Anatomy of a debate protocol, Session

How a multi-agent debate is actually structured: parallel fan-out, round-robin critique, a small judge, threshold exit, and a synthesizer that preserves minority dissent.

The shape of a debate

A multi-agent debate is not a chat room with several models talking at once. It is a structured protocol with phases, roles, and exit conditions, designed to extract more signal from a panel of agents than any single model can produce alone. The protocol has a particular anatomy: a parallel fan-out, a sequence of round-robin critiques, a small judge that watches for convergence, and a synthesizer that composes the final attributed answer. This essay walks through each part as Session implements it, with the trade-offs that govern when to use which variant.

The motivation, covered in Why one model isn't enough, is simple: a single model's answer is a single sample from a distribution shaped by its training. A debate samples that distribution from multiple architectures and forces the samples to interact under structured constraints. The protocol is the scaffolding that makes the interaction productive instead of merely noisy.

Round 0: parallel fan-out

The opening move is deliberately uncoordinated. The orchestrator sends the same prompt, with the same system context, to every participating agent simultaneously. No agent sees any other agent's output. Each produces an independent answer to the question.

Parallelism here is not a performance optimization, although it is faster. It is an epistemic requirement. If agent B sees agent A's answer first, B's response is contaminated: it will anchor, defer, or contrast against A rather than committing to its own prior. In practice this collapses the diversity that makes the panel valuable. Round 0 in parallel gives you genuine baseline opinions, with the disagreements visible and uncompressed.

Latency-wise, Round 0 takes roughly the time of the slowest agent. With three frontier models in flight you typically see 8,20 seconds wall clock, dominated by the longest reasoning trace.

Rounds 1 through N: round-robin critique

From Round 1 onward, agents see each other. Each agent is now given the full set of previous-round answers from its peers and asked to do three specific things, in this order:

State explicitly where it agrees with peers, and which peers.
State explicitly where it disagrees, with reasoning grounded in the substance, not in authority.
Decide whether to update its own answer, and if so, where and why.

The structure is enforced by asking for JSON, which makes the turn machine-readable for the judge and the synthesizer. A typical turn looks like this:

{
  "agent": "agent-A",
  "round": 2,
  "agreements": [
    {"with": "agent-B", "on": "Postgres adds operational overhead disproportionate to current load"},
    {"with": "agent-C", "on": "SQLite WAL mode handles <100 writes/sec comfortably"}
  ],
  "disagreements": [
    {"with": "agent-B",
     "on": "claim that migration is reversible at any time",
     "reason": "schema drift and pg-specific types create real lock-in after ~3 months"}
  ],
  "updated_position": "Stay on SQLite until concurrent writers exceed 50/sec OR until a second service needs the same data. Pre-migration: enforce a typed schema layer in app code now to keep the door open.",
  "confidence": 0.78
}

Round-robin matters because it serializes reading while keeping writing independent. Every agent reads the same prior round; no agent sees its peers' current-round responses while writing. This prevents a runaway feedback loop where the last writer in a round simply summarizes the others.

The judge: small, cheap, single-purpose

After each round, a separate judge model reads the round's outputs and emits a single number: an estimate of inter-agent convergence, on a scale from 0 to 1. The judge does no reasoning about correctness. It only measures whether the agents' positions, claims, and recommendations are converging.

The judge should be deliberately small. Three reasons:

Cost. The judge runs every round, on transcripts that grow linearly. A frontier model in this seat doubles the bill for marginal benefit.
Bias. A small, cheap model is less likely to inject its own opinion about the substantive question. Its job is bookkeeping, not adjudication.
Determinism. Convergence scoring works better with a low-temperature, narrowly-prompted model. Frontier models tend to overthink the rubric.

The judge's prompt asks for a structured score across three axes: agreement on the central recommendation, agreement on the key supporting facts, and agreement on the critical caveats. The mean is the convergence score for the round.

Convergence threshold and early exit

Session's default early-exit condition is convergence ≥ 0.85. When the judge reports a score above the threshold for a round, the orchestrator skips remaining rounds and proceeds directly to synthesis.

Early exit matters for three reasons. First, cost: each additional round is N agent calls plus a judge call, and frontier model tokens are not free. Second, latency: a four-round debate on three agents is twelve sequential-ish reasoning passes. Third, and most importantly, there is no epistemic value in re-arguing settled points. If the panel has agreed on the central claims by Round 2, a third round produces near-identical text and dilutes the signal in the transcript that the synthesizer reads.

Empirically, on technical decisions, roughly 60% of debates exit after Round 2 and another 25% after Round 3. The remaining 15% are usually questions where the agents have genuinely irreconcilable priors, and those are the ones where you most want a full transcript anyway.

The synthesizer

The synthesizer is a designated model that reads the entire debate transcript, Round 0 answers, every round of critique, and the judge's per-round scores, and produces a single attributed final answer. It is the artifact the user actually receives. Reading one is its own skill, covered in Reading a debate transcript.

Two design choices deserve emphasis.

The synthesizer should usually not be one of the debating agents. If agent A debated and agent A also synthesizes, A's prior weight on its own arguments is structurally over-represented in the final answer. The cleanest setup uses a fourth model, or a fresh instance with a system prompt that explicitly instructs it to weight all participants equally. In small panels where a fourth model is uneconomical, rotate the synthesizer slot across debates so the bias averages out.

Minority dissent is preserved inline. When two agents agree and one dissents with a substantive reason, the dissent is not discarded. It is attached to the recommendation as a labeled minority position. This is how you avoid the failure mode where consensus rolls over the one agent that happened to notice the real risk.

A worked example

Three agents are asked: "Should I migrate this small internal tool from SQLite to Postgres now?"

Round 0 (parallel, ~12s):
  agent-A: "Stay on SQLite. Operational simplicity outweighs scaling headroom you don't need."
  agent-B: "Migrate now. Future-proofing is cheaper than rushed migration under load."
  agent-C: "Conditional: stay on SQLite, but add an abstraction layer to make migration cheap later."

Judge score after Round 0: 0.41  (low convergence, continue)

Round 1 (round-robin, ~18s):
  Each agent critiques the others. Agent A updates to acknowledge agent C's
  abstraction-layer point. Agent B softens "migrate now" to "migrate when
  concurrent writers exceed 50/sec." Agent C holds.

Judge score after Round 1: 0.74  (continue)

Round 2 (round-robin, ~18s):
  Convergence on a conditional recommendation. Disagreement narrows to
  the specific trigger metric.

Judge score after Round 2: 0.89  (exit, proceed to synthesis)

The synthesizer then produces a card like this:

RECOMMENDATION
  Stay on SQLite. Add a typed schema layer now to keep migration cheap.

TRIGGERS TO REVISIT
  - Concurrent writers exceed ~50/sec sustained
  - A second service needs to read the same data
  - You add a feature requiring Postgres-specific types

CONFIDENCE: HIGH (3/3 agents converged)

MINORITY POSITION (agent-B, Round 0):
  Argued for migrating immediately to avoid future rushed migration.
  Updated to the conditional position by Round 1; original argument
  retained here because the "future rushed migration" risk is real
  if the triggers above are crossed quietly.

Trade-offs and protocol variants

More rounds cost more money and time. The marginal value drops sharply: in our logs, Round 4 changes the synthesis in fewer than 8% of debates that didn't already exit. Two to three rounds captures the bulk of the available signal. Setting a hard ceiling of max_rounds=4 with early-exit at 0.85 is a reasonable default for technical decisions.

The full debate protocol is not always the right tool. Three lighter-weight variants, surveyed in Multi-agent strategies for engineers:

Parallel-only. Round 0 plus synthesis, no debate rounds. Cheap, fast, and surprisingly good for questions where you want diverse generations rather than a converged answer.
Tournament elimination. Pairwise debates, loser drops out, winner advances. Useful when you have many candidate answers and want to compress to one through structured competition.
Hierarchical decomposition. One agent decomposes the question into sub-parts, peers answer the sub-parts independently, the decomposer reassembles. Best when the question is large enough that any single agent's working memory is a real constraint.

The full protocol, parallel fan-out, round-robin critique, small judge, threshold exit, dedicated synthesizer with minority preservation, is the right default for high-stakes single-answer questions. The variants are right when the question's shape doesn't match that mold. Knowing which mold to reach for is, increasingly, the actual craft of orchestration.