Reading a debate transcript: a practical guide, Session

What convergence percentage actually measures, how to spot superficial agreement, and how to find the load-bearing claim every synthesis really rests on.

The card is the summary. The transcript is the evidence.

Most users of a multi-agent system read the synthesis card and close the tab. That's reasonable for low-stakes questions, what's a good name for this function, does this regex look right, is this email tone okay. For anything that matters, the card is not the artifact you should be reading. The card is a digest. The transcript is the thing that was actually produced, and the interesting work, the disagreements, the hedges, the moment one agent quietly capitulated, happens there.

This post is about how to read the transcript. Not how to skim it; how to read it, the way a referee reads a paper they're about to recommend rejecting. If you've followed the case for multi-agent systems in Why one model isn't enough and the protocol mechanics in Anatomy of a debate protocol, this is the operator's manual for the output.

What's on the card, and what each field is actually for

A well-designed synthesis card has roughly six fields. Knowing what each one is for tells you what you can and can't conclude from it.

Final answer. The synthesizer's best single response. It is a summary, not a vote count. Treat it as a hypothesis, not a verdict.
Per-claim attribution. Which agent contributed which factual claim. This is the field most people skip; it is the field that lets you trace any sentence in the answer back to the agent that authored it.
Agreement / convergence percentage. See the next section. This number is the most misread item on the card.
Minority dissent. Inline, not buried. Often one or two sentences. If it isn't there, ask why.
Token and cost meter. Tells you whether the system actually deliberated or just rubber-stamped a single agent's draft. A debate with three agents and 2k total tokens did not really happen.
Round count. How many turns of debate were run before the judge declared convergence. One round is suspicious. Five rounds is interesting.

Convergence percentage: what it actually measures

The number labeled convergence, 87%, 92%, 100%, is the judge model's score for how well the agents agree on the answer. That is all it is. It is a similarity metric over the agents' final positions, scored by a model that is itself fallible.

It is not a probability of correctness. It is not calibrated to anything in the world. Two agents that agree because they share a training-data error will score 100%. Three agents from the same model family, asked the same question with mild paraphrase, will frequently converge on the same wrong answer with high confidence and no dissent.

The right intuition: convergence is necessary but never sufficient. Low convergence is a strong signal something is contested. High convergence is a weak signal of correctness, it tells you the agents agree, not that they're right. We've explored where models systematically share blind spots in Where models disagree; the inverse, where they systematically agree because they were trained on the same wrong thing, is the failure mode this metric cannot catch.

Three patterns of superficial agreement

When you open the transcript, look for these three patterns. They all produce high convergence numbers. None of them mean the answer is right.

1. Same conclusion, different reasons

Agent A concludes "use Postgres" because of write throughput. Agent B concludes "use Postgres" because of operational familiarity. Agent C concludes "use Postgres" because of JSON support. The convergence score is 100% on the recommendation. But the agents are not actually agreeing on the same answer, they're agreeing on the same label, for three unrelated reasons. If any one of those reasons is the real driver of the user's decision, two of the three agents have nothing useful to say about it. Read for: do the per-claim attributions cluster on shared premises, or scatter?

2. Agreement by quotation

Agent B's round-two response begins "as Agent A noted, …" and proceeds to restate Agent A's argument. Agent C does the same. This is not three agents converging; it is one agent, echoed twice. Read for: in round two and beyond, does each agent introduce a claim the others didn't make? If not, the deliberation collapsed.

3. Anchor dominance

The strongest agent, usually the one with the most assertive prose style, speaks first, frames the problem, and the other two adjust toward that frame within one round. Convergence is high because the question got narrowed before it was examined. Read for: a sudden frame-shift in the second message of round one. If agent two's response opens by accepting agent one's framing rather than restating the user's question in its own terms, the anchor has already won.

Reading the minority dissent

We keep dissent inline on the card, never collapsed behind a "show more". This is deliberate. The dissent is, in our experience, the single most useful sentence in most transcripts. It is where the question's hidden difficulty lives.

The patterns worth recognizing:

"I agree with the conclusion but disagree with the reasoning." The answer might be right by accident. Investigate the reasoning chain before relying on the answer in a different context.
"This works for the case described but fails when X." X is the edge case the synthesizer rounded off. If your situation might be X, the synthesis doesn't apply to you.
"I have low confidence in this answer but the others are pushing me toward agreement." The convergence number is mechanically high and epistemically empty. Treat the answer as unresolved.

If the dissent is missing entirely, that is itself information, either the question was genuinely easy, or the protocol suppressed disagreement. The token meter and round count usually tell you which.

Finding the load-bearing claim

Every synthesis rests on one or two claims that, if false, invalidate the entire answer. Train yourself to find them. The tactic is a single question, asked of every synthesis you read:

What would have to be false for this synthesis to be wrong?

The answer is the load-bearing claim. It is almost never the conclusion. It is usually a factual premise buried two paragraphs in, a performance number, an API behavior, a cost figure, a version assumption. Once you've identified it, you have one job: verify that one claim. You don't need to verify the whole synthesis. You need to verify the load.

When the synthesizer hallucinates consensus

Rare but real. The synthesizer writes "the agents agree that X" when, reading the transcript, two of them clearly didn't say X, or one of them said the opposite of X. It is a summarization failure, not a debate failure, and it is especially dangerous because the card looks clean.

Detection is mechanical. For any high-stakes claim in the final answer, check the per-claim attribution. If the claim is attributed to only one of three agents but presented in the prose as consensus, you've found it. Spot-check the three or four claims that matter most. You don't need to audit the whole transcript; you need to audit the load-bearing parts of it.

A worked example

A team asks three agents to recommend a database-migration strategy. The synthesis card returns 92% convergence and a confident recommendation: shard early, before the table crosses 500GB, because Postgres degrades sharply past that point on write-heavy workloads.

The card looks excellent. Three agents, high convergence, clean dissent ("worth reconsidering if write volume is bursty rather than sustained"), reasonable cost.

Apply the question. What would have to be false for this synthesis to be wrong? The 500GB threshold. The claim about sharp degradation past that point. That single number is the load.

Open the transcript. Each of the three agents cites the threshold. None of them cite a source. Two of them phrase it identically, a tell that they're drawing on the same training-data passage. The third hedges ("commonly cited") but does not contradict. The dissent is about workload shape, not about the threshold itself.

The threshold is from a 2019 blog post, repeatedly quoted, that is no longer accurate for current Postgres versions on modern hardware. All three agents shared the outdated assumption. Convergence: 92%. Correctness: zero. The protocol did its job perfectly; the agents just happened to be wrong in the same way.

This is the failure mode the convergence number cannot see, the dissent did not surface, and the synthesis card cannot warn you about. Only the transcript, and the question you asked of it, gets you there.

Read like a referee

Your job, as the operator of a multi-agent system, is not to trust the synthesis. It is to read the transcript the way a referee reads a paper they're about to send back. Find the load-bearing claim. Check whether the agents agreed for the same reason or for three different reasons. Read the dissent twice. Verify the one number the whole answer rests on.

The card is a courtesy. The transcript is the work.