When the math works
Single-model AI is great for autocompletion: low latency, low cost, low stakes. Multi-agent debate earns its cost on judgement-heavy tasks where the failure mode is "we shipped the wrong thing," not "we mistyped a variable name." Below are three patterns where the math works at an engineer's desk, each with a setup, a worked example, and a note on when not to bother.
Pattern 1, Spec review
The cheapest defect is one caught before code is written. Spec review is the multi-agent pattern for design docs, RFCs, and engineering plans.
Setup
- Three agents receive the same document.
- Each is given a different role prior in its system prompt: (a) the implementer who will build it, (b) the SRE who will operate it at 3 a.m., (c) the security reviewer who has to sign it off.
- Each writes a concerns list, no synthesis, no diplomacy.
- A fourth agent, the synthesizer, collates, deduplicates, and ranks the concerns by severity and effort.
Worked example: "Migrate from Redis to an in-memory cache"
A real RFC I ran this on. Three agents, three very different reports:
- Implementer flagged the refactor scope: 14 call sites assume Redis pub/sub semantics that don't translate to a process-local map. Estimated 3 weeks, not the 3 days the doc claimed.
- SRE flagged the failure-recovery story: an in-memory cache loses state on every deploy, and the doc had no warm-up plan. Cold-start latency would breach the p99 SLO for ~90 seconds after each rollout.
- Security flagged a side-channel: shared-process caches across tenants leak timing information that the network-isolated Redis instance had been hiding by accident.
One generic "review this design" prompt produces a bland summary that mentions all three weakly. The role priors force divergent attention: the SRE doesn't get distracted by refactor scope, and the implementer doesn't try to play security expert. You get three sharp opinions instead of one blurry one.
When not to use it
Two-paragraph design notes. The role-prompting overhead outweighs the doc.
Pattern 2, Code-synth
Code-synth is for the moment when you know roughly what to build but the implementation has a real fork in the road, algorithm choice, data-structure choice, sync vs. async.
Setup
- Two agents independently propose an implementation. Round 0 only, they do not see each other's drafts.
- A third agent acts as synthesizer. It reads both proposals and writes the final code, choosing the stronger approach per concern.
The synthesizer's job is not to pick a winner. It is to compose the strongest version using parts of both. If you collapse this into "pick A or B," you lose the whole point, you wanted parallel exploration, not a vote.
Worked example: a rate-limiter
- Agent A proposes a token bucket: simple, O(1) per request, easy to reason about under burst traffic.
- Agent B proposes a sliding-window log: more accurate at the boundary, with a cleaner property-based test suite.
- Synthesizer ships A's token bucket, the right call for the production hot path, but lifts B's test harness wholesale, plus the boundary-accuracy assertions B used to validate the log approach.
You end up with the fast implementation and the rigorous test suite. Neither agent alone would have produced that pairing.
When not to use it
Glue code. CRUD endpoints. Anything where there isn't a real fork. If both agents would write the same thing, you're paying 3x for nothing.
Pattern 3, Pair-debug
Most debugging wall-clock time is spent narrowing the search space. Pair-debug parallelises that narrowing.
Setup
- Two agents, each handed a different hypothesis. Agent A: "the bug is in the data layer." Agent B: "the bug is in the request lifecycle."
- Each is wired to a code-search MCP server (see MCP tools not optional) and tries to prove its hypothesis, grep, read, trace, repro.
- A judge agent reads both findings and either picks a winner or escalates with a third hypothesis.
Worked example: a flaky integration test
A test that failed roughly 1 in 30 runs in CI, never locally. Two agents, two sweeps:
- Agent A (data layer) traced fixtures, transactions, isolation levels. Found nothing, the data path was clean.
- Agent B (request lifecycle) found a goroutine spawned in a middleware that wrote to a response buffer after the handler had returned. A classic race, only visible when the scheduler interleaved a particular way.
- Judge picked B, with A's "data layer is clean" report attached as evidence narrowing the remaining surface.
Crucially, A's negative result has value. It rules out a region of the search space, which is exactly what a human debugger spends hours doing alone. For more on how disagreement carries information, see where models disagree.
When not to use it
Bugs with an obvious stack trace. If the exception points at line 47, you don't need two agents racing to find line 47.
When not to use multi-agent at all
Be honest about this. Most of your day is not a debate problem.
- Factual lookup. "What's the syntax for Postgres
LATERAL JOIN?", a single best model wins. Debate adds noise. - Verifiable in seconds. Tasks where you can compile, run tests, or hit an endpoint to know if the answer is right. Let one agent iterate against the verifier; that loop is faster and cheaper than three agents arguing.
- Short tasks. If the task takes a human five minutes, debate overhead exceeds the task. Round-robin coordination has a fixed cost.
- Taste, not reasoning. Design aesthetics, prose style, naming. Debate tends to homogenise, three agents converge on the median, which is the worst possible aesthetic choice.
Cost estimation, back-of-envelope
A 3-agent, 2-round debate runs roughly 7x the token cost of a single-agent answer. The breakdown, assuming each agent's output is roughly the same size as the input prompt:
- Round 0: 3 agents each read the prompt and write an answer. 3 units.
- Round 1: each agent reads its 2 peers' answers and revises. Input grew, but for a rough count call it 3 units of generated output.
- Synthesis: 1 agent reads all 3 final answers and writes the result. ~1 unit.
- Total: ~7 units vs. 1 for a single-agent baseline.
So when does 7x pay off? When the cost of being wrong is more than 7x the cost of being right. A bad architectural decision costs weeks. A bad rate-limiter ships an outage. A missed race condition burns an on-call rotation. Against those numbers, 7x of a few cents in tokens is a rounding error.
When does it not? Autocomplete, lookups, scaffolding, anything you'd happily redo if it came out wrong. For more on structuring the rounds themselves, see anatomy of a debate protocol.
Starting points, not endpoints
Spec-review, code-synth, pair-debug. Three patterns, three different shapes of judgement, three places the 7x is worth it. Use them tomorrow at your desk; tune the role priors and round counts to your stack.
These are starting points. Members of the Session Community publish their own variants, pair-debug with three hypotheses instead of two, code-synth with a fourth "adversary" agent that tries to break both proposals, spec-review with a cost-modeller as the fourth role. Bring yours.