session

Sessionsession.ad · the blog

Where models disagree

By Secan98 · 2026-03-12 · 9 min read · all posts · home

HomeBlogWhere models disagree

Where models disagree

A field-notes essay on three task families where frontier models reliably take different positions, and what that means for picking your debate roster.


How to read what follows

This is a snapshot, not a benchmark. The three frontier providers we route to most often inside Session behave differently on certain task families in ways that are repeatable enough to plan around, but unstable enough that any reader six months from now should treat the specifics with suspicion. Provider updates re-tune behaviour. Post-training recipes shift. What we describe below is what we observe right now, in early 2026, across the kind of work that flows through Session debates.

The reason this matters is practical. If you are picking which models to combine in a debate, you want them to disagree in useful ways. Three identical voices give you a chorus, not a check. Three voices with different priors give you genuine triangulation, and a synthesis step has something real to weigh.

How we observe this

Every Pro debate inside Session produces a transcript with attributed claims: which model said what, when, and which assertions survived rebuttal. We strip user content before any aggregate analysis, we are looking at agreement structure, not message contents. Across thousands of debates, certain task families show recurring divergence patterns where the three models do not just give different wordings of the same answer, but reliably emphasise different things.

The honest caveat: this is observational. Our debates are not a controlled benchmark. The prompts vary; the user populations skew toward engineering and research; the temperature settings differ across rooms. We are reporting tendencies that are consistent enough to act on, not effect sizes you could publish. Treat the categories below as field notes from a working operator, not as a paper. For more on how we read these transcripts, see reading a debate transcript.

Category 1, Code review, and where caution differs

Code review is the cleanest place to see the divergence, because the artefact under review is fixed and the responses are directly comparable. Imagine a 200-line PR that adds a retry policy with exponential backoff to an outbound HTTP client, a fairly ordinary piece of infrastructure work, the kind that lands every week in any production codebase.

What we tend to see, across many such reviews:

Conservative agent (illustrative):
  - jitter window is symmetric; under load this can produce
    thundering-herd at the edge of the backoff envelope
  - retry budget is global, not per-endpoint; one slow downstream
    can starve fast ones
  - 5xx and 429 are treated identically; 429 should respect
    Retry-After if present
  - cancellation context not threaded through the sleep
  - timer uses wall clock, not monotonic
  - log line at WARN may be noisy in steady-state failure modes

Action-oriented agent (illustrative):
  - distinguish 429 from 5xx; honour Retry-After
  - move sleep onto a monotonic timer
  - thread context through the retry loop so cancels propagate
  (with a suggested diff for each)

Style-thorough agent (illustrative):
  - naming inconsistency across files
  - prefer the standard backoff helper already in the codebase
  - docstring missing on the exported constructor
  - the table-test fixture would be cleaner with subtests
  - same three correctness items the others raised, in passing

Three observations. One model tends to be the most conservative reviewer in our sample, it flags more potential edge cases, including ones that genuinely seldom matter in practice. Another is more action-oriented: a shorter list, but the items it flags are more often shippable diffs you would actually merge. A third is the most thorough on style and idiom, and is most likely to point you at existing helpers in the codebase. None of these is the "right" review. The union is the review you actually want.

Category 2, Long-horizon planning

Ask all three "design a 6-month plan to migrate this monolith to services" and you will see the priors separate sharply.

One model tends to produce a phase-by-phase plan with explicit risk gates between phases, strangler-fig boundaries, rollback criteria, what observability has to be in place before phase two begins. The plan reads as if written by someone who has seen migrations stall. Another produces a faster plan with optimistic dependencies; the milestones are tighter, the prose more confident, and parallel workstreams are assumed to actually run in parallel. The plan reads as if written by someone who wants to ship. The third produces the most thorough exploration of alternatives, strangler fig versus parallel run versus event-bridge cutover, with tradeoffs, but is correspondingly slower to commit to one path.

The implication for debate composition is direct. In a two-agent debate on this kind of task, the conservative model will reliably play the implicit constraint-keeper and the action-oriented model the tempo-setter. That tension is productive. You get a plan that is faster than the cautious instinct and more guarded than the optimistic one, and the synthesiser has something real to balance. Adding the third broadens the alternative space at the cost of a longer debate.

Category 3, Scientific summarisation

Ask the three to summarise a contested research claim, say, a recent paper claiming a particular evaluation methodology systematically over-rates instruction-tuned models, and the divergence is no longer about pace or caution. It is about which uncertainty each model surfaces by default.

  • One tends to surface methodology limits. Sample size. Prompt sensitivity. Whether the metric measures what the paper claims it measures.
  • Another tends to surface generalisation limits. Whether the result holds outside the model families tested. Whether the finding survives a year of post-training drift.
  • The third tends to surface citation gaps. Prior work that contradicts the claim, related findings that nuance it, where the literature actually stands.

None of these is more correct than the others. The union is the right summary. This is the cleanest case for routing a research-synthesis debate to all three rather than picking one, you are not voting on the answer, you are aggregating lenses.

What this means for picking your roster

The reason we walk through these three categories is that the right debate roster depends on the task, and "use all three always" is not the answer. Some practical pairings we have settled on internally:

  • High-stakes implementation choices, the conservative model + the action-oriented model + a small judge. You want the constraint-keeper and the tempo-setter both at the table; the judge resolves where they pull apart.
  • Research synthesis, methodology lens + citation lens + a synthesiser that explicitly weights "uncertainty surfaced". You want both lenses, and you want the synthesiser to credit each.
  • Product or prose work, pick by taste. Honestly. Prose is the one domain where homogenising via debate can actively hurt; you flatten voice in exchange for averageness. Run a single model whose register you like and edit it yourself.

For the broader argument behind why combining models works at all, see why one model isn't enough; for concrete recipes, multi-agent strategies for engineers goes deeper on roster patterns.

A note on local models

Open-weights local models bring a different prior again. In our observation they are often less alignment-tuned in the conversational sense, more willing to commit to a single recommendation without hedging, and sometimes notably blunter in tone. That is not a flaw, in a hybrid debate they can usefully play the "uncautious" voice that pushes the frontier models off their hedging defaults. The condition is that you trust your synthesiser to weight that contribution appropriately. A blunt voice in a room with no judge is just noise; in a room with a calibrated judge, it is signal the frontier models will not produce on their own.

Behaviour drifts; track the drift

I want to close on the caveat I opened with, because it is the most important sentence in this post. None of these observations are stable across more than roughly six months. Providers re-tune. An update can shift the conservatism dial. An update can change how readily a model produces shippable diffs versus longer prose reviews. An update can change which uncertainty it surfaces first.

Inside Session we re-run a small set of internal probes against each provider on a rolling basis, and members of the community room post their own. The useful work is not memorising the categories above, it is building the habit of checking them, noticing when convergence drops on a task family that used to converge, and adjusting your rosters accordingly. Bake the method into your process, not the conclusions into your prompt templates.


Newer »What's next: the Session roadmap (2026)
© 2026 Session · Home · Blog · About · Academy · Lexicon · FAQ · Contact
Hand-built with care · no cookies, no trackers, no third-party scripts.