Local-first AI: a practical guide to Ollama-only debates, Session

A hands-on guide to running three local models in a debate on your own machine. Hardware math, model selection, latency, and when to reach for the cloud.

Why local matters at all

Most multi-agent debate writing assumes a credit card pointed at a hosted provider. That assumption is wrong often enough to be worth attacking directly. Local-first debates, three models, three opinions, one machine, are now a genuinely practical way to run multi-agent workflows for a real class of problems. Here is why I keep coming back to them.

Privacy. Your prompts never leave the box. For internal code review, customer data triage, regulated-industry summarization, or anything covered by an NDA, this is not a nice-to-have. The bytes go from your editor to a process listening on localhost and back. There is no third party, no logging endpoint, no retention policy to read.

Cost. After the hardware is paid for, the marginal cost of a debate is electricity. A 3-round, 3-agent debate that would run you $0.40,$1.20 on frontier APIs costs cents at the wall. If you run hundreds of debates a week, that math compounds quickly.

Reproducibility. A pinned local weight file does not silently change underneath you. The hosted model labeled the same way on Tuesday is not bit-identical to the one on Friday. For evaluation work, regression testing, and any kind of scientific rigor, a hash-pinned local model is worth a lot.

Resilience. No provider outage means no debate outage. If your business logic depends on an agentic loop, a four-hour incident at a single provider is a four-hour incident for you. Local does not have that failure mode.

Be honest about the trade-offs. A 70B local model at Q4 is meaningfully behind the frontier on hard reasoning, long-context synthesis, and cutting-edge code generation. Hardware costs real money up front. Tokens-per-second on consumer GPUs is slower than a hosted API. Local-first is a tool, not a religion.

Ollama in 90 seconds

Ollama is a local model runtime with an OpenAI-compatible HTTP API. It manages model downloads (GGUF weights, with sane default quantizations), KV-cache, and request queuing, and it exposes a chat-completions endpoint at http://localhost:11434/v1. From Session's point of view, an Ollama instance is just another provider, same shape as a cloud one, different base URL, no API key required.

This matters because it means the same Session agent definitions you use against cloud models work locally with a single field changed. If you have already adopted BYOK as described in our BYOK post, going local-first is the next step on the same road: you own the keys, then you own the runtime.

Why Ollama specifically and not llama.cpp directly, vLLM, or LM Studio? Ollama is the lowest-friction option that still gives you a real HTTP server, model versioning, and concurrent request handling. vLLM is faster on data-center GPUs but is a heavier install. llama.cpp is excellent but is a library, not a service. For a developer machine running a debate, Ollama is the right default.

Picking three models for a debate

The whole point of a debate is that the agents disagree in useful ways. Three copies of the same model with different system prompts will agree too often. You want different pretraining mixes, different post-training, different safety regimes, different tokenizers. That is what produces genuinely independent error modes, which is what makes the synthesizer's job non-trivial and the final answer better than any single model's.

A reasonable default trio:

A 70B-class instruction-tuned model as the strong agent. Conservative, articulate; tends to produce the longest, most defensible argument.
A 30B-class model from a different vendor as the diverse voice. Different training mix, often heavier on math/code data; disagrees with the first model in interesting ways on edge cases.
A smaller (7,13B) model as the synthesizer and judge. A notably terse, structured style is a good fit for the role of "read both arguments and write the answer."

The point is the spread, not the specific names. Swap in whatever the strongest open-weights drop of the month is. What matters is: three distinct training pipelines.

Hardware reality check

Quantization is what makes this fit on real hardware. A rough VRAM rule of thumb at Q4_K_M is "model size in billions of parameters times 0.6 GB," plus 2,4 GB of KV-cache headroom for a debate-sized context window (8,16k tokens).

24 GB (single high-end consumer card): one model at a time. Run 70B Q4 alone, or run all three by hot-swapping (Ollama unloads idle models). Expect 8,14 tok/s on the 70B, 25,35 tok/s on the 27,32B.
48 GB (workstation card or two consumer cards): the 70B Q5_K_M fits with room to spare, or you can keep the 32B and 27B both resident and only swap the 70B.
64 GB: comfortable for 70B Q5 plus one mid-sized model resident.
96 GB+ (workstation-class GPU or M-series Mac with unified memory): all three resident at Q5_K_M, no swapping. This is where local-first stops feeling like a compromise.

Apple Silicon note: unified memory means a 64 GB M-series machine functionally behaves like a GPU with 64 GB of VRAM for inference. Tokens-per-second is lower than a discrete card, but the model fits, and the laptop is silent and portable. For solo developers, this is often the right buy.

Pull the models:

ollama pull llama3:70b-instruct-q4_K_M
ollama pull qwen2.5:32b-instruct-q5_K_M
ollama pull gemma2:9b-instruct-q5_K_M
ollama list

Latency, and how a debate actually runs

A 3-round, 3-agent debate is nine model calls plus a synthesis step. Naively serialized on a single GPU with one model resident at a time, you pay the model-load cost (5,20 seconds per swap on NVMe) plus generation time. That is unacceptable.

Two practical workarounds. First, if your hardware fits all three models simultaneously, Ollama will keep them resident and route requests in parallel up to its OLLAMA_NUM_PARALLEL setting. A round of three agents then runs concurrently, and the wall-clock cost of a round is roughly the slowest agent. Second, if you cannot fit all three, structure the debate so each round is "all agent A, then all agent B, then all agent C", minimizing swap count to two per debate instead of nine.

Realistic numbers on a single high-end consumer GPU with all three models resident at moderate quantization, 1k token responses: a full 3-round debate completes in 60,120 seconds. On a 64 GB Apple Silicon laptop, double that. On a dual workstation-card box, under 30 seconds.

Wiring three Ollama agents into Session

In Session, each agent is a (provider, model, system prompt, role) tuple. For a local debate, all three agents share provider = ollama and baseURL = http://localhost:11434/v1, and differ in model and role.

Define Agent-Strong: the 70B model, role debater, system prompt asking for the strongest defensible argument with citations to the input.
Define Agent-Diverse: the 32B model from a different vendor, role debater, system prompt asking it to find what the other agents missed and to argue the contrarian read.
Define Agent-Judge: the small model, role synthesizer, system prompt instructing it to read the full transcript, weight arguments by evidence, and produce a single ranked answer.

The judge sees the full debate transcript; the debaters see only the original prompt and the previous round's outputs. Keep the judge's temperature low (0.2) and the debaters' moderate (0.6,0.8). That is the whole configuration.

A worked example: reviewing a TypeScript snippet

Suppose you hand the trio this:

function getUser(id: string) {
  const u = users.find(u => u.id == id);
  return u.name.toLowerCase();
}

A realistic debate transcript looks like this:

[Strong agent]
Three issues. (1) `==` should be `===` to avoid type coercion bugs.
(2) `u` may be undefined; `.name` will throw. Guard with `if (!u) return null`.
(3) Return type is implicit; annotate `: string | null`.

[Diverse agent]
Agree on the undefined check, but the deeper issue is the API contract.
Should `getUser` throw, return null, or return a Result type? Silent
null leaks bugs. Also: `users` is a closure global, that is the real
smell. Inject it.

[Judge / synthesizer]
Synthesis: fix the null deref (blocking), tighten equality to `===`,
add explicit return type. Treat the closure-global and Result-vs-null
points as a follow-up refactor, not a fix in this PR. Final patch:
return type `string | null`, early return on miss, `===` for the
comparison.

Notice what happened: the strong agent gave the textbook answer, the diverse agent pushed on the architecture, the judge scoped the fix to what belongs in this PR. No single model would have produced that shape of answer alone.

When not to go fully local

Local-first is not always the right call. If the task genuinely needs a frontier model, long-context document synthesis over 200k tokens, hard competition math, cutting-edge code generation against unfamiliar libraries, a 70B at Q4 will let you down. Be honest about which task you are running.

The hybrid setup I reach for in those cases: keep two local debaters (privacy-preserving for the raw input) and use a cloud frontier model as the synthesizer. The cloud model sees the debaters' arguments, not the original sensitive document. You get frontier-quality synthesis with most of the privacy benefit. It is the pragmatic middle, and for a lot of real work, it is the right answer.

Start local. Measure. Reach for the cloud only when the numbers say you must.