Retrieval Economics

Token Efficiency with
Deterministic Retrieval

You pay for every token you retrieve. Precision means you retrieve less.

Author

Joshua Thomas

CTO, CogniSwitch

Reading Time

~9 min read

~70%

Fewer tokens per consult-QA run

85,000 → 25,000

~3.4×

Lower cost per consult

$0.3485 → $0.1025

~$63.5K

Saved per year, one workflow

$90,018 → $26,476

Modeled rates · anonymized deployment. Figures from an anonymized virtual urgent care clinical-QA deployment, standard RAG vs. CogniSwitch deterministic retrieval. Cost computed at modeled list rates of $2.00 / million input tokens and $8.00 / million output tokens, not a billed invoice.

In One Paragraph

Deterministic retrieval reduces token consumption because it returns the precise subgraph a query needs instead of the top-k similar chunks vector RAG stuffs into context. You stop paying input tokens for the passages that were near enough but irrelevant. Structured facts are denser than prose, multi-hop questions resolve in one traversal instead of repeated re-querying that re-sends context, and there is no language model in the retrieval path burning tokens just to decide what to fetch.

Key Takeaways

TL;DR

You pay for what you retrieve. Vector RAG over-fetches top-k chunks; most of those input tokens are noise you're billed for anyway.

No model sits in the retrieval path. The expensive, token-heavy work of deciding what to fetch is replaced by a deterministic traversal.

Structured facts are denser than prose. A handful of grounded concepts carry the same answer as thousands of tokens of similar text.

Multi-hop resolves in one pass. No re-query loop re-sending a growing context, so tokens don't compound across hops.

The savings are real and they compound. ~70% fewer tokens per run in one clinical-QA deployment, ~$63.5K saved in a year on a single workflow.

Why does vector RAG burn so many tokens?

Because similarity search is recall-oriented by design. To avoid missing the relevant passage, you retrieve the top-k nearest chunks, set k generously, and often stuff more context to be safe. A single chunk is rich in many concepts, so what comes back is too much noise and not enough signal, and you are billed for every token of it on the way into the model.

Then the workarounds add their own tax. A re-ranking pass re-reads the candidates. "Lost in the middle" pushes teams to enlarge the context further. And when one retrieval doesn't answer a multi-hop question, an agent re-queries and re-sends the growing context on every loop. The bill is dominated by tokens you retrieved but never needed.

"There's too much noise, no signal. You can't retrieve meaningful information, and you get into hallucinations."

Joshua Thomas, CTO, CogniSwitch

How does deterministic retrieval cut token usage?

There are two kinds of model usage, and only one of them belongs in the per-query path. Move the model to ingestion, keep it out of retrieval, and the token bill collapses.

"There are two types of usage. One is the platform using a language model to do some of the job. The other is retrieval, where we don't use any model at all, and that's where you have huge benefits of tokens, which you'll see when you compare with any RAG application."

CogniSwitch, customer architecture review

No model in retrieval

A deterministic traversal decides what to fetch. You spend zero tokens choosing context.

Dense, structured facts

A chain of concepts carries the answer in a fraction of the tokens of equivalent prose chunks.

Multi-hop in one pass

The traversal handles linked concepts at once, so context isn't re-sent on every retry.

Before / After // One Query, Two Architectures

The token bill, side by side

Take one real workflow: quality-assuring a virtual urgent care consult. Standard RAG retrieves chunks of the record and guidelines and stuffs them into the prompt. Deterministic retrieval traverses to the exact clinical facts the QA needs. Same task, same model, very different bill.

Per consult-QA run	Standard RAG	Deterministic retrieval
Tokens retrieved + sent	85,000	25,000
— input tokens	55,250	16,250
— output tokens	29,750	8,750
Cost per consult*	$0.3485	$0.1025

* At modeled list rates of $2.00 / million input tokens and $8.00 / million output tokens. Anonymized client engagement.

Then multiply by volume

Projected 2026 consults410,000
Usable transcripts (~70%)287,000
Symptom-based visits (~90%)258,300

Annual cost on this one workflow

$90,018$26,476

258,300 visits × $0.3485 vs. 258,300 × $0.1025

~$63,500 saved per year

When token efficiency isn't the deciding factor

On a small corpus, or for exploratory work where you're still figuring out what "relevant" means, the token bill may not be what's hurting, and vector search is genuinely the better tool for fuzzy discovery. Token efficiency is a consequence of precision, so it matters most exactly where precision matters most: high-volume, repeated, production workflows where the same kind of question is asked thousands of times a day.

That's also where it compounds with the other consequence of precision. Retrieving less doesn't only cost less, it returns faster.

Latency gains with deterministic retrieval

FAQ

Where the token savings come from, whether they hold up, and how they scale.

Q1Why does deterministic retrieval use fewer tokens than RAG?

Because it retrieves a precise subgraph instead of stuffing the top-k most similar chunks into the prompt. You stop paying input tokens for the passages that were near enough but irrelevant. The model receives the exact grounded facts a question needs, not a haystack it has to read through.

Q2How much token reduction is realistic?

In one virtual urgent care clinical-QA deployment, the token count per consult-QA run fell from about 85,000 with standard RAG to about 25,000 with deterministic retrieval, roughly 70% fewer, or a 3.4x reduction. The exact ratio depends on your corpus and task, but the direction is structural, not incidental.

Q3Where do the token savings actually come from?

Two architectural choices. First, there is no language model in the retrieval path, so you don't spend tokens deciding what to retrieve. Second, the traversal returns a small, exact set of facts rather than a large, noisy context, and multi-hop questions resolve in one pass instead of re-querying and re-sending a growing context each time.

Q4Doesn't this just move the cost to ingestion?

Ingestion does use models and vectors to read and structure your documents, and that cost is real. But it is paid once and amortized, while retrieval runs on every query, forever. Moving the expensive, noisy work out of the per-query path and into a one-time structuring step is exactly where the savings compound.

Q5Does retrieving fewer tokens hurt answer quality?

No, it tends to improve it. You are removing noise, not signal: the chunks vector search drops are the ones that were merely similar, not the ones that were relevant. A model reasoning over exact, grounded facts hallucinates less than one reasoning over a large, loosely related context.

Q6How do the savings scale with volume?

They compound per call. In the same deployment, cost per consult fell from about $0.35 to about $0.10 at modeled list rates. Across the projected annual volume of symptom-based visits, that is roughly $90,000 versus $26,000, about $63,500 saved in a single year on one workflow.

Q7Is the saving just from using a smaller model?

No. The reduction is in how many tokens the workflow consumes, independent of which model phrases the final answer. You can run the same model and still spend a fraction of the tokens, because the difference is in what gets retrieved and sent, not in the model tier.

Stop paying for noise.

Retrieve the exact facts a question needs, and the token bill follows. That's a traversal over your knowledge graph, not a top-k chunk dump.

How deterministic retrieval works

Latency gains See Verifiable AI The neuro-symbolic Trust Layer

References

1.Lost in the Middle: How Language Models Use Long Contexts — Liu et al., TACL 2024
2.Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks — Lewis et al., NeurIPS 2020