Retrieval Performance
Latency Gains with
Deterministic Retrieval
The model call is the latency. Take it out of retrieval.
Author
Joshua Thomas
CTO, CogniSwitch
Reading Time
~9 min read
Seconds
A single model call
a few, up to 10–30s, variable
Milliseconds
A graph traversal
a bounded database lookup
Zero
Model calls in retrieval
the model only phrases the answer
Deterministic retrieval is fast because it queries your database with a graph traversal instead of routing through language-model calls. A model call is the expensive part: it adds anywhere from a few seconds to tens of seconds, and it is unpredictable. A traversal returns in milliseconds, the way any database query does. Take the model out of the retrieval path and you remove that variable, multi-second tax entirely, leaving a single model call to phrase the final answer.
Key Takeaways
TL;DR
The model call is the latency. It adds seconds to tens of seconds per call, and the duration is unpredictable.
Retrieval hits your database, not a model. A graph traversal is a bounded lookup that returns in milliseconds.
Multi-hop doesn't loop the model. Vector RAG re-queries and re-generates per hop; a traversal resolves linked concepts in one pass.
Deterministic means a predictable tail. No model in the path means no heavy, spiky p99 to design around.
Fast and predictable enough to run inline. In one telehealth deployment, review coverage went from ~3% of consults to ~90–95% by running on every output.
The hidden cost: a model call in the retrieval path
Most RAG pipelines put language-model calls in the path to retrieval, not just at the end. The query gets embedded, a similarity search runs, a model re-ranks the candidates, and for anything multi-hop an agent loops the model again to decide what to fetch next. Every one of those calls is a network round-trip to a probabilistic system.
And model calls are slow in a way that's hard to bound. A single call can take a few seconds, and under load or with long context it can take 10, 20, even 30 seconds. Worse, the time varies request to request, so you can't promise a tight response-time budget. Stack several calls in a multi-hop flow and the latency compounds with no ceiling you can rely on.
A model call can cost anywhere from a few seconds to tens of seconds, and you can't predict which. Put several in the retrieval path and the wait stacks.
The core latency tax
How deterministic retrieval removes it
Replace the model calls with one traversal of your knowledge graph. No embedding round-trip, no re-rank pass, no agentic loop, just a bounded lookup against a structure you already built.
"It's not using vectors. That's partly why it's fast. Provisioning vectors takes time; a purely graph system isn't doing that at retrieval."
Joshua Thomas, CTO, CogniSwitch
A query, not a guess
Retrieval is a traversal of the graph: a bounded database lookup, not a probabilistic search plus model passes.
No round-trips to a model
Embedding, re-ranking, and agentic re-querying are gone from the path. The only model call left phrases the answer.
Multi-hop in one pass
Linked concepts resolve in a single traversal, so latency doesn't compound across sequential model calls.
Before / After // One Query, Two Architectures
Count the model calls
Take a multi-hop clinical question: a patient reports chest pain, and the system needs the linked blood-pressure readings, prior cardiac history, and active medications. Watch where the time goes.
| In the retrieval path | Standard RAG (agentic) | Deterministic retrieval |
|---|---|---|
| Model calls to retrieve | Several (embed, re-rank, re-query per hop) | None |
| Retrieval mechanism | Model calls + vector similarity search | One graph traversal |
| Time per retrieval step | Seconds to tens of seconds, variable | Milliseconds, bounded |
| Multi-hop | Sequential loop, latency compounds | Single pass |
| Tail latency (p99) | Heavy, unpredictable | Predictable |
| Model calls remaining | 1 (final generation) | 1 (final phrasing only) |
Both architectures end with one model call to phrase the answer. The difference is everything that happens before it: a stack of variable, multi-second model calls, or a single bounded traversal.
Why is predictability the real win?
Lower average latency matters, but a bounded tail matters more. A deterministic traversal returns in a predictable time on every run, so you can promise a response-time budget. A model in the path makes the p99 spiky and forces you to design around the worst case you can't control.
Fast enough to run on every output?
When retrieval is fast and predictable, checking can run inline on every output instead of offline on a sample. That changes what's possible operationally.
Consult-review coverage, one telehealth clinical-QA deployment
In that deployment, manual review could reach only a small fraction of the thousand-plus consults happening each day. Running deterministic checks inline lifted coverage to nearly all of them. A major reason that was feasible at all is that retrieval stopped waiting on model calls, alongside the cost and engineering work that inline verification also required.
When latency isn't the bottleneck
For batch or offline workloads where nothing is waiting on the answer, an LLM-heavy pipeline can be perfectly fine, and raw retrieval speed isn't what decides the design. Latency becomes the constraint the moment something is gated in real time: a user is waiting on a response, or an output has to be verified before it ships.
That's the same line that decides cost. Removing model calls from the path doesn't only make retrieval faster, it makes it leaner.
Token efficiency with deterministic retrievalFAQ
Where the latency goes, why a traversal is bounded, and what running inline unlocks.
Q1Why is deterministic retrieval faster than RAG?
Because it retrieves from your database with a graph traversal instead of going through language-model calls. A traversal is a bounded lookup that returns in milliseconds, the way any database query does. The slow, variable part of RAG is the model calls in the path, and deterministic retrieval removes them.
Q2How much latency does a language-model call actually add?
Anywhere from a few seconds to tens of seconds, and it is unpredictable. A single call can take 10, 20, even 30 seconds under load or with long context, and multi-hop agentic flows stack several calls in sequence. That variability is why model-in-the-path retrieval can't promise a tight response-time budget.
Q3Where does the latency actually go?
You take the model out of the retrieval path. Vector RAG often embeds the query, runs a similarity search, re-ranks with another model pass, then loops the model again for multi-hop. Deterministic retrieval replaces all of that with one traversal of the graph. The only model call left is the final one that phrases the answer.
Q4Is the retrieval really free of model calls?
Yes. Language models and vectors do their work at ingestion, when documents are read and structured into the graph. Retrieval itself is a deterministic traversal with no model in the loop. That separation, models at ingestion and not at retrieval, is what makes the lookup both fast and reproducible.
Q5What about predictability and tail latency?
A deterministic traversal returns the same result in a bounded time on every run, so the p99 is predictable rather than spiky. Model calls have heavy, variable tails. When the slow component is removed from the critical path, response time stops depending on how a probabilistic model happened to behave on that request.
Q6Does this make real-time, inline verification possible?
It does, and that is the point. Because retrieval and checking are fast and predictable, they can run on every output inline rather than offline on a sample. In one telehealth clinical-QA deployment, that shifted review from roughly 3% of consults a human could cover to about 90-95% coverage by the system.
Q7When is latency not the bottleneck?
For batch or offline workloads where nothing waits on the answer, raw speed matters less, and an LLM-heavy pipeline can be fine. Latency becomes decisive when something is gated in real time: a user is waiting, or an output has to be verified before it ships. That is exactly where removing model calls from the path pays off.
Take the model out of the path.
If a user is waiting, or you need to verify before you ship, retrieval can't depend on a model call that takes seconds you can't predict. A traversal is a query, in milliseconds.
References
- 1.Lost in the Middle: How Language Models Use Long Contexts — Liu et al., TACL 2024
- 2.Graph Retrieval-Augmented Generation: A Survey — Peng et al., 2024