Retrieval Performance

Latency Gains with
Deterministic Retrieval

The model call is the latency. Take it out of retrieval.

Author

Joshua Thomas

CTO, CogniSwitch

Reading Time

~9 min read

Seconds

A single model call

a few, up to 10–30s, variable

Milliseconds

A graph traversal

a bounded database lookup

Zero

Model calls in retrieval

the model only phrases the answer

In One Paragraph

Deterministic retrieval is fast because it queries your database with a graph traversal instead of routing through language-model calls. A model call is the expensive part: it adds anywhere from a few seconds to tens of seconds, and it is unpredictable. A traversal returns in milliseconds, the way any database query does. Take the model out of the retrieval path and you remove that variable, multi-second tax entirely, leaving a single model call to phrase the final answer.

Key Takeaways

TL;DR

The model call is the latency. It adds seconds to tens of seconds per call, and the duration is unpredictable.

Retrieval hits your database, not a model. A graph traversal is a bounded lookup that returns in milliseconds.

Multi-hop doesn't loop the model. Vector RAG re-queries and re-generates per hop; a traversal resolves linked concepts in one pass.

Deterministic means a predictable tail. No model in the path means no heavy, spiky p99 to design around.

Fast and predictable enough to run inline. In one telehealth deployment, review coverage went from ~3% of consults to ~90–95% by running on every output.

The hidden cost: a model call in the retrieval path

Most RAG pipelines put language-model calls in the path to retrieval, not just at the end. The query gets embedded, a similarity search runs, a model re-ranks the candidates, and for anything multi-hop an agent loops the model again to decide what to fetch next. Every one of those calls is a network round-trip to a probabilistic system.

And model calls are slow in a way that's hard to bound. A single call can take a few seconds, and under load or with long context it can take 10, 20, even 30 seconds. Worse, the time varies request to request, so you can't promise a tight response-time budget. Stack several calls in a multi-hop flow and the latency compounds with no ceiling you can rely on.

A model call can cost anywhere from a few seconds to tens of seconds, and you can't predict which. Put several in the retrieval path and the wait stacks.

The core latency tax

How deterministic retrieval removes it

Replace the model calls with one traversal of your knowledge graph. No embedding round-trip, no re-rank pass, no agentic loop, just a bounded lookup against a structure you already built.

"It's not using vectors. That's partly why it's fast. Provisioning vectors takes time; a purely graph system isn't doing that at retrieval."

Joshua Thomas, CTO, CogniSwitch

A query, not a guess

Retrieval is a traversal of the graph: a bounded database lookup, not a probabilistic search plus model passes.

No round-trips to a model

Embedding, re-ranking, and agentic re-querying are gone from the path. The only model call left phrases the answer.

Multi-hop in one pass

Linked concepts resolve in a single traversal, so latency doesn't compound across sequential model calls.

Before / After // One Query, Two Architectures

Count the model calls

Take a multi-hop clinical question: a patient reports chest pain, and the system needs the linked blood-pressure readings, prior cardiac history, and active medications. Watch where the time goes.

In the retrieval path	Standard RAG (agentic)	Deterministic retrieval
Model calls to retrieve	Several (embed, re-rank, re-query per hop)	None
Retrieval mechanism	Model calls + vector similarity search	One graph traversal
Time per retrieval step	Seconds to tens of seconds, variable	Milliseconds, bounded
Multi-hop	Sequential loop, latency compounds	Single pass
Tail latency (p99)	Heavy, unpredictable	Predictable
Model calls remaining	1 (final generation)	1 (final phrasing only)

Both architectures end with one model call to phrase the answer. The difference is everything that happens before it: a stack of variable, multi-second model calls, or a single bounded traversal.

Why is predictability the real win?

Lower average latency matters, but a bounded tail matters more. A deterministic traversal returns in a predictable time on every run, so you can promise a response-time budget. A model in the path makes the p99 spiky and forces you to design around the worst case you can't control.

Fast enough to run on every output?

When retrieval is fast and predictable, checking can run inline on every output instead of offline on a sample. That changes what's possible operationally.

~3%~90–95%

Consult-review coverage, one telehealth clinical-QA deployment

In that deployment, manual review could reach only a small fraction of the thousand-plus consults happening each day. Running deterministic checks inline lifted coverage to nearly all of them. A major reason that was feasible at all is that retrieval stopped waiting on model calls, alongside the cost and engineering work that inline verification also required.

When latency isn't the bottleneck

For batch or offline workloads where nothing is waiting on the answer, an LLM-heavy pipeline can be perfectly fine, and raw retrieval speed isn't what decides the design. Latency becomes the constraint the moment something is gated in real time: a user is waiting on a response, or an output has to be verified before it ships.

That's the same line that decides cost. Removing model calls from the path doesn't only make retrieval faster, it makes it leaner.

Token efficiency with deterministic retrieval

FAQ

Where the latency goes, why a traversal is bounded, and what running inline unlocks.

Q1Why is deterministic retrieval faster than RAG?

Because it retrieves from your database with a graph traversal instead of going through language-model calls. A traversal is a bounded lookup that returns in milliseconds, the way any database query does. The slow, variable part of RAG is the model calls in the path, and deterministic retrieval removes them.

Q2How much latency does a language-model call actually add?

Anywhere from a few seconds to tens of seconds, and it is unpredictable. A single call can take 10, 20, even 30 seconds under load or with long context, and multi-hop agentic flows stack several calls in sequence. That variability is why model-in-the-path retrieval can't promise a tight response-time budget.

Q3Where does the latency actually go?

You take the model out of the retrieval path. Vector RAG often embeds the query, runs a similarity search, re-ranks with another model pass, then loops the model again for multi-hop. Deterministic retrieval replaces all of that with one traversal of the graph. The only model call left is the final one that phrases the answer.

Q4Is the retrieval really free of model calls?

Yes. Language models and vectors do their work at ingestion, when documents are read and structured into the graph. Retrieval itself is a deterministic traversal with no model in the loop. That separation, models at ingestion and not at retrieval, is what makes the lookup both fast and reproducible.

Q5What about predictability and tail latency?

A deterministic traversal returns the same result in a bounded time on every run, so the p99 is predictable rather than spiky. Model calls have heavy, variable tails. When the slow component is removed from the critical path, response time stops depending on how a probabilistic model happened to behave on that request.

Q6Does this make real-time, inline verification possible?

It does, and that is the point. Because retrieval and checking are fast and predictable, they can run on every output inline rather than offline on a sample. In one telehealth clinical-QA deployment, that shifted review from roughly 3% of consults a human could cover to about 90-95% coverage by the system.

Q7When is latency not the bottleneck?

For batch or offline workloads where nothing waits on the answer, raw speed matters less, and an LLM-heavy pipeline can be fine. Latency becomes decisive when something is gated in real time: a user is waiting, or an output has to be verified before it ships. That is exactly where removing model calls from the path pays off.

Take the model out of the path.

If a user is waiting, or you need to verify before you ship, retrieval can't depend on a model call that takes seconds you can't predict. A traversal is a query, in milliseconds.

How deterministic retrieval works

Token efficiency See Verifiable AI The neuro-symbolic Trust Layer

References

1.Lost in the Middle: How Language Models Use Long Contexts — Liu et al., TACL 2024
2.Graph Retrieval-Augmented Generation: A Survey — Peng et al., 2024