Back to Essays

The Handover Problem

Vivek Khandelwal
Vivek KhandelwalChief Business Officer, CoFounder @ CogniSwitch
Apr 24, 2026·8 Min Read·Updated May 2, 2026
Reviewed by: Dilip Ittyera — CEO & Co-Founder, CogniSwitch

Running ops at my last company, I learned one thing fast: as humans, we are terrible at honouring internal contracts. SalesReps won't update the CRM. Onboarding team won't get their deliverables. Engineering won't get full context of bugs from QA. You get the drift. These gaps surfaced in weekly reviews. War rooms. People pointing fingers. Mostly ugly faceoffs. It was discomforting — but effective in one specific way: there was always a team or individual who was held accountable. That essentially was the contract.

The holy land of AI and the promise of no admin work

A big reason teams feel relieved (me included) is that they don't have to update the system of record anymore. Elimination of admin work is the #1 use case. Sales Ops, Healthcare Admin, front desk admin — massive opportunities and real action. Most orgs now have a range of AI tools in production chasing outcomes. Three to five AI tools. Each being a RAG + Human review stitched together. Can feel like duct taping — but hey, it works. Saves time. This is progress. Even if in silos.

What happens when AI agents start demanding context from other AI agents?

The question is what happens when agent-first systems start demanding context from other agents and AI applications?

Classic example: an account manager prepping for a QBR needs all support tickets, contract terms, implementation deliverables, goals set during the sales handoff, pending invoices, new relevant features, and more. She outsources this to an AI assistant that will pull whatever SupportDesk AI provides, CLM AI tells, and Gong AI has documented. Remember — there is always a response. Doesn't matter if it's correct, complete, or completely fabricated.

Higher stakes? Look at AI scribes. Simple job: listen, summarize, and generate the note. The physician, holding the liability, reviews and signs. A 2025 study in npj Health Systems evaluated 208 AI-generated chart summaries. Physicians flagged omissions in nearly a third of them.

The question is what happens when the physician signs anyway. The incomplete note moves into prior auth. The denial lands 14 days later, abstracted entirely from its source. By that point the connection to the AI-generated summary is invisible. In conversations with healthcare billing teams, the appeal process for a complex denied inpatient claim typically runs several weeks and thousands of dollars in administrative overhead before the underlying claim is even re-adjudicated. The AI produced a note. The system logged a clean output. Nobody sees the downstream cost.

When agents collate context from five systems and there are no audit trails, there is no way to know what was passed, what was dropped, and which system owns the error. One agent trusts another the way you'd trust a rumor. This would feel like a bug at one AI implementation. Poor context handoff across five AI systems means you start bleeding all your gains.

Can a human in the loop can't keep pace with agent throughput?

Think about an airport security checkpoint. A human reviews every bag and the conveyor runs at that reviewer's pace, not the queue's. That's the contract. At production volume, the agent pipeline flips that relationship. Fifty outputs queued. Another agent waiting on input. The reviewer doesn't get faster — they get bypassed. At that point, nobody knows what's happening. Did the agent proceed on incomplete context? Did it fabricate? Did it miss a critical detail? No audit trail, no answer. Most enterprises today risk the conveyor setting the pace.

How should teams understand this? Daniel Kahneman called it substitution. When a hard question is too difficult to answer, the mind simply replaces it with an easier one. I've referred to this frame before because nothing else describes the failure mode as cleanly.

The hard question: was the context passed between these agents accurate, complete, and verifiable?

The easy question: did the system respond with a clean and structured output? Yes.

Why human oversight fails without citation, source and provenance

When the human-in-loop can't see citation, source, or provenance against the agent's output, there is no review. They can only re-do it, go back to the source and re-verify manually, do the work the agent was expected to do.

That's not human-in-the-loop. That's human-on-the-hook.

Why confident outputs are the default, not the exception

Here's the part that trips up even experienced AI teams. The confident output when context is incomplete isn't a bug. It's how the model was built. LLMs are trained to complete. Give a model an incomplete context and ask it a question. It will still answer. The training objective rewards completion and coherence, not epistemic honesty. A model that responds "I cannot verify this from the provided context" performs poorly in evaluation because it looks like a failure to generate useful output. So the model learns not to say it. What you get instead is a complete, structured, confident answer built on a gap. The gap is real. The confidence is not.

This matters for the HITL question because it means the reviewer can't use output quality as a proxy for context quality. A well-formatted response doesn't tell you whether the context was complete. It only tells you the model is good at formatting.

Context handoffs: what they are and how they actually work

A context handover contract has three components. All three are structural requirements.

Consistency: the same query, routed through the same context, produces the same answer. Every time. Not most times.

Traceability: every output carries its source. The specific document and section, not a summary or a confidence score.

Completeness: nothing critical was dropped in transit. The contraindication made it through. The pricing exception made it through. The unresolved onboarding issue made it through.

Remove any one of these and the contract fails.

What is actually being passed between agents

The contract is easy to state. The gap is in what gets built. Most agent pipelines today pass context as a flat text payload. Think of it as one agent handing another a sticky note. It has the answer. It doesn't have the document it came from. It doesn't have the version of the policy that was active when the claim was processed. It doesn't carry a flag for "this section was incomplete" or "this field was inferred, not retrieved."

A flat payload is the telephone game. The structured formatting doesn't change what gets dropped.

The right shape is different: a context object that carries the claim, the source document ID, the section reference, the retrieval timestamp, and a completeness flag. Not a confidence score. Those are probabilistic. A completeness flag is a structural assertion: the required fields are present or they are not. When that object reaches Agent C, the chain of custody is intact. When it doesn't, the gap is visible, not invisible.

Most teams don't build this because the immediate output looks fine. The agent responds. The output is structured. The downstream failure is three weeks out and three systems back.

The context telephone gameFig 1
What Agent A retrieves
Policy document — Section 4.2
Billing exception flag
Version: 1.3
Retrieved: Tuesday 14:32
What gets passed
Flat text summary
What Agent C has
The claim
✓ System Connected

A flat payload is not a context handoff. It's a telephone game with structured formatting.

Who exactly is held accountable when broken context handoffs between agents triggers wrong decisions?

At some point, the calls start coming in. Denial rates are up. A patient flagged something wrong in their chart. The QBR deck was missing half the context. Nobody knows why.

IT gets the ticket. IT calls the vendor. The vendor pulls the logs. The logs show the agent responded. Status 200. Output clean. No errors recorded. That's where the conversation ends, because that's all the trail there is.

No root cause analysis, no attribution trail. Nobody can explain which handover dropped what, when, or why. Everything looks clean.

This is the problem IT leaders are about to inherit. It won't come from negligence or bad vendors. It will come from architectural limitations that constrain context handoff between agents. Every handoff needs consistency, traceability, and completeness baked in before the first agent talks to the second. Before the denials spike and the tickets pile up.

Before.

Why it hasn't been fixed yet

Everyone I talk to knows this is a problem. The reason it persists is structural: there's no agent-to-agent context API. Not a missing one. Not an emerging one. It just doesn't exist yet.

Think about how conventional system integrations work. HubSpot talks to Twilio via a REST API. When that integration breaks, you can test it from either side. Hit the endpoint, see the request, see the response, isolate the failure. The contract is explicit, the interface is inspectable, and the debugging is tractable.

There's no equivalent for agent-to-agent context. Every vendor has their own payload format. Every integration is a private handshake between two systems that didn't design for each other. Something like the Model Context Protocol addresses how agents call tools. It doesn't define what context means, how to verify it arrived complete, or how to trace what was dropped. You still get a response. You just can't verify what was in it, what was missing, or why.

The result: when the handoff fails, troubleshooting is near impossible. There's no endpoint to test. No request-response pair to inspect. Just a Status 200 and a confident output that may or may not reflect what was actually in the context.

The four questions at the end of this piece aren't audits of vendor compliance. They're probes for whether your pipeline has even attempted the contract.

Fig 4.2 — Integration Visibility Comparison

The Debugging Gap

Standard APIs vs. Agentic Handoffs

HubSpot
v3 REST
Twilio

REST API Integration

Agent A
PROMPT
Agent B

Agentic Context Handoff

FIG 4.2: Integration Visibility Comparison
CogniSwitch Engineering

You've seen this problem before

The context handover problem isn't new. You've just been calling it something else.

Application-Centric — the legacy you know well. Every application owned its own data. CRM had the customer record. Support desk had the tickets. Finance had the invoices. Integration meant passing data between systems and hoping it arrived intact. It usually didn't. That's why your ops team spent half their time chasing context across disconnected tools. At least accountability had a face. You knew which system dropped it.

Agentic Pipeline — where most orgs are today. You swapped the applications for agents. The pipeline got smarter. The governance didn't. Context is still owned per-agent, still passed as a flat, unstructured payload, still unverifiable end-to-end. The agents are more capable than the apps they replaced. The structural problem is identical. Except now, nobody knows which system dropped it.

The architecture that solves it. Context defined once, governed centrally, served with full provenance at every handoff. Same query, same answer. Every output with its source. Nothing critical drops in transit. The handover contract isn't aspirational — it's built in.

In practice, this means four things that separate a context data product from a flat payload:

  1. Schema-enforced: every field is typed and validated. Not "the agent returned something about the contract." A structured object with specific fields, constraints, and completeness checks.
  2. Source-anchored: every claim carries a document reference and section pointer. The downstream agent can ask "where did this come from?" and get a specific, deterministic answer. Not a confidence score.
  3. Versioned: context from version 1.2 of a policy document is distinguishable from version 1.3. If the policy changed between the run that produced the note and the run that triggered the prior auth request, the system knows.
  4. Queryable: the audit question ("what context did Agent A pass to Agent B on Tuesday?") has a deterministic answer, not a log search.

Most enterprise AI teams today are building the equivalent of a phone call between agents: no record, no accountability. A context data product is the contract in writing.

Most of the teams I talk to are in stage two and don't know it yet. The systems are working. The QBR decks are landing. The notes are getting generated. Everything looks fine — until it doesn't, and the trail is three systems back and six weeks ago.

Fig 4.3 — Pipeline Investment Audit

Where Teams Over-Invest

Most investment sits at the bottom — where governance value is lowest.

High Investment / Low Governance Value
Agent Orchestration
Coordinates agents, not context
Investment
High
Gov. Value
Low
RAG Infrastructure
Retrieves documents, not provenance
Investment
High
Gov. Value
Low
LLM / Model Selection
Generates output, not accountability
Investment
High
Gov. Value
Low
Evals / LLM Judge
Probabilistic ≠ auditable
Investment
Medium
Gov. Value
Low
The Contract Gap
Low Investment / High Governance Value
Context Governance
Not built
Investment
Near Zero
Gov. Value
High
Provenance & Attribution
Not built
Investment
Near Zero
Gov. Value
High

Most teams build the pipeline before the contract. The contract has to come first.

Case for a Contract for Consistency, traceability, and completeness looks like

The consistency/traceability/completeness contract has a structural requirement that most vendor architectures skip: the context layer has to exist independently of the agents consuming it.

A single governed layer, queried by any agent, with source provenance attached at the point of ingest. Note - this cannot and shouldn't be built downstream.

CogniSwitch builds this layer. It sits between your documents and your agents. Every query returns not just an answer but a citation trail: source document, version, and section. The same query produces the same answer because the context is governed, not generated on demand. When an agent downstream asks "was this complete?" the answer is structural, not probabilistic.

If you're in the middle of an agent deployment and the four questions below don't have clean answers from your architecture team, that's the gap we work on. See how CogniSwitch handles context governance.

Does your pipeline have a context contract?

Eight questions. Be honest.

0 of 8 criteria met0%
Your pipeline has no context contract. The handoff problem is live — it just hasn't surfaced yet.

The questions worth asking your team

Whether it's a vendor system or something you built internally, the same four questions apply.

How is context defined? Not in theory — in the actual payload that passes between agents. Is it a document? A summary? A structured object with source metadata attached? If the answer varies by agent or by team, that's already a gap.

How does source context and citation travel through a handoff? If Agent A retrieved something from a policy document, does that attribution reach Agent C — or does it get summarized away at step one? Attribution that disappears at the first handoff doesn't exist by the third.

How do you ensure completeness? What structural check verifies that the contraindication, the pricing exception, the unresolved issue, actually made it through — not just that an output was returned? Absence of an error is not evidence of completeness. Run the Knowledge Audit.

How do you ensure consistency? If you run the same query through the same context tomorrow, do you get the same answer? If not — does anyone know? A system that works on the runs you're watching and produces a different result the next day hasn't solved the handover problem. It's just gotten lucky.

These aren't abstract architecture questions. They're the difference between a pipeline that compounds its accuracy and one that quietly bleeds it while your lagging metrics catch up.

If you read this far — let me know if it landed. And if someone on your team is making AI architecture decisions right now, forward this before the next deployment conversation.

Best, Vivek K

Frequently Asked Questions

We run every agent output through a guardrails layer with semantic validation before it reaches a downstream agent. If you're worried about fabricated context passing silently, isn't that the fix?

Guardrails check the output — they don't verify the context that produced it. Semantic validation tells you the response is coherent. It doesn't tell you whether the billing exception from section 4.2 of the policy document was present before the model ran. A guardrail catches "this output looks wrong." A completeness check catches "the required input was never present." You can't detect absence at the output layer — the model generated something complete-looking from whatever it had. That's the whole problem.

We address the confident-output problem through system prompt engineering — we explicitly instruct the model to surface uncertainty when context is thin. Why do you frame this as architectural rather than a prompting problem?

Prompting a model to hedge works when the model can detect its own uncertainty. The problem is that a model can't detect a context gap that happened upstream. If the billing exception was dropped three agents ago, the payload Agent C receives doesn't look incomplete — it looks like complete context for the query it was given. A model fine-tuned to surface uncertainty will flag what it can see in its context window. It won't flag what was never in the payload. Prompting instruments the model. It doesn't instrument the handoff.

We use LLM-as-judge evals on every agent handoff. The judge scores output quality, coherence, and factual grounding. What does a context completeness check catch that an LLM judge doesn't?

An LLM judge scores quality — "how good is this output?" A completeness check is structural — "were the required fields present when this ran?" The judge scores 85/100 on a response that never had the pricing exception in its context. That score is technically valid. The gap is invisible to it. A probabilistic quality signal cannot detect structural absence. The eval and the completeness check examine the same output but answer different questions.

Defining "required fields" upfront assumes you know what complete context looks like before the query runs. Our agents handle open-domain questions where relevance is query-dependent. How does a completeness flag work in that scenario?

The completeness flag isn't about predicting what might be relevant — it's about asserting what was retrieved. When Agent A calls a retrieval system, the system knows what it pulled: document ID, section, version, whether the pull was exhaustive or partial. The completeness object carries those facts. Agent B can verify that. What it can't verify is what Agent A never retrieved — which is the honest gap. The answer isn't "pre-specify every possible field." It's "make what was retrieved auditable, so what's missing is visible rather than silently absent."

We use typed JSON schemas with Pydantic validation at each agent boundary. The payload is structured and contract-enforced at the API level. Are you saying that's still insufficient?

Schema validation checks that the fields that arrived are correctly typed. It doesn't check that the fields that should have arrived are present. A Pydantic model expecting context: str validates successfully whether that string is a complete policy document or a three-word summary derived from one. What's missing from a schema-validated JSON handoff is provenance: which document, which version, which section, whether the retrieval was exhaustive. You can add those fields — but most teams don't, because the output looks clean without them, and the failure surfaces three systems downstream.

We have distributed tracing across our entire agent pipeline — every LLM call, tool invocation, and inter-agent message is logged with full input/output payloads. When a handoff fails, we can replay it. What does that not cover?

Tracing tells you what happened. It doesn't tell you what should have happened. You can replay the handoff and see Agent A sent three fields, Agent C received three fields, transport was clean. What you can't see is that the source document had five relevant sections, and two were dropped during Agent A's summarization step before the payload was assembled. The gap isn't in the transport layer — it's upstream, at the point where the context object was built. Full distributed tracing traces the handoff. It doesn't audit the assembly.

MCP is live and being adopted at Anthropic, Google, and OpenAI right now. Our vendor stack is already moving toward it. What exactly does MCP not cover that your argument requires?

MCP defines how agents call tools — the request interface, the protocol layer. It's the equivalent of HTTP: it specifies how to make the call. It doesn't define what context means inside that call, whether it arrived complete, or what was dropped between Agent A and Agent C. MCP gives you a Status 200 and a response. It doesn't give you a completeness assertion, a source citation, or a version tag on the document the response was derived from. The debugging gap — no endpoint to test from either side, no request-response pair to inspect at the context layer — exists inside MCP-compliant architectures.

Document versioning is already solved by our document management system. Why does the context layer need to re-solve something the source system already tracks?

The DMS knows which version of a document exists. The context layer needs to know which version was active when a specific agent ran a specific query — those are different questions. If Policy Doc v1.3 replaced v1.2 between the run that generated the note and the run that triggered the prior auth request, both versions exist in the DMS. But unless the context payload carried "retrieved from v1.2 at 14:32 Tuesday," Agent C is implicitly assuming it has current-policy context. The DMS doesn't push retrieval event metadata into the handoff. The context layer has to capture it at the point of retrieval and carry it forward.

Every vendor pitching centralized context governance eventually becomes a single point of failure and a procurement chokepoint. We have 12 data sources and 4 agent frameworks. What does the migration path actually look like?

The governed layer isn't a new database you migrate 12 sources into — it's a retrieval interface that sits in front of your existing sources. Your data sources stay where they are. The layer standardizes what comes out of them: every retrieval returns a context object with source ID, section, version, and completeness signal, regardless of which underlying system it came from. Migration is incremental — start with the highest-stakes handoff in your pipeline. The chokepoint risk is real; any centralized architecture has it. The honest comparison is the current architecture, where failure is distributed, ungoverned, and every trail ends at Status 200.

See how CogniSwitch handles context governance.

We run monthly accuracy audits on our three highest-volume pipelines. If context degradation were real at our scale, we'd see it in those numbers — and we haven't. Is this a production-at-scale problem, or a theoretical risk?

Monthly accuracy audits measure output correctness on the queries you're watching. They don't measure context completeness on the queries you aren't. A pipeline can produce correct outputs on 95% of monitored cases and silently degrade on the 5% involving multi-hop context — the pricing exceptions, the contraindications, the cases where stakes are highest. Those don't surface in aggregate accuracy metrics. They surface in a specific failure that takes weeks to trace. By the time denial rates spike or a patient flags something wrong, the trail is three systems back and six weeks ago.

References

  1. 1.Evaluation of electronic health record-integrated artificial intelligence chart reviewnpj Health Systems / Nature Portfolio
  2. 2.How AI is leading to more prior authorization denialsAmerican Medical Association
  3. 3.Thinking, Fast and SlowDaniel Kahneman
  4. 4.Artificial Intelligence Risk Management Framework (AI RMF 1.0)National Institute of Standards and Technology (NIST)
  5. 5.Physician-Reported Safety Outcomes of AI-Generated Hospital Course SummariesFrançois Grolleau et al., JAMA Network Open
About the Author
Vivek Khandelwal

Vivek Khandelwal

Chief Business Officer, CoFounder @ CogniSwitch·M.Sc. Chemistry, IIT Bombay

Vivek Khandelwal is the Chief Business Officer at CogniSwitch, where he leads go-to-market strategy, enterprise partnerships, and the company's thought leadership programs. He is the author of Signal, CogniSwitch's weekly newsletter that translates the complex machinery of enterprise AI infrastructure into clear, actionable intelligence for practitioners and executives in regulated industries.