Back to Essays

The Handover Problem

Running ops at my last company, I learned one thing fast: as humans, we are terrible at honouring internal contracts. SalesReps won't update the CRM. Onboarding team won't get their deliverables. Engineering won't get full context of bugs from QA. You get the drift. These gaps surfaced in weekly reviews. War rooms. People pointing fingers. Mostly ugly faceoffs. It was discomforting — but effective in one specific way: there was always a team or individual who was held accountable. That essentially was the contract.

The holy land of AI and the promise of no admin work

A big reason teams feel relieved (me included) is that they don't have to update the system of record anymore. Elimination of admin work is the #1 use case. Sales Ops, Healthcare Admin, front desk admin — massive opportunities and real action. Most orgs now have a range of AI tools in production chasing outcomes. Three to five AI tools. Each being a RAG + Human review stitched together. Can feel like duct taping — but hey, it works. Saves time. This is progress. Even if in silos.

Everything looks good, right?

The question is what happens when agent-first systems start demanding context from other agents and AI applications?

Classic example: an account manager prepping for a QBR needs all support tickets, contract terms, implementation deliverables, goals set during the sales handoff, pending invoices, new relevant features, and more. She outsources this to an AI assistant that will pull whatever SupportDesk AI provides, CLM AI tells, and Gong AI has documented. Remember — .

Higher stakes? Look at AI scribes. Simple job: listen, summarize, and generate the note. The physician, holding the liability, reviews and signs. A 2025 study in npj Health Systems evaluated 208 AI-generated chart summaries. Physicians flagged omissions in nearly a third of them.

The question is what happens when the physician signs anyway. The incomplete note moves into prior auth. The denial lands 14 days later, abstracted entirely from its source. By that point the connection to the AI-generated summary is invisible. In conversations with healthcare billing teams, the appeal process for a complex denied inpatient claim typically runs several weeks and thousands of dollars in administrative overhead before the underlying claim is even re-adjudicated. The AI produced a note. The system logged a clean output. Nobody sees the downstream cost.

When agents collate context from five systems and there are no audit trails, there is no way to know what was passed, what was dropped, and which system owns the error. One agent trusts another the way you'd trust a rumor. This would feel like a bug at one AI implementation. Poor context handoff across five AI systems means you start bleeding all your gains.

Wait, what about the human in the loop?

Think about an airport security checkpoint. A human reviews every bag and the conveyor runs at that reviewer's pace, not the queue's. That's the contract. At production volume, the agent pipeline flips that relationship. Fifty outputs queued. Another agent waiting on input. The reviewer doesn't get faster — they get bypassed. At that point, nobody knows what's happening. Did the agent proceed on incomplete context? Did it fabricate? Did it miss a critical detail? No audit trail, no answer. Most enterprises today risk the conveyor setting the pace.

How should teams understand this? Daniel Kahneman called it substitution. When a hard question is too difficult to answer, the mind simply replaces it with an easier one. I've referred to this frame before because nothing else describes the failure mode as cleanly.

The hard question: was the context passed between these agents accurate, complete, and verifiable?

The easy question: did the system respond with a clean and structured output? Yes.

Humans-on-the-hook

When the human-in-loop can't see citation, source, or provenance against the agent's output, there is no review. They can only re-do it, go back to the source and re-verify manually, do the work the agent was expected to do.

That's not human-in-the-loop. That's human-on-the-hook.

Why confident outputs are the default, not the exception

Here's the part that trips up even experienced AI teams. The confident output when context is incomplete isn't a bug. It's how the model was built. LLMs are trained to complete. Give a model an incomplete context and ask it a question. It will still answer. . A model that responds "I cannot verify this from the provided context" performs poorly in evaluation because it looks like a failure to generate useful output. So the model learns not to say it. What you get instead is a complete, structured, confident answer built on a gap. The gap is real. The confidence is not.

This matters for the HITL question because it means . A well-formatted response doesn't tell you whether the context was complete. It only tells you the model is good at formatting.

Context handoffs: what they are and how they actually work

A context handover contract has three components. All three are structural requirements.

Consistency: the same query, routed through the same context, produces the same answer. Every time. Not most times.

Traceability: every output carries its source. The specific document and section, not a summary or a confidence score.

Completeness: nothing critical was dropped in transit. The contraindication made it through. The pricing exception made it through. The unresolved onboarding issue made it through.

Remove any one of these and the contract fails.

What is actually being passed between agents

The contract is easy to state. The gap is in what gets built. . Think of it as one agent handing another a sticky note. It has the answer. It doesn't have the document it came from. It doesn't have the version of the policy that was active when the claim was processed. It doesn't carry a flag for "this section was incomplete" or "this field was inferred, not retrieved."

A flat payload is the telephone game. The structured formatting doesn't change what gets dropped.

The right shape is different: a context object that carries the claim, the source document ID, the section reference, the retrieval timestamp, and a completeness flag. Not a confidence score. Those are probabilistic. A completeness flag is a structural assertion: the required fields are present or they are not. When that object reaches Agent C, the chain of custody is intact. When it doesn't, the gap is visible, not invisible.

Most teams don't build this because the immediate output looks fine. The agent responds. The output is structured. The downstream failure is three weeks out and three systems back.

The context telephone gameFig 1
What Agent A retrieves
Policy document — Section 4.2
Billing exception flag
Version: 1.3
Retrieved: Tuesday 14:32
What gets passed
Flat text summary
What Agent C has
The claim
✓ System Connected

A flat payload is not a context handoff. It's a telephone game with structured formatting.

Who will inherit this problem?

At some point, the calls start coming in. Denial rates are up. A patient flagged something wrong in their chart. The QBR deck was missing half the context. Nobody knows why.

IT gets the ticket. IT calls the vendor. The vendor pulls the logs. The logs show the agent responded. , because that's all the trail there is.

No root cause analysis, no attribution trail. Nobody can explain which handover dropped what, when, or why. Everything looks clean.

This is the problem IT leaders are about to inherit. It won't come from negligence or bad vendors. It will come from architectural limitations that constrain context handoff between agents. Every handoff needs consistency, traceability, and completeness baked in before the first agent talks to the second. Before the denials spike and the tickets pile up.

Before.

Why it hasn't been fixed yet

Everyone I talk to knows this is a problem. The reason it persists is structural: .

Think about how conventional system integrations work. HubSpot talks to Twilio via a REST API. When that integration breaks, you can test it from either side. Hit the endpoint, see the request, see the response, isolate the failure. The contract is explicit, the interface is inspectable, and the debugging is tractable.

There's no equivalent for agent-to-agent context. Every vendor has their own payload format. Every integration is a private handshake between two systems that didn't design for each other. Something like the Model Context Protocol addresses how agents call tools. It doesn't define what context means, how to verify it arrived complete, or how to trace what was dropped. You still get a response. You just can't verify what was in it, what was missing, or why.

The result: when the handoff fails, troubleshooting is near impossible. There's no endpoint to test. No request-response pair to inspect. Just a Status 200 and a confident output that may or may not reflect what was actually in the context.

The four questions at the end of this piece aren't audits of vendor compliance. They're probes for whether your pipeline has even attempted the contract.

Fig 4.2 — Integration Visibility Comparison

The Debugging Gap

Standard APIs vs. Agentic Handoffs

HubSpot
v3 REST
Twilio

REST API Integration

Agent A
PROMPT
Agent B

Agentic Context Handoff

FIG 4.2: Integration Visibility Comparison
CogniSwitch Engineering

You've seen this problem before

The context handover problem isn't new. You've just been calling it something else.

Application-Centric — the legacy you know well. Every application owned its own data. CRM had the customer record. Support desk had the tickets. Finance had the invoices. Integration meant passing data between systems and hoping it arrived intact. It usually didn't. That's why your ops team spent half their time chasing context across disconnected tools. At least accountability had a face. You knew which system dropped it.

Agentic Pipeline — where most orgs are today. You swapped the applications for agents. The pipeline got smarter. The governance didn't. Context is still owned per-agent, still passed as a flat, unstructured payload, still unverifiable end-to-end. The agents are more capable than the apps they replaced. The structural problem is identical. Except now, nobody knows which system dropped it.

The architecture that solves it. Context defined once, governed centrally, served with full provenance at every handoff. Same query, same answer. Every output with its source. Nothing critical drops in transit. The handover contract isn't aspirational — it's built in.

In practice, this means four things that separate a context data product from a flat payload:

  1. Schema-enforced: every field is typed and validated. Not "the agent returned something about the contract." A structured object with specific fields, constraints, and completeness checks.
  2. Source-anchored: every claim carries a document reference and section pointer. The downstream agent can ask "where did this come from?" and get a specific, deterministic answer. Not a confidence score.
  3. Versioned: . If the policy changed between the run that produced the note and the run that triggered the prior auth request, the system knows.
  4. Queryable: the audit question ("what context did Agent A pass to Agent B on Tuesday?") has a deterministic answer, not a log search.

Most enterprise AI teams today are building the equivalent of a phone call between agents: no record, no accountability. A context data product is the contract in writing.

Most of the teams I talk to are in stage two and don't know it yet. The systems are working. The QBR decks are landing. The notes are getting generated. Everything looks fine — until it doesn't, and the trail is three systems back and six weeks ago.

Fig 4.3 — Pipeline Investment Audit

Where Teams Over-Invest

Most investment sits at the bottom — where governance value is lowest.

High Investment / Low Governance Value
Agent Orchestration
Coordinates agents, not context
Investment
High
Gov. Value
Low
RAG Infrastructure
Retrieves documents, not provenance
Investment
High
Gov. Value
Low
LLM / Model Selection
Generates output, not accountability
Investment
High
Gov. Value
Low
Evals / LLM Judge
Probabilistic ≠ auditable
Investment
Medium
Gov. Value
Low
The Contract Gap
Low Investment / High Governance Value
Context Governance
Not built
Investment
Near Zero
Gov. Value
High
Provenance & Attribution
Not built
Investment
Near Zero
Gov. Value
High

Most teams build the pipeline before the contract. The contract has to come first.

What solving this looks like

The consistency/traceability/completeness contract has a structural requirement that most vendor architectures skip: the context layer has to exist independently of the agents consuming it.

Not inside each agent. Not rebuilt at every handoff. A single governed layer, queried by any agent, with source provenance attached at the point of ingest. Not reconstructed downstream.

CogniSwitch builds this layer. It sits between your documents and your agents. Every query returns not just an answer but a citation trail: source document, version, and section. The same query produces the same answer because the context is governed, not generated on demand. When an agent downstream asks "was this complete?" the answer is structural, not probabilistic.

If you're in the middle of an agent deployment and the four questions below don't have clean answers from your architecture team, that's the gap we work on. See how CogniSwitch handles context governance.

Does your pipeline have a context contract?

Eight questions. Be honest.

0 of 8 criteria met0%
Your pipeline has no context contract. The handoff problem is live — it just hasn't surfaced yet.

The questions worth asking your team

Whether it's a vendor system or something you built internally, the same four questions apply.

How is context defined? Not in theory — in the actual payload that passes between agents. Is it a document? A summary? A structured object with source metadata attached? If the answer varies by agent or by team, that's already a gap.

How does source context and citation travel through a handoff? If Agent A retrieved something from a policy document, does that attribution reach Agent C — or does it get summarized away at step one? Attribution that disappears at the first handoff doesn't exist by the third.

How do you ensure completeness? What structural check verifies that the contraindication, the pricing exception, the unresolved issue, actually made it through — not just that an output was returned? Absence of an error is not evidence of completeness. Run the Knowledge Audit.

How do you ensure consistency? If you run the same query through the same context tomorrow, do you get the same answer? If not — does anyone know? A system that works on the runs you're watching and produces a different result the next day hasn't solved the handover problem. It's just gotten lucky.

These aren't abstract architecture questions. They're the difference between .

If you read this far — let me know if it landed. And if someone on your team is making AI architecture decisions right now, forward this before the next deployment conversation.

Best, Vivek K

About the Author
Vivek Khandelwal

Vivek Khandelwal

Chief Business Officer, CoFounder @ CogniSwitch·2X Entrepreneur, IIT Bombay

2X founder who has built multiple companies in the last 15 years. He bootstrapped iZooto to multi-millons in revenue. He graduated from IIT Bombay and has deep experience across product marketing, and GTM strategy. Mentors early-stage startups at Upekkha, and SaaSBoomi's SGx program. At CogniSwitch, he leads all things Marketing, Business Development and partnerships.