Healthcare AI // For Compliance and Clinical Leaders
Braintrust and CogniSwitch:
see what your AI did,
and prove it was allowed
CogniSwitch and Braintrust are complementary, not competing. Braintrust evaluates and observes how your agents perform across releases and production traffic. CogniSwitch proves the one decision under question was allowed.
No, CogniSwitch is not a Braintrust alternative. They operate at different layers of the stack. Braintrust evaluates how your agents perform and scores their output in production. CogniSwitch proves a specific decision followed the rules. In healthcare you need both: it's an and, not an or.
Why this matters
Why run Braintrust and CogniSwitch together?
Most health systems have AI agents stuck in pilots. The blocker is rarely the model. It's that no one can prove, to a regulator or an internal audit, that a given decision was correct and allowed.
The problem it solves
You can move agents out of pilot and into production, because you can now stand behind every decision they make.
The outcome
When an auditor or regulator asks why the agent did what it did, you can pull the exact rule and the record, and prove it. The decision is defensible.
The stack
What does a healthcare AI stack need to be complete?
A complete regulated AI stack has four layers, and CogniSwitch is the trust layer between the agents and the clinical data.
Evals & Observability
Evaluate the agents, score their output, catch regressions
Braintrust, Arize, Langfuse
Agents & AI Applications
The agents themselves: prior-auth, clinical notes, intake
LangChain, CrewAI, OpenAI, Anthropic
Trust Layer
Check each decision against policy at runtime, and give an auditor a reason they can re-derive.
CogniSwitch
Clinical Data
The source of truth the agent has to match
EHR, payer policies, clinical SOPs
Braintrust answers
"Are the agents performing well?"
CogniSwitch answers
"Can we prove this decision was allowed?"
CogniSwitch sits above your evaluation platform and completes the stack.
What does Braintrust do on its own?
Braintrust is a strong, mature AI observability and evaluation platform. It scores how good your agents' output is, runs experiments to catch regressions before they ship, and traces what happens in production. The engineering is solid and the signal is real.
It ships a browser playground for rapid prompt iteration, immutable experiments you can wire into CI/CD to catch regressions, logging and tracing through its purpose-built observability store, datasets built from production logs, and online scoring that evaluates live traffic asynchronously. Scorers can be code-based, LLM-as-a-judge, or human, and its open-source autoevals library is available under the MIT license. For measuring quality and catching regressions across releases, Braintrust does its job well.
What evaluation and observability can't do on their own
When evaluation and observability platforms score live production traffic, they lean on an LLM-as-a-judge: one language model grading another model's work. By Braintrust's own documentation, its online scoring relies on LLM-as-a-judge scorers because production requests have no ground truth. That is a useful instrument, and it is not what the approach is built for: proving that one specific decision was correct and allowed. Four things follow from putting a model in the scoring path.
A model grades a model
When one model scores another, they can share the same blind spots. The grader is no more independent than the agent it is grading.
The score can move
Run the same output twice and the score can change. A verdict that moves is not one an auditor can re-derive and trust.
Built to measure, not to enforce
Online scoring runs asynchronously with no impact on latency. It observes and measures after the fact; it does not gate a non-compliant decision inline.
Quality scores, not provenance
A scorer produces a numeric quality score or a label. Braintrust does not market a decision-level audit trail that names the rule a regulated decision followed.
So with only Braintrust you can evaluate and score. What you cannot do is prove that one specific decision was correct and policy-compliant, with a reason you can repeat. That takes a verification step. Better tracing and better scorers alone do not get you there.
In practice // A health system in production
How do Braintrust and CogniSwitch work together in healthcare?
Consider a health system running two AI agents in production: a prior-authorization agent and a clinical documentation agent. For each one, Braintrust and CogniSwitch do different jobs in the same flow. Braintrust evaluates and scores the behavior across thousands of cases. CogniSwitch proves each individual decision. Here is how that plays out.
Prior-authorization agent
The scenario
The agent reviews a request and decides whether a procedure is approved or denied under the patient's plan.
Braintrust evaluates the behavior
Braintrust scores the agent's output against datasets, runs experiments to catch regressions before a release ships, and runs online scoring over live requests to surface edge cases across the population of cases.
Answers
"Is the prior-auth agent performing well overall?"
CogniSwitch proves the decision
CogniSwitch checks each individual denial against the exact payer-policy version that applied at that moment, produces a yes-or-no verdict that names the rule that fired, and keeps the record. The same input gives the same verdict every time.
Answers
"Can we prove this specific denial followed the policy?"
Together, in one flow
Braintrust tells you the agent is scoring well and a release did not regress. CogniSwitch lets you pull up one disputed denial and show the regulator the precise rule it applied and why.
Clinical documentation agent (discharge summaries)
The scenario
The agent drafts a discharge summary from the encounter: medications, problems, and the care plan.
Braintrust evaluates the behavior
Braintrust scores summary quality with scorers such as completeness and hallucination across datasets and live traffic, and tracks how those scores move from one release to the next.
Answers
"Are the summaries generally scoring well?"
CogniSwitch proves the decision
CogniSwitch checks each summary against the EHR itself: every medication, problem, and plan item is matched to the structured record, and any item that does not match is flagged before the summary is finalized.
Answers
"Can we prove this summary matches the patient's record?"
Together, in one flow
Braintrust tells you summary scores are holding steady across releases. CogniSwitch catches the one summary that lists a medication the patient was never on, before it reaches the chart.
Both layers, together
Braintrust evaluates how the agent performs across releases and the whole population. CogniSwitch proves the one decision under question. Together, the agent is safe to run in production and defensible to an auditor.
What changes when you add CogniSwitch to Braintrust?
What changes when you add CogniSwitch to the evaluation and observability you already run. The rows build from scoring the agents to being able to deploy them in a regulated setting with confidence.
Yes = the stack can do this
The first three rows are Braintrust doing its job well. The rest is what the verification layer adds.
| What you can do | With only Braintrust | With Braintrust + CogniSwitch |
|---|---|---|
| Score output quality with evals and scorers | Yes | Yes |
| Catch regressions across releases with experiments and CI/CD | Yes | Yes |
| Trace and monitor production traffic over time | Yes | Yes |
| Get the same verdict on a decision every time you check | No | Yes |
| Name the exact policy rule that drove a decision | No | Yes |
| Reconstruct and prove one specific decision after the fact | No | Yes |
| Block a non-compliant decision before it ships | No | Yes |
| Deploy AI agents in a regulated setting with confidence | No | Yes |
| Defend a decision to an auditor or regulator | No | Yes |
FAQ
Common questions from teams that already run an evaluation and observability stack and are deciding where the trust layer fits.
Q1Is CogniSwitch an alternative to Braintrust?
No. They operate at different layers of the AI stack. Braintrust is evaluation, experimentation, and observability: it scores and monitors what your agents produce. CogniSwitch is the trust layer: it verifies and enforces decisions deterministically. Most regulated teams run both, not one instead of the other.
Q2What does Braintrust do that CogniSwitch does not?
Braintrust gives you a full evaluation and observability loop: a browser playground, immutable experiments with CI/CD regression gating, logging and tracing, datasets, and online scoring of production traffic. CogniSwitch does not replace that signal. It adds a verification layer on top of it.
Q3What does CogniSwitch add to a Braintrust stack?
Deterministic verification. Braintrust tells you whether an agent is performing well on average and whether a release regressed. CogniSwitch proves whether a specific decision was correct and policy-compliant, with a reproducible, rule-named verdict and an audit trail you can hand an auditor.
Q4Why isn't Braintrust's scoring enough for a regulated decision?
Braintrust's online scoring of live production traffic relies on LLM-as-a-judge scorers, by its own documentation, because production requests have no ground truth. That is the right tool for monitoring quality trends, and a model judging a model can score the same output differently on re-run. A regulator needs a reproducible verdict on the specific decision, which is what deterministic verification provides.
Q5Do Braintrust and CogniSwitch run together?
Yes, as complementary layers in the same stack, each doing its job. Braintrust observes, evaluates, and experiments; CogniSwitch verifies and enforces. You keep your eval and observability loop and add the ability to prove and enforce regulated decisions.
Q6We already use Braintrust. What changes if we add CogniSwitch?
Your eval and observability loop stays exactly as it is. What you gain is proof and enforcement: every regulated decision is checked against your SOPs and source data at runtime, deterministically, producing an audit trail. Braintrust keeps answering 'is the agent performing well?'; CogniSwitch answers 'can we prove this decision was compliant?'
Get your agents into production.
Keep your evaluation and observability stack. Add the layer that proves a decision followed the rules and blocks the one that does not, before it reaches a patient or payer. It runs on a context graph, not another model in the scoring path.
Keep reading
CogniSwitch also completes Arize + CogniSwitch, Galileo + CogniSwitch, and Cekura + CogniSwitch.
Author
Joshua Thomas
Co-Founder & CTO, CogniSwitch
Reading Time
~9 min read
References
- 1.Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena — Zheng et al., NeurIPS 2023
- 2.Large Language Models are not Fair Evaluators — Wang et al., ACL 2024
- 3.A Survey on LLM-as-a-Judge — Gu, Jiang et al., 2024-2025
- 4.Evaluating large language models for drafting emergency department encounter summaries — PLOS Digital Health, 2025
- 5.A framework to assess clinical safety and hallucination rates of LLMs for medical text summarisation — npj Digital Medicine, 2025