Doveryai,
No Proveryai.
Why enterprise AI needs verifiability as a first-class property, and what the benchmarks don't measure.
The Russian proverb meaning "Trust but verify", made famous by Reagan during the Cold War, fits the enterprise AI context in 2026. The principle was simple: trust alone is not a control. In enterprise AI, the same logic applies.
For the entire history of software, verifiability was assumed. Deterministic systems are testable by definition. AI broke that assumption. A system can now be accurate and completely unverifiable at the same time.
Verifiability is a prerequisite for accuracy that can be trusted. A correct answer you cannot prove, repeat, or trace is not usable in a regulated context. Current benchmarks measure whether the system got it right. They say nothing about whether you can prove it. In healthcare, financial services, and insurance, that gap is the risk.
CogniSwitch is a neuro-symbolic AI platform that uses deterministic retrieval over concepts using ontologies, vectors and graphs. Every output traces to a named source document at triple level. Every retrieval reproduces across runs. This is a structural property of the architecture, not a design aspiration.
How to read these numbers: Exact Match (EM) is a character-level metric. "Bill Clinton" fails against "William Jefferson Clinton". "EFL Cup" fails against "Carabao Cup" even when both are correct. EM is the conservative floor used for leaderboard comparison. LLM-judge accuracy (84.4%) measures semantic correctness including format variants — a 10.1 point gap above token F1 on identical answers, driven by verbosity preference in LLM judges. Precision when answered (93.1%) excludes the 9.4% of questions where the system abstained — cases where retrieved evidence was insufficient to commit to an answer. In regulated industries, a principled "I don't know" is preferable to a confident wrong answer.
| Metric | CogniSwitch + Claude | What It Measures |
|---|---|---|
| LLM-Judge Accuracy | 84.4% | Semantic correctness scored by language model judge |
| Precision when answered | 93.1% | Correctness on the 90.6% of questions answered |
| Token F1 | 74.3% | Token overlap with gold answer — official HotpotQA metric |
| Contain-Match | 69.8% | Gold answer string present in response |
| Exact Match (EM) | 60.8% | Character-level match — conservative floor |
| Principled abstention (IDK) | 9.4% | Declined to answer — insufficient evidence |
Why this dataset is hard: HotpotQA distractor setting provides 10 paragraphs per question: 2 gold supporting documents plus 8 adversarial distractors designed to mislead retrieval. Questions split between bridge (entity traversal across two documents) and comparison (attribute extraction across two entities). Short answers mean EM is unforgiving. "Bill Clinton" fails against "William Jefferson Clinton". "EFL Cup" fails against "Carabao Cup". No fine-tuning, no adaptation signal. Every question is a cold inference problem.
| System | EM | F1 | Setting |
|---|---|---|---|
| CogniSwitch + Claude (this work) | 60.8% | 74.3% | Zero-shot · No fine-tuning · Deterministic KG |
| StepChain GraphRAG (Ni et al., Oct 2025) | 66.70% | 79.5% | Zero-shot · On-the-fly KG |
| PropRAG (Wang et al., Apr 2025) | — | 64.5% | Zero-shot · Proposition paths |
| PEI (Huang et al., 2024) — best fine-tuned | 72.89% | 77.84% | Fine-tuned on 90,000 training examples |
| HotpotQA leaderboard SOTA (distractor, test set) | ~72% | ~85% | Fine-tuned · Full training set · hotpotqa.github.io |
| Human performance (reported) | ~77% | ~91% | Human annotators · Yang et al., 2018 |
The headline: The best fine-tuned system (PEI) achieves 77.84% token F1 after training on 90,526 labelled examples. CogniSwitch achieves 74.3% token F1 with zero training examples — within 3.5 F1 points of fine-tuned SOTA — while maintaining full source traceability and run-to-run reproducibility that no fine-tuned system provides. On EM, the gap is larger (60.8% vs 72.89%), consistent with the additional reasoning steps bridge questions require without ontology grounding.
Supporting Facts F1: 74.4% · Facts Precision: 84.2% · Supporting Triples Coverage: 76.9%. Retrieval quality measured independently of generation. Full methodology: HotpotQA (Yang et al., EMNLP 2018) · Evaluation metric reference: A-RAG (Du et al., 2026)
The HotpotQA result surfaces a deeper problem. Accuracy benchmarks ask one question: did the system get it right? For healthcare, financial services, and insurance, that is necessary but not sufficient.
The question that matters equally: can you prove it, repeat it, and defend it?
Verifiability is a prerequisite for accuracy that can actually be used. An unverifiable answer — no consistent retrieval, no named source, no audit trail — cannot be defended in a regulated context regardless of whether it happens to be correct. This is not a hypothetical edge case. It is the structural condition of every probabilistic retrieval system.
CogniSwitch's deterministic architecture makes verifiability measurable: same retrieval every run, every output traceable to a named source document at triple level. Existing benchmarks have no metrics for this. The field needs new evaluation dimensions to measure verifiability alongside accuracy, and this work is a first step toward defining what those should be.
This HotpotQA result is one data point in a broader research programme. Work in progress:
MedQA USMLE benchmark
Full ingestion of Robbins Pathology and Harrison's Internal Medicine. 1,272 questions from the dev set, filtered to pathology and internal medicine coverage. Comparison target: AMG-RAG at 73.92% (arXiv 2502.13010). Unlike AMG-RAG, which builds knowledge graphs from live PubMed API calls at runtime, CogniSwitch uses pre-built deterministic graphs from named corpora — verifiable by design.
Retrieval Stability Score (RSS)
500 questions × 50 runs to empirically demonstrate that deterministic KG retrieval produces identical results across runs. Not a performance claim — an architectural proof that no probabilistic RAG system can replicate.
Context Quality Index (CQI)
A verifiability framework comprising five properties in strict dependency order: Consistency, Provenance, Domain Alignment, Completeness, Signal Clarity. Each property is a necessary condition for the next. No existing benchmark measures any of them. arXiv preprint in preparation.
Context Quality Index (CQI)
Five properties. Strict dependency order. The full technical reference — formulas, metrics, worked examples.