COGNISWITCH · RESEARCH NOTE
May 2026
HotpotQA Benchmark — Run 3
Zero-Shot · No Fine-Tuning

Doveryai,
No Proveryai.

Why enterprise AI needs verifiability as a first-class property, and what the benchmarks don't measure.

The Russian proverb meaning "Trust but verify", made famous by Reagan during the Cold War, fits the enterprise AI context in 2026. The principle was simple: trust alone is not a control. In enterprise AI, the same logic applies.

The Thesis

For the entire history of software, verifiability was assumed. Deterministic systems are testable by definition. AI broke that assumption. A system can now be accurate and completely unverifiable at the same time.

Verifiability is a prerequisite for accuracy that can be trusted. A correct answer you cannot prove, repeat, or trace is not usable in a regulated context. Current benchmarks measure whether the system got it right. They say nothing about whether you can prove it. In healthcare, financial services, and insurance, that gap is the risk.

CogniSwitch is a neuro-symbolic AI platform that uses deterministic retrieval over concepts using ontologies, vectors and graphs. Every output traces to a named source document at triple level. Every retrieval reproduces across runs. This is a structural property of the architecture, not a design aspiration.

The Empirical Result
HotpotQA distractor setting · 2,000 questions · zero-shot · no fine-tuning · 10 bundled paragraphs per question
84.4%
LLM-Judge Accuracy
93.1%
Precision When Answered
9.4%
Principled Abstention
60.8%
Exact Match (Floor)

How to read these numbers: Exact Match (EM) is a character-level metric. "Bill Clinton" fails against "William Jefferson Clinton". "EFL Cup" fails against "Carabao Cup" even when both are correct. EM is the conservative floor used for leaderboard comparison. LLM-judge accuracy (84.4%) measures semantic correctness including format variants — a 10.1 point gap above token F1 on identical answers, driven by verbosity preference in LLM judges. Precision when answered (93.1%) excludes the 9.4% of questions where the system abstained — cases where retrieved evidence was insufficient to commit to an answer. In regulated industries, a principled "I don't know" is preferable to a confident wrong answer.

MetricCogniSwitch + ClaudeWhat It Measures
LLM-Judge Accuracy84.4%Semantic correctness scored by language model judge
Precision when answered93.1%Correctness on the 90.6% of questions answered
Token F174.3%Token overlap with gold answer — official HotpotQA metric
Contain-Match69.8%Gold answer string present in response
Exact Match (EM)60.8%Character-level match — conservative floor
Principled abstention (IDK)9.4%Declined to answer — insufficient evidence

Why this dataset is hard: HotpotQA distractor setting provides 10 paragraphs per question: 2 gold supporting documents plus 8 adversarial distractors designed to mislead retrieval. Questions split between bridge (entity traversal across two documents) and comparison (attribute extraction across two entities). Short answers mean EM is unforgiving. "Bill Clinton" fails against "William Jefferson Clinton". "EFL Cup" fails against "Carabao Cup". No fine-tuning, no adaptation signal. Every question is a cold inference problem.

Leaderboard context — what it takes to score here
SystemEMF1Setting
CogniSwitch + Claude (this work)60.8%74.3%Zero-shot · No fine-tuning · Deterministic KG
StepChain GraphRAG (Ni et al., Oct 2025)66.70%79.5%Zero-shot · On-the-fly KG
PropRAG (Wang et al., Apr 2025)64.5%Zero-shot · Proposition paths
PEI (Huang et al., 2024) — best fine-tuned72.89%77.84%Fine-tuned on 90,000 training examples
HotpotQA leaderboard SOTA (distractor, test set)~72%~85%Fine-tuned · Full training set · hotpotqa.github.io
Human performance (reported)~77%~91%Human annotators · Yang et al., 2018

The headline: The best fine-tuned system (PEI) achieves 77.84% token F1 after training on 90,526 labelled examples. CogniSwitch achieves 74.3% token F1 with zero training examples — within 3.5 F1 points of fine-tuned SOTA — while maintaining full source traceability and run-to-run reproducibility that no fine-tuned system provides. On EM, the gap is larger (60.8% vs 72.89%), consistent with the additional reasoning steps bridge questions require without ontology grounding.

Supporting Facts F1: 74.4% · Facts Precision: 84.2% · Supporting Triples Coverage: 76.9%. Retrieval quality measured independently of generation. Full methodology: HotpotQA (Yang et al., EMNLP 2018) · Evaluation metric reference: A-RAG (Du et al., 2026)

Beyond Accuracy: The Verifiability Gap

The HotpotQA result surfaces a deeper problem. Accuracy benchmarks ask one question: did the system get it right? For healthcare, financial services, and insurance, that is necessary but not sufficient.

The question that matters equally: can you prove it, repeat it, and defend it?

Verifiability is a prerequisite for accuracy that can actually be used. An unverifiable answer — no consistent retrieval, no named source, no audit trail — cannot be defended in a regulated context regardless of whether it happens to be correct. This is not a hypothetical edge case. It is the structural condition of every probabilistic retrieval system.

CogniSwitch's deterministic architecture makes verifiability measurable: same retrieval every run, every output traceable to a named source document at triple level. Existing benchmarks have no metrics for this. The field needs new evaluation dimensions to measure verifiability alongside accuracy, and this work is a first step toward defining what those should be.

What's Next

This HotpotQA result is one data point in a broader research programme. Work in progress:

MedQA USMLE benchmark

Full ingestion of Robbins Pathology and Harrison's Internal Medicine. 1,272 questions from the dev set, filtered to pathology and internal medicine coverage. Comparison target: AMG-RAG at 73.92% (arXiv 2502.13010). Unlike AMG-RAG, which builds knowledge graphs from live PubMed API calls at runtime, CogniSwitch uses pre-built deterministic graphs from named corpora — verifiable by design.

Retrieval Stability Score (RSS)

500 questions × 50 runs to empirically demonstrate that deterministic KG retrieval produces identical results across runs. Not a performance claim — an architectural proof that no probabilistic RAG system can replicate.

Context Quality Index (CQI)

A verifiability framework comprising five properties in strict dependency order: Consistency, Provenance, Domain Alignment, Completeness, Signal Clarity. Each property is a necessary condition for the next. No existing benchmark measures any of them. arXiv preprint in preparation.

The Framework

Context Quality Index (CQI)

Five properties. Strict dependency order. The full technical reference — formulas, metrics, worked examples.

Read the CQI framework
cogniswitch.ai · The Trust Layer for Enterprise AI
Vivek Khandelwal · Co-founder