The Glass Box Benchmark
Regulated industries currently mandate human-in-the-loop oversight for every AI output because no evaluation framework certifies the process behind the answer — only the answer itself.
The Context Quality Index (CQI) measures a second question beyond accuracy: can you prove it, repeat it, and defend it? Goal: a glass box — where you can see what happened, not just what came out.
CQI is a framework first and a metric set second. Five properties define what verifiable enterprise AI requires. All metrics are accuracy-agnostic — CQI measures verifiability on top of accuracy, not instead of it.
Quick Reference
| Property | Metric(s) | Measures | Gold Set Required |
|---|---|---|---|
| Consistency | RSS-R · RSS-O | Same correct answer, every run | NO |
| Provenance | PDS | Response facts grounded in retrieved subgraph | NO |
| Alignment | OAS-D · OAS-O · OAS-T · OAS-B | Response concepts aligned to domain ontology, enterprise taxonomy, and temporal scope | NO |
| Completeness | CCS-R · CCS-O | Retrieval surfaced every expected fact | YES |
| Signal Clarity | CFS-R · CFS-O | Retrieval contained only relevant facts | YES |
Consistency
This is the most visible property — the first thing a user notices and the first thing that erodes trust. Enterprises will not proceed to evaluate any other property if outputs are inconsistent. Consistency is where trust assessment begins.
Calculation
run same question N times, extract triples (retrieved or mined from response), calculate fraction of identical triples averaged across run pairs and questions. Score 1.0 = no drift. Below 1.0 = drift proportional to distance.
Provenance
Once consistency is established: did every output fact come from the defined corpus, or did the system draw on external knowledge? Distinct from post-hoc citation (a claim) vs. provenance (a constraint). The architecture enforces it, not the prompt.[3,4,5]
Calculation
mine response for triples via 3-stage ontology-guided pipeline (NLP pre-processing, ontology-guided extraction, validation). Two framings, numerically equivalent:
- Grounding Rate: |matched triples| / |response triples| x 100
- Response Precision: |R intersect G| / |R|
Worked Example
pembrolizumab KEYNOTE-024 query.
PDS (A) = 66.7% / PDS (B) = 0.67
OS figure is fabrication.
Alignment
Once provenance is established: are the corpus-grounded facts being interpreted against your ontology, your organization's definitions, and within their valid temporal scope — or against the model's parametric knowledge?[6,7] Four sub-scores:
Formulas & Status
|C_retrieved intersect C_response| / |C_retrieved union C_response| x 100(response concepts resolving to enterprise taxonomy / total enterprise-scoped response concepts) x 100(retrieved facts within valid temporal scope / total retrieved facts with temporal metadata) x 1001 minus (contradicted concepts / total ontology-mapped concepts)Worked Example: KEYNOTE-024 Protocol
Completeness
Once alignment is established: did retrieval surface every applicable fact? This is the thesis: verifiability cannot be measured without a defined expected fact set — which is why current benchmarks do not measure it.[8,9]
Retrieval Recall (CCS-R)
|G_retrieved intersect G_gold| / |G_gold|CCS-R 1.0 = every expected fact surfaced.
Output Recall (CCS-O)
|R_triples intersect G_gold| / |G_gold|CCS-O 1.0 = response addressed every expected fact.
Derived Metric
CCS-R minus CCS-OIn regulated use cases, a retrieved contraindication not surfaced in the response is a failure regardless of retrieval completeness.
Worked Example
Signal Clarity
Once completeness is established: did retrieval surface ONLY the applicable facts, or did noise reach the model alongside the signal? Noise in context is not neutral — it increases the probability of off-target or hallucinated responses.[10,11,12]
CCS and CFS are the recall-precision pair at triple level. F1 = their harmonic mean.
Retrieval Fidelity (CFS-R)
|G_retrieved intersect G_gold| / |G_retrieved|CFS-R 1.0 = zero noise in context.
Output Fidelity (CFS-O)
|R_triples intersect G_gold| / |R_triples|CFS-O 1.0 = zero noise in response.
Derived Metric
2 x (CCS-R x CFS-R) / (CCS-R + CFS-R)| Scenario | R | O | Insight |
|---|---|---|---|
| 1 | H | H | System working correctly. |
| 2 | L | H | LLM filtered noise, not reliable or repeatable. |
| 3 | H | L | LLM introduced noise, cross-ref PDS. |
| 4 | L | L | Noise propagated end to end. |
Cross-Property Summary
| Metric | Types | What it catches |
|---|---|---|
| Retrieval Stability Score | RSS-R / RSS-O | Drift across runs |
| Provenance Depth Score | PDS | Fabrication (response facts not in subgraph) |
| Ontology Aligned Score | OAS-D / OAS-O / OAS-T / OAS-B | Concept misalignment across domain, org taxonomy, temporal scope, definitions |
| Context Coverage Score | CCS-R / CCS-O | Missing expected facts (retrieval gap / LLM omission) |
| Context Fidelity Score | CFS-R / CFS-O | Noise in context or response |
| Derived | CCS-R minus CCS-O | LLM omission rate |
| Derived | CCS + CFS to F1 | Balanced retrieval quality |
References
External literature underpinning each property. The CQI metrics (RSS, PDS, OAS, CCS, CFS) are CogniSwitch-original; the sources below establish the measurement problem each property addresses.
- [1]ConsistencyWang, Zhao, Tallent & Guo (2025). On the Reproducibility Limitations of RAG Systems. arXiv:2509.18869.
- [2]ConsistencyAtil, Aykent et al. (2024). Non-Determinism of "Deterministic" LLM Settings. arXiv:2408.04667 · Eval4NLP 2025.
- [3]ProvenanceJi, Lee, Frieske et al. (2023). Survey of Hallucination in Natural Language Generation. ACM Computing Surveys 55(12).
- [4]ProvenanceRashkin, Nikolaev, Lamm et al. (2023). Measuring Attribution in Natural Language Generation Models. Computational Linguistics 49(4).
- [5]ProvenanceMin, Krishna, Lyu et al. (2023). FActScore: Fine-grained Atomic Evaluation of Factual Precision. EMNLP 2023.
- [6]AlignmentApple ML Research (2025). ODKE+: Ontology-Guided Open-Domain Knowledge Extraction with LLMs. arXiv:2509.04696.
- [7]AlignmentCotti et al. (2025). OntoLogX: Ontology-Guided Knowledge Graph Extraction with LLMs. arXiv:2510.01409 · Adv. Intelligent Systems.
- [8]CompletenessManning, Raghavan & Schütze (2008). Introduction to Information Retrieval (Ch. 8: Evaluation). Cambridge University Press.
- [9]CompletenessWilliams, Bains, Tang et al. (2025). Evaluating large language models for drafting emergency department encounter summaries. PLOS Digital Health.
- [10]Signal ClarityCuconasu et al. (2024). The Power of Noise: Redefining Retrieval for RAG Systems. SIGIR 2024 · arXiv:2401.14887.
- [11]Signal ClarityShi, Chen, Misra et al. (2023). Large Language Models Can Be Easily Distracted by Irrelevant Context. ICML 2023.
- [12]Signal ClarityLiu et al. (2024). Lost in the Middle: How Language Models Use Long Contexts. TACL 12.