The Glass Box Benchmark
Regulated industries currently mandate human-in-the-loop oversight for every AI output because no evaluation framework certifies the process behind the answer — only the answer itself.
CQI measures a second question beyond accuracy: can you prove it, repeat it, and defend it? Goal: a glass box — where you can see what happened, not just what came out.
CQI is a framework first and a metric set second. Five properties define what verifiable enterprise AI requires. All metrics are accuracy-agnostic — CQI measures verifiability on top of accuracy, not instead of it.
Quick Reference
| Property | Metric(s) | Measures | Gold Set Required |
|---|---|---|---|
| Consistency | RSS-R · RSS-O | Same correct answer, every run | NO |
| Provenance | PDS | Response facts grounded in retrieved subgraph | NO |
| Alignment | OAS-D · OAS-O · OAS-T · OAS-B | Response concepts aligned to domain ontology, enterprise taxonomy, and temporal scope | NO |
| Completeness | CCS-R · CCS-O | Retrieval surfaced every expected fact | YES |
| Signal Clarity | CFS-R · CFS-O | Retrieval contained only relevant facts | YES |
Consistency
This is the most visible property — the first thing a user notices and the first thing that erodes trust. Enterprises will not proceed to evaluate any other property if outputs are inconsistent. Consistency is where trust assessment begins.
Calculation
run same question N times, extract triples (retrieved or mined from response), calculate fraction of identical triples averaged across run pairs and questions. Score 1.0 = no drift. Below 1.0 = drift proportional to distance.
Comparative Performance
| Framework | RSS-R | RSS-O |
|---|---|---|
| CogniSwitch | 1.0 | ~1.0 |
| Vector RAG | < 1.0 | < 1.0 |
* Probabilistic retrieval inherently drifts across runs.
Provenance
Once consistency is established: did every output fact come from the defined corpus, or did the system draw on external knowledge? Distinct from post-hoc citation (a claim) vs. provenance (a constraint). The architecture enforces it, not the prompt.
Calculation
mine response for triples via 3-stage ontology-guided pipeline (NLP pre-processing, ontology-guided extraction, validation). Two framings, numerically equivalent:
- Grounding Rate: |matched triples| / |response triples| x 100
- Response Precision: |R intersect G| / |R|
Worked Example
pembrolizumab KEYNOTE-024 query.
PDS (A) = 66.7% / PDS (B) = 0.67
OS figure is fabrication.
Alignment
Once provenance is established: are the corpus-grounded facts being interpreted against your ontology, your organization's definitions, and within their valid temporal scope — or against the model's parametric knowledge? Four sub-scores:
Formulas & Status
|C_retrieved intersect C_response| / |C_retrieved union C_response| x 100(response concepts resolving to enterprise taxonomy / total enterprise-scoped response concepts) x 100(retrieved facts within valid temporal scope / total retrieved facts with temporal metadata) x 1001 minus (contradicted concepts / total ontology-mapped concepts)Worked Example: KEYNOTE-024 Protocol
Completeness
Once alignment is established: did retrieval surface every applicable fact? This is the thesis: verifiability cannot be measured without a defined expected fact set — which is why current benchmarks do not measure it.
Retrieval Recall (CCS-R)
|G_retrieved intersect G_gold| / |G_gold|CCS-R 1.0 = every expected fact surfaced.
Output Recall (CCS-O)
|R_triples intersect G_gold| / |G_gold|CCS-O 1.0 = response addressed every expected fact.
Derived Metric
CCS-R minus CCS-OIn regulated use cases, a retrieved contraindication not surfaced in the response is a failure regardless of retrieval completeness.
Worked Example
Signal Clarity
Once completeness is established: did retrieval surface ONLY the applicable facts, or did noise reach the model alongside the signal? Noise in context is not neutral — it increases the probability of off-target or hallucinated responses.
CCS and CFS are the recall-precision pair at triple level. F1 = their harmonic mean.
Retrieval Fidelity (CFS-R)
|G_retrieved intersect G_gold| / |G_retrieved|CFS-R 1.0 = zero noise in context.
Output Fidelity (CFS-O)
|R_triples intersect G_gold| / |R_triples|CFS-O 1.0 = zero noise in response.
Derived Metric
2 x (CCS-R x CFS-R) / (CCS-R + CFS-R)| Scenario | R | O | Insight |
|---|---|---|---|
| 1 | H | H | System working correctly. |
| 2 | L | H | LLM filtered noise, not reliable or repeatable. |
| 3 | H | L | LLM introduced noise, cross-ref PDS. |
| 4 | L | L | Noise propagated end to end. |
Cross-Property Summary
| Metric | Types | What it catches |
|---|---|---|
| Retrieval Stability Score | RSS-R / RSS-O | Drift across runs |
| Provenance Depth Score | PDS | Fabrication (response facts not in subgraph) |
| Ontology Aligned Score | OAS-D / OAS-O / OAS-T / OAS-B | Concept misalignment across domain, org taxonomy, temporal scope, definitions |
| Context Coverage Score | CCS-R / CCS-O | Missing expected facts (retrieval gap / LLM omission) |
| Context Fidelity Score | CFS-R / CFS-O | Noise in context or response |
| Derived | CCS-R minus CCS-O | LLM omission rate |
| Derived | CCS + CFS to F1 | Balanced retrieval quality |