Reference v1.4

The Glass Box Benchmark

Regulated industries currently mandate human-in-the-loop oversight for every AI output because no evaluation framework certifies the process behind the answer — only the answer itself.

CQI measures a second question beyond accuracy: can you prove it, repeat it, and defend it? Goal: a glass box — where you can see what happened, not just what came out.

CQI is a framework first and a metric set second. Five properties define what verifiable enterprise AI requires. All metrics are accuracy-agnostic — CQI measures verifiability on top of accuracy, not instead of it.

Quick Reference

PropertyMetric(s)MeasuresGold Set Required
ConsistencyRSS-R · RSS-OSame correct answer, every runNO
ProvenancePDSResponse facts grounded in retrieved subgraphNO
AlignmentOAS-D · OAS-O · OAS-T · OAS-BResponse concepts aligned to domain ontology, enterprise taxonomy, and temporal scopeNO
CompletenessCCS-R · CCS-ORetrieval surfaced every expected factYES
Signal ClarityCFS-R · CFS-ORetrieval contained only relevant factsYES
01

Consistency

RSS-R · RSS-O

This is the most visible property — the first thing a user notices and the first thing that erodes trust. Enterprises will not proceed to evaluate any other property if outputs are inconsistent. Consistency is where trust assessment begins.

Calculation

run same question N times, extract triples (retrieved or mined from response), calculate fraction of identical triples averaged across run pairs and questions. Score 1.0 = no drift. Below 1.0 = drift proportional to distance.

Comparative Performance

FrameworkRSS-RRSS-O
CogniSwitch1.0~1.0
Vector RAG< 1.0< 1.0

* Probabilistic retrieval inherently drifts across runs.

02

Provenance

PDS

Once consistency is established: did every output fact come from the defined corpus, or did the system draw on external knowledge? Distinct from post-hoc citation (a claim) vs. provenance (a constraint). The architecture enforces it, not the prompt.

Calculation

mine response for triples via 3-stage ontology-guided pipeline (NLP pre-processing, ontology-guided extraction, validation). Two framings, numerically equivalent:

  • Grounding Rate: |matched triples| / |response triples| x 100
  • Response Precision: |R intersect G| / |R|

Worked Example

pembrolizumab KEYNOTE-024 query.

RETRIEVED: ORR 44.8%, phase 3, population.
RESPONSE ADDS: OS median 26.3m (not in subgraph).

PDS (A) = 66.7% / PDS (B) = 0.67

OS figure is fabrication.

03

Alignment

OAS-D · OAS-O · OAS-T · OAS-B

Once provenance is established: are the corpus-grounded facts being interpreted against your ontology, your organization's definitions, and within their valid temporal scope — or against the model's parametric knowledge? Four sub-scores:

Formulas & Status

OAS-D (Domain Class Overlap)IMPLEMENTABLE NOW
|C_retrieved intersect C_response| / |C_retrieved union C_response| x 100
OAS-O (Org Concept Resolution)IMPLEMENTABLE NOW
(response concepts resolving to enterprise taxonomy / total enterprise-scoped response concepts) x 100
OAS-T (Temporal Validity)IMPLEMENTABLE NOW
(retrieved facts within valid temporal scope / total retrieved facts with temporal metadata) x 100
OAS-B (Contradiction Detection)OPEN RESEARCH
1 minus (contradicted concepts / total ontology-mapped concepts)

Worked Example: KEYNOTE-024 Protocol

Class Misalignment
Time to progression mapped to generic class (OAS-D < 100%).
Taxonomy Conflict
Primary endpoint per general convention instead of protocol SAP (OAS-O < 100%).
Temporal Drift
2016 OS data cited when 2020 update is current (OAS-T < 100%).
Rule Violation
Measurement methodology inconsistent with RECIST 1.1 (OAS-B < 1.0).
04

Completeness

CCS-R · CCS-O

Once alignment is established: did retrieval surface every applicable fact? This is the thesis: verifiability cannot be measured without a defined expected fact set — which is why current benchmarks do not measure it.

Retrieval Recall (CCS-R)

|G_retrieved intersect G_gold| / |G_gold|

CCS-R 1.0 = every expected fact surfaced.

Output Recall (CCS-O)

|R_triples intersect G_gold| / |G_gold|

CCS-O 1.0 = response addressed every expected fact.

Derived Metric

LLM Omission Rate
CCS-R minus CCS-O

In regulated use cases, a retrieved contraindication not surfaced in the response is a failure regardless of retrieval completeness.

Worked Example

G_GOLD
4
RETRIEVED
3
RESPONSE
2
OMISSION
1
CCS-R: 0.75
CCS-O: 0.50
RATE: 0.25
05

Signal Clarity

CFS-R · CFS-O

Once completeness is established: did retrieval surface ONLY the applicable facts, or did noise reach the model alongside the signal? Noise in context is not neutral — it increases the probability of off-target or hallucinated responses.

CCS and CFS are the recall-precision pair at triple level. F1 = their harmonic mean.

Retrieval Fidelity (CFS-R)

|G_retrieved intersect G_gold| / |G_retrieved|

CFS-R 1.0 = zero noise in context.

Output Fidelity (CFS-O)

|R_triples intersect G_gold| / |R_triples|

CFS-O 1.0 = zero noise in response.

Derived Metric

Harmonic Mean (F1)
2 x (CCS-R x CFS-R) / (CCS-R + CFS-R)
Diagnostic Table
ScenarioROInsight
1HHSystem working correctly.
2LHLLM filtered noise, not reliable or repeatable.
3HLLLM introduced noise, cross-ref PDS.
4LLNoise propagated end to end.
Example Calculation
4 gold facts + 2 noise triples retrieved. Response uses 2 gold + 1 noise.
CCS-R 1.0
CFS-R 0.67
F1 0.80
CCS-O 0.50
CFS-O 0.67

Cross-Property Summary

MetricTypesWhat it catches
Retrieval Stability ScoreRSS-R / RSS-ODrift across runs
Provenance Depth ScorePDSFabrication (response facts not in subgraph)
Ontology Aligned ScoreOAS-D / OAS-O / OAS-T / OAS-BConcept misalignment across domain, org taxonomy, temporal scope, definitions
Context Coverage ScoreCCS-R / CCS-OMissing expected facts (retrieval gap / LLM omission)
Context Fidelity ScoreCFS-R / CFS-ONoise in context or response
DerivedCCS-R minus CCS-OLLM omission rate
DerivedCCS + CFS to F1Balanced retrieval quality
Intelligence Infrastructure
CogniSwitch
Confidential Technical Reference. Not for public distribution. Benchmark specifications subject to ontology versioning.