Reference v1.4

The Glass Box Benchmark

Regulated industries currently mandate human-in-the-loop oversight for every AI output because no evaluation framework certifies the process behind the answer — only the answer itself.

CQI measures a second question beyond accuracy: can you prove it, repeat it, and defend it? Goal: a glass box — where you can see what happened, not just what came out.

CQI is a framework first and a metric set second. Five properties define what verifiable enterprise AI requires. All metrics are accuracy-agnostic — CQI measures verifiability on top of accuracy, not instead of it.

Quick Reference

Property	Metric(s)	Measures	Gold Set Required
Consistency	RSS-R · RSS-O	Same correct answer, every run	NO
Provenance	PDS	Response facts grounded in retrieved subgraph	NO
Alignment	OAS-D · OAS-O · OAS-T · OAS-B	Response concepts aligned to domain ontology, enterprise taxonomy, and temporal scope	NO
Completeness	CCS-R · CCS-O	Retrieval surfaced every expected fact	YES
Signal Clarity	CFS-R · CFS-O	Retrieval contained only relevant facts	YES

Consistency

RSS-R · RSS-O

This is the most visible property — the first thing a user notices and the first thing that erodes trust. Enterprises will not proceed to evaluate any other property if outputs are inconsistent. Consistency is where trust assessment begins.

Calculation

run same question N times, extract triples (retrieved or mined from response), calculate fraction of identical triples averaged across run pairs and questions. Score 1.0 = no drift. Below 1.0 = drift proportional to distance.

Comparative Performance

Framework	RSS-R	RSS-O
CogniSwitch	1.0	~1.0
Vector RAG	< 1.0	< 1.0

* Probabilistic retrieval inherently drifts across runs.

Provenance

PDS

Once consistency is established: did every output fact come from the defined corpus, or did the system draw on external knowledge? Distinct from post-hoc citation (a claim) vs. provenance (a constraint). The architecture enforces it, not the prompt.

Calculation

mine response for triples via 3-stage ontology-guided pipeline (NLP pre-processing, ontology-guided extraction, validation). Two framings, numerically equivalent:

Grounding Rate: |matched triples| / |response triples| x 100
Response Precision: |R intersect G| / |R|

Worked Example

pembrolizumab KEYNOTE-024 query.

RETRIEVED: ORR 44.8%, phase 3, population.

RESPONSE ADDS: OS median 26.3m (not in subgraph).

PDS (A) = 66.7% / PDS (B) = 0.67

OS figure is fabrication.

Alignment

OAS-D · OAS-O · OAS-T · OAS-B

Once provenance is established: are the corpus-grounded facts being interpreted against your ontology, your organization's definitions, and within their valid temporal scope — or against the model's parametric knowledge? Four sub-scores:

Formulas & Status

OAS-D (Domain Class Overlap)IMPLEMENTABLE NOW

|C_retrieved intersect C_response| / |C_retrieved union C_response| x 100

OAS-O (Org Concept Resolution)IMPLEMENTABLE NOW

(response concepts resolving to enterprise taxonomy / total enterprise-scoped response concepts) x 100

OAS-T (Temporal Validity)IMPLEMENTABLE NOW

(retrieved facts within valid temporal scope / total retrieved facts with temporal metadata) x 100

OAS-B (Contradiction Detection)OPEN RESEARCH

1 minus (contradicted concepts / total ontology-mapped concepts)

Worked Example: KEYNOTE-024 Protocol

Class Misalignment

Time to progression mapped to generic class (OAS-D < 100%).

Taxonomy Conflict

Primary endpoint per general convention instead of protocol SAP (OAS-O < 100%).

Temporal Drift

2016 OS data cited when 2020 update is current (OAS-T < 100%).

Rule Violation

Measurement methodology inconsistent with RECIST 1.1 (OAS-B < 1.0).

Completeness

CCS-R · CCS-O

Once alignment is established: did retrieval surface every applicable fact? This is the thesis: verifiability cannot be measured without a defined expected fact set — which is why current benchmarks do not measure it.

Retrieval Recall (CCS-R)

|G_retrieved intersect G_gold| / |G_gold|

CCS-R 1.0 = every expected fact surfaced.

Output Recall (CCS-O)

|R_triples intersect G_gold| / |G_gold|

CCS-O 1.0 = response addressed every expected fact.

Derived Metric

LLM Omission Rate

CCS-R minus CCS-O

In regulated use cases, a retrieved contraindication not surfaced in the response is a failure regardless of retrieval completeness.

Worked Example

G_GOLD

RETRIEVED

RESPONSE

OMISSION

CCS-R: 0.75

CCS-O: 0.50

RATE: 0.25

Signal Clarity

CFS-R · CFS-O

Once completeness is established: did retrieval surface ONLY the applicable facts, or did noise reach the model alongside the signal? Noise in context is not neutral — it increases the probability of off-target or hallucinated responses.

CCS and CFS are the recall-precision pair at triple level. F1 = their harmonic mean.

Retrieval Fidelity (CFS-R)

|G_retrieved intersect G_gold| / |G_retrieved|

CFS-R 1.0 = zero noise in context.

Output Fidelity (CFS-O)

|R_triples intersect G_gold| / |R_triples|

CFS-O 1.0 = zero noise in response.

Derived Metric

Harmonic Mean (F1)

2 x (CCS-R x CFS-R) / (CCS-R + CFS-R)

Diagnostic Table

Scenario	R	O	Insight
1	H	H	System working correctly.
2	L	H	LLM filtered noise, not reliable or repeatable.
3	H	L	LLM introduced noise, cross-ref PDS.
4	L	L	Noise propagated end to end.

Example Calculation

4 gold facts + 2 noise triples retrieved. Response uses 2 gold + 1 noise.

CCS-R 1.0

CFS-R 0.67

F1 0.80

CCS-O 0.50

CFS-O 0.67

Cross-Property Summary

Metric	Types	What it catches
Retrieval Stability Score	RSS-R / RSS-O	Drift across runs
Provenance Depth Score	PDS	Fabrication (response facts not in subgraph)
Ontology Aligned Score	OAS-D / OAS-O / OAS-T / OAS-B	Concept misalignment across domain, org taxonomy, temporal scope, definitions
Context Coverage Score	CCS-R / CCS-O	Missing expected facts (retrieval gap / LLM omission)
Context Fidelity Score	CFS-R / CFS-O	Noise in context or response
Derived	CCS-R minus CCS-O	LLM omission rate
Derived	CCS + CFS to F1	Balanced retrieval quality