Why do my AI evals keep failing?

The root cause isn't your agent—it's your eval criteria. Vague criteria produce no actionable insights. When automation stalls at 15-20% and your team asks 'what should we fix?', you don't know—because your quality audits aren't telling you anything useful. Fix the criteria first, then the agent.

How do I write better evals for AI agents?

Use the 5-step transformation method: (1) Decompose—split compound requirements on 'and'. (2) Find Observable Moments—if you can't Ctrl+F for evidence in a transcript, it's not observable. (3) Define Pass conditions. (4) Define Fail conditions (not just 'opposite of pass'—cover partial completion). (5) Stress Test against edge cases.

What is the Courtroom Test for eval criteria?

If your criterion sounds robotic in a courtroom deposition but feels strange in normal conversation, you've written it wrong. Good evals focus on outcomes over specific actions. Wrong: 'State your full legal name for the record.' Right: 'And I have you as John Doe, correct?' Both verify identity, but one allows natural flexibility.

What is the 3-Deep Rule for writing evals?

For any single SOP requirement, you get one primary criterion and up to two conditionals. That's it. Each addition feels justified ('But what if they spell their name wrong?'), but stop—does this edge case occur in more than 5% of interactions? If not, let it fail the primary criterion. Any more than 3 levels is over-engineering your eval checklist.

How do I evaluate soft skills like empathy in AI audits?

Use the Inversion Principle. Checking for 'Empathy' often yields 98% pass rates because professional behavior is the baseline—this is noise, not signal. To find the 2% of bad calls, flip your eval to hunt for 'Rudeness' or 'Dismissiveness' instead. Invert the default state to surface actual problems.

What are False Failures in AI evaluation?

False Failures occur when you run evals against criteria that don't apply to that call type. Example: You write 25 criteria for Debt Collection, but 8 only apply to employed borrowers. If you evaluate a hardship call against 'Set Payment Date' (employed-only criterion), your agent fails unfairly. False failures destroy trust in your audit data.

What are Logic Gates in an eval checklist?

Logic Gates are conditional filters that determine which eval criteria apply to which call types. They prevent false failures by ensuring criteria only run when relevant. Example: IF call_type = 'hardship' THEN skip employed-borrower criteria. Toggle logic gates on and watch false failure rates drop and audit scores become meaningful.

How should I organize my AI eval checklist?

Group criteria into four domains: (1) Admin—identity verification, call setup, consent (low complexity, high compliance). (2) Core Process—clinical assessment, lending qualification (where domain expertise lives). (3) Empathy/CX—active listening, tone adjustment (hardest to quantify). (4) Compliance—mandatory disclosures, verbatim requirements. When 23 interactions fail in 'Clinical Data Collection,' you know exactly which LLM chain to debug.

How do I set thresholds for quantified evals?

Use the Baseline Method—don't guess thresholds. Run 20 calls, plot the data, and find the natural gap between 'clearly bad' and 'clearly good.' There is no debate about whether 94% is less than 95%. Data-driven thresholds remove the last traces of subjectivity from your eval criteria.

What are the three types of quantified eval criteria?

Three types: (1) Ratio-based—percentages like agent talk time (e.g., 'Agent speaks less than 60% of interaction'). (2) Count-based—minimum occurrences (e.g., 'Agent acknowledges patient concern at least twice'). (3) Time-based—duration or sequence requirements (e.g., 'Hold time does not exceed 30 seconds without check-in').

Why do high pass rates mean my evals are broken?

A 98% pass rate on 'Empathy' means professional behavior is the baseline—your eval isn't measuring anything useful. High pass rates indicate you're detecting the default state (noise), not quality variance (signal). Apply the Inversion Principle: if everyone passes, flip the criterion to hunt for failures instead.

How many edge cases should my eval criteria cover?

Apply the 80/20 rule. Ask: Does this edge case occur in more than 5% of interactions? If not, let it fail the primary criterion and move on. Each edge-case criterion feels justified, but over-engineering your eval checklist creates maintenance burden without proportional value. One primary + two conditionals maximum.

How do I use eval results to debug my AI agent?

Structure enables diagnosis. When you group criteria by domain (Admin, Core Process, Empathy, Compliance), failure patterns become actionable. 'Your agent failed 23 interactions' tells you nothing. 'Your agent failed 23 interactions, all in Clinical Data Collection' tells you exactly which LLM chain to debug. Good eval structure = precise debugging.

CogniSwitch Engineering Guide v2.1

Criteria Authoring
Workbench.

Transforming SOPs from subjective documents into executable logic. A practical manual for writing audit-ready criteria.

Part 01

Why Criteria Fail

Your agent works. Sort of. It handles the happy path. It sounds professional. Demo goes great. But when you deploy to production, automation rates stall at 15-20%. Customer complaints trickle in. Your engineering team asks: "What exactly should we fix?"

You don't know. Because your quality audits aren't telling you anything useful. The root cause isn't your agent. It's your criteria.

The Hidden Cost of Vague Criteria

Broken Criterion

"Clear explanation provided"

ERROR: What's 'clear' to one reviewer isn't to another.

Audit-Ready Refactor

// OUTPUT

Agent confirms customer understanding by asking 'Do you have any questions?'

Status: Subjective DETECTEDFix Applied: TRUE

The Courtroom Test

If your criterion would make sense in a courtroom deposition but feels strange in a normal conversation, you've written it wrong.

Wrong (Robotic)

"State your full legal name for the record."

Criteria should allow for natural flexibility. Outcome over action.

Right (Natural)

"And I have you as John Doe, correct?"

Part 02

Grouping Strategy

Before you write a single criterion, decide how you'll organize them. This isn't about neatness. It's about diagnosis. When your agent fails 23 interactions, knowing they all failed in "Clinical Data Collection" tells you exactly which LLM chain to debug.

Admin

The mechanical stuff. Identity verification, call setup, consent. Low complexity, high compliance.

• Verify caller identity

• State agent name

Core Process

The actual job. Clinical assessment, lending qualification. Where domain expertise lives.

• Gather symptoms

• Assess creditworthiness

Empathy / CX

The human layer. Active listening, tone adjustment. Hardest to quantify.

• Acknowledge frustration

• Allow patient to speak

Compliance

Mandatory disclosures. Verbatim requirements. Regulatory boxes that must be checked.

• Read privacy disclosure

• Obtain recorded consent

Part 02B

The Granularity Trap

Each addition feels justified. "But what if the customer spells their name wrong?" "What if they give a nickname?"

Stop. The 80/20 Rule applies here. Does this edge case occur in more than 5% of interactions? If not, let it fail the primary criterion.

Primary Requirement

"Verify Customer Identity"

Audit Status

OPTIMAL

Primary Criterion

Pass: Agent asks for customer's full name.

Use controls to adjust depth:

The 3-Deep Rule

"For any single SOP requirement, you get one primary criterion and up to two conditionals. That's it."

1 Primary Criterion (Core)
Max 2 Conditional Criteria
Any more is over-engineering

Part 03

The Transformation Method

You have an SOP written for humans. You need criteria machines can evaluate. Here is the algorithm to get from paragraphs to atomic logic.

Transformation_Engine

SOP TO CRITERIA COMPILER v2.1

1/5

CURRENT PHASE

Decompose

Read the SOP aloud. Every time you hear 'and' or a comma, that's likely a split point.

Original SOP Requirement

"The agent should gather the patient's medical history, including current medications, known allergies, and any existing conditions, before discussing treatment options."

Atomic Requirements

root@audit:~#1. Current medications gathered 2. Known allergies gathered 3. Existing conditions gathered 4. All gathered before treatment discussion (sequence)

Observable Moments

The Transcript Test: If you can't Ctrl+F for evidence, the criterion isn't observable.

BAD: "Agent was friendly."

GOOD: "Agent used greeting with patient's name."

Strong Fail Definitions

The fail definition isn't just "the opposite of pass." It must cover partial completion.

FAIL: Interaction ends without [X] stated or explicitly confirmed as none.

Part 04

Quantification

Quantified criteria remove the last traces of subjectivity. There is no debate about whether 94% is less than 95%.

Use the Baseline Method: Don't guess thresholds. Run 20 calls, plot the data, and find the natural gap between "clearly bad" and "clearly good".

Ratio-Based

Measures one thing relative to another.

Active Listening

Provider speaks < 95% of total words.

Balanced Conversation

Provider talk time between 40-60%.

Confirmation Coverage

At least 80% of data points confirmed.

Count-Based

Measures occurrences of specific behaviors.

Use Customer Name

Used at least 2 times during interaction.

Offer Assistance

"Anything else?" asked at least once.

Excessive Re-asking

Same info asked not more than 2 times.

Time-Based

Measures when something happens or duration.

Prompt Greeting

Delivered within first 10 seconds.

Hold Time

Total hold time < 60 seconds.

Verification Timing

Confirmed before any account details discussed.

Part 04B: Inversion Principle

The Default State Problem

Checking for "Empathy" often yields 98% pass rates because professional behavior is the baseline. This is noise. To find the signal (the 2% of bad calls), flip the criterion to hunt for "Rudeness."

Signal_Monitor_v2.0

Criterion: "Was Empathy Present?"

RESULT: 97% PASS (Low Signal)

The Default State Problem

You run a criterion against 500 interactions. 487 Pass, 13 Fail. Your report says "97.4% Empathy Rate".

You spent tokens evaluating 487 interactions that were... fine. Normal. Default professional behavior.

RULE: The 13 failures are where the signal lives. Everything else is noise.

Old Way

"Was empathy present?"

PASS: 487 Calls

New Way

"Was rudeness present?"

FAIL: 13 Calls (Flagged for Review)

Part 05

Logic Gates

You write 25 criteria for "Debt Collection." But 8 only apply to employed borrowers, and 6 only apply to hardship cases.

If you evaluate a hardship call against employed criteria (e.g., "Set Payment Date"), your agent fails. These are False Failures. They destroy trust in your audit data.

The False Failure Simulator

Toggle the switch below to see how Logic Gates filter out irrelevant criteria and fix audit scores.

Simulation Context

Call #8291AUDIO

Customer: John Doe

Status: Unemployed

"I lost my job last month. I can't make any payments right now."

Evaluation Logic

System evaluates ALL criteria blindly. Agent is penalized for not asking a jobless person for money.

Audit Score

71%

Status

ACCURATE

Criterion

Type

Result

Verify Customer Identity

UNIVERSAL

PASS

Read Mini-Miranda Disclosure

UNIVERSAL

PASS

Establish Employment StatusGATE: Unemployed

GATE

PASS

Propose Payment Plan

PATH A

FAIL

Secure Payment Commitment

PATH A

FAIL

Explain Hardship Options

PATH B

PASS

Email Hardship Forms

PATH B

PASS

Analysis

The agent followed the correct "Hardship" procedure, but is failing because "Employed" criteria (04, 05) are being blindly evaluated. This is noise.

Part 06

Common Pitfalls

A quick reference checklist. If you see these patterns, refactor immediately.

Select a Pattern

Diagnosing

The 'Appropriately' Trap

Symptom

"Your criterion uses words like 'appropriately', 'properly', 'correctly', 'adequately'."

Incorrect

Provider appropriately addresses patient concerns.

Corrected

Provider acknowledges stated concern AND offers a response (information, action, or follow-up).

The Fix

Replace the judgment word with the observable behavior that defines 'appropriate'.

What's Next?

Put Your Criteria to Work

You've written audit-ready criteria. Now enforce them in real-time and verify compliance at scale.

But first: understand why eval criteria and audit criteria serve different purposes. Evals measure quality. Audits prove compliance →

Step 1: Enforce

Rails

Transform your criteria into real-time guardrails. Block SOP violations before they reach the customer.

Explore Rails →

Step 2: Verify

Audit

Evaluate every interaction against your criteria. Generate the compliance evidence your regulators demand.

Explore Audit →

Want this guide as a PDF?

We'll email you a PDF. No spam.