Back to Essays

Understanding How Model Bias Impacts Agents Outputs

Simon Wardley has spent the past year making an uncomfortable argument about AI. In "AI and the New Theocracies," he warns that the real power in generative AI is the power to shape how people think, and that whoever writes the guardrails becomes, in his words, the high priest of a new theocracy. His answer is openness all the way down, to the training data itself: the case for sovereign AI. Strip away the politics and a narrower point survives. The values a model enforces are someone's values. What one team calls alignment, another experiences as bias.

It's easy to overlook bias and alignment in the context of enterprise AI. You are not asking models their point of view on ethical problems, or their take on the trolley problem. You just want your voice agent to run mortgage collections, or do risk assessments, or screen the 5,000 candidates who applied for one intern position. Like most things in generative AI, there is more than what meets the eye. Hence, ELI5.

Start with a result that is hard to wave away. In a 2026 study on LLM medical triage, researchers fed several leading models the same patient profile, a persistent headache, blurred vision, and morning nausea, and changed one variable: the patient's sex. The same symptoms returned a more urgent recommendation for the man than for the woman.

Look at what the model actually did. It was asked one thing, how urgent the case is, and it answered by leaning on what it had learned about patients who present like this. For the woman, what it had learned was tied to her sex, so the urgency it returned was tied to her sex too.

That is what bias is here. An assumption the model carries into the case, strong enough to outweigh the case itself. An assumption gets made somewhere, for some reason, which means it sits somewhere you can find. It has a location.

Where does this particular assumption come from? From what the model absorbed in training. That is the next thing to look at.

What the model learned from

Start with what the triage model leaned on: a base rate, learned from a body of text. So the real question is what went into that text.

The answer is narrow. GPT-3's training corpus was about 93% English by word count. Llama 2, one of the few models whose makers published the split, came in at 89.7%. The Common Crawl that feeds most of them runs somewhere between 40 and 46% English depending on the crawl, with the next language a distant second, and American sources dominate it. A model built on that learns how one slice of the world writes, and most of that slice writes from the same few countries.

That slice carries more than facts. It carries a worldview, and the model takes it on. A 2024 study in PNAS Nexus ran five GPT models against the World Values Survey and found their answers lined up with English-speaking and Protestant European countries, and drifted from everywhere else. Georgia Tech researchers found the tilt held even when models were prompted in Arabic, or trained only on Arabic. The default rides in the model itself, which is why switching the input language does not move it.

This is where it stops being a medical curiosity and becomes your problem. The triage model sorted a person into a category, urgent or not, by leaning on the average instead of the case.

Multiple systems sort people the same way. An AI-based applicant tracking system ranking résumés is doing category assignment, strong candidate or weak one, and on the model's defaults it scores "strong" the way the internet scores it.

Researchers who studied AI hiring evaluations titled their study "Invisible Filters," because the recommendation looks neutral while carrying someone else's idea of a good answer. A health system scoring patients into risk tiers inherits the same defaults.

"Are we biased?" is the wrong question. Detecting bias doesn't tell you where it entered. That's what matters.

"Can you locate the bias?"

The work is to locate the bias, not to reduce it. A bias you can point to, sitting at the step that produced it, is one a person can examine and argue with.

Noise is ordinary. Bias is a direction

A model gives slightly different answers every time you run it. That scatter is noise. Change a variable the answer should not depend on, run it again and again, and watch the answer move the same way every time. Run the woman's case a hundred times and the man's a hundred times: the answers cluster in two places, hers low and his high, and they stay that way. The noise is inside both clouds. The gap between them is the bias. A model being inconsistent is ordinary. A model being consistently wrong in one direction, against one group, is the finding.

To see why, break the workflow into three stages. The first is your data. The second is retrieval, pulling the relevant facts from that data. The third is reasoning, turning those facts into a decision. In the standard setup the model owns the last two. It decides what counts as relevant, and it decides the conclusion. Two stages, both inside the model, both closed. When a biased output appears, you cannot tell whether the model pulled the wrong facts or reasoned wrongly over the right ones.

The input carried no bias to remove, and the output skewed anyway. The skew was produced downstream, in the two stages.

Converting black boxes into glass boxes

Retrieval is the usual answer to that, and it helps less than it looks. A retrieval system changes what the model reads before it decides. Feed it your own documents and the added context can pull the answer toward your world, which is worth doing. But retrieval only feeds the reasoning. The step where the model turns facts into a decision still runs on the model's own judgment, and that judgment is the worldview. Chain-of-thought hides this, because the model writes out something that reads like rule-following while the worldview makes the actual call. The reasoning trace shows you what the model said it considered, not what determined the call. And whatever retrieval does to the size of the bias, it never tells you where the bias entered.

The fix follows from the diagnosis. Take the model out of the two stages where you cannot watch it.

Leave it the one job it is good at: reading the messy input and turning it into structured facts. A patient's description becomes a set of recorded symptoms. A résumé becomes a list of stated qualifications. This is perception, and it is checkable. You can hold the extracted facts against the source and see whether they match.

Then make retrieval deterministic. Guide it with an ontology, an explicit map of the domain. Instead of pulling whatever sits closest, retrieval follows the map, so the same case pulls the same facts every time, each one traceable to where it came from. The model is no longer the one deciding what counts as relevant.

Then make the reasoning run on that same map and a small set of rules over it. This is the part people get wrong when they hear the word rules, picturing someone hand-writing a line for every case until the list buckles. That is not it. The map already holds the relationships, so a rule can sit at the level of the policy and let the structure carry it across cases. Take the urgency rule the study itself points to: elevated pressure inside the skull needs urgent evaluation, whatever the cause. You write that once. The map knows that both a tumor and the condition more common in women raise that pressure, so the single rule reaches every case beneath it. A clinician can read it and defend it, and it applies the same way to the man and the woman. Put the decision there, and the model's averaged idea of who gets which diagnosis never enters it.

The shape is easy to state. The model perceives. The system decides. The worldview no longer gets a vote on what your category means.

Now the decomposition pays off. With each stage open, a biased output has somewhere to point, and the place you point to tells you the fix.

Where bias enters once the stages are openFig 1.0
Item
Bias source
The fix
Data
Your own history showing through
Clean it
Retrieval
Wrong facts pulled
Fix the retrieval rule or the ontology
Reasoning
Right facts, wrong call
Fix the decision rule

"Each kind of bias gets its own address and its own repair."

When the model owned retrieval and reasoning, these three collapsed into one closed box. A bad output could have come from anywhere inside it, so there was nothing specific to fix. Pulling the model out of those stages is what gives each kind of bias its own address and its own repair.

Knowing bias vs eliminating bias

Be honest about what this does not do. It does not make the decision neutral. A rule carries the judgment of whoever wrote it, and a bad rule is still bias. What changes is that the judgment is now yours and open to challenge. The worldview was neither.

It works where the decision can already be written as a rule, the kind that answers to a policy someone can cite. Claims adjudication is one. The urgency call in triage is another. Where the decision is genuinely a matter of judgment with no rule behind it, the model's worldview is unavoidable, and the honest answer is to keep a person in the seat.

Real systems are messier than three clean stages. They loop, and the model often gets called more than once. The three stages are the decomposition to aim for, not a description of how most systems are wired today.

And none of this is an argument against the model. The model is doing real work at the front, reading input no rule could parse. The objection is narrow. The model should not be the thing that decides.

References

  1. 1.Gender-Dependent Diagnostic Substitution in LLM Medical Triage: Same Symptoms, Unequal UrgencyQi Han Wong
  2. 2.Cultural bias and cultural alignment of large language modelsYan Tao, Olga Viberg, Ryan S. Baker, René F. Kizilcec
  3. 3.Having Beer after Prayer? Measuring Cultural Bias in Large Language ModelsTarek Naous, Michael J. Ryan, Alan Ritter, Wei Xu
  4. 4.Invisible Filters: Cultural Bias in Hiring Evaluations Using Large Language ModelsPooja S. B. Rao, Laxminarayen Nagarajan Venkatesan, Mauro Cherubini, Dinesh Babu Jayagopi
  5. 5.Language Models are Few-Shot LearnersTom B. Brown et al.
  6. 6.Llama 2: Open Foundation and Fine-Tuned Chat ModelsHugo Touvron et al.
  7. 7.Common Crawl Language StatisticsCommon Crawl
  8. 8.AI and the New TheocraciesSimon Wardley
About the Author
Vivek Khandelwal

Vivek Khandelwal

Chief Business Officer, CoFounder @ CogniSwitch·M.Sc. Chemistry, IIT Bombay

Vivek Khandelwal is the Chief Business Officer at CogniSwitch, where he leads go-to-market strategy, enterprise partnerships, and the company's thought leadership programs. He is the author of Signal, CogniSwitch's weekly newsletter that translates the complex machinery of enterprise AI infrastructure into clear, actionable intelligence for practitioners and executives in regulated industries.