When AI Fails Through Bias
A decade of discrimination in production AI systems. Eight cases, five mechanisms, the Colorado standard, and a framework for executives.
Read the summaryTwelve case studies, two analyst forecasts, and one architectural diagnosis.
By Bradley W. Petersen, PhD Candidate, Daniels College of Business, Founder, Orbis Scientia
White paper, released April 2026.
The largest AI implementation failures of the past decade were not failures of the technology. They were failures of management. Across twelve documented cases spanning healthcare, financial services, retail, government, recruiting, customer service, and professional services, the same architectural pattern recurs. The model worked. The deployment did not.
This paper makes that pattern visible. It uses a single architectural reference model, the Gen AI Value Architecture, to map twelve major AI failures onto the specific stages where each one broke. Once the cases are mapped onto the same diagram, the diagnosis becomes unmistakable.
These are not edge cases. They are the documented public failures of the past decade, and they share an architectural pattern.
The Gen AI Value Architecture has eight stages plus a top-level governance band and a cross-cutting risk band. The technical stages (Foundation, Model Adaptation, Runtime Operations) are where most attention and most engineering investment land. They are not where the failures occurred. The technical layers stay clean. The management layers are flooded.
The five components most-implicated across the twelve cases are Use Case Approval, Risk Classification, Human-in-the-Loop, Output Validation, and Use Case Definition. None of them are technical capabilities. All of them are governance, design, and discipline practices that organizations skipped or under-resourced because the model demos looked impressive enough to proceed without them.
The single most common strategic error across the twelve cases is what the paper names the substitution trap. Organizations moved from AI as augmentation to AI as substitution faster than the system could justify. The augmentation case is well understood. AI accelerates a human worker. The substitution case is harder. AI replaces a human worker, and the failure mode that surfaces is not slower work but degraded customer experience, civil rights exposure, regulatory action, or reputational damage that the original substitution business case did not price in.
Klarna substituted AI for customer service representatives, then publicly reversed the decision. McDonald's substituted AI for drive-thru order takers, then ended the program. Government deployments substituted AI for case workers and produced the most severe documented harms in the dataset.
Two leading research organizations independently confirm the diagnosis. Gartner predicts that 60 percent of AI projects unsupported by AI-ready data will be abandoned through 2026, that 30 percent of generative AI projects will be abandoned after proof of concept, and that more than 40 percent of agentic AI projects will be canceled by the end of 2027. McKinsey reports that value capture from generative AI remains heavily concentrated and that most enterprises fail to convert technically successful pilots into measured business outcomes. Read together with the case studies, the analyst forecasts point to a coming wave of failures concentrated in agentic AI, in projects without AI-ready data, and in organizations that have adopted tools without redesigning the work.
The paper provides the architectural reference model, the historical evidence from twelve documented cases, the analyst-confirmed forecasts, a practical framework for executive decision-making, recommendations across nine specific governance and design areas, and a self-assessment checklist that surfaces the failure modes most often implicated across the dataset. It is written for executives, product leaders, risk officers, and board members responsible for AI strategy, who are accountable for the next wave of decisions and want to avoid being the next case study in the next decade's edition of this paper.
Where The High Stakes Decision makes the architectural argument that high-stakes AI decisions require alignment-based and weakest-link reasoning, this paper provides the forensic evidence. Twelve organizations, twelve diagnoses, one composite pattern. The companion paper is recommended reading.
“The technical layers stay clean. The management layers are flooded.”
From 'When AI Fails,' April 2026
I have spent decades inside organizations making the kinds of decisions this paper examines. Healthcare systems deploying enterprise software at scale. Consulting engagements where the gap between what the technology promised and what the implementation delivered was the difference between a recoverable disappointment and an unrecoverable disaster. The pattern that recurs across the twelve cases in this paper is recognizable from the inside long before it appears in the news.
I wrote this paper because the rate of public AI failures is accelerating, the architectural pattern in those failures is unmistakable once you map them onto a common reference model, and the executive audience accountable for the next decade of AI investment decisions has not yet had access to the failures laid out alongside each other in a way that makes the pattern visible. Most reporting on AI failures treats each one as an isolated incident with a unique cause. The cases are not isolated. The pattern is not unique to any one industry. The diagnosis is the same case after case.
The companies and governments that will succeed with AI are not the ones that deployed the most tools the fastest. They are the ones that rewired their work, governed their risks, and respected the boundary between augmentation and substitution. That is the thesis of the paper, and it is supported by every case in the dataset.
The paper is meant to be read alongside The High Stakes Decision. That paper makes the architectural argument: high-stakes AI decisions require an architecture that AI infrastructure built for averaging cannot natively express. This paper provides the evidence: twelve documented cases of what happens when the architecture is wrong. The two papers are companions.
If the argument resonates, the right next step is a conversation. The self-assessment checklist in Appendix C of the paper is a starting point, not a complete governance program, but it surfaces the most common failure modes documented in the dataset. An organization that can answer yes to most of the items on the checklist has materially reduced its exposure to the failure pattern. An organization that cannot is closer to the next case study than its leadership probably realizes.
A decade of discrimination in production AI systems. Eight cases, five mechanisms, the Colorado standard, and a framework for executives.
Read the summaryA decade of cost and time-to-value failures in production AI. Five cases, one economic law, and a framework for executives.
Read the summary