Overview

The same question. Two different answers.

Three times it went very wrong

The numbers behind the failures

How bad labels break good models

What real human oversight looks like

Share this Blog

Most people assume that when an AI gets something wrong, it is an edge case. A glitch. Something the next model update will fix.

The reality is more uncomfortable than that. And if you are training and evaluating AI, understanding it will change how you think about every task you work on.

‍

The same question. Two different answers.

Ask an AI something today. Close the app, reopen it tomorrow, and ask the exact same thing. The words that come back will probably be different. Sometimes noticeably so.

That is how these systems are designed.

Every response an AI generates is a fresh statistical calculation - a weighted guess assembled word by word. The model looks at every possible next word, assigns each one a probability, picks one, and repeats that process until the response is complete. This is what makes AI probabilistic. Same input, different output. Every single time.

A calculator works on a completely different principle. Press 2 + 2 and you get 4. Always. No variation, no surprises. That is a deterministic system.

The gap between those two things - probabilistic output versus deterministic expectation - is where AI deployments go wrong. And they go wrong more often than most people realise.

‍

Three times it went very wrong

^{Zillow, 2021.} The company shut down its AI-powered home-buying operation and wrote off over 500 million dollars. The models had predicted home prices with enough confidence to buy properties at scale. When the market shifted, the algorithm kept predicting from patterns that no longer existed. Around 2,000 people lost their jobs before anyone caught it.

^{Air Canada.} The airline deployed a chatbot that gave a grieving passenger the wrong information about bereavement fares. He followed the instructions. The airline denied his claim. A court ruled Air Canada responsible and ordered them to pay damages. The company had argued their chatbot was a separate entity and therefore not their problem. The tribunal disagreed.

^{A New York lawyer.}He submitted a legal brief citing six court cases. All six were invented by the AI. The citations looked perfect - case numbers, judges, quotes. None of it was real. He was fined 5,000 dollars. The judge called it unprecedented.

None of these were reckless experiments. They were real deployments that failed because a probabilistic system was handed a job requiring deterministic results - with nobody properly checking the gap between the two.

‍

The numbers behind the failures

Gartner found that only one in five enterprise AI initiatives delivered measurable ROI. MIT tracked hundreds of deployments and found roughly 95% of generative AI pilots produce zero measurable business value. IBM puts the share of AI projects that scale past the pilot stage at around 16%.

The model is rarely the problem. What fails is everything built around it - including the quality of the humans checking its work.

‍

How bad labels break good models

Every label applied during AI training is a signal. Every rating submitted teaches the model something about what correct looks like. Research found that a 10% drop in annotation accuracy leads to a 2 to 5% drop in model performance. In an enterprise context that translates directly to misfiled claims, failed compliance checks, and wrong approvals.

When evaluators rush through tasks without holding the line on quality, the model absorbs that inconsistency and carries it quietly into production - where it compounds across millions of outputs before anyone traces the problem back to its source.

One study measured what happens when annotation errors are deliberately introduced into training data. Accuracy fell from 73.6% to 54.2%. The model continued to function. It just became quietly, steadily wrong.

‍

What real human oversight looks like

The AI industry talks a lot about having a human in the loop. The assumption is that a person somewhere in the process catches problems before they reach the end user.

In practice, that only holds when the person doing the evaluation understands the weight of each decision they make. Every label is a design decision. Consistency across hundreds of tasks shapes how a model behaves at scale - across millions of real interactions, with real consequences attached.

Evaluators who bring genuine domain expertise and apply it carefully to every task are the reason some AI systems hold up under real-world conditions while others quietly fall apart.At Deccan AI Experts, this is the standard we hold ourselves to. The gap between a promising AI demo and a production system that actually works is closed by people who defined, carefully and consistently, what good looks like.

That is the work. And now you know exactly why it matters.

The AI Got It Right Yesterday. Ask Again Tomorrow.

The same question. Two different answers.

Three times it went very wrong

The numbers behind the failures

How bad labels break good models

What real human oversight looks like

Explore other blogs

Humans × AI = Magic: Why the Future Belongs to AI Experts Who Think Beyond Prompts

The Human Blueprint Behind Every AI Breakthrough