writing
40% of AI Projects Get Killed. Here's Why Yours Might.
Most AI agent projects don't die because the model was too dumb. They die because nobody could say what "working" meant, the agent never got the keys to the actual systems, and three months in, no one noticed it had quietly gone stupid.
Gartner expects more than 40% of agentic AI projects to be scrapped by 2027. That number gets passed around as a warning about the technology. It isn't. It's a warning about how these projects get run. The gap shows up even earlier: around 88% of agent pilots never make it to production at all. Most of them never had a chance, because the work that decides whether an agent survives happens before anyone writes a prompt, and it's the unglamorous part nobody wants to own.
Forrester actually broke down why these things fail, and the causes are boringly fixable: 41% died from unclear success criteria, 33% from insufficient tool or data access, and 26% from evaluation drift. Not one of those is a model problem. Every one is a judgment problem. So here are the three questions to answer now, before your project becomes a statistic.
Have I defined what success means in a number?
This is the big one, and it's the one founders skip because it feels too obvious to bother with. You wanted an agent that "handles support tickets" or "qualifies leads." Fine. What does handling a ticket well look like, expressed as something you could check on a Friday afternoon?
If the answer is a vibe, the project is already in trouble. "Working" without a number means everyone on the team is quietly holding a different definition, and the day the agent ships, those definitions collide. Marketing thinks it's great. The support lead thinks it's embarrassing. Nobody's wrong, because nobody agreed.
A real success criterion sounds like: resolves 60% of tier-one tickets without a human, with a customer satisfaction score no worse than the team's current average. Now you can tell whether you won. You can also tell when you're done arguing about it.
I've seen a project nearly die over exactly this. The team wanted an agent to "handle onboarding," everyone nodded, and three months in the founder thought it was a failure while the ops lead thought it was a win — because no one had agreed what "handle" meant. The moment we forced it into a number — complete one onboarding step for 70% of new users without a human touching it — the argument evaporated. Suddenly we could see it was actually at 55% and climbing, which is a tuning problem, not a failure. Without the number, they'd have killed something that was three weeks from working.
Does the agent actually have access to the systems it needs?
Demos lie. A demo runs against a clean sandbox with three sample records and perfect data. Production is your real CRM with a decade of inconsistent entries, the permissions your IT person is nervous about granting, and the API that rate-limits you at the worst possible moment.
An agent that can't reach the live systems is a very expensive autocomplete. It can describe what it would do. It cannot do it. And the moment of truth, the part where you wire it into the actual tools and data, is usually where someone discovers the access was never going to be approved, or the data is too messy to act on, or the integration nobody scoped turns out to be the whole project.
Ask this before you build, not after: when this is real, does it touch the real thing? If the honest answer is "we'll figure out access later," later is where the project goes to die.
Who's checking it's still good in three months?
An agent isn't a feature you ship and forget. It's closer to a hire who never tells you when they're struggling. Your data shifts, your customers start asking things they didn't before, an upstream system changes a field, and the agent keeps confidently doing what used to be right. That's evaluation drift, and it's quiet by design. Nothing breaks. The numbers just slide.
So someone has to own the boring ritual of checking. Not "we'll look if customers complain." A standing habit: a set of test cases that get rerun, a metric somebody watches, a name attached to it. If you can't say who that person is, the answer is no one, and the agent will rot in the dark until a customer finds the problem for you.
For a small team, that ritual does not mean a fancy monitoring stack. It means twenty real examples in a spreadsheet — actual inputs from your business with the answer you'd want beside each — and someone rerunning them whenever you change the prompt or the model, plus once a month no matter what. It takes an afternoon to set up and ten minutes to run. The teams that do it catch drift before customers do. The teams that don't find out from an angry email. You don't need ML ops. You need the discipline to look.
The 30% that saves the project
Here's where the 70/30 idea earns its keep. The mechanical 70% of an agent build, the wiring, the prompting, the plumbing, is the part AI is genuinely good at now. That work has gotten cheap. It's also not the part that decides whether your project lives.
The 30% that decides everything is judgment: defining the number, making sure the agent touches reality, and owning the question of whether it's still any good next quarter. That's not building. It's deciding what "good" means and refusing to ship until you can prove it. It doesn't demo well and it doesn't make for a fun standup, which is exactly why it's the work that gets dropped, and exactly why projects die.
A better model won't save a project that was never defined. Ownership will.