writing
How I Build AI Agents That Actually Ship
Most AI agents never make it past the demo. They look brilliant in a controlled room, then fall apart the moment a real user does something unexpected โ and quietly get switched off a month later.
I've built enough of them to know the difference between a demo and a system isn't the model. It's the process around it. Here's how I actually build agents that survive contact with production.
A note before the steps: this pairs with the 70/30 method โ agents do the volume, senior judgment owns the decisions that matter. What follows is how that judgment gets applied, in order.
1. Start with the decision, not the model
Before I touch a model, I map the actual work: where the decisions happen, what data feeds them, who's accountable when something goes wrong, and how the process really runs versus how the org chart says it runs. That gap is usually where the interesting problems live.
I also pin down what success means in numbers up front. If we can't say what "working" looks like before we build, we won't agree on it after.
2. Design the boundaries before the behavior
This is the part demos skip. What can the agent touch? What can't it? What happens when it fails, when the data's missing, when it isn't sure? An agent with broad access and no guardrails isn't a capability โ it's an incident with a delay on it.
I design the architecture โ models, tools, integration points, auth, failure handling โ with the boundaries first. The personality and the clever behavior come after the safety rails, not before.
3. Build the smallest thing that proves it works
I build a focused version that does the core job and nothing else, then get it in front of real users fast. Prototypes are for learning, not for impressing. People always use agents differently than they say they will, and that gap is the most valuable thing the prototype tells you.
Then I iterate against real interactions, measured against the current process. Faster? More accurate? Actually trusted? Better to find out now than after launch.
4. Give it your context, not just the model's
Your business has years of accumulated knowledge โ terminology, procedures, the rules nobody wrote down. That context is what separates an agent that sounds generic from one that sounds like it works there. I integrate your data, your language, and your rules, and build the feedback loops so the agent improves from corrections instead of repeating them.
This is also where the human-oversight rules get set: where it acts on its own, where it asks first, where it escalates.
5. Deploy it like a change, not a launch
The technical deploy is the easy half. The other half is adoption โ people need to see how the agent makes their work easier, not just be told it exists. I roll it in with the people who'll actually use it, watch the real usage, and tune from there. Monitoring runs from day one, because an agent you can't observe is an agent you can't trust.
6. Keep it alive
Agents that don't evolve rot. Requirements move, data shifts, new edge cases surface. The ones that keep delivering are the ones someone keeps tending โ watching the evals, expanding what works, pruning what doesn't. I'd rather build one agent that's still running and trusted in a year than five impressive ones that all got quietly turned off.
The short version
The model is maybe 10% of the work. The other 90% โ the boundaries, the context, the evals, the adoption, the upkeep โ is what decides whether you have a system or a screenshot. Skip it and you get a great demo. Do it and you get something your team actually relies on.
If you've got an AI project stuck at "impressive demo" and you need it to become "runs in production and we trust it," that's the work I do. Let's talk.