writing
Giving an Agent a Performance Review
Most teams deploy an agent, watch it work for a week, decide it's "fine," and never look closely again. Then six months later they're surprised it's been quietly drifting, or that a competitor's version is twice as good, or that customers have been getting answers nobody would have signed off on.
You'd never manage a person that way. You'd never hire someone, glance at their first week, and then never review their work again. The fix is the same thing you already do with people: a regular, honest performance review. For agents, the industry calls it evals. Same ritual, less intimidating than it sounds.
An eval is a performance review, not a science project
The word "evals" makes this sound like an engineering discipline you need a framework and a dashboard for. You can have those, and at scale you'll want them. But the core of it is the thing every decent manager does: look at a representative sample of the work, against a clear standard, on a regular cadence, and decide whether it's good enough.
If you can run a quarterly review for a human, you can run one for an agent. The agent's version is actually easier, because the agent doesn't get defensive and you can re-run the exact same test whenever you want.
What you're actually reviewing
A good review answers three questions, and they map cleanly onto how you'd assess a person.
Is the work good against a real standard? Not "did it produce something," but "is what it produced what we'd want." That means you need a definition of good written down before you look, the same way a job has a definition of success. Pull a sample of real outputs and grade them against it. If you can't grade them, you don't have a standard yet, and that's the first thing to fix.
Where does it fail, and how badly? Every performer has failure modes. The question is whether this agent's failures are rare and low-stakes or common and expensive. A support agent that's slightly stiff on tone is a coaching note. One that confidently invents a refund policy is a serious problem. Name the failure modes explicitly, the same way you'd name a person's development areas.
Is it getting better or worse over time? This is the one teams skip, and it's the whole point of a recurring review. Models change, your data changes, the world the agent operates in changes. Run the same set of test cases this quarter that you ran last quarter and you'll see drift before a customer does.
Reviews drive a decision, same as with people
A performance review that doesn't lead to a decision is theater. With a person, a review leads to keep, coach, promote, or move them off the role. With an agent, the options are nearly identical.
Keep it as is. Coach it, which for an agent means improving its context, instructions, or tools, then re-running the same evals to confirm the coaching worked. Promote it, meaning give it more scope or more autonomy because it's earned trust. Or move it off the work, the agent version of knowing when to pull an agent off a job it isn't suited for.
The review is how you make that call on evidence instead of vibes.
This is the 30% doing its job
A performance review for an agent is the 70/30 Method in practice. The agent does the 70, the high-volume work. The review is part of the 30, the senior judgment that decides whether the 70 is actually worth anything. It's the same loop you set up when you onboard an agent like a new hire: a clear standard, real access, and a manager who checks the work. The review is just the "checks the work" part, done on a cadence instead of once.
If you can't tell me how you'd run a performance review on your agents, that's the gap. It's also exactly the part of the Agentic OS we build with clients, the loop that keeps the system trustworthy after the demo is over. Let's talk.