writing
Evals for People Who Aren't ML Engineers
"Evals" sounds like something you need a data science team and a research budget for. It isn't. An eval is just a way to test whether your AI system is actually any good, on purpose and repeatedly, instead of going on the vibe of whoever looked at it last. If you can build a spreadsheet, you can run evals.
And you need to, because the alternative is shipping on gut feel and finding out it drifted when a customer complains. Here's the whole thing, demystified.
What an eval actually is
Three pieces. A set of real examples to test against. A definition of what a good answer looks like for each one. And a way to score how the system did. That's it. Everything fancier is an optimization of those three things, not a replacement for them.
Think of it as a recurring exam for your agent, where you wrote the questions and the answer key from real life.
Build your first eval set
Collect 20 to 50 real examples of the work the agent does. Pull them from actual usage if you have it, or write realistic ones if you don't. The rule that matters: include the hard cases on purpose. The ambiguous inputs, the edge cases, the ones that have burned you before. An eval set made only of easy questions tells you nothing, the same way a demo of only the happy path tells you nothing.
A few dozen well-chosen cases beats a thousand random ones. You're not after volume. You're after coverage of the ways it can go wrong.
Decide what good means
For each case, write down what a good answer looks like. Sometimes that's an exact correct value. More often it's a short rubric: gets the facts right, uses the right tone, escalates when it should, doesn't promise things you don't offer. Writing this down is most of the value, because if you can't say what good is, you've found the real problem, and no model can fix that one for you.
Score it
Run your examples through the agent and grade the outputs against your definition. For a small set, a person reading and scoring is perfectly legitimate and often best. As volume grows you can have a model do first-pass grading with humans spot-checking, but don't reach for that until hand-grading actually hurts. A spreadsheet with a row per case and a pass or fail column is a real eval. Start there.
Run it on a cadence
The point of evals is repetition. Run the same set on a schedule, and every single time you change the prompt, the data, the model, or the tools. That's how you catch the change that quietly made things worse, and how you prove an improvement actually improved things instead of just feeling better.
This is the practical machinery behind giving an agent a performance review: the eval set is the test, the cadence is the review schedule. And it's the part of the 70/30 Method that turns "it seemed fine" into something you can actually stand behind. You don't need to be an ML engineer to do it. You need to decide what good looks like and check, on purpose, whether you're getting it.
If you want help building an eval set that actually catches your failure modes, that's part of the Agentic OS work I do with clients. Let's talk.