AI docs · Quality & evaluation
Evaluations (evals)
How you measure whether an AI system is good enough, before and after you ship it.
What it is
- Evals are tests for AI behavior: a dataset of inputs with a way to score the outputs.
- They are how you move from 'it seems to work' to 'it works, measurably'.
How it works
- You assemble representative cases, define what good looks like, and score outputs (by rules, by humans, or by a judge model).
- You track scores across versions to catch regressions and guide improvement.
- Evals run in development and, ideally, continuously in production.
Trade-offs
- Good evals take effort to build, but without them you are flying blind.
- Automated judges are scalable but imperfect; humans are accurate but slow and costly.
When to use it
- Before shipping anything important, and continuously afterward.
- Whenever you change a prompt, model, or pipeline and need to know if it got better.
Common pitfalls
- Shipping on vibes with no measurement.
- Evals that do not reflect real user inputs.