Skip to content
AI docs · Quality & evaluation

Evaluations (evals)

How you measure whether an AI system is good enough, before and after you ship it.

What it is

  • Evals are tests for AI behavior: a dataset of inputs with a way to score the outputs.
  • They are how you move from 'it seems to work' to 'it works, measurably'.

How it works

  • You assemble representative cases, define what good looks like, and score outputs (by rules, by humans, or by a judge model).
  • You track scores across versions to catch regressions and guide improvement.
  • Evals run in development and, ideally, continuously in production.

Trade-offs

  • Good evals take effort to build, but without them you are flying blind.
  • Automated judges are scalable but imperfect; humans are accurate but slow and costly.

When to use it

  • Before shipping anything important, and continuously afterward.
  • Whenever you change a prompt, model, or pipeline and need to know if it got better.

Common pitfalls

  • Shipping on vibes with no measurement.
  • Evals that do not reflect real user inputs.

Related concepts