Start a project

AI docs · Quality & evaluation

Evaluations (evals)

How you measure whether an AI system is good enough, before and after you ship it.

What it is

Evals are tests for AI behavior: a dataset of inputs with a way to score the outputs.
They are how you move from 'it seems to work' to 'it works, measurably'.

How it works

You assemble representative cases, define what good looks like, and score outputs (by rules, by humans, or by a judge model).
You track scores across versions to catch regressions and guide improvement.
Evals run in development and, ideally, continuously in production.

Trade-offs

Good evals take effort to build, but without them you are flying blind.
Automated judges are scalable but imperfect; humans are accurate but slow and costly.

When to use it

Before shipping anything important, and continuously afterward.
Whenever you change a prompt, model, or pipeline and need to know if it got better.

Common pitfalls

Shipping on vibes with no measurement.
Evals that do not reflect real user inputs.

Quick check

Why do teams build evals for an AI feature?

Related concepts

Hallucinations Retrieval-augmented generation (RAG)Deploying AI to production

Let's get to work

Want to build with AI for real?

Beyond the explainer, we design, secure, build and run production AI. Tell us what you have in mind.

Start a project See the offer

Evaluations (evals): explained · SDEN