Start a project

AI docs · Operations

Inference cost and latency

What it costs and how long it takes to run a model, and the levers that control both.

What it is

Inference is running the model to produce output; cost is usually per token, latency is the time to respond.
Both are central to whether an AI feature is viable at scale.

How it works

Cost scales with model size, input and output tokens, and call volume.
Latency depends on model, output length, and infrastructure.
Levers include choosing a smaller model, shortening prompts and outputs, caching, batching, and streaming.

Trade-offs

Cheaper, faster models may be less capable; the right choice depends on the task.
Optimizations add engineering complexity.

When to use it

Whenever a feature will run at meaningful volume or needs to feel responsive.

Common pitfalls

Defaulting to the largest model everywhere.
Ignoring how output length and call volume drive the bill.

Quick check

Which change most directly reduces the cost and latency of a request?

Related concepts

Tokens and context windows Deploying AI to production AI adoption in a business

Let's get to work

Want to build with AI for real?

Beyond the explainer, we design, secure, build and run production AI. Tell us what you have in mind.

Start a project See the offer

Inference cost and latency: explained · SDEN