AI docs · Operations
Inference cost and latency
What it costs and how long it takes to run a model, and the levers that control both.
What it is
- Inference is running the model to produce output; cost is usually per token, latency is the time to respond.
- Both are central to whether an AI feature is viable at scale.
How it works
- Cost scales with model size, input and output tokens, and call volume.
- Latency depends on model, output length, and infrastructure.
- Levers include choosing a smaller model, shortening prompts and outputs, caching, batching, and streaming.
Trade-offs
- Cheaper, faster models may be less capable; the right choice depends on the task.
- Optimizations add engineering complexity.
When to use it
- Whenever a feature will run at meaningful volume or needs to feel responsive.
Common pitfalls
- Defaulting to the largest model everywhere.
- Ignoring how output length and call volume drive the bill.