Skip to content
AI docs · Operations

Inference cost and latency

What it costs and how long it takes to run a model, and the levers that control both.

What it is

  • Inference is running the model to produce output; cost is usually per token, latency is the time to respond.
  • Both are central to whether an AI feature is viable at scale.

How it works

  • Cost scales with model size, input and output tokens, and call volume.
  • Latency depends on model, output length, and infrastructure.
  • Levers include choosing a smaller model, shortening prompts and outputs, caching, batching, and streaming.

Trade-offs

  • Cheaper, faster models may be less capable; the right choice depends on the task.
  • Optimizations add engineering complexity.

When to use it

  • Whenever a feature will run at meaningful volume or needs to feel responsive.

Common pitfalls

  • Defaulting to the largest model everywhere.
  • Ignoring how output length and call volume drive the bill.

Related concepts