Skip to content
AI docs · Foundations

Tokens and context windows

LLMs read and write in tokens, and they can only consider a limited window of them at once. This shapes cost, latency, and what fits.

What it is

  • A token is a chunk of text (often a word-piece). Models measure input and output in tokens, and pricing is usually per token.
  • The context window is the maximum number of tokens the model can consider at once, including both your input and its output.

How it works

  • Everything you send (system prompt, instructions, documents, history) consumes context.
  • If you exceed the window, older content must be dropped, summarized, or retrieved on demand.
  • Longer contexts cost more and can be slower; they also do not guarantee the model uses all of it equally well.

Trade-offs

  • Large windows simplify some designs but raise cost and latency.
  • Retrieval (RAG) can be cheaper than stuffing everything into the prompt.

When to use it

  • Plan context budgets for any production use.
  • Use retrieval when the relevant knowledge is larger than the window or changes often.

Common pitfalls

  • Assuming a bigger window removes the need for good retrieval.
  • Ignoring how much hidden context (system prompts, tools) is already consumed.

Related concepts

Tokens and context windows: explained · SDEN