AI docs · Foundations
Tokens and context windows
LLMs read and write in tokens, and they can only consider a limited window of them at once. This shapes cost, latency, and what fits.
What it is
- A token is a chunk of text (often a word-piece). Models measure input and output in tokens, and pricing is usually per token.
- The context window is the maximum number of tokens the model can consider at once, including both your input and its output.
How it works
- Everything you send (system prompt, instructions, documents, history) consumes context.
- If you exceed the window, older content must be dropped, summarized, or retrieved on demand.
- Longer contexts cost more and can be slower; they also do not guarantee the model uses all of it equally well.
Trade-offs
- Large windows simplify some designs but raise cost and latency.
- Retrieval (RAG) can be cheaper than stuffing everything into the prompt.
When to use it
- Plan context budgets for any production use.
- Use retrieval when the relevant knowledge is larger than the window or changes often.
Common pitfalls
- Assuming a bigger window removes the need for good retrieval.
- Ignoring how much hidden context (system prompts, tools) is already consumed.