Start a project

AI docs · Foundations

Tokens and context windows

LLMs read and write in tokens, and they can only consider a limited window of them at once. This shapes cost, latency, and what fits.

What it is

A token is a chunk of text (often a word-piece). Models measure input and output in tokens, and pricing is usually per token.
The context window is the maximum number of tokens the model can consider at once, including both your input and its output.

How it works

Everything you send (system prompt, instructions, documents, history) consumes context.
If you exceed the window, older content must be dropped, summarized, or retrieved on demand.
Longer contexts cost more and can be slower; they also do not guarantee the model uses all of it equally well.

Trade-offs

Large windows simplify some designs but raise cost and latency.
Retrieval (RAG) can be cheaper than stuffing everything into the prompt.

When to use it

Plan context budgets for any production use.
Use retrieval when the relevant knowledge is larger than the window or changes often.

Common pitfalls

Assuming a bigger window removes the need for good retrieval.
Ignoring how much hidden context (system prompts, tools) is already consumed.

Quick check

What happens when a conversation grows past the model's context window?

Related concepts

Large language models (LLMs)Retrieval-augmented generation (RAG)Inference cost and latency

Let's get to work

Want to build with AI for real?

Beyond the explainer, we design, secure, build and run production AI. Tell us what you have in mind.

Start a project See the offer

Tokens and context windows: explained · SDEN