Tokens, Context Windows, and Temperature
Tokens, Context Windows, and Temperature
When you work with LLM APIs, you'll encounter parameters like temperature, max_tokens, and top_p. Understanding what these do is essential for getting good results and managing costs.
What Are Tokens?
Tokens are the units LLMs use to process text. A token is roughly a word or part of a word. LLMs don't read characters or words — they read tokens.
How Tokenization Works
"Hello, world!" → ["Hello", ",", " world", "!"] = 4 tokens
"I'm learning" → ["I", "'m", " learning"] = 3 tokens
"ChatGPT" → ["Chat", "G", "PT"] = 3 tokens
"the" → ["the"] = 1 token
Rules of thumb:
- 1 token ≈ 4 characters in English
- 1 token ≈ ¾ of a word
- 100 tokens ≈ 75 words
- 1,000 tokens ≈ 750 words (about 1.5 pages of text)
Why Tokens Matter
- Pricing — API costs are calculated per token (both input and output)
- Context limits — Each model has a maximum number of tokens it can process
- Speed — More tokens = longer response time
Code: Estimating Token Count
// Rough estimation (not exact, but useful for planning) function estimateTokens(text: string): number { // English text averages about 4 characters per token return Math.ceil(text.length / 4); } console.log(estimateTokens("Hello, how are you today?")); // ~7 tokens (actual: 7) // For exact counts, use a tokenizer library: // npm install tiktoken (for OpenAI models) // Or use the model provider's tokenizer API
Token Pricing Examples
| Model | Input Price (per 1M tokens) | Output Price (per 1M tokens) |
|---|---|---|
| GPT-4o | $2.50 | $10.00 |
| GPT-4o mini | $0.15 | $0.60 |
| Claude 3.5 Sonnet | $3.00 | $15.00 |
| Claude 3.5 Haiku | $0.80 | $4.00 |
| Gemini 1.5 Pro | $1.25 | $5.00 |
Key insight: Output tokens are usually 2-5x more expensive than input tokens. This is why setting
max_tokenswisely matters.
Context Windows
The context window is the total number of tokens a model can "see" at once — this includes both your input (prompt + conversation history) and the model's output.
Context Window Sizes
| Model | Context Window |
|---|---|
| GPT-4o | 128,000 tokens |
| Claude 3.5 Sonnet | 200,000 tokens |
| Gemini 1.5 Pro | 1,000,000+ tokens |
| Llama 3 (8B) | 8,000 tokens |
| Llama 3.1 (405B) | 128,000 tokens |
What Fits in a Context Window?
128,000 tokens ≈
- ~96,000 words
- ~300 pages of text
- ~A full novel
200,000 tokens ≈
- ~150,000 words
- ~470 pages of text
- ~A large codebase
1,000,000 tokens ≈
- ~750,000 words
- ~Several books
- ~An entire repository
Why Context Windows Matter for Developers
- Conversation history — In a chatbot, every message (user and assistant) uses tokens. Long conversations eventually exceed the limit.
- Code analysis — When you paste code for the model to review, large files eat into the context.
- RAG applications — When you inject retrieved documents into the prompt, you need room for both the context and the response.
// Managing context in a conversation interface Message { role: "system" | "user" | "assistant"; content: string; } function trimConversation(messages: Message[], maxTokens: number): Message[] { // Always keep the system prompt const systemMessage = messages.find(m => m.role === "system"); const conversation = messages.filter(m => m.role !== "system"); let totalTokens = estimateTokens(systemMessage?.content || ""); const trimmed: Message[] = systemMessage ? [systemMessage] : []; // Keep recent messages, drop oldest ones for (let i = conversation.length - 1; i >= 0; i--) { const msgTokens = estimateTokens(conversation[i].content); if (totalTokens + msgTokens > maxTokens) break; totalTokens += msgTokens; trimmed.splice(systemMessage ? 1 : 0, 0, conversation[i]); } return trimmed; }
Temperature
Temperature controls the randomness (creativity) of the model's output. It's a number typically between 0 and 2.
| Temperature | Behavior | Use Case |
|---|---|---|
| 0 | Deterministic, focused | Code generation, factual Q&A, data extraction |
| 0.3 | Slightly creative | Technical writing, analysis |
| 0.7 | Balanced (common default) | General conversation, explanations |
| 1.0 | Creative | Brainstorming, creative writing |
| 1.5–2.0 | Very random | Experimental, poetry, wild ideas |
How Temperature Works (Simplified)
When predicting the next token, the model calculates probabilities for every possible token:
Prompt: "The best programming language is"
Temperature 0 (most likely):
"Python" (40%) → Always picks this ✓
"JavaScript" (25%)
"TypeScript" (15%)
Temperature 0.7 (balanced):
"Python" (40%) → Usually picks this, sometimes others
"JavaScript" (25%) → Occasionally picks this
"TypeScript" (15%) → Rarely picks this
Temperature 1.5 (creative):
"Python" (40%)
"JavaScript" (25%)
"TypeScript" (15%)
"Haskell" (5%) → Might even pick this!
Other Important Parameters
top_p (Nucleus Sampling)
top_p limits the model to considering only the most probable tokens that add up to probability p.
| top_p | Effect |
|---|---|
| 0.1 | Only considers the top 10% most likely tokens — very focused |
| 0.5 | Considers tokens making up 50% of probability — moderate |
| 0.9 | Considers tokens making up 90% of probability — diverse |
| 1.0 | Considers all tokens (default) |
Best practice: Adjust either temperature or top_p, not both. They serve similar purposes.
max_tokens
The maximum number of tokens in the model's response. This does NOT affect quality — it just sets a hard cutoff.
// If you set max_tokens too low, the response gets cut off mid-sentence // If you set it too high, you might pay for tokens you don't need // Good defaults: // Short answers: max_tokens: 256 // Paragraphs: max_tokens: 1024 // Long content: max_tokens: 4096 // Maximum output: max_tokens: 8192 (model dependent)
frequency_penalty and presence_penalty
These reduce repetition:
| Parameter | What It Does | Range |
|---|---|---|
| frequency_penalty | Penalizes tokens that already appeared frequently | -2.0 to 2.0 |
| presence_penalty | Penalizes any token that appeared at all | -2.0 to 2.0 |
Practical Code Example: API Call with Parameters
// Complete example: Calling an LLM API with tuned parameters async function generateCode(prompt: string): Promise<string> { const response = await fetch("https://api.openai.com/v1/chat/completions", { method: "POST", headers: { "Content-Type": "application/json", "Authorization": `Bearer ${process.env.OPENAI_API_KEY}`, }, body: JSON.stringify({ model: "gpt-4o", messages: [ { role: "system", content: "You are a senior TypeScript developer. Write clean, typed code." }, { role: "user", content: prompt } ], temperature: 0, // Deterministic for code max_tokens: 2048, // Enough for a full function top_p: 1, // Default (not adjusting since we set temperature) frequency_penalty: 0, // Default presence_penalty: 0, // Default }), }); const data = await response.json(); return data.choices[0].message.content; } // Example: Creative writing with different parameters async function brainstorm(topic: string): Promise<string> { const response = await fetch("https://api.openai.com/v1/chat/completions", { method: "POST", headers: { "Content-Type": "application/json", "Authorization": `Bearer ${process.env.OPENAI_API_KEY}`, }, body: JSON.stringify({ model: "gpt-4o", messages: [ { role: "system", content: "You are a creative brainstorming partner. Think outside the box." }, { role: "user", content: `Give me 10 creative ideas for: ${topic}` } ], temperature: 1.2, // High creativity max_tokens: 1024, presence_penalty: 0.6, // Encourage diverse ideas }), }); const data = await response.json(); return data.choices[0].message.content; }
Parameter Cheat Sheet
| Task | Temperature | max_tokens | Notes |
|---|---|---|---|
| Code generation | 0 | 2048–4096 | Deterministic is best for code |
| Bug fixing | 0 | 1024 | You want consistent, accurate fixes |
| Summarization | 0.3 | 512–1024 | Slightly creative but factual |
| General Q&A | 0.7 | 1024 | Balanced default |
| Creative writing | 1.0–1.3 | 2048+ | More randomness = more creativity |
| Brainstorming | 1.0–1.5 | 1024 | High creativity, add presence_penalty |
| Data extraction | 0 | 512 | You want exact, consistent output |
Key Takeaways
- Tokens are the units LLMs process — roughly 1 token per ¾ word
- Context window is the total token limit for input + output combined
- Temperature controls creativity (0 = focused, 1+ = creative)
- max_tokens sets a hard limit on response length
- Use low temperature for code and facts, high temperature for creativity
- Token costs add up — monitor usage in production applications
What's Next?
You understand the mechanics. Now let's zoom out and look at the GenAI landscape — which models are available, how they compare, and when to use each one.
What to ask your AI: "I'm building a [type of feature]. What temperature and max_tokens should I use for the best results?"