Books/GenAI Fundamentals/Tokens, Context Windows, and Temperature

Tokens, Context Windows, and Temperature

When you work with LLM APIs, you'll encounter parameters like temperature, max_tokens, and top_p. Understanding what these do is essential for getting good results and managing costs.

What Are Tokens?

Tokens are the units LLMs use to process text. A token is roughly a word or part of a word. LLMs don't read characters or words — they read tokens.

How Tokenization Works

"Hello, world!" → ["Hello", ",", " world", "!"]  = 4 tokens
"I'm learning"  → ["I", "'m", " learning"]        = 3 tokens
"ChatGPT"       → ["Chat", "G", "PT"]             = 3 tokens
"the"            → ["the"]                          = 1 token

Rules of thumb:

1 token ≈ 4 characters in English
1 token ≈ ¾ of a word
100 tokens ≈ 75 words
1,000 tokens ≈ 750 words (about 1.5 pages of text)

Why Tokens Matter

Pricing — API costs are calculated per token (both input and output)
Context limits — Each model has a maximum number of tokens it can process
Speed — More tokens = longer response time

Code: Estimating Token Count

// Rough estimation (not exact, but useful for planning)
function estimateTokens(text: string): number {
  // English text averages about 4 characters per token
  return Math.ceil(text.length / 4);
}

console.log(estimateTokens("Hello, how are you today?"));
// ~7 tokens (actual: 7)

// For exact counts, use a tokenizer library:
// npm install tiktoken (for OpenAI models)
// Or use the model provider's tokenizer API

Token Pricing Examples

Model	Input Price (per 1M tokens)	Output Price (per 1M tokens)
GPT-4o	$2.50	$10.00
GPT-4o mini	$0.15	$0.60
Claude 3.5 Sonnet	$3.00	$15.00
Claude 3.5 Haiku	$0.80	$4.00
Gemini 1.5 Pro	$1.25	$5.00

Key insight: Output tokens are usually 2-5x more expensive than input tokens. This is why setting max_tokens wisely matters.

Context Windows

The context window is the total number of tokens a model can "see" at once — this includes both your input (prompt + conversation history) and the model's output.

Context Window Sizes

Model	Context Window
GPT-4o	128,000 tokens
Claude 3.5 Sonnet	200,000 tokens
Gemini 1.5 Pro	1,000,000+ tokens
Llama 3 (8B)	8,000 tokens
Llama 3.1 (405B)	128,000 tokens

What Fits in a Context Window?

128,000 tokens ≈
  - ~96,000 words
  - ~300 pages of text
  - ~A full novel

200,000 tokens ≈
  - ~150,000 words
  - ~470 pages of text
  - ~A large codebase

1,000,000 tokens ≈
  - ~750,000 words
  - ~Several books
  - ~An entire repository

Why Context Windows Matter for Developers

Conversation history — In a chatbot, every message (user and assistant) uses tokens. Long conversations eventually exceed the limit.
Code analysis — When you paste code for the model to review, large files eat into the context.
RAG applications — When you inject retrieved documents into the prompt, you need room for both the context and the response.

// Managing context in a conversation
interface Message {
  role: "system" | "user" | "assistant";
  content: string;
}

function trimConversation(messages: Message[], maxTokens: number): Message[] {
  // Always keep the system prompt
  const systemMessage = messages.find(m => m.role === "system");
  const conversation = messages.filter(m => m.role !== "system");

  let totalTokens = estimateTokens(systemMessage?.content || "");
  const trimmed: Message[] = systemMessage ? [systemMessage] : [];

  // Keep recent messages, drop oldest ones
  for (let i = conversation.length - 1; i >= 0; i--) {
    const msgTokens = estimateTokens(conversation[i].content);
    if (totalTokens + msgTokens > maxTokens) break;
    totalTokens += msgTokens;
    trimmed.splice(systemMessage ? 1 : 0, 0, conversation[i]);
  }

  return trimmed;
}

Temperature

Temperature controls the randomness (creativity) of the model's output. It's a number typically between 0 and 2.

Temperature	Behavior	Use Case
0	Deterministic, focused	Code generation, factual Q&A, data extraction
0.3	Slightly creative	Technical writing, analysis
0.7	Balanced (common default)	General conversation, explanations
1.0	Creative	Brainstorming, creative writing
1.5–2.0	Very random	Experimental, poetry, wild ideas

How Temperature Works (Simplified)

When predicting the next token, the model calculates probabilities for every possible token:

Prompt: "The best programming language is"

Temperature 0 (most likely):
  "Python" (40%) → Always picks this ✓
  "JavaScript" (25%)
  "TypeScript" (15%)

Temperature 0.7 (balanced):
  "Python" (40%) → Usually picks this, sometimes others
  "JavaScript" (25%) → Occasionally picks this
  "TypeScript" (15%) → Rarely picks this

Temperature 1.5 (creative):
  "Python" (40%)
  "JavaScript" (25%)
  "TypeScript" (15%)
  "Haskell" (5%) → Might even pick this!

Other Important Parameters

top_p (Nucleus Sampling)

top_p limits the model to considering only the most probable tokens that add up to probability p.

top_p	Effect
0.1	Only considers the top 10% most likely tokens — very focused
0.5	Considers tokens making up 50% of probability — moderate
0.9	Considers tokens making up 90% of probability — diverse
1.0	Considers all tokens (default)

Best practice: Adjust either temperature or top_p, not both. They serve similar purposes.

max_tokens

The maximum number of tokens in the model's response. This does NOT affect quality — it just sets a hard cutoff.

// If you set max_tokens too low, the response gets cut off mid-sentence
// If you set it too high, you might pay for tokens you don't need

// Good defaults:
// Short answers:   max_tokens: 256
// Paragraphs:      max_tokens: 1024
// Long content:    max_tokens: 4096
// Maximum output:  max_tokens: 8192 (model dependent)

frequency_penalty and presence_penalty

These reduce repetition:

Parameter	What It Does	Range
frequency_penalty	Penalizes tokens that already appeared frequently	-2.0 to 2.0
presence_penalty	Penalizes any token that appeared at all	-2.0 to 2.0

Practical Code Example: API Call with Parameters

// Complete example: Calling an LLM API with tuned parameters

async function generateCode(prompt: string): Promise<string> {
  const response = await fetch("https://api.openai.com/v1/chat/completions", {
    method: "POST",
    headers: {
      "Content-Type": "application/json",
      "Authorization": `Bearer ${process.env.OPENAI_API_KEY}`,
    },
    body: JSON.stringify({
      model: "gpt-4o",
      messages: [
        {
          role: "system",
          content: "You are a senior TypeScript developer. Write clean, typed code."
        },
        {
          role: "user",
          content: prompt
        }
      ],
      temperature: 0,        // Deterministic for code
      max_tokens: 2048,       // Enough for a full function
      top_p: 1,               // Default (not adjusting since we set temperature)
      frequency_penalty: 0,   // Default
      presence_penalty: 0,    // Default
    }),
  });

  const data = await response.json();
  return data.choices[0].message.content;
}

// Example: Creative writing with different parameters
async function brainstorm(topic: string): Promise<string> {
  const response = await fetch("https://api.openai.com/v1/chat/completions", {
    method: "POST",
    headers: {
      "Content-Type": "application/json",
      "Authorization": `Bearer ${process.env.OPENAI_API_KEY}`,
    },
    body: JSON.stringify({
      model: "gpt-4o",
      messages: [
        {
          role: "system",
          content: "You are a creative brainstorming partner. Think outside the box."
        },
        {
          role: "user",
          content: `Give me 10 creative ideas for: ${topic}`
        }
      ],
      temperature: 1.2,       // High creativity
      max_tokens: 1024,
      presence_penalty: 0.6,  // Encourage diverse ideas
    }),
  });

  const data = await response.json();
  return data.choices[0].message.content;
}

Parameter Cheat Sheet

Task	Temperature	max_tokens	Notes
Code generation	0	2048–4096	Deterministic is best for code
Bug fixing	0	1024	You want consistent, accurate fixes
Summarization	0.3	512–1024	Slightly creative but factual
General Q&A	0.7	1024	Balanced default
Creative writing	1.0–1.3	2048+	More randomness = more creativity
Brainstorming	1.0–1.5	1024	High creativity, add presence_penalty
Data extraction	0	512	You want exact, consistent output

Key Takeaways

Tokens are the units LLMs process — roughly 1 token per ¾ word
Context window is the total token limit for input + output combined
Temperature controls creativity (0 = focused, 1+ = creative)
max_tokens sets a hard limit on response length
Use low temperature for code and facts, high temperature for creativity
Token costs add up — monitor usage in production applications

What's Next?

You understand the mechanics. Now let's zoom out and look at the GenAI landscape — which models are available, how they compare, and when to use each one.

What to ask your AI: "I'm building a [type of feature]. What temperature and max_tokens should I use for the best results?"

🌐 www.genai-mentor.ai

Understanding Large Language Models

The GenAI Landscape