Books/GenAI Fundamentals/Tokens, Context Windows, and Temperature

    Tokens, Context Windows, and Temperature

    Tokens, Context Windows, and Temperature

    When you work with LLM APIs, you'll encounter parameters like temperature, max_tokens, and top_p. Understanding what these do is essential for getting good results and managing costs.

    What Are Tokens?

    Tokens are the units LLMs use to process text. A token is roughly a word or part of a word. LLMs don't read characters or words — they read tokens.

    How Tokenization Works

    "Hello, world!" → ["Hello", ",", " world", "!"]  = 4 tokens
    "I'm learning"  → ["I", "'m", " learning"]        = 3 tokens
    "ChatGPT"       → ["Chat", "G", "PT"]             = 3 tokens
    "the"            → ["the"]                          = 1 token
    

    Rules of thumb:

    • 1 token ≈ 4 characters in English
    • 1 token ≈ ¾ of a word
    • 100 tokens ≈ 75 words
    • 1,000 tokens ≈ 750 words (about 1.5 pages of text)

    Why Tokens Matter

    1. Pricing — API costs are calculated per token (both input and output)
    2. Context limits — Each model has a maximum number of tokens it can process
    3. Speed — More tokens = longer response time

    Code: Estimating Token Count

    // Rough estimation (not exact, but useful for planning)
    function estimateTokens(text: string): number {
      // English text averages about 4 characters per token
      return Math.ceil(text.length / 4);
    }
    
    console.log(estimateTokens("Hello, how are you today?"));
    // ~7 tokens (actual: 7)
    
    // For exact counts, use a tokenizer library:
    // npm install tiktoken (for OpenAI models)
    // Or use the model provider's tokenizer API

    Token Pricing Examples

    ModelInput Price (per 1M tokens)Output Price (per 1M tokens)
    GPT-4o$2.50$10.00
    GPT-4o mini$0.15$0.60
    Claude 3.5 Sonnet$3.00$15.00
    Claude 3.5 Haiku$0.80$4.00
    Gemini 1.5 Pro$1.25$5.00

    Key insight: Output tokens are usually 2-5x more expensive than input tokens. This is why setting max_tokens wisely matters.

    Context Windows

    The context window is the total number of tokens a model can "see" at once — this includes both your input (prompt + conversation history) and the model's output.

    Context Window Sizes

    ModelContext Window
    GPT-4o128,000 tokens
    Claude 3.5 Sonnet200,000 tokens
    Gemini 1.5 Pro1,000,000+ tokens
    Llama 3 (8B)8,000 tokens
    Llama 3.1 (405B)128,000 tokens

    What Fits in a Context Window?

    128,000 tokens ≈
      - ~96,000 words
      - ~300 pages of text
      - ~A full novel
    
    200,000 tokens ≈
      - ~150,000 words
      - ~470 pages of text
      - ~A large codebase
    
    1,000,000 tokens ≈
      - ~750,000 words
      - ~Several books
      - ~An entire repository
    

    Why Context Windows Matter for Developers

    1. Conversation history — In a chatbot, every message (user and assistant) uses tokens. Long conversations eventually exceed the limit.
    2. Code analysis — When you paste code for the model to review, large files eat into the context.
    3. RAG applications — When you inject retrieved documents into the prompt, you need room for both the context and the response.
    // Managing context in a conversation
    interface Message {
      role: "system" | "user" | "assistant";
      content: string;
    }
    
    function trimConversation(messages: Message[], maxTokens: number): Message[] {
      // Always keep the system prompt
      const systemMessage = messages.find(m => m.role === "system");
      const conversation = messages.filter(m => m.role !== "system");
    
      let totalTokens = estimateTokens(systemMessage?.content || "");
      const trimmed: Message[] = systemMessage ? [systemMessage] : [];
    
      // Keep recent messages, drop oldest ones
      for (let i = conversation.length - 1; i >= 0; i--) {
        const msgTokens = estimateTokens(conversation[i].content);
        if (totalTokens + msgTokens > maxTokens) break;
        totalTokens += msgTokens;
        trimmed.splice(systemMessage ? 1 : 0, 0, conversation[i]);
      }
    
      return trimmed;
    }

    Temperature

    Temperature controls the randomness (creativity) of the model's output. It's a number typically between 0 and 2.

    TemperatureBehaviorUse Case
    0Deterministic, focusedCode generation, factual Q&A, data extraction
    0.3Slightly creativeTechnical writing, analysis
    0.7Balanced (common default)General conversation, explanations
    1.0CreativeBrainstorming, creative writing
    1.5–2.0Very randomExperimental, poetry, wild ideas

    How Temperature Works (Simplified)

    When predicting the next token, the model calculates probabilities for every possible token:

    Prompt: "The best programming language is"
    
    Temperature 0 (most likely):
      "Python" (40%) → Always picks this ✓
      "JavaScript" (25%)
      "TypeScript" (15%)
    
    Temperature 0.7 (balanced):
      "Python" (40%) → Usually picks this, sometimes others
      "JavaScript" (25%) → Occasionally picks this
      "TypeScript" (15%) → Rarely picks this
    
    Temperature 1.5 (creative):
      "Python" (40%)
      "JavaScript" (25%)
      "TypeScript" (15%)
      "Haskell" (5%) → Might even pick this!
    

    Other Important Parameters

    top_p (Nucleus Sampling)

    top_p limits the model to considering only the most probable tokens that add up to probability p.

    top_pEffect
    0.1Only considers the top 10% most likely tokens — very focused
    0.5Considers tokens making up 50% of probability — moderate
    0.9Considers tokens making up 90% of probability — diverse
    1.0Considers all tokens (default)

    Best practice: Adjust either temperature or top_p, not both. They serve similar purposes.

    max_tokens

    The maximum number of tokens in the model's response. This does NOT affect quality — it just sets a hard cutoff.

    // If you set max_tokens too low, the response gets cut off mid-sentence
    // If you set it too high, you might pay for tokens you don't need
    
    // Good defaults:
    // Short answers:   max_tokens: 256
    // Paragraphs:      max_tokens: 1024
    // Long content:    max_tokens: 4096
    // Maximum output:  max_tokens: 8192 (model dependent)

    frequency_penalty and presence_penalty

    These reduce repetition:

    ParameterWhat It DoesRange
    frequency_penaltyPenalizes tokens that already appeared frequently-2.0 to 2.0
    presence_penaltyPenalizes any token that appeared at all-2.0 to 2.0

    Practical Code Example: API Call with Parameters

    // Complete example: Calling an LLM API with tuned parameters
    
    async function generateCode(prompt: string): Promise<string> {
      const response = await fetch("https://api.openai.com/v1/chat/completions", {
        method: "POST",
        headers: {
          "Content-Type": "application/json",
          "Authorization": `Bearer ${process.env.OPENAI_API_KEY}`,
        },
        body: JSON.stringify({
          model: "gpt-4o",
          messages: [
            {
              role: "system",
              content: "You are a senior TypeScript developer. Write clean, typed code."
            },
            {
              role: "user",
              content: prompt
            }
          ],
          temperature: 0,        // Deterministic for code
          max_tokens: 2048,       // Enough for a full function
          top_p: 1,               // Default (not adjusting since we set temperature)
          frequency_penalty: 0,   // Default
          presence_penalty: 0,    // Default
        }),
      });
    
      const data = await response.json();
      return data.choices[0].message.content;
    }
    
    // Example: Creative writing with different parameters
    async function brainstorm(topic: string): Promise<string> {
      const response = await fetch("https://api.openai.com/v1/chat/completions", {
        method: "POST",
        headers: {
          "Content-Type": "application/json",
          "Authorization": `Bearer ${process.env.OPENAI_API_KEY}`,
        },
        body: JSON.stringify({
          model: "gpt-4o",
          messages: [
            {
              role: "system",
              content: "You are a creative brainstorming partner. Think outside the box."
            },
            {
              role: "user",
              content: `Give me 10 creative ideas for: ${topic}`
            }
          ],
          temperature: 1.2,       // High creativity
          max_tokens: 1024,
          presence_penalty: 0.6,  // Encourage diverse ideas
        }),
      });
    
      const data = await response.json();
      return data.choices[0].message.content;
    }

    Parameter Cheat Sheet

    TaskTemperaturemax_tokensNotes
    Code generation02048–4096Deterministic is best for code
    Bug fixing01024You want consistent, accurate fixes
    Summarization0.3512–1024Slightly creative but factual
    General Q&A0.71024Balanced default
    Creative writing1.0–1.32048+More randomness = more creativity
    Brainstorming1.0–1.51024High creativity, add presence_penalty
    Data extraction0512You want exact, consistent output

    Key Takeaways

    • Tokens are the units LLMs process — roughly 1 token per ¾ word
    • Context window is the total token limit for input + output combined
    • Temperature controls creativity (0 = focused, 1+ = creative)
    • max_tokens sets a hard limit on response length
    • Use low temperature for code and facts, high temperature for creativity
    • Token costs add up — monitor usage in production applications

    What's Next?

    You understand the mechanics. Now let's zoom out and look at the GenAI landscape — which models are available, how they compare, and when to use each one.

    What to ask your AI: "I'm building a [type of feature]. What temperature and max_tokens should I use for the best results?"


    🌐 www.genai-mentor.ai