Books/GenAI Fundamentals/How LLMs Actually Work

How LLMs Actually Work

You've heard that LLMs "predict the next token." But what does that actually mean? This tutorial breaks down the key mechanisms — attention, embeddings, and next-token prediction — so you understand both the power and the limitations.

Next-Token Prediction: The Core Idea

At its heart, every LLM does one thing: given a sequence of tokens, predict the most likely next token.

Input tokens:  ["The", " cat", " sat", " on", " the"]
Prediction:    " mat" (highest probability)

The model then appends " mat" and predicts again:
Input tokens:  ["The", " cat", " sat", " on", " the", " mat"]
Prediction:    "." (highest probability)

This process repeats until the model generates a stop token or hits the max token limit. Every response you've ever seen from ChatGPT or Claude was generated one token at a time through this process.

Why This Is More Powerful Than It Sounds

You might think "just predicting the next word" sounds simple. But consider: to predict the next token accurately, the model needs to understand:

Grammar — What's syntactically valid
Context — What the conversation is about
Logic — What follows from previous statements
World knowledge — Facts learned during training
Intent — What the user is asking for
Style — The tone and format expected

A model that can predict the next token well enough to write working code or explain quantum physics has effectively learned a compressed representation of human knowledge.

Embeddings: How Models Understand Words

Before a model can process text, it needs to convert words into numbers. This is done through embeddings — numerical representations of tokens in a high-dimensional space.

What Are Embeddings?

Each token is mapped to a vector (a list of numbers), typically with 768 to 12,288 dimensions:

"king"   → [0.23, -0.45, 0.89, 0.12, ..., -0.34]  (thousands of numbers)
"queen"  → [0.25, -0.42, 0.91, 0.10, ..., -0.31]  (similar to "king"!)
"apple"  → [-0.56, 0.78, -0.12, 0.45, ..., 0.67]  (very different from "king")

The magic: words with similar meanings are close together in this number space. "King" and "queen" have similar vectors. "King" and "apple" don't.

Why Embeddings Matter for Developers

Embeddings aren't just an internal mechanism — you can use them directly:

// Generating embeddings via OpenAI's API
const response = await fetch("https://api.openai.com/v1/embeddings", {
  method: "POST",
  headers: {
    "Content-Type": "application/json",
    "Authorization": `Bearer ${process.env.OPENAI_API_KEY}`,
  },
  body: JSON.stringify({
    model: "text-embedding-3-small",
    input: "How do I deploy a React app to Firebase?",
  }),
});

const data = await response.json();
const embedding = data.data[0].embedding;
// Returns a vector of 1536 numbers

Real-world uses of embeddings:

Semantic search — Find documents by meaning, not just keywords
RAG (Retrieval Augmented Generation) — Retrieve relevant context before generating
Recommendation systems — Find similar items
Clustering — Group similar documents automatically
Anomaly detection — Find outliers in text data

Similarity Between Embeddings

You can measure how similar two pieces of text are by comparing their embeddings using cosine similarity:

// Cosine similarity: 1 = identical meaning, 0 = unrelated, -1 = opposite
function cosineSimilarity(a: number[], b: number[]): number {
  let dotProduct = 0;
  let normA = 0;
  let normB = 0;
  for (let i = 0; i < a.length; i++) {
    dotProduct += a[i] * b[i];
    normA += a[i] * a[i];
    normB += b[i] * b[i];
  }
  return dotProduct / (Math.sqrt(normA) * Math.sqrt(normB));
}

// "How to deploy React" vs "Deploying a React application" → ~0.95 (very similar)
// "How to deploy React" vs "Best pizza in New York" → ~0.15 (unrelated)

The Attention Mechanism

The attention mechanism is the breakthrough that made modern LLMs possible. It's how the model figures out which parts of the input are relevant to each other.

The Problem Attention Solves

Consider this sentence: "The animal didn't cross the road because it was too tired."

What does "it" refer to? The animal or the road? A human instantly knows "it" = "the animal" (roads don't get tired). The attention mechanism lets the model make these connections too.

How Attention Works (Simplified)

For each token, attention asks: "How much should I pay attention to every other token?"

Input: "The cat sat on the mat because it was tired"

When processing "it", attention scores:
  "The"     → low attention  (0.02)
  "cat"     → HIGH attention (0.35) ← "it" refers to "cat"
  "sat"     → low attention  (0.05)
  "on"      → low attention  (0.01)
  "the"     → low attention  (0.02)
  "mat"     → some attention (0.15)
  "because" → low attention  (0.05)
  "it"      → some attention (0.10)
  "was"     → moderate       (0.15)
  "tired"   → low attention  (0.10)

The model learns these attention patterns during training. In practice, modern LLMs have many "attention heads" — each looking at different types of relationships:

Some heads track grammatical relationships (subject-verb)
Some track semantic relationships (pronouns to nouns)
Some track positional relationships (nearby words)
Some track long-range dependencies (instructions from paragraphs ago)

Multi-Head Attention

Real transformer models use multi-head attention — multiple attention mechanisms running in parallel, each looking for different patterns:

Head 1: Grammar patterns (subject → verb → object)
Head 2: Coreference (pronouns → their referents)
Head 3: Topic tracking (maintaining context across paragraphs)
Head 4: Code structure (function names → their usage)
...
Head N: Other patterns learned during training

This is why LLMs can simultaneously track conversation context, follow instructions, maintain style, and generate syntactically correct output.

The Full Pipeline

Here's how a complete generation works, from prompt to response:

1. TOKENIZE
   "Write a hello world function" → [15043, 257, 23748, 1917, 2163]

2. EMBED
   Each token ID → High-dimensional vector
   [15043] → [0.23, -0.45, 0.89, ...]

3. PROCESS (Transformer layers × many)
   For each layer:
   a. Self-attention: Every token attends to every other token
   b. Feed-forward: Process the attended information
   c. Normalize and residual connections

4. PREDICT
   Final layer outputs probabilities for every possible next token
   "function" (32%) | "in" (18%) | "def" (12%) | ...

5. SAMPLE
   Based on temperature/top_p, select the next token
   → "function" (with temperature=0, always picks highest)

6. REPEAT steps 2-5 with the new token appended
   Until stop token or max_tokens reached

Why This Matters for Developers

Understanding the mechanism helps you reason about model behavior:

1. Why Long Prompts Can Lose Instructions

Attention has to spread across all tokens. In very long contexts, earlier instructions can get "diluted." This is why it helps to repeat key instructions or place them at both the beginning and end of your prompt.

2. Why Models Are Bad at Counting

Next-token prediction doesn't involve actual computation. When you ask "how many r's in strawberry," the model isn't counting — it's predicting what the answer probably is based on patterns. This is why LLMs struggle with precise counting tasks.

3. Why Prompt Engineering Works

The attention mechanism means the model's output is heavily influenced by the context you provide. Good prompts work because they guide the attention to the right patterns:

// Weak prompt — model has to guess the format
const weak = "Tell me about React hooks";

// Strong prompt — attention is guided to specific patterns
const strong = `You are a senior React developer teaching a junior.
Explain React hooks with:
1. A simple definition
2. The 3 most important hooks (useState, useEffect, useContext)
3. A practical code example for each
4. Common mistakes to avoid

Use TypeScript in all code examples.`;

4. Why RAG (Retrieval Augmented Generation) Works

When you inject relevant documents into the prompt, the attention mechanism connects the user's question to the provided context. The model "attends to" the documents you provided instead of relying on potentially outdated training data.

5. Why Fine-Tuning Changes Behavior

Fine-tuning adjusts the model's attention patterns and prediction weights for your specific domain. A fine-tuned model pays more attention to patterns relevant to your use case.

Key Takeaways

LLMs generate text by predicting one token at a time based on all previous tokens
Embeddings convert text into numbers — similar meanings have similar numbers
Attention lets the model figure out which parts of the input relate to each other
Understanding these mechanisms explains both the strengths AND limitations of LLMs
This knowledge directly improves your prompt engineering and application design
You can use embeddings directly in your apps for search, recommendations, and more

What's Next?

Now that you understand how LLMs work, let's explore the practical use cases for developers — from code generation to RAG to building conversational AI.

What to ask your AI: "Explain how your attention mechanism processes my prompt. What parts of this conversation are you paying the most attention to right now?"

🌐 www.genai-mentor.ai

The GenAI Landscape

GenAI Use Cases for Developers