How LLMs Actually Work
How LLMs Actually Work
You've heard that LLMs "predict the next token." But what does that actually mean? This tutorial breaks down the key mechanisms — attention, embeddings, and next-token prediction — so you understand both the power and the limitations.
Next-Token Prediction: The Core Idea
At its heart, every LLM does one thing: given a sequence of tokens, predict the most likely next token.
Input tokens: ["The", " cat", " sat", " on", " the"]
Prediction: " mat" (highest probability)
The model then appends " mat" and predicts again:
Input tokens: ["The", " cat", " sat", " on", " the", " mat"]
Prediction: "." (highest probability)
This process repeats until the model generates a stop token or hits the max token limit. Every response you've ever seen from ChatGPT or Claude was generated one token at a time through this process.
Why This Is More Powerful Than It Sounds
You might think "just predicting the next word" sounds simple. But consider: to predict the next token accurately, the model needs to understand:
- Grammar — What's syntactically valid
- Context — What the conversation is about
- Logic — What follows from previous statements
- World knowledge — Facts learned during training
- Intent — What the user is asking for
- Style — The tone and format expected
A model that can predict the next token well enough to write working code or explain quantum physics has effectively learned a compressed representation of human knowledge.
Embeddings: How Models Understand Words
Before a model can process text, it needs to convert words into numbers. This is done through embeddings — numerical representations of tokens in a high-dimensional space.
What Are Embeddings?
Each token is mapped to a vector (a list of numbers), typically with 768 to 12,288 dimensions:
"king" → [0.23, -0.45, 0.89, 0.12, ..., -0.34] (thousands of numbers)
"queen" → [0.25, -0.42, 0.91, 0.10, ..., -0.31] (similar to "king"!)
"apple" → [-0.56, 0.78, -0.12, 0.45, ..., 0.67] (very different from "king")
The magic: words with similar meanings are close together in this number space. "King" and "queen" have similar vectors. "King" and "apple" don't.
Why Embeddings Matter for Developers
Embeddings aren't just an internal mechanism — you can use them directly:
// Generating embeddings via OpenAI's API const response = await fetch("https://api.openai.com/v1/embeddings", { method: "POST", headers: { "Content-Type": "application/json", "Authorization": `Bearer ${process.env.OPENAI_API_KEY}`, }, body: JSON.stringify({ model: "text-embedding-3-small", input: "How do I deploy a React app to Firebase?", }), }); const data = await response.json(); const embedding = data.data[0].embedding; // Returns a vector of 1536 numbers
Real-world uses of embeddings:
- Semantic search — Find documents by meaning, not just keywords
- RAG (Retrieval Augmented Generation) — Retrieve relevant context before generating
- Recommendation systems — Find similar items
- Clustering — Group similar documents automatically
- Anomaly detection — Find outliers in text data
Similarity Between Embeddings
You can measure how similar two pieces of text are by comparing their embeddings using cosine similarity:
// Cosine similarity: 1 = identical meaning, 0 = unrelated, -1 = opposite function cosineSimilarity(a: number[], b: number[]): number { let dotProduct = 0; let normA = 0; let normB = 0; for (let i = 0; i < a.length; i++) { dotProduct += a[i] * b[i]; normA += a[i] * a[i]; normB += b[i] * b[i]; } return dotProduct / (Math.sqrt(normA) * Math.sqrt(normB)); } // "How to deploy React" vs "Deploying a React application" → ~0.95 (very similar) // "How to deploy React" vs "Best pizza in New York" → ~0.15 (unrelated)
The Attention Mechanism
The attention mechanism is the breakthrough that made modern LLMs possible. It's how the model figures out which parts of the input are relevant to each other.
The Problem Attention Solves
Consider this sentence: "The animal didn't cross the road because it was too tired."
What does "it" refer to? The animal or the road? A human instantly knows "it" = "the animal" (roads don't get tired). The attention mechanism lets the model make these connections too.
How Attention Works (Simplified)
For each token, attention asks: "How much should I pay attention to every other token?"
Input: "The cat sat on the mat because it was tired"
When processing "it", attention scores:
"The" → low attention (0.02)
"cat" → HIGH attention (0.35) ← "it" refers to "cat"
"sat" → low attention (0.05)
"on" → low attention (0.01)
"the" → low attention (0.02)
"mat" → some attention (0.15)
"because" → low attention (0.05)
"it" → some attention (0.10)
"was" → moderate (0.15)
"tired" → low attention (0.10)
The model learns these attention patterns during training. In practice, modern LLMs have many "attention heads" — each looking at different types of relationships:
- Some heads track grammatical relationships (subject-verb)
- Some track semantic relationships (pronouns to nouns)
- Some track positional relationships (nearby words)
- Some track long-range dependencies (instructions from paragraphs ago)
Multi-Head Attention
Real transformer models use multi-head attention — multiple attention mechanisms running in parallel, each looking for different patterns:
Head 1: Grammar patterns (subject → verb → object)
Head 2: Coreference (pronouns → their referents)
Head 3: Topic tracking (maintaining context across paragraphs)
Head 4: Code structure (function names → their usage)
...
Head N: Other patterns learned during training
This is why LLMs can simultaneously track conversation context, follow instructions, maintain style, and generate syntactically correct output.
The Full Pipeline
Here's how a complete generation works, from prompt to response:
1. TOKENIZE
"Write a hello world function" → [15043, 257, 23748, 1917, 2163]
2. EMBED
Each token ID → High-dimensional vector
[15043] → [0.23, -0.45, 0.89, ...]
3. PROCESS (Transformer layers × many)
For each layer:
a. Self-attention: Every token attends to every other token
b. Feed-forward: Process the attended information
c. Normalize and residual connections
4. PREDICT
Final layer outputs probabilities for every possible next token
"function" (32%) | "in" (18%) | "def" (12%) | ...
5. SAMPLE
Based on temperature/top_p, select the next token
→ "function" (with temperature=0, always picks highest)
6. REPEAT steps 2-5 with the new token appended
Until stop token or max_tokens reached
Why This Matters for Developers
Understanding the mechanism helps you reason about model behavior:
1. Why Long Prompts Can Lose Instructions
Attention has to spread across all tokens. In very long contexts, earlier instructions can get "diluted." This is why it helps to repeat key instructions or place them at both the beginning and end of your prompt.
2. Why Models Are Bad at Counting
Next-token prediction doesn't involve actual computation. When you ask "how many r's in strawberry," the model isn't counting — it's predicting what the answer probably is based on patterns. This is why LLMs struggle with precise counting tasks.
3. Why Prompt Engineering Works
The attention mechanism means the model's output is heavily influenced by the context you provide. Good prompts work because they guide the attention to the right patterns:
// Weak prompt — model has to guess the format const weak = "Tell me about React hooks"; // Strong prompt — attention is guided to specific patterns const strong = `You are a senior React developer teaching a junior. Explain React hooks with: 1. A simple definition 2. The 3 most important hooks (useState, useEffect, useContext) 3. A practical code example for each 4. Common mistakes to avoid Use TypeScript in all code examples.`;
4. Why RAG (Retrieval Augmented Generation) Works
When you inject relevant documents into the prompt, the attention mechanism connects the user's question to the provided context. The model "attends to" the documents you provided instead of relying on potentially outdated training data.
5. Why Fine-Tuning Changes Behavior
Fine-tuning adjusts the model's attention patterns and prediction weights for your specific domain. A fine-tuned model pays more attention to patterns relevant to your use case.
Key Takeaways
- LLMs generate text by predicting one token at a time based on all previous tokens
- Embeddings convert text into numbers — similar meanings have similar numbers
- Attention lets the model figure out which parts of the input relate to each other
- Understanding these mechanisms explains both the strengths AND limitations of LLMs
- This knowledge directly improves your prompt engineering and application design
- You can use embeddings directly in your apps for search, recommendations, and more
What's Next?
Now that you understand how LLMs work, let's explore the practical use cases for developers — from code generation to RAG to building conversational AI.
What to ask your AI: "Explain how your attention mechanism processes my prompt. What parts of this conversation are you paying the most attention to right now?"