Books/RAG Essentials/Understanding Embeddings

    Understanding Embeddings

    Understanding Embeddings

    Embeddings are the foundation of RAG. Before you can search your documents for relevant information, you need to convert text into a format that machines can compare. That format is vectors — arrays of numbers that capture the meaning of text.

    What Are Embeddings?

    An embedding is a numerical representation of text in a high-dimensional space. Each piece of text — a sentence, a paragraph, a document — gets converted into a list of numbers (a vector). Texts with similar meanings end up with similar vectors.

    Think of it like coordinates on a map. Just as two nearby points on a map are geographically close, two similar text embeddings are semantically close.

    "How do I reset my password?"  →  [0.021, -0.134, 0.891, 0.045, ...]
    "I forgot my login credentials" →  [0.019, -0.128, 0.887, 0.051, ...]
    "What is the weather today?"    →  [-0.523, 0.412, 0.033, -0.298, ...]
    

    Notice how the first two vectors are very similar (both are about account access), while the third is completely different (it is about weather).

    How Text Becomes Numbers

    The process works in three stages:

    1. Tokenization — The text is split into tokens (words or subword pieces)
    2. Neural network processing — The tokens pass through a trained transformer model
    3. Pooling — The model outputs are combined into a single fixed-length vector

    You do not need to understand the math in detail. What matters is the result: text goes in, a vector comes out, and similar texts produce similar vectors.

    Input text: "How to train a puppy"
         │
         ▼
    ┌──────────────┐
    │  Embedding   │
    │    Model     │
    └──────┬───────┘
           │
           ▼
    Output: [0.023, -0.156, 0.734, 0.091, ..., -0.045]
            (1536 numbers for OpenAI text-embedding-3-small)
    

    The length of the vector (its dimensionality) depends on the model:

    ModelDimensionsProvider
    text-embedding-3-small1536OpenAI
    text-embedding-3-large3072OpenAI
    embed-english-v3.01024Cohere
    all-MiniLM-L6-v2384Open source (Sentence Transformers)
    nomic-embed-text-v1.5768Nomic (open source)
    BGE-large-en-v1.51024BAAI (open source)

    Semantic Similarity and Cosine Distance

    Once you have embeddings, you compare them using cosine similarity. This measures how similar two vectors are on a scale from -1 to 1:

    • 1.0 — Identical meaning
    • 0.7-0.9 — Very similar
    • 0.4-0.6 — Somewhat related
    • 0.0 — Unrelated
    • -1.0 — Opposite meaning
    from numpy import dot
    from numpy.linalg import norm
    
    def cosine_similarity(a, b):
        return dot(a, b) / (norm(a) * norm(b))
    
    # Example
    similarity = cosine_similarity(
        embedding_of("How do I reset my password?"),
        embedding_of("I forgot my login credentials")
    )
    # Result: ~0.91 (very similar!)

    This is the core mechanic of RAG: when a user asks a question, you embed that question, then find the stored document embeddings with the highest cosine similarity.

    Embedding Models: Choosing the Right One

    OpenAI Embeddings

    The most popular choice for getting started. Excellent quality, simple API, pay per token.

    from openai import OpenAI
    
    client = OpenAI()
    
    response = client.embeddings.create(
        model="text-embedding-3-small",
        input="How do I reset my password?"
    )
    
    embedding = response.data[0].embedding
    print(f"Vector length: {len(embedding)}")  # 1536
    print(f"First 5 values: {embedding[:5]}")

    Cohere Embeddings

    Strong alternative with a generous free tier. Supports different input types for better retrieval.

    import cohere
    
    co = cohere.ClientV2()
    
    response = co.embed(
        texts=["How do I reset my password?"],
        model="embed-english-v3.0",
        input_type="search_query",           # or "search_document" for stored docs
        embedding_types=["float"]
    )
    
    embedding = response.embeddings.float_[0]
    print(f"Vector length: {len(embedding)}")  # 1024

    Open-Source Embeddings (Sentence Transformers)

    Free, runs locally, no API key needed. Good for privacy-sensitive applications.

    from sentence_transformers import SentenceTransformer
    
    model = SentenceTransformer("all-MiniLM-L6-v2")
    
    sentences = [
        "How do I reset my password?",
        "I forgot my login credentials",
        "What is the weather today?"
    ]
    
    embeddings = model.encode(sentences)
    print(f"Shape: {embeddings.shape}")  # (3, 384)

    Code Example: Generating and Comparing Embeddings

    Here is a complete example that generates embeddings and finds the most similar document to a query:

    from openai import OpenAI
    from numpy import dot
    from numpy.linalg import norm
    
    client = OpenAI()
    
    def get_embedding(text: str) -> list[float]:
        """Generate an embedding for a piece of text."""
        response = client.embeddings.create(
            model="text-embedding-3-small",
            input=text
        )
        return response.data[0].embedding
    
    def cosine_similarity(a: list[float], b: list[float]) -> float:
        """Calculate cosine similarity between two vectors."""
        return dot(a, b) / (norm(a) * norm(b))
    
    # Our "knowledge base" — in a real app these come from your documents
    documents = [
        "To reset your password, go to Settings > Security > Change Password.",
        "Our refund policy allows returns within 30 days of purchase.",
        "Business hours are Monday through Friday, 9 AM to 5 PM EST.",
        "To upgrade your plan, visit the Billing section in your dashboard.",
    ]
    
    # Pre-compute embeddings for all documents
    doc_embeddings = [get_embedding(doc) for doc in documents]
    
    # User asks a question
    query = "How do I change my password?"
    query_embedding = get_embedding(query)
    
    # Find the most similar document
    similarities = [
        cosine_similarity(query_embedding, doc_emb)
        for doc_emb in doc_embeddings
    ]
    
    best_idx = similarities.index(max(similarities))
    print(f"Best match (similarity {similarities[best_idx]:.3f}):")
    print(f"  {documents[best_idx]}")
    # Output: "To reset your password, go to Settings > Security > Change Password."

    Best Practices for Embeddings

    1. Use the same model for queries and documents — Mixing models produces incompatible vectors
    2. Batch your embedding calls — Most APIs support embedding multiple texts at once, which is faster and cheaper
    3. Cache embeddings — Store computed embeddings so you do not regenerate them every time
    4. Consider dimensionality vs cost — Smaller models (384 dims) are faster and cheaper; larger ones (3072 dims) are more accurate
    5. Normalize your text — Clean up formatting, remove excessive whitespace, and handle special characters before embedding

    What to ask your AI: "Help me set up an embedding pipeline that processes my [documents/PDFs/web pages] and stores the embeddings for later retrieval."

    What's Next?

    You can now generate embeddings for any text. But where do you store millions of these vectors and search through them efficiently? That is the job of a vector database, which we cover next.


    🌐 www.genai-mentor.ai