Books/RAG Essentials/RAG Essentials Cheat Sheet

    RAG Essentials Cheat Sheet

    RAG Essentials Cheat Sheet

    Your complete reference for building RAG systems. Bookmark this page for pipeline diagrams, comparison tables, decision trees, and ready-to-use AI prompts.

    RAG Pipeline Visual Reference

    ┌──────────────────────────────────────────────────────────────┐
    │                    RAG PIPELINE                              │
    │                                                              │
    │  ┌─────────┐   ┌─────────┐   ┌──────────┐   ┌───────────┐  │
    │  │  LOAD   │──▶│  CHUNK  │──▶│  EMBED   │──▶│   STORE   │  │
    │  │  docs   │   │  split  │   │  vectors │   │  vector   │  │
    │  │         │   │  text   │   │          │   │  database │  │
    │  └─────────┘   └─────────┘   └──────────┘   └─────┬─────┘  │
    │                                                    │        │
    │  ┌─────────────────────────────────────────────────┘        │
    │  │                                                          │
    │  │  At query time:                                          │
    │  │                                                          │
    │  │  ┌──────────┐   ┌──────────┐   ┌───────────────────┐    │
    │  │  │  QUERY   │──▶│ RETRIEVE │──▶│     GENERATE      │    │
    │  │  │  embed   │   │  top-K   │   │  LLM + context    │    │
    │  │  │  question│   │  similar │   │  = grounded answer│    │
    │  │  └──────────┘   └──────────┘   └───────────────────┘    │
    │  │                                                          │
    │  └──────────────────────────────────────────────────────────┘
    

    Embedding Model Comparison

    ModelProviderDimensionsMax TokensCostQuality
    text-embedding-3-smallOpenAI15368191$0.02/1M tokensGood
    text-embedding-3-largeOpenAI30728191$0.13/1M tokensExcellent
    embed-english-v3.0Cohere1024512Free tier availableExcellent
    all-MiniLM-L6-v2Open source384256FreeGood
    nomic-embed-text-v1.5Nomic7688192Free (open source)Very good
    BGE-large-en-v1.5BAAI1024512Free (open source)Very good
    voyage-3Voyage AI102432000$0.06/1M tokensExcellent

    Quick Pick Guide

    • Getting started: OpenAI text-embedding-3-small (easy API, good quality)
    • Best quality: OpenAI text-embedding-3-large or Cohere embed-v3
    • Free / local: all-MiniLM-L6-v2 or nomic-embed-text-v1.5
    • Long documents: nomic-embed-text-v1.5 or voyage-3 (large context windows)
    • Privacy-critical: Any open-source model run locally

    Vector Database Comparison

    FeaturePineconeChromaDBWeaviatepgvectorQdrant
    HostingCloud onlyLocal / embeddedCloud or self-hostYour PostgreSQLCloud or self-host
    Setup time5 min2 min15 min10 min10 min
    Free tierYesFree (OSS)Free (OSS)Free (OSS)Free (OSS)
    Max vectors (free)100KUnlimited (local)Unlimited (local)Unlimited (local)Unlimited (local)
    Hybrid searchYesLimitedYesWith extensionsYes
    FilteringMetadataMetadataMetadata + refsSQL WHEREMetadata
    Language SDKsPython, JSPython, JSPython, JS, GoAny SQL clientPython, JS, Rust
    Best forProduction SaaSPrototypingFeature-rich appsExisting PG usersHigh performance

    Quick Pick Guide

    • Learning / prototyping: ChromaDB (zero config, runs locally)
    • Production (managed): Pinecone (fully managed, scales automatically)
    • Production (self-hosted): Weaviate or Qdrant
    • Already using PostgreSQL: pgvector (add to existing DB)

    Chunking Strategy Decision Tree

    What type of document are you chunking?
    │
    ├── Structured (Markdown, HTML, docs with headers)
    │   └── Use: Document-structure-based chunking
    │       Split on headers/sections, keep hierarchy as metadata
    │
    ├── Long-form text (books, articles, reports)
    │   └── Use: Recursive text splitting
    │       chunk_size=500-1000, overlap=50-100
    │       separators: paragraphs → sentences → words
    │
    ├── Short entries (FAQ, chat logs, product descriptions)
    │   └── Use: Fixed-size or per-entry chunking
    │       chunk_size=200-400, minimal overlap
    │
    ├── Mixed topics (transcripts, meeting notes)
    │   └── Use: Semantic chunking
    │       Split when topic changes (embedding similarity drops)
    │
    └── Code files
        └── Use: Language-aware splitting
            Split on functions/classes, keep file path as metadata
    

    Chunk Size Quick Reference

    Document TypeChunk SizeOverlapSplitter
    FAQ / Q&A200-30020-30Fixed or per-entry
    Technical docs500-80050-100Recursive
    Legal / compliance800-1200100-200Recursive
    Chat logs300-50030-50Fixed or per-message
    Books / articles500-100050-100Recursive
    API documentation400-60040-60Structure-based
    CodePer function0Language-aware

    RAG Prompt Templates

    Basic RAG Prompt

    You are a helpful assistant. Answer the user's question based ONLY on
    the provided context. If the context does not contain enough information,
    say "I don't have enough information to answer that."
    
    Context:
    {retrieved_chunks}
    
    Question: {user_question}
    

    RAG with Citations

    Answer the question based on the provided context. For each claim in
    your answer, cite the source using [Source: document_name].
    
    If the context doesn't contain the answer, say "I don't have enough
    information to answer that question."
    
    Context:
    [Source: handbook-section-5]
    Employees receive 15 days of paid vacation per year...
    
    [Source: handbook-section-8]
    Parental leave policy provides 12 weeks...
    
    Question: {user_question}
    

    Conversational RAG Prompt

    You are a knowledgeable assistant for [Company/Product].
    Use the conversation history and retrieved context to answer questions.
    Be concise and helpful. If you're not sure, say so.
    
    Conversation history:
    {chat_history}
    
    Retrieved context:
    {retrieved_chunks}
    
    User: {user_question}
    Assistant:
    

    Key Formulas and Metrics

    Cosine Similarity = dot(A, B) / (||A|| * ||B||)
      Range: -1 to 1 (higher = more similar)
    
    Precision = relevant_retrieved / total_retrieved
      "Of what I found, how much was useful?"
    
    Recall = relevant_retrieved / total_relevant
      "Of what exists, how much did I find?"
    
    F1 Score = 2 * (precision * recall) / (precision + recall)
      Harmonic mean of precision and recall
    
    MRR = 1 / rank_of_first_relevant_result
      "How quickly did I find something useful?"
    

    Common Code Snippets

    Generate Embeddings (OpenAI)

    from openai import OpenAI
    client = OpenAI()
    
    def embed(texts: list[str]) -> list[list[float]]:
        response = client.embeddings.create(model="text-embedding-3-small", input=texts)
        return [item.embedding for item in response.data]

    Store in ChromaDB

    import chromadb
    client = chromadb.PersistentClient(path="./data")
    collection = client.get_or_create_collection("docs")
    
    collection.add(
        ids=["id1", "id2"],
        documents=["text one", "text two"],
        embeddings=[embed(["text one"])[0], embed(["text two"])[0]],
        metadatas=[{"source": "file1"}, {"source": "file2"}]
    )

    Query and Generate

    results = collection.query(query_texts=["user question"], n_results=5)
    context = "\n---\n".join(results["documents"][0])
    
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": f"Answer based on context:\n{context}"},
            {"role": "user", "content": "user question"}
        ]
    )
    print(response.choices[0].message.content)

    AI Prompts for Building RAG Systems

    Getting Started

    • "Help me build a RAG system for [document type]. Walk me through the setup step by step."
    • "I have [number] [PDFs/markdown files/web pages]. Design a RAG pipeline using [Python/TypeScript] with [OpenAI/Cohere] embeddings and [ChromaDB/Pinecone]."
    • "Compare RAG vs fine-tuning for my use case: [describe what you need]."

    Embeddings

    • "Help me choose an embedding model for [use case]. I need [speed/quality/privacy]. My documents are [language/domain]."
    • "Write a script that generates embeddings for all files in a directory and saves them to [vector DB]."
    • "My embedding costs are too high. Help me optimize by [batching/caching/switching models]."

    Chunking

    • "I have [document type] documents averaging [size]. Recommend the best chunking strategy, chunk size, and overlap."
    • "My RAG retrieves irrelevant chunks. Here's an example: [show question and retrieved chunks]. How should I improve my chunking?"
    • "Write a custom chunker that splits [my document format] by [sections/headers/pages] and preserves metadata."

    Vector Database

    • "Help me set up [Pinecone/ChromaDB/Weaviate/pgvector] for my RAG project with [Python/TypeScript]."
    • "I have [number] documents. Which vector database should I use considering [cost/scale/features]?"
    • "My vector search is slow. Help me optimize indexing and query performance."

    Pipeline Building

    • "Build a complete RAG pipeline that: loads [documents], chunks them with [strategy], embeds with [model], stores in [DB], and answers questions with [LLM]."
    • "Add metadata filtering to my RAG pipeline so users can filter by [category/date/source]."
    • "Implement hybrid search (vector + keyword) in my RAG pipeline for better retrieval."
    • "Add conversation memory to my RAG chatbot so it remembers previous questions."

    Evaluation and Improvement

    • "Help me create a ground truth evaluation set for my RAG system with [number] test questions."
    • "My RAG system gives wrong answers for [type of question]. Here's my setup: [describe]. How do I debug and fix this?"
    • "Set up Ragas evaluation for my RAG pipeline and help me interpret the results."
    • "My retrieval precision is low. What techniques can I use to improve it?"

    Production

    • "Help me deploy my RAG pipeline as a REST API using [FastAPI/Express]."
    • "Add streaming responses to my RAG API so users see the answer as it generates."
    • "Set up a document ingestion pipeline that automatically indexes new documents when they're added to [S3/GCS/a folder]."
    • "Implement rate limiting and caching for my RAG API to control costs."

    Debugging

    • "My RAG returns 'I don't know' for questions I know are in my documents. Help me debug."
    • "The retrieved chunks are relevant but the LLM ignores them. How do I fix my prompt?"
    • "My RAG works for short questions but fails for complex ones. What's wrong?"
    • "I'm getting different answers for the same question. How do I make my RAG more consistent?"

    The Complete RAG Workflow

    1. COLLECT your documents (PDFs, markdown, web pages, databases)
    2. CHOOSE an embedding model (OpenAI, Cohere, or open source)
    3. CHOOSE a vector database (ChromaDB for dev, Pinecone/Weaviate for prod)
    4. CHUNK your documents (recursive splitting is the default choice)
    5. EMBED and STORE chunks in your vector database
    6. BUILD the query pipeline (embed question → retrieve → generate)
    7. EVALUATE with ground truth questions
    8. ITERATE on chunking, retrieval, and prompts until quality is good
    9. DEPLOY as an API or integrate into your application
    10. MONITOR and update your document index as data changes
    

    Keep Learning

    RAG is evolving fast. Here are advanced topics to explore next:

    • Agentic RAG — Let the LLM decide when and what to retrieve
    • Graph RAG — Combine knowledge graphs with vector search
    • Multi-modal RAG — Retrieve images, tables, and diagrams alongside text
    • Re-ranking — Use cross-encoders to re-order retrieved results
    • Query decomposition — Break complex questions into sub-queries
    • Self-RAG — The model evaluates its own retrieval and generation quality

    Remember: The best RAG system is the one that gives your users accurate, grounded answers. Start simple, measure everything, and iterate based on real usage.

    Happy building!


    🌐 www.genai-mentor.ai