Books/RAG Essentials/RAG Essentials Cheat Sheet

RAG Essentials Cheat Sheet

Your complete reference for building RAG systems. Bookmark this page for pipeline diagrams, comparison tables, decision trees, and ready-to-use AI prompts.

RAG Pipeline Visual Reference

┌──────────────────────────────────────────────────────────────┐
│                    RAG PIPELINE                              │
│                                                              │
│  ┌─────────┐   ┌─────────┐   ┌──────────┐   ┌───────────┐  │
│  │  LOAD   │──▶│  CHUNK  │──▶│  EMBED   │──▶│   STORE   │  │
│  │  docs   │   │  split  │   │  vectors │   │  vector   │  │
│  │         │   │  text   │   │          │   │  database │  │
│  └─────────┘   └─────────┘   └──────────┘   └─────┬─────┘  │
│                                                    │        │
│  ┌─────────────────────────────────────────────────┘        │
│  │                                                          │
│  │  At query time:                                          │
│  │                                                          │
│  │  ┌──────────┐   ┌──────────┐   ┌───────────────────┐    │
│  │  │  QUERY   │──▶│ RETRIEVE │──▶│     GENERATE      │    │
│  │  │  embed   │   │  top-K   │   │  LLM + context    │    │
│  │  │  question│   │  similar │   │  = grounded answer│    │
│  │  └──────────┘   └──────────┘   └───────────────────┘    │
│  │                                                          │
│  └──────────────────────────────────────────────────────────┘

Embedding Model Comparison

Model	Provider	Dimensions	Max Tokens	Cost	Quality
text-embedding-3-small	OpenAI	1536	8191	$0.02/1M tokens	Good
text-embedding-3-large	OpenAI	3072	8191	$0.13/1M tokens	Excellent
embed-english-v3.0	Cohere	1024	512	Free tier available	Excellent
all-MiniLM-L6-v2	Open source	384	256	Free	Good
nomic-embed-text-v1.5	Nomic	768	8192	Free (open source)	Very good
BGE-large-en-v1.5	BAAI	1024	512	Free (open source)	Very good
voyage-3	Voyage AI	1024	32000	$0.06/1M tokens	Excellent

Quick Pick Guide

Getting started: OpenAI text-embedding-3-small (easy API, good quality)
Best quality: OpenAI text-embedding-3-large or Cohere embed-v3
Free / local: all-MiniLM-L6-v2 or nomic-embed-text-v1.5
Long documents: nomic-embed-text-v1.5 or voyage-3 (large context windows)
Privacy-critical: Any open-source model run locally

Vector Database Comparison

Feature	Pinecone	ChromaDB	Weaviate	pgvector	Qdrant
Hosting	Cloud only	Local / embedded	Cloud or self-host	Your PostgreSQL	Cloud or self-host
Setup time	5 min	2 min	15 min	10 min	10 min
Free tier	Yes	Free (OSS)	Free (OSS)	Free (OSS)	Free (OSS)
Max vectors (free)	100K	Unlimited (local)	Unlimited (local)	Unlimited (local)	Unlimited (local)
Hybrid search	Yes	Limited	Yes	With extensions	Yes
Filtering	Metadata	Metadata	Metadata + refs	SQL WHERE	Metadata
Language SDKs	Python, JS	Python, JS	Python, JS, Go	Any SQL client	Python, JS, Rust
Best for	Production SaaS	Prototyping	Feature-rich apps	Existing PG users	High performance

Quick Pick Guide

Learning / prototyping: ChromaDB (zero config, runs locally)
Production (managed): Pinecone (fully managed, scales automatically)
Production (self-hosted): Weaviate or Qdrant
Already using PostgreSQL: pgvector (add to existing DB)

Chunking Strategy Decision Tree

What type of document are you chunking?
│
├── Structured (Markdown, HTML, docs with headers)
│   └── Use: Document-structure-based chunking
│       Split on headers/sections, keep hierarchy as metadata
│
├── Long-form text (books, articles, reports)
│   └── Use: Recursive text splitting
│       chunk_size=500-1000, overlap=50-100
│       separators: paragraphs → sentences → words
│
├── Short entries (FAQ, chat logs, product descriptions)
│   └── Use: Fixed-size or per-entry chunking
│       chunk_size=200-400, minimal overlap
│
├── Mixed topics (transcripts, meeting notes)
│   └── Use: Semantic chunking
│       Split when topic changes (embedding similarity drops)
│
└── Code files
    └── Use: Language-aware splitting
        Split on functions/classes, keep file path as metadata

Chunk Size Quick Reference

Document Type	Chunk Size	Overlap	Splitter
FAQ / Q&A	200-300	20-30	Fixed or per-entry
Technical docs	500-800	50-100	Recursive
Legal / compliance	800-1200	100-200	Recursive
Chat logs	300-500	30-50	Fixed or per-message
Books / articles	500-1000	50-100	Recursive
API documentation	400-600	40-60	Structure-based
Code	Per function	0	Language-aware

RAG Prompt Templates

Basic RAG Prompt

You are a helpful assistant. Answer the user's question based ONLY on
the provided context. If the context does not contain enough information,
say "I don't have enough information to answer that."

Context:
{retrieved_chunks}

Question: {user_question}

RAG with Citations

Answer the question based on the provided context. For each claim in
your answer, cite the source using [Source: document_name].

If the context doesn't contain the answer, say "I don't have enough
information to answer that question."

Context:
[Source: handbook-section-5]
Employees receive 15 days of paid vacation per year...

[Source: handbook-section-8]
Parental leave policy provides 12 weeks...

Question: {user_question}

Conversational RAG Prompt

You are a knowledgeable assistant for [Company/Product].
Use the conversation history and retrieved context to answer questions.
Be concise and helpful. If you're not sure, say so.

Conversation history:
{chat_history}

Retrieved context:
{retrieved_chunks}

User: {user_question}
Assistant:

Key Formulas and Metrics

Cosine Similarity = dot(A, B) / (||A|| * ||B||)
  Range: -1 to 1 (higher = more similar)

Precision = relevant_retrieved / total_retrieved
  "Of what I found, how much was useful?"

Recall = relevant_retrieved / total_relevant
  "Of what exists, how much did I find?"

F1 Score = 2 * (precision * recall) / (precision + recall)
  Harmonic mean of precision and recall

MRR = 1 / rank_of_first_relevant_result
  "How quickly did I find something useful?"

Common Code Snippets

Generate Embeddings (OpenAI)

from openai import OpenAI
client = OpenAI()

def embed(texts: list[str]) -> list[list[float]]:
    response = client.embeddings.create(model="text-embedding-3-small", input=texts)
    return [item.embedding for item in response.data]

Store in ChromaDB

import chromadb
client = chromadb.PersistentClient(path="./data")
collection = client.get_or_create_collection("docs")

collection.add(
    ids=["id1", "id2"],
    documents=["text one", "text two"],
    embeddings=[embed(["text one"])[0], embed(["text two"])[0]],
    metadatas=[{"source": "file1"}, {"source": "file2"}]
)

Query and Generate

results = collection.query(query_texts=["user question"], n_results=5)
context = "\n---\n".join(results["documents"][0])

response = client.chat.completions.create(
    model="gpt-4o",
    messages=[
        {"role": "system", "content": f"Answer based on context:\n{context}"},
        {"role": "user", "content": "user question"}
    ]
)
print(response.choices[0].message.content)

AI Prompts for Building RAG Systems

Getting Started

"Help me build a RAG system for [document type]. Walk me through the setup step by step."
"I have [number] [PDFs/markdown files/web pages]. Design a RAG pipeline using [Python/TypeScript] with [OpenAI/Cohere] embeddings and [ChromaDB/Pinecone]."
"Compare RAG vs fine-tuning for my use case: [describe what you need]."

Embeddings

"Help me choose an embedding model for [use case]. I need [speed/quality/privacy]. My documents are [language/domain]."
"Write a script that generates embeddings for all files in a directory and saves them to [vector DB]."
"My embedding costs are too high. Help me optimize by [batching/caching/switching models]."

Chunking

"I have [document type] documents averaging [size]. Recommend the best chunking strategy, chunk size, and overlap."
"My RAG retrieves irrelevant chunks. Here's an example: [show question and retrieved chunks]. How should I improve my chunking?"
"Write a custom chunker that splits [my document format] by [sections/headers/pages] and preserves metadata."

Vector Database

"Help me set up [Pinecone/ChromaDB/Weaviate/pgvector] for my RAG project with [Python/TypeScript]."
"I have [number] documents. Which vector database should I use considering [cost/scale/features]?"
"My vector search is slow. Help me optimize indexing and query performance."

Pipeline Building

"Build a complete RAG pipeline that: loads [documents], chunks them with [strategy], embeds with [model], stores in [DB], and answers questions with [LLM]."
"Add metadata filtering to my RAG pipeline so users can filter by [category/date/source]."
"Implement hybrid search (vector + keyword) in my RAG pipeline for better retrieval."
"Add conversation memory to my RAG chatbot so it remembers previous questions."

Evaluation and Improvement

"Help me create a ground truth evaluation set for my RAG system with [number] test questions."
"My RAG system gives wrong answers for [type of question]. Here's my setup: [describe]. How do I debug and fix this?"
"Set up Ragas evaluation for my RAG pipeline and help me interpret the results."
"My retrieval precision is low. What techniques can I use to improve it?"

Production

"Help me deploy my RAG pipeline as a REST API using [FastAPI/Express]."
"Add streaming responses to my RAG API so users see the answer as it generates."
"Set up a document ingestion pipeline that automatically indexes new documents when they're added to [S3/GCS/a folder]."
"Implement rate limiting and caching for my RAG API to control costs."

Debugging

"My RAG returns 'I don't know' for questions I know are in my documents. Help me debug."
"The retrieved chunks are relevant but the LLM ignores them. How do I fix my prompt?"
"My RAG works for short questions but fails for complex ones. What's wrong?"
"I'm getting different answers for the same question. How do I make my RAG more consistent?"

The Complete RAG Workflow

1. COLLECT your documents (PDFs, markdown, web pages, databases)
2. CHOOSE an embedding model (OpenAI, Cohere, or open source)
3. CHOOSE a vector database (ChromaDB for dev, Pinecone/Weaviate for prod)
4. CHUNK your documents (recursive splitting is the default choice)
5. EMBED and STORE chunks in your vector database
6. BUILD the query pipeline (embed question → retrieve → generate)
7. EVALUATE with ground truth questions
8. ITERATE on chunking, retrieval, and prompts until quality is good
9. DEPLOY as an API or integrate into your application
10. MONITOR and update your document index as data changes

Keep Learning

RAG is evolving fast. Here are advanced topics to explore next:

Agentic RAG — Let the LLM decide when and what to retrieve
Graph RAG — Combine knowledge graphs with vector search
Multi-modal RAG — Retrieve images, tables, and diagrams alongside text
Re-ranking — Use cross-encoders to re-order retrieved results
Query decomposition — Break complex questions into sub-queries
Self-RAG — The model evaluates its own retrieval and generation quality

Remember: The best RAG system is the one that gives your users accurate, grounded answers. Start simple, measure everything, and iterate based on real usage.

Happy building!

🌐 www.genai-mentor.ai

Evaluating RAG Quality