RAG Essentials Cheat Sheet
RAG Essentials Cheat Sheet
Your complete reference for building RAG systems. Bookmark this page for pipeline diagrams, comparison tables, decision trees, and ready-to-use AI prompts.
RAG Pipeline Visual Reference
┌──────────────────────────────────────────────────────────────┐
│ RAG PIPELINE │
│ │
│ ┌─────────┐ ┌─────────┐ ┌──────────┐ ┌───────────┐ │
│ │ LOAD │──▶│ CHUNK │──▶│ EMBED │──▶│ STORE │ │
│ │ docs │ │ split │ │ vectors │ │ vector │ │
│ │ │ │ text │ │ │ │ database │ │
│ └─────────┘ └─────────┘ └──────────┘ └─────┬─────┘ │
│ │ │
│ ┌─────────────────────────────────────────────────┘ │
│ │ │
│ │ At query time: │
│ │ │
│ │ ┌──────────┐ ┌──────────┐ ┌───────────────────┐ │
│ │ │ QUERY │──▶│ RETRIEVE │──▶│ GENERATE │ │
│ │ │ embed │ │ top-K │ │ LLM + context │ │
│ │ │ question│ │ similar │ │ = grounded answer│ │
│ │ └──────────┘ └──────────┘ └───────────────────┘ │
│ │ │
│ └──────────────────────────────────────────────────────────┘
Embedding Model Comparison
| Model | Provider | Dimensions | Max Tokens | Cost | Quality |
|---|---|---|---|---|---|
| text-embedding-3-small | OpenAI | 1536 | 8191 | $0.02/1M tokens | Good |
| text-embedding-3-large | OpenAI | 3072 | 8191 | $0.13/1M tokens | Excellent |
| embed-english-v3.0 | Cohere | 1024 | 512 | Free tier available | Excellent |
| all-MiniLM-L6-v2 | Open source | 384 | 256 | Free | Good |
| nomic-embed-text-v1.5 | Nomic | 768 | 8192 | Free (open source) | Very good |
| BGE-large-en-v1.5 | BAAI | 1024 | 512 | Free (open source) | Very good |
| voyage-3 | Voyage AI | 1024 | 32000 | $0.06/1M tokens | Excellent |
Quick Pick Guide
- Getting started: OpenAI text-embedding-3-small (easy API, good quality)
- Best quality: OpenAI text-embedding-3-large or Cohere embed-v3
- Free / local: all-MiniLM-L6-v2 or nomic-embed-text-v1.5
- Long documents: nomic-embed-text-v1.5 or voyage-3 (large context windows)
- Privacy-critical: Any open-source model run locally
Vector Database Comparison
| Feature | Pinecone | ChromaDB | Weaviate | pgvector | Qdrant |
|---|---|---|---|---|---|
| Hosting | Cloud only | Local / embedded | Cloud or self-host | Your PostgreSQL | Cloud or self-host |
| Setup time | 5 min | 2 min | 15 min | 10 min | 10 min |
| Free tier | Yes | Free (OSS) | Free (OSS) | Free (OSS) | Free (OSS) |
| Max vectors (free) | 100K | Unlimited (local) | Unlimited (local) | Unlimited (local) | Unlimited (local) |
| Hybrid search | Yes | Limited | Yes | With extensions | Yes |
| Filtering | Metadata | Metadata | Metadata + refs | SQL WHERE | Metadata |
| Language SDKs | Python, JS | Python, JS | Python, JS, Go | Any SQL client | Python, JS, Rust |
| Best for | Production SaaS | Prototyping | Feature-rich apps | Existing PG users | High performance |
Quick Pick Guide
- Learning / prototyping: ChromaDB (zero config, runs locally)
- Production (managed): Pinecone (fully managed, scales automatically)
- Production (self-hosted): Weaviate or Qdrant
- Already using PostgreSQL: pgvector (add to existing DB)
Chunking Strategy Decision Tree
What type of document are you chunking?
│
├── Structured (Markdown, HTML, docs with headers)
│ └── Use: Document-structure-based chunking
│ Split on headers/sections, keep hierarchy as metadata
│
├── Long-form text (books, articles, reports)
│ └── Use: Recursive text splitting
│ chunk_size=500-1000, overlap=50-100
│ separators: paragraphs → sentences → words
│
├── Short entries (FAQ, chat logs, product descriptions)
│ └── Use: Fixed-size or per-entry chunking
│ chunk_size=200-400, minimal overlap
│
├── Mixed topics (transcripts, meeting notes)
│ └── Use: Semantic chunking
│ Split when topic changes (embedding similarity drops)
│
└── Code files
└── Use: Language-aware splitting
Split on functions/classes, keep file path as metadata
Chunk Size Quick Reference
| Document Type | Chunk Size | Overlap | Splitter |
|---|---|---|---|
| FAQ / Q&A | 200-300 | 20-30 | Fixed or per-entry |
| Technical docs | 500-800 | 50-100 | Recursive |
| Legal / compliance | 800-1200 | 100-200 | Recursive |
| Chat logs | 300-500 | 30-50 | Fixed or per-message |
| Books / articles | 500-1000 | 50-100 | Recursive |
| API documentation | 400-600 | 40-60 | Structure-based |
| Code | Per function | 0 | Language-aware |
RAG Prompt Templates
Basic RAG Prompt
You are a helpful assistant. Answer the user's question based ONLY on
the provided context. If the context does not contain enough information,
say "I don't have enough information to answer that."
Context:
{retrieved_chunks}
Question: {user_question}
RAG with Citations
Answer the question based on the provided context. For each claim in
your answer, cite the source using [Source: document_name].
If the context doesn't contain the answer, say "I don't have enough
information to answer that question."
Context:
[Source: handbook-section-5]
Employees receive 15 days of paid vacation per year...
[Source: handbook-section-8]
Parental leave policy provides 12 weeks...
Question: {user_question}
Conversational RAG Prompt
You are a knowledgeable assistant for [Company/Product].
Use the conversation history and retrieved context to answer questions.
Be concise and helpful. If you're not sure, say so.
Conversation history:
{chat_history}
Retrieved context:
{retrieved_chunks}
User: {user_question}
Assistant:
Key Formulas and Metrics
Cosine Similarity = dot(A, B) / (||A|| * ||B||)
Range: -1 to 1 (higher = more similar)
Precision = relevant_retrieved / total_retrieved
"Of what I found, how much was useful?"
Recall = relevant_retrieved / total_relevant
"Of what exists, how much did I find?"
F1 Score = 2 * (precision * recall) / (precision + recall)
Harmonic mean of precision and recall
MRR = 1 / rank_of_first_relevant_result
"How quickly did I find something useful?"
Common Code Snippets
Generate Embeddings (OpenAI)
from openai import OpenAI client = OpenAI() def embed(texts: list[str]) -> list[list[float]]: response = client.embeddings.create(model="text-embedding-3-small", input=texts) return [item.embedding for item in response.data]
Store in ChromaDB
import chromadb client = chromadb.PersistentClient(path="./data") collection = client.get_or_create_collection("docs") collection.add( ids=["id1", "id2"], documents=["text one", "text two"], embeddings=[embed(["text one"])[0], embed(["text two"])[0]], metadatas=[{"source": "file1"}, {"source": "file2"}] )
Query and Generate
results = collection.query(query_texts=["user question"], n_results=5) context = "\n---\n".join(results["documents"][0]) response = client.chat.completions.create( model="gpt-4o", messages=[ {"role": "system", "content": f"Answer based on context:\n{context}"}, {"role": "user", "content": "user question"} ] ) print(response.choices[0].message.content)
AI Prompts for Building RAG Systems
Getting Started
- "Help me build a RAG system for [document type]. Walk me through the setup step by step."
- "I have [number] [PDFs/markdown files/web pages]. Design a RAG pipeline using [Python/TypeScript] with [OpenAI/Cohere] embeddings and [ChromaDB/Pinecone]."
- "Compare RAG vs fine-tuning for my use case: [describe what you need]."
Embeddings
- "Help me choose an embedding model for [use case]. I need [speed/quality/privacy]. My documents are [language/domain]."
- "Write a script that generates embeddings for all files in a directory and saves them to [vector DB]."
- "My embedding costs are too high. Help me optimize by [batching/caching/switching models]."
Chunking
- "I have [document type] documents averaging [size]. Recommend the best chunking strategy, chunk size, and overlap."
- "My RAG retrieves irrelevant chunks. Here's an example: [show question and retrieved chunks]. How should I improve my chunking?"
- "Write a custom chunker that splits [my document format] by [sections/headers/pages] and preserves metadata."
Vector Database
- "Help me set up [Pinecone/ChromaDB/Weaviate/pgvector] for my RAG project with [Python/TypeScript]."
- "I have [number] documents. Which vector database should I use considering [cost/scale/features]?"
- "My vector search is slow. Help me optimize indexing and query performance."
Pipeline Building
- "Build a complete RAG pipeline that: loads [documents], chunks them with [strategy], embeds with [model], stores in [DB], and answers questions with [LLM]."
- "Add metadata filtering to my RAG pipeline so users can filter by [category/date/source]."
- "Implement hybrid search (vector + keyword) in my RAG pipeline for better retrieval."
- "Add conversation memory to my RAG chatbot so it remembers previous questions."
Evaluation and Improvement
- "Help me create a ground truth evaluation set for my RAG system with [number] test questions."
- "My RAG system gives wrong answers for [type of question]. Here's my setup: [describe]. How do I debug and fix this?"
- "Set up Ragas evaluation for my RAG pipeline and help me interpret the results."
- "My retrieval precision is low. What techniques can I use to improve it?"
Production
- "Help me deploy my RAG pipeline as a REST API using [FastAPI/Express]."
- "Add streaming responses to my RAG API so users see the answer as it generates."
- "Set up a document ingestion pipeline that automatically indexes new documents when they're added to [S3/GCS/a folder]."
- "Implement rate limiting and caching for my RAG API to control costs."
Debugging
- "My RAG returns 'I don't know' for questions I know are in my documents. Help me debug."
- "The retrieved chunks are relevant but the LLM ignores them. How do I fix my prompt?"
- "My RAG works for short questions but fails for complex ones. What's wrong?"
- "I'm getting different answers for the same question. How do I make my RAG more consistent?"
The Complete RAG Workflow
1. COLLECT your documents (PDFs, markdown, web pages, databases)
2. CHOOSE an embedding model (OpenAI, Cohere, or open source)
3. CHOOSE a vector database (ChromaDB for dev, Pinecone/Weaviate for prod)
4. CHUNK your documents (recursive splitting is the default choice)
5. EMBED and STORE chunks in your vector database
6. BUILD the query pipeline (embed question → retrieve → generate)
7. EVALUATE with ground truth questions
8. ITERATE on chunking, retrieval, and prompts until quality is good
9. DEPLOY as an API or integrate into your application
10. MONITOR and update your document index as data changes
Keep Learning
RAG is evolving fast. Here are advanced topics to explore next:
- Agentic RAG — Let the LLM decide when and what to retrieve
- Graph RAG — Combine knowledge graphs with vector search
- Multi-modal RAG — Retrieve images, tables, and diagrams alongside text
- Re-ranking — Use cross-encoders to re-order retrieved results
- Query decomposition — Break complex questions into sub-queries
- Self-RAG — The model evaluates its own retrieval and generation quality
Remember: The best RAG system is the one that gives your users accurate, grounded answers. Start simple, measure everything, and iterate based on real usage.
Happy building!