Evaluating RAG Quality
Evaluating RAG Quality
You have a working RAG pipeline, but how do you know if it is actually good? Bad RAG can be worse than no RAG — it gives users confident-sounding answers that are wrong. Let's learn how to measure quality and catch problems.
Why Evaluation Matters
Without evaluation, you are flying blind. Your RAG system might:
- Retrieve irrelevant documents but generate plausible-sounding answers
- Retrieve the right documents but ignore them in the answer
- Work great for some topics and fail completely for others
- Degrade silently as you add new documents
RAG evaluation has two dimensions: retrieval quality (did we find the right documents?) and generation quality (did we produce a good answer from those documents?).
Retrieval Quality Metrics
Precision: Are Retrieved Docs Relevant?
Of the K documents retrieved, how many are actually relevant to the question?
Precision = (Number of relevant docs retrieved) / (Total docs retrieved)
Example: Retrieved 5 docs, 3 are relevant
Precision = 3/5 = 0.60 (60%)
High precision means you are not flooding the LLM with irrelevant context.
Recall: Did We Find All Relevant Docs?
Of all the relevant documents in your database, how many did you retrieve?
Recall = (Number of relevant docs retrieved) / (Total relevant docs in DB)
Example: 4 relevant docs exist in DB, retrieved 3 of them
Recall = 3/4 = 0.75 (75%)
High recall means you are not missing important information.
Mean Reciprocal Rank (MRR)
How high does the first relevant document rank in your results?
MRR = 1 / (rank of first relevant document)
If the first relevant doc is rank 1: MRR = 1.0
If the first relevant doc is rank 3: MRR = 0.33
Evaluating Retrieval in Practice
def evaluate_retrieval( question: str, retrieved_docs: list[str], relevant_docs: list[str] ) -> dict: """Calculate retrieval metrics.""" retrieved_set = set(retrieved_docs) relevant_set = set(relevant_docs) relevant_retrieved = retrieved_set & relevant_set precision = len(relevant_retrieved) / len(retrieved_set) if retrieved_set else 0 recall = len(relevant_retrieved) / len(relevant_set) if relevant_set else 0 f1 = (2 * precision * recall / (precision + recall)) if (precision + recall) > 0 else 0 # MRR mrr = 0 for i, doc in enumerate(retrieved_docs): if doc in relevant_set: mrr = 1 / (i + 1) break return { "precision": precision, "recall": recall, "f1": f1, "mrr": mrr }
Generation Quality Metrics
Faithfulness: Is the Answer Grounded in Context?
Does the answer only contain information from the retrieved documents? Or does it hallucinate facts?
def evaluate_faithfulness(answer: str, context: str) -> dict: """Use an LLM to judge if the answer is faithful to the context.""" from openai import OpenAI client = OpenAI() response = client.chat.completions.create( model="gpt-4o", messages=[ {"role": "system", "content": """You are an evaluation judge. Given a context and an answer, determine if every claim in the answer is supported by the context. Respond with JSON: {"faithful": true/false, "unsupported_claims": ["claim1", "claim2"]}"""}, {"role": "user", "content": f"""Context: {context} Answer: {answer} Is this answer faithful to the context?"""} ], temperature=0 ) return response.choices[0].message.content
Relevance: Does the Answer Address the Question?
A faithful answer is useless if it does not actually answer what the user asked.
def evaluate_relevance(question: str, answer: str) -> dict: """Use an LLM to judge if the answer is relevant to the question.""" from openai import OpenAI client = OpenAI() response = client.chat.completions.create( model="gpt-4o", messages=[ {"role": "system", "content": """You are an evaluation judge. Given a question and an answer, rate how well the answer addresses the question on a scale of 1-5: 1 = Completely irrelevant 2 = Tangentially related but doesn't answer 3 = Partially answers the question 4 = Mostly answers the question 5 = Fully and directly answers the question Respond with JSON: {"score": N, "reasoning": "..."}"""}, {"role": "user", "content": f"Question: {question}\n\nAnswer: {answer}"} ], temperature=0 ) return response.choices[0].message.content
Testing with Ground Truth
The gold standard for RAG evaluation is a ground truth dataset — a set of questions with known correct answers and the documents that contain those answers.
# Create a ground truth evaluation set eval_set = [ { "question": "What is our vacation policy?", "expected_answer": "Employees receive 15 days of paid vacation per year.", "relevant_doc_ids": ["handbook-section-5", "handbook-section-5a"], }, { "question": "How do I submit an expense report?", "expected_answer": "Submit expense reports through the Finance portal within 30 days.", "relevant_doc_ids": ["handbook-section-12"], }, { "question": "What is the parental leave policy?", "expected_answer": "12 weeks of paid parental leave for all parents.", "relevant_doc_ids": ["handbook-section-8"], }, ] # Run evaluation def run_evaluation(eval_set: list[dict]) -> dict: results = [] for item in eval_set: # Run your RAG pipeline retrieved_docs = retrieve(item["question"]) answer = generate_answer(item["question"], retrieved_docs) # Evaluate retrieval_metrics = evaluate_retrieval( item["question"], [doc["id"] for doc in retrieved_docs], item["relevant_doc_ids"] ) results.append({ "question": item["question"], "answer": answer, "expected": item["expected_answer"], **retrieval_metrics }) # Aggregate avg_precision = sum(r["precision"] for r in results) / len(results) avg_recall = sum(r["recall"] for r in results) / len(results) avg_mrr = sum(r["mrr"] for r in results) / len(results) print(f"Average Precision: {avg_precision:.2f}") print(f"Average Recall: {avg_recall:.2f}") print(f"Average MRR: {avg_mrr:.2f}") return results
Using RAG Evaluation Frameworks
For production systems, use established frameworks instead of building from scratch:
Ragas (RAG Assessment)
from ragas import evaluate from ragas.metrics import faithfulness, answer_relevancy, context_precision, context_recall from datasets import Dataset eval_data = Dataset.from_dict({ "question": ["What is the vacation policy?"], "answer": ["Employees get 15 days of paid vacation per year."], "contexts": [["Section 5: Employees receive 15 days of paid vacation annually..."]], "ground_truth": ["Employees receive 15 days of paid vacation per year."] }) results = evaluate( eval_data, metrics=[faithfulness, answer_relevancy, context_precision, context_recall] ) print(results)
Common Failure Modes and Fixes
| Failure | Symptom | Fix |
|---|---|---|
| Wrong docs retrieved | Answer is off-topic | Improve chunking, tune chunk size, add metadata filters |
| Right docs, wrong answer | Answer ignores or misinterprets context | Improve system prompt, lower temperature, use a better LLM |
| Hallucination | Answer includes facts not in context | Add explicit "only use provided context" instructions |
| Incomplete answer | Misses part of the answer | Increase K (retrieve more docs), improve recall |
| Stale data | Answer uses outdated info | Update your document index, add timestamps |
| Chunk too small | Retrieved text lacks context | Increase chunk size or add surrounding context |
| Chunk too large | Retrieved text is vague | Decrease chunk size, use semantic chunking |
Debugging Checklist
When your RAG system gives a bad answer:
- Check retrieval first — Print the retrieved documents. Are they relevant?
- Check the prompt — Is the context properly formatted? Is the instruction clear?
- Check chunk quality — Are your chunks cutting information in bad places?
- Try more context — Increase K from 3 to 5 or 10. Does the answer improve?
- Try a better model — Switch from GPT-4o-mini to GPT-4o. Does it help?
What to ask your AI: "My RAG system is returning [describe the problem]. Here is my chunking strategy: [describe]. Here is my retrieval approach: [describe]. How can I improve the results?"
What's Next?
You now know how to build, optimize, and evaluate a RAG pipeline. The final tutorial is your RAG Essentials Cheat Sheet — a quick reference with comparison tables, decision trees, and ready-to-use AI prompts for building RAG systems.