Books/RAG Essentials/Evaluating RAG Quality

    Evaluating RAG Quality

    Evaluating RAG Quality

    You have a working RAG pipeline, but how do you know if it is actually good? Bad RAG can be worse than no RAG — it gives users confident-sounding answers that are wrong. Let's learn how to measure quality and catch problems.

    Why Evaluation Matters

    Without evaluation, you are flying blind. Your RAG system might:

    • Retrieve irrelevant documents but generate plausible-sounding answers
    • Retrieve the right documents but ignore them in the answer
    • Work great for some topics and fail completely for others
    • Degrade silently as you add new documents

    RAG evaluation has two dimensions: retrieval quality (did we find the right documents?) and generation quality (did we produce a good answer from those documents?).

    Retrieval Quality Metrics

    Precision: Are Retrieved Docs Relevant?

    Of the K documents retrieved, how many are actually relevant to the question?

    Precision = (Number of relevant docs retrieved) / (Total docs retrieved)
    
    Example: Retrieved 5 docs, 3 are relevant
    Precision = 3/5 = 0.60 (60%)
    

    High precision means you are not flooding the LLM with irrelevant context.

    Recall: Did We Find All Relevant Docs?

    Of all the relevant documents in your database, how many did you retrieve?

    Recall = (Number of relevant docs retrieved) / (Total relevant docs in DB)
    
    Example: 4 relevant docs exist in DB, retrieved 3 of them
    Recall = 3/4 = 0.75 (75%)
    

    High recall means you are not missing important information.

    Mean Reciprocal Rank (MRR)

    How high does the first relevant document rank in your results?

    MRR = 1 / (rank of first relevant document)
    
    If the first relevant doc is rank 1: MRR = 1.0
    If the first relevant doc is rank 3: MRR = 0.33
    

    Evaluating Retrieval in Practice

    def evaluate_retrieval(
        question: str,
        retrieved_docs: list[str],
        relevant_docs: list[str]
    ) -> dict:
        """Calculate retrieval metrics."""
        retrieved_set = set(retrieved_docs)
        relevant_set = set(relevant_docs)
    
        relevant_retrieved = retrieved_set & relevant_set
    
        precision = len(relevant_retrieved) / len(retrieved_set) if retrieved_set else 0
        recall = len(relevant_retrieved) / len(relevant_set) if relevant_set else 0
        f1 = (2 * precision * recall / (precision + recall)) if (precision + recall) > 0 else 0
    
        # MRR
        mrr = 0
        for i, doc in enumerate(retrieved_docs):
            if doc in relevant_set:
                mrr = 1 / (i + 1)
                break
    
        return {
            "precision": precision,
            "recall": recall,
            "f1": f1,
            "mrr": mrr
        }

    Generation Quality Metrics

    Faithfulness: Is the Answer Grounded in Context?

    Does the answer only contain information from the retrieved documents? Or does it hallucinate facts?

    def evaluate_faithfulness(answer: str, context: str) -> dict:
        """Use an LLM to judge if the answer is faithful to the context."""
        from openai import OpenAI
        client = OpenAI()
    
        response = client.chat.completions.create(
            model="gpt-4o",
            messages=[
                {"role": "system", "content": """You are an evaluation judge.
    Given a context and an answer, determine if every claim in the answer
    is supported by the context. Respond with JSON:
    {"faithful": true/false, "unsupported_claims": ["claim1", "claim2"]}"""},
                {"role": "user", "content": f"""Context:
    {context}
    
    Answer:
    {answer}
    
    Is this answer faithful to the context?"""}
            ],
            temperature=0
        )
    
        return response.choices[0].message.content

    Relevance: Does the Answer Address the Question?

    A faithful answer is useless if it does not actually answer what the user asked.

    def evaluate_relevance(question: str, answer: str) -> dict:
        """Use an LLM to judge if the answer is relevant to the question."""
        from openai import OpenAI
        client = OpenAI()
    
        response = client.chat.completions.create(
            model="gpt-4o",
            messages=[
                {"role": "system", "content": """You are an evaluation judge.
    Given a question and an answer, rate how well the answer addresses
    the question on a scale of 1-5:
    1 = Completely irrelevant
    2 = Tangentially related but doesn't answer
    3 = Partially answers the question
    4 = Mostly answers the question
    5 = Fully and directly answers the question
    Respond with JSON: {"score": N, "reasoning": "..."}"""},
                {"role": "user", "content": f"Question: {question}\n\nAnswer: {answer}"}
            ],
            temperature=0
        )
    
        return response.choices[0].message.content

    Testing with Ground Truth

    The gold standard for RAG evaluation is a ground truth dataset — a set of questions with known correct answers and the documents that contain those answers.

    # Create a ground truth evaluation set
    eval_set = [
        {
            "question": "What is our vacation policy?",
            "expected_answer": "Employees receive 15 days of paid vacation per year.",
            "relevant_doc_ids": ["handbook-section-5", "handbook-section-5a"],
        },
        {
            "question": "How do I submit an expense report?",
            "expected_answer": "Submit expense reports through the Finance portal within 30 days.",
            "relevant_doc_ids": ["handbook-section-12"],
        },
        {
            "question": "What is the parental leave policy?",
            "expected_answer": "12 weeks of paid parental leave for all parents.",
            "relevant_doc_ids": ["handbook-section-8"],
        },
    ]
    
    # Run evaluation
    def run_evaluation(eval_set: list[dict]) -> dict:
        results = []
    
        for item in eval_set:
            # Run your RAG pipeline
            retrieved_docs = retrieve(item["question"])
            answer = generate_answer(item["question"], retrieved_docs)
    
            # Evaluate
            retrieval_metrics = evaluate_retrieval(
                item["question"],
                [doc["id"] for doc in retrieved_docs],
                item["relevant_doc_ids"]
            )
    
            results.append({
                "question": item["question"],
                "answer": answer,
                "expected": item["expected_answer"],
                **retrieval_metrics
            })
    
        # Aggregate
        avg_precision = sum(r["precision"] for r in results) / len(results)
        avg_recall = sum(r["recall"] for r in results) / len(results)
        avg_mrr = sum(r["mrr"] for r in results) / len(results)
    
        print(f"Average Precision: {avg_precision:.2f}")
        print(f"Average Recall:    {avg_recall:.2f}")
        print(f"Average MRR:       {avg_mrr:.2f}")
    
        return results

    Using RAG Evaluation Frameworks

    For production systems, use established frameworks instead of building from scratch:

    Ragas (RAG Assessment)

    from ragas import evaluate
    from ragas.metrics import faithfulness, answer_relevancy, context_precision, context_recall
    from datasets import Dataset
    
    eval_data = Dataset.from_dict({
        "question": ["What is the vacation policy?"],
        "answer": ["Employees get 15 days of paid vacation per year."],
        "contexts": [["Section 5: Employees receive 15 days of paid vacation annually..."]],
        "ground_truth": ["Employees receive 15 days of paid vacation per year."]
    })
    
    results = evaluate(
        eval_data,
        metrics=[faithfulness, answer_relevancy, context_precision, context_recall]
    )
    print(results)

    Common Failure Modes and Fixes

    FailureSymptomFix
    Wrong docs retrievedAnswer is off-topicImprove chunking, tune chunk size, add metadata filters
    Right docs, wrong answerAnswer ignores or misinterprets contextImprove system prompt, lower temperature, use a better LLM
    HallucinationAnswer includes facts not in contextAdd explicit "only use provided context" instructions
    Incomplete answerMisses part of the answerIncrease K (retrieve more docs), improve recall
    Stale dataAnswer uses outdated infoUpdate your document index, add timestamps
    Chunk too smallRetrieved text lacks contextIncrease chunk size or add surrounding context
    Chunk too largeRetrieved text is vagueDecrease chunk size, use semantic chunking

    Debugging Checklist

    When your RAG system gives a bad answer:

    1. Check retrieval first — Print the retrieved documents. Are they relevant?
    2. Check the prompt — Is the context properly formatted? Is the instruction clear?
    3. Check chunk quality — Are your chunks cutting information in bad places?
    4. Try more context — Increase K from 3 to 5 or 10. Does the answer improve?
    5. Try a better model — Switch from GPT-4o-mini to GPT-4o. Does it help?

    What to ask your AI: "My RAG system is returning [describe the problem]. Here is my chunking strategy: [describe]. Here is my retrieval approach: [describe]. How can I improve the results?"

    What's Next?

    You now know how to build, optimize, and evaluate a RAG pipeline. The final tutorial is your RAG Essentials Cheat Sheet — a quick reference with comparison tables, decision trees, and ready-to-use AI prompts for building RAG systems.


    🌐 www.genai-mentor.ai