Books/RAG Essentials/Evaluating RAG Quality

Evaluating RAG Quality

You have a working RAG pipeline, but how do you know if it is actually good? Bad RAG can be worse than no RAG — it gives users confident-sounding answers that are wrong. Let's learn how to measure quality and catch problems.

Why Evaluation Matters

Without evaluation, you are flying blind. Your RAG system might:

Retrieve irrelevant documents but generate plausible-sounding answers
Retrieve the right documents but ignore them in the answer
Work great for some topics and fail completely for others
Degrade silently as you add new documents

RAG evaluation has two dimensions: retrieval quality (did we find the right documents?) and generation quality (did we produce a good answer from those documents?).

Retrieval Quality Metrics

Precision: Are Retrieved Docs Relevant?

Of the K documents retrieved, how many are actually relevant to the question?

Precision = (Number of relevant docs retrieved) / (Total docs retrieved)

Example: Retrieved 5 docs, 3 are relevant
Precision = 3/5 = 0.60 (60%)

High precision means you are not flooding the LLM with irrelevant context.

Recall: Did We Find All Relevant Docs?

Of all the relevant documents in your database, how many did you retrieve?

Recall = (Number of relevant docs retrieved) / (Total relevant docs in DB)

Example: 4 relevant docs exist in DB, retrieved 3 of them
Recall = 3/4 = 0.75 (75%)

High recall means you are not missing important information.

Mean Reciprocal Rank (MRR)

How high does the first relevant document rank in your results?

MRR = 1 / (rank of first relevant document)

If the first relevant doc is rank 1: MRR = 1.0
If the first relevant doc is rank 3: MRR = 0.33

Evaluating Retrieval in Practice

def evaluate_retrieval(
    question: str,
    retrieved_docs: list[str],
    relevant_docs: list[str]
) -> dict:
    """Calculate retrieval metrics."""
    retrieved_set = set(retrieved_docs)
    relevant_set = set(relevant_docs)

    relevant_retrieved = retrieved_set & relevant_set

    precision = len(relevant_retrieved) / len(retrieved_set) if retrieved_set else 0
    recall = len(relevant_retrieved) / len(relevant_set) if relevant_set else 0
    f1 = (2 * precision * recall / (precision + recall)) if (precision + recall) > 0 else 0

    # MRR
    mrr = 0
    for i, doc in enumerate(retrieved_docs):
        if doc in relevant_set:
            mrr = 1 / (i + 1)
            break

    return {
        "precision": precision,
        "recall": recall,
        "f1": f1,
        "mrr": mrr
    }

Generation Quality Metrics

Faithfulness: Is the Answer Grounded in Context?

Does the answer only contain information from the retrieved documents? Or does it hallucinate facts?

def evaluate_faithfulness(answer: str, context: str) -> dict:
    """Use an LLM to judge if the answer is faithful to the context."""
    from openai import OpenAI
    client = OpenAI()

    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": """You are an evaluation judge.
Given a context and an answer, determine if every claim in the answer
is supported by the context. Respond with JSON:
{"faithful": true/false, "unsupported_claims": ["claim1", "claim2"]}"""},
            {"role": "user", "content": f"""Context:
{context}

Answer:
{answer}

Is this answer faithful to the context?"""}
        ],
        temperature=0
    )

    return response.choices[0].message.content

Relevance: Does the Answer Address the Question?

A faithful answer is useless if it does not actually answer what the user asked.

def evaluate_relevance(question: str, answer: str) -> dict:
    """Use an LLM to judge if the answer is relevant to the question."""
    from openai import OpenAI
    client = OpenAI()

    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": """You are an evaluation judge.
Given a question and an answer, rate how well the answer addresses
the question on a scale of 1-5:
1 = Completely irrelevant
2 = Tangentially related but doesn't answer
3 = Partially answers the question
4 = Mostly answers the question
5 = Fully and directly answers the question
Respond with JSON: {"score": N, "reasoning": "..."}"""},
            {"role": "user", "content": f"Question: {question}\n\nAnswer: {answer}"}
        ],
        temperature=0
    )

    return response.choices[0].message.content

Testing with Ground Truth

The gold standard for RAG evaluation is a ground truth dataset — a set of questions with known correct answers and the documents that contain those answers.

# Create a ground truth evaluation set
eval_set = [
    {
        "question": "What is our vacation policy?",
        "expected_answer": "Employees receive 15 days of paid vacation per year.",
        "relevant_doc_ids": ["handbook-section-5", "handbook-section-5a"],
    },
    {
        "question": "How do I submit an expense report?",
        "expected_answer": "Submit expense reports through the Finance portal within 30 days.",
        "relevant_doc_ids": ["handbook-section-12"],
    },
    {
        "question": "What is the parental leave policy?",
        "expected_answer": "12 weeks of paid parental leave for all parents.",
        "relevant_doc_ids": ["handbook-section-8"],
    },
]

# Run evaluation
def run_evaluation(eval_set: list[dict]) -> dict:
    results = []

    for item in eval_set:
        # Run your RAG pipeline
        retrieved_docs = retrieve(item["question"])
        answer = generate_answer(item["question"], retrieved_docs)

        # Evaluate
        retrieval_metrics = evaluate_retrieval(
            item["question"],
            [doc["id"] for doc in retrieved_docs],
            item["relevant_doc_ids"]
        )

        results.append({
            "question": item["question"],
            "answer": answer,
            "expected": item["expected_answer"],
            **retrieval_metrics
        })

    # Aggregate
    avg_precision = sum(r["precision"] for r in results) / len(results)
    avg_recall = sum(r["recall"] for r in results) / len(results)
    avg_mrr = sum(r["mrr"] for r in results) / len(results)

    print(f"Average Precision: {avg_precision:.2f}")
    print(f"Average Recall:    {avg_recall:.2f}")
    print(f"Average MRR:       {avg_mrr:.2f}")

    return results

Using RAG Evaluation Frameworks

For production systems, use established frameworks instead of building from scratch:

Ragas (RAG Assessment)

from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy, context_precision, context_recall
from datasets import Dataset

eval_data = Dataset.from_dict({
    "question": ["What is the vacation policy?"],
    "answer": ["Employees get 15 days of paid vacation per year."],
    "contexts": [["Section 5: Employees receive 15 days of paid vacation annually..."]],
    "ground_truth": ["Employees receive 15 days of paid vacation per year."]
})

results = evaluate(
    eval_data,
    metrics=[faithfulness, answer_relevancy, context_precision, context_recall]
)
print(results)

Common Failure Modes and Fixes

Failure	Symptom	Fix
Wrong docs retrieved	Answer is off-topic	Improve chunking, tune chunk size, add metadata filters
Right docs, wrong answer	Answer ignores or misinterprets context	Improve system prompt, lower temperature, use a better LLM
Hallucination	Answer includes facts not in context	Add explicit "only use provided context" instructions
Incomplete answer	Misses part of the answer	Increase K (retrieve more docs), improve recall
Stale data	Answer uses outdated info	Update your document index, add timestamps
Chunk too small	Retrieved text lacks context	Increase chunk size or add surrounding context
Chunk too large	Retrieved text is vague	Decrease chunk size, use semantic chunking

Debugging Checklist

When your RAG system gives a bad answer:

Check retrieval first — Print the retrieved documents. Are they relevant?
Check the prompt — Is the context properly formatted? Is the instruction clear?
Check chunk quality — Are your chunks cutting information in bad places?
Try more context — Increase K from 3 to 5 or 10. Does the answer improve?
Try a better model — Switch from GPT-4o-mini to GPT-4o. Does it help?

What to ask your AI: "My RAG system is returning [describe the problem]. Here is my chunking strategy: [describe]. Here is my retrieval approach: [describe]. How can I improve the results?"

What's Next?

You now know how to build, optimize, and evaluate a RAG pipeline. The final tutorial is your RAG Essentials Cheat Sheet — a quick reference with comparison tables, decision trees, and ready-to-use AI prompts for building RAG systems.

🌐 www.genai-mentor.ai

Chunking Strategies

RAG Essentials Cheat Sheet