Books/Prompt Engineering Essentials/Debugging Bad AI Outputs

    Debugging Bad AI Outputs

    Debugging Bad AI Outputs

    Even with great prompts, AI will sometimes give you bad output. The response might be vague, wrong, hallucinated, too verbose, or formatted incorrectly. Knowing how to diagnose and fix these failures is a critical prompt engineering skill. This chapter gives you a systematic approach to debugging AI outputs.

    The Five Failure Modes

    Most bad AI outputs fall into one of five categories:

    Failure ModeWhat It Looks LikeRoot Cause
    VagueGeneric, unhelpful responsePrompt lacks specificity
    WrongConfidently incorrect answerMissing context or outdated training data
    HallucinatedMade-up facts, fake citationsAI fills gaps with plausible-sounding fiction
    Too verbose1000 words when you needed 100No length or format constraints
    Wrong formatProse when you wanted JSON, or missing sectionsNo format instructions

    Let's fix each one.

    Fixing Vague Responses

    Symptom: The AI gives generic advice that could apply to anything.

    Example of vague output:

    User: How do I improve my app's performance?
    
    AI: There are many ways to improve performance. You should optimize
    your code, reduce bundle size, use caching, and minimize network
    requests. Consider profiling your app to find bottlenecks.
    

    This is technically correct but practically useless. It's a list of platitudes.

    Fix: Add specificity to your prompt.

    My Next.js 14 app (App Router, TypeScript) takes 4.2 seconds to load
    on mobile. The largest contentful paint (LCP) is 3.8 seconds. The page
    loads a list of 50 products with images from a Firestore database.
    
    What are the top 3 specific changes I should make to get LCP under
    2.5 seconds? For each change, show the exact code I need to write.
    

    Strategies for eliminating vagueness:

    • Include specific numbers (load time, data size, user count)
    • Name your exact tech stack and versions
    • Describe the specific problem, not the general category
    • Ask for a specific number of recommendations
    • Request code, not advice

    Fixing Wrong Answers

    Symptom: The AI gives a confident, specific answer that is incorrect.

    This happens because:

    1. The AI's training data is outdated
    2. You didn't provide enough context for the AI to reason correctly
    3. The AI pattern-matched to a similar but different situation

    Fix 1: Provide the correct context.

    // Bad — AI might use outdated Next.js syntax
    How do I create an API route in Next.js?
    
    // Good — specifies the version and pattern
    How do I create an API route in Next.js 14 using the App Router?
    I need a POST endpoint at /api/users that accepts JSON body
    and writes to Firestore. Show the route.ts file.
    

    Fix 2: Ask the AI to verify its own answer.

    Write a Firestore query that gets documents created in the last 7 days.
    
    After writing the query, double-check:
    1. Is this syntax correct for the Firebase SDK version 10+?
    2. Does the where() clause use the correct operator for date comparison?
    3. Does this require a composite index?
    

    Fix 3: Ask for sources or reasoning.

    What's the maximum document size in Firestore?
    
    Show me where this is documented. If you're not certain, say so.
    

    Fixing Hallucinations

    Symptom: The AI invents facts, cites papers that don't exist, references APIs that were never real, or generates plausible-looking but fictional data.

    Hallucinations are the most dangerous failure because they look correct. The AI isn't lying — it's generating the most statistically likely continuation of text, which sometimes means fabricating details.

    Red flags for hallucination:

    • Specific citation with author names and years (often fake)
    • API methods or configuration options you can't find in docs
    • Statistics with precise numbers but no verifiable source
    • Library or package names that don't exist on npm

    Fix 1: Constrain the AI to what it knows.

    List the Firebase Firestore query operators.
    Only list operators that you are confident exist.
    If you are uncertain about any operator, mark it with [VERIFY].
    

    Fix 2: Provide reference material.

    Here is the current Firebase Firestore documentation for query operators:
    [paste relevant docs]
    
    Based on this documentation, write a query that filters
    products by category and price range.
    

    Fix 3: Ask the AI to distinguish facts from assumptions.

    Explain the rate limits for the OpenAI API.
    
    Separate your response into two sections:
    1. "Confirmed" — Information you are highly confident about
    2. "May have changed" — Information that might be outdated
    

    Fix 4: Validate with code.

    When the AI generates code with specific API calls, always verify:

    # Check if a package actually exists
    npm info [package-name]
    
    # Check if a method exists in the types
    npx tsc --noEmit  # TypeScript will catch non-existent methods

    Fixing Verbose Responses

    Symptom: You asked a simple question and got a 1500-word essay.

    Fix: Add explicit length and format constraints.

    // Instead of:
    Explain the difference between let and const in JavaScript.
    
    // Use:
    Explain let vs const in JavaScript in exactly 3 bullet points.
    Each bullet point should be one sentence.
    

    Other length control techniques:

    // Maximum word count
    Answer in under 50 words.
    
    // Format that naturally limits length
    Reply with only the code. No explanations.
    
    // Table format forces brevity
    Compare let, const, and var in a table with columns:
    Keyword | Scope | Reassignable | Hoisted
    
    // One-line answer
    Give me the one-line answer, then optionally a brief explanation.
    

    Fixing Wrong Format

    Symptom: You wanted JSON but got prose. You wanted a table but got paragraphs. You wanted code but got an explanation.

    Fix: Be extremely explicit about format.

    Return your response as valid JSON matching this exact schema:
    {
      "summary": "string — one sentence summary",
      "issues": ["string — each issue found"],
      "score": "number — 1 to 10",
      "recommendation": "string — what to do next"
    }
    
    Do not include any text before or after the JSON.
    Do not wrap it in markdown code blocks.
    Return only the raw JSON object.
    

    For API calls, use structured output features when available:

    const response = await openai.chat.completions.create({
      model: "gpt-4o",
      messages: [...],
      response_format: { type: "json_object" }, // Forces JSON output
    });

    Iterative Refinement Strategy

    When a prompt isn't working, follow this systematic refinement process:

    Step 1: Identify the Failure Mode

    Ask: Is the response vague? Wrong? Hallucinated? Too long? Wrong format?

    Step 2: Apply the Targeted Fix

    Use the specific fix for that failure mode (see sections above).

    Step 3: Test with Multiple Inputs

    Don't test with just one input. Try 3-5 different inputs to make sure the fix works consistently.

    Step 4: Add Guardrails

    After generating your response, verify:
    1. All code examples are syntactically valid
    2. No external libraries are used unless specified
    3. The response follows the format defined above
    4. All claims are based on well-established facts
    
    If any check fails, fix the issue before responding.
    

    Step 5: Version Your Prompts

    Keep track of what you tried:

    // prompts/codeReview.ts
    // v1 — Too vague, gave generic advice
    // v2 — Added specific format, better but missed edge cases
    // v3 — Added examples and guardrails, works consistently
    
    export const CODE_REVIEW_PROMPT_V3 = `...`;

    Adding Constraints and Guardrails

    Guardrails are instructions that prevent common problems:

    Guardrails:
    - If you don't know something, say "I'm not sure" instead of guessing
    - Do not make up package names, API methods, or citations
    - If the question is ambiguous, state your interpretation before answering
    - Verify that all code examples would compile/run without errors
    - Do not include deprecated methods or syntax
    - If the task cannot be completed as specified, explain why and suggest an alternative
    

    When to Switch Models

    Sometimes the problem isn't your prompt — it's the model.

    Task TypeRecommended Model
    Simple classificationGPT-4o-mini, Claude Haiku (faster, cheaper)
    Code generationGPT-4o, Claude Sonnet (good balance)
    Complex reasoningGPT-4o, Claude Opus (strongest reasoning)
    Large context (100K+ tokens)Claude models (larger context window)
    Structured data extractionGPT-4o with JSON mode
    Creative writingClaude Sonnet or Opus (nuanced language)

    Signs you need a different model:

    • Simple prompts produce wrong answers consistently
    • The task requires reasoning the model can't handle
    • You need a larger context window than the model supports
    • Cost is too high for the volume of requests

    Debugging Checklist

    When AI output is bad, work through this list:

    1. [ ] Is my task specific enough? (not vague)
    2. [ ] Did I provide the right context? (tech stack, constraints, examples)
    3. [ ] Did I specify the output format? (JSON, table, code-only)
    4. [ ] Did I set length constraints? (word count, number of items)
    5. [ ] Did I add guardrails? (what to avoid, how to handle uncertainty)
    6. [ ] Did I use the right technique? (zero-shot, few-shot, chain-of-thought)
    7. [ ] Am I using the right model for this task?
    8. [ ] Did I test with multiple inputs?
    

    Try it now: Find an AI output from the past week that was not good enough. Identify which failure mode it was (vague, wrong, hallucinated, verbose, or wrong format). Apply the targeted fix from this chapter and compare the results.


    🌐 www.genai-mentor.ai