Books/Prompt Engineering Essentials/Debugging Bad AI Outputs

Debugging Bad AI Outputs

Even with great prompts, AI will sometimes give you bad output. The response might be vague, wrong, hallucinated, too verbose, or formatted incorrectly. Knowing how to diagnose and fix these failures is a critical prompt engineering skill. This chapter gives you a systematic approach to debugging AI outputs.

The Five Failure Modes

Most bad AI outputs fall into one of five categories:

Failure Mode	What It Looks Like	Root Cause
Vague	Generic, unhelpful response	Prompt lacks specificity
Wrong	Confidently incorrect answer	Missing context or outdated training data
Hallucinated	Made-up facts, fake citations	AI fills gaps with plausible-sounding fiction
Too verbose	1000 words when you needed 100	No length or format constraints
Wrong format	Prose when you wanted JSON, or missing sections	No format instructions

Let's fix each one.

Fixing Vague Responses

Symptom: The AI gives generic advice that could apply to anything.

Example of vague output:

User: How do I improve my app's performance?

AI: There are many ways to improve performance. You should optimize
your code, reduce bundle size, use caching, and minimize network
requests. Consider profiling your app to find bottlenecks.

This is technically correct but practically useless. It's a list of platitudes.

Fix: Add specificity to your prompt.

My Next.js 14 app (App Router, TypeScript) takes 4.2 seconds to load
on mobile. The largest contentful paint (LCP) is 3.8 seconds. The page
loads a list of 50 products with images from a Firestore database.

What are the top 3 specific changes I should make to get LCP under
2.5 seconds? For each change, show the exact code I need to write.

Strategies for eliminating vagueness:

Include specific numbers (load time, data size, user count)
Name your exact tech stack and versions
Describe the specific problem, not the general category
Ask for a specific number of recommendations
Request code, not advice

Fixing Wrong Answers

Symptom: The AI gives a confident, specific answer that is incorrect.

This happens because:

The AI's training data is outdated
You didn't provide enough context for the AI to reason correctly
The AI pattern-matched to a similar but different situation

Fix 1: Provide the correct context.

// Bad — AI might use outdated Next.js syntax
How do I create an API route in Next.js?

// Good — specifies the version and pattern
How do I create an API route in Next.js 14 using the App Router?
I need a POST endpoint at /api/users that accepts JSON body
and writes to Firestore. Show the route.ts file.

Fix 2: Ask the AI to verify its own answer.

Write a Firestore query that gets documents created in the last 7 days.

After writing the query, double-check:
1. Is this syntax correct for the Firebase SDK version 10+?
2. Does the where() clause use the correct operator for date comparison?
3. Does this require a composite index?

Fix 3: Ask for sources or reasoning.

What's the maximum document size in Firestore?

Show me where this is documented. If you're not certain, say so.

Fixing Hallucinations

Symptom: The AI invents facts, cites papers that don't exist, references APIs that were never real, or generates plausible-looking but fictional data.

Hallucinations are the most dangerous failure because they look correct. The AI isn't lying — it's generating the most statistically likely continuation of text, which sometimes means fabricating details.

Red flags for hallucination:

Specific citation with author names and years (often fake)
API methods or configuration options you can't find in docs
Statistics with precise numbers but no verifiable source
Library or package names that don't exist on npm

Fix 1: Constrain the AI to what it knows.

List the Firebase Firestore query operators.
Only list operators that you are confident exist.
If you are uncertain about any operator, mark it with [VERIFY].

Fix 2: Provide reference material.

Here is the current Firebase Firestore documentation for query operators:
[paste relevant docs]

Based on this documentation, write a query that filters
products by category and price range.

Fix 3: Ask the AI to distinguish facts from assumptions.

Explain the rate limits for the OpenAI API.

Separate your response into two sections:
1. "Confirmed" — Information you are highly confident about
2. "May have changed" — Information that might be outdated

Fix 4: Validate with code.

When the AI generates code with specific API calls, always verify:

# Check if a package actually exists
npm info [package-name]

# Check if a method exists in the types
npx tsc --noEmit  # TypeScript will catch non-existent methods

Fixing Verbose Responses

Symptom: You asked a simple question and got a 1500-word essay.

Fix: Add explicit length and format constraints.

// Instead of:
Explain the difference between let and const in JavaScript.

// Use:
Explain let vs const in JavaScript in exactly 3 bullet points.
Each bullet point should be one sentence.

Other length control techniques:

// Maximum word count
Answer in under 50 words.

// Format that naturally limits length
Reply with only the code. No explanations.

// Table format forces brevity
Compare let, const, and var in a table with columns:
Keyword | Scope | Reassignable | Hoisted

// One-line answer
Give me the one-line answer, then optionally a brief explanation.

Fixing Wrong Format

Symptom: You wanted JSON but got prose. You wanted a table but got paragraphs. You wanted code but got an explanation.

Fix: Be extremely explicit about format.

Return your response as valid JSON matching this exact schema:
{
  "summary": "string — one sentence summary",
  "issues": ["string — each issue found"],
  "score": "number — 1 to 10",
  "recommendation": "string — what to do next"
}

Do not include any text before or after the JSON.
Do not wrap it in markdown code blocks.
Return only the raw JSON object.

For API calls, use structured output features when available:

const response = await openai.chat.completions.create({
  model: "gpt-4o",
  messages: [...],
  response_format: { type: "json_object" }, // Forces JSON output
});

Iterative Refinement Strategy

When a prompt isn't working, follow this systematic refinement process:

Step 1: Identify the Failure Mode

Ask: Is the response vague? Wrong? Hallucinated? Too long? Wrong format?

Step 2: Apply the Targeted Fix

Use the specific fix for that failure mode (see sections above).

Step 3: Test with Multiple Inputs

Don't test with just one input. Try 3-5 different inputs to make sure the fix works consistently.

Step 4: Add Guardrails

After generating your response, verify:
1. All code examples are syntactically valid
2. No external libraries are used unless specified
3. The response follows the format defined above
4. All claims are based on well-established facts

If any check fails, fix the issue before responding.

Step 5: Version Your Prompts

Keep track of what you tried:

// prompts/codeReview.ts
// v1 — Too vague, gave generic advice
// v2 — Added specific format, better but missed edge cases
// v3 — Added examples and guardrails, works consistently

export const CODE_REVIEW_PROMPT_V3 = `...`;

Adding Constraints and Guardrails

Guardrails are instructions that prevent common problems:

Guardrails:
- If you don't know something, say "I'm not sure" instead of guessing
- Do not make up package names, API methods, or citations
- If the question is ambiguous, state your interpretation before answering
- Verify that all code examples would compile/run without errors
- Do not include deprecated methods or syntax
- If the task cannot be completed as specified, explain why and suggest an alternative

When to Switch Models

Sometimes the problem isn't your prompt — it's the model.

Task Type	Recommended Model
Simple classification	GPT-4o-mini, Claude Haiku (faster, cheaper)
Code generation	GPT-4o, Claude Sonnet (good balance)
Complex reasoning	GPT-4o, Claude Opus (strongest reasoning)
Large context (100K+ tokens)	Claude models (larger context window)
Structured data extraction	GPT-4o with JSON mode
Creative writing	Claude Sonnet or Opus (nuanced language)

Signs you need a different model:

Simple prompts produce wrong answers consistently
The task requires reasoning the model can't handle
You need a larger context window than the model supports
Cost is too high for the volume of requests

Debugging Checklist

When AI output is bad, work through this list:

1. [ ] Is my task specific enough? (not vague)
2. [ ] Did I provide the right context? (tech stack, constraints, examples)
3. [ ] Did I specify the output format? (JSON, table, code-only)
4. [ ] Did I set length constraints? (word count, number of items)
5. [ ] Did I add guardrails? (what to avoid, how to handle uncertainty)
6. [ ] Did I use the right technique? (zero-shot, few-shot, chain-of-thought)
7. [ ] Am I using the right model for this task?
8. [ ] Did I test with multiple inputs?

Try it now: Find an AI output from the past week that was not good enough. Identify which failure mode it was (vague, wrong, hallucinated, verbose, or wrong format). Apply the targeted fix from this chapter and compare the results.

🌐 www.genai-mentor.ai

Prompt Patterns and Templates

Prompt Engineering Cheat Sheet