Books/Deploying AI Apps/Cost Management and Scaling

Cost Management and Scaling

AI APIs charge per token, and tokens add up fast. A single chat conversation might cost a few cents, but multiply that by thousands of users and you have a real bill. This tutorial teaches you how to keep costs under control while scaling your app.

AI API Cost Breakdown

Understanding pricing is the first step to controlling costs. Here's how major AI providers charge:

Token-Based Pricing

AI APIs charge per token (roughly 0.75 words). There are two types:

Input tokens — Your prompt (what you send to the AI)
Output tokens — The AI's response (what comes back)

Output tokens are typically more expensive because they require more computation.

Price Comparison (as of 2024-2025)

Model	Input (per 1M tokens)	Output (per 1M tokens)	Best For
GPT-4o	$2.50	$10.00	Complex reasoning, coding
GPT-4o Mini	$0.15	$0.60	Most tasks, cost-effective
Claude 3.5 Sonnet	$3.00	$15.00	Long content, analysis
Claude 3.5 Haiku	$0.80	$4.00	Fast, affordable tasks
Gemini 1.5 Pro	$1.25	$5.00	Multimodal, long context
Gemini 1.5 Flash	$0.075	$0.30	Cheapest option

Real-World Cost Examples

Use Case	Model	Tokens/Request	Cost/Request	1,000 Users/Day
Simple chatbot	GPT-4o Mini	~500	$0.0004	$0.40/day
Code generation	GPT-4o	~2,000	$0.025	$25/day
Document summary	Claude Sonnet	~5,000	$0.09	$90/day
AI search	GPT-4o Mini	~300	$0.0002	$0.20/day

What to ask your AI: "Estimate the monthly cost of my AI app. It makes about [X] API calls per user per day, with [Y] users, using [model name]."

Caching Strategies to Reduce Costs

The most effective way to cut AI costs is to avoid making the same API call twice. Caching stores responses so identical or similar requests get served from cache instead of calling the AI API.

Simple In-Memory Cache

// src/lib/aiCache.ts
const cache = new Map<string, { response: string; timestamp: number }>();
const CACHE_TTL = 1000 * 60 * 60; // 1 hour

function getCacheKey(prompt: string, model: string): string {
  // Create a consistent key from the prompt and model
  return `${model}:${prompt.trim().toLowerCase()}`;
}

export async function cachedAICall(
  prompt: string,
  model: string,
  fetchFn: () => Promise<string>
): Promise<string> {
  const key = getCacheKey(prompt, model);

  // Check cache
  const cached = cache.get(key);
  if (cached && Date.now() - cached.timestamp < CACHE_TTL) {
    console.log("Cache hit — saved an API call!");
    return cached.response;
  }

  // Cache miss — make the API call
  const response = await fetchFn();

  // Store in cache
  cache.set(key, { response, timestamp: Date.now() });

  return response;
}

Database Cache (Persistent)

For production, use your database as a cache so responses survive server restarts:

// src/services/aiCacheService.ts
import { db } from "@/lib/firebase";
import { Timestamp } from "firebase-admin/firestore";
import crypto from "crypto";

const CACHE_COLLECTION = "ai_cache";
const CACHE_TTL_HOURS = 24;

function hashPrompt(prompt: string): string {
  return crypto.createHash("sha256").update(prompt).digest("hex");
}

export async function getCachedResponse(
  prompt: string,
  model: string
): Promise<string | null> {
  const hash = hashPrompt(`${model}:${prompt}`);
  const doc = await db.collection(CACHE_COLLECTION).doc(hash).get();

  if (!doc.exists) return null;

  const data = doc.data();
  const age = Date.now() - data?.createdAt?.toMillis();
  if (age > CACHE_TTL_HOURS * 60 * 60 * 1000) return null;

  return data?.response ?? null;
}

export async function setCachedResponse(
  prompt: string,
  model: string,
  response: string
): Promise<void> {
  const hash = hashPrompt(`${model}:${prompt}`);
  await db.collection(CACHE_COLLECTION).doc(hash).set({
    prompt: prompt.substring(0, 200), // Store truncated for debugging
    model,
    response,
    createdAt: Timestamp.now(),
  });
}

Semantic Caching

Instead of exact matches, cache based on meaning. If someone asks "What is JavaScript?" and then "Explain JavaScript to me," the responses should be similar enough to cache:

// This is an advanced technique — use with a vector database
// or a similarity search library
// The idea: embed the prompt, find similar cached prompts,
// return the cached response if similarity is above a threshold

What to ask your AI: "Implement a caching layer for my AI API calls. Use [Firestore/Redis/in-memory] and cache responses for [duration]."

Rate Limiting Your Users

Without rate limits, a single user could make hundreds of AI calls and blow your budget. Rate limiting controls how many requests each user can make.

Simple Rate Limiter

// src/lib/rateLimiter.ts
const userRequestCounts = new Map<string, { count: number; resetTime: number }>();

const RATE_LIMIT = 20; // requests per window
const WINDOW_MS = 60 * 60 * 1000; // 1 hour

export function checkRateLimit(userId: string): {
  allowed: boolean;
  remaining: number;
  resetIn: number;
} {
  const now = Date.now();
  const userLimit = userRequestCounts.get(userId);

  // Reset if window has passed
  if (!userLimit || now > userLimit.resetTime) {
    userRequestCounts.set(userId, {
      count: 1,
      resetTime: now + WINDOW_MS,
    });
    return { allowed: true, remaining: RATE_LIMIT - 1, resetIn: WINDOW_MS };
  }

  // Check limit
  if (userLimit.count >= RATE_LIMIT) {
    return {
      allowed: false,
      remaining: 0,
      resetIn: userLimit.resetTime - now,
    };
  }

  // Increment
  userLimit.count++;
  return {
    allowed: true,
    remaining: RATE_LIMIT - userLimit.count,
    resetIn: userLimit.resetTime - now,
  };
}

Using the Rate Limiter in an API Route

// app/api/chat/route.ts
import { checkRateLimit } from "@/lib/rateLimiter";

export async function POST(req: Request) {
  const userId = getUserIdFromRequest(req); // Your auth logic

  const rateCheck = checkRateLimit(userId);

  if (!rateCheck.allowed) {
    return Response.json(
      {
        error: "Rate limit exceeded",
        retryAfter: Math.ceil(rateCheck.resetIn / 1000),
      },
      { status: 429 }
    );
  }

  // Process the AI request...
  const response = await generateResponse(prompt);

  return Response.json({
    response,
    rateLimitRemaining: rateCheck.remaining,
  });
}

Tiered Rate Limits

Different users get different limits:

const RATE_LIMITS: Record<string, number> = {
  free: 10,       // 10 requests per hour
  pro: 100,       // 100 requests per hour
  enterprise: 1000, // 1000 requests per hour
};

function getUserRateLimit(userTier: string): number {
  return RATE_LIMITS[userTier] ?? RATE_LIMITS.free;
}

What to ask your AI: "Implement rate limiting for my AI API endpoint. Free users get [X] requests per hour, paid users get [Y]."

Scaling Considerations

As your app grows, you need to think about scaling both your infrastructure and your AI API usage.

Horizontal vs. Vertical Scaling

Type	What It Means	Example
Vertical	Bigger server	More RAM, faster CPU
Horizontal	More servers	Multiple instances behind a load balancer

Good news: platforms like Vercel, Firebase, and Railway handle horizontal scaling automatically. You don't need to manage servers.

AI-Specific Scaling Challenges

Rate limits from AI providers — Each provider has limits on requests per minute. As you scale, you might hit these.
Latency under load — More requests can mean slower responses.
Cost scaling — Costs scale linearly (or worse) with users.

Solutions

// 1. Request queuing — don't overwhelm the AI API
import PQueue from "p-queue";

const queue = new PQueue({ concurrency: 10 }); // Max 10 concurrent AI calls

async function queuedAICall(prompt: string): Promise<string> {
  return queue.add(() => generateResponse(prompt));
}

// 2. Fallback models — use a cheaper model when the primary is slow/down
async function resilientAICall(prompt: string): Promise<string> {
  try {
    return await generateResponse(prompt, "gpt-4o");
  } catch (error) {
    console.warn("Primary model failed, falling back to mini");
    return await generateResponse(prompt, "gpt-4o-mini");
  }
}

When to Use Smaller Models

Not every request needs the most powerful (and expensive) model. Use the right model for the job:

Task	Recommended Model	Why
Simple Q&A	GPT-4o Mini, Gemini Flash	Fast and cheap
Classification/tagging	GPT-4o Mini	Structured output, low complexity
Code generation	GPT-4o, Claude Sonnet	Needs reasoning ability
Creative writing	Claude Sonnet	Best quality prose
Data extraction	GPT-4o Mini	Structured, repetitive
Complex reasoning	GPT-4o, Claude Sonnet	Needs deep thinking
Summarization	GPT-4o Mini	Good enough quality

Dynamic Model Selection

function selectModel(task: string): string {
  const modelMap: Record<string, string> = {
    classify: "gpt-4o-mini",
    summarize: "gpt-4o-mini",
    chat: "gpt-4o-mini",
    code: "gpt-4o",
    analyze: "gpt-4o",
    creative: "gpt-4o",
  };

  return modelMap[task] ?? "gpt-4o-mini"; // Default to cheaper model
}

Budget Alerts and Monitoring

Set Provider-Level Limits

Every AI API provider lets you set spending limits:

Provider	How to Set Limits
OpenAI	Dashboard → Settings → Billing → Usage limits
Anthropic	Console → Settings → Spending limit
Google AI	Google Cloud Console → Budgets & alerts

Set a hard limit that's 20% above your expected monthly spend. This prevents surprise bills.

Application-Level Budget Tracking

// Track spending in real-time
async function trackSpending(cost: number): Promise<boolean> {
  const today = new Date().toISOString().split("T")[0];

  const budgetDoc = await db.collection("budgets").doc(today).get();
  const currentSpend = budgetDoc.exists ? budgetDoc.data()?.totalCost ?? 0 : 0;
  const newTotal = currentSpend + cost;

  await db.collection("budgets").doc(today).set(
    { totalCost: newTotal, updatedAt: Timestamp.now() },
    { merge: true }
  );

  const DAILY_LIMIT = 50; // $50/day
  if (newTotal > DAILY_LIMIT) {
    console.error(`Daily budget exceeded: $${newTotal.toFixed(2)}`);
    return false; // Signal to stop making AI calls
  }

  return true;
}

Cost Optimization Checklist

✅ Using the cheapest model that meets quality requirements
✅ Caching identical/similar requests
✅ Rate limiting users (per hour/day)
✅ Setting spending limits with AI providers
✅ Tracking costs per user/day in your app
✅ Using streaming for long responses (better UX, same cost)
✅ Trimming unnecessary context from prompts
✅ Setting max_tokens to prevent runaway responses
✅ Budget alerts at 50%, 80%, and 100% of monthly limit
✅ Monitoring token usage trends weekly

What's Next?

You've covered all the essential deployment topics. The final tutorial is your Deployment Cheat Sheet — a quick reference with commands, templates, and AI prompts for everything deployment-related.

What to ask your AI: "Analyze my AI app's API calls and suggest ways to reduce costs. Here's my current usage: [describe patterns]."

🌐 www.genai-mentor.ai

Monitoring and Logging

Deployment Cheat Sheet