Books/Deploying AI Apps/Cost Management and Scaling

    Cost Management and Scaling

    Cost Management and Scaling

    AI APIs charge per token, and tokens add up fast. A single chat conversation might cost a few cents, but multiply that by thousands of users and you have a real bill. This tutorial teaches you how to keep costs under control while scaling your app.

    AI API Cost Breakdown

    Understanding pricing is the first step to controlling costs. Here's how major AI providers charge:

    Token-Based Pricing

    AI APIs charge per token (roughly 0.75 words). There are two types:

    • Input tokens — Your prompt (what you send to the AI)
    • Output tokens — The AI's response (what comes back)

    Output tokens are typically more expensive because they require more computation.

    Price Comparison (as of 2024-2025)

    ModelInput (per 1M tokens)Output (per 1M tokens)Best For
    GPT-4o$2.50$10.00Complex reasoning, coding
    GPT-4o Mini$0.15$0.60Most tasks, cost-effective
    Claude 3.5 Sonnet$3.00$15.00Long content, analysis
    Claude 3.5 Haiku$0.80$4.00Fast, affordable tasks
    Gemini 1.5 Pro$1.25$5.00Multimodal, long context
    Gemini 1.5 Flash$0.075$0.30Cheapest option

    Real-World Cost Examples

    Use CaseModelTokens/RequestCost/Request1,000 Users/Day
    Simple chatbotGPT-4o Mini~500$0.0004$0.40/day
    Code generationGPT-4o~2,000$0.025$25/day
    Document summaryClaude Sonnet~5,000$0.09$90/day
    AI searchGPT-4o Mini~300$0.0002$0.20/day

    What to ask your AI: "Estimate the monthly cost of my AI app. It makes about [X] API calls per user per day, with [Y] users, using [model name]."

    Caching Strategies to Reduce Costs

    The most effective way to cut AI costs is to avoid making the same API call twice. Caching stores responses so identical or similar requests get served from cache instead of calling the AI API.

    Simple In-Memory Cache

    // src/lib/aiCache.ts
    const cache = new Map<string, { response: string; timestamp: number }>();
    const CACHE_TTL = 1000 * 60 * 60; // 1 hour
    
    function getCacheKey(prompt: string, model: string): string {
      // Create a consistent key from the prompt and model
      return `${model}:${prompt.trim().toLowerCase()}`;
    }
    
    export async function cachedAICall(
      prompt: string,
      model: string,
      fetchFn: () => Promise<string>
    ): Promise<string> {
      const key = getCacheKey(prompt, model);
    
      // Check cache
      const cached = cache.get(key);
      if (cached && Date.now() - cached.timestamp < CACHE_TTL) {
        console.log("Cache hit — saved an API call!");
        return cached.response;
      }
    
      // Cache miss — make the API call
      const response = await fetchFn();
    
      // Store in cache
      cache.set(key, { response, timestamp: Date.now() });
    
      return response;
    }

    Database Cache (Persistent)

    For production, use your database as a cache so responses survive server restarts:

    // src/services/aiCacheService.ts
    import { db } from "@/lib/firebase";
    import { Timestamp } from "firebase-admin/firestore";
    import crypto from "crypto";
    
    const CACHE_COLLECTION = "ai_cache";
    const CACHE_TTL_HOURS = 24;
    
    function hashPrompt(prompt: string): string {
      return crypto.createHash("sha256").update(prompt).digest("hex");
    }
    
    export async function getCachedResponse(
      prompt: string,
      model: string
    ): Promise<string | null> {
      const hash = hashPrompt(`${model}:${prompt}`);
      const doc = await db.collection(CACHE_COLLECTION).doc(hash).get();
    
      if (!doc.exists) return null;
    
      const data = doc.data();
      const age = Date.now() - data?.createdAt?.toMillis();
      if (age > CACHE_TTL_HOURS * 60 * 60 * 1000) return null;
    
      return data?.response ?? null;
    }
    
    export async function setCachedResponse(
      prompt: string,
      model: string,
      response: string
    ): Promise<void> {
      const hash = hashPrompt(`${model}:${prompt}`);
      await db.collection(CACHE_COLLECTION).doc(hash).set({
        prompt: prompt.substring(0, 200), // Store truncated for debugging
        model,
        response,
        createdAt: Timestamp.now(),
      });
    }

    Semantic Caching

    Instead of exact matches, cache based on meaning. If someone asks "What is JavaScript?" and then "Explain JavaScript to me," the responses should be similar enough to cache:

    // This is an advanced technique — use with a vector database
    // or a similarity search library
    // The idea: embed the prompt, find similar cached prompts,
    // return the cached response if similarity is above a threshold

    What to ask your AI: "Implement a caching layer for my AI API calls. Use [Firestore/Redis/in-memory] and cache responses for [duration]."

    Rate Limiting Your Users

    Without rate limits, a single user could make hundreds of AI calls and blow your budget. Rate limiting controls how many requests each user can make.

    Simple Rate Limiter

    // src/lib/rateLimiter.ts
    const userRequestCounts = new Map<string, { count: number; resetTime: number }>();
    
    const RATE_LIMIT = 20; // requests per window
    const WINDOW_MS = 60 * 60 * 1000; // 1 hour
    
    export function checkRateLimit(userId: string): {
      allowed: boolean;
      remaining: number;
      resetIn: number;
    } {
      const now = Date.now();
      const userLimit = userRequestCounts.get(userId);
    
      // Reset if window has passed
      if (!userLimit || now > userLimit.resetTime) {
        userRequestCounts.set(userId, {
          count: 1,
          resetTime: now + WINDOW_MS,
        });
        return { allowed: true, remaining: RATE_LIMIT - 1, resetIn: WINDOW_MS };
      }
    
      // Check limit
      if (userLimit.count >= RATE_LIMIT) {
        return {
          allowed: false,
          remaining: 0,
          resetIn: userLimit.resetTime - now,
        };
      }
    
      // Increment
      userLimit.count++;
      return {
        allowed: true,
        remaining: RATE_LIMIT - userLimit.count,
        resetIn: userLimit.resetTime - now,
      };
    }

    Using the Rate Limiter in an API Route

    // app/api/chat/route.ts
    import { checkRateLimit } from "@/lib/rateLimiter";
    
    export async function POST(req: Request) {
      const userId = getUserIdFromRequest(req); // Your auth logic
    
      const rateCheck = checkRateLimit(userId);
    
      if (!rateCheck.allowed) {
        return Response.json(
          {
            error: "Rate limit exceeded",
            retryAfter: Math.ceil(rateCheck.resetIn / 1000),
          },
          { status: 429 }
        );
      }
    
      // Process the AI request...
      const response = await generateResponse(prompt);
    
      return Response.json({
        response,
        rateLimitRemaining: rateCheck.remaining,
      });
    }

    Tiered Rate Limits

    Different users get different limits:

    const RATE_LIMITS: Record<string, number> = {
      free: 10,       // 10 requests per hour
      pro: 100,       // 100 requests per hour
      enterprise: 1000, // 1000 requests per hour
    };
    
    function getUserRateLimit(userTier: string): number {
      return RATE_LIMITS[userTier] ?? RATE_LIMITS.free;
    }

    What to ask your AI: "Implement rate limiting for my AI API endpoint. Free users get [X] requests per hour, paid users get [Y]."

    Scaling Considerations

    As your app grows, you need to think about scaling both your infrastructure and your AI API usage.

    Horizontal vs. Vertical Scaling

    TypeWhat It MeansExample
    VerticalBigger serverMore RAM, faster CPU
    HorizontalMore serversMultiple instances behind a load balancer

    Good news: platforms like Vercel, Firebase, and Railway handle horizontal scaling automatically. You don't need to manage servers.

    AI-Specific Scaling Challenges

    1. Rate limits from AI providers — Each provider has limits on requests per minute. As you scale, you might hit these.
    2. Latency under load — More requests can mean slower responses.
    3. Cost scaling — Costs scale linearly (or worse) with users.

    Solutions

    // 1. Request queuing — don't overwhelm the AI API
    import PQueue from "p-queue";
    
    const queue = new PQueue({ concurrency: 10 }); // Max 10 concurrent AI calls
    
    async function queuedAICall(prompt: string): Promise<string> {
      return queue.add(() => generateResponse(prompt));
    }
    
    // 2. Fallback models — use a cheaper model when the primary is slow/down
    async function resilientAICall(prompt: string): Promise<string> {
      try {
        return await generateResponse(prompt, "gpt-4o");
      } catch (error) {
        console.warn("Primary model failed, falling back to mini");
        return await generateResponse(prompt, "gpt-4o-mini");
      }
    }

    When to Use Smaller Models

    Not every request needs the most powerful (and expensive) model. Use the right model for the job:

    TaskRecommended ModelWhy
    Simple Q&AGPT-4o Mini, Gemini FlashFast and cheap
    Classification/taggingGPT-4o MiniStructured output, low complexity
    Code generationGPT-4o, Claude SonnetNeeds reasoning ability
    Creative writingClaude SonnetBest quality prose
    Data extractionGPT-4o MiniStructured, repetitive
    Complex reasoningGPT-4o, Claude SonnetNeeds deep thinking
    SummarizationGPT-4o MiniGood enough quality

    Dynamic Model Selection

    function selectModel(task: string): string {
      const modelMap: Record<string, string> = {
        classify: "gpt-4o-mini",
        summarize: "gpt-4o-mini",
        chat: "gpt-4o-mini",
        code: "gpt-4o",
        analyze: "gpt-4o",
        creative: "gpt-4o",
      };
    
      return modelMap[task] ?? "gpt-4o-mini"; // Default to cheaper model
    }

    Budget Alerts and Monitoring

    Set Provider-Level Limits

    Every AI API provider lets you set spending limits:

    ProviderHow to Set Limits
    OpenAIDashboard → Settings → Billing → Usage limits
    AnthropicConsole → Settings → Spending limit
    Google AIGoogle Cloud Console → Budgets & alerts

    Set a hard limit that's 20% above your expected monthly spend. This prevents surprise bills.

    Application-Level Budget Tracking

    // Track spending in real-time
    async function trackSpending(cost: number): Promise<boolean> {
      const today = new Date().toISOString().split("T")[0];
    
      const budgetDoc = await db.collection("budgets").doc(today).get();
      const currentSpend = budgetDoc.exists ? budgetDoc.data()?.totalCost ?? 0 : 0;
      const newTotal = currentSpend + cost;
    
      await db.collection("budgets").doc(today).set(
        { totalCost: newTotal, updatedAt: Timestamp.now() },
        { merge: true }
      );
    
      const DAILY_LIMIT = 50; // $50/day
      if (newTotal > DAILY_LIMIT) {
        console.error(`Daily budget exceeded: $${newTotal.toFixed(2)}`);
        return false; // Signal to stop making AI calls
      }
    
      return true;
    }

    Cost Optimization Checklist

    ✅ Using the cheapest model that meets quality requirements
    ✅ Caching identical/similar requests
    ✅ Rate limiting users (per hour/day)
    ✅ Setting spending limits with AI providers
    ✅ Tracking costs per user/day in your app
    ✅ Using streaming for long responses (better UX, same cost)
    ✅ Trimming unnecessary context from prompts
    ✅ Setting max_tokens to prevent runaway responses
    ✅ Budget alerts at 50%, 80%, and 100% of monthly limit
    ✅ Monitoring token usage trends weekly
    

    What's Next?

    You've covered all the essential deployment topics. The final tutorial is your Deployment Cheat Sheet — a quick reference with commands, templates, and AI prompts for everything deployment-related.

    What to ask your AI: "Analyze my AI app's API calls and suggest ways to reduce costs. Here's my current usage: [describe patterns]."


    🌐 www.genai-mentor.ai