# LLM System: Caching & Cost Control (Starter Template) --- **Last Updated:** 2025-12-12 **Phase:** Phase 0 (Planning) **Status:** Draft — finalize in Phase 1 **Owner:** AI/LLM Lead + Backend Architect **References:** - `/docs/llm/prompting.md` - `/docs/llm/evals.md` --- This document defines how to keep LLM usage reliable and within budget. ## 1. Goals - Minimize cost while preserving quality. - Keep latency predictable for user flows. - Avoid repeated work (idempotency + caching). ## 2. Budgets & Limits Define per tenant and per feature: - monthly token/cost cap, - per‑request max tokens, - max retries/timeouts, - concurrency limits. ## 3. Caching Layers Pick what applies: 1. **Input normalization cache** - canonicalize inputs (trim, stable ordering) to increase hit rate. 2. **LLM response cache** - key: `(prompt_version, model, canonical_input_hash, retrieval_config_hash)`. - TTL depends on volatility of the task. 3. **Embeddings cache** - store embeddings for reusable texts/items. 4. **RAG retrieval cache** - cache top‑k doc IDs for stable queries. > Never cache raw PII; cache keys use hashes of redacted inputs. ## 4. Cost Controls - Prefer cheaper models for low‑risk tasks; escalate to stronger models only when needed. - Use staged pipelines (rules/heuristics/RAG) to reduce LLM calls. - Batch non‑interactive jobs (classification, report gen). - Track tokens in/out per request and per tenant. ## 5. Fallbacks - On timeouts/errors: retry with backoff, then fallback to safe default or human review. - On budget exhaustion: degrade gracefully (limited features, queue jobs, ask user). ## 6. Monitoring - Dashboards for cost, latency, cache hit rate, retry rate. - Alerts for spikes, anomaly tenants, or runaway loops.