# LLM System: Caching & Cost Control (Starter Template)

---
**Last Updated:** 2025-12-12  
**Phase:** Phase 0 (Planning)  
**Status:** Draft — finalize in Phase 1  
**Owner:** AI/LLM Lead + Backend Architect  
**References:**
- `/docs/llm/prompting.md`
- `/docs/llm/evals.md`
---

This document defines how to keep LLM usage reliable and within budget.

## 1. Goals
- Minimize cost while preserving quality.
- Keep latency predictable for user flows.
- Avoid repeated work (idempotency + caching).

## 2. Budgets & Limits
Define per tenant and per feature:
- monthly token/cost cap,
- per‑request max tokens,
- max retries/timeouts,
- concurrency limits.

## 3. Caching Layers
Pick what applies:
1. **Input normalization cache**  
   - canonicalize inputs (trim, stable ordering) to increase hit rate.
2. **LLM response cache**  
   - key: `(prompt_version, model, canonical_input_hash, retrieval_config_hash)`.
   - TTL depends on volatility of the task.
3. **Embeddings cache**  
   - store embeddings for reusable texts/items.
4. **RAG retrieval cache**  
   - cache top‑k doc IDs for stable queries.

> Never cache raw PII; cache keys use hashes of redacted inputs.

## 4. Cost Controls
- Prefer cheaper models for low‑risk tasks; escalate to stronger models only when needed.
- Use staged pipelines (rules/heuristics/RAG) to reduce LLM calls.
- Batch non‑interactive jobs (classification, report gen).
- Track tokens in/out per request and per tenant.

## 5. Fallbacks
- On timeouts/errors: retry with backoff, then fallback to safe default or human review.
- On budget exhaustion: degrade gracefully (limited features, queue jobs, ask user).

## 6. Monitoring
- Dashboards for cost, latency, cache hit rate, retry rate.
- Alerts for spikes, anomaly tenants, or runaway loops.