LLM System: Caching & Cost Control (Starter Template)

Phase: Phase 0 (Planning)
Status: Draft — finalize in Phase 1
Owner: AI/LLM Lead + Backend Architect
References:

This document defines how to keep LLM usage reliable and within budget.

1. Goals

Define per tenant and per feature:

Pick what applies:

Input normalization cache
- canonicalize inputs (trim, stable ordering) to increase hit rate.
LLM response cache
- key: (prompt_version, model, canonical_input_hash, retrieval_config_hash).
- TTL depends on volatility of the task.
Embeddings cache
- store embeddings for reusable texts/items.
RAG retrieval cache
- cache top‑k doc IDs for stable queries.

Never cache raw PII; cache keys use hashes of redacted inputs.

Prefer cheaper models for low‑risk tasks; escalate to stronger models only when needed.
Use staged pipelines (rules/heuristics/RAG) to reduce LLM calls.
Batch non‑interactive jobs (classification, report gen).
Track tokens in/out per request and per tenant.

On timeouts/errors: retry with backoff, then fallback to safe default or human review.
On budget exhaustion: degrade gracefully (limited features, queue jobs, ask user).