LLM System: Caching & Cost Control (Starter Template)

Last Updated: 2025-12-12
Phase: Phase 0 (Planning)
Status: Draft — finalize in Phase 1
Owner: AI/LLM Lead + Backend Architect
References:

/docs/llm/prompting.md
/docs/llm/evals.md

This document defines how to keep LLM usage reliable and within budget.

1. Goals

Minimize cost while preserving quality.
Keep latency predictable for user flows.
Avoid repeated work (idempotency + caching).

2. Budgets & Limits

Define per tenant and per feature:

monthly token/cost cap,
per‑request max tokens,
max retries/timeouts,
concurrency limits.

3. Caching Layers

Pick what applies:

Input normalization cache
- canonicalize inputs (trim, stable ordering) to increase hit rate.
LLM response cache
- key: (prompt_version, model, canonical_input_hash, retrieval_config_hash).
- TTL depends on volatility of the task.
Embeddings cache
- store embeddings for reusable texts/items.
RAG retrieval cache
- cache top‑k doc IDs for stable queries.

Never cache raw PII; cache keys use hashes of redacted inputs.

4. Cost Controls

Prefer cheaper models for low‑risk tasks; escalate to stronger models only when needed.
Use staged pipelines (rules/heuristics/RAG) to reduce LLM calls.
Batch non‑interactive jobs (classification, report gen).
Track tokens in/out per request and per tenant.

5. Fallbacks

On timeouts/errors: retry with backoff, then fallback to safe default or human review.
On budget exhaustion: degrade gracefully (limited features, queue jobs, ask user).

6. Monitoring

Dashboards for cost, latency, cache hit rate, retry rate.
Alerts for spikes, anomaly tenants, or runaway loops.

1.8 KiB Raw Blame History Unescape Escape