1.8 KiB
1.8 KiB
LLM System: Caching & Cost Control (Starter Template)
Last Updated: 2025-12-12
Phase: Phase 0 (Planning)
Status: Draft — finalize in Phase 1
Owner: AI/LLM Lead + Backend Architect
References:
/docs/llm/prompting.md/docs/llm/evals.md
This document defines how to keep LLM usage reliable and within budget.
1. Goals
- Minimize cost while preserving quality.
- Keep latency predictable for user flows.
- Avoid repeated work (idempotency + caching).
2. Budgets & Limits
Define per tenant and per feature:
- monthly token/cost cap,
- per‑request max tokens,
- max retries/timeouts,
- concurrency limits.
3. Caching Layers
Pick what applies:
- Input normalization cache
- canonicalize inputs (trim, stable ordering) to increase hit rate.
- LLM response cache
- key:
(prompt_version, model, canonical_input_hash, retrieval_config_hash). - TTL depends on volatility of the task.
- key:
- Embeddings cache
- store embeddings for reusable texts/items.
- RAG retrieval cache
- cache top‑k doc IDs for stable queries.
Never cache raw PII; cache keys use hashes of redacted inputs.
4. Cost Controls
- Prefer cheaper models for low‑risk tasks; escalate to stronger models only when needed.
- Use staged pipelines (rules/heuristics/RAG) to reduce LLM calls.
- Batch non‑interactive jobs (classification, report gen).
- Track tokens in/out per request and per tenant.
5. Fallbacks
- On timeouts/errors: retry with backoff, then fallback to safe default or human review.
- On budget exhaustion: degrade gracefully (limited features, queue jobs, ask user).
6. Monitoring
- Dashboards for cost, latency, cache hit rate, retry rate.
- Alerts for spikes, anomaly tenants, or runaway loops.