55 lines
1.8 KiB
Markdown
55 lines
1.8 KiB
Markdown
# LLM System: Caching & Cost Control (Starter Template)
|
||
|
||
---
|
||
**Last Updated:** 2025-12-12
|
||
**Phase:** Phase 0 (Planning)
|
||
**Status:** Draft — finalize in Phase 1
|
||
**Owner:** AI/LLM Lead + Backend Architect
|
||
**References:**
|
||
- `/docs/llm/prompting.md`
|
||
- `/docs/llm/evals.md`
|
||
---
|
||
|
||
This document defines how to keep LLM usage reliable and within budget.
|
||
|
||
## 1. Goals
|
||
- Minimize cost while preserving quality.
|
||
- Keep latency predictable for user flows.
|
||
- Avoid repeated work (idempotency + caching).
|
||
|
||
## 2. Budgets & Limits
|
||
Define per tenant and per feature:
|
||
- monthly token/cost cap,
|
||
- per‑request max tokens,
|
||
- max retries/timeouts,
|
||
- concurrency limits.
|
||
|
||
## 3. Caching Layers
|
||
Pick what applies:
|
||
1. **Input normalization cache**
|
||
- canonicalize inputs (trim, stable ordering) to increase hit rate.
|
||
2. **LLM response cache**
|
||
- key: `(prompt_version, model, canonical_input_hash, retrieval_config_hash)`.
|
||
- TTL depends on volatility of the task.
|
||
3. **Embeddings cache**
|
||
- store embeddings for reusable texts/items.
|
||
4. **RAG retrieval cache**
|
||
- cache top‑k doc IDs for stable queries.
|
||
|
||
> Never cache raw PII; cache keys use hashes of redacted inputs.
|
||
|
||
## 4. Cost Controls
|
||
- Prefer cheaper models for low‑risk tasks; escalate to stronger models only when needed.
|
||
- Use staged pipelines (rules/heuristics/RAG) to reduce LLM calls.
|
||
- Batch non‑interactive jobs (classification, report gen).
|
||
- Track tokens in/out per request and per tenant.
|
||
|
||
## 5. Fallbacks
|
||
- On timeouts/errors: retry with backoff, then fallback to safe default or human review.
|
||
- On budget exhaustion: degrade gracefully (limited features, queue jobs, ask user).
|
||
|
||
## 6. Monitoring
|
||
- Dashboards for cost, latency, cache hit rate, retry rate.
|
||
- Alerts for spikes, anomaly tenants, or runaway loops.
|
||
|