Add foundational documentation templates to support product design and architecture planning, including ADR, archetypes, LLM systems, dev setup, and shared modules.
This commit is contained in:
54
docs/llm/caching-costs.md
Normal file
54
docs/llm/caching-costs.md
Normal file
@@ -0,0 +1,54 @@
|
||||
# LLM System: Caching & Cost Control (Starter Template)
|
||||
|
||||
---
|
||||
**Last Updated:** 2025-12-12
|
||||
**Phase:** Phase 0 (Planning)
|
||||
**Status:** Draft — finalize in Phase 1
|
||||
**Owner:** AI/LLM Lead + Backend Architect
|
||||
**References:**
|
||||
- `/docs/llm/prompting.md`
|
||||
- `/docs/llm/evals.md`
|
||||
---
|
||||
|
||||
This document defines how to keep LLM usage reliable and within budget.
|
||||
|
||||
## 1. Goals
|
||||
- Minimize cost while preserving quality.
|
||||
- Keep latency predictable for user flows.
|
||||
- Avoid repeated work (idempotency + caching).
|
||||
|
||||
## 2. Budgets & Limits
|
||||
Define per tenant and per feature:
|
||||
- monthly token/cost cap,
|
||||
- per‑request max tokens,
|
||||
- max retries/timeouts,
|
||||
- concurrency limits.
|
||||
|
||||
## 3. Caching Layers
|
||||
Pick what applies:
|
||||
1. **Input normalization cache**
|
||||
- canonicalize inputs (trim, stable ordering) to increase hit rate.
|
||||
2. **LLM response cache**
|
||||
- key: `(prompt_version, model, canonical_input_hash, retrieval_config_hash)`.
|
||||
- TTL depends on volatility of the task.
|
||||
3. **Embeddings cache**
|
||||
- store embeddings for reusable texts/items.
|
||||
4. **RAG retrieval cache**
|
||||
- cache top‑k doc IDs for stable queries.
|
||||
|
||||
> Never cache raw PII; cache keys use hashes of redacted inputs.
|
||||
|
||||
## 4. Cost Controls
|
||||
- Prefer cheaper models for low‑risk tasks; escalate to stronger models only when needed.
|
||||
- Use staged pipelines (rules/heuristics/RAG) to reduce LLM calls.
|
||||
- Batch non‑interactive jobs (classification, report gen).
|
||||
- Track tokens in/out per request and per tenant.
|
||||
|
||||
## 5. Fallbacks
|
||||
- On timeouts/errors: retry with backoff, then fallback to safe default or human review.
|
||||
- On budget exhaustion: degrade gracefully (limited features, queue jobs, ask user).
|
||||
|
||||
## 6. Monitoring
|
||||
- Dashboards for cost, latency, cache hit rate, retry rate.
|
||||
- Alerts for spikes, anomaly tenants, or runaway loops.
|
||||
|
||||
Reference in New Issue
Block a user