Add foundational documentation templates to support product design and architecture planning, including ADR, archetypes, LLM systems, dev setup, and shared modules.

2025-12-12 02:31:03 +02:00
parent 5053235e95
commit c905cbb725
26 changed files with 759 additions and 65 deletions
--- a/docs/llm/caching-costs.md
+++ b/docs/llm/caching-costs.md
@@ -0,0 +1,54 @@
+# LLM System: Caching & Cost Control (Starter Template)
+
+---
+**Last Updated:** 2025-12-12  
+**Phase:** Phase 0 (Planning)  
+**Status:** Draft — finalize in Phase 1  
+**Owner:** AI/LLM Lead + Backend Architect  
+**References:**
+- `/docs/llm/prompting.md`
+- `/docs/llm/evals.md`
+---
+
+This document defines how to keep LLM usage reliable and within budget.
+
+## 1. Goals
+- Minimize cost while preserving quality.
+- Keep latency predictable for user flows.
+- Avoid repeated work (idempotency + caching).
+
+## 2. Budgets & Limits
+Define per tenant and per feature:
+- monthly token/cost cap,
+- per‑request max tokens,
+- max retries/timeouts,
+- concurrency limits.
+
+## 3. Caching Layers
+Pick what applies:
+1. **Input normalization cache**  
+   - canonicalize inputs (trim, stable ordering) to increase hit rate.
+2. **LLM response cache**  
+   - key: `(prompt_version, model, canonical_input_hash, retrieval_config_hash)`.
+   - TTL depends on volatility of the task.
+3. **Embeddings cache**  
+   - store embeddings for reusable texts/items.
+4. **RAG retrieval cache**  
+   - cache top‑k doc IDs for stable queries.
+
+> Never cache raw PII; cache keys use hashes of redacted inputs.
+
+## 4. Cost Controls
+- Prefer cheaper models for low‑risk tasks; escalate to stronger models only when needed.
+- Use staged pipelines (rules/heuristics/RAG) to reduce LLM calls.
+- Batch non‑interactive jobs (classification, report gen).
+- Track tokens in/out per request and per tenant.
+
+## 5. Fallbacks
+- On timeouts/errors: retry with backoff, then fallback to safe default or human review.
+- On budget exhaustion: degrade gracefully (limited features, queue jobs, ask user).
+
+## 6. Monitoring
+- Dashboards for cost, latency, cache hit rate, retry rate.
+- Alerts for spikes, anomaly tenants, or runaway loops.
+