Add foundational documentation templates to support product design and architecture planning, including ADR, archetypes, LLM systems, dev setup, and shared modules.

This commit is contained in:
olekhondera
2025-12-12 02:31:03 +02:00
parent 5053235e95
commit c905cbb725
26 changed files with 759 additions and 65 deletions

54
docs/llm/caching-costs.md Normal file
View File

@@ -0,0 +1,54 @@
# LLM System: Caching & Cost Control (Starter Template)
---
**Last Updated:** 2025-12-12
**Phase:** Phase 0 (Planning)
**Status:** Draft — finalize in Phase 1
**Owner:** AI/LLM Lead + Backend Architect
**References:**
- `/docs/llm/prompting.md`
- `/docs/llm/evals.md`
---
This document defines how to keep LLM usage reliable and within budget.
## 1. Goals
- Minimize cost while preserving quality.
- Keep latency predictable for user flows.
- Avoid repeated work (idempotency + caching).
## 2. Budgets & Limits
Define per tenant and per feature:
- monthly token/cost cap,
- perrequest max tokens,
- max retries/timeouts,
- concurrency limits.
## 3. Caching Layers
Pick what applies:
1. **Input normalization cache**
- canonicalize inputs (trim, stable ordering) to increase hit rate.
2. **LLM response cache**
- key: `(prompt_version, model, canonical_input_hash, retrieval_config_hash)`.
- TTL depends on volatility of the task.
3. **Embeddings cache**
- store embeddings for reusable texts/items.
4. **RAG retrieval cache**
- cache topk doc IDs for stable queries.
> Never cache raw PII; cache keys use hashes of redacted inputs.
## 4. Cost Controls
- Prefer cheaper models for lowrisk tasks; escalate to stronger models only when needed.
- Use staged pipelines (rules/heuristics/RAG) to reduce LLM calls.
- Batch noninteractive jobs (classification, report gen).
- Track tokens in/out per request and per tenant.
## 5. Fallbacks
- On timeouts/errors: retry with backoff, then fallback to safe default or human review.
- On budget exhaustion: degrade gracefully (limited features, queue jobs, ask user).
## 6. Monitoring
- Dashboards for cost, latency, cache hit rate, retry rate.
- Alerts for spikes, anomaly tenants, or runaway loops.