Add foundational documentation templates to support product design and architecture planning, including ADR, archetypes, LLM systems, dev setup, and shared modules.

2025-12-12 02:31:03 +02:00
parent 5053235e95
commit c905cbb725
26 changed files with 759 additions and 65 deletions
--- a/docs/llm/evals.md
+++ b/docs/llm/evals.md
@@ -0,0 +1,73 @@
+# LLM System: Evals & Quality (Starter Template)
+
+---
+**Last Updated:** 2025-12-12  
+**Phase:** Phase 0 (Planning)  
+**Status:** Draft — finalize in Phase 1  
+**Owner:** AI/LLM Lead + Test Engineer  
+**References:**
+- `/docs/llm/prompting.md`
+- `/docs/llm/safety.md`
+---
+
+This document defines how you measure LLM quality and prevent regressions.
+
+## 1. Goals
+- Detect prompt/model regressions before production.
+- Track accuracy, safety, latency, and cost over time.
+- Provide a repeatable path for improving prompts and RAG.
+
+## 2. Eval Suite Types
+Mix 3 layers depending on archetype:
+1. **Unit evals (offline, deterministic)**  
+   - Small golden set, strict expected outputs.
+2. **Integration evals (offline, realistic)**  
+   - Full pipeline including retrieval, tools, and post‑processing.
+3. **Online evals (production, controlled)**  
+   - Shadow runs, A/B, canary prompts, RUM‑style metrics.
+
+## 3. Datasets
+- Maintain **versioned eval datasets** with:
+  - input,
+  - expected output or rubric,
+  - metadata (domain, difficulty, edge cases).
+- Include adversarial cases:
+  - prompt injection,
+  - ambiguous queries,
+  - long/noisy inputs,
+  - PII‑rich inputs (to test redaction).
+
+## 4. Metrics (suggested)
+Choose per archetype:
+- **Task quality:** accuracy/F1, exact‑match, rubric score, human preference rate.
+- **Safety:** refusal correctness, policy violations, PII leakage rate.
+- **Robustness:** format‑valid rate, tool‑call correctness, retry rate.
+- **Performance:** p50/p95 latency, tokens in/out, cost per task.
+
+## 5. Regression Policy
+- Every prompt or model change must run evals.
+- Define gates:
+  - no safety regressions,
+  - quality must improve or stay within tolerance,
+  - latency/cost budgets respected.
+- If a gate fails: block rollout or require explicit override in `RECOMMENDATIONS.md`.
+
+## 6. Human Review Loop
+- For tasks without ground truth, use rubric‑based human grading.
+- Sample strategy:
+  - new prompt versions → 100% review on small batch,
+  - stable versions → periodic audits.
+
+## 7. Logging for Evals
+- Store eval runs with:
+  - prompt version,
+  - model/provider version,
+  - retrieval config version (if used),
+  - inputs/outputs,
+  - metrics + artifacts.
+
+## 8. Open Questions to Lock in Phase 1
+- Where datasets live (repo vs storage)?
+- Which metrics are hard gates for MVP?
+- Online eval strategy (shadow vs A/B) and sample sizes?
+