Add foundational documentation templates to support product design and architecture planning, including ADR, archetypes, LLM systems, dev setup, and shared modules.

This commit is contained in:
olekhondera
2025-12-12 02:31:03 +02:00
parent 5053235e95
commit c905cbb725
26 changed files with 759 additions and 65 deletions

73
docs/llm/evals.md Normal file
View File

@@ -0,0 +1,73 @@
# LLM System: Evals & Quality (Starter Template)
---
**Last Updated:** 2025-12-12
**Phase:** Phase 0 (Planning)
**Status:** Draft — finalize in Phase 1
**Owner:** AI/LLM Lead + Test Engineer
**References:**
- `/docs/llm/prompting.md`
- `/docs/llm/safety.md`
---
This document defines how you measure LLM quality and prevent regressions.
## 1. Goals
- Detect prompt/model regressions before production.
- Track accuracy, safety, latency, and cost over time.
- Provide a repeatable path for improving prompts and RAG.
## 2. Eval Suite Types
Mix 3 layers depending on archetype:
1. **Unit evals (offline, deterministic)**
- Small golden set, strict expected outputs.
2. **Integration evals (offline, realistic)**
- Full pipeline including retrieval, tools, and postprocessing.
3. **Online evals (production, controlled)**
- Shadow runs, A/B, canary prompts, RUMstyle metrics.
## 3. Datasets
- Maintain **versioned eval datasets** with:
- input,
- expected output or rubric,
- metadata (domain, difficulty, edge cases).
- Include adversarial cases:
- prompt injection,
- ambiguous queries,
- long/noisy inputs,
- PIIrich inputs (to test redaction).
## 4. Metrics (suggested)
Choose per archetype:
- **Task quality:** accuracy/F1, exactmatch, rubric score, human preference rate.
- **Safety:** refusal correctness, policy violations, PII leakage rate.
- **Robustness:** formatvalid rate, toolcall correctness, retry rate.
- **Performance:** p50/p95 latency, tokens in/out, cost per task.
## 5. Regression Policy
- Every prompt or model change must run evals.
- Define gates:
- no safety regressions,
- quality must improve or stay within tolerance,
- latency/cost budgets respected.
- If a gate fails: block rollout or require explicit override in `RECOMMENDATIONS.md`.
## 6. Human Review Loop
- For tasks without ground truth, use rubricbased human grading.
- Sample strategy:
- new prompt versions → 100% review on small batch,
- stable versions → periodic audits.
## 7. Logging for Evals
- Store eval runs with:
- prompt version,
- model/provider version,
- retrieval config version (if used),
- inputs/outputs,
- metrics + artifacts.
## 8. Open Questions to Lock in Phase 1
- Where datasets live (repo vs storage)?
- Which metrics are hard gates for MVP?
- Online eval strategy (shadow vs A/B) and sample sizes?