Add foundational documentation templates to support product design and architecture planning, including ADR, archetypes, LLM systems, dev setup, and shared modules.
This commit is contained in:
73
docs/llm/evals.md
Normal file
73
docs/llm/evals.md
Normal file
@@ -0,0 +1,73 @@
|
||||
# LLM System: Evals & Quality (Starter Template)
|
||||
|
||||
---
|
||||
**Last Updated:** 2025-12-12
|
||||
**Phase:** Phase 0 (Planning)
|
||||
**Status:** Draft — finalize in Phase 1
|
||||
**Owner:** AI/LLM Lead + Test Engineer
|
||||
**References:**
|
||||
- `/docs/llm/prompting.md`
|
||||
- `/docs/llm/safety.md`
|
||||
---
|
||||
|
||||
This document defines how you measure LLM quality and prevent regressions.
|
||||
|
||||
## 1. Goals
|
||||
- Detect prompt/model regressions before production.
|
||||
- Track accuracy, safety, latency, and cost over time.
|
||||
- Provide a repeatable path for improving prompts and RAG.
|
||||
|
||||
## 2. Eval Suite Types
|
||||
Mix 3 layers depending on archetype:
|
||||
1. **Unit evals (offline, deterministic)**
|
||||
- Small golden set, strict expected outputs.
|
||||
2. **Integration evals (offline, realistic)**
|
||||
- Full pipeline including retrieval, tools, and post‑processing.
|
||||
3. **Online evals (production, controlled)**
|
||||
- Shadow runs, A/B, canary prompts, RUM‑style metrics.
|
||||
|
||||
## 3. Datasets
|
||||
- Maintain **versioned eval datasets** with:
|
||||
- input,
|
||||
- expected output or rubric,
|
||||
- metadata (domain, difficulty, edge cases).
|
||||
- Include adversarial cases:
|
||||
- prompt injection,
|
||||
- ambiguous queries,
|
||||
- long/noisy inputs,
|
||||
- PII‑rich inputs (to test redaction).
|
||||
|
||||
## 4. Metrics (suggested)
|
||||
Choose per archetype:
|
||||
- **Task quality:** accuracy/F1, exact‑match, rubric score, human preference rate.
|
||||
- **Safety:** refusal correctness, policy violations, PII leakage rate.
|
||||
- **Robustness:** format‑valid rate, tool‑call correctness, retry rate.
|
||||
- **Performance:** p50/p95 latency, tokens in/out, cost per task.
|
||||
|
||||
## 5. Regression Policy
|
||||
- Every prompt or model change must run evals.
|
||||
- Define gates:
|
||||
- no safety regressions,
|
||||
- quality must improve or stay within tolerance,
|
||||
- latency/cost budgets respected.
|
||||
- If a gate fails: block rollout or require explicit override in `RECOMMENDATIONS.md`.
|
||||
|
||||
## 6. Human Review Loop
|
||||
- For tasks without ground truth, use rubric‑based human grading.
|
||||
- Sample strategy:
|
||||
- new prompt versions → 100% review on small batch,
|
||||
- stable versions → periodic audits.
|
||||
|
||||
## 7. Logging for Evals
|
||||
- Store eval runs with:
|
||||
- prompt version,
|
||||
- model/provider version,
|
||||
- retrieval config version (if used),
|
||||
- inputs/outputs,
|
||||
- metrics + artifacts.
|
||||
|
||||
## 8. Open Questions to Lock in Phase 1
|
||||
- Where datasets live (repo vs storage)?
|
||||
- Which metrics are hard gates for MVP?
|
||||
- Online eval strategy (shadow vs A/B) and sample sizes?
|
||||
|
||||
Reference in New Issue
Block a user