74 lines
2.3 KiB
Markdown
74 lines
2.3 KiB
Markdown
# LLM System: Evals & Quality (Starter Template)
|
||
|
||
---
|
||
**Last Updated:** 2025-12-12
|
||
**Phase:** Phase 0 (Planning)
|
||
**Status:** Draft — finalize in Phase 1
|
||
**Owner:** AI/LLM Lead + Test Engineer
|
||
**References:**
|
||
- `/docs/llm/prompting.md`
|
||
- `/docs/llm/safety.md`
|
||
---
|
||
|
||
This document defines how you measure LLM quality and prevent regressions.
|
||
|
||
## 1. Goals
|
||
- Detect prompt/model regressions before production.
|
||
- Track accuracy, safety, latency, and cost over time.
|
||
- Provide a repeatable path for improving prompts and RAG.
|
||
|
||
## 2. Eval Suite Types
|
||
Mix 3 layers depending on archetype:
|
||
1. **Unit evals (offline, deterministic)**
|
||
- Small golden set, strict expected outputs.
|
||
2. **Integration evals (offline, realistic)**
|
||
- Full pipeline including retrieval, tools, and post‑processing.
|
||
3. **Online evals (production, controlled)**
|
||
- Shadow runs, A/B, canary prompts, RUM‑style metrics.
|
||
|
||
## 3. Datasets
|
||
- Maintain **versioned eval datasets** with:
|
||
- input,
|
||
- expected output or rubric,
|
||
- metadata (domain, difficulty, edge cases).
|
||
- Include adversarial cases:
|
||
- prompt injection,
|
||
- ambiguous queries,
|
||
- long/noisy inputs,
|
||
- PII‑rich inputs (to test redaction).
|
||
|
||
## 4. Metrics (suggested)
|
||
Choose per archetype:
|
||
- **Task quality:** accuracy/F1, exact‑match, rubric score, human preference rate.
|
||
- **Safety:** refusal correctness, policy violations, PII leakage rate.
|
||
- **Robustness:** format‑valid rate, tool‑call correctness, retry rate.
|
||
- **Performance:** p50/p95 latency, tokens in/out, cost per task.
|
||
|
||
## 5. Regression Policy
|
||
- Every prompt or model change must run evals.
|
||
- Define gates:
|
||
- no safety regressions,
|
||
- quality must improve or stay within tolerance,
|
||
- latency/cost budgets respected.
|
||
- If a gate fails: block rollout or require explicit override in `RECOMMENDATIONS.md`.
|
||
|
||
## 6. Human Review Loop
|
||
- For tasks without ground truth, use rubric‑based human grading.
|
||
- Sample strategy:
|
||
- new prompt versions → 100% review on small batch,
|
||
- stable versions → periodic audits.
|
||
|
||
## 7. Logging for Evals
|
||
- Store eval runs with:
|
||
- prompt version,
|
||
- model/provider version,
|
||
- retrieval config version (if used),
|
||
- inputs/outputs,
|
||
- metrics + artifacts.
|
||
|
||
## 8. Open Questions to Lock in Phase 1
|
||
- Where datasets live (repo vs storage)?
|
||
- Which metrics are hard gates for MVP?
|
||
- Online eval strategy (shadow vs A/B) and sample sizes?
|
||
|