2.3 KiB
2.3 KiB
LLM System: Evals & Quality (Starter Template)
Last Updated: 2025-12-12
Phase: Phase 0 (Planning)
Status: Draft — finalize in Phase 1
Owner: AI/LLM Lead + Test Engineer
References:
/docs/llm/prompting.md/docs/llm/safety.md
This document defines how you measure LLM quality and prevent regressions.
1. Goals
- Detect prompt/model regressions before production.
- Track accuracy, safety, latency, and cost over time.
- Provide a repeatable path for improving prompts and RAG.
2. Eval Suite Types
Mix 3 layers depending on archetype:
- Unit evals (offline, deterministic)
- Small golden set, strict expected outputs.
- Integration evals (offline, realistic)
- Full pipeline including retrieval, tools, and post‑processing.
- Online evals (production, controlled)
- Shadow runs, A/B, canary prompts, RUM‑style metrics.
3. Datasets
- Maintain versioned eval datasets with:
- input,
- expected output or rubric,
- metadata (domain, difficulty, edge cases).
- Include adversarial cases:
- prompt injection,
- ambiguous queries,
- long/noisy inputs,
- PII‑rich inputs (to test redaction).
4. Metrics (suggested)
Choose per archetype:
- Task quality: accuracy/F1, exact‑match, rubric score, human preference rate.
- Safety: refusal correctness, policy violations, PII leakage rate.
- Robustness: format‑valid rate, tool‑call correctness, retry rate.
- Performance: p50/p95 latency, tokens in/out, cost per task.
5. Regression Policy
- Every prompt or model change must run evals.
- Define gates:
- no safety regressions,
- quality must improve or stay within tolerance,
- latency/cost budgets respected.
- If a gate fails: block rollout or require explicit override in
RECOMMENDATIONS.md.
6. Human Review Loop
- For tasks without ground truth, use rubric‑based human grading.
- Sample strategy:
- new prompt versions → 100% review on small batch,
- stable versions → periodic audits.
7. Logging for Evals
- Store eval runs with:
- prompt version,
- model/provider version,
- retrieval config version (if used),
- inputs/outputs,
- metrics + artifacts.
8. Open Questions to Lock in Phase 1
- Where datasets live (repo vs storage)?
- Which metrics are hard gates for MVP?
- Online eval strategy (shadow vs A/B) and sample sizes?