Files
AI_template/docs/llm/evals.md

74 lines
2.3 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# LLM System: Evals & Quality (Starter Template)
---
**Last Updated:** 2025-12-12
**Phase:** Phase 0 (Planning)
**Status:** Draft — finalize in Phase 1
**Owner:** AI/LLM Lead + Test Engineer
**References:**
- `/docs/llm/prompting.md`
- `/docs/llm/safety.md`
---
This document defines how you measure LLM quality and prevent regressions.
## 1. Goals
- Detect prompt/model regressions before production.
- Track accuracy, safety, latency, and cost over time.
- Provide a repeatable path for improving prompts and RAG.
## 2. Eval Suite Types
Mix 3 layers depending on archetype:
1. **Unit evals (offline, deterministic)**
- Small golden set, strict expected outputs.
2. **Integration evals (offline, realistic)**
- Full pipeline including retrieval, tools, and postprocessing.
3. **Online evals (production, controlled)**
- Shadow runs, A/B, canary prompts, RUMstyle metrics.
## 3. Datasets
- Maintain **versioned eval datasets** with:
- input,
- expected output or rubric,
- metadata (domain, difficulty, edge cases).
- Include adversarial cases:
- prompt injection,
- ambiguous queries,
- long/noisy inputs,
- PIIrich inputs (to test redaction).
## 4. Metrics (suggested)
Choose per archetype:
- **Task quality:** accuracy/F1, exactmatch, rubric score, human preference rate.
- **Safety:** refusal correctness, policy violations, PII leakage rate.
- **Robustness:** formatvalid rate, toolcall correctness, retry rate.
- **Performance:** p50/p95 latency, tokens in/out, cost per task.
## 5. Regression Policy
- Every prompt or model change must run evals.
- Define gates:
- no safety regressions,
- quality must improve or stay within tolerance,
- latency/cost budgets respected.
- If a gate fails: block rollout or require explicit override in `RECOMMENDATIONS.md`.
## 6. Human Review Loop
- For tasks without ground truth, use rubricbased human grading.
- Sample strategy:
- new prompt versions → 100% review on small batch,
- stable versions → periodic audits.
## 7. Logging for Evals
- Store eval runs with:
- prompt version,
- model/provider version,
- retrieval config version (if used),
- inputs/outputs,
- metrics + artifacts.
## 8. Open Questions to Lock in Phase 1
- Where datasets live (repo vs storage)?
- Which metrics are hard gates for MVP?
- Online eval strategy (shadow vs A/B) and sample sizes?