AI_template/docs/llm/evals.md

# LLM System: Evals & Quality (Starter Template)

---
**Last Updated:** 2025-12-12
**Phase:** Phase 0 (Planning)
**Status:** Draft — finalize in Phase 1
**Owner:** AI/LLM Lead + Test Engineer
**References:**
- `/docs/llm/prompting.md`
- `/docs/llm/safety.md`
---

This document defines how you measure LLM quality and prevent regressions.

## 1. Goals
- Detect prompt/model regressions before production.
- Track accuracy, safety, latency, and cost over time.
- Provide a repeatable path for improving prompts and RAG.

## 2. Eval Suite Types
Mix 3 layers depending on archetype:
1. **Unit evals (offline, deterministic)**
   - Small golden set, strict expected outputs.
2. **Integration evals (offline, realistic)**
   - Full pipeline including retrieval, tools, and post‑processing.
3. **Online evals (production, controlled)**
   - Shadow runs, A/B, canary prompts, RUM‑style metrics.

## 3. Datasets
- Maintain **versioned eval datasets** with:
  - input,
  - expected output or rubric,
  - metadata (domain, difficulty, edge cases).
- Include adversarial cases:
  - prompt injection,
  - ambiguous queries,
  - long/noisy inputs,
  - PII‑rich inputs (to test redaction).

## 4. Metrics (suggested)
Choose per archetype:
- **Task quality:** accuracy/F1, exact‑match, rubric score, human preference rate.
- **Safety:** refusal correctness, policy violations, PII leakage rate.
- **Robustness:** format‑valid rate, tool‑call correctness, retry rate.
- **Performance:** p50/p95 latency, tokens in/out, cost per task.

## 5. Regression Policy
- Every prompt or model change must run evals.
- Define gates:
  - no safety regressions,
  - quality must improve or stay within tolerance,
  - latency/cost budgets respected.
- If a gate fails: block rollout or require explicit override in `RECOMMENDATIONS.md`.

## 6. Human Review Loop
- For tasks without ground truth, use rubric‑based human grading.
- Sample strategy:
  - new prompt versions → 100% review on small batch,
  - stable versions → periodic audits.

## 7. Logging for Evals
- Store eval runs with:
  - prompt version,
  - model/provider version,
  - retrieval config version (if used),
  - inputs/outputs,
  - metrics + artifacts.

## 8. Open Questions to Lock in Phase 1
- Where datasets live (repo vs storage)?
- Which metrics are hard gates for MVP?
- Online eval strategy (shadow vs A/B) and sample sizes?