# LLM System: Evals & Quality (Starter Template) --- **Last Updated:** 2025-12-12 **Phase:** Phase 0 (Planning) **Status:** Draft — finalize in Phase 1 **Owner:** AI/LLM Lead + Test Engineer **References:** - `/docs/llm/prompting.md` - `/docs/llm/safety.md` --- This document defines how you measure LLM quality and prevent regressions. ## 1. Goals - Detect prompt/model regressions before production. - Track accuracy, safety, latency, and cost over time. - Provide a repeatable path for improving prompts and RAG. ## 2. Eval Suite Types Mix 3 layers depending on archetype: 1. **Unit evals (offline, deterministic)** - Small golden set, strict expected outputs. 2. **Integration evals (offline, realistic)** - Full pipeline including retrieval, tools, and post‑processing. 3. **Online evals (production, controlled)** - Shadow runs, A/B, canary prompts, RUM‑style metrics. ## 3. Datasets - Maintain **versioned eval datasets** with: - input, - expected output or rubric, - metadata (domain, difficulty, edge cases). - Include adversarial cases: - prompt injection, - ambiguous queries, - long/noisy inputs, - PII‑rich inputs (to test redaction). ## 4. Metrics (suggested) Choose per archetype: - **Task quality:** accuracy/F1, exact‑match, rubric score, human preference rate. - **Safety:** refusal correctness, policy violations, PII leakage rate. - **Robustness:** format‑valid rate, tool‑call correctness, retry rate. - **Performance:** p50/p95 latency, tokens in/out, cost per task. ## 5. Regression Policy - Every prompt or model change must run evals. - Define gates: - no safety regressions, - quality must improve or stay within tolerance, - latency/cost budgets respected. - If a gate fails: block rollout or require explicit override in `RECOMMENDATIONS.md`. ## 6. Human Review Loop - For tasks without ground truth, use rubric‑based human grading. - Sample strategy: - new prompt versions → 100% review on small batch, - stable versions → periodic audits. ## 7. Logging for Evals - Store eval runs with: - prompt version, - model/provider version, - retrieval config version (if used), - inputs/outputs, - metrics + artifacts. ## 8. Open Questions to Lock in Phase 1 - Where datasets live (repo vs storage)? - Which metrics are hard gates for MVP? - Online eval strategy (shadow vs A/B) and sample sizes?