Files
AI_template/docs/llm/evals.md
olekhondera 5b28ea675d add SKILL
2026-02-14 07:38:50 +02:00

2.3 KiB
Raw Permalink Blame History

LLM System: Evals & Quality (Starter Template)


Phase: Phase 0 (Planning)
Status: Draft — finalize in Phase 1
Owner: AI/LLM Lead + Test Engineer
References:

  • /docs/llm/prompting.md
  • /docs/llm/safety.md

This document defines how you measure LLM quality and prevent regressions.

1. Goals

  • Detect prompt/model regressions before production.
  • Track accuracy, safety, latency, and cost over time.
  • Provide a repeatable path for improving prompts and RAG.

2. Eval Suite Types

Mix 3 layers depending on archetype:

  1. Unit evals (offline, deterministic)
    • Small golden set, strict expected outputs.
  2. Integration evals (offline, realistic)
    • Full pipeline including retrieval, tools, and postprocessing.
  3. Online evals (production, controlled)
    • Shadow runs, A/B, canary prompts, RUMstyle metrics.

3. Datasets

  • Maintain versioned eval datasets with:
    • input,
    • expected output or rubric,
    • metadata (domain, difficulty, edge cases).
  • Include adversarial cases:
    • prompt injection,
    • ambiguous queries,
    • long/noisy inputs,
    • PIIrich inputs (to test redaction).

4. Metrics (suggested)

Choose per archetype:

  • Task quality: accuracy/F1, exactmatch, rubric score, human preference rate.
  • Safety: refusal correctness, policy violations, PII leakage rate.
  • Robustness: formatvalid rate, toolcall correctness, retry rate.
  • Performance: p50/p95 latency, tokens in/out, cost per task.

5. Regression Policy

  • Every prompt or model change must run evals.
  • Define gates:
    • no safety regressions,
    • quality must improve or stay within tolerance,
    • latency/cost budgets respected.
  • If a gate fails: block rollout or require explicit override in RECOMMENDATIONS.md.

6. Human Review Loop

  • For tasks without ground truth, use rubricbased human grading.
  • Sample strategy:
    • new prompt versions → 100% review on small batch,
    • stable versions → periodic audits.

7. Logging for Evals

  • Store eval runs with:
    • prompt version,
    • model/provider version,
    • retrieval config version (if used),
    • inputs/outputs,
    • metrics + artifacts.

8. Open Questions to Lock in Phase 1

  • Where datasets live (repo vs storage)?
  • Which metrics are hard gates for MVP?
  • Online eval strategy (shadow vs A/B) and sample sizes?