olek/AI_template

Files

olekhondera c905cbb725 Add foundational documentation templates to support product design and architecture planning, including ADR, archetypes, LLM systems, dev setup, and shared modules.

2025-12-12 02:31:03 +02:00

2.3 KiB

Raw Blame History

LLM System: Evals & Quality (Starter Template)

Last Updated: 2025-12-12
Phase: Phase 0 (Planning)
Status: Draft — finalize in Phase 1
Owner: AI/LLM Lead + Test Engineer
References:

/docs/llm/prompting.md
/docs/llm/safety.md

This document defines how you measure LLM quality and prevent regressions.

1. Goals

Detect prompt/model regressions before production.
Track accuracy, safety, latency, and cost over time.
Provide a repeatable path for improving prompts and RAG.

2. Eval Suite Types

Mix 3 layers depending on archetype:

Unit evals (offline, deterministic)
- Small golden set, strict expected outputs.
Integration evals (offline, realistic)
- Full pipeline including retrieval, tools, and post‑processing.
Online evals (production, controlled)
- Shadow runs, A/B, canary prompts, RUM‑style metrics.

3. Datasets

Maintain versioned eval datasets with:
- input,
- expected output or rubric,
- metadata (domain, difficulty, edge cases).
Include adversarial cases:
- prompt injection,
- ambiguous queries,
- long/noisy inputs,
- PII‑rich inputs (to test redaction).

4. Metrics (suggested)

Choose per archetype:

Task quality: accuracy/F1, exact‑match, rubric score, human preference rate.
Safety: refusal correctness, policy violations, PII leakage rate.
Robustness: format‑valid rate, tool‑call correctness, retry rate.
Performance: p50/p95 latency, tokens in/out, cost per task.

5. Regression Policy

Every prompt or model change must run evals.
Define gates:
- no safety regressions,
- quality must improve or stay within tolerance,
- latency/cost budgets respected.
If a gate fails: block rollout or require explicit override in RECOMMENDATIONS.md.

6. Human Review Loop

For tasks without ground truth, use rubric‑based human grading.
Sample strategy:
- new prompt versions → 100% review on small batch,
- stable versions → periodic audits.

7. Logging for Evals

Store eval runs with:
- prompt version,
- model/provider version,
- retrieval config version (if used),
- inputs/outputs,
- metrics + artifacts.

8. Open Questions to Lock in Phase 1

Where datasets live (repo vs storage)?
Which metrics are hard gates for MVP?
Online eval strategy (shadow vs A/B) and sample sizes?