Add foundational documentation templates to support product design and architecture planning, including ADR, archetypes, LLM systems, dev setup, and shared modules.

2025-12-12 02:31:03 +02:00
parent 5053235e95
commit c905cbb725
26 changed files with 759 additions and 65 deletions
--- a/docs/llm/caching-costs.md
+++ b/docs/llm/caching-costs.md
@@ -0,0 +1,54 @@
+# LLM System: Caching & Cost Control (Starter Template)
+
+---
+**Last Updated:** 2025-12-12  
+**Phase:** Phase 0 (Planning)  
+**Status:** Draft — finalize in Phase 1  
+**Owner:** AI/LLM Lead + Backend Architect  
+**References:**
+- `/docs/llm/prompting.md`
+- `/docs/llm/evals.md`
+---
+
+This document defines how to keep LLM usage reliable and within budget.
+
+## 1. Goals
+- Minimize cost while preserving quality.
+- Keep latency predictable for user flows.
+- Avoid repeated work (idempotency + caching).
+
+## 2. Budgets & Limits
+Define per tenant and per feature:
+- monthly token/cost cap,
+- per‑request max tokens,
+- max retries/timeouts,
+- concurrency limits.
+
+## 3. Caching Layers
+Pick what applies:
+1. **Input normalization cache**  
+   - canonicalize inputs (trim, stable ordering) to increase hit rate.
+2. **LLM response cache**  
+   - key: `(prompt_version, model, canonical_input_hash, retrieval_config_hash)`.
+   - TTL depends on volatility of the task.
+3. **Embeddings cache**  
+   - store embeddings for reusable texts/items.
+4. **RAG retrieval cache**  
+   - cache top‑k doc IDs for stable queries.
+
+> Never cache raw PII; cache keys use hashes of redacted inputs.
+
+## 4. Cost Controls
+- Prefer cheaper models for low‑risk tasks; escalate to stronger models only when needed.
+- Use staged pipelines (rules/heuristics/RAG) to reduce LLM calls.
+- Batch non‑interactive jobs (classification, report gen).
+- Track tokens in/out per request and per tenant.
+
+## 5. Fallbacks
+- On timeouts/errors: retry with backoff, then fallback to safe default or human review.
+- On budget exhaustion: degrade gracefully (limited features, queue jobs, ask user).
+
+## 6. Monitoring
+- Dashboards for cost, latency, cache hit rate, retry rate.
+- Alerts for spikes, anomaly tenants, or runaway loops.
+
--- a/docs/llm/evals.md
+++ b/docs/llm/evals.md
@@ -0,0 +1,73 @@
+# LLM System: Evals & Quality (Starter Template)
+
+---
+**Last Updated:** 2025-12-12  
+**Phase:** Phase 0 (Planning)  
+**Status:** Draft — finalize in Phase 1  
+**Owner:** AI/LLM Lead + Test Engineer  
+**References:**
+- `/docs/llm/prompting.md`
+- `/docs/llm/safety.md`
+---
+
+This document defines how you measure LLM quality and prevent regressions.
+
+## 1. Goals
+- Detect prompt/model regressions before production.
+- Track accuracy, safety, latency, and cost over time.
+- Provide a repeatable path for improving prompts and RAG.
+
+## 2. Eval Suite Types
+Mix 3 layers depending on archetype:
+1. **Unit evals (offline, deterministic)**  
+   - Small golden set, strict expected outputs.
+2. **Integration evals (offline, realistic)**  
+   - Full pipeline including retrieval, tools, and post‑processing.
+3. **Online evals (production, controlled)**  
+   - Shadow runs, A/B, canary prompts, RUM‑style metrics.
+
+## 3. Datasets
+- Maintain **versioned eval datasets** with:
+  - input,
+  - expected output or rubric,
+  - metadata (domain, difficulty, edge cases).
+- Include adversarial cases:
+  - prompt injection,
+  - ambiguous queries,
+  - long/noisy inputs,
+  - PII‑rich inputs (to test redaction).
+
+## 4. Metrics (suggested)
+Choose per archetype:
+- **Task quality:** accuracy/F1, exact‑match, rubric score, human preference rate.
+- **Safety:** refusal correctness, policy violations, PII leakage rate.
+- **Robustness:** format‑valid rate, tool‑call correctness, retry rate.
+- **Performance:** p50/p95 latency, tokens in/out, cost per task.
+
+## 5. Regression Policy
+- Every prompt or model change must run evals.
+- Define gates:
+  - no safety regressions,
+  - quality must improve or stay within tolerance,
+  - latency/cost budgets respected.
+- If a gate fails: block rollout or require explicit override in `RECOMMENDATIONS.md`.
+
+## 6. Human Review Loop
+- For tasks without ground truth, use rubric‑based human grading.
+- Sample strategy:
+  - new prompt versions → 100% review on small batch,
+  - stable versions → periodic audits.
+
+## 7. Logging for Evals
+- Store eval runs with:
+  - prompt version,
+  - model/provider version,
+  - retrieval config version (if used),
+  - inputs/outputs,
+  - metrics + artifacts.
+
+## 8. Open Questions to Lock in Phase 1
+- Where datasets live (repo vs storage)?
+- Which metrics are hard gates for MVP?
+- Online eval strategy (shadow vs A/B) and sample sizes?
+
--- a/docs/llm/prompting.md
+++ b/docs/llm/prompting.md
@@ -0,0 +1,110 @@
+# LLM System: Prompting (Starter Template)
+
+---
+**Last Updated:** 2025-12-12  
+**Phase:** Phase 0 (Planning)  
+**Status:** Draft — finalize in Phase 1  
+**Owner:** AI/LLM Lead  
+**References:**
+- `/docs/archetypes.md`
+- `/docs/llm/safety.md`
+- `/docs/llm/evals.md`
+---
+
+This document defines how prompts are designed, versioned, and executed.  
+It is **archetype‑agnostic**: adapt the “interaction surface” (chat, workflow generation, pipeline classification, agentic tasks) to your product.
+
+## 1. Goals
+- Produce **consistent, auditable outputs** across models/providers.
+- Make prompt changes **safe and reversible** (versioning + evals).
+- Keep sensitive data out of prompts unless strictly required (see safety).
+
+## 2. Single LLM Entry Point
+All LLM calls go through one abstraction (e.g., `callLLM()` / “LLM Gateway”):
+- Centralizes model selection, temperature/top_p defaults, retries, timeouts.
+- Applies redaction and policy checks before sending prompts.
+- Emits structured logs + trace IDs to `EventLog`.
+- Enforces output schema validation.
+
+> Lock the exact interface and defaults in Phase 1.
+
+## 3. Prompt Types
+Define prompt families that match your archetype:
+- **Chat‑first:** system prompt + conversation memory + optional retrieval context.
+- **Generation/workflow:** task prompt + constraints + examples + output schema.
+- **Classification/pipeline:** short instruction + label set + few‑shot examples + JSON output.
+- **Agentic automation:** planner prompt + tool policy + step budget + “stop/ask‑human” rules.
+
+## 4. Prompt Structure (recommended)
+Use a predictable layout for every prompt:
+1. **System / role:** who the model is, high‑level mission.
+2. **Safety & constraints:** what not to do, privacy rules, refusal triggers.
+3. **Task spec:** exact objective and success criteria.
+4. **Context:** domain data, retrieved snippets, tool outputs (clearly delimited).
+5. **Few‑shot examples:** 1–3 archetype‑relevant pairs.
+6. **Output schema:** strict JSON/XML/markdown template.
+
+### Example skeleton
+```text
+[SYSTEM]
+You are ...
+
+[CONSTRAINTS]
+- Never ...
+- If unsure, respond with ...
+
+[TASK]
+Given input X, produce Y.
+
+[CONTEXT]
+<untrusted_input>
+...
+</untrusted_input>
+
+[EXAMPLES]
+Input: ...
+Output: ...
+
+[OUTPUT_SCHEMA]
+{ "label": "...", "confidence": 0..1, "reasoning_trace": {...} }
+```
+
+## 5. Prompt Versioning
+- Store prompts in a dedicated location (e.g., `prompts/` folder or DB table).
+- **Semantic versioning**: `prompt_name@major.minor.patch`.
+  - **major:** behavior change or schema change.
+  - **minor:** quality improvement (new examples, clearer instruction).
+  - **patch:** typos / no behavior change.
+- Every version is linked to:
+  - model/provider version,
+  - eval suite run,
+  - changelog entry.
+
+## 6. Output Schemas & Validation
+- Prefer **strict JSON** for machine‑consumed outputs.
+- Validate outputs server‑side:
+  - required fields present,
+  - types/enum values correct,
+  - confidence in range,
+  - no disallowed keys (PII, secrets).
+- If validation fails: retry with “fix‑format” prompt or fallback to safe default.
+
+## 7. Context Management
+- Separate **trusted** vs **untrusted** context:
+  - Untrusted: user input, webhook payloads, retrieved docs.
+  - Trusted: system instructions, tool policies, fixed label sets.
+- Delimit untrusted context explicitly to reduce prompt injection risk.
+- Keep context minimal; avoid leaking irrelevant tenant/user data.
+
+## 8. Memory (if applicable)
+For chat/agentic archetypes:
+- Short‑term memory: last N turns.
+- Long‑term memory: curated summaries or embeddings with strict privacy rules.
+- Never store raw PII in memory unless required and approved.
+
+## 9. Open Questions to Lock in Phase 1
+- Which models/providers are supported at launch?
+- Default parameters and retry/backoff policy?
+- Where prompts live (repo vs DB) and who can change them?
+- How schema validation + fallback works per archetype?
+
--- a/docs/llm/rag-embeddings.md
+++ b/docs/llm/rag-embeddings.md
@@ -0,0 +1,53 @@
+# LLM System: RAG & Embeddings (Starter Template)
+
+---
+**Last Updated:** 2025-12-12  
+**Phase:** Phase 0 (Planning)  
+**Status:** Draft — finalize in Phase 1  
+**Owner:** AI/LLM Lead + Backend Architect  
+**References:**
+- `/docs/backend/architecture.md`
+- `/docs/llm/evals.md`
+- `/docs/llm/safety.md`
+---
+
+This document describes retrieval‑augmented generation (RAG) and embeddings.  
+Use it only if your archetype needs external knowledge or similarity search.
+
+## 1. When to Use RAG
+- You need grounded answers from a knowledge base.
+- Inputs are large or dynamic (docs, tickets, policies).
+- You want controllable citations/explainability.
+
+Do **not** use RAG when:
+- the task is purely generative with no grounding,
+- retrieval latency/cost outweighs benefit.
+
+## 2. Data Sources
+- Curated docs, user‑uploaded files, internal DB records, external APIs.
+- Mark each source as trusted/untrusted and apply safety rules.
+
+## 3. Chunking & Indexing
+- Define chunk size/overlap per domain.
+- Store embeddings in a vector index (e.g., `pgvector`, managed vector DB).
+- Keep an embedding model/version field to support migrations.
+
+## 4. Retrieval Strategy
+- Default: semantic search top‑k + optional filters (tenant, type, recency).
+- Re‑rank if quality requires it.
+- Always include retrieved doc IDs in `reasoning_trace` (not raw text).
+
+## 5. RAG Prompting Pattern
+- Provide retrieved snippets in a clearly delimited block.
+- Instruct model to answer **only** using retrieved context when grounding is required.
+- If context is insufficient → ask for clarification or defer.
+
+## 6. Evaluating Retrieval
+- Measure recall/precision of retrieval separately from generation quality.
+- Add “no‑answer” test cases to avoid hallucinations.
+
+## 7. Privacy & Multi‑Tenancy
+- Tenant‑scoped indexes or strict filters.
+- Never cross‑tenant retrieve.
+- Redact PII before embedding if embeddings can be exposed or logged.
+
--- a/docs/llm/safety.md
+++ b/docs/llm/safety.md
@@ -0,0 +1,86 @@
+# LLM System: Safety, Privacy & Reasoning Traces (Starter Template)
+
+---
+**Last Updated:** 2025-12-12  
+**Phase:** Phase 0 (Planning)  
+**Status:** Draft — finalize in Phase 1  
+**Owner:** Security + AI/LLM Lead  
+**References:**
+- `/docs/backend/security.md`
+- `/docs/llm/prompting.md`
+---
+
+This document defines the safety posture for any LLM‑backed feature: privacy, injection defenses, tool safety, and what you log.
+
+## 1. Safety Goals
+- Prevent leakage of PII/tenant secrets to LLMs, logs, or UI.
+- Resist prompt injection and untrusted context manipulation.
+- Ensure outputs are safe to act on (validated, bounded, auditable).
+
+## 2. Data Classification & Handling
+Define categories for your domain:
+- **Public:** safe to send and store.
+- **Internal:** safe to send only if necessary; store minimally.
+- **Sensitive (PII/PHI/PCI/Secrets):** never send unless explicitly approved; never store in traces.
+
+## 3. Redaction Pipeline (before LLM)
+Apply a mandatory pre‑processing step in `callLLM()`:
+1. Detect sensitive fields (allowlist what *can* be sent, not what can’t).
+2. Redact or hash PII (names, emails, phone, addresses, IDs, card data).
+3. Replace with stable placeholders: `{{USER_EMAIL_HASH}}`.
+4. Attach a “redaction summary” to logs (no raw PII).
+
+## 4. Prompt Injection & Untrusted Context
+- Delimit untrusted input (`<untrusted_input>...</untrusted_input>`).
+- Never allow untrusted text to override system constraints.
+- For RAG: treat retrieved docs as untrusted unless curated.
+- If injection detected → refuse or ask for human review.
+
+## 5. Tool / Agent Safety (if applicable)
+- Tool allowlist with scopes and rate limits.
+- Confirm destructive actions with humans (“human checkpoint”).
+- Constrain tool outputs length and validate before reuse.
+
+## 6. `reasoning_trace` Specification
+`reasoning_trace` is **optional** and should be safe to show to humans.  
+Store only **structured, privacy‑safe metadata**, never raw prompts or user PII.
+
+### Allowed fields (example)
+```json
+{
+  "prompt_version": "classify@1.2.0",
+  "model": "provider:model",
+  "inputs": { "redacted": true, "source_ids": ["..."] },
+  "steps": [
+    { "type": "rule_hit", "rule_id": "r_123", "confidence": 0.72 },
+    { "type": "retrieval", "top_k": 5, "doc_ids": ["d1","d2"] },
+    { "type": "llm_call", "confidence": 0.64 }
+  ],
+  "output": { "label": "X", "confidence": 0.64 },
+  "trace_id": "..."
+}
+```
+
+### Explicitly disallowed in traces
+- Raw user input, webhook payloads, or document text.
+- Emails, phone numbers, addresses, names, gov IDs.
+- Payment data, auth tokens, API keys, secrets.
+- Full prompts or full LLM responses (store refs or summaries only).
+
+### How we guarantee “no PII” in traces
+1. **Schema allowlist:** trace is validated against a strict schema with only allowed keys.
+2. **Redaction required:** `callLLM()` sets `inputs.redacted=true` only after redaction succeeded.
+3. **PII linting:** server‑side scan of trace JSON for patterns (emails, phones, IDs) before storing.
+4. **UI gating:** only safe fields are rendered; raw text never shown from trace.
+5. **Audits:** periodic sampling in Phase 3+ to verify zero leakage.
+
+## 7. Storage & Retention
+- Traces stored per tenant; encrypted at rest.
+- Retention window aligned with compliance needs.
+- Ability to disable traces globally or per tenant.
+
+## 8. Open Questions to Lock in Phase 1
+- Exact redaction rules and allowlist fields.
+- Whether to store any raw LLM outputs outside traces (audit vault).
+- Who can access traces in UI and API.
+