Files
AI_template/docs/llm/safety.md
olekhondera 5b28ea675d add SKILL
2026-02-14 07:38:50 +02:00

86 lines
3.4 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# LLM System: Safety, Privacy & Reasoning Traces (Starter Template)
---
**Phase:** Phase 0 (Planning)
**Status:** Draft — finalize in Phase 1
**Owner:** Security + AI/LLM Lead
**References:**
- `/docs/backend/security.md`
- `/docs/llm/prompting.md`
---
This document defines the safety posture for any LLMbacked feature: privacy, injection defenses, tool safety, and what you log.
## 1. Safety Goals
- Prevent leakage of PII/tenant secrets to LLMs, logs, or UI.
- Resist prompt injection and untrusted context manipulation.
- Ensure outputs are safe to act on (validated, bounded, auditable).
## 2. Data Classification & Handling
Define categories for your domain:
- **Public:** safe to send and store.
- **Internal:** safe to send only if necessary; store minimally.
- **Sensitive (PII/PHI/PCI/Secrets):** never send unless explicitly approved; never store in traces.
## 3. Redaction Pipeline (before LLM)
Apply a mandatory preprocessing step in `callLLM()`:
1. Detect sensitive fields (allowlist what *can* be sent, not what cant).
2. Redact or hash PII (names, emails, phone, addresses, IDs, card data).
3. Replace with stable placeholders: `{{USER_EMAIL_HASH}}`.
4. Attach a “redaction summary” to logs (no raw PII).
## 4. Prompt Injection & Untrusted Context
- Delimit untrusted input (`<untrusted_input>...</untrusted_input>`).
- Never allow untrusted text to override system constraints.
- For RAG: treat retrieved docs as untrusted unless curated.
- If injection detected → refuse or ask for human review.
## 5. Tool / Agent Safety (if applicable)
- Tool allowlist with scopes and rate limits.
- Confirm destructive actions with humans (“human checkpoint”).
- Constrain tool outputs length and validate before reuse.
## 6. `reasoning_trace` Specification
`reasoning_trace` is **optional** and should be safe to show to humans.
Store only **structured, privacysafe metadata**, never raw prompts or user PII.
### Allowed fields (example)
```json
{
"prompt_version": "classify@1.2.0",
"model": "provider:model",
"inputs": { "redacted": true, "source_ids": ["..."] },
"steps": [
{ "type": "rule_hit", "rule_id": "r_123", "confidence": 0.72 },
{ "type": "retrieval", "top_k": 5, "doc_ids": ["d1","d2"] },
{ "type": "llm_call", "confidence": 0.64 }
],
"output": { "label": "X", "confidence": 0.64 },
"trace_id": "..."
}
```
### Explicitly disallowed in traces
- Raw user input, webhook payloads, or document text.
- Emails, phone numbers, addresses, names, gov IDs.
- Payment data, auth tokens, API keys, secrets.
- Full prompts or full LLM responses (store refs or summaries only).
### How we guarantee “no PII” in traces
1. **Schema allowlist:** trace is validated against a strict schema with only allowed keys.
2. **Redaction required:** `callLLM()` sets `inputs.redacted=true` only after redaction succeeded.
3. **PII linting:** serverside scan of trace JSON for patterns (emails, phones, IDs) before storing.
4. **UI gating:** only safe fields are rendered; raw text never shown from trace.
5. **Audits:** periodic sampling in Phase 3+ to verify zero leakage.
## 7. Storage & Retention
- Traces stored per tenant; encrypted at rest.
- Retention window aligned with compliance needs.
- Ability to disable traces globally or per tenant.
## 8. Open Questions to Lock in Phase 1
- Exact redaction rules and allowlist fields.
- Whether to store any raw LLM outputs outside traces (audit vault).
- Who can access traces in UI and API.