3.5 KiB
3.5 KiB
LLM System: Safety, Privacy & Reasoning Traces (Starter Template)
Last Updated: 2025-12-12
Phase: Phase 0 (Planning)
Status: Draft — finalize in Phase 1
Owner: Security + AI/LLM Lead
References:
/docs/backend/security.md/docs/llm/prompting.md
This document defines the safety posture for any LLM‑backed feature: privacy, injection defenses, tool safety, and what you log.
1. Safety Goals
- Prevent leakage of PII/tenant secrets to LLMs, logs, or UI.
- Resist prompt injection and untrusted context manipulation.
- Ensure outputs are safe to act on (validated, bounded, auditable).
2. Data Classification & Handling
Define categories for your domain:
- Public: safe to send and store.
- Internal: safe to send only if necessary; store minimally.
- Sensitive (PII/PHI/PCI/Secrets): never send unless explicitly approved; never store in traces.
3. Redaction Pipeline (before LLM)
Apply a mandatory pre‑processing step in callLLM():
- Detect sensitive fields (allowlist what can be sent, not what can’t).
- Redact or hash PII (names, emails, phone, addresses, IDs, card data).
- Replace with stable placeholders:
{{USER_EMAIL_HASH}}. - Attach a “redaction summary” to logs (no raw PII).
4. Prompt Injection & Untrusted Context
- Delimit untrusted input (
<untrusted_input>...</untrusted_input>). - Never allow untrusted text to override system constraints.
- For RAG: treat retrieved docs as untrusted unless curated.
- If injection detected → refuse or ask for human review.
5. Tool / Agent Safety (if applicable)
- Tool allowlist with scopes and rate limits.
- Confirm destructive actions with humans (“human checkpoint”).
- Constrain tool outputs length and validate before reuse.
6. reasoning_trace Specification
reasoning_trace is optional and should be safe to show to humans.
Store only structured, privacy‑safe metadata, never raw prompts or user PII.
Allowed fields (example)
{
"prompt_version": "classify@1.2.0",
"model": "provider:model",
"inputs": { "redacted": true, "source_ids": ["..."] },
"steps": [
{ "type": "rule_hit", "rule_id": "r_123", "confidence": 0.72 },
{ "type": "retrieval", "top_k": 5, "doc_ids": ["d1","d2"] },
{ "type": "llm_call", "confidence": 0.64 }
],
"output": { "label": "X", "confidence": 0.64 },
"trace_id": "..."
}
Explicitly disallowed in traces
- Raw user input, webhook payloads, or document text.
- Emails, phone numbers, addresses, names, gov IDs.
- Payment data, auth tokens, API keys, secrets.
- Full prompts or full LLM responses (store refs or summaries only).
How we guarantee “no PII” in traces
- Schema allowlist: trace is validated against a strict schema with only allowed keys.
- Redaction required:
callLLM()setsinputs.redacted=trueonly after redaction succeeded. - PII linting: server‑side scan of trace JSON for patterns (emails, phones, IDs) before storing.
- UI gating: only safe fields are rendered; raw text never shown from trace.
- Audits: periodic sampling in Phase 3+ to verify zero leakage.
7. Storage & Retention
- Traces stored per tenant; encrypted at rest.
- Retention window aligned with compliance needs.
- Ability to disable traces globally or per tenant.
8. Open Questions to Lock in Phase 1
- Exact redaction rules and allowlist fields.
- Whether to store any raw LLM outputs outside traces (audit vault).
- Who can access traces in UI and API.