# LLM System: Safety, Privacy & Reasoning Traces (Starter Template) --- **Last Updated:** 2025-12-12 **Phase:** Phase 0 (Planning) **Status:** Draft — finalize in Phase 1 **Owner:** Security + AI/LLM Lead **References:** - `/docs/backend/security.md` - `/docs/llm/prompting.md` --- This document defines the safety posture for any LLM‑backed feature: privacy, injection defenses, tool safety, and what you log. ## 1. Safety Goals - Prevent leakage of PII/tenant secrets to LLMs, logs, or UI. - Resist prompt injection and untrusted context manipulation. - Ensure outputs are safe to act on (validated, bounded, auditable). ## 2. Data Classification & Handling Define categories for your domain: - **Public:** safe to send and store. - **Internal:** safe to send only if necessary; store minimally. - **Sensitive (PII/PHI/PCI/Secrets):** never send unless explicitly approved; never store in traces. ## 3. Redaction Pipeline (before LLM) Apply a mandatory pre‑processing step in `callLLM()`: 1. Detect sensitive fields (allowlist what *can* be sent, not what can’t). 2. Redact or hash PII (names, emails, phone, addresses, IDs, card data). 3. Replace with stable placeholders: `{{USER_EMAIL_HASH}}`. 4. Attach a “redaction summary” to logs (no raw PII). ## 4. Prompt Injection & Untrusted Context - Delimit untrusted input (`...`). - Never allow untrusted text to override system constraints. - For RAG: treat retrieved docs as untrusted unless curated. - If injection detected → refuse or ask for human review. ## 5. Tool / Agent Safety (if applicable) - Tool allowlist with scopes and rate limits. - Confirm destructive actions with humans (“human checkpoint”). - Constrain tool outputs length and validate before reuse. ## 6. `reasoning_trace` Specification `reasoning_trace` is **optional** and should be safe to show to humans. Store only **structured, privacy‑safe metadata**, never raw prompts or user PII. ### Allowed fields (example) ```json { "prompt_version": "classify@1.2.0", "model": "provider:model", "inputs": { "redacted": true, "source_ids": ["..."] }, "steps": [ { "type": "rule_hit", "rule_id": "r_123", "confidence": 0.72 }, { "type": "retrieval", "top_k": 5, "doc_ids": ["d1","d2"] }, { "type": "llm_call", "confidence": 0.64 } ], "output": { "label": "X", "confidence": 0.64 }, "trace_id": "..." } ``` ### Explicitly disallowed in traces - Raw user input, webhook payloads, or document text. - Emails, phone numbers, addresses, names, gov IDs. - Payment data, auth tokens, API keys, secrets. - Full prompts or full LLM responses (store refs or summaries only). ### How we guarantee “no PII” in traces 1. **Schema allowlist:** trace is validated against a strict schema with only allowed keys. 2. **Redaction required:** `callLLM()` sets `inputs.redacted=true` only after redaction succeeded. 3. **PII linting:** server‑side scan of trace JSON for patterns (emails, phones, IDs) before storing. 4. **UI gating:** only safe fields are rendered; raw text never shown from trace. 5. **Audits:** periodic sampling in Phase 3+ to verify zero leakage. ## 7. Storage & Retention - Traces stored per tenant; encrypted at rest. - Retention window aligned with compliance needs. - Ability to disable traces globally or per tenant. ## 8. Open Questions to Lock in Phase 1 - Exact redaction rules and allowlist fields. - Whether to store any raw LLM outputs outside traces (audit vault). - Who can access traces in UI and API.