86 lines
3.4 KiB
Markdown
86 lines
3.4 KiB
Markdown
# LLM System: Safety, Privacy & Reasoning Traces (Starter Template)
|
||
|
||
---
|
||
**Phase:** Phase 0 (Planning)
|
||
**Status:** Draft — finalize in Phase 1
|
||
**Owner:** Security + AI/LLM Lead
|
||
**References:**
|
||
- `/docs/backend/security.md`
|
||
- `/docs/llm/prompting.md`
|
||
---
|
||
|
||
This document defines the safety posture for any LLM‑backed feature: privacy, injection defenses, tool safety, and what you log.
|
||
|
||
## 1. Safety Goals
|
||
- Prevent leakage of PII/tenant secrets to LLMs, logs, or UI.
|
||
- Resist prompt injection and untrusted context manipulation.
|
||
- Ensure outputs are safe to act on (validated, bounded, auditable).
|
||
|
||
## 2. Data Classification & Handling
|
||
Define categories for your domain:
|
||
- **Public:** safe to send and store.
|
||
- **Internal:** safe to send only if necessary; store minimally.
|
||
- **Sensitive (PII/PHI/PCI/Secrets):** never send unless explicitly approved; never store in traces.
|
||
|
||
## 3. Redaction Pipeline (before LLM)
|
||
Apply a mandatory pre‑processing step in `callLLM()`:
|
||
1. Detect sensitive fields (allowlist what *can* be sent, not what can’t).
|
||
2. Redact or hash PII (names, emails, phone, addresses, IDs, card data).
|
||
3. Replace with stable placeholders: `{{USER_EMAIL_HASH}}`.
|
||
4. Attach a “redaction summary” to logs (no raw PII).
|
||
|
||
## 4. Prompt Injection & Untrusted Context
|
||
- Delimit untrusted input (`<untrusted_input>...</untrusted_input>`).
|
||
- Never allow untrusted text to override system constraints.
|
||
- For RAG: treat retrieved docs as untrusted unless curated.
|
||
- If injection detected → refuse or ask for human review.
|
||
|
||
## 5. Tool / Agent Safety (if applicable)
|
||
- Tool allowlist with scopes and rate limits.
|
||
- Confirm destructive actions with humans (“human checkpoint”).
|
||
- Constrain tool outputs length and validate before reuse.
|
||
|
||
## 6. `reasoning_trace` Specification
|
||
`reasoning_trace` is **optional** and should be safe to show to humans.
|
||
Store only **structured, privacy‑safe metadata**, never raw prompts or user PII.
|
||
|
||
### Allowed fields (example)
|
||
```json
|
||
{
|
||
"prompt_version": "classify@1.2.0",
|
||
"model": "provider:model",
|
||
"inputs": { "redacted": true, "source_ids": ["..."] },
|
||
"steps": [
|
||
{ "type": "rule_hit", "rule_id": "r_123", "confidence": 0.72 },
|
||
{ "type": "retrieval", "top_k": 5, "doc_ids": ["d1","d2"] },
|
||
{ "type": "llm_call", "confidence": 0.64 }
|
||
],
|
||
"output": { "label": "X", "confidence": 0.64 },
|
||
"trace_id": "..."
|
||
}
|
||
```
|
||
|
||
### Explicitly disallowed in traces
|
||
- Raw user input, webhook payloads, or document text.
|
||
- Emails, phone numbers, addresses, names, gov IDs.
|
||
- Payment data, auth tokens, API keys, secrets.
|
||
- Full prompts or full LLM responses (store refs or summaries only).
|
||
|
||
### How we guarantee “no PII” in traces
|
||
1. **Schema allowlist:** trace is validated against a strict schema with only allowed keys.
|
||
2. **Redaction required:** `callLLM()` sets `inputs.redacted=true` only after redaction succeeded.
|
||
3. **PII linting:** server‑side scan of trace JSON for patterns (emails, phones, IDs) before storing.
|
||
4. **UI gating:** only safe fields are rendered; raw text never shown from trace.
|
||
5. **Audits:** periodic sampling in Phase 3+ to verify zero leakage.
|
||
|
||
## 7. Storage & Retention
|
||
- Traces stored per tenant; encrypted at rest.
|
||
- Retention window aligned with compliance needs.
|
||
- Ability to disable traces globally or per tenant.
|
||
|
||
## 8. Open Questions to Lock in Phase 1
|
||
- Exact redaction rules and allowlist fields.
|
||
- Whether to store any raw LLM outputs outside traces (audit vault).
|
||
- Who can access traces in UI and API.
|
||
|