LLM System: Safety, Privacy & Reasoning Traces (Starter Template)

Last Updated: 2025-12-12
Phase: Phase 0 (Planning)
Status: Draft — finalize in Phase 1
Owner: Security + AI/LLM Lead
References:

/docs/backend/security.md
/docs/llm/prompting.md

This document defines the safety posture for any LLM‑backed feature: privacy, injection defenses, tool safety, and what you log.

1. Safety Goals

Prevent leakage of PII/tenant secrets to LLMs, logs, or UI.
Resist prompt injection and untrusted context manipulation.
Ensure outputs are safe to act on (validated, bounded, auditable).

2. Data Classification & Handling

Define categories for your domain:

Public: safe to send and store.
Internal: safe to send only if necessary; store minimally.
Sensitive (PII/PHI/PCI/Secrets): never send unless explicitly approved; never store in traces.

3. Redaction Pipeline (before LLM)

Apply a mandatory pre‑processing step in callLLM():

Detect sensitive fields (allowlist what can be sent, not what can’t).
Redact or hash PII (names, emails, phone, addresses, IDs, card data).
Replace with stable placeholders: {{USER_EMAIL_HASH}}.
Attach a “redaction summary” to logs (no raw PII).

4. Prompt Injection & Untrusted Context

Delimit untrusted input (<untrusted_input>...</untrusted_input>).
Never allow untrusted text to override system constraints.
For RAG: treat retrieved docs as untrusted unless curated.
If injection detected → refuse or ask for human review.

5. Tool / Agent Safety (if applicable)

Tool allowlist with scopes and rate limits.
Confirm destructive actions with humans (“human checkpoint”).
Constrain tool outputs length and validate before reuse.

6. `reasoning_trace` Specification

reasoning_trace is optional and should be safe to show to humans.
Store only structured, privacy‑safe metadata, never raw prompts or user PII.

Allowed fields (example)

{
  "prompt_version": "classify@1.2.0",
  "model": "provider:model",
  "inputs": { "redacted": true, "source_ids": ["..."] },
  "steps": [
    { "type": "rule_hit", "rule_id": "r_123", "confidence": 0.72 },
    { "type": "retrieval", "top_k": 5, "doc_ids": ["d1","d2"] },
    { "type": "llm_call", "confidence": 0.64 }
  ],
  "output": { "label": "X", "confidence": 0.64 },
  "trace_id": "..."
}

Explicitly disallowed in traces

Raw user input, webhook payloads, or document text.
Emails, phone numbers, addresses, names, gov IDs.
Payment data, auth tokens, API keys, secrets.
Full prompts or full LLM responses (store refs or summaries only).

How we guarantee “no PII” in traces

Schema allowlist: trace is validated against a strict schema with only allowed keys.
Redaction required: callLLM() sets inputs.redacted=true only after redaction succeeded.
PII linting: server‑side scan of trace JSON for patterns (emails, phones, IDs) before storing.
UI gating: only safe fields are rendered; raw text never shown from trace.
Audits: periodic sampling in Phase 3+ to verify zero leakage.

7. Storage & Retention

Traces stored per tenant; encrypted at rest.
Retention window aligned with compliance needs.
Ability to disable traces globally or per tenant.

8. Open Questions to Lock in Phase 1

Exact redaction rules and allowlist fields.
Whether to store any raw LLM outputs outside traces (audit vault).
Who can access traces in UI and API.

3.5 KiB Raw Blame History Unescape Escape