Files
AI_template/docs/llm/safety.md

3.5 KiB
Raw Blame History

LLM System: Safety, Privacy & Reasoning Traces (Starter Template)


Last Updated: 2025-12-12
Phase: Phase 0 (Planning)
Status: Draft — finalize in Phase 1
Owner: Security + AI/LLM Lead
References:

  • /docs/backend/security.md
  • /docs/llm/prompting.md

This document defines the safety posture for any LLMbacked feature: privacy, injection defenses, tool safety, and what you log.

1. Safety Goals

  • Prevent leakage of PII/tenant secrets to LLMs, logs, or UI.
  • Resist prompt injection and untrusted context manipulation.
  • Ensure outputs are safe to act on (validated, bounded, auditable).

2. Data Classification & Handling

Define categories for your domain:

  • Public: safe to send and store.
  • Internal: safe to send only if necessary; store minimally.
  • Sensitive (PII/PHI/PCI/Secrets): never send unless explicitly approved; never store in traces.

3. Redaction Pipeline (before LLM)

Apply a mandatory preprocessing step in callLLM():

  1. Detect sensitive fields (allowlist what can be sent, not what cant).
  2. Redact or hash PII (names, emails, phone, addresses, IDs, card data).
  3. Replace with stable placeholders: {{USER_EMAIL_HASH}}.
  4. Attach a “redaction summary” to logs (no raw PII).

4. Prompt Injection & Untrusted Context

  • Delimit untrusted input (<untrusted_input>...</untrusted_input>).
  • Never allow untrusted text to override system constraints.
  • For RAG: treat retrieved docs as untrusted unless curated.
  • If injection detected → refuse or ask for human review.

5. Tool / Agent Safety (if applicable)

  • Tool allowlist with scopes and rate limits.
  • Confirm destructive actions with humans (“human checkpoint”).
  • Constrain tool outputs length and validate before reuse.

6. reasoning_trace Specification

reasoning_trace is optional and should be safe to show to humans.
Store only structured, privacysafe metadata, never raw prompts or user PII.

Allowed fields (example)

{
  "prompt_version": "classify@1.2.0",
  "model": "provider:model",
  "inputs": { "redacted": true, "source_ids": ["..."] },
  "steps": [
    { "type": "rule_hit", "rule_id": "r_123", "confidence": 0.72 },
    { "type": "retrieval", "top_k": 5, "doc_ids": ["d1","d2"] },
    { "type": "llm_call", "confidence": 0.64 }
  ],
  "output": { "label": "X", "confidence": 0.64 },
  "trace_id": "..."
}

Explicitly disallowed in traces

  • Raw user input, webhook payloads, or document text.
  • Emails, phone numbers, addresses, names, gov IDs.
  • Payment data, auth tokens, API keys, secrets.
  • Full prompts or full LLM responses (store refs or summaries only).

How we guarantee “no PII” in traces

  1. Schema allowlist: trace is validated against a strict schema with only allowed keys.
  2. Redaction required: callLLM() sets inputs.redacted=true only after redaction succeeded.
  3. PII linting: serverside scan of trace JSON for patterns (emails, phones, IDs) before storing.
  4. UI gating: only safe fields are rendered; raw text never shown from trace.
  5. Audits: periodic sampling in Phase 3+ to verify zero leakage.

7. Storage & Retention

  • Traces stored per tenant; encrypted at rest.
  • Retention window aligned with compliance needs.
  • Ability to disable traces globally or per tenant.

8. Open Questions to Lock in Phase 1

  • Exact redaction rules and allowlist fields.
  • Whether to store any raw LLM outputs outside traces (audit vault).
  • Who can access traces in UI and API.