AI_template/docs/llm/rag-embeddings.md

# LLM System: RAG & Embeddings (Starter Template)

---
**Last Updated:** 2025-12-12
**Phase:** Phase 0 (Planning)
**Status:** Draft — finalize in Phase 1
**Owner:** AI/LLM Lead + Backend Architect
**References:**
- `/docs/backend/architecture.md`
- `/docs/llm/evals.md`
- `/docs/llm/safety.md`
---

This document describes retrieval‑augmented generation (RAG) and embeddings.
Use it only if your archetype needs external knowledge or similarity search.

## 1. When to Use RAG
- You need grounded answers from a knowledge base.
- Inputs are large or dynamic (docs, tickets, policies).
- You want controllable citations/explainability.

Do **not** use RAG when:
- the task is purely generative with no grounding,
- retrieval latency/cost outweighs benefit.

## 2. Data Sources
- Curated docs, user‑uploaded files, internal DB records, external APIs.
- Mark each source as trusted/untrusted and apply safety rules.

## 3. Chunking & Indexing
- Define chunk size/overlap per domain.
- Store embeddings in a vector index (e.g., `pgvector`, managed vector DB).
- Keep an embedding model/version field to support migrations.

## 4. Retrieval Strategy
- Default: semantic search top‑k + optional filters (tenant, type, recency).
- Re‑rank if quality requires it.
- Always include retrieved doc IDs in `reasoning_trace` (not raw text).

## 5. RAG Prompting Pattern
- Provide retrieved snippets in a clearly delimited block.
- Instruct model to answer **only** using retrieved context when grounding is required.
- If context is insufficient → ask for clarification or defer.

## 6. Evaluating Retrieval
- Measure recall/precision of retrieval separately from generation quality.
- Add “no‑answer” test cases to avoid hallucinations.

## 7. Privacy & Multi‑Tenancy
- Tenant‑scoped indexes or strict filters.
- Never cross‑tenant retrieve.
- Redact PII before embedding if embeddings can be exposed or logged.