# LLM System: RAG & Embeddings (Starter Template) --- **Last Updated:** 2025-12-12 **Phase:** Phase 0 (Planning) **Status:** Draft — finalize in Phase 1 **Owner:** AI/LLM Lead + Backend Architect **References:** - `/docs/backend/architecture.md` - `/docs/llm/evals.md` - `/docs/llm/safety.md` --- This document describes retrieval‑augmented generation (RAG) and embeddings. Use it only if your archetype needs external knowledge or similarity search. ## 1. When to Use RAG - You need grounded answers from a knowledge base. - Inputs are large or dynamic (docs, tickets, policies). - You want controllable citations/explainability. Do **not** use RAG when: - the task is purely generative with no grounding, - retrieval latency/cost outweighs benefit. ## 2. Data Sources - Curated docs, user‑uploaded files, internal DB records, external APIs. - Mark each source as trusted/untrusted and apply safety rules. ## 3. Chunking & Indexing - Define chunk size/overlap per domain. - Store embeddings in a vector index (e.g., `pgvector`, managed vector DB). - Keep an embedding model/version field to support migrations. ## 4. Retrieval Strategy - Default: semantic search top‑k + optional filters (tenant, type, recency). - Re‑rank if quality requires it. - Always include retrieved doc IDs in `reasoning_trace` (not raw text). ## 5. RAG Prompting Pattern - Provide retrieved snippets in a clearly delimited block. - Instruct model to answer **only** using retrieved context when grounding is required. - If context is insufficient → ask for clarification or defer. ## 6. Evaluating Retrieval - Measure recall/precision of retrieval separately from generation quality. - Add “no‑answer” test cases to avoid hallucinations. ## 7. Privacy & Multi‑Tenancy - Tenant‑scoped indexes or strict filters. - Never cross‑tenant retrieve. - Redact PII before embedding if embeddings can be exposed or logged.