1.9 KiB
1.9 KiB
LLM System: RAG & Embeddings (Starter Template)
Last Updated: 2025-12-12
Phase: Phase 0 (Planning)
Status: Draft — finalize in Phase 1
Owner: AI/LLM Lead + Backend Architect
References:
/docs/backend/architecture.md/docs/llm/evals.md/docs/llm/safety.md
This document describes retrieval‑augmented generation (RAG) and embeddings.
Use it only if your archetype needs external knowledge or similarity search.
1. When to Use RAG
- You need grounded answers from a knowledge base.
- Inputs are large or dynamic (docs, tickets, policies).
- You want controllable citations/explainability.
Do not use RAG when:
- the task is purely generative with no grounding,
- retrieval latency/cost outweighs benefit.
2. Data Sources
- Curated docs, user‑uploaded files, internal DB records, external APIs.
- Mark each source as trusted/untrusted and apply safety rules.
3. Chunking & Indexing
- Define chunk size/overlap per domain.
- Store embeddings in a vector index (e.g.,
pgvector, managed vector DB). - Keep an embedding model/version field to support migrations.
4. Retrieval Strategy
- Default: semantic search top‑k + optional filters (tenant, type, recency).
- Re‑rank if quality requires it.
- Always include retrieved doc IDs in
reasoning_trace(not raw text).
5. RAG Prompting Pattern
- Provide retrieved snippets in a clearly delimited block.
- Instruct model to answer only using retrieved context when grounding is required.
- If context is insufficient → ask for clarification or defer.
6. Evaluating Retrieval
- Measure recall/precision of retrieval separately from generation quality.
- Add “no‑answer” test cases to avoid hallucinations.
7. Privacy & Multi‑Tenancy
- Tenant‑scoped indexes or strict filters.
- Never cross‑tenant retrieve.
- Redact PII before embedding if embeddings can be exposed or logged.