LLM System: RAG & Embeddings (Starter Template)

Last Updated: 2025-12-12
Phase: Phase 0 (Planning)
Status: Draft — finalize in Phase 1
Owner: AI/LLM Lead + Backend Architect
References:

/docs/backend/architecture.md
/docs/llm/evals.md
/docs/llm/safety.md

This document describes retrieval‑augmented generation (RAG) and embeddings.
Use it only if your archetype needs external knowledge or similarity search.

1. When to Use RAG

You need grounded answers from a knowledge base.
Inputs are large or dynamic (docs, tickets, policies).
You want controllable citations/explainability.

Do not use RAG when:

the task is purely generative with no grounding,
retrieval latency/cost outweighs benefit.

2. Data Sources

Curated docs, user‑uploaded files, internal DB records, external APIs.
Mark each source as trusted/untrusted and apply safety rules.

3. Chunking & Indexing

Define chunk size/overlap per domain.
Store embeddings in a vector index (e.g., pgvector, managed vector DB).
Keep an embedding model/version field to support migrations.

4. Retrieval Strategy

Default: semantic search top‑k + optional filters (tenant, type, recency).
Re‑rank if quality requires it.
Always include retrieved doc IDs in reasoning_trace (not raw text).

5. RAG Prompting Pattern

Provide retrieved snippets in a clearly delimited block.
Instruct model to answer only using retrieved context when grounding is required.
If context is insufficient → ask for clarification or defer.

6. Evaluating Retrieval

Measure recall/precision of retrieval separately from generation quality.
Add “no‑answer” test cases to avoid hallucinations.

7. Privacy & Multi‑Tenancy

Tenant‑scoped indexes or strict filters.
Never cross‑tenant retrieve.
Redact PII before embedding if embeddings can be exposed or logged.

1.9 KiB Raw Blame History Unescape Escape