Files
AI_template/docs/llm/rag-embeddings.md

1.9 KiB
Raw Blame History

LLM System: RAG & Embeddings (Starter Template)


Last Updated: 2025-12-12
Phase: Phase 0 (Planning)
Status: Draft — finalize in Phase 1
Owner: AI/LLM Lead + Backend Architect
References:

  • /docs/backend/architecture.md
  • /docs/llm/evals.md
  • /docs/llm/safety.md

This document describes retrievalaugmented generation (RAG) and embeddings.
Use it only if your archetype needs external knowledge or similarity search.

1. When to Use RAG

  • You need grounded answers from a knowledge base.
  • Inputs are large or dynamic (docs, tickets, policies).
  • You want controllable citations/explainability.

Do not use RAG when:

  • the task is purely generative with no grounding,
  • retrieval latency/cost outweighs benefit.

2. Data Sources

  • Curated docs, useruploaded files, internal DB records, external APIs.
  • Mark each source as trusted/untrusted and apply safety rules.

3. Chunking & Indexing

  • Define chunk size/overlap per domain.
  • Store embeddings in a vector index (e.g., pgvector, managed vector DB).
  • Keep an embedding model/version field to support migrations.

4. Retrieval Strategy

  • Default: semantic search topk + optional filters (tenant, type, recency).
  • Rerank if quality requires it.
  • Always include retrieved doc IDs in reasoning_trace (not raw text).

5. RAG Prompting Pattern

  • Provide retrieved snippets in a clearly delimited block.
  • Instruct model to answer only using retrieved context when grounding is required.
  • If context is insufficient → ask for clarification or defer.

6. Evaluating Retrieval

  • Measure recall/precision of retrieval separately from generation quality.
  • Add “noanswer” test cases to avoid hallucinations.

7. Privacy & MultiTenancy

  • Tenantscoped indexes or strict filters.
  • Never crosstenant retrieve.
  • Redact PII before embedding if embeddings can be exposed or logged.