Files
AI_template/docs/llm/caching-costs.md

55 lines
1.8 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# LLM System: Caching & Cost Control (Starter Template)
---
**Last Updated:** 2025-12-12
**Phase:** Phase 0 (Planning)
**Status:** Draft — finalize in Phase 1
**Owner:** AI/LLM Lead + Backend Architect
**References:**
- `/docs/llm/prompting.md`
- `/docs/llm/evals.md`
---
This document defines how to keep LLM usage reliable and within budget.
## 1. Goals
- Minimize cost while preserving quality.
- Keep latency predictable for user flows.
- Avoid repeated work (idempotency + caching).
## 2. Budgets & Limits
Define per tenant and per feature:
- monthly token/cost cap,
- perrequest max tokens,
- max retries/timeouts,
- concurrency limits.
## 3. Caching Layers
Pick what applies:
1. **Input normalization cache**
- canonicalize inputs (trim, stable ordering) to increase hit rate.
2. **LLM response cache**
- key: `(prompt_version, model, canonical_input_hash, retrieval_config_hash)`.
- TTL depends on volatility of the task.
3. **Embeddings cache**
- store embeddings for reusable texts/items.
4. **RAG retrieval cache**
- cache topk doc IDs for stable queries.
> Never cache raw PII; cache keys use hashes of redacted inputs.
## 4. Cost Controls
- Prefer cheaper models for lowrisk tasks; escalate to stronger models only when needed.
- Use staged pipelines (rules/heuristics/RAG) to reduce LLM calls.
- Batch noninteractive jobs (classification, report gen).
- Track tokens in/out per request and per tenant.
## 5. Fallbacks
- On timeouts/errors: retry with backoff, then fallback to safe default or human review.
- On budget exhaustion: degrade gracefully (limited features, queue jobs, ask user).
## 6. Monitoring
- Dashboards for cost, latency, cache hit rate, retry rate.
- Alerts for spikes, anomaly tenants, or runaway loops.