Files
AI_template/docs/llm/caching-costs.md
olekhondera 5b28ea675d add SKILL
2026-02-14 07:38:50 +02:00

1.7 KiB
Raw Blame History

LLM System: Caching & Cost Control (Starter Template)


Phase: Phase 0 (Planning)
Status: Draft — finalize in Phase 1
Owner: AI/LLM Lead + Backend Architect
References:

  • /docs/llm/prompting.md
  • /docs/llm/evals.md

This document defines how to keep LLM usage reliable and within budget.

1. Goals

  • Minimize cost while preserving quality.
  • Keep latency predictable for user flows.
  • Avoid repeated work (idempotency + caching).

2. Budgets & Limits

Define per tenant and per feature:

  • monthly token/cost cap,
  • perrequest max tokens,
  • max retries/timeouts,
  • concurrency limits.

3. Caching Layers

Pick what applies:

  1. Input normalization cache
    • canonicalize inputs (trim, stable ordering) to increase hit rate.
  2. LLM response cache
    • key: (prompt_version, model, canonical_input_hash, retrieval_config_hash).
    • TTL depends on volatility of the task.
  3. Embeddings cache
    • store embeddings for reusable texts/items.
  4. RAG retrieval cache
    • cache topk doc IDs for stable queries.

Never cache raw PII; cache keys use hashes of redacted inputs.

4. Cost Controls

  • Prefer cheaper models for lowrisk tasks; escalate to stronger models only when needed.
  • Use staged pipelines (rules/heuristics/RAG) to reduce LLM calls.
  • Batch noninteractive jobs (classification, report gen).
  • Track tokens in/out per request and per tenant.

5. Fallbacks

  • On timeouts/errors: retry with backoff, then fallback to safe default or human review.
  • On budget exhaustion: degrade gracefully (limited features, queue jobs, ask user).

6. Monitoring

  • Dashboards for cost, latency, cache hit rate, retry rate.
  • Alerts for spikes, anomaly tenants, or runaway loops.