akashnotes — Structured Learning for Engineers

Every LLM call costs money and takes seconds. In production, many queries are semantically identical — "What's the return policy?" and "How do I return an item?" should hit the same cache. This guide teaches you to build caching layers that dramatically reduce cost and latency.

Coming from Software Engineering? Semantic caching is like a CDN with fuzzy matching — instead of exact URL match, you match by meaning. If you've built Redis caching layers with cache-aside patterns, the architecture is identical. The only new concept is using embeddings for cache key similarity instead of string equality.

Why Cache LLM Calls?

Metric	Without Cache	With Cache (70% hit rate)
Avg latency	2,000ms	605ms
Cost per 1000 queries	$10.00	$3.00
Monthly cost (10K queries/day)	$3,000	$900

Level 1: Exact-Match Caching

The simplest approach — hash the prompt and check for exact matches.

# script_id: day_091_semantic_caching/exact_match_cache
import hashlib
import json
import time
from functools import lru_cache
from openai import OpenAI

client = OpenAI()

# In-memory exact-match cache
class ExactMatchCache:
    """Cache LLM responses by exact prompt match."""

    def __init__(self, ttl_seconds: int = 3600):
        self.cache: dict[str, dict] = {}
        self.ttl = ttl_seconds
        self.hits = 0
        self.misses = 0

    def _hash_key(self, model: str, messages: list, **kwargs) -> str:
        """Create a deterministic hash from the request."""
        key_data = json.dumps({"model": model, "messages": messages, **kwargs}, sort_keys=True)
        return hashlib.sha256(key_data.encode()).hexdigest()

    def get(self, model: str, messages: list, **kwargs) -> str | None:
        key = self._hash_key(model, messages, **kwargs)
        entry = self.cache.get(key)

        if entry and (time.time() - entry["timestamp"]) < self.ttl:
            self.hits += 1
            return entry["response"]

        self.misses += 1
        return None

    def set(self, model: str, messages: list, response: str, **kwargs):
        key = self._hash_key(model, messages, **kwargs)
        self.cache[key] = {"response": response, "timestamp": time.time()}

    @property
    def hit_rate(self) -> float:
        total = self.hits + self.misses
        return self.hits / total if total > 0 else 0.0


# Usage
cache = ExactMatchCache(ttl_seconds=3600)

def cached_chat(messages: list, model: str = "gpt-4o") -> str:
    # Check cache first
    cached = cache.get(model, messages)
    if cached:
        return cached

    # Cache miss — call LLM
    response = client.chat.completions.create(model=model, messages=messages)
    result = response.choices[0].message.content

    # Store in cache
    cache.set(model, messages, result)
    return result

Redis-Backed Exact Cache

For multi-instance deployments:

# script_id: day_091_semantic_caching/redis_llm_cache
import redis
import hashlib
import json

class RedisLLMCache:
    """Redis-backed LLM cache for distributed deployments."""

    def __init__(self, redis_url: str = "redis://localhost:6379", ttl: int = 3600):
        self.redis = redis.from_url(redis_url)
        self.ttl = ttl

    def _key(self, model: str, messages: list) -> str:
        data = json.dumps({"model": model, "messages": messages}, sort_keys=True)
        return f"llm:cache:{hashlib.sha256(data.encode()).hexdigest()}"

    def get(self, model: str, messages: list) -> str | None:
        result = self.redis.get(self._key(model, messages))
        return result.decode() if result else None

    def set(self, model: str, messages: list, response: str):
        self.redis.setex(self._key(model, messages), self.ttl, response)

Limitation: Exact-match only helps when users send identical prompts. In practice, "What's your return policy?" and "How do returns work?" are semantically the same but hash differently.

Level 2: Semantic Caching

Match queries by meaning, not exact text. Use embeddings to find similar past queries.

An embedding is an API call (client.embeddings.create) that turns a string into a fixed-length list of floats — e.g. 1536 numbers — such that texts with similar meaning produce nearby lists. Think of it like a hash, except a one-character edit nudges the output slightly instead of changing it completely, so close meanings give close outputs. (You don't need to know how embeddings are trained to use them.)

To compare two embeddings we use cosine similarity: a score from 0 (unrelated) to 1 (identical meaning). Many vector stores return distance instead (0 = identical, higher = further apart), so similarity = 1 - distance. We keep a cache hit only when similarity clears a threshold like 0.92.

Complete Semantic Cache

# script_id: day_091_semantic_caching/semantic_cache
from openai import OpenAI
import chromadb
import time
import uuid

client = OpenAI()

class SemanticCache:
    """Cache LLM responses using embedding similarity."""

    def __init__(self, similarity_threshold: float = 0.95, ttl_seconds: int = 3600):
        self.threshold = similarity_threshold
        self.ttl = ttl_seconds
        self.chroma = chromadb.Client()
        self.collection = self.chroma.get_or_create_collection(
            name="llm_cache",
            metadata={"hnsw:space": "cosine"}  # cosine similarity for vector comparison (hnsw is the fast nearest-neighbor index Chroma uses under the hood — you don't need to tune it)
        )
        self.stats = {"hits": 0, "misses": 0}

    def _get_embedding(self, text: str) -> list[float]:
        response = client.embeddings.create(
            model="text-embedding-3-small",
            input=text
        )
        return response.data[0].embedding

    def _query_to_key(self, messages: list) -> str:
        """Extract the semantic key from messages (last user message)."""
        for msg in reversed(messages):
            if msg["role"] == "user":
                return msg["content"]
        return str(messages)

    def get(self, messages: list) -> str | None:
        """Check cache for semantically similar query."""
        query_text = self._query_to_key(messages)
        query_embedding = self._get_embedding(query_text)

        # Search for similar cached queries
        results = self.collection.query(
            query_embeddings=[query_embedding],
            n_results=1
        )

        if not results["ids"][0]:
            self.stats["misses"] += 1
            return None

        # Check similarity threshold
        distance = results["distances"][0][0]
        similarity = 1 - distance  # ChromaDB returns distance, not similarity

        if similarity >= self.threshold:
            # Check TTL
            metadata = results["metadatas"][0][0]
            if time.time() - metadata["timestamp"] < self.ttl:
                self.stats["hits"] += 1
                return metadata["response"]

        self.stats["misses"] += 1
        return None

    def set(self, messages: list, response: str):
        """Cache a query-response pair."""
        query_text = self._query_to_key(messages)
        embedding = self._get_embedding(query_text)

        self.collection.add(
            ids=[str(uuid.uuid4())],
            embeddings=[embedding],
            documents=[query_text],
            metadatas=[{
                "response": response,
                "timestamp": time.time()
            }]
        )

    @property
    def hit_rate(self) -> float:
        total = self.stats["hits"] + self.stats["misses"]
        return self.stats["hits"] / total if total > 0 else 0.0


# Usage
semantic_cache = SemanticCache(similarity_threshold=0.92)

def smart_chat(messages: list, model: str = "gpt-4o") -> str:
    # Try semantic cache
    cached = semantic_cache.get(messages)
    if cached:
        return cached

    # Cache miss
    response = client.chat.completions.create(model=model, messages=messages)
    result = response.choices[0].message.content

    semantic_cache.set(messages, result)
    return result

# These will hit the same cache entry!
q1 = [{"role": "user", "content": "What is your return policy?"}]
q2 = [{"role": "user", "content": "How do I return an item?"}]
q3 = [{"role": "user", "content": "Can I send back a product?"}]

print(smart_chat(q1))  # Cache miss — calls LLM
print(smart_chat(q2))  # Cache hit! Same meaning
print(smart_chat(q3))  # Cache hit! Same meaning

Choosing Similarity Thresholds

The threshold controls the tradeoff between cache hit rate and response accuracy. On the 0-1 similarity scale, a higher threshold demands closer meaning, so fewer queries match (lower hit rate) but matches are safer:

Threshold	Hit Rate (illustrative — varies by dataset and embedding model; measure your own)	Risk	Best For
0.98+	Low (~10%)	Very safe	Factual/compliance queries
0.95	Moderate (~30%)	Safe	Customer support, FAQ
0.92	High (~50%)	Some false positives	Conversational, general Q&A
0.85	Very high (~70%)	Risky	Only for non-critical use

When in doubt, start at 0.95+ and only loosen if your hit rate is too low and you have measured the wrong-answer rate.

# script_id: day_091_semantic_caching/adaptive_thresholds
# Adaptive thresholds by query category
THRESHOLDS = {
    "factual": 0.97,      # "What are your hours?" — wrong answer is bad
    "conversational": 0.92, # "Tell me about X" — slight variation is OK
    "creative": 0.85,       # "Write a poem about..." — reuse is fine
}

Warning: A semantic cache can return stale or wrong answers if the threshold is too low. Always monitor your false-positive rate in production — track cases where users ask follow-up questions (indicating the cached answer didn't help).

Cache Invalidation

# script_id: day_091_semantic_caching/semantic_cache
class InvalidatingSemanticCache(SemanticCache):
    """Semantic cache with invalidation strategies."""

    def invalidate_by_age(self, max_age_seconds: int):
        """Remove entries older than max_age."""
        # In production, run this periodically
        all_entries = self.collection.get()
        stale_ids = []

        for i, metadata in enumerate(all_entries["metadatas"]):
            if time.time() - metadata["timestamp"] > max_age_seconds:
                stale_ids.append(all_entries["ids"][i])

        if stale_ids:
            self.collection.delete(ids=stale_ids)
            print(f"Invalidated {len(stale_ids)} stale cache entries")

    def invalidate_by_topic(self, topic_query: str, threshold: float = 0.85):
        """Invalidate all entries similar to a topic (e.g., after content update)."""
        embedding = self._get_embedding(topic_query)
        results = self.collection.query(
            query_embeddings=[embedding],
            n_results=100
        )

        to_delete = []
        for i, dist in enumerate(results["distances"][0]):
            if 1 - dist >= threshold:
                to_delete.append(results["ids"][0][i])

        if to_delete:
            self.collection.delete(ids=to_delete)
            print(f"Invalidated {len(to_delete)} entries related to: {topic_query}")

When to invalidate:

Source documents updated (invalidate by topic)
Model changed (invalidate everything)
Time-based TTL expired (background cleanup)
User reports wrong answer (invalidate specific entry)

GPTCache: Production Library

For production deployments, consider GPTCache:

# script_id: day_091_semantic_caching/gptcache_usage
# pip install gptcache

from gptcache import cache
from gptcache.adapter import openai
from gptcache.embedding import Onnx
from gptcache.similarity_evaluation.distance import SearchDistanceEvaluation

# Initialize with semantic matching
onnx = Onnx()
cache.init(
    embedding_func=onnx.to_embeddings,
    similarity_evaluation=SearchDistanceEvaluation()
)

# NOTE: this uses GPTCache's `openai` ADAPTER imported above (not the real
# openai module). Its drop-in `ChatCompletion.create` mirrors the legacy
# openai<1.0 surface; with the openai>=1.0 client you instead wrap your own
# `client.chat.completions.create` calls with a cache lookup (the pattern built
# earlier in this lesson). Check the GPTCache docs for current adapter support.
response = openai.ChatCompletion.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "What is Python?"}]
)

Cost Analysis

# script_id: day_091_semantic_caching/cost_analysis
def estimate_cache_savings(
    daily_queries: int = 10_000,
    cost_per_query: float = 0.01,
    embedding_cost_per_query: float = 0.00002,
    cache_hit_rate: float = 0.60
):
    """Estimate monthly savings from semantic caching."""
    monthly_queries = daily_queries * 30

    # Without cache
    no_cache_cost = monthly_queries * cost_per_query

    # With cache
    hits = monthly_queries * cache_hit_rate
    misses = monthly_queries * (1 - cache_hit_rate)

    cache_cost = (
        monthly_queries * embedding_cost_per_query +  # Embedding for every query
        misses * cost_per_query                        # LLM call only on miss
    )

    savings = no_cache_cost - cache_cost
    return {
        "monthly_without_cache": f"${no_cache_cost:,.2f}",
        "monthly_with_cache": f"${cache_cost:,.2f}",
        "monthly_savings": f"${savings:,.2f}",
        "savings_percent": f"{(savings/no_cache_cost)*100:.0f}%"
    }

print(estimate_cache_savings())
# monthly_without_cache: $3,000.00
# monthly_with_cache: $1,206.00
# monthly_savings: $1,794.00
# savings_percent: 60%

Native Prompt Caching

Separately from semantic caching, major providers now offer server-side prompt caching that discounts repeated prefixes in your API calls. This is especially useful for system prompts, few-shot examples, or large context documents that stay the same across requests.

OpenAI automatically caches prompt prefixes for requests to supported models, giving roughly a 50% discount on cached input tokens with no code changes required (as of 2026-06; verify current discounts at the provider).

Anthropic offers explicit cache control -- you mark which parts of the prompt to cache and receive roughly a 90% discount on cached input tokens (as of 2026-06; verify current discounts at the provider):

# script_id: day_091_semantic_caching/anthropic_prompt_caching
# Anthropic prompt caching — ~90% discount on cached input tokens
import anthropic

anthropic_client = anthropic.Anthropic()
query = "Your question here"

response = anthropic_client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=1024,
    system=[{
        "type": "text",
        "text": "Your large system prompt here...",
        "cache_control": {"type": "ephemeral"}
    }],
    # prompt caching only kicks in above the model's minimum cacheable prefix
    # (roughly 1k+ tokens, varies by model — check the provider), so a real
    # system prompt or context document is needed for the discount to apply
    messages=[{"role": "user", "content": query}]
)

How this complements semantic caching: Prompt caching and semantic caching solve different problems. Prompt caching handles repeated prefixes -- the same system prompt or context documents sent across many requests. Semantic caching handles similar queries -- different users asking the same question in different words. In production, you often want both: prompt caching reduces per-token cost on every request, while semantic caching eliminates redundant LLM calls entirely for repeated questions.

Checkpoint

Run the semantic_cache example, ask the same question twice with slightly different wording, and confirm the second call is served from cache (much faster, no new API spend) rather than hitting the model again. If the paraphrase misses the cache every time, check that your similarity threshold isn't set too strict and that both queries are being embedded with the same model.

Summary

Quick Reference

Pattern	When	How
Exact-match cache	Identical repeated prompts	Hash the prompt → Redis key
Semantic cache	Paraphrases of the same question	Embed query, nearest-neighbor in pgvector/Chroma
Similarity threshold	Tune precision vs hit-rate	~0.97 factual, ~0.92 conversational, ~0.85 creative
TTL invalidation	Answers go stale	Set per-entry expiry
Topic/manual invalidation	Underlying data changed	Evict by tag or key

# Semantic cache lookup (pseudo-flow)
emb = embed(query)
hit = vector_store.search(emb, top_k=1)
if hit and hit.score >= THRESHOLD:
    return hit.cached_response          # cache hit
resp = call_llm(query)
vector_store.add(emb, resp, ttl=3600)   # store for next time
return resp

Exercises

Exact-match first. Build a Redis-backed cache keyed on a hash of the prompt; measure the hit rate on a log of repeated questions.
Go semantic. Add an embedding-similarity cache so "What's your refund policy?" hits the entry stored for "How do refunds work?"
Tune the threshold. Sweep the similarity threshold from 0.85 to 0.98 and chart hit-rate vs wrong-answer rate. Pick a value for a factual assistant.
Invalidate. Add TTL expiry plus a way to evict every cached entry about a given topic when its source data changes.

Solutions (approaches)

key = hashlib.sha256(prompt.encode()).hexdigest(); redis.get(key) then redis.setex(key, ttl, resp).
Embed the query, nearest-neighbor search in pgvector/Chroma, return the stored response when score >= threshold.
Higher threshold → fewer but safer hits; for factual, bias high (~0.97) to avoid serving a near-but-wrong answer.
Store entries with a topic tag and ttl; evict by scanning/deleting that tag, and let TTL handle the rest.

What's Next?

Caching cuts cost and latency on the generation side. Next we turn to the retrieval side: Retrieval Evaluation & Reranking — measuring whether your search is actually returning the right context, and the highest-ROI fix when it isn't.

Semantic Caching for LLM Applications