Context
Current response_cache_enabled in Zeph uses exact-match caching. Research indicates semantic caching (matching by embedding similarity rather than exact query text) can reduce LLM API calls by up to 69%.
Concept
Instead of caching only identical prompts, cache responses for semantically similar queries:
- Embed incoming user query
- Search cache for similar embeddings (threshold ~0.95)
- On cache hit: return cached response (sub-millisecond)
- On cache miss: call LLM, store response + embedding
Applicability to Zeph
- Already has embedding infrastructure (sqlite vector, Qdrant)
- Already has
response_cache_enabled config option — could extend
- Most impactful for: repeated similar questions across sessions, skill-triggered prompts, experiment evaluator calls
- Less impactful for: unique creative queries, tool-heavy interactions
Trade-offs
- Requires embedding model (Ollama) — not available in Claude-only config
- Similarity threshold tuning: too low = stale/wrong answers, too high = no cache hits
- Cache invalidation: context changes may make cached responses incorrect
- Memory overhead: storing embeddings + responses
References
- AI Agent Architecture (Redis) — semantic caching pattern, 69% API call reduction, 15X faster responses
- Redis LangCache: vector-similarity based query caching
Context
Current
response_cache_enabledin Zeph uses exact-match caching. Research indicates semantic caching (matching by embedding similarity rather than exact query text) can reduce LLM API calls by up to 69%.Concept
Instead of caching only identical prompts, cache responses for semantically similar queries:
Applicability to Zeph
response_cache_enabledconfig option — could extendTrade-offs
References