Research: semantic response caching for LLM API cost reduction

## Context

Current `response_cache_enabled` in Zeph uses exact-match caching. Research indicates semantic caching (matching by embedding similarity rather than exact query text) can reduce LLM API calls by up to 69%.

## Concept

Instead of caching only identical prompts, cache responses for semantically similar queries:

1. Embed incoming user query
2. Search cache for similar embeddings (threshold ~0.95)
3. On cache hit: return cached response (sub-millisecond)
4. On cache miss: call LLM, store response + embedding

## Applicability to Zeph

- Already has embedding infrastructure (sqlite vector, Qdrant)
- Already has `response_cache_enabled` config option — could extend
- Most impactful for: repeated similar questions across sessions, skill-triggered prompts, experiment evaluator calls
- Less impactful for: unique creative queries, tool-heavy interactions

## Trade-offs

- Requires embedding model (Ollama) — not available in Claude-only config
- Similarity threshold tuning: too low = stale/wrong answers, too high = no cache hits
- Cache invalidation: context changes may make cached responses incorrect
- Memory overhead: storing embeddings + responses

## References

- [AI Agent Architecture (Redis)](https://redis.io/blog/ai-agent-architecture/) — semantic caching pattern, 69% API call reduction, 15X faster responses
- Redis LangCache: vector-similarity based query caching

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Research: semantic response caching for LLM API cost reduction #1521

Context

Concept

Applicability to Zeph

Trade-offs

References

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Research: semantic response caching for LLM API cost reduction #1521

Description

Context

Concept

Applicability to Zeph

Trade-offs

References

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions