-
Notifications
You must be signed in to change notification settings - Fork 615
[PERFORMANCE]: Response-cache-by-prompt algorithmic optimization #1835
Copy link
Copy link
Labels
SHOULDP2: Important but not vital; high-value items that are not crucial for the immediate releaseP2: Important but not vital; high-value items that are not crucial for the immediate releaseperformancePerformance related itemsPerformance related items
Milestone
Description
Summary
Response-cache-by-prompt performs a full linear scan of cached entries and vectorizes the input per request. This is O(n) and becomes CPU-heavy as the cache grows.
Evidence (current code)
plugins/response_cache_by_prompt/response_cache_by_prompt.py:_find_bestvectorizes input and compares cosine similarity against all cache entries each request.
Impact
- CPU usage grows linearly with cache size.
- Can dominate request latency when cache is large.
Proposed fix
- Use LRU + pruning to keep cache small, or index entries by tokens to reduce candidate comparisons.
- Consider approximate nearest neighbor search for large caches.
Acceptance criteria
- Cache lookup avoids full linear scan for common cases.
- CPU cost per request scales sublinearly with cache size.
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
SHOULDP2: Important but not vital; high-value items that are not crucial for the immediate releaseP2: Important but not vital; high-value items that are not crucial for the immediate releaseperformancePerformance related itemsPerformance related items