Skip to content

[PERFORMANCE]: Response-cache-by-prompt algorithmic optimization #1835

@crivetimihai

Description

@crivetimihai

Summary

Response-cache-by-prompt performs a full linear scan of cached entries and vectorizes the input per request. This is O(n) and becomes CPU-heavy as the cache grows.

Evidence (current code)

  • plugins/response_cache_by_prompt/response_cache_by_prompt.py: _find_best vectorizes input and compares cosine similarity against all cache entries each request.

Impact

  • CPU usage grows linearly with cache size.
  • Can dominate request latency when cache is large.

Proposed fix

  • Use LRU + pruning to keep cache small, or index entries by tokens to reduce candidate comparisons.
  • Consider approximate nearest neighbor search for large caches.

Acceptance criteria

  • Cache lookup avoids full linear scan for common cases.
  • CPU cost per request scales sublinearly with cache size.

Metadata

Metadata

Assignees

Labels

SHOULDP2: Important but not vital; high-value items that are not crucial for the immediate releaseperformancePerformance related items

Type

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions