Fix kv prefix cache by rltakashige · Pull Request #1262 · exo-explore/exo

rltakashige · 2026-01-23T15:53:57Z

Motivation

OpenCode sends very large prompts, most of which are repeated on the next call.

Changes

Add prefix caching, reducing average time in prefill (in testing) from 40 seconds to 4. This massively improves user experience.

Also evicts KV caches from this prefix cache in a LRU-style manner.

Why It Works

We no longer prefill repeatedly but rather use kv cache stored in memory. A future update may want to use storage to make the prefix cache larger.

Test Plan

Manual Testing

Tested speedup on OpenCode

Automated Testing

Added a lot of tests

- Wire up KVPrefixCache to runner and generate - Fix exact match to return deepcopy (was returning reference) - Fix trim_prompt_cache argument (was using wrong calculation) - Fix token slicing to use best_snapshot_length (not index) - Add _cache_length() using .offset for compatibility with older mlx_lm - Fix prefill() to use max_tokens=1 with trim (workaround for mlx_lm bug) - Add clear() method for single-cache behavior - Remove KEEP_KV_SIZE limit from prefix matching - Add minimal logging for cache hits/misses Fix type errors and KV cache implementation Type fixes for CI: - Add KVCacheType alias matching make_kv_cache return type - Update function signatures to use consistent cache types - Add explicit type annotations KV cache fixes to actually reduce TTFT: - get_kv_cache now prefills internally and returns only last token - stream_generate receives 1 token on cache hit instead of full prompt - Extract encode_prompt as standalone function for reuse Refactor KV cache: move prefill to generate.py, add shared KVCacheType Address PR feedback: - Move KVCacheType to shared/types/mlx.py for reuse across codebase - Move prefill logic from cache.py to generate.py - get_kv_cache now only returns cache + remaining tokens (no prefill) - Caller (mlx_generate) is responsible for prefilling Fix types: regenerate mlx stubs, remove type ignores - Regenerate cache.pyi and tokenizer_utils.pyi stubs for latest mlx_lm - Remove # type: ignore from cache.py (now fully typed) - Remove unnecessary type ignores from generate.py - Use mx.equal() instead of == for proper array typing Fix encode_prompt to not add special tokens for chat-templated prompts Chat templates (like Kimi-K2's <|im_user|>, <|im_middle|>, etc.) already include their own structure markers. Adding BOS/EOS tokens on top of this corrupts the prompt structure and can slow down prefill. Use add_special_tokens=False since the chat template defines its own structure. Add prefill logging with progress callbacks and timing stats

# Conflicts: # .mlx_typings/mlx_lm/tokenizer_utils.pyi # src/exo/worker/engines/mlx/generator/generate.py # src/exo/worker/runner/runner.py

Evanev7

I think we should move forward with this and get some results - looks good.

src/exo/worker/engines/mlx/cache.py

src/exo/shared/types/mlx.py

JakeHillion

LGTM.

This reverts commit cd8c01b.

This PR makes exo engine-agnostic by adding PyTorch as an inference backend, enabling Linux systems with NVIDIA GPUs to run inference. ## Architecture Changes - **Engine abstraction**: Created base_engine.py with Engine interface - **MlxEngine**: Moved all MLX-specific patches into MlxEngine.generate() - Includes KV prefix cache for faster prefill (upstream feature exo-explore#1262) - Properly passes kv_prefix_cache to mlx_generate() - **PytorchEngine**: New engine for HuggingFace transformers on NVIDIA GPUs - **Engine-agnostic runner**: runner.py no longer imports MLX at top level - **Conditional imports**: bootstrap.py selects engine based on instance type ## New Files - src/exo/worker/engines/base_engine.py - Abstract Engine interface - src/exo/worker/engines/pytorch/__init__.py - PyTorch engine implementation - src/exo/worker/engines/pytorch/auto_parallel.py - Pipeline parallelism for PyTorch - src/exo/worker/engines/mlx/patches.py - Extracted MLX-specific helpers - src/exo/utils/info_gatherer/linux_metrics.py - nvidia-smi GPU metrics ## Upstream Features Preserved - KV prefix cache (exo-explore#1262) - integrated into MlxEngine - Empty message fix (exo-explore#1292) - in utils_mlx.py - Model shard loading fix (exo-explore#1291) - in auto_parallel.py ## Dashboard Changes - Model-engine compatibility filtering - Reordered instance type buttons (MLX above PyTorch) - Fixed matchesSelectedRuntime() for PyTorch - Model dropdown reset on instance type change ## Bug Fixes - placement.py: Added missing logger import - api.py: Model/engine compatibility validation - api.py: Fixed tags hardcoding (tags=card.tags or []) - test_event_ordering.py: Updated to use MockEngine instead of MLX patches ## Testing - 20+ tests for nvidia-smi parsing edge cases - PyTorch engine tests - All existing tests pass (137 passed)

This PR makes exo engine-agnostic by adding PyTorch as an inference backend, enabling Linux systems with NVIDIA GPUs to run inference. - **Engine abstraction**: Created base_engine.py with Engine interface - **MlxEngine**: Moved all MLX-specific patches into MlxEngine.generate() - Includes KV prefix cache for faster prefill (upstream feature exo-explore#1262) - Properly passes kv_prefix_cache to mlx_generate() - **PytorchEngine**: New engine for HuggingFace transformers on NVIDIA GPUs - **Engine-agnostic runner**: runner.py no longer imports MLX at top level - **Conditional imports**: bootstrap.py selects engine based on instance type Related to exo-explore#1347 - PyTorch Backend Requirements

…actor, regressing from exo-explore#1262.

rltakashige mentioned this pull request Jan 23, 2026

Fix KV prefix cache for prompt reuse #991

Closed

rltakashige closed this Jan 23, 2026

rltakashige force-pushed the fix-kv-prefix-cache branch from fc0d2d3 to a1939c8 Compare January 23, 2026 16:03

rltakashige reopened this Jan 23, 2026

rltakashige force-pushed the fix-kv-prefix-cache branch from e4badff to 812a9f2 Compare January 23, 2026 16:07

rltakashige added 6 commits January 23, 2026 16:11

Merge remote-tracking branch 'origin/main' into fix-kv-prefix-cache

b777c6f

# Conflicts: # .mlx_typings/mlx_lm/tokenizer_utils.pyi # src/exo/worker/engines/mlx/generator/generate.py # src/exo/worker/runner/runner.py

cleanup

7744420

Try and limit memory consumption

a02b452

Add tests

2d42af8

Remove incorrect typing

424d96c

Test LRU eviction

9c320d7

rltakashige enabled auto-merge (squash) January 23, 2026 20:47

Evanev7 approved these changes Jan 24, 2026

View reviewed changes

src/exo/worker/engines/mlx/cache.py Outdated Show resolved Hide resolved

src/exo/worker/engines/mlx/cache.py Outdated Show resolved Hide resolved

src/exo/shared/types/mlx.py Outdated Show resolved Hide resolved

rltakashige and others added 3 commits January 26, 2026 19:32

Merge branch 'main' into fix-kv-prefix-cache

8eadb62

Address comments

d1ef914

Remove unnecessary comments

33b343b

JakeHillion approved these changes Jan 26, 2026

View reviewed changes

Merge branch 'main' into fix-kv-prefix-cache

07bc28d

rltakashige merged commit cd8c01b into main Jan 26, 2026
8 checks passed

rltakashige deleted the fix-kv-prefix-cache branch January 26, 2026 20:14

AlexCheema added a commit that referenced this pull request Jan 26, 2026

Revert "Fix kv prefix cache (#1262)"

9ce0d46

This reverts commit cd8c01b.

leocamello mentioned this pull request Jan 31, 2026

feat: Add PyTorch backend for NVIDIA GPU support #1284

Open

michaelharrigan added a commit to michaelharrigan/exo that referenced this pull request Mar 5, 2026

Fixed KV Cache regression introduced in exo-explore#1632 batching ref…

ab7d049

…actor, regressing from exo-explore#1262.

michaelharrigan mentioned this pull request Mar 5, 2026

fix: KVPrefixCache Regression #1668

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix kv prefix cache#1262

Fix kv prefix cache#1262
rltakashige merged 11 commits intomainfrom
fix-kv-prefix-cache

rltakashige commented Jan 23, 2026 •

edited

Loading

Uh oh!

Evanev7 left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

JakeHillion left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

rltakashige commented Jan 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Changes

Why It Works

Test Plan

Manual Testing

Automated Testing

Uh oh!

Evanev7 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

JakeHillion left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

rltakashige commented Jan 23, 2026 •

edited

Loading