Skip to content

Fix kv prefix cache#1262

Merged
rltakashige merged 11 commits intomainfrom
fix-kv-prefix-cache
Jan 26, 2026
Merged

Fix kv prefix cache#1262
rltakashige merged 11 commits intomainfrom
fix-kv-prefix-cache

Conversation

@rltakashige
Copy link
Collaborator

@rltakashige rltakashige commented Jan 23, 2026

Motivation

OpenCode sends very large prompts, most of which are repeated on the next call.

Changes

Add prefix caching, reducing average time in prefill (in testing) from 40 seconds to 4. This massively improves user experience.

Also evicts KV caches from this prefix cache in a LRU-style manner.

Why It Works

We no longer prefill repeatedly but rather use kv cache stored in memory. A future update may want to use storage to make the prefix cache larger.

Test Plan

Manual Testing

Tested speedup on OpenCode

Automated Testing

Added a lot of tests

- Wire up KVPrefixCache to runner and generate
- Fix exact match to return deepcopy (was returning reference)
- Fix trim_prompt_cache argument (was using wrong calculation)
- Fix token slicing to use best_snapshot_length (not index)
- Add _cache_length() using .offset for compatibility with older mlx_lm
- Fix prefill() to use max_tokens=1 with trim (workaround for mlx_lm bug)
- Add clear() method for single-cache behavior
- Remove KEEP_KV_SIZE limit from prefix matching
- Add minimal logging for cache hits/misses

Fix type errors and KV cache implementation

Type fixes for CI:
- Add KVCacheType alias matching make_kv_cache return type
- Update function signatures to use consistent cache types
- Add explicit type annotations

KV cache fixes to actually reduce TTFT:
- get_kv_cache now prefills internally and returns only last token
- stream_generate receives 1 token on cache hit instead of full prompt
- Extract encode_prompt as standalone function for reuse

Refactor KV cache: move prefill to generate.py, add shared KVCacheType

Address PR feedback:
- Move KVCacheType to shared/types/mlx.py for reuse across codebase
- Move prefill logic from cache.py to generate.py
- get_kv_cache now only returns cache + remaining tokens (no prefill)
- Caller (mlx_generate) is responsible for prefilling

Fix types: regenerate mlx stubs, remove type ignores

- Regenerate cache.pyi and tokenizer_utils.pyi stubs for latest mlx_lm
- Remove # type: ignore from cache.py (now fully typed)
- Remove unnecessary type ignores from generate.py
- Use mx.equal() instead of == for proper array typing

Fix encode_prompt to not add special tokens for chat-templated prompts

Chat templates (like Kimi-K2's <|im_user|>, <|im_middle|>, etc.) already
include their own structure markers. Adding BOS/EOS tokens on top of this
corrupts the prompt structure and can slow down prefill.

Use add_special_tokens=False since the chat template defines its own structure.

Add prefill logging with progress callbacks and timing stats
# Conflicts:
#	.mlx_typings/mlx_lm/tokenizer_utils.pyi
#	src/exo/worker/engines/mlx/generator/generate.py
#	src/exo/worker/runner/runner.py
@rltakashige rltakashige enabled auto-merge (squash) January 23, 2026 20:47
Copy link
Member

@Evanev7 Evanev7 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should move forward with this and get some results - looks good.

Copy link
Member

@JakeHillion JakeHillion left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@rltakashige rltakashige merged commit cd8c01b into main Jan 26, 2026
8 checks passed
@rltakashige rltakashige deleted the fix-kv-prefix-cache branch January 26, 2026 20:14
AlexCheema added a commit that referenced this pull request Jan 26, 2026
leocamello added a commit to leocamello/exo that referenced this pull request Jan 26, 2026
This PR makes exo engine-agnostic by adding PyTorch as an inference backend,
enabling Linux systems with NVIDIA GPUs to run inference.

## Architecture Changes

- **Engine abstraction**: Created base_engine.py with Engine interface
- **MlxEngine**: Moved all MLX-specific patches into MlxEngine.generate()
  - Includes KV prefix cache for faster prefill (upstream feature exo-explore#1262)
  - Properly passes kv_prefix_cache to mlx_generate()
- **PytorchEngine**: New engine for HuggingFace transformers on NVIDIA GPUs
- **Engine-agnostic runner**: runner.py no longer imports MLX at top level
- **Conditional imports**: bootstrap.py selects engine based on instance type

## New Files

- src/exo/worker/engines/base_engine.py - Abstract Engine interface
- src/exo/worker/engines/pytorch/__init__.py - PyTorch engine implementation
- src/exo/worker/engines/pytorch/auto_parallel.py - Pipeline parallelism for PyTorch
- src/exo/worker/engines/mlx/patches.py - Extracted MLX-specific helpers
- src/exo/utils/info_gatherer/linux_metrics.py - nvidia-smi GPU metrics

## Upstream Features Preserved

- KV prefix cache (exo-explore#1262) - integrated into MlxEngine
- Empty message fix (exo-explore#1292) - in utils_mlx.py
- Model shard loading fix (exo-explore#1291) - in auto_parallel.py

## Dashboard Changes

- Model-engine compatibility filtering
- Reordered instance type buttons (MLX above PyTorch)
- Fixed matchesSelectedRuntime() for PyTorch
- Model dropdown reset on instance type change

## Bug Fixes

- placement.py: Added missing logger import
- api.py: Model/engine compatibility validation
- api.py: Fixed tags hardcoding (tags=card.tags or [])
- test_event_ordering.py: Updated to use MockEngine instead of MLX patches

## Testing

- 20+ tests for nvidia-smi parsing edge cases
- PyTorch engine tests
- All existing tests pass (137 passed)
leocamello added a commit to leocamello/exo that referenced this pull request Jan 31, 2026
This PR makes exo engine-agnostic by adding PyTorch as an inference backend,
enabling Linux systems with NVIDIA GPUs to run inference.

- **Engine abstraction**: Created base_engine.py with Engine interface
- **MlxEngine**: Moved all MLX-specific patches into MlxEngine.generate()
  - Includes KV prefix cache for faster prefill (upstream feature exo-explore#1262)
  - Properly passes kv_prefix_cache to mlx_generate()
- **PytorchEngine**: New engine for HuggingFace transformers on NVIDIA GPUs
- **Engine-agnostic runner**: runner.py no longer imports MLX at top level
- **Conditional imports**: bootstrap.py selects engine based on instance type

Related to exo-explore#1347 - PyTorch Backend Requirements
leocamello added a commit to leocamello/exo that referenced this pull request Feb 7, 2026
This PR makes exo engine-agnostic by adding PyTorch as an inference backend,
enabling Linux systems with NVIDIA GPUs to run inference.

- **Engine abstraction**: Created base_engine.py with Engine interface
- **MlxEngine**: Moved all MLX-specific patches into MlxEngine.generate()
  - Includes KV prefix cache for faster prefill (upstream feature exo-explore#1262)
  - Properly passes kv_prefix_cache to mlx_generate()
- **PytorchEngine**: New engine for HuggingFace transformers on NVIDIA GPUs
- **Engine-agnostic runner**: runner.py no longer imports MLX at top level
- **Conditional imports**: bootstrap.py selects engine based on instance type

Related to exo-explore#1347 - PyTorch Backend Requirements
michaelharrigan added a commit to michaelharrigan/exo that referenced this pull request Mar 5, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants