Merged
Conversation
- Wire up KVPrefixCache to runner and generate - Fix exact match to return deepcopy (was returning reference) - Fix trim_prompt_cache argument (was using wrong calculation) - Fix token slicing to use best_snapshot_length (not index) - Add _cache_length() using .offset for compatibility with older mlx_lm - Fix prefill() to use max_tokens=1 with trim (workaround for mlx_lm bug) - Add clear() method for single-cache behavior - Remove KEEP_KV_SIZE limit from prefix matching - Add minimal logging for cache hits/misses Fix type errors and KV cache implementation Type fixes for CI: - Add KVCacheType alias matching make_kv_cache return type - Update function signatures to use consistent cache types - Add explicit type annotations KV cache fixes to actually reduce TTFT: - get_kv_cache now prefills internally and returns only last token - stream_generate receives 1 token on cache hit instead of full prompt - Extract encode_prompt as standalone function for reuse Refactor KV cache: move prefill to generate.py, add shared KVCacheType Address PR feedback: - Move KVCacheType to shared/types/mlx.py for reuse across codebase - Move prefill logic from cache.py to generate.py - get_kv_cache now only returns cache + remaining tokens (no prefill) - Caller (mlx_generate) is responsible for prefilling Fix types: regenerate mlx stubs, remove type ignores - Regenerate cache.pyi and tokenizer_utils.pyi stubs for latest mlx_lm - Remove # type: ignore from cache.py (now fully typed) - Remove unnecessary type ignores from generate.py - Use mx.equal() instead of == for proper array typing Fix encode_prompt to not add special tokens for chat-templated prompts Chat templates (like Kimi-K2's <|im_user|>, <|im_middle|>, etc.) already include their own structure markers. Adding BOS/EOS tokens on top of this corrupts the prompt structure and can slow down prefill. Use add_special_tokens=False since the chat template defines its own structure. Add prefill logging with progress callbacks and timing stats
fc0d2d3 to
a1939c8
Compare
e4badff to
812a9f2
Compare
# Conflicts: # .mlx_typings/mlx_lm/tokenizer_utils.pyi # src/exo/worker/engines/mlx/generator/generate.py # src/exo/worker/runner/runner.py
Evanev7
approved these changes
Jan 24, 2026
Member
Evanev7
left a comment
There was a problem hiding this comment.
I think we should move forward with this and get some results - looks good.
AlexCheema
added a commit
that referenced
this pull request
Jan 26, 2026
This reverts commit cd8c01b.
leocamello
added a commit
to leocamello/exo
that referenced
this pull request
Jan 26, 2026
This PR makes exo engine-agnostic by adding PyTorch as an inference backend, enabling Linux systems with NVIDIA GPUs to run inference. ## Architecture Changes - **Engine abstraction**: Created base_engine.py with Engine interface - **MlxEngine**: Moved all MLX-specific patches into MlxEngine.generate() - Includes KV prefix cache for faster prefill (upstream feature exo-explore#1262) - Properly passes kv_prefix_cache to mlx_generate() - **PytorchEngine**: New engine for HuggingFace transformers on NVIDIA GPUs - **Engine-agnostic runner**: runner.py no longer imports MLX at top level - **Conditional imports**: bootstrap.py selects engine based on instance type ## New Files - src/exo/worker/engines/base_engine.py - Abstract Engine interface - src/exo/worker/engines/pytorch/__init__.py - PyTorch engine implementation - src/exo/worker/engines/pytorch/auto_parallel.py - Pipeline parallelism for PyTorch - src/exo/worker/engines/mlx/patches.py - Extracted MLX-specific helpers - src/exo/utils/info_gatherer/linux_metrics.py - nvidia-smi GPU metrics ## Upstream Features Preserved - KV prefix cache (exo-explore#1262) - integrated into MlxEngine - Empty message fix (exo-explore#1292) - in utils_mlx.py - Model shard loading fix (exo-explore#1291) - in auto_parallel.py ## Dashboard Changes - Model-engine compatibility filtering - Reordered instance type buttons (MLX above PyTorch) - Fixed matchesSelectedRuntime() for PyTorch - Model dropdown reset on instance type change ## Bug Fixes - placement.py: Added missing logger import - api.py: Model/engine compatibility validation - api.py: Fixed tags hardcoding (tags=card.tags or []) - test_event_ordering.py: Updated to use MockEngine instead of MLX patches ## Testing - 20+ tests for nvidia-smi parsing edge cases - PyTorch engine tests - All existing tests pass (137 passed)
leocamello
added a commit
to leocamello/exo
that referenced
this pull request
Jan 31, 2026
This PR makes exo engine-agnostic by adding PyTorch as an inference backend, enabling Linux systems with NVIDIA GPUs to run inference. - **Engine abstraction**: Created base_engine.py with Engine interface - **MlxEngine**: Moved all MLX-specific patches into MlxEngine.generate() - Includes KV prefix cache for faster prefill (upstream feature exo-explore#1262) - Properly passes kv_prefix_cache to mlx_generate() - **PytorchEngine**: New engine for HuggingFace transformers on NVIDIA GPUs - **Engine-agnostic runner**: runner.py no longer imports MLX at top level - **Conditional imports**: bootstrap.py selects engine based on instance type Related to exo-explore#1347 - PyTorch Backend Requirements
leocamello
added a commit
to leocamello/exo
that referenced
this pull request
Feb 7, 2026
This PR makes exo engine-agnostic by adding PyTorch as an inference backend, enabling Linux systems with NVIDIA GPUs to run inference. - **Engine abstraction**: Created base_engine.py with Engine interface - **MlxEngine**: Moved all MLX-specific patches into MlxEngine.generate() - Includes KV prefix cache for faster prefill (upstream feature exo-explore#1262) - Properly passes kv_prefix_cache to mlx_generate() - **PytorchEngine**: New engine for HuggingFace transformers on NVIDIA GPUs - **Engine-agnostic runner**: runner.py no longer imports MLX at top level - **Conditional imports**: bootstrap.py selects engine based on instance type Related to exo-explore#1347 - PyTorch Backend Requirements
michaelharrigan
added a commit
to michaelharrigan/exo
that referenced
this pull request
Mar 5, 2026
…actor, regressing from exo-explore#1262.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Motivation
OpenCode sends very large prompts, most of which are repeated on the next call.
Changes
Add prefix caching, reducing average time in prefill (in testing) from 40 seconds to 4. This massively improves user experience.
Also evicts KV caches from this prefix cache in a LRU-style manner.
Why It Works
We no longer prefill repeatedly but rather use kv cache stored in memory. A future update may want to use storage to make the prefix cache larger.
Test Plan
Manual Testing
Tested speedup on OpenCode
Automated Testing
Added a lot of tests