Thread local generation stream#1090
Merged
Merged
Conversation
3 tasks
andresy
reviewed
Apr 2, 2026
| nonlocal tokens | ||
|
|
||
| with mx.stream(generation_stream): | ||
| with mx.stream(generation_stream.stream): |
Member
Author
There was a problem hiding this comment.
Yeah good point. I guess I was being lazy yesterday. Moved ThreadLocalStream to C++ and made it implicitly convert to stream. So now we can use it wherever we would use a stream and that looks exactly like before.
andresy
approved these changes
Apr 4, 2026
95d688e to
e69a53d
Compare
e69a53d to
615e63b
Compare
615e63b to
dd64c60
Compare
4 tasks
4 tasks
rcourtman
pushed a commit
to rcourtman/parakey
that referenced
this pull request
May 11, 2026
Got the original 'warm during speak' design working after all. The prior blocker — MLX raising 'There is no Stream(gpu, 0) in current thread' when a background thread did model.generate() — turns out to be the same bug Apple's mlx-lm hit and fixed in their server (ml-explore/mlx-lm#1090). The fix is architectural: route ALL generate() calls through one dedicated thread, and that thread must load the model itself (a model can't be safely used from a different thread than the one it was loaded on). inference_worker.py — new module. InferenceWorker is a daemon thread that: * Creates its own mx.new_thread_local_stream(mx.gpu). * Loads the parakeet model in run(), not on main. * Pulls tasks (warmup / transcribe) from a queue and dispatches them inside . * Exposes wait_ready(), submit_warmup(), submit_transcribe(), shutdown(). Used by parakey.py for everything inference-related. * Takes a injection point so unit tests can mock MLX out without monkey-patching. parakey.py — refactored: * _initialize() no longer loads the model. It blocks on worker.wait_ready() and forwards the worker's status callbacks to the menu's load_status text. * _start_recording() submits a parallel warmup via the worker if WarmupGate's try_begin_warmup() says we're cold. The warmup runs on the worker thread WHILE the user is speaking, so by release time the GPU is hot. * _transcribe_and_paste() submits the real transcribe to the worker and waits via a callback. The worker's FIFO queue naturally serializes warmup → transcribe — no explicit lock. * Menu status reads (the worker is now the source of truth for 'is a warmup in flight'). * _inline_rewarm() removed — superseded by the parallel approach. bench_idle.py — updated to drive a real InferenceWorker. Shows the new ceiling: * cold first generate: ~1.3s * steady-state (warm): ~160ms * parallel-warmup pattern: ~150ms post-release ← essentially equal to steady-state tests/test_inference_worker.py — 6 unit tests with fakes for MLX, covering: * Model loads on the worker thread (not main). * wait_ready() blocks until load + startup warmup complete. * Load failures are surfaced via .error AND still set _ready. * Tasks run FIFO; transcribe waits for in-flight warmup. * warming flag is True during warmup and only that. * Task errors don't kill the worker. * shutdown() drains the queue before exiting. All 20 tests (14 WarmupGate + 6 InferenceWorker) pass in ~270ms, and continue to run on Linux CI without MLX installed.
4 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This PR requires #3355 from MLX core.