Thread local generation stream by angeloskath · Pull Request #1090 · ml-explore/mlx-lm

angeloskath · 2026-04-02T09:51:16Z

This PR requires #3355 from MLX core.

Uses a thread local stream for the generation stream (see Add a convenience for making local streams in python mlx#3355)
Allows setting the stream to use in the batch generator
Refactors the model provider to load the default model in the generation thread

andresy · 2026-04-02T15:54:10Z

        nonlocal tokens

-        with mx.stream(generation_stream):
+        with mx.stream(generation_stream.stream):


not sure about the naming?

Yeah good point. I guess I was being lazy yesterday. Moved ThreadLocalStream to C++ and made it implicitly convert to stream. So now we can use it wherever we would use a stream and that looks exactly like before.

Got the original 'warm during speak' design working after all. The prior blocker — MLX raising 'There is no Stream(gpu, 0) in current thread' when a background thread did model.generate() — turns out to be the same bug Apple's mlx-lm hit and fixed in their server (ml-explore/mlx-lm#1090). The fix is architectural: route ALL generate() calls through one dedicated thread, and that thread must load the model itself (a model can't be safely used from a different thread than the one it was loaded on). inference_worker.py — new module. InferenceWorker is a daemon thread that: * Creates its own mx.new_thread_local_stream(mx.gpu). * Loads the parakeet model in run(), not on main. * Pulls tasks (warmup / transcribe) from a queue and dispatches them inside . * Exposes wait_ready(), submit_warmup(), submit_transcribe(), shutdown(). Used by parakey.py for everything inference-related. * Takes a injection point so unit tests can mock MLX out without monkey-patching. parakey.py — refactored: * _initialize() no longer loads the model. It blocks on worker.wait_ready() and forwards the worker's status callbacks to the menu's load_status text. * _start_recording() submits a parallel warmup via the worker if WarmupGate's try_begin_warmup() says we're cold. The warmup runs on the worker thread WHILE the user is speaking, so by release time the GPU is hot. * _transcribe_and_paste() submits the real transcribe to the worker and waits via a callback. The worker's FIFO queue naturally serializes warmup → transcribe — no explicit lock. * Menu status reads (the worker is now the source of truth for 'is a warmup in flight'). * _inline_rewarm() removed — superseded by the parallel approach. bench_idle.py — updated to drive a real InferenceWorker. Shows the new ceiling: * cold first generate: ~1.3s * steady-state (warm): ~160ms * parallel-warmup pattern: ~150ms post-release ← essentially equal to steady-state tests/test_inference_worker.py — 6 unit tests with fakes for MLX, covering: * Model loads on the worker thread (not main). * wait_ready() blocks until load + startup warmup complete. * Load failures are surfaced via .error AND still set _ready. * Tasks run FIFO; transcribe waits for in-flight warmup. * warming flag is True during warmup and only that. * Task errors don't kill the worker. * shutdown() drains the queue before exiting. All 20 tests (14 WarmupGate + 6 InferenceWorker) pass in ~270ms, and continue to run on Linux CI without MLX installed.

angeloskath requested a review from andresy April 2, 2026 09:51

angeloskath mentioned this pull request Apr 2, 2026

fix: make generation_stream thread-local (MLX #3348 compat) #1088

Closed

3 tasks

andresy reviewed Apr 2, 2026

View reviewed changes

andresy approved these changes Apr 4, 2026

View reviewed changes

angeloskath force-pushed the generation-stream branch from 95d688e to e69a53d Compare April 4, 2026 12:22

angeloskath force-pushed the generation-stream branch from e69a53d to 615e63b Compare April 20, 2026 06:15

angeloskath added 7 commits April 21, 2026 18:03

Local stream in generate.py

1421291

Use the default thread local stream in the server

92c8642

Fix omission of is_batchable

36a807a

Update to the better ThreadLocalStream

ba755fb

Remove unused import

1cc5312

Remove import

2adc388

Switch to the new thread local stream

dd64c60

angeloskath force-pushed the generation-stream branch from 615e63b to dd64c60 Compare April 22, 2026 01:03

angeloskath added 2 commits April 21, 2026 18:03

Update the minimum MLX version

8ac2592

Add load_default in test

60efab6

angeloskath mentioned this pull request Apr 22, 2026

Server crashes with 'There is no Stream(gpu, 0) in current thread' on mlx 0.31.2 #1179

Closed

Fix gguf test

6ccada8

angeloskath merged commit ed1fca4 into main Apr 22, 2026
2 checks passed

angeloskath deleted the generation-stream branch April 22, 2026 07:34

angeloskath mentioned this pull request Apr 22, 2026

Fix BatchGenerator crash on worker threads with mlx 0.31.2 #1182

Closed

csheaff mentioned this pull request Apr 22, 2026

Crash on mlx 0.31.2: 'There is no Stream(gpu, N) in current thread' when generate() runs in a worker thread Blaizzy/mlx-vlm#1049

Closed

Blaizzy mentioned this pull request Apr 22, 2026

Thread-local generation stream (port mlx-lm#1090) Blaizzy/mlx-vlm#1050

Merged

4 tasks

spicyneuron mentioned this pull request Apr 22, 2026

fix: Use thread-local generation stream Blaizzy/mlx-vlm#1051

Closed

jackneil mentioned this pull request Apr 29, 2026

fix(stream): port upstream PR #421 — run scheduler.step on event-loop thread jackneil/vllm-mlx-patched#39

Merged

4 tasks

sangemaru mentioned this pull request May 7, 2026

mlx_lm.server crashes with 'There is no Stream(gpu, 1) in current thread' on sliding-window models (mlx-lm 0.31.3) #1256

Open

nish2292 mentioned this pull request May 14, 2026

fix: make generation_stream per-thread to fix server crash on worker threads #1275

Open

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Thread local generation stream#1090

Thread local generation stream#1090
angeloskath merged 10 commits into
mainfrom
generation-stream

angeloskath commented Apr 2, 2026

Uh oh!

andresy Apr 2, 2026

Uh oh!

angeloskath Apr 2, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

angeloskath commented Apr 2, 2026

Uh oh!

andresy Apr 2, 2026

Choose a reason for hiding this comment

Uh oh!

angeloskath Apr 2, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants