Skip to content

Thread local generation stream#1090

Merged
angeloskath merged 10 commits into
mainfrom
generation-stream
Apr 22, 2026
Merged

Thread local generation stream#1090
angeloskath merged 10 commits into
mainfrom
generation-stream

Conversation

@angeloskath

Copy link
Copy Markdown
Member

This PR requires #3355 from MLX core.

Comment thread mlx_lm/generate.py Outdated
nonlocal tokens

with mx.stream(generation_stream):
with mx.stream(generation_stream.stream):

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure about the naming?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah good point. I guess I was being lazy yesterday. Moved ThreadLocalStream to C++ and made it implicitly convert to stream. So now we can use it wherever we would use a stream and that looks exactly like before.

@angeloskath angeloskath merged commit ed1fca4 into main Apr 22, 2026
2 checks passed
@angeloskath angeloskath deleted the generation-stream branch April 22, 2026 07:34
rcourtman pushed a commit to rcourtman/parakey that referenced this pull request May 11, 2026
Got the original 'warm during speak' design working after all. The
prior blocker — MLX raising 'There is no Stream(gpu, 0) in current
thread' when a background thread did model.generate() — turns out to
be the same bug Apple's mlx-lm hit and fixed in their server
(ml-explore/mlx-lm#1090). The fix is architectural: route ALL
generate() calls through one dedicated thread, and that thread must
load the model itself (a model can't be safely used from a different
thread than the one it was loaded on).

inference_worker.py — new module. InferenceWorker is a daemon thread
that:

  * Creates its own mx.new_thread_local_stream(mx.gpu).
  * Loads the parakeet model in run(), not on main.
  * Pulls tasks (warmup / transcribe) from a queue and dispatches
    them inside .
  * Exposes wait_ready(), submit_warmup(), submit_transcribe(),
    shutdown(). Used by parakey.py for everything inference-related.
  * Takes a  injection
    point so unit tests can mock MLX out without monkey-patching.

parakey.py — refactored:

  * _initialize() no longer loads the model. It blocks on
    worker.wait_ready() and forwards the worker's status callbacks
    to the menu's load_status text.
  * _start_recording() submits a parallel warmup via the worker if
    WarmupGate's try_begin_warmup() says we're cold. The warmup runs
    on the worker thread WHILE the user is speaking, so by release
    time the GPU is hot.
  * _transcribe_and_paste() submits the real transcribe to the
    worker and waits via a callback. The worker's FIFO queue
    naturally serializes warmup → transcribe — no explicit lock.
  * Menu status reads  (the worker is now the source
    of truth for 'is a warmup in flight').
  * _inline_rewarm() removed — superseded by the parallel approach.

bench_idle.py — updated to drive a real InferenceWorker. Shows the
new ceiling:

  * cold first generate:        ~1.3s
  * steady-state (warm):        ~160ms
  * parallel-warmup pattern:    ~150ms post-release  ← essentially
                                                       equal to
                                                       steady-state

tests/test_inference_worker.py — 6 unit tests with fakes for MLX,
covering:

  * Model loads on the worker thread (not main).
  * wait_ready() blocks until load + startup warmup complete.
  * Load failures are surfaced via .error AND still set _ready.
  * Tasks run FIFO; transcribe waits for in-flight warmup.
  * warming flag is True during warmup and only that.
  * Task errors don't kill the worker.
  * shutdown() drains the queue before exiting.

All 20 tests (14 WarmupGate + 6 InferenceWorker) pass in ~270ms,
and continue to run on Linux CI without MLX installed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants