Thread-local generation stream (port mlx-lm#1090) by Blaizzy · Pull Request #1050 · Blaizzy/mlx-vlm

Blaizzy · 2026-04-22T23:55:56Z

Summary

Ports the thread-local generation stream changes from mlx-lm#1090 into mlx-vlm.
Module-level generation_stream switched to mx.new_thread_local_stream(mx.default_device()).
BatchGenerator now accepts a stream= kwarg (other args made keyword-only) and routes wired_limit, remove(), and next() through self._stream; exposes a .stream property.
server.py: drops the module-level import and creates a local mx.default_stream(mx.default_device()) inside _run() and _run_speculative(), passing it to BatchGenerator(stream=...) so generation and synchronization run on the generator thread's default stream.
Bumps mlx>=0.31.2 and mlx-lm>=0.31.3 in requirements.txt (new_thread_local_stream requires MLX core 0.31.2).

Test plan

python -c "from mlx_vlm import generate, server" imports cleanly on mlx 0.31.2.
Run pytest mlx_vlm/tests/test_generate.py mlx_vlm/tests/test_batch_quantized_cache.py — BatchGenerator call sites already use kwargs past model, processor.
Start mlx_vlm.server and issue a multi-request load; confirm generation still completes and stays on a single stream.
Exercise the speculative path (_run_speculative) with a draft model if available.

🤖 Generated with Claude Code

Switch generation_stream to mx.new_thread_local_stream and let BatchGenerator accept a stream= kwarg, so the server can pass the generator thread's default stream explicitly. Keeps generation and synchronization on the same stream. Requires mlx>=0.31.2 (for mx.new_thread_local_stream). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

- Updated ResponseGenerator to load model resources in a dedicated thread, improving responsiveness. - Introduced a wait_until_ready method to ensure the model is fully loaded before generating responses. - Added error handling for model loading failures, allowing for graceful degradation. - Removed direct model loading from get_cached_model, streamlining the initialization process. This change enhances the overall architecture by decoupling model loading from response generation, ensuring better performance and reliability.

Merge changes from upstream: - Blaizzy#1056: hunyuan_vl/gemma3n cache-offset optimization - Blaizzy#1053: Fix DFlash speculative decoding (GPU hang, performance) - Blaizzy#1050: Thread-local generation stream (port mlx-lm#1090) - Blaizzy#1055: Close batch_generate/server decode gap + VLM fixes Conflict resolution: - requirements.txt: Mixed approach - mlx>=0.31.2 with transformers<5.4.0 to maintain omlx compatibility while accepting mlx update Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…MoE pos fix upstream/main 흡수 (4-19 ~ 4-25 batch). Fork의 핵심 자산은 모두 보존: MTP (mlx-lm 포팅, Qwen3.5 dense+MoE), PrefixCache hybrid, server hardening (MLX_MEMORY_LIMIT_GB env, /v1/status, /v1/models 로드 모델 포함, model pinning, busy tracking, GC threshold, last_request, OOM-위험 startup warmup 제거), 서버사이드 thinking strip + 스트리밍 incremental, null tool_calls 가드. Upstream 흡수: continuous batching server (Blaizzy#1027), DFlash speculative decoding (Blaizzy#1029, Blaizzy#1053 fix), thread-local generation stream (Blaizzy#1050, mlx<0.32 hasattr 가드), batch_generate/server VLM fixes (Blaizzy#1055), Qwen3.5/3.6 MoE stale position IDs + gdn_sink 호환 (Blaizzy#1040), tool-call markup strip (Blaizzy#1037), KV cache quantization (Blaizzy#1030), Qwen2-3.5 VL torch-free 비디오 processors (Blaizzy#1048), Gemma4 LoRA NaN/freeze fix (Blaizzy#1052), Gemma4 video, Youtu-VL, distributed inference 등. 충돌 해결 원칙: fork의 MTP n_confirmed와 upstream의 gdn_sink는 같은 함수에서 공존하도록 시그니처 확장. fork는 Blaizzy#1029(DFlash) 도입 전 시점에서 분기되어 gdn_sink 본체 로직은 우리 모델에서 비활성(None 전달); 단 시그니처는 받아두어 호환성 유지. position_ids 캐시 재사용 시 fork의 ">= cache_offset + seq_length" 체크가 Blaizzy#1040 fix를 더 정교하게 커버. LanguageModelOutput.hidden_states/gdn_states 필드는 upstream 추가분 호환. 검증: 4개 파일 syntax + import OK. M3 96GB에서 mlx 0.31.0 호환 확인. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Adds a stable LiteLLM model_name (mac/gemma4-vision) that downstream apps target without coupling to the underlying engine. Currently routes to ollama/gemma4:26b on the mac node; once the upstream mlx-vlm threading bug is resolved (mlx-vlm#1134, vllm-mlx#496) the alias swaps to vLLM-MLX without touching consumers. Schema change: models.yaml entries (both ollama models and vllm instances) now accept an `extra_aliases: [...]` list of additional LiteLLM model_names that route to the same backend. gen-litellm emits one entry per primary alias plus one per extra. Documented in the models.yaml header and in the disabled gemma4 vllm-mlx block (which carries the swap instructions for when upstream lands a fix). The disabled vllm-mlx gemma4 comment now points at: - Blaizzy/mlx-vlm#1134 - Blaizzy/mlx-vlm#1050 - waybarrios/vllm-mlx#496

Blaizzy mentioned this pull request Apr 23, 2026

fix: Use thread-local generation stream #1051

Closed

Blaizzy linked an issue Apr 23, 2026 that may be closed by this pull request

Crash on mlx 0.31.2: 'There is no Stream(gpu, N) in current thread' when generate() runs in a worker thread #1049

Closed

Merge branch 'main' into pc/thread-local-generation-stream

b1df54d

Blaizzy merged commit 728fab1 into main Apr 24, 2026
1 check passed

elonen mentioned this pull request Apr 29, 2026

Issue with some models: "There is no Stream(gpu, 1) in current thread." cubist38/mlx-openai-server#290

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Thread-local generation stream (port mlx-lm#1090)#1050

Thread-local generation stream (port mlx-lm#1090)#1050
Blaizzy merged 3 commits into
mainfrom
pc/thread-local-generation-stream

Blaizzy commented Apr 22, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

Blaizzy commented Apr 22, 2026

Summary

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant