Skip to content

Thread-local generation stream (port mlx-lm#1090)#1050

Merged
Blaizzy merged 3 commits into
mainfrom
pc/thread-local-generation-stream
Apr 24, 2026
Merged

Thread-local generation stream (port mlx-lm#1090)#1050
Blaizzy merged 3 commits into
mainfrom
pc/thread-local-generation-stream

Conversation

@Blaizzy

@Blaizzy Blaizzy commented Apr 22, 2026

Copy link
Copy Markdown
Owner

Summary

  • Ports the thread-local generation stream changes from mlx-lm#1090 into mlx-vlm.
  • Module-level generation_stream switched to mx.new_thread_local_stream(mx.default_device()).
  • BatchGenerator now accepts a stream= kwarg (other args made keyword-only) and routes wired_limit, remove(), and next() through self._stream; exposes a .stream property.
  • server.py: drops the module-level import and creates a local mx.default_stream(mx.default_device()) inside _run() and _run_speculative(), passing it to BatchGenerator(stream=...) so generation and synchronization run on the generator thread's default stream.
  • Bumps mlx>=0.31.2 and mlx-lm>=0.31.3 in requirements.txt (new_thread_local_stream requires MLX core 0.31.2).

Test plan

  • python -c "from mlx_vlm import generate, server" imports cleanly on mlx 0.31.2.
  • Run pytest mlx_vlm/tests/test_generate.py mlx_vlm/tests/test_batch_quantized_cache.py — BatchGenerator call sites already use kwargs past model, processor.
  • Start mlx_vlm.server and issue a multi-request load; confirm generation still completes and stays on a single stream.
  • Exercise the speculative path (_run_speculative) with a draft model if available.

🤖 Generated with Claude Code

Switch generation_stream to mx.new_thread_local_stream and let
BatchGenerator accept a stream= kwarg, so the server can pass the
generator thread's default stream explicitly. Keeps generation and
synchronization on the same stream.

Requires mlx>=0.31.2 (for mx.new_thread_local_stream).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- Updated ResponseGenerator to load model resources in a dedicated thread, improving responsiveness.
- Introduced a wait_until_ready method to ensure the model is fully loaded before generating responses.
- Added error handling for model loading failures, allowing for graceful degradation.
- Removed direct model loading from get_cached_model, streamlining the initialization process.

This change enhances the overall architecture by decoupling model loading from response generation, ensuring better performance and reliability.
@Blaizzy Blaizzy merged commit 728fab1 into main Apr 24, 2026
1 check passed
afanty2021 added a commit to afanty2021/mlx-vlm that referenced this pull request Apr 24, 2026
Merge changes from upstream:
- Blaizzy#1056: hunyuan_vl/gemma3n cache-offset optimization
- Blaizzy#1053: Fix DFlash speculative decoding (GPU hang, performance)
- Blaizzy#1050: Thread-local generation stream (port mlx-lm#1090)
- Blaizzy#1055: Close batch_generate/server decode gap + VLM fixes

Conflict resolution:
- requirements.txt: Mixed approach - mlx>=0.31.2 with transformers<5.4.0
  to maintain omlx compatibility while accepting mlx update

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
mdkirin pushed a commit to mdkirin/mlx-seori that referenced this pull request Apr 26, 2026
…MoE pos fix

upstream/main 흡수 (4-19 ~ 4-25 batch). Fork의 핵심 자산은 모두 보존:
MTP (mlx-lm 포팅, Qwen3.5 dense+MoE), PrefixCache hybrid, server hardening
(MLX_MEMORY_LIMIT_GB env, /v1/status, /v1/models 로드 모델 포함, model
pinning, busy tracking, GC threshold, last_request, OOM-위험 startup
warmup 제거), 서버사이드 thinking strip + 스트리밍 incremental, null
tool_calls 가드.

Upstream 흡수: continuous batching server (Blaizzy#1027), DFlash speculative
decoding (Blaizzy#1029, Blaizzy#1053 fix), thread-local generation stream (Blaizzy#1050,
mlx<0.32 hasattr 가드), batch_generate/server VLM fixes (Blaizzy#1055), Qwen3.5/3.6
MoE stale position IDs + gdn_sink 호환 (Blaizzy#1040), tool-call markup strip
(Blaizzy#1037), KV cache quantization (Blaizzy#1030), Qwen2-3.5 VL torch-free 비디오
processors (Blaizzy#1048), Gemma4 LoRA NaN/freeze fix (Blaizzy#1052), Gemma4 video,
Youtu-VL, distributed inference 등.

충돌 해결 원칙: fork의 MTP n_confirmed와 upstream의 gdn_sink는 같은
함수에서 공존하도록 시그니처 확장. fork는 Blaizzy#1029(DFlash) 도입 전 시점에서
분기되어 gdn_sink 본체 로직은 우리 모델에서 비활성(None 전달); 단
시그니처는 받아두어 호환성 유지. position_ids 캐시 재사용 시 fork의
">= cache_offset + seq_length" 체크가 Blaizzy#1040 fix를 더 정교하게 커버.
LanguageModelOutput.hidden_states/gdn_states 필드는 upstream 추가분 호환.

검증: 4개 파일 syntax + import OK. M3 96GB에서 mlx 0.31.0 호환 확인.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
TheBranchDriftCatalyst pushed a commit to TheBranchDriftCatalyst/catalyst-llm that referenced this pull request May 7, 2026
Adds a stable LiteLLM model_name (mac/gemma4-vision) that downstream apps
target without coupling to the underlying engine. Currently routes to
ollama/gemma4:26b on the mac node; once the upstream mlx-vlm threading
bug is resolved (mlx-vlm#1134, vllm-mlx#496) the alias swaps to vLLM-MLX
without touching consumers.

Schema change: models.yaml entries (both ollama models and vllm
instances) now accept an `extra_aliases: [...]` list of additional
LiteLLM model_names that route to the same backend. gen-litellm emits
one entry per primary alias plus one per extra. Documented in the
models.yaml header and in the disabled gemma4 vllm-mlx block (which
carries the swap instructions for when upstream lands a fix).

The disabled vllm-mlx gemma4 comment now points at:
  - Blaizzy/mlx-vlm#1134
  - Blaizzy/mlx-vlm#1050
  - waybarrios/vllm-mlx#496
TheBranchDriftCatalyst added a commit to TheBranchDriftCatalyst/catalyst-llm that referenced this pull request May 13, 2026
Adds a stable LiteLLM model_name (mac/gemma4-vision) that downstream apps
target without coupling to the underlying engine. Currently routes to
ollama/gemma4:26b on the mac node; once the upstream mlx-vlm threading
bug is resolved (mlx-vlm#1134, vllm-mlx#496) the alias swaps to vLLM-MLX
without touching consumers.

Schema change: models.yaml entries (both ollama models and vllm
instances) now accept an `extra_aliases: [...]` list of additional
LiteLLM model_names that route to the same backend. gen-litellm emits
one entry per primary alias plus one per extra. Documented in the
models.yaml header and in the disabled gemma4 vllm-mlx block (which
carries the swap instructions for when upstream lands a fix).

The disabled vllm-mlx gemma4 comment now points at:
  - Blaizzy/mlx-vlm#1134
  - Blaizzy/mlx-vlm#1050
  - waybarrios/vllm-mlx#496
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Crash on mlx 0.31.2: 'There is no Stream(gpu, N) in current thread' when generate() runs in a worker thread

1 participant