Thread-local generation stream (port mlx-lm#1090)#1050
Merged
Conversation
Switch generation_stream to mx.new_thread_local_stream and let BatchGenerator accept a stream= kwarg, so the server can pass the generator thread's default stream explicitly. Keeps generation and synchronization on the same stream. Requires mlx>=0.31.2 (for mx.new_thread_local_stream). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- Updated ResponseGenerator to load model resources in a dedicated thread, improving responsiveness. - Introduced a wait_until_ready method to ensure the model is fully loaded before generating responses. - Added error handling for model loading failures, allowing for graceful degradation. - Removed direct model loading from get_cached_model, streamlining the initialization process. This change enhances the overall architecture by decoupling model loading from response generation, ensuring better performance and reliability.
afanty2021
added a commit
to afanty2021/mlx-vlm
that referenced
this pull request
Apr 24, 2026
Merge changes from upstream: - Blaizzy#1056: hunyuan_vl/gemma3n cache-offset optimization - Blaizzy#1053: Fix DFlash speculative decoding (GPU hang, performance) - Blaizzy#1050: Thread-local generation stream (port mlx-lm#1090) - Blaizzy#1055: Close batch_generate/server decode gap + VLM fixes Conflict resolution: - requirements.txt: Mixed approach - mlx>=0.31.2 with transformers<5.4.0 to maintain omlx compatibility while accepting mlx update Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
mdkirin
pushed a commit
to mdkirin/mlx-seori
that referenced
this pull request
Apr 26, 2026
…MoE pos fix upstream/main 흡수 (4-19 ~ 4-25 batch). Fork의 핵심 자산은 모두 보존: MTP (mlx-lm 포팅, Qwen3.5 dense+MoE), PrefixCache hybrid, server hardening (MLX_MEMORY_LIMIT_GB env, /v1/status, /v1/models 로드 모델 포함, model pinning, busy tracking, GC threshold, last_request, OOM-위험 startup warmup 제거), 서버사이드 thinking strip + 스트리밍 incremental, null tool_calls 가드. Upstream 흡수: continuous batching server (Blaizzy#1027), DFlash speculative decoding (Blaizzy#1029, Blaizzy#1053 fix), thread-local generation stream (Blaizzy#1050, mlx<0.32 hasattr 가드), batch_generate/server VLM fixes (Blaizzy#1055), Qwen3.5/3.6 MoE stale position IDs + gdn_sink 호환 (Blaizzy#1040), tool-call markup strip (Blaizzy#1037), KV cache quantization (Blaizzy#1030), Qwen2-3.5 VL torch-free 비디오 processors (Blaizzy#1048), Gemma4 LoRA NaN/freeze fix (Blaizzy#1052), Gemma4 video, Youtu-VL, distributed inference 등. 충돌 해결 원칙: fork의 MTP n_confirmed와 upstream의 gdn_sink는 같은 함수에서 공존하도록 시그니처 확장. fork는 Blaizzy#1029(DFlash) 도입 전 시점에서 분기되어 gdn_sink 본체 로직은 우리 모델에서 비활성(None 전달); 단 시그니처는 받아두어 호환성 유지. position_ids 캐시 재사용 시 fork의 ">= cache_offset + seq_length" 체크가 Blaizzy#1040 fix를 더 정교하게 커버. LanguageModelOutput.hidden_states/gdn_states 필드는 upstream 추가분 호환. 검증: 4개 파일 syntax + import OK. M3 96GB에서 mlx 0.31.0 호환 확인. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
TheBranchDriftCatalyst
pushed a commit
to TheBranchDriftCatalyst/catalyst-llm
that referenced
this pull request
May 7, 2026
Adds a stable LiteLLM model_name (mac/gemma4-vision) that downstream apps target without coupling to the underlying engine. Currently routes to ollama/gemma4:26b on the mac node; once the upstream mlx-vlm threading bug is resolved (mlx-vlm#1134, vllm-mlx#496) the alias swaps to vLLM-MLX without touching consumers. Schema change: models.yaml entries (both ollama models and vllm instances) now accept an `extra_aliases: [...]` list of additional LiteLLM model_names that route to the same backend. gen-litellm emits one entry per primary alias plus one per extra. Documented in the models.yaml header and in the disabled gemma4 vllm-mlx block (which carries the swap instructions for when upstream lands a fix). The disabled vllm-mlx gemma4 comment now points at: - Blaizzy/mlx-vlm#1134 - Blaizzy/mlx-vlm#1050 - waybarrios/vllm-mlx#496
TheBranchDriftCatalyst
added a commit
to TheBranchDriftCatalyst/catalyst-llm
that referenced
this pull request
May 13, 2026
Adds a stable LiteLLM model_name (mac/gemma4-vision) that downstream apps target without coupling to the underlying engine. Currently routes to ollama/gemma4:26b on the mac node; once the upstream mlx-vlm threading bug is resolved (mlx-vlm#1134, vllm-mlx#496) the alias swaps to vLLM-MLX without touching consumers. Schema change: models.yaml entries (both ollama models and vllm instances) now accept an `extra_aliases: [...]` list of additional LiteLLM model_names that route to the same backend. gen-litellm emits one entry per primary alias plus one per extra. Documented in the models.yaml header and in the disabled gemma4 vllm-mlx block (which carries the swap instructions for when upstream lands a fix). The disabled vllm-mlx gemma4 comment now points at: - Blaizzy/mlx-vlm#1134 - Blaizzy/mlx-vlm#1050 - waybarrios/vllm-mlx#496
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
generation_streamswitched tomx.new_thread_local_stream(mx.default_device()).BatchGeneratornow accepts astream=kwarg (other args made keyword-only) and routeswired_limit,remove(), andnext()throughself._stream; exposes a.streamproperty.server.py: drops the module-level import and creates a localmx.default_stream(mx.default_device())inside_run()and_run_speculative(), passing it toBatchGenerator(stream=...)so generation and synchronization run on the generator thread's default stream.mlx>=0.31.2andmlx-lm>=0.31.3inrequirements.txt(new_thread_local_streamrequires MLX core 0.31.2).Test plan
python -c "from mlx_vlm import generate, server"imports cleanly on mlx 0.31.2.pytest mlx_vlm/tests/test_generate.py mlx_vlm/tests/test_batch_quantized_cache.py— BatchGenerator call sites already use kwargs pastmodel, processor.mlx_vlm.serverand issue a multi-request load; confirm generation still completes and stays on a single stream._run_speculative) with a draft model if available.🤖 Generated with Claude Code