-
Notifications
You must be signed in to change notification settings - Fork 161
Closed
Description
Summary
I’m seeing a deterministic crash during GPT-OSS inference when multiple requests run concurrently. The crash disappears when I force concurrency to 1. Other models (like Qwen) can run multiple requests without any issues. The issue seems specific to GPT-OSS.
The abort comes from MLX’s trace/compile path and asserts on vector::back() being called on an empty vector.
Environment
- OS Version: macOS 26.2
- Device: MacBook Pro M1
- Version: both main branch and GPT-oss performance optimizations mlx-swift-lm#51 tested. Both show the same behavior. Since GPT-oss performance optimizations mlx-swift-lm#51 closely mirrors the Python implementation, it's likely that the same issue exists in mlx-lm (but not tested)
Steps to Reproduce
- Run GPT-OSS inference with concurrent requests (e.g., multiple client requests at once).
- Observe crash during generation/compile phase.
- Set concurrency to 1 (serialize requests) and rerun.
- Crash disappears.
Expected Behavior
Concurrent inference requests should be safe (or fail gracefully), without crashing the process.
Actual Behavior
Process aborts with SIGABRT. Console shows:
Stack Trace (excerpt)
Task 153 Queue : com.apple.root.user-initiated-qos.cooperative (concurrent)
#0 __pthread_kill
ml-explore/mlx-swift-lm#1 pthread_kill
ml-explore/mlx-swift-lm#2 abort
ml-explore/mlx-swift-lm#3 std::__1::__libcpp_verbose_abort
ml-explore/mlx-swift-lm#4 std::__1::vector<...>::back at __vector/vector.h:430
ml-explore/mlx-swift-lm#5 mlx::core::detail::InTracing::~InTracing at mlx/transforms_impl.h:28
ml-explore/mlx-swift-lm#6 mlx::core::detail::InTracing::~InTracing at mlx/transforms_impl.h:27
ml-explore/mlx-swift-lm#7 mlx::core::detail::compile_trace at mlx/compile.cpp:410
ml-explore/mlx-swift-lm#8 mlx::core::detail::compile(...) at mlx/compile.cpp:1125
ml-explore/mlx-swift-lm#16 mlx::core::detail::compile(...) at mlx/compile.cpp:1179
ml-explore/mlx-swift-lm#24 ::mlx_closure_apply at mlx/c/closure.cpp:102
ml-explore/mlx-swift-lm#25 CompiledFunction.innerCall at Transforms+Compile.swift:100
ml-explore/mlx-swift-lm#29 CompiledFunction.call at Transforms+Compile.swift:40
ml-explore/mlx-swift-lm#31 SwiGLUSwitchGLU.callAsFunction at GPTOSS.swift:144
ml-explore/mlx-swift-lm#35 GPTOSSModel.callAsFunction at GPTOSS.swift:509
ml-explore/mlx-swift-lm#37 LanguageModel.callAsFunction at LanguageModel.swift:183
ml-explore/mlx-swift-lm#39 TokenIterator.step at Evaluate.swift:421
ml-explore/mlx-swift-lm#41 LLMRunner.generateHarmonyTokenStreaming at LLMRunner+HarmonyGeneration.swift:68
ml-explore/mlx-swift-lm#40 TokenIterator.next() at Evaluate.swift:445
ml-explore/mlx-swift-lm#41 in closure ml-explore/mlx-swift-lm#1 in closure ml-explore/mlx-swift-lm#1 in closure ml-explore/mlx-swift-lm#2 in LLMRunner.generateHarmonyTokenStreaming(tokenIds:modelId:languageModelContainer:generateParameters:maxCompletionTokens:stopTokens:) ()
ml-explore/mlx-swift-lm#42 in partial apply for closure ml-explore/mlx-swift-lm#1 in closure ml-explore/mlx-swift-lm#1 in closure ml-explore/mlx-swift-lm#2 in LLMRunner.generateHarmonyTokenStreaming(tokenIds:modelId:languageModelContainer:generateParameters:maxCompletionTokens:stopTokens:) ()
ml-explore/mlx-swift-lm#43 ModelContainer.perform<τ_0_0>(_:) at ModelContainer.swift:71
Notes / Hypothesis
- The crash happens only when multiple requests are processed concurrently.
- Setting concurrency to 1 makes the crash disappear.
- This only happens on GPT-OSS, not other models
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels