Skip to content

[BUG] SIGABRT for multiple inference requests concurrently on GPT-oss (vector::back on empty) #337

@ronaldmannak

Description

@ronaldmannak

Summary

I’m seeing a deterministic crash during GPT-OSS inference when multiple requests run concurrently. The crash disappears when I force concurrency to 1. Other models (like Qwen) can run multiple requests without any issues. The issue seems specific to GPT-OSS.

The abort comes from MLX’s trace/compile path and asserts on vector::back() being called on an empty vector.

Environment

Steps to Reproduce

  1. Run GPT-OSS inference with concurrent requests (e.g., multiple client requests at once).
  2. Observe crash during generation/compile phase.
  3. Set concurrency to 1 (serialize requests) and rerun.
  4. Crash disappears.

Expected Behavior

Concurrent inference requests should be safe (or fail gracefully), without crashing the process.

Actual Behavior

Process aborts with SIGABRT. Console shows:

Stack Trace (excerpt)

Task 153 Queue : com.apple.root.user-initiated-qos.cooperative (concurrent)
#0  __pthread_kill
ml-explore/mlx-swift-lm#1  pthread_kill
ml-explore/mlx-swift-lm#2  abort
ml-explore/mlx-swift-lm#3  std::__1::__libcpp_verbose_abort
ml-explore/mlx-swift-lm#4  std::__1::vector<...>::back at __vector/vector.h:430
ml-explore/mlx-swift-lm#5  mlx::core::detail::InTracing::~InTracing at mlx/transforms_impl.h:28
ml-explore/mlx-swift-lm#6  mlx::core::detail::InTracing::~InTracing at mlx/transforms_impl.h:27
ml-explore/mlx-swift-lm#7  mlx::core::detail::compile_trace at mlx/compile.cpp:410
ml-explore/mlx-swift-lm#8  mlx::core::detail::compile(...) at mlx/compile.cpp:1125
ml-explore/mlx-swift-lm#16 mlx::core::detail::compile(...) at mlx/compile.cpp:1179
ml-explore/mlx-swift-lm#24 ::mlx_closure_apply at mlx/c/closure.cpp:102
ml-explore/mlx-swift-lm#25 CompiledFunction.innerCall at Transforms+Compile.swift:100
ml-explore/mlx-swift-lm#29 CompiledFunction.call at Transforms+Compile.swift:40
ml-explore/mlx-swift-lm#31 SwiGLUSwitchGLU.callAsFunction at GPTOSS.swift:144
ml-explore/mlx-swift-lm#35 GPTOSSModel.callAsFunction at GPTOSS.swift:509
ml-explore/mlx-swift-lm#37 LanguageModel.callAsFunction at LanguageModel.swift:183
ml-explore/mlx-swift-lm#39 TokenIterator.step at Evaluate.swift:421
ml-explore/mlx-swift-lm#41 LLMRunner.generateHarmonyTokenStreaming at LLMRunner+HarmonyGeneration.swift:68
ml-explore/mlx-swift-lm#40	TokenIterator.next() at Evaluate.swift:445
ml-explore/mlx-swift-lm#41	in closure ml-explore/mlx-swift-lm#1 in closure ml-explore/mlx-swift-lm#1 in closure ml-explore/mlx-swift-lm#2 in LLMRunner.generateHarmonyTokenStreaming(tokenIds:modelId:languageModelContainer:generateParameters:maxCompletionTokens:stopTokens:) ()
ml-explore/mlx-swift-lm#42	in partial apply for closure ml-explore/mlx-swift-lm#1 in closure ml-explore/mlx-swift-lm#1 in closure ml-explore/mlx-swift-lm#2 in LLMRunner.generateHarmonyTokenStreaming(tokenIds:modelId:languageModelContainer:generateParameters:maxCompletionTokens:stopTokens:) ()
ml-explore/mlx-swift-lm#43	ModelContainer.perform<τ_0_0>(_:) at ModelContainer.swift:71

Notes / Hypothesis

  • The crash happens only when multiple requests are processed concurrently.
  • Setting concurrency to 1 makes the crash disappear.
  • This only happens on GPT-OSS, not other models

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions