[BUG] SIGABRT for multiple inference requests concurrently on GPT-oss (vector::back on empty)

## Summary
I’m seeing a deterministic crash during GPT-OSS inference when multiple requests run concurrently. The crash disappears when I force concurrency to 1. Other models (like Qwen) can run multiple requests without any issues. The issue seems specific to GPT-OSS.

The abort comes from MLX’s trace/compile path and asserts on `vector::back()` being called on an empty vector.

## Environment
- OS Version: macOS 26.2
- Device: MacBook Pro M1
- Version: both main branch and ml-explore/mlx-swift-lm#51 tested. Both show the same behavior. Since ml-explore/mlx-swift-lm#51 closely mirrors the Python implementation, it's likely that the same issue exists in mlx-lm (but not tested)

## Steps to Reproduce
1. Run GPT-OSS inference with concurrent requests (e.g., multiple client requests at once).
2. Observe crash during generation/compile phase.
3. Set concurrency to 1 (serialize requests) and rerun.
4. Crash disappears.

## Expected Behavior
Concurrent inference requests should be safe (or fail gracefully), without crashing the process.

## Actual Behavior
Process aborts with `SIGABRT`. Console shows:
```[...]/c++/v1/__vector/vector.h:430: assertion !empty() failed: back() called on an empty vector
```

## Stack Trace (excerpt)
```
Task 153 Queue : com.apple.root.user-initiated-qos.cooperative (concurrent)
#0  __pthread_kill
ml-explore/mlx-swift-lm#1  pthread_kill
ml-explore/mlx-swift-lm#2  abort
ml-explore/mlx-swift-lm#3  std::__1::__libcpp_verbose_abort
ml-explore/mlx-swift-lm#4  std::__1::vector<...>::back at __vector/vector.h:430
ml-explore/mlx-swift-lm#5  mlx::core::detail::InTracing::~InTracing at mlx/transforms_impl.h:28
ml-explore/mlx-swift-lm#6  mlx::core::detail::InTracing::~InTracing at mlx/transforms_impl.h:27
ml-explore/mlx-swift-lm#7  mlx::core::detail::compile_trace at mlx/compile.cpp:410
ml-explore/mlx-swift-lm#8  mlx::core::detail::compile(...) at mlx/compile.cpp:1125
ml-explore/mlx-swift-lm#16 mlx::core::detail::compile(...) at mlx/compile.cpp:1179
ml-explore/mlx-swift-lm#24 ::mlx_closure_apply at mlx/c/closure.cpp:102
ml-explore/mlx-swift-lm#25 CompiledFunction.innerCall at Transforms+Compile.swift:100
ml-explore/mlx-swift-lm#29 CompiledFunction.call at Transforms+Compile.swift:40
ml-explore/mlx-swift-lm#31 SwiGLUSwitchGLU.callAsFunction at GPTOSS.swift:144
ml-explore/mlx-swift-lm#35 GPTOSSModel.callAsFunction at GPTOSS.swift:509
ml-explore/mlx-swift-lm#37 LanguageModel.callAsFunction at LanguageModel.swift:183
ml-explore/mlx-swift-lm#39 TokenIterator.step at Evaluate.swift:421
ml-explore/mlx-swift-lm#41 LLMRunner.generateHarmonyTokenStreaming at LLMRunner+HarmonyGeneration.swift:68
ml-explore/mlx-swift-lm#40	TokenIterator.next() at Evaluate.swift:445
ml-explore/mlx-swift-lm#41	in closure ml-explore/mlx-swift-lm#1 in closure ml-explore/mlx-swift-lm#1 in closure ml-explore/mlx-swift-lm#2 in LLMRunner.generateHarmonyTokenStreaming(tokenIds:modelId:languageModelContainer:generateParameters:maxCompletionTokens:stopTokens:) ()
ml-explore/mlx-swift-lm#42	in partial apply for closure ml-explore/mlx-swift-lm#1 in closure ml-explore/mlx-swift-lm#1 in closure ml-explore/mlx-swift-lm#2 in LLMRunner.generateHarmonyTokenStreaming(tokenIds:modelId:languageModelContainer:generateParameters:maxCompletionTokens:stopTokens:) ()
ml-explore/mlx-swift-lm#43	ModelContainer.perform<τ_0_0>(_:) at ModelContainer.swift:71
```

## Notes / Hypothesis
- The crash happens only when multiple requests are processed concurrently.
- Setting concurrency to 1 makes the crash disappear.
- This only happens on GPT-OSS, not other models


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] SIGABRT for multiple inference requests concurrently on GPT-oss (vector::back on empty) #337

Summary

Environment

Steps to Reproduce

Expected Behavior

Actual Behavior

Stack Trace (excerpt)

Notes / Hypothesis

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[BUG] SIGABRT for multiple inference requests concurrently on GPT-oss (vector::back on empty) #337

Description

Summary

Environment

Steps to Reproduce

Expected Behavior

Actual Behavior

Stack Trace (excerpt)

Notes / Hypothesis

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions