Skip to content

CompiledFunction is not thread safe#226

Merged
davidkoski merged 1 commit intomainfrom
compiled-concurrency
Apr 28, 2025
Merged

CompiledFunction is not thread safe#226
davidkoski merged 1 commit intomainfrom
compiled-concurrency

Conversation

@davidkoski
Copy link
Collaborator

- fixes #225
- although CompiledFunction was marked as unchecked(Sendable) it really wasn't
- guard the state with a lock
@davidkoski davidkoski requested review from awni and barronalex April 14, 2025 23:39
/// unique (for the lifetime of the object) identifier for the compiled function
private var id: UInt!

let lock = NSLock()
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hopefully this is enough -- I am scoping the lock to the instance of CompileFunction (e.g. to handle cases like compiledSilu). if the mlx::core side of it isn't thread safe then I can make the lock global instead.

Comment on lines +40 to +42
lock.withLock {
innerCall(arguments)
}
Copy link
Member

@awni awni Apr 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So here, we grab the lock every time we call the compiled function? Is that because innerCall is not thread safe or something in mlx::core was causing issues?

Indeed I think mlx::core::compile is not entirely thread safe since it puts the compiled function in a cache. I think it probably makes sense to guard reading and writing to the compile cache with a lock. Maybe that will make this lock unnecessary?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it may be two-fold, though this is speculation from observing the behavior:

  • the cache itself may not be thread safe, in which case this lock per compiled function won't be enough
  • the function / tracer arguments are not thread safe (innerCall) as the cached function itself is not thread safe

I think I was observing the latter as it looked like there were two simultaneous call using the same innerCall & messing with the tracing parameters. That is specifically what this is guarding against and in the failure case we had it seems to work.

But I think the cache itself being not thread safe could very well be an issue, maybe just a little harder to race on as it is a shorter critical section.

Copy link
Member

@awni awni left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.. hopefully we can remove this soon!

@davidkoski davidkoski merged commit 1f1f922 into main Apr 28, 2025
1 check passed
@davidkoski davidkoski deleted the compiled-concurrency branch April 28, 2025 20:45
pjm073 added a commit to Oxyide/mlx-rs that referenced this pull request Dec 30, 2025
Fixes oxiglade#259, oxiglade#258, oxiglade#224

## Problem

MLX C++ core is not thread-safe for concurrent operations:
- Concurrent eval() calls corrupt the lazy evaluation graph
- Concurrent compiled function calls corrupt the compile cache
- Results in segmentation faults under concurrent load

Thread safety stress tests confirmed SIGSEGV crashes.

## Solution

Following mlx-swift's approach:

1. **Global eval() lock** - Serializes all eval() and async_eval() calls
   - Array::eval()
   - transforms::eval()
   - transforms::async_eval()

2. **Per-compiled-function locks** - Prevents cache corruption
   - Each Compiled<F, G> gets Arc<Mutex<()>>
   - All call_mut() implementations acquire lock before execution
   - Applies to both compile() and compile_with_state()

## Implementation

- Uses parking_lot::Mutex for better performance
- Global EVAL_LOCK defined in transforms::mod.rs
- Per-function locks in Compiled struct using Arc for Clone compatibility
- All call_mut()/call_mut_with_state() implementations protected

## Testing

- All 545 existing mlx-rs tests pass
- Added comprehensive thread safety stress tests (see tests/thread_safety.rs)

## Performance Impact

- Concurrent eval() calls are serialized (necessary trade-off)
- Different compiled functions can still run in parallel
- Same function calls to different instances are serialized

## Upstream MLX Issues

Apple MLX C++ core has intentional design decision to NOT be thread-safe.
Bindings must handle thread safety themselves:

- ml-explore/mlx#2067 - Thread issues with evaluation
  Our EVAL_LOCK works around this issue

- ml-explore/mlx#2086 - C++ compile cache should be thread safe
  Our per-function locks partially address this (full fix in next commit)

- ml-explore/mlx#2133 - Umbrella thread safety tracking issue

Apple maintainer stated these will NOT be fixed at C++ level.

## References

- mlx-swift eval lock: https://github.com/ml-explore/mlx-swift/blob/main/Source/MLX/Transforms%2BEval.swift#L9
- mlx-swift compiled fix: ml-explore/mlx-swift#226
- Investigation: docs/THREAD_SAFETY_INVESTIGATION.md
- Findings: docs/THREAD_SAFETY_FINDINGS.md
pjm073 added a commit to Oxyide/mlx-rs that referenced this pull request Jan 1, 2026
Fixes oxiglade#259, oxiglade#258, oxiglade#224

## Problem

MLX C++ core is not thread-safe for concurrent operations:
- Concurrent eval() calls corrupt the lazy evaluation graph
- Concurrent compiled function calls corrupt the compile cache
- Results in segmentation faults under concurrent load

Thread safety stress tests confirmed SIGSEGV crashes.

## Solution

Following mlx-swift's approach:

1. **Global eval() lock** - Serializes all eval() and async_eval() calls
   - Array::eval()
   - transforms::eval()
   - transforms::async_eval()

2. **Per-compiled-function locks** - Prevents cache corruption
   - Each Compiled<F, G> gets Arc<Mutex<()>>
   - All call_mut() implementations acquire lock before execution
   - Applies to both compile() and compile_with_state()

## Implementation

- Uses parking_lot::Mutex for better performance
- Global EVAL_LOCK defined in transforms::mod.rs
- Per-function locks in Compiled struct using Arc for Clone compatibility
- All call_mut()/call_mut_with_state() implementations protected

## Testing

- All 545 existing mlx-rs tests pass
- Added comprehensive thread safety stress tests (see tests/thread_safety.rs)

## Performance Impact

- Concurrent eval() calls are serialized (necessary trade-off)
- Different compiled functions can still run in parallel
- Same function calls to different instances are serialized

## Upstream MLX Issues

Apple MLX C++ core has intentional design decision to NOT be thread-safe.
Bindings must handle thread safety themselves:

- ml-explore/mlx#2067 - Thread issues with evaluation
  Our EVAL_LOCK works around this issue

- ml-explore/mlx#2086 - C++ compile cache should be thread safe
  Our per-function locks partially address this (full fix in next commit)

- ml-explore/mlx#2133 - Umbrella thread safety tracking issue

Apple maintainer stated these will NOT be fixed at C++ level.

## References

- mlx-swift eval lock: https://github.com/ml-explore/mlx-swift/blob/main/Source/MLX/Transforms%2BEval.swift#L9
- mlx-swift compiled fix: ml-explore/mlx-swift#226
- Investigation: docs/THREAD_SAFETY_INVESTIGATION.md
- Findings: docs/THREAD_SAFETY_FINDINGS.md
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Crashes occur when running multiple LLMEvaluators.

2 participants