Conversation
davidkoski
commented
Apr 14, 2025
- fixes Crashes occur when running multiple LLMEvaluators. #225
- although CompiledFunction was marked as unchecked(Sendable) it really wasn't
- guard the state with a lock
- fixes #225 - although CompiledFunction was marked as unchecked(Sendable) it really wasn't - guard the state with a lock
| /// unique (for the lifetime of the object) identifier for the compiled function | ||
| private var id: UInt! | ||
|
|
||
| let lock = NSLock() |
There was a problem hiding this comment.
Hopefully this is enough -- I am scoping the lock to the instance of CompileFunction (e.g. to handle cases like compiledSilu). if the mlx::core side of it isn't thread safe then I can make the lock global instead.
| lock.withLock { | ||
| innerCall(arguments) | ||
| } |
There was a problem hiding this comment.
So here, we grab the lock every time we call the compiled function? Is that because innerCall is not thread safe or something in mlx::core was causing issues?
Indeed I think mlx::core::compile is not entirely thread safe since it puts the compiled function in a cache. I think it probably makes sense to guard reading and writing to the compile cache with a lock. Maybe that will make this lock unnecessary?
There was a problem hiding this comment.
I think it may be two-fold, though this is speculation from observing the behavior:
- the cache itself may not be thread safe, in which case this lock per compiled function won't be enough
- the function / tracer arguments are not thread safe (innerCall) as the cached function itself is not thread safe
I think I was observing the latter as it looked like there were two simultaneous call using the same innerCall & messing with the tracing parameters. That is specifically what this is guarding against and in the failure case we had it seems to work.
But I think the cache itself being not thread safe could very well be an issue, maybe just a little harder to race on as it is a shorter critical section.
awni
left a comment
There was a problem hiding this comment.
LGTM.. hopefully we can remove this soon!
Fixes oxiglade#259, oxiglade#258, oxiglade#224 ## Problem MLX C++ core is not thread-safe for concurrent operations: - Concurrent eval() calls corrupt the lazy evaluation graph - Concurrent compiled function calls corrupt the compile cache - Results in segmentation faults under concurrent load Thread safety stress tests confirmed SIGSEGV crashes. ## Solution Following mlx-swift's approach: 1. **Global eval() lock** - Serializes all eval() and async_eval() calls - Array::eval() - transforms::eval() - transforms::async_eval() 2. **Per-compiled-function locks** - Prevents cache corruption - Each Compiled<F, G> gets Arc<Mutex<()>> - All call_mut() implementations acquire lock before execution - Applies to both compile() and compile_with_state() ## Implementation - Uses parking_lot::Mutex for better performance - Global EVAL_LOCK defined in transforms::mod.rs - Per-function locks in Compiled struct using Arc for Clone compatibility - All call_mut()/call_mut_with_state() implementations protected ## Testing - All 545 existing mlx-rs tests pass - Added comprehensive thread safety stress tests (see tests/thread_safety.rs) ## Performance Impact - Concurrent eval() calls are serialized (necessary trade-off) - Different compiled functions can still run in parallel - Same function calls to different instances are serialized ## Upstream MLX Issues Apple MLX C++ core has intentional design decision to NOT be thread-safe. Bindings must handle thread safety themselves: - ml-explore/mlx#2067 - Thread issues with evaluation Our EVAL_LOCK works around this issue - ml-explore/mlx#2086 - C++ compile cache should be thread safe Our per-function locks partially address this (full fix in next commit) - ml-explore/mlx#2133 - Umbrella thread safety tracking issue Apple maintainer stated these will NOT be fixed at C++ level. ## References - mlx-swift eval lock: https://github.com/ml-explore/mlx-swift/blob/main/Source/MLX/Transforms%2BEval.swift#L9 - mlx-swift compiled fix: ml-explore/mlx-swift#226 - Investigation: docs/THREAD_SAFETY_INVESTIGATION.md - Findings: docs/THREAD_SAFETY_FINDINGS.md
Fixes oxiglade#259, oxiglade#258, oxiglade#224 ## Problem MLX C++ core is not thread-safe for concurrent operations: - Concurrent eval() calls corrupt the lazy evaluation graph - Concurrent compiled function calls corrupt the compile cache - Results in segmentation faults under concurrent load Thread safety stress tests confirmed SIGSEGV crashes. ## Solution Following mlx-swift's approach: 1. **Global eval() lock** - Serializes all eval() and async_eval() calls - Array::eval() - transforms::eval() - transforms::async_eval() 2. **Per-compiled-function locks** - Prevents cache corruption - Each Compiled<F, G> gets Arc<Mutex<()>> - All call_mut() implementations acquire lock before execution - Applies to both compile() and compile_with_state() ## Implementation - Uses parking_lot::Mutex for better performance - Global EVAL_LOCK defined in transforms::mod.rs - Per-function locks in Compiled struct using Arc for Clone compatibility - All call_mut()/call_mut_with_state() implementations protected ## Testing - All 545 existing mlx-rs tests pass - Added comprehensive thread safety stress tests (see tests/thread_safety.rs) ## Performance Impact - Concurrent eval() calls are serialized (necessary trade-off) - Different compiled functions can still run in parallel - Same function calls to different instances are serialized ## Upstream MLX Issues Apple MLX C++ core has intentional design decision to NOT be thread-safe. Bindings must handle thread safety themselves: - ml-explore/mlx#2067 - Thread issues with evaluation Our EVAL_LOCK works around this issue - ml-explore/mlx#2086 - C++ compile cache should be thread safe Our per-function locks partially address this (full fix in next commit) - ml-explore/mlx#2133 - Umbrella thread safety tracking issue Apple maintainer stated these will NOT be fixed at C++ level. ## References - mlx-swift eval lock: https://github.com/ml-explore/mlx-swift/blob/main/Source/MLX/Transforms%2BEval.swift#L9 - mlx-swift compiled fix: ml-explore/mlx-swift#226 - Investigation: docs/THREAD_SAFETY_INVESTIGATION.md - Findings: docs/THREAD_SAFETY_FINDINGS.md