fix(boundary-store): serialize cleanup_all() / cleanup_request() with writer thread by cfbraun · Pull Request #1423 · jundot/omlx

cfbraun · 2026-05-26T12:28:06Z

Summary

test_cleanup_all_drains_queue was failing ~20% of the time, in isolation and in suites. The race:

T0 save() puts item in queue
T1 writer queue.get() pulls item (already off the queue)
T2 cleanup_all() drains queue — finds it empty
T3 cleanup_all() rmtree(snapshot_dir) + mkdir(snapshot_dir)
T4 writer mkdir(req_dir) + temp write + rename(final)

Result: req-X/N.safetensors survives the supposed cleanup, the caller (scheduler / shutdown) believes the snapshot dir is empty, and the leftover file is the disk shadow of a request that should have been forgotten.

The fix introduces a _writer_busy barrier so cleanup_all() and cleanup_request() wait for the in-flight writer iteration to complete before touching the filesystem. Both methods now serialize correctly against the writer thread.

Also pins the cleanup-before-shutdown ordering of the new shutdown(cleanup=True) path with a regression test.

Test plan

pytest tests/test_boundary_snapshot_store.py — 24 passed (previously flaked ~20%)
test_cleanup_all_with_in_flight_writer regression test added
test_shutdown_cleanup_true_runs_cleanup_before_setting_flag regression test added

…th the writer thread test_cleanup_all_drains_queue failed ~20% of the time, in isolation and in suites. The race: T0 save() puts item in queue T1 writer queue.get() pulls item (already off the queue) T2 cleanup_all() drains queue — finds it empty T3 cleanup_all() rmtree(snapshot_dir) + mkdir(snapshot_dir) T4 writer mkdir(req_dir) + temp write + rename(final) Result: ``req-X/N.safetensors`` survives the supposed cleanup, the caller (scheduler / shutdown) believes the snapshot dir is empty, and the leftover file is the disk shadow of a request that should have been forgotten. Add ``_writer_busy`` (threading.Lock). The writer holds it for the entire duration of each ``_process_write_item`` call. ``cleanup_all()`` drains the queue first (no new items can start) and then acquires ``_writer_busy``, so any item the writer had already pulled finishes before rmtree runs. Same class of race exists for ``cleanup_request()`` — writer pulls a request's item before cleanup_request sets the cancelled-counter, then mkdir's the req_dir back into existence after the rmtree, and the late-rename rescue at the tail of ``_process_write_item`` catches it in most timings but a small window remains. Apply the same ``_writer_busy`` barrier (bounded, see below) so the two cleanup paths are symmetric. Per-path timeouts so cleanup_request can yield faster than cleanup_all: - ``_CLEANUP_ALL_TIMEOUT_S = 5.0`` — cleanup_all runs at reset / startup where blocking longer is tolerable in exchange for a stronger orphan-avoidance guarantee. - ``_CLEANUP_REQUEST_TIMEOUT_S = 2.0`` — cleanup_request is called from the scheduler's abort hot path (~3 sites) where bounding latency matters more than chasing one in-flight write. The bounded acquires have a logged-warning fallback. Without that bound a slow disk inside ``_write_safetensors_no_mx`` could deadlock the scheduler's hot path — cleanup_all() is called from request abort / reset (scheduler.py:5400 / 5596 / 5737), not just shutdown. The worst-case fallback behaviour is an orphaned file under the recreated snapshot dir until next startup's constructor cleanup, which is the exact same state the pre-fix code produced on every cleanup_all. Counter pop on cleanup_request now only runs when the ``_writer_busy`` acquire succeeded. On the timeout fallback the writer is still mid-item and may not yet have consulted ``_is_cancelled``; popping the counter there would defeat the late- rename rescue that the docstring advertises as the timeout- fallback safety net. The previous code popped unconditionally. Counter bump itself is now additive (``get + count``) and skipped entirely when ``count == 0``. Two distinct bugs that closes: - Skip-on-zero: cleanup_request("X") for an rid with NO pending items previously wrote ``cancelled[X] = 0`` then popped on the acquired path. On the timeout fallback the pop never ran and the ``X: 0`` entry lingered for the process lifetime — every later ``save()`` under that rid (or any reuse of the string) was silently discarded by the writer's ``_is_cancelled`` gates, which check key membership not value > 0. Counter must only exist when there is at least one in-flight item to drain it. Regression: test_cleanup_request_no_pending_does_not_pin_counter _on_timeout (new). - Additive: a re-entrant cleanup_request("X") for an rid that already has an in-flight cancellation must NOT overwrite the previous count. The writer's ``cleared_by_cleanup`` branch + ``_writer_busy`` lock together close the file-write race today (so no orphan file slips out), but the per-item dec_cancelled bookkeeping has to balance — overwriting drops remaining decs on the floor and on the next ``save()`` under the same rid the writer would see a non-zero counter from the earlier batch and silently discard the new item. shutdown() now accepts ``cleanup=True`` so callers that want both operations can express the cleanup-before-shutdown ordering in one call instead of sequencing them manually. The cleanup_all() warning at the top of the function catches misordered callers that still call cleanup_all() AFTER shutdown — the writer no longer reacquires ``_writer_busy`` past the sentinel, so a post-shutdown cleanup degrades to an in-memory-only clear. Tests: - ``test_cleanup_all_drains_queue`` — no longer sleeps; relies on the lock to guarantee ordering, runs deterministically. - ``test_cleanup_all_blocks_until_writer_finishes_pinned_item`` — monkey-patches ``_write_safetensors_no_mx`` to block on an Event, pins the writer mid-item, asserts cleanup_all does NOT return until the writer releases. Without the lock this test fails deterministically rather than the original ~20% flake rate. - ``test_cleanup_request_blocks_until_writer_finishes_pinned_item`` — symmetric for cleanup_request. - ``test_cleanup_request_keeps_counter_on_timeout`` — regression for the bug where the timeout-fallback pop dropped the late-rename rescue's safety net. - ``test_cleanup_request_timeout_drains_counter_on_writer_early_return`` — the writer's cleared_by_cleanup early-return must still dec_cancelled or the counter pins the rid forever. - ``test_cleanup_request_no_pending_does_not_pin_counter_on_timeout`` (new) — regression for the skip-on-zero bug above. Pins the writer with an unrelated save, calls cleanup_request("never-saved-rid") past the 0.1s timeout, asserts the rid does NOT appear in ``_cancelled_requests`` AND that a subsequent save under the same rid is not silently discarded. - ``test_shutdown_cleanup_true_runs_cleanup_before_setting_flag`` — pins the cleanup-before-shutdown ordering of the new ``shutdown(cleanup=True)`` path. 24/24 boundary-store tests pass.

fry69 · 2026-05-27T07:35:35Z

FYI: I saw

pytest tests/test_boundary_snapshot_store.py

failing in previous pytest runs, but with current main (e6d8a3f) I can no longer reproduce this flakiness.

I gave up after ~20 tries of this test file succeeding -> 15 passed in 2.57s

jundot · 2026-05-27T07:48:51Z

Thanks, the race timeline matches what I traced through _writer_loop on main, and lock order is safe (cleanup_request nests pending → cancelled, the writer keeps them in separate critical sections, no reverse nesting anywhere). Both save() callers in scheduler already handle False as in-memory fallback, so the queue-full rollback is transparent. Merging.

Catch-up sync. Highlights: - boundary-store cleanup race fix (jundot#1423) — eliminates the test_cleanup_all_drains_queue flake we have been carrying. - per-engine MLX threads / streams (jundot#1304) — multiple models stepping scheduler.step() concurrently no longer cross-contaminate streams. - VLM lazy state materialized on loader thread; skip MTPModule attach when checkpoint lacks mtp.* weights. - Dead TieredCacheManager removed; profiles three-scope template refactor (jundot#1399). 4567 pass / 3 known env-override fails / 36 skip. Zero regression. User delegated review. --- Catch-up 同步. 主要内容: - boundary-store 清理 race 修复 (jundot#1423) - 消掉一直拖着的 test_cleanup_all_drains_queue flake. - 每引擎 MLX 线程 / 流 (jundot#1304) - 多模型并发 scheduler.step() 不再 cross-contaminate. - VLM 在 loader 线程实例化 lazy 状态; checkpoint 无 mtp.* 权重时跳过 MTPModule attach. - 删 dead TieredCacheManager; profiles three-scope template 重构 (jundot#1399). 4567 pass / 3 known env-override fails / 36 skip. 零回归. 用户委托 review.

cfbraun mentioned this pull request May 26, 2026

fix(tests): repair default test failures #942

Open

cfbraun force-pushed the pr/boundary-store-race branch 2 times, most recently from 763de0c to b167e38 Compare May 27, 2026 06:43

cfbraun force-pushed the pr/boundary-store-race branch from b167e38 to 09118cd Compare May 27, 2026 07:29

jundot merged commit 4f3a9b9 into jundot:main May 27, 2026

cfbraun added a commit to cfbraun/omlx that referenced this pull request May 27, 2026

Merge upstream/main (PR jundot#1423 merge)

bf799cf

cfbraun deleted the pr/boundary-store-race branch May 27, 2026 07:52

panwudi mentioned this pull request May 27, 2026

sync: independent upstream fixes (boundary-store, per-engine threads, VLM, profiles) panwudi/flyto-mlx#8

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(boundary-store): serialize cleanup_all() / cleanup_request() with writer thread#1423

fix(boundary-store): serialize cleanup_all() / cleanup_request() with writer thread#1423
jundot merged 1 commit into
jundot:mainfrom
cfbraun:pr/boundary-store-race

cfbraun commented May 26, 2026

Uh oh!

fry69 commented May 27, 2026

Uh oh!

jundot commented May 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

cfbraun commented May 26, 2026

Summary

Test plan

Uh oh!

fry69 commented May 27, 2026

Uh oh!

jundot commented May 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants