fix(boundary-store): serialize cleanup_all() / cleanup_request() with writer thread#1423
Merged
Merged
Conversation
763de0c to
b167e38
Compare
…th the writer thread
test_cleanup_all_drains_queue failed ~20% of the time, in isolation
and in suites. The race:
T0 save() puts item in queue
T1 writer queue.get() pulls item (already off the queue)
T2 cleanup_all() drains queue — finds it empty
T3 cleanup_all() rmtree(snapshot_dir) + mkdir(snapshot_dir)
T4 writer mkdir(req_dir) + temp write + rename(final)
Result: ``req-X/N.safetensors`` survives the supposed cleanup, the
caller (scheduler / shutdown) believes the snapshot dir is empty, and
the leftover file is the disk shadow of a request that should have
been forgotten.
Add ``_writer_busy`` (threading.Lock). The writer holds it for the
entire duration of each ``_process_write_item`` call. ``cleanup_all()``
drains the queue first (no new items can start) and then acquires
``_writer_busy``, so any item the writer had already pulled finishes
before rmtree runs.
Same class of race exists for ``cleanup_request()`` — writer pulls a
request's item before cleanup_request sets the cancelled-counter, then
mkdir's the req_dir back into existence after the rmtree, and the
late-rename rescue at the tail of ``_process_write_item`` catches it
in most timings but a small window remains. Apply the same
``_writer_busy`` barrier (bounded, see below) so the two cleanup paths
are symmetric.
Per-path timeouts so cleanup_request can yield faster than
cleanup_all:
- ``_CLEANUP_ALL_TIMEOUT_S = 5.0`` — cleanup_all runs at reset /
startup where blocking longer is tolerable in exchange for a
stronger orphan-avoidance guarantee.
- ``_CLEANUP_REQUEST_TIMEOUT_S = 2.0`` — cleanup_request is called
from the scheduler's abort hot path (~3 sites) where bounding
latency matters more than chasing one in-flight write.
The bounded acquires have a logged-warning fallback. Without that
bound a slow disk inside ``_write_safetensors_no_mx`` could deadlock
the scheduler's hot path — cleanup_all() is called from request abort
/ reset (scheduler.py:5400 / 5596 / 5737), not just shutdown. The
worst-case fallback behaviour is an orphaned file under the recreated
snapshot dir until next startup's constructor cleanup, which is the
exact same state the pre-fix code produced on every cleanup_all.
Counter pop on cleanup_request now only runs when the
``_writer_busy`` acquire succeeded. On the timeout fallback the
writer is still mid-item and may not yet have consulted
``_is_cancelled``; popping the counter there would defeat the late-
rename rescue that the docstring advertises as the timeout-
fallback safety net. The previous code popped unconditionally.
Counter bump itself is now additive (``get + count``) and skipped
entirely when ``count == 0``. Two distinct bugs that closes:
- Skip-on-zero: cleanup_request("X") for an rid with NO pending
items previously wrote ``cancelled[X] = 0`` then popped on the
acquired path. On the timeout fallback the pop never ran and the
``X: 0`` entry lingered for the process lifetime — every later
``save()`` under that rid (or any reuse of the string) was
silently discarded by the writer's ``_is_cancelled`` gates,
which check key membership not value > 0. Counter must only
exist when there is at least one in-flight item to drain it.
Regression: test_cleanup_request_no_pending_does_not_pin_counter
_on_timeout (new).
- Additive: a re-entrant cleanup_request("X") for an rid that
already has an in-flight cancellation must NOT overwrite the
previous count. The writer's ``cleared_by_cleanup`` branch +
``_writer_busy`` lock together close the file-write race today
(so no orphan file slips out), but the per-item dec_cancelled
bookkeeping has to balance — overwriting drops remaining decs on
the floor and on the next ``save()`` under the same rid the
writer would see a non-zero counter from the earlier batch and
silently discard the new item.
shutdown() now accepts ``cleanup=True`` so callers that want both
operations can express the cleanup-before-shutdown ordering in one
call instead of sequencing them manually. The cleanup_all() warning
at the top of the function catches misordered callers that still
call cleanup_all() AFTER shutdown — the writer no longer reacquires
``_writer_busy`` past the sentinel, so a post-shutdown cleanup
degrades to an in-memory-only clear.
Tests:
- ``test_cleanup_all_drains_queue`` — no longer sleeps; relies on the
lock to guarantee ordering, runs deterministically.
- ``test_cleanup_all_blocks_until_writer_finishes_pinned_item`` —
monkey-patches ``_write_safetensors_no_mx`` to block on an Event,
pins the writer mid-item, asserts cleanup_all does NOT return until
the writer releases. Without the lock this test fails
deterministically rather than the original ~20% flake rate.
- ``test_cleanup_request_blocks_until_writer_finishes_pinned_item`` —
symmetric for cleanup_request.
- ``test_cleanup_request_keeps_counter_on_timeout`` — regression for
the bug where the timeout-fallback pop dropped the late-rename
rescue's safety net.
- ``test_cleanup_request_timeout_drains_counter_on_writer_early_return``
— the writer's cleared_by_cleanup early-return must still
dec_cancelled or the counter pins the rid forever.
- ``test_cleanup_request_no_pending_does_not_pin_counter_on_timeout``
(new) — regression for the skip-on-zero bug above. Pins the writer
with an unrelated save, calls cleanup_request("never-saved-rid")
past the 0.1s timeout, asserts the rid does NOT appear in
``_cancelled_requests`` AND that a subsequent save under the same
rid is not silently discarded.
- ``test_shutdown_cleanup_true_runs_cleanup_before_setting_flag`` —
pins the cleanup-before-shutdown ordering of the new
``shutdown(cleanup=True)`` path.
24/24 boundary-store tests pass.
b167e38 to
09118cd
Compare
Contributor
|
FYI: I saw pytest tests/test_boundary_snapshot_store.pyfailing in previous I gave up after ~20 tries of this test file succeeding -> |
Owner
|
Thanks, the race timeline matches what I traced through |
cfbraun
added a commit
to cfbraun/omlx
that referenced
this pull request
May 27, 2026
4 tasks
panwudi
added a commit
to panwudi/flyto-mlx
that referenced
this pull request
May 27, 2026
Catch-up sync. Highlights: - boundary-store cleanup race fix (jundot#1423) — eliminates the test_cleanup_all_drains_queue flake we have been carrying. - per-engine MLX threads / streams (jundot#1304) — multiple models stepping scheduler.step() concurrently no longer cross-contaminate streams. - VLM lazy state materialized on loader thread; skip MTPModule attach when checkpoint lacks mtp.* weights. - Dead TieredCacheManager removed; profiles three-scope template refactor (jundot#1399). 4567 pass / 3 known env-override fails / 36 skip. Zero regression. User delegated review. --- Catch-up 同步. 主要内容: - boundary-store 清理 race 修复 (jundot#1423) - 消掉一直拖着的 test_cleanup_all_drains_queue flake. - 每引擎 MLX 线程 / 流 (jundot#1304) - 多模型并发 scheduler.step() 不再 cross-contaminate. - VLM 在 loader 线程实例化 lazy 状态; checkpoint 无 mtp.* 权重时 跳过 MTPModule attach. - 删 dead TieredCacheManager; profiles three-scope template 重构 (jundot#1399). 4567 pass / 3 known env-override fails / 36 skip. 零回归. 用户委托 review.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
test_cleanup_all_drains_queuewas failing ~20% of the time, in isolation and in suites. The race:Result:
req-X/N.safetensorssurvives the supposed cleanup, the caller (scheduler / shutdown) believes the snapshot dir is empty, and the leftover file is the disk shadow of a request that should have been forgotten.The fix introduces a
_writer_busybarrier socleanup_all()andcleanup_request()wait for the in-flight writer iteration to complete before touching the filesystem. Both methods now serialize correctly against the writer thread.Also pins the cleanup-before-shutdown ordering of the new
shutdown(cleanup=True)path with a regression test.Test plan
pytest tests/test_boundary_snapshot_store.py— 24 passed (previously flaked ~20%)test_cleanup_all_with_in_flight_writerregression test addedtest_shutdown_cleanup_true_runs_cleanup_before_setting_flagregression test added