Skip to content

fix(boundary-store): serialize cleanup_all() / cleanup_request() with writer thread#1423

Merged
jundot merged 1 commit into
jundot:mainfrom
cfbraun:pr/boundary-store-race
May 27, 2026
Merged

fix(boundary-store): serialize cleanup_all() / cleanup_request() with writer thread#1423
jundot merged 1 commit into
jundot:mainfrom
cfbraun:pr/boundary-store-race

Conversation

@cfbraun

@cfbraun cfbraun commented May 26, 2026

Copy link
Copy Markdown
Contributor

Summary

test_cleanup_all_drains_queue was failing ~20% of the time, in isolation and in suites. The race:

T0 save() puts item in queue
T1 writer queue.get() pulls item (already off the queue)
T2 cleanup_all() drains queue — finds it empty
T3 cleanup_all() rmtree(snapshot_dir) + mkdir(snapshot_dir)
T4 writer mkdir(req_dir) + temp write + rename(final)

Result: req-X/N.safetensors survives the supposed cleanup, the caller (scheduler / shutdown) believes the snapshot dir is empty, and the leftover file is the disk shadow of a request that should have been forgotten.

The fix introduces a _writer_busy barrier so cleanup_all() and cleanup_request() wait for the in-flight writer iteration to complete before touching the filesystem. Both methods now serialize correctly against the writer thread.

Also pins the cleanup-before-shutdown ordering of the new shutdown(cleanup=True) path with a regression test.

Test plan

  • pytest tests/test_boundary_snapshot_store.py — 24 passed (previously flaked ~20%)
  • test_cleanup_all_with_in_flight_writer regression test added
  • test_shutdown_cleanup_true_runs_cleanup_before_setting_flag regression test added

@cfbraun cfbraun force-pushed the pr/boundary-store-race branch 2 times, most recently from 763de0c to b167e38 Compare May 27, 2026 06:43
…th the writer thread

test_cleanup_all_drains_queue failed ~20% of the time, in isolation
and in suites. The race:

  T0 save() puts item in queue
  T1 writer queue.get() pulls item (already off the queue)
  T2 cleanup_all() drains queue — finds it empty
  T3 cleanup_all() rmtree(snapshot_dir) + mkdir(snapshot_dir)
  T4 writer mkdir(req_dir) + temp write + rename(final)

Result: ``req-X/N.safetensors`` survives the supposed cleanup, the
caller (scheduler / shutdown) believes the snapshot dir is empty, and
the leftover file is the disk shadow of a request that should have
been forgotten.

Add ``_writer_busy`` (threading.Lock). The writer holds it for the
entire duration of each ``_process_write_item`` call. ``cleanup_all()``
drains the queue first (no new items can start) and then acquires
``_writer_busy``, so any item the writer had already pulled finishes
before rmtree runs.

Same class of race exists for ``cleanup_request()`` — writer pulls a
request's item before cleanup_request sets the cancelled-counter, then
mkdir's the req_dir back into existence after the rmtree, and the
late-rename rescue at the tail of ``_process_write_item`` catches it
in most timings but a small window remains. Apply the same
``_writer_busy`` barrier (bounded, see below) so the two cleanup paths
are symmetric.

Per-path timeouts so cleanup_request can yield faster than
cleanup_all:

- ``_CLEANUP_ALL_TIMEOUT_S = 5.0`` — cleanup_all runs at reset /
  startup where blocking longer is tolerable in exchange for a
  stronger orphan-avoidance guarantee.
- ``_CLEANUP_REQUEST_TIMEOUT_S = 2.0`` — cleanup_request is called
  from the scheduler's abort hot path (~3 sites) where bounding
  latency matters more than chasing one in-flight write.

The bounded acquires have a logged-warning fallback. Without that
bound a slow disk inside ``_write_safetensors_no_mx`` could deadlock
the scheduler's hot path — cleanup_all() is called from request abort
/ reset (scheduler.py:5400 / 5596 / 5737), not just shutdown. The
worst-case fallback behaviour is an orphaned file under the recreated
snapshot dir until next startup's constructor cleanup, which is the
exact same state the pre-fix code produced on every cleanup_all.

Counter pop on cleanup_request now only runs when the
``_writer_busy`` acquire succeeded. On the timeout fallback the
writer is still mid-item and may not yet have consulted
``_is_cancelled``; popping the counter there would defeat the late-
rename rescue that the docstring advertises as the timeout-
fallback safety net. The previous code popped unconditionally.

Counter bump itself is now additive (``get + count``) and skipped
entirely when ``count == 0``. Two distinct bugs that closes:

- Skip-on-zero: cleanup_request("X") for an rid with NO pending
  items previously wrote ``cancelled[X] = 0`` then popped on the
  acquired path. On the timeout fallback the pop never ran and the
  ``X: 0`` entry lingered for the process lifetime — every later
  ``save()`` under that rid (or any reuse of the string) was
  silently discarded by the writer's ``_is_cancelled`` gates,
  which check key membership not value > 0. Counter must only
  exist when there is at least one in-flight item to drain it.
  Regression: test_cleanup_request_no_pending_does_not_pin_counter
  _on_timeout (new).
- Additive: a re-entrant cleanup_request("X") for an rid that
  already has an in-flight cancellation must NOT overwrite the
  previous count. The writer's ``cleared_by_cleanup`` branch +
  ``_writer_busy`` lock together close the file-write race today
  (so no orphan file slips out), but the per-item dec_cancelled
  bookkeeping has to balance — overwriting drops remaining decs on
  the floor and on the next ``save()`` under the same rid the
  writer would see a non-zero counter from the earlier batch and
  silently discard the new item.

shutdown() now accepts ``cleanup=True`` so callers that want both
operations can express the cleanup-before-shutdown ordering in one
call instead of sequencing them manually. The cleanup_all() warning
at the top of the function catches misordered callers that still
call cleanup_all() AFTER shutdown — the writer no longer reacquires
``_writer_busy`` past the sentinel, so a post-shutdown cleanup
degrades to an in-memory-only clear.

Tests:
- ``test_cleanup_all_drains_queue`` — no longer sleeps; relies on the
  lock to guarantee ordering, runs deterministically.
- ``test_cleanup_all_blocks_until_writer_finishes_pinned_item`` —
  monkey-patches ``_write_safetensors_no_mx`` to block on an Event,
  pins the writer mid-item, asserts cleanup_all does NOT return until
  the writer releases. Without the lock this test fails
  deterministically rather than the original ~20% flake rate.
- ``test_cleanup_request_blocks_until_writer_finishes_pinned_item`` —
  symmetric for cleanup_request.
- ``test_cleanup_request_keeps_counter_on_timeout`` — regression for
  the bug where the timeout-fallback pop dropped the late-rename
  rescue's safety net.
- ``test_cleanup_request_timeout_drains_counter_on_writer_early_return``
  — the writer's cleared_by_cleanup early-return must still
  dec_cancelled or the counter pins the rid forever.
- ``test_cleanup_request_no_pending_does_not_pin_counter_on_timeout``
  (new) — regression for the skip-on-zero bug above. Pins the writer
  with an unrelated save, calls cleanup_request("never-saved-rid")
  past the 0.1s timeout, asserts the rid does NOT appear in
  ``_cancelled_requests`` AND that a subsequent save under the same
  rid is not silently discarded.
- ``test_shutdown_cleanup_true_runs_cleanup_before_setting_flag`` —
  pins the cleanup-before-shutdown ordering of the new
  ``shutdown(cleanup=True)`` path.

24/24 boundary-store tests pass.
@cfbraun cfbraun force-pushed the pr/boundary-store-race branch from b167e38 to 09118cd Compare May 27, 2026 07:29
@fry69

fry69 commented May 27, 2026

Copy link
Copy Markdown
Contributor

FYI: I saw

pytest tests/test_boundary_snapshot_store.py

failing in previous pytest runs, but with current main (e6d8a3f) I can no longer reproduce this flakiness.

I gave up after ~20 tries of this test file succeeding -> 15 passed in 2.57s

@jundot

jundot commented May 27, 2026

Copy link
Copy Markdown
Owner

Thanks, the race timeline matches what I traced through _writer_loop on main, and lock order is safe (cleanup_request nests pending → cancelled, the writer keeps them in separate critical sections, no reverse nesting anywhere). Both save() callers in scheduler already handle False as in-memory fallback, so the queue-full rollback is transparent. Merging.

@jundot jundot merged commit 4f3a9b9 into jundot:main May 27, 2026
cfbraun added a commit to cfbraun/omlx that referenced this pull request May 27, 2026
@cfbraun cfbraun deleted the pr/boundary-store-race branch May 27, 2026 07:52
panwudi added a commit to panwudi/flyto-mlx that referenced this pull request May 27, 2026
Catch-up sync. Highlights:
- boundary-store cleanup race fix (jundot#1423) — eliminates the
  test_cleanup_all_drains_queue flake we have been carrying.
- per-engine MLX threads / streams (jundot#1304) — multiple models stepping
  scheduler.step() concurrently no longer cross-contaminate streams.
- VLM lazy state materialized on loader thread; skip MTPModule attach
  when checkpoint lacks mtp.* weights.
- Dead TieredCacheManager removed; profiles three-scope template
  refactor (jundot#1399).

4567 pass / 3 known env-override fails / 36 skip. Zero regression.
User delegated review.

---

Catch-up 同步. 主要内容:
- boundary-store 清理 race 修复 (jundot#1423) - 消掉一直拖着的
  test_cleanup_all_drains_queue flake.
- 每引擎 MLX 线程 / 流 (jundot#1304) - 多模型并发 scheduler.step()
  不再 cross-contaminate.
- VLM 在 loader 线程实例化 lazy 状态; checkpoint 无 mtp.* 权重时
  跳过 MTPModule attach.
- 删 dead TieredCacheManager; profiles three-scope template 重构
  (jundot#1399).

4567 pass / 3 known env-override fails / 36 skip. 零回归.
用户委托 review.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants