Skip to content

Fix flusher data race in Gridstore#7702

Merged
timvisee merged 10 commits intodevfrom
fix-gridstore-flush-data-race
Dec 9, 2025
Merged

Fix flusher data race in Gridstore#7702
timvisee merged 10 commits intodevfrom
fix-gridstore-flush-data-race

Conversation

@timvisee
Copy link
Member

@timvisee timvisee commented Dec 5, 2025

In #7624 and #7627 we improved flushing for Gridstore.

But, there was still a data race left when wiping/clearing Gridstore. After wiping, it was still possible for a pending (now invalid) flush task to write data to disk. We did add a barrier in wipe, but that didn't prevent a flusher task to run one more time.

This sequence of events was possible:

  • flusher task is created
  • flusher task is invoked and upgrades Arc's
  • wipe is called, falls through bitmask barrier, and deletes files on disk
  • flusher task now obtains bitmask lock, and flushes data

I use the same approach as in other places - an 'is alive' flag (for example).

The added test breaks on the old flush behavior. You can try this by commenting out this line:

// Wait for all background flush operations to finish, abort pending flushes
// Below we create a new Gridstore instance with a new boolean, so flushers created on the
// new instance work as expected
*self.is_alive_flush_lock.lock() = false;

All Submissions:

  • Contributions should target the dev branch. Did you create your branch from dev?
  • Have you followed the guidelines in our Contributing document?
  • Have you checked to ensure there aren't other open Pull Requests for the same update/change?

Changes to Core Features:

  • Have you added an explanation of what your changes do and why you'd like us to include them?
  • Have you written new tests for your core changes, as applicable?
  • Have you successfully ran tests with your changes locally?

@timvisee timvisee marked this pull request as ready for review December 5, 2025 16:26
coderabbitai[bot]

This comment was marked as resolved.

@qdrant qdrant deleted a comment from coderabbitai bot Dec 5, 2025
@qdrant qdrant deleted a comment from coderabbitai bot Dec 5, 2025
coderabbitai[bot]

This comment was marked as resolved.

@qdrant qdrant deleted a comment from coderabbitai bot Dec 5, 2025
@timvisee timvisee marked this pull request as draft December 5, 2025 17:23
@timvisee timvisee force-pushed the fix-gridstore-flush-data-race branch from 35f26fa to 3aeaa50 Compare December 8, 2025 16:20
@timvisee timvisee marked this pull request as ready for review December 8, 2025 16:23
coderabbitai[bot]

This comment was marked as resolved.

@qdrant qdrant deleted a comment from coderabbitai bot Dec 8, 2025
@timvisee timvisee force-pushed the fix-gridstore-flush-data-race branch from 857d3e0 to f79199b Compare December 8, 2025 16:33
@qdrant qdrant deleted a comment from coderabbitai bot Dec 8, 2025
coderabbitai[bot]

This comment was marked as resolved.

@qdrant qdrant deleted a comment from coderabbitai bot Dec 8, 2025
@qdrant qdrant deleted a comment from coderabbitai bot Dec 9, 2025
@timvisee timvisee merged commit bdd3871 into dev Dec 9, 2025
18 of 21 checks passed
@timvisee timvisee deleted the fix-gridstore-flush-data-race branch December 9, 2025 10:23
timvisee added a commit that referenced this pull request Dec 18, 2025
* Add lease structure to invalidate pending flushers after wipe/clear

* We don't use bitmask as barrier anymore

* Add missing early return

* Add test

* Replace flush lease with is alive boolean we also use elsewhere

* Reimplement test, now it does fail on the old implementation

* In test, cover both new fixed and old broken flushing

* Remove old test case because it is flaky

* Fix clippy

* Fix outdated comment
@timvisee timvisee mentioned this pull request Dec 19, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants