Fix BufferedDynamicFlags flushing with concurrent drop/deletion#7621
Fix BufferedDynamicFlags flushing with concurrent drop/deletion#7621
Conversation
|
Yeah, it makes sense to me that we see it on this particular storage/flusher. See, I believe this problem only appears when we try to reopen the file. In almost all storage types we don't and keep a file handle. When we remove the file, the file stays accessible through then handle as long as we keep it and we can still write to it. The kernel silently cleans it up once all handles are closed. This storage does indeed try to reopen the file when the number of flags grows, breaking the pattern. We reopen through So our other payload indices suffer from a similar problem. We just don't see an error report for it, as all late writes are silently voided. I believe we should audit all our storage types for similar problems. Just thinking out loud here. |
|
|
||
| impl Drop for BufferedDynamicFlags { | ||
| fn drop(&mut self) { | ||
| // Wait for all background flush operations to finish |
There was a problem hiding this comment.
| // Wait for all background flush operations to finish | |
| // Wait for all background flush operations to finish, and cancel future flushes |
| // Weak reference to detect when the storage has been deleted | ||
| let flags_arc = Arc::downgrade(&self.storage); |
There was a problem hiding this comment.
I did similarly for #7624, this is an improvement so that we don't start flushing a dropped component 👍
| /// Lock to prevent concurrent flush and drop | ||
| pub is_alive_flush_lock: Arc<Mutex<bool>>, |
There was a problem hiding this comment.
After some consideration, I am not sure this will actually prevent trying to interact with an inexistent file/directory. In PayloadIndex::drop_index we actually remove the directory BEFORE dropping the index.
This lock only makes the dropping wait for the end of the flush, but the files may already be deleted externally.
Please correct me if I missed something 🤔
There was a problem hiding this comment.
Good call. That is a problem!
We must only delete files once we unload the payload index structure.
I am quite sure it's important to only delete after also releasing all files handles. Because Windows doesn't allow us to delete files which we still have open. Though then I'm also a bit puzzled on why we haven't seen issues about this. Maybe it needs a specific setup, like explicitly deleting a payload index which may not be very common.
There was a problem hiding this comment.
Thanks I did push a fix to order the drop properly 34d031d
This lock only makes the dropping wait for the end of the flush, but the files may already be deleted externally.
My guess it that the original patch worked for me locally because the flush would happen after the drop anyway.
There is not a lot of time to squeeze the flusher between the FS remove and the implicit drop.
| // Keep the guard till the end of the flush to prevent concurrent drop/flushes | ||
| let is_alive_flush_guard = is_alive_flush_lock.lock(); | ||
|
|
||
| if !*is_alive_flush_guard { |
There was a problem hiding this comment.
We should be able to avoid this check by also using a weak reference to the lock. Then it would also simplify the mutex to just hold a unit struct ()
* Fix BufferedDynamicFlags flushing with concurrent drop/deletion * make sure to drop before fs remove

Fixes flushing error happening when deleting a payload index concurrently with flushing.
This PR makes sure that drop and flushing are exclusive and uses a weak reference on storage to not trip over a deleted folder.
This was tested manually with
bfband adding/removing a keyword payload index.