Skip to content

MB-66163: always protect the latest snapshot within bolt#2183

Merged
Thejas-bhat merged 8 commits into
masterfrom
boltBug
Apr 24, 2025
Merged

MB-66163: always protect the latest snapshot within bolt#2183
Thejas-bhat merged 8 commits into
masterfrom
boltBug

Conversation

@Thejas-bhat

@Thejas-bhat Thejas-bhat commented Apr 21, 2025

Copy link
Copy Markdown
Member
  • Currently because of the time-series based checkpointing mechanism, there can be edge cases where we don't end up protecting the very latest snapshot that's there in the bolt file.
  • This can cause some serious issues along the persister path, where the newly created file segments that's part of the new snapshot written to bolt can potentially get cleaned up when the purge code removeOldData is invoked. If there were a backup operation indexReader.CopyTo() to be performed just after this purge operation, the latest root still has a pointer to the file segment but the underlying file itself would have been purged because its not part of ineligibleForRemoval structure and also the corresponding bolt entry would have been deleted.
  • The issue highlighted has also been unit tested and is consistently reproducable.
    rollback_test.go:559: error copying the index: error backing up index snapshot: segment: /var/folders/5n/g80bkslj5qn6xn8srh6pxv_r0000gp/T/bleve-scorch-test-TestLatestSnapshotCheckpointed/000000000004.zap copy err: stat /var/folders/5n/g80bkslj5qn6xn8srh6pxv_r0000gp/T/bleve-scorch-test-TestLatestSnapshotCheckpointed/000000000004.zap: no such file or directory
    
  • The PR addresses this issue by firstly preserving the latest snapshot within bolt and n-1 time-series based checkpoints and also marking the file as ineligible for removal as an layer of protection.
  • There is also a bug fix along a specific merger path where the old snapshot's ref count wasn't being decremented appropriately which can potentially cause file leak and memory bloat due to invalid snapshots still being present.

@abhinavdangeti abhinavdangeti added this to the v2.5.1 milestone Apr 21, 2025
@Thejas-bhat Thejas-bhat marked this pull request as ready for review April 22, 2025 15:24
@Thejas-bhat Thejas-bhat changed the title MB-66163: always preserve latest snapshot within bolt MB-66163: always protect the latest snapshot within bolt Apr 22, 2025
abhinavdangeti
abhinavdangeti previously approved these changes Apr 22, 2025
@Thejas-bhat Thejas-bhat force-pushed the boltBug branch 2 times, most recently from 54150a1 to 3a08bed Compare April 23, 2025 11:44
Comment thread index/scorch/event.go Outdated
abhinavdangeti
abhinavdangeti previously approved these changes Apr 23, 2025
Comment thread index/scorch/rollback_test.go Outdated

@abhinavdangeti abhinavdangeti left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Solid diagnosis & fix @Thejas-bhat !

@Thejas-bhat Thejas-bhat merged commit 9dc5de8 into master Apr 24, 2025
Thejas-bhat added a commit that referenced this pull request Apr 24, 2025
- Currently because of the time-series based checkpointing mechanism,
there can be edge cases where we don't end up protecting the very latest
snapshot that's there in the bolt file.
- This can cause some serious issues along the persister path, where the
newly created file segments that's part of the new snapshot written to
bolt can potentially get cleaned up when the purge code `removeOldData`
is invoked. If there were a backup operation `indexReader.CopyTo()` to
be performed just after this purge operation, the latest root still has
a pointer to the file segment but the underlying file itself would have
been purged because its not part of `ineligibleForRemoval` structure and
also the corresponding bolt entry would have been deleted.
- The issue highlighted has also been unit tested and is consistently
reproducable.
  ```
rollback_test.go:559: error copying the index: error backing up index
snapshot: segment:
/var/folders/5n/g80bkslj5qn6xn8srh6pxv_r0000gp/T/bleve-scorch-test-TestLatestSnapshotCheckpointed/000000000004.zap
copy err: stat
/var/folders/5n/g80bkslj5qn6xn8srh6pxv_r0000gp/T/bleve-scorch-test-TestLatestSnapshotCheckpointed/000000000004.zap:
no such file or directory
  ```
- The PR addresses this issue by firstly preserving the latest snapshot
within bolt and n-1 time-series based checkpoints and also marking the
file as ineligible for removal as an layer of protection.
- There is also a bug fix along a specific merger path where the old
snapshot's ref count wasn't being decremented appropriately which can
potentially cause file leak and memory bloat due to invalid snapshots
still being present.
Thejas-bhat added a commit that referenced this pull request Apr 24, 2025
Backporting #2183 for couchbase 7.6.x release cycle builds.
@abhinavdangeti abhinavdangeti deleted the boltBug branch April 24, 2025 13:57
abhinavdangeti pushed a commit that referenced this pull request May 6, 2025
Backporting #2183 for couchbase 7.6.x release cycle builds.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants