[Draft] Fix Universal Compaction causing overlapping seqnos in L0 files due to overlapped seqnos between ingested files and memtable's#10776
Conversation
|
@hx235 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
3c7b3c3 to
9862123
Compare
|
@hx235 has updated the pull request. You must reimport the pull request before landing. |
|
@hx235 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
9862123 to
30f1120
Compare
|
@hx235 has updated the pull request. You must reimport the pull request before landing. |
|
@hx235 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
30f1120 to
90fbb4c
Compare
|
@hx235 has updated the pull request. You must reimport the pull request before landing. |
|
Update: just a draft to see if it passes make check. Have some dependency on Fix overlapping seqnos in L0 files cased by overlapped seqnos between ingested files and memtable (FIFO Compaction case) |
90fbb4c to
d3c5c9b
Compare
|
@hx235 has updated the pull request. You must reimport the pull request before landing. |
d3c5c9b to
7ba76cd
Compare
|
@hx235 has updated the pull request. You must reimport the pull request before landing. |
7ba76cd to
564fd4a
Compare
|
@hx235 has updated the pull request. You must reimport the pull request before landing. |
564fd4a to
3c6f98b
Compare
|
@hx235 has updated the pull request. You must reimport the pull request before landing. |
3c6f98b to
d340eef
Compare
|
@hx235 has updated the pull request. You must reimport the pull request before landing. |
d340eef to
d4a2129
Compare
|
@hx235 has updated the pull request. You must reimport the pull request before landing. |
d4a2129 to
4dfe2d3
Compare
|
@hx235 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
512a04c to
5433150
Compare
|
@hx235 has updated the pull request. You must reimport the pull request before landing. |
|
@ajkr has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
5433150 to
5d10ebc
Compare
|
@hx235 has updated the pull request. You must reimport the pull request before landing. |
|
@hx235 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
db/db_compaction_test.cc
Outdated
There was a problem hiding this comment.
I haven't investigated but when num_level = 1 with level compaction, there will still be a L1 in vstorage and ingested file will be placed in L1, creating difficulties to force compaction to L0. So I skipped adding UT for kCompactionStyleLevel.
[Question @ajkr] However let me know if you still see value in investigating this and figure this out!
There was a problem hiding this comment.
I think we sanitize num_levels to be at least 3 or something in level compaction.
There was a problem hiding this comment.
Oh okay thanks! Then CompactRange + level compaction under the case where we didn't fix in this PR.
db/db_compaction_test.cc
Outdated
There was a problem hiding this comment.
[Question @ajkr] I wanted to expand test coverage to num_levels > 1 for universal compaction as mentioned in PR summary to cover some edge case in PickCompaction() but couldn't quite force the compaction to L0 - any good idea?
db/db_compaction_test.cc
Outdated
There was a problem hiding this comment.
[Question @ajkr] I tried to cover PickDeleteTriggeredCompaction() but only come up with low-level test case manipulating vstorage directly like https://github.com/facebook/rocksdb/blob/main/db/compaction/compaction_picker_test.cc#L2953-L2977. Is there another way of doing this? If not, I can still go with the low-level one - better than nothing.
There was a problem hiding this comment.
Actually - found another more useful test case TEST_P(DBCompactionTestWithParam, CompactionDeletionTrigger)
|
Running stress test now - will notify @ajkr once signals are clear |
HISTORY.md
Outdated
HISTORY.md
Outdated
There was a problem hiding this comment.
CompactRangeOptions::change_level=true and CompactRangeOptions::target_level=0. I am not sure it suffers from a bug. Perhaps it does when used together with setting non-nullptr endpoints to prevent the flush. Non-nullptr endpoints with change_level==true is weird though and I think we should disallow it, or at least pretend the user passed us nullptr endpoints. Last discussion - #7441 (comment)
HISTORY.md
Outdated
There was a problem hiding this comment.
I think this was not in the 7.8.0 release
db/compaction/compaction_picker.cc
Outdated
There was a problem hiding this comment.
Would you mind pointing me to places of my previous commit indicating such assumption? I actually didn't bear this assumption in mind but might have made mistake that indicating this assumption. Although I added UT for it in my latest commit, it would be great if you can point me a place of my prev commit so that I didn't miss anything. Thanks!
A key should be compacted if it exists before the CompactRange() starts and overlaps with the range specified to CompactRange(). If the memtable were not flushed, we would miss compacting such keys that are in an L0 file overlapping memtable in seqno. This conditional is where files containing those missed keys are omitted from compaction.
There was a problem hiding this comment.
oh okay - that seems like a contradiction between CompactRange() contract (i.e, "a key should be compacted .... ") and the "skip file" approach we are pursuing here .... Have we discussed the solution to this before?
Maybe leave as TODO for CompactionPicker ::CompactRange() for now and fix it by not allowing such "L0 file overlapping memtable in seqno" exists as you mentioned in the task about next step for universal compaction?
There was a problem hiding this comment.
No solution has been discussed before, but a necessary condition for this bug ("If the memtable were not flushed, ...") is not present in universal CompactRange(), because it always does the flush. My original point was just that this is a big assumption that might change or might not even be true if I misread the code.
There was a problem hiding this comment.
What if I ingest level0_stop_writes_trigger L0 files that don't overlap with memtable and they need to be compacted to L0? Will writes be stopped forever since it won't flush and can't compact anything? This is difficult to reason about. I need to think about this approach more.
There was a problem hiding this comment.
Aha good point ... I didn't even realize there is a feature called level0_stop_writes_trigger ... need to think about this more. Current situation sounds like a deadlock to me
There was a problem hiding this comment.
A side-note is that db stress can be improved to detect deadlock like this.
5d10ebc to
b65b5d6
Compare
|
@hx235 has updated the pull request. You must reimport the pull request before landing. |
b65b5d6 to
863c0c5
Compare
|
@hx235 has updated the pull request. You must reimport the pull request before landing. |
863c0c5 to
d4a3f6f
Compare
There was a problem hiding this comment.
[Note] A bug fix - cased by assuming sorted_run on L0 is always a prefix of vstorage->LevelFiles(0), which is no longer true after we sanitize sorted_run in this PR.
There was a problem hiding this comment.
[Note] A bug fix - cased by assuming sorted_run on L0 is always a prefix of vstorage->LevelFiles(0), which is no longer true after we sanitize sorted_run in this PR.
|
Update: addressed some comments + added more UT + fixed a bug found by new UT |
|
What if we set FileMetaData::largest_seqno as follows?
|
The idea is to play with the metadata so we don't have to worry about what files can or cannot be picked for compaction. I'm still working on an intuitive explanation for this altered meaning of largest_seqno. In case we can't find one, I also thought about a new ordering field in T133335138. |
|
@hx235 has updated the pull request. You must reimport the pull request before landing. |
…erlapping L0 file seqnos
…no required + add UT + cleanup
d4a3f6f to
8b851df
Compare
|
@hx235 has updated the pull request. You must reimport the pull request before landing. |
HBase does this -- https://github.com/apache/hbase/blob/984d226010558c5b5e54d283b465a645816ffb24/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/compactions/Compactor.java#L171-L172. But ever since HBase style compaction was introduced to RocksDB in 554c06d, we have been learning largest_seqno from the output file keys rather than from the input file metadata. While our way is buggy when ordering L0 files by descending largest_seqno, it still feels like we have a more intuitive meaning of largest_seqno, so probably we should introduce a different field like "L0 epoch number" to use for ordering. |
|
Closed since alternative approach was pursued. |
Context:
Same as #5958 (comment) but apply the fix to Universal Compaction case
Repro:
Summary:
This PR considers
earliest_mem_seqnoswhenever we are compacting to L0 level, which include the following three cases.earliest_mem_seqnofrom compaction.CalculateSortedRunsat the beginning ofUniversalCompactionPicker::PickCompaction()earliest_mem_seqnoso that previously filtered out files can be chosen again.PickXXXCompaction()s called inUniversalCompactionPicker::PickCompaction(). This is because some of these functions actually use files directly fromvstorageas input files, not from the sanitized sorted run resulted inCalculateSortedRuns. This includes the following. For test coverage of each case, see Test.PickPeriodicCompaction()for multi-levelPickCompactionToReduceSizeAmp()for multi-level-
PickCompactionToReduceSortedRuns()for multi-levelPickDeleteTriggeredCompaction()for both single and multi-level.vstorage->GetOverlappingInputs/GetOverlappingInputsRangeBinarySearch()ExpandInputsToCleanCut()SetupOtherInputs()OutputPathId/DBTestUniversalManualCompactionOutputPathId.ManualCompactionOutputPathId/0,1due to a bug about setting an incorrect EarliestSequenceNumber of the memtable inDB::Reopen(). Therefore I have to remove the reopening part in this PR d4a2129#diff-3d401dbe63b9d238f4c6ed8ae4f3b4b3b6948be976d902b7a41021b208a06192L1875-L1879 to avoid the bug and pass.CompactionPicker::CompactFiles: fixed in Fix FIFO causing overlapping seqnos in L0 files due to overlapped seqnos between ingested files and memtable's #10777earliest_mem_seqnorequired in more functions, including those shared with level compaction inCompactionPicker,DBImpl(i.e,DBImpl::DeleteFilesInRanges) andCompaction(i.e,Compaction::IsTrivialMove()). For these callerskMaxSequenceNumberis passed in to limit this PR scope.DBCompactionTestL0FilesReorderingCorruptionand adopted the comment
TEST_SYNC_POINTlocation for fifo related unit test to make it more resistant to accidental code change breaking the test.VersionStorageInfo::files_Test:
DBCompactionTestL0FilesReorderingCorruptioncover all the above cases that fail before the fix and pass after, except forUniversalCompactionPicker::PickCompaction(), see TODO in unit test for challenges to add one