Skip to content

[Draft] Fix Universal Compaction causing overlapping seqnos in L0 files due to overlapped seqnos between ingested files and memtable's#10776

Closed
hx235 wants to merge 2 commits intofacebook:mainfrom
hx235:overlap_file_ingest_with_memtable
Closed

[Draft] Fix Universal Compaction causing overlapping seqnos in L0 files due to overlapped seqnos between ingested files and memtable's#10776
hx235 wants to merge 2 commits intofacebook:mainfrom
hx235:overlap_file_ingest_with_memtable

Conversation

@hx235
Copy link
Copy Markdown
Contributor

@hx235 hx235 commented Oct 5, 2022

Context:
Same as #5958 (comment) but apply the fix to Universal Compaction case

Repro:

COERCE_CONTEXT_SWICH=1 make -j56 db_stress

./db_stress --acquire_snapshot_one_in=0 --adaptive_readahead=0 --allow_data_in_errors=True --async_io=1 --avoid_flush_during_recovery=0 --avoid_unnecessary_blocking_io=0 --backup_max_size=104857600 --backup_one_in=0 --batch_protection_bytes_per_key=0 --block_size=16384 --bloom_bits=18 --bottommost_compression_type=disable --bytes_per_sync=262144 --cache_index_and_filter_blocks=0 --cache_size=8388608 --cache_type=lru_cache --charge_compression_dictionary_building_buffer=0 --charge_file_metadata=1 --charge_filter_construction=1 --charge_table_reader=1 --checkpoint_one_in=0 --checksum_type=kCRC32c --clear_column_family_one_in=0 --compact_files_one_in=0 --compact_range_one_in=100 --compaction_pri=2 --compaction_style=1 --compaction_ttl=0 --compression_max_dict_buffer_bytes=8388607 --compression_max_dict_bytes=16384 --compression_parallel_threads=1 --compression_type=zlib --compression_use_zstd_dict_trainer=1 --compression_zstd_max_train_bytes=0 --continuous_verification_interval=0 --data_block_index_type=0 --db=/dev/shm/rocksdb_test2/rocksdb_crashtest_whitebox --db_write_buffer_size=8388608 --delpercent=4 --delrangepercent=1 --destroy_db_initially=1 --detect_filter_construct_corruption=0 --disable_wal=0 --enable_compaction_filter=0 --enable_pipelined_write=1 --fail_if_options_file_error=1 --file_checksum_impl=none --flush_one_in=1000 --format_version=5 --get_current_wal_file_one_in=0 --get_live_files_one_in=0 --get_property_one_in=0 --get_sorted_wal_files_one_in=0 --index_block_restart_interval=15 --index_type=3 --ingest_external_file_one_in=100 --initial_auto_readahead_size=0 --iterpercent=10 --key_len_percent_dist=1,30,69 --level_compaction_dynamic_level_bytes=True --log2_keys_per_lock=10 --long_running_snapshots=0 --mark_for_compaction_one_file_in=10 --max_auto_readahead_size=16384 --max_background_compactions=20 --max_bytes_for_level_base=10485760 --max_key=100000 --max_key_len=3 --max_manifest_file_size=1073741824 --max_write_batch_group_size_bytes=1048576 --max_write_buffer_number=3 --max_write_buffer_size_to_maintain=4194304 --memtable_prefix_bloom_size_ratio=0.5 --memtable_protection_bytes_per_key=1 --memtable_whole_key_filtering=1 --memtablerep=skip_list --mmap_read=1 --mock_direct_io=False --nooverwritepercent=1 --num_file_reads_for_auto_readahead=0 --num_levels=1 --open_files=100 --open_metadata_write_fault_one_in=0 --open_read_fault_one_in=32 --open_write_fault_one_in=0 --ops_per_thread=200000 --optimize_filters_for_memory=0 --paranoid_file_checks=1 --partition_filters=0 --partition_pinning=1 --pause_background_one_in=0 --periodic_compaction_seconds=0 --prefix_size=8 --prefixpercent=5 --prepopulate_block_cache=0 --progress_reports=0 --read_fault_one_in=0 --readahead_size=16384 --readpercent=45 --recycle_log_file_num=1 --reopen=20 --ribbon_starting_level=999 --snapshot_hold_ops=1000 --sst_file_manager_bytes_per_sec=0 --sst_file_manager_bytes_per_truncate=0 --subcompactions=2 --sync=0 --sync_fault_injection=0 --target_file_size_base=524288 --target_file_size_multiplier=2 --test_batches_snapshots=0 --top_level_index_pinning=3 --unpartitioned_pinning=0 --use_direct_io_for_flush_and_compaction=0 --use_direct_reads=0 --use_full_merge_v1=1 --use_merge=0 --use_multiget=1 --user_timestamp_size=0 --value_size_mult=32 --verify_checksum=1 --verify_checksum_one_in=0 --verify_db_one_in=1000 --verify_sst_unique_id_in_manifest=1 --wal_bytes_per_sync=0 --wal_compression=zstd --write_buffer_size=524288 --write_dbid_to_manifest=0 --writepercent=35 

put or merge error: Corruption: force_consistency_checks(DEBUG): VersionBuilder: L0 file #4153 with seqno 0 101850 vs. file #4164 with seqno 101187 101820

Summary:
This PR considers earliest_mem_seqnos whenever we are compacting to L0 level, which include the following three cases.

  • UniversalCompactionPicker::PickCompaction() when compacting to L0 level
    • Like the fix Fix corruption with intra-L0 on ingested files #5958 (comment), we excluded files of largest seqno greater than earliest_mem_seqno from compaction.
    • Firstly, this is done in CalculateSortedRuns at the beginning of UniversalCompactionPicker::PickCompaction()
      • However, we couldn't know the output level at this point therefore we end up filtering out such files even when we are not compacting to L0 level.
      • Would this be problematic?? Maybe it's fine once we accumulate some data in memetable and increase the earliest_mem_seqno so that previously filtered out files can be chosen again.
    • Secondly, this is done in some PickXXXCompaction()s called in UniversalCompactionPicker::PickCompaction(). This is because some of these functions actually use files directly from vstorage as input files, not from the sanitized sorted run resulted in CalculateSortedRuns. This includes the following. For test coverage of each case, see Test.
      • PickPeriodicCompaction() for multi-level
      • PickCompactionToReduceSizeAmp() for multi-level
        - PickCompactionToReduceSortedRuns() for multi-level
      • PickDeleteTriggeredCompaction() for both single and multi-level.
        • For this path, I also fixed a bug cased by assuming sorted_run on L0 is always a prefix of vstorage->LevelFiles(0), which is no longer true after we sanitize sorted_run in this PR.
  • CompactionPicker::CompactRange when compacting to L0 level
  • CompactionPicker::CompactFiles: fixed in Fix FIFO causing overlapping seqnos in L0 files due to overlapped seqnos between ingested files and memtable's #10777
  • Some clean-up include:
    • Made earliest_mem_seqno required in more functions, including those shared with level compaction in CompactionPicker, DBImpl (i.e, DBImpl::DeleteFilesInRanges) and Compaction (i.e, Compaction::IsTrivialMove()). For these callers kMaxSequenceNumber is passed in to limit this PR scope.
    • Consolidated with Fix FIFO causing overlapping seqnos in L0 files due to overlapped seqnos between ingested files and memtable's #10777 unit test under DBCompactionTestL0FilesReorderingCorruption
      and adopted the comment
      • Also updated TEST_SYNC_POINT location for fifo related unit test to make it more resistant to accidental code change breaking the test.
    • Clarified comment about the basis of ordering for VersionStorageInfo::files_
      Test:
  • New unit tests under DBCompactionTestL0FilesReorderingCorruption cover all the above cases that fail before the fix and pass after, except for
    • multi-level for UniversalCompactionPicker::PickCompaction(), see TODO in unit test for challenges to add one
  • make check
  • [Ongoing] Regular CI stress run on this PR + stress test with aggressive value [CI only]Aggressive crash test value #10761 and on Universal Compaction only

@facebook-github-bot
Copy link
Copy Markdown
Contributor

@hx235 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@hx235 hx235 changed the title [Draft] Fix overlapping seqno between files in L0 due to file ingestion's seqno overlapping with memtable's for Universal Compaction [Draft] Fix overlapping seqno between files in L0 Universal Compaction due to file ingestion's seqno overlapping with memtable's Oct 5, 2022
@hx235 hx235 changed the title [Draft] Fix overlapping seqno between files in L0 Universal Compaction due to file ingestion's seqno overlapping with memtable's [Draft] Fix overlapping seqno between files in L0 due to incorrectly including ingested files for Universal Compaction Oct 5, 2022
@hx235 hx235 force-pushed the overlap_file_ingest_with_memtable branch from 3c7b3c3 to 9862123 Compare October 5, 2022 19:17
@facebook-github-bot
Copy link
Copy Markdown
Contributor

@hx235 has updated the pull request. You must reimport the pull request before landing.

@facebook-github-bot
Copy link
Copy Markdown
Contributor

@hx235 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@hx235 hx235 changed the title [Draft] Fix overlapping seqno between files in L0 due to incorrectly including ingested files for Universal Compaction [Draft] Fix overlapping seqnos in L0 files cased by overlapped seqnos between ingested files and memtable (Universal Compaction case) Oct 6, 2022
@hx235 hx235 force-pushed the overlap_file_ingest_with_memtable branch from 9862123 to 30f1120 Compare October 8, 2022 00:54
@facebook-github-bot
Copy link
Copy Markdown
Contributor

@hx235 has updated the pull request. You must reimport the pull request before landing.

@facebook-github-bot
Copy link
Copy Markdown
Contributor

@hx235 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@hx235 hx235 force-pushed the overlap_file_ingest_with_memtable branch from 30f1120 to 90fbb4c Compare October 11, 2022 23:40
@facebook-github-bot
Copy link
Copy Markdown
Contributor

@hx235 has updated the pull request. You must reimport the pull request before landing.

@hx235
Copy link
Copy Markdown
Contributor Author

hx235 commented Oct 11, 2022

Update: just a draft to see if it passes make check. Have some dependency on Fix overlapping seqnos in L0 files cased by overlapped seqnos between ingested files and memtable (FIFO Compaction case)
#10777

@hx235 hx235 force-pushed the overlap_file_ingest_with_memtable branch from 90fbb4c to d3c5c9b Compare October 12, 2022 02:02
@facebook-github-bot
Copy link
Copy Markdown
Contributor

@hx235 has updated the pull request. You must reimport the pull request before landing.

@hx235 hx235 force-pushed the overlap_file_ingest_with_memtable branch from d3c5c9b to 7ba76cd Compare October 12, 2022 21:10
@facebook-github-bot
Copy link
Copy Markdown
Contributor

@hx235 has updated the pull request. You must reimport the pull request before landing.

@hx235 hx235 force-pushed the overlap_file_ingest_with_memtable branch from 7ba76cd to 564fd4a Compare October 12, 2022 21:17
@facebook-github-bot
Copy link
Copy Markdown
Contributor

@hx235 has updated the pull request. You must reimport the pull request before landing.

@hx235 hx235 changed the title [Draft] Fix overlapping seqnos in L0 files cased by overlapped seqnos between ingested files and memtable (Universal Compaction case) [Draft] Fix overlapping seqnos in L0 files caused by overlapped seqnos between ingested files and memtable (Universal Compaction case) Oct 12, 2022
@hx235 hx235 force-pushed the overlap_file_ingest_with_memtable branch from 564fd4a to 3c6f98b Compare October 12, 2022 21:34
@facebook-github-bot
Copy link
Copy Markdown
Contributor

@hx235 has updated the pull request. You must reimport the pull request before landing.

@hx235 hx235 force-pushed the overlap_file_ingest_with_memtable branch from 3c6f98b to d340eef Compare October 12, 2022 21:35
@facebook-github-bot
Copy link
Copy Markdown
Contributor

@hx235 has updated the pull request. You must reimport the pull request before landing.

@hx235 hx235 force-pushed the overlap_file_ingest_with_memtable branch from d340eef to d4a2129 Compare October 12, 2022 21:40
@facebook-github-bot
Copy link
Copy Markdown
Contributor

@hx235 has updated the pull request. You must reimport the pull request before landing.

@facebook-github-bot
Copy link
Copy Markdown
Contributor

@hx235 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@hx235 hx235 force-pushed the overlap_file_ingest_with_memtable branch from 512a04c to 5433150 Compare October 26, 2022 18:15
@facebook-github-bot
Copy link
Copy Markdown
Contributor

@hx235 has updated the pull request. You must reimport the pull request before landing.

@facebook-github-bot
Copy link
Copy Markdown
Contributor

@ajkr has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@hx235 hx235 force-pushed the overlap_file_ingest_with_memtable branch from 5433150 to 5d10ebc Compare October 27, 2022 00:10
@facebook-github-bot
Copy link
Copy Markdown
Contributor

@hx235 has updated the pull request. You must reimport the pull request before landing.

@facebook-github-bot
Copy link
Copy Markdown
Contributor

@hx235 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

Copy link
Copy Markdown
Contributor Author

@hx235 hx235 Oct 27, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't investigated but when num_level = 1 with level compaction, there will still be a L1 in vstorage and ingested file will be placed in L1, creating difficulties to force compaction to L0. So I skipped adding UT for kCompactionStyleLevel.

[Question @ajkr] However let me know if you still see value in investigating this and figure this out!

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we sanitize num_levels to be at least 3 or something in level compaction.

Copy link
Copy Markdown
Contributor Author

@hx235 hx235 Oct 27, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh okay thanks! Then CompactRange + level compaction under the case where we didn't fix in this PR.

Copy link
Copy Markdown
Contributor Author

@hx235 hx235 Oct 27, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Question @ajkr] I wanted to expand test coverage to num_levels > 1 for universal compaction as mentioned in PR summary to cover some edge case in PickCompaction() but couldn't quite force the compaction to L0 - any good idea?

Copy link
Copy Markdown
Contributor Author

@hx235 hx235 Oct 27, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Question @ajkr] I tried to cover PickDeleteTriggeredCompaction() but only come up with low-level test case manipulating vstorage directly like https://github.com/facebook/rocksdb/blob/main/db/compaction/compaction_picker_test.cc#L2953-L2977. Is there another way of doing this? If not, I can still go with the low-level one - better than nothing.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually - found another more useful test case TEST_P(DBCompactionTestWithParam, CompactionDeletionTrigger)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

@hx235
Copy link
Copy Markdown
Contributor Author

hx235 commented Oct 27, 2022

Running stress test now - will notify @ajkr once signals are clear

HISTORY.md Outdated
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be in "Unreleased"

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

HISTORY.md Outdated
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CompactRangeOptions::change_level=true and CompactRangeOptions::target_level=0. I am not sure it suffers from a bug. Perhaps it does when used together with setting non-nullptr endpoints to prevent the flush. Non-nullptr endpoints with change_level==true is weird though and I think we should disallow it, or at least pretend the user passed us nullptr endpoints. Last discussion - #7441 (comment)

HISTORY.md Outdated
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this was not in the 7.8.0 release

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed now.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you mind pointing me to places of my previous commit indicating such assumption? I actually didn't bear this assumption in mind but might have made mistake that indicating this assumption. Although I added UT for it in my latest commit, it would be great if you can point me a place of my prev commit so that I didn't miss anything. Thanks!

A key should be compacted if it exists before the CompactRange() starts and overlaps with the range specified to CompactRange(). If the memtable were not flushed, we would miss compacting such keys that are in an L0 file overlapping memtable in seqno. This conditional is where files containing those missed keys are omitted from compaction.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh okay - that seems like a contradiction between CompactRange() contract (i.e, "a key should be compacted .... ") and the "skip file" approach we are pursuing here .... Have we discussed the solution to this before?

Maybe leave as TODO for CompactionPicker ::CompactRange() for now and fix it by not allowing such "L0 file overlapping memtable in seqno" exists as you mentioned in the task about next step for universal compaction?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No solution has been discussed before, but a necessary condition for this bug ("If the memtable were not flushed, ...") is not present in universal CompactRange(), because it always does the flush. My original point was just that this is a big assumption that might change or might not even be true if I misread the code.

Copy link
Copy Markdown
Contributor

@ajkr ajkr Oct 27, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if I ingest level0_stop_writes_trigger L0 files that don't overlap with memtable and they need to be compacted to L0? Will writes be stopped forever since it won't flush and can't compact anything? This is difficult to reason about. I need to think about this approach more.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aha good point ... I didn't even realize there is a feature called level0_stop_writes_trigger ... need to think about this more. Current situation sounds like a deadlock to me

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A side-note is that db stress can be improved to detect deadlock like this.

@hx235 hx235 force-pushed the overlap_file_ingest_with_memtable branch from 5d10ebc to b65b5d6 Compare October 27, 2022 10:45
@facebook-github-bot
Copy link
Copy Markdown
Contributor

@hx235 has updated the pull request. You must reimport the pull request before landing.

@hx235 hx235 force-pushed the overlap_file_ingest_with_memtable branch from b65b5d6 to 863c0c5 Compare October 27, 2022 10:58
@facebook-github-bot
Copy link
Copy Markdown
Contributor

@hx235 has updated the pull request. You must reimport the pull request before landing.

@hx235 hx235 force-pushed the overlap_file_ingest_with_memtable branch from 863c0c5 to d4a3f6f Compare October 27, 2022 13:00
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Note] A bug fix - cased by assuming sorted_run on L0 is always a prefix of vstorage->LevelFiles(0), which is no longer true after we sanitize sorted_run in this PR.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Note] A bug fix - cased by assuming sorted_run on L0 is always a prefix of vstorage->LevelFiles(0), which is no longer true after we sanitize sorted_run in this PR.

@hx235
Copy link
Copy Markdown
Contributor Author

hx235 commented Oct 27, 2022

Update: addressed some comments + added more UT + fixed a bug found by new UT

@ajkr
Copy link
Copy Markdown
Contributor

ajkr commented Oct 27, 2022

What if we set FileMetaData::largest_seqno as follows?

  • During flush/ingestion, set it to the max of (i) the max seqno of keys in the new file and (ii) one higher than the max largest_seqno of all L0 files. Today it currently is set to (i).
  • During intra-L0, set it to the max largest_seqno of all input files

@ajkr
Copy link
Copy Markdown
Contributor

ajkr commented Oct 27, 2022

What if we set FileMetaData::largest_seqno as follows?

  • During flush/ingestion, set it to the max of (i) the max seqno of keys in the new file and (ii) one higher than the max largest_seqno of all L0 files. Today it currently is set to (i).
  • During intra-L0, set it to the max largest_seqno of all input files

The idea is to play with the metadata so we don't have to worry about what files can or cannot be picked for compaction. I'm still working on an intuitive explanation for this altered meaning of largest_seqno. In case we can't find one, I also thought about a new ordering field in T133335138.

@facebook-github-bot
Copy link
Copy Markdown
Contributor

@hx235 has updated the pull request. You must reimport the pull request before landing.

@hx235 hx235 force-pushed the overlap_file_ingest_with_memtable branch from d4a3f6f to 8b851df Compare October 27, 2022 22:55
@facebook-github-bot
Copy link
Copy Markdown
Contributor

@hx235 has updated the pull request. You must reimport the pull request before landing.

@ajkr
Copy link
Copy Markdown
Contributor

ajkr commented Oct 30, 2022

  • During intra-L0, set it to the max largest_seqno of all input files

HBase does this -- https://github.com/apache/hbase/blob/984d226010558c5b5e54d283b465a645816ffb24/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/compactions/Compactor.java#L171-L172. But ever since HBase style compaction was introduced to RocksDB in 554c06d, we have been learning largest_seqno from the output file keys rather than from the input file metadata. While our way is buggy when ordering L0 files by descending largest_seqno, it still feels like we have a more intuitive meaning of largest_seqno, so probably we should introduce a different field like "L0 epoch number" to use for ordering.

@hx235
Copy link
Copy Markdown
Contributor Author

hx235 commented Mar 11, 2023

Closed since alternative approach was pursued.

@hx235 hx235 closed this Mar 11, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed WIP Work in progress

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants