Sort L0 files by newly introduced epoch_num#10922
Sort L0 files by newly introduced epoch_num#10922hx235 wants to merge 2 commits intofacebook:mainfrom
Conversation
2107596 to
a616822
Compare
|
@hx235 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
|
@hx235 has updated the pull request. You must reimport the pull request before landing. |
|
@hx235 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
|
@hx235 has updated the pull request. You must reimport the pull request before landing. |
|
@hx235 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
|
@hx235 has updated the pull request. You must reimport the pull request before landing. |
|
@hx235 has updated the pull request. You must reimport the pull request before landing. |
|
@hx235 has updated the pull request. You must reimport the pull request before landing. |
|
@hx235 has updated the pull request. You must reimport the pull request before landing. |
|
@hx235 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
|
@hx235 has updated the pull request. You must reimport the pull request before landing. |
|
@hx235 has updated the pull request. You must reimport the pull request before landing. |
|
@hx235 has updated the pull request. You must reimport the pull request before landing. |
1 similar comment
|
@hx235 has updated the pull request. You must reimport the pull request before landing. |
|
@hx235 has updated the pull request. You must reimport the pull request before landing. |
ajkr
left a comment
There was a problem hiding this comment.
Finally finished thinking about all the parts I hoped to think about. Great work on this very difficult problem
|
@hx235 has updated the pull request. You must reimport the pull request before landing. |
|
@hx235 has updated the pull request. You must reimport the pull request before landing. |
|
@hx235 has updated the pull request. You must reimport the pull request before landing. |
|
@hx235 has updated the pull request. You must reimport the pull request before landing. |
|
@hx235 has updated the pull request. You must reimport the pull request before landing. |
|
@hx235 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
1 similar comment
|
@hx235 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
|
Update: Addressed the comments and going thru compatibility + stress test again. Will notify here when it's ready for review again. |
|
@hx235 has updated the pull request. You must reimport the pull request before landing. |
|
@hx235 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
ajkr
left a comment
There was a problem hiding this comment.
Thanks for addressing all the comments. LGTM!!
|
@hx235 has updated the pull request. You must reimport the pull request before landing. |
|
@hx235 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
Thanks for the review!!! Compatibility test and stress tests finish good so I will land this once signals are clear. |
…rash test (#11063) Summary: **Context/Summary:** #10777 was reverted (#10999) due to internal blocker and replaced with a better fix #10922. However, the revert also reverted the `Options::CompactionOptionsFIFO::allow_compaction` stress/crash coverage added by the PR. It's an useful coverage cuz setting `Options::CompactionOptionsFIFO::allow_compaction=true` will [increase](https://github.com/facebook/rocksdb/blob/7.8.fb/db/version_set.cc#L3255) the compaction score of L0 files for FIFO and then trigger more FIFO compaction. This speed up discovery of bug related to FIFO compaction like #10955. To see the speedup, compare the failure occurrence in following commands with `Options::CompactionOptionsFIFO::allow_compaction=true/false` ``` --fifo_allow_compaction=1 --acquire_snapshot_one_in=10000 --adaptive_readahead=0 --allow_concurrent_memtable_write=0 --allow_data_in_errors=True --async_io=1 --avoid_flush_during_recovery=0 --avoid_unnecessary_blocking_io=1 --backup_max_size=104857600 --backup_one_in=100000 --batch_protection_bytes_per_key=0 --block_size=16384 --bloom_bits=8.869062094789008 --bottommost_compression_type=none --bytes_per_sync=0 --cache_index_and_filter_blocks=1 --cache_size=8388608 --cache_type=lru_cache --charge_compression_dictionary_building_buffer=1 --charge_file_metadata=1 --charge_filter_construction=0 --charge_table_reader=1 --checkpoint_one_in=1000000 --checksum_type=kxxHash --clear_column_family_one_in=0 --column_families=1 --compact_files_one_in=1000000 --compact_range_one_in=1000000 --compaction_pri=3 --compaction_style=2 --compaction_ttl=0 --compression_max_dict_buffer_bytes=8589934591 --compression_max_dict_bytes=16384 --compression_parallel_threads=1 --compression_type=xpress --compression_use_zstd_dict_trainer=1 --compression_zstd_max_train_bytes=0 --continuous_verification_interval=0 --data_block_index_type=0 --db=/dev/shm/rocksdb_test/rocksdb_crashtest_whitebox --db_write_buffer_size=1048576 --delpercent=4 --delrangepercent=1 --destroy_db_initially=1 --detect_filter_construct_corruption=0 --disable_wal=0 --enable_compaction_filter=0 --enable_pipelined_write=1 --expected_values_dir=/dev/shm/rocksdb_test/rocksdb_crashtest_expected --fail_if_options_file_error=1 --file_checksum_impl=xxh64 --flush_one_in=1000000 --format_version=4 --get_current_wal_file_one_in=0 --get_live_files_one_in=1000000 --get_property_one_in=1000000 --get_sorted_wal_files_one_in=0 --index_block_restart_interval=10 --index_type=2 --ingest_external_file_one_in=100 --initial_auto_readahead_size=16384 --iterpercent=10 --key_len_percent_dist=1,30,69 --level_compaction_dynamic_level_bytes=False --log2_keys_per_lock=10 --long_running_snapshots=0 --manual_wal_flush_one_in=0 --mark_for_compaction_one_file_in=10 --max_auto_readahead_size=524288 --max_background_compactions=1 --max_bytes_for_level_base=67108864 --max_key=25000000 --max_key_len=3 --max_manifest_file_size=1073741824 --max_write_batch_group_size_bytes=1048576 --max_write_buffer_number=3 --max_write_buffer_size_to_maintain=0 --memtable_prefix_bloom_size_ratio=0.01 --memtable_protection_bytes_per_key=1 --memtable_whole_key_filtering=1 --memtablerep=skip_list --min_write_buffer_number_to_merge=2 --mmap_read=0 --mock_direct_io=True --nooverwritepercent=1 --num_file_reads_for_auto_readahead=2 --open_files=-1 --open_metadata_write_fault_one_in=0 --open_read_fault_one_in=0 --open_write_fault_one_in=0 --ops_per_thread=40000 --optimize_filters_for_memory=0 --paranoid_file_checks=1 --partition_filters=0 --partition_pinning=3 --pause_background_one_in=1000000 --periodic_compaction_seconds=0 --prefix_size=7 --prefixpercent=5 --prepopulate_block_cache=0 --preserve_internal_time_seconds=3600 --progress_reports=0 --read_fault_one_in=1000 --readahead_size=0 --readpercent=15 --recycle_log_file_num=1 --reopen=0 --ribbon_starting_level=999 --secondary_cache_fault_one_in=0 --snapshot_hold_ops=100000 --sst_file_manager_bytes_per_sec=0 --sst_file_manager_bytes_per_truncate=0 --stats_dump_period_sec=0 --subcompactions=2 --sync=0 --sync_fault_injection=0 --target_file_size_base=16777216 --target_file_size_multiplier=1 --test_batches_snapshots=0 --top_level_index_pinning=1 --unpartitioned_pinning=1 --use_direct_io_for_flush_and_compaction=1 --use_direct_reads=1 --use_full_merge_v1=1 --use_merge=0 --use_multiget=0 --use_put_entity_one_in=0 --user_timestamp_size=0 --value_size_mult=32 --verify_checksum=1 --verify_checksum_one_in=1000000 --verify_db_one_in=100000 --verify_iterator_with_expected_state_one_in=5 --verify_sst_unique_id_in_manifest=1 --wal_bytes_per_sync=0 --wal_compression=none --write_buffer_size=33554432 --write_dbid_to_manifest=1 --writepercent=65 ``` Therefore this PR is adding it back to stress/crash test. Pull Request resolved: #11063 Test Plan: Rehearsal stress test to make sure stress/crash test is stable Reviewed By: ajkr Differential Revision: D42283650 Pulled By: hx235 fbshipit-source-id: 132e6396ab6e24d8dcb8fe51c62dd5211cdf53ef
Context:
Sorting L0 files by
largest_seqnohas at least two inconvenience:force_consistency_check=truewill catch such overlap seqno range even those harmless overlap.force_consistency_check=truewill think s4 and s3 has file reordering corruption that might cause retuning an old value of k1largest_seqnocan introduce a wrong ordering hence file reordering corruptionforce_consistency_check=true, there isn't a good way to prevent this from happening if ordering bylargest_seqnoTherefore, we are redesigning the sorting criteria of L0 files and avoid above inconvenience. Credit to @ajkr , we now introduce
epoch_numwhich describes the order of a file being flushed or ingested/imported (compaction output file will has the minimumepoch_numamong input files'). This will avoid the above inconvenience in the following ways:force_consistency_check=truebutepoch_numberordering check. This will result in file ordering s1 < s2 < s4 (pre-compaction) and s3 < s4 (post-compaction) which won't trigger false positive corruption. See test classDBCompactionTestL0FilesMisorderCorruption*for more.Summary:
epoch_numberstored perColumnFamilyDataand sort CF's L0 files by their assignedepoch_numberinstead oflargest_seqno.epoch_numberis increased and assigned uponVersionEdit::AddFile()for flush (or similarly for WriteLevel0TableForRecovery) and file ingestion (except for allow_behind_true, which will always get assigned as thekReservedEpochNumberForFileIngestedBehind)epoch_numberamong input files'epoch_numbertreatment:NewestFirstBySeqNoNewestFirstBySeqNo.VersionEdit::AddFile()where this version edit will be apply to LSM tree shape right after by holding the db mutex (e.g, flush, file ingestion, import column family) or by there is only 1 ongoing edit per CF (e.g, WriteLevel0TableForRecovery, Repair).Refit(target_level=0)). It's due to for every key "k" in the input range, a legit compaction will cover a continuous epoch number range of that key. As long as we assign the key "k" the minimum input epoch number, it won't become newer or older than the versions of this key that aren't included in this compaction hence no misorder.epoch_numberof each file in manifest and recoverepoch_numberon db recoveryepoch_numbersupport is guaranteed by assigningepoch_numberto recovered files byNewestFirstBySeqnoorder. SeeVersionStorageInfo::RecoverEpochNumbers()for moreNewFileCustomTagforce_consistent_checkon L0 withepoch_numberand remove false positive check like case 1 withlargest_seqnoaboveNewestFirstBySeqno) to check/sort them till we infer their epoch number. See usages ofEpochNumberRequirement.epoch_numberso make check will passepoch_numberand cover universal/fifo compaction/CompactRange/CompactFile casesTest:
make checkdb/db_compaction_test.cc,db/db_test2.cc,db/version_builder_test.cc,db/repair_test.ccDBCompactionTestL0FilesMisorderCorruption*) under Fix corruption with intra-L0 on ingested files #5958 (comment).origbinary to prevent this bug affecting upgrade/downgrade formality checking) for 1 hour onsimple black/white box,cf_consistency/txn/enable_ts with whitebox + test_best_efforts_recovery with blackbox