There are too many write stalls because the write slowdown stage is frequently skipped

### Expected behavior

Write slowdowns should be applied long before write stalls

### Actual behavior

The bytes pending compaction value computed in [EstimateCompactionBytesNeeded](https://github.com/facebook/rocksdb/blob/b860a4215893c5862c7df1bbe14c30fc50af9b28/db/version_set.cc#L2573) changes too rapidly. With write-heavy workloads (db_bench --benchmarks=overwrite) RocksDB can change within 1 or 2 seconds from no stalls (or some slowdown) to long periods of write stalls. Even worse are write stalls that last for 100+ seconds. With a small change (see PoC diff below) there are no long stalls and variance is greatly reduced.

The root cause is that [CalculateBaseBytes](https://github.com/facebook/rocksdb/blob/e8f116deabc601a5eae011f2fd8a38d02d87be8d/db/version_set.cc#L3655) computes sizes bottom up for all levels except Lmax. [It uses](https://github.com/facebook/rocksdb/blob/e8f116deabc601a5eae011f2fd8a38d02d87be8d/db/version_set.cc#L3724) max( sizeof(L0), options.max_bytes_for_level_base) for the L1 target size, and then from L1 to Lmax-1, the target size is set to be fanout times larger.

This leads to large targets when L0 is large because compaction gets behind. On its own that isn't horrible and might even be considered to be adaptive. However, when sizeof(L0) suddenly shrinks because L0->L1 compaction finishes then the targets can suddenly become much smaller because sizeof(L1) will be max_bytes_for_level_base rather than sizeof(L0). At that point the bytes pending compaction value can become huge because it is the sum of the per-level bytes pending compaction values (PLBPC) where PLBPC = (sizeof(level) - target_size(level)) X fanout.

We also need to improve visibility for this. AFAIK, it is hard or impossible to see the global bytes pending compaction, the contribution from each level and the target size for each level. I ended up adding a lot of code to dump more info to LOG.

Here is one example using things written to LOG, including extra monitoring I added. The first block of logging is at 20:23.49 and there are write slowdowns because bytes pending compaction ~= 25B while the slowdown trigger was ~=23B. The second line lists the target size per level they are 1956545174 or ~2B for L1, 5354149941 or ~5.3B for L2, 14651806650 or ~14.6B for L3, 40095148713 or ~40B for L4, 109721687484 or ~110B for L5. In this case the value for bytes pending compaction jumps from ~25B to ~85B in 2 seconds. In other tests I have seen it increase by 100B or 200B in a few seconds because L0->L1 completes.
```
2022/01/21-20:23:49.546718 7ff4d71ff700 [/version_set.cc:2687] ZZZ ECBN scores 24985540821 total, 25.0 totalB, 2.0 L0, 15.6 L1, 9.3 L2, 0.0 L3, 0.0 L4, 0.0 L5, 0.0 L6, 0.0 L7, 0.0 L8, 0.0 L9
2022/01/21-20:23:49.546733 7ff4d71ff700 [/version_set.cc:2703] ZZZ ECBN target 12 L0, 1956545174 L1, 5354149941 L2, 14651806650 L3, 40095148713 L4, 109721687484 L5, 0 L6, 0 L7, 0 L8, 0 L9
2022/01/21-20:23:49.546737 7ff4d71ff700 [/version_set.cc:2716] ZZZ ECBN actual 1956545174 L0, 6988421099 L1, 10062248176 L2, 14614385201 L3, 4683299263 L4, 37515691925 L5, 0 L6, 0 L7, 0 L8, 0 L9
2022/01/21-20:23:49.546773 7ff4d71ff700 [WARN] [/column_family.cc:984] [default] ZZZ Stalling writes because of estimated pending compaction bytes 24985540821 rate 5368703
```

But 2 seconds later after L0->L1 finishes the per-level target sizes are recomputed and they are much smaller: 141300841 or ~140M for L1, 654055230 or ~650M for L2, 3027499631 or ~3B for L3, 14013730948 or ~14B for L4, 64866946001 or ~65B for L5. The result is that the per-level contributions to bytes pending compaction are much larger for L2, L3 and L4 which triggers a long write stall. The _ECBN scores_ line has the value for estimated_compaction_bytes_needed_ as total and totalB and then the per-level contributions. The _ECBN actual_ line has the actual per-level sizes. These show that the increase comes from the sudden reduction in the per-level target sizes.
```
2022/01/21-20:23:51.893239 7ff4db3ff700 [/version_set.cc:3859] ZZZ set level_max_bytes: 141300841 L1, 654055230 L2, 3027499631 L3, 14013730948 L4, 64866946001 L5, 300256990742 L6, 9 L7, 9 L8, 9 L9
2022/01/21-20:23:51.906933 7ff4db3ff700 [/version_set.cc:2687] ZZZ ECBN scores 84473373628 total, 84.5 totalB, 0.1 L0, 18.7 L1, 20.5 L2, 22.1 L3, 23.1 L4, 0.0 L5, 0.0 L6, 0.0 L7, 0.0 L8, 0.0 L9
2022/01/21-20:23:51.906951 7ff4db3ff700 [/version_set.cc:2703] ZZZ ECBN target 3 L0, 141300841 L1, 654055230 L2, 3027499631 L3, 14013730948 L4, 64866946001 L5, 0 L6, 0 L7, 0 L8, 0 L9
2022/01/21-20:23:51.906954 7ff4db3ff700 [/version_set.cc:2716] ZZZ ECBN actual 141300841 L0, 6955042332 L1, 11844113742 L2, 21096345478 L3, 22752145110 L4, 46254106087 L5, 0 L6, 0 L7, 0 L8, 0 L9
2022/01/21-20:23:51.907149 7ff4db3ff700 [WARN] [/column_family.cc:925] [default] ZZZ Stopping writes because of estimated pending compaction bytes 84473373628
```

### Steps to reproduce the behavior

A proof-of-concept diff to fix this [is here](https://gist.github.com/mdcallag/4d35272769a199fc52598b55e11cae8d). It computes level target from largest to smallest while the current code does it from smallest to largest.

Reproduction:
```
numactl --interleave=all ./db_bench --benchmarks=fillseq --allow_concurrent_memtable_write=false --level0_file_num_compaction_trigger=4 --level0_slowdown_writes_trigger=20 --level0_stop_writes_trigger=30 --max_background_jobs=8 --max_write_buffer_number=8 --db=/data/m/rx --wal_dir=/data/m/rx --num=800000000 --num_levels=8 --key_size=20 --value_size=400 --block_size=8192 --cache_size=51539607552 --cache_numshardbits=6 --compression_max_dict_bytes=0 --compression_ratio=0.5 --compression_type=none --bytes_per_sync=8388608 --cache_index_and_filter_blocks=1 --cache_high_pri_pool_ratio=0.5 --benchmark_write_rate_limit=0 --write_buffer_size=16777216 --target_file_size_base=16777216 --max_bytes_for_level_base=67108864 --verify_checksum=1 --delete_obsolete_files_period_micros=62914560 --max_bytes_for_level_multiplier=8 --statistics=0 --stats_per_interval=1 --stats_interval_seconds=5 --report_interval_seconds=5 --histogram=1 --memtablerep=skip_list --bloom_bits=10 --open_files=-1 --subcompactions=1 --compaction_style=0 --min_level_to_compress=3 --level_compaction_dynamic_level_bytes=true --pin_l0_filter_and_index_blocks_in_cache=1 --soft_pending_compaction_bytes_limit=24696061952 --hard_pending_compaction_bytes_limit=49392123904 --min_level_to_compress=0 --use_existing_db=0 --sync=0 --threads=1 --memtablerep=vector --allow_concurrent_memtable_write=false --disable_wal=1 --seed=1642906118

numactl --interleave=all ./db_bench --benchmarks=overwrite --use_existing_db=1 --sync=0 --level0_file_num_compaction_trigger=4 --level0_slowdown_writes_trigger=20 --level0_stop_writes_trigger=30 --max_background_jobs=8 --max_write_buffer_number=8 --db=/data/m/rx --wal_dir=/data/m/rx --num=800000000 --num_levels=8 --key_size=20 --value_size=400 --block_size=8192 --cache_size=51539607552 --cache_numshardbits=6 --compression_max_dict_bytes=0 --compression_ratio=0.5 --compression_type=none --bytes_per_sync=8388608 --cache_index_and_filter_blocks=1 --cache_high_pri_pool_ratio=0.5 --benchmark_write_rate_limit=0 --write_buffer_size=16777216 --target_file_size_base=16777216 --max_bytes_for_level_base=67108864 --verify_checksum=1 --delete_obsolete_files_period_micros=62914560 --max_bytes_for_level_multiplier=8 --statistics=0 --stats_per_interval=1 --stats_interval_seconds=5 --report_interval_seconds=5 --histogram=1 --memtablerep=skip_list --bloom_bits=10 --open_files=-1 --subcompactions=1 --compaction_style=0 --min_level_to_compress=3 --level_compaction_dynamic_level_bytes=true --pin_l0_filter_and_index_blocks_in_cache=1 --soft_pending_compaction_bytes_limit=24696061952 --hard_pending_compaction_bytes_limit=49392123904 --duration=3600 --threads=16 --merge_operator="put" --seed=1642907605 --report_file=bm.lc.nt16.cm1.d0/v6.26.1/benchmark_overwrite.t16.s0.log.r.csv
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

There are too many write stalls because the write slowdown stage is frequently skipped #9423

Expected behavior

Actual behavior

Steps to reproduce the behavior

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

There are too many write stalls because the write slowdown stage is frequently skipped #9423

Description

Expected behavior

Actual behavior

Steps to reproduce the behavior

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions