Skip to content

Extra allocations makes short range queries 5% slower when linked with glibc malloc #10340

@mdcallag

Description

@mdcallag

I encountered this by accident when doing benchmarks and using glibc malloc by accident. Normally I use jemalloc, but it wasn't installed on the host on which I compiled db_bench. This is introduced by 8b74cea and the new allocation might be here.

I know we prefer jemalloc vs glibc malloc, but is it possible to reduce the amount of allocations?

Example output from the fwdrangewhilewriting benchmark step shows the impact. QPS drops from 512323 to 485274. The first line is from b82edff and the second is from 8b74cea. These diffs are adjacent in the repo (b82... precedes 8b7...).

ops_sec mb_sec  lsm_sz  blob_sz c_wgb   w_amp   c_mbps  c_wsecs c_csecs b_rgb   b_wgb   usec_op p50     p99     p99.9   p99.99  pmax    uptime  stall%  Nstall  u_cpu   s_cpu   rss     test    date    version job_id  githash
512323  2052.1  18GB    0.0GB,  33.3    14.9    28.8    107     75      0       0       42.9    41.7    76      168     479     22597   1183    0.0     0       21.2    3.1     0.0     fwdrangewhilewriting.t22        2022-07-11T18:36:10     7.3.0           b82edffc7b
485274  1943.7  18GB    0.0GB,  33.2    14.8    28.7    106     74      0       0       45.3    43.7    84      174     489     22534   1183    0.0     0       21.8    2.8     0.0     fwdrangewhilewriting.t22        2022-07-11T18:57:20     7.3.0           8b74cea7fe

From the throughput result and vmstat output (not shared here) I see that 8b74cea uses ~5% more CPU per query. I confirmed this does not reproduce when db_bench is linked with jemalloc.

A reproduction script:

numactl --interleave=all ./db_bench --benchmarks=fillseq --allow_concurrent_memtable_write=false --level0_file_num_compaction_trigger=4 --level0_slowdown_writes_trigger=20 --level0_stop_writes_trigger=30 --max_background_jobs=8 --max_write_buffer_number=8 --db=/data/m/rx --wal_dir=/data/m/rx --num=40000000 --key_size=20 --value_size=400 --block_size=8192 --cache_size=193273528320 --cache_numshardbits=6 --compression_max_dict_bytes=0 --compression_ratio=0.5 --compression_type=none --bytes_per_sync=8388608 --cache_index_and_filter_blocks=1 --cache_high_pri_pool_ratio=0.5 --benchmark_write_rate_limit=0 --write_buffer_size=16777216 --target_file_size_base=16777216 --max_bytes_for_level_base=67108864 --verify_checksum=1 --delete_obsolete_files_period_micros=62914560 --max_bytes_for_level_multiplier=8 --statistics=0 --stats_per_interval=1 --stats_interval_seconds=20 --report_interval_seconds=5 --histogram=1 --memtablerep=skip_list --bloom_bits=10 --open_files=-1 --subcompactions=1 --compaction_style=0 --num_levels=8 --min_level_to_compress=3 --level_compaction_dynamic_level_bytes=true --pin_l0_filter_and_index_blocks_in_cache=1 --min_level_to_compress=0 --use_existing_db=0 --sync=0 --threads=1 --memtablerep=vector --allow_concurrent_memtable_write=false --disable_wal=1 --seed=1657564500 --report_file=benchmark_fillseq.wal_disabled.v400.log.r.csv 2>&1 

numactl --interleave=all timeout 1800 ./db_bench --benchmarks=seekrandomwhilewriting --use_existing_db=1 --sync=0 --level0_file_num_compaction_trigger=4 --level0_slowdown_writes_trigger=20 --level0_stop_writes_trigger=30 --max_background_jobs=8 --max_write_buffer_number=8 --db=/data/m/rx --wal_dir=/data/m/rx --num=40000000 --key_size=20 --value_size=400 --block_size=8192 --cache_size=193273528320 --cache_numshardbits=6 --compression_max_dict_bytes=0 --compression_ratio=0.5 --compression_type=none --bytes_per_sync=8388608 --cache_index_and_filter_blocks=1 --cache_high_pri_pool_ratio=0.5 --benchmark_write_rate_limit=2097152 --write_buffer_size=16777216 --target_file_size_base=16777216 --max_bytes_for_level_base=67108864 --verify_checksum=1 --delete_obsolete_files_period_micros=62914560 --max_bytes_for_level_multiplier=8 --statistics=0 --stats_per_interval=1 --stats_interval_seconds=20 --report_interval_seconds=5 --histogram=1 --memtablerep=skip_list --bloom_bits=10 --open_files=-1 --subcompactions=1 --compaction_style=0 --num_levels=8 --min_level_to_compress=3 --level_compaction_dynamic_level_bytes=true --pin_l0_filter_and_index_blocks_in_cache=1 --duration=1200 --threads=22 --merge_operator="put" --seek_nexts=10 --reverse_iterator=false --seed=1657564570 --report_file=benchmark_fwdrangewhilewriting.t22.log.r.csv 2>&1

A flamegraph for b82edff (no regression here)
benchmark_fwdrangewhilewriting t22 log stats perf g 9 Jul11 185402

A flamegraph for 8b74cea that shows the problem -- on the left side of the flamegraph the call stacks for __default_morecore, __libc_free and __libc_malloc are much wider:
benchmark_fwdrangewhilewriting t22 log stats perf g 9 Jul11 191500

Metadata

Metadata

Assignees

No one assigned

    Labels

    performanceIssues related to performance that may or may not be bugsregression

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions