Skip to content

db: benchmark framework for compaction heuristic comparisons #1865

@sumeerbhola

Description

@sumeerbhola

This has come up in discussions in the past such as on #1603 and #1746

We need benchmarks that satisfy the following (non-exhaustive) requirements:

  • Workloads with different distribution of writes in the key space: For example (a) uniform (probably the most challenging for compactions), (b) hash sharded sequential writes (seem common in the CockroachDB context), (c) Zipfian.
  • Large LSMs: we need many levels to be populated, at least L2-L6, for it to be representative of large deployments. We also need to be able to run benchmarks over short time intervals, say 30min. Which means capturing and storing already built LSMs corresponding to various workloads (say in S3 or GCS), and using those as a starting point for benchmarking. Since these benchmarks will be comparing write amplification, we can live with the performance of the starting files staying in blob storage (and not have to copy them to run experiments). The SharedFS developed by @itsbilal can be useful for this.
  • Pacing the writes so that compactions are not falling behind. One cannot make good comparisons between different schemes if one fell behind and then caught up at the end, while another kept up. Also the termination condition of the benchmarks need to be similar -- one option is to stop writes and then wait until all levels have compaction scores < 1.0 and there are no more compactions left to run.

Sub-issues:

TODO(jackson): Add issues for remaining work.

Metadata

Metadata

Type

No type

Projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions