-
Notifications
You must be signed in to change notification settings - Fork 556
perf: L0->Lbase compactions not keeping up with flushing #203
Description
On a c5d.4xlarge instance, the pebble sync workload shows good write performance, but a problematic behavior. L0->Lbase compactions are not keeping up with flushing, leading to an ever growing number of files in L0.
~ ./pebble sync -c 100 -d 1m -w /mnt/data1/bench --batch 100 -v
...
level__files____size___score______in__ingest____move____read___write___w-amp
WAL 4 92 M - 5.8 G - - - 5.8 G 1.0
0 111 4.6 G 55.50 5.7 G 0 B 0 B 0 B 6.2 G 1.1
1 0 0 B 0.00 0 B 0 B 0 B 0 B 0 B 0.0
2 0 0 B 0.00 0 B 0 B 0 B 0 B 0 B 0.0
3 0 0 B 0.00 0 B 0 B 0 B 0 B 0 B 0.0
4 0 0 B 0.00 0 B 0 B 0 B 0 B 0 B 0.0
5 374 1.4 G 23.20 1.5 G 0 B 0 B 2.8 G 2.8 G 1.9
6 34 144 M 1.00 144 M 0 B 0 B 214 M 214 M 1.5
total 519 6.2 G 0.00 5.8 G 0 B 0 B 3.0 G 15 G 2.6
(I tweaked the Pebble options to set the L0 stop writes threshold to 1000).
The behavior that is happening is that Pebble sees the large number of L0 sstables and decides to compact them into Lbase (L5 in this case). The workload is generating uniformly random keys, so the L0 sstables overlap all of Lbase. That means that an L0->Lbase compaction will have 111+374==485 input sstables, totaling 6 GB. That compaction necessarily takes a long time, and while it is proceeding further L0 tables build up. When the L0->Lbase compaction finishes, there are enough L0 tables to require another L0->Lbase compaction. The real starvation here is the L5->L6 compactions.
RocksDB somehow avoids this egregiously bad behavior, though I'm not quite sure how yet. It seems to be a combination of L0->L0 compactions, and concurrent compactions. If I disable L0->L0 compactions, RocksDB sees the same behavior as Pebble. If I disable concurrent compactions, RocksDB sees the same behavior as Pebble. I'm somewhat suspicious it is also related to the lower write throughput of RocksDB I see on this workload. An interesting side-effect of L0->L0 compactions is that they lower the number of files in L0 which lowers the L0 compaction score. Perhaps that is allowing Lbase->Lbase+1 compactions to be scheduled.
A limitation both Pebble and RocksDB currently suffer, is that an L0->Lbase compaction locks out a concurrent Lbase->Lbase+1 compaction. This is mentioned in https://github.com/petermattis/pebble/issues/136.