Project

General

Profile

Actions

Bug #53926

closed

rocksdb's Option.ttl may be beneficial for RGW index workload

Added by Alex Marangone about 4 years ago. Updated 8 months ago.

Status:
Resolved
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
% Done:

100%

Source:
Community (user)
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):
Tags (freeform):
Fixed In:
v18.0.0-1010-g85e2757e85e
Released In:
v18.2.0~972
Upkeep Timestamp:
2025-07-14T15:41:13+00:00

Description

Background

In our environment a lot of RGW workloads have lifecycle or customers doing a lot of DELETE/PUT on their buckets. These workload create a lot of tombstones which slows omap_iterator until the OSD eventually start to complain:

2021-12-22 19:10:22.201 7f19eb59f700  0 bluestore(/var/lib/ceph/osd/ceph-19) log_latency_fn slow operation observed for upper_bound, latency = 6.25955s, after = <redacted_key_name> omap_iterator(cid = 14.311_head, oid = <redacted_object_name>
2021-12-22 19:10:34.370 7f19eb59f700  0 bluestore(/var/lib/ceph/osd/ceph-19) log_latency_fn slow operation observed for upper_bound, latency = 6.0164s, after = <redacted_key_name> omap_iterator(cid = 14.311_head, oid = <redacted_object_name>

In the most extreme scenarios, this issue created a lot of slow requests (10k+) on the PG which results in an outage. On some clusters we have no alternatives but to trigger a compaction thrice a day.

Options.ttl

While looking at rocksdb options, the following was discovered: https://github.com/facebook/rocksdb/wiki/RocksDB-Tuning-Guide#periodic-and-ttl-compaction. Because our production runs on Nautilus we decided to investigate options.ttl instead of option.periodic_compation_seconds. Note: ttl is disabled on Nautilus version of rocksdb and set at 30 days with Pacific

I ran two sets of benchmarks:
- On a 10M keys omap object: with a rate capped at 10k keys/s list 10k keys -> delete listed keys -> repeat until all keys are deleted. ttl was set at 30min
- On a 10M keys omap object: Send the following distribution (max: 100/s): 11 keys list, 44 keys insert, 44 keys delete. ttl was set at 60min

Results

All OSDs were manually compacted prior to running the bench.

For the 10k list -> deletes (ran on Pacific):
Default non-ttl: Showed an increase in latency some drops happen due to general compaction but latency never returns to its start value

ttl=30min: Steady increase of latency followed by a drop to start value

For the workload distribution test (ran on Nautilus)
Default non-ttl:You can clearly see latency increasing over time

ttl=1h: latency still increases over time. When TTL compaction is triggered there's a high latency spike followed by a significant drop

Conclusion

We started to deploy ttl compaction in prod but based off of this I'm not sure whether TTL should be changed upstream or not. I think it needs more investigation to make sure it doesn't negatively affect some workloads but I'm not sure to how to proceed with this. This ticket to gather more data or tests we could run.


Files

10k-delete-nottl.png (289 KB) 10k-delete-nottl.png Alex Marangone, 01/18/2022 10:52 PM
10k-delete-ttl.png (283 KB) 10k-delete-ttl.png Alex Marangone, 01/18/2022 10:52 PM
no-ttl-nautilus.png (64.8 KB) no-ttl-nautilus.png Alex Marangone, 01/18/2022 10:52 PM
ttl-nautilus.png (45.8 KB) ttl-nautilus.png Alex Marangone, 01/18/2022 10:52 PM
Pacific - vanilla.png (76.5 KB) Pacific - vanilla.png Alex Marangone, 03/15/2023 05:15 PM
Pacific - patched.png (107 KB) Pacific - patched.png Alex Marangone, 03/15/2023 05:15 PM

Related issues 2 (0 open2 closed)

Related to bluestore - Backport #59329: quincy: kv/RocksDBStore: Add CompactOnDeletion supportResolvedCory SnyderActions
Related to bluestore - Backport #59330: pacific: kv/RocksDBStore: Add CompactOnDeletion supportResolvedCory SnyderActions
Actions #1

Updated by Alex Marangone about 4 years ago

Ran my omap-bench tests on an omap object with 100k keys which is more in line with RGW's recommendation. As expected the results are similar in pattern.

Updated by Alex Marangone about 3 years ago

I've run very similar bench to the ones described above on 16.2.11 with https://github.com/ceph/ceph/pull/47221 backported.
The results are significant.

Without the patch:

With the patch:

The deletes are graphs but they're too quick to show up with the graph's scale.
The average delete time in ms was 41 for the no patch and 7 for the patched one.

Going to close this tracker since it appears the perf degradation is fixed!

Actions #3

Updated by Alex Marangone about 3 years ago

  • Status changed from New to Closed
Actions #4

Updated by jianwei zhang almost 3 years ago

I don't quite understand why insert can cost 10000ms (10s).

Actions #5

Updated by jianwei zhang almost 3 years ago

I think at any time, insert should be lower than list latency,
I'm curious. I hope I can explain.

Thank you very much!

Actions #6

Updated by Alex Marangone almost 3 years ago

It doesn't take 10s, the graphs are stacked and this is the time to insert 10k keys. The bench that I wrote is also extremely chaotic and raw. It will attempt to insert/list/delete 10k keys to the same object (thus 1PG) at the same time. To make things even worse the omap has 1M entries to start with.

The tl;dr; is that I wanted to create something somewhat unrealistic but still the worst case scenario for rocksdb in order to quickly (minutes) reproduce the performance degradation that we've seen pile up slowly overtime (months) and that seemed to do the trick. Because of this the performance numbers themselves aren't really relevant, the pattern of latency over time is.
Hope that helps

Actions #7

Updated by jianwei zhang almost 3 years ago

Can you share the bench tool?
I haven't been able to find a way to reproduce lately.
Thank you so much!

Actions #8

Updated by Alex Marangone almost 3 years ago

Sure it's super rough and gross, Never got time to work on it much but: https://gist.github.com/alram/ecfcf90035403fcf5a2dfff27daabf37

Actions #9

Updated by Cory Snyder almost 3 years ago

  • Related to Backport #59329: quincy: kv/RocksDBStore: Add CompactOnDeletion support added
Actions #10

Updated by Cory Snyder almost 3 years ago

  • Related to Backport #59330: pacific: kv/RocksDBStore: Add CompactOnDeletion support added
Actions #11

Updated by Igor Fedotov almost 3 years ago

  • Status changed from Closed to Pending Backport
  • Pull request ID set to 47221
Actions #13

Updated by Konstantin Shalygin over 2 years ago

  • Tracker changed from Feature to Bug
  • Status changed from Pending Backport to Resolved
  • % Done changed from 0 to 100
  • Regression set to No
  • Severity set to 3 - minor
Actions #14

Updated by Upkeep Bot 8 months ago

  • Merge Commit set to 85e2757e85eb5939a202e878b9a8ca694d7acc29
  • Fixed In set to v18.0.0-1010-g85e2757e85e
  • Released In set to v18.2.0~972
  • Upkeep Timestamp set to 2025-07-14T15:41:13+00:00
Actions

Also available in: Atom PDF