osd: optimize handling of RGW's start_after & filter during OMAP iteration by rzarzynski · Pull Request #60000 · ceph/ceph

rzarzynski · 2024-09-26T14:28:02Z

Currently an OSD does up to 3 key seeks while iterating over OMAP:

to rewind to the first OMAP entry of particular RADOS object in the whole, global key namespace;
then to find the pagination-related start_after key;
then to respect the filter parameter.

This is pretty inefficient as Seek() in RocksDB is logarithmic, so Seek(n) should be cheaper than Seek(n/2) + Seek(n/2).

~~Also, this change further narrows the RocksDB iterator's bound.~~

Please note there are other inefficiencies. The underlying Seeks() are performed under the collection lock and the iterator is one-time only. Ultimately we could try to cache it and reuse across multiple rounds of the same listing of RGW.

Signed-off-by: Radoslaw Zarzynski rzarzyns@redhat.com

~~The impact of this change is currently unknown. I'm hunting for a cluster to verify it. Draft for now.~~

Benchmarking & profiling under the freshly extended, unreviewed rados bench: https://gist.github.com/rzarzynski/dbfedcb55bd9c9cafeb0b0c3358a32ec

Contribution Guidelines

To sign and title your commits, please refer to Submitting Patches to Ceph.
If you are submitting a fix for a stable branch (e.g. "quincy"), please refer to Submitting Patches to Ceph - Backports for the proper workflow.
When filling out the below checklist, you may click boxes directly in the GitHub web UI. When entering or editing the entire PR message in the GitHub web UI editor, you may also select a checklist item by adding an x between the brackets: [x]. Spaces and capitalization matter when checking off items this way.

Checklist

Tracker (select at least one)
- References tracker ticket
- Very recent bug; references commit where it was introduced
- New feature (ticket optional)
- Doc update (no ticket needed)
- Code cleanup (no ticket needed)
Component impact
- Affects Dashboard, opened tracker ticket
- Affects Orchestrator, opened tracker ticket
- No impact that needs to be tracked
Documentation (select at least one)
- Updates relevant documentation
- No doc update is appropriate
Tests (select at least one)
- Includes unit test(s)
- Includes integration test(s)
- Includes bug reproducer
- No tests

Show available Jenkins commands

jenkins retest this please
jenkins test classic perf
jenkins test crimson perf
jenkins test signed
jenkins test make check
jenkins test make check arm64
jenkins test submodules
jenkins test dashboard
jenkins test dashboard cephadm
jenkins test api
jenkins test docs
jenkins render docs
jenkins test ceph-volume all
jenkins test ceph-volume tox
jenkins test windows
jenkins test rook e2e

anthonyeleven · 2024-09-26T15:01:28Z

Nice! I'm not qualified to approve this code PR, but it sounds like good stuff

rzarzynski · 2024-09-30T12:15:32Z

The only change is the recent push is dissecting the "Also, this change further narrows the RocksDB iterator's bound." part into a dedicated commit.

aclamk

This is a valuable work on reducing extra seeks in RocksDB iterators, but details have to be iron out.

src/os/bluestore/BlueStore.cc

src/osd/PrimaryLogPG.cc

src/os/bluestore/BlueStore.cc

src/os/ObjectStore.h

rzarzynski · 2024-10-01T09:01:27Z

The new revision dropped 65b78be.

Currently an OSD does up to 3 key seeks while iterating over OMAP: 1. to rewind to the first OMAP entry of particular RADOS object in the whole, global key namespace; 2. then to find the pagination-related `start_after` key; 3. then to respect the `filter` parameter. This is pretty inefficient as `Seek()` in RocksDB is logarithmic, so `Seek(n)` should be cheaper than `Seek(n/2)` + `Seek(n/2)`. Please note there are other inefficiencies. The underlying `Seeks()` are performed under the collection lock and the iterator is one-time only. Ultimately we could try to cache it and reuse across multiple rounds of the same listing of RGW. Fixes: https://tracker.ceph.com/issues/68457 Signed-off-by: Radoslaw Zarzynski <rzarzyns@redhat.com>

Fixes: https://tracker.ceph.com/issues/68457 Signed-off-by: Radoslaw Zarzynski <rzarzyns@redhat.com>

…bound() This squeezes additional calling into OMAP iterator on the hot RGW bucket listing path while preserving the pure-extend characteristic of the interface modification; that is, there is no 2nd collection locking in BlueStore while the patch should be easier (in terms of risk) to backport. Inspired by cxyxd's PR 59384 and Adam Kupczyk's PR 60056. Fixes: https://tracker.ceph.com/issues/68457 Signed-off-by: Radoslaw Zarzynski <rzarzyns@redhat.com>

Fixes: https://tracker.ceph.com/issues/68457 Signed-off-by: Radoslaw Zarzynski <rzarzyns@redhat.com>

mkogan1

with this PR the time it took to list 50M objects is shorter

time (nice numactl -N 1 -m 1 -- ~/go/bin/hsbench -a b2345678901234567890 -s b234567890123456789012345678901234567890 -u http://127.0.0.1:8000 -z 4K -d -1 -t $(( $(numactl -N 0 -- nproc) / 1 )) -b 1 -n 50000000 -m l -bp b01b- -op 'folder01/stage01_')

# Before PR: 47:20.61 total
2024/10/09 08:06:42 Running Loop 0 BUCKET LIST TEST
2024/10/09 08:54:02 Loop: 0, Int: TOTAL, Dur(s): 2840.6, Mode: LIST, Ops: 50000, MB/s: 0.00, IO/s: 18, Lat(ms): [ min: 43.5, avg: 56.8, 99%: 76.6, max: 120.5 ], Slowdowns: 0               
( nice numactl -N 1 -m 1 -- ~/go/bin/hsbench -a b2345678901234567890 -s  -u  )  2163.04s user 85.44s system 79% cpu 47:20.61 total

# After PR: 46:58.62 total
2024/10/09 12:26:55 Running Loop 0 BUCKET LIST TEST 
2024/10/09 13:13:53 Loop: 0, Int: TOTAL, Dur(s): 2818.6, Mode: LIST, Ops: 50000, MB/s: 0.00, IO/s: 18, Lat(ms): [ min: 43.0, avg: 56.4, 99%: 76.3, max: 127.1 ], Slowdowns: 0 
( nice numactl -N 1 -m 1 -- ~/go/bin/hsbench -a b2345678901234567890 -s  -u  )  2160.19s user 84.21s system 79% cpu 46:58.62 total

markhpc · 2024-10-10T14:30:54Z

with this PR the time it took to list 50M objects is shorter

time (nice numactl -N 1 -m 1 -- ~/go/bin/hsbench -a b2345678901234567890 -s b234567890123456789012345678901234567890 -u http://127.0.0.1:8000 -z 4K -d -1 -t $(( $(numactl -N 0 -- nproc) / 1 )) -b 1 -n 50000000 -m l -bp b01b- -op 'folder01/stage01_')

# Before PR: 47:20.61 total
2024/10/09 08:06:42 Running Loop 0 BUCKET LIST TEST
2024/10/09 08:54:02 Loop: 0, Int: TOTAL, Dur(s): 2840.6, Mode: LIST, Ops: 50000, MB/s: 0.00, IO/s: 18, Lat(ms): [ min: 43.5, avg: 56.8, 99%: 76.6, max: 120.5 ], Slowdowns: 0               
( nice numactl -N 1 -m 1 -- ~/go/bin/hsbench -a b2345678901234567890 -s  -u  )  2163.04s user 85.44s system 79% cpu 47:20.61 total

# After PR: 46:58.62 total
2024/10/09 12:26:55 Running Loop 0 BUCKET LIST TEST 
2024/10/09 13:13:53 Loop: 0, Int: TOTAL, Dur(s): 2818.6, Mode: LIST, Ops: 50000, MB/s: 0.00, IO/s: 18, Lat(ms): [ min: 43.0, avg: 56.4, 99%: 76.3, max: 127.1 ], Slowdowns: 0 
( nice numactl -N 1 -m 1 -- ~/go/bin/hsbench -a b2345678901234567890 -s  -u  )  2160.19s user 84.21s system 79% cpu 46:58.62 total

Any idea what the standard deviation is? If I did my math right we're looking at around a 0.8% improvement in this test?

mkogan1 · 2024-10-14T13:35:22Z

Any idea what the standard deviation is? If I did my math right we're looking at around a 0.8% improvement in this test?

yes, regrettably from repeated tests seems like it is within the noise variability range

aclamk · 2024-10-15T10:11:22Z

@athanatos: @rzarzynski

1. both PRs change the interface. 
   This one does that by extension to free from reasoning about other callers. 
   It makes it more complex but less intrusive. The driving goal was ability to (back)port to many 
   releases without repeating the human-based inspection – only pin-pointed users are affected, 
   interface providers are verified by a compiler (as happened with the reef-main forthporting).
   Whatever the interface becomes, I believe it should be consistent across all implementations of `ObjectStore`.

a) This is simply NOT true. #60056 does not change interface.
One could argue that it does modify a contract, if we had a contract that the iterator after creation seeks to first omap key in object. But its not stated anywhere.
We only have implementation.
And in all use cases we do seek_to_first/lower_bound/upper_bound.
b) Both #60056 and #60000 can be backported. #60056 is smaller.
c) If a backportability without inspection is a goal, one can replace
ceph_assert(seeked);
with
if (!seeked) seek_to_first();
This gives complete equivalence of previous and new behaviour.
d) Our omap implementations in different ObjectStores are already different.
OmapIterators in BlueStore freeze dataset on iterator creation. No seeking will let you see key inserted later.
OmapIterators in KStore and MemStore point to a live database. Changes to db are visible via iterator created earlier.
I do not think we want to invest in unifying the difference.

As a side note, I just realized that my omap bench test uses the ability to create iterator to lock the dataset, and perform seeking later. #60315

2. Seek-at-create allows to squeeze relocking the `Collection::lock`. 
   This doesn't make a difference when listing with extensive iteration (e.g. `max_keys=1000`)
   but the targeted bottleneck is the _list-to-check-for-existence_ case (`max_keys=2`).

True. I am not sure what are the Collection::locks for, but we are taking them.

Ultimately, I think that out path forward should be to get rid of OmapIterator altogether, like proposed here:
https://github.com/ceph/ceph/pull/60278/files#diff-14111846d7c70ada2d719669d9ed93e7d153c198ef1aa93c44b94b00125f6d24R791

rzarzynski · 2024-10-15T14:25:25Z

@aclamk:

a) This is simply NOT true. #60056 does not change interface.

I disagree. The interface is broader than what can be expressed with language constructs; it's established also by documentation and, particularly in lack of thereof, by implementations. If it had stayed intact, audit of all callers wouldn't have been necessary.

All providers of the OMAP iterator interface do seek-at-create.

b) Both #60056 and #60000 can be backported. #60056 is smaller.

I agree they can be. The question is about risk.

c) If a backportability without inspection is a goal, one can replace
ceph_assert(seeked);
with
if (!seeked) seek_to_first();

Then it would have been far less intrusive.

rzarzynski · 2024-10-15T14:30:00Z

@markhpc, @mkogan1:

hsbench exercises the plain bucket listing instead of the check-for-existence (list with max-keys=2 + prefix). In hsbench.go (from @markhpc's repo):

func runBucketList(thread_num int, stats *Stats) {
        // ...
                // ...
                err := svc.ListObjectsPages(
                        &s3.ListObjectsInput{
                                Bucket:  &buckets[bucket_num],
                                MaxKeys: &max_keys,
                        },

Please note the absence of the prefix S3 parameter.

max_keys is user-configurable with default being 1000:

func init() {
        // ...
        myflag.Int64Var(&max_keys, "mk", 1000, "Maximum number of keys to retreive at once for bucket listings")

The default doesn't cause RGW to paginate the output, so neither start_after will be used.

The PR optimizes start_after, filter_prefix and particularly the check-for-existence while doing very little for the plain listing (without the prefix and start_after parameters).
To verify the impact on the check-for-existence I extended rados bench to cover OMAP reads (PR #60277).
This new & unreviewed benchmark shows 20% of gain for the list-for-existence and roughly the same numbers for the generic, parameterless list.

However, more interesting is profiling under the new workload. In addition to the list-for-existence, it allowed to optimize the plain listing case as well. #60278 is continuation of this PR, also in the matter of backportability being the driving goal.

markhpc · 2024-10-17T14:48:21Z

@rzarzynski Is there any chance I could ask you to submit a PR for hsbench as well that would let us test for this?

ifed01 · 2024-10-22T13:42:38Z

src/os/ObjectStore.h

+  struct omap_iter_seek_t {
+    std::string seek_position;
+    enum {
+      LOWER_BOUND,


Shouldn't we additionally have BEGIN or something for the sake of completeness?

ifed01 · 2024-10-22T13:44:27Z

src/os/ObjectStore.h

    CollectionHandle &c,   ///< [in] collection
-    const ghobject_t &oid  ///< [in] object
+    const ghobject_t &oid, ///< [in] object
+    omap_iter_seek_t start_from = omap_iter_seek_t::min_lower_bound()  ///< [in] where the iterator should point to at the beginning


Wouldn't be passing "const omap_iter_seek_t&" more straightforward/readable in terms of what's being copied during the call?

ifed01 · 2024-10-22T13:46:33Z

src/os/bluestore/BlueStore.cc

    o->get_omap_key(string(), &head);
    o->get_omap_tail(&tail);
-    it->lower_bound(head);
+    string key;


IMO this changes the default behavior when start_from is empty. Originally we seek to lower_bound(head) but not it would be lower_bound("")

And with this approach we're still not eliminating the duplicate lookup for get_omap_iterator() callers other than one in PrimaryLogPG.cc...

aclamk · 2024-10-31T11:08:52Z

Some results from alternative approach:
#60056 (comment)

github-actions · 2024-12-30T14:02:11Z

This pull request has been automatically marked as stale because it has not had any activity for 60 days. It will be closed if no further activity occurs for another 30 days.
If you are a maintainer or core committer, please follow-up on this pull request to identify what steps should be taken by the author to move this proposed change forward.
If you are the author of this pull request, thank you for your proposed contribution. If you believe this change is still appropriate, please ensure that any feedback has been addressed and ask for a code review.

github-actions · 2025-01-13T13:50:47Z

This pull request can no longer be automatically merged: a rebase is needed and changes have to be manually resolved

rzarzynski · 2025-01-13T16:28:08Z

Closing as the far broader #60278 has been merged.

rzarzynski added the performance label Sep 26, 2024

github-actions bot added bluestore core labels Sep 26, 2024

github-actions bot added this to the reef milestone Sep 26, 2024

rzarzynski changed the title ~~Wip os cheaper get omap iterator~~ osd: optimize handling of RGW's start_after & filter during OMAP iteration Sep 26, 2024

aclamk mentioned this pull request Sep 30, 2024

os/bluestore, rgw: Get rid of superfluous iterator seeking #60056

Closed

14 tasks

rzarzynski force-pushed the wip-os-cheaper-get_omap_iterator branch from 89a8278 to 2b6117c Compare September 30, 2024 12:13

aclamk requested changes Sep 30, 2024

View reviewed changes

src/os/bluestore/BlueStore.cc Outdated Show resolved Hide resolved

src/osd/PrimaryLogPG.cc Outdated Show resolved Hide resolved

src/osd/PrimaryLogPG.cc Outdated Show resolved Hide resolved

rzarzynski force-pushed the wip-os-cheaper-get_omap_iterator branch 2 times, most recently from 09cd45b to 65b78be Compare September 30, 2024 17:05

rzarzynski requested a review from aclamk October 1, 2024 07:53

rzarzynski force-pushed the wip-os-cheaper-get_omap_iterator branch from 72fcda0 to f8f09f0 Compare October 1, 2024 08:10

aclamk reviewed Oct 1, 2024

View reviewed changes

src/os/bluestore/BlueStore.cc Outdated Show resolved Hide resolved

aclamk reviewed Oct 1, 2024

View reviewed changes

src/os/bluestore/BlueStore.cc Outdated Show resolved Hide resolved

aclamk reviewed Oct 1, 2024

View reviewed changes

src/os/ObjectStore.h Outdated Show resolved Hide resolved

rzarzynski force-pushed the wip-os-cheaper-get_omap_iterator branch from f8f09f0 to d6f7b36 Compare October 1, 2024 08:59

rzarzynski requested a review from aclamk October 1, 2024 09:01

rzarzynski force-pushed the wip-os-cheaper-get_omap_iterator branch from d6f7b36 to 58e6f25 Compare October 1, 2024 09:14

rzarzynski changed the base branch from reef to main October 1, 2024 09:14

github-actions bot added build/ops ceph-volume cephadm cephfs Ceph File System CI Continuous Integration crimson dashboard documentation labels Oct 1, 2024

athanatos self-requested a review October 8, 2024 23:32

athanatos approved these changes Oct 8, 2024

View reviewed changes

rzarzynski added 6 commits October 9, 2024 09:51

osd: optimize handling of RGW's filter_prefix during OMAP iteration

b2377cf

Fixes: https://tracker.ceph.com/issues/68457 Signed-off-by: Radoslaw Zarzynski <rzarzyns@redhat.com>

os/kstore: bring the support for omap_iter_seek_t

e12a137

Fixes: https://tracker.ceph.com/issues/68457 Signed-off-by: Radoslaw Zarzynski <rzarzyns@redhat.com>

os/memstore: bring the support for omap_iter_seek_t

171b15f

Fixes: https://tracker.ceph.com/issues/68457 Signed-off-by: Radoslaw Zarzynski <rzarzyns@redhat.com>

test/objectstore: bring the support for omap_iter_seek_t

901acac

Fixes: https://tracker.ceph.com/issues/68457 Signed-off-by: Radoslaw Zarzynski <rzarzyns@redhat.com>

rzarzynski force-pushed the wip-os-cheaper-get_omap_iterator branch from cb1b35a to 901acac Compare October 9, 2024 10:34

rzarzynski marked this pull request as ready for review October 9, 2024 10:35

rzarzynski requested a review from a team as a code owner October 9, 2024 10:35

mkogan1 approved these changes Oct 9, 2024

View reviewed changes

mkogan1 mentioned this pull request Oct 14, 2024

os, osd: bring the lightweight OMAP iteration #60278

Merged

14 tasks

This was referenced Oct 15, 2024

remove it->lower_bound(head); #59384

Closed

os/bluestore: record omapiter init latency #58924

Merged

ifed01 reviewed Oct 22, 2024

View reviewed changes

github-actions bot added the stale label Dec 30, 2024

github-actions bot added the needs-rebase label Jan 13, 2025

github-actions bot removed the stale label Jan 13, 2025

rzarzynski closed this Jan 13, 2025

Conversation

rzarzynski commented Sep 26, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Contribution Guidelines

Checklist

Uh oh!

anthonyeleven commented Sep 26, 2024

Uh oh!

rzarzynski commented Sep 30, 2024

Uh oh!

aclamk left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

rzarzynski commented Oct 1, 2024

Uh oh!

mkogan1 left a comment

Choose a reason for hiding this comment

Uh oh!

markhpc commented Oct 10, 2024

Uh oh!

mkogan1 commented Oct 14, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

aclamk commented Oct 15, 2024

Uh oh!

rzarzynski commented Oct 15, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rzarzynski commented Oct 15, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

markhpc commented Oct 17, 2024

Uh oh!

ifed01 Oct 22, 2024

Choose a reason for hiding this comment

Uh oh!

ifed01 Oct 22, 2024

Choose a reason for hiding this comment

Uh oh!

ifed01 Oct 22, 2024

Choose a reason for hiding this comment

Uh oh!

ifed01 Oct 22, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

aclamk commented Oct 31, 2024

Uh oh!

github-actions bot commented Dec 30, 2024

Uh oh!

github-actions bot commented Jan 13, 2025

Uh oh!

rzarzynski commented Jan 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

rzarzynski commented Sep 26, 2024 •

edited

Loading

mkogan1 commented Oct 14, 2024 •

edited

Loading

rzarzynski commented Oct 15, 2024 •

edited

Loading

rzarzynski commented Oct 15, 2024 •

edited

Loading

ifed01 Oct 22, 2024 •

edited

Loading

rzarzynski commented Jan 13, 2025 •

edited

Loading