Skip to content

osd: optimize handling of RGW's start_after & filter during OMAP iteration #60000

Closed
rzarzynski wants to merge 6 commits intoceph:mainfrom
rzarzynski:wip-os-cheaper-get_omap_iterator
Closed

osd: optimize handling of RGW's start_after & filter during OMAP iteration #60000
rzarzynski wants to merge 6 commits intoceph:mainfrom
rzarzynski:wip-os-cheaper-get_omap_iterator

Conversation

@rzarzynski
Copy link
Contributor

@rzarzynski rzarzynski commented Sep 26, 2024

Currently an OSD does up to 3 key seeks while iterating over OMAP:

  1. to rewind to the first OMAP entry of particular RADOS object in the whole, global key namespace;
  2. then to find the pagination-related start_after key;
  3. then to respect the filter parameter.

This is pretty inefficient as Seek() in RocksDB is logarithmic, so Seek(n) should be cheaper than Seek(n/2) + Seek(n/2).

Also, this change further narrows the RocksDB iterator's bound.

Please note there are other inefficiencies. The underlying Seeks() are performed under the collection lock and the iterator is one-time only. Ultimately we could try to cache it and reuse across multiple rounds of the same listing of RGW.

Signed-off-by: Radoslaw Zarzynski rzarzyns@redhat.com


The impact of this change is currently unknown. I'm hunting for a cluster to verify it. Draft for now.

Benchmarking & profiling under the freshly extended, unreviewed rados bench: https://gist.github.com/rzarzynski/dbfedcb55bd9c9cafeb0b0c3358a32ec

Contribution Guidelines

  • To sign and title your commits, please refer to Submitting Patches to Ceph.

  • If you are submitting a fix for a stable branch (e.g. "quincy"), please refer to Submitting Patches to Ceph - Backports for the proper workflow.

  • When filling out the below checklist, you may click boxes directly in the GitHub web UI. When entering or editing the entire PR message in the GitHub web UI editor, you may also select a checklist item by adding an x between the brackets: [x]. Spaces and capitalization matter when checking off items this way.

Checklist

  • Tracker (select at least one)
    • References tracker ticket
    • Very recent bug; references commit where it was introduced
    • New feature (ticket optional)
    • Doc update (no ticket needed)
    • Code cleanup (no ticket needed)
  • Component impact
    • Affects Dashboard, opened tracker ticket
    • Affects Orchestrator, opened tracker ticket
    • No impact that needs to be tracked
  • Documentation (select at least one)
    • Updates relevant documentation
    • No doc update is appropriate
  • Tests (select at least one)
Show available Jenkins commands
  • jenkins retest this please
  • jenkins test classic perf
  • jenkins test crimson perf
  • jenkins test signed
  • jenkins test make check
  • jenkins test make check arm64
  • jenkins test submodules
  • jenkins test dashboard
  • jenkins test dashboard cephadm
  • jenkins test api
  • jenkins test docs
  • jenkins render docs
  • jenkins test ceph-volume all
  • jenkins test ceph-volume tox
  • jenkins test windows
  • jenkins test rook e2e

@github-actions github-actions bot added this to the reef milestone Sep 26, 2024
@rzarzynski rzarzynski changed the title Wip os cheaper get omap iterator osd: optimize handling of RGW's start_after & filter during OMAP iteration Sep 26, 2024
@anthonyeleven
Copy link
Contributor

Nice! I'm not qualified to approve this code PR, but it sounds like good stuff

@rzarzynski
Copy link
Contributor Author

The only change is the recent push is dissecting the "Also, this change further narrows the RocksDB iterator's bound." part into a dedicated commit.

Copy link
Contributor

@aclamk aclamk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a valuable work on reducing extra seeks in RocksDB iterators, but details have to be iron out.

@rzarzynski rzarzynski force-pushed the wip-os-cheaper-get_omap_iterator branch 2 times, most recently from 09cd45b to 65b78be Compare September 30, 2024 17:05
@rzarzynski rzarzynski requested a review from aclamk October 1, 2024 07:53
@rzarzynski rzarzynski force-pushed the wip-os-cheaper-get_omap_iterator branch from 72fcda0 to f8f09f0 Compare October 1, 2024 08:10
@rzarzynski rzarzynski force-pushed the wip-os-cheaper-get_omap_iterator branch from f8f09f0 to d6f7b36 Compare October 1, 2024 08:59
@rzarzynski
Copy link
Contributor Author

The new revision dropped 65b78be.

@rzarzynski rzarzynski requested a review from aclamk October 1, 2024 09:01
@rzarzynski rzarzynski force-pushed the wip-os-cheaper-get_omap_iterator branch from d6f7b36 to 58e6f25 Compare October 1, 2024 09:14
@rzarzynski rzarzynski changed the base branch from reef to main October 1, 2024 09:14
@athanatos athanatos self-requested a review October 8, 2024 23:32
Currently an OSD does up to 3 key seeks while iterating over OMAP:
1. to rewind to the first OMAP entry of particular RADOS object
   in the whole, global key namespace;
2. then to find the pagination-related `start_after` key;
3. then to respect the `filter` parameter.

This is pretty inefficient as `Seek()` in RocksDB is logarithmic,
so `Seek(n)` should be cheaper than `Seek(n/2)` + `Seek(n/2)`.

Please note there are other inefficiencies. The underlying `Seeks()`
are performed under the collection lock and the iterator is one-time
only. Ultimately we could try to cache it and reuse across multiple
rounds of the same listing of RGW.

Fixes: https://tracker.ceph.com/issues/68457

Signed-off-by: Radoslaw Zarzynski <rzarzyns@redhat.com>
…bound()

This squeezes additional calling into OMAP iterator on the hot
RGW bucket listing path while preserving the pure-extend
characteristic of the interface modification; that is, there
is no 2nd collection locking in BlueStore while the patch
should be easier (in terms of risk) to backport.

Inspired by cxyxd's PR 59384 and Adam Kupczyk's PR 60056.

Fixes: https://tracker.ceph.com/issues/68457

Signed-off-by: Radoslaw Zarzynski <rzarzyns@redhat.com>
Fixes: https://tracker.ceph.com/issues/68457

Signed-off-by: Radoslaw Zarzynski <rzarzyns@redhat.com>
Fixes: https://tracker.ceph.com/issues/68457

Signed-off-by: Radoslaw Zarzynski <rzarzyns@redhat.com>
Fixes: https://tracker.ceph.com/issues/68457

Signed-off-by: Radoslaw Zarzynski <rzarzyns@redhat.com>
@rzarzynski rzarzynski force-pushed the wip-os-cheaper-get_omap_iterator branch from cb1b35a to 901acac Compare October 9, 2024 10:34
@rzarzynski rzarzynski marked this pull request as ready for review October 9, 2024 10:35
@rzarzynski rzarzynski requested a review from a team as a code owner October 9, 2024 10:35
Copy link
Contributor

@mkogan1 mkogan1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

with this PR the time it took to list 50M objects is shorter

time (nice numactl -N 1 -m 1 -- ~/go/bin/hsbench -a b2345678901234567890 -s b234567890123456789012345678901234567890 -u http://127.0.0.1:8000 -z 4K -d -1 -t $(( $(numactl -N 0 -- nproc) / 1 )) -b 1 -n 50000000 -m l -bp b01b- -op 'folder01/stage01_')

# Before PR: 47:20.61 total
2024/10/09 08:06:42 Running Loop 0 BUCKET LIST TEST
2024/10/09 08:54:02 Loop: 0, Int: TOTAL, Dur(s): 2840.6, Mode: LIST, Ops: 50000, MB/s: 0.00, IO/s: 18, Lat(ms): [ min: 43.5, avg: 56.8, 99%: 76.6, max: 120.5 ], Slowdowns: 0               
( nice numactl -N 1 -m 1 -- ~/go/bin/hsbench -a b2345678901234567890 -s  -u  )  2163.04s user 85.44s system 79% cpu 47:20.61 total

# After PR: 46:58.62 total
2024/10/09 12:26:55 Running Loop 0 BUCKET LIST TEST 
2024/10/09 13:13:53 Loop: 0, Int: TOTAL, Dur(s): 2818.6, Mode: LIST, Ops: 50000, MB/s: 0.00, IO/s: 18, Lat(ms): [ min: 43.0, avg: 56.4, 99%: 76.3, max: 127.1 ], Slowdowns: 0 
( nice numactl -N 1 -m 1 -- ~/go/bin/hsbench -a b2345678901234567890 -s  -u  )  2160.19s user 84.21s system 79% cpu 46:58.62 total

@markhpc
Copy link
Member

markhpc commented Oct 10, 2024

with this PR the time it took to list 50M objects is shorter

time (nice numactl -N 1 -m 1 -- ~/go/bin/hsbench -a b2345678901234567890 -s b234567890123456789012345678901234567890 -u http://127.0.0.1:8000 -z 4K -d -1 -t $(( $(numactl -N 0 -- nproc) / 1 )) -b 1 -n 50000000 -m l -bp b01b- -op 'folder01/stage01_')

# Before PR: 47:20.61 total
2024/10/09 08:06:42 Running Loop 0 BUCKET LIST TEST
2024/10/09 08:54:02 Loop: 0, Int: TOTAL, Dur(s): 2840.6, Mode: LIST, Ops: 50000, MB/s: 0.00, IO/s: 18, Lat(ms): [ min: 43.5, avg: 56.8, 99%: 76.6, max: 120.5 ], Slowdowns: 0               
( nice numactl -N 1 -m 1 -- ~/go/bin/hsbench -a b2345678901234567890 -s  -u  )  2163.04s user 85.44s system 79% cpu 47:20.61 total

# After PR: 46:58.62 total
2024/10/09 12:26:55 Running Loop 0 BUCKET LIST TEST 
2024/10/09 13:13:53 Loop: 0, Int: TOTAL, Dur(s): 2818.6, Mode: LIST, Ops: 50000, MB/s: 0.00, IO/s: 18, Lat(ms): [ min: 43.0, avg: 56.4, 99%: 76.3, max: 127.1 ], Slowdowns: 0 
( nice numactl -N 1 -m 1 -- ~/go/bin/hsbench -a b2345678901234567890 -s  -u  )  2160.19s user 84.21s system 79% cpu 46:58.62 total

Any idea what the standard deviation is? If I did my math right we're looking at around a 0.8% improvement in this test?

@mkogan1
Copy link
Contributor

mkogan1 commented Oct 14, 2024

Any idea what the standard deviation is? If I did my math right we're looking at around a 0.8% improvement in this test?

yes, regrettably from repeated tests seems like it is within the noise variability range

@aclamk
Copy link
Contributor

aclamk commented Oct 15, 2024

@athanatos: @rzarzynski

1. both PRs change the interface. 
   This one does that by extension to free from reasoning about other callers. 
   It makes it more complex but less intrusive. The driving goal was ability to (back)port to many 
   releases without repeating the human-based inspection – only pin-pointed users are affected, 
   interface providers are verified by a compiler (as happened with the reef-main forthporting).
   Whatever the interface becomes, I believe it should be consistent across all implementations of `ObjectStore`.

a) This is simply NOT true. #60056 does not change interface.
One could argue that it does modify a contract, if we had a contract that the iterator after creation seeks to first omap key in object. But its not stated anywhere.
We only have implementation.
And in all use cases we do seek_to_first/lower_bound/upper_bound.
b) Both #60056 and #60000 can be backported. #60056 is smaller.
c) If a backportability without inspection is a goal, one can replace
ceph_assert(seeked);
with
if (!seeked) seek_to_first();
This gives complete equivalence of previous and new behaviour.
d) Our omap implementations in different ObjectStores are already different.
OmapIterators in BlueStore freeze dataset on iterator creation. No seeking will let you see key inserted later.
OmapIterators in KStore and MemStore point to a live database. Changes to db are visible via iterator created earlier.
I do not think we want to invest in unifying the difference.

As a side note, I just realized that my omap bench test uses the ability to create iterator to lock the dataset, and perform seeking later. #60315

2. Seek-at-create allows to squeeze relocking the `Collection::lock`. 
   This doesn't make a difference when listing with extensive iteration (e.g. `max_keys=1000`)
   but the targeted bottleneck is the _list-to-check-for-existence_ case (`max_keys=2`).

True. I am not sure what are the Collection::locks for, but we are taking them.

Ultimately, I think that out path forward should be to get rid of OmapIterator altogether, like proposed here:
https://github.com/ceph/ceph/pull/60278/files#diff-14111846d7c70ada2d719669d9ed93e7d153c198ef1aa93c44b94b00125f6d24R791

@rzarzynski
Copy link
Contributor Author

rzarzynski commented Oct 15, 2024

@aclamk:

a) This is simply NOT true. #60056 does not change interface.

I disagree. The interface is broader than what can be expressed with language constructs; it's established also by documentation and, particularly in lack of thereof, by implementations. If it had stayed intact, audit of all callers wouldn't have been necessary.

All providers of the OMAP iterator interface do seek-at-create.

b) Both #60056 and #60000 can be backported. #60056 is smaller.

I agree they can be. The question is about risk.

c) If a backportability without inspection is a goal, one can replace
ceph_assert(seeked);
with
if (!seeked) seek_to_first();

Then it would have been far less intrusive.

@rzarzynski
Copy link
Contributor Author

rzarzynski commented Oct 15, 2024

@markhpc, @mkogan1:

  1. hsbench exercises the plain bucket listing instead of the check-for-existence (list with max-keys=2 + prefix). In hsbench.go (from @markhpc's repo):
func runBucketList(thread_num int, stats *Stats) {
        // ...
                // ...
                err := svc.ListObjectsPages(
                        &s3.ListObjectsInput{
                                Bucket:  &buckets[bucket_num],
                                MaxKeys: &max_keys,
                        },

Please note the absence of the prefix S3 parameter.

max_keys is user-configurable with default being 1000:

func init() {
        // ...
        myflag.Int64Var(&max_keys, "mk", 1000, "Maximum number of keys to retreive at once for bucket listings")

The default doesn't cause RGW to paginate the output, so neither start_after will be used.

  1. The PR optimizes start_after, filter_prefix and particularly the check-for-existence while doing very little for the plain listing (without the prefix and start_after parameters).
  2. To verify the impact on the check-for-existence I extended rados bench to cover OMAP reads (PR #60277).
  3. This new & unreviewed benchmark shows 20% of gain for the list-for-existence and roughly the same numbers for the generic, parameterless list.

However, more interesting is profiling under the new workload. In addition to the list-for-existence, it allowed to optimize the plain listing case as well. #60278 is continuation of this PR, also in the matter of backportability being the driving goal.

@markhpc
Copy link
Member

markhpc commented Oct 17, 2024

@rzarzynski Is there any chance I could ask you to submit a PR for hsbench as well that would let us test for this?

struct omap_iter_seek_t {
std::string seek_position;
enum {
LOWER_BOUND,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't we additionally have BEGIN or something for the sake of completeness?

CollectionHandle &c, ///< [in] collection
const ghobject_t &oid ///< [in] object
const ghobject_t &oid, ///< [in] object
omap_iter_seek_t start_from = omap_iter_seek_t::min_lower_bound() ///< [in] where the iterator should point to at the beginning
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't be passing "const omap_iter_seek_t&" more straightforward/readable in terms of what's being copied during the call?

o->get_omap_key(string(), &head);
o->get_omap_tail(&tail);
it->lower_bound(head);
string key;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO this changes the default behavior when start_from is empty. Originally we seek to lower_bound(head) but not it would be lower_bound("")

Copy link
Contributor

@ifed01 ifed01 Oct 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And with this approach we're still not eliminating the duplicate lookup for get_omap_iterator() callers other than one in PrimaryLogPG.cc...

@aclamk
Copy link
Contributor

aclamk commented Oct 31, 2024

Some results from alternative approach:
#60056 (comment)

@github-actions
Copy link

This pull request has been automatically marked as stale because it has not had any activity for 60 days. It will be closed if no further activity occurs for another 30 days.
If you are a maintainer or core committer, please follow-up on this pull request to identify what steps should be taken by the author to move this proposed change forward.
If you are the author of this pull request, thank you for your proposed contribution. If you believe this change is still appropriate, please ensure that any feedback has been addressed and ask for a code review.

@github-actions
Copy link

This pull request can no longer be automatically merged: a rebase is needed and changes have to be manually resolved

@github-actions github-actions bot removed the stale label Jan 13, 2025
@rzarzynski
Copy link
Contributor Author

rzarzynski commented Jan 13, 2025

Closing as the far broader #60278 has been merged.

@rzarzynski rzarzynski closed this Jan 13, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants