osd: optimize handling of RGW's start_after & filter during OMAP iteration #60000
osd: optimize handling of RGW's start_after & filter during OMAP iteration #60000rzarzynski wants to merge 6 commits intoceph:mainfrom
Conversation
|
Nice! I'm not qualified to approve this code PR, but it sounds like good stuff |
89a8278 to
2b6117c
Compare
|
The only change is the recent push is dissecting the "Also, this change further narrows the RocksDB iterator's bound." part into a dedicated commit. |
aclamk
left a comment
There was a problem hiding this comment.
This is a valuable work on reducing extra seeks in RocksDB iterators, but details have to be iron out.
09cd45b to
65b78be
Compare
72fcda0 to
f8f09f0
Compare
f8f09f0 to
d6f7b36
Compare
|
The new revision dropped 65b78be. |
d6f7b36 to
58e6f25
Compare
Currently an OSD does up to 3 key seeks while iterating over OMAP: 1. to rewind to the first OMAP entry of particular RADOS object in the whole, global key namespace; 2. then to find the pagination-related `start_after` key; 3. then to respect the `filter` parameter. This is pretty inefficient as `Seek()` in RocksDB is logarithmic, so `Seek(n)` should be cheaper than `Seek(n/2)` + `Seek(n/2)`. Please note there are other inefficiencies. The underlying `Seeks()` are performed under the collection lock and the iterator is one-time only. Ultimately we could try to cache it and reuse across multiple rounds of the same listing of RGW. Fixes: https://tracker.ceph.com/issues/68457 Signed-off-by: Radoslaw Zarzynski <rzarzyns@redhat.com>
Fixes: https://tracker.ceph.com/issues/68457 Signed-off-by: Radoslaw Zarzynski <rzarzyns@redhat.com>
…bound() This squeezes additional calling into OMAP iterator on the hot RGW bucket listing path while preserving the pure-extend characteristic of the interface modification; that is, there is no 2nd collection locking in BlueStore while the patch should be easier (in terms of risk) to backport. Inspired by cxyxd's PR 59384 and Adam Kupczyk's PR 60056. Fixes: https://tracker.ceph.com/issues/68457 Signed-off-by: Radoslaw Zarzynski <rzarzyns@redhat.com>
Fixes: https://tracker.ceph.com/issues/68457 Signed-off-by: Radoslaw Zarzynski <rzarzyns@redhat.com>
Fixes: https://tracker.ceph.com/issues/68457 Signed-off-by: Radoslaw Zarzynski <rzarzyns@redhat.com>
Fixes: https://tracker.ceph.com/issues/68457 Signed-off-by: Radoslaw Zarzynski <rzarzyns@redhat.com>
cb1b35a to
901acac
Compare
mkogan1
left a comment
There was a problem hiding this comment.
with this PR the time it took to list 50M objects is shorter
time (nice numactl -N 1 -m 1 -- ~/go/bin/hsbench -a b2345678901234567890 -s b234567890123456789012345678901234567890 -u http://127.0.0.1:8000 -z 4K -d -1 -t $(( $(numactl -N 0 -- nproc) / 1 )) -b 1 -n 50000000 -m l -bp b01b- -op 'folder01/stage01_')
# Before PR: 47:20.61 total
2024/10/09 08:06:42 Running Loop 0 BUCKET LIST TEST
2024/10/09 08:54:02 Loop: 0, Int: TOTAL, Dur(s): 2840.6, Mode: LIST, Ops: 50000, MB/s: 0.00, IO/s: 18, Lat(ms): [ min: 43.5, avg: 56.8, 99%: 76.6, max: 120.5 ], Slowdowns: 0
( nice numactl -N 1 -m 1 -- ~/go/bin/hsbench -a b2345678901234567890 -s -u ) 2163.04s user 85.44s system 79% cpu 47:20.61 total
# After PR: 46:58.62 total
2024/10/09 12:26:55 Running Loop 0 BUCKET LIST TEST
2024/10/09 13:13:53 Loop: 0, Int: TOTAL, Dur(s): 2818.6, Mode: LIST, Ops: 50000, MB/s: 0.00, IO/s: 18, Lat(ms): [ min: 43.0, avg: 56.4, 99%: 76.3, max: 127.1 ], Slowdowns: 0
( nice numactl -N 1 -m 1 -- ~/go/bin/hsbench -a b2345678901234567890 -s -u ) 2160.19s user 84.21s system 79% cpu 46:58.62 total
Any idea what the standard deviation is? If I did my math right we're looking at around a 0.8% improvement in this test? |
yes, regrettably from repeated tests seems like it is within the noise variability range |
a) This is simply NOT true. #60056 does not change interface. As a side note, I just realized that my omap bench test uses the ability to create iterator to lock the dataset, and perform seeking later. #60315
True. I am not sure what are the Ultimately, I think that out path forward should be to get rid of OmapIterator altogether, like proposed here: |
I disagree. The interface is broader than what can be expressed with language constructs; it's established also by documentation and, particularly in lack of thereof, by implementations. If it had stayed intact, audit of all callers wouldn't have been necessary. All providers of the OMAP iterator interface do seek-at-create.
I agree they can be. The question is about risk.
Then it would have been far less intrusive. |
func runBucketList(thread_num int, stats *Stats) {
// ...
// ...
err := svc.ListObjectsPages(
&s3.ListObjectsInput{
Bucket: &buckets[bucket_num],
MaxKeys: &max_keys,
},Please note the absence of the
func init() {
// ...
myflag.Int64Var(&max_keys, "mk", 1000, "Maximum number of keys to retreive at once for bucket listings")The default doesn't cause RGW to paginate the output, so neither
However, more interesting is profiling under the new workload. In addition to the list-for-existence, it allowed to optimize the plain listing case as well. #60278 is continuation of this PR, also in the matter of backportability being the driving goal. |
|
@rzarzynski Is there any chance I could ask you to submit a PR for hsbench as well that would let us test for this? |
| struct omap_iter_seek_t { | ||
| std::string seek_position; | ||
| enum { | ||
| LOWER_BOUND, |
There was a problem hiding this comment.
Shouldn't we additionally have BEGIN or something for the sake of completeness?
| CollectionHandle &c, ///< [in] collection | ||
| const ghobject_t &oid ///< [in] object | ||
| const ghobject_t &oid, ///< [in] object | ||
| omap_iter_seek_t start_from = omap_iter_seek_t::min_lower_bound() ///< [in] where the iterator should point to at the beginning |
There was a problem hiding this comment.
Wouldn't be passing "const omap_iter_seek_t&" more straightforward/readable in terms of what's being copied during the call?
| o->get_omap_key(string(), &head); | ||
| o->get_omap_tail(&tail); | ||
| it->lower_bound(head); | ||
| string key; |
There was a problem hiding this comment.
IMO this changes the default behavior when start_from is empty. Originally we seek to lower_bound(head) but not it would be lower_bound("")
There was a problem hiding this comment.
And with this approach we're still not eliminating the duplicate lookup for get_omap_iterator() callers other than one in PrimaryLogPG.cc...
|
Some results from alternative approach: |
|
This pull request has been automatically marked as stale because it has not had any activity for 60 days. It will be closed if no further activity occurs for another 30 days. |
|
This pull request can no longer be automatically merged: a rebase is needed and changes have to be manually resolved |
|
Closing as the far broader #60278 has been merged. |
Currently an OSD does up to 3 key seeks while iterating over OMAP:
start_afterkey;filterparameter.This is pretty inefficient as
Seek()in RocksDB is logarithmic, soSeek(n)should be cheaper thanSeek(n/2)+Seek(n/2).Also, this change further narrows the RocksDB iterator's bound.Please note there are other inefficiencies. The underlying
Seeks()are performed under the collection lock and the iterator is one-time only. Ultimately we could try to cache it and reuse across multiple rounds of the same listing of RGW.Signed-off-by: Radoslaw Zarzynski rzarzyns@redhat.com
The impact of this change is currently unknown. I'm hunting for a cluster to verify it. Draft for now.Benchmarking & profiling under the freshly extended, unreviewed
rados bench: https://gist.github.com/rzarzynski/dbfedcb55bd9c9cafeb0b0c3358a32ecContribution Guidelines
To sign and title your commits, please refer to Submitting Patches to Ceph.
If you are submitting a fix for a stable branch (e.g. "quincy"), please refer to Submitting Patches to Ceph - Backports for the proper workflow.
When filling out the below checklist, you may click boxes directly in the GitHub web UI. When entering or editing the entire PR message in the GitHub web UI editor, you may also select a checklist item by adding an
xbetween the brackets:[x]. Spaces and capitalization matter when checking off items this way.Checklist
Show available Jenkins commands
jenkins retest this pleasejenkins test classic perfjenkins test crimson perfjenkins test signedjenkins test make checkjenkins test make check arm64jenkins test submodulesjenkins test dashboardjenkins test dashboard cephadmjenkins test apijenkins test docsjenkins render docsjenkins test ceph-volume alljenkins test ceph-volume toxjenkins test windowsjenkins test rook e2e