octopus: rgw: resolve empty ordered bucket listing results w/ CLS filtering *and* bucket index list produces incorrect result when non-ascii entries#45088
Conversation
A recent PR that helped address the issue of non-ascii plain entries didn't cover all the bases, allowing I/O errors to be produced in some circumstances during a bucket index list (i.e., `radosgw-admin bi list ...`). This fixes those issue and does some additional clean-up. Signed-off-by: J. Eric Ivancich <ivancich@redhat.com> (cherry picked from commit e714f0d) (cherry picked from commit d3d8df7)
When using asynchronous (concurrent) IO for bucket index requests, there are two int ids that are used that need to be kept separate -- shard id and request id. In many cases they're the same -- shard 0 gets request 0, and so forth. But in preparation for re-requests, those ids can diverge, where request 13 maps to shard 2. The existing code maintained the OIDs that went with each request. This PR also maintains the shard id as well. Documentation has been beefed up to help future developers navigate this. Signed-off-by: J. Eric Ivancich <ivancich@redhat.com> (cherry picked from commit 9606346) Conflicts: src/cls/rgw/cls_rgw_client.cc src/cls/rgw/cls_rgw_client.h In all cases I took the patch code, not mangling anything. These were all cases of std:: or auto type.
When doing an asynchronous/concurrent bucket index operation against multiple bucket index shards, a special error code is set aside to indicate that an "advancing" retry of a/some shard(s) is necessary. In that case another asynchronous call is made on the indicated shard(s) from the client (i.e., CLSRGWConcurrentIO). It is up to the subclass of CLSRGWConcurrentIO to handle the retry such that it "advances" and simply doesn't get stuck, looping forever. The retry functionality only works when the "need_multiple_rounds" functionality is not in use. Signed-off-by: J. Eric Ivancich <ivancich@redhat.com> (cherry picked from commit 5d28307) Conflicts: src/cls/rgw/cls_rgw_client.cc src/cls/rgw/cls_rgw_client.h Resolved by taking the patch version -- all cases of auto type and std::
A previous PR moved the much of the filtering that's part of bucket listing to the CLS layer. One unanticipated result was that it is now possible for a call to return 0 entries. In such a case we want to retry the call with the marker moved forward (i.e., advanced), repeatedly if necessary, in order to either retrieve some entries or to hit the end of the entries. This PR adds that functionality. Signed-off-by: J. Eric Ivancich <ivancich@redhat.com> (cherry picked from commit 423c183) Conflicts: src/cls/rgw/cls_rgw_ops.h s/ceph::buffer::list/bufferlist/g
|
This includes the single commit from #44858. |
When "bucket index list" traverses the different regions in the bucket index assembling the output, it miscalculates how many entries to ask for at one point. This fixes that. This fixes previous "rgw: bucket index list can produce I/O errors". Credit for finding this bug goes to Soumya Koduri <skoduri@redhat.com>. Signed-off-by: J. Eric Ivancich <ivancich@redhat.com> (cherry picked from commit aa76051)
|
@ivancich as mentioned in https://tracker.ceph.com/issues/51462#note-7 we started to get bucket listing problems after upgrading to octopus. So I'm trying to collect here the backports needed to bring octopus up to the state of the art in cls_rgw bucket listing. Does this look complete and safe to try? |
We tested this PR on our cluster with bucket listing. It fixes! |
|
https://pulpito.ceph.com/yuriw-2022-02-24_22:51:58-rgw-wip-yuri10-testing-2022-02-24-1329-octopus-distro-default-smithi/ shows 3 jobs failing in the exception is thrown in the cls_rgw_client code (not the cls_rgw part running in the osd). in the past, i've seen |
|
so far it's unclear whether this also happens in the pacific backport, we didn't get a clean run there due to issues around centos 8 eol https://pulpito.ceph.com/yuriw-2022-02-24_20:49:23-rgw-wip-yuri2-testing-2022-02-23-1524-pacific-distro-default-smithi/ |
Make sure marker is cleared. Put end-of-list check inside the conditional with the rest of the test. Add some additional testing. Signed-off-by: J. Eric Ivancich <ivancich@redhat.com> (cherry picked from commit 1bf0581)
|
jenkins retest this please |
|
thanks @dvanders, worth a shot! i tried reproducing this in an old focal vm, but the test passes consistently there. so i guess it's somehow specific to centos? |
|
jenkins test api |
|
Argh ! v15.2.16 is just released, and this PR is not in ;-( |
@GillesMocellin i'm afraid that Ceph no longer have much of a backport team, and rgw has not been getting timely backports as a result. if you're interested in getting involved, you can find some resources in https://tracker.ceph.com/projects/ceph-releases/wiki and https://github.com/ceph/ceph/blob/master/SubmittingPatches-backports.rst (and thanks to @dvanders for working on this one) |
Fix bugs surrounding calculation of number of entries returned and whether the end of a listing range has been reached. Signed-off-by: J. Eric Ivancich <ivancich@redhat.com>
|
jenkins test api |
backport tracker: https://tracker.ceph.com/issues/52076
backport of #42125 and #42404 and #44562
parent tracker: https://tracker.ceph.com/issues/51462 and https://tracker.ceph.com/issues/51429