RGW: multi object delete op; skip olh update for all deletes but the last one#64800
RGW: multi object delete op; skip olh update for all deletes but the last one#64800
Conversation
|
jenkins test make check |
311a5da to
476f24e
Compare
…last one Fixes: https://tracker.ceph.com/issues/72375 Signed-off-by: Oguzhan Ozmen <oozmen@bloomberg.net>
476f24e to
9bb1701
Compare
cbodley
left a comment
There was a problem hiding this comment.
looks good to me. do we have s3test coverage of delete_objects() on versioned buckets to catch any regressions here?
Good point — I’ll check whether we already have coverage for delete_objects() on versioned buckets. If not, I’ll add a test, and in either case I’ll follow up here with a link to the relevant S3 test case. |
Looks like we do have a testcase for multi object delete for versioning: https://github.com/ceph/s3-tests/blob/cb8c4b3ef8c2140f522f2cb57309de959ee3cf5b/s3tests_boto3/functional/test_s3.py#L1651 Just added a s3-tests PR ceph/s3-tests#687 to enhance a bit to add more versions per object. Ran the fix here with the updated testcase: $ S3TEST_CONF=./s3tests_vstart.conf tox -- s3tests_boto3/functional/test_s3.py::test_versioning_concurrent_multi_object_delete ... collected 1 item s3tests_boto3/functional/test_s3.py . ... ================================================================================================= 1 passed, 11 warnings in 3.04s ================================================================================================== py: OK (7.18=setup[3.74]+cmd[3.43] seconds) congratulations :) (7.24 seconds) NOTE: Warnings are SyntaxWarning for some untouched lines. Also increased debug_rgw to 20 to get the $ grep skip radosgw.8000.log | grep 17329738590219513412 2025-09-03T13:56:28.393+0000 7f6ad4ef6700 20 req 17329738590219513412 0.015000031s s3:multi_object_delete key_0[gF5H7aX4YNLV2O9Yzc52JCxyBEtjLcs] skip update_olh() target_obj=oozmen-ivtzl1mx13c8p7rrs398ur-1:key_0 2025-09-03T13:56:28.394+0000 7f6aea318700 20 req 17329738590219513412 0.016000032s s3:multi_object_delete key_1[1Ns0DF9m1P17oyZne4hYeyEtugqi.S.] skip update_olh() target_obj=oozmen-ivtzl1mx13c8p7rrs398ur-1:key_1 2025-09-03T13:56:28.395+0000 7f6aea318700 20 req 17329738590219513412 0.017000036s s3:multi_object_delete key_2[mGy8Y9VxAWRAI9aea4ZgHDXNUCAZmvq] skip update_olh() target_obj=oozmen-ivtzl1mx13c8p7rrs398ur-1:key_2 2025-09-03T13:56:28.397+0000 7f6aca4e5700 20 req 17329738590219513412 0.019000040s s3:multi_object_delete key_3[45ZUoGukZek9vfnVm5FKToAYWJNFpQC] skip update_olh() target_obj=oozmen-ivtzl1mx13c8p7rrs398ur-1:key_3 2025-09-03T13:56:28.397+0000 7f6aca4e5700 20 req 17329738590219513412 0.019000040s s3:multi_object_delete key_4[42WJcFt1gnio8jfBO4F4exh4DF9K.lf] skip update_olh() target_obj=oozmen-ivtzl1mx13c8p7rrs398ur-1:key_4 2025-09-03T13:56:28.398+0000 7f6ad12f0700 20 req 17329738590219513412 0.020000041s s3:multi_object_delete key_0[ckOEucGOte6Npyz7K8j-HLR8AONLdTh] skip update_olh() target_obj=oozmen-ivtzl1mx13c8p7rrs398ur-1:key_0 2025-09-03T13:56:28.400+0000 7f6ae7b14700 20 req 17329738590219513412 0.022000046s s3:multi_object_delete key_4[VfoHDQpLbRcpeuD23qLO06N6Etb6kVp] skip update_olh() target_obj=oozmen-ivtzl1mx13c8p7rrs398ur-1:key_4 2025-09-03T13:56:28.400+0000 7f6ae7b14700 20 req 17329738590219513412 0.022000046s s3:multi_object_delete key_2[W0agC19ZNsMKh.64dOqJ1YE.IkV2O20] skip update_olh() target_obj=oozmen-ivtzl1mx13c8p7rrs398ur-1:key_2 2025-09-03T13:56:28.401+0000 7f6b2bd81700 20 req 17329738590219513412 0.023000048s s3:multi_object_delete key_1[HjpxbWmCUkW4ICtgVoXj4mRE0Mr5YkP] skip update_olh() target_obj=oozmen-ivtzl1mx13c8p7rrs398ur-1:key_1 2025-09-03T13:56:28.401+0000 7f6ac22d8700 20 req 17329738590219513412 0.023000048s s3:multi_object_delete key_3[tU3KDMrx5fmCZ2wu658cyBawyyFJRus] skip update_olh() target_obj=oozmen-ivtzl1mx13c8p7rrs398ur-1:key_3 $ grep skip radosgw.8000.log | grep 17329738590219513412 | wc -l 10 For each of the 5 objects, it skips olh update for the first 2 versions (5x2 skips) and carries out olh processing for the last (3rd) version only for each object. |
|
qa results in https://pulpito.ceph.com/cbodley-2025-09-06_01:03:01-rgw-wip-72375-distro-default-gibba/ look pretty good, and the updated test case is passing:
however, there was one rgw/tempest job failure for the swift api that i don't recognize:
i scheduled a rerun in https://pulpito.ceph.com/cbodley-2025-09-08_16:10:56-rgw-wip-72375-distro-default-gibba to see if that goes away |
the rerun failed again on rgw/tempest, but it was setting up a different test case:
both failures mention a |
2025-09-06T02:48:17.388 INFO:teuthology.orchestra.run.gibba042.stdout:
2025-09-06T02:48:17.388 INFO:teuthology.orchestra.run.gibba042.stdout:==============================
2025-09-06T02:48:17.388 INFO:teuthology.orchestra.run.gibba042.stdout:Failed 1 tests - output below:
2025-09-06T02:48:17.388 INFO:teuthology.orchestra.run.gibba042.stdout:==============================
2025-09-06T02:48:17.389 INFO:teuthology.orchestra.run.gibba042.stdout:
2025-09-06T02:48:17.389 INFO:teuthology.orchestra.run.gibba042.stdout:setUpClass (tempest.api.object_storage.test_object_temp_url_negative.ObjectTempUrlNegativeTest)
2025-09-06T02:48:17.389 INFO:teuthology.orchestra.run.gibba042.stdout:-----------------------------------------------------------------------------------------------
2025-09-06T02:48:17.389 INFO:teuthology.orchestra.run.gibba042.stdout:
2025-09-06T02:48:17.389 INFO:teuthology.orchestra.run.gibba042.stdout:Captured traceback:
2025-09-06T02:48:17.389 INFO:teuthology.orchestra.run.gibba042.stdout:~~~~~~~~~~~~~~~~~~~
...
2025-09-06T02:48:17.393 INFO:teuthology.orchestra.run.gibba042.stdout: File "/home/ubuntu/cephtest/tempest/tempest/test.py", line 748, in get_client_manager
2025-09-06T02:48:17.393 INFO:teuthology.orchestra.run.gibba042.stdout: cred_provider = cls._get_credentials_provider()
2025-09-06T02:48:17.393 INFO:teuthology.orchestra.run.gibba042.stdout:
2025-09-06T02:48:17.393 INFO:teuthology.orchestra.run.gibba042.stdout: File "/home/ubuntu/cephtest/tempest/tempest/test.py", line 723, in _get_credentials_provider
2025-09-06T02:48:17.393 INFO:teuthology.orchestra.run.gibba042.stdout: cls._creds_provider = credentials.get_credentials_provider(
...
2025-09-06T02:48:17.400 INFO:teuthology.orchestra.run.gibba042.stdout: urllib3.exceptions.ProtocolError: ('Connection aborted.', ConnectionResetError(104, 'Connection reset by peer'))
...
2025-09-06T02:48:17.400 INFO:teuthology.orchestra.run.gibba042.stdout:
2025-09-06T02:48:17.400 INFO:teuthology.orchestra.run.gibba042.stdout:======
2025-09-06T02:48:17.400 INFO:teuthology.orchestra.run.gibba042.stdout:Totals
2025-09-06T02:48:17.400 INFO:teuthology.orchestra.run.gibba042.stdout:======
2025-09-06T02:48:17.400 INFO:teuthology.orchestra.run.gibba042.stdout:Ran: 119 tests in 42.6611 sec.
2025-09-06T02:48:17.400 INFO:teuthology.orchestra.run.gibba042.stdout: - Passed: 118
2025-09-06T02:48:17.401 INFO:teuthology.orchestra.run.gibba042.stdout: - Skipped: 0
2025-09-06T02:48:17.401 INFO:teuthology.orchestra.run.gibba042.stdout: - Expected Fail: 0
2025-09-06T02:48:17.401 INFO:teuthology.orchestra.run.gibba042.stdout: - Unexpected Success: 0
2025-09-06T02:48:17.401 INFO:teuthology.orchestra.run.gibba042.stdout: - Failed: 1
2025-09-06T02:48:17.401 INFO:teuthology.orchestra.run.gibba042.stdout:Sum of execute time for each test: 34.6238 sec.
This seems to be happening in class setup and the traceback shows it dies while setting up credentials: test.py:setUpClass → setup_credentials → get_client_manager → _get_credentials_provider. Does this testsuite use Keystone for auth? |
yes it should be keystone. there are additional logs for the run in keystone.client.0.log and tempest.log, but i didn't spot anything interesting around the timestamp of the initial failure:
if you'd like to compare those logs with a recent successful job: it's hard to imagine what changes in this pr could break the tempest job, given that the swift api doesn't use RGWDeleteMultiObj at all (it has a separate RGWBulkDelete op) |
|
@BBoozmen i saw the same rgw/tempest failures in another branch today, so i opened https://tracker.ceph.com/issues/72968 to track this. approving this now that we've shown it's not your fault :) |
|
This is an automated message by src/script/redmine-upkeep.py. I have resolved the following tracker ticket due to the merge of this PR: No backports are pending for the ticket. If this is incorrect, please update the tracker Update Log: https://github.com/ceph/ceph/actions/runs/17624856443 |
tagged for squid and tentacle |
Thank you! FWIW, I was looking at this and one thing about the successful branch (wip-usage-exporter-clean), it's based off of an earlier HEAD (~mid August). Though, the issue seems to lie in keystone service (e.g., a front-end blip?) rather than ceph so an earlier HEAD may not matter. It's like a front-proxy reset (nginx/HAProxy idle/keepalive/queue overflow) closed the socket without writing an HTTP status—exactly what a client sees as RST mid-read. The stream of concurrent token POSTs around 02:47:53–02:47:57 shows a brief saturation/recycle rather than a steady misconfig? Anyhow since this is merged in, we'll figure out the mystery thru https://tracker.ceph.com/issues/72968. |
Sounds good. Let me work on these. Tentacle should be OK but "squid" will need some work. |
RGW: multi object delete op; skip olh update for all deletes but the last one
Reviewed-by: Casey Bodley <cbodley@redhat.com>
Conflicts:
src/rgw/rgw_op.cc
src/rgw/rgw_op.h
- RGWDeleteMultiObj kept the vector of objects to be deleted as "rgw_obj_key"
rather than "RGWMultiDelObject".
- RGWDeleteMultiObj::execute didn't factor out the object deletions into
"handle_objects" helper method.
- There was no check whether RGWDeleteMultiObj::execute is already running in
a coroutine or not before handling objects.
Fixes: https://tracker.ceph.com/issues/72375
Tracker describes how to reproduce the problem.
Contribution Guidelines
To sign and title your commits, please refer to Submitting Patches to Ceph.
If you are submitting a fix for a stable branch (e.g. "quincy"), please refer to Submitting Patches to Ceph - Backports for the proper workflow.
When filling out the below checklist, you may click boxes directly in the GitHub web UI. When entering or editing the entire PR message in the GitHub web UI editor, you may also select a checklist item by adding an
xbetween the brackets:[x]. Spaces and capitalization matter when checking off items this way.Checklist
Show available Jenkins commands
jenkins test classic perfJenkins Job | Jenkins Job Definitionjenkins test crimson perfJenkins Job | Jenkins Job Definitionjenkins test signedJenkins Job | Jenkins Job Definitionjenkins test make checkJenkins Job | Jenkins Job Definitionjenkins test make check arm64Jenkins Job | Jenkins Job Definitionjenkins test submodulesJenkins Job | Jenkins Job Definitionjenkins test dashboardJenkins Job | Jenkins Job Definitionjenkins test dashboard cephadmJenkins Job | Jenkins Job Definitionjenkins test apiJenkins Job | Jenkins Job Definitionjenkins test docsReadTheDocs | Github Workflow Definitionjenkins test ceph-volume allJenkins Jobs | Jenkins Jobs Definitionjenkins test windowsJenkins Job | Jenkins Job Definitionjenkins test rook e2eJenkins Job | Jenkins Job Definition