RGW: multi object delete op; skip olh update for all deletes but the last one by BBoozmen · Pull Request #64800 · ceph/ceph

BBoozmen · 2025-08-01T18:12:57Z

Fixes: https://tracker.ceph.com/issues/72375

Tracker describes how to reproduce the problem.

Contribution Guidelines

To sign and title your commits, please refer to Submitting Patches to Ceph.
If you are submitting a fix for a stable branch (e.g. "quincy"), please refer to Submitting Patches to Ceph - Backports for the proper workflow.
When filling out the below checklist, you may click boxes directly in the GitHub web UI. When entering or editing the entire PR message in the GitHub web UI editor, you may also select a checklist item by adding an x between the brackets: [x]. Spaces and capitalization matter when checking off items this way.

Checklist

Tracker (select at least one)
- References tracker ticket
- Very recent bug; references commit where it was introduced
- New feature (ticket optional)
- Doc update (no ticket needed)
- Code cleanup (no ticket needed)
Component impact
- Affects Dashboard, opened tracker ticket
- Affects Orchestrator, opened tracker ticket
- No impact that needs to be tracked
Documentation (select at least one)
- Updates relevant documentation
- No doc update is appropriate
Tests (select at least one)
- Includes unit test(s)
- Includes integration test(s)
- Includes bug reproducer
- No tests

Show available Jenkins commands

jenkins test classic perf Jenkins Job | Jenkins Job Definition
jenkins test crimson perf Jenkins Job | Jenkins Job Definition
jenkins test signed Jenkins Job | Jenkins Job Definition
jenkins test make check Jenkins Job | Jenkins Job Definition
jenkins test make check arm64 Jenkins Job | Jenkins Job Definition
jenkins test submodules Jenkins Job | Jenkins Job Definition
jenkins test dashboard Jenkins Job | Jenkins Job Definition
jenkins test dashboard cephadm Jenkins Job | Jenkins Job Definition
jenkins test api Jenkins Job | Jenkins Job Definition
jenkins test docs ReadTheDocs | Github Workflow Definition
jenkins test ceph-volume all Jenkins Jobs | Jenkins Jobs Definition
jenkins test windows Jenkins Job | Jenkins Job Definition
jenkins test rook e2e Jenkins Job | Jenkins Job Definition

BBoozmen · 2025-08-04T13:26:08Z

jenkins test make check

src/rgw/rgw_op.cc

…last one Fixes: https://tracker.ceph.com/issues/72375 Signed-off-by: Oguzhan Ozmen <oozmen@bloomberg.net>

cbodley

looks good to me. do we have s3test coverage of delete_objects() on versioned buckets to catch any regressions here?

BBoozmen · 2025-08-21T17:07:38Z

looks good to me. do we have s3test coverage of delete_objects() on versioned buckets to catch any regressions here?

Good point — I’ll check whether we already have coverage for delete_objects() on versioned buckets. If not, I’ll add a test, and in either case I’ll follow up here with a link to the relevant S3 test case.

BBoozmen · 2025-09-03T13:53:30Z

looks good to me. do we have s3test coverage of delete_objects() on versioned buckets to catch any regressions here?

Looks like we do have a testcase for multi object delete for versioning: https://github.com/ceph/s3-tests/blob/cb8c4b3ef8c2140f522f2cb57309de959ee3cf5b/s3tests_boto3/functional/test_s3.py#L1651

Just added a s3-tests PR ceph/s3-tests#687 to enhance a bit to add more versions per object.

Ran the fix here with the updated testcase:

$ S3TEST_CONF=./s3tests_vstart.conf tox -- s3tests_boto3/functional/test_s3.py::test_versioning_concurrent_multi_object_delete
...
collected 1 item

s3tests_boto3/functional/test_s3.py .
...
================================================================================================= 1 passed, 11 warnings in 3.04s ==================================================================================================
  py: OK (7.18=setup[3.74]+cmd[3.43] seconds)
  congratulations :) (7.24 seconds)

NOTE: Warnings are SyntaxWarning for some untouched lines.

Also increased debug_rgw to 20 to get the skip update_olh() events and verified it's working as expected: e.g., looking at one of the multi object delete request (one of the threads s3-tests initiated to delete all 5x3 versions):

$ grep skip radosgw.8000.log | grep 17329738590219513412
2025-09-03T13:56:28.393+0000 7f6ad4ef6700 20 req 17329738590219513412 0.015000031s s3:multi_object_delete key_0[gF5H7aX4YNLV2O9Yzc52JCxyBEtjLcs] skip update_olh() target_obj=oozmen-ivtzl1mx13c8p7rrs398ur-1:key_0
2025-09-03T13:56:28.394+0000 7f6aea318700 20 req 17329738590219513412 0.016000032s s3:multi_object_delete key_1[1Ns0DF9m1P17oyZne4hYeyEtugqi.S.] skip update_olh() target_obj=oozmen-ivtzl1mx13c8p7rrs398ur-1:key_1
2025-09-03T13:56:28.395+0000 7f6aea318700 20 req 17329738590219513412 0.017000036s s3:multi_object_delete key_2[mGy8Y9VxAWRAI9aea4ZgHDXNUCAZmvq] skip update_olh() target_obj=oozmen-ivtzl1mx13c8p7rrs398ur-1:key_2
2025-09-03T13:56:28.397+0000 7f6aca4e5700 20 req 17329738590219513412 0.019000040s s3:multi_object_delete key_3[45ZUoGukZek9vfnVm5FKToAYWJNFpQC] skip update_olh() target_obj=oozmen-ivtzl1mx13c8p7rrs398ur-1:key_3
2025-09-03T13:56:28.397+0000 7f6aca4e5700 20 req 17329738590219513412 0.019000040s s3:multi_object_delete key_4[42WJcFt1gnio8jfBO4F4exh4DF9K.lf] skip update_olh() target_obj=oozmen-ivtzl1mx13c8p7rrs398ur-1:key_4
2025-09-03T13:56:28.398+0000 7f6ad12f0700 20 req 17329738590219513412 0.020000041s s3:multi_object_delete key_0[ckOEucGOte6Npyz7K8j-HLR8AONLdTh] skip update_olh() target_obj=oozmen-ivtzl1mx13c8p7rrs398ur-1:key_0
2025-09-03T13:56:28.400+0000 7f6ae7b14700 20 req 17329738590219513412 0.022000046s s3:multi_object_delete key_4[VfoHDQpLbRcpeuD23qLO06N6Etb6kVp] skip update_olh() target_obj=oozmen-ivtzl1mx13c8p7rrs398ur-1:key_4
2025-09-03T13:56:28.400+0000 7f6ae7b14700 20 req 17329738590219513412 0.022000046s s3:multi_object_delete key_2[W0agC19ZNsMKh.64dOqJ1YE.IkV2O20] skip update_olh() target_obj=oozmen-ivtzl1mx13c8p7rrs398ur-1:key_2
2025-09-03T13:56:28.401+0000 7f6b2bd81700 20 req 17329738590219513412 0.023000048s s3:multi_object_delete key_1[HjpxbWmCUkW4ICtgVoXj4mRE0Mr5YkP] skip update_olh() target_obj=oozmen-ivtzl1mx13c8p7rrs398ur-1:key_1
2025-09-03T13:56:28.401+0000 7f6ac22d8700 20 req 17329738590219513412 0.023000048s s3:multi_object_delete key_3[tU3KDMrx5fmCZ2wu658cyBawyyFJRus] skip update_olh() target_obj=oozmen-ivtzl1mx13c8p7rrs398ur-1:key_3
$ grep skip radosgw.8000.log | grep 17329738590219513412 | wc -l
10

For each of the 5 objects, it skips olh update for the first 2 versions (5x2 skips) and carries out olh processing for the last (3rd) version only for each object.

cbodley · 2025-09-08T16:23:55Z

qa results in https://pulpito.ceph.com/cbodley-2025-09-06_01:03:01-rgw-wip-72375-distro-default-gibba/ look pretty good, and the updated test case is passing:

s3tests_boto3/functional/test_s3.py::test_versioning_concurrent_multi_object_delete PASSED [ 40%]

however, there was one rgw/tempest job failure for the swift api that i don't recognize:

{13} setUpClass (tempest.api.object_storage.test_object_temp_url_negative.ObjectTempUrlNegativeTest) [0.000000s] ... FAILED

i scheduled a rerun in https://pulpito.ceph.com/cbodley-2025-09-08_16:10:56-rgw-wip-72375-distro-default-gibba to see if that goes away

cbodley · 2025-09-08T17:59:44Z

i scheduled a rerun in https://pulpito.ceph.com/cbodley-2025-09-08_16:10:56-rgw-wip-72375-distro-default-gibba to see if that goes away

the rerun failed again on rgw/tempest, but it was setting up a different test case:

{15} setUpClass (tempest.api.object_storage.test_object_formpost.ObjectFormPostTest) [0.000000s] ... FAILED

both failures mention a ConnectionResetError: [Errno 104] Connection reset by peer

BBoozmen · 2025-09-08T20:58:28Z

however, there was one rgw/tempest job failure for the swift api that i don't recognize:

{13} setUpClass (tempest.api.object_storage.test_object_temp_url_negative.ObjectTempUrlNegativeTest) [0.000000s] ... FAILED

2025-09-06T02:48:17.388 INFO:teuthology.orchestra.run.gibba042.stdout:
2025-09-06T02:48:17.388 INFO:teuthology.orchestra.run.gibba042.stdout:==============================
2025-09-06T02:48:17.388 INFO:teuthology.orchestra.run.gibba042.stdout:Failed 1 tests - output below:
2025-09-06T02:48:17.388 INFO:teuthology.orchestra.run.gibba042.stdout:==============================
2025-09-06T02:48:17.389 INFO:teuthology.orchestra.run.gibba042.stdout:
2025-09-06T02:48:17.389 INFO:teuthology.orchestra.run.gibba042.stdout:setUpClass (tempest.api.object_storage.test_object_temp_url_negative.ObjectTempUrlNegativeTest)
2025-09-06T02:48:17.389 INFO:teuthology.orchestra.run.gibba042.stdout:-----------------------------------------------------------------------------------------------
2025-09-06T02:48:17.389 INFO:teuthology.orchestra.run.gibba042.stdout:
2025-09-06T02:48:17.389 INFO:teuthology.orchestra.run.gibba042.stdout:Captured traceback:
2025-09-06T02:48:17.389 INFO:teuthology.orchestra.run.gibba042.stdout:~~~~~~~~~~~~~~~~~~~
...
2025-09-06T02:48:17.393 INFO:teuthology.orchestra.run.gibba042.stdout:      File "/home/ubuntu/cephtest/tempest/tempest/test.py", line 748, in get_client_manager
2025-09-06T02:48:17.393 INFO:teuthology.orchestra.run.gibba042.stdout:    cred_provider = cls._get_credentials_provider()
2025-09-06T02:48:17.393 INFO:teuthology.orchestra.run.gibba042.stdout:
2025-09-06T02:48:17.393 INFO:teuthology.orchestra.run.gibba042.stdout:      File "/home/ubuntu/cephtest/tempest/tempest/test.py", line 723, in _get_credentials_provider
2025-09-06T02:48:17.393 INFO:teuthology.orchestra.run.gibba042.stdout:    cls._creds_provider = credentials.get_credentials_provider(
...
2025-09-06T02:48:17.400 INFO:teuthology.orchestra.run.gibba042.stdout:    urllib3.exceptions.ProtocolError: ('Connection aborted.', ConnectionResetError(104, 'Connection reset by peer'))
...
2025-09-06T02:48:17.400 INFO:teuthology.orchestra.run.gibba042.stdout:
2025-09-06T02:48:17.400 INFO:teuthology.orchestra.run.gibba042.stdout:======
2025-09-06T02:48:17.400 INFO:teuthology.orchestra.run.gibba042.stdout:Totals
2025-09-06T02:48:17.400 INFO:teuthology.orchestra.run.gibba042.stdout:======
2025-09-06T02:48:17.400 INFO:teuthology.orchestra.run.gibba042.stdout:Ran: 119 tests in 42.6611 sec.
2025-09-06T02:48:17.400 INFO:teuthology.orchestra.run.gibba042.stdout: - Passed: 118
2025-09-06T02:48:17.401 INFO:teuthology.orchestra.run.gibba042.stdout: - Skipped: 0
2025-09-06T02:48:17.401 INFO:teuthology.orchestra.run.gibba042.stdout: - Expected Fail: 0
2025-09-06T02:48:17.401 INFO:teuthology.orchestra.run.gibba042.stdout: - Unexpected Success: 0
2025-09-06T02:48:17.401 INFO:teuthology.orchestra.run.gibba042.stdout: - Failed: 1
2025-09-06T02:48:17.401 INFO:teuthology.orchestra.run.gibba042.stdout:Sum of execute time for each test: 34.6238 sec.

This seems to be happening in class setup and the traceback shows it dies while setting up credentials: test.py:setUpClass → setup_credentials → get_client_manager → _get_credentials_provider. Does this testsuite use Keystone for auth?

cbodley · 2025-09-09T15:47:01Z

This seems to be happening in class setup and the traceback shows it dies while setting up credentials: test.py:setUpClass → setup_credentials → get_client_manager → _get_credentials_provider. Does this testsuite use Keystone for auth?

yes it should be keystone. there are additional logs for the run in keystone.client.0.log and tempest.log, but i didn't spot anything interesting around the timestamp of the initial failure:

2025-09-06T02:47:54.022 INFO:teuthology.orchestra.run.gibba042.stdout:{13} setUpClass (tempest.api.object_storage.test_object_temp_url_negative.ObjectTempUrlNegativeTest) [0.000000s] ... FAILED

if you'd like to compare those logs with a recent successful job:
https://qa-proxy.ceph.com/teuthology/aemerson-2025-09-09_03:40:07-rgw-wip-usage-exporter-clean-distro-default-smithi/8488955/teuthology.log
https://qa-proxy.ceph.com/teuthology/aemerson-2025-09-09_03:40:07-rgw-wip-usage-exporter-clean-distro-default-smithi/8488955/remote/smithi152/keystone.client.0.log
https://qa-proxy.ceph.com/teuthology/aemerson-2025-09-09_03:40:07-rgw-wip-usage-exporter-clean-distro-default-smithi/8488955/remote/smithi152/tempest.log

it's hard to imagine what changes in this pr could break the tempest job, given that the swift api doesn't use RGWDeleteMultiObj at all (it has a separate RGWBulkDelete op)

cbodley · 2025-09-10T19:34:53Z

@BBoozmen i saw the same rgw/tempest failures in another branch today, so i opened https://tracker.ceph.com/issues/72968 to track this. approving this now that we've shown it's not your fault :)

github-actions · 2025-09-10T19:37:41Z

This is an automated message by src/script/redmine-upkeep.py.

I have resolved the following tracker ticket due to the merge of this PR:

https://tracker.ceph.com/issues/72375

No backports are pending for the ticket. If this is incorrect, please update the tracker
ticket and reset to Pending Backport state.

Update Log: https://github.com/ceph/ceph/actions/runs/17624856443

cbodley · 2025-09-10T19:39:59Z

No backports are pending for the ticket. If this is incorrect, please update the tracker
ticket and reset to Pending Backport state.

tagged for squid and tentacle

BBoozmen · 2025-09-10T19:54:29Z

@BBoozmen i saw the same rgw/tempest failures in another branch today, so i opened https://tracker.ceph.com/issues/72968 to track this. approving this now that we've shown it's not your fault :)

Thank you!

FWIW, I was looking at this and one thing about the successful branch (wip-usage-exporter-clean), it's based off of an earlier HEAD (~mid August).

Though, the issue seems to lie in keystone service (e.g., a front-end blip?) rather than ceph so an earlier HEAD may not matter. It's like a front-proxy reset (nginx/HAProxy idle/keepalive/queue overflow) closed the socket without writing an HTTP status—exactly what a client sees as RST mid-read. The stream of concurrent token POSTs around 02:47:53–02:47:57 shows a brief saturation/recycle rather than a steady misconfig?

Anyhow since this is merged in, we'll figure out the mystery thru https://tracker.ceph.com/issues/72968.

BBoozmen · 2025-09-10T19:56:09Z

No backports are pending for the ticket. If this is incorrect, please update the tracker
ticket and reset to Pending Backport state.

tagged for squid and tentacle

Sounds good. Let me work on these. Tentacle should be OK but "squid" will need some work.
I'll raise the backport PRs tomorrow.

RGW: multi object delete op; skip olh update for all deletes but the last one Reviewed-by: Casey Bodley <cbodley@redhat.com> Conflicts: src/rgw/rgw_op.cc src/rgw/rgw_op.h - RGWDeleteMultiObj kept the vector of objects to be deleted as "rgw_obj_key" rather than "RGWMultiDelObject". - RGWDeleteMultiObj::execute didn't factor out the object deletions into "handle_objects" helper method. - There was no check whether RGWDeleteMultiObj::execute is already running in a coroutine or not before handling objects.

BBoozmen self-assigned this Aug 1, 2025

BBoozmen added the rgw label Aug 1, 2025

cbodley reviewed Aug 4, 2025

View reviewed changes

src/rgw/rgw_op.cc Outdated Show resolved Hide resolved

BBoozmen force-pushed the wip-72375-bulk_delete_olh branch from 311a5da to 476f24e Compare August 5, 2025 02:00

RGW: multi object delete op; skip olh update for all deletes but the …

9bb1701

…last one Fixes: https://tracker.ceph.com/issues/72375 Signed-off-by: Oguzhan Ozmen <oozmen@bloomberg.net>

BBoozmen force-pushed the wip-72375-bulk_delete_olh branch from 476f24e to 9bb1701 Compare August 8, 2025 19:15

BBoozmen marked this pull request as ready for review August 8, 2025 19:22

BBoozmen requested a review from a team as a code owner August 8, 2025 19:22

cbodley approved these changes Aug 21, 2025

View reviewed changes

BBoozmen mentioned this pull request Sep 3, 2025

test_s3 versioning multi object delete - add more versions for each object ceph/s3-tests#687

Merged

cbodley added needs-qa wip-cbodley-testing labels Sep 8, 2025

cbodley added TESTED and removed needs-qa wip-cbodley-testing labels Sep 10, 2025

cbodley merged commit 6f543b8 into ceph:main Sep 10, 2025
13 checks passed

BBoozmen mentioned this pull request Sep 11, 2025

tentacle: RGW: multi object delete op; skip olh update for all deletes but the last one #65488

Merged

BBoozmen mentioned this pull request Sep 11, 2025

squid: RGW: multi object delete op; skip olh update for all deletes but the last one #65500

Merged

Conversation

BBoozmen commented Aug 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Contribution Guidelines

Checklist

Uh oh!

BBoozmen commented Aug 4, 2025

Uh oh!

Uh oh!

cbodley left a comment

Choose a reason for hiding this comment

Uh oh!

BBoozmen commented Aug 21, 2025

Uh oh!

BBoozmen commented Sep 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cbodley commented Sep 8, 2025

Uh oh!

cbodley commented Sep 8, 2025

Uh oh!

BBoozmen commented Sep 8, 2025

Uh oh!

cbodley commented Sep 9, 2025

Uh oh!

cbodley commented Sep 10, 2025

Uh oh!

Uh oh!

github-actions bot commented Sep 10, 2025

Uh oh!

cbodley commented Sep 10, 2025

Uh oh!

BBoozmen commented Sep 10, 2025

Uh oh!

BBoozmen commented Sep 10, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

BBoozmen commented Aug 1, 2025 •

edited

Loading

BBoozmen commented Sep 3, 2025 •

edited

Loading