RGW/multisite: fix bucket-full-sync infinite loop caused by stale bucket_list_result reuse#66203
RGW/multisite: fix bucket-full-sync infinite loop caused by stale bucket_list_result reuse#66203
Conversation
Config Diff Tool Output+ added: rgw_inject_delay_sec (rgw.yaml.in)
+ added: rgw_inject_delay_pattern (rgw.yaml.in)
The above configuration changes are found in the PR. Please update the relevant release documentation if necessary. |
|
/config check ok |
|
@BBoozmen thanks for the detailed test case and reproducer. I do agree that the list_bucket_result should be cleared before sending request to the remote for the next listing.
am I missing something? |
9f76d26 to
887302c
Compare
Thank you @smanjara for the feedback. Yes, the current reproduction recipe doesn't seem to be reproducing the issue deterministically. I've updated the recipe to make the reproduction deterministic. Updated the related commit. Please have a look. The new recipe doesn't rely on bilog trimming but it just
Now, this new recipe makes sure that -- when bucket full sync starts -- the bucket will have > 1000 objects and full_sync listing will have paginated listing. Tested the new recipe with and without fix: With FixNow, once full sync starts (after enabling the bucket sync after uploading all objects), we always see the paginated listing happening deterministically which is critical in reproducing the issue: run/c2/out/radosgw.8001.log:2025-11-23T03:26:28.842+0000 7f7cce0da700 20 RGW-SYNC:data:sync:shard[91]:entry[rvtfgn-1:eb859e84-2137-4098-a6c5-06de10988051.4421.1:3[0]]:bucket_sync_sources[source=:rvtfgn-1[eb859e84-2137-4098-a6c5-06de10988051.4421.1]):3:source_zone=eb859e84-2137-4098-a6c5-06de10988051]:bucket[rvtfgn-1:eb859e84-2137-4098-a6c5-06de10988051.4421.1<-rvtfgn-1:eb859e84-2137-4098-a6c5-06de10988051.4421.1:3]:full_sync[rvtfgn-1:eb859e84-2137-4098-a6c5-06de10988051.4421.1:3]: listing bucket for full sync # now we get paginated listing run/c2/out/radosgw.8001.log:2025-11-23T03:26:28.969+0000 7f7cce0da700 20 RGW-SYNC:data:sync:shard[91]:entry[rvtfgn-1:eb859e84-2137-4098-a6c5-06de10988051.4421.1:3[0]]:bucket_sync_sources[source=:rvtfgn-1[eb859e84-2137-4098-a6c5-06de10988051.4421.1]):3:source_zone=eb859e84-2137-4098-a6c5-06de10988051]:bucket[rvtfgn-1:eb859e84-2137-4098-a6c5-06de10988051.4421.1<-rvtfgn-1:eb859e84-2137-4098-a6c5-06de10988051.4421.1:3]:full_sync[rvtfgn-1:eb859e84-2137-4098-a6c5-06de10988051.4421.1:3]: listed bucket for full sync list_result.entries.size=1000 is_truncated=1 # and delay injecting - to give time the testcase to delete all objects and the bucket in the meantime - # now happens correctly when paginated is set. run/c2/out/radosgw.8001.log:2025-11-23T03:26:28.969+0000 7f7cce0da700 0 RGW-SYNC:data:sync:shard[91]:entry[rvtfgn-1:eb859e84-2137-4098-a6c5-06de10988051.4421.1:3[0]]:bucket_sync_sources[source=:rvtfgn-1[eb859e84-2137-4098-a6c5-06de10988051.4421.1]):3:source_zone=eb859e84-2137-4098-a6c5-06de10988051]:bucket[rvtfgn-1:eb859e84-2137-4098-a6c5-06de10988051.4421.1<-rvtfgn-1:eb859e84-2137-4098-a6c5-06de10988051.4421.1:3]:full_sync[rvtfgn-1:eb859e84-2137-4098-a6c5-06de10988051.4421.1:3]: injecting a delay of 100.000000s With the fix, testcase passes since we now reset the list_result object and don't use a stale state. run/c2/out/radosgw.8001.log:2025-11-23T03:28:12.771+0000 7f7cce0da700 20 RGW-SYNC:data:sync:shard[91]:entry[rvtfgn-1:eb859e84-2137-4098-a6c5-06de10988051.4421.1:3[0]]:bucket_sync_sources[source=:rvtfgn-1[eb859e84-2137-4098-a6c5-06de10988051.4421.1]):3:source_zone=eb859e84-2137-4098-a6c5-06de10988051]:bucket[rvtfgn-1:eb859e84-2137-4098-a6c5-06de10988051.4421.1<-rvtfgn-1:eb859e84-2137-4098-a6c5-06de10988051.4421.1:3]:full_sync[rvtfgn-1:eb859e84-2137-4098-a6c5-06de10988051.4421.1:3]: listing bucket for full sync # now - since source objects and the bucket is deleted - # with the fix we can get out of the for-loop run/c2/out/radosgw.8001.log:2025-11-23T03:28:12.773+0000 7f7cce0da700 20 RGW-SYNC:data:sync:shard[91]:entry[rvtfgn-1:eb859e84-2137-4098-a6c5-06de10988051.4421.1:3[0]]:bucket_sync_sources[source=:rvtfgn-1[eb859e84-2137-4098-a6c5-06de10988051.4421.1]):3:source_zone=eb859e84-2137-4098-a6c5-06de10988051]:bucket[rvtfgn-1:eb859e84-2137-4098-a6c5-06de10988051.4421.1<-rvtfgn-1:eb859e84-2137-4098-a6c5-06de10988051.4421.1:3]:full_sync[rvtfgn-1:eb859e84-2137-4098-a6c5-06de10988051.4421.1:3]: listed bucket for full sync list_result.entries.size=0 is_truncated=0 Test passes: Ran 1 test in 284.062s OK Without fixRunning the updated testcase without fix now reproduces the issue deterministically and test fails as expected. Now, the listing loop never ends after bucket is deleted: run/c2/out/radosgw.8001.log:2025-11-23T04:07:58.309+0000 7fe9ad1ae700 20 RGW-SYNC:data:sync:shard[11]:entry[mkoudz-1:88756c7d-e05a-460c-9b97-0efd67cf04eb.4389.1:2[0]]:bucket_sync_sources[source=:mkoudz-1[88756c7d-e05a-460c-9b97-0efd67cf04eb.4389.1]):2:source_zone=88756c7d-e05a-460c-9b97-0efd67cf04eb]:bucket[mkoudz-1:88756c7d-e05a-460c-9b97-0efd67cf04eb.4389.1<-mkoudz-1:88756c7d-e05a-460c-9b97-0efd67cf04eb.4389.1:2]:full_sync[mkoudz-1:88756c7d-e05a-460c-9b97-0efd67cf04eb.4389.1:2]: listing bucket for full sync run/c2/out/radosgw.8001.log:2025-11-23T04:07:58.490+0000 7fe9ad1ae700 20 RGW-SYNC:data:sync:shard[11]:entry[mkoudz-1:88756c7d-e05a-460c-9b97-0efd67cf04eb.4389.1:2[0]]:bucket_sync_sources[source=:mkoudz-1[88756c7d-e05a-460c-9b97-0efd67cf04eb.4389.1]):2:source_zone=88756c7d-e05a-460c-9b97-0efd67cf04eb]:bucket[mkoudz-1:88756c7d-e05a-460c-9b97-0efd67cf04eb.4389.1<-mkoudz-1:88756c7d-e05a-460c-9b97-0efd67cf04eb.4389.1:2]:full_sync[mkoudz-1:88756c7d-e05a-460c-9b97-0efd67cf04eb.4389.1:2]: listed bucket for full sync list_result.entries.size=1000 is_truncated=1 . . . # the full-sync coroutine can never exit the loop as it is using the # stale list-results object until the rgw instance is restarted run/c2/out/radosgw.8001.log:2025-11-23T04:09:54.779+0000 7fe9ad1ae700 20 RGW-SYNC:data:sync:shard[11]:entry[mkoudz-1:88756c7d-e05a-460c-9b97-0efd67cf04eb.4389.1:2[0]]:bucket_sync_sources[source=:mkoudz-1[88756c7d-e05a-460c-9b97-0efd67cf04eb.4389.1]):2:source_zone=88756c7d-e05a-460c-9b97-0efd67cf04eb]:bucket[mkoudz-1:88756c7d-e05a-460c-9b97-0efd67cf04eb.4389.1<-mkoudz-1:88756c7d-e05a-460c-9b97-0efd67cf04eb.4389.1:2]:full_sync[mkoudz-1:88756c7d-e05a-460c-9b97-0efd67cf04eb.4389.1:2]: listed bucket for full sync list_result.entries.size=1000 is_truncated=1 . . . If you look at the sync status during this time, you'll see that status keeps reporting the same delay: ...
current time 2025-11-23T04:11:59Z
..
data sync source: 88756c7d-e05a-460c-9b97-0efd67cf04eb (a1)
syncing
full sync: 0/128 shards
incremental sync: 128/128 shards
data is behind on 1 shards
behind shards: [11]
oldest incremental change not applied: 2025-11-23T04:09:09.837391+0000 [11]
10 shards are recovering
recovering shards: [9,10,12,13,14,15,16,17,18,19]
.
.
.
...
current time 2025-11-23T04:12:39Z
...
data sync source: 88756c7d-e05a-460c-9b97-0efd67cf04eb (a1)
syncing
full sync: 0/128 shards
incremental sync: 128/128 shards
data is behind on 1 shards
behind shards: [11]
oldest incremental change not applied: 2025-11-23T04:09:09.837391+0000 [11]
10 shards are recovering
recovering shards: [9,10,12,13,14,15,16,17,18,19]
and testcase fails since data sync keeps reporting it's still behind. ... rgw_multi.tests: ERROR: test_bucket_full_sync_when_the_bucket_is_deleted_in_the_meantime failed: failed data checkpoint for target_zone=a2 source_zone=a1 --------------------- >> end captured logging << --------------------- ---------------------------------------------------------------------- Ran 1 test in 598.802s FAILED (failures=1) http://localhost:8000 http://localhost:8001 Please take a look at the testcase again. Just a note, please don't forget to stop your mstart cluster (or restart the affected rgw instance) after the testcase failure (if you are trying it out without the fix); otherwise. the radosgw.log will fill up quickly due to never ending loop of "listing bucket for full sync" and "listed bucket for full sync list_result.entries.size=1000 is_truncated=1" events. |
887302c to
ce9be9e
Compare
|
scheduled teuthology runs at https://pulpito.ceph.com/smanjara-2025-11-25_06:49:38-rgw:multisite-test-wip-oozmen-73799-distro-default-smithi/ |
the test is failing with
|
Hmm,
I think problem is qa/tasks uses the defition from qa/tasks/rgw_multisite.py but not from src/test/rgw/test_multi.py. This PR updates only the latter.
|
|
Just to give an update on this one... Testing the Teuthology seems to be busy, though; the job is still in the queue for a while: https://pulpito.ceph.com/bcs-ceph-2025-12-02_19:41:42-rgw:multisite-wip-oozmen-73799-distro-default-smithi/ To my understanding, there's no easy way to test |
|
An update on this one, it's getting closer. I have to change the testcase a bit due to the difference in cluster topology between:
The former creates a realm with single (master) zonegroup with 2 zones/clusters one being the master so below logic works when running the integration testing locally primary_zone_cluster_conn = zonegroup.zones[0]
secondary_zone_cluster_conn = zonegroup.zones[1]
However, in teuthology run # get cluster connections
primary_zone_cluster_conn = master_zonegroup.master_zone
secondary_zone_cluster_conn = None
for zg in realm.current_period.zonegroups:
for zone in zg.zones:
if zone.cluster != primary_zone_cluster_conn.cluster and zone != zg.master_zone:
secondary_zone_cluster_conn = zone
break
if secondary_zone_cluster_conn is not None:
break
else:
raise SkipTest("test_bucket_full_sync_when_the_bucket_is_deleted_in_the_meantime is skipped. "
"Requires a secondary zone in a different cluster.")
|
sepia lab is still under migration, so you might not be able to schedule teuthology runs yet. the existing besides that, we tested the sync code at a reasonable scale and we didn't find any issues or regressions! |
Sounds good, thank you for the update!
Hopefully, I can test these |
ce9be9e to
3355a23
Compare
It's OK. No rush on this one. This PR is meant to change to qa/tasks as well (i.e., I've just pushed in the changes here as well (see compare) that makes the change to Bottomline, we can wait for boto3 changes to go in first and I can test this one later. |
|
This pull request can no longer be automatically merged: a rebase is needed and changes have to be manually resolved |
…ts entries Add a new method `reset_entries()` to the `bucket_list_result` struct that clears the list of entries and resets the truncated flag. This would be used to enhance the re-use cases to avoid accessing stale entries or truncated flag. Signed-off-by: Oguzhan Ozmen <oozmen@bloomberg.net>
…to avoid stale listings RGWBucketFullSyncCR could spin indefinitely when the source bucket was already deleted. The coroutine reused a bucket_list_result member, and RGWListRemoteBucketCR populated it without clearing prior state. Stale entries/is_truncated from a previous iteration caused the loop to continue even after the bucket no longer existed. Fix by clearing the provided bucket_list_result at the start of RGWListRemoteBucketCR (constructor), ensuring each listing starts from a clean state and reflects the current remote bucket contents. This prevents the infinite loop and returns correct results when the bucket has been deleted. Fixes: https://tracker.ceph.com/issues/73799 Signed-off-by: Oguzhan Ozmen <oozmen@bloomberg.net>
3355a23 to
74d3441
Compare
…e bucket is deleted in the middle Tests: https://tracker.ceph.com/issues/73799 Signed-off-by: Oguzhan Ozmen <oozmen@bloomberg.net>
74d3441 to
9432f0e
Compare
|
@cbodley , @smanjara - can you please review the PR? I could finally complete the teuthology testing. https://pulpito.ceph.com/bcs-ceph-2026-02-19_14:51:57-rgw:multisite-wip-oozmen-73799-distro-default-trial/ is the $ egrep "test_period_update_commit|test_bucket_full_sync_when_the_bucket_is_deleted_in_the_meantime" teuthology.log 2026-02-19T23:10:43.212 INFO:tasks.rgw_multisite_tests:rgw_multi.tests.test_period_update_commit ... ok 2026-02-19T23:13:20.426 INFO:tasks.rgw_multisite_tests:rgw_multi.tests.test_bucket_full_sync_when_the_bucket_is_deleted_in_the_meantime ... ok Some details on 2026-02-19T23:10:43.404 INFO:rgw_multi.tests:disable sync for bucket=eoibqj-69 2026-02-19T23:11:03.164 INFO:rgw_multi.tests:successfully uploaded 3000 objects to bucket=eoibqj-69 2026-02-19T23:11:03.267 INFO:rgw_multi.tests:set rgw_inject_delay_sec and rgw_inject_delay_pattern to slow down bucket full sync 2026-02-19T23:11:03.623 INFO:rgw_multi.tests:enable bucket sync to initiate full sync 2026-02-19T23:11:03.741 INFO:rgw_multi.tests:verify that bucket sync is stalled 2026-02-19T23:11:14.050 INFO:rgw_multi.tests:verified that bucket sync is stalled, oldest incremental change not applied epoch: 0.0 2026-02-19T23:11:39.502 INFO:rgw_multi.tests:removing rgw_inject_delay_sec and rgw_inject_delay_pattern to allow bucket full sync to run normally to the completion 2026-02-19T23:13:19.872 INFO:rgw_multi.tests:wait for data sync to complete 2026-02-19T23:13:20.426 INFO:tasks.rgw_multisite_tests:rgw_multi.tests.test_bucket_full_sync_when_the_bucket_is_deleted_in_the_meantime ... ok As for the failed tests, they are failing in other submissions as well so I don't think they are relevant to this PR. For example, looking at anuchaithra's submission for example: https://qa-proxy.ceph.com/teuthology/anuchaithra-2026-02-18_10:20:38-rgw-wip-anrao5-testing-2026-02-18-1230-distro-default-trial/56569/teuthology.log $ egrep -A 1 "=======" teuthology-anrao5.log | grep ERROR 2026-02-18T14:14:30.968 INFO:tasks.rgw_multisite_tests:ERROR: rgw_multi.tests.test_bucket_create 2026-02-18T14:14:30.970 INFO:tasks.rgw_multisite_tests:ERROR: create a bucket from secondary zone under tenant namespace. check if it successfully syncs 2026-02-18T14:14:30.972 INFO:tasks.rgw_multisite_tests:ERROR: rgw_multi.tests.test_bucket_recreate 2026-02-18T14:14:30.974 INFO:tasks.rgw_multisite_tests:ERROR: rgw_multi.tests.test_bucket_remove 2026-02-18T14:14:30.976 INFO:tasks.rgw_multisite_tests:ERROR: rgw_multi.tests.test_object_sync 2026-02-18T14:14:30.978 INFO:tasks.rgw_multisite_tests:ERROR: rgw_multi.tests.test_object_delete 2026-02-18T14:14:30.980 INFO:tasks.rgw_multisite_tests:ERROR: rgw_multi.tests.test_multi_object_delete 2026-02-18T14:14:30.982 INFO:tasks.rgw_multisite_tests:ERROR: rgw_multi.tests.test_versioned_object_incremental_sync 2026-02-18T14:14:30.984 INFO:tasks.rgw_multisite_tests:ERROR: rgw_multi.tests.test_delete_marker_full_sync 2026-02-18T14:14:30.986 INFO:tasks.rgw_multisite_tests:ERROR: rgw_multi.tests.test_suspended_delete_marker_full_sync 2026-02-18T14:14:30.988 INFO:tasks.rgw_multisite_tests:ERROR: rgw_multi.tests.test_bucket_versioning 2026-02-18T14:14:30.990 INFO:tasks.rgw_multisite_tests:ERROR: rgw_multi.tests.test_bucket_acl 2026-02-18T14:14:30.992 INFO:tasks.rgw_multisite_tests:ERROR: rgw_multi.tests.test_bucket_cors 2026-02-18T14:14:30.995 INFO:tasks.rgw_multisite_tests:ERROR: rgw_multi.tests.test_bucket_delete_notempty 2026-02-18T14:14:30.996 INFO:tasks.rgw_multisite_tests:ERROR: rgw_multi.tests.test_datalog_autotrim 2026-02-18T14:14:30.998 INFO:tasks.rgw_multisite_tests:ERROR: rgw_multi.tests.test_set_bucket_website 2026-02-18T14:14:31.000 INFO:tasks.rgw_multisite_tests:ERROR: rgw_multi.tests.test_set_bucket_policy 2026-02-18T14:14:31.002 INFO:tasks.rgw_multisite_tests:ERROR: rgw_multi.tests.test_bucket_sync_disable 2026-02-18T14:14:31.003 INFO:tasks.rgw_multisite_tests:ERROR: rgw_multi.tests.test_bucket_sync_enable_right_after_disable 2026-02-18T14:14:31.005 INFO:tasks.rgw_multisite_tests:ERROR: rgw_multi.tests.test_bucket_sync_disable_enable 2026-02-18T14:14:31.006 INFO:tasks.rgw_multisite_tests:ERROR: rgw_multi.tests.test_multipart_object_sync 2026-02-18T14:14:31.008 INFO:tasks.rgw_multisite_tests:ERROR: rgw_multi.tests.test_assume_role_after_sync 2026-02-18T14:14:31.009 INFO:tasks.rgw_multisite_tests:ERROR: rgw_multi.tests.test_topic_notification_sync $ egrep -A 1 "=======" teuthology.log | grep ERROR 2026-02-19T23:13:20.427 INFO:tasks.rgw_multisite_tests:ERROR: rgw_multi.tests.test_bucket_create 2026-02-19T23:13:20.429 INFO:tasks.rgw_multisite_tests:ERROR: create a bucket from secondary zone under tenant namespace. check if it successfully syncs 2026-02-19T23:13:20.430 INFO:tasks.rgw_multisite_tests:ERROR: rgw_multi.tests.test_bucket_recreate 2026-02-19T23:13:20.432 INFO:tasks.rgw_multisite_tests:ERROR: rgw_multi.tests.test_bucket_remove 2026-02-19T23:13:20.434 INFO:tasks.rgw_multisite_tests:ERROR: rgw_multi.tests.test_object_sync 2026-02-19T23:13:20.436 INFO:tasks.rgw_multisite_tests:ERROR: rgw_multi.tests.test_object_delete 2026-02-19T23:13:20.438 INFO:tasks.rgw_multisite_tests:ERROR: rgw_multi.tests.test_multi_object_delete 2026-02-19T23:13:20.441 INFO:tasks.rgw_multisite_tests:ERROR: rgw_multi.tests.test_versioned_object_incremental_sync 2026-02-19T23:13:20.443 INFO:tasks.rgw_multisite_tests:ERROR: rgw_multi.tests.test_delete_marker_full_sync 2026-02-19T23:13:20.445 INFO:tasks.rgw_multisite_tests:ERROR: rgw_multi.tests.test_suspended_delete_marker_full_sync 2026-02-19T23:13:20.447 INFO:tasks.rgw_multisite_tests:ERROR: rgw_multi.tests.test_bucket_versioning 2026-02-19T23:13:20.449 INFO:tasks.rgw_multisite_tests:ERROR: rgw_multi.tests.test_bucket_acl 2026-02-19T23:13:20.451 INFO:tasks.rgw_multisite_tests:ERROR: rgw_multi.tests.test_bucket_cors 2026-02-19T23:13:20.453 INFO:tasks.rgw_multisite_tests:ERROR: rgw_multi.tests.test_bucket_delete_notempty 2026-02-19T23:13:20.456 INFO:tasks.rgw_multisite_tests:ERROR: rgw_multi.tests.test_datalog_autotrim 2026-02-19T23:13:20.457 INFO:tasks.rgw_multisite_tests:ERROR: rgw_multi.tests.test_set_bucket_website 2026-02-19T23:13:20.459 INFO:tasks.rgw_multisite_tests:ERROR: rgw_multi.tests.test_set_bucket_policy 2026-02-19T23:13:20.461 INFO:tasks.rgw_multisite_tests:ERROR: rgw_multi.tests.test_bucket_sync_disable 2026-02-19T23:13:20.463 INFO:tasks.rgw_multisite_tests:ERROR: rgw_multi.tests.test_bucket_sync_enable_right_after_disable 2026-02-19T23:13:20.464 INFO:tasks.rgw_multisite_tests:ERROR: rgw_multi.tests.test_bucket_sync_disable_enable 2026-02-19T23:13:20.466 INFO:tasks.rgw_multisite_tests:ERROR: rgw_multi.tests.test_multipart_object_sync 2026-02-19T23:13:20.467 INFO:tasks.rgw_multisite_tests:ERROR: rgw_multi.tests.test_assume_role_after_sync 2026-02-19T23:13:20.468 INFO:tasks.rgw_multisite_tests:ERROR: rgw_multi.tests.test_topic_notification_sync 2026-02-19T23:13:20.469 INFO:tasks.rgw_multisite_tests:ERROR: rgw_multi.tests.test_bucket_create_location_constraint Both logs show the failure reasons for above testcases are the same
Again, those failure seem irrelevant to this PR. |
the PR associated with https://tracker.ceph.com/issues/74579 should fix the CreateBucket forwarded requests. |
|
Thank you @smanjara! Added |
|
The QA run for this PR included two others. But they're showing a lot of crash related errors. There are six of these: "2026-02-25T05:26:33.173487+0000 mon.a (mon.0) 181 : cluster [WRN] Health check failed: 2 daemons have recently crashed (RECENT_CRASH)" in cluster log And there are nine of these, mostly dealing with the upgrade tests. Found coredumps on ubuntu@trial096.front.sepia.ceph.com None of the PRs jumps out at me as a likely culprit, so I'm asking you. I'm also having a re-run done. Here's the full run: https://pulpito.ceph.com/anuchaithra-2026-02-25_05:07:15-rgw-wip-anrao1-testing-2026-02-23-1551-distro-default-trial/ Thanks! |
@ivancich the two multisite jobs haven't crashed though. so it must be something else. could you tell which are the other PR's? also, multisite tests will fail without #67083. we need that to be merged or have this PR tested along with wip-anrao3-testing cc @anrao19 |
#67083 merged, let's retry with that |
I was going to start looking at the failure logs @ivancich pointed out earlier but I think there's going to be another round of testing including #67083 . I'll be on stand-by for now then. |
|
Sent this PR to the Teuthology testing for rgw testsuite:
Overall Summary
Looking at failures:
failure_reason: 'Command failed (workunit test rgw/s3_user_quota-run.sh) on trial049 with status 22: ''mkdir -p -- /home/ubuntu/cephtest/mnt.0/client.0/tmp && cd -- /home/ubuntu/cephtest/mnt.0/client.0/tmp && CEPH_CLI_TEST_DUP_COMMAND=1 CEPH_REF=c221a0e611968dcf997352e2792ea5cbb550c44e TESTDIR="/home/ubuntu/cephtest" CEPH_ARGS="--cluster ceph" CEPH_ID="0" PATH=$PATH:/usr/sbin CEPH_BASE=/home/ubuntu/cephtest/clone.client.0 CEPH_ROOT=/home/ubuntu/cephtest/clone.client.0 CEPH_MNT=/home/ubuntu/cephtest/mnt.0 adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage timeout 3h /home/ubuntu/cephtest/clone.client.0/qa/workunits/rgw/s3_user_quota-run.sh'''
2026-03-07T05:20:08.109 DEBUG:teuthology.orchestra.run.trial052:> sudo adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage timeout 120 ceph --cluster ceph pg dump --format=json 2026-03-07T05:22:08.142 DEBUG:teuthology.orchestra.run:got remote process result: 124
2026-03-07T05:43:53.758 INFO:tasks.rgw_multisite_tests:rgw_multi.tests.test_bucket_remove ... FAIL 2026-03-07T06:31:32.060 INFO:tasks.rgw_multisite_tests:rgw_multi.tests.test_period_update_commit ... FAIL test_bucket_removeThe testcase is not touched by this PR and it seems to be failing at other submissions, too; e.g., https://qa-proxy.ceph.com/teuthology/anuchaithra-2026-03-09_10:21:55-rgw-wip-anrao2-testing-2026-03-05-1549-distro-default-trial/94912/teuthology.log. This testcase is about metadata sync which is not relevant to this PR. I think we can address this testcaseseparately. If there's no tracker item I can open one to work on this separately; it seems to be failing in other submissions consistently as well. test_period_update_commitIt fails at the very last step at the check Again, this failure is not relevant to the PR so I can open a separate tracker - I think it should be enough to make the s3 client wkld less aggressive. Otherwise, this testcase used to pass as shown in https://qa-proxy.ceph.com/teuthology/bcs-ceph-2026-02-19_14:51:57-rgw:multisite-wip-oozmen-73799-distro-default-trial/58954/teuthology.log and I just built this very ceph-ci branch and tested it locally and it passed again.
|
|
thanks @BBoozmen |
please do open trackers for the s3 and multisite failures @BBoozmen. thanks! |
Fixes: https://tracker.ceph.com/issues/73799
Contribution Guidelines
To sign and title your commits, please refer to Submitting Patches to Ceph.
If you are submitting a fix for a stable branch (e.g. "quincy"), please refer to Submitting Patches to Ceph - Backports for the proper workflow.
When filling out the below checklist, you may click boxes directly in the GitHub web UI. When entering or editing the entire PR message in the GitHub web UI editor, you may also select a checklist item by adding an
xbetween the brackets:[x]. Spaces and capitalization matter when checking off items this way.Checklist
Show available Jenkins commands
jenkins test classic perfJenkins Job | Jenkins Job Definitionjenkins test crimson perfJenkins Job | Jenkins Job Definitionjenkins test signedJenkins Job | Jenkins Job Definitionjenkins test make checkJenkins Job | Jenkins Job Definitionjenkins test make check arm64Jenkins Job | Jenkins Job Definitionjenkins test submodulesJenkins Job | Jenkins Job Definitionjenkins test dashboardJenkins Job | Jenkins Job Definitionjenkins test dashboard cephadmJenkins Job | Jenkins Job Definitionjenkins test apiJenkins Job | Jenkins Job Definitionjenkins test docsReadTheDocs | Github Workflow Definitionjenkins test ceph-volume allJenkins Jobs | Jenkins Jobs Definitionjenkins test windowsJenkins Job | Jenkins Job Definitionjenkins test rook e2eJenkins Job | Jenkins Job DefinitionYou must only issue one Jenkins command per-comment. Jenkins does not understand
comments with more than one command.