Skip to content

RGW/multisite: fix bucket-full-sync infinite loop caused by stale bucket_list_result reuse#66203

Merged
smanjara merged 8 commits intoceph:mainfrom
BBoozmen:wip-oozmen-73799
Mar 10, 2026
Merged

RGW/multisite: fix bucket-full-sync infinite loop caused by stale bucket_list_result reuse#66203
smanjara merged 8 commits intoceph:mainfrom
BBoozmen:wip-oozmen-73799

Conversation

@BBoozmen
Copy link
Contributor

@BBoozmen BBoozmen commented Nov 11, 2025

Fixes: https://tracker.ceph.com/issues/73799

Contribution Guidelines

  • To sign and title your commits, please refer to Submitting Patches to Ceph.

  • If you are submitting a fix for a stable branch (e.g. "quincy"), please refer to Submitting Patches to Ceph - Backports for the proper workflow.

  • When filling out the below checklist, you may click boxes directly in the GitHub web UI. When entering or editing the entire PR message in the GitHub web UI editor, you may also select a checklist item by adding an x between the brackets: [x]. Spaces and capitalization matter when checking off items this way.

Checklist

  • Tracker (select at least one)
    • References tracker ticket
    • Very recent bug; references commit where it was introduced
    • New feature (ticket optional)
    • Doc update (no ticket needed)
    • Code cleanup (no ticket needed)
  • Component impact
    • Affects Dashboard, opened tracker ticket
    • Affects Orchestrator, opened tracker ticket
    • No impact that needs to be tracked
  • Documentation (select at least one)
    • Updates relevant documentation
    • No doc update is appropriate
  • Tests (select at least one)
Show available Jenkins commands

You must only issue one Jenkins command per-comment. Jenkins does not understand
comments with more than one command.

@github-actions
Copy link

github-actions bot commented Nov 11, 2025

Config Diff Tool Output

+ added: rgw_inject_delay_sec (rgw.yaml.in)
+ added: rgw_inject_delay_pattern (rgw.yaml.in)

The above configuration changes are found in the PR. Please update the relevant release documentation if necessary.
Ignore this comment if docs are already updated. To make the "Check ceph config changes" CI check pass, please comment /config check ok and re-run the test.

@BBoozmen BBoozmen marked this pull request as ready for review November 12, 2025 19:16
@BBoozmen BBoozmen requested a review from a team as a code owner November 12, 2025 19:16
@BBoozmen
Copy link
Contributor Author

/config check ok

@cbodley cbodley requested a review from smanjara November 20, 2025 15:26
@smanjara
Copy link
Contributor

@BBoozmen thanks for the detailed test case and reproducer. I do agree that the list_bucket_result should be cleared before sending request to the remote for the next listing.
I tried running your test without the fix proposed with single rgw instance on either side. we should expect a repetitive loop with a log message like "listed bucket for full sync" if we had stale entries? but I could only find one such log message.

2025-11-20T14:13:15.452-0500 7efe2975f6c0 20 RGW-SYNC:data:sync:shard[36]:entry[ccjgqv-1:609213d3-f584-4e86-bca2-b7c1c434fd59.4362.1:5[0]]:bucket_sync_sources[source=:ccjgqv-1[609213d3-f584-4e86-bca2-b7c1c434fd59.4362.1]):5:source_zone=609213d3-f584-4e86-bca2-b7c1c434fd59]:bucket[ccjgqv-1:609213d3-f584-4e86-bca2-b7c1c434fd59.4362.1<-ccjgqv-1:609213d3-f584-4e86-bca2-b7c1c434fd59.4362.1:5]:full_sync[ccjgqv-1:609213d3-f584-4e86-bca2-b7c1c434fd59.4362.1:5]: listed bucket for full sync list_result.entries.size=299 is_truncated=0

am I missing something?

@BBoozmen
Copy link
Contributor Author

@BBoozmen thanks for the detailed test case and reproducer. I do agree that the list_bucket_result should be cleared before sending request to the remote for the next listing. I tried running your test without the fix proposed with single rgw instance on either side. we should expect a repetitive loop with a log message like "listed bucket for full sync" if we had stale entries? but I could only find one such log message.

2025-11-20T14:13:15.452-0500 7efe2975f6c0 20 RGW-SYNC:data:sync:shard[36]:entry[ccjgqv-1:609213d3-f584-4e86-bca2-b7c1c434fd59.4362.1:5[0]]:bucket_sync_sources[source=:ccjgqv-1[609213d3-f584-4e86-bca2-b7c1c434fd59.4362.1]):5:source_zone=609213d3-f584-4e86-bca2-b7c1c434fd59]:bucket[ccjgqv-1:609213d3-f584-4e86-bca2-b7c1c434fd59.4362.1<-ccjgqv-1:609213d3-f584-4e86-bca2-b7c1c434fd59.4362.1:5]:full_sync[ccjgqv-1:609213d3-f584-4e86-bca2-b7c1c434fd59.4362.1:5]: listed bucket for full sync list_result.entries.size=299 is_truncated=0

am I missing something?

Thank you @smanjara for the feedback. Yes, the current reproduction recipe doesn't seem to be reproducing the issue deterministically. I've updated the recipe to make the reproduction deterministic. Updated the related commit. Please have a look.

The new recipe doesn't rely on bilog trimming but it just

  • disables bucket sync first before uploading any objects
  • upload all objects (>1000)
  • injects the delay
  • re-enables the bucket sync to initiate the full sync.

Now, this new recipe makes sure that -- when bucket full sync starts -- the bucket will have > 1000 objects and full_sync listing will have paginated listing.

Tested the new recipe with and without fix: python3.9 -m nose -s test_multi.py -v -m test_bucket_full_sync_when_the_bucket_is_deleted_in_the_meantime.

With Fix

Now, once full sync starts (after enabling the bucket sync after uploading all objects), we always see the paginated listing happening deterministically which is critical in reproducing the issue:

run/c2/out/radosgw.8001.log:2025-11-23T03:26:28.842+0000 7f7cce0da700 20 RGW-SYNC:data:sync:shard[91]:entry[rvtfgn-1:eb859e84-2137-4098-a6c5-06de10988051.4421.1:3[0]]:bucket_sync_sources[source=:rvtfgn-1[eb859e84-2137-4098-a6c5-06de10988051.4421.1]):3:source_zone=eb859e84-2137-4098-a6c5-06de10988051]:bucket[rvtfgn-1:eb859e84-2137-4098-a6c5-06de10988051.4421.1<-rvtfgn-1:eb859e84-2137-4098-a6c5-06de10988051.4421.1:3]:full_sync[rvtfgn-1:eb859e84-2137-4098-a6c5-06de10988051.4421.1:3]: listing bucket for full sync

# now we get paginated listing
run/c2/out/radosgw.8001.log:2025-11-23T03:26:28.969+0000 7f7cce0da700 20 RGW-SYNC:data:sync:shard[91]:entry[rvtfgn-1:eb859e84-2137-4098-a6c5-06de10988051.4421.1:3[0]]:bucket_sync_sources[source=:rvtfgn-1[eb859e84-2137-4098-a6c5-06de10988051.4421.1]):3:source_zone=eb859e84-2137-4098-a6c5-06de10988051]:bucket[rvtfgn-1:eb859e84-2137-4098-a6c5-06de10988051.4421.1<-rvtfgn-1:eb859e84-2137-4098-a6c5-06de10988051.4421.1:3]:full_sync[rvtfgn-1:eb859e84-2137-4098-a6c5-06de10988051.4421.1:3]: listed bucket for full sync list_result.entries.size=1000 is_truncated=1

# and delay injecting - to give time the testcase to delete all objects and the bucket in the meantime -
# now happens correctly when paginated is set.
run/c2/out/radosgw.8001.log:2025-11-23T03:26:28.969+0000 7f7cce0da700  0 RGW-SYNC:data:sync:shard[91]:entry[rvtfgn-1:eb859e84-2137-4098-a6c5-06de10988051.4421.1:3[0]]:bucket_sync_sources[source=:rvtfgn-1[eb859e84-2137-4098-a6c5-06de10988051.4421.1]):3:source_zone=eb859e84-2137-4098-a6c5-06de10988051]:bucket[rvtfgn-1:eb859e84-2137-4098-a6c5-06de10988051.4421.1<-rvtfgn-1:eb859e84-2137-4098-a6c5-06de10988051.4421.1:3]:full_sync[rvtfgn-1:eb859e84-2137-4098-a6c5-06de10988051.4421.1:3]: injecting a delay of 100.000000s

With the fix, testcase passes since we now reset the list_result object and don't use a stale state.

run/c2/out/radosgw.8001.log:2025-11-23T03:28:12.771+0000 7f7cce0da700 20 RGW-SYNC:data:sync:shard[91]:entry[rvtfgn-1:eb859e84-2137-4098-a6c5-06de10988051.4421.1:3[0]]:bucket_sync_sources[source=:rvtfgn-1[eb859e84-2137-4098-a6c5-06de10988051.4421.1]):3:source_zone=eb859e84-2137-4098-a6c5-06de10988051]:bucket[rvtfgn-1:eb859e84-2137-4098-a6c5-06de10988051.4421.1<-rvtfgn-1:eb859e84-2137-4098-a6c5-06de10988051.4421.1:3]:full_sync[rvtfgn-1:eb859e84-2137-4098-a6c5-06de10988051.4421.1:3]: listing bucket for full sync

# now - since source objects and the bucket is deleted -
# with the fix we can get out of the for-loop
run/c2/out/radosgw.8001.log:2025-11-23T03:28:12.773+0000 7f7cce0da700 20 RGW-SYNC:data:sync:shard[91]:entry[rvtfgn-1:eb859e84-2137-4098-a6c5-06de10988051.4421.1:3[0]]:bucket_sync_sources[source=:rvtfgn-1[eb859e84-2137-4098-a6c5-06de10988051.4421.1]):3:source_zone=eb859e84-2137-4098-a6c5-06de10988051]:bucket[rvtfgn-1:eb859e84-2137-4098-a6c5-06de10988051.4421.1<-rvtfgn-1:eb859e84-2137-4098-a6c5-06de10988051.4421.1:3]:full_sync[rvtfgn-1:eb859e84-2137-4098-a6c5-06de10988051.4421.1:3]: listed bucket for full sync list_result.entries.size=0 is_truncated=0

Test passes:

Ran 1 test in 284.062s
OK

Without fix

Running the updated testcase without fix now reproduces the issue deterministically and test fails as expected. Now, the listing loop never ends after bucket is deleted:

run/c2/out/radosgw.8001.log:2025-11-23T04:07:58.309+0000 7fe9ad1ae700 20 RGW-SYNC:data:sync:shard[11]:entry[mkoudz-1:88756c7d-e05a-460c-9b97-0efd67cf04eb.4389.1:2[0]]:bucket_sync_sources[source=:mkoudz-1[88756c7d-e05a-460c-9b97-0efd67cf04eb.4389.1]):2:source_zone=88756c7d-e05a-460c-9b97-0efd67cf04eb]:bucket[mkoudz-1:88756c7d-e05a-460c-9b97-0efd67cf04eb.4389.1<-mkoudz-1:88756c7d-e05a-460c-9b97-0efd67cf04eb.4389.1:2]:full_sync[mkoudz-1:88756c7d-e05a-460c-9b97-0efd67cf04eb.4389.1:2]: listing bucket for full sync
run/c2/out/radosgw.8001.log:2025-11-23T04:07:58.490+0000 7fe9ad1ae700 20 RGW-SYNC:data:sync:shard[11]:entry[mkoudz-1:88756c7d-e05a-460c-9b97-0efd67cf04eb.4389.1:2[0]]:bucket_sync_sources[source=:mkoudz-1[88756c7d-e05a-460c-9b97-0efd67cf04eb.4389.1]):2:source_zone=88756c7d-e05a-460c-9b97-0efd67cf04eb]:bucket[mkoudz-1:88756c7d-e05a-460c-9b97-0efd67cf04eb.4389.1<-mkoudz-1:88756c7d-e05a-460c-9b97-0efd67cf04eb.4389.1:2]:full_sync[mkoudz-1:88756c7d-e05a-460c-9b97-0efd67cf04eb.4389.1:2]: listed bucket for full sync list_result.entries.size=1000 is_truncated=1
.
.
.
# the full-sync coroutine can never exit the loop as it is using the
# stale list-results object until the rgw instance is restarted
run/c2/out/radosgw.8001.log:2025-11-23T04:09:54.779+0000 7fe9ad1ae700 20 RGW-SYNC:data:sync:shard[11]:entry[mkoudz-1:88756c7d-e05a-460c-9b97-0efd67cf04eb.4389.1:2[0]]:bucket_sync_sources[source=:mkoudz-1[88756c7d-e05a-460c-9b97-0efd67cf04eb.4389.1]):2:source_zone=88756c7d-e05a-460c-9b97-0efd67cf04eb]:bucket[mkoudz-1:88756c7d-e05a-460c-9b97-0efd67cf04eb.4389.1<-mkoudz-1:88756c7d-e05a-460c-9b97-0efd67cf04eb.4389.1:2]:full_sync[mkoudz-1:88756c7d-e05a-460c-9b97-0efd67cf04eb.4389.1:2]: listed bucket for full sync list_result.entries.size=1000 is_truncated=1
.
.
.

If you look at the sync status during this time, you'll see that status keeps reporting the same delay:

...
   current time 2025-11-23T04:11:59Z
..
      data sync source: 88756c7d-e05a-460c-9b97-0efd67cf04eb (a1)
                        syncing
                        full sync: 0/128 shards
                        incremental sync: 128/128 shards
                        data is behind on 1 shards
                        behind shards: [11]
                        oldest incremental change not applied: 2025-11-23T04:09:09.837391+0000 [11]
                        10 shards are recovering
                        recovering shards: [9,10,12,13,14,15,16,17,18,19]
.
.
.

...
   current time 2025-11-23T04:12:39Z
...
      data sync source: 88756c7d-e05a-460c-9b97-0efd67cf04eb (a1)
                        syncing
                        full sync: 0/128 shards
                        incremental sync: 128/128 shards
                        data is behind on 1 shards
                        behind shards: [11]
                        oldest incremental change not applied: 2025-11-23T04:09:09.837391+0000 [11]
                        10 shards are recovering
                        recovering shards: [9,10,12,13,14,15,16,17,18,19]

and testcase fails since data sync keeps reporting it's still behind.

...
rgw_multi.tests: ERROR: test_bucket_full_sync_when_the_bucket_is_deleted_in_the_meantime failed: failed data checkpoint for target_zone=a2 source_zone=a1
--------------------- >> end captured logging << ---------------------

----------------------------------------------------------------------
Ran 1 test in 598.802s

FAILED (failures=1)
http://localhost:8000
http://localhost:8001

Please take a look at the testcase again. Just a note, please don't forget to stop your mstart cluster (or restart the affected rgw instance) after the testcase failure (if you are trying it out without the fix); otherwise. the radosgw.log will fill up quickly due to never ending loop of "listing bucket for full sync" and "listed bucket for full sync list_result.entries.size=1000 is_truncated=1" events.

@jmundack jmundack requested a review from a team November 24, 2025 02:15
@smanjara
Copy link
Contributor

@smanjara
Copy link
Contributor

scheduled teuthology runs at https://pulpito.ceph.com/smanjara-2025-11-25_06:49:38-rgw:multisite-test-wip-oozmen-73799-distro-default-smithi/

the test is failing with AttributeError: 'Cluster' object has no attribute 'ceph_admin':

2025-11-25T07:51:20.693 INFO:tasks.rgw_multisite_tests:======================================================================
2025-11-25T07:51:20.693 INFO:tasks.rgw_multisite_tests:ERROR: rgw_multi.tests.test_bucket_full_sync_when_the_bucket_is_deleted_in_the_meantime
2025-11-25T07:51:20.693 INFO:tasks.rgw_multisite_tests:----------------------------------------------------------------------
2025-11-25T07:51:20.693 INFO:tasks.rgw_multisite_tests:Traceback (most recent call last):
2025-11-25T07:51:20.693 INFO:tasks.rgw_multisite_tests: File "/home/teuthworker/src/git.ceph.com_teuthology_258eb6279f4d7fcd4b45c82e521f2a2e799d7f33/virtualenv/lib/python3.10/site-packages/nose/case.py", line 170, in runTest
2025-11-25T07:51:20.693 INFO:tasks.rgw_multisite_tests: self.test(*self.arg)
2025-11-25T07:51:20.693 INFO:tasks.rgw_multisite_tests: File "/home/teuthworker/src/github.com_BBoozmen_ceph_ce9be9e2a9c1bf23e82ba7dfc7a7caaedfecbfd1/qa/../src/test/rgw/rgw_multi/tests.py", line 6198, in test_bucket_full_sync_when_the_bucket_is_deleted_in_the_meantime
2025-11-25T07:51:20.693 INFO:tasks.rgw_multisite_tests: secondary_zone_cluster_conn.cluster.ceph_admin(
2025-11-25T07:51:20.693 INFO:tasks.rgw_multisite_tests:AttributeError: 'Cluster' object has no attribute 'ceph_admin'

@BBoozmen
Copy link
Contributor Author

scheduled teuthology runs at https://pulpito.ceph.com/smanjara-2025-11-25_06:49:38-rgw:multisite-test-wip-oozmen-73799-distro-default-smithi/

the test is failing with AttributeError: 'Cluster' object has no attribute 'ceph_admin':

Hmm, ceph_admin method is introduced in this PR at the commit 4f47850 and ceph-ci branch https://github.com/ceph/ceph-ci/commits/test-wip-oozmen-73799/ has it, too.

2025-11-25T07:58:19.703 ERROR:rgw_multi.tests:test_bucket_full_sync_when_the_bucket_is_deleted_in_the_meantime failed: 'Cluster' object has no attribute 'ceph_admin'

I think problem is qa/tasks uses the defition from qa/tasks/rgw_multisite.py but not from src/test/rgw/test_multi.py. This PR updates only the latter.

qa/tasks/rgw_multisite.py seems to be using a slightly different implementation. Let me take a look how can I can add ceph_admin method there as well.

@BBoozmen
Copy link
Contributor Author

BBoozmen commented Dec 5, 2025

Just to give an update on this one...

Testing the qa/tasks changes via https://github.com/ceph/ceph-ci/tree/wip-oozmen-73799. Once it passes, I'll reflect the changes here.

Teuthology seems to be busy, though; the job is still in the queue for a while: https://pulpito.ceph.com/bcs-ceph-2025-12-02_19:41:42-rgw:multisite-wip-oozmen-73799-distro-default-smithi/

To my understanding, there's no easy way to test qa/tasks changes locally so has to go thru teuthology machinery?

@BBoozmen
Copy link
Contributor Author

BBoozmen commented Dec 9, 2025

https://pulpito.ceph.com/bcs-ceph-2025-12-02_19:41:42-rgw:multisite-wip-oozmen-73799-distro-default-smithi/

An update on this one, it's getting closer. I have to change the testcase a bit due to the difference in cluster topology between:

  • local/mstart (src/test/rgw/) integration test run and
  • teuthology (qa/tasks) run

The former creates a realm with single (master) zonegroup with 2 zones/clusters one being the master so below logic works when running the integration testing locally

    primary_zone_cluster_conn = zonegroup.zones[0]
    secondary_zone_cluster_conn = zonegroup.zones[1]

However, in teuthology run secondary_zone_cluster_conn still refers to c1 as teuthology creates a more complex topology. Updated the testcase at https://github.com/ceph/ceph-ci/tree/wip-oozmen-73799 as follows. I'll try again on teuthology.

    # get cluster connections
    primary_zone_cluster_conn = master_zonegroup.master_zone
    secondary_zone_cluster_conn = None
    for zg in realm.current_period.zonegroups:
        for zone in zg.zones:
            if zone.cluster != primary_zone_cluster_conn.cluster and zone != zg.master_zone:
                secondary_zone_cluster_conn = zone
                break
        if secondary_zone_cluster_conn is not None:
            break
    else:
        raise SkipTest("test_bucket_full_sync_when_the_bucket_is_deleted_in_the_meantime is skipped. "
                       "Requires a secondary zone in a different cluster.")

@smanjara
Copy link
Contributor

smanjara commented Jan 5, 2026

the test is failing with AttributeError: 'Cluster' object has no attribute 'ceph_admin':

2025-11-25T07:51:20.693 INFO:tasks.rgw_multisite_tests: secondary_zone_cluster_conn.cluster.ceph_admin(
2025-11-25T07:51:20.693 INFO:tasks.rgw_multisite_tests:AttributeError: 'Cluster' object has no attribute 'ceph_admin'

qa/tasks/rgw_multisite.py seems to be using a slightly different implementation. Let me take a look how can I can add ceph_admin method there as well.

sepia lab is still under migration, so you might not be able to schedule teuthology runs yet. the existing cluster.admin() method comes from the class Cluster. as you pointed out teuthology runs these tasks a bit differently. ceph/qa/tasks/rgw_multi/multisite.py which is a symlink to multisite.py is where the abstract methods are declared and qa/tasks/rgw_multisite.py is where the concrete classes are implemented. I think we will need to modify and add ceph_admin() as a method in the class 'Cluster'.

besides that, we tested the sync code at a reasonable scale and we didn't find any issues or regressions!

@BBoozmen
Copy link
Contributor Author

BBoozmen commented Jan 6, 2026

besides that, we tested the sync code at a reasonable scale and we didn't find any issues or regressions!

Sounds good, thank you for the update!

sepia lab is still under migration, so you might not be able to schedule teuthology runs yet. the existing cluster.admin() method comes from the class Cluster. as you pointed out teuthology runs these tasks a bit differently. ceph/qa/tasks/rgw_multi/multisite.py which is a symlink to multisite.py is where the abstract methods are declared and qa/tasks/rgw_multisite.py is where the concrete classes are implemented. I think we will need to modify and add ceph_admin() as a method in the class 'Cluster'.

Hopefully, I can test these qa/tasks changes once the lab is functional again. I've not added these changes in this PR yet but added them to its corresponding ceph-ci branch (at https://github.com/ceph/ceph-ci/tree/wip-oozmen-73799). When I get a chance, I'll test it via tautology. Hopefully when it passes, I'll reflect the changes here as well. Planning to run the ceph-ci branch in the Poughkeepsie lab if I can.

@BBoozmen
Copy link
Contributor Author

rgw:multisite tests are broken at the new POK lab. We need #67011 (qa/multisite: switch to boto3 #67011) to be merged in first.

@smanjara
Copy link
Contributor

@BBoozmen, the boto3 migration is still underway. but I don't want to block the PR with just the multisite commit from getting merged. we could follow up with the test once the test migration is done. cc @cbodley

@BBoozmen
Copy link
Contributor Author

@BBoozmen, the boto3 migration is still underway. but I don't want to block the PR with just the multisite commit from getting merged. we could follow up with the test once the test migration is done. cc @cbodley

It's OK. No rush on this one.

This PR is meant to change to qa/tasks as well (i.e., qa/tasks/rgw_multisite.py); that's why, I'd like to wait for rgw:multisite suite to pass.

I've just pushed in the changes here as well (see compare) that makes the change to qa/tasks/rgw_multisite.py - which I was hoping to test in teuthology first via ceph-ci branch ceph-ci::wip-oozmen-73799.

Bottomline, we can wait for boto3 changes to go in first and I can test this one later.

@github-actions
Copy link

github-actions bot commented Feb 3, 2026

This pull request can no longer be automatically merged: a rebase is needed and changes have to be manually resolved

…ts entries

Add a new method `reset_entries()` to the `bucket_list_result` struct
that clears the list of entries and resets the truncated flag.

This would be used to enhance the re-use cases to avoid accessing stale
entries or truncated flag.

Signed-off-by: Oguzhan Ozmen <oozmen@bloomberg.net>
…to avoid stale listings

RGWBucketFullSyncCR could spin indefinitely when the source bucket was
already deleted. The coroutine reused a bucket_list_result member, and
RGWListRemoteBucketCR populated it without clearing prior state. Stale
entries/is_truncated from a previous iteration caused the loop to
continue even after the bucket no longer existed.

Fix by clearing the provided bucket_list_result at the start of
RGWListRemoteBucketCR (constructor), ensuring each listing starts from a
clean state and reflects the current remote bucket contents.

This prevents the infinite loop and returns correct results when the
bucket has been deleted.

Fixes: https://tracker.ceph.com/issues/73799
Signed-off-by: Oguzhan Ozmen <oozmen@bloomberg.net>
…e bucket is deleted in the middle

Tests: https://tracker.ceph.com/issues/73799
Signed-off-by: Oguzhan Ozmen <oozmen@bloomberg.net>
@BBoozmen
Copy link
Contributor Author

BBoozmen commented Feb 20, 2026

@cbodley , @smanjara - can you please review the PR? I could finally complete the teuthology testing.

https://pulpito.ceph.com/bcs-ceph-2026-02-19_14:51:57-rgw:multisite-wip-oozmen-73799-distro-default-trial/ is the rgw:multisite test. Although it shows failed, the tests that are the relevant to this PR passed. This PR changes a helper function the testcase test_period_update_commit uses and it introduces the new testcase test_bucket_full_sync_when_the_bucket_is_deleted_in_the_meantime.

$ egrep "test_period_update_commit|test_bucket_full_sync_when_the_bucket_is_deleted_in_the_meantime" teuthology.log
2026-02-19T23:10:43.212 INFO:tasks.rgw_multisite_tests:rgw_multi.tests.test_period_update_commit ... ok
2026-02-19T23:13:20.426 INFO:tasks.rgw_multisite_tests:rgw_multi.tests.test_bucket_full_sync_when_the_bucket_is_deleted_in_the_meantime ... ok

Some details on test_bucket_full_sync_when_the_bucket_is_deleted_in_the_meantime to show it runs as expected

2026-02-19T23:10:43.404 INFO:rgw_multi.tests:disable sync for bucket=eoibqj-69
2026-02-19T23:11:03.164 INFO:rgw_multi.tests:successfully uploaded 3000 objects to bucket=eoibqj-69
2026-02-19T23:11:03.267 INFO:rgw_multi.tests:set rgw_inject_delay_sec and rgw_inject_delay_pattern to slow down bucket full sync
2026-02-19T23:11:03.623 INFO:rgw_multi.tests:enable bucket sync to initiate full sync
2026-02-19T23:11:03.741 INFO:rgw_multi.tests:verify that bucket sync is stalled
2026-02-19T23:11:14.050 INFO:rgw_multi.tests:verified that bucket sync is stalled, oldest incremental change not applied epoch: 0.0
2026-02-19T23:11:39.502 INFO:rgw_multi.tests:removing rgw_inject_delay_sec and rgw_inject_delay_pattern to allow bucket full sync to run normally to the completion
2026-02-19T23:13:19.872 INFO:rgw_multi.tests:wait for data sync to complete
2026-02-19T23:13:20.426 INFO:tasks.rgw_multisite_tests:rgw_multi.tests.test_bucket_full_sync_when_the_bucket_is_deleted_in_the_meantime ... ok

As for the failed tests, they are failing in other submissions as well so I don't think they are relevant to this PR. For example, looking at anuchaithra's submission for example: https://qa-proxy.ceph.com/teuthology/anuchaithra-2026-02-18_10:20:38-rgw-wip-anrao5-testing-2026-02-18-1230-distro-default-trial/56569/teuthology.log

$ egrep -A 1 "=======" teuthology-anrao5.log | grep ERROR
2026-02-18T14:14:30.968 INFO:tasks.rgw_multisite_tests:ERROR: rgw_multi.tests.test_bucket_create
2026-02-18T14:14:30.970 INFO:tasks.rgw_multisite_tests:ERROR: create a bucket from secondary zone under tenant namespace. check if it successfully syncs
2026-02-18T14:14:30.972 INFO:tasks.rgw_multisite_tests:ERROR: rgw_multi.tests.test_bucket_recreate
2026-02-18T14:14:30.974 INFO:tasks.rgw_multisite_tests:ERROR: rgw_multi.tests.test_bucket_remove
2026-02-18T14:14:30.976 INFO:tasks.rgw_multisite_tests:ERROR: rgw_multi.tests.test_object_sync
2026-02-18T14:14:30.978 INFO:tasks.rgw_multisite_tests:ERROR: rgw_multi.tests.test_object_delete
2026-02-18T14:14:30.980 INFO:tasks.rgw_multisite_tests:ERROR: rgw_multi.tests.test_multi_object_delete
2026-02-18T14:14:30.982 INFO:tasks.rgw_multisite_tests:ERROR: rgw_multi.tests.test_versioned_object_incremental_sync
2026-02-18T14:14:30.984 INFO:tasks.rgw_multisite_tests:ERROR: rgw_multi.tests.test_delete_marker_full_sync
2026-02-18T14:14:30.986 INFO:tasks.rgw_multisite_tests:ERROR: rgw_multi.tests.test_suspended_delete_marker_full_sync
2026-02-18T14:14:30.988 INFO:tasks.rgw_multisite_tests:ERROR: rgw_multi.tests.test_bucket_versioning
2026-02-18T14:14:30.990 INFO:tasks.rgw_multisite_tests:ERROR: rgw_multi.tests.test_bucket_acl
2026-02-18T14:14:30.992 INFO:tasks.rgw_multisite_tests:ERROR: rgw_multi.tests.test_bucket_cors
2026-02-18T14:14:30.995 INFO:tasks.rgw_multisite_tests:ERROR: rgw_multi.tests.test_bucket_delete_notempty
2026-02-18T14:14:30.996 INFO:tasks.rgw_multisite_tests:ERROR: rgw_multi.tests.test_datalog_autotrim
2026-02-18T14:14:30.998 INFO:tasks.rgw_multisite_tests:ERROR: rgw_multi.tests.test_set_bucket_website
2026-02-18T14:14:31.000 INFO:tasks.rgw_multisite_tests:ERROR: rgw_multi.tests.test_set_bucket_policy
2026-02-18T14:14:31.002 INFO:tasks.rgw_multisite_tests:ERROR: rgw_multi.tests.test_bucket_sync_disable
2026-02-18T14:14:31.003 INFO:tasks.rgw_multisite_tests:ERROR: rgw_multi.tests.test_bucket_sync_enable_right_after_disable
2026-02-18T14:14:31.005 INFO:tasks.rgw_multisite_tests:ERROR: rgw_multi.tests.test_bucket_sync_disable_enable
2026-02-18T14:14:31.006 INFO:tasks.rgw_multisite_tests:ERROR: rgw_multi.tests.test_multipart_object_sync
2026-02-18T14:14:31.008 INFO:tasks.rgw_multisite_tests:ERROR: rgw_multi.tests.test_assume_role_after_sync
2026-02-18T14:14:31.009 INFO:tasks.rgw_multisite_tests:ERROR: rgw_multi.tests.test_topic_notification_sync


$ egrep -A 1 "=======" teuthology.log | grep ERROR
2026-02-19T23:13:20.427 INFO:tasks.rgw_multisite_tests:ERROR: rgw_multi.tests.test_bucket_create
2026-02-19T23:13:20.429 INFO:tasks.rgw_multisite_tests:ERROR: create a bucket from secondary zone under tenant namespace. check if it successfully syncs
2026-02-19T23:13:20.430 INFO:tasks.rgw_multisite_tests:ERROR: rgw_multi.tests.test_bucket_recreate
2026-02-19T23:13:20.432 INFO:tasks.rgw_multisite_tests:ERROR: rgw_multi.tests.test_bucket_remove
2026-02-19T23:13:20.434 INFO:tasks.rgw_multisite_tests:ERROR: rgw_multi.tests.test_object_sync
2026-02-19T23:13:20.436 INFO:tasks.rgw_multisite_tests:ERROR: rgw_multi.tests.test_object_delete
2026-02-19T23:13:20.438 INFO:tasks.rgw_multisite_tests:ERROR: rgw_multi.tests.test_multi_object_delete
2026-02-19T23:13:20.441 INFO:tasks.rgw_multisite_tests:ERROR: rgw_multi.tests.test_versioned_object_incremental_sync
2026-02-19T23:13:20.443 INFO:tasks.rgw_multisite_tests:ERROR: rgw_multi.tests.test_delete_marker_full_sync
2026-02-19T23:13:20.445 INFO:tasks.rgw_multisite_tests:ERROR: rgw_multi.tests.test_suspended_delete_marker_full_sync
2026-02-19T23:13:20.447 INFO:tasks.rgw_multisite_tests:ERROR: rgw_multi.tests.test_bucket_versioning
2026-02-19T23:13:20.449 INFO:tasks.rgw_multisite_tests:ERROR: rgw_multi.tests.test_bucket_acl
2026-02-19T23:13:20.451 INFO:tasks.rgw_multisite_tests:ERROR: rgw_multi.tests.test_bucket_cors
2026-02-19T23:13:20.453 INFO:tasks.rgw_multisite_tests:ERROR: rgw_multi.tests.test_bucket_delete_notempty
2026-02-19T23:13:20.456 INFO:tasks.rgw_multisite_tests:ERROR: rgw_multi.tests.test_datalog_autotrim
2026-02-19T23:13:20.457 INFO:tasks.rgw_multisite_tests:ERROR: rgw_multi.tests.test_set_bucket_website
2026-02-19T23:13:20.459 INFO:tasks.rgw_multisite_tests:ERROR: rgw_multi.tests.test_set_bucket_policy
2026-02-19T23:13:20.461 INFO:tasks.rgw_multisite_tests:ERROR: rgw_multi.tests.test_bucket_sync_disable
2026-02-19T23:13:20.463 INFO:tasks.rgw_multisite_tests:ERROR: rgw_multi.tests.test_bucket_sync_enable_right_after_disable
2026-02-19T23:13:20.464 INFO:tasks.rgw_multisite_tests:ERROR: rgw_multi.tests.test_bucket_sync_disable_enable
2026-02-19T23:13:20.466 INFO:tasks.rgw_multisite_tests:ERROR: rgw_multi.tests.test_multipart_object_sync
2026-02-19T23:13:20.467 INFO:tasks.rgw_multisite_tests:ERROR: rgw_multi.tests.test_assume_role_after_sync
2026-02-19T23:13:20.468 INFO:tasks.rgw_multisite_tests:ERROR: rgw_multi.tests.test_topic_notification_sync
2026-02-19T23:13:20.469 INFO:tasks.rgw_multisite_tests:ERROR: rgw_multi.tests.test_bucket_create_location_constraint

Both logs show the failure reasons for above testcases are the same

  • Failures are mostly AccessDenied during CreateBucket
  • the same AccessDenied during PutBucketNotificationConfiguration for topic notifications
  • one-off SignatureDoesNotMatch during CreateBucket

Again, those failure seem irrelevant to this PR.

@smanjara
Copy link
Contributor

  • Failures are mostly AccessDenied during CreateBucket
  • the same AccessDenied during PutBucketNotificationConfiguration for topic notifications
  • one-off SignatureDoesNotMatch during CreateBucket

Again, those failure seem irrelevant to this PR.

the PR associated with https://tracker.ceph.com/issues/74579 should fix the CreateBucket forwarded requests.

@BBoozmen
Copy link
Contributor Author

Thank you @smanjara! Added needs-qa label to get this thru the general qa testing.

@ivancich
Copy link
Member

The QA run for this PR included two others. But they're showing a lot of crash related errors.

There are six of these:

"2026-02-25T05:26:33.173487+0000 mon.a (mon.0) 181 : cluster [WRN] Health check failed: 2 daemons have recently crashed (RECENT_CRASH)" in cluster log

And there are nine of these, mostly dealing with the upgrade tests.

Found coredumps on ubuntu@trial096.front.sepia.ceph.com

None of the PRs jumps out at me as a likely culprit, so I'm asking you. I'm also having a re-run done.

Here's the full run: https://pulpito.ceph.com/anuchaithra-2026-02-25_05:07:15-rgw-wip-anrao1-testing-2026-02-23-1551-distro-default-trial/

Thanks!

@smanjara
Copy link
Contributor

smanjara commented Feb 25, 2026

The QA run for this PR included two others. But they're showing a lot of crash related errors.

There are six of these:

"2026-02-25T05:26:33.173487+0000 mon.a (mon.0) 181 : cluster [WRN] Health check failed: 2 daemons have recently crashed (RECENT_CRASH)" in cluster log

And there are nine of these, mostly dealing with the upgrade tests.

Found coredumps on ubuntu@trial096.front.sepia.ceph.com

None of the PRs jumps out at me as a likely culprit, so I'm asking you. I'm also having a re-run done.

Here's the full run: https://pulpito.ceph.com/anuchaithra-2026-02-25_05:07:15-rgw-wip-anrao1-testing-2026-02-23-1551-distro-default-trial/

Thanks!

@ivancich the two multisite jobs haven't crashed though. so it must be something else. could you tell which are the other PR's? also, multisite tests will fail without #67083. we need that to be merged or have this PR tested along with wip-anrao3-testing

cc @anrao19

@smanjara smanjara removed the needs-qa label Feb 25, 2026
@cbodley
Copy link
Contributor

cbodley commented Feb 26, 2026

@ivancich the two multisite jobs haven't crashed though. so it must be something else. could you tell which are the other PR's? also, multisite tests will fail without #67083. we need that to be merged or have this PR tested along with wip-anrao3-testing

#67083 merged, let's retry with that

@BBoozmen
Copy link
Contributor Author

@ivancich the two multisite jobs haven't crashed though. so it must be something else. could you tell which are the other PR's? also, multisite tests will fail without #67083. we need that to be merged or have this PR tested along with wip-anrao3-testing

#67083 merged, let's retry with that

I was going to start looking at the failure logs @ivancich pointed out earlier but I think there's going to be another round of testing including #67083 . I'll be on stand-by for now then.

@BBoozmen BBoozmen added the wip-oozmen-testing for teuthology integration testing to be used via build-integration-branch.sh label Mar 4, 2026
@BBoozmen
Copy link
Contributor Author

@cbodley / @smanjara

Sent this PR to the Teuthology testing for rgw testsuite:

Overall Summary

  • Total tests: 71
  • Passed: 43 (60.6%)
  • Failed: 27
  • Dead: 1

Looking at failures:

Root Cause Count Suites Affected
Valgrind Error (sendmsg/SyscallParam) 17 rgw/verify (12), rgw/notifications (5)
Keystone/OpenStack setup failure 4 rgw/crypt (1), rgw/tempest (3)
S3 User Quota workunit failure 3 rgw/multifs (3)
RGW Multisite test failures 2 rgw/multisite (2)
adjust-ulimits error code 124 1 rgw/d4n (1)
Job timeout 1 rgw/upgrade (1)
  • Valgrind errors are all related to SyscallParam sendmsg. I believe this is a known infrastructure/valgrind issue?
  • Keystone/OpenStack issues - Test infrastructure problem I think with openstack project create failing.
  • s3_user_quota-run.sh failures all seem to be getting error 22 (EINVAL).
failure_reason: 'Command failed (workunit test rgw/s3_user_quota-run.sh) on trial049
  with status 22: ''mkdir -p -- /home/ubuntu/cephtest/mnt.0/client.0/tmp && cd --
  /home/ubuntu/cephtest/mnt.0/client.0/tmp && CEPH_CLI_TEST_DUP_COMMAND=1 CEPH_REF=c221a0e611968dcf997352e2792ea5cbb550c44e
  TESTDIR="/home/ubuntu/cephtest" CEPH_ARGS="--cluster ceph" CEPH_ID="0" PATH=$PATH:/usr/sbin
  CEPH_BASE=/home/ubuntu/cephtest/clone.client.0 CEPH_ROOT=/home/ubuntu/cephtest/clone.client.0
  CEPH_MNT=/home/ubuntu/cephtest/mnt.0 adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage
  timeout 3h /home/ubuntu/cephtest/clone.client.0/qa/workunits/rgw/s3_user_quota-run.sh'''
  • adjust-ulimits error code 124 (timeout)
2026-03-07T05:20:08.109 DEBUG:teuthology.orchestra.run.trial052:> sudo adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage timeout 120 ceph --cluster ceph pg dump --format=json
2026-03-07T05:22:08.142 DEBUG:teuthology.orchestra.run:got remote process result: 124
  • The relevant failures are 2 multisite testcases:
2026-03-07T05:43:53.758 INFO:tasks.rgw_multisite_tests:rgw_multi.tests.test_bucket_remove ... FAIL
2026-03-07T06:31:32.060 INFO:tasks.rgw_multisite_tests:rgw_multi.tests.test_period_update_commit ... FAIL

test_bucket_remove

The testcase is not touched by this PR and it seems to be failing at other submissions, too; e.g., https://qa-proxy.ceph.com/teuthology/anuchaithra-2026-03-09_10:21:55-rgw-wip-anrao2-testing-2026-03-05-1549-distro-default-trial/94912/teuthology.log. This testcase is about metadata sync which is not relevant to this PR. I think we can address this testcaseseparately. If there's no tracker item I can open one to work on this separately; it seems to be failing in other submissions consistently as well.

test_period_update_commit

It fails at the very last step at the check zonegroup_data_checkpoint. I think it's a scale issue as this testcase uploads a couple of thousand objects in concurrency.

...
# this is last (5 of 5) verification after a period update --commit
2026-03-07T06:17:10.733 INFO:rgw_multi.tests:verify data sync is making progress
...
# The workload stopped (all 25 threads)
2026-03-07T06:21:14.514 INFO:rgw_multi.tests:uploaded 4000 times for the range (900, 999) to bucket=jwzcgy-66
2026-03-07T06:21:14.550 INFO:rgw_multi.tests:uploaded 4000 times for the range (700, 799) to bucket=jwzcgy-66
...
2026-03-07T06:21:14.711 INFO:rgw_multi.tests:uploaded 4000 times for the range (1600, 1699) to bucket=jwzcgy-66
...
# but after 5 mins (default timeout) - zonegroup_data_checkpoint timed out
2026-03-07T06:26:13.259 INFO:teuthology.orchestra.run.trial052.stderr:2026-03-07T06:26:13.256+0000 7f6f8bfff640  0 WARNING: curl operation timed out, network average transfer speed less than 1024 Bytes per second during 300 seconds.

Again, this failure is not relevant to the PR so I can open a separate tracker - I think it should be enough to make the s3 client wkld less aggressive.

Otherwise, this testcase used to pass as shown in https://qa-proxy.ceph.com/teuthology/bcs-ceph-2026-02-19_14:51:57-rgw:multisite-wip-oozmen-73799-distro-default-trial/58954/teuthology.log and I just built this very ceph-ci branch and tested it locally and it passed again.

test_bucket_full_sync_when_the_bucket_is_deleted_in_the_meantime

The testcase introduced in this PR passed.

...
2026-03-07T06:33:58.810 INFO:tasks.rgw_multisite_tests:rgw_multi.tests.test_bucket_full_sync_when_the_bucket_is_deleted_in_the_meantime ... ok

@cbodley / @smanjara - Shall we get this merged?

@smanjara smanjara merged commit 16c4842 into ceph:main Mar 10, 2026
13 checks passed
@smanjara
Copy link
Contributor

thanks @BBoozmen

@smanjara
Copy link
Contributor

Overall Summary

  • Total tests: 71
  • Passed: 43 (60.6%)
  • Failed: 27
  • Dead: 1

please do open trackers for the s3 and multisite failures @BBoozmen. thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants