RGW/multisite: fix bucket-full-sync infinite loop caused by stale bucket_list_result reuse by BBoozmen · Pull Request #66203 · ceph/ceph

BBoozmen · 2025-11-11T20:21:46Z

Fixes: https://tracker.ceph.com/issues/73799

Contribution Guidelines

To sign and title your commits, please refer to Submitting Patches to Ceph.
If you are submitting a fix for a stable branch (e.g. "quincy"), please refer to Submitting Patches to Ceph - Backports for the proper workflow.
When filling out the below checklist, you may click boxes directly in the GitHub web UI. When entering or editing the entire PR message in the GitHub web UI editor, you may also select a checklist item by adding an x between the brackets: [x]. Spaces and capitalization matter when checking off items this way.

Checklist

Tracker (select at least one)
- References tracker ticket
- Very recent bug; references commit where it was introduced
- New feature (ticket optional)
- Doc update (no ticket needed)
- Code cleanup (no ticket needed)
Component impact
- Affects Dashboard, opened tracker ticket
- Affects Orchestrator, opened tracker ticket
- No impact that needs to be tracked
Documentation (select at least one)
- Updates relevant documentation
- No doc update is appropriate
Tests (select at least one)
- Includes unit test(s)
- Includes integration test(s)
- Includes bug reproducer
- No tests

Show available Jenkins commands

jenkins test classic perf Jenkins Job | Jenkins Job Definition
jenkins test crimson perf Jenkins Job | Jenkins Job Definition
jenkins test signed Jenkins Job | Jenkins Job Definition
jenkins test make check Jenkins Job | Jenkins Job Definition
jenkins test make check arm64 Jenkins Job | Jenkins Job Definition
jenkins test submodules Jenkins Job | Jenkins Job Definition
jenkins test dashboard Jenkins Job | Jenkins Job Definition
jenkins test dashboard cephadm Jenkins Job | Jenkins Job Definition
jenkins test api Jenkins Job | Jenkins Job Definition
jenkins test docs ReadTheDocs | Github Workflow Definition
jenkins test ceph-volume all Jenkins Jobs | Jenkins Jobs Definition
jenkins test windows Jenkins Job | Jenkins Job Definition
jenkins test rook e2e Jenkins Job | Jenkins Job Definition

You must only issue one Jenkins command per-comment. Jenkins does not understand
comments with more than one command.

github-actions · 2025-11-11T20:22:10Z

Config Diff Tool Output

+ added: rgw_inject_delay_sec (rgw.yaml.in)
+ added: rgw_inject_delay_pattern (rgw.yaml.in)

The above configuration changes are found in the PR. Please update the relevant release documentation if necessary.
Ignore this comment if docs are already updated. To make the "Check ceph config changes" CI check pass, please comment /config check ok and re-run the test.

BBoozmen · 2025-11-18T18:46:51Z

/config check ok

smanjara · 2025-11-20T21:37:13Z

@BBoozmen thanks for the detailed test case and reproducer. I do agree that the list_bucket_result should be cleared before sending request to the remote for the next listing.
I tried running your test without the fix proposed with single rgw instance on either side. we should expect a repetitive loop with a log message like "listed bucket for full sync" if we had stale entries? but I could only find one such log message.

2025-11-20T14:13:15.452-0500 7efe2975f6c0 20 RGW-SYNC:data:sync:shard[36]:entry[ccjgqv-1:609213d3-f584-4e86-bca2-b7c1c434fd59.4362.1:5[0]]:bucket_sync_sources[source=:ccjgqv-1[609213d3-f584-4e86-bca2-b7c1c434fd59.4362.1]):5:source_zone=609213d3-f584-4e86-bca2-b7c1c434fd59]:bucket[ccjgqv-1:609213d3-f584-4e86-bca2-b7c1c434fd59.4362.1<-ccjgqv-1:609213d3-f584-4e86-bca2-b7c1c434fd59.4362.1:5]:full_sync[ccjgqv-1:609213d3-f584-4e86-bca2-b7c1c434fd59.4362.1:5]: listed bucket for full sync list_result.entries.size=299 is_truncated=0

am I missing something?

src/rgw/driver/rados/rgw_data_sync.cc

BBoozmen · 2025-11-23T20:27:28Z

@BBoozmen thanks for the detailed test case and reproducer. I do agree that the list_bucket_result should be cleared before sending request to the remote for the next listing. I tried running your test without the fix proposed with single rgw instance on either side. we should expect a repetitive loop with a log message like "listed bucket for full sync" if we had stale entries? but I could only find one such log message.

2025-11-20T14:13:15.452-0500 7efe2975f6c0 20 RGW-SYNC:data:sync:shard[36]:entry[ccjgqv-1:609213d3-f584-4e86-bca2-b7c1c434fd59.4362.1:5[0]]:bucket_sync_sources[source=:ccjgqv-1[609213d3-f584-4e86-bca2-b7c1c434fd59.4362.1]):5:source_zone=609213d3-f584-4e86-bca2-b7c1c434fd59]:bucket[ccjgqv-1:609213d3-f584-4e86-bca2-b7c1c434fd59.4362.1<-ccjgqv-1:609213d3-f584-4e86-bca2-b7c1c434fd59.4362.1:5]:full_sync[ccjgqv-1:609213d3-f584-4e86-bca2-b7c1c434fd59.4362.1:5]: listed bucket for full sync list_result.entries.size=299 is_truncated=0

am I missing something?

Thank you @smanjara for the feedback. Yes, the current reproduction recipe doesn't seem to be reproducing the issue deterministically. I've updated the recipe to make the reproduction deterministic. Updated the related commit. Please have a look.

The new recipe doesn't rely on bilog trimming but it just

disables bucket sync first before uploading any objects
upload all objects (>1000)
injects the delay
re-enables the bucket sync to initiate the full sync.

Now, this new recipe makes sure that -- when bucket full sync starts -- the bucket will have > 1000 objects and full_sync listing will have paginated listing.

Tested the new recipe with and without fix: python3.9 -m nose -s test_multi.py -v -m test_bucket_full_sync_when_the_bucket_is_deleted_in_the_meantime.

With Fix

Now, once full sync starts (after enabling the bucket sync after uploading all objects), we always see the paginated listing happening deterministically which is critical in reproducing the issue:

run/c2/out/radosgw.8001.log:2025-11-23T03:26:28.842+0000 7f7cce0da700 20 RGW-SYNC:data:sync:shard[91]:entry[rvtfgn-1:eb859e84-2137-4098-a6c5-06de10988051.4421.1:3[0]]:bucket_sync_sources[source=:rvtfgn-1[eb859e84-2137-4098-a6c5-06de10988051.4421.1]):3:source_zone=eb859e84-2137-4098-a6c5-06de10988051]:bucket[rvtfgn-1:eb859e84-2137-4098-a6c5-06de10988051.4421.1<-rvtfgn-1:eb859e84-2137-4098-a6c5-06de10988051.4421.1:3]:full_sync[rvtfgn-1:eb859e84-2137-4098-a6c5-06de10988051.4421.1:3]: listing bucket for full sync

# now we get paginated listing
run/c2/out/radosgw.8001.log:2025-11-23T03:26:28.969+0000 7f7cce0da700 20 RGW-SYNC:data:sync:shard[91]:entry[rvtfgn-1:eb859e84-2137-4098-a6c5-06de10988051.4421.1:3[0]]:bucket_sync_sources[source=:rvtfgn-1[eb859e84-2137-4098-a6c5-06de10988051.4421.1]):3:source_zone=eb859e84-2137-4098-a6c5-06de10988051]:bucket[rvtfgn-1:eb859e84-2137-4098-a6c5-06de10988051.4421.1<-rvtfgn-1:eb859e84-2137-4098-a6c5-06de10988051.4421.1:3]:full_sync[rvtfgn-1:eb859e84-2137-4098-a6c5-06de10988051.4421.1:3]: listed bucket for full sync list_result.entries.size=1000 is_truncated=1

# and delay injecting - to give time the testcase to delete all objects and the bucket in the meantime -
# now happens correctly when paginated is set.
run/c2/out/radosgw.8001.log:2025-11-23T03:26:28.969+0000 7f7cce0da700  0 RGW-SYNC:data:sync:shard[91]:entry[rvtfgn-1:eb859e84-2137-4098-a6c5-06de10988051.4421.1:3[0]]:bucket_sync_sources[source=:rvtfgn-1[eb859e84-2137-4098-a6c5-06de10988051.4421.1]):3:source_zone=eb859e84-2137-4098-a6c5-06de10988051]:bucket[rvtfgn-1:eb859e84-2137-4098-a6c5-06de10988051.4421.1<-rvtfgn-1:eb859e84-2137-4098-a6c5-06de10988051.4421.1:3]:full_sync[rvtfgn-1:eb859e84-2137-4098-a6c5-06de10988051.4421.1:3]: injecting a delay of 100.000000s

With the fix, testcase passes since we now reset the list_result object and don't use a stale state.

run/c2/out/radosgw.8001.log:2025-11-23T03:28:12.771+0000 7f7cce0da700 20 RGW-SYNC:data:sync:shard[91]:entry[rvtfgn-1:eb859e84-2137-4098-a6c5-06de10988051.4421.1:3[0]]:bucket_sync_sources[source=:rvtfgn-1[eb859e84-2137-4098-a6c5-06de10988051.4421.1]):3:source_zone=eb859e84-2137-4098-a6c5-06de10988051]:bucket[rvtfgn-1:eb859e84-2137-4098-a6c5-06de10988051.4421.1<-rvtfgn-1:eb859e84-2137-4098-a6c5-06de10988051.4421.1:3]:full_sync[rvtfgn-1:eb859e84-2137-4098-a6c5-06de10988051.4421.1:3]: listing bucket for full sync

# now - since source objects and the bucket is deleted -
# with the fix we can get out of the for-loop
run/c2/out/radosgw.8001.log:2025-11-23T03:28:12.773+0000 7f7cce0da700 20 RGW-SYNC:data:sync:shard[91]:entry[rvtfgn-1:eb859e84-2137-4098-a6c5-06de10988051.4421.1:3[0]]:bucket_sync_sources[source=:rvtfgn-1[eb859e84-2137-4098-a6c5-06de10988051.4421.1]):3:source_zone=eb859e84-2137-4098-a6c5-06de10988051]:bucket[rvtfgn-1:eb859e84-2137-4098-a6c5-06de10988051.4421.1<-rvtfgn-1:eb859e84-2137-4098-a6c5-06de10988051.4421.1:3]:full_sync[rvtfgn-1:eb859e84-2137-4098-a6c5-06de10988051.4421.1:3]: listed bucket for full sync list_result.entries.size=0 is_truncated=0

Test passes:

Ran 1 test in 284.062s
OK

Without fix

Running the updated testcase without fix now reproduces the issue deterministically and test fails as expected. Now, the listing loop never ends after bucket is deleted:

run/c2/out/radosgw.8001.log:2025-11-23T04:07:58.309+0000 7fe9ad1ae700 20 RGW-SYNC:data:sync:shard[11]:entry[mkoudz-1:88756c7d-e05a-460c-9b97-0efd67cf04eb.4389.1:2[0]]:bucket_sync_sources[source=:mkoudz-1[88756c7d-e05a-460c-9b97-0efd67cf04eb.4389.1]):2:source_zone=88756c7d-e05a-460c-9b97-0efd67cf04eb]:bucket[mkoudz-1:88756c7d-e05a-460c-9b97-0efd67cf04eb.4389.1<-mkoudz-1:88756c7d-e05a-460c-9b97-0efd67cf04eb.4389.1:2]:full_sync[mkoudz-1:88756c7d-e05a-460c-9b97-0efd67cf04eb.4389.1:2]: listing bucket for full sync
run/c2/out/radosgw.8001.log:2025-11-23T04:07:58.490+0000 7fe9ad1ae700 20 RGW-SYNC:data:sync:shard[11]:entry[mkoudz-1:88756c7d-e05a-460c-9b97-0efd67cf04eb.4389.1:2[0]]:bucket_sync_sources[source=:mkoudz-1[88756c7d-e05a-460c-9b97-0efd67cf04eb.4389.1]):2:source_zone=88756c7d-e05a-460c-9b97-0efd67cf04eb]:bucket[mkoudz-1:88756c7d-e05a-460c-9b97-0efd67cf04eb.4389.1<-mkoudz-1:88756c7d-e05a-460c-9b97-0efd67cf04eb.4389.1:2]:full_sync[mkoudz-1:88756c7d-e05a-460c-9b97-0efd67cf04eb.4389.1:2]: listed bucket for full sync list_result.entries.size=1000 is_truncated=1
.
.
.
# the full-sync coroutine can never exit the loop as it is using the
# stale list-results object until the rgw instance is restarted
run/c2/out/radosgw.8001.log:2025-11-23T04:09:54.779+0000 7fe9ad1ae700 20 RGW-SYNC:data:sync:shard[11]:entry[mkoudz-1:88756c7d-e05a-460c-9b97-0efd67cf04eb.4389.1:2[0]]:bucket_sync_sources[source=:mkoudz-1[88756c7d-e05a-460c-9b97-0efd67cf04eb.4389.1]):2:source_zone=88756c7d-e05a-460c-9b97-0efd67cf04eb]:bucket[mkoudz-1:88756c7d-e05a-460c-9b97-0efd67cf04eb.4389.1<-mkoudz-1:88756c7d-e05a-460c-9b97-0efd67cf04eb.4389.1:2]:full_sync[mkoudz-1:88756c7d-e05a-460c-9b97-0efd67cf04eb.4389.1:2]: listed bucket for full sync list_result.entries.size=1000 is_truncated=1
.
.
.

If you look at the sync status during this time, you'll see that status keeps reporting the same delay:

...
   current time 2025-11-23T04:11:59Z
..
      data sync source: 88756c7d-e05a-460c-9b97-0efd67cf04eb (a1)
                        syncing
                        full sync: 0/128 shards
                        incremental sync: 128/128 shards
                        data is behind on 1 shards
                        behind shards: [11]
                        oldest incremental change not applied: 2025-11-23T04:09:09.837391+0000 [11]
                        10 shards are recovering
                        recovering shards: [9,10,12,13,14,15,16,17,18,19]
.
.
.

...
   current time 2025-11-23T04:12:39Z
...
      data sync source: 88756c7d-e05a-460c-9b97-0efd67cf04eb (a1)
                        syncing
                        full sync: 0/128 shards
                        incremental sync: 128/128 shards
                        data is behind on 1 shards
                        behind shards: [11]
                        oldest incremental change not applied: 2025-11-23T04:09:09.837391+0000 [11]
                        10 shards are recovering
                        recovering shards: [9,10,12,13,14,15,16,17,18,19]

and testcase fails since data sync keeps reporting it's still behind.

...
rgw_multi.tests: ERROR: test_bucket_full_sync_when_the_bucket_is_deleted_in_the_meantime failed: failed data checkpoint for target_zone=a2 source_zone=a1
--------------------- >> end captured logging << ---------------------

----------------------------------------------------------------------
Ran 1 test in 598.802s

FAILED (failures=1)
http://localhost:8000
http://localhost:8001

Please take a look at the testcase again. Just a note, please don't forget to stop your mstart cluster (or restart the affected rgw instance) after the testcase failure (if you are trying it out without the fix); otherwise. the radosgw.log will fill up quickly due to never ending loop of "listing bucket for full sync" and "listed bucket for full sync list_result.entries.size=1000 is_truncated=1" events.

smanjara · 2025-11-25T16:42:42Z

scheduled teuthology runs at https://pulpito.ceph.com/smanjara-2025-11-25_06:49:38-rgw:multisite-test-wip-oozmen-73799-distro-default-smithi/

smanjara · 2025-11-25T23:28:42Z

scheduled teuthology runs at https://pulpito.ceph.com/smanjara-2025-11-25_06:49:38-rgw:multisite-test-wip-oozmen-73799-distro-default-smithi/

the test is failing with AttributeError: 'Cluster' object has no attribute 'ceph_admin':

2025-11-25T07:51:20.693 INFO:tasks.rgw_multisite_tests:======================================================================
2025-11-25T07:51:20.693 INFO:tasks.rgw_multisite_tests:ERROR: rgw_multi.tests.test_bucket_full_sync_when_the_bucket_is_deleted_in_the_meantime
2025-11-25T07:51:20.693 INFO:tasks.rgw_multisite_tests:----------------------------------------------------------------------
2025-11-25T07:51:20.693 INFO:tasks.rgw_multisite_tests:Traceback (most recent call last):
2025-11-25T07:51:20.693 INFO:tasks.rgw_multisite_tests: File "/home/teuthworker/src/git.ceph.com_teuthology_258eb6279f4d7fcd4b45c82e521f2a2e799d7f33/virtualenv/lib/python3.10/site-packages/nose/case.py", line 170, in runTest
2025-11-25T07:51:20.693 INFO:tasks.rgw_multisite_tests: self.test(*self.arg)
2025-11-25T07:51:20.693 INFO:tasks.rgw_multisite_tests: File "/home/teuthworker/src/github.com_BBoozmen_ceph_ce9be9e2a9c1bf23e82ba7dfc7a7caaedfecbfd1/qa/../src/test/rgw/rgw_multi/tests.py", line 6198, in test_bucket_full_sync_when_the_bucket_is_deleted_in_the_meantime
2025-11-25T07:51:20.693 INFO:tasks.rgw_multisite_tests: secondary_zone_cluster_conn.cluster.ceph_admin(
2025-11-25T07:51:20.693 INFO:tasks.rgw_multisite_tests:AttributeError: 'Cluster' object has no attribute 'ceph_admin'

BBoozmen · 2025-11-26T02:49:44Z

scheduled teuthology runs at https://pulpito.ceph.com/smanjara-2025-11-25_06:49:38-rgw:multisite-test-wip-oozmen-73799-distro-default-smithi/

the test is failing with AttributeError: 'Cluster' object has no attribute 'ceph_admin':

Hmm, ceph_admin method is introduced in this PR at the commit 4f47850 and ceph-ci branch https://github.com/ceph/ceph-ci/commits/test-wip-oozmen-73799/ has it, too.

2025-11-25T07:58:19.703 ERROR:rgw_multi.tests:test_bucket_full_sync_when_the_bucket_is_deleted_in_the_meantime failed: 'Cluster' object has no attribute 'ceph_admin'

I think problem is qa/tasks uses the defition from qa/tasks/rgw_multisite.py but not from src/test/rgw/test_multi.py. This PR updates only the latter.

qa/tasks/rgw_multisite.py seems to be using a slightly different implementation. Let me take a look how can I can add ceph_admin method there as well.

BBoozmen · 2025-12-05T17:13:04Z

Just to give an update on this one...

Testing the qa/tasks changes via https://github.com/ceph/ceph-ci/tree/wip-oozmen-73799. Once it passes, I'll reflect the changes here.

Teuthology seems to be busy, though; the job is still in the queue for a while: https://pulpito.ceph.com/bcs-ceph-2025-12-02_19:41:42-rgw:multisite-wip-oozmen-73799-distro-default-smithi/

To my understanding, there's no easy way to test qa/tasks changes locally so has to go thru teuthology machinery?

BBoozmen · 2025-12-09T16:01:21Z

https://pulpito.ceph.com/bcs-ceph-2025-12-02_19:41:42-rgw:multisite-wip-oozmen-73799-distro-default-smithi/

An update on this one, it's getting closer. I have to change the testcase a bit due to the difference in cluster topology between:

local/mstart (src/test/rgw/) integration test run and
teuthology (qa/tasks) run

The former creates a realm with single (master) zonegroup with 2 zones/clusters one being the master so below logic works when running the integration testing locally

    primary_zone_cluster_conn = zonegroup.zones[0]
    secondary_zone_cluster_conn = zonegroup.zones[1]

However, in teuthology run secondary_zone_cluster_conn still refers to c1 as teuthology creates a more complex topology. Updated the testcase at https://github.com/ceph/ceph-ci/tree/wip-oozmen-73799 as follows. I'll try again on teuthology.

    # get cluster connections
    primary_zone_cluster_conn = master_zonegroup.master_zone
    secondary_zone_cluster_conn = None
    for zg in realm.current_period.zonegroups:
        for zone in zg.zones:
            if zone.cluster != primary_zone_cluster_conn.cluster and zone != zg.master_zone:
                secondary_zone_cluster_conn = zone
                break
        if secondary_zone_cluster_conn is not None:
            break
    else:
        raise SkipTest("test_bucket_full_sync_when_the_bucket_is_deleted_in_the_meantime is skipped. "
                       "Requires a secondary zone in a different cluster.")

smanjara · 2026-01-05T22:25:10Z

the test is failing with AttributeError: 'Cluster' object has no attribute 'ceph_admin':

2025-11-25T07:51:20.693 INFO:tasks.rgw_multisite_tests: secondary_zone_cluster_conn.cluster.ceph_admin(
2025-11-25T07:51:20.693 INFO:tasks.rgw_multisite_tests:AttributeError: 'Cluster' object has no attribute 'ceph_admin'

qa/tasks/rgw_multisite.py seems to be using a slightly different implementation. Let me take a look how can I can add ceph_admin method there as well.

sepia lab is still under migration, so you might not be able to schedule teuthology runs yet. the existing cluster.admin() method comes from the class Cluster. as you pointed out teuthology runs these tasks a bit differently. ceph/qa/tasks/rgw_multi/multisite.py which is a symlink to multisite.py is where the abstract methods are declared and qa/tasks/rgw_multisite.py is where the concrete classes are implemented. I think we will need to modify and add ceph_admin() as a method in the class 'Cluster'.

besides that, we tested the sync code at a reasonable scale and we didn't find any issues or regressions!

BBoozmen · 2026-01-06T16:26:29Z

besides that, we tested the sync code at a reasonable scale and we didn't find any issues or regressions!

Sounds good, thank you for the update!

sepia lab is still under migration, so you might not be able to schedule teuthology runs yet. the existing cluster.admin() method comes from the class Cluster. as you pointed out teuthology runs these tasks a bit differently. ceph/qa/tasks/rgw_multi/multisite.py which is a symlink to multisite.py is where the abstract methods are declared and qa/tasks/rgw_multisite.py is where the concrete classes are implemented. I think we will need to modify and add ceph_admin() as a method in the class 'Cluster'.

Hopefully, I can test these qa/tasks changes once the lab is functional again. I've not added these changes in this PR yet but added them to its corresponding ceph-ci branch (at https://github.com/ceph/ceph-ci/tree/wip-oozmen-73799). When I get a chance, I'll test it via tautology. Hopefully when it passes, I'll reflect the changes here as well. Planning to run the ceph-ci branch in the Poughkeepsie lab if I can.

BBoozmen · 2026-01-26T16:35:01Z

rgw:multisite tests are broken at the new POK lab. We need #67011 (qa/multisite: switch to boto3 #67011) to be merged in first.

smanjara · 2026-01-27T16:37:15Z

@BBoozmen, the boto3 migration is still underway. but I don't want to block the PR with just the multisite commit from getting merged. we could follow up with the test once the test migration is done. cc @cbodley

BBoozmen · 2026-01-28T17:25:20Z

@BBoozmen, the boto3 migration is still underway. but I don't want to block the PR with just the multisite commit from getting merged. we could follow up with the test once the test migration is done. cc @cbodley

It's OK. No rush on this one.

This PR is meant to change to qa/tasks as well (i.e., qa/tasks/rgw_multisite.py); that's why, I'd like to wait for rgw:multisite suite to pass.

I've just pushed in the changes here as well (see compare) that makes the change to qa/tasks/rgw_multisite.py - which I was hoping to test in teuthology first via ceph-ci branch ceph-ci::wip-oozmen-73799.

Bottomline, we can wait for boto3 changes to go in first and I can test this one later.

github-actions · 2026-02-03T17:50:24Z

This pull request can no longer be automatically merged: a rebase is needed and changes have to be manually resolved

…ts entries Add a new method `reset_entries()` to the `bucket_list_result` struct that clears the list of entries and resets the truncated flag. This would be used to enhance the re-use cases to avoid accessing stale entries or truncated flag. Signed-off-by: Oguzhan Ozmen <oozmen@bloomberg.net>

…to avoid stale listings RGWBucketFullSyncCR could spin indefinitely when the source bucket was already deleted. The coroutine reused a bucket_list_result member, and RGWListRemoteBucketCR populated it without clearing prior state. Stale entries/is_truncated from a previous iteration caused the loop to continue even after the bucket no longer existed. Fix by clearing the provided bucket_list_result at the start of RGWListRemoteBucketCR (constructor), ensuring each listing starts from a clean state and reflects the current remote bucket contents. This prevents the infinite loop and returns correct results when the bucket has been deleted. Fixes: https://tracker.ceph.com/issues/73799 Signed-off-by: Oguzhan Ozmen <oozmen@bloomberg.net>

…e bucket is deleted in the middle Tests: https://tracker.ceph.com/issues/73799 Signed-off-by: Oguzhan Ozmen <oozmen@bloomberg.net>

BBoozmen · 2026-02-20T15:36:04Z

@cbodley , @smanjara - can you please review the PR? I could finally complete the teuthology testing.

https://pulpito.ceph.com/bcs-ceph-2026-02-19_14:51:57-rgw:multisite-wip-oozmen-73799-distro-default-trial/ is the rgw:multisite test. Although it shows failed, the tests that are the relevant to this PR passed. This PR changes a helper function the testcase test_period_update_commit uses and it introduces the new testcase test_bucket_full_sync_when_the_bucket_is_deleted_in_the_meantime.

$ egrep "test_period_update_commit|test_bucket_full_sync_when_the_bucket_is_deleted_in_the_meantime" teuthology.log
2026-02-19T23:10:43.212 INFO:tasks.rgw_multisite_tests:rgw_multi.tests.test_period_update_commit ... ok
2026-02-19T23:13:20.426 INFO:tasks.rgw_multisite_tests:rgw_multi.tests.test_bucket_full_sync_when_the_bucket_is_deleted_in_the_meantime ... ok

Some details on test_bucket_full_sync_when_the_bucket_is_deleted_in_the_meantime to show it runs as expected

2026-02-19T23:10:43.404 INFO:rgw_multi.tests:disable sync for bucket=eoibqj-69
2026-02-19T23:11:03.164 INFO:rgw_multi.tests:successfully uploaded 3000 objects to bucket=eoibqj-69
2026-02-19T23:11:03.267 INFO:rgw_multi.tests:set rgw_inject_delay_sec and rgw_inject_delay_pattern to slow down bucket full sync
2026-02-19T23:11:03.623 INFO:rgw_multi.tests:enable bucket sync to initiate full sync
2026-02-19T23:11:03.741 INFO:rgw_multi.tests:verify that bucket sync is stalled
2026-02-19T23:11:14.050 INFO:rgw_multi.tests:verified that bucket sync is stalled, oldest incremental change not applied epoch: 0.0
2026-02-19T23:11:39.502 INFO:rgw_multi.tests:removing rgw_inject_delay_sec and rgw_inject_delay_pattern to allow bucket full sync to run normally to the completion
2026-02-19T23:13:19.872 INFO:rgw_multi.tests:wait for data sync to complete
2026-02-19T23:13:20.426 INFO:tasks.rgw_multisite_tests:rgw_multi.tests.test_bucket_full_sync_when_the_bucket_is_deleted_in_the_meantime ... ok

As for the failed tests, they are failing in other submissions as well so I don't think they are relevant to this PR. For example, looking at anuchaithra's submission for example: https://qa-proxy.ceph.com/teuthology/anuchaithra-2026-02-18_10:20:38-rgw-wip-anrao5-testing-2026-02-18-1230-distro-default-trial/56569/teuthology.log

$ egrep -A 1 "=======" teuthology-anrao5.log | grep ERROR
2026-02-18T14:14:30.968 INFO:tasks.rgw_multisite_tests:ERROR: rgw_multi.tests.test_bucket_create
2026-02-18T14:14:30.970 INFO:tasks.rgw_multisite_tests:ERROR: create a bucket from secondary zone under tenant namespace. check if it successfully syncs
2026-02-18T14:14:30.972 INFO:tasks.rgw_multisite_tests:ERROR: rgw_multi.tests.test_bucket_recreate
2026-02-18T14:14:30.974 INFO:tasks.rgw_multisite_tests:ERROR: rgw_multi.tests.test_bucket_remove
2026-02-18T14:14:30.976 INFO:tasks.rgw_multisite_tests:ERROR: rgw_multi.tests.test_object_sync
2026-02-18T14:14:30.978 INFO:tasks.rgw_multisite_tests:ERROR: rgw_multi.tests.test_object_delete
2026-02-18T14:14:30.980 INFO:tasks.rgw_multisite_tests:ERROR: rgw_multi.tests.test_multi_object_delete
2026-02-18T14:14:30.982 INFO:tasks.rgw_multisite_tests:ERROR: rgw_multi.tests.test_versioned_object_incremental_sync
2026-02-18T14:14:30.984 INFO:tasks.rgw_multisite_tests:ERROR: rgw_multi.tests.test_delete_marker_full_sync
2026-02-18T14:14:30.986 INFO:tasks.rgw_multisite_tests:ERROR: rgw_multi.tests.test_suspended_delete_marker_full_sync
2026-02-18T14:14:30.988 INFO:tasks.rgw_multisite_tests:ERROR: rgw_multi.tests.test_bucket_versioning
2026-02-18T14:14:30.990 INFO:tasks.rgw_multisite_tests:ERROR: rgw_multi.tests.test_bucket_acl
2026-02-18T14:14:30.992 INFO:tasks.rgw_multisite_tests:ERROR: rgw_multi.tests.test_bucket_cors
2026-02-18T14:14:30.995 INFO:tasks.rgw_multisite_tests:ERROR: rgw_multi.tests.test_bucket_delete_notempty
2026-02-18T14:14:30.996 INFO:tasks.rgw_multisite_tests:ERROR: rgw_multi.tests.test_datalog_autotrim
2026-02-18T14:14:30.998 INFO:tasks.rgw_multisite_tests:ERROR: rgw_multi.tests.test_set_bucket_website
2026-02-18T14:14:31.000 INFO:tasks.rgw_multisite_tests:ERROR: rgw_multi.tests.test_set_bucket_policy
2026-02-18T14:14:31.002 INFO:tasks.rgw_multisite_tests:ERROR: rgw_multi.tests.test_bucket_sync_disable
2026-02-18T14:14:31.003 INFO:tasks.rgw_multisite_tests:ERROR: rgw_multi.tests.test_bucket_sync_enable_right_after_disable
2026-02-18T14:14:31.005 INFO:tasks.rgw_multisite_tests:ERROR: rgw_multi.tests.test_bucket_sync_disable_enable
2026-02-18T14:14:31.006 INFO:tasks.rgw_multisite_tests:ERROR: rgw_multi.tests.test_multipart_object_sync
2026-02-18T14:14:31.008 INFO:tasks.rgw_multisite_tests:ERROR: rgw_multi.tests.test_assume_role_after_sync
2026-02-18T14:14:31.009 INFO:tasks.rgw_multisite_tests:ERROR: rgw_multi.tests.test_topic_notification_sync


$ egrep -A 1 "=======" teuthology.log | grep ERROR
2026-02-19T23:13:20.427 INFO:tasks.rgw_multisite_tests:ERROR: rgw_multi.tests.test_bucket_create
2026-02-19T23:13:20.429 INFO:tasks.rgw_multisite_tests:ERROR: create a bucket from secondary zone under tenant namespace. check if it successfully syncs
2026-02-19T23:13:20.430 INFO:tasks.rgw_multisite_tests:ERROR: rgw_multi.tests.test_bucket_recreate
2026-02-19T23:13:20.432 INFO:tasks.rgw_multisite_tests:ERROR: rgw_multi.tests.test_bucket_remove
2026-02-19T23:13:20.434 INFO:tasks.rgw_multisite_tests:ERROR: rgw_multi.tests.test_object_sync
2026-02-19T23:13:20.436 INFO:tasks.rgw_multisite_tests:ERROR: rgw_multi.tests.test_object_delete
2026-02-19T23:13:20.438 INFO:tasks.rgw_multisite_tests:ERROR: rgw_multi.tests.test_multi_object_delete
2026-02-19T23:13:20.441 INFO:tasks.rgw_multisite_tests:ERROR: rgw_multi.tests.test_versioned_object_incremental_sync
2026-02-19T23:13:20.443 INFO:tasks.rgw_multisite_tests:ERROR: rgw_multi.tests.test_delete_marker_full_sync
2026-02-19T23:13:20.445 INFO:tasks.rgw_multisite_tests:ERROR: rgw_multi.tests.test_suspended_delete_marker_full_sync
2026-02-19T23:13:20.447 INFO:tasks.rgw_multisite_tests:ERROR: rgw_multi.tests.test_bucket_versioning
2026-02-19T23:13:20.449 INFO:tasks.rgw_multisite_tests:ERROR: rgw_multi.tests.test_bucket_acl
2026-02-19T23:13:20.451 INFO:tasks.rgw_multisite_tests:ERROR: rgw_multi.tests.test_bucket_cors
2026-02-19T23:13:20.453 INFO:tasks.rgw_multisite_tests:ERROR: rgw_multi.tests.test_bucket_delete_notempty
2026-02-19T23:13:20.456 INFO:tasks.rgw_multisite_tests:ERROR: rgw_multi.tests.test_datalog_autotrim
2026-02-19T23:13:20.457 INFO:tasks.rgw_multisite_tests:ERROR: rgw_multi.tests.test_set_bucket_website
2026-02-19T23:13:20.459 INFO:tasks.rgw_multisite_tests:ERROR: rgw_multi.tests.test_set_bucket_policy
2026-02-19T23:13:20.461 INFO:tasks.rgw_multisite_tests:ERROR: rgw_multi.tests.test_bucket_sync_disable
2026-02-19T23:13:20.463 INFO:tasks.rgw_multisite_tests:ERROR: rgw_multi.tests.test_bucket_sync_enable_right_after_disable
2026-02-19T23:13:20.464 INFO:tasks.rgw_multisite_tests:ERROR: rgw_multi.tests.test_bucket_sync_disable_enable
2026-02-19T23:13:20.466 INFO:tasks.rgw_multisite_tests:ERROR: rgw_multi.tests.test_multipart_object_sync
2026-02-19T23:13:20.467 INFO:tasks.rgw_multisite_tests:ERROR: rgw_multi.tests.test_assume_role_after_sync
2026-02-19T23:13:20.468 INFO:tasks.rgw_multisite_tests:ERROR: rgw_multi.tests.test_topic_notification_sync
2026-02-19T23:13:20.469 INFO:tasks.rgw_multisite_tests:ERROR: rgw_multi.tests.test_bucket_create_location_constraint

Both logs show the failure reasons for above testcases are the same

Failures are mostly AccessDenied during CreateBucket
the same AccessDenied during PutBucketNotificationConfiguration for topic notifications
one-off SignatureDoesNotMatch during CreateBucket

Again, those failure seem irrelevant to this PR.

smanjara · 2026-02-20T18:07:22Z

Failures are mostly AccessDenied during CreateBucket

the same AccessDenied during PutBucketNotificationConfiguration for topic notifications

one-off SignatureDoesNotMatch during CreateBucket

Again, those failure seem irrelevant to this PR.

the PR associated with https://tracker.ceph.com/issues/74579 should fix the CreateBucket forwarded requests.

BBoozmen · 2026-02-20T20:08:47Z

Thank you @smanjara! Added needs-qa label to get this thru the general qa testing.

ivancich · 2026-02-25T18:14:31Z

The QA run for this PR included two others. But they're showing a lot of crash related errors.

There are six of these:

"2026-02-25T05:26:33.173487+0000 mon.a (mon.0) 181 : cluster [WRN] Health check failed: 2 daemons have recently crashed (RECENT_CRASH)" in cluster log

And there are nine of these, mostly dealing with the upgrade tests.

Found coredumps on ubuntu@trial096.front.sepia.ceph.com

None of the PRs jumps out at me as a likely culprit, so I'm asking you. I'm also having a re-run done.

Here's the full run: https://pulpito.ceph.com/anuchaithra-2026-02-25_05:07:15-rgw-wip-anrao1-testing-2026-02-23-1551-distro-default-trial/

Thanks!

smanjara · 2026-02-25T19:43:19Z

The QA run for this PR included two others. But they're showing a lot of crash related errors.

There are six of these:

"2026-02-25T05:26:33.173487+0000 mon.a (mon.0) 181 : cluster [WRN] Health check failed: 2 daemons have recently crashed (RECENT_CRASH)" in cluster log

And there are nine of these, mostly dealing with the upgrade tests.

Found coredumps on ubuntu@trial096.front.sepia.ceph.com

None of the PRs jumps out at me as a likely culprit, so I'm asking you. I'm also having a re-run done.

Here's the full run: https://pulpito.ceph.com/anuchaithra-2026-02-25_05:07:15-rgw-wip-anrao1-testing-2026-02-23-1551-distro-default-trial/

Thanks!

@ivancich the two multisite jobs haven't crashed though. so it must be something else. could you tell which are the other PR's? also, multisite tests will fail without #67083. we need that to be merged or have this PR tested along with wip-anrao3-testing

cc @anrao19

cbodley · 2026-02-26T15:19:15Z

@ivancich the two multisite jobs haven't crashed though. so it must be something else. could you tell which are the other PR's? also, multisite tests will fail without #67083. we need that to be merged or have this PR tested along with wip-anrao3-testing

#67083 merged, let's retry with that

BBoozmen · 2026-02-26T17:17:39Z

@ivancich the two multisite jobs haven't crashed though. so it must be something else. could you tell which are the other PR's? also, multisite tests will fail without #67083. we need that to be merged or have this PR tested along with wip-anrao3-testing

#67083 merged, let's retry with that

I was going to start looking at the failure logs @ivancich pointed out earlier but I think there's going to be another round of testing including #67083 . I'll be on stand-by for now then.

BBoozmen · 2026-03-10T02:10:36Z

@cbodley / @smanjara

Sent this PR to the Teuthology testing for rgw testsuite:

ceph-ci branch tested: https://github.com/ceph/ceph-ci/commits/wip-oozmen-testing-2026-03-04-2025/ which includes this PR.
test results: https://pulpito.ceph.com/bcs-ceph-2026-03-05_00:33:13-rgw-wip-oozmen-testing-2026-03-04-2025-distro-default-trial/

Overall Summary

Total tests: 71
Passed: 43 (60.6%)
Failed: 27
Dead: 1

Looking at failures:

Root Cause	Count	Suites Affected
Valgrind Error (sendmsg/SyscallParam)	17	rgw/verify (12), rgw/notifications (5)
Keystone/OpenStack setup failure	4	rgw/crypt (1), rgw/tempest (3)
S3 User Quota workunit failure	3	rgw/multifs (3)
RGW Multisite test failures	2	rgw/multisite (2)
adjust-ulimits error code 124	1	rgw/d4n (1)
Job timeout	1	rgw/upgrade (1)

Valgrind errors are all related to SyscallParam sendmsg. I believe this is a known infrastructure/valgrind issue?
Keystone/OpenStack issues - Test infrastructure problem I think with openstack project create failing.
s3_user_quota-run.sh failures all seem to be getting error 22 (EINVAL).

failure_reason: 'Command failed (workunit test rgw/s3_user_quota-run.sh) on trial049
  with status 22: ''mkdir -p -- /home/ubuntu/cephtest/mnt.0/client.0/tmp && cd --
  /home/ubuntu/cephtest/mnt.0/client.0/tmp && CEPH_CLI_TEST_DUP_COMMAND=1 CEPH_REF=c221a0e611968dcf997352e2792ea5cbb550c44e
  TESTDIR="/home/ubuntu/cephtest" CEPH_ARGS="--cluster ceph" CEPH_ID="0" PATH=$PATH:/usr/sbin
  CEPH_BASE=/home/ubuntu/cephtest/clone.client.0 CEPH_ROOT=/home/ubuntu/cephtest/clone.client.0
  CEPH_MNT=/home/ubuntu/cephtest/mnt.0 adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage
  timeout 3h /home/ubuntu/cephtest/clone.client.0/qa/workunits/rgw/s3_user_quota-run.sh'''

adjust-ulimits error code 124 (timeout)

2026-03-07T05:20:08.109 DEBUG:teuthology.orchestra.run.trial052:> sudo adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage timeout 120 ceph --cluster ceph pg dump --format=json
2026-03-07T05:22:08.142 DEBUG:teuthology.orchestra.run:got remote process result: 124

The relevant failures are 2 multisite testcases:

2026-03-07T05:43:53.758 INFO:tasks.rgw_multisite_tests:rgw_multi.tests.test_bucket_remove ... FAIL
2026-03-07T06:31:32.060 INFO:tasks.rgw_multisite_tests:rgw_multi.tests.test_period_update_commit ... FAIL

test_bucket_remove

The testcase is not touched by this PR and it seems to be failing at other submissions, too; e.g., https://qa-proxy.ceph.com/teuthology/anuchaithra-2026-03-09_10:21:55-rgw-wip-anrao2-testing-2026-03-05-1549-distro-default-trial/94912/teuthology.log. This testcase is about metadata sync which is not relevant to this PR. I think we can address this testcaseseparately. If there's no tracker item I can open one to work on this separately; it seems to be failing in other submissions consistently as well.

test_period_update_commit

It fails at the very last step at the check zonegroup_data_checkpoint. I think it's a scale issue as this testcase uploads a couple of thousand objects in concurrency.

...
# this is last (5 of 5) verification after a period update --commit
2026-03-07T06:17:10.733 INFO:rgw_multi.tests:verify data sync is making progress
...
# The workload stopped (all 25 threads)
2026-03-07T06:21:14.514 INFO:rgw_multi.tests:uploaded 4000 times for the range (900, 999) to bucket=jwzcgy-66
2026-03-07T06:21:14.550 INFO:rgw_multi.tests:uploaded 4000 times for the range (700, 799) to bucket=jwzcgy-66
...
2026-03-07T06:21:14.711 INFO:rgw_multi.tests:uploaded 4000 times for the range (1600, 1699) to bucket=jwzcgy-66
...
# but after 5 mins (default timeout) - zonegroup_data_checkpoint timed out
2026-03-07T06:26:13.259 INFO:teuthology.orchestra.run.trial052.stderr:2026-03-07T06:26:13.256+0000 7f6f8bfff640  0 WARNING: curl operation timed out, network average transfer speed less than 1024 Bytes per second during 300 seconds.

Again, this failure is not relevant to the PR so I can open a separate tracker - I think it should be enough to make the s3 client wkld less aggressive.

Otherwise, this testcase used to pass as shown in https://qa-proxy.ceph.com/teuthology/bcs-ceph-2026-02-19_14:51:57-rgw:multisite-wip-oozmen-73799-distro-default-trial/58954/teuthology.log and I just built this very ceph-ci branch and tested it locally and it passed again.

`test_bucket_full_sync_when_the_bucket_is_deleted_in_the_meantime`

The testcase introduced in this PR passed.

...
2026-03-07T06:33:58.810 INFO:tasks.rgw_multisite_tests:rgw_multi.tests.test_bucket_full_sync_when_the_bucket_is_deleted_in_the_meantime ... ok

@cbodley / @smanjara - Shall we get this merged?

smanjara · 2026-03-10T16:32:59Z

thanks @BBoozmen

smanjara · 2026-03-10T22:57:55Z

Overall Summary

Total tests: 71

Passed: 43 (60.6%)

Failed: 27

Dead: 1

please do open trackers for the s3 and multisite failures @BBoozmen. thanks!

BBoozmen self-assigned this Nov 11, 2025

github-actions bot added build/ops common config-change rgw tests labels Nov 11, 2025

BBoozmen marked this pull request as ready for review November 12, 2025 19:16

BBoozmen requested a review from a team as a code owner November 12, 2025 19:16

cbodley requested a review from smanjara November 20, 2025 15:26

smanjara reviewed Nov 20, 2025

View reviewed changes

src/rgw/driver/rados/rgw_data_sync.cc Outdated Show resolved Hide resolved

BBoozmen force-pushed the wip-oozmen-73799 branch from 9f76d26 to 887302c Compare November 23, 2025 19:57

jmundack requested a review from a team November 24, 2025 02:15

BBoozmen force-pushed the wip-oozmen-73799 branch from 887302c to ce9be9e Compare November 24, 2025 03:14

BBoozmen force-pushed the wip-oozmen-73799 branch from ce9be9e to 3355a23 Compare January 28, 2026 17:17

github-actions bot added the needs-rebase label Feb 3, 2026

BBoozmen added 2 commits February 13, 2026 03:47

BBoozmen force-pushed the wip-oozmen-73799 branch from 3355a23 to 74d3441 Compare February 13, 2026 15:44

github-actions bot removed the needs-rebase label Feb 13, 2026

RGW/test_multi/RGWBucketFullSyncCR: test bucket full sync while sourc…

9432f0e

…e bucket is deleted in the middle Tests: https://tracker.ceph.com/issues/73799 Signed-off-by: Oguzhan Ozmen <oozmen@bloomberg.net>

BBoozmen force-pushed the wip-oozmen-73799 branch from 74d3441 to 9432f0e Compare February 19, 2026 21:31

smanjara approved these changes Feb 20, 2026

View reviewed changes

BBoozmen added the needs-qa label Feb 20, 2026

anrao19 added the wip-anrao1-testing label Feb 23, 2026

smanjara mentioned this pull request Feb 24, 2026

qa/rgw/multisite: remove boto2 BotoJSONEncoder #67486

Merged

smanjara removed the needs-qa label Feb 25, 2026

smanjara added the needs-qa label Feb 27, 2026

BBoozmen added the wip-oozmen-testing for teuthology integration testing to be used via build-integration-branch.sh label Mar 4, 2026

smanjara merged commit 16c4842 into ceph:main Mar 10, 2026
13 checks passed

anrao19 added TESTED ready-to-merge and removed wip-anrao1-testing labels Mar 16, 2026

This was referenced Mar 20, 2026

squid: RGW/multisite: fix bucket-full-sync infinite loop caused by stale bucket_list_result reuse #67921

Open

tentacle: RGW/multisite: fix bucket-full-sync infinite loop caused by stale bucket_list_result reuse #67923

Open

BBoozmen removed the wip-oozmen-testing for teuthology integration testing to be used via build-integration-branch.sh label Mar 23, 2026

Conversation

BBoozmen commented Nov 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Contribution Guidelines

Checklist

Uh oh!

github-actions bot commented Nov 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Config Diff Tool Output

Uh oh!

BBoozmen commented Nov 18, 2025

Uh oh!

smanjara commented Nov 20, 2025

Uh oh!

Uh oh!

BBoozmen commented Nov 23, 2025

With Fix

Without fix

Uh oh!

smanjara commented Nov 25, 2025

Uh oh!

smanjara commented Nov 25, 2025

Uh oh!

BBoozmen commented Nov 26, 2025

Uh oh!

BBoozmen commented Dec 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

BBoozmen commented Dec 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

smanjara commented Jan 5, 2026

Uh oh!

BBoozmen commented Jan 6, 2026

Uh oh!

BBoozmen commented Jan 26, 2026

Uh oh!

smanjara commented Jan 27, 2026

Uh oh!

BBoozmen commented Jan 28, 2026

Uh oh!

github-actions bot commented Feb 3, 2026

Uh oh!

BBoozmen commented Feb 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

smanjara commented Feb 20, 2026

Uh oh!

BBoozmen commented Feb 20, 2026

Uh oh!

ivancich commented Feb 25, 2026

Uh oh!

smanjara commented Feb 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cbodley commented Feb 26, 2026

Uh oh!

BBoozmen commented Feb 26, 2026

Uh oh!

BBoozmen commented Mar 10, 2026

test_bucket_remove

test_period_update_commit

test_bucket_full_sync_when_the_bucket_is_deleted_in_the_meantime

Uh oh!

Uh oh!

smanjara commented Mar 10, 2026

Uh oh!

smanjara commented Mar 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

BBoozmen commented Nov 11, 2025 •

edited

Loading

github-actions bot commented Nov 11, 2025 •

edited

Loading

BBoozmen commented Dec 5, 2025 •

edited

Loading

BBoozmen commented Dec 9, 2025 •

edited

Loading

BBoozmen commented Feb 20, 2026 •

edited

Loading

smanjara commented Feb 25, 2026 •

edited

Loading

`test_bucket_full_sync_when_the_bucket_is_deleted_in_the_meantime`