Bug #62710
closedmultisite replication is super slow when some of the rgws configured in zonegroup are down
0%
Description
Multisite replication is super slow when some of the rgws configured in zonegroup are down.
This can be reproduces with the main branch (as of Aug. 29th, 2023).
- 2 clusters
- each cluster has 3 rgw nodes, and 16 rgw instances per node (with 8 client-facing on the primary site)
- all rgw instances are added in the zonegroup settings
- shutdown all the rgw instances on one primary rgw node
- cosbench write only, 15 users, 30 workers, 600 seconds
- generated 1800 buckets, >2 million objects
Replication lag:
The replication still not done 50 mins after the client traffic finished
Updated by Jane Zhu over 2 years ago
I set the severity of this issue to Major because the issue results in significant replication lag. But it does have a walkaround, which is to remove the downed rgws from the zonegroup settings. So please free to change it to Minor if you think it's more appropriate.
Updated by Jane Zhu over 2 years ago
I accidently put this in a wrong project. Can somebody please move it to "rgw" project? Thanks!
Updated by Neha Ojha over 2 years ago
- Project changed from teuthology to rgw
- Status changed from New to Fix Under Review
- Pull request ID set to 53320
Updated by Casey Bodley about 2 years ago
- Status changed from Fix Under Review to Resolved
Updated by Soumya Koduri almost 2 years ago
@Jane,
The below changes - https://github.com/ceph/ceph/pull/53320/commits/e200499bb3c5703862b92a4d7fb534d98601f1bf seem to have caused regression in LC/cloud-transition code - https://tracker.ceph.com/issues/65251.
if (diff >= CONN_STATUS_EXPIRE_SECS) {
endpoints_status[endpoint].store(ceph::real_clock::zero());
ldout(cct, 10) << "endpoint " << endpoint << " unconnectable status expired. mark it connectable" << dendl;
break;
}
<<<
Even though there is valid endpoint, since the updated time is < 2sec, it returned null RGWRESTStreamS3PutObj pointer resulting in crash in tier code. The crash can be avoided with an extra check but it would still return error failing the transition request at times.
Could you please explain why the above check is needed and if needs to be modified to handle the LC cloud transition and perhaps cloud-sync too (which uses similar routines). Thanks!
Updated by Upkeep Bot 8 months ago
- Merge Commit set to d3256c484136a1b32b79a904861f681a9248ba3c
- Fixed In set to v19.0.0-842-gd3256c48413
- Released In set to v19.2.0~869
- Upkeep Timestamp set to 2025-07-11T22:09:13+00:00