Bug #62710: multisite replication is super slow when some of the rgws configured in zonegroup are down - rgw - Ceph

Actions

Copy link

Bug #62710

closed

multisite replication is super slow when some of the rgws configured in zonegroup are down

Added by Jane Zhu over 2 years ago. Updated 8 months ago.

Status:

Resolved

Priority:

Normal

Assignee:

Shilpa MJ

Target version:

% Done:

Source:

Backport:

Regression:

Severity:

2 - major

Reviewed:

Affected Versions:

ceph-qa-suite:

Pull request ID:

53320

Tags (freeform):

Merge Commit:

d3256c484136a1b32b79a904861f681a9248ba3c

Fixed In:

v19.0.0-842-gd3256c48413

Released In:

v19.2.0~869

Upkeep Timestamp:

2025-07-11T22:09:13+00:00

Description

Multisite replication is super slow when some of the rgws configured in zonegroup are down.
This can be reproduces with the main branch (as of Aug. 29th, 2023).

Multisite clusters configuration:

2 clusters
each cluster has 3 rgw nodes, and 16 rgw instances per node (with 8 client-facing on the primary site)
all rgw instances are added in the zonegroup settings
shutdown all the rgw instances on one primary rgw node

Client traffic:

cosbench write only, 15 users, 30 workers, 600 seconds
generated 1800 buckets, >2 million objects

Replication lag:
The replication still not done 50 mins after the client traffic finished

Actions

Copy link

Updated by Jane Zhu over 2 years ago

I set the severity of this issue to Major because the issue results in significant replication lag. But it does have a walkaround, which is to remove the downed rgws from the zonegroup settings. So please free to change it to Minor if you think it's more appropriate.

Actions

Copy link

Updated by Jane Zhu over 2 years ago

I accidently put this in a wrong project. Can somebody please move it to "rgw" project? Thanks!

Actions

Copy link

Updated by Jane Zhu over 2 years ago

https://github.com/ceph/ceph/pull/53320

Actions

Copy link

Updated by Neha Ojha over 2 years ago

Project changed from teuthology to rgw
Status changed from New to Fix Under Review
Pull request ID set to 53320

Actions

Copy link

Updated by Daniel Gryniewicz over 2 years ago

Assignee set to Shilpa MJ

Actions

Copy link

Updated by Casey Bodley about 2 years ago

Status changed from Fix Under Review to Resolved

Actions

Copy link

Updated by Soumya Koduri almost 2 years ago

@Jane,

The below changes - https://github.com/ceph/ceph/pull/53320/commits/e200499bb3c5703862b92a4d7fb534d98601f1bf seem to have caused regression in LC/cloud-transition code - https://tracker.ceph.com/issues/65251.

if (diff >= CONN_STATUS_EXPIRE_SECS) {
endpoints_status[endpoint].store(ceph::real_clock::zero());
ldout(cct, 10) << "endpoint " << endpoint << " unconnectable status expired. mark it connectable" << dendl;
break;
}
<<<

Even though there is valid endpoint, since the updated time is < 2sec, it returned null RGWRESTStreamS3PutObj pointer resulting in crash in tier code. The crash can be avoided with an extra check but it would still return error failing the transition request at times.

Could you please explain why the above check is needed and if needs to be modified to handle the LC cloud transition and perhaps cloud-sync too (which uses similar routines). Thanks!

Actions

Copy link

Updated by Upkeep Bot 8 months ago

Merge Commit set to d3256c484136a1b32b79a904861f681a9248ba3c
Fixed In set to v19.0.0-842-gd3256c48413
Released In set to v19.2.0~869
Upkeep Timestamp set to 2025-07-11T22:09:13+00:00

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph » rgw

Tags

Custom queries

Bug #62710

multisite replication is super slow when some of the rgws configured in zonegroup are down

Updated by Jane Zhu over 2 years ago

Updated by Jane Zhu over 2 years ago

Updated by Jane Zhu over 2 years ago

Updated by Neha Ojha over 2 years ago

Updated by Daniel Gryniewicz over 2 years ago

Updated by Casey Bodley about 2 years ago

Updated by Soumya Koduri almost 2 years ago

Updated by Upkeep Bot 8 months ago