rgw: Reduce data sync parallelism in response to RADOS lock latency#48451
rgw: Reduce data sync parallelism in response to RADOS lock latency#48451adamemerson wants to merge 2 commits intoceph:mainfrom
Conversation
c0b62dd to
5e2ef27
Compare
5e2ef27 to
b93e394
Compare
Lock latency in RGWContinuousLeaseCR gets high enough under load that the locks end up timing out, leading to incorrect behavior. Monitor lock latency and cut concurrent operations in half if it goes above ten seconds. Cut currency to one if it goes about twenty seconds. Signed-off-by: Adam C. Emerson <aemerson@redhat.com>
b93e394 to
2b2464f
Compare
Limited to onlly warn every five minutes. Signed-off-by: Adam C. Emerson <aemerson@redhat.com>
| /// cut it to 1. | ||
| int64_t adj_concurrency(int64_t concurrency) { | ||
| using namespace std::literals; | ||
| auto threshold = (cct->_conf->rgw_sync_lease_period * 1s) / 12; |
There was a problem hiding this comment.
@cbodley I was thinking about this, and I think making this scale with the least time might be the wrong idea.
If we have a twenty minute lease time then ten or twenty seconds for the lock action to complete is still pretty egregious and suggests the we're overloading the OSD, and with it scaling this way we'd end up only throttling back with hundred second average latency.
There was a problem hiding this comment.
(cc @mattbenjamin)
considering a deployment with one radosgw per zone where we could omit the leases entirely, this latency shouldn't matter at all - it could be 10 minutes or an hour, and sync would chug along (albeit slowly) without any errors
it wasn't until the RGWContinuousLeaseCR::is_locked() change from #47728 that this latency turned into actual sync errors. because this lease timer is the only thing that's sensitive to latency, i suggested that the throttling logic should aim to keep that latency in a range where we can actually make progress
however, lacking a better model for flow control or throttling here, maybe we should just revert that is_locked() part of #47728 - as-is, that seems to be making sync less reliable because it's shown that we have no control over this latency in our testing
i'm open to exploring band-aids here just to ship something, but this is complicated and it's hard for me to tell whether we're making things better or worse. even with your super-conservative throttling, we were still seeing extreme latencies here, right? that suggests the load is coming from somewhere else like ingest or /admin/log requests
There was a problem hiding this comment.
@cbodley I guess let's review today and go with your and @adamemerson 's intuition on this for 5.3, at least
|
included in #48898 |
Lock latency in RGWContinuousLeaseCR gets high enough under load that the locks end up timing out, leading to incorrect behavior.
Monitor lock latency and cut concurrent operations in half if it goes above ten seconds.
Cut currency to one if it goes about twenty seconds.
Contribution Guidelines
To sign and title your commits, please refer to Submitting Patches to Ceph.
If you are submitting a fix for a stable branch (e.g. "pacific"), please refer to Submitting Patches to Ceph - Backports for the proper workflow.
Checklist
Show available Jenkins commands
jenkins retest this pleasejenkins test classic perfjenkins test crimson perfjenkins test signedjenkins test make checkjenkins test make check arm64jenkins test submodulesjenkins test dashboardjenkins test dashboard cephadmjenkins test apijenkins test docsjenkins render docsjenkins test ceph-volume alljenkins test ceph-volume toxjenkins test windows