rgw: Reduce data sync parallelism in response to RADOS lock latency by adamemerson · Pull Request #48451 · ceph/ceph

adamemerson · 2022-10-11T18:54:11Z

Lock latency in RGWContinuousLeaseCR gets high enough under load that the locks end up timing out, leading to incorrect behavior.

Monitor lock latency and cut concurrent operations in half if it goes above ten seconds.

Cut currency to one if it goes about twenty seconds.

Contribution Guidelines

To sign and title your commits, please refer to Submitting Patches to Ceph.
If you are submitting a fix for a stable branch (e.g. "pacific"), please refer to Submitting Patches to Ceph - Backports for the proper workflow.

Checklist

Tracker (select at least one)
- References tracker ticket
- Very recent bug; references commit where it was introduced
- New feature (ticket optional)
- Doc update (no ticket needed)
- Code cleanup (no ticket needed)
Component impact
- Affects Dashboard, opened tracker ticket
- Affects Orchestrator, opened tracker ticket
- No impact that needs to be tracked
Documentation (select at least one)
- Updates relevant documentation
- No doc update is appropriate
Tests (select at least one)
- Includes unit test(s)
- Includes integration test(s)
- Includes bug reproducer
- No tests

Show available Jenkins commands

jenkins retest this please
jenkins test classic perf
jenkins test crimson perf
jenkins test signed
jenkins test make check
jenkins test make check arm64
jenkins test submodules
jenkins test dashboard
jenkins test dashboard cephadm
jenkins test api
jenkins test docs
jenkins render docs
jenkins test ceph-volume all
jenkins test ceph-volume tox
jenkins test windows

src/rgw/rgw_cr_rados.h

src/rgw/rgw_data_sync.cc

Lock latency in RGWContinuousLeaseCR gets high enough under load that the locks end up timing out, leading to incorrect behavior. Monitor lock latency and cut concurrent operations in half if it goes above ten seconds. Cut currency to one if it goes about twenty seconds. Signed-off-by: Adam C. Emerson <aemerson@redhat.com>

Limited to onlly warn every five minutes. Signed-off-by: Adam C. Emerson <aemerson@redhat.com>

src/rgw/rgw_data_sync.h

adamemerson · 2022-10-17T19:05:37Z

src/rgw/rgw_data_sync.h

+  /// cut it to 1.
+  int64_t adj_concurrency(int64_t concurrency) {
+    using namespace std::literals;
+    auto threshold = (cct->_conf->rgw_sync_lease_period * 1s) / 12;


@cbodley I was thinking about this, and I think making this scale with the least time might be the wrong idea.

If we have a twenty minute lease time then ten or twenty seconds for the lock action to complete is still pretty egregious and suggests the we're overloading the OSD, and with it scaling this way we'd end up only throttling back with hundred second average latency.

(cc @mattbenjamin)

considering a deployment with one radosgw per zone where we could omit the leases entirely, this latency shouldn't matter at all - it could be 10 minutes or an hour, and sync would chug along (albeit slowly) without any errors

it wasn't until the RGWContinuousLeaseCR::is_locked() change from #47728 that this latency turned into actual sync errors. because this lease timer is the only thing that's sensitive to latency, i suggested that the throttling logic should aim to keep that latency in a range where we can actually make progress

however, lacking a better model for flow control or throttling here, maybe we should just revert that is_locked() part of #47728 - as-is, that seems to be making sync less reliable because it's shown that we have no control over this latency in our testing

i'm open to exploring band-aids here just to ship something, but this is complicated and it's hard for me to tell whether we're making things better or worse. even with your super-conservative throttling, we were still seeing extreme latencies here, right? that suggests the load is coming from somewhere else like ingest or /admin/log requests

@cbodley I guess let's review today and go with your and @adamemerson 's intuition on this for 5.3, at least

cbodley · 2022-11-23T13:41:23Z

included in #48898

adamemerson requested a review from cbodley October 11, 2022 18:54

github-actions bot added the rgw label Oct 11, 2022

cbodley reviewed Oct 11, 2022

View reviewed changes

src/rgw/rgw_cr_rados.h Outdated Show resolved Hide resolved

src/rgw/rgw_cr_rados.h Outdated Show resolved Hide resolved

src/rgw/rgw_data_sync.cc Outdated Show resolved Hide resolved

cbodley requested review from ofriedma and yuvalif October 11, 2022 19:09

cbodley reviewed Oct 11, 2022

View reviewed changes

src/rgw/rgw_data_sync.cc Outdated Show resolved Hide resolved

cbodley reviewed Oct 11, 2022

View reviewed changes

src/rgw/rgw_data_sync.cc Outdated Show resolved Hide resolved

adamemerson force-pushed the wip-rgw-sync-latency-spawnwindow branch from c0b62dd to 5e2ef27 Compare October 11, 2022 20:35

adamemerson requested a review from cbodley October 11, 2022 20:36

adamemerson force-pushed the wip-rgw-sync-latency-spawnwindow branch from 5e2ef27 to b93e394 Compare October 11, 2022 20:38

adamemerson force-pushed the wip-rgw-sync-latency-spawnwindow branch from b93e394 to 2b2464f Compare October 11, 2022 20:41

cbodley approved these changes Oct 11, 2022

View reviewed changes

rgw: LatencyConcurrencyControl warns on very high latency

897fad1

Limited to onlly warn every five minutes. Signed-off-by: Adam C. Emerson <aemerson@redhat.com>

adamemerson commented Oct 12, 2022

View reviewed changes

src/rgw/rgw_data_sync.h Show resolved Hide resolved

cbodley approved these changes Oct 13, 2022

View reviewed changes

src/rgw/rgw_data_sync.h Show resolved Hide resolved

adamemerson commented Oct 17, 2022

View reviewed changes

cbodley mentioned this pull request Nov 23, 2022

rgw: multisite stabilization for reef #48898

Merged

cbodley closed this Nov 23, 2022

adamemerson deleted the wip-rgw-sync-latency-spawnwindow branch January 10, 2023 16:45

cbodley mentioned this pull request Jan 12, 2023

common/async: add co_throttle for bounded concurrency with c++20 coroutines #49720

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

rgw: Reduce data sync parallelism in response to RADOS lock latency#48451

rgw: Reduce data sync parallelism in response to RADOS lock latency#48451
adamemerson wants to merge 2 commits intoceph:mainfrom
adamemerson:wip-rgw-sync-latency-spawnwindow

adamemerson commented Oct 11, 2022 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

adamemerson Oct 17, 2022

Uh oh!

cbodley Oct 17, 2022

Uh oh!

mattbenjamin Oct 18, 2022

Uh oh!

cbodley commented Nov 23, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

adamemerson commented Oct 11, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Contribution Guidelines

Checklist

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

adamemerson Oct 17, 2022

Choose a reason for hiding this comment

Uh oh!

cbodley Oct 17, 2022

Choose a reason for hiding this comment

Uh oh!

mattbenjamin Oct 18, 2022

Choose a reason for hiding this comment

Uh oh!

cbodley commented Nov 23, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

adamemerson commented Oct 11, 2022 •

edited

Loading