osd: mClock recovery/backfill cost fixes by sseshasa · Pull Request #49975 · ceph/ceph

sseshasa · 2023-02-02T15:40:01Z

This PR addresses the issue of slow backfill/recoveries due to non-ideal
cost settings for recovery/backfill operations.

The following summarizes the changes:

Set cost for PGRecovery item based on average size of objects in PG.
Set cost for PGRecoveryContext item (Replicated and EC backend) based on the aggregate cost of impending pushes.
PushReplyOp cost set to min cost and PullOp cost is set based on the remaining bytes to be pushed.
Modify IO cost model to incorporate device characteristics like random write IOPS and sequential bandwidth capacity.
Represent QoS cost in terms bytes instead of secs.
Consider degraded object recoveries with greater urgency than backfill recovery.
Categorize backfill operations as best-effort client.
Change QoS parameter allocation in terms of proportions and optimize mClock profiles.
Set 'balanced' profile as the default mClock profile.
Restore backfill/recovery limits to original defaults.
Update mClock configuration document to reflect all the changes.

Fixes: https://tracker.ceph.com/issues/58529
Fixes: https://tracker.ceph.com/issues/58606
Fixes: https://tracker.ceph.com/issues/58607
Fixes: https://tracker.ceph.com/issues/59080
Signed-off-by: Samuel Just sjust@redhat.com
Signed-off-by: Sridhar Seshasayee sseshasa@redhat.com

Contribution Guidelines

To sign and title your commits, please refer to Submitting Patches to Ceph.
If you are submitting a fix for a stable branch (e.g. "pacific"), please refer to Submitting Patches to Ceph - Backports for the proper workflow.

Checklist

Tracker (select at least one)
- References tracker ticket
- Very recent bug; references commit where it was introduced
- New feature (ticket optional)
- Doc update (no ticket needed)
- Code cleanup (no ticket needed)
Component impact
- Affects Dashboard, opened tracker ticket
- Affects Orchestrator, opened tracker ticket
- No impact that needs to be tracked
Documentation (select at least one)
- Updates relevant documentation
- No doc update is appropriate
Tests (select at least one)
- Includes unit test(s)
- Includes integration test(s)
- Includes bug reproducer
- No tests

Show available Jenkins commands

jenkins retest this please
jenkins test classic perf
jenkins test crimson perf
jenkins test signed
jenkins test make check
jenkins test make check arm64
jenkins test submodules
jenkins test dashboard
jenkins test dashboard cephadm
jenkins test api
jenkins test docs
jenkins render docs
jenkins test ceph-volume all
jenkins test ceph-volume tox
jenkins test windows

athanatos

Pretty close! https://github.com/athanatos/ceph/tree/sjust/wip-recovery-cost-49959 has some commits that I think clean/simplify things a bit. If you agree with the changes, squash my commits as indicated and update the commit messages (add Signed-off-by: and Fixes: lines as well).

athanatos · 2023-02-03T08:39:19Z

src/osd/ECBackend.cc

+          }
+        }
+      }
+    }


See my branch for a comment with an explanation, but I think we want a cost of 1 here.

Thanks for the detailed explanation of the case involved.

athanatos · 2023-02-03T08:39:43Z

src/osd/OSD.h

  // -- pg recovery and associated throttling --
  ceph::mutex recovery_lock = ceph::make_mutex("OSDService::recovery_lock");
-  std::list<std::pair<epoch_t, PGRef> > awaiting_throttle;
+  std::list<std::tuple<epoch_t, PGRef, int> > awaiting_throttle;


See my branch, let's use a named struct here.

Sure, accessing the tuple members did seem awkward. With the named struct, it's clear which member is being accessed.

athanatos · 2023-02-03T08:40:05Z

src/osd/PG.cc

-    osd->queue_for_recovery(this);
+    // Determine recovery cost in terms of average object size in this PG
+    int64_t num_bytes = info.stats.stats.sum.num_bytes;
+    int64_t num_objects = info.stats.stats.sum.num_objects;


See my branch, we need to handle the case where num_objects is 0.

athanatos · 2023-02-03T08:40:26Z

src/osd/PGBackend.h

     virtual void schedule_recovery_work(
-       GenContext<ThreadPool::TPHandle&> *c) = 0;
+       GenContext<ThreadPool::TPHandle&> *c,
+       int cost) = 0;


See branch, need to use uint64_t everywhere for cost.

Sure, I used int since OpSchedulerItem takes in cost as int.

athanatos · 2023-02-03T08:42:49Z

src/osd/ReplicatedBackend.cc

+      if (j != pulling.end()) {
+        auto rec_size = j->second.recovery_info.size;
+        auto rec_progress = j->second.recovery_progress.data_recovered_to;
+        cost += (rec_size - rec_progress);


This isn't quite right, data_recovered_to will generally be equal to recovery_info.size since the pull is complete. PG_RecoveryQueueAsync here is going to enqueue a C_ReplicatedBackend_OnPullComplete which pushes the object data out to waiting acting set OSDs. We want the full size of the relevant objects. Instead, let's use i.stat.num_bytes_recovered for each element of to_continue. See my branch.

Understood. Thanks for clarifying this!

sseshasa · 2023-02-03T09:06:36Z

Pretty close! https://github.com/athanatos/ceph/tree/sjust/wip-recovery-cost-49959 has some commits that I think clean/simplify things a bit. If you agree with the changes, squash my commits as indicated and update the commit messages (add Signed-off-by: and Fixes: lines as well).

Will do @athanatos. I will take a look and squash your commits into mine. Thanks for taking a look and for these
commits.

sseshasa · 2023-02-03T13:30:27Z

src/osd/OSD.h

    } else {
-      awaiting_throttle.push_back(std::make_pair(pg->get_osdmap()->get_epoch(), pg));
+      awaiting_throttle.emplace_back(
+        pg->get_osdmap()->get_epoch(), pg, cost_per_object);


@athanatos The initialization of pg_awaiting_throttle_t is missing. Will add it in the next push.

Fixed in the latest push.

sseshasa · 2023-02-03T14:15:55Z

jenkins retest this please

sseshasa · 2023-02-03T14:24:31Z

jenkins test make check

sseshasa · 2023-02-07T04:30:07Z

jenkins retest this please

sseshasa · 2023-02-07T04:30:51Z

jenkins render docs

ljflores · 2023-02-08T00:15:15Z

jenkins test api

ljflores · 2023-02-08T00:15:28Z

jenkins test make check

sseshasa · 2023-02-09T03:07:49Z

Rados Suite Run:

Unrelated Failures:

https://tracker.ceph.com/issues/58496 - osd/PeeringState: FAILED ceph_assert(!acting_recovery_backfill.empty())
https://tracker.ceph.com/issues/58585 - rook: failed to pull kubelet image - Ceph - Orchestrator
https://tracker.ceph.com/issues/57754 - test_envlibrados_for_rocksdb.sh: update-alternatives: error: alternative path /usr/bin/gcc-11 doesn't exist - Infrastructure
https://tracker.ceph.com/issues/58475 - test_dashboard_e2e.sh: Conflicting peer dependency: postcss@8.4.21 - Ceph - Mgr - Dashboard
https://tracker.ceph.com/issues/57755 - Test failure: test_cephfs_mirror (tasks.cephadm_cases.test_cli.TestCephadmCLI)

sseshasa · 2023-02-09T20:50:04Z

@athanatos PTAL at commit: 55575ae (see my email for additional info).

athanatos · 2023-02-09T20:55:17Z

@sseshasa Those changes seem reasonable to me.

sseshasa · 2023-02-10T12:19:32Z

Rados Suite Run:
This is after including latest commit: 55575ae

https://pulpito.ceph.com/nojha-2023-02-10_01:49:56-rados-wip-sseshasa-testing-2023-02-09-cost-fix-distro-default-smithi/

Unrelated Failures:

https://tracker.ceph.com/issues/58585 - rook: failed to pull kubelet image - Ceph - Orchestrator
https://tracker.ceph.com/issues/57754 - test_envlibrados_for_rocksdb.sh: update-alternatives: error: alternative path /usr/bin/gcc-11 doesn't exist - Infrastructure
https://tracker.ceph.com/issues/58475 - test_dashboard_e2e.sh: Conflicting peer dependency: postcss@8.4.21 - Ceph - Mgr - Dashboard
https://tracker.ceph.com/issues/57755 - Test failure: test_cephfs_mirror (tasks.cephadm_cases.test_cli.TestCephadmCLI)
https://tracker.ceph.com/issues/57771 - Command failed on smithi082 with status 1: 'TESTDIR=/home/ubuntu/cephtest bash -s'
https://tracker.ceph.com/issues/55443 - SELinuxError: SELinux denials
https://tracker.ceph.com/issues/58693 - FAIL: test_update_export_with_invalid_values (tasks.cephfs.test_nfs.TestNFS) - New tracker was created.

sseshasa · 2023-02-10T14:59:02Z

@athanatos Can you please approve? The rados suite results look good.

athanatos

@neha-ojha You should probably approve this as well given that it's going to be backported immediately.

Signed-off-by: Samuel Just <sjust@redhat.com>

is_qos_item() was only used in operator<< for OpSchedulerItem. However, it's actually useful to see priority for mclock items since it affects whether it goes into the immediate queues and, for some types, the class. Unconditionally display both class_id and priority. Signed-off-by: Samuel Just <sjust@redhat.com>

…riority Consolidate methods governing recovery scheduling in PeeringState. Signed-off-by: Samuel Just <sjust@redhat.com>

…ry_msg Signed-off-by: Samuel Just <sjust@redhat.com>

Signed-off-by: Samuel Just <sjust@redhat.com>

…ovement Recovery operations on pgs/objects that have fewer than the configured number of copies should be treated more urgently than operations on pgs/objects that simply need to be moved to a new location. Signed-off-by: Samuel Just <sjust@redhat.com>

…ages Otherwise, these end up as PGOpItem and therefore as immediate: class PGOpItem : public PGOpQueueable { ... op_scheduler_class get_scheduler_class() const final { auto type = op->get_req()->get_type(); if (type == CEPH_MSG_OSD_OP || type == CEPH_MSG_OSD_BACKOFF) { return op_scheduler_class::client; } else { return op_scheduler_class::immediate; } } ... }; This was probably causing a bunch of extra interference with client ops. Signed-off-by: Samuel Just <sjust@redhat.com>

PGs with degraded objects should be higher priority. Signed-off-by: Samuel Just <sjust@redhat.com>

Signed-off-by: Samuel Just <sjust@redhat.com>

Now that recovery operations are split between background_recovery and background_best_effort, rebalance qos params to avoid penalizing background_recovery while idle. Signed-off-by: Samuel Just <sjust@redhat.com>

Let's use the middle profile as the default. Modify the standalone tests accordingly. Signed-off-by: Samuel Just <sjust@redhat.com> Signed-off-by: Sridhar Seshasayee <sseshasa@redhat.com>

The earlier limit of 3 was still aggressive enough to have an impact on the client and other competing operations. Retain the current default for mClock. This can be modified if necessary after setting the osd_mclock_override_recovery_settings option. Signed-off-by: Sridhar Seshasayee <sseshasa@redhat.com>

…r SSDs The osd_mclock_max_sequential_bandwidth_ssd is changed to 1200 MiB/s as a reasonable middle ground considering the broad range of SSD capabilities. This allows the mClock's cost model to extract the SSDs capability depending on the cost of the IO being performed. Signed-off-by: Sridhar Seshasayee <sseshasa@redhat.com>

…anges Modify the relevant documentation to reflect: - change in the default mClock profile to 'balanced' - new allocations for ops across mClock profiles - change in the osd_max_backfills limit - miscellaneous changes related to warnings. Signed-off-by: Sridhar Seshasayee <sseshasa@redhat.com>

The qa tests are not client I/O centric and mostly focus on triggering recovery/backfills and monitor them for completion within a finite amount of time. The same holds true for scrub operations. Therefore, an mClock profile that optimizes background operations is a better fit for qa related tests. The osd_mclock_profile is therefore globally overriden to 'high_recovery_ops' profile for the Rados suite as it fits the requirement. Also, many standalone tests expect recovery and scrub operations to complete within a finite time. To ensure this, the osd_mclock_profile options is set to 'high_recovery_ops' as part of the run_osd() function in ceph-helpers.sh. A subset of standalone tests explicitly used 'high_recovery_ops' profile. Since the profile is now set as part of run_osd(), the earlier overrides are redundant and therefore removed from the tests. Signed-off-by: Sridhar Seshasayee <sseshasa@redhat.com>

sseshasa · 2023-05-08T13:45:31Z

jenkins test api

sseshasa · 2023-05-08T13:45:49Z

jenkins test make check

ljflores · 2023-05-08T16:59:01Z

Tracked the make check failure in https://tracker.ceph.com/issues/59671. I don't see it happening on every PR though, so thinking it could be transient, or a problem with a particular node.

Testing that theory by retriggering.

ljflores · 2023-05-08T16:59:16Z

jenkins test make check

ceph/ceph#49975 Introduced changes to mclock conf value types which caused the osd to stall while booting. Signed-off-by: Matan Breizman <mbreizma@redhat.com>

ceph#49975 Introduced changes to mclock conf value types which caused the osd to stall while booting. Signed-off-by: Matan Breizman <mbreizma@redhat.com> (cherry picked from commit 5cb12bb)

sseshasa requested a review from a team as a code owner February 2, 2023 15:40

github-actions bot added the core label Feb 2, 2023

sseshasa requested review from amathuria, athanatos and neha-ojha February 2, 2023 15:40

sseshasa force-pushed the wip-fix-mclk-rec-backfill-cost branch from b0a1169 to 1c3cecb Compare February 3, 2023 05:51

athanatos requested changes Feb 3, 2023

View reviewed changes

sseshasa force-pushed the wip-fix-mclk-rec-backfill-cost branch from 1c3cecb to 0569e33 Compare February 3, 2023 12:36

sseshasa commented Feb 3, 2023

View reviewed changes

sseshasa force-pushed the wip-fix-mclk-rec-backfill-cost branch 2 times, most recently from 0363ee4 to 3f594db Compare February 3, 2023 13:48

sseshasa changed the title ~~osd: (DRAFT) mClock recovery/backfill cost fixes~~ osd: mClock recovery/backfill cost fixes Feb 3, 2023

sseshasa force-pushed the wip-fix-mclk-rec-backfill-cost branch from 3f594db to 5609184 Compare February 3, 2023 15:38

sseshasa added the wip-sseshasa-testing label Feb 6, 2023

sseshasa force-pushed the wip-fix-mclk-rec-backfill-cost branch from 5609184 to 1a5d40c Compare February 6, 2023 14:15

athanatos approved these changes Feb 10, 2023

View reviewed changes

neha-ojha added the DNM label Feb 23, 2023

sseshasa force-pushed the wip-fix-mclk-rec-backfill-cost branch from 55575ae to e996629 Compare March 20, 2023 13:58

athanatos and others added 17 commits May 8, 2023 16:22

osd/scheduler: remove unused PGOpItem::maybe_get_mosd_op

dc3cc16

Signed-off-by: Samuel Just <sjust@redhat.com>

osd/: move get_recovery_op_priority into PeeringState next to get_*_p…

c20bc71

…riority Consolidate methods governing recovery scheduling in PeeringState. Signed-off-by: Samuel Just <sjust@redhat.com>

osd/: move PGRecoveryMsg check from osd into PGRecoveryMsg::is_recove…

e75f480

…ry_msg Signed-off-by: Samuel Just <sjust@redhat.com>

osd/.../OpSchedulerItem: add MSG_OSD_PG_PULL to is_recovery_msg

1defccb

Signed-off-by: Samuel Just <sjust@redhat.com>

osd/: differentiate priority for PGRecovery[Context]

a1f0ccf

PGs with degraded objects should be higher priority. Signed-off-by: Samuel Just <sjust@redhat.com>

osd/: add per-op latency averages for each recovery related message

5ebc7d3

Signed-off-by: Samuel Just <sjust@redhat.com>

osd/: add counters for queue latency for PGRecovery[Context]

e148d34

Signed-off-by: Samuel Just <sjust@redhat.com>

osd/: add counters for ops delayed due to degraded|unreadable target

97b27cd

Signed-off-by: Samuel Just <sjust@redhat.com>

osd/scheduler/mClockScheduler: avoid limits for recovery

e326af9

Now that recovery operations are split between background_recovery and background_best_effort, rebalance qos params to avoid penalizing background_recovery while idle. Signed-off-by: Samuel Just <sjust@redhat.com>

common/options/osd.yaml.in: change mclock profile default to balanced

5a649f3

Let's use the middle profile as the default. Modify the standalone tests accordingly. Signed-off-by: Samuel Just <sjust@redhat.com> Signed-off-by: Sridhar Seshasayee <sseshasa@redhat.com>

sseshasa force-pushed the wip-fix-mclk-rec-backfill-cost branch from abca84e to 4c22fcf Compare May 8, 2023 10:52

rzarzynski merged commit a9d46ad into ceph:main May 8, 2023

This was referenced May 10, 2023

librbd: localize snap_remove op for mirror snapshots #51166

Merged

mgr/rbd_support: recover from rados client blocklisting #49742

Merged

Matan-B mentioned this pull request May 11, 2023

crimson/osd/scheduler/mclock_scheduler: Fix OSD unable to start #51448

Merged

14 tasks

sseshasa removed the wip-sseshasa-testing label May 17, 2023

idryomov mentioned this pull request May 17, 2023

qa/tasks: Changing default mClock profile to high_recovery_ops #51449

Merged

14 tasks

Conversation

sseshasa commented Feb 2, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Contribution Guidelines

Checklist

Uh oh!

athanatos left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sseshasa Feb 3, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

athanatos Feb 3, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sseshasa commented Feb 3, 2023

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sseshasa commented Feb 3, 2023

Uh oh!

sseshasa commented Feb 3, 2023

Uh oh!

sseshasa commented Feb 7, 2023

Uh oh!

sseshasa commented Feb 7, 2023

Uh oh!

ljflores commented Feb 8, 2023

Uh oh!

ljflores commented Feb 8, 2023

Uh oh!

sseshasa commented Feb 9, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sseshasa commented Feb 9, 2023

Uh oh!

athanatos commented Feb 9, 2023

Uh oh!

sseshasa commented Feb 10, 2023

Uh oh!

sseshasa commented Feb 10, 2023

Uh oh!

athanatos left a comment

Choose a reason for hiding this comment

Uh oh!

sseshasa commented May 8, 2023

Uh oh!

sseshasa commented May 8, 2023

Uh oh!

ljflores commented May 8, 2023

Uh oh!

ljflores commented May 8, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

sseshasa commented Feb 2, 2023 •

edited

Loading

athanatos left a comment •

edited

Loading

sseshasa Feb 3, 2023 •

edited

Loading

athanatos Feb 3, 2023 •

edited

Loading

sseshasa commented Feb 9, 2023 •

edited

Loading