Skip to content

osd: mClock recovery/backfill cost fixes#49975

Merged
rzarzynski merged 30 commits intoceph:mainfrom
sseshasa:wip-fix-mclk-rec-backfill-cost
May 8, 2023
Merged

osd: mClock recovery/backfill cost fixes#49975
rzarzynski merged 30 commits intoceph:mainfrom
sseshasa:wip-fix-mclk-rec-backfill-cost

Conversation

@sseshasa
Copy link
Contributor

@sseshasa sseshasa commented Feb 2, 2023

This PR addresses the issue of slow backfill/recoveries due to non-ideal
cost settings for recovery/backfill operations.

The following summarizes the changes:

  1. Set cost for PGRecovery item based on average size of objects in PG.
  2. Set cost for PGRecoveryContext item (Replicated and EC backend) based on the aggregate cost of impending pushes.
  3. PushReplyOp cost set to min cost and PullOp cost is set based on the remaining bytes to be pushed.
  4. Modify IO cost model to incorporate device characteristics like random write IOPS and sequential bandwidth capacity.
  5. Represent QoS cost in terms bytes instead of secs.
  6. Consider degraded object recoveries with greater urgency than backfill recovery.
  7. Categorize backfill operations as best-effort client.
  8. Change QoS parameter allocation in terms of proportions and optimize mClock profiles.
  9. Set 'balanced' profile as the default mClock profile.
  10. Restore backfill/recovery limits to original defaults.
  11. Update mClock configuration document to reflect all the changes.

Fixes: https://tracker.ceph.com/issues/58529
Fixes: https://tracker.ceph.com/issues/58606
Fixes: https://tracker.ceph.com/issues/58607
Fixes: https://tracker.ceph.com/issues/59080
Signed-off-by: Samuel Just sjust@redhat.com
Signed-off-by: Sridhar Seshasayee sseshasa@redhat.com

Contribution Guidelines

Checklist

  • Tracker (select at least one)
    • References tracker ticket
    • Very recent bug; references commit where it was introduced
    • New feature (ticket optional)
    • Doc update (no ticket needed)
    • Code cleanup (no ticket needed)
  • Component impact
    • Affects Dashboard, opened tracker ticket
    • Affects Orchestrator, opened tracker ticket
    • No impact that needs to be tracked
  • Documentation (select at least one)
    • Updates relevant documentation
    • No doc update is appropriate
  • Tests (select at least one)
Show available Jenkins commands
  • jenkins retest this please
  • jenkins test classic perf
  • jenkins test crimson perf
  • jenkins test signed
  • jenkins test make check
  • jenkins test make check arm64
  • jenkins test submodules
  • jenkins test dashboard
  • jenkins test dashboard cephadm
  • jenkins test api
  • jenkins test docs
  • jenkins render docs
  • jenkins test ceph-volume all
  • jenkins test ceph-volume tox
  • jenkins test windows

@sseshasa sseshasa requested a review from a team as a code owner February 2, 2023 15:40
@github-actions github-actions bot added the core label Feb 2, 2023
@sseshasa sseshasa force-pushed the wip-fix-mclk-rec-backfill-cost branch from b0a1169 to 1c3cecb Compare February 3, 2023 05:51
Copy link
Contributor

@athanatos athanatos left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pretty close! https://github.com/athanatos/ceph/tree/sjust/wip-recovery-cost-49959 has some commits that I think clean/simplify things a bit. If you agree with the changes, squash my commits as indicated and update the commit messages (add Signed-off-by: and Fixes: lines as well).

}
}
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See my branch for a comment with an explanation, but I think we want a cost of 1 here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the detailed explanation of the case involved.

src/osd/OSD.h Outdated
// -- pg recovery and associated throttling --
ceph::mutex recovery_lock = ceph::make_mutex("OSDService::recovery_lock");
std::list<std::pair<epoch_t, PGRef> > awaiting_throttle;
std::list<std::tuple<epoch_t, PGRef, int> > awaiting_throttle;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See my branch, let's use a named struct here.

Copy link
Contributor Author

@sseshasa sseshasa Feb 3, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, accessing the tuple members did seem awkward. With the named struct, it's clear which member is being accessed.

src/osd/PG.cc Outdated
osd->queue_for_recovery(this);
// Determine recovery cost in terms of average object size in this PG
int64_t num_bytes = info.stats.stats.sum.num_bytes;
int64_t num_objects = info.stats.stats.sum.num_objects;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See my branch, we need to handle the case where num_objects is 0.

virtual void schedule_recovery_work(
GenContext<ThreadPool::TPHandle&> *c) = 0;
GenContext<ThreadPool::TPHandle&> *c,
int cost) = 0;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See branch, need to use uint64_t everywhere for cost.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I used int since OpSchedulerItem takes in cost as int.

if (j != pulling.end()) {
auto rec_size = j->second.recovery_info.size;
auto rec_progress = j->second.recovery_progress.data_recovered_to;
cost += (rec_size - rec_progress);
Copy link
Contributor

@athanatos athanatos Feb 3, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't quite right, data_recovered_to will generally be equal to recovery_info.size since the pull is complete. PG_RecoveryQueueAsync here is going to enqueue a C_ReplicatedBackend_OnPullComplete which pushes the object data out to waiting acting set OSDs. We want the full size of the relevant objects. Instead, let's use i.stat.num_bytes_recovered for each element of to_continue. See my branch.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Understood. Thanks for clarifying this!

@sseshasa
Copy link
Contributor Author

sseshasa commented Feb 3, 2023

Pretty close! https://github.com/athanatos/ceph/tree/sjust/wip-recovery-cost-49959 has some commits that I think clean/simplify things a bit. If you agree with the changes, squash my commits as indicated and update the commit messages (add Signed-off-by: and Fixes: lines as well).

Will do @athanatos. I will take a look and squash your commits into mine. Thanks for taking a look and for these
commits.

@sseshasa sseshasa force-pushed the wip-fix-mclk-rec-backfill-cost branch from 1c3cecb to 0569e33 Compare February 3, 2023 12:36
src/osd/OSD.h Outdated
} else {
awaiting_throttle.push_back(std::make_pair(pg->get_osdmap()->get_epoch(), pg));
awaiting_throttle.emplace_back(
pg->get_osdmap()->get_epoch(), pg, cost_per_object);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@athanatos The initialization of pg_awaiting_throttle_t is missing. Will add it in the next push.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in the latest push.

@sseshasa sseshasa force-pushed the wip-fix-mclk-rec-backfill-cost branch 2 times, most recently from 0363ee4 to 3f594db Compare February 3, 2023 13:48
@sseshasa sseshasa changed the title osd: (DRAFT) mClock recovery/backfill cost fixes osd: mClock recovery/backfill cost fixes Feb 3, 2023
@sseshasa
Copy link
Contributor Author

sseshasa commented Feb 3, 2023

jenkins retest this please

@sseshasa
Copy link
Contributor Author

sseshasa commented Feb 3, 2023

jenkins test make check

@sseshasa sseshasa force-pushed the wip-fix-mclk-rec-backfill-cost branch from 3f594db to 5609184 Compare February 3, 2023 15:38
@sseshasa sseshasa force-pushed the wip-fix-mclk-rec-backfill-cost branch from 5609184 to 1a5d40c Compare February 6, 2023 14:15
@sseshasa
Copy link
Contributor Author

sseshasa commented Feb 7, 2023

jenkins retest this please

@sseshasa
Copy link
Contributor Author

sseshasa commented Feb 7, 2023

jenkins render docs

@ljflores
Copy link
Member

ljflores commented Feb 8, 2023

jenkins test api

@ljflores
Copy link
Member

ljflores commented Feb 8, 2023

jenkins test make check

@sseshasa
Copy link
Contributor Author

sseshasa commented Feb 9, 2023

Rados Suite Run:

Unrelated Failures:

  1. https://tracker.ceph.com/issues/58496 - osd/PeeringState: FAILED ceph_assert(!acting_recovery_backfill.empty())
  2. https://tracker.ceph.com/issues/58585 - rook: failed to pull kubelet image - Ceph - Orchestrator
  3. https://tracker.ceph.com/issues/57754 - test_envlibrados_for_rocksdb.sh: update-alternatives: error: alternative path /usr/bin/gcc-11 doesn't exist - Infrastructure
  4. https://tracker.ceph.com/issues/58475 - test_dashboard_e2e.sh: Conflicting peer dependency: postcss@8.4.21 - Ceph - Mgr - Dashboard
  5. https://tracker.ceph.com/issues/57755 - Test failure: test_cephfs_mirror (tasks.cephadm_cases.test_cli.TestCephadmCLI)

@sseshasa
Copy link
Contributor Author

sseshasa commented Feb 9, 2023

@athanatos PTAL at commit: 55575ae (see my email for additional info).

@athanatos
Copy link
Contributor

@sseshasa Those changes seem reasonable to me.

@sseshasa
Copy link
Contributor Author

Rados Suite Run:
This is after including latest commit: 55575ae

https://pulpito.ceph.com/nojha-2023-02-10_01:49:56-rados-wip-sseshasa-testing-2023-02-09-cost-fix-distro-default-smithi/

Unrelated Failures:

  1. https://tracker.ceph.com/issues/58585 - rook: failed to pull kubelet image - Ceph - Orchestrator
  2. https://tracker.ceph.com/issues/57754 - test_envlibrados_for_rocksdb.sh: update-alternatives: error: alternative path /usr/bin/gcc-11 doesn't exist - Infrastructure
  3. https://tracker.ceph.com/issues/58475 - test_dashboard_e2e.sh: Conflicting peer dependency: postcss@8.4.21 - Ceph - Mgr - Dashboard
  4. https://tracker.ceph.com/issues/57755 - Test failure: test_cephfs_mirror (tasks.cephadm_cases.test_cli.TestCephadmCLI)
  5. https://tracker.ceph.com/issues/57771 - Command failed on smithi082 with status 1: 'TESTDIR=/home/ubuntu/cephtest bash -s'
  6. https://tracker.ceph.com/issues/55443 - SELinuxError: SELinux denials
  7. https://tracker.ceph.com/issues/58693 - FAIL: test_update_export_with_invalid_values (tasks.cephfs.test_nfs.TestNFS) - New tracker was created.

@sseshasa
Copy link
Contributor Author

@athanatos Can you please approve? The rados suite results look good.

Copy link
Contributor

@athanatos athanatos left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@neha-ojha You should probably approve this as well given that it's going to be backported immediately.

@neha-ojha neha-ojha added the DNM label Feb 23, 2023
@sseshasa sseshasa force-pushed the wip-fix-mclk-rec-backfill-cost branch from 55575ae to e996629 Compare March 20, 2023 13:58
athanatos and others added 17 commits May 8, 2023 16:22
Signed-off-by: Samuel Just <sjust@redhat.com>
is_qos_item() was only used in operator<< for OpSchedulerItem.  However,
it's actually useful to see priority for mclock items since it affects
whether it goes into the immediate queues and, for some types, the
class.  Unconditionally display both class_id and priority.

Signed-off-by: Samuel Just <sjust@redhat.com>
…riority

Consolidate methods governing recovery scheduling in PeeringState.

Signed-off-by: Samuel Just <sjust@redhat.com>
…ry_msg

Signed-off-by: Samuel Just <sjust@redhat.com>
Signed-off-by: Samuel Just <sjust@redhat.com>
…ovement

Recovery operations on pgs/objects that have fewer than the configured
number of copies should be treated more urgently than operations on
pgs/objects that simply need to be moved to a new location.

Signed-off-by: Samuel Just <sjust@redhat.com>
…ages

Otherwise, these end up as PGOpItem and therefore as immediate:

class PGOpItem : public PGOpQueueable {
...
  op_scheduler_class get_scheduler_class() const final {
    auto type = op->get_req()->get_type();
    if (type == CEPH_MSG_OSD_OP ||
  type == CEPH_MSG_OSD_BACKOFF) {
      return op_scheduler_class::client;
    } else {
      return op_scheduler_class::immediate;
    }
  }
...
};

This was probably causing a bunch of extra interference with client
ops.

Signed-off-by: Samuel Just <sjust@redhat.com>
PGs with degraded objects should be higher priority.

Signed-off-by: Samuel Just <sjust@redhat.com>
Signed-off-by: Samuel Just <sjust@redhat.com>
Signed-off-by: Samuel Just <sjust@redhat.com>
Now that recovery operations are split between background_recovery and
background_best_effort, rebalance qos params to avoid penalizing
background_recovery while idle.

Signed-off-by: Samuel Just <sjust@redhat.com>
Let's use the middle profile as the default.
Modify the standalone tests accordingly.

Signed-off-by: Samuel Just <sjust@redhat.com>
Signed-off-by: Sridhar Seshasayee <sseshasa@redhat.com>
The earlier limit of 3 was still aggressive enough to have an impact on
the client and other competing operations. Retain the current default
for mClock. This can be modified if necessary after setting the
osd_mclock_override_recovery_settings option.

Signed-off-by: Sridhar Seshasayee <sseshasa@redhat.com>
…r SSDs

The osd_mclock_max_sequential_bandwidth_ssd is changed to 1200 MiB/s as
a reasonable middle ground considering the broad range of SSD capabilities.
This allows the mClock's cost model to extract the SSDs capability
depending on the cost of the IO being performed.

Signed-off-by: Sridhar Seshasayee <sseshasa@redhat.com>
…anges

Modify the relevant documentation to reflect:

- change in the default mClock profile to 'balanced'
- new allocations for ops across mClock profiles
- change in the osd_max_backfills limit
- miscellaneous changes related to warnings.

Signed-off-by: Sridhar Seshasayee <sseshasa@redhat.com>
The qa tests are not client I/O centric and mostly focus on triggering
recovery/backfills and monitor them for completion within a finite amount
of time. The same holds true for scrub operations.

Therefore, an mClock profile that optimizes background operations is a
better fit for qa related tests. The osd_mclock_profile is therefore
globally overriden to 'high_recovery_ops' profile for the Rados suite as
it fits the requirement.

Also, many standalone tests expect recovery and scrub operations to
complete within a finite time. To ensure this, the osd_mclock_profile
options is set to 'high_recovery_ops' as part of the run_osd() function
in ceph-helpers.sh.

A subset of standalone tests explicitly used 'high_recovery_ops' profile.
Since the profile is now set as part of run_osd(), the earlier overrides
are redundant and therefore removed from the tests.

Signed-off-by: Sridhar Seshasayee <sseshasa@redhat.com>
@sseshasa sseshasa force-pushed the wip-fix-mclk-rec-backfill-cost branch from abca84e to 4c22fcf Compare May 8, 2023 10:52
@sseshasa
Copy link
Contributor Author

sseshasa commented May 8, 2023

jenkins test api

@sseshasa
Copy link
Contributor Author

sseshasa commented May 8, 2023

jenkins test make check

@ljflores
Copy link
Member

ljflores commented May 8, 2023

Tracked the make check failure in https://tracker.ceph.com/issues/59671. I don't see it happening on every PR though, so thinking it could be transient, or a problem with a particular node.

Testing that theory by retriggering.

@ljflores
Copy link
Member

ljflores commented May 8, 2023

jenkins test make check

@rzarzynski rzarzynski merged commit a9d46ad into ceph:main May 8, 2023
Matan-B added a commit to ceph/ceph-ci that referenced this pull request May 11, 2023
ceph/ceph#49975 Introduced changes to
mclock conf value types which caused the osd to stall while booting.

Signed-off-by: Matan Breizman <mbreizma@redhat.com>
Matan-B added a commit to Matan-B/ceph that referenced this pull request May 21, 2023
ceph#49975 Introduced changes to
mclock conf value types which caused the osd to stall while booting.

Signed-off-by: Matan Breizman <mbreizma@redhat.com>
(cherry picked from commit 5cb12bb)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants