osd: mClock recovery/backfill cost fixes#49975
Conversation
b0a1169 to
1c3cecb
Compare
There was a problem hiding this comment.
Pretty close! https://github.com/athanatos/ceph/tree/sjust/wip-recovery-cost-49959 has some commits that I think clean/simplify things a bit. If you agree with the changes, squash my commits as indicated and update the commit messages (add Signed-off-by: and Fixes: lines as well).
src/osd/ECBackend.cc
Outdated
| } | ||
| } | ||
| } | ||
| } |
There was a problem hiding this comment.
See my branch for a comment with an explanation, but I think we want a cost of 1 here.
There was a problem hiding this comment.
Thanks for the detailed explanation of the case involved.
src/osd/OSD.h
Outdated
| // -- pg recovery and associated throttling -- | ||
| ceph::mutex recovery_lock = ceph::make_mutex("OSDService::recovery_lock"); | ||
| std::list<std::pair<epoch_t, PGRef> > awaiting_throttle; | ||
| std::list<std::tuple<epoch_t, PGRef, int> > awaiting_throttle; |
There was a problem hiding this comment.
See my branch, let's use a named struct here.
There was a problem hiding this comment.
Sure, accessing the tuple members did seem awkward. With the named struct, it's clear which member is being accessed.
src/osd/PG.cc
Outdated
| osd->queue_for_recovery(this); | ||
| // Determine recovery cost in terms of average object size in this PG | ||
| int64_t num_bytes = info.stats.stats.sum.num_bytes; | ||
| int64_t num_objects = info.stats.stats.sum.num_objects; |
There was a problem hiding this comment.
See my branch, we need to handle the case where num_objects is 0.
src/osd/PGBackend.h
Outdated
| virtual void schedule_recovery_work( | ||
| GenContext<ThreadPool::TPHandle&> *c) = 0; | ||
| GenContext<ThreadPool::TPHandle&> *c, | ||
| int cost) = 0; |
There was a problem hiding this comment.
See branch, need to use uint64_t everywhere for cost.
There was a problem hiding this comment.
Sure, I used int since OpSchedulerItem takes in cost as int.
src/osd/ReplicatedBackend.cc
Outdated
| if (j != pulling.end()) { | ||
| auto rec_size = j->second.recovery_info.size; | ||
| auto rec_progress = j->second.recovery_progress.data_recovered_to; | ||
| cost += (rec_size - rec_progress); |
There was a problem hiding this comment.
This isn't quite right, data_recovered_to will generally be equal to recovery_info.size since the pull is complete. PG_RecoveryQueueAsync here is going to enqueue a C_ReplicatedBackend_OnPullComplete which pushes the object data out to waiting acting set OSDs. We want the full size of the relevant objects. Instead, let's use i.stat.num_bytes_recovered for each element of to_continue. See my branch.
There was a problem hiding this comment.
Understood. Thanks for clarifying this!
Will do @athanatos. I will take a look and squash your commits into mine. Thanks for taking a look and for these |
1c3cecb to
0569e33
Compare
src/osd/OSD.h
Outdated
| } else { | ||
| awaiting_throttle.push_back(std::make_pair(pg->get_osdmap()->get_epoch(), pg)); | ||
| awaiting_throttle.emplace_back( | ||
| pg->get_osdmap()->get_epoch(), pg, cost_per_object); |
There was a problem hiding this comment.
@athanatos The initialization of pg_awaiting_throttle_t is missing. Will add it in the next push.
There was a problem hiding this comment.
Fixed in the latest push.
0363ee4 to
3f594db
Compare
|
jenkins retest this please |
|
jenkins test make check |
3f594db to
5609184
Compare
5609184 to
1a5d40c
Compare
|
jenkins retest this please |
|
jenkins render docs |
|
jenkins test api |
|
jenkins test make check |
|
Rados Suite Run:
Unrelated Failures:
|
|
@athanatos PTAL at commit: 55575ae (see my email for additional info). |
|
@sseshasa Those changes seem reasonable to me. |
|
Rados Suite Run: Unrelated Failures:
|
|
@athanatos Can you please approve? The rados suite results look good. |
athanatos
left a comment
There was a problem hiding this comment.
@neha-ojha You should probably approve this as well given that it's going to be backported immediately.
55575ae to
e996629
Compare
Signed-off-by: Samuel Just <sjust@redhat.com>
is_qos_item() was only used in operator<< for OpSchedulerItem. However, it's actually useful to see priority for mclock items since it affects whether it goes into the immediate queues and, for some types, the class. Unconditionally display both class_id and priority. Signed-off-by: Samuel Just <sjust@redhat.com>
…riority Consolidate methods governing recovery scheduling in PeeringState. Signed-off-by: Samuel Just <sjust@redhat.com>
…ry_msg Signed-off-by: Samuel Just <sjust@redhat.com>
Signed-off-by: Samuel Just <sjust@redhat.com>
…ovement Recovery operations on pgs/objects that have fewer than the configured number of copies should be treated more urgently than operations on pgs/objects that simply need to be moved to a new location. Signed-off-by: Samuel Just <sjust@redhat.com>
…ages
Otherwise, these end up as PGOpItem and therefore as immediate:
class PGOpItem : public PGOpQueueable {
...
op_scheduler_class get_scheduler_class() const final {
auto type = op->get_req()->get_type();
if (type == CEPH_MSG_OSD_OP ||
type == CEPH_MSG_OSD_BACKOFF) {
return op_scheduler_class::client;
} else {
return op_scheduler_class::immediate;
}
}
...
};
This was probably causing a bunch of extra interference with client
ops.
Signed-off-by: Samuel Just <sjust@redhat.com>
PGs with degraded objects should be higher priority. Signed-off-by: Samuel Just <sjust@redhat.com>
Signed-off-by: Samuel Just <sjust@redhat.com>
Signed-off-by: Samuel Just <sjust@redhat.com>
Signed-off-by: Samuel Just <sjust@redhat.com>
Now that recovery operations are split between background_recovery and background_best_effort, rebalance qos params to avoid penalizing background_recovery while idle. Signed-off-by: Samuel Just <sjust@redhat.com>
Let's use the middle profile as the default. Modify the standalone tests accordingly. Signed-off-by: Samuel Just <sjust@redhat.com> Signed-off-by: Sridhar Seshasayee <sseshasa@redhat.com>
The earlier limit of 3 was still aggressive enough to have an impact on the client and other competing operations. Retain the current default for mClock. This can be modified if necessary after setting the osd_mclock_override_recovery_settings option. Signed-off-by: Sridhar Seshasayee <sseshasa@redhat.com>
…r SSDs The osd_mclock_max_sequential_bandwidth_ssd is changed to 1200 MiB/s as a reasonable middle ground considering the broad range of SSD capabilities. This allows the mClock's cost model to extract the SSDs capability depending on the cost of the IO being performed. Signed-off-by: Sridhar Seshasayee <sseshasa@redhat.com>
…anges Modify the relevant documentation to reflect: - change in the default mClock profile to 'balanced' - new allocations for ops across mClock profiles - change in the osd_max_backfills limit - miscellaneous changes related to warnings. Signed-off-by: Sridhar Seshasayee <sseshasa@redhat.com>
The qa tests are not client I/O centric and mostly focus on triggering recovery/backfills and monitor them for completion within a finite amount of time. The same holds true for scrub operations. Therefore, an mClock profile that optimizes background operations is a better fit for qa related tests. The osd_mclock_profile is therefore globally overriden to 'high_recovery_ops' profile for the Rados suite as it fits the requirement. Also, many standalone tests expect recovery and scrub operations to complete within a finite time. To ensure this, the osd_mclock_profile options is set to 'high_recovery_ops' as part of the run_osd() function in ceph-helpers.sh. A subset of standalone tests explicitly used 'high_recovery_ops' profile. Since the profile is now set as part of run_osd(), the earlier overrides are redundant and therefore removed from the tests. Signed-off-by: Sridhar Seshasayee <sseshasa@redhat.com>
abca84e to
4c22fcf
Compare
|
jenkins test api |
|
jenkins test make check |
|
Tracked the make check failure in https://tracker.ceph.com/issues/59671. I don't see it happening on every PR though, so thinking it could be transient, or a problem with a particular node. Testing that theory by retriggering. |
|
jenkins test make check |
ceph/ceph#49975 Introduced changes to mclock conf value types which caused the osd to stall while booting. Signed-off-by: Matan Breizman <mbreizma@redhat.com>
ceph#49975 Introduced changes to mclock conf value types which caused the osd to stall while booting. Signed-off-by: Matan Breizman <mbreizma@redhat.com> (cherry picked from commit 5cb12bb)
This PR addresses the issue of slow backfill/recoveries due to non-ideal
cost settings for recovery/backfill operations.
The following summarizes the changes:
Fixes: https://tracker.ceph.com/issues/58529
Fixes: https://tracker.ceph.com/issues/58606
Fixes: https://tracker.ceph.com/issues/58607
Fixes: https://tracker.ceph.com/issues/59080
Signed-off-by: Samuel Just sjust@redhat.com
Signed-off-by: Sridhar Seshasayee sseshasa@redhat.com
Contribution Guidelines
To sign and title your commits, please refer to Submitting Patches to Ceph.
If you are submitting a fix for a stable branch (e.g. "pacific"), please refer to Submitting Patches to Ceph - Backports for the proper workflow.
Checklist
Show available Jenkins commands
jenkins retest this pleasejenkins test classic perfjenkins test crimson perfjenkins test signedjenkins test make checkjenkins test make check arm64jenkins test submodulesjenkins test dashboardjenkins test dashboard cephadmjenkins test apijenkins test docsjenkins render docsjenkins test ceph-volume alljenkins test ceph-volume toxjenkins test windows