osd: some recovery improvements and cleanups#23663
Conversation
5a27350 to
106dc2d
Compare
| pg_log.reset_recovery_pointers(); | ||
| } else { | ||
| dout(10) << "activate - not complete, " << missing << dendl; | ||
| info.stats.stats.sum.num_objects_missing = missing.num_missing(); |
There was a problem hiding this comment.
Hmm, i think this should be num_objects_missing_on_primary.. and I think (?) that num_objects_missing isn't updated correctly anywhere?
There was a problem hiding this comment.
The PG::_update_calc_stats() calculates the num_objects_missing based on the missing on all replicas. I don't think we need this changed anywhere else.
There was a problem hiding this comment.
I think (?) that num_objects_missing isn't updated correctly anywhere?
Exactly, that's why I try to make this member can trace each peer's number of missing objects correctly here. So in the next peering circle we could leverage this (and plus the log difference) to choose a more suitable or worth-recovering peers as acting.
There was a problem hiding this comment.
Hmm, looking a bit more closely, I think all we need is a call to publish_stats_to_osd() at the end of activate()?
There was a problem hiding this comment.
Nope. What I am intending to do is to let each peer to record its number of missing objects separately. publish_stats_to_osd (and later call to _update_calc_stats() ) will only update primary's in-memory peer-info's num_objects_missing.
|
This pull request should be run against 2 tests: cd build |
src/osd/PG.cc
Outdated
| approx_objects_missing += primary_version - auth_version; | ||
| } | ||
| auto force_threshold = cct->_conf.get_val<uint64_t>( | ||
| "osd_force_auth_primary_missing_objects"); |
There was a problem hiding this comment.
Making choose_acting depend on a config variable is dangerous because it can lead to a cycle if the value differs between OSDs (with acting toggling back and forth between two primaries with different settings). We do it with the ignore_les option but that has led to confusion/problems in the field as a result.
There was a problem hiding this comment.
Yeah. That's why I introduce a very strict constraint above:
if (HAVE_FEATURE(osdmap->get_up_osd_features(), SERVER_NAUTILUS))
And choosing acting under the control of some configurable option is something that the async_recovery code already does.. see:
osd_async_recovery_min_pg_log_entries
| void recover_got(hobject_t oid, eversion_t v, pg_info_t &info) { | ||
| if (missing.is_missing(oid, v)) { | ||
| missing.got(oid, v); | ||
| info.stats.stats.sum.num_objects_missing = missing.num_missing(); |
There was a problem hiding this comment.
For this, I think the on_local_recover() is the only caller that really matters, and publish_stats_to_osd() is already called a few lines down inside the if (is_primary()) blok.
There was a problem hiding this comment.
I am trying to let each peer trace its own number of missing and get those update-to-date stats written onto disk on time
e2a31a6 to
63a51c7
Compare
|
@liewegas Ping? BTW, I've done some local testing which reveals this patch together with async_recovery can reduce the impact on client IOPS obviously if pgs are in recovery mode... |
|
@xiexingguo could you please at least rerun osd-recovery-stats.sh as suggested by @dzafman at |
src/osd/PG.cc
Outdated
| set<pg_shard_t> *backfill, | ||
| set<pg_shard_t> *acting_backfill, | ||
| const OSDMapRef osdmap, | ||
| CephContext *cct, |
There was a problem hiding this comment.
if what we need is but osd_force_auth_primary_missing_objects, why not passing it in instead?
Async recovery peers usually have a relative complete log history but may exist a lot of missing objects. Choosing them as auth_log_shard and further as primary if current up_primary is unrecoverable, say, could have a bigger chance to block client I/Os. Among peers with identical new log history, we now consider those who are now complete (having no missing objects) as the preferred ones when determining auth_log_shard. Signed-off-by: xie xingguo <xie.xingguo@zte.com.cn>
Signed-off-by: xie xingguo <xie.xingguo@zte.com.cn>
- kill usable, use want->size instead - introduce a (separate) lambda function for sorting Signed-off-by: xie xingguo <xie.xingguo@zte.com.cn>
Which has no consumers. Signed-off-by: xie xingguo <xie.xingguo@zte.com.cn>
Signed-off-by: xie xingguo <xie.xingguo@zte.com.cn>
So if there are a lot fo missing objects on primary, we can make use of auth_log_shard to restore client I/O quickly. Signed-off-by: xie xingguo <xie.xingguo@zte.com.cn>
63a51c7 to
22786cf
Compare
We changed async recovery cost calculation in nautilus to also take into account approx_missing_objects in ab241bf This commit depends on ceph#23663, hence wasn't backported to mimic. Mimic only uses the difference in length of logs as the cost. Due to this, the same OSD might have different costs in a mixed mimic and nautilus(or above) cluster. This can lead to choose_acting() cycling between OSDs, when trying to select the acting set and async_recovery_targets. Fixes: https://tracker.ceph.com/issues/39441 Signed-off-by: Neha Ojha <nojha@redhat.com>
We changed async recovery cost calculation in nautilus to also take into account approx_missing_objects in ab241bf This commit depends on ceph#23663, hence wasn't backported to mimic. Mimic only uses the difference in length of logs as the cost. Due to this, the same OSD might have different costs in a mixed mimic and nautilus(or above) cluster. This can lead to choose_acting() cycling between OSDs, when trying to select the acting set and async_recovery_targets. Fixes: https://tracker.ceph.com/issues/39441 Signed-off-by: Neha Ojha <nojha@redhat.com> (cherry picked from commit 4c617ec) Conflicts: src/osd/PG.cc : Resolved in choose_async_recovery_ec and choose_async_recovery_replicated

No description provided.