osd: some recovery improvements and cleanups by xiexingguo · Pull Request #23663 · ceph/ceph

xiexingguo · 2018-08-21T05:52:20Z

No description provided.

liewegas · 2018-08-26T16:59:41Z

src/osd/PG.cc

    pg_log.reset_recovery_pointers();
  } else {
    dout(10) << "activate - not complete, " << missing << dendl;
+    info.stats.stats.sum.num_objects_missing = missing.num_missing();


Hmm, i think this should be num_objects_missing_on_primary.. and I think (?) that num_objects_missing isn't updated correctly anywhere?

The PG::_update_calc_stats() calculates the num_objects_missing based on the missing on all replicas. I don't think we need this changed anywhere else.

I think (?) that num_objects_missing isn't updated correctly anywhere?

Exactly, that's why I try to make this member can trace each peer's number of missing objects correctly here. So in the next peering circle we could leverage this (and plus the log difference) to choose a more suitable or worth-recovering peers as acting.

Hmm, looking a bit more closely, I think all we need is a call to publish_stats_to_osd() at the end of activate()?

Nope. What I am intending to do is to let each peer to record its number of missing objects separately. publish_stats_to_osd (and later call to _update_calc_stats() ) will only update primary's in-memory peer-info's num_objects_missing.

dzafman · 2018-08-26T19:26:18Z

This pull request should be run against 2 tests:

cd build
../qa/run-standalone.sh osd-backfill-stats.sh osd-recovery-stats.sh

liewegas · 2018-08-26T20:17:23Z

src/osd/PG.cc

+        approx_objects_missing += primary_version - auth_version;
+      }
+      auto force_threshold = cct->_conf.get_val<uint64_t>(
+        "osd_force_auth_primary_missing_objects");


Making choose_acting depend on a config variable is dangerous because it can lead to a cycle if the value differs between OSDs (with acting toggling back and forth between two primaries with different settings). We do it with the ignore_les option but that has led to confusion/problems in the field as a result.

Yeah. That's why I introduce a very strict constraint above:

if (HAVE_FEATURE(osdmap->get_up_osd_features(), SERVER_NAUTILUS))

And choosing acting under the control of some configurable option is something that the async_recovery code already does.. see:

osd_async_recovery_min_pg_log_entries

liewegas · 2018-08-27T02:57:41Z

src/osd/PGLog.h

  void recover_got(hobject_t oid, eversion_t v, pg_info_t &info) {
    if (missing.is_missing(oid, v)) {
      missing.got(oid, v);
+      info.stats.stats.sum.num_objects_missing = missing.num_missing();


For this, I think the on_local_recover() is the only caller that really matters, and publish_stats_to_osd() is already called a few lines down inside the if (is_primary()) blok.

I am trying to let each peer trace its own number of missing and get those update-to-date stats written onto disk on time

xiexingguo · 2018-08-29T10:04:15Z

@liewegas Ping?

BTW, I've done some local testing which reveals this patch together with async_recovery can reduce the impact on client IOPS obviously if pgs are in recovery mode...

tchaikov · 2018-08-30T15:20:03Z

http://pulpito.ceph.com/kchai-2018-08-30_02:23:40-rados-wip-kefu-testing-2018-08-29-2346-distro-basic-smithi/2957798/

@xiexingguo could you please at least rerun osd-recovery-stats.sh as suggested by @dzafman at
#23663 (comment) ?

tchaikov · 2018-08-30T15:38:06Z

src/osd/PG.cc

  set<pg_shard_t> *backfill,
  set<pg_shard_t> *acting_backfill,
+  const OSDMapRef osdmap,
+  CephContext *cct,


if what we need is but osd_force_auth_primary_missing_objects, why not passing it in instead?

Async recovery peers usually have a relative complete log history but may exist a lot of missing objects. Choosing them as auth_log_shard and further as primary if current up_primary is unrecoverable, say, could have a bigger chance to block client I/Os. Among peers with identical new log history, we now consider those who are now complete (having no missing objects) as the preferred ones when determining auth_log_shard. Signed-off-by: xie xingguo <xie.xingguo@zte.com.cn>

Signed-off-by: xie xingguo <xie.xingguo@zte.com.cn>

- kill usable, use want->size instead - introduce a (separate) lambda function for sorting Signed-off-by: xie xingguo <xie.xingguo@zte.com.cn>

Which has no consumers. Signed-off-by: xie xingguo <xie.xingguo@zte.com.cn>

Signed-off-by: xie xingguo <xie.xingguo@zte.com.cn>

So if there are a lot fo missing objects on primary, we can make use of auth_log_shard to restore client I/O quickly. Signed-off-by: xie xingguo <xie.xingguo@zte.com.cn>

xiexingguo · 2018-09-01T06:27:05Z

standalone:

http://pulpito.ceph.com/xxg-2018-09-01_00:48:49-rados-wip-incompat-async-fixes-distro-basic-smithi/

two extra rounds of RADOS suite:

http://pulpito.ceph.com/xxg-2018-09-01_00:43:29-rados-wip-incompat-async-fixes-distro-basic-smithi/

http://pulpito.ceph.com/xxg-2018-09-01_03:11:18-rados-wip-incompat-async-fixes-distro-basic-smithi/

We changed async recovery cost calculation in nautilus to also take into account approx_missing_objects in ab241bf This commit depends on ceph#23663, hence wasn't backported to mimic. Mimic only uses the difference in length of logs as the cost. Due to this, the same OSD might have different costs in a mixed mimic and nautilus(or above) cluster. This can lead to choose_acting() cycling between OSDs, when trying to select the acting set and async_recovery_targets. Fixes: https://tracker.ceph.com/issues/39441 Signed-off-by: Neha Ojha <nojha@redhat.com>

We changed async recovery cost calculation in nautilus to also take into account approx_missing_objects in ab241bf This commit depends on ceph#23663, hence wasn't backported to mimic. Mimic only uses the difference in length of logs as the cost. Due to this, the same OSD might have different costs in a mixed mimic and nautilus(or above) cluster. This can lead to choose_acting() cycling between OSDs, when trying to select the acting set and async_recovery_targets. Fixes: https://tracker.ceph.com/issues/39441 Signed-off-by: Neha Ojha <nojha@redhat.com> (cherry picked from commit 4c617ec) Conflicts: src/osd/PG.cc : Resolved in choose_async_recovery_ec and choose_async_recovery_replicated

xiexingguo force-pushed the wip-incompat-async-fixes branch from 5a27350 to 106dc2d Compare August 21, 2018 06:29

xiexingguo added the core label Aug 21, 2018

xiexingguo requested a review from liewegas August 21, 2018 10:33

xiexingguo added the feature label Aug 24, 2018

liewegas reviewed Aug 26, 2018

View reviewed changes

liewegas reviewed Aug 27, 2018

View reviewed changes

xiexingguo force-pushed the wip-incompat-async-fixes branch 3 times, most recently from e2a31a6 to 63a51c7 Compare August 28, 2018 02:36

liewegas added the needs-qa label Aug 29, 2018

liewegas approved these changes Aug 29, 2018

View reviewed changes

tchaikov added the wip-kefu-testing label Aug 29, 2018

tchaikov removed needs-qa wip-kefu-testing labels Aug 30, 2018

This was referenced Aug 30, 2018

os/bluestore: fix deep-scrub operation againest disk silent errors #23629

Merged

osd/PGLog: trim - avoid dereferencing invalid iter #23546

Merged

tchaikov reviewed Aug 30, 2018

View reviewed changes

xiexingguo added 6 commits August 31, 2018 13:51

osd/PG: move comments to the proper place

af84517

Signed-off-by: xie xingguo <xie.xingguo@zte.com.cn>

osd/PG: clear calc_replicated_acting a bit

0801640

- kill usable, use want->size instead - introduce a (separate) lambda function for sorting Signed-off-by: xie xingguo <xie.xingguo@zte.com.cn>

osd/osd_types: kill overlaps_with() of pg_info_t

8c6837c

Which has no consumers. Signed-off-by: xie xingguo <xie.xingguo@zte.com.cn>

osd/PG: make num_objects_missing can trace missing objects correctly

7de3562

Signed-off-by: xie xingguo <xie.xingguo@zte.com.cn>

osd/PG: force auth_log_shard to be primary when appropriate

22786cf

So if there are a lot fo missing objects on primary, we can make use of auth_log_shard to restore client I/O quickly. Signed-off-by: xie xingguo <xie.xingguo@zte.com.cn>

xiexingguo force-pushed the wip-incompat-async-fixes branch from 63a51c7 to 22786cf Compare August 31, 2018 08:31

xiexingguo merged commit 0857124 into ceph:master Sep 1, 2018

xiexingguo deleted the wip-incompat-async-fixes branch September 1, 2018 06:27

neha-ojha mentioned this pull request Sep 13, 2018

osd/PG: async-recovery should respect historical missing objects #24004

Merged

3 tasks

neha-ojha mentioned this pull request Apr 25, 2019

osd/PG: do not use approx_missing_objects pre-nautilus #27798

Merged

3 tasks

Conversation

xiexingguo commented Aug 21, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dzafman commented Aug 26, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

xiexingguo commented Aug 29, 2018

Uh oh!

tchaikov commented Aug 30, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

xiexingguo commented Sep 1, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

tchaikov commented Aug 30, 2018 •

edited

Loading