Skip to content

osd: some recovery improvements and cleanups#23663

Merged
xiexingguo merged 6 commits intoceph:masterfrom
xiexingguo:wip-incompat-async-fixes
Sep 1, 2018
Merged

osd: some recovery improvements and cleanups#23663
xiexingguo merged 6 commits intoceph:masterfrom
xiexingguo:wip-incompat-async-fixes

Conversation

@xiexingguo
Copy link
Member

No description provided.

@xiexingguo xiexingguo force-pushed the wip-incompat-async-fixes branch from 5a27350 to 106dc2d Compare August 21, 2018 06:29
@xiexingguo xiexingguo requested a review from liewegas August 21, 2018 10:33
pg_log.reset_recovery_pointers();
} else {
dout(10) << "activate - not complete, " << missing << dendl;
info.stats.stats.sum.num_objects_missing = missing.num_missing();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, i think this should be num_objects_missing_on_primary.. and I think (?) that num_objects_missing isn't updated correctly anywhere?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PG::_update_calc_stats() calculates the num_objects_missing based on the missing on all replicas. I don't think we need this changed anywhere else.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think (?) that num_objects_missing isn't updated correctly anywhere?

Exactly, that's why I try to make this member can trace each peer's number of missing objects correctly here. So in the next peering circle we could leverage this (and plus the log difference) to choose a more suitable or worth-recovering peers as acting.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, looking a bit more closely, I think all we need is a call to publish_stats_to_osd() at the end of activate()?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nope. What I am intending to do is to let each peer to record its number of missing objects separately. publish_stats_to_osd (and later call to _update_calc_stats() ) will only update primary's in-memory peer-info's num_objects_missing.

@dzafman
Copy link
Contributor

dzafman commented Aug 26, 2018

This pull request should be run against 2 tests:

cd build
../qa/run-standalone.sh osd-backfill-stats.sh osd-recovery-stats.sh

src/osd/PG.cc Outdated
approx_objects_missing += primary_version - auth_version;
}
auto force_threshold = cct->_conf.get_val<uint64_t>(
"osd_force_auth_primary_missing_objects");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Making choose_acting depend on a config variable is dangerous because it can lead to a cycle if the value differs between OSDs (with acting toggling back and forth between two primaries with different settings). We do it with the ignore_les option but that has led to confusion/problems in the field as a result.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah. That's why I introduce a very strict constraint above:

if (HAVE_FEATURE(osdmap->get_up_osd_features(), SERVER_NAUTILUS)) 

And choosing acting under the control of some configurable option is something that the async_recovery code already does.. see:

osd_async_recovery_min_pg_log_entries

void recover_got(hobject_t oid, eversion_t v, pg_info_t &info) {
if (missing.is_missing(oid, v)) {
missing.got(oid, v);
info.stats.stats.sum.num_objects_missing = missing.num_missing();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For this, I think the on_local_recover() is the only caller that really matters, and publish_stats_to_osd() is already called a few lines down inside the if (is_primary()) blok.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am trying to let each peer trace its own number of missing and get those update-to-date stats written onto disk on time

@xiexingguo xiexingguo force-pushed the wip-incompat-async-fixes branch 3 times, most recently from e2a31a6 to 63a51c7 Compare August 28, 2018 02:36
@xiexingguo
Copy link
Member Author

@liewegas Ping?

BTW, I've done some local testing which reveals this patch together with async_recovery can reduce the impact on client IOPS obviously if pgs are in recovery mode...

image

@tchaikov
Copy link
Contributor

tchaikov commented Aug 30, 2018

src/osd/PG.cc Outdated
set<pg_shard_t> *backfill,
set<pg_shard_t> *acting_backfill,
const OSDMapRef osdmap,
CephContext *cct,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if what we need is but osd_force_auth_primary_missing_objects, why not passing it in instead?

Async recovery peers usually have a relative complete
log history but may exist a lot of missing objects. Choosing them
as auth_log_shard and further as primary if current up_primary is
unrecoverable, say, could have a bigger chance to block client I/Os.

Among peers with identical new log history, we now consider those
who are now complete (having no missing objects) as the preferred
ones when determining auth_log_shard.

Signed-off-by: xie xingguo <xie.xingguo@zte.com.cn>
Signed-off-by: xie xingguo <xie.xingguo@zte.com.cn>
- kill usable, use want->size instead
- introduce a (separate) lambda function for sorting

Signed-off-by: xie xingguo <xie.xingguo@zte.com.cn>
Which has no consumers.

Signed-off-by: xie xingguo <xie.xingguo@zte.com.cn>
Signed-off-by: xie xingguo <xie.xingguo@zte.com.cn>
So if there are a lot fo missing objects on primary, we can
make use of auth_log_shard to restore client I/O quickly.

Signed-off-by: xie xingguo <xie.xingguo@zte.com.cn>
@xiexingguo xiexingguo force-pushed the wip-incompat-async-fixes branch from 63a51c7 to 22786cf Compare August 31, 2018 08:31
@xiexingguo xiexingguo merged commit 0857124 into ceph:master Sep 1, 2018
@xiexingguo xiexingguo deleted the wip-incompat-async-fixes branch September 1, 2018 06:27
neha-ojha added a commit to neha-ojha/ceph that referenced this pull request Apr 25, 2019
We changed async recovery cost calculation in nautilus to also take
into account approx_missing_objects in ab241bf
This commit depends on ceph#23663, hence
wasn't backported to mimic.

Mimic only uses the difference in length of logs as the cost. Due to this,
the same OSD might have different costs in a mixed mimic and nautilus(or above)
cluster. This can lead to choose_acting() cycling between OSDs, when trying
to select the acting set and async_recovery_targets.

Fixes: https://tracker.ceph.com/issues/39441
Signed-off-by: Neha Ojha <nojha@redhat.com>
pdvian pushed a commit to pdvian/ceph that referenced this pull request May 17, 2019
We changed async recovery cost calculation in nautilus to also take
into account approx_missing_objects in ab241bf
This commit depends on ceph#23663, hence
wasn't backported to mimic.

Mimic only uses the difference in length of logs as the cost. Due to this,
the same OSD might have different costs in a mixed mimic and nautilus(or above)
cluster. This can lead to choose_acting() cycling between OSDs, when trying
to select the acting set and async_recovery_targets.

Fixes: https://tracker.ceph.com/issues/39441
Signed-off-by: Neha Ojha <nojha@redhat.com>
(cherry picked from commit 4c617ec)

Conflicts:
	src/osd/PG.cc : Resolved in choose_async_recovery_ec and
			choose_async_recovery_replicated
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants