Skip to content

osd/PG: async-recovery should respect historical missing objects#24004

Merged
xiexingguo merged 2 commits intoceph:masterfrom
xiexingguo:wip-yet-more-async-fixes
Sep 20, 2018
Merged

osd/PG: async-recovery should respect historical missing objects#24004
xiexingguo merged 2 commits intoceph:masterfrom
xiexingguo:wip-yet-more-async-fixes

Conversation

@xiexingguo
Copy link
Member

Peers with async-recovery enabled are usually having a update-to-date
last-update iterator and hence might be moved out from the async_recovery_targets
set during the next peering circles.

7de3562 makes num_objects_missing
trace historical missing objects correctly, hence we could take
num_objects_missing into account when determing async_recovery_targets.

Signed-off-by: xie xingguo xie.xingguo@zte.com.cn

  • References tracker ticket
  • Updates documentation if necessary
  • Includes tests for new functionality or reproducer for bug

.set_description("Approximate missing objects above which to force auth_log_shard to be primary temporarily"),

Option("osd_async_recovery_min_pg_log_entries", Option::TYPE_UINT, Option::LEVEL_ADVANCED)
Option("osd_async_recovery_approx_missing_objects", Option::TYPE_UINT, Option::LEVEL_ADVANCED)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure this name is appropriate, since this option now accounts for difference in length of logs plus missing objects.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

eh, what is your suggestion, then? @neha-ojha

Copy link
Member Author

@xiexingguo xiexingguo Sep 11, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

difference in length of logs

IMHO, the log difference is essentially a imprecise measure of missing objects, so I guess the naming should be fine? @neha-ojha

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see it as a cost of recovery, more so because now it seems to be dependent on more than one parameter.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@neha-ojha

What about:

Option("osd_async_recovery_min_cost", Option::TYPE_UINT, Option::LEVEL_ADVANCED)
set_description("A mixture measure of number of current log entries difference and historical missing objects,  above which we switch to use asynchronous recovery when appropriate")

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sounds good to me

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@neha-ojha Repushed now!

Copy link
Contributor

@Yan-waller Yan-waller left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

if (auth_version > candidate_version) {
approx_missing_objects += auth_version - candidate_version;
}
if (approx_missing_objects > cct->_conf.get_val<uint64_t>(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@xiexingguo If num_objects_missing is reliable, this change should work. Wondering if you have done some evaluation like #23663 (comment), to compare how this change impacts the overall performance of async recovery.

Since this change is critical to how async recovery works in general, I'd like @jdurgin to review this as well.

@xiexingguo
Copy link
Member Author

@neha-ojha

Wondering if you have done some evaluation like #23663 (comment), to compare how this change impacts the overall performance of async recovery.

Actually I do. Without this change the recovery process will cause a up to 80% decrease of client IOPS
if PGs are performing recovery from a previous unfinished recovery process since there aren't many async recovery peers are initiated .

BTW, #22330 and #22664 dramatically reduce the chance whether or not a pg can go async recovery and hence the chance that the recovery process will unblock the client I/Os , I am also wondering
whether we could revert those two changes and fix the out of order issue in another way instead...

Since this change is critical to how async recovery works in general, I'd like @jdurgin to review this as well.

Sure.

@xiexingguo
Copy link
Member Author

@jdurgin Ping?

Copy link
Member

@jdurgin jdurgin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Including missing objects that we know about at this point in peering seems like a good idea. It's a bit more accurate at least, even if some objects may be more expensive than others to recover.

With respect to increasing availability by choosing more async recovery targets, I wonder if this could be achieved better by e.g. the balancer mgr module manipulating the up-set.

Trying to make the OSDs converge on mappings that aren't the up set will be tough, since that's what recovery is trying to achieve eventually. It's pretty easy to introduce bugs that way.

Peers with async-recovery enabled are usually having a update-to-date
last-update iterator and hence might be moved out from the __async_recovery_targets__
set during the next peering circles.

7de3562 makes num_objects_missing
trace historical missing objects correctly, hence we could take
num_objects_missing into account when determing __async_recovery_targets__.

Signed-off-by: xie xingguo <xie.xingguo@zte.com.cn>
Signed-off-by: xie xingguo <xie.xingguo@zte.com.cn>
@xiexingguo xiexingguo force-pushed the wip-yet-more-async-fixes branch from f88a573 to 90eda15 Compare September 20, 2018 01:48
@xiexingguo xiexingguo merged commit 03abaf9 into ceph:master Sep 20, 2018
@xiexingguo xiexingguo deleted the wip-yet-more-async-fixes branch September 20, 2018 06:49
xiexingguo added a commit to xiexingguo/ceph that referenced this pull request Sep 19, 2019
guoracle report that:

> In the asynchronous recovery feature, the asynchronous recovery
> target OSD is selected by last_updata.version, so that after the
> peering is completed, the asynchronous recovery target OSDs update
> the last_update.version, and then go down again, when the asynchronous
> recovery target OSDs is back online, when peering,there is no pglog
> difference between the asynchronous recovery targets and the
> authoritative OSD, resulting in no asynchronous recovery.

ceph#24004 aimed to solve the problem by
persisting the number of missing objects into the disk when peering was
done, and then we could take both new approximate missing objects
(estimated according to last_update) and historical num_objects_missing
into account when determining async_recovery_targets on any new follow-up
peering circles.
However, the above comment stands only if we could keep an up-to-date
num_objects_missing field for each pg instance under any circumstances,
which is unfortunately not true for replicas which have completed peering
but never started recovery later (7de3562
make sure we'll update num_objects_missing for primary when peering is done,
and will keep num_objects_missing up-to-update when each missing object
is recovered).

Note that guoracle also suggests to fix the same problem by using
last_complete.version to calculate the pglog difference and update the
last_complete of the asynchronous recovery target OSD in the copy of peer_info
to the latest after the recovery is complete, which should not work well
because we might reset last_complete to 0'0 whenever we trim pglog past the
minimal need-version of missing set.

Fix by persisting num_objects_missing for replicas correctly when peering
is done.

Fixes: https://tracker.ceph.com/issues/41924
Signed-off-by: xie xingguo <xie.xingguo@zte.com.cn>
xiexingguo added a commit to xiexingguo/ceph that referenced this pull request Sep 20, 2019
guoracle reported that:

> In the asynchronous recovery feature, the asynchronous recovery
> target OSD is selected by last_updata.version, so that after the
> peering is completed, the asynchronous recovery target OSDs update
> the last_update.version, and then go down again, when the asynchronous
> recovery target OSDs is back online, when peering,there is no pglog
> difference between the asynchronous recovery targets and the
> authoritative OSD, resulting in no asynchronous recovery.

ceph#24004 aimed to solve the problem by
persisting the number of missing objects into the disk when peering was
done, and then we could take both new approximate missing objects
(estimated according to last_update) and historical num_objects_missing
into account when determining async_recovery_targets on any new follow-up
peering cycles.
However, the above comment stands only if we could keep an up-to-date
num_objects_missing field for each pg instance under any circumstances,
which is unfortunately not true for replicas which have completed peering
but never started recovery later (7de3562
make sure we'll update num_objects_missing for primary when peering is done,
and will keep num_objects_missing up-to-update when each missing object
is recovered).

Note that guoracle also suggests to fix the same problem by using
last_complete.version to calculate the pglog difference and update the
last_complete of the asynchronous recovery target OSD in the copy of peer_info
to the latest after the recovery is complete, which should not work well
because we might reset last_complete to 0'0 whenever we trim pglog past the
minimal need-version of missing set.

Fix by persisting num_objects_missing for replicas correctly when peering
is done.

Fixes: https://tracker.ceph.com/issues/41924
Signed-off-by: xie xingguo <xie.xingguo@zte.com.cn>
xiexingguo added a commit to xiexingguo/ceph that referenced this pull request Sep 20, 2019
guoracle reported that:

> In the asynchronous recovery feature, the asynchronous recovery
> target OSD is selected by last_updata.version, so that after the
> peering is completed, the asynchronous recovery target OSDs update
> the last_update.version, and then go down again, when the asynchronous
> recovery target OSDs is back online, when peering,there is no pglog
> difference between the asynchronous recovery targets and the
> authoritative OSD, resulting in no asynchronous recovery.

ceph#24004 aimed to solve the problem by
persisting the number of missing objects into the disk when peering was
done, and then we could take both new approximate missing objects
(estimated according to last_update) and historical num_objects_missing
into account when determining async_recovery_targets on any new follow-up
peering cycles.
However, the above comment stands only if we could keep an up-to-date
num_objects_missing field for each pg instance under any circumstances,
which is unfortunately not true for replicas which have completed peering
but never started recovery later (7de3562
make sure we'll update num_objects_missing for primary when peering is done,
and will keep num_objects_missing up-to-update when each missing object
is recovered).

Note that guoracle also suggests to fix the same problem by using
last_complete.version to calculate the pglog difference and update the
last_complete of the asynchronous recovery target OSD in the copy of peer_info
to the latest after the recovery is complete, which should not work well
because we might reset last_complete to 0'0 whenever we trim pglog past the
minimal need-version of missing set.

Fix by persisting num_objects_missing for replicas correctly when peering
is done.

Fixes: https://tracker.ceph.com/issues/41924
Signed-off-by: xie xingguo <xie.xingguo@zte.com.cn>
smithfarm pushed a commit to smithfarm/ceph that referenced this pull request Oct 23, 2019
guoracle reported that:

> In the asynchronous recovery feature, the asynchronous recovery
> target OSD is selected by last_updata.version, so that after the
> peering is completed, the asynchronous recovery target OSDs update
> the last_update.version, and then go down again, when the asynchronous
> recovery target OSDs is back online, when peering,there is no pglog
> difference between the asynchronous recovery targets and the
> authoritative OSD, resulting in no asynchronous recovery.

ceph#24004 aimed to solve the problem by
persisting the number of missing objects into the disk when peering was
done, and then we could take both new approximate missing objects
(estimated according to last_update) and historical num_objects_missing
into account when determining async_recovery_targets on any new follow-up
peering cycles.
However, the above comment stands only if we could keep an up-to-date
num_objects_missing field for each pg instance under any circumstances,
which is unfortunately not true for replicas which have completed peering
but never started recovery later (7de3562
make sure we'll update num_objects_missing for primary when peering is done,
and will keep num_objects_missing up-to-update when each missing object
is recovered).

Note that guoracle also suggests to fix the same problem by using
last_complete.version to calculate the pglog difference and update the
last_complete of the asynchronous recovery target OSD in the copy of peer_info
to the latest after the recovery is complete, which should not work well
because we might reset last_complete to 0'0 whenever we trim pglog past the
minimal need-version of missing set.

Fix by persisting num_objects_missing for replicas correctly when peering
is done.

Fixes: https://tracker.ceph.com/issues/41924
Signed-off-by: xie xingguo <xie.xingguo@zte.com.cn>
(cherry picked from commit 3b024c5)
xiexingguo added a commit to ceph/ceph-ci that referenced this pull request Nov 2, 2019
guoracle reported that:

> In the asynchronous recovery feature, the asynchronous recovery
> target OSD is selected by last_updata.version, so that after the
> peering is completed, the asynchronous recovery target OSDs update
> the last_update.version, and then go down again, when the asynchronous
> recovery target OSDs is back online, when peering,there is no pglog
> difference between the asynchronous recovery targets and the
> authoritative OSD, resulting in no asynchronous recovery.

ceph/ceph#24004 aimed to solve the problem by
persisting the number of missing objects into the disk when peering was
done, and then we could take both new approximate missing objects
(estimated according to last_update) and historical num_objects_missing
into account when determining async_recovery_targets on any new follow-up
peering cycles.
However, the above comment stands only if we could keep an up-to-date
num_objects_missing field for each pg instance under any circumstances,
which is unfortunately not true for replicas which have completed peering
but never started recovery later (7de3562
make sure we'll update num_objects_missing for primary when peering is done,
and will keep num_objects_missing up-to-update when each missing object
is recovered).

Note that guoracle also suggests to fix the same problem by using
last_complete.version to calculate the pglog difference and update the
last_complete of the asynchronous recovery target OSD in the copy of peer_info
to the latest after the recovery is complete, which should not work well
because we might reset last_complete to 0'0 whenever we trim pglog past the
minimal need-version of missing set.

Fix by persisting num_objects_missing for replicas correctly when peering
is done.

Fixes: https://tracker.ceph.com/issues/41924
Signed-off-by: xie xingguo <xie.xingguo@zte.com.cn>
(cherry picked from commit 3b024c5)
xiexingguo added a commit to ceph/ceph-ci that referenced this pull request Nov 7, 2019
guoracle reported that:

> In the asynchronous recovery feature, the asynchronous recovery
> target OSD is selected by last_updata.version, so that after the
> peering is completed, the asynchronous recovery target OSDs update
> the last_update.version, and then go down again, when the asynchronous
> recovery target OSDs is back online, when peering,there is no pglog
> difference between the asynchronous recovery targets and the
> authoritative OSD, resulting in no asynchronous recovery.

ceph/ceph#24004 aimed to solve the problem by
persisting the number of missing objects into the disk when peering was
done, and then we could take both new approximate missing objects
(estimated according to last_update) and historical num_objects_missing
into account when determining async_recovery_targets on any new follow-up
peering cycles.
However, the above comment stands only if we could keep an up-to-date
num_objects_missing field for each pg instance under any circumstances,
which is unfortunately not true for replicas which have completed peering
but never started recovery later (7de3562
make sure we'll update num_objects_missing for primary when peering is done,
and will keep num_objects_missing up-to-update when each missing object
is recovered).

Note that guoracle also suggests to fix the same problem by using
last_complete.version to calculate the pglog difference and update the
last_complete of the asynchronous recovery target OSD in the copy of peer_info
to the latest after the recovery is complete, which should not work well
because we might reset last_complete to 0'0 whenever we trim pglog past the
minimal need-version of missing set.

Fix by persisting num_objects_missing for replicas correctly when peering
is done.

Fixes: https://tracker.ceph.com/issues/41924
Signed-off-by: xie xingguo <xie.xingguo@zte.com.cn>
(cherry picked from commit 3b024c5)
xiexingguo added a commit to ceph/ceph-ci that referenced this pull request Nov 26, 2019
guoracle reported that:

> In the asynchronous recovery feature, the asynchronous recovery
> target OSD is selected by last_updata.version, so that after the
> peering is completed, the asynchronous recovery target OSDs update
> the last_update.version, and then go down again, when the asynchronous
> recovery target OSDs is back online, when peering,there is no pglog
> difference between the asynchronous recovery targets and the
> authoritative OSD, resulting in no asynchronous recovery.

ceph/ceph#24004 aimed to solve the problem by
persisting the number of missing objects into the disk when peering was
done, and then we could take both new approximate missing objects
(estimated according to last_update) and historical num_objects_missing
into account when determining async_recovery_targets on any new follow-up
peering cycles.
However, the above comment stands only if we could keep an up-to-date
num_objects_missing field for each pg instance under any circumstances,
which is unfortunately not true for replicas which have completed peering
but never started recovery later (7de3562
make sure we'll update num_objects_missing for primary when peering is done,
and will keep num_objects_missing up-to-update when each missing object
is recovered).

Note that guoracle also suggests to fix the same problem by using
last_complete.version to calculate the pglog difference and update the
last_complete of the asynchronous recovery target OSD in the copy of peer_info
to the latest after the recovery is complete, which should not work well
because we might reset last_complete to 0'0 whenever we trim pglog past the
minimal need-version of missing set.

Fix by persisting num_objects_missing for replicas correctly when peering
is done.

Fixes: https://tracker.ceph.com/issues/41924
Signed-off-by: xie xingguo <xie.xingguo@zte.com.cn>
(cherry picked from commit 3b024c5)
alfonsomthd pushed a commit to rhcs-dashboard/ceph that referenced this pull request Dec 13, 2019
guoracle reported that:

> In the asynchronous recovery feature, the asynchronous recovery
> target OSD is selected by last_updata.version, so that after the
> peering is completed, the asynchronous recovery target OSDs update
> the last_update.version, and then go down again, when the asynchronous
> recovery target OSDs is back online, when peering,there is no pglog
> difference between the asynchronous recovery targets and the
> authoritative OSD, resulting in no asynchronous recovery.

ceph#24004 aimed to solve the problem by
persisting the number of missing objects into the disk when peering was
done, and then we could take both new approximate missing objects
(estimated according to last_update) and historical num_objects_missing
into account when determining async_recovery_targets on any new follow-up
peering cycles.
However, the above comment stands only if we could keep an up-to-date
num_objects_missing field for each pg instance under any circumstances,
which is unfortunately not true for replicas which have completed peering
but never started recovery later (7de3562
make sure we'll update num_objects_missing for primary when peering is done,
and will keep num_objects_missing up-to-update when each missing object
is recovered).

Note that guoracle also suggests to fix the same problem by using
last_complete.version to calculate the pglog difference and update the
last_complete of the asynchronous recovery target OSD in the copy of peer_info
to the latest after the recovery is complete, which should not work well
because we might reset last_complete to 0'0 whenever we trim pglog past the
minimal need-version of missing set.

Fix by persisting num_objects_missing for replicas correctly when peering
is done.

Fixes: https://tracker.ceph.com/issues/41924
Signed-off-by: xie xingguo <xie.xingguo@zte.com.cn>
(cherry picked from commit 3b024c5)
(cherry picked from commit 8f645d0)

Resolves: rhbz#1457536
xiexingguo added a commit to ceph/ceph-ci that referenced this pull request Feb 15, 2020
guoracle reported that:

> In the asynchronous recovery feature, the asynchronous recovery
> target OSD is selected by last_updata.version, so that after the
> peering is completed, the asynchronous recovery target OSDs update
> the last_update.version, and then go down again, when the asynchronous
> recovery target OSDs is back online, when peering,there is no pglog
> difference between the asynchronous recovery targets and the
> authoritative OSD, resulting in no asynchronous recovery.

ceph/ceph#24004 aimed to solve the problem by
persisting the number of missing objects into the disk when peering was
done, and then we could take both new approximate missing objects
(estimated according to last_update) and historical num_objects_missing
into account when determining async_recovery_targets on any new follow-up
peering cycles.
However, the above comment stands only if we could keep an up-to-date
num_objects_missing field for each pg instance under any circumstances,
which is unfortunately not true for replicas which have completed peering
but never started recovery later (7de3562
make sure we'll update num_objects_missing for primary when peering is done,
and will keep num_objects_missing up-to-update when each missing object
is recovered).

Note that guoracle also suggests to fix the same problem by using
last_complete.version to calculate the pglog difference and update the
last_complete of the asynchronous recovery target OSD in the copy of peer_info
to the latest after the recovery is complete, which should not work well
because we might reset last_complete to 0'0 whenever we trim pglog past the
minimal need-version of missing set.

Fix by persisting num_objects_missing for replicas correctly when peering
is done.

Fixes: https://tracker.ceph.com/issues/41924
Signed-off-by: xie xingguo <xie.xingguo@zte.com.cn>
(cherry picked from commit 3b024c5)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants