mgr: relax ok-to-stop condition by xxhdx1985126 · Pull Request #39335 · ceph/ceph

xxhdx1985126 · 2021-02-07T06:35:13Z

Right now, the "ok-to-stop" condition is relatively rigorous, it allows
stopping an osd only if no PG on it is non-active or degraded. But there
are situations in which people want to concurrently stop multiple osds even if
they share pgs, as long as that wouldn't cause client requests to block.

This commit relax the "ok-to-stop" condition to something like: allow
osd to stop as long as all pgs on it are active AND there are more-than-min_size
pgs that are still up/in AND there are at least one OSD that have no missing
objects.

Signed-off-by: Xuehan Xu xxhdx1985126@gmail.com

Checklist

References tracker ticket
Updates documentation if necessary
Includes tests for new functionality or reproducer for bug

Show available Jenkins commands

jenkins retest this please
jenkins test classic perf
jenkins test crimson perf
jenkins test signed
jenkins test make check
jenkins test make check arm64
jenkins test submodules
jenkins test dashboard
jenkins test api
jenkins test docs
jenkins render docs
jenkins test ceph-volume all
jenkins test ceph-volume tox

xxhdx1985126 · 2021-02-08T06:21:21Z

@dzafman Hi, could you take a look this PR, please? Thanks:-)

dzafman · 2021-02-08T18:47:15Z

src/mgr/DaemonServer.cc

-	    if (pg_acting.size() < pi->min_size) {
+	    if (((q.second.state & PG_STATE_DEGRADED) &&
+                  !avail_no_missing.size()) ||
+		pg_acting.size() < pi->min_size) {


It seems to that if the pg is DEGRADED, then pg_acting == avail_no_missing. So avail_no_missing has no extra information.

Since !avail_no_missing.size() == (pg_acting.size() == 0), then it has no change whether dangerous_pgs is incremented since 0 should always be < min_size.

@dzafman Um...It seems that an OSD can be a member of pg_acting even if it has missing objects, so I think when the pg is DEGRADED, pg_acting may not be the same as avail_no_missing, right? And for the same reason, pg_acting may not be necessarily empty even if no OSD has all objects, and I think that's why we have a concept of "unfound objects". Am I right?:-)

When a pg is degraded the pg_acting is based on avail_no_missing. If not, pg_acting comes from acting. In any case the size of pg_acting must be >= min_size. Unfound is for objects that aren't stored on any OSD.

Now that think about it, I'm not clear how a pg with pg_acting.size() == min_size wouldn't be dangerous because stopping an OSD could then cause a PG to drop below min_size. Doesn't "available" imply writable.

dzafman

As it stands this change should not be merged.

dzafman · 2021-02-08T18:51:23Z

src/mgr/DaemonServer.cc

-	    if (pg_acting.size() < pi->min_size) {
+	    if (((q.second.state & PG_STATE_DEGRADED) &&
+                  !avail_no_missing.size()) ||
+		pg_acting.size() < pi->min_size) {


Since !avail_no_missing.size() == (pg_acting.size() == 0), then it has no change whether dangerous_pgs is incremented since 0 should always be < min_size.

dzafman · 2021-02-08T18:57:17Z

The description of the goals of this change "AND there are at least one OSD that have no missing
objects.", based on the code should read "OR there are at least one OSD that have no missing
objects." This means that the change would only apply to replicated pools, since erasure coded pools need more than 1 replica to rebuild.

xxhdx1985126 · 2021-02-09T02:01:42Z

The description of the goals of this change "AND there are at least one OSD that have no missing
objects.", based on the code should read "OR there are at least one OSD that have no missing
objects." This means that the change would only apply to replicated pools, since erasure coded pools need more than 1 replica to rebuild.

So I think maybe I should distinguish replicated pools and ec pools, right?

xxhdx1985126 · 2021-02-09T02:27:19Z

@dzafman hi, I've restrict the change to only affect osds of replicated pools, please take a look:-) Thanks.

dzafman · 2021-02-09T02:44:06Z

src/mgr/DaemonServer.cc

-	    if (pg_acting.size() < pi->min_size) {
+	    if (((q.second.state & PG_STATE_DEGRADED) &&
+                  !avail_no_missing.size() &&
+                  (pi->type == pg_pool_t::TYPE_ERASURE)) ||


should be pi->type != pg_pool_t::TYPE_ERASURE

I still think you aren't changing anything with this change. If pg_acting.size() == 0 then it will be less than min_size. Remember for degraded avail_no_missing == pg_acting which is also why you don't need the avail_no_missing set.

When a pg is degraded the pg_acting is based on avail_no_missing. If not, pg_acting comes from acting. In any case the size of pg_acting must be >= min_size. Unfound is for objects that aren't stored on any OSD.

Yes, but pg_acting.size() >= min_size doesn't necessarily mean avail_no_missing is not empty, as it looks to me that pg_acting comes from acting and avail_no_missing is constructed when PeeringState do "update_calc_stats". What I wanted to express here is that "if after stopping the specified osds would mean that there's no OSD who has all the up-to-date objects of the pg is still up in the cluster, it's not ok to stop even if the pg can go into PG_STATE_ACTIVE".

Now that think about it, I'm not clear how a pg with pg_acting.size() == min_size wouldn't be dangerous because stopping an OSD could then cause a PG to drop below min_size. Doesn't "available" imply writable.

Yes, but if, say, a pool's size is 5 and min_size is 2, the current implementation doesn't allow stopping anymore OSDs, even if currently there's only one OSD is down. Actually, according to our online clusters' "ops" team, cases like this happens. So we think maybe the current ok-to-stop is a little bit rigorous. If we want to avoid pg_acting.size() == min_size, we can modify the code to something like:

if (pg_acting.size() <= pi->min_size) ++dangerous_pgs;

Am I right?

I still think you aren't changing anything with this change. If pg_acting.size() == 0 then it will be less than min_size. Remember for degraded avail_no_missing == pg_acting which is also why you don't need the avail_no_missing set.

Um...it looks to me that this is not the case. avail_no_missing may be less than pg_acting.

should be pi->type != pg_pool_t::TYPE_ERASURE

Sorry, this must be a typo......will change it.

Yes, but if, say, a pool's size is 5 and min_size is 2, the current implementation doesn't allow stopping anymore OSDs, even if currently there's only one OSD is down. Actually, according to our online clusters' "ops" team, cases like this happens. So we think maybe the current ok-to-stop is a little bit rigorous. If we want to avoid pg_acting.size() == min_size, we can modify the code to something like:

If there is already an OSD down in your example, then pg_acting might be size 4 which is 4 < 2 (min_size) is NOT dangerous as expected. I don't see why that would or should prevent another OSD from being stopped.

if (pg_acting.size() <= pi->min_size) ++dangerous_pgs;

Am I right?

Yes, I'm not sure why the code doesn't already have <=. Of course, that is more stringent.

I still think you aren't changing anything with this change. If pg_acting.size() == 0 then it will be less than min_size. Remember for degraded avail_no_missing == pg_acting which is also why you don't need the avail_no_missing set.

Um...it looks to me that this is not the case. avail_no_missing may be less than pg_acting.

See my other comment. Don't confuse the added local avail_no_missing and q.second.avail_no_missing nor pg_acting and q.second.acting.

Yes, but if, say, a pool's size is 5 and min_size is 2, the current implementation doesn't allow stopping anymore OSDs, even if currently there's only one OSD is down. Actually, according to our online clusters' "ops" team, cases like this happens. So we think maybe the current ok-to-stop is a little bit rigorous. If we want to avoid pg_acting.size() == min_size, we can modify the code to something like:

If there is already an OSD down in your example, then pg_acting might be size 4 which is 4 < 2 (min_size) is NOT dangerous as expected. I don't see why that would or should prevent another OSD from being stopped.

Oh, sorry that I made this misunderstanding. What is preventing OSD to stop is not this line, but here

if (pg_acting.size() <= pi->min_size) ++dangerous_pgs;

Am I right?

Yes, I'm not sure why the code doesn't already have <=. Of course, that is more stringent.

I still think you aren't changing anything with this change. If pg_acting.size() == 0 then it will be less than min_size. Remember for degraded avail_no_missing == pg_acting which is also why you don't need the avail_no_missing set.

Um...it looks to me that this is not the case. avail_no_missing may be less than pg_acting.

See my other comment. Don't confuse the added local avail_no_missing and q.second.avail_no_missing nor pg_acting and q.second.acting.

Finally, I got your point. Will remove avail_no_missing set:-)

dzafman · 2021-02-09T16:51:05Z

src/mgr/DaemonServer.cc

 	      }
 	      if (anm.osd != CRUSH_ITEM_NONE) {
 		pg_acting.insert(anm.osd);
+		avail_no_missing.insert(anm.osd);


Here you see that pg_acting AND avail_no_missing is set from q.second.avail_no_missing for degraded PGs. There is no place for a degraded PG that they can diverge.

In summary:
non-degraded pg -> pg_acting from q.second.acting
degraded pg -> pg_acting AND avail_no_missing identically set from q.second.avail_no_missing

xxhdx1985126 · 2021-02-10T03:56:44Z

@dzafman I've just modified the code as suggested, please take a look again at your leisure, thanks:-)

liewegas · 2021-02-18T15:33:04Z

src/mgr/DaemonServer.cc

+	    ++dangerous_pgs;
+	    continue;
+	  }
+          if (pg_acting.size() <= pi->min_size) {


this changed from < to <=.. was that intentional? I think < is correct...

If pg_acting.size() == pi->min_size then wouldn't write access be lost of another OSD is stopped? Isn't that the point of ok-to-stop checking?

pg_acting is the anticipated acting set if the osd(s) are stopped (current acting, minus any of the osds provided as an argument)

@xxhdx1985126 Change this back to pg_acting.size() < pi->min_size

xxhdx1985126 · 2021-02-19T03:03:28Z

@dzafman Done:-)

liewegas

@dzafman, I'm still not sure about this, actually. If a PG is degraded, one or more of the OSDs in the acting set is missing objects. Is that what avail_no_missing is tracking? Which OSDs are missing no objects? Shouldn't we make sure that we have at least one of those in the final acting set? Otherwise, we might still end up with enough acting osds, but may have stopped the OSD(s) that have the only available copy of one or more objects...

liewegas · 2021-02-19T13:14:18Z

(@xuxuehan I definitely what to fix this! ok-to-stop is currently way too conservative.)

dzafman · 2021-02-19T18:03:16Z

@dzafman, I'm still not sure about this, actually. If a PG is degraded, one or more of the OSDs in the acting set is missing objects. Is that what avail_no_missing is tracking? Which OSDs are missing no objects? Shouldn't we make sure that we have at least one of those in the final acting set? Otherwise, we might still end up with enough acting osds, but may have stopped the OSD(s) that have the only available copy of one or more objects...

Yes, I assumed that avail_no_missing excludes OSDs with missing objects for that PG. The current code requires at least min_size OSDs with no missing objects for degraded PGs to exist. I don't understand how we can loosen this criteria. I think that this change originally was trying to allow read-only access for replicated pools that had at least 1 OSD with no missing. Writes would block which could not have been the original purpose of ok-to-stop.

liewegas · 2021-02-19T19:06:14Z

@dzafman, I'm still not sure about this, actually. If a PG is degraded, one or more of the OSDs in the acting set is missing objects. Is that what avail_no_missing is tracking? Which OSDs are missing no objects? Shouldn't we make sure that we have at least one of those in the final acting set? Otherwise, we might still end up with enough acting osds, but may have stopped the OSD(s) that have the only available copy of one or more objects...

Yes, I assumed that avail_no_missing excludes OSDs with missing objects for that PG. The current code requires at least min_size OSDs with no missing objects for degraded PGs to exist. I don't understand how we can loosen this criteria. I think that this change originally was trying to allow read-only access for replicated pools that had at least 1 OSD with no missing. Writes would block which could not have been the original purpose of ok-to-stop.

Oh! I missed that this code was already looking at avail_no_missing. I think this version is right. Thanks!!

liewegas · 2021-02-19T19:07:45Z

Sigh, I'm still confused...

liewegas · 2021-02-19T19:11:19Z

Okay, to summarize:

We build a pg_acting set out of OSDs that are in avail_no_missing (if degraded) or acting, MINUS any of the osds we're trying to stop.
Before, if the PG was in degraded state, we would always say it was dangerous, regardless of how big pg_acting was
Now, we still say any degraded erasure coded pool pg is dangerous to touch (because avail_no_missing isn't enough info?). But for replicated pools, we can proceed if the pg_acting set (built with only avail_no_missing OSDs) is >= min_size.
Right?

dzafman · 2021-02-19T19:42:34Z

Okay, to summarize:

We build a pg_acting set out of OSDs that are in avail_no_missing (if degraded) or acting, MINUS any of the osds we're trying to stop.

Before, if the PG was in degraded state, we would always say it was dangerous, regardless of how big pg_acting was

Now, we still say any degraded erasure coded pool pg is dangerous to touch (because avail_no_missing isn't enough info?). But for replicated pools, we can proceed if the pg_acting set (built with only avail_no_missing OSDs) is >= min_size.
Right?

@liewegas YES!!!!! I was confused but that summary makes it very clear.

dzafman

I'd like to see a tracker ticker, the commit message with the tracker specified and correction to the commit message which describes something that the final version doesn't do.

dzafman · 2021-02-19T19:56:44Z

@liewegas Now that you mention it, originally the code allowed 1 OSD to work presumably for read access (which was wrong). That case would only apply to replicated, so I asked for that be checked. Now in the final version, it should work for replicated or erasure coded.

liewegas · 2021-02-19T20:11:19Z

@liewegas Now that you mention it, originally the code allowed 1 OSD to work presumably for read access (which was wrong). That case would only apply to replicated, so I asked for that be checked. Now in the final version, it should work for replicated or erasure coded.

I opened https://tracker.ceph.com/issues/49392 ... @xuxuehan can you add this to the commit message?

@dzafman are you saying that check for replicated can be dropped? i.e., reduce this section

	  if (!(q.second.state & PG_STATE_ACTIVE) ||
              ((pi->type != pg_pool_t::TYPE_REPLICATED) &&
               (q.second.state & PG_STATE_DEGRADED))) {
	    ++dangerous_pgs;
	    continue;
	  }

to

	  if (!(q.second.state & PG_STATE_ACTIVE) {
	    ++dangerous_pgs;
	    continue;
	  }

?

dzafman · 2021-02-19T20:12:14Z

src/mgr/DaemonServer.cc

+               (q.second.state & PG_STATE_DEGRADED))) {
+	    ++dangerous_pgs;
+	    continue;
+	  }


@liewegas I made change 9750061 to add degraded handling but didn't remove the PG_STATE_DEGRADED check here. So shouldn't this just read?

if (!(q.second.state & PG_STATE_ACTIVE)) { ++dangerous_pgs; continue; }

liewegas · 2021-02-19T20:33:58Z

how does liewegas@mgr-relax-ok-to-stop look?

dzafman · 2021-02-19T21:05:56Z

how does liewegas@mgr-rel look?

perfect

dzafman · 2021-02-19T21:06:23Z

@xxhdx1985126 @liewegas Do we have a functional test for this?

xxhdx1985126 · 2021-02-21T07:11:59Z

@dzafman @liewegas Seems that the proposed commit is already inclosed in PR 39455, should I close this PR? :-)

…ds only replicated pools Right now, the "ok-to-stop" condition is relatively rigorous, it allows stopping an osd only if no PG on it is non-active or degraded. But there are situations in which an OSD is part of a degraded pg and the pg still still have > min_size complete replicas after the OSD is stopped. In [9750061](liewegas@9750061), we changed from considering just acting to using avail_no_missing (OSDs that have no missing objects). When the projected pg_acting is constructed this way, we can safely compare to min_size... even for a PG marked degraded. Fixes: https://tracker.ceph.com/issues/49392 Signed-off-by: Xuehan Xu <xxhdx1985126@gmail.com> Signed-off-by: Xuehan Xu <xxhdx1985126@gmail.com>

dzafman · 2021-02-22T19:03:57Z

Now included in #39455

github-actions bot added core mgr labels Feb 7, 2021

dzafman reviewed Feb 8, 2021

View reviewed changes

dzafman suggested changes Feb 8, 2021

View reviewed changes

xxhdx1985126 force-pushed the wip-osd-relax-ok_to_stop branch from 636a185 to 0c829f9 Compare February 9, 2021 02:25

dzafman reviewed Feb 9, 2021

View reviewed changes

xxhdx1985126 force-pushed the wip-osd-relax-ok_to_stop branch from 0c829f9 to ed947e7 Compare February 10, 2021 03:55

tchaikov mentioned this pull request Feb 18, 2021

mgr: add --max <n> to 'osd ok-to-stop' command #39455

Merged

liewegas reviewed Feb 18, 2021

View reviewed changes

xxhdx1985126 force-pushed the wip-osd-relax-ok_to_stop branch from ed947e7 to fd525de Compare February 19, 2021 03:03

liewegas approved these changes Feb 19, 2021

View reviewed changes

liewegas requested changes Feb 19, 2021

View reviewed changes

liewegas approved these changes Feb 19, 2021

View reviewed changes

dzafman approved these changes Feb 19, 2021

View reviewed changes

dzafman suggested changes Feb 19, 2021

View reviewed changes

dzafman reviewed Feb 19, 2021

View reviewed changes

xxhdx1985126 force-pushed the wip-osd-relax-ok_to_stop branch from fd525de to 10bb20d Compare February 21, 2021 07:14

dzafman closed this Feb 22, 2021

xxhdx1985126 deleted the wip-osd-relax-ok_to_stop branch March 2, 2021 04:30

Conversation

xxhdx1985126 commented Feb 7, 2021

Checklist

Uh oh!

xxhdx1985126 commented Feb 8, 2021

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dzafman left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dzafman commented Feb 8, 2021

Uh oh!

xxhdx1985126 commented Feb 9, 2021

Uh oh!

xxhdx1985126 commented Feb 9, 2021

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

xxhdx1985126 Feb 9, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

xxhdx1985126 commented Feb 10, 2021

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

xxhdx1985126 commented Feb 19, 2021

Uh oh!

liewegas left a comment

Choose a reason for hiding this comment

Uh oh!

liewegas commented Feb 19, 2021

Uh oh!

dzafman commented Feb 19, 2021

Uh oh!

liewegas commented Feb 19, 2021

Uh oh!

liewegas commented Feb 19, 2021

Uh oh!

liewegas commented Feb 19, 2021

Uh oh!

dzafman commented Feb 19, 2021

Uh oh!

dzafman left a comment

Choose a reason for hiding this comment

Uh oh!

dzafman commented Feb 19, 2021

Uh oh!

liewegas commented Feb 19, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dzafman Feb 19, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

xxhdx1985126 Feb 9, 2021 •

edited

Loading

liewegas commented Feb 19, 2021 •

edited

Loading

dzafman Feb 19, 2021 •

edited

Loading

xxhdx1985126 commented Feb 21, 2021 •

edited

Loading