Skip to content

mgr: relax ok-to-stop condition#39335

Closed
xxhdx1985126 wants to merge 1 commit intoceph:masterfrom
xxhdx1985126:wip-osd-relax-ok_to_stop
Closed

mgr: relax ok-to-stop condition#39335
xxhdx1985126 wants to merge 1 commit intoceph:masterfrom
xxhdx1985126:wip-osd-relax-ok_to_stop

Conversation

@xxhdx1985126
Copy link
Contributor

Right now, the "ok-to-stop" condition is relatively rigorous, it allows
stopping an osd only if no PG on it is non-active or degraded. But there
are situations in which people want to concurrently stop multiple osds even if
they share pgs, as long as that wouldn't cause client requests to block.

This commit relax the "ok-to-stop" condition to something like: allow
osd to stop as long as all pgs on it are active AND there are more-than-min_size
pgs that are still up/in AND there are at least one OSD that have no missing
objects.

Signed-off-by: Xuehan Xu xxhdx1985126@gmail.com

Checklist

  • References tracker ticket
  • Updates documentation if necessary
  • Includes tests for new functionality or reproducer for bug

Show available Jenkins commands
  • jenkins retest this please
  • jenkins test classic perf
  • jenkins test crimson perf
  • jenkins test signed
  • jenkins test make check
  • jenkins test make check arm64
  • jenkins test submodules
  • jenkins test dashboard
  • jenkins test api
  • jenkins test docs
  • jenkins render docs
  • jenkins test ceph-volume all
  • jenkins test ceph-volume tox

@xxhdx1985126
Copy link
Contributor Author

@dzafman Hi, could you take a look this PR, please? Thanks:-)

if (pg_acting.size() < pi->min_size) {
if (((q.second.state & PG_STATE_DEGRADED) &&
!avail_no_missing.size()) ||
pg_acting.size() < pi->min_size) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems to that if the pg is DEGRADED, then pg_acting == avail_no_missing. So avail_no_missing has no extra information.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since !avail_no_missing.size() == (pg_acting.size() == 0), then it has no change whether dangerous_pgs is incremented since 0 should always be < min_size.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dzafman Um...It seems that an OSD can be a member of pg_acting even if it has missing objects, so I think when the pg is DEGRADED, pg_acting may not be the same as avail_no_missing, right? And for the same reason, pg_acting may not be necessarily empty even if no OSD has all objects, and I think that's why we have a concept of "unfound objects". Am I right?:-)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When a pg is degraded the pg_acting is based on avail_no_missing. If not, pg_acting comes from acting. In any case the size of pg_acting must be >= min_size. Unfound is for objects that aren't stored on any OSD.

Now that think about it, I'm not clear how a pg with pg_acting.size() == min_size wouldn't be dangerous because stopping an OSD could then cause a PG to drop below min_size. Doesn't "available" imply writable.

Copy link
Contributor

@dzafman dzafman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As it stands this change should not be merged.

if (pg_acting.size() < pi->min_size) {
if (((q.second.state & PG_STATE_DEGRADED) &&
!avail_no_missing.size()) ||
pg_acting.size() < pi->min_size) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since !avail_no_missing.size() == (pg_acting.size() == 0), then it has no change whether dangerous_pgs is incremented since 0 should always be < min_size.

@dzafman
Copy link
Contributor

dzafman commented Feb 8, 2021

The description of the goals of this change "AND there are at least one OSD that have no missing
objects.", based on the code should read "OR there are at least one OSD that have no missing
objects." This means that the change would only apply to replicated pools, since erasure coded pools need more than 1 replica to rebuild.

@xxhdx1985126
Copy link
Contributor Author

The description of the goals of this change "AND there are at least one OSD that have no missing
objects.", based on the code should read "OR there are at least one OSD that have no missing
objects." This means that the change would only apply to replicated pools, since erasure coded pools need more than 1 replica to rebuild.

So I think maybe I should distinguish replicated pools and ec pools, right?

@xxhdx1985126 xxhdx1985126 force-pushed the wip-osd-relax-ok_to_stop branch from 636a185 to 0c829f9 Compare February 9, 2021 02:25
@xxhdx1985126
Copy link
Contributor Author

@dzafman hi, I've restrict the change to only affect osds of replicated pools, please take a look:-) Thanks.

if (pg_acting.size() < pi->min_size) {
if (((q.second.state & PG_STATE_DEGRADED) &&
!avail_no_missing.size() &&
(pi->type == pg_pool_t::TYPE_ERASURE)) ||
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should be pi->type != pg_pool_t::TYPE_ERASURE

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still think you aren't changing anything with this change. If pg_acting.size() == 0 then it will be less than min_size. Remember for degraded avail_no_missing == pg_acting which is also why you don't need the avail_no_missing set.

Copy link
Contributor Author

@xxhdx1985126 xxhdx1985126 Feb 9, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When a pg is degraded the pg_acting is based on avail_no_missing. If not, pg_acting comes from acting. In any case the size of pg_acting must be >= min_size. Unfound is for objects that aren't stored on any OSD.

Yes, but pg_acting.size() >= min_size doesn't necessarily mean avail_no_missing is not empty, as it looks to me that pg_acting comes from acting and avail_no_missing is constructed when PeeringState do "update_calc_stats". What I wanted to express here is that "if after stopping the specified osds would mean that there's no OSD who has all the up-to-date objects of the pg is still up in the cluster, it's not ok to stop even if the pg can go into PG_STATE_ACTIVE".

Now that think about it, I'm not clear how a pg with pg_acting.size() == min_size wouldn't be dangerous because stopping an OSD could then cause a PG to drop below min_size. Doesn't "available" imply writable.

Yes, but if, say, a pool's size is 5 and min_size is 2, the current implementation doesn't allow stopping anymore OSDs, even if currently there's only one OSD is down. Actually, according to our online clusters' "ops" team, cases like this happens. So we think maybe the current ok-to-stop is a little bit rigorous. If we want to avoid pg_acting.size() == min_size, we can modify the code to something like:

if (pg_acting.size() <= pi->min_size)
    ++dangerous_pgs;

Am I right?

I still think you aren't changing anything with this change. If pg_acting.size() == 0 then it will be less than min_size. Remember for degraded avail_no_missing == pg_acting which is also why you don't need the avail_no_missing set.

Um...it looks to me that this is not the case. avail_no_missing may be less than pg_acting.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should be pi->type != pg_pool_t::TYPE_ERASURE

Sorry, this must be a typo......will change it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but if, say, a pool's size is 5 and min_size is 2, the current implementation doesn't allow stopping anymore OSDs, even if currently there's only one OSD is down. Actually, according to our online clusters' "ops" team, cases like this happens. So we think maybe the current ok-to-stop is a little bit rigorous. If we want to avoid pg_acting.size() == min_size, we can modify the code to something like:

If there is already an OSD down in your example, then pg_acting might be size 4 which is 4 < 2 (min_size) is NOT dangerous as expected. I don't see why that would or should prevent another OSD from being stopped.

if (pg_acting.size() <= pi->min_size)
    ++dangerous_pgs;

Am I right?

Yes, I'm not sure why the code doesn't already have <=. Of course, that is more stringent.

I still think you aren't changing anything with this change. If pg_acting.size() == 0 then it will be less than min_size. Remember for degraded avail_no_missing == pg_acting which is also why you don't need the avail_no_missing set.

Um...it looks to me that this is not the case. avail_no_missing may be less than pg_acting.

See my other comment. Don't confuse the added local avail_no_missing and q.second.avail_no_missing nor pg_acting and q.second.acting.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but if, say, a pool's size is 5 and min_size is 2, the current implementation doesn't allow stopping anymore OSDs, even if currently there's only one OSD is down. Actually, according to our online clusters' "ops" team, cases like this happens. So we think maybe the current ok-to-stop is a little bit rigorous. If we want to avoid pg_acting.size() == min_size, we can modify the code to something like:

If there is already an OSD down in your example, then pg_acting might be size 4 which is 4 < 2 (min_size) is NOT dangerous as expected. I don't see why that would or should prevent another OSD from being stopped.

Oh, sorry that I made this misunderstanding. What is preventing OSD to stop is not this line, but here

if (pg_acting.size() <= pi->min_size)
    ++dangerous_pgs;

Am I right?

Yes, I'm not sure why the code doesn't already have <=. Of course, that is more stringent.

I still think you aren't changing anything with this change. If pg_acting.size() == 0 then it will be less than min_size. Remember for degraded avail_no_missing == pg_acting which is also why you don't need the avail_no_missing set.

Um...it looks to me that this is not the case. avail_no_missing may be less than pg_acting.

See my other comment. Don't confuse the added local avail_no_missing and q.second.avail_no_missing nor pg_acting and q.second.acting.

Finally, I got your point. Will remove avail_no_missing set:-)

}
if (anm.osd != CRUSH_ITEM_NONE) {
pg_acting.insert(anm.osd);
avail_no_missing.insert(anm.osd);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here you see that pg_acting AND avail_no_missing is set from q.second.avail_no_missing for degraded PGs. There is no place for a degraded PG that they can diverge.

In summary:
non-degraded pg -> pg_acting from q.second.acting
degraded pg -> pg_acting AND avail_no_missing identically set from q.second.avail_no_missing

@xxhdx1985126 xxhdx1985126 force-pushed the wip-osd-relax-ok_to_stop branch from 0c829f9 to ed947e7 Compare February 10, 2021 03:55
@xxhdx1985126
Copy link
Contributor Author

@dzafman I've just modified the code as suggested, please take a look again at your leisure, thanks:-)

++dangerous_pgs;
continue;
}
if (pg_acting.size() <= pi->min_size) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this changed from < to <=.. was that intentional? I think < is correct...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If pg_acting.size() == pi->min_size then wouldn't write access be lost of another OSD is stopped? Isn't that the point of ok-to-stop checking?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pg_acting is the anticipated acting set if the osd(s) are stopped (current acting, minus any of the osds provided as an argument)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@xxhdx1985126 Change this back to pg_acting.size() < pi->min_size

@xxhdx1985126 xxhdx1985126 force-pushed the wip-osd-relax-ok_to_stop branch from ed947e7 to fd525de Compare February 19, 2021 03:03
@xxhdx1985126
Copy link
Contributor Author

@dzafman Done:-)

Copy link
Member

@liewegas liewegas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dzafman, I'm still not sure about this, actually. If a PG is degraded, one or more of the OSDs in the acting set is missing objects. Is that what avail_no_missing is tracking? Which OSDs are missing no objects? Shouldn't we make sure that we have at least one of those in the final acting set? Otherwise, we might still end up with enough acting osds, but may have stopped the OSD(s) that have the only available copy of one or more objects...

@liewegas
Copy link
Member

(@xuxuehan I definitely what to fix this! ok-to-stop is currently way too conservative.)

@dzafman
Copy link
Contributor

dzafman commented Feb 19, 2021

@dzafman, I'm still not sure about this, actually. If a PG is degraded, one or more of the OSDs in the acting set is missing objects. Is that what avail_no_missing is tracking? Which OSDs are missing no objects? Shouldn't we make sure that we have at least one of those in the final acting set? Otherwise, we might still end up with enough acting osds, but may have stopped the OSD(s) that have the only available copy of one or more objects...

Yes, I assumed that avail_no_missing excludes OSDs with missing objects for that PG. The current code requires at least min_size OSDs with no missing objects for degraded PGs to exist. I don't understand how we can loosen this criteria. I think that this change originally was trying to allow read-only access for replicated pools that had at least 1 OSD with no missing. Writes would block which could not have been the original purpose of ok-to-stop.

@liewegas
Copy link
Member

@dzafman, I'm still not sure about this, actually. If a PG is degraded, one or more of the OSDs in the acting set is missing objects. Is that what avail_no_missing is tracking? Which OSDs are missing no objects? Shouldn't we make sure that we have at least one of those in the final acting set? Otherwise, we might still end up with enough acting osds, but may have stopped the OSD(s) that have the only available copy of one or more objects...

Yes, I assumed that avail_no_missing excludes OSDs with missing objects for that PG. The current code requires at least min_size OSDs with no missing objects for degraded PGs to exist. I don't understand how we can loosen this criteria. I think that this change originally was trying to allow read-only access for replicated pools that had at least 1 OSD with no missing. Writes would block which could not have been the original purpose of ok-to-stop.

Oh! I missed that this code was already looking at avail_no_missing. I think this version is right. Thanks!!

@liewegas
Copy link
Member

Sigh, I'm still confused...

@liewegas
Copy link
Member

Okay, to summarize:

  • We build a pg_acting set out of OSDs that are in avail_no_missing (if degraded) or acting, MINUS any of the osds we're trying to stop.
  • Before, if the PG was in degraded state, we would always say it was dangerous, regardless of how big pg_acting was
  • Now, we still say any degraded erasure coded pool pg is dangerous to touch (because avail_no_missing isn't enough info?). But for replicated pools, we can proceed if the pg_acting set (built with only avail_no_missing OSDs) is >= min_size.
    Right?

@dzafman
Copy link
Contributor

dzafman commented Feb 19, 2021

Okay, to summarize:

  • We build a pg_acting set out of OSDs that are in avail_no_missing (if degraded) or acting, MINUS any of the osds we're trying to stop.
  • Before, if the PG was in degraded state, we would always say it was dangerous, regardless of how big pg_acting was
  • Now, we still say any degraded erasure coded pool pg is dangerous to touch (because avail_no_missing isn't enough info?). But for replicated pools, we can proceed if the pg_acting set (built with only avail_no_missing OSDs) is >= min_size.
    Right?

@liewegas YES!!!!! I was confused but that summary makes it very clear.

Copy link
Contributor

@dzafman dzafman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to see a tracker ticker, the commit message with the tracker specified and correction to the commit message which describes something that the final version doesn't do.

@dzafman
Copy link
Contributor

dzafman commented Feb 19, 2021

@liewegas Now that you mention it, originally the code allowed 1 OSD to work presumably for read access (which was wrong). That case would only apply to replicated, so I asked for that be checked. Now in the final version, it should work for replicated or erasure coded.

@liewegas
Copy link
Member

liewegas commented Feb 19, 2021

@liewegas Now that you mention it, originally the code allowed 1 OSD to work presumably for read access (which was wrong). That case would only apply to replicated, so I asked for that be checked. Now in the final version, it should work for replicated or erasure coded.

I opened https://tracker.ceph.com/issues/49392 ... @xuxuehan can you add this to the commit message?

@dzafman are you saying that check for replicated can be dropped? i.e., reduce this section

	  if (!(q.second.state & PG_STATE_ACTIVE) ||
              ((pi->type != pg_pool_t::TYPE_REPLICATED) &&
               (q.second.state & PG_STATE_DEGRADED))) {
	    ++dangerous_pgs;
	    continue;
	  }

to

	  if (!(q.second.state & PG_STATE_ACTIVE) {
	    ++dangerous_pgs;
	    continue;
	  }

?

(q.second.state & PG_STATE_DEGRADED))) {
++dangerous_pgs;
continue;
}
Copy link
Contributor

@dzafman dzafman Feb 19, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@liewegas I made change 9750061 to add degraded handling but didn't remove the PG_STATE_DEGRADED check here. So shouldn't this just read?

if (!(q.second.state & PG_STATE_ACTIVE)) {
	    ++dangerous_pgs;
 	    continue;
 	  }

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep!

@liewegas
Copy link
Member

how does liewegas@mgr-relax-ok-to-stop look?

@dzafman
Copy link
Contributor

dzafman commented Feb 19, 2021

how does liewegas@mgr-rel look?

perfect

@dzafman
Copy link
Contributor

dzafman commented Feb 19, 2021

@xxhdx1985126 @liewegas Do we have a functional test for this?

@xxhdx1985126
Copy link
Contributor Author

xxhdx1985126 commented Feb 21, 2021

@dzafman @liewegas Seems that the proposed commit is already inclosed in PR 39455, should I close this PR? :-)

…ds only replicated pools

Right now, the "ok-to-stop" condition is relatively rigorous, it allows
stopping an osd only if no PG on it is non-active or degraded. But there
are situations in which an OSD is part of a degraded pg and the pg still
still have > min_size complete replicas after the OSD is stopped.

In [9750061](liewegas@9750061),
we changed from considering just acting to using avail_no_missing (OSDs that have no missing
objects). When the projected pg_acting is constructed this way, we can safely compare to
min_size... even for a PG marked degraded.

Fixes: https://tracker.ceph.com/issues/49392
Signed-off-by: Xuehan Xu <xxhdx1985126@gmail.com>

Signed-off-by: Xuehan Xu <xxhdx1985126@gmail.com>
@xxhdx1985126 xxhdx1985126 force-pushed the wip-osd-relax-ok_to_stop branch from fd525de to 10bb20d Compare February 21, 2021 07:14
@dzafman
Copy link
Contributor

dzafman commented Feb 22, 2021

Now included in #39455

@dzafman dzafman closed this Feb 22, 2021
@xxhdx1985126 xxhdx1985126 deleted the wip-osd-relax-ok_to_stop branch March 2, 2021 04:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants