Conversation
|
@dzafman Hi, could you take a look this PR, please? Thanks:-) |
src/mgr/DaemonServer.cc
Outdated
| if (pg_acting.size() < pi->min_size) { | ||
| if (((q.second.state & PG_STATE_DEGRADED) && | ||
| !avail_no_missing.size()) || | ||
| pg_acting.size() < pi->min_size) { |
There was a problem hiding this comment.
It seems to that if the pg is DEGRADED, then pg_acting == avail_no_missing. So avail_no_missing has no extra information.
There was a problem hiding this comment.
Since !avail_no_missing.size() == (pg_acting.size() == 0), then it has no change whether dangerous_pgs is incremented since 0 should always be < min_size.
There was a problem hiding this comment.
@dzafman Um...It seems that an OSD can be a member of pg_acting even if it has missing objects, so I think when the pg is DEGRADED, pg_acting may not be the same as avail_no_missing, right? And for the same reason, pg_acting may not be necessarily empty even if no OSD has all objects, and I think that's why we have a concept of "unfound objects". Am I right?:-)
There was a problem hiding this comment.
When a pg is degraded the pg_acting is based on avail_no_missing. If not, pg_acting comes from acting. In any case the size of pg_acting must be >= min_size. Unfound is for objects that aren't stored on any OSD.
Now that think about it, I'm not clear how a pg with pg_acting.size() == min_size wouldn't be dangerous because stopping an OSD could then cause a PG to drop below min_size. Doesn't "available" imply writable.
dzafman
left a comment
There was a problem hiding this comment.
As it stands this change should not be merged.
src/mgr/DaemonServer.cc
Outdated
| if (pg_acting.size() < pi->min_size) { | ||
| if (((q.second.state & PG_STATE_DEGRADED) && | ||
| !avail_no_missing.size()) || | ||
| pg_acting.size() < pi->min_size) { |
There was a problem hiding this comment.
Since !avail_no_missing.size() == (pg_acting.size() == 0), then it has no change whether dangerous_pgs is incremented since 0 should always be < min_size.
|
The description of the goals of this change "AND there are at least one OSD that have no missing |
So I think maybe I should distinguish replicated pools and ec pools, right? |
636a185 to
0c829f9
Compare
|
@dzafman hi, I've restrict the change to only affect osds of replicated pools, please take a look:-) Thanks. |
src/mgr/DaemonServer.cc
Outdated
| if (pg_acting.size() < pi->min_size) { | ||
| if (((q.second.state & PG_STATE_DEGRADED) && | ||
| !avail_no_missing.size() && | ||
| (pi->type == pg_pool_t::TYPE_ERASURE)) || |
There was a problem hiding this comment.
should be pi->type != pg_pool_t::TYPE_ERASURE
There was a problem hiding this comment.
I still think you aren't changing anything with this change. If pg_acting.size() == 0 then it will be less than min_size. Remember for degraded avail_no_missing == pg_acting which is also why you don't need the avail_no_missing set.
There was a problem hiding this comment.
When a pg is degraded the pg_acting is based on avail_no_missing. If not, pg_acting comes from acting. In any case the size of pg_acting must be >= min_size. Unfound is for objects that aren't stored on any OSD.
Yes, but pg_acting.size() >= min_size doesn't necessarily mean avail_no_missing is not empty, as it looks to me that pg_acting comes from acting and avail_no_missing is constructed when PeeringState do "update_calc_stats". What I wanted to express here is that "if after stopping the specified osds would mean that there's no OSD who has all the up-to-date objects of the pg is still up in the cluster, it's not ok to stop even if the pg can go into PG_STATE_ACTIVE".
Now that think about it, I'm not clear how a pg with pg_acting.size() == min_size wouldn't be dangerous because stopping an OSD could then cause a PG to drop below min_size. Doesn't "available" imply writable.
Yes, but if, say, a pool's size is 5 and min_size is 2, the current implementation doesn't allow stopping anymore OSDs, even if currently there's only one OSD is down. Actually, according to our online clusters' "ops" team, cases like this happens. So we think maybe the current ok-to-stop is a little bit rigorous. If we want to avoid pg_acting.size() == min_size, we can modify the code to something like:
if (pg_acting.size() <= pi->min_size)
++dangerous_pgs;
Am I right?
I still think you aren't changing anything with this change. If pg_acting.size() == 0 then it will be less than min_size. Remember for degraded avail_no_missing == pg_acting which is also why you don't need the avail_no_missing set.
Um...it looks to me that this is not the case. avail_no_missing may be less than pg_acting.
There was a problem hiding this comment.
should be pi->type != pg_pool_t::TYPE_ERASURE
Sorry, this must be a typo......will change it.
There was a problem hiding this comment.
Yes, but if, say, a pool's size is 5 and min_size is 2, the current implementation doesn't allow stopping anymore OSDs, even if currently there's only one OSD is down. Actually, according to our online clusters' "ops" team, cases like this happens. So we think maybe the current ok-to-stop is a little bit rigorous. If we want to avoid pg_acting.size() == min_size, we can modify the code to something like:
If there is already an OSD down in your example, then pg_acting might be size 4 which is 4 < 2 (min_size) is NOT dangerous as expected. I don't see why that would or should prevent another OSD from being stopped.
if (pg_acting.size() <= pi->min_size) ++dangerous_pgs;Am I right?
Yes, I'm not sure why the code doesn't already have <=. Of course, that is more stringent.
I still think you aren't changing anything with this change. If pg_acting.size() == 0 then it will be less than min_size. Remember for degraded avail_no_missing == pg_acting which is also why you don't need the avail_no_missing set.
Um...it looks to me that this is not the case. avail_no_missing may be less than pg_acting.
See my other comment. Don't confuse the added local avail_no_missing and q.second.avail_no_missing nor pg_acting and q.second.acting.
There was a problem hiding this comment.
Yes, but if, say, a pool's size is 5 and min_size is 2, the current implementation doesn't allow stopping anymore OSDs, even if currently there's only one OSD is down. Actually, according to our online clusters' "ops" team, cases like this happens. So we think maybe the current ok-to-stop is a little bit rigorous. If we want to avoid pg_acting.size() == min_size, we can modify the code to something like:
If there is already an OSD down in your example, then pg_acting might be size 4 which is 4 < 2 (min_size) is NOT dangerous as expected. I don't see why that would or should prevent another OSD from being stopped.
Oh, sorry that I made this misunderstanding. What is preventing OSD to stop is not this line, but here
if (pg_acting.size() <= pi->min_size) ++dangerous_pgs;Am I right?
Yes, I'm not sure why the code doesn't already have <=. Of course, that is more stringent.
I still think you aren't changing anything with this change. If pg_acting.size() == 0 then it will be less than min_size. Remember for degraded avail_no_missing == pg_acting which is also why you don't need the avail_no_missing set.
Um...it looks to me that this is not the case. avail_no_missing may be less than pg_acting.
See my other comment. Don't confuse the added local avail_no_missing and q.second.avail_no_missing nor pg_acting and q.second.acting.
Finally, I got your point. Will remove avail_no_missing set:-)
src/mgr/DaemonServer.cc
Outdated
| } | ||
| if (anm.osd != CRUSH_ITEM_NONE) { | ||
| pg_acting.insert(anm.osd); | ||
| avail_no_missing.insert(anm.osd); |
There was a problem hiding this comment.
Here you see that pg_acting AND avail_no_missing is set from q.second.avail_no_missing for degraded PGs. There is no place for a degraded PG that they can diverge.
In summary:
non-degraded pg -> pg_acting from q.second.acting
degraded pg -> pg_acting AND avail_no_missing identically set from q.second.avail_no_missing
0c829f9 to
ed947e7
Compare
|
@dzafman I've just modified the code as suggested, please take a look again at your leisure, thanks:-) |
src/mgr/DaemonServer.cc
Outdated
| ++dangerous_pgs; | ||
| continue; | ||
| } | ||
| if (pg_acting.size() <= pi->min_size) { |
There was a problem hiding this comment.
this changed from < to <=.. was that intentional? I think < is correct...
There was a problem hiding this comment.
If pg_acting.size() == pi->min_size then wouldn't write access be lost of another OSD is stopped? Isn't that the point of ok-to-stop checking?
There was a problem hiding this comment.
pg_acting is the anticipated acting set if the osd(s) are stopped (current acting, minus any of the osds provided as an argument)
There was a problem hiding this comment.
@xxhdx1985126 Change this back to pg_acting.size() < pi->min_size
ed947e7 to
fd525de
Compare
|
@dzafman Done:-) |
liewegas
left a comment
There was a problem hiding this comment.
@dzafman, I'm still not sure about this, actually. If a PG is degraded, one or more of the OSDs in the acting set is missing objects. Is that what avail_no_missing is tracking? Which OSDs are missing no objects? Shouldn't we make sure that we have at least one of those in the final acting set? Otherwise, we might still end up with enough acting osds, but may have stopped the OSD(s) that have the only available copy of one or more objects...
|
(@xuxuehan I definitely what to fix this! ok-to-stop is currently way too conservative.) |
Yes, I assumed that avail_no_missing excludes OSDs with missing objects for that PG. The current code requires at least min_size OSDs with no missing objects for degraded PGs to exist. I don't understand how we can loosen this criteria. I think that this change originally was trying to allow read-only access for replicated pools that had at least 1 OSD with no missing. Writes would block which could not have been the original purpose of ok-to-stop. |
Oh! I missed that this code was already looking at avail_no_missing. I think this version is right. Thanks!! |
|
Sigh, I'm still confused... |
|
Okay, to summarize:
|
@liewegas YES!!!!! I was confused but that summary makes it very clear. |
dzafman
left a comment
There was a problem hiding this comment.
I'd like to see a tracker ticker, the commit message with the tracker specified and correction to the commit message which describes something that the final version doesn't do.
|
@liewegas Now that you mention it, originally the code allowed 1 OSD to work presumably for read access (which was wrong). That case would only apply to replicated, so I asked for that be checked. Now in the final version, it should work for replicated or erasure coded. |
I opened https://tracker.ceph.com/issues/49392 ... @xuxuehan can you add this to the commit message? @dzafman are you saying that check for replicated can be dropped? i.e., reduce this section to ? |
| (q.second.state & PG_STATE_DEGRADED))) { | ||
| ++dangerous_pgs; | ||
| continue; | ||
| } |
|
how does liewegas@mgr-relax-ok-to-stop look? |
perfect |
|
@xxhdx1985126 @liewegas Do we have a functional test for this? |
…ds only replicated pools Right now, the "ok-to-stop" condition is relatively rigorous, it allows stopping an osd only if no PG on it is non-active or degraded. But there are situations in which an OSD is part of a degraded pg and the pg still still have > min_size complete replicas after the OSD is stopped. In [9750061](liewegas@9750061), we changed from considering just acting to using avail_no_missing (OSDs that have no missing objects). When the projected pg_acting is constructed this way, we can safely compare to min_size... even for a PG marked degraded. Fixes: https://tracker.ceph.com/issues/49392 Signed-off-by: Xuehan Xu <xxhdx1985126@gmail.com> Signed-off-by: Xuehan Xu <xxhdx1985126@gmail.com>
fd525de to
10bb20d
Compare
|
Now included in #39455 |
Right now, the "ok-to-stop" condition is relatively rigorous, it allows
stopping an osd only if no PG on it is non-active or degraded. But there
are situations in which people want to concurrently stop multiple osds even if
they share pgs, as long as that wouldn't cause client requests to block.
This commit relax the "ok-to-stop" condition to something like: allow
osd to stop as long as all pgs on it are active AND there are more-than-min_size
pgs that are still up/in AND there are at least one OSD that have no missing
objects.
Signed-off-by: Xuehan Xu xxhdx1985126@gmail.com
Checklist
Show available Jenkins commands
jenkins retest this pleasejenkins test classic perfjenkins test crimson perfjenkins test signedjenkins test make checkjenkins test make check arm64jenkins test submodulesjenkins test dashboardjenkins test apijenkins test docsjenkins render docsjenkins test ceph-volume alljenkins test ceph-volume tox