mon [stretch mode]: support disable_stretch_mode#59483
Conversation
e2810ca to
0ecd953
Compare
|
|
||
| .. describe:: {crush_rule} | ||
|
|
||
| The CRUSH rule that the user wants all the pools to move back to. If this |
| .. describe:: {crush_rule} | ||
|
|
||
| The CRUSH rule that the user wants all the pools to move back to. If this | ||
| is not specified, the cluster will move back to the default CRUSH rule. |
| All pools will move its ``size`` and ``min_size`` | ||
| back to the default values it started with. | ||
| At this point the user is responsible for scaling down the cluster | ||
| to the desired number of OSDs. |
There was a problem hiding this comment.
Why would exiting stretch mode necessarily mean taking out OSDs?
There was a problem hiding this comment.
One of the use cases is that the user may encounter a whole DC failure and doesn't want to bother bringing the failed DC back up and wants to continue using the surviving DC without stretch mode.
There was a problem hiding this comment.
I suspected as much, but as written this might confuse a user who thinks that scaling OSDs is necessary in all cases.
There was a problem hiding this comment.
@anthonyeleven
That is a good point, I added.
At this point the user is responsible for scaling down the cluster
to the desired number of OSDs if they choose to operate with less number of OSDs.
Let me know if you think this clarifies the point I'm trying to make or should I just delete this sentence completely.
Thanks
f4b00cd to
5e32cfc
Compare
|
status:
|
bbbd4fa to
97f54a6
Compare
|
wrote python integration test however still need to modify the command to include --yes-i-really-really-mean-it |
a120c38 to
89ecbec
Compare
Problem: Currently, Ceph lacks the ability to exit stretch mode and move back to normal cluster (non-stretched). Solution: Provide a command to allow the user to exit stretch mode gracefully: `ceph mon disable_stretch_mode <crush_rule> --yes-i-really-mean-it` User can either specify a crush rule that they want all pools to move to or not specify a rule and Ceph will use a default replicated crush rule. Fixes: https://tracker.ceph.com/issues/67467 Signed-off-by: Kamoltat Sirivadhna <ksirivad@redhat.com>
Added documentation about exiting stretch mode. Fixes: https://tracker.ceph.com/issues/67467 Signed-off-by: Kamoltat Sirivadhna <ksirivad@redhat.com>
89ecbec to
80f0c1e
Compare
| log-ignorelist: | ||
| - overall HEALTH_ | ||
| - \(OSDMAP_FLAGS\) | ||
| - \(OSD_ | ||
| - \(PG_ | ||
| - \(POOL_ | ||
| - \(CACHE_POOL_ | ||
| - \(OBJECT_ | ||
| - \(SLOW_OPS\) | ||
| - \(REQUEST_SLOW\) | ||
| - \(TOO_FEW_PGS\) | ||
| - slow request | ||
| - \(POOL_APP_NOT_ENABLED\) | ||
| - overall HEALTH_ | ||
| - \(MGR_DOWN\) | ||
| - \(MON_DOWN\) | ||
| - \(PG_AVAILABILITY\) | ||
| - \(SLOW_OPS\) |
There was a problem hiding this comment.
@kamoltat do you think we can eliminate a few of the log-ignorelist? Or should we split it so we have it before and after the conversion ignorelist ?
my concern is that PG_* will be ignored and we will miss important warnings after the conversion happened
There was a problem hiding this comment.
That's a good point @NitzanMordhai let me remove most of the ones that can affect the integrity of the test.
There was a problem hiding this comment.
@NitzanMordhai I had to put in things like MON_DOWN and PG_AVAILABILITY since my test involves a lot of failover. Honestly, this grep for warning in the cluster log adds more issues than it actually solves. Maybe worth discussing with the team about it.
18b0f24 to
4ca4b52
Compare
b1bfcf0 to
2d28d24
Compare
e2bcb8d to
768c374
Compare
Test disabling stretch mode with the following scenario: 1. Healthy Stretch Mode 2. Degraded Stretch Mode Fixes: https://tracker.ceph.com/issues/67467 Signed-off-by: Kamoltat Sirivadhna <ksirivad@redhat.com>
Problem: Current dump for "removed_ranks" and "disallowed_leaders" doesn't have the correct format so the python test script can parse through these values. Solution: Modified the values such that it is in the correct format Signed-off-by: Kamoltat Sirivadhna <ksirivad@redhat.com>
768c374 to
a7f3b7b
Compare
|
Passed 1/1 Test after addressing Nitzan remarks on the ignore list. |
|
jenkins test windows |
|
jenkins test api |
|
hmm something wrong with build system ... |
|
jenkins test windows |
|
@NitzanMordhai let me know what you think of the new change on the log-ignorelist |
@kamoltat I have an issue with PG_DEGRADED, let's say you moved between modes, and PG gets stuck with PG_DEGRADED, how will you catch such an error? the test will ignore it. this issue is not specific to that test only, we have more suites that have that, but we can check at the last iterator if we have any warning before closing the test (if it has already cleared, we are ok with that). |
@NitzanMordhai I understand your concern, however, in my test I do check for all PGs to be active+clean when switching between the modes, e.g., the only time I am willing to tolerate degraded PGs is when I am testing the case where the stretch cluster is in degraded stretch mode (PGs are expected to be degraded because we loss 1 DC) and the user is disabling stretch mode and going back to a normal cluster. Therefore, I think grepping for PG_DEGRADED at the end of this test won't help us in this case. |
|
@NitzanMordhai ping, let me know what you think of the comment above |
well, in that
in that case, that's perfectly fine! then we are covered and not allowing this situation, thanks for checking that! |
|
RADOS approved 119/153 test passed. |
Problem:
Currently, Ceph lacks the ability
to exit stretch mode and move back
to normal cluster (non-stretched).
Solution:
Provide a command to allow
the user to exit stretch mode gracefully:
ceph mon disable_stretch_mode <crush_rule> --yes-i-really-mean-itUsers can either specify a crush rule that
they want all pools to move to or not specify
a rule and Ceph will use a default replicated crush rule.
Fixes: https://tracker.ceph.com/issues/67467
TODO
Contribution Guidelines
To sign and title your commits, please refer to Submitting Patches to Ceph.
If you are submitting a fix for a stable branch (e.g. "quincy"), please refer to Submitting Patches to Ceph - Backports for the proper workflow.
When filling out the below checklist, you may click boxes directly in the GitHub web UI. When entering or editing the entire PR message in the GitHub web UI editor, you may also select a checklist item by adding an
xbetween the brackets:[x]. Spaces and capitalization matter when checking off items this way.Checklist
Show available Jenkins commands
jenkins retest this pleasejenkins test classic perfjenkins test crimson perfjenkins test signedjenkins test make checkjenkins test make check arm64jenkins test submodulesjenkins test dashboardjenkins test dashboard cephadmjenkins test apijenkins test docsjenkins render docsjenkins test ceph-volume alljenkins test ceph-volume toxjenkins test windowsjenkins test rook e2e