crush: add multistep retry rules#55332
Conversation
|
@neha-ojha @rzarzynski @Matan-B I could use at least one more detailed review if any of you have time. |
|
Review-in-progress. |
|
A first iteration of testing passed here: https://tracker.ceph.com/projects/rados/wiki/MAIN#httpstrellocomctbflL3Ja1936-wip-yuri3-testing-2024-01-22-1155 |
gregsfortytwo
left a comment
There was a problem hiding this comment.
I'm partway through reviewing this and just got to the meat of the changes, and it definitely needs documentation work before it can merge. The user-visible documentation simply says to use MSR rules when you need multiple OSDs per failure domain, but doesn't provide any other orientation.
In the source code, I did eventually find function documentation in mapper.c. But I had to dig through quite a lot — there's nothing in the header file; there's no pointers to it; there's no "here is why we built this new CRUSH implementation" anywhere. I'd really like to see something like we have for stretch clusters providing a broader view: https://docs.ceph.com/en/reef/rados/operations/stretch-mode/ Without that, it's really hard for anybody to get into without a personal briefing.
src/mon/OSDMonitor.cc
Outdated
| if (mv > newmap.require_osd_release) { | ||
| ss << "new crush map requires client version " << mv | ||
| << " but require_osd_release is " | ||
| << newmap.require_osd_release; |
There was a problem hiding this comment.
We definitely need this, but I don't think the commit message is right: OSDs are checked against require_osd_reelase, and this function was only ever checking the require_min_compat_client which is only checked against client/MDS/MGR as best I can tell.
I don't think that means we need any other changes, but just checking?
There was a problem hiding this comment.
Hmm, I shouldn't have put it in this block with the client compat check -- not sure why I did that.
However, immediately below, this function does check the newmap features against existing OSDs via check_cluster_features.
That said, I'm going to drop this commit for now. I think this is actually a problem with check_cluster_features generally. It's used to gate mon commands based on what OSDs support, but the check should really be against a bar we use to disallow OSDs from starting, not against what the OSDs that happen to be running use. https://tracker.ceph.com/issues/64257
There was a problem hiding this comment.
Stretch mode actually has its own check in preprocess_boot. I suppose I can add a check there.
| CRUSH_RULE_TYPE_REPLICATED = 1, | ||
| CRUSH_RULE_TYPE_ERASURE = 3, | ||
| CRUSH_RULE_TYPE_MSR_FIRSTN = 4, | ||
| CRUSH_RULE_TYPE_MSR_INDEP = 5 |
There was a problem hiding this comment.
Is there a reason the MSR ones are FIRSTN and INDEP, but the others are just REPLICATED and ERASURE? They're referred to as "firstn" and "indep" when writing the rules!
There was a problem hiding this comment.
That's not quite right. Classic crush rules specify a type in crush_make_rule which ends up in crush_rule::type. Prior to this PR, it was usually passed a pool type enum -- REPLICATED or ERASURE. Regardless of which of those two types the rule actually is, the choose steps in the rule could be either FIRSTN or INDEP. IIUC, there isn't anything to enforce a relationship between rule type and FIRSTN|INDEP in the actual steps (or that the steps are uniform).
MSR is different here -- the output behavior (FIRSTN or INDEP) is governed by the rule type and choosemsr steps do not individually specify FIRSTN or INDEP.
|
@gregsfortytwo I did add user documentation in doc/rados/operations/crush-map.rst as well as in doc/rados/operations/crush-map-edits.rst. Erasure code profiles are the way I expect normal users to interact with it -- does that seem reasonable for user documentation? For developer documentation, there isn't an existing explainer for crush outside of the source files. I do want to keep the comment on crush_msr_do_rule in mapper.c as the main explainer as I think reading that code is what will normally spark questions. I predict that mapper.h is the entry point where most developers will encounter the concept, so adding a quick statement of the two crush types with a pointer to the more detailed explanation on crush_msr_do_rule seems sufficient. Do you think it's worth adding a doc/dev/crush.rst document mainly to point at this file? |
4c9cd28 to
2c855c7
Compare
2c855c7 to
273e28a
Compare
|
@gregsfortytwo @rzarzynski The feature cutoff for squid was extended, but only for 4 days. I really need this reviewed within the next 24 hours. |
|
jenkins test api |
gregsfortytwo
left a comment
There was a problem hiding this comment.
Okay, I spent a while staring at the guts of mapper.c and I think this all works. It requires a certain amount of inner CRUSH knowledge I have not grokked in a while, but the new functions all do what they say, I believe the overall algorithm does work, and they tie together appropriately.
I did not review the tests and only skimmed the EC parts that plug into it, but I presume from the volume and commit messages that they are good.
Do go over my notes. Is there a plan for when or how we want to integrate this with the kernel client? Since I understand this to mostly be an RGW-focused feature, I wonder if we want to delay pushing it out aggressively to make sure we don't find any problems in early deployments -- it seems like every CRUSH change runs across some issues with the math not playing out quite the way we'd hoped.
| break; | ||
| case CRUSH_RULE_SET_MSR_COLLISION_TRIES: | ||
| if (msr_collision_tries) *msr_collision_tries = step->arg1; | ||
| break; |
There was a problem hiding this comment.
So if somebody has multiple CRUSH_RULE_SET_MSR_DESCENTS_TRIES or CRUSH_RULE_SET_MSR_COLLISION_TRIES steps, the earlier ones are simply overwritten and life proceeds. Is this okay? Do we want some kind of enforcement here?
There was a problem hiding this comment.
That seems to be how the other CRUSH_RULE_SET_* steps work. I'm not really worried about it.
src/crush/mapper.c
Outdated
| unsigned end_index = MIN(start_index + total_children, | ||
| input.result_max); | ||
| while (tries_so_far <= input.msr_descents && | ||
| output.returned_so_far < input.result_max) { |
There was a problem hiding this comment.
I'm confused by the interplay between this comparison of output.returned_so_far < input.result_max and the end_index we set up that can be smaller than the result_max. What's going on here?
There was a problem hiding this comment.
That's an artifact from when I added support for multiple emit blocks -- good catch. Fixing.
There was a problem hiding this comment.
Ok, added a bit of code to stop the loop once we've generated end_index - start_index worth of outputs.
|
@gregsfortytwo I do expect this to go into the kernel. The only encoding change really is the two new tunables. I think all we have to do is handle that and copy over the new mapper.h/cc files. I'll look into it. |
e1f94ad to
98db74d
Compare
Signed-off-by: Samuel Just <sjust@redhat.com>
Signed-off-by: Samuel Just <sjust@redhat.com>
Add rule_valid_for_pool_type to CrushWrapper to generalize rule type <-> pool type mapping to include the new MSR types. Signed-off-by: Samuel Just <sjust@redhat.com>
Signed-off-by: Samuel Just <sjust@redhat.com>
Signed-off-by: Samuel Just <sjust@redhat.com>
Signed-off-by: Samuel Just <sjust@redhat.com>
Signed-off-by: Samuel Just <sjust@redhat.com>
Signed-off-by: Samuel Just <sjust@redhat.com>
…les flags Signed-off-by: Samuel Just <sjust@redhat.com>
Adds support for crush-osds-per-failure-domain and crush-num-failure-domains via MSR rules. Signed-off-by: Samuel Just <sjust@redhat.com>
Signed-off-by: Samuel Just <sjust@redhat.com>
Signed-off-by: Samuel Just <sjust@redhat.com>
Signed-off-by: Samuel Just <sjust@redhat.com>
Signed-off-by: Samuel Just <sjust@redhat.com>
Newly added profile options may break this test otherwise. Signed-off-by: Samuel Just <sjust@redhat.com>
98db74d to
0736d5d
Compare
|
@athanatos latest test results show some apparent regressions. I think the osdmap is not getting updated correctly. In this job, these messages in the log caught my eye: /a/yuriw-2024-02-05_19:32:33-rados-wip-yuri4-testing-2024-02-05-0849-distro-default-smithi/7547574 2024-02-06T15:38:08.835 INFO:tasks.ceph.osd.8.smithi002.stderr:2024-02-06T15:38:08.832+0000 7f0ef10c0640 -1 osd.8 0 waiting for initial osdmap /a/yuriw-2024-02-05_19:32:33-rados-wip-yuri4-testing-2024-02-05-0849-distro-default-smithi/7547574/remote/smithi002/log/ceph-osd.0.log.gz 2024-02-06T15:41:17.369+0000 7f4000a81640 10 osd.0 52 maybe_share_map: con v2:172.21.15.143:6802/3809609215 our osdmap epoch of 52 is not newer than session's projected_epoch of 52 The other jobs show the same symptoms. See https://pulpito.ceph.com/yuriw-2024-02-05_19:32:33-rados-wip-yuri4-testing-2024-02-05-0849-distro-default-smithi/ for more examples. |
|
@ljflores Hmm, those are normal as long as the OSDs eventually started. |
We found the failures are actually from #54312 merging. |
This is a re-submission of #55096 after an accidental merge.
Adds support for CRUSH rules that require multiple choose steps -- see added comments and documentation for details.
Checklist
Show available Jenkins commands
jenkins retest this pleasejenkins test classic perfjenkins test crimson perfjenkins test signedjenkins test make checkjenkins test make check arm64jenkins test submodulesjenkins test dashboardjenkins test dashboard cephadmjenkins test apijenkins test docsjenkins render docsjenkins test ceph-volume alljenkins test ceph-volume toxjenkins test windowsjenkins test rook e2e