Skip to content

crush: add multistep retry rules#55332

Merged
rzarzynski merged 20 commits intoceph:mainfrom
athanatos:sjust/wip-crush-multi-choose
Feb 8, 2024
Merged

crush: add multistep retry rules#55332
rzarzynski merged 20 commits intoceph:mainfrom
athanatos:sjust/wip-crush-multi-choose

Conversation

@athanatos
Copy link
Contributor

@athanatos athanatos commented Jan 26, 2024

This is a re-submission of #55096 after an accidental merge.

Adds support for CRUSH rules that require multiple choose steps -- see added comments and documentation for details.

Checklist

  • Tracker (select at least one)
    • References tracker ticket
    • Very recent bug; references commit where it was introduced
    • New feature (ticket optional)
    • Doc update (no ticket needed)
    • Code cleanup (no ticket needed)
  • Component impact
    • Affects Dashboard, opened tracker ticket
    • Affects Orchestrator, opened tracker ticket
    • No impact that needs to be tracked
  • Documentation (select at least one)
    • Updates relevant documentation
    • No doc update is appropriate
  • Tests (select at least one)
Show available Jenkins commands
  • jenkins retest this please
  • jenkins test classic perf
  • jenkins test crimson perf
  • jenkins test signed
  • jenkins test make check
  • jenkins test make check arm64
  • jenkins test submodules
  • jenkins test dashboard
  • jenkins test dashboard cephadm
  • jenkins test api
  • jenkins test docs
  • jenkins render docs
  • jenkins test ceph-volume all
  • jenkins test ceph-volume tox
  • jenkins test windows
  • jenkins test rook e2e

@athanatos
Copy link
Contributor Author

@neha-ojha @rzarzynski @Matan-B I could use at least one more detailed review if any of you have time.

@rzarzynski
Copy link
Contributor

Review-in-progress.

@ljflores
Copy link
Member

Copy link
Member

@gregsfortytwo gregsfortytwo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm partway through reviewing this and just got to the meat of the changes, and it definitely needs documentation work before it can merge. The user-visible documentation simply says to use MSR rules when you need multiple OSDs per failure domain, but doesn't provide any other orientation.
In the source code, I did eventually find function documentation in mapper.c. But I had to dig through quite a lot — there's nothing in the header file; there's no pointers to it; there's no "here is why we built this new CRUSH implementation" anywhere. I'd really like to see something like we have for stretch clusters providing a broader view: https://docs.ceph.com/en/reef/rados/operations/stretch-mode/ Without that, it's really hard for anybody to get into without a personal briefing.

if (mv > newmap.require_osd_release) {
ss << "new crush map requires client version " << mv
<< " but require_osd_release is "
<< newmap.require_osd_release;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We definitely need this, but I don't think the commit message is right: OSDs are checked against require_osd_reelase, and this function was only ever checking the require_min_compat_client which is only checked against client/MDS/MGR as best I can tell.
I don't think that means we need any other changes, but just checking?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I shouldn't have put it in this block with the client compat check -- not sure why I did that.

However, immediately below, this function does check the newmap features against existing OSDs via check_cluster_features.

That said, I'm going to drop this commit for now. I think this is actually a problem with check_cluster_features generally. It's used to gate mon commands based on what OSDs support, but the check should really be against a bar we use to disallow OSDs from starting, not against what the OSDs that happen to be running use. https://tracker.ceph.com/issues/64257

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Stretch mode actually has its own check in preprocess_boot. I suppose I can add a check there.

CRUSH_RULE_TYPE_REPLICATED = 1,
CRUSH_RULE_TYPE_ERASURE = 3,
CRUSH_RULE_TYPE_MSR_FIRSTN = 4,
CRUSH_RULE_TYPE_MSR_INDEP = 5
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason the MSR ones are FIRSTN and INDEP, but the others are just REPLICATED and ERASURE? They're referred to as "firstn" and "indep" when writing the rules!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's not quite right. Classic crush rules specify a type in crush_make_rule which ends up in crush_rule::type. Prior to this PR, it was usually passed a pool type enum -- REPLICATED or ERASURE. Regardless of which of those two types the rule actually is, the choose steps in the rule could be either FIRSTN or INDEP. IIUC, there isn't anything to enforce a relationship between rule type and FIRSTN|INDEP in the actual steps (or that the steps are uniform).

MSR is different here -- the output behavior (FIRSTN or INDEP) is governed by the rule type and choosemsr steps do not individually specify FIRSTN or INDEP.

@athanatos
Copy link
Contributor Author

athanatos commented Jan 30, 2024

@gregsfortytwo I did add user documentation in doc/rados/operations/crush-map.rst as well as in doc/rados/operations/crush-map-edits.rst. Erasure code profiles are the way I expect normal users to interact with it -- does that seem reasonable for user documentation?

For developer documentation, there isn't an existing explainer for crush outside of the source files. I do want to keep the comment on crush_msr_do_rule in mapper.c as the main explainer as I think reading that code is what will normally spark questions. I predict that mapper.h is the entry point where most developers will encounter the concept, so adding a quick statement of the two crush types with a pointer to the more detailed explanation on crush_msr_do_rule seems sufficient. Do you think it's worth adding a doc/dev/crush.rst document mainly to point at this file?

@athanatos athanatos force-pushed the sjust/wip-crush-multi-choose branch 4 times, most recently from 4c9cd28 to 2c855c7 Compare January 30, 2024 22:12
@athanatos athanatos force-pushed the sjust/wip-crush-multi-choose branch from 2c855c7 to 273e28a Compare February 1, 2024 16:57
@athanatos
Copy link
Contributor Author

@gregsfortytwo @rzarzynski The feature cutoff for squid was extended, but only for 4 days. I really need this reviewed within the next 24 hours.

@athanatos
Copy link
Contributor Author

jenkins test api

Copy link
Member

@gregsfortytwo gregsfortytwo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, I spent a while staring at the guts of mapper.c and I think this all works. It requires a certain amount of inner CRUSH knowledge I have not grokked in a while, but the new functions all do what they say, I believe the overall algorithm does work, and they tie together appropriately.

I did not review the tests and only skimmed the EC parts that plug into it, but I presume from the volume and commit messages that they are good.

Do go over my notes. Is there a plan for when or how we want to integrate this with the kernel client? Since I understand this to mostly be an RGW-focused feature, I wonder if we want to delay pushing it out aggressively to make sure we don't find any problems in early deployments -- it seems like every CRUSH change runs across some issues with the math not playing out quite the way we'd hoped.

break;
case CRUSH_RULE_SET_MSR_COLLISION_TRIES:
if (msr_collision_tries) *msr_collision_tries = step->arg1;
break;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So if somebody has multiple CRUSH_RULE_SET_MSR_DESCENTS_TRIES or CRUSH_RULE_SET_MSR_COLLISION_TRIES steps, the earlier ones are simply overwritten and life proceeds. Is this okay? Do we want some kind of enforcement here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That seems to be how the other CRUSH_RULE_SET_* steps work. I'm not really worried about it.

unsigned end_index = MIN(start_index + total_children,
input.result_max);
while (tries_so_far <= input.msr_descents &&
output.returned_so_far < input.result_max) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm confused by the interplay between this comparison of output.returned_so_far < input.result_max and the end_index we set up that can be smaller than the result_max. What's going on here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's an artifact from when I added support for multiple emit blocks -- good catch. Fixing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, added a bit of code to stop the loop once we've generated end_index - start_index worth of outputs.

@athanatos
Copy link
Contributor Author

@gregsfortytwo I do expect this to go into the kernel. The only encoding change really is the two new tunables. I think all we have to do is handle that and copy over the new mapper.h/cc files. I'll look into it.

@athanatos athanatos force-pushed the sjust/wip-crush-multi-choose branch 3 times, most recently from e1f94ad to 98db74d Compare February 4, 2024 00:49
Signed-off-by: Samuel Just <sjust@redhat.com>
Signed-off-by: Samuel Just <sjust@redhat.com>
Add rule_valid_for_pool_type to CrushWrapper to generalize
rule type <-> pool type mapping to include the new MSR
types.

Signed-off-by: Samuel Just <sjust@redhat.com>
Signed-off-by: Samuel Just <sjust@redhat.com>
Signed-off-by: Samuel Just <sjust@redhat.com>
Signed-off-by: Samuel Just <sjust@redhat.com>
Signed-off-by: Samuel Just <sjust@redhat.com>
Signed-off-by: Samuel Just <sjust@redhat.com>
…les flags

Signed-off-by: Samuel Just <sjust@redhat.com>
Adds support for crush-osds-per-failure-domain and
crush-num-failure-domains via MSR rules.

Signed-off-by: Samuel Just <sjust@redhat.com>
Signed-off-by: Samuel Just <sjust@redhat.com>
Signed-off-by: Samuel Just <sjust@redhat.com>
Signed-off-by: Samuel Just <sjust@redhat.com>
Signed-off-by: Samuel Just <sjust@redhat.com>
Newly added profile options may break this test otherwise.

Signed-off-by: Samuel Just <sjust@redhat.com>
@athanatos athanatos force-pushed the sjust/wip-crush-multi-choose branch from 98db74d to 0736d5d Compare February 4, 2024 05:00
@ljflores
Copy link
Member

ljflores commented Feb 6, 2024

@athanatos latest test results show some apparent regressions. I think the osdmap is not getting updated correctly.

In this job, these messages in the log caught my eye:

/a/yuriw-2024-02-05_19:32:33-rados-wip-yuri4-testing-2024-02-05-0849-distro-default-smithi/7547574

2024-02-06T15:38:08.835 INFO:tasks.ceph.osd.8.smithi002.stderr:2024-02-06T15:38:08.832+0000 7f0ef10c0640 -1 osd.8 0 waiting for initial osdmap
2024-02-06T15:38:08.835 INFO:tasks.ceph.osd.7.smithi160.stderr:2024-02-06T15:38:08.832+0000 7f0a709a3640 -1 osd.7 0 waiting for initial osdmap
2024-02-06T15:38:08.835 INFO:tasks.ceph.osd.14.smithi143.stderr:2024-02-06T15:38:08.834+0000 7f575841f640 -1 osd.14 0 waiting for initial osdmap
2024-02-06T15:38:08.836 INFO:tasks.ceph.osd.11.smithi160.stderr:2024-02-06T15:38:08.835+0000 7f82329c0640 -1 osd.11 0 waiting for initial osdmap
2024-02-06T15:38:08.838 INFO:tasks.ceph.osd.0.smithi002.stderr:2024-02-06T15:38:08.837+0000 7f3ffe269640 -1 osd.0 0 waiting for initial osdmap
2024-02-06T15:38:08.838 INFO:tasks.ceph.osd.4.smithi002.stderr:2024-02-06T15:38:08.837+0000 7fd8c4397640 -1 osd.4 0 waiting for initial osdmap
2024-02-06T15:38:08.839 INFO:tasks.ceph.osd.12.smithi002.stderr:2024-02-06T15:38:08.837+0000 7f2b6a4b3640 -1 osd.12 0 waiting for initial osdmap
2024-02-06T15:38:08.839 INFO:tasks.ceph.osd.1.smithi033.stderr:2024-02-06T15:38:08.837+0000 7f012b8a2640 -1 osd.1 0 waiting for initial osdmap
2024-02-06T15:38:08.841 INFO:tasks.ceph.osd.3.smithi160.stderr:2024-02-06T15:38:08.839+0000 7fc400ec6640 -1 osd.3 0 waiting for initial osdmap

/a/yuriw-2024-02-05_19:32:33-rados-wip-yuri4-testing-2024-02-05-0849-distro-default-smithi/7547574/remote/smithi002/log/ceph-osd.0.log.gz

2024-02-06T15:41:17.369+0000 7f4000a81640 10 osd.0 52 maybe_share_map: con v2:172.21.15.143:6802/3809609215 our osdmap epoch of 52 is not newer than session's projected_epoch of 52

The other jobs show the same symptoms.

See https://pulpito.ceph.com/yuriw-2024-02-05_19:32:33-rados-wip-yuri4-testing-2024-02-05-0849-distro-default-smithi/ for more examples.

@athanatos
Copy link
Contributor Author

@ljflores Hmm, those are normal as long as the OSDs eventually started.

@ljflores
Copy link
Member

ljflores commented Feb 6, 2024

@athanatos latest test results show some apparent regressions. I think the osdmap is not getting updated correctly.

In this job, these messages in the log caught my eye:

/a/yuriw-2024-02-05_19:32:33-rados-wip-yuri4-testing-2024-02-05-0849-distro-default-smithi/7547574

2024-02-06T15:38:08.835 INFO:tasks.ceph.osd.8.smithi002.stderr:2024-02-06T15:38:08.832+0000 7f0ef10c0640 -1 osd.8 0 waiting for initial osdmap 2024-02-06T15:38:08.835 INFO:tasks.ceph.osd.7.smithi160.stderr:2024-02-06T15:38:08.832+0000 7f0a709a3640 -1 osd.7 0 waiting for initial osdmap 2024-02-06T15:38:08.835 INFO:tasks.ceph.osd.14.smithi143.stderr:2024-02-06T15:38:08.834+0000 7f575841f640 -1 osd.14 0 waiting for initial osdmap 2024-02-06T15:38:08.836 INFO:tasks.ceph.osd.11.smithi160.stderr:2024-02-06T15:38:08.835+0000 7f82329c0640 -1 osd.11 0 waiting for initial osdmap 2024-02-06T15:38:08.838 INFO:tasks.ceph.osd.0.smithi002.stderr:2024-02-06T15:38:08.837+0000 7f3ffe269640 -1 osd.0 0 waiting for initial osdmap 2024-02-06T15:38:08.838 INFO:tasks.ceph.osd.4.smithi002.stderr:2024-02-06T15:38:08.837+0000 7fd8c4397640 -1 osd.4 0 waiting for initial osdmap 2024-02-06T15:38:08.839 INFO:tasks.ceph.osd.12.smithi002.stderr:2024-02-06T15:38:08.837+0000 7f2b6a4b3640 -1 osd.12 0 waiting for initial osdmap 2024-02-06T15:38:08.839 INFO:tasks.ceph.osd.1.smithi033.stderr:2024-02-06T15:38:08.837+0000 7f012b8a2640 -1 osd.1 0 waiting for initial osdmap 2024-02-06T15:38:08.841 INFO:tasks.ceph.osd.3.smithi160.stderr:2024-02-06T15:38:08.839+0000 7fc400ec6640 -1 osd.3 0 waiting for initial osdmap

/a/yuriw-2024-02-05_19:32:33-rados-wip-yuri4-testing-2024-02-05-0849-distro-default-smithi/7547574/remote/smithi002/log/ceph-osd.0.log.gz

2024-02-06T15:41:17.369+0000 7f4000a81640 10 osd.0 52 maybe_share_map: con v2:172.21.15.143:6802/3809609215 our osdmap epoch of 52 is not newer than session's projected_epoch of 52

The other jobs show the same symptoms.

See https://pulpito.ceph.com/yuriw-2024-02-05_19:32:33-rados-wip-yuri4-testing-2024-02-05-0849-distro-default-smithi/ for more examples.

We found the failures are actually from #54312 merging.

@rzarzynski rzarzynski merged commit 72be1f4 into ceph:main Feb 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Archived in project

Development

Successfully merging this pull request may close these issues.

7 participants