crush: add multistep retry rules by athanatos · Pull Request #55332 · ceph/ceph

athanatos · 2024-01-26T23:24:33Z

This is a re-submission of #55096 after an accidental merge.

Adds support for CRUSH rules that require multiple choose steps -- see added comments and documentation for details.

Checklist

Tracker (select at least one)
- References tracker ticket
- Very recent bug; references commit where it was introduced
- New feature (ticket optional)
- Doc update (no ticket needed)
- Code cleanup (no ticket needed)
Component impact
- Affects Dashboard, opened tracker ticket
- Affects Orchestrator, opened tracker ticket
- No impact that needs to be tracked
Documentation (select at least one)
- Updates relevant documentation
- No doc update is appropriate
Tests (select at least one)
- Includes unit test(s)
- Includes integration test(s)
- Includes bug reproducer
- No tests

Show available Jenkins commands

jenkins retest this please
jenkins test classic perf
jenkins test crimson perf
jenkins test signed
jenkins test make check
jenkins test make check arm64
jenkins test submodules
jenkins test dashboard
jenkins test dashboard cephadm
jenkins test api
jenkins test docs
jenkins render docs
jenkins test ceph-volume all
jenkins test ceph-volume tox
jenkins test windows
jenkins test rook e2e

athanatos · 2024-01-28T01:28:44Z

@neha-ojha @rzarzynski @Matan-B I could use at least one more detailed review if any of you have time.

rzarzynski · 2024-01-28T22:13:36Z

Review-in-progress.

ljflores · 2024-01-29T18:49:30Z

A first iteration of testing passed here: https://tracker.ceph.com/projects/rados/wiki/MAIN#httpstrellocomctbflL3Ja1936-wip-yuri3-testing-2024-01-22-1155

gregsfortytwo

I'm partway through reviewing this and just got to the meat of the changes, and it definitely needs documentation work before it can merge. The user-visible documentation simply says to use MSR rules when you need multiple OSDs per failure domain, but doesn't provide any other orientation.
In the source code, I did eventually find function documentation in mapper.c. But I had to dig through quite a lot — there's nothing in the header file; there's no pointers to it; there's no "here is why we built this new CRUSH implementation" anywhere. I'd really like to see something like we have for stretch clusters providing a broader view: https://docs.ceph.com/en/reef/rados/operations/stretch-mode/ Without that, it's really hard for anybody to get into without a personal briefing.

gregsfortytwo · 2024-01-27T02:29:09Z

src/mon/OSDMonitor.cc

+    if (mv > newmap.require_osd_release) {
+      ss << "new crush map requires client version " << mv
+	 << " but require_osd_release is "
+	 << newmap.require_osd_release;


We definitely need this, but I don't think the commit message is right: OSDs are checked against require_osd_reelase, and this function was only ever checking the require_min_compat_client which is only checked against client/MDS/MGR as best I can tell.
I don't think that means we need any other changes, but just checking?

Hmm, I shouldn't have put it in this block with the client compat check -- not sure why I did that.

However, immediately below, this function does check the newmap features against existing OSDs via check_cluster_features.

That said, I'm going to drop this commit for now. I think this is actually a problem with check_cluster_features generally. It's used to gate mon commands based on what OSDs support, but the check should really be against a bar we use to disallow OSDs from starting, not against what the OSDs that happen to be running use. https://tracker.ceph.com/issues/64257

Stretch mode actually has its own check in preprocess_boot. I suppose I can add a check there.

gregsfortytwo · 2024-01-27T02:43:18Z

src/crush/crush.h

+	CRUSH_RULE_TYPE_REPLICATED = 1,
+	CRUSH_RULE_TYPE_ERASURE = 3,
+	CRUSH_RULE_TYPE_MSR_FIRSTN = 4,
+	CRUSH_RULE_TYPE_MSR_INDEP = 5


Is there a reason the MSR ones are FIRSTN and INDEP, but the others are just REPLICATED and ERASURE? They're referred to as "firstn" and "indep" when writing the rules!

That's not quite right. Classic crush rules specify a type in crush_make_rule which ends up in crush_rule::type. Prior to this PR, it was usually passed a pool type enum -- REPLICATED or ERASURE. Regardless of which of those two types the rule actually is, the choose steps in the rule could be either FIRSTN or INDEP. IIUC, there isn't anything to enforce a relationship between rule type and FIRSTN|INDEP in the actual steps (or that the steps are uniform).

MSR is different here -- the output behavior (FIRSTN or INDEP) is governed by the rule type and choosemsr steps do not individually specify FIRSTN or INDEP.

src/include/ceph_features.h

src/crush/crush.h

src/crush/mapper.c

athanatos · 2024-01-30T21:20:21Z

@gregsfortytwo I did add user documentation in doc/rados/operations/crush-map.rst as well as in doc/rados/operations/crush-map-edits.rst. Erasure code profiles are the way I expect normal users to interact with it -- does that seem reasonable for user documentation?

For developer documentation, there isn't an existing explainer for crush outside of the source files. I do want to keep the comment on crush_msr_do_rule in mapper.c as the main explainer as I think reading that code is what will normally spark questions. I predict that mapper.h is the entry point where most developers will encounter the concept, so adding a quick statement of the two crush types with a pointer to the more detailed explanation on crush_msr_do_rule seems sufficient. Do you think it's worth adding a doc/dev/crush.rst document mainly to point at this file?

doc/rados/operations/crush-map.rst

doc/rados/operations/crush-map-edits.rst

athanatos · 2024-02-01T16:58:57Z

@gregsfortytwo @rzarzynski The feature cutoff for squid was extended, but only for 4 days. I really need this reviewed within the next 24 hours.

athanatos · 2024-02-02T20:42:14Z

jenkins test api

gregsfortytwo

Okay, I spent a while staring at the guts of mapper.c and I think this all works. It requires a certain amount of inner CRUSH knowledge I have not grokked in a while, but the new functions all do what they say, I believe the overall algorithm does work, and they tie together appropriately.

I did not review the tests and only skimmed the EC parts that plug into it, but I presume from the volume and commit messages that they are good.

Do go over my notes. Is there a plan for when or how we want to integrate this with the kernel client? Since I understand this to mostly be an RGW-focused feature, I wonder if we want to delay pushing it out aggressively to make sure we don't find any problems in early deployments -- it seems like every CRUSH change runs across some issues with the math not playing out quite the way we'd hoped.

gregsfortytwo · 2024-02-03T03:17:56Z

src/crush/mapper.c

+			break;
+		case CRUSH_RULE_SET_MSR_COLLISION_TRIES:
+			if (msr_collision_tries) *msr_collision_tries = step->arg1;
+			break;


So if somebody has multiple CRUSH_RULE_SET_MSR_DESCENTS_TRIES or CRUSH_RULE_SET_MSR_COLLISION_TRIES steps, the earlier ones are simply overwritten and life proceeds. Is this okay? Do we want some kind of enforcement here?

That seems to be how the other CRUSH_RULE_SET_* steps work. I'm not really worried about it.

src/crush/mapper.c

gregsfortytwo · 2024-02-03T04:14:09Z

src/crush/mapper.c

+			unsigned end_index = MIN(start_index + total_children,
+						 input.result_max);
+			while (tries_so_far <= input.msr_descents &&
+			       output.returned_so_far < input.result_max) {


I'm confused by the interplay between this comparison of output.returned_so_far < input.result_max and the end_index we set up that can be smaller than the result_max. What's going on here?

That's an artifact from when I added support for multiple emit blocks -- good catch. Fixing.

Ok, added a bit of code to stop the loop once we've generated end_index - start_index worth of outputs.

src/crush/mapper.c

athanatos · 2024-02-03T23:21:56Z

@gregsfortytwo I do expect this to go into the kernel. The only encoding change really is the two new tunables. I think all we have to do is handle that and copy over the new mapper.h/cc files. I'll look into it.

Signed-off-by: Samuel Just <sjust@redhat.com>

Add rule_valid_for_pool_type to CrushWrapper to generalize rule type <-> pool type mapping to include the new MSR types. Signed-off-by: Samuel Just <sjust@redhat.com>

Signed-off-by: Samuel Just <sjust@redhat.com>

…les flags Signed-off-by: Samuel Just <sjust@redhat.com>

Adds support for crush-osds-per-failure-domain and crush-num-failure-domains via MSR rules. Signed-off-by: Samuel Just <sjust@redhat.com>

Signed-off-by: Samuel Just <sjust@redhat.com>

Newly added profile options may break this test otherwise. Signed-off-by: Samuel Just <sjust@redhat.com>

ljflores · 2024-02-06T18:41:36Z

@athanatos latest test results show some apparent regressions. I think the osdmap is not getting updated correctly.

In this job, these messages in the log caught my eye:

/a/yuriw-2024-02-05_19:32:33-rados-wip-yuri4-testing-2024-02-05-0849-distro-default-smithi/7547574

2024-02-06T15:38:08.835 INFO:tasks.ceph.osd.8.smithi002.stderr:2024-02-06T15:38:08.832+0000 7f0ef10c0640 -1 osd.8 0 waiting for initial osdmap
2024-02-06T15:38:08.835 INFO:tasks.ceph.osd.7.smithi160.stderr:2024-02-06T15:38:08.832+0000 7f0a709a3640 -1 osd.7 0 waiting for initial osdmap
2024-02-06T15:38:08.835 INFO:tasks.ceph.osd.14.smithi143.stderr:2024-02-06T15:38:08.834+0000 7f575841f640 -1 osd.14 0 waiting for initial osdmap
2024-02-06T15:38:08.836 INFO:tasks.ceph.osd.11.smithi160.stderr:2024-02-06T15:38:08.835+0000 7f82329c0640 -1 osd.11 0 waiting for initial osdmap
2024-02-06T15:38:08.838 INFO:tasks.ceph.osd.0.smithi002.stderr:2024-02-06T15:38:08.837+0000 7f3ffe269640 -1 osd.0 0 waiting for initial osdmap
2024-02-06T15:38:08.838 INFO:tasks.ceph.osd.4.smithi002.stderr:2024-02-06T15:38:08.837+0000 7fd8c4397640 -1 osd.4 0 waiting for initial osdmap
2024-02-06T15:38:08.839 INFO:tasks.ceph.osd.12.smithi002.stderr:2024-02-06T15:38:08.837+0000 7f2b6a4b3640 -1 osd.12 0 waiting for initial osdmap
2024-02-06T15:38:08.839 INFO:tasks.ceph.osd.1.smithi033.stderr:2024-02-06T15:38:08.837+0000 7f012b8a2640 -1 osd.1 0 waiting for initial osdmap
2024-02-06T15:38:08.841 INFO:tasks.ceph.osd.3.smithi160.stderr:2024-02-06T15:38:08.839+0000 7fc400ec6640 -1 osd.3 0 waiting for initial osdmap

/a/yuriw-2024-02-05_19:32:33-rados-wip-yuri4-testing-2024-02-05-0849-distro-default-smithi/7547574/remote/smithi002/log/ceph-osd.0.log.gz

2024-02-06T15:41:17.369+0000 7f4000a81640 10 osd.0 52 maybe_share_map: con v2:172.21.15.143:6802/3809609215 our osdmap epoch of 52 is not newer than session's projected_epoch of 52

The other jobs show the same symptoms.

See https://pulpito.ceph.com/yuriw-2024-02-05_19:32:33-rados-wip-yuri4-testing-2024-02-05-0849-distro-default-smithi/ for more examples.

athanatos · 2024-02-06T19:58:30Z

@ljflores Hmm, those are normal as long as the OSDs eventually started.

ljflores · 2024-02-06T20:47:01Z

@athanatos latest test results show some apparent regressions. I think the osdmap is not getting updated correctly.

In this job, these messages in the log caught my eye:

/a/yuriw-2024-02-05_19:32:33-rados-wip-yuri4-testing-2024-02-05-0849-distro-default-smithi/7547574

2024-02-06T15:38:08.835 INFO:tasks.ceph.osd.8.smithi002.stderr:2024-02-06T15:38:08.832+0000 7f0ef10c0640 -1 osd.8 0 waiting for initial osdmap 2024-02-06T15:38:08.835 INFO:tasks.ceph.osd.7.smithi160.stderr:2024-02-06T15:38:08.832+0000 7f0a709a3640 -1 osd.7 0 waiting for initial osdmap 2024-02-06T15:38:08.835 INFO:tasks.ceph.osd.14.smithi143.stderr:2024-02-06T15:38:08.834+0000 7f575841f640 -1 osd.14 0 waiting for initial osdmap 2024-02-06T15:38:08.836 INFO:tasks.ceph.osd.11.smithi160.stderr:2024-02-06T15:38:08.835+0000 7f82329c0640 -1 osd.11 0 waiting for initial osdmap 2024-02-06T15:38:08.838 INFO:tasks.ceph.osd.0.smithi002.stderr:2024-02-06T15:38:08.837+0000 7f3ffe269640 -1 osd.0 0 waiting for initial osdmap 2024-02-06T15:38:08.838 INFO:tasks.ceph.osd.4.smithi002.stderr:2024-02-06T15:38:08.837+0000 7fd8c4397640 -1 osd.4 0 waiting for initial osdmap 2024-02-06T15:38:08.839 INFO:tasks.ceph.osd.12.smithi002.stderr:2024-02-06T15:38:08.837+0000 7f2b6a4b3640 -1 osd.12 0 waiting for initial osdmap 2024-02-06T15:38:08.839 INFO:tasks.ceph.osd.1.smithi033.stderr:2024-02-06T15:38:08.837+0000 7f012b8a2640 -1 osd.1 0 waiting for initial osdmap 2024-02-06T15:38:08.841 INFO:tasks.ceph.osd.3.smithi160.stderr:2024-02-06T15:38:08.839+0000 7fc400ec6640 -1 osd.3 0 waiting for initial osdmap

/a/yuriw-2024-02-05_19:32:33-rados-wip-yuri4-testing-2024-02-05-0849-distro-default-smithi/7547574/remote/smithi002/log/ceph-osd.0.log.gz

2024-02-06T15:41:17.369+0000 7f4000a81640 10 osd.0 52 maybe_share_map: con v2:172.21.15.143:6802/3809609215 our osdmap epoch of 52 is not newer than session's projected_epoch of 52

The other jobs show the same symptoms.

See https://pulpito.ceph.com/yuriw-2024-02-05_19:32:33-rados-wip-yuri4-testing-2024-02-05-0849-distro-default-smithi/ for more examples.

We found the failures are actually from #54312 merging.

ljflores · 2024-02-08T01:32:12Z

Rados approved: https://tracker.ceph.com/projects/rados/wiki/MAIN#httpstrellocomcwQZkeQIp1948-wip-yuri4-testing-2024-02-05-0849-old-wip-yuri4-testing-2024-02-03-0802-old-wip-yuri4-testing-2024-02-02-1609

athanatos requested review from a team as code owners January 26, 2024 23:24

athanatos requested review from Pegonzal, aaSharma14 and gregsfortytwo and removed request for a team January 26, 2024 23:24

github-actions bot added core dashboard documentation mon tests labels Jan 26, 2024

athanatos mentioned this pull request Jan 26, 2024

crush: add multistep retry rules #55096

Merged

8 tasks

athanatos requested review from Matan-B, neha-ojha and rzarzynski January 28, 2024 01:27

gregsfortytwo requested changes Jan 30, 2024

View reviewed changes

athanatos force-pushed the sjust/wip-crush-multi-choose branch 4 times, most recently from 4c9cd28 to 2c855c7 Compare January 30, 2024 22:12

dparmar18 reviewed Jan 31, 2024

View reviewed changes

doc/rados/operations/crush-map.rst Show resolved Hide resolved

dparmar18 reviewed Jan 31, 2024

View reviewed changes

doc/rados/operations/crush-map.rst Show resolved Hide resolved

dparmar18 reviewed Jan 31, 2024

View reviewed changes

doc/rados/operations/crush-map-edits.rst Show resolved Hide resolved

athanatos force-pushed the sjust/wip-crush-multi-choose branch from 2c855c7 to 273e28a Compare February 1, 2024 16:57

rzarzynski requested a review from gregsfortytwo February 2, 2024 20:43

ljflores added the needs-qa label Feb 2, 2024

yuriw added the wip-yuri4-testing label Feb 3, 2024

gregsfortytwo approved these changes Feb 3, 2024

View reviewed changes

athanatos force-pushed the sjust/wip-crush-multi-choose branch 3 times, most recently from e1f94ad to 98db74d Compare February 4, 2024 00:49

athanatos added 15 commits February 3, 2024 21:00

crush/mapper: add support for MSR types

bbfa5ba

Signed-off-by: Samuel Just <sjust@redhat.com>

doc/dev/crush-msr.rst: add developer summary of crush msr

8eb6835

Signed-off-by: Samuel Just <sjust@redhat.com>

mon/OSDMonitor: generalize rule type check for pools

8fba03f

Add rule_valid_for_pool_type to CrushWrapper to generalize rule type <-> pool type mapping to include the new MSR types. Signed-off-by: Samuel Just <sjust@redhat.com>

test/crush/crush.cc: s/NULL/nullptr/g

ab2b62c

Signed-off-by: Samuel Just <sjust@redhat.com>

test/crush/crush.cc: convert indep test cases to test MSR as well

28989d0

Signed-off-by: Samuel Just <sjust@redhat.com>

test/crush/crush.cc: add test variants for firstn rules

0445c33

Signed-off-by: Samuel Just <sjust@redhat.com>

test/crush/crush.cc: add tests specifically for MSR

4b4eb17

Signed-off-by: Samuel Just <sjust@redhat.com>

vstart.sh: add --osds-per-host

f58d4e8

Signed-off-by: Samuel Just <sjust@redhat.com>

vstart.sh: add --require-osd-and-client-version and --use-crush-tunab…

d9f463e

…les flags Signed-off-by: Samuel Just <sjust@redhat.com>

erasure-code: add support for multiple osds in a single failure domain

b398c54

Adds support for crush-osds-per-failure-domain and crush-num-failure-domains via MSR rules. Signed-off-by: Samuel Just <sjust@redhat.com>

doc/rados/operations: add CRUSH MSR documentation

aa88dfa

Signed-off-by: Samuel Just <sjust@redhat.com>

qa/erasure-code: modify jerasure 4/2 ec test case to use msr

e56e1bb

Signed-off-by: Samuel Just <sjust@redhat.com>

test/cli/crushtool/choose-args.t: add msr related json output

39ad2e7

Signed-off-by: Samuel Just <sjust@redhat.com>

test/cli/osdmaptool/crush.t: adjust --import-crush size output

c52aabf

Signed-off-by: Samuel Just <sjust@redhat.com>

tasks/.../test_erasure_code_profile: assertSubset in test_create_plugin

0736d5d

Newly added profile options may break this test otherwise. Signed-off-by: Samuel Just <sjust@redhat.com>

athanatos force-pushed the sjust/wip-crush-multi-choose branch from 98db74d to 0736d5d Compare February 4, 2024 05:00

rzarzynski merged commit 72be1f4 into ceph:main Feb 8, 2024

Conversation

athanatos commented Jan 26, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Checklist

Uh oh!

athanatos commented Jan 28, 2024

Uh oh!

rzarzynski commented Jan 28, 2024

Uh oh!

ljflores commented Jan 29, 2024

Uh oh!

gregsfortytwo left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

athanatos commented Jan 30, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

athanatos commented Feb 1, 2024

Uh oh!

athanatos commented Feb 2, 2024

Uh oh!

gregsfortytwo left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

athanatos commented Feb 3, 2024

Uh oh!

ljflores commented Feb 6, 2024

Uh oh!

athanatos commented Feb 6, 2024

Uh oh!

ljflores commented Feb 6, 2024

Uh oh!

ljflores commented Feb 8, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

athanatos commented Jan 26, 2024 •

edited

Loading

athanatos commented Jan 30, 2024 •

edited

Loading