Skip to content

Ceph: provision OSDs using Drive Groups#4916

Merged
BlaineEXE merged 1 commit intorook:masterfrom
SUSE:drive-groups
Jul 15, 2020
Merged

Ceph: provision OSDs using Drive Groups#4916
BlaineEXE merged 1 commit intorook:masterfrom
SUSE:drive-groups

Conversation

@BlaineEXE
Copy link
Member

@BlaineEXE BlaineEXE commented Feb 25, 2020

[test ceph]
[test full]

Add a design document to plan how Rook will allow Ceph's Rook
Orchestrator Mgr Module to add and remove OSDs to/from a Rook-Ceph
cluster including changes needed in both the Orchestrator Module and in
Rook.

Related to CI:
I will implement unit tests in Rook as needed, but I intend to implement CI tests for Drive Group work using the SUSE/rookcheck framework we hope to soon present and use as a Ceph nightly build tester since it is primarily SUSE who will use the drive group functionality.

Integration test matrix to verify functionality around drive groups:

  • Test drive groups with useAllNodes: false and storage config empty
  • Test stg config empty with tainted nodes & no DG toleration in Rook
  • Test stg config empty with Rook Drive Group placement defined
  • Test above without drive groups specified (should be no provisioning jobs created)
  • Test drive groups ignore placement, host_pattern, and service_id in the drive group spec
  • Test drive groups with some storage config applied previously (upgrade)
  • With Ceph image that doesn't support drive groups

Signed-off-by: Blaine Gardner blaine.gardner@suse.com

Description of your changes:

Which issue is resolved by this Pull Request:
Resolves #

Checklist:

  • Reviewed the developer guide on Submitting a Pull Request
  • Documentation has been updated, if necessary.
  • Unit tests have been added, if necessary.
  • Integration tests have been added, if necessary.
  • Pending release notes updated with breaking and/or notable changes, if necessary.
  • Upgrade from previous release is tested and upgrade user guide is updated, if necessary.
  • Code generation (make codegen) has been run to update object specifications, if necessary.
  • Comments have been added or updated based on the standards set in CONTRIBUTING.md
  • Add the flag for skipping the CI if this PR does not require a build. See here for more details.

@BlaineEXE BlaineEXE added this to the 1.3 milestone Feb 25, 2020
@BlaineEXE BlaineEXE requested review from leseb and travisn February 25, 2020 20:34
@BlaineEXE
Copy link
Member Author

This is obviously not complete. Unit tests have been commented out instead of changed, but what currently exists propagates drive group config to the OSD provisioning pod.

Relevant drive groups config in cluster.yaml

  driveGroups:
      # # Examples below
    regular_nodes:
      host_pattern: "*-osd"
      data_devices:
        all: true
    fast_nodes:
      host_pattern: "*-osd-fast"
      data_devices:
        limit: 6
        size: "10TB:10TB"

Relevant OSD prepare pod logs

2020-02-25 03:20:40.129309 I | cephosd: TODO: configure Drive Group "fast_nodes": {"data_devices":{"limit":6,"size":"10TB:10TB"},"host_pattern":"*-osd-fast"}
2020-02-25 03:20:40.129338 I | cephosd: TODO: configure Drive Group "regular_nodes": {"data_devices":{"all":true},"host_pattern":"*-osd"}

@BlaineEXE BlaineEXE added the do-not-merge DO NOT MERGE :) label Feb 25, 2020
@BlaineEXE BlaineEXE changed the title Drive groups Ceph: provision OSDs using Drive Groups Feb 25, 2020
@BlaineEXE BlaineEXE force-pushed the drive-groups branch 2 times, most recently from 414447f to f850e3a Compare April 24, 2020 15:56
@BlaineEXE BlaineEXE force-pushed the drive-groups branch 2 times, most recently from 0d8e125 to e8bea7e Compare May 8, 2020 20:26
Comment on lines +705 to +734
// TODO: what if log path were /dev/stderr and we didn't do combined output so that logs would
// appear in the prepare pod log?
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Open question. Thoughts @travisn / @leseb ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good for the logs to be piped to the same pod log

# deviceFilter: "^sd."
# Ceph supports adding devices as OSDs by Drive Group definitions. TODO: LINK
# Rook will give precedence to creating OSDs using Drive Groups before using other methods.
driveGroups:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if we added a new example for drive groups? The main cluster.yaml would be used for raw devices, cluster-on-pvc.yaml for storageClassDeviceSets, and cluster-drivegroups.yaml for this example. If we keep it in cluster.yaml we would need to comment it all out by default.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still like the idea of doing this in the main cluster.yaml and commenting these out. I intend to comment this out, but for my testing they are uncommented. Thoughts?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO if the example is a big chunk of yaml it makes more sense as a separate yaml, but I'm ok with it staying in cluster.yaml as long as it's not too long. If it's a separate yaml, the docs can also refer to a working example instead of needing to uncomment it. Another thought is to comment out a very short example in cluster.yaml, and have a bigger example in a new yaml.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I opted for short example in cluster.yaml and to make a new cluster-with-drive-groups.yaml

# Rook will give precedence to creating OSDs using Drive Groups before using other methods.
driveGroups:
# # Examples below
regular_nodes:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is this the same schema defined by ceph-volume (outside rook's control), right? It just has a different look and feel with the underscores. Not a problem, only an observation.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is defined by Ceph (and ceph-volume) yes. Somewhat annoying, but... 🤷‍♀️

Comment on lines +705 to +734
// TODO: what if log path were /dev/stderr and we didn't do combined output so that logs would
// appear in the prepare pod log?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good for the logs to be piped to the same pod log

@BlaineEXE BlaineEXE modified the milestones: 1.3, 1.4 May 12, 2020
@BlaineEXE BlaineEXE force-pushed the drive-groups branch 9 times, most recently from d9158b2 to 7435565 Compare June 15, 2020 21:27
@BlaineEXE
Copy link
Member Author

Still waiting on ceph-volume PR ceph/ceph#35728 for Ceph to support this in master.

@BlaineEXE BlaineEXE force-pushed the drive-groups branch 2 times, most recently from 5094665 to 4b4dfaa Compare July 8, 2020 19:11
@BlaineEXE
Copy link
Member Author

Actually, I would suggest we add these checks in the admission controller so they wouldn't be needed here. As a separate work item, I still want to get to the point where the admission controller is always enabled.

Since the admission controller isn't always enabled, I think having both is still good. I added admission controller checks.

@BlaineEXE BlaineEXE requested a review from travisn July 9, 2020 19:56
return nil
}

func validateStorageSpec(cluster cephv1.CephCluster) error {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that the admission controller code will move to a different location with #5761.

@mergify
Copy link

mergify bot commented Jul 10, 2020

This pull request has merge conflicts that must be resolved before it can be merged. @BlaineEXE please rebase it. https://rook.io/docs/rook/master/development-flow.html#updating-your-fork

@BlaineEXE BlaineEXE requested a review from travisn July 10, 2020 21:06
@BlaineEXE BlaineEXE force-pushed the drive-groups branch 3 times, most recently from 042c8f1 to ead8207 Compare July 13, 2020 14:11
@BlaineEXE
Copy link
Member Author

Had a failure on k8s 1.18. Restarting CI.

@mergify
Copy link

mergify bot commented Jul 13, 2020

This pull request has merge conflicts that must be resolved before it can be merged. @BlaineEXE please rebase it. https://rook.io/docs/rook/master/development-flow.html#updating-your-fork

Copy link
Member

@travisn travisn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM after the merge conflict is resolved

Add the ability to provision Ceph OSDs with Drive Groups.
This adds Drive Groups to the CephCluster CRD, and it sets code
in place for propagating this config to the OSD provisioning pod.

Signed-off-by: Blaine Gardner <blaine.gardner@suse.com>
Copy link
Member

@travisn travisn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The CI issues look unrelated to this PR.

@BlaineEXE BlaineEXE force-pushed the drive-groups branch 2 times, most recently from 2d8e0d7 to 7117fc1 Compare July 14, 2020 22:46
@BlaineEXE BlaineEXE merged commit f2bc23c into rook:master Jul 15, 2020
@BlaineEXE BlaineEXE deleted the drive-groups branch July 15, 2020 00:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ceph main ceph tag ceph-osd do-not-merge DO NOT MERGE :) feature

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants