This repository includes a Ceph OSD operator based on the Ansible variant of the Red Hat (formerly CoreOS) Operator SDK.
In contrast to Rook it is not a one-stop solution for deploying Ceph but only deals
with the orchestration of the Ceph OSDs. To be a complete solution it needs further support. Currently
ceph-osd-operator is used in ceph-with-helm to form a complete
container based installation of Ceph.
This is the third major version based on version 1.8.0 of the Operator SDK. Its predecessors have been working like a charm for the last few years.
-
Functionality is the same as in previous versions.
-
A test suite based on the framework provided by the SDK is included. It uses Molecule. Two test scenarios are provided:
defaultandkind. Thekindscenario creates a three node Kubernetes cluster based on Kind. It requires Docker, Ansible, and the Python packagesmolecule,openshift, as well asjmespath. -
In previous versions the naming of the CRD was erroneous as it used the singular instead of the plural, this has been corrected. The CRD and all custom resources need to be recreated which will disrupt the Ceph cluster.
-
There are two versions of the CRD provided: one using API version
v1beta1and another one usingv1. Please seeconfig/crd/bases. -
Example manifests and overlays can be found in
config. -
The helper scripts have been removed in this version.
This operator only supports Bluestore based deployments with ceph-volume. Separate devices for the RocksDB and
the WAL can be specified. Support for passing the WAL is untested as ceph-with-helm currently doesn't support it.
The operator creates one pod per OSD. The restartPolicy on these pods should normally by Always to ensure that
they are restarted automatically when they terminate unexpectedly. Missing pods will automatically be recreated by
the operator. During deployment of the pods nodeAffinity rules are injected into the pod definition to bind the pod
to a specific node.
Rolling updates of the pod configuration (including container images) can either be performed one OSD at a time or
on a host by host basis. At the moment there is no direct interaction between the operator and Ceph and the operator
only watches the Ready condition of newly created pods. But if the readiness check is implemented correctly
this should go a long way in making sure that the OSD is okay. If an updated OSD pod doesn't become ready the update
process is halted and no further OSDs are updated without manual intervention.
apiVersion: ceph.elemental.net/v1alpha1
kind: CephOSD
metadata:
name: ceph-osds
spec:
storage:
- hosts:
- worker041.example.com
- worker042.example.com
osds:
- data: '/dev/disk/by-slot/pci-0000:5c:00.0-encl-0:7:0-slot-1'
db: '/dev/disk/by-slot/pci-0000:5c:00.0-encl-0:7:0-slot-10-part1'
- data: '/dev/disk/by-slot/pci-0000:5c:00.0-encl-0:7:0-slot-2'
db: '/dev/disk/by-slot/pci-0000:5c:00.0-encl-0:7:0-slot-10-part2'
updateDomain: "Host"
podTemplate:
# [...]storageis a list ofhosts/osdspairs. The cartesian product of each element of thehostslist and each element of theosdsis formed and an OSD pod is created for each resulting host/osd pair. This is repeated for each element of thestoragelist.- The hostnames of a
hostslist must match your Kubernetes node names as they are used for constructing thenodeAffinityrules. - Each element of an
osdslist consists of a dictionary with up to three keys:data,dbandwal. Only thedatakey is mandatory and its value represent the primary OSD data device. A separate RocksDB or WAL location can be specified by setting thedborwalkeys respectively. updateDomaincan either be set toOSDto perform rolling updates one OSD at a time. Or it can be set toHostto update all pods on a host at the same time before proceeding to the next host.- The
podTemplateshould contain a complete pod definition. It is instantiated for each OSD by replacing some values likemetadata.nameormetadata.namespaceand adding other values (labels,annotations, andnodeAffinityrules). Seeansible/roles/CephOSD/templates/pod-definition-additions.yaml.j2for a complete list of changes.
The whole custom resource is checked against an OpenAPIv3 schema provided in the included custom resource definition.
This includes the podTemplate. Changes to the Pod specification by the Kubernetes team might require updates
to the schema in the future.
Each pod has an annotation named ceph.elemental.net/state tracking its state. These states are:
up-to-date: The pod looks fine and its configuration is up-to-date.out-of-dateThe pod looks fine but its configuration is out-of-date and it will be updated by recreating it.ready: The pod is up-to-date, it is waiting to get ready and its actual state is ready. This state is only used internally and is not reflected in the annotation.unready: The pod is up-to-date and it is waiting to get ready but it is not ready (yet). Internal state only.invalid: The pod is missing some mandatory annotations, i.e. it is a duplicate or has an invalid name. It will be deleted. Internal state only.
The state of a pod is determined by this logic:
- If the pod is terminating:
- The pod is ignored altogether.
- Else if the mandatory pod annotations are present:
- If the pod is a duplicate of another pod:
- The pod is invalid.
- Else if the pod name doesn't conform to our scheme:
- The pod is invalid.
- Else if the pod's template hash equals the current template hash from the CR:
- If the pod is waiting to get ready and its actual state is ready:
- The pod is ready.
- Else if the pod is waiting to get ready and it's not ready:
- The pod is unready.
- Else:
- The pod is up-to-date.
- If the pod is waiting to get ready and its actual state is ready:
- Else:
- The pod is out-of-date.
- If the pod is a duplicate of another pod:
- Else:
- The pod is invalid.
When the podTemplate is changed all pods are marked as out-of-date. Depending on the updateDomain one pod
or all pods of one host from the out-of-date list are recreated with the new configuration. After that the
operator waits for the new pod or for the group of new pods to become ready. When all new pods have become ready the
operator proceeds to the next out-of-date pod or group of pods. If not all pods become ready the update process
halts and it requires manual intervention by an administrator. Options for the administrator include:
- If the
podTemplateis faulty, the administrator can fix thepodTemplateand the update process will automatically restart from the beginning. - If there are other reasons preventing a pod from becoming ready the administrator can fix them. After that the pod should become ready after some time and the update process continues automatically.
- The administrator can delete the
ceph.elemental.net/pod-stateannotation or set it toup-to-dateoverriding the operator. The update process will continue without waiting for this pod to become ready.
The restartPolicy in the pod template should be Always. In addition, the following tolerations should be included
to prevent eviction under these conditions:
tolerations:
- key: node.kubernetes.io/unschedulable
operator: Exists
effect: NoSchedule
- key: node.kubernetes.io/not-ready
operator: Exists
- key: node.kubernetes.io/unreachable
operator: ExistsIt is also a good idea to set a priorityClass in the template:
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
name: ceph-osd
value: 1000000000
Container images for the operator are available in the GitHub Container Registry. Images are build automatically from the Git repository by Travis CI.
- Setting
nooutfor OSDs during update - Watch OSD status directly via Ceph during update