Skip to content

Failure creating cluster using Kubernetes Operator #13554

@dmatch01

Description

@dmatch01

What is the problem?

reference: https://docs.ray.io/en/master/cluster/k8s-operator.html#k8s-operator

Followed the instructions referenced in the k8s operator setup. After successfully creating the operator pod attempted to launch a Ray cluster(kubectl -n ray apply -f ray/python/ray/autoscaler/kubernetes/operator_configs/example_cluster.yaml) but launch failed. See logs from Ray Operator pod:

$ oc logs ray-operator-pod 
example-cluster:2021-01-19 05:46:11,736	DEBUG config.py:83 -- Updating the resources of node type head-node to include {'CPU': 1, 'GPU': 0}.
example-cluster:2021-01-19 05:46:11,737	DEBUG config.py:83 -- Updating the resources of node type worker-nodes to include {'CPU': 1, 'GPU': 0}.
example-cluster:2021-01-19 05:46:11,773	WARNING config.py:164 -- KubernetesNodeProvider: not checking if namespace 'ray' exists
example-cluster:2021-01-19 05:46:11,773	INFO config.py:184 -- KubernetesNodeProvider: no autoscaler_service_account config provided, must already exist
example-cluster:2021-01-19 05:46:11,773	INFO config.py:210 -- KubernetesNodeProvider: no autoscaler_role config provided, must already exist
example-cluster:2021-01-19 05:46:11,774	INFO config.py:236 -- KubernetesNodeProvider: no autoscaler_role_binding config provided, must already exist
example-cluster:2021-01-19 05:46:11,774	INFO config.py:269 -- KubernetesNodeProvider: no services config provided, must already exist
example-cluster:2021-01-19 05:46:11,809	INFO node_provider.py:114 -- KubernetesNodeProvider: calling create_namespaced_pod (count=1).
2021-01-19 05:46:11,687	INFO commands.py:221 -- Cluster: example-cluster
2021-01-19 05:46:11,735	INFO commands.py:283 -- Checking Kubernetes environment settings
2021-01-19 05:46:11,808	INFO commands.py:533 -- No head node found. Launching a new cluster. Confirm [y/N]: y [automatic, due to --yes]
2021-01-19 05:46:11,808	INFO commands.py:578 -- Acquiring an up-to-date head node
Process example-cluster:
Traceback (most recent call last):
  File "/home/ray/anaconda3/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap
    self.run()
  File "/home/ray/anaconda3/lib/python3.7/multiprocessing/process.py", line 99, in run
    self._target(*self._args, **self._kwargs)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/operator/operator.py", line 48, in _create_or_update
    self.start_head()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/operator/operator.py", line 60, in start_head
    no_config_cache=True)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/autoscaler/_private/commands.py", line 228, in create_or_update_cluster
    override_cluster_name)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/autoscaler/_private/commands.py", line 598, in get_or_create_head_node
    provider.create_node(head_node_config, head_node_tags, 1)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/autoscaler/_private/kubernetes/node_provider.py", line 117, in create_node
    pod = core_api().create_namespaced_pod(self.namespace, pod_spec)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/kubernetes/client/api/core_v1_api.py", line 7320, in create_namespaced_pod
    return self.create_namespaced_pod_with_http_info(namespace, body, **kwargs)  # noqa: E501
  File "/home/ray/anaconda3/lib/python3.7/site-packages/kubernetes/client/api/core_v1_api.py", line 7429, in create_namespaced_pod_with_http_info
    collection_formats=collection_formats)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/kubernetes/client/api_client.py", line 353, in call_api
    _preload_content, _request_timeout, _host)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/kubernetes/client/api_client.py", line 184, in __call_api
    _request_timeout=_request_timeout)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/kubernetes/client/api_client.py", line 397, in request
    body=body)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/kubernetes/client/rest.py", line 280, in POST
    body=body)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/kubernetes/client/rest.py", line 233, in request
    raise ApiException(http_resp=r)
kubernetes.client.exceptions.ApiException: (403)
Reason: Forbidden
HTTP response headers: HTTPHeaderDict({'Audit-Id': '562ee453-9aa8-4190-8450-3fb975bb0a7a', 'Cache-Control': 'no-cache, private', 'Content-Type': 'application/json', 'Date': 'Tue, 19 Jan 2021 13:46:11 GMT', 'Content-Length': '352'})
HTTP response body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"pods \"example-cluster-ray-head-pmfw4\" is forbidden: cannot set blockOwnerDeletion if an ownerReference refers to a resource you can't set finalizers on: , \u003cnil\u003e","reason":"Forbidden","details":{"name":"example-cluster-ray-head-pmfw4","kind":"pods"},"code":403}

Reproduction (REQUIRED)

Please provide a short code snippet (less than 50 lines if possible) that can be copy-pasted to reproduce the issue. The snippet should have no external library dependencies (i.e., use fake or mock data / environments):

Environment Details:
Openshift v4.5.8

$ kubectl version
Client Version: version.Info{Major:"1", Minor:"12", GitVersion:"v1.12.0", GitCommit:"0ed33881dc4355495f623c6f22e7dd0b7632b7c0", GitTreeState:"clean", BuildDate:"2018-09-27T17:05:32Z", GoVersion:"go1.10.4", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"18+", GitVersion:"v1.18.3+fa69cae", GitCommit:"fa69cae", GitTreeState:"clean", BuildDate:"2020-12-14T23:03:06Z", GoVersion:"go1.13.15", Compiler:"gc", Platform:"linux/amd64"}

Operator Successfully Running:

$ kubectl get pods
NAME               READY   STATUS    RESTARTS   AGE
ray-operator-pod   1/1     Running   1          24m

Ray Image:

$ kubectl describe pod ray-operator-pod  | grep "Image ID:"
    Image ID:      docker.io/rayproject/ray@sha256:b6273b691dff8d980128dad0a6fe70ceadc755ea24490da413d07710ee04d88b

YAML output of running operator pod:

$ kubectl get pod ray-operator-pod -o yaml
apiVersion: v1
kind: Pod
metadata:
  annotations:
    cni.projectcalico.org/podIP: 172.30.100.133/32
    cni.projectcalico.org/podIPs: 172.30.100.133/32
    k8s.v1.cni.cncf.io/network-status: |-
      [{
          "name": "k8s-pod-network",
          "ips": [
              "172.30.100.133"
          ],
          "default": true,
          "dns": {}
      }]
    k8s.v1.cni.cncf.io/networks-status: |-
      [{
          "name": "k8s-pod-network",
          "ips": [
              "172.30.100.133"
          ],
          "default": true,
          "dns": {}
      }]
    kubectl.kubernetes.io/last-applied-configuration: |
      {"apiVersion":"v1","kind":"Pod","metadata":{"annotations":{},"name":"ray-operator-pod","namespace":"ray"},"spec":{"containers":[{"command":["ray-operator"],"env":[{"name":"RAY_OPERATOR_POD_NAMESPACE","valueFrom":{"fieldRef":{"fieldPath":"metadata.namespace"}}}],"image":"rayproject/ray:nightly","imagePullPolicy":"Always","name":"ray","resources":{"limits":{"memory":"2Gi"},"requests":{"cpu":1,"memory":"1Gi"}}}],"serviceAccountName":"ray-operator-serviceaccount"}}
    openshift.io/scc: anyuid
  creationTimestamp: 2021-01-19T13:42:53Z
  managedFields:
  - apiVersion: v1
    fieldsType: FieldsV1
    fieldsV1:
      f:metadata:
        f:annotations:
          .: {}
          f:kubectl.kubernetes.io/last-applied-configuration: {}
      f:spec:
        f:containers:
          k:{"name":"ray"}:
            .: {}
            f:command: {}
            f:env:
              .: {}
              k:{"name":"RAY_OPERATOR_POD_NAMESPACE"}:
                .: {}
                f:name: {}
                f:valueFrom:
                  .: {}
                  f:fieldRef:
                    .: {}
                    f:apiVersion: {}
                    f:fieldPath: {}
            f:image: {}
            f:imagePullPolicy: {}
            f:name: {}
            f:resources:
              .: {}
              f:limits:
                .: {}
                f:memory: {}
              f:requests:
                .: {}
                f:cpu: {}
                f:memory: {}
            f:terminationMessagePath: {}
            f:terminationMessagePolicy: {}
        f:dnsPolicy: {}
        f:enableServiceLinks: {}
        f:restartPolicy: {}
        f:schedulerName: {}
        f:securityContext: {}
        f:serviceAccount: {}
        f:serviceAccountName: {}
        f:terminationGracePeriodSeconds: {}
    manager: kubectl
    operation: Update
    time: 2021-01-19T13:42:53Z
  - apiVersion: v1
    fieldsType: FieldsV1
    fieldsV1:
      f:metadata:
        f:annotations:
          f:cni.projectcalico.org/podIP: {}
          f:cni.projectcalico.org/podIPs: {}
    manager: calico
    operation: Update
    time: 2021-01-19T13:42:54Z
  - apiVersion: v1
    fieldsType: FieldsV1
    fieldsV1:
      f:metadata:
        f:annotations:
          f:k8s.v1.cni.cncf.io/network-status: {}
          f:k8s.v1.cni.cncf.io/networks-status: {}
    manager: multus
    operation: Update
    time: 2021-01-19T13:42:54Z
  - apiVersion: v1
    fieldsType: FieldsV1
    fieldsV1:
      f:status:
        f:conditions:
          k:{"type":"ContainersReady"}:
            .: {}
            f:lastProbeTime: {}
            f:lastTransitionTime: {}
            f:status: {}
            f:type: {}
          k:{"type":"Initialized"}:
            .: {}
            f:lastProbeTime: {}
            f:lastTransitionTime: {}
            f:status: {}
            f:type: {}
          k:{"type":"Ready"}:
            .: {}
            f:lastProbeTime: {}
            f:lastTransitionTime: {}
            f:status: {}
            f:type: {}
        f:containerStatuses: {}
        f:hostIP: {}
        f:phase: {}
        f:podIP: {}
        f:podIPs:
          .: {}
          k:{"ip":"172.30.100.133"}:
            .: {}
            f:ip: {}
        f:startTime: {}
    manager: kubelet
    operation: Update
    time: 2021-01-19T14:01:52Z
  name: ray-operator-pod
  namespace: ray
  resourceVersion: "25632355"
  selfLink: /api/v1/namespaces/ray/pods/ray-operator-pod
  uid: a7163014-e459-4479-9665-732ee18eff16
spec:
  containers:
  - command:
    - ray-operator
    env:
    - name: RAY_OPERATOR_POD_NAMESPACE
      valueFrom:
        fieldRef:
          apiVersion: v1
          fieldPath: metadata.namespace
    image: rayproject/ray:nightly
    imagePullPolicy: Always
    name: ray
    resources:
      limits:
        memory: 2Gi
      requests:
        cpu: "1"
        memory: 1Gi
    securityContext:
      capabilities:
        drop:
        - MKNOD
    terminationMessagePath: /dev/termination-log
    terminationMessagePolicy: File
    volumeMounts:
    - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
      name: ray-operator-serviceaccount-token-lhfnd
      readOnly: true
  dnsPolicy: ClusterFirst
  enableServiceLinks: true
  imagePullSecrets:
  - name: ray-operator-serviceaccount-dockercfg-bcrj5
  nodeName: 10.95.102.76
  priority: 0
  restartPolicy: Always
  schedulerName: default-scheduler
  securityContext:
    seLinuxOptions:
      level: s0:c25,c15
  serviceAccount: ray-operator-serviceaccount
  serviceAccountName: ray-operator-serviceaccount
  terminationGracePeriodSeconds: 30
  tolerations:
  - effect: NoExecute
    key: node.kubernetes.io/not-ready
    operator: Exists
    tolerationSeconds: 300
  - effect: NoExecute
    key: node.kubernetes.io/unreachable
    operator: Exists
    tolerationSeconds: 300
  - effect: NoSchedule
    key: node.kubernetes.io/memory-pressure
    operator: Exists
  volumes:
  - name: ray-operator-serviceaccount-token-lhfnd
    secret:
      defaultMode: 420
      secretName: ray-operator-serviceaccount-token-lhfnd
status:
  conditions:
  - lastProbeTime: null
    lastTransitionTime: 2021-01-19T13:42:53Z
    status: "True"
    type: Initialized
  - lastProbeTime: null
    lastTransitionTime: 2021-01-19T14:01:52Z
    status: "True"
    type: Ready
  - lastProbeTime: null
    lastTransitionTime: 2021-01-19T14:01:52Z
    status: "True"
    type: ContainersReady
  - lastProbeTime: null
    lastTransitionTime: 2021-01-19T13:42:53Z
    status: "True"
    type: PodScheduled
  containerStatuses:
  - containerID: cri-o://a9670e6ca1e06ed70db76af3fe35e79198b91cbb3f75bc3d4d1d777f5aae7dae
    image: docker.io/rayproject/ray:nightly
    imageID: docker.io/rayproject/ray@sha256:8a09fc4eff3c142ae9c0174b7beb8311a479afd53d85010aa092307479d59eb5
    lastState:
      terminated:
        containerID: cri-o://08c907a4ff9c2854fe34edd158432cd87724bbcbf4d0ba346690593c8824bcef
        exitCode: 1
        finishedAt: 2021-01-19T14:01:49Z
        reason: Error
        startedAt: 2021-01-19T13:43:42Z
    name: ray
    ready: true
    restartCount: 1
    started: true
    state:
      running:
        startedAt: 2021-01-19T14:01:51Z
  hostIP: 10.95.102.76
  phase: Running
  podIP: 172.30.100.133
  podIPs:
  - ip: 172.30.100.133
  qosClass: Burstable
  startTime: 2021-01-19T13:42:53Z

If the code snippet cannot be run by itself, the issue will be closed with "needs-repro-script".

  • I have verified my script runs in a clean environment and reproduces the issue.
  • I have verified the issue also occurs with the latest wheels.

Metadata

Metadata

Labels

bugSomething that is supposed to be working; but isn't

Type

No type

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions