-
Notifications
You must be signed in to change notification settings - Fork 7.4k
Description
What is the problem?
reference: https://docs.ray.io/en/master/cluster/k8s-operator.html#k8s-operator
Followed the instructions referenced in the k8s operator setup. After successfully creating the operator pod attempted to launch a Ray cluster(kubectl -n ray apply -f ray/python/ray/autoscaler/kubernetes/operator_configs/example_cluster.yaml) but launch failed. See logs from Ray Operator pod:
$ oc logs ray-operator-pod
example-cluster:2021-01-19 05:46:11,736 DEBUG config.py:83 -- Updating the resources of node type head-node to include {'CPU': 1, 'GPU': 0}.
example-cluster:2021-01-19 05:46:11,737 DEBUG config.py:83 -- Updating the resources of node type worker-nodes to include {'CPU': 1, 'GPU': 0}.
example-cluster:2021-01-19 05:46:11,773 WARNING config.py:164 -- KubernetesNodeProvider: not checking if namespace 'ray' exists
example-cluster:2021-01-19 05:46:11,773 INFO config.py:184 -- KubernetesNodeProvider: no autoscaler_service_account config provided, must already exist
example-cluster:2021-01-19 05:46:11,773 INFO config.py:210 -- KubernetesNodeProvider: no autoscaler_role config provided, must already exist
example-cluster:2021-01-19 05:46:11,774 INFO config.py:236 -- KubernetesNodeProvider: no autoscaler_role_binding config provided, must already exist
example-cluster:2021-01-19 05:46:11,774 INFO config.py:269 -- KubernetesNodeProvider: no services config provided, must already exist
example-cluster:2021-01-19 05:46:11,809 INFO node_provider.py:114 -- KubernetesNodeProvider: calling create_namespaced_pod (count=1).
2021-01-19 05:46:11,687 INFO commands.py:221 -- Cluster: example-cluster
2021-01-19 05:46:11,735 INFO commands.py:283 -- Checking Kubernetes environment settings
2021-01-19 05:46:11,808 INFO commands.py:533 -- No head node found. Launching a new cluster. Confirm [y/N]: y [automatic, due to --yes]
2021-01-19 05:46:11,808 INFO commands.py:578 -- Acquiring an up-to-date head node
Process example-cluster:
Traceback (most recent call last):
File "/home/ray/anaconda3/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap
self.run()
File "/home/ray/anaconda3/lib/python3.7/multiprocessing/process.py", line 99, in run
self._target(*self._args, **self._kwargs)
File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/operator/operator.py", line 48, in _create_or_update
self.start_head()
File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/operator/operator.py", line 60, in start_head
no_config_cache=True)
File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/autoscaler/_private/commands.py", line 228, in create_or_update_cluster
override_cluster_name)
File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/autoscaler/_private/commands.py", line 598, in get_or_create_head_node
provider.create_node(head_node_config, head_node_tags, 1)
File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/autoscaler/_private/kubernetes/node_provider.py", line 117, in create_node
pod = core_api().create_namespaced_pod(self.namespace, pod_spec)
File "/home/ray/anaconda3/lib/python3.7/site-packages/kubernetes/client/api/core_v1_api.py", line 7320, in create_namespaced_pod
return self.create_namespaced_pod_with_http_info(namespace, body, **kwargs) # noqa: E501
File "/home/ray/anaconda3/lib/python3.7/site-packages/kubernetes/client/api/core_v1_api.py", line 7429, in create_namespaced_pod_with_http_info
collection_formats=collection_formats)
File "/home/ray/anaconda3/lib/python3.7/site-packages/kubernetes/client/api_client.py", line 353, in call_api
_preload_content, _request_timeout, _host)
File "/home/ray/anaconda3/lib/python3.7/site-packages/kubernetes/client/api_client.py", line 184, in __call_api
_request_timeout=_request_timeout)
File "/home/ray/anaconda3/lib/python3.7/site-packages/kubernetes/client/api_client.py", line 397, in request
body=body)
File "/home/ray/anaconda3/lib/python3.7/site-packages/kubernetes/client/rest.py", line 280, in POST
body=body)
File "/home/ray/anaconda3/lib/python3.7/site-packages/kubernetes/client/rest.py", line 233, in request
raise ApiException(http_resp=r)
kubernetes.client.exceptions.ApiException: (403)
Reason: Forbidden
HTTP response headers: HTTPHeaderDict({'Audit-Id': '562ee453-9aa8-4190-8450-3fb975bb0a7a', 'Cache-Control': 'no-cache, private', 'Content-Type': 'application/json', 'Date': 'Tue, 19 Jan 2021 13:46:11 GMT', 'Content-Length': '352'})
HTTP response body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"pods \"example-cluster-ray-head-pmfw4\" is forbidden: cannot set blockOwnerDeletion if an ownerReference refers to a resource you can't set finalizers on: , \u003cnil\u003e","reason":"Forbidden","details":{"name":"example-cluster-ray-head-pmfw4","kind":"pods"},"code":403}
Reproduction (REQUIRED)
Please provide a short code snippet (less than 50 lines if possible) that can be copy-pasted to reproduce the issue. The snippet should have no external library dependencies (i.e., use fake or mock data / environments):
Environment Details:
Openshift v4.5.8
$ kubectl version
Client Version: version.Info{Major:"1", Minor:"12", GitVersion:"v1.12.0", GitCommit:"0ed33881dc4355495f623c6f22e7dd0b7632b7c0", GitTreeState:"clean", BuildDate:"2018-09-27T17:05:32Z", GoVersion:"go1.10.4", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"18+", GitVersion:"v1.18.3+fa69cae", GitCommit:"fa69cae", GitTreeState:"clean", BuildDate:"2020-12-14T23:03:06Z", GoVersion:"go1.13.15", Compiler:"gc", Platform:"linux/amd64"}
Operator Successfully Running:
$ kubectl get pods
NAME READY STATUS RESTARTS AGE
ray-operator-pod 1/1 Running 1 24m
Ray Image:
$ kubectl describe pod ray-operator-pod | grep "Image ID:"
Image ID: docker.io/rayproject/ray@sha256:b6273b691dff8d980128dad0a6fe70ceadc755ea24490da413d07710ee04d88b
YAML output of running operator pod:
$ kubectl get pod ray-operator-pod -o yaml
apiVersion: v1
kind: Pod
metadata:
annotations:
cni.projectcalico.org/podIP: 172.30.100.133/32
cni.projectcalico.org/podIPs: 172.30.100.133/32
k8s.v1.cni.cncf.io/network-status: |-
[{
"name": "k8s-pod-network",
"ips": [
"172.30.100.133"
],
"default": true,
"dns": {}
}]
k8s.v1.cni.cncf.io/networks-status: |-
[{
"name": "k8s-pod-network",
"ips": [
"172.30.100.133"
],
"default": true,
"dns": {}
}]
kubectl.kubernetes.io/last-applied-configuration: |
{"apiVersion":"v1","kind":"Pod","metadata":{"annotations":{},"name":"ray-operator-pod","namespace":"ray"},"spec":{"containers":[{"command":["ray-operator"],"env":[{"name":"RAY_OPERATOR_POD_NAMESPACE","valueFrom":{"fieldRef":{"fieldPath":"metadata.namespace"}}}],"image":"rayproject/ray:nightly","imagePullPolicy":"Always","name":"ray","resources":{"limits":{"memory":"2Gi"},"requests":{"cpu":1,"memory":"1Gi"}}}],"serviceAccountName":"ray-operator-serviceaccount"}}
openshift.io/scc: anyuid
creationTimestamp: 2021-01-19T13:42:53Z
managedFields:
- apiVersion: v1
fieldsType: FieldsV1
fieldsV1:
f:metadata:
f:annotations:
.: {}
f:kubectl.kubernetes.io/last-applied-configuration: {}
f:spec:
f:containers:
k:{"name":"ray"}:
.: {}
f:command: {}
f:env:
.: {}
k:{"name":"RAY_OPERATOR_POD_NAMESPACE"}:
.: {}
f:name: {}
f:valueFrom:
.: {}
f:fieldRef:
.: {}
f:apiVersion: {}
f:fieldPath: {}
f:image: {}
f:imagePullPolicy: {}
f:name: {}
f:resources:
.: {}
f:limits:
.: {}
f:memory: {}
f:requests:
.: {}
f:cpu: {}
f:memory: {}
f:terminationMessagePath: {}
f:terminationMessagePolicy: {}
f:dnsPolicy: {}
f:enableServiceLinks: {}
f:restartPolicy: {}
f:schedulerName: {}
f:securityContext: {}
f:serviceAccount: {}
f:serviceAccountName: {}
f:terminationGracePeriodSeconds: {}
manager: kubectl
operation: Update
time: 2021-01-19T13:42:53Z
- apiVersion: v1
fieldsType: FieldsV1
fieldsV1:
f:metadata:
f:annotations:
f:cni.projectcalico.org/podIP: {}
f:cni.projectcalico.org/podIPs: {}
manager: calico
operation: Update
time: 2021-01-19T13:42:54Z
- apiVersion: v1
fieldsType: FieldsV1
fieldsV1:
f:metadata:
f:annotations:
f:k8s.v1.cni.cncf.io/network-status: {}
f:k8s.v1.cni.cncf.io/networks-status: {}
manager: multus
operation: Update
time: 2021-01-19T13:42:54Z
- apiVersion: v1
fieldsType: FieldsV1
fieldsV1:
f:status:
f:conditions:
k:{"type":"ContainersReady"}:
.: {}
f:lastProbeTime: {}
f:lastTransitionTime: {}
f:status: {}
f:type: {}
k:{"type":"Initialized"}:
.: {}
f:lastProbeTime: {}
f:lastTransitionTime: {}
f:status: {}
f:type: {}
k:{"type":"Ready"}:
.: {}
f:lastProbeTime: {}
f:lastTransitionTime: {}
f:status: {}
f:type: {}
f:containerStatuses: {}
f:hostIP: {}
f:phase: {}
f:podIP: {}
f:podIPs:
.: {}
k:{"ip":"172.30.100.133"}:
.: {}
f:ip: {}
f:startTime: {}
manager: kubelet
operation: Update
time: 2021-01-19T14:01:52Z
name: ray-operator-pod
namespace: ray
resourceVersion: "25632355"
selfLink: /api/v1/namespaces/ray/pods/ray-operator-pod
uid: a7163014-e459-4479-9665-732ee18eff16
spec:
containers:
- command:
- ray-operator
env:
- name: RAY_OPERATOR_POD_NAMESPACE
valueFrom:
fieldRef:
apiVersion: v1
fieldPath: metadata.namespace
image: rayproject/ray:nightly
imagePullPolicy: Always
name: ray
resources:
limits:
memory: 2Gi
requests:
cpu: "1"
memory: 1Gi
securityContext:
capabilities:
drop:
- MKNOD
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
volumeMounts:
- mountPath: /var/run/secrets/kubernetes.io/serviceaccount
name: ray-operator-serviceaccount-token-lhfnd
readOnly: true
dnsPolicy: ClusterFirst
enableServiceLinks: true
imagePullSecrets:
- name: ray-operator-serviceaccount-dockercfg-bcrj5
nodeName: 10.95.102.76
priority: 0
restartPolicy: Always
schedulerName: default-scheduler
securityContext:
seLinuxOptions:
level: s0:c25,c15
serviceAccount: ray-operator-serviceaccount
serviceAccountName: ray-operator-serviceaccount
terminationGracePeriodSeconds: 30
tolerations:
- effect: NoExecute
key: node.kubernetes.io/not-ready
operator: Exists
tolerationSeconds: 300
- effect: NoExecute
key: node.kubernetes.io/unreachable
operator: Exists
tolerationSeconds: 300
- effect: NoSchedule
key: node.kubernetes.io/memory-pressure
operator: Exists
volumes:
- name: ray-operator-serviceaccount-token-lhfnd
secret:
defaultMode: 420
secretName: ray-operator-serviceaccount-token-lhfnd
status:
conditions:
- lastProbeTime: null
lastTransitionTime: 2021-01-19T13:42:53Z
status: "True"
type: Initialized
- lastProbeTime: null
lastTransitionTime: 2021-01-19T14:01:52Z
status: "True"
type: Ready
- lastProbeTime: null
lastTransitionTime: 2021-01-19T14:01:52Z
status: "True"
type: ContainersReady
- lastProbeTime: null
lastTransitionTime: 2021-01-19T13:42:53Z
status: "True"
type: PodScheduled
containerStatuses:
- containerID: cri-o://a9670e6ca1e06ed70db76af3fe35e79198b91cbb3f75bc3d4d1d777f5aae7dae
image: docker.io/rayproject/ray:nightly
imageID: docker.io/rayproject/ray@sha256:8a09fc4eff3c142ae9c0174b7beb8311a479afd53d85010aa092307479d59eb5
lastState:
terminated:
containerID: cri-o://08c907a4ff9c2854fe34edd158432cd87724bbcbf4d0ba346690593c8824bcef
exitCode: 1
finishedAt: 2021-01-19T14:01:49Z
reason: Error
startedAt: 2021-01-19T13:43:42Z
name: ray
ready: true
restartCount: 1
started: true
state:
running:
startedAt: 2021-01-19T14:01:51Z
hostIP: 10.95.102.76
phase: Running
podIP: 172.30.100.133
podIPs:
- ip: 172.30.100.133
qosClass: Burstable
startTime: 2021-01-19T13:42:53Z
If the code snippet cannot be run by itself, the issue will be closed with "needs-repro-script".
- I have verified my script runs in a clean environment and reproduces the issue.
- I have verified the issue also occurs with the latest wheels.