-
Notifications
You must be signed in to change notification settings - Fork 7.4k
[Ray Serve] Conda environment can't be activated from image on ray==2.44.1 #51971
Copy link
Copy link
Closed
Closed
Copy link
Labels
P0Issues that should be fixed in short orderIssues that should be fixed in short orderbugSomething that is supposed to be working; but isn'tSomething that is supposed to be working; but isn'tcommunity-backlogcoreIssues that should be addressed in Ray CoreIssues that should be addressed in Ray Core
Description
What happened + What you expected to happen
I upgraded from ray 2.42.1 to 2.44.1 (also tested 2.44.0, wich fails) and I get the following error in the service controller logs:
File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/_private/runtime_env/agent/runtime_env_agent.py", line 391, in _create_runtime_env_with_retry
runtime_env_context = await asyncio.wait_for(
File "/home/ray/anaconda3/lib/python3.10/asyncio/tasks.py", line 445, in wait_for
return fut.result()
File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/_private/runtime_env/agent/runtime_env_agent.py", line 357, in _setup_runtime_env
await create_for_plugin_if_needed(
File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/_private/runtime_env/plugin.py", line 249, in create_for_plugin_if_needed
await plugin.create(None, runtime_env, context, logger=logger)
File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/_private/runtime_env/conda.py", line 386, in create
return await loop.run_in_executor(None, _create)
File "/home/ray/anaconda3/lib/python3.10/concurrent/futures/thread.py", line 58, in run
result = self.fn(*self.args, **self.kwargs)
File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/_private/runtime_env/conda.py", line 347, in _create
raise ValueError(
ValueError: The given conda environment 'minimal-example' from the runtime env {'_ray_commit': 'daca7b2b1a950dc7f731e34e74c76ae383794ffe', 'conda': 'minimal-example'} can't be activated with conda activate minimal-example 1>&2 && python --versionYou can only specify an env that already exists. Please make sure to create an env minimal-example
I have baked the environment into a docker image the cluster is running. The presence of the conda environment can be verified by connecting to the cluster and manually activating (or via vscode on the dashboard).
I verified this setup still works correclty on 2.42.1.
Versions / Dependencies
ray[serve]==2.44.1
Reproduction script
Dockerfile:
# syntax=docker/dockerfile:1.4
FROM rayproject/ray:2.44.1-py310-cu121 AS base
ENV PYTHONUNBUFFERED=1
RUN python -m pip install --upgrade pip && \
conda upgrade -n base -c defaults condA
COPY --chown=ray . .
RUN conda env create -f cluster/conda-env.yaml
RUN conda run -n minimal-example python --version
manifest.yaml
apiVersion: ray.io/v1
kind: RayService
metadata:
name: minimal-ray-example
namespace: defaults
labels:
app: minimal-ray-example
cluster: minimal-ray-example
spec:
rayClusterConfig:
rayVersion: 2.44.1
enableInTreeAutoscaling: true
autoscalerOptions:
upscalingMode: Default
idleTimeoutSeconds: 1200
imagePullPolicy: IfNotPresent
securityContext: {}
env: []
envFrom: []
headGroupSpec:
serviceType: ClusterIP
rayStartParams:
dashboard-host: 0.0.0.0
resources: '"{\"instance_type:c7i.2xlarge\": 1, \"priority:normal\": 1, \"lifecycle:on-demand\":
1}"'
template:
metadata:
labels:
instance.type: c7i.2xlarge
priority: normal
lifecycle: on-demand
annotations:
karpenter.sh/do-not-disrupt: 'true'
spec:
serviceAccountName: ''
containers:
- name: ray-head
image: ${DOCKER_NAME}@${IMAGE_SHA}
ports:
- containerPort: 6379
name: gcs
- containerPort: 8265
name: dashboard
- containerPort: 10001
name: client
- containerPort: 44217
name: as-metrics
- containerPort: 44227
name: dash-metrics
- containerPort: 8484
name: serve
- containerPort: 3001
name: code-server
- containerPort: 6006
name: tensorboard
lifecycle:
preStop:
exec:
command:
- /bin/sh
- -c
- ray stop
volumeMounts:
- mountPath: /tmp/ray
name: ray-logs
resources:
limits: &id001
cpu: 4
memory: 8Gi
requests: *id001
nodeSelector:
node.kubernetes.io/instance-type: c7i.2xlarge
karpenter.sh/capacity-type: on-demand
topology.kubernetes.io/zone: us-east-1a
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: karpenter.sh/nodepool
operator: In
values:
- cpu
- gpu-g5
- gpu-p5
- gpu-p
- kuberay-nodepool-cpu
- kuberay-nodepool-gpu-g5
- kuberay-nodepool-gpu-p5
volumes:
- name: ray-logs
emptyDir: {}
priorityClassName: ray-job-normal
workerGroupSpecs:
- replicas: 0
minReplicas: 0
maxReplicas: 20
groupName: minimal-ray-example-worker-c7i-2xlarge
rayStartParams:
resources: '"{\"instance_type:c7i.2xlarge\": 1, \"priority:normal\": 1, \"lifecycle:on-demand\":
1}"'
template:
metadata:
labels:
instance.type: c7i.2xlarge
priority: normal
lifecycle: on-demand
annotations:
karpenter.sh/do-not-disrupt: 'true'
spec:
serviceAccountName: ''
containers:
- name: ray-worker
image: ${DOCKER_NAME}@{IMAGE_SHA}
lifecycle:
preStop:
exec:
command:
- /bin/sh
- -c
- ray stop
volumeMounts:
- mountPath: /tmp/ray
name: ray-logs
resources:
limits: &id002
cpu: 4
memory: 8Gi
requests: *id002
env: []
nodeSelector:
node.kubernetes.io/instance-type: c7i.2xlarge
karpenter.sh/capacity-type: on-demand
topology.kubernetes.io/zone: us-east-1a
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: karpenter.sh/nodepool
operator: In
values:
- cpu
- gpu-g5
- gpu-p5
- gpu-p
- kuberay-nodepool-cpu
- kuberay-nodepool-gpu-g5
- kuberay-nodepool-gpu-p5
volumes:
- name: ray-logs
emptyDir: {}
priorityClassName: ray-job-normal
serveConfigV2: |
proxy_location: EveryNode
http_options:
host: 0.0.0.0
port: 8484
root_path: /minimal-ray-example
grpc_options:
port: 9000
grpc_servicer_functions: []
logging_config:
encoding: TEXT
log_level: INFO
logs_dir: null
enable_access_log: true
applications:
- name: MinimalExample
route_prefix: /
import_path: package.app:ray_app
runtime_env:
conda: minimal-example
deployments:
- name: FastAPIDeployment
ray_actor_options:
num_cpus: 1
num_replicas: 1
I have posted a complete project structure here: https://github.com/JJMinton/ray-serve-bug-report
Issue Severity
Medium: It is a significant difficulty but I can work around it.
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
P0Issues that should be fixed in short orderIssues that should be fixed in short orderbugSomething that is supposed to be working; but isn'tSomething that is supposed to be working; but isn'tcommunity-backlogcoreIssues that should be addressed in Ray CoreIssues that should be addressed in Ray Core