Skip to content

[Ray Serve] Conda environment can't be activated from image on ray==2.44.1 #51971

@JJMinton

Description

@JJMinton

What happened + What you expected to happen

I upgraded from ray 2.42.1 to 2.44.1 (also tested 2.44.0, wich fails) and I get the following error in the service controller logs:

File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/_private/runtime_env/agent/runtime_env_agent.py", line 391, in _create_runtime_env_with_retry
runtime_env_context = await asyncio.wait_for(
File "/home/ray/anaconda3/lib/python3.10/asyncio/tasks.py", line 445, in wait_for
return fut.result()
File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/_private/runtime_env/agent/runtime_env_agent.py", line 357, in _setup_runtime_env
await create_for_plugin_if_needed(
File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/_private/runtime_env/plugin.py", line 249, in create_for_plugin_if_needed
await plugin.create(None, runtime_env, context, logger=logger)
File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/_private/runtime_env/conda.py", line 386, in create
return await loop.run_in_executor(None, _create)
File "/home/ray/anaconda3/lib/python3.10/concurrent/futures/thread.py", line 58, in run
result = self.fn(*self.args, **self.kwargs)
File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/_private/runtime_env/conda.py", line 347, in _create
raise ValueError(
ValueError: The given conda environment 'minimal-example' from the runtime env {'_ray_commit': 'daca7b2b1a950dc7f731e34e74c76ae383794ffe', 'conda': 'minimal-example'} can't be activated with conda activate minimal-example 1>&2 && python --versionYou can only specify an env that already exists. Please make sure to create an env minimal-example

I have baked the environment into a docker image the cluster is running. The presence of the conda environment can be verified by connecting to the cluster and manually activating (or via vscode on the dashboard).
I verified this setup still works correclty on 2.42.1.

Versions / Dependencies

ray[serve]==2.44.1

Reproduction script

Dockerfile:

# syntax=docker/dockerfile:1.4
FROM rayproject/ray:2.44.1-py310-cu121 AS base
ENV PYTHONUNBUFFERED=1

RUN python -m pip install --upgrade pip && \
    conda upgrade -n base -c defaults condA

COPY --chown=ray . .
RUN conda env create -f cluster/conda-env.yaml

RUN conda run -n minimal-example python --version

manifest.yaml

apiVersion: ray.io/v1
kind: RayService
metadata:
  name: minimal-ray-example
  namespace: defaults
  labels:
    app: minimal-ray-example
    cluster: minimal-ray-example
spec:
  rayClusterConfig:
    rayVersion: 2.44.1
    enableInTreeAutoscaling: true
    autoscalerOptions:
      upscalingMode: Default
      idleTimeoutSeconds: 1200
      imagePullPolicy: IfNotPresent
      securityContext: {}
      env: []
      envFrom: []
    headGroupSpec:
      serviceType: ClusterIP
      rayStartParams:
        dashboard-host: 0.0.0.0
        resources: '"{\"instance_type:c7i.2xlarge\": 1, \"priority:normal\": 1, \"lifecycle:on-demand\":
          1}"'
      template:
        metadata:
          labels:
            instance.type: c7i.2xlarge
            priority: normal
            lifecycle: on-demand
          annotations:
            karpenter.sh/do-not-disrupt: 'true'
        spec:
          serviceAccountName: ''
          containers:
          - name: ray-head
            image: ${DOCKER_NAME}@${IMAGE_SHA}
            ports:
            - containerPort: 6379
              name: gcs
            - containerPort: 8265
              name: dashboard
            - containerPort: 10001
              name: client
            - containerPort: 44217
              name: as-metrics
            - containerPort: 44227
              name: dash-metrics
            - containerPort: 8484
              name: serve
            - containerPort: 3001
              name: code-server
            - containerPort: 6006
              name: tensorboard
            lifecycle:
              preStop:
                exec:
                  command:
                  - /bin/sh
                  - -c
                  - ray stop
            volumeMounts:
            - mountPath: /tmp/ray
              name: ray-logs
            resources:
              limits: &id001
                cpu: 4
                memory: 8Gi
              requests: *id001
          nodeSelector:
            node.kubernetes.io/instance-type: c7i.2xlarge
            karpenter.sh/capacity-type: on-demand
            topology.kubernetes.io/zone: us-east-1a
          affinity:
            nodeAffinity:
              requiredDuringSchedulingIgnoredDuringExecution:
                nodeSelectorTerms:
                - matchExpressions:
                  - key: karpenter.sh/nodepool
                    operator: In
                    values:
                    - cpu
                    - gpu-g5
                    - gpu-p5
                    - gpu-p
                    - kuberay-nodepool-cpu
                    - kuberay-nodepool-gpu-g5
                    - kuberay-nodepool-gpu-p5
          volumes:
          - name: ray-logs
            emptyDir: {}
          priorityClassName: ray-job-normal
    workerGroupSpecs:
    - replicas: 0
      minReplicas: 0
      maxReplicas: 20
      groupName: minimal-ray-example-worker-c7i-2xlarge
      rayStartParams:
        resources: '"{\"instance_type:c7i.2xlarge\": 1, \"priority:normal\": 1, \"lifecycle:on-demand\":
          1}"'
      template:
        metadata:
          labels:
            instance.type: c7i.2xlarge
            priority: normal
            lifecycle: on-demand
          annotations:
            karpenter.sh/do-not-disrupt: 'true'
        spec:
          serviceAccountName: ''
          containers:
          - name: ray-worker
            image: ${DOCKER_NAME}@{IMAGE_SHA}
            lifecycle:
              preStop:
                exec:
                  command:
                  - /bin/sh
                  - -c
                  - ray stop
            volumeMounts:
            - mountPath: /tmp/ray
              name: ray-logs
            resources:
              limits: &id002
                cpu: 4
                memory: 8Gi
              requests: *id002
            env: []
          nodeSelector:
            node.kubernetes.io/instance-type: c7i.2xlarge
            karpenter.sh/capacity-type: on-demand
            topology.kubernetes.io/zone: us-east-1a
          affinity:
            nodeAffinity:
              requiredDuringSchedulingIgnoredDuringExecution:
                nodeSelectorTerms:
                - matchExpressions:
                  - key: karpenter.sh/nodepool
                    operator: In
                    values:
                    - cpu
                    - gpu-g5
                    - gpu-p5
                    - gpu-p
                    - kuberay-nodepool-cpu
                    - kuberay-nodepool-gpu-g5
                    - kuberay-nodepool-gpu-p5
          volumes:
          - name: ray-logs
            emptyDir: {}
          priorityClassName: ray-job-normal
  serveConfigV2: |
    proxy_location: EveryNode
    http_options:
      host: 0.0.0.0
      port: 8484
      root_path: /minimal-ray-example
    grpc_options:
      port: 9000
      grpc_servicer_functions: []
    logging_config:
      encoding: TEXT
      log_level: INFO
      logs_dir: null
      enable_access_log: true
    applications:
    - name: MinimalExample
      route_prefix: /
      import_path: package.app:ray_app
      runtime_env:
        conda: minimal-example
      deployments:
      - name: FastAPIDeployment
        ray_actor_options:
          num_cpus: 1
        num_replicas: 1

I have posted a complete project structure here: https://github.com/JJMinton/ray-serve-bug-report

Issue Severity

Medium: It is a significant difficulty but I can work around it.

Metadata

Metadata

Labels

P0Issues that should be fixed in short orderbugSomething that is supposed to be working; but isn'tcommunity-backlogcoreIssues that should be addressed in Ray Core

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions