Skip to content

[Dashboard/Core] Resource list in Cluster Dashboard tab should show only logical GPUs #53641

@eicherseiji

Description

@eicherseiji

What happened + What you expected to happen

Bug: Cluster tab head pod resources show 8 GPU, but start params were for 0 GPU, and worker request for 4.
Expected: head node shows 0 GPU on cluster page, matching ray status output and ray start params

Dashboard cluster tab:
Image

(base) ray@ray-serve-llm-raycluster-mzkv9-head-lqnkv:~$ ray status
======== Autoscaler status: 2025-06-07 17:45:06.206068 ========
Node status
---------------------------------------------------------------
Active:
 1 node_9928b5bd637a26e42b8068817fb5d853febda9ec8424f4082618d855
 1 node_95c758accfddf1542e60fdccb17773d8c866bcd12b1c812c7ea70a80
Pending:
 (no pending nodes)
Recent failures:
 (no failures)

Resources
---------------------------------------------------------------
Total Usage:
 3.0/32.0 CPU (1.0 used of 1.0 reserved in placement groups)
 1.0/4.0 GPU (1.0 used of 1.0 reserved in placement groups)
 0B/36.00GiB memory
 109.95KiB/10.36GiB object_store_memory

Total Constraints:
 (no request_resources() constraints)
Total Demands:
 (no resource demands)

Versions / Dependencies

Ray 2.46, Python 3.11.11, Kuberay 1.3.0

Cluster: AWS EKS 4x m5.4xlarge, one 8xA100 node (p4d.24xlarge)

Reproduction script

apiVersion: ray.io/v1
kind: RayService
metadata:
  name: ray-serve-llm
spec:
  serveConfigV2: |
    applications:
    - name: llms
      import_path: ray.serve.llm:build_openai_app
      route_prefix: "/"
      args:
        llm_configs:
        - model_loading_config:
            model_id: qwen2.5-7b-instruct
            model_source: Qwen/Qwen2.5-7B-Instruct
          engine_kwargs:
            dtype: bfloat16
            max_model_len: 1024
            device: auto
            gpu_memory_utilization: 0.75
          deployment_config:
            autoscaling_config:
              min_replicas: 1
              max_replicas: 4
              target_ongoing_requests: 64
            max_ongoing_requests: 128
  rayClusterConfig:
    rayVersion: "2.46.0"
    headGroupSpec:
      rayStartParams:
        num-cpus: "0"
        num-gpus: "0"
      template:
        spec:
          containers:
          - name: ray-head
            image: rayproject/ray-llm:2.46.0-py311-cu124
            ports:
            - containerPort: 8000
              name: serve
              protocol: TCP
            - containerPort: 8080
              name: metrics
              protocol: TCP
            - containerPort: 6379
              name: gcs
              protocol: TCP
            - containerPort: 8265
              name: dashboard
              protocol: TCP
            - containerPort: 10001
              name: client
              protocol: TCP
            resources:
              limits:
                cpu: 2
                memory: 4Gi
              requests:
                cpu: 2
                memory: 4Gi
            env:
            - name: HUGGING_FACE_HUB_TOKEN
              valueFrom:
                secretKeyRef:
                  name: hf-token
                  key: hf_token
    workerGroupSpecs:
    - replicas: 1
      minReplicas: 1
      maxReplicas: 1
      numOfHosts: 1
      groupName: gpu-group
      rayStartParams:
        num-gpus: "4"
      template:
        spec:
          containers:
          - name: ray-worker
            image: rayproject/ray-llm:2.46.0-py311-cu124
            env:
            - name: HUGGING_FACE_HUB_TOKEN
              valueFrom:
                secretKeyRef:
                  name: hf-token
                  key: hf_token
            resources:
              limits:
                cpu: 32
                memory: 32Gi
                nvidia.com/gpu: "4"
              requests:
                cpu: 32
                memory: 32Gi
                nvidia.com/gpu: "4"

---

apiVersion: v1
kind: Secret
metadata:
  name: hf-token
type: Opaque
stringData:
  hf_token: <your-hf-access-token-value>

Issue Severity

Low: It annoys or frustrates me.

Metadata

Metadata

Assignees

No one assigned

    Labels

    P1Issue that should be fixed within a few weekscoreIssues that should be addressed in Ray CoredashboardIssues specific to the Ray DashboardenhancementRequest for new feature and/or capabilityusability

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions