[train] Raise error when calling ray.train.report with a gpu tensor by TimothySeah · Pull Request #53725 · ray-project/ray

TimothySeah · 2025-06-10T23:16:54Z

Right now, when users call ray.train.report with a gpu tensor in their ray train train_function, the train controller fails to deserialize the gpu tensor, causing the training run to hang. With this change, the train workers running the train_function preemptively raise a ValueError, allowing the train run to terminate properly.

Tested by running this script (which was previously known to hang) in a workspace

import torch
import ray.train

from ray.train.torch import TorchTrainer

def train_func():
    x = torch.tensor([1.0], device=torch.device("cuda"))
    ray.train.report({"x": x})

trainer = TorchTrainer(train_func, scaling_config=ray.train.ScalingConfig(use_gpu=True))
trainer.fit()

which resulted in these logs

(base) ray@ip-10-0-50-151:~/default$ RAY_TRAIN_V2_ENABLED=1 python repro.py
2025-06-10 18:47:41,730 INFO worker.py:1736 -- Connecting to existing Ray cluster at address: 10.0.50.151:6379...
2025-06-10 18:47:41,742 INFO worker.py:1907 -- Connected to Ray cluster. View the dashboard at https://session-p6gyps3ixmrxrgbdnhhtktkxa8.i.anyscaleuserdata-staging.com 
2025-06-10 18:47:41,744 INFO packaging.py:380 -- Pushing file package 'gcs://_ray_pkg_acaede90bc4876a6fb8ccaa7821e0566476ec355.zip' (0.02MiB) to Ray cluster...
2025-06-10 18:47:41,744 INFO packaging.py:393 -- Successfully pushed file package 'gcs://_ray_pkg_acaede90bc4876a6fb8ccaa7821e0566476ec355.zip'.
(TrainController pid=8326) [State Transition] INITIALIZING -> SCHEDULING.
(TrainController pid=8326) Attempting to start training worker group of size 1 with the following resources: [{'GPU': 1}] * 1
(TrainController pid=8326) Retrying the launch of the training worker group. The previous launch attempt encountered the following failure:
(TrainController pid=8326) The worker group startup timed out after 30.0 seconds waiting for 1 workers. Potential causes include: (1) temporary insufficient cluster resources while waiting for autoscaling (ignore this warning in this case), (2) infeasible resource request where the provided `ScalingConfig` cannot be satisfied), and (3) transient network issues. Set the RAY_TRAIN_WORKER_GROUP_START_TIMEOUT_S environment variable to increase the timeout.
(TrainController pid=8326) [State Transition] SCHEDULING -> RESCHEDULING.
(TrainController pid=8326) [State Transition] RESCHEDULING -> SCHEDULING.
(TrainController pid=8326) Attempting to start training worker group of size 1 with the following resources: [{'GPU': 1}] * 1
(TrainController pid=8326) Retrying the launch of the training worker group. The previous launch attempt encountered the following failure:
(TrainController pid=8326) The worker group startup timed out after 30.0 seconds waiting for 1 workers. Potential causes include: (1) temporary insufficient cluster resources while waiting for autoscaling (ignore this warning in this case), (2) infeasible resource request where the provided `ScalingConfig` cannot be satisfied), and (3) transient network issues. Set the RAY_TRAIN_WORKER_GROUP_START_TIMEOUT_S environment variable to increase the timeout.
(TrainController pid=8326) [State Transition] SCHEDULING -> RESCHEDULING.
(TrainController pid=8326) [State Transition] RESCHEDULING -> SCHEDULING.
(TrainController pid=8326) Attempting to start training worker group of size 1 with the following resources: [{'GPU': 1}] * 1
(RayTrainWorker pid=3299, ip=10.0.27.89) Setting up process group for: env:// [rank=0, world_size=1]
(TrainController pid=8326) Started training worker group of size 1: 
(TrainController pid=8326) - (ip=10.0.27.89, pid=3299) world_rank=0, local_rank=0, node_rank=0
(TrainController pid=8326) [State Transition] SCHEDULING -> RUNNING.
(RayTrainWorker pid=3299, ip=10.0.27.89) Error in training function:
(RayTrainWorker pid=3299, ip=10.0.27.89) Traceback (most recent call last):
(RayTrainWorker pid=3299, ip=10.0.27.89)   File "/home/ray/default/repro.py", line 8, in train_func
(RayTrainWorker pid=3299, ip=10.0.27.89)   File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/train/v2/api/train_fn_utils.py", line 86, in report
(RayTrainWorker pid=3299, ip=10.0.27.89)     get_train_context().report(
(RayTrainWorker pid=3299, ip=10.0.27.89)   File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/train/v2/_internal/execution/context.py", line 229, in report
(RayTrainWorker pid=3299, ip=10.0.27.89)     raise ValueError(
(RayTrainWorker pid=3299, ip=10.0.27.89) ValueError: Passing objects containg Torch tensors as metrics is not supported as it will throw an exception on deserialization. You can either convert the tensors to Python objects or report a `train.Checkpoint` with `ray.train.report` to store your Torch objects.
(RayTrainWorker pid=3299, ip=10.0.27.89) 
Traceback (most recent call last):
  File "/home/ray/default/repro.py", line 11, in <module>
    trainer.fit()
  File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/train/v2/api/data_parallel_trainer.py", line 129, in fit
    raise result.error
ray.train.v2.api.exceptions.TrainingFailedError: Training failed due to worker errors:
[Rank 0]
Traceback (most recent call last):
  File "/home/ray/default/repro.py", line 8, in train_func
  File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/train/v2/api/train_fn_utils.py", line 86, in report
    get_train_context().report(
  File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/train/v2/_internal/execution/context.py", line 229, in report
    raise ValueError(
ValueError: Passing objects containg Torch tensors as metrics is not supported as it will throw an exception on deserialization. You can either convert the tensors to Python objects or report a `train.Checkpoint` with `ray.train.report` to store your Torch objects.

(TrainController pid=8326) Deciding to TERMINATE, since the total failure count (1) exceeded the maximum allowed failures: FailureConfig(max_failures=0).
(TrainController pid=8326) Terminating training worker group after encountering failure(s) on 1 worker(s):
(TrainController pid=8326) [Rank 0]
(TrainController pid=8326) 
(TrainController pid=8326) [State Transition] RUNNING -> ERRORED.
(TrainController pid=8326) Traceback (most recent call last):
(TrainController pid=8326)   File "/home/ray/default/repro.py", line 8, in train_func
(TrainController pid=8326)   File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/train/v2/_internal/execution/context.py", line 229, in report [repeated 2x across cluster] (Ray deduplicates logs by default. Set RAY_DEDUP_LOGS=0 to disable log deduplication, or see https://docs.ray.io/en/master/ray-observability/user-guides/configure-logging.html#log-deduplication for more options.)
(TrainController pid=8326)     get_train_context().report(
(TrainController pid=8326)     raise ValueError(
(TrainController pid=8326) ValueError: Passing objects containg Torch tensors as metrics is not supported as it will throw an exception on deserialization. You can either convert the tensors to Python objects or report a `train.Checkpoint` with `ray.train.report` to store your Torch objects.

Signed-off-by: Timothy Seah <tseah@anyscale.com>

justinvyu

Maybe we can make this a TorchTrainer specific WorkerCallback that implements a check within on_report instead.

Then we would be able to remove the torch module check and remove this code in the generic report implementation.

python/ray/train/v2/_internal/execution/context.py

Signed-off-by: Timothy Seah <tseah@anyscale.com>

TimothySeah · 2025-06-11T01:43:44Z

Maybe we can make this a TorchTrainer specific WorkerCallback that implements a check within on_report instead.

Then we would be able to remove the torch module check and remove this code in the generic report implementation.

Filed a ticket for this as per our discussion.

justinvyu

Thanks!

…53725) Right now, when users call `ray.train.report` with a gpu tensor in their ray train train_function, the train controller fails to deserialize the gpu tensor, causing the training run to hang. With this change, the train workers running the train_function preemptively raise a ValueError, allowing the train run to terminate properly. --------- Signed-off-by: Timothy Seah <tseah@anyscale.com> Co-authored-by: Timothy Seah <tseah@anyscale.com> Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>

[train] Raise error when calling ray.train.report with a gpu tensor

9f01ee0

Signed-off-by: Timothy Seah <tseah@anyscale.com>

TimothySeah requested a review from a team as a code owner June 10, 2025 23:16

justinvyu reviewed Jun 10, 2025

View reviewed changes

python/ray/train/v2/_internal/execution/context.py Outdated Show resolved Hide resolved

update valuererror

7548a17

Signed-off-by: Timothy Seah <tseah@anyscale.com>

TimothySeah requested a review from justinvyu June 11, 2025 01:43

justinvyu approved these changes Jun 11, 2025

View reviewed changes

justinvyu enabled auto-merge (squash) June 11, 2025 19:40

github-actions bot added the go add ONLY when ready to merge, run all tests label Jun 11, 2025

justinvyu merged commit 0c84a2c into ray-project:master Jun 11, 2025
7 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[train] Raise error when calling ray.train.report with a gpu tensor#53725

[train] Raise error when calling ray.train.report with a gpu tensor#53725
justinvyu merged 2 commits intoray-project:masterfrom
TimothySeah:tseah/raise-gpu-tensor-error

TimothySeah commented Jun 10, 2025 •

edited

Loading

Uh oh!

justinvyu left a comment

Uh oh!

Uh oh!

TimothySeah commented Jun 11, 2025

Uh oh!

justinvyu left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

TimothySeah commented Jun 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

justinvyu left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

TimothySeah commented Jun 11, 2025

Uh oh!

justinvyu left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

TimothySeah commented Jun 10, 2025 •

edited

Loading