Skip to content

[train] Raise error when calling ray.train.report with a gpu tensor#53725

Merged
justinvyu merged 2 commits intoray-project:masterfrom
TimothySeah:tseah/raise-gpu-tensor-error
Jun 11, 2025
Merged

[train] Raise error when calling ray.train.report with a gpu tensor#53725
justinvyu merged 2 commits intoray-project:masterfrom
TimothySeah:tseah/raise-gpu-tensor-error

Conversation

@TimothySeah
Copy link
Copy Markdown
Contributor

@TimothySeah TimothySeah commented Jun 10, 2025

Right now, when users call ray.train.report with a gpu tensor in their ray train train_function, the train controller fails to deserialize the gpu tensor, causing the training run to hang. With this change, the train workers running the train_function preemptively raise a ValueError, allowing the train run to terminate properly.

Tested by running this script (which was previously known to hang) in a workspace

import torch
import ray.train

from ray.train.torch import TorchTrainer

def train_func():
    x = torch.tensor([1.0], device=torch.device("cuda"))
    ray.train.report({"x": x})

trainer = TorchTrainer(train_func, scaling_config=ray.train.ScalingConfig(use_gpu=True))
trainer.fit()

which resulted in these logs

(base) ray@ip-10-0-50-151:~/default$ RAY_TRAIN_V2_ENABLED=1 python repro.py
2025-06-10 18:47:41,730 INFO worker.py:1736 -- Connecting to existing Ray cluster at address: 10.0.50.151:6379...
2025-06-10 18:47:41,742 INFO worker.py:1907 -- Connected to Ray cluster. View the dashboard at https://session-p6gyps3ixmrxrgbdnhhtktkxa8.i.anyscaleuserdata-staging.com 
2025-06-10 18:47:41,744 INFO packaging.py:380 -- Pushing file package 'gcs://_ray_pkg_acaede90bc4876a6fb8ccaa7821e0566476ec355.zip' (0.02MiB) to Ray cluster...
2025-06-10 18:47:41,744 INFO packaging.py:393 -- Successfully pushed file package 'gcs://_ray_pkg_acaede90bc4876a6fb8ccaa7821e0566476ec355.zip'.
(TrainController pid=8326) [State Transition] INITIALIZING -> SCHEDULING.
(TrainController pid=8326) Attempting to start training worker group of size 1 with the following resources: [{'GPU': 1}] * 1
(TrainController pid=8326) Retrying the launch of the training worker group. The previous launch attempt encountered the following failure:
(TrainController pid=8326) The worker group startup timed out after 30.0 seconds waiting for 1 workers. Potential causes include: (1) temporary insufficient cluster resources while waiting for autoscaling (ignore this warning in this case), (2) infeasible resource request where the provided `ScalingConfig` cannot be satisfied), and (3) transient network issues. Set the RAY_TRAIN_WORKER_GROUP_START_TIMEOUT_S environment variable to increase the timeout.
(TrainController pid=8326) [State Transition] SCHEDULING -> RESCHEDULING.
(TrainController pid=8326) [State Transition] RESCHEDULING -> SCHEDULING.
(TrainController pid=8326) Attempting to start training worker group of size 1 with the following resources: [{'GPU': 1}] * 1
(TrainController pid=8326) Retrying the launch of the training worker group. The previous launch attempt encountered the following failure:
(TrainController pid=8326) The worker group startup timed out after 30.0 seconds waiting for 1 workers. Potential causes include: (1) temporary insufficient cluster resources while waiting for autoscaling (ignore this warning in this case), (2) infeasible resource request where the provided `ScalingConfig` cannot be satisfied), and (3) transient network issues. Set the RAY_TRAIN_WORKER_GROUP_START_TIMEOUT_S environment variable to increase the timeout.
(TrainController pid=8326) [State Transition] SCHEDULING -> RESCHEDULING.
(TrainController pid=8326) [State Transition] RESCHEDULING -> SCHEDULING.
(TrainController pid=8326) Attempting to start training worker group of size 1 with the following resources: [{'GPU': 1}] * 1
(RayTrainWorker pid=3299, ip=10.0.27.89) Setting up process group for: env:// [rank=0, world_size=1]
(TrainController pid=8326) Started training worker group of size 1: 
(TrainController pid=8326) - (ip=10.0.27.89, pid=3299) world_rank=0, local_rank=0, node_rank=0
(TrainController pid=8326) [State Transition] SCHEDULING -> RUNNING.
(RayTrainWorker pid=3299, ip=10.0.27.89) Error in training function:
(RayTrainWorker pid=3299, ip=10.0.27.89) Traceback (most recent call last):
(RayTrainWorker pid=3299, ip=10.0.27.89)   File "/home/ray/default/repro.py", line 8, in train_func
(RayTrainWorker pid=3299, ip=10.0.27.89)   File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/train/v2/api/train_fn_utils.py", line 86, in report
(RayTrainWorker pid=3299, ip=10.0.27.89)     get_train_context().report(
(RayTrainWorker pid=3299, ip=10.0.27.89)   File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/train/v2/_internal/execution/context.py", line 229, in report
(RayTrainWorker pid=3299, ip=10.0.27.89)     raise ValueError(
(RayTrainWorker pid=3299, ip=10.0.27.89) ValueError: Passing objects containg Torch tensors as metrics is not supported as it will throw an exception on deserialization. You can either convert the tensors to Python objects or report a `train.Checkpoint` with `ray.train.report` to store your Torch objects.
(RayTrainWorker pid=3299, ip=10.0.27.89) 
Traceback (most recent call last):
  File "/home/ray/default/repro.py", line 11, in <module>
    trainer.fit()
  File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/train/v2/api/data_parallel_trainer.py", line 129, in fit
    raise result.error
ray.train.v2.api.exceptions.TrainingFailedError: Training failed due to worker errors:
[Rank 0]
Traceback (most recent call last):
  File "/home/ray/default/repro.py", line 8, in train_func
  File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/train/v2/api/train_fn_utils.py", line 86, in report
    get_train_context().report(
  File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/train/v2/_internal/execution/context.py", line 229, in report
    raise ValueError(
ValueError: Passing objects containg Torch tensors as metrics is not supported as it will throw an exception on deserialization. You can either convert the tensors to Python objects or report a `train.Checkpoint` with `ray.train.report` to store your Torch objects.

(TrainController pid=8326) Deciding to TERMINATE, since the total failure count (1) exceeded the maximum allowed failures: FailureConfig(max_failures=0).
(TrainController pid=8326) Terminating training worker group after encountering failure(s) on 1 worker(s):
(TrainController pid=8326) [Rank 0]
(TrainController pid=8326) 
(TrainController pid=8326) [State Transition] RUNNING -> ERRORED.
(TrainController pid=8326) Traceback (most recent call last):
(TrainController pid=8326)   File "/home/ray/default/repro.py", line 8, in train_func
(TrainController pid=8326)   File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/train/v2/_internal/execution/context.py", line 229, in report [repeated 2x across cluster] (Ray deduplicates logs by default. Set RAY_DEDUP_LOGS=0 to disable log deduplication, or see https://docs.ray.io/en/master/ray-observability/user-guides/configure-logging.html#log-deduplication for more options.)
(TrainController pid=8326)     get_train_context().report(
(TrainController pid=8326)     raise ValueError(
(TrainController pid=8326) ValueError: Passing objects containg Torch tensors as metrics is not supported as it will throw an exception on deserialization. You can either convert the tensors to Python objects or report a `train.Checkpoint` with `ray.train.report` to store your Torch objects.

Signed-off-by: Timothy Seah <tseah@anyscale.com>
@TimothySeah TimothySeah requested a review from a team as a code owner June 10, 2025 23:16
Copy link
Copy Markdown
Contributor

@justinvyu justinvyu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can make this a TorchTrainer specific WorkerCallback that implements a check within on_report instead.

Then we would be able to remove the torch module check and remove this code in the generic report implementation.

Signed-off-by: Timothy Seah <tseah@anyscale.com>
@TimothySeah
Copy link
Copy Markdown
Contributor Author

Maybe we can make this a TorchTrainer specific WorkerCallback that implements a check within on_report instead.

Then we would be able to remove the torch module check and remove this code in the generic report implementation.

Filed a ticket for this as per our discussion.

@TimothySeah TimothySeah requested a review from justinvyu June 11, 2025 01:43
Copy link
Copy Markdown
Contributor

@justinvyu justinvyu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@justinvyu justinvyu enabled auto-merge (squash) June 11, 2025 19:40
@github-actions github-actions bot added the go add ONLY when ready to merge, run all tests label Jun 11, 2025
@justinvyu justinvyu merged commit 0c84a2c into ray-project:master Jun 11, 2025
7 checks passed
elliot-barn pushed a commit that referenced this pull request Jun 18, 2025
…53725)

Right now, when users call `ray.train.report` with a gpu tensor in their
ray train train_function, the train controller fails to deserialize the
gpu tensor, causing the training run to hang. With this change, the
train workers running the train_function preemptively raise a
ValueError, allowing the train run to terminate properly.

---------

Signed-off-by: Timothy Seah <tseah@anyscale.com>
Co-authored-by: Timothy Seah <tseah@anyscale.com>
Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>
elliot-barn pushed a commit that referenced this pull request Jul 2, 2025
…53725)

Right now, when users call `ray.train.report` with a gpu tensor in their
ray train train_function, the train controller fails to deserialize the
gpu tensor, causing the training run to hang. With this change, the
train workers running the train_function preemptively raise a
ValueError, allowing the train run to terminate properly.

---------

Signed-off-by: Timothy Seah <tseah@anyscale.com>
Co-authored-by: Timothy Seah <tseah@anyscale.com>
Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

go add ONLY when ready to merge, run all tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants