[train] add trace to WorkerHealthCheckFailedError by matthewdeng · Pull Request #53626 · ray-project/ray

matthewdeng · 2025-06-06T21:27:50Z

Improve the string representation of WorkerHealthCheckFailedError to also include the base reason why the health check failed.

Repro

import torch
import ray.train
from ray.train.torch import TorchTrainer

def train_func():
    x = torch.tensor([1.0], device=torch.device("cuda"))
    ray.train.report({"x": x})

trainer = TorchTrainer(train_func, scaling_config=ray.train.ScalingConfig(use_gpu=True))
trainer.fit()

Before

Terminating training worker group after encountering failure(s) on 1 worker(s):
[Rank 0]
A worker health check failed.
Worker info: Worker(
  actor=Actor(RayTrainWorker, d60c89b07cc4662cf2e3bdde07000000),
  metadata=ActorMetadata(
    hostname='ip-10-0-97-173',
    node_id='c3caabc03a219ac24d7db2c91ea0c9f62f02d5ac56c301bacceb1291',
    node_ip='10.0.97.173',
    pid=21785,
    accelerator_ids={'GPU': ['0']},
  ),
  distributed_context=DistributedContext(world_rank=0, world_size=1, local_rank=0, local_world_size=1, node_rank=0),
  log_file_path='/tmp/ray/session_2025-06-06_12-48-00_927737_2713/logs/train/ray-train-app-worker-7babcde8cfa5b228be37776e8812a75f993d156f235d070633f1afa2.log',
)

After

Terminating training worker group after encountering failure(s) on 1 worker(s):
[Rank 0]
A worker health check failed.
Worker info: Worker(
  actor=Actor(RayTrainWorker, d60c89b07cc4662cf2e3bdde07000000),
  metadata=ActorMetadata(
    hostname='ip-10-0-97-173',
    node_id='c3caabc03a219ac24d7db2c91ea0c9f62f02d5ac56c301bacceb1291',
    node_ip='10.0.97.173',
    pid=21785,
    accelerator_ids={'GPU': ['0']},
  ),
  distributed_context=DistributedContext(world_rank=0, world_size=1, local_rank=0, local_world_size=1, node_rank=0),
  log_file_path='/tmp/ray/session_2025-06-06_12-48-00_927737_2713/logs/train/ray-train-app-worker-7babcde8cfa5b228be37776e8812a75f993d156f235d070633f1afa2.log',
)
System error: Attempting to deserialize object on a CUDA device but torch.cuda.is_available() is False. If you are running on a CPU-only machine, please use torch.load with map_location=torch.device('cpu') to map your storages to the CPU.
traceback: Traceback (most recent call last):
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/_private/serialization.py", line 460, in deserialize_objects
    obj = self._deserialize_object(data, metadata, object_ref)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/_private/serialization.py", line 317, in _deserialize_object
    return self._deserialize_msgpack_data(data, metadata_fields)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/_private/serialization.py", line 272, in _deserialize_msgpack_data
    python_objects = self._deserialize_pickle5_data(pickle5_data)
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/_private/serialization.py", line 262, in _deserialize_pickle5_data
    obj = pickle.loads(in_band)
          ^^^^^^^^^^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.11/site-packages/torch/storage.py", line 381, in _load_from_bytes
    return torch.load(io.BytesIO(b))
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.11/site-packages/torch/serialization.py", line 1040, in load
    return _legacy_load(opened_file, map_location, pickle_module, **pickle_load_args)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.11/site-packages/torch/serialization.py", line 1272, in _legacy_load
    result = unpickler.load()
             ^^^^^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.11/site-packages/torch/serialization.py", line 1205, in persistent_load
    obj = restore_location(obj, location)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.11/site-packages/torch/serialization.py", line 390, in default_restore_location
    result = fn(storage, location)
             ^^^^^^^^^^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.11/site-packages/torch/serialization.py", line 265, in _cuda_deserialize
    device = validate_cuda_device(location)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.11/site-packages/torch/serialization.py", line 249, in validate_cuda_device
    raise RuntimeError('Attempting to deserialize object on a CUDA '
RuntimeError: Attempting to deserialize object on a CUDA device but torch.cuda.is_available() is False. If you are running on a CPU-only machine, please use torch.load with map_location=torch.device('cpu') to map your storages to the CPU.

Signed-off-by: Matthew Deng <matt@anyscale.com>

Copilot

Pull Request Overview

This PR enhances the WorkerHealthCheckFailedError exception by appending the underlying health check failure reason to its string output.

Overrides __str__ to include the base exception message.
Ensures end users see the root cause when a worker health check fails.

Comments suppressed due to low confidence (2)

python/ray/train/v2/_internal/exceptions.py:44

Add unit tests for the new __str__ method to verify that the underlying health_check_failure is correctly included in the returned string.

def __str__(self):

python/ray/train/v2/_internal/exceptions.py:44

Consider using Python exception chaining (e.g., raise WorkerHealthCheckFailedError(...) from health_check_failure) instead of manual string concatenation so the original traceback is preserved and printed automatically.

def __str__(self):

justinvyu · 2025-06-09T21:23:35Z

python/ray/train/v2/_internal/exceptions.py

+    def __str__(self):
+        return self._message + "\n" + str(self.health_check_failure)


thoughts on setting the __cause__ of the this error to be the worker error, then not needing to update the string representation?

Just tested this, unfortunately __cause__ doesn't get included in the string-representation, only when the traceback is printed (e.g. when the exception is raised).

Improve the string representation of `WorkerHealthCheckFailedError` to also include the base reason why the health check failed. Signed-off-by: Matthew Deng <matt@anyscale.com> Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>

[train] add trace to WorkerHealthCheckFailedError

2207c78

Signed-off-by: Matthew Deng <matt@anyscale.com>

Copilot AI review requested due to automatic review settings June 6, 2025 21:27

matthewdeng requested a review from a team as a code owner June 6, 2025 21:27

Copilot AI reviewed Jun 6, 2025

View reviewed changes

matthewdeng added the go add ONLY when ready to merge, run all tests label Jun 7, 2025

justinvyu reviewed Jun 9, 2025

View reviewed changes

justinvyu approved these changes Jun 11, 2025

View reviewed changes

matthewdeng merged commit 0e24c26 into ray-project:master Jun 11, 2025
5 checks passed

matthewdeng deleted the healthcheck-trace branch June 11, 2025 00:09

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[train] add trace to WorkerHealthCheckFailedError#53626

[train] add trace to WorkerHealthCheckFailedError#53626
matthewdeng merged 1 commit intoray-project:masterfrom
matthewdeng:healthcheck-trace

matthewdeng commented Jun 6, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

justinvyu Jun 9, 2025

Uh oh!

matthewdeng Jun 9, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		def __str__(self):
		return self._message + "\n" + str(self.health_check_failure)

Conversation

matthewdeng commented Jun 6, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Uh oh!

justinvyu Jun 9, 2025

Choose a reason for hiding this comment

Uh oh!

matthewdeng Jun 9, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants