Skip to content

[train] add trace to WorkerHealthCheckFailedError#53626

Merged
matthewdeng merged 1 commit intoray-project:masterfrom
matthewdeng:healthcheck-trace
Jun 11, 2025
Merged

[train] add trace to WorkerHealthCheckFailedError#53626
matthewdeng merged 1 commit intoray-project:masterfrom
matthewdeng:healthcheck-trace

Conversation

@matthewdeng
Copy link
Copy Markdown
Contributor

Improve the string representation of WorkerHealthCheckFailedError to also include the base reason why the health check failed.

Repro

import torch
import ray.train
from ray.train.torch import TorchTrainer

def train_func():
    x = torch.tensor([1.0], device=torch.device("cuda"))
    ray.train.report({"x": x})

trainer = TorchTrainer(train_func, scaling_config=ray.train.ScalingConfig(use_gpu=True))
trainer.fit()

Before

Terminating training worker group after encountering failure(s) on 1 worker(s):
[Rank 0]
A worker health check failed.
Worker info: Worker(
  actor=Actor(RayTrainWorker, d60c89b07cc4662cf2e3bdde07000000),
  metadata=ActorMetadata(
    hostname='ip-10-0-97-173',
    node_id='c3caabc03a219ac24d7db2c91ea0c9f62f02d5ac56c301bacceb1291',
    node_ip='10.0.97.173',
    pid=21785,
    accelerator_ids={'GPU': ['0']},
  ),
  distributed_context=DistributedContext(world_rank=0, world_size=1, local_rank=0, local_world_size=1, node_rank=0),
  log_file_path='/tmp/ray/session_2025-06-06_12-48-00_927737_2713/logs/train/ray-train-app-worker-7babcde8cfa5b228be37776e8812a75f993d156f235d070633f1afa2.log',
)

After

Terminating training worker group after encountering failure(s) on 1 worker(s):
[Rank 0]
A worker health check failed.
Worker info: Worker(
  actor=Actor(RayTrainWorker, d60c89b07cc4662cf2e3bdde07000000),
  metadata=ActorMetadata(
    hostname='ip-10-0-97-173',
    node_id='c3caabc03a219ac24d7db2c91ea0c9f62f02d5ac56c301bacceb1291',
    node_ip='10.0.97.173',
    pid=21785,
    accelerator_ids={'GPU': ['0']},
  ),
  distributed_context=DistributedContext(world_rank=0, world_size=1, local_rank=0, local_world_size=1, node_rank=0),
  log_file_path='/tmp/ray/session_2025-06-06_12-48-00_927737_2713/logs/train/ray-train-app-worker-7babcde8cfa5b228be37776e8812a75f993d156f235d070633f1afa2.log',
)
System error: Attempting to deserialize object on a CUDA device but torch.cuda.is_available() is False. If you are running on a CPU-only machine, please use torch.load with map_location=torch.device('cpu') to map your storages to the CPU.
traceback: Traceback (most recent call last):
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/_private/serialization.py", line 460, in deserialize_objects
    obj = self._deserialize_object(data, metadata, object_ref)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/_private/serialization.py", line 317, in _deserialize_object
    return self._deserialize_msgpack_data(data, metadata_fields)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/_private/serialization.py", line 272, in _deserialize_msgpack_data
    python_objects = self._deserialize_pickle5_data(pickle5_data)
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/_private/serialization.py", line 262, in _deserialize_pickle5_data
    obj = pickle.loads(in_band)
          ^^^^^^^^^^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.11/site-packages/torch/storage.py", line 381, in _load_from_bytes
    return torch.load(io.BytesIO(b))
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.11/site-packages/torch/serialization.py", line 1040, in load
    return _legacy_load(opened_file, map_location, pickle_module, **pickle_load_args)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.11/site-packages/torch/serialization.py", line 1272, in _legacy_load
    result = unpickler.load()
             ^^^^^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.11/site-packages/torch/serialization.py", line 1205, in persistent_load
    obj = restore_location(obj, location)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.11/site-packages/torch/serialization.py", line 390, in default_restore_location
    result = fn(storage, location)
             ^^^^^^^^^^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.11/site-packages/torch/serialization.py", line 265, in _cuda_deserialize
    device = validate_cuda_device(location)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.11/site-packages/torch/serialization.py", line 249, in validate_cuda_device
    raise RuntimeError('Attempting to deserialize object on a CUDA '
RuntimeError: Attempting to deserialize object on a CUDA device but torch.cuda.is_available() is False. If you are running on a CPU-only machine, please use torch.load with map_location=torch.device('cpu') to map your storages to the CPU.

Signed-off-by: Matthew Deng <matt@anyscale.com>
Copilot AI review requested due to automatic review settings June 6, 2025 21:27
@matthewdeng matthewdeng requested a review from a team as a code owner June 6, 2025 21:27
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR enhances the WorkerHealthCheckFailedError exception by appending the underlying health check failure reason to its string output.

  • Overrides __str__ to include the base exception message.
  • Ensures end users see the root cause when a worker health check fails.
Comments suppressed due to low confidence (2)

python/ray/train/v2/_internal/exceptions.py:44

  • Add unit tests for the new __str__ method to verify that the underlying health_check_failure is correctly included in the returned string.
def __str__(self):

python/ray/train/v2/_internal/exceptions.py:44

  • Consider using Python exception chaining (e.g., raise WorkerHealthCheckFailedError(...) from health_check_failure) instead of manual string concatenation so the original traceback is preserved and printed automatically.
def __str__(self):

@matthewdeng matthewdeng added the go add ONLY when ready to merge, run all tests label Jun 7, 2025
Comment on lines +44 to +45
def __str__(self):
return self._message + "\n" + str(self.health_check_failure)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thoughts on setting the __cause__ of the this error to be the worker error, then not needing to update the string representation?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just tested this, unfortunately __cause__ doesn't get included in the string-representation, only when the traceback is printed (e.g. when the exception is raised).

@matthewdeng matthewdeng merged commit 0e24c26 into ray-project:master Jun 11, 2025
5 checks passed
@matthewdeng matthewdeng deleted the healthcheck-trace branch June 11, 2025 00:09
elliot-barn pushed a commit that referenced this pull request Jun 18, 2025
Improve the string representation of `WorkerHealthCheckFailedError` to
also include the base reason why the health check failed.

Signed-off-by: Matthew Deng <matt@anyscale.com>
Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>
elliot-barn pushed a commit that referenced this pull request Jul 2, 2025
Improve the string representation of `WorkerHealthCheckFailedError` to
also include the base reason why the health check failed.

Signed-off-by: Matthew Deng <matt@anyscale.com>
Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

go add ONLY when ready to merge, run all tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants