[train] log errors raised by worker by matthewdeng · Pull Request #52223 · ray-project/ray

matthewdeng · 2025-04-10T20:31:49Z

Why are these changes needed?

When an error happens in the user training function, it is raised in a separate thread and not currently logged.

This propagates the error to be logged when the thread completes with an error.

Checks

from ray.train.v2.api.data_parallel_trainer import DataParallelTrainer

def f():
    1/0

trainer = DataParallelTrainer(f)
trainer.fit()

~ cat /tmp/ray/session_latest/logs/train/ray-train-app-worker-8a2f3e3c1a67ccfd095ea12fe7eb343d0063af1b9cadb55874eed73c.log | jq -r .message           
Error in training function:
Traceback (most recent call last):
  File "/tmp/repro.py", line 4, in f
    1/0
ZeroDivisionError: division by zero

Signed-off-by: Matthew Deng <matt@anyscale.com>

justinvyu

Should we log the exception from where it was raised originally instead?

https://github.com/ray-project/ray/blob/master/python/ray/train/v2/_internal/execution/worker_group/thread_runner.py#L36

justinvyu · 2025-04-10T20:55:36Z

python/ray/train/v2/_internal/execution/worker_group/worker.py


        error = execution_context.training_thread_runner.get_error()
+        if error:
+            logger.error(f"Error in training function:\n{error}")


This prints the error stacktrace due to the UserExceptionWithTraceback wrapper that we add right?

Oh yes you are right! Updated to the location you linked above and just printed out the stacktrace directly.

Signed-off-by: Matthew Deng <matt@anyscale.com>

justinvyu · 2025-04-10T23:09:06Z

@matthewdeng Does this now show the worker error 2x in the console? Once upon the thread raising and once in logger.error? Can you also show the raw console output in the example?

matthewdeng · 2025-04-10T23:22:42Z

Yeah it shows up 2x for me on my laptop but I believe it would not be shown in the main driver console logs if it is raised from a separate worker node.

RAY_TRAIN_V2_ENABLED=1 python repro.py
2025-04-10 15:56:21,036 INFO worker.py:1849 -- Started a local Ray instance. View the dashboard at 127.0.0.1:8265 
(TrainController pid=10401) Attempting to start training worker group of size 1 with the following resources: [{'CPU': 1}] * 1
Traceback (most recent call last):
  File "/Users/matt/workspace/scratch/trainv2/repro.py", line 8, in <module>
    trainer.fit()
  File "/Users/matt/workspace/ray/python/ray/train/v2/api/data_parallel_trainer.py", line 128, in fit
    raise result.error
ray.train.v2.api.exceptions.TrainingFailedError: Training failed due to worker errors:
[Rank 0]
Traceback (most recent call last):
  File "/Users/matt/workspace/scratch/trainv2/repro.py", line 4, in f
    1/0
ZeroDivisionError: division by zero

(TrainController pid=10401) Started training worker group of size 1: 
(TrainController pid=10401) - (ip=127.0.0.1, pid=10404) world_rank=0, local_rank=0, node_rank=0
(TrainController pid=10401) Deciding to TERMINATE, since the total failure count (1) exceeded the maximum allowed failures: FailureConfig(max_failures=0).
(TrainController pid=10401) Terminating training worker group after encountering failure(s) on 1 worker(s):
(TrainController pid=10401) [Rank 0]
(TrainController pid=10401) Traceback (most recent call last):
(TrainController pid=10401)   File "/Users/matt/workspace/scratch/trainv2/repro.py", line 4, in f
(TrainController pid=10401)     1/0
(TrainController pid=10401) ZeroDivisionError: division by zero
(TrainController pid=10401) 
(RayTrainWorker pid=10404) Error in training function:
(RayTrainWorker pid=10404)     1/0
(RayTrainWorker pid=10404) 
(RayTrainWorker pid=10404) Traceback (most recent call last):
(RayTrainWorker pid=10404)   File "/Users/matt/workspace/scratch/trainv2/repro.py", line 4, in f
(RayTrainWorker pid=10404) ZeroDivisionError: division by zero

justinvyu · 2025-04-10T23:36:44Z

(RayTrainWorker pid=10404) Error in training function:
(RayTrainWorker pid=10404)     1/0
(RayTrainWorker pid=10404) 
(RayTrainWorker pid=10404) Traceback (most recent call last):
(RayTrainWorker pid=10404)   File "/Users/matt/workspace/scratch/trainv2/repro.py", line 4, in f
(RayTrainWorker pid=10404) ZeroDivisionError: division by zero

What's up with the 1/0 and new line? This doesn't look like the other tracebacks above

matthewdeng · 2025-04-10T23:47:58Z

That is a good question... something to do with worker log propagation to console misordering? The output written to the log file is properly formatted. Let me test it some more.

matthewdeng · 2025-04-11T18:19:49Z

I was able to isolate the reproduction for the reordering behavior, it seems there are two requirements:

All words in the line have a number in them.
1. Such as: 1/0, a1, 1 1
2. These would not trigger: a, a 1, 1 a
Ray log deduplication is enabled. (RAY_DEDUP_LOGS=1)

Given this extremely small set of criteria, I'm going to consider this non-blocking at merge.

When an error happens in the user training function, it is raised in a separate thread and not currently logged. This propagates the error to be logged when the thread completes with an error. --------- Signed-off-by: Matthew Deng <matt@anyscale.com> Signed-off-by: Steve Han <stevehan2001@gmail.com>

[train] log errors raised by worker

dc54e94

Signed-off-by: Matthew Deng <matt@anyscale.com>

matthewdeng requested review from hongpeng-guo, justinvyu, raulchen and woshiyyya as code owners April 10, 2025 20:31

justinvyu reviewed Apr 10, 2025

View reviewed changes

justinvyu self-assigned this Apr 10, 2025

matthewdeng added 2 commits April 10, 2025 15:57

move to when the exception is raised

bbc8a9a

Signed-off-by: Matthew Deng <matt@anyscale.com>

newline

cccde2c

Signed-off-by: Matthew Deng <matt@anyscale.com>

justinvyu approved these changes Apr 10, 2025

View reviewed changes

matthewdeng enabled auto-merge (squash) April 11, 2025 18:20

github-actions bot added the go add ONLY when ready to merge, run all tests label Apr 11, 2025

matthewdeng merged commit 3c32e56 into ray-project:master Apr 11, 2025
7 checks passed

hainesmichaelc added the community-backlog label May 22, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[train] log errors raised by worker#52223

[train] log errors raised by worker#52223
matthewdeng merged 3 commits intoray-project:masterfrom
matthewdeng:train-log-error

matthewdeng commented Apr 10, 2025

Uh oh!

justinvyu left a comment •

edited

Loading

Uh oh!

justinvyu Apr 10, 2025

Uh oh!

matthewdeng Apr 10, 2025

Uh oh!

justinvyu commented Apr 10, 2025 •

edited

Loading

Uh oh!

matthewdeng commented Apr 10, 2025

Uh oh!

justinvyu commented Apr 10, 2025

Uh oh!

matthewdeng commented Apr 10, 2025

Uh oh!

matthewdeng commented Apr 11, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

matthewdeng commented Apr 10, 2025

Why are these changes needed?

Checks

Uh oh!

justinvyu left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

justinvyu Apr 10, 2025

Choose a reason for hiding this comment

Uh oh!

matthewdeng Apr 10, 2025

Choose a reason for hiding this comment

Uh oh!

justinvyu commented Apr 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

matthewdeng commented Apr 10, 2025

Uh oh!

justinvyu commented Apr 10, 2025

Uh oh!

matthewdeng commented Apr 10, 2025

Uh oh!

matthewdeng commented Apr 11, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

justinvyu left a comment •

edited

Loading

justinvyu commented Apr 10, 2025 •

edited

Loading