Skip to content

[train] log errors raised by worker#52223

Merged
matthewdeng merged 3 commits intoray-project:masterfrom
matthewdeng:train-log-error
Apr 11, 2025
Merged

[train] log errors raised by worker#52223
matthewdeng merged 3 commits intoray-project:masterfrom
matthewdeng:train-log-error

Conversation

@matthewdeng
Copy link
Copy Markdown
Contributor

Why are these changes needed?

When an error happens in the user training function, it is raised in a separate thread and not currently logged.

This propagates the error to be logged when the thread completes with an error.

Checks

from ray.train.v2.api.data_parallel_trainer import DataParallelTrainer

def f():
    1/0

trainer = DataParallelTrainer(f)
trainer.fit()
~ cat /tmp/ray/session_latest/logs/train/ray-train-app-worker-8a2f3e3c1a67ccfd095ea12fe7eb343d0063af1b9cadb55874eed73c.log | jq -r .message           
Error in training function:
Traceback (most recent call last):
  File "/tmp/repro.py", line 4, in f
    1/0
ZeroDivisionError: division by zero

Signed-off-by: Matthew Deng <matt@anyscale.com>
Copy link
Copy Markdown
Contributor

@justinvyu justinvyu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.


error = execution_context.training_thread_runner.get_error()
if error:
logger.error(f"Error in training function:\n{error}")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This prints the error stacktrace due to the UserExceptionWithTraceback wrapper that we add right?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh yes you are right! Updated to the location you linked above and just printed out the stacktrace directly.

@justinvyu justinvyu self-assigned this Apr 10, 2025
Signed-off-by: Matthew Deng <matt@anyscale.com>
Signed-off-by: Matthew Deng <matt@anyscale.com>
@justinvyu
Copy link
Copy Markdown
Contributor

justinvyu commented Apr 10, 2025

@matthewdeng Does this now show the worker error 2x in the console? Once upon the thread raising and once in logger.error? Can you also show the raw console output in the example?

@matthewdeng
Copy link
Copy Markdown
Contributor Author

Yeah it shows up 2x for me on my laptop but I believe it would not be shown in the main driver console logs if it is raised from a separate worker node.

RAY_TRAIN_V2_ENABLED=1 python repro.py
2025-04-10 15:56:21,036 INFO worker.py:1849 -- Started a local Ray instance. View the dashboard at 127.0.0.1:8265 
(TrainController pid=10401) Attempting to start training worker group of size 1 with the following resources: [{'CPU': 1}] * 1
Traceback (most recent call last):
  File "/Users/matt/workspace/scratch/trainv2/repro.py", line 8, in <module>
    trainer.fit()
  File "/Users/matt/workspace/ray/python/ray/train/v2/api/data_parallel_trainer.py", line 128, in fit
    raise result.error
ray.train.v2.api.exceptions.TrainingFailedError: Training failed due to worker errors:
[Rank 0]
Traceback (most recent call last):
  File "/Users/matt/workspace/scratch/trainv2/repro.py", line 4, in f
    1/0
ZeroDivisionError: division by zero

(TrainController pid=10401) Started training worker group of size 1: 
(TrainController pid=10401) - (ip=127.0.0.1, pid=10404) world_rank=0, local_rank=0, node_rank=0
(TrainController pid=10401) Deciding to TERMINATE, since the total failure count (1) exceeded the maximum allowed failures: FailureConfig(max_failures=0).
(TrainController pid=10401) Terminating training worker group after encountering failure(s) on 1 worker(s):
(TrainController pid=10401) [Rank 0]
(TrainController pid=10401) Traceback (most recent call last):
(TrainController pid=10401)   File "/Users/matt/workspace/scratch/trainv2/repro.py", line 4, in f
(TrainController pid=10401)     1/0
(TrainController pid=10401) ZeroDivisionError: division by zero
(TrainController pid=10401) 
(RayTrainWorker pid=10404) Error in training function:
(RayTrainWorker pid=10404)     1/0
(RayTrainWorker pid=10404) 
(RayTrainWorker pid=10404) Traceback (most recent call last):
(RayTrainWorker pid=10404)   File "/Users/matt/workspace/scratch/trainv2/repro.py", line 4, in f
(RayTrainWorker pid=10404) ZeroDivisionError: division by zero

@justinvyu
Copy link
Copy Markdown
Contributor

(RayTrainWorker pid=10404) Error in training function:
(RayTrainWorker pid=10404)     1/0
(RayTrainWorker pid=10404) 
(RayTrainWorker pid=10404) Traceback (most recent call last):
(RayTrainWorker pid=10404)   File "/Users/matt/workspace/scratch/trainv2/repro.py", line 4, in f
(RayTrainWorker pid=10404) ZeroDivisionError: division by zero

What's up with the 1/0 and new line? This doesn't look like the other tracebacks above

@matthewdeng
Copy link
Copy Markdown
Contributor Author

That is a good question... something to do with worker log propagation to console misordering? The output written to the log file is properly formatted. Let me test it some more.

@matthewdeng
Copy link
Copy Markdown
Contributor Author

I was able to isolate the reproduction for the reordering behavior, it seems there are two requirements:

  1. All words in the line have a number in them.
    1. Such as: 1/0, a1, 1 1
    2. These would not trigger: a, a 1, 1 a
  2. Ray log deduplication is enabled. (RAY_DEDUP_LOGS=1)

Given this extremely small set of criteria, I'm going to consider this non-blocking at merge.

@matthewdeng matthewdeng enabled auto-merge (squash) April 11, 2025 18:20
@github-actions github-actions bot added the go add ONLY when ready to merge, run all tests label Apr 11, 2025
@matthewdeng matthewdeng merged commit 3c32e56 into ray-project:master Apr 11, 2025
7 checks passed
han-steve pushed a commit to han-steve/ray that referenced this pull request Apr 11, 2025
When an error happens in the user training function, it is raised in a
separate thread and not currently logged.

This propagates the error to be logged when the thread completes with an
error.

---------

Signed-off-by: Matthew Deng <matt@anyscale.com>
Signed-off-by: Steve Han <stevehan2001@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

community-backlog go add ONLY when ready to merge, run all tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants