[release-test][data][train] Preload a subset of modules for torch dataloader forkserver multiprocessing by justinvyu · Pull Request #56343 · ray-project/ray

justinvyu · 2025-09-08T19:17:14Z

Summary

For the training ingest release test baseline using torch dataloader multiprocessing, we preload all imported modules and submodules. This can be brittle and run into issues if any of the imported modules cannot be forked safely. For example, see the following release test error:

Traceback (most recent call last):
File "/tmp/ray/session_2025-09-04_05-22-53_048164_2379/runtime_resources/working_dir_files/s3_runtime-release-test-artifacts_working_dirs_training_ingest_benchmark-task=image_classification_full_training_parquet_torch_dataloader_doiardvjgz__anyscale_pkg_339c386192dfee395b1179753d3efecc/train_benchmark.py", line 35, in train_fn_per_worker
File "/tmp/ray/session_2025-09-04_05-22-53_048164_2379/runtime_resources/working_dir_files/_ray_pkg_b43ef8c2034973b4/runner.py", line 313, in run
self._train_epoch()
File "/tmp/ray/session_2025-09-04_05-22-53_048164_2379/runtime_resources/working_dir_files/_ray_pkg_b43ef8c2034973b4/runner.py", line 154, in _train_epoch
for batch in train_dataloader:
File "/tmp/ray/session_2025-09-04_05-22-53_048164_2379/runtime_resources/working_dir_files/_ray_pkg_b43ef8c2034973b4/runner.py", line 105, in dataloader_with_timers
batch = next(dataloader_iter)
File "/tmp/ray/session_2025-09-04_05-22-53_048164_2379/runtime_resources/working_dir_files/_ray_pkg_b43ef8c2034973b4/image_classification/factory.py", line 117, in create_batch_iterator
for batch_idx, batch in enumerate(dataloader):
File "/home/ray/anaconda3/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 434, in __iter__
self._iterator = self._get_iterator()
File "/home/ray/anaconda3/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 387, in _get_iterator
return _MultiProcessingDataLoaderIter(self)
File "/home/ray/anaconda3/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 1040, in __init__
w.start()
File "/home/ray/anaconda3/lib/python3.9/multiprocessing/process.py", line 121, in start
self._popen = self._Popen(self)
File "/home/ray/anaconda3/lib/python3.9/multiprocessing/context.py", line 291, in _Popen
return Popen(process_obj)
File "/home/ray/anaconda3/lib/python3.9/multiprocessing/popen_forkserver.py", line 35, in __init__
super().__init__(process_obj)
File "/home/ray/anaconda3/lib/python3.9/multiprocessing/popen_fork.py", line 19, in __init__
self._launch(process_obj)
File "/home/ray/anaconda3/lib/python3.9/multiprocessing/popen_forkserver.py", line 59, in _launch
self.pid = forkserver.read_signed(self.sentinel)
File "/home/ray/anaconda3/lib/python3.9/multiprocessing/forkserver.py", line 328, in read_signed
raise EOFError('unexpected EOF')
EOFError: unexpected EOF

I found that excluding all ray submodule imports from the preload list can fix the issue. To make this more robust, I only preload modules in an allowlist of a few heavy imports that take a long time.

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

gemini-code-assist

Code Review

This pull request addresses a fork-safety issue with torch.utils.data.DataLoader by switching from preloading all modules to an allowlist of heavy, safe-to-fork modules. This is a robust fix for the EOFError encountered. The refactoring in get_val_dataloader to conditionally create the multiprocessing context is also a good improvement. My review suggests applying this same refactoring to get_train_dataloader for consistency and to fully resolve the inefficiency of creating the context when no worker processes are used.

release/train_tests/benchmark/torch_dataloader_factory.py

srinathk10

LGTM

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

…rch dataloader baselines (#56395) #56343 refactored some code for torch dataloader creation but introduced a bug when it came to the validation dataset throughput calculation. This happened because `drop_last=True` became the default setting, which would cause the validation dataset to be empty since it's small enough and spread across enough workers so that we couldn't form a single full batch. This PR fixes the issue by setting `drop_last=False`. --------- Signed-off-by: Justin Yu <justinvyu@anyscale.com>

…rch dataloader baselines (ray-project#56395) ray-project#56343 refactored some code for torch dataloader creation but introduced a bug when it came to the validation dataset throughput calculation. This happened because `drop_last=True` became the default setting, which would cause the validation dataset to be empty since it's small enough and spread across enough workers so that we couldn't form a single full batch. This PR fixes the issue by setting `drop_last=False`. --------- Signed-off-by: Justin Yu <justinvyu@anyscale.com>

…aloader forkserver multiprocessing (ray-project#56343) For the training ingest release test baseline using torch dataloader multiprocessing, we preload all imported modules and submodules. This can be brittle and run into issues if any of the imported modules cannot be forked safely. To make this more robust, I only preload modules in an allowlist of a few heavy imports that take a long time. --------- Signed-off-by: Justin Yu <justinvyu@anyscale.com> Signed-off-by: jugalshah291 <shah.jugal291@gmail.com>

…aloader forkserver multiprocessing (ray-project#56343) For the training ingest release test baseline using torch dataloader multiprocessing, we preload all imported modules and submodules. This can be brittle and run into issues if any of the imported modules cannot be forked safely. To make this more robust, I only preload modules in an allowlist of a few heavy imports that take a long time. --------- Signed-off-by: Justin Yu <justinvyu@anyscale.com> Signed-off-by: zac <zac@anyscale.com>

…rch dataloader baselines (ray-project#56395) ray-project#56343 refactored some code for torch dataloader creation but introduced a bug when it came to the validation dataset throughput calculation. This happened because `drop_last=True` became the default setting, which would cause the validation dataset to be empty since it's small enough and spread across enough workers so that we couldn't form a single full batch. This PR fixes the issue by setting `drop_last=False`. --------- Signed-off-by: Justin Yu <justinvyu@anyscale.com> Signed-off-by: zac <zac@anyscale.com>

…aloader forkserver multiprocessing (#56343) For the training ingest release test baseline using torch dataloader multiprocessing, we preload all imported modules and submodules. This can be brittle and run into issues if any of the imported modules cannot be forked safely. To make this more robust, I only preload modules in an allowlist of a few heavy imports that take a long time. --------- Signed-off-by: Justin Yu <justinvyu@anyscale.com> Signed-off-by: Douglas Strodtman <douglas@anyscale.com>

…rch dataloader baselines (#56395) #56343 refactored some code for torch dataloader creation but introduced a bug when it came to the validation dataset throughput calculation. This happened because `drop_last=True` became the default setting, which would cause the validation dataset to be empty since it's small enough and spread across enough workers so that we couldn't form a single full batch. This PR fixes the issue by setting `drop_last=False`. --------- Signed-off-by: Justin Yu <justinvyu@anyscale.com> Signed-off-by: Douglas Strodtman <douglas@anyscale.com>

…aloader forkserver multiprocessing (ray-project#56343) For the training ingest release test baseline using torch dataloader multiprocessing, we preload all imported modules and submodules. This can be brittle and run into issues if any of the imported modules cannot be forked safely. To make this more robust, I only preload modules in an allowlist of a few heavy imports that take a long time. --------- Signed-off-by: Justin Yu <justinvyu@anyscale.com>

…rch dataloader baselines (ray-project#56395) ray-project#56343 refactored some code for torch dataloader creation but introduced a bug when it came to the validation dataset throughput calculation. This happened because `drop_last=True` became the default setting, which would cause the validation dataset to be empty since it's small enough and spread across enough workers so that we couldn't form a single full batch. This PR fixes the issue by setting `drop_last=False`. --------- Signed-off-by: Justin Yu <justinvyu@anyscale.com>

…aloader forkserver multiprocessing (ray-project#56343) For the training ingest release test baseline using torch dataloader multiprocessing, we preload all imported modules and submodules. This can be brittle and run into issues if any of the imported modules cannot be forked safely. To make this more robust, I only preload modules in an allowlist of a few heavy imports that take a long time. --------- Signed-off-by: Justin Yu <justinvyu@anyscale.com>

…rch dataloader baselines (ray-project#56395) ray-project#56343 refactored some code for torch dataloader creation but introduced a bug when it came to the validation dataset throughput calculation. This happened because `drop_last=True` became the default setting, which would cause the validation dataset to be empty since it's small enough and spread across enough workers so that we couldn't form a single full batch. This PR fixes the issue by setting `drop_last=False`. --------- Signed-off-by: Justin Yu <justinvyu@anyscale.com>

justinvyu added 2 commits September 8, 2025 12:04

fix num_workers==0

cfaf1e2

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

preload some whitelisted modules

a288211

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

justinvyu assigned srinathk10 Sep 8, 2025

gemini-code-assist bot reviewed Sep 8, 2025

View reviewed changes

release/train_tests/benchmark/torch_dataloader_factory.py Outdated Show resolved Hide resolved

srinathk10 approved these changes Sep 8, 2025

View reviewed changes

small refactor

004003f

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

justinvyu enabled auto-merge (squash) September 8, 2025 20:04

github-actions bot added the go add ONLY when ready to merge, run all tests label Sep 8, 2025

justinvyu merged commit d55b35d into ray-project:master Sep 8, 2025
5 of 7 checks passed

justinvyu deleted the fix_forkdserver_preloaded_modules branch September 8, 2025 21:17

justinvyu mentioned this pull request Sep 9, 2025

[release-test] Disable drop_last flag to fix division by zero in torch dataloader baselines #56395

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[release-test][data][train] Preload a subset of modules for torch dataloader forkserver multiprocessing#56343

[release-test][data][train] Preload a subset of modules for torch dataloader forkserver multiprocessing#56343
justinvyu merged 3 commits intoray-project:masterfrom
justinvyu:fix_forkdserver_preloaded_modules

justinvyu commented Sep 8, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

srinathk10 left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

justinvyu commented Sep 8, 2025

Summary

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

srinathk10 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants