Skip to content

[release-test][data][train] Preload a subset of modules for torch dataloader forkserver multiprocessing#56343

Merged
justinvyu merged 3 commits intoray-project:masterfrom
justinvyu:fix_forkdserver_preloaded_modules
Sep 8, 2025
Merged

[release-test][data][train] Preload a subset of modules for torch dataloader forkserver multiprocessing#56343
justinvyu merged 3 commits intoray-project:masterfrom
justinvyu:fix_forkdserver_preloaded_modules

Conversation

@justinvyu
Copy link
Copy Markdown
Contributor

Summary

For the training ingest release test baseline using torch dataloader multiprocessing, we preload all imported modules and submodules. This can be brittle and run into issues if any of the imported modules cannot be forked safely. For example, see the following release test error:

Traceback (most recent call last):
File "/tmp/ray/session_2025-09-04_05-22-53_048164_2379/runtime_resources/working_dir_files/s3_runtime-release-test-artifacts_working_dirs_training_ingest_benchmark-task=image_classification_full_training_parquet_torch_dataloader_doiardvjgz__anyscale_pkg_339c386192dfee395b1179753d3efecc/train_benchmark.py", line 35, in train_fn_per_worker
File "/tmp/ray/session_2025-09-04_05-22-53_048164_2379/runtime_resources/working_dir_files/_ray_pkg_b43ef8c2034973b4/runner.py", line 313, in run
self._train_epoch()
File "/tmp/ray/session_2025-09-04_05-22-53_048164_2379/runtime_resources/working_dir_files/_ray_pkg_b43ef8c2034973b4/runner.py", line 154, in _train_epoch
for batch in train_dataloader:
File "/tmp/ray/session_2025-09-04_05-22-53_048164_2379/runtime_resources/working_dir_files/_ray_pkg_b43ef8c2034973b4/runner.py", line 105, in dataloader_with_timers
batch = next(dataloader_iter)
File "/tmp/ray/session_2025-09-04_05-22-53_048164_2379/runtime_resources/working_dir_files/_ray_pkg_b43ef8c2034973b4/image_classification/factory.py", line 117, in create_batch_iterator
for batch_idx, batch in enumerate(dataloader):
File "/home/ray/anaconda3/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 434, in __iter__
self._iterator = self._get_iterator()
File "/home/ray/anaconda3/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 387, in _get_iterator
return _MultiProcessingDataLoaderIter(self)
File "/home/ray/anaconda3/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 1040, in __init__
w.start()
File "/home/ray/anaconda3/lib/python3.9/multiprocessing/process.py", line 121, in start
self._popen = self._Popen(self)
File "/home/ray/anaconda3/lib/python3.9/multiprocessing/context.py", line 291, in _Popen
return Popen(process_obj)
File "/home/ray/anaconda3/lib/python3.9/multiprocessing/popen_forkserver.py", line 35, in __init__
super().__init__(process_obj)
File "/home/ray/anaconda3/lib/python3.9/multiprocessing/popen_fork.py", line 19, in __init__
self._launch(process_obj)
File "/home/ray/anaconda3/lib/python3.9/multiprocessing/popen_forkserver.py", line 59, in _launch
self.pid = forkserver.read_signed(self.sentinel)
File "/home/ray/anaconda3/lib/python3.9/multiprocessing/forkserver.py", line 328, in read_signed
raise EOFError('unexpected EOF')
EOFError: unexpected EOF 

I found that excluding all ray submodule imports from the preload list can fix the issue. To make this more robust, I only preload modules in an allowlist of a few heavy imports that take a long time.

Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request addresses a fork-safety issue with torch.utils.data.DataLoader by switching from preloading all modules to an allowlist of heavy, safe-to-fork modules. This is a robust fix for the EOFError encountered. The refactoring in get_val_dataloader to conditionally create the multiprocessing context is also a good improvement. My review suggests applying this same refactoring to get_train_dataloader for consistency and to fully resolve the inefficiency of creating the context when no worker processes are used.

Copy link
Copy Markdown
Contributor

@srinathk10 srinathk10 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Signed-off-by: Justin Yu <justinvyu@anyscale.com>
@justinvyu justinvyu enabled auto-merge (squash) September 8, 2025 20:04
@github-actions github-actions bot added the go add ONLY when ready to merge, run all tests label Sep 8, 2025
@justinvyu justinvyu merged commit d55b35d into ray-project:master Sep 8, 2025
5 of 7 checks passed
@justinvyu justinvyu deleted the fix_forkdserver_preloaded_modules branch September 8, 2025 21:17
justinvyu added a commit that referenced this pull request Sep 10, 2025
…rch dataloader baselines (#56395)

#56343 refactored some code for
torch dataloader creation but introduced a bug when it came to the
validation dataset throughput calculation.

This happened because `drop_last=True` became the default setting, which
would cause the validation dataset to be empty since it's small enough
and spread across enough workers so that we couldn't form a single full
batch. This PR fixes the issue by setting `drop_last=False`.

---------

Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Aydin-ab pushed a commit to Aydin-ab/ray-aydin that referenced this pull request Sep 10, 2025
…rch dataloader baselines (ray-project#56395)

ray-project#56343 refactored some code for
torch dataloader creation but introduced a bug when it came to the
validation dataset throughput calculation.

This happened because `drop_last=True` became the default setting, which
would cause the validation dataset to be empty since it's small enough
and spread across enough workers so that we couldn't form a single full
batch. This PR fixes the issue by setting `drop_last=False`.

---------

Signed-off-by: Justin Yu <justinvyu@anyscale.com>
jugalshah291 pushed a commit to jugalshah291/ray_fork that referenced this pull request Sep 11, 2025
…aloader forkserver multiprocessing (ray-project#56343)

For the training ingest release test baseline using torch dataloader
multiprocessing, we preload all imported modules and submodules. This
can be brittle and run into issues if any of the imported modules cannot
be forked safely.

To make this more robust, I only preload modules in
an allowlist of a few heavy imports that take a long time.

---------

Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: jugalshah291 <shah.jugal291@gmail.com>
ZacAttack pushed a commit to ZacAttack/ray that referenced this pull request Sep 24, 2025
…aloader forkserver multiprocessing (ray-project#56343)

For the training ingest release test baseline using torch dataloader
multiprocessing, we preload all imported modules and submodules. This
can be brittle and run into issues if any of the imported modules cannot
be forked safely.

To make this more robust, I only preload modules in
an allowlist of a few heavy imports that take a long time.

---------

Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: zac <zac@anyscale.com>
ZacAttack pushed a commit to ZacAttack/ray that referenced this pull request Sep 24, 2025
…rch dataloader baselines (ray-project#56395)

ray-project#56343 refactored some code for
torch dataloader creation but introduced a bug when it came to the
validation dataset throughput calculation.

This happened because `drop_last=True` became the default setting, which
would cause the validation dataset to be empty since it's small enough
and spread across enough workers so that we couldn't form a single full
batch. This PR fixes the issue by setting `drop_last=False`.

---------

Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: zac <zac@anyscale.com>
dstrodtman pushed a commit that referenced this pull request Oct 6, 2025
…aloader forkserver multiprocessing (#56343)

For the training ingest release test baseline using torch dataloader
multiprocessing, we preload all imported modules and submodules. This
can be brittle and run into issues if any of the imported modules cannot
be forked safely.

To make this more robust, I only preload modules in
an allowlist of a few heavy imports that take a long time.

---------

Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Douglas Strodtman <douglas@anyscale.com>
dstrodtman pushed a commit that referenced this pull request Oct 6, 2025
…rch dataloader baselines (#56395)

#56343 refactored some code for
torch dataloader creation but introduced a bug when it came to the
validation dataset throughput calculation.

This happened because `drop_last=True` became the default setting, which
would cause the validation dataset to be empty since it's small enough
and spread across enough workers so that we couldn't form a single full
batch. This PR fixes the issue by setting `drop_last=False`.

---------

Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Douglas Strodtman <douglas@anyscale.com>
justinyeh1995 pushed a commit to justinyeh1995/ray that referenced this pull request Oct 20, 2025
…aloader forkserver multiprocessing (ray-project#56343)

For the training ingest release test baseline using torch dataloader
multiprocessing, we preload all imported modules and submodules. This
can be brittle and run into issues if any of the imported modules cannot
be forked safely.

To make this more robust, I only preload modules in
an allowlist of a few heavy imports that take a long time.

---------

Signed-off-by: Justin Yu <justinvyu@anyscale.com>
justinyeh1995 pushed a commit to justinyeh1995/ray that referenced this pull request Oct 20, 2025
…rch dataloader baselines (ray-project#56395)

ray-project#56343 refactored some code for
torch dataloader creation but introduced a bug when it came to the
validation dataset throughput calculation.

This happened because `drop_last=True` became the default setting, which
would cause the validation dataset to be empty since it's small enough
and spread across enough workers so that we couldn't form a single full
batch. This PR fixes the issue by setting `drop_last=False`.

---------

Signed-off-by: Justin Yu <justinvyu@anyscale.com>
landscapepainter pushed a commit to landscapepainter/ray that referenced this pull request Nov 17, 2025
…aloader forkserver multiprocessing (ray-project#56343)

For the training ingest release test baseline using torch dataloader
multiprocessing, we preload all imported modules and submodules. This
can be brittle and run into issues if any of the imported modules cannot
be forked safely.

To make this more robust, I only preload modules in
an allowlist of a few heavy imports that take a long time.

---------

Signed-off-by: Justin Yu <justinvyu@anyscale.com>
landscapepainter pushed a commit to landscapepainter/ray that referenced this pull request Nov 17, 2025
…rch dataloader baselines (ray-project#56395)

ray-project#56343 refactored some code for
torch dataloader creation but introduced a bug when it came to the
validation dataset throughput calculation.

This happened because `drop_last=True` became the default setting, which
would cause the validation dataset to be empty since it's small enough
and spread across enough workers so that we couldn't form a single full
batch. This PR fixes the issue by setting `drop_last=False`.

---------

Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

go add ONLY when ready to merge, run all tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants