[release-test][data][train] Preload a subset of modules for torch dataloader forkserver multiprocessing#56343
Merged
justinvyu merged 3 commits intoray-project:masterfrom Sep 8, 2025
Conversation
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Contributor
There was a problem hiding this comment.
Code Review
This pull request addresses a fork-safety issue with torch.utils.data.DataLoader by switching from preloading all modules to an allowlist of heavy, safe-to-fork modules. This is a robust fix for the EOFError encountered. The refactoring in get_val_dataloader to conditionally create the multiprocessing context is also a good improvement. My review suggests applying this same refactoring to get_train_dataloader for consistency and to fully resolve the inefficiency of creating the context when no worker processes are used.
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
justinvyu
added a commit
that referenced
this pull request
Sep 10, 2025
…rch dataloader baselines (#56395) #56343 refactored some code for torch dataloader creation but introduced a bug when it came to the validation dataset throughput calculation. This happened because `drop_last=True` became the default setting, which would cause the validation dataset to be empty since it's small enough and spread across enough workers so that we couldn't form a single full batch. This PR fixes the issue by setting `drop_last=False`. --------- Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Aydin-ab
pushed a commit
to Aydin-ab/ray-aydin
that referenced
this pull request
Sep 10, 2025
…rch dataloader baselines (ray-project#56395) ray-project#56343 refactored some code for torch dataloader creation but introduced a bug when it came to the validation dataset throughput calculation. This happened because `drop_last=True` became the default setting, which would cause the validation dataset to be empty since it's small enough and spread across enough workers so that we couldn't form a single full batch. This PR fixes the issue by setting `drop_last=False`. --------- Signed-off-by: Justin Yu <justinvyu@anyscale.com>
jugalshah291
pushed a commit
to jugalshah291/ray_fork
that referenced
this pull request
Sep 11, 2025
…aloader forkserver multiprocessing (ray-project#56343) For the training ingest release test baseline using torch dataloader multiprocessing, we preload all imported modules and submodules. This can be brittle and run into issues if any of the imported modules cannot be forked safely. To make this more robust, I only preload modules in an allowlist of a few heavy imports that take a long time. --------- Signed-off-by: Justin Yu <justinvyu@anyscale.com> Signed-off-by: jugalshah291 <shah.jugal291@gmail.com>
ZacAttack
pushed a commit
to ZacAttack/ray
that referenced
this pull request
Sep 24, 2025
…aloader forkserver multiprocessing (ray-project#56343) For the training ingest release test baseline using torch dataloader multiprocessing, we preload all imported modules and submodules. This can be brittle and run into issues if any of the imported modules cannot be forked safely. To make this more robust, I only preload modules in an allowlist of a few heavy imports that take a long time. --------- Signed-off-by: Justin Yu <justinvyu@anyscale.com> Signed-off-by: zac <zac@anyscale.com>
ZacAttack
pushed a commit
to ZacAttack/ray
that referenced
this pull request
Sep 24, 2025
…rch dataloader baselines (ray-project#56395) ray-project#56343 refactored some code for torch dataloader creation but introduced a bug when it came to the validation dataset throughput calculation. This happened because `drop_last=True` became the default setting, which would cause the validation dataset to be empty since it's small enough and spread across enough workers so that we couldn't form a single full batch. This PR fixes the issue by setting `drop_last=False`. --------- Signed-off-by: Justin Yu <justinvyu@anyscale.com> Signed-off-by: zac <zac@anyscale.com>
dstrodtman
pushed a commit
that referenced
this pull request
Oct 6, 2025
…aloader forkserver multiprocessing (#56343) For the training ingest release test baseline using torch dataloader multiprocessing, we preload all imported modules and submodules. This can be brittle and run into issues if any of the imported modules cannot be forked safely. To make this more robust, I only preload modules in an allowlist of a few heavy imports that take a long time. --------- Signed-off-by: Justin Yu <justinvyu@anyscale.com> Signed-off-by: Douglas Strodtman <douglas@anyscale.com>
dstrodtman
pushed a commit
that referenced
this pull request
Oct 6, 2025
…rch dataloader baselines (#56395) #56343 refactored some code for torch dataloader creation but introduced a bug when it came to the validation dataset throughput calculation. This happened because `drop_last=True` became the default setting, which would cause the validation dataset to be empty since it's small enough and spread across enough workers so that we couldn't form a single full batch. This PR fixes the issue by setting `drop_last=False`. --------- Signed-off-by: Justin Yu <justinvyu@anyscale.com> Signed-off-by: Douglas Strodtman <douglas@anyscale.com>
justinyeh1995
pushed a commit
to justinyeh1995/ray
that referenced
this pull request
Oct 20, 2025
…aloader forkserver multiprocessing (ray-project#56343) For the training ingest release test baseline using torch dataloader multiprocessing, we preload all imported modules and submodules. This can be brittle and run into issues if any of the imported modules cannot be forked safely. To make this more robust, I only preload modules in an allowlist of a few heavy imports that take a long time. --------- Signed-off-by: Justin Yu <justinvyu@anyscale.com>
justinyeh1995
pushed a commit
to justinyeh1995/ray
that referenced
this pull request
Oct 20, 2025
…rch dataloader baselines (ray-project#56395) ray-project#56343 refactored some code for torch dataloader creation but introduced a bug when it came to the validation dataset throughput calculation. This happened because `drop_last=True` became the default setting, which would cause the validation dataset to be empty since it's small enough and spread across enough workers so that we couldn't form a single full batch. This PR fixes the issue by setting `drop_last=False`. --------- Signed-off-by: Justin Yu <justinvyu@anyscale.com>
landscapepainter
pushed a commit
to landscapepainter/ray
that referenced
this pull request
Nov 17, 2025
…aloader forkserver multiprocessing (ray-project#56343) For the training ingest release test baseline using torch dataloader multiprocessing, we preload all imported modules and submodules. This can be brittle and run into issues if any of the imported modules cannot be forked safely. To make this more robust, I only preload modules in an allowlist of a few heavy imports that take a long time. --------- Signed-off-by: Justin Yu <justinvyu@anyscale.com>
landscapepainter
pushed a commit
to landscapepainter/ray
that referenced
this pull request
Nov 17, 2025
…rch dataloader baselines (ray-project#56395) ray-project#56343 refactored some code for torch dataloader creation but introduced a bug when it came to the validation dataset throughput calculation. This happened because `drop_last=True` became the default setting, which would cause the validation dataset to be empty since it's small enough and spread across enough workers so that we couldn't form a single full batch. This PR fixes the issue by setting `drop_last=False`. --------- Signed-off-by: Justin Yu <justinvyu@anyscale.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
For the training ingest release test baseline using torch dataloader multiprocessing, we preload all imported modules and submodules. This can be brittle and run into issues if any of the imported modules cannot be forked safely. For example, see the following release test error:
I found that excluding all
raysubmodule imports from the preload list can fix the issue. To make this more robust, I only preload modules in an allowlist of a few heavy imports that take a long time.