Skip to content

[Train] [release test] Release tests for ray train local mode#56862

Merged
matthewdeng merged 14 commits intoray-project:masterfrom
xinyuangui2:local-run-test
Oct 3, 2025
Merged

[Train] [release test] Release tests for ray train local mode#56862
matthewdeng merged 14 commits intoray-project:masterfrom
xinyuangui2:local-run-test

Conversation

@xinyuangui2
Copy link
Copy Markdown
Contributor

@xinyuangui2 xinyuangui2 commented Sep 24, 2025

Why are these changes needed?

Release test for ray train local mode.

Setup:

  • 4 nodes, each with 4 gpus.
  • Each node runs torchrun.

Test job: https://buildkite.com/ray-project/release/builds/61135

Related issue number

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

Note

Adds a nightly Train release test that launches torchrun across 2 GPU nodes to run a Ray Train TorchTrainer on FashionMNIST and validates loss.

  • Train tests:
    • Add torch_local_mode to release/release_tests.yaml (nightly, BYOD GPU) running python torch_local_mode_launcher.py with cluster config train_tests/local_mode/compute_gpu_2x4_aws.yaml.
  • New local mode assets:
    • release/train_tests/local_mode/compute_gpu_2x4_aws.yaml: 1 head m5.4xlarge, 2 workers g4dn.12xlarge (non-spot) in us-west-2.
    • release/train_tests/local_mode/torch_local_mode_launcher.py: Starts a Ray cluster task that uploads torch_local_mode_test.py to 2 nodes and launches torchrun (--nnodes=2, --nproc-per-node=4) with RAY_TRAIN_V2_ENABLED=1.
    • release/train_tests/local_mode/torch_local_mode_test.py: Defines a TorchTrainer training resnet18 on FashionMNIST, reports checkpoints/metrics over 20 epochs, and asserts loss decreases.

Written by Cursor Bugbot for commit 1157068. This will update automatically on new commits. Configure here.

Signed-off-by: xgui <xgui@anyscale.com>
Signed-off-by: xgui <xgui@anyscale.com>
Signed-off-by: xgui <xgui@anyscale.com>
Signed-off-by: xgui <xgui@anyscale.com>
Signed-off-by: xgui <xgui@anyscale.com>
@xinyuangui2 xinyuangui2 changed the title adding release tests for local mode [Train] [release test] Release tests for ray train local mode Sep 24, 2025
@xinyuangui2 xinyuangui2 marked this pull request as ready for review September 24, 2025 22:38
cursor[bot]

This comment was marked as outdated.

cursor[bot]

This comment was marked as outdated.

@ray-gardener ray-gardener bot added train Ray Train Related Issue release-test release test labels Sep 25, 2025
xinyuangui2 and others added 4 commits September 29, 2025 11:11
Co-authored-by: Jason Li <57246540+JasonLi1909@users.noreply.github.com>
Signed-off-by: Xinyuan <43737116+xinyuangui2@users.noreply.github.com>
Resolve comments

Signed-off-by: Xinyuan <43737116+xinyuangui2@users.noreply.github.com>
Signed-off-by: xgui <xgui@anyscale.com>
Comment on lines +94 to +117
def fit_func():
# Define configurations.
train_loop_config = {"num_epochs": 20, "lr": 0.01, "batch_size": 32}
scaling_config = ScalingConfig(num_workers=0, use_gpu=True)
run_config = RunConfig(checkpoint_config=CheckpointConfig(num_to_keep=1))

# Initialize the Trainer.
trainer = TorchTrainer(
train_loop_per_worker=train_func,
train_loop_config=train_loop_config,
scaling_config=scaling_config,
run_config=run_config,
)

# Train the model.
result = trainer.fit()

# Inspect the results.
final_loss = result.metrics["loss"]
logger.info(f"final_loss: {final_loss}")


if __name__ == "__main__":
fit_func()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this run anywhere?

Would it be helpful to verify that the runtimes are the same? And maybe set the seeds and compare the results?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added loss convergence check.

@matthewdeng matthewdeng enabled auto-merge (squash) October 3, 2025 00:19
@github-actions github-actions bot added the go add ONLY when ready to merge, run all tests label Oct 3, 2025
@github-actions github-actions bot disabled auto-merge October 3, 2025 17:51
@matthewdeng matthewdeng merged commit 637aaf3 into ray-project:master Oct 3, 2025
6 checks passed
dstrodtman pushed a commit that referenced this pull request Oct 6, 2025
Release test for ray train local mode.

Setup:

- 2 nodes, each with 4 gpus.
- Each node runs `torchrun`.

Test job: https://buildkite.com/ray-project/release/builds/61135

---------

Signed-off-by: xgui <xgui@anyscale.com>
Signed-off-by: Xinyuan <43737116+xinyuangui2@users.noreply.github.com>
Co-authored-by: Jason Li <57246540+JasonLi1909@users.noreply.github.com>
Signed-off-by: Douglas Strodtman <douglas@anyscale.com>
eicherseiji pushed a commit to eicherseiji/ray that referenced this pull request Oct 6, 2025
…oject#56862)

Release test for ray train local mode.

Setup:

- 2 nodes, each with 4 gpus.
- Each node runs `torchrun`.

Test job: https://buildkite.com/ray-project/release/builds/61135

---------

Signed-off-by: xgui <xgui@anyscale.com>
Signed-off-by: Xinyuan <43737116+xinyuangui2@users.noreply.github.com>
Co-authored-by: Jason Li <57246540+JasonLi1909@users.noreply.github.com>
Signed-off-by: Seiji Eicher <seiji@anyscale.com>
eicherseiji pushed a commit to eicherseiji/ray that referenced this pull request Oct 6, 2025
…oject#56862)

Release test for ray train local mode.

Setup:

- 2 nodes, each with 4 gpus.
- Each node runs `torchrun`.

Test job: https://buildkite.com/ray-project/release/builds/61135

---------

Signed-off-by: xgui <xgui@anyscale.com>
Signed-off-by: Xinyuan <43737116+xinyuangui2@users.noreply.github.com>
Co-authored-by: Jason Li <57246540+JasonLi1909@users.noreply.github.com>
eicherseiji pushed a commit to eicherseiji/ray that referenced this pull request Oct 6, 2025
…oject#56862)

Release test for ray train local mode.

Setup:

- 2 nodes, each with 4 gpus.
- Each node runs `torchrun`.

Test job: https://buildkite.com/ray-project/release/builds/61135

---------

Signed-off-by: xgui <xgui@anyscale.com>
Signed-off-by: Xinyuan <43737116+xinyuangui2@users.noreply.github.com>
Co-authored-by: Jason Li <57246540+JasonLi1909@users.noreply.github.com>
eicherseiji pushed a commit to eicherseiji/ray that referenced this pull request Oct 6, 2025
…oject#56862)

Release test for ray train local mode.

Setup:

- 2 nodes, each with 4 gpus.
- Each node runs `torchrun`.

Test job: https://buildkite.com/ray-project/release/builds/61135

---------

Signed-off-by: xgui <xgui@anyscale.com>
Signed-off-by: Xinyuan <43737116+xinyuangui2@users.noreply.github.com>
Co-authored-by: Jason Li <57246540+JasonLi1909@users.noreply.github.com>
eicherseiji pushed a commit to eicherseiji/ray that referenced this pull request Oct 6, 2025
…oject#56862)

Release test for ray train local mode.

Setup:

- 2 nodes, each with 4 gpus.
- Each node runs `torchrun`.

Test job: https://buildkite.com/ray-project/release/builds/61135

---------

Signed-off-by: xgui <xgui@anyscale.com>
Signed-off-by: Xinyuan <43737116+xinyuangui2@users.noreply.github.com>
Co-authored-by: Jason Li <57246540+JasonLi1909@users.noreply.github.com>
liulehui pushed a commit to liulehui/ray that referenced this pull request Oct 9, 2025
…oject#56862)

Release test for ray train local mode.

Setup:

- 2 nodes, each with 4 gpus.
- Each node runs `torchrun`.

Test job: https://buildkite.com/ray-project/release/builds/61135

---------

Signed-off-by: xgui <xgui@anyscale.com>
Signed-off-by: Xinyuan <43737116+xinyuangui2@users.noreply.github.com>
Co-authored-by: Jason Li <57246540+JasonLi1909@users.noreply.github.com>
joshkodi pushed a commit to joshkodi/ray that referenced this pull request Oct 13, 2025
…oject#56862)

Release test for ray train local mode.

Setup:

- 2 nodes, each with 4 gpus.
- Each node runs `torchrun`.

Test job: https://buildkite.com/ray-project/release/builds/61135

---------

Signed-off-by: xgui <xgui@anyscale.com>
Signed-off-by: Xinyuan <43737116+xinyuangui2@users.noreply.github.com>
Co-authored-by: Jason Li <57246540+JasonLi1909@users.noreply.github.com>
Signed-off-by: Josh Kodi <joshkodi@gmail.com>
justinyeh1995 pushed a commit to justinyeh1995/ray that referenced this pull request Oct 20, 2025
…oject#56862)

Release test for ray train local mode.

Setup:

- 2 nodes, each with 4 gpus.
- Each node runs `torchrun`.

Test job: https://buildkite.com/ray-project/release/builds/61135

---------

Signed-off-by: xgui <xgui@anyscale.com>
Signed-off-by: Xinyuan <43737116+xinyuangui2@users.noreply.github.com>
Co-authored-by: Jason Li <57246540+JasonLi1909@users.noreply.github.com>
landscapepainter pushed a commit to landscapepainter/ray that referenced this pull request Nov 17, 2025
…oject#56862)

Release test for ray train local mode.

Setup:

- 2 nodes, each with 4 gpus.
- Each node runs `torchrun`.

Test job: https://buildkite.com/ray-project/release/builds/61135

---------

Signed-off-by: xgui <xgui@anyscale.com>
Signed-off-by: Xinyuan <43737116+xinyuangui2@users.noreply.github.com>
Co-authored-by: Jason Li <57246540+JasonLi1909@users.noreply.github.com>
Aydin-ab pushed a commit to Aydin-ab/ray-aydin that referenced this pull request Nov 19, 2025
…oject#56862)

Release test for ray train local mode.

Setup:

- 2 nodes, each with 4 gpus.
- Each node runs `torchrun`.

Test job: https://buildkite.com/ray-project/release/builds/61135

---------

Signed-off-by: xgui <xgui@anyscale.com>
Signed-off-by: Xinyuan <43737116+xinyuangui2@users.noreply.github.com>
Co-authored-by: Jason Li <57246540+JasonLi1909@users.noreply.github.com>
Signed-off-by: Aydin Abiar <aydin@anyscale.com>
Future-Outlier pushed a commit to Future-Outlier/ray that referenced this pull request Dec 7, 2025
…oject#56862)

Release test for ray train local mode.

Setup:

- 2 nodes, each with 4 gpus.
- Each node runs `torchrun`.

Test job: https://buildkite.com/ray-project/release/builds/61135

---------

Signed-off-by: xgui <xgui@anyscale.com>
Signed-off-by: Xinyuan <43737116+xinyuangui2@users.noreply.github.com>
Co-authored-by: Jason Li <57246540+JasonLi1909@users.noreply.github.com>
Signed-off-by: Future-Outlier <eric901201@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

go add ONLY when ready to merge, run all tests release-test release test train Ray Train Related Issue

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants