[train][template] Add Anyscale template for pytorch + train + data by TimothySeah · Pull Request #53220 · ray-project/ray

TimothySeah · 2025-05-21T23:23:37Z

Followed steps 1 and 2 here.

A few questions:

Right now I'm setting env variables in build.sh and in the notebook itself via dotenv as done in this example. Lmk if the latter is still necessary.
Not sure if my index.rst is correct
Not sure where in the codebase I should put my example

I will replace image links with public links in a future PR.

I ran the notebook in an Anyscale Workspace and verified that it worked as intended.

Signed-off-by: Timothy Seah <tseah@anyscale.com>

aslonnie · 2025-05-22T16:50:22Z

release/ray_release/byod/byod_pytorch_train_data.sh

+# Install Python dependencies
+pip3 install --no-cache-dir \
+    "torch==2.7.0" \
+    "torchvision==0.22.0"


why are these required? it explicitly requires torch 2.7?

there is already a version of torch inside the image.

Do I still need torchvision?
When running the notebook in my workspace, I pip installed torch and torchvision, and then just used the versions that it happened to download.

release/ray_release/byod/byod_pytorch_train_data.sh

Signed-off-by: Timothy Seah <tseah@anyscale.com>

…emplate

Signed-off-by: Timothy Seah <tseah@anyscale.com>

doc/source/train/examples/pytorch/distributing-pytorch/README.ipynb

matthewdeng · 2025-05-27T17:27:03Z

doc/source/train/examples/pytorch/distributing-pytorch/README.ipynb

+    "        # [3] Report metrics to Ray Train\n",
+    "        # ===============================\n",
+    "        with tempfile.TemporaryDirectory() as temp_checkpoint_dir:\n",
+    "            ray.train.report(\n",
+    "                metrics={\"loss\": test_loss, \"accuracy\": accuracy},\n",
+    "                checkpoint=ray.train.Checkpoint.from_directory(temp_checkpoint_dir),\n",
+    "            )\n",
+    "            if ray.train.get_context().get_world_rank() == 0:\n",
+    "                print({\"epoch_num\": epoch, \"loss\": test_loss, \"accuracy\": accuracy})"
+   ]
+  },


Checkpoint is empty here.

Added torch.save calls to this and the next example. Iirc checkpoint should be optional but right now it is not - maybe I can fix this in a future PR?

TODO: run the notebook again after it is finalized.

doc/source/train/examples/pytorch/distributing-pytorch/README.ipynb

matthewdeng · 2025-05-27T17:30:56Z

doc/source/train/examples/pytorch/distributing-pytorch/README.ipynb

+    "    # [1] Start distributed training\n",
+    "    # Define computation resources for workers\n",
+    "    # Run `train_func_per_worker` on those workers\n",
+    "    # ============================================\n",
+    "    scaling_config = ScalingConfig(num_workers=num_workers, use_gpu=use_gpu)\n",
+    "    run_config = RunConfig(storage_path=\"/mnt/cluster_storage\", name=\"ray_train_run\")\n",
+    "    trainer = TorchTrainer(\n",
+    "        train_loop_per_worker=train_func_per_worker,\n",
+    "        train_loop_config=train_config,\n",
+    "        scaling_config=scaling_config,\n",
+    "        run_config=run_config,\n",
+    "    )\n",
+    "    result = trainer.fit()\n",
+    "    print(f\"Training result: {result}\")\n",


Can use this as an opportunity to introduce what each of these classes/concepts are.

Done - lmk what you think of the new explanation.

matthewdeng · 2025-05-27T17:32:10Z

doc/source/train/examples/pytorch/distributing-pytorch/README.ipynb

+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [


Images aren't rendering for me, will re-review this section whenever the docs render

See 1d in https://www.notion.so/anyscale-hq/Examples-Publishing-Workflow-1e6027c809cb80bba02ccaa87523c22d - iiuc I need to submit this PR first and then use absolute links in a followup PR.

matthewdeng · 2025-05-27T18:25:38Z

doc/source/train/examples/pytorch/distributing-pytorch/README.md

+
+![Metrics Dashboard](images/metrics_dashboard.png)
+
+## Step 3: Speed up its data loading with Ray Data


Not obvious what "it" is here, can just do:

Suggested change

## Step 3: Speed up its data loading with Ray Data

## Step 3: Speed up data loading with Ray Data

Changed to Scale data ingest separately from training with Ray Data since it doesn't actually get sped up - lmk if this is fine.

doc/source/train/examples/pytorch/distributing-pytorch/README.md

matthewdeng · 2025-05-27T18:27:55Z

doc/source/train/examples/pytorch/distributing-pytorch/README.md

+
+## Step 3: Speed up its data loading with Ray Data
+
+Let’s modify this example to load data with Ray Data instead of the native Torch DataLoader. With just a few modifications, you can offload data preprocessing to an independently scaling distributed Ray Data pipeline. See [here](https://docs.ray.io/en/latest/data/comparisons.html#how-does-ray-data-compare-to-other-solutions-for-ml-training-ingest) for a comparison between Ray Data and Torch data loading.


"offload data preprocessing to an independently scaling distributed Ray Data pipeline" is a mouthful and difficult for users to conceptualize what that actually means - we should revisit this to simplify the wording.

Edited - lmk if the new wording is better!

doc/source/train/examples/pytorch/distributing-pytorch/README.md

matthewdeng · 2025-05-27T18:31:16Z

doc/source/train/examples/pytorch/distributing-pytorch/README.md

+    # [1] Prepare Dataloader for distributed training
+    # Shard the datasets among workers and move batches to the correct device
+    # =======================================================================


Maybe elaborate more that the methods called are doing these? Otherwise it feels too auto-magical.

Changed all of the numbered comments to mention what each function - ptal, thanks!

Signed-off-by: Timothy Seah <tseah@anyscale.com>

justinvyu

Looks really great! Some suggestions:

In section 2 (scaling training w/ Ray Train):

Let's include a detail that we're doing DDP training and briefly (like 1 sentence) describe what that's doing.
Let's include the expected speedup that we see, since we say the expected runtime in the first section. OR, just remove mentions of the expected runtimes, since it might be hard to explain why we don't see an 8x speedup.
Remove the mention in the text about enabling Ray Train v2.

doc/source/train/examples/pytorch/distributing-pytorch/README.ipynb

justinvyu · 2025-05-29T01:27:11Z

doc/source/train/examples/pytorch/distributing-pytorch/README.ipynb

+    "        train_loop_per_worker=train_func_per_worker,\n",
+    "        # [1] With Ray Data you pass the Dataset directly to the Trainer.\n",
+    "        # ==============================================================\n",
+    "        datasets={\"train\": train_dataset, \"test\": test_dataset},\n",


nit: "valid" set rather than "test" set

Changed all references to "test" in the train_func to "valid" but kept generic loading utilities as "test" since they could have been used for testing. Lmk if this is fine.

TimothySeah · 2025-05-30T23:58:33Z

Let's include the expected speedup that we see, since we say the expected runtime in the first section. OR, just remove mentions of the expected runtimes, since it might be hard to explain why we don't see an 8x speedup.

Wait what do you mean by this comment? I mention If everything works as expected, the training should take around 2 minutes 10 seconds with an accuracy of around 0.35. after running train_func in section 1 and Because you ran training in a data parallel fashion this time, it should have taken under 1 minute while maintaining similar accuracy. after running train_cifar_10 in section 2.

Signed-off-by: Timothy Seah <tseah@anyscale.com>

justinvyu · 2025-06-02T17:31:38Z

@TimothySeah Oops, missed that then. Looks good 👍

Signed-off-by: Timothy Seah <tseah@anyscale.com>

…emplate

Signed-off-by: Timothy Seah <tseah@anyscale.com>

justinvyu

Great job! Will we add a link to this in the ray docs as a follow-up?

doc/source/train/examples/pytorch/distributing-pytorch/README.ipynb

doc/source/train/examples/pytorch/distributing-pytorch/configs/aws.yaml

angelinalg · 2025-06-05T22:55:46Z

So impressed with how much you did this on your own, @TimothySeah! Very high quality, too.
Re: where to put this example in the docs
Per @matthewdeng, let's replace this example with your new example:

ray/doc/source/train/examples.yml

Line 6 in cb98953

- title: Train an image classifier with PyTorch

angelinalg · 2025-06-05T23:03:46Z

When you mention Anyscale-only features, keep in mind that OSS readers might be reading it. Suggest adding verbiage to let readers know that this is a feature on Anyscale and what OSS users would need to do to get the same functionality, if applicable.

angelinalg

Approving with some suggestions.

…ipynb Co-authored-by: angelinalg <122562471+angelinalg@users.noreply.github.com> Signed-off-by: Timothy Seah <timothy.seah777@yahoo.com>

Signed-off-by: Timothy Seah <tseah@anyscale.com>

TimothySeah · 2025-06-09T23:28:29Z

So impressed with how much you did this on your own, @TimothySeah! Very high quality, too. Re: where to put this example in the docs Per @matthewdeng, let's replace this example with your new example:

ray/doc/source/train/examples.yml

Line 6 in cb98953

- title: Train an image classifier with PyTorch

Will consider doing this in a future PR. For now ill just follow https://www.notion.so/anyscale-hq/Examples-Publishing-Workflow-1e6027c809cb80bba02ccaa87523c22d to add it to the workspace templates page.

Signed-off-by: Timothy Seah <tseah@anyscale.com>

TimothySeah · 2025-06-09T23:37:53Z

When you mention Anyscale-only features, keep in mind that OSS readers might be reading it. Suggest adding verbiage to let readers know that this is a feature on Anyscale and what OSS users would need to do to get the same functionality, if applicable.

Good callout - pushed a commit to do this.

Signed-off-by: Timothy Seah <tseah@anyscale.com>

… Data (#53220) --------- Signed-off-by: Timothy Seah <tseah@anyscale.com> Co-authored-by: angelinalg <122562471+angelinalg@users.noreply.github.com> Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>

[train][template] Add Anyscale template for pytorch + train + data

fa81244

Signed-off-by: Timothy Seah <tseah@anyscale.com>

TimothySeah requested review from a team as code owners May 21, 2025 23:23

hainesmichaelc added the community-backlog label May 22, 2025

aslonnie reviewed May 22, 2025

View reviewed changes

release/ray_release/byod/byod_pytorch_train_data.sh Show resolved Hide resolved

aslonnie requested a review from GokuMohandas May 22, 2025 16:53

hainesmichaelc removed the community-backlog label May 22, 2025

TimothySeah added 7 commits May 22, 2025 14:49

try adding more yaml files to example_configs

30d6033

Signed-off-by: Timothy Seah <tseah@anyscale.com>

Include ci_yamls filegroup instead

740f7d9

Signed-off-by: Timothy Seah <tseah@anyscale.com>

remove unneeded index.rst file

20d762f

Move dep to ray_release to fix bk too

c350905

Signed-off-by: Timothy Seah <tseah@anyscale.com>

Merge remote-tracking branch 'upstream/master' into tseah/ray-train-t…

3b60aa2

…emplate

try marking readme orphan

f36e604

Signed-off-by: Timothy Seah <tseah@anyscale.com>

ignore readme warnings

5732975

matthewdeng reviewed May 27, 2025

View reviewed changes

Address feedback on ipynb

1870987

Signed-off-by: Timothy Seah <tseah@anyscale.com>

TimothySeah requested a review from matthewdeng May 28, 2025 01:14

justinvyu reviewed May 29, 2025

View reviewed changes

Address comments

9e09ee0

Signed-off-by: Timothy Seah <tseah@anyscale.com>

TimothySeah added 4 commits June 2, 2025 18:50

use correct s3 data

641900e

Signed-off-by: Timothy Seah <tseah@anyscale.com>

Merge remote-tracking branch 'upstream/master' into tseah/ray-train-t…

68f6004

…emplate

Use default python image

21bdef6

Signed-off-by: Timothy Seah <tseah@anyscale.com>

Add nb2py script

cb98953

Signed-off-by: Timothy Seah <tseah@anyscale.com>

TimothySeah added the go add ONLY when ready to merge, run all tests label Jun 4, 2025

justinvyu approved these changes Jun 4, 2025

View reviewed changes

angelinalg reviewed Jun 5, 2025

View reviewed changes

angelinalg approved these changes Jun 5, 2025

View reviewed changes

TimothySeah and others added 13 commits June 9, 2025 15:49

Update doc/source/train/examples/pytorch/distributing-pytorch/README.…

14c7360

…ipynb Co-authored-by: angelinalg <122562471+angelinalg@users.noreply.github.com> Signed-off-by: Timothy Seah <timothy.seah777@yahoo.com>

Update doc/source/train/examples/pytorch/distributing-pytorch/README.…

6acc068

…ipynb Co-authored-by: angelinalg <122562471+angelinalg@users.noreply.github.com> Signed-off-by: Timothy Seah <timothy.seah777@yahoo.com>

Update doc/source/train/examples/pytorch/distributing-pytorch/README.…

dc1bdf5

…ipynb Co-authored-by: angelinalg <122562471+angelinalg@users.noreply.github.com> Signed-off-by: Timothy Seah <timothy.seah777@yahoo.com>

Update doc/source/train/examples/pytorch/distributing-pytorch/README.…

432653b

…ipynb Co-authored-by: angelinalg <122562471+angelinalg@users.noreply.github.com> Signed-off-by: Timothy Seah <timothy.seah777@yahoo.com>

Update doc/source/train/examples/pytorch/distributing-pytorch/README.…

f442036

…ipynb Co-authored-by: angelinalg <122562471+angelinalg@users.noreply.github.com> Signed-off-by: Timothy Seah <timothy.seah777@yahoo.com>

Update doc/source/train/examples/pytorch/distributing-pytorch/README.…

8d4f9d3

…ipynb Co-authored-by: angelinalg <122562471+angelinalg@users.noreply.github.com> Signed-off-by: Timothy Seah <timothy.seah777@yahoo.com>

Update doc/source/train/examples/pytorch/distributing-pytorch/README.…

bd8f744

…ipynb Co-authored-by: angelinalg <122562471+angelinalg@users.noreply.github.com> Signed-off-by: Timothy Seah <timothy.seah777@yahoo.com>

Update doc/source/train/examples/pytorch/distributing-pytorch/README.…

d30cddb

…ipynb Co-authored-by: angelinalg <122562471+angelinalg@users.noreply.github.com> Signed-off-by: Timothy Seah <timothy.seah777@yahoo.com>

Update doc/source/train/examples/pytorch/distributing-pytorch/README.…

0d2728e

…ipynb Co-authored-by: angelinalg <122562471+angelinalg@users.noreply.github.com> Signed-off-by: Timothy Seah <timothy.seah777@yahoo.com>

Update doc/source/train/examples/pytorch/distributing-pytorch/README.…

e791f0a

…ipynb Co-authored-by: angelinalg <122562471+angelinalg@users.noreply.github.com> Signed-off-by: Timothy Seah <timothy.seah777@yahoo.com>

Update doc/source/train/examples/pytorch/distributing-pytorch/README.…

9e04e93

…ipynb Co-authored-by: angelinalg <122562471+angelinalg@users.noreply.github.com> Signed-off-by: Timothy Seah <timothy.seah777@yahoo.com>

Update doc/source/train/examples/pytorch/distributing-pytorch/README.…

da0f486

…ipynb Co-authored-by: angelinalg <122562471+angelinalg@users.noreply.github.com> Signed-off-by: Timothy Seah <timothy.seah777@yahoo.com>

Address ipynb suggestions

bce2ee0

Signed-off-by: Timothy Seah <tseah@anyscale.com>

explicitly call out that the train/data dashboards are anyscale-only

2c14934

Signed-off-by: Timothy Seah <tseah@anyscale.com>

TimothySeah and others added 2 commits June 9, 2025 16:38

Merge branch 'master' into tseah/ray-train-template

65b161e

Remove redundant export from post build script

1db1929

Signed-off-by: Timothy Seah <tseah@anyscale.com>

TimothySeah removed the request for review from a team June 9, 2025 23:51

aslonnie approved these changes Jun 9, 2025

View reviewed changes

matthewdeng enabled auto-merge (squash) June 10, 2025 00:20

matthewdeng merged commit 6f0ce28 into ray-project:master Jun 10, 2025
6 checks passed


		![Metrics Dashboard](images/metrics_dashboard.png)

		## Step 3: Speed up its data loading with Ray Data

	## Step 3: Speed up its data loading with Ray Data
	## Step 3: Speed up data loading with Ray Data


		## Step 3: Speed up its data loading with Ray Data

		Let’s modify this example to load data with Ray Data instead of the native Torch DataLoader. With just a few modifications, you can offload data preprocessing to an independently scaling distributed Ray Data pipeline. See [here](https://docs.ray.io/en/latest/data/comparisons.html#how-does-ray-data-compare-to-other-solutions-for-ml-training-ingest) for a comparison between Ray Data and Torch data loading.

Conversation

TimothySeah commented May 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

justinvyu left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

TimothySeah commented May 30, 2025

Uh oh!

justinvyu commented Jun 2, 2025

Uh oh!

justinvyu left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

angelinalg commented Jun 5, 2025

Uh oh!

angelinalg commented Jun 5, 2025

Uh oh!

angelinalg left a comment

Choose a reason for hiding this comment

Uh oh!

TimothySeah commented Jun 9, 2025

Uh oh!

TimothySeah commented Jun 9, 2025

Uh oh!

Uh oh!

TimothySeah commented May 21, 2025 •

edited

Loading