Skip to content

[train][template] Add Anyscale template for pytorch + train + data#53220

Merged
matthewdeng merged 30 commits intoray-project:masterfrom
TimothySeah:tseah/ray-train-template
Jun 10, 2025
Merged

[train][template] Add Anyscale template for pytorch + train + data#53220
matthewdeng merged 30 commits intoray-project:masterfrom
TimothySeah:tseah/ray-train-template

Conversation

@TimothySeah
Copy link
Copy Markdown
Contributor

@TimothySeah TimothySeah commented May 21, 2025

Followed steps 1 and 2 here.

A few questions:

  • Right now I'm setting env variables in build.sh and in the notebook itself via dotenv as done in this example. Lmk if the latter is still necessary.
  • Not sure if my index.rst is correct
  • Not sure where in the codebase I should put my example

I will replace image links with public links in a future PR.

I ran the notebook in an Anyscale Workspace and verified that it worked as intended.

Signed-off-by: Timothy Seah <tseah@anyscale.com>
Comment on lines +5 to +8
# Install Python dependencies
pip3 install --no-cache-dir \
"torch==2.7.0" \
"torchvision==0.22.0"
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why are these required? it explicitly requires torch 2.7?

there is already a version of torch inside the image.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do I still need torchvision?
When running the notebook in my workspace, I pip installed torch and torchvision, and then just used the versions that it happened to download.

Signed-off-by: Timothy Seah <tseah@anyscale.com>
Signed-off-by: Timothy Seah <tseah@anyscale.com>
Signed-off-by: Timothy Seah <tseah@anyscale.com>
Signed-off-by: Timothy Seah <tseah@anyscale.com>
Comment on lines +319 to +329
" # [3] Report metrics to Ray Train\n",
" # ===============================\n",
" with tempfile.TemporaryDirectory() as temp_checkpoint_dir:\n",
" ray.train.report(\n",
" metrics={\"loss\": test_loss, \"accuracy\": accuracy},\n",
" checkpoint=ray.train.Checkpoint.from_directory(temp_checkpoint_dir),\n",
" )\n",
" if ray.train.get_context().get_world_rank() == 0:\n",
" print({\"epoch_num\": epoch, \"loss\": test_loss, \"accuracy\": accuracy})"
]
},
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Checkpoint is empty here.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added torch.save calls to this and the next example. Iirc checkpoint should be optional but right now it is not - maybe I can fix this in a future PR?

TODO: run the notebook again after it is finalized.

Comment on lines +352 to +365
" # [1] Start distributed training\n",
" # Define computation resources for workers\n",
" # Run `train_func_per_worker` on those workers\n",
" # ============================================\n",
" scaling_config = ScalingConfig(num_workers=num_workers, use_gpu=use_gpu)\n",
" run_config = RunConfig(storage_path=\"/mnt/cluster_storage\", name=\"ray_train_run\")\n",
" trainer = TorchTrainer(\n",
" train_loop_per_worker=train_func_per_worker,\n",
" train_loop_config=train_config,\n",
" scaling_config=scaling_config,\n",
" run_config=run_config,\n",
" )\n",
" result = trainer.fit()\n",
" print(f\"Training result: {result}\")\n",
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can use this as an opportunity to introduce what each of these classes/concepts are.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done - lmk what you think of the new explanation.

{
"cell_type": "markdown",
"metadata": {},
"source": [
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Images aren't rendering for me, will re-review this section whenever the docs render

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See 1d in https://www.notion.so/anyscale-hq/Examples-Publishing-Workflow-1e6027c809cb80bba02ccaa87523c22d - iiuc I need to submit this PR first and then use absolute links in a followup PR.


![Metrics Dashboard](images/metrics_dashboard.png)

## Step 3: Speed up its data loading with Ray Data
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not obvious what "it" is here, can just do:

Suggested change
## Step 3: Speed up its data loading with Ray Data
## Step 3: Speed up data loading with Ray Data

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed to Scale data ingest separately from training with Ray Data since it doesn't actually get sped up - lmk if this is fine.


## Step 3: Speed up its data loading with Ray Data

Let’s modify this example to load data with Ray Data instead of the native Torch DataLoader. With just a few modifications, you can offload data preprocessing to an independently scaling distributed Ray Data pipeline. See [here](https://docs.ray.io/en/latest/data/comparisons.html#how-does-ray-data-compare-to-other-solutions-for-ml-training-ingest) for a comparison between Ray Data and Torch data loading.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"offload data preprocessing to an independently scaling distributed Ray Data pipeline" is a mouthful and difficult for users to conceptualize what that actually means - we should revisit this to simplify the wording.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Edited - lmk if the new wording is better!

Comment on lines +333 to +335
# [1] Prepare Dataloader for distributed training
# Shard the datasets among workers and move batches to the correct device
# =======================================================================
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe elaborate more that the methods called are doing these? Otherwise it feels too auto-magical.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed all of the numbered comments to mention what each function - ptal, thanks!

Signed-off-by: Timothy Seah <tseah@anyscale.com>
@TimothySeah TimothySeah requested a review from matthewdeng May 28, 2025 01:14
Copy link
Copy Markdown
Contributor

@justinvyu justinvyu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks really great! Some suggestions:

In section 2 (scaling training w/ Ray Train):

  • Let's include a detail that we're doing DDP training and briefly (like 1 sentence) describe what that's doing.
  • Let's include the expected speedup that we see, since we say the expected runtime in the first section. OR, just remove mentions of the expected runtimes, since it might be hard to explain why we don't see an 8x speedup.
  • Remove the mention in the text about enabling Ray Train v2.

" train_loop_per_worker=train_func_per_worker,\n",
" # [1] With Ray Data you pass the Dataset directly to the Trainer.\n",
" # ==============================================================\n",
" datasets={\"train\": train_dataset, \"test\": test_dataset},\n",
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: "valid" set rather than "test" set

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed all references to "test" in the train_func to "valid" but kept generic loading utilities as "test" since they could have been used for testing. Lmk if this is fine.

@TimothySeah
Copy link
Copy Markdown
Contributor Author

  • Let's include the expected speedup that we see, since we say the expected runtime in the first section. OR, just remove mentions of the expected runtimes, since it might be hard to explain why we don't see an 8x speedup.

Wait what do you mean by this comment? I mention If everything works as expected, the training should take around 2 minutes 10 seconds with an accuracy of around 0.35. after running train_func in section 1 and Because you ran training in a data parallel fashion this time, it should have taken under 1 minute while maintaining similar accuracy. after running train_cifar_10 in section 2.

Signed-off-by: Timothy Seah <tseah@anyscale.com>
@justinvyu
Copy link
Copy Markdown
Contributor

@TimothySeah Oops, missed that then. Looks good 👍

Signed-off-by: Timothy Seah <tseah@anyscale.com>
Signed-off-by: Timothy Seah <tseah@anyscale.com>
Signed-off-by: Timothy Seah <tseah@anyscale.com>
@TimothySeah TimothySeah added the go add ONLY when ready to merge, run all tests label Jun 4, 2025
Copy link
Copy Markdown
Contributor

@justinvyu justinvyu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great job! Will we add a link to this in the ray docs as a follow-up?

@angelinalg
Copy link
Copy Markdown
Contributor

So impressed with how much you did this on your own, @TimothySeah! Very high quality, too.
Re: where to put this example in the docs
Per @matthewdeng, let's replace this example with your new example:

- title: Train an image classifier with PyTorch

@angelinalg
Copy link
Copy Markdown
Contributor

When you mention Anyscale-only features, keep in mind that OSS readers might be reading it. Suggest adding verbiage to let readers know that this is a feature on Anyscale and what OSS users would need to do to get the same functionality, if applicable.

Copy link
Copy Markdown
Contributor

@angelinalg angelinalg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving with some suggestions.

TimothySeah and others added 13 commits June 9, 2025 15:49
…ipynb

Co-authored-by: angelinalg <122562471+angelinalg@users.noreply.github.com>
Signed-off-by: Timothy Seah <timothy.seah777@yahoo.com>
…ipynb

Co-authored-by: angelinalg <122562471+angelinalg@users.noreply.github.com>
Signed-off-by: Timothy Seah <timothy.seah777@yahoo.com>
…ipynb

Co-authored-by: angelinalg <122562471+angelinalg@users.noreply.github.com>
Signed-off-by: Timothy Seah <timothy.seah777@yahoo.com>
…ipynb

Co-authored-by: angelinalg <122562471+angelinalg@users.noreply.github.com>
Signed-off-by: Timothy Seah <timothy.seah777@yahoo.com>
…ipynb

Co-authored-by: angelinalg <122562471+angelinalg@users.noreply.github.com>
Signed-off-by: Timothy Seah <timothy.seah777@yahoo.com>
…ipynb

Co-authored-by: angelinalg <122562471+angelinalg@users.noreply.github.com>
Signed-off-by: Timothy Seah <timothy.seah777@yahoo.com>
…ipynb

Co-authored-by: angelinalg <122562471+angelinalg@users.noreply.github.com>
Signed-off-by: Timothy Seah <timothy.seah777@yahoo.com>
…ipynb

Co-authored-by: angelinalg <122562471+angelinalg@users.noreply.github.com>
Signed-off-by: Timothy Seah <timothy.seah777@yahoo.com>
…ipynb

Co-authored-by: angelinalg <122562471+angelinalg@users.noreply.github.com>
Signed-off-by: Timothy Seah <timothy.seah777@yahoo.com>
…ipynb

Co-authored-by: angelinalg <122562471+angelinalg@users.noreply.github.com>
Signed-off-by: Timothy Seah <timothy.seah777@yahoo.com>
…ipynb

Co-authored-by: angelinalg <122562471+angelinalg@users.noreply.github.com>
Signed-off-by: Timothy Seah <timothy.seah777@yahoo.com>
…ipynb

Co-authored-by: angelinalg <122562471+angelinalg@users.noreply.github.com>
Signed-off-by: Timothy Seah <timothy.seah777@yahoo.com>
Signed-off-by: Timothy Seah <tseah@anyscale.com>
@TimothySeah
Copy link
Copy Markdown
Contributor Author

So impressed with how much you did this on your own, @TimothySeah! Very high quality, too. Re: where to put this example in the docs Per @matthewdeng, let's replace this example with your new example:

- title: Train an image classifier with PyTorch

Will consider doing this in a future PR. For now ill just follow https://www.notion.so/anyscale-hq/Examples-Publishing-Workflow-1e6027c809cb80bba02ccaa87523c22d to add it to the workspace templates page.

Signed-off-by: Timothy Seah <tseah@anyscale.com>
@TimothySeah
Copy link
Copy Markdown
Contributor Author

When you mention Anyscale-only features, keep in mind that OSS readers might be reading it. Suggest adding verbiage to let readers know that this is a feature on Anyscale and what OSS users would need to do to get the same functionality, if applicable.

Good callout - pushed a commit to do this.

@TimothySeah TimothySeah removed the request for review from a team June 9, 2025 23:51
@matthewdeng matthewdeng enabled auto-merge (squash) June 10, 2025 00:20
@matthewdeng matthewdeng merged commit 6f0ce28 into ray-project:master Jun 10, 2025
6 checks passed
elliot-barn pushed a commit that referenced this pull request Jun 18, 2025
… Data (#53220)

---------

Signed-off-by: Timothy Seah <tseah@anyscale.com>
Co-authored-by: angelinalg <122562471+angelinalg@users.noreply.github.com>
Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>
elliot-barn pushed a commit that referenced this pull request Jul 2, 2025
… Data (#53220)

---------

Signed-off-by: Timothy Seah <tseah@anyscale.com>
Co-authored-by: angelinalg <122562471+angelinalg@users.noreply.github.com>
Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

go add ONLY when ready to merge, run all tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants