[train][template] Add Anyscale template for pytorch + train + data#53220
[train][template] Add Anyscale template for pytorch + train + data#53220matthewdeng merged 30 commits intoray-project:masterfrom
Conversation
Signed-off-by: Timothy Seah <tseah@anyscale.com>
| # Install Python dependencies | ||
| pip3 install --no-cache-dir \ | ||
| "torch==2.7.0" \ | ||
| "torchvision==0.22.0" |
There was a problem hiding this comment.
why are these required? it explicitly requires torch 2.7?
there is already a version of torch inside the image.
There was a problem hiding this comment.
Do I still need torchvision?
When running the notebook in my workspace, I pip installed torch and torchvision, and then just used the versions that it happened to download.
Signed-off-by: Timothy Seah <tseah@anyscale.com>
Signed-off-by: Timothy Seah <tseah@anyscale.com>
Signed-off-by: Timothy Seah <tseah@anyscale.com>
Signed-off-by: Timothy Seah <tseah@anyscale.com>
doc/source/train/examples/pytorch/distributing-pytorch/README.ipynb
Outdated
Show resolved
Hide resolved
doc/source/train/examples/pytorch/distributing-pytorch/README.ipynb
Outdated
Show resolved
Hide resolved
doc/source/train/examples/pytorch/distributing-pytorch/README.ipynb
Outdated
Show resolved
Hide resolved
| " # [3] Report metrics to Ray Train\n", | ||
| " # ===============================\n", | ||
| " with tempfile.TemporaryDirectory() as temp_checkpoint_dir:\n", | ||
| " ray.train.report(\n", | ||
| " metrics={\"loss\": test_loss, \"accuracy\": accuracy},\n", | ||
| " checkpoint=ray.train.Checkpoint.from_directory(temp_checkpoint_dir),\n", | ||
| " )\n", | ||
| " if ray.train.get_context().get_world_rank() == 0:\n", | ||
| " print({\"epoch_num\": epoch, \"loss\": test_loss, \"accuracy\": accuracy})" | ||
| ] | ||
| }, |
There was a problem hiding this comment.
Checkpoint is empty here.
There was a problem hiding this comment.
Added torch.save calls to this and the next example. Iirc checkpoint should be optional but right now it is not - maybe I can fix this in a future PR?
TODO: run the notebook again after it is finalized.
doc/source/train/examples/pytorch/distributing-pytorch/README.ipynb
Outdated
Show resolved
Hide resolved
doc/source/train/examples/pytorch/distributing-pytorch/README.ipynb
Outdated
Show resolved
Hide resolved
| " # [1] Start distributed training\n", | ||
| " # Define computation resources for workers\n", | ||
| " # Run `train_func_per_worker` on those workers\n", | ||
| " # ============================================\n", | ||
| " scaling_config = ScalingConfig(num_workers=num_workers, use_gpu=use_gpu)\n", | ||
| " run_config = RunConfig(storage_path=\"/mnt/cluster_storage\", name=\"ray_train_run\")\n", | ||
| " trainer = TorchTrainer(\n", | ||
| " train_loop_per_worker=train_func_per_worker,\n", | ||
| " train_loop_config=train_config,\n", | ||
| " scaling_config=scaling_config,\n", | ||
| " run_config=run_config,\n", | ||
| " )\n", | ||
| " result = trainer.fit()\n", | ||
| " print(f\"Training result: {result}\")\n", |
There was a problem hiding this comment.
Can use this as an opportunity to introduce what each of these classes/concepts are.
There was a problem hiding this comment.
Done - lmk what you think of the new explanation.
| { | ||
| "cell_type": "markdown", | ||
| "metadata": {}, | ||
| "source": [ |
There was a problem hiding this comment.
Images aren't rendering for me, will re-review this section whenever the docs render
There was a problem hiding this comment.
See 1d in https://www.notion.so/anyscale-hq/Examples-Publishing-Workflow-1e6027c809cb80bba02ccaa87523c22d - iiuc I need to submit this PR first and then use absolute links in a followup PR.
|
|
||
|  | ||
|
|
||
| ## Step 3: Speed up its data loading with Ray Data |
There was a problem hiding this comment.
Not obvious what "it" is here, can just do:
| ## Step 3: Speed up its data loading with Ray Data | |
| ## Step 3: Speed up data loading with Ray Data |
There was a problem hiding this comment.
Changed to Scale data ingest separately from training with Ray Data since it doesn't actually get sped up - lmk if this is fine.
doc/source/train/examples/pytorch/distributing-pytorch/README.md
Outdated
Show resolved
Hide resolved
|
|
||
| ## Step 3: Speed up its data loading with Ray Data | ||
|
|
||
| Let’s modify this example to load data with Ray Data instead of the native Torch DataLoader. With just a few modifications, you can offload data preprocessing to an independently scaling distributed Ray Data pipeline. See [here](https://docs.ray.io/en/latest/data/comparisons.html#how-does-ray-data-compare-to-other-solutions-for-ml-training-ingest) for a comparison between Ray Data and Torch data loading. |
There was a problem hiding this comment.
"offload data preprocessing to an independently scaling distributed Ray Data pipeline" is a mouthful and difficult for users to conceptualize what that actually means - we should revisit this to simplify the wording.
There was a problem hiding this comment.
Edited - lmk if the new wording is better!
doc/source/train/examples/pytorch/distributing-pytorch/README.md
Outdated
Show resolved
Hide resolved
| # [1] Prepare Dataloader for distributed training | ||
| # Shard the datasets among workers and move batches to the correct device | ||
| # ======================================================================= |
There was a problem hiding this comment.
Maybe elaborate more that the methods called are doing these? Otherwise it feels too auto-magical.
There was a problem hiding this comment.
Changed all of the numbered comments to mention what each function - ptal, thanks!
Signed-off-by: Timothy Seah <tseah@anyscale.com>
justinvyu
left a comment
There was a problem hiding this comment.
Looks really great! Some suggestions:
In section 2 (scaling training w/ Ray Train):
- Let's include a detail that we're doing DDP training and briefly (like 1 sentence) describe what that's doing.
- Let's include the expected speedup that we see, since we say the expected runtime in the first section. OR, just remove mentions of the expected runtimes, since it might be hard to explain why we don't see an 8x speedup.
- Remove the mention in the text about enabling Ray Train v2.
doc/source/train/examples/pytorch/distributing-pytorch/README.ipynb
Outdated
Show resolved
Hide resolved
doc/source/train/examples/pytorch/distributing-pytorch/README.ipynb
Outdated
Show resolved
Hide resolved
doc/source/train/examples/pytorch/distributing-pytorch/README.ipynb
Outdated
Show resolved
Hide resolved
doc/source/train/examples/pytorch/distributing-pytorch/README.ipynb
Outdated
Show resolved
Hide resolved
| " train_loop_per_worker=train_func_per_worker,\n", | ||
| " # [1] With Ray Data you pass the Dataset directly to the Trainer.\n", | ||
| " # ==============================================================\n", | ||
| " datasets={\"train\": train_dataset, \"test\": test_dataset},\n", |
There was a problem hiding this comment.
nit: "valid" set rather than "test" set
There was a problem hiding this comment.
Changed all references to "test" in the train_func to "valid" but kept generic loading utilities as "test" since they could have been used for testing. Lmk if this is fine.
Wait what do you mean by this comment? I mention |
Signed-off-by: Timothy Seah <tseah@anyscale.com>
|
@TimothySeah Oops, missed that then. Looks good 👍 |
Signed-off-by: Timothy Seah <tseah@anyscale.com>
Signed-off-by: Timothy Seah <tseah@anyscale.com>
Signed-off-by: Timothy Seah <tseah@anyscale.com>
justinvyu
left a comment
There was a problem hiding this comment.
Great job! Will we add a link to this in the ray docs as a follow-up?
doc/source/train/examples/pytorch/distributing-pytorch/README.ipynb
Outdated
Show resolved
Hide resolved
doc/source/train/examples/pytorch/distributing-pytorch/README.ipynb
Outdated
Show resolved
Hide resolved
doc/source/train/examples/pytorch/distributing-pytorch/README.ipynb
Outdated
Show resolved
Hide resolved
doc/source/train/examples/pytorch/distributing-pytorch/README.ipynb
Outdated
Show resolved
Hide resolved
doc/source/train/examples/pytorch/distributing-pytorch/README.ipynb
Outdated
Show resolved
Hide resolved
doc/source/train/examples/pytorch/distributing-pytorch/README.ipynb
Outdated
Show resolved
Hide resolved
doc/source/train/examples/pytorch/distributing-pytorch/README.ipynb
Outdated
Show resolved
Hide resolved
doc/source/train/examples/pytorch/distributing-pytorch/README.ipynb
Outdated
Show resolved
Hide resolved
doc/source/train/examples/pytorch/distributing-pytorch/README.ipynb
Outdated
Show resolved
Hide resolved
doc/source/train/examples/pytorch/distributing-pytorch/configs/aws.yaml
Outdated
Show resolved
Hide resolved
|
So impressed with how much you did this on your own, @TimothySeah! Very high quality, too. ray/doc/source/train/examples.yml Line 6 in cb98953 |
|
When you mention Anyscale-only features, keep in mind that OSS readers might be reading it. Suggest adding verbiage to let readers know that this is a feature on Anyscale and what OSS users would need to do to get the same functionality, if applicable. |
angelinalg
left a comment
There was a problem hiding this comment.
Approving with some suggestions.
…ipynb Co-authored-by: angelinalg <122562471+angelinalg@users.noreply.github.com> Signed-off-by: Timothy Seah <timothy.seah777@yahoo.com>
…ipynb Co-authored-by: angelinalg <122562471+angelinalg@users.noreply.github.com> Signed-off-by: Timothy Seah <timothy.seah777@yahoo.com>
…ipynb Co-authored-by: angelinalg <122562471+angelinalg@users.noreply.github.com> Signed-off-by: Timothy Seah <timothy.seah777@yahoo.com>
…ipynb Co-authored-by: angelinalg <122562471+angelinalg@users.noreply.github.com> Signed-off-by: Timothy Seah <timothy.seah777@yahoo.com>
…ipynb Co-authored-by: angelinalg <122562471+angelinalg@users.noreply.github.com> Signed-off-by: Timothy Seah <timothy.seah777@yahoo.com>
…ipynb Co-authored-by: angelinalg <122562471+angelinalg@users.noreply.github.com> Signed-off-by: Timothy Seah <timothy.seah777@yahoo.com>
…ipynb Co-authored-by: angelinalg <122562471+angelinalg@users.noreply.github.com> Signed-off-by: Timothy Seah <timothy.seah777@yahoo.com>
…ipynb Co-authored-by: angelinalg <122562471+angelinalg@users.noreply.github.com> Signed-off-by: Timothy Seah <timothy.seah777@yahoo.com>
…ipynb Co-authored-by: angelinalg <122562471+angelinalg@users.noreply.github.com> Signed-off-by: Timothy Seah <timothy.seah777@yahoo.com>
…ipynb Co-authored-by: angelinalg <122562471+angelinalg@users.noreply.github.com> Signed-off-by: Timothy Seah <timothy.seah777@yahoo.com>
…ipynb Co-authored-by: angelinalg <122562471+angelinalg@users.noreply.github.com> Signed-off-by: Timothy Seah <timothy.seah777@yahoo.com>
…ipynb Co-authored-by: angelinalg <122562471+angelinalg@users.noreply.github.com> Signed-off-by: Timothy Seah <timothy.seah777@yahoo.com>
Signed-off-by: Timothy Seah <tseah@anyscale.com>
Will consider doing this in a future PR. For now ill just follow https://www.notion.so/anyscale-hq/Examples-Publishing-Workflow-1e6027c809cb80bba02ccaa87523c22d to add it to the workspace templates page. |
Signed-off-by: Timothy Seah <tseah@anyscale.com>
Good callout - pushed a commit to do this. |
Signed-off-by: Timothy Seah <tseah@anyscale.com>
… Data (#53220) --------- Signed-off-by: Timothy Seah <tseah@anyscale.com> Co-authored-by: angelinalg <122562471+angelinalg@users.noreply.github.com> Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>
… Data (#53220) --------- Signed-off-by: Timothy Seah <tseah@anyscale.com> Co-authored-by: angelinalg <122562471+angelinalg@users.noreply.github.com> Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>
Followed steps 1 and 2 here.
A few questions:
I will replace image links with public links in a future PR.
I ran the notebook in an Anyscale Workspace and verified that it worked as intended.