Skip to content
This repository was archived by the owner on Mar 3, 2026. It is now read-only.

Support run trainer locally#111

Merged
zpcore merged 3 commits intomainfrom
piz/run_local
Mar 15, 2025
Merged

Support run trainer locally#111
zpcore merged 3 commits intomainfrom
piz/run_local

Conversation

@zpcore
Copy link
Copy Markdown
Collaborator

@zpcore zpcore commented Feb 13, 2025

Support to run train.py locally:

tp docker-run  --use-hf torchprime/hf_models/train.py # huggingface run

Or

tp docker-run  torchprime/torch_xla_models/train.py 

This makes it easier to debug model code in docker env.

Comment thread torchprime/launcher/cli.py
@zpcore zpcore requested a review from tengyifei February 13, 2025 21:58
@zpcore zpcore marked this pull request as ready for review February 13, 2025 21:58
@tengyifei
Copy link
Copy Markdown
Contributor

I usually run a model locally like this:

python3 torchprime/torch_xla_models/train.py model=llama-3-8b mesh.fsdp=8 profile_step=3 max_steps=50

Is this insufficient for your use case?

@zpcore
Copy link
Copy Markdown
Collaborator Author

zpcore commented Feb 13, 2025

I usually run a model locally like this:

python3 torchprime/torch_xla_models/train.py model=llama-3-8b mesh.fsdp=8 profile_step=3 max_steps=50

Is this insufficient for your use case?

Yes, this should also work. The thing is that I saw many permission issues when run train.py directly. This just helps to run in a container.

Comment thread README.md Outdated
Comment thread README.md Outdated
Comment thread torchprime/launcher/cli.py Outdated
Comment thread README.md Outdated
Comment thread torchprime/launcher/cli.py
Comment thread torchprime/launcher/cli.py
Comment thread torchprime/launcher/buildpush.py Outdated
@zpcore zpcore merged commit 825320e into main Mar 15, 2025
@zpcore zpcore deleted the piz/run_local branch March 15, 2025 06:04
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants