Conversation
82dfbe2 to
1e40c74
Compare
|
Any idea when this will be merged? I try to implement the prime_optimizer function but looks like require some of the changes on the torch_xla/csrc/init_python_bindings.cpp side. |
Collaborator
Author
|
Hi @dasoto, I've found that this approach will not guarantee the same sharding in the optimizer compared to running a full training step (this is due to sharding propagation decisions in the compiler). I believe the adagrad unit test was broken after an openxla pin update, for example. Since this is an experimental feature, I would be OK to merge after a rebase. cc @JackCaoG @alanwaketan |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
See also: #6546
The optimizer state must be primed before it can be restored. Optimizer state isn't materialized until the first
optim.stepcall, so to restore optimizer state before resuming training, a dummy step is needed.This PR introduces the
prime_optimizerAPI, which will run a dummy optimizer step with zeroed gradients. The gradient sharding is copied from the parameters to ensure the resulting sharding is the same.