ONNX Training Proposal#2013
Conversation
This proposal aims at capturing the information required to perform stochastic gradient-based training on the inference graph. Although training itself can encoded as a sub-graph by adding a single backward operator, we currently perfer a more strongly-typed way to store that information.
|
Thanks Wei-sheng. Looks good to me, with a few minor points I mentioned above. |
| // function defined in "ModelProto.function." | ||
| // | ||
| // The field MUST be present. | ||
| optional NodeProto loss = 4; |
There was a problem hiding this comment.
Please clarify name look-up when referencing FunctionProto in ModelProto.functions for both of loss and optimizer.
| // Optimized tensors and their gradient tensor names. Each pair describes a | ||
| // tensor's name (key) and its gradient's name (value). The gradient tensors | ||
| // are only visible to the input list of "optimizer." | ||
| repeated StringStringEntryProto gradient_binding = 5; |
There was a problem hiding this comment.
Please explain how the gradient tensors are associated with loss.
There was a problem hiding this comment.
Do we really need this binding? Based on the document https://github.com/onnx/onnx/files/3208156/ONNX.Training.Discussion.pptx, the optimizer has W and Gradient W as input. The name of "gradient W", which is output from the backward prop, seems could be derived from the backend and runtime implementation. The current Pytorch and TF APIs do not support user-defined names for the gradients for the optimizers as a reference.
There was a problem hiding this comment.
We need it. Pytorch creates such a binding in another (but equivalent) way. They have a dictionary where key is parameter and value is that parameter's gradient tensor.
Add GatherND
* Update to include ONNX Foundation WG Added link to Gitter and description of the newly formed Foundation WG. Co-leaders Jim Spohrer (IBM) and Ryan Loney (Intel) * Updated description of Foundation WG Revised the description of the working group for ONNX Foundation * Update working-groups.md
…2288) with type default values even though they are not in the stream.
Move map and sequence types to onnx domain, this is the first step of merging onnx-ml and onnx types.
* Fix link to community docs in readme Addresses onnx#2255 * Update README.md * Update README.md
* Update managingexperimentalops.md * Update managingexperimentalops.md * Rename managingexperimentalops.md to ManagingExperimentalOps.md
* Added negative axes for slice and squeeze opset 11 * added negative axes support for squeeze, unsqueeze, flatten * added support for negative axes to all the existing ops * fixed minor if condition missed for axis attr in flatten * fixed test name for flatten with negative axes * updated unsqueeze and softmax tests with fix for failures * fixed typo * Updating Split op documentations and version * fixed typo in unsqueeze model * fixed dim check for unsqueeze * fixed type cast * test fix for build failure * updating onnx model for unsqueeze test * fixed minor error in type casting
* added test for int64 input to 'where' op * added onnx model files and docs for test 'where' op with long input * added missing doc updates
… that do not have matching graph inputs. (onnx#2135) * Update helper.py An IR v4 model is not required to have matching graph inputs for all initializers. Update printable_graph to allow for this and output the name, type and shape of initializers with no matching graph input. * Add test for printable_graph Add test and tweak messaging * Fix comment formatting
…lements', 'OneHot' (onnx#2260) * modified gather docs to support negative indices * added support for negatice indices to gather_elements and scatter_elements * fixed documentation formatting as per comments * Added negatice indices to docs for OneHot op * GatheND spec for negative indices editted * fix for comments * fixed formatting for gather op * Update onnx/defs/tensor/defs.cc Co-Authored-By: Wei-Sheng Chin <wschin@outlook.com> * Update docs/Changelog.md Co-Authored-By: Wei-Sheng Chin <wschin@outlook.com> * added print examples to onehot and gather as per comments * adding modifies model tests for onehot * updated doc files * updating unsqueeze test * typo fix as per comments
* Fix shapeinference function * Added shapeinference test for cumsum * update inference test * fix test * minor fix -- shape (1) should be (1,) * Add whitespace after comma to fix flake warning
* Add a helper function update_inputs_outputs_dims to tools * fix link to doc * newline at the end * add test for tools * doc props * nit * ci tests * ci tests 2 * accept shapes by dictionary inputs and add more error handling * Update onnx/tools/update_model_dims.py nit: rephrasing Co-Authored-By: Wei-Sheng Chin <wschin@outlook.com> * remove debug line * fix type annotation * fix annotation * fix annotation * fix annotation * fix flake8
* sequence related ops * refine docs * extend hasInputShape to Sequence * refining naming and error checking * refine descriptions
* fix resize shape inference issue in opset10 * include opset10 upsample as well * nit: const auto* * rename 'opset7' to 'opset7_to_10'
* Added more test cases for Unsqueeze * Added a test case for unsqueezing 3 dims Also renamed the 1 dim test cases slightly. * Added more test cases for Unsqueeze * Added a test case for unsqueezing 3 dims Also renamed the 1 dim test cases slightly. * Update docs/Operators.md Feedback from wschin to fix axis bounds. Co-Authored-By: Wei-Sheng Chin <wschin@outlook.com> * Re-ran update_doc.sh
|
This PR has been merged into #2314 so we can close it now. |
Major changes:
1. Add a protobuf message, `TrainingInfoProto` originally designed in
onnx#2013, to store training information.
2. In `TrainingInfoProto`, the user can store training algorithm in
`algorithm` field as a `GraphProto`.
3. The user can also store initialization algorithm for resetting the
model in `TrainingInfoProto.initialization` (proposed by @tbennun in
onnx#2517 and agreed by Training WG).
4. `ModelProto.graph` is callable inside `TrainingInfoProto.algorithm`.
`ModelProto.graph.initializer` are visible to nodes in
`TrainingInfoProto.algorithm.node`.
5. This PR also introduces a `Gradient` operator to differentiate a
function represented by a (sub-)graph. This idea is from onnx#2168.
Contribution list:
Baihan Huang: spec design.
Tal Ben-Nun: model initialization design.
Wei-Sheng Chin: spec design, Gradient operator design.
Jonny Shipton and active WG members and participants: many valuable comments and reviews.
Co-authored-by: Sherlock <baihan.huang@gmail.com>
Co-authored-by: Tal Ben-Nun <tbennun@users.noreply.github.com>
Co-authored-by: Jonny Shipton <tmvector@gmail.com>
* ONNX Training proposal.
Major changes:
1. Add a protobuf message, `TrainingInfoProto` originally designed in
#2013, to store training information.
2. In `TrainingInfoProto`, the user can store training algorithm in
`algorithm` field as a `GraphProto`.
3. The user can also store initialization algorithm for resetting the
model in `TrainingInfoProto.initialization` (proposed by @tbennun in
#2517 and agreed by Training WG).
4. `ModelProto.graph` is callable inside `TrainingInfoProto.algorithm`.
`ModelProto.graph.initializer` are visible to nodes in
`TrainingInfoProto.algorithm.node`.
5. This PR also introduces a `Gradient` operator to differentiate a
function represented by a (sub-)graph. This idea is from #2168.
Contribution list:
Baihan Huang: spec design.
Tal Ben-Nun: model initialization design.
Wei-Sheng Chin: spec design, Gradient operator design.
Jonny Shipton and active WG members and participants: many valuable comments and reviews.
Co-authored-by: Sherlock <baihan.huang@gmail.com>
Co-authored-by: Tal Ben-Nun <tbennun@users.noreply.github.com>
Co-authored-by: Jonny Shipton <tmvector@gmail.com>
* Address comments
* Address a comment
* Move Gradient to ai.onnx.training
Update Gradient test models
* Address comments
1. Create initialization_binding instead of
using update_binding for initialization.
2. Swap key and velue in update_binding.
3. Refine documents accordingly.
* Clarify sementics of algorithm and initialization
* Fix typos
* Address comment and explain the two computation modes of ModelProto.training_info
* Fix typo and explain default behavior
* Update onnx/checker.cc
Co-Authored-By: Jonny Shipton <tmvector@gmail.com>
* Address comments
* Make normalization_binding a repeated field
* Add GraphCall operator
* Polish GraphCall
* GraphCall now uses position to map inputs and outputs
* Address comments:
1. Clarify GraphCall's semantic.
2. Implicitly force trainable tensors to be inference graph's inputs.
3. Training operators cannot be called in the inference graph.
* Add accidently removed changes back
* Use protobuf lite
* Polish the helper script
* Fix windows build and polish helper script
* Fix linux and mac builds
* One more line
* fix the attribute types section in IR.md (#2590)
* fix the attribute types section in IR.md
* update per comments.
* Some changes around the behavior of optional inference inputs.
1. Use pass-by-value to optional inference inputs.
2. Due to the semantic of GraphCall, we implicitly force trainable
inputs to be added into inference graph's input list.
Revise docs
* Update spec per WG discussion
* update_binding is optional now because user might only want to store initialization
* Polish doc
* Address comments. Polish words.
* Use an alternative field to declar global variables.
In yesterday's Operator SIG meeting, we agree to still
put global variables in the inference graph and add a
model-level field to indicate global variables. This way
we can have smaller impact to the inference engines, because
they don't need to move trainable tensors to a new field.
* polish docs
* Allow training initializers to be promoted to global & mutable variables
* Merge the functions of global_mutable_initializer_names into update_binding
* Polish docs
* Remove restriction on using ai.onnx.training in the inference graph
* Split training register from ai.onnx register file
Co-authored-by: Sherlock <baihan.huang@gmail.com>
Co-authored-by: Tal Ben-Nun <tbennun@users.noreply.github.com>
Co-authored-by: Jonny Shipton <tmvector@gmail.com>
Co-authored-by: Ke Zhang <kezhan@microsoft.com>
* ONNX Training proposal.
Major changes:
1. Add a protobuf message, `TrainingInfoProto` originally designed in
onnx#2013, to store training information.
2. In `TrainingInfoProto`, the user can store training algorithm in
`algorithm` field as a `GraphProto`.
3. The user can also store initialization algorithm for resetting the
model in `TrainingInfoProto.initialization` (proposed by @tbennun in
onnx#2517 and agreed by Training WG).
4. `ModelProto.graph` is callable inside `TrainingInfoProto.algorithm`.
`ModelProto.graph.initializer` are visible to nodes in
`TrainingInfoProto.algorithm.node`.
5. This PR also introduces a `Gradient` operator to differentiate a
function represented by a (sub-)graph. This idea is from onnx#2168.
Contribution list:
Baihan Huang: spec design.
Tal Ben-Nun: model initialization design.
Wei-Sheng Chin: spec design, Gradient operator design.
Jonny Shipton and active WG members and participants: many valuable comments and reviews.
Co-authored-by: Sherlock <baihan.huang@gmail.com>
Co-authored-by: Tal Ben-Nun <tbennun@users.noreply.github.com>
Co-authored-by: Jonny Shipton <tmvector@gmail.com>
* Address comments
* Address a comment
* Move Gradient to ai.onnx.training
Update Gradient test models
* Address comments
1. Create initialization_binding instead of
using update_binding for initialization.
2. Swap key and velue in update_binding.
3. Refine documents accordingly.
* Clarify sementics of algorithm and initialization
* Fix typos
* Address comment and explain the two computation modes of ModelProto.training_info
* Fix typo and explain default behavior
* Update onnx/checker.cc
Co-Authored-By: Jonny Shipton <tmvector@gmail.com>
* Address comments
* Make normalization_binding a repeated field
* Add GraphCall operator
* Polish GraphCall
* GraphCall now uses position to map inputs and outputs
* Address comments:
1. Clarify GraphCall's semantic.
2. Implicitly force trainable tensors to be inference graph's inputs.
3. Training operators cannot be called in the inference graph.
* Add accidently removed changes back
* Use protobuf lite
* Polish the helper script
* Fix windows build and polish helper script
* Fix linux and mac builds
* One more line
* fix the attribute types section in IR.md (onnx#2590)
* fix the attribute types section in IR.md
* update per comments.
* Some changes around the behavior of optional inference inputs.
1. Use pass-by-value to optional inference inputs.
2. Due to the semantic of GraphCall, we implicitly force trainable
inputs to be added into inference graph's input list.
Revise docs
* Update spec per WG discussion
* update_binding is optional now because user might only want to store initialization
* Polish doc
* Address comments. Polish words.
* Use an alternative field to declar global variables.
In yesterday's Operator SIG meeting, we agree to still
put global variables in the inference graph and add a
model-level field to indicate global variables. This way
we can have smaller impact to the inference engines, because
they don't need to move trainable tensors to a new field.
* polish docs
* Allow training initializers to be promoted to global & mutable variables
* Merge the functions of global_mutable_initializer_names into update_binding
* Polish docs
* Remove restriction on using ai.onnx.training in the inference graph
* Split training register from ai.onnx register file
Co-authored-by: Sherlock <baihan.huang@gmail.com>
Co-authored-by: Tal Ben-Nun <tbennun@users.noreply.github.com>
Co-authored-by: Jonny Shipton <tmvector@gmail.com>
Co-authored-by: Ke Zhang <kezhan@microsoft.com>
PR #2314 is a single place for reviewing the whole training story.
This proposal aims at capturing the information required to perform
stochastic gradient-based training on the inference graph.
Although training itself can encoded as a sub-graph by adding
a single backward operator, we currently prefer a more strongly-typed
way to store that information. We also think separating training graph
from inference graph can make backends' life easier.
The major change happens in the introduction of TrainingInfoProto:
An optional
TrainingInfoProtowill be added into ModelProto so that users know how to apply the specified optimization algorithm to conduct further training iterations.To allow customized gradient-based optimization algorithm, a
FunctionProtolist is also added intoModelProto. User can store their training algorithm as aFunctionProtoin that list and reference thatFunctionProtousingTrainingInfoProto.optimizer.The existence of
TrainingInfoProto.additional_initializerhas a reason. There are many per-tensor states in training phase. Common examples are momentum and accumulated squared gradient. If we reuseModelProto.graph.initializerto store those training-specific tensors, loading a model for inference may cause a lot more memory (2 times or 3 times).Common optimizers and loss functions will be proposed
FunctionProtos subsequently. For example, ADAGRAD optimizer has a WIP PR, #1955.