Training Proposal: Spec Changes and Gradient Operator#2314
Training Proposal: Spec Changes and Gradient Operator#2314wschin merged 53 commits intoonnx:masterfrom
Conversation
|
Are the new ops only useful in training scenarios? If so, they should be put in a separate namespace. It doesn't have to be "ai.onnx.training", it can be "ai.onnx.lossfunction" or something like that |
Sounds good. I will put them into |
|
How should a backend decide which parameters in the graph are trainable and which are not? Do we consider all the initializers to be 'trainable' or only those that are captured in |
There is no direct concept of trainable. As long as you use |
|
Okay, I believe that means trainable parameters are inferred based on initializer and update_binding. Sounds reasonable. |
Yes! |
| // If an input or initializer in a graph has the same name as a global variable, | ||
| // then the local variable in the graph will hide the global variable. |
There was a problem hiding this comment.
Wouldn't it be more consistent to say that graph initializers cannot have the same name as any global initializer? The spec currently requires node outputs in subgraphs to not hide any tensors from parent graphs.
There was a problem hiding this comment.
Graph initializer means declaring a variable so local variable hides outer-scope's variable. Operator output means declaring a variable if that variable doesn't exist; if that variable exists, it will be an assignment which we want to avoid.
| The variable "W" is an optional input in the called graph. | ||
| If the user omits it, the input names of GraphCall becomes ["X_1", "", "Z_1"]. | ||
| In this case, from the view of computation graph, the Conv operator invoked by | ||
| GraphCall's may be connected the global "W" variable. |
There was a problem hiding this comment.
So when you use the gradient operator in the training graph, on a graph containing a GraphCall:
- Global initializers preserve gradient flow even when used by the inference graph but not explicitly passed to the
GraphCall - Inference graph initializers aren't visible to the training graph so you can't take the gradient w.r.t. them
- If the inference graph has an input with the same name as a global initializer, then that input is optional
Is that right?
I'm guessing the main purpose in adding the global intializers is to make it clearer that these are implicitly used by GraphCall without the GraphCall having to explicitly pass values for each one?
There was a problem hiding this comment.
So when you use the gradient operator in the training graph, on a graph containing a
GraphCall:
- Global initializers preserve gradient flow even when used by the inference graph but not explicitly passed to the
GraphCall
Yes. Global variable is referenced if we don't find a local variable with the required name (here the name is "W").
- Inference graph initializers aren't visible to the training graph so you can't take the gradient w.r.t. them
Yes. They are inference-time and training-time constants.
- If the inference graph has an input with the same name as a global initializer, then that input is optional
Yes.
Is that right?
I'm guessing the main purpose in adding the global intializers is to make it clearer that these are implicitly used by
GraphCallwithout theGraphCallhaving to explicitly pass values for each one?
Yes! It also forces the initialization/training algorithm not to touch inference-only-constants.
| // | ||
| // This field usually stores trainable tensors of the model, because trainable | ||
| // tensors are used by both inference graph and training algorithm graph. | ||
| repeated TensorProto global_initializer = 30; |
There was a problem hiding this comment.
| repeated TensorProto global_initializer = 30; | |
| repeated TensorProto global_mutable_initializer = 30; |
maybe?
There was a problem hiding this comment.
Let's make it string. All tensors are in global_initializer, the name of the mutable subset are in global_mutable_initializer.
There was a problem hiding this comment.
Ok. I change this field to global_mutable_initializer_name and it can only contains names of ModelProto.graph.initializer. Do you want TrainingInfoProto.algorithm.initializer and TrainingInfoProto.initialization.initializer to be mutable and globally-visible?
|
|
||
| // Sparse initializers which are globally accessible. | ||
| // See the document of `global_initializer` for details. | ||
| repeated SparseTensorProto global_sparse_initializer = 31; |
There was a problem hiding this comment.
Merge these two lists into a string list? Each element is a name of ModelProto.graph.initializer.
There was a problem hiding this comment.
(a) I thought the above version also allowed initializers from the training-algorithm?
(b) I forget whether today's meeting said it is okay to explicitly specify this, or just infer it from update_bindings?
(c) We didn't discuss one point in meeting: whether this list allows any (ModelProto.graph) initializer or only initializers that appear in input?
There was a problem hiding this comment.
(a) I thought the above version also allowed initializers from the training-algorithm?
If the mutable initializer is only needed in a training graph, its name won't appear in the global variable name list.
(b) I forget whether today's meeting said it is okay to explicitly specify this, or just infer it from update_bindings?
Ok with both but it seems people like update_bindings more.
(c) We didn't discuss one point in meeting: whether this list allows any (ModelProto.graph) initializer or only initializers that appear in input?
No because
- We only need
initializers(trainable and mutable tensors) for training. I don't feel an input is a mutable thing. - If an input is promoted to global, multiple graph calls will have to use the same global input, which sounds strange.
ebarsoum
left a comment
There was a problem hiding this comment.
As we agreed in today SIG meeting, let's do the following:
- Rename global_initializer to mutable_initializer.
- Change the type to string, which contains the name of the tensors in the initializer that are mutable.
In yesterday's Operator SIG meeting, we agree to still put global variables in the inference graph and add a model-level field to indicate global variables. This way we can have smaller impact to the inference engines, because they don't need to move trainable tensors to a new field.
The last commit includes all requested changes. Please take a look. Thank you. |
The request has been addressed.
houseroad
left a comment
There was a problem hiding this comment.
Why we silently switched to protobuf-lite by default?
This is not friendly to building systems which relies on the protobuf.
Please give a heads up and discuss about it before doing so.
* ONNX Training proposal.
Major changes:
1. Add a protobuf message, `TrainingInfoProto` originally designed in
onnx#2013, to store training information.
2. In `TrainingInfoProto`, the user can store training algorithm in
`algorithm` field as a `GraphProto`.
3. The user can also store initialization algorithm for resetting the
model in `TrainingInfoProto.initialization` (proposed by @tbennun in
onnx#2517 and agreed by Training WG).
4. `ModelProto.graph` is callable inside `TrainingInfoProto.algorithm`.
`ModelProto.graph.initializer` are visible to nodes in
`TrainingInfoProto.algorithm.node`.
5. This PR also introduces a `Gradient` operator to differentiate a
function represented by a (sub-)graph. This idea is from onnx#2168.
Contribution list:
Baihan Huang: spec design.
Tal Ben-Nun: model initialization design.
Wei-Sheng Chin: spec design, Gradient operator design.
Jonny Shipton and active WG members and participants: many valuable comments and reviews.
Co-authored-by: Sherlock <baihan.huang@gmail.com>
Co-authored-by: Tal Ben-Nun <tbennun@users.noreply.github.com>
Co-authored-by: Jonny Shipton <tmvector@gmail.com>
* Address comments
* Address a comment
* Move Gradient to ai.onnx.training
Update Gradient test models
* Address comments
1. Create initialization_binding instead of
using update_binding for initialization.
2. Swap key and velue in update_binding.
3. Refine documents accordingly.
* Clarify sementics of algorithm and initialization
* Fix typos
* Address comment and explain the two computation modes of ModelProto.training_info
* Fix typo and explain default behavior
* Update onnx/checker.cc
Co-Authored-By: Jonny Shipton <tmvector@gmail.com>
* Address comments
* Make normalization_binding a repeated field
* Add GraphCall operator
* Polish GraphCall
* GraphCall now uses position to map inputs and outputs
* Address comments:
1. Clarify GraphCall's semantic.
2. Implicitly force trainable tensors to be inference graph's inputs.
3. Training operators cannot be called in the inference graph.
* Add accidently removed changes back
* Use protobuf lite
* Polish the helper script
* Fix windows build and polish helper script
* Fix linux and mac builds
* One more line
* fix the attribute types section in IR.md (onnx#2590)
* fix the attribute types section in IR.md
* update per comments.
* Some changes around the behavior of optional inference inputs.
1. Use pass-by-value to optional inference inputs.
2. Due to the semantic of GraphCall, we implicitly force trainable
inputs to be added into inference graph's input list.
Revise docs
* Update spec per WG discussion
* update_binding is optional now because user might only want to store initialization
* Polish doc
* Address comments. Polish words.
* Use an alternative field to declar global variables.
In yesterday's Operator SIG meeting, we agree to still
put global variables in the inference graph and add a
model-level field to indicate global variables. This way
we can have smaller impact to the inference engines, because
they don't need to move trainable tensors to a new field.
* polish docs
* Allow training initializers to be promoted to global & mutable variables
* Merge the functions of global_mutable_initializer_names into update_binding
* Polish docs
* Remove restriction on using ai.onnx.training in the inference graph
* Split training register from ai.onnx register file
Co-authored-by: Sherlock <baihan.huang@gmail.com>
Co-authored-by: Tal Ben-Nun <tbennun@users.noreply.github.com>
Co-authored-by: Jonny Shipton <tmvector@gmail.com>
Co-authored-by: Ke Zhang <kezhan@microsoft.com>
This PR includes all necessary changes for enabling training. It currently includes #2013, #1955, #1970, #1959, #2168, #1939. Per Training WG's discussion, we want to put everything into a single place for convenience and having a global view to our status.Please see #2038 and those sub-PRs' messages for details.To speedup the code review process and properly distribute works, this PR now only contains content from #2517 by @tbennun, #2013, and #2168,. They are the minimal requirement to express training graph. Losses and optimizers will be added separately.
This PR introduces several major changes.
TrainingInfoProtooriginally designed in ONNX Training Proposal #2013, to store training information.TrainingInfoProto, the user can store training algorithm inalgorithmfield as aGraphProto.TrainingInfoProto.initialization(proposed by @tbennun in Graph initializers for tensors #2517 and agreed by Training WG).ModelProto.graphis callable insideTrainingInfoProto.algorithm.ModelProto.graph.initializerare visible to nodes inTrainingInfoProto.algorithm.node.Gradientoperator to differentiate a function represented by a (sub-)graph. This idea is from [WIP][Training] Draft of Gradient Operator #2168. The domain ofGradientisai.onnx.training.