Graph initializers for tensors#2517
Conversation
|
It looks like this might also be related to the training-proposal to extend the IR to support training: please see: #2314 ... training needs more support, since it also needs to describe how to update the weights after processing a batch. |
|
@linkerzhang Thanks! Posted more details on the gitter. @gramalingam Yes, this PR is related to #2314 but currently orthogonal to it, as the initializers there are still defined as Tensors. |
|
Comments based on chatting over gitter room - https://gitter.im/onnx/Infra:
One more question, is it necessary to have one graph (with one output) mapping to one weight? how about just using one graph to cover all weights? |
|
There is no functional difference between having one graph to initialize all tensors or having a graph per tensor, but there are two main reasons why we thought it was best to use a graph per tensor:
|
TMVector
left a comment
There was a problem hiding this comment.
I'm not sure where the best place to specify it is, but IMO there should be explicit prescription of evaluation semantics. I don't think ONNX by itself has a notion of sessions (perhaps that changes with training?), but operators like RandomUniform imply different implementations could behave differently.
The obvious behaviour would seem to be evaluation once per "session", since once per evaluation can be implemented by just putting your nodes in the graph instead of an initializer, and once for-ever is implemented by using a constant tensor.
| // A list of named graphs, used to specify how model inputs should be | ||
| // initialized. Each GraphProto must have a name (distinct from any other | ||
| // initializer) and one output, which is of the same type and shape as the | ||
| // corresponding model input. |
There was a problem hiding this comment.
How about this?
Initializers (see above) described by graphs. Each GraphProto must have a name (which acts as the initializer name), zero inputs, and one output (which acts as the initializer tensor).
There was a problem hiding this comment.
I think the type/shape constraint is implied by the spec already, but if you think it is worth being explicit, then IMO it should be on the initializer field since it applied to all initializers.
There was a problem hiding this comment.
I agree, this is why I didn't mention the type/shape constraint in the first iteration. The zero inputs requirement is also important. Thanks!
Co-Authored-By: Jonny Shipton <tmvector@gmail.com>
| node|Node[]|A list of nodes, forming a partially ordered computation graph based on input/output data dependencies. | ||
| initializer|Tensor[]|A list of named tensor values. When an initializer has the same name as a graph input, it specifies a default value for that input. When an initializer has a name different from all graph inputs, it specifies a constant value. | ||
| sparse_initializer|SparseTensor[]|A list of initializer tensor values that are represented by sparse tensors. | ||
| graph_initializer|Graph[]|A list of initializers that are represented by graphs, which describe how to compute the initializer tensor. Each initializer graph has zero inputs and one output, which acts as the initializer tensor. The initializer graph name acts as the initializer name. |
There was a problem hiding this comment.
I don't quite sure the reasons behind graph_initializers. If an initializer is input-dependent, it should be computed from the input in the main computation graph. If an initializer is input-independent, it sounds a constant and the current initializer support is sufficient. Are there cases we must compute initializers from separated graphs? Or this new field is mainly for convenience?
ONNX training spec will support some ways to compute the initializers and possibly have a similar semantic here. It's also strange if we have a main graph, training graph, and initialization graph --- should we execute initialization graph each time we execute the training graph? If not, when should initialization graph be computed? These two questions make me feel initialization is a part of training graph, rather than a part of inference-oriented graph.
There was a problem hiding this comment.
I see. I created the PR after we saw that the training spec does not include an initialization graph. Training is the main purpose for this change.
IMO there may still be other reasons to keep the initialization separate from training. For instance, if for inference some weights are represented by large tensors that can be simply computed, it would be better to initialize tensors once. Currently, if they are added as constant initializers, the file size would be large. If they are added as nodes to the graph, they will be recomputed every time, which might not be efficient.
It would probably be fine if this change will only be applied in TrainingInfoProto. What do you think?
There was a problem hiding this comment.
If we want to use graph to compute large constant tensor, we can specify that logic in the main inference graph (that is, we insert nodes into inference graph). Is it necessary to define another field for that logic?
For combining this proposal and training, we need to first list out the scenarios we want to support. For example, if we only want to express initialization + training in the training graph, we can merge initialization and training into one single graph and extra fields are not necessary.
There was a problem hiding this comment.
If we put some a layer's weight into graph_initializer, it means we don't necessarily have a placeholder for the tensor, which can cause some problems when training algorithm updates its value. For example, if I define W as a graph initializer, where should I save its value produced by the training algorithm?
The signature of graph_initializer looks fine. We just need to refine its semantic. It should be an optional step to update existing initializers rather than defining new initializers.
| repeated SparseTensorProto sparse_initializer = 15; | ||
|
|
||
| // A list of named graphs, used to specify how model inputs should be | ||
| // initialized. Each GraphProto must have a name (distinct from any other |
There was a problem hiding this comment.
| // initialized. Each GraphProto must have a name (distinct from any other | |
| // initialized. Each GraphProto must have a name (the same to one |
As mentioned above, graph initializers can't be used to declare initializers. It can only specify how existing initializers should be computed.
| // initialized. Each GraphProto must have a name (distinct from any other | ||
| // initializer) and one output, which is of the same type and shape as the | ||
| // corresponding model input. | ||
| repeated GraphProto graph_initializer = 16; |
There was a problem hiding this comment.
Should we put the entire initialization stage into a single graph? I feel it's cleaner than having a list of graphs.
[Update]
Now, I think we should make it a list. In training proposal, we have TrainingInfoProto as a repeated field. I think for each, TrainingInfoProto, we should have one initialization graph.
|
With more thoughts of supporting training (plus the training proposal #2314 ), I'd suggest,
Thoughts? |
|
@linkerzhang I agree with everything. We also had a discussion in the training WG that supports this decision. As @wschin said, adding graph initializers to I will now close this PR and create one targeted at the branch in #2314, so that it will appear as part of that PR. I will incorporate notes (2) and (3) from the message above. Thank you all for the important feedback! |
Major changes:
1. Add a protobuf message, `TrainingInfoProto` originally designed in
onnx#2013, to store training information.
2. In `TrainingInfoProto`, the user can store training algorithm in
`algorithm` field as a `GraphProto`.
3. The user can also store initialization algorithm for resetting the
model in `TrainingInfoProto.initialization` (proposed by @tbennun in
onnx#2517 and agreed by Training WG).
4. `ModelProto.graph` is callable inside `TrainingInfoProto.algorithm`.
`ModelProto.graph.initializer` are visible to nodes in
`TrainingInfoProto.algorithm.node`.
5. This PR also introduces a `Gradient` operator to differentiate a
function represented by a (sub-)graph. This idea is from onnx#2168.
Contribution list:
Baihan Huang: spec design.
Tal Ben-Nun: model initialization design.
Wei-Sheng Chin: spec design, Gradient operator design.
Jonny Shipton and active WG members and participants: many valuable comments and reviews.
Co-authored-by: Sherlock <baihan.huang@gmail.com>
Co-authored-by: Tal Ben-Nun <tbennun@users.noreply.github.com>
Co-authored-by: Jonny Shipton <tmvector@gmail.com>
* ONNX Training proposal.
Major changes:
1. Add a protobuf message, `TrainingInfoProto` originally designed in
#2013, to store training information.
2. In `TrainingInfoProto`, the user can store training algorithm in
`algorithm` field as a `GraphProto`.
3. The user can also store initialization algorithm for resetting the
model in `TrainingInfoProto.initialization` (proposed by @tbennun in
#2517 and agreed by Training WG).
4. `ModelProto.graph` is callable inside `TrainingInfoProto.algorithm`.
`ModelProto.graph.initializer` are visible to nodes in
`TrainingInfoProto.algorithm.node`.
5. This PR also introduces a `Gradient` operator to differentiate a
function represented by a (sub-)graph. This idea is from #2168.
Contribution list:
Baihan Huang: spec design.
Tal Ben-Nun: model initialization design.
Wei-Sheng Chin: spec design, Gradient operator design.
Jonny Shipton and active WG members and participants: many valuable comments and reviews.
Co-authored-by: Sherlock <baihan.huang@gmail.com>
Co-authored-by: Tal Ben-Nun <tbennun@users.noreply.github.com>
Co-authored-by: Jonny Shipton <tmvector@gmail.com>
* Address comments
* Address a comment
* Move Gradient to ai.onnx.training
Update Gradient test models
* Address comments
1. Create initialization_binding instead of
using update_binding for initialization.
2. Swap key and velue in update_binding.
3. Refine documents accordingly.
* Clarify sementics of algorithm and initialization
* Fix typos
* Address comment and explain the two computation modes of ModelProto.training_info
* Fix typo and explain default behavior
* Update onnx/checker.cc
Co-Authored-By: Jonny Shipton <tmvector@gmail.com>
* Address comments
* Make normalization_binding a repeated field
* Add GraphCall operator
* Polish GraphCall
* GraphCall now uses position to map inputs and outputs
* Address comments:
1. Clarify GraphCall's semantic.
2. Implicitly force trainable tensors to be inference graph's inputs.
3. Training operators cannot be called in the inference graph.
* Add accidently removed changes back
* Use protobuf lite
* Polish the helper script
* Fix windows build and polish helper script
* Fix linux and mac builds
* One more line
* fix the attribute types section in IR.md (#2590)
* fix the attribute types section in IR.md
* update per comments.
* Some changes around the behavior of optional inference inputs.
1. Use pass-by-value to optional inference inputs.
2. Due to the semantic of GraphCall, we implicitly force trainable
inputs to be added into inference graph's input list.
Revise docs
* Update spec per WG discussion
* update_binding is optional now because user might only want to store initialization
* Polish doc
* Address comments. Polish words.
* Use an alternative field to declar global variables.
In yesterday's Operator SIG meeting, we agree to still
put global variables in the inference graph and add a
model-level field to indicate global variables. This way
we can have smaller impact to the inference engines, because
they don't need to move trainable tensors to a new field.
* polish docs
* Allow training initializers to be promoted to global & mutable variables
* Merge the functions of global_mutable_initializer_names into update_binding
* Polish docs
* Remove restriction on using ai.onnx.training in the inference graph
* Split training register from ai.onnx register file
Co-authored-by: Sherlock <baihan.huang@gmail.com>
Co-authored-by: Tal Ben-Nun <tbennun@users.noreply.github.com>
Co-authored-by: Jonny Shipton <tmvector@gmail.com>
Co-authored-by: Ke Zhang <kezhan@microsoft.com>
* ONNX Training proposal.
Major changes:
1. Add a protobuf message, `TrainingInfoProto` originally designed in
onnx#2013, to store training information.
2. In `TrainingInfoProto`, the user can store training algorithm in
`algorithm` field as a `GraphProto`.
3. The user can also store initialization algorithm for resetting the
model in `TrainingInfoProto.initialization` (proposed by @tbennun in
onnx#2517 and agreed by Training WG).
4. `ModelProto.graph` is callable inside `TrainingInfoProto.algorithm`.
`ModelProto.graph.initializer` are visible to nodes in
`TrainingInfoProto.algorithm.node`.
5. This PR also introduces a `Gradient` operator to differentiate a
function represented by a (sub-)graph. This idea is from onnx#2168.
Contribution list:
Baihan Huang: spec design.
Tal Ben-Nun: model initialization design.
Wei-Sheng Chin: spec design, Gradient operator design.
Jonny Shipton and active WG members and participants: many valuable comments and reviews.
Co-authored-by: Sherlock <baihan.huang@gmail.com>
Co-authored-by: Tal Ben-Nun <tbennun@users.noreply.github.com>
Co-authored-by: Jonny Shipton <tmvector@gmail.com>
* Address comments
* Address a comment
* Move Gradient to ai.onnx.training
Update Gradient test models
* Address comments
1. Create initialization_binding instead of
using update_binding for initialization.
2. Swap key and velue in update_binding.
3. Refine documents accordingly.
* Clarify sementics of algorithm and initialization
* Fix typos
* Address comment and explain the two computation modes of ModelProto.training_info
* Fix typo and explain default behavior
* Update onnx/checker.cc
Co-Authored-By: Jonny Shipton <tmvector@gmail.com>
* Address comments
* Make normalization_binding a repeated field
* Add GraphCall operator
* Polish GraphCall
* GraphCall now uses position to map inputs and outputs
* Address comments:
1. Clarify GraphCall's semantic.
2. Implicitly force trainable tensors to be inference graph's inputs.
3. Training operators cannot be called in the inference graph.
* Add accidently removed changes back
* Use protobuf lite
* Polish the helper script
* Fix windows build and polish helper script
* Fix linux and mac builds
* One more line
* fix the attribute types section in IR.md (onnx#2590)
* fix the attribute types section in IR.md
* update per comments.
* Some changes around the behavior of optional inference inputs.
1. Use pass-by-value to optional inference inputs.
2. Due to the semantic of GraphCall, we implicitly force trainable
inputs to be added into inference graph's input list.
Revise docs
* Update spec per WG discussion
* update_binding is optional now because user might only want to store initialization
* Polish doc
* Address comments. Polish words.
* Use an alternative field to declar global variables.
In yesterday's Operator SIG meeting, we agree to still
put global variables in the inference graph and add a
model-level field to indicate global variables. This way
we can have smaller impact to the inference engines, because
they don't need to move trainable tensors to a new field.
* polish docs
* Allow training initializers to be promoted to global & mutable variables
* Merge the functions of global_mutable_initializer_names into update_binding
* Polish docs
* Remove restriction on using ai.onnx.training in the inference graph
* Split training register from ai.onnx register file
Co-authored-by: Sherlock <baihan.huang@gmail.com>
Co-authored-by: Tal Ben-Nun <tbennun@users.noreply.github.com>
Co-authored-by: Jonny Shipton <tmvector@gmail.com>
Co-authored-by: Ke Zhang <kezhan@microsoft.com>
ONNX now supports dense and sparse tensors as initializers. This PR adds missing IR documentation for sparse tensor initializers, and adds an initializer type that is based on a computational graph.
The reasoning behind the distinction between an initializer graph and simply adding nodes to the ONNX model is to support the strict separation between how a model is computed and the way to initialize (some of) its tensors. For example, pre-trained models can use tensor initializers to store weights, whereas untrained models can provide schemes for the initialization of weights (e.g., a
RandomNormaloperator followed by offset and scaling).Our use case for this is reproducible training workloads (such as in MLPerf). Instead of defining a reference model as a PyTorch/TensorFlow script, with this PR one could simply upload an .onnx file that competitors can use. Beyond reducing the file size from hundreds of megabytes to a something that can be pushed to git, the initializers also act as "recipes" to what the fair starting conditions of training should be (as that differs from model to model).