Graph initializers for tensors by tbennun · Pull Request #2517 · onnx/onnx

tbennun · 2019-12-18T22:58:30Z

ONNX now supports dense and sparse tensors as initializers. This PR adds missing IR documentation for sparse tensor initializers, and adds an initializer type that is based on a computational graph.

The reasoning behind the distinction between an initializer graph and simply adding nodes to the ONNX model is to support the strict separation between how a model is computed and the way to initialize (some of) its tensors. For example, pre-trained models can use tensor initializers to store weights, whereas untrained models can provide schemes for the initialization of weights (e.g., a RandomNormal operator followed by offset and scaling).

Our use case for this is reproducible training workloads (such as in MLPerf). Instead of defining a reference model as a PyTorch/TensorFlow script, with this PR one could simply upload an .onnx file that competitors can use. Beyond reducing the file size from hundreds of megabytes to a something that can be pushed to git, the initializers also act as "recipes" to what the fair starting conditions of training should be (as that differs from model to model).

gramalingam · 2019-12-19T19:21:36Z

It looks like this might also be related to the training-proposal to extend the IR to support training: please see: #2314 ... training needs more support, since it also needs to describe how to update the weights after processing a batch.

tbennun · 2019-12-20T18:08:47Z

@linkerzhang Thanks! Posted more details on the gitter.

@gramalingam Yes, this PR is related to #2314 but currently orthogonal to it, as the initializers there are still defined as Tensors.

linkerzhang · 2020-01-02T18:43:44Z

@wschin

linkerzhang · 2020-01-02T18:50:50Z

Comments based on chatting over gitter room - https://gitter.im/onnx/Infra:

Both IR.md and *.proto files should be commented (clarifying the 1:1 name mapping is still there between the only-one output of these graphs and initializers (weights).
Comments from training work group folks if any @wschin @SherlockNoMad @chinhuang007

One more question, is it necessary to have one graph (with one output) mapping to one weight? how about just using one graph to cover all weights?

tbennun · 2020-01-02T21:04:50Z

There is no functional difference between having one graph to initialize all tensors or having a graph per tensor, but there are two main reasons why we thought it was best to use a graph per tensor:

In both cases, in order to create different weights for each tensor, you would have to replicate the graph nodes.
Using the name field of a GraphProto is more straightforward to implement (both saving and loading) rather than iterating over nodes in the singular initializer graph and updating the output names. This is also more equivalent to how tensor and sparse tensor initializers are defined now, so the implementations would remain similar.

TMVector

I'm not sure where the best place to specify it is, but IMO there should be explicit prescription of evaluation semantics. I don't think ONNX by itself has a notion of sessions (perhaps that changes with training?), but operators like RandomUniform imply different implementations could behave differently.

The obvious behaviour would seem to be evaluation once per "session", since once per evaluation can be implemented by just putting your nodes in the graph instead of an initializer, and once for-ever is implemented by using a constant tensor.

TMVector · 2020-01-03T10:12:25Z

+  // A list of named graphs, used to specify how model inputs should be 
+  // initialized. Each GraphProto must have a name (distinct from any other
+  // initializer) and one output, which is of the same type and shape as the
+  // corresponding model input.


How about this?

Initializers (see above) described by graphs. Each GraphProto must have a name (which acts as the initializer name), zero inputs, and one output (which acts as the initializer tensor).

I think the type/shape constraint is implied by the spec already, but if you think it is worth being explicit, then IMO it should be on the initializer field since it applied to all initializers.

I agree, this is why I didn't mention the type/shape constraint in the first iteration. The zero inputs requirement is also important. Thanks!

Co-Authored-By: Jonny Shipton <tmvector@gmail.com>

wschin · 2020-01-03T17:49:46Z

 node|Node[]|A list of nodes, forming a partially ordered computation graph based on input/output data dependencies.
 initializer|Tensor[]|A list of named tensor values. When an initializer has the same name as a graph input, it specifies a default value for that input. When an initializer has a name different from all graph inputs, it specifies a constant value.
+sparse_initializer|SparseTensor[]|A list of initializer tensor values that are represented by sparse tensors.
+graph_initializer|Graph[]|A list of initializers that are represented by graphs, which describe how to compute the initializer tensor. Each initializer graph has zero inputs and one output, which acts as the initializer tensor. The initializer graph name acts as the initializer name.


I don't quite sure the reasons behind graph_initializers. If an initializer is input-dependent, it should be computed from the input in the main computation graph. If an initializer is input-independent, it sounds a constant and the current initializer support is sufficient. Are there cases we must compute initializers from separated graphs? Or this new field is mainly for convenience?

ONNX training spec will support some ways to compute the initializers and possibly have a similar semantic here. It's also strange if we have a main graph, training graph, and initialization graph --- should we execute initialization graph each time we execute the training graph? If not, when should initialization graph be computed? These two questions make me feel initialization is a part of training graph, rather than a part of inference-oriented graph.

I see. I created the PR after we saw that the training spec does not include an initialization graph. Training is the main purpose for this change.

IMO there may still be other reasons to keep the initialization separate from training. For instance, if for inference some weights are represented by large tensors that can be simply computed, it would be better to initialize tensors once. Currently, if they are added as constant initializers, the file size would be large. If they are added as nodes to the graph, they will be recomputed every time, which might not be efficient.

It would probably be fine if this change will only be applied in TrainingInfoProto. What do you think?

If we want to use graph to compute large constant tensor, we can specify that logic in the main inference graph (that is, we insert nodes into inference graph). Is it necessary to define another field for that logic?

For combining this proposal and training, we need to first list out the scenarios we want to support. For example, if we only want to express initialization + training in the training graph, we can merge initialization and training into one single graph and extra fields are not necessary.

@wschin I am understanding this PR is trying to add a field to show the initialization logic of weights in training scenario. @tbennun
correct me if I'm wrong. This is why I think this should align with training proposal.

@linkerzhang yes, that is the goal.

If we put some a layer's weight into graph_initializer, it means we don't necessarily have a placeholder for the tensor, which can cause some problems when training algorithm updates its value. For example, if I define W as a graph initializer, where should I save its value produced by the training algorithm?

The signature of graph_initializer looks fine. We just need to refine its semantic. It should be an optional step to update existing initializers rather than defining new initializers.

wschin · 2020-01-07T17:10:17Z

  repeated SparseTensorProto sparse_initializer = 15;

+  // A list of named graphs, used to specify how model inputs should be 
+  // initialized. Each GraphProto must have a name (distinct from any other


Suggested change

// initialized. Each GraphProto must have a name (distinct from any other

// initialized. Each GraphProto must have a name (the same to one

As mentioned above, graph initializers can't be used to declare initializers. It can only specify how existing initializers should be computed.

wschin · 2020-01-07T17:11:34Z

+  // initialized. Each GraphProto must have a name (distinct from any other
+  // initializer) and one output, which is of the same type and shape as the
+  // corresponding model input.
+  repeated GraphProto graph_initializer = 16;


Should we put the entire initialization stage into a single graph? I feel it's cleaner than having a list of graphs.

[Update]
Now, I think we should make it a list. In training proposal, we have TrainingInfoProto as a repeated field. I think for each, TrainingInfoProto, we should have one initialization graph.

linkerzhang · 2020-01-07T18:37:49Z

With more thoughts of supporting training (plus the training proposal #2314 ),

I'd suggest,

move this initialization field as part of training proposal (TrainingInfoProto).
if my understanding is correct, this initialization for training should be run only once (for training reproducibility), in this way, I'd still suggest to have only one GraphProto to initialize all trainable weights.
The initialization graph's outputs will have exactly the same name as the "initializers" in inference graph. Say, in "initializers" there's a tensor named "a", if this "a" is trainable, there should be an output tensor named "a" in initialization graph.

Thoughts?

tbennun · 2020-01-07T19:37:59Z

@linkerzhang I agree with everything. We also had a discussion in the training WG that supports this decision. As @wschin said, adding graph initializers to TrainingInfoProto has the additional benefit of being able to restart training on a pre-trained ONNX file, and the behavior of saving/loading models is more well defined.

I will now close this PR and create one targeted at the branch in #2314, so that it will appear as part of that PR. I will incorporate notes (2) and (3) from the message above.

Thank you all for the important feedback!

@tbennun

Major changes: 1. Add a protobuf message, `TrainingInfoProto` originally designed in onnx#2013, to store training information. 2. In `TrainingInfoProto`, the user can store training algorithm in `algorithm` field as a `GraphProto`. 3. The user can also store initialization algorithm for resetting the model in `TrainingInfoProto.initialization` (proposed by @tbennun in onnx#2517 and agreed by Training WG). 4. `ModelProto.graph` is callable inside `TrainingInfoProto.algorithm`. `ModelProto.graph.initializer` are visible to nodes in `TrainingInfoProto.algorithm.node`. 5. This PR also introduces a `Gradient` operator to differentiate a function represented by a (sub-)graph. This idea is from onnx#2168. Contribution list: Baihan Huang: spec design. Tal Ben-Nun: model initialization design. Wei-Sheng Chin: spec design, Gradient operator design. Jonny Shipton and active WG members and participants: many valuable comments and reviews. Co-authored-by: Sherlock <baihan.huang@gmail.com> Co-authored-by: Tal Ben-Nun <tbennun@users.noreply.github.com> Co-authored-by: Jonny Shipton <tmvector@gmail.com>

@tbennun

* ONNX Training proposal. Major changes: 1. Add a protobuf message, `TrainingInfoProto` originally designed in #2013, to store training information. 2. In `TrainingInfoProto`, the user can store training algorithm in `algorithm` field as a `GraphProto`. 3. The user can also store initialization algorithm for resetting the model in `TrainingInfoProto.initialization` (proposed by @tbennun in #2517 and agreed by Training WG). 4. `ModelProto.graph` is callable inside `TrainingInfoProto.algorithm`. `ModelProto.graph.initializer` are visible to nodes in `TrainingInfoProto.algorithm.node`. 5. This PR also introduces a `Gradient` operator to differentiate a function represented by a (sub-)graph. This idea is from #2168. Contribution list: Baihan Huang: spec design. Tal Ben-Nun: model initialization design. Wei-Sheng Chin: spec design, Gradient operator design. Jonny Shipton and active WG members and participants: many valuable comments and reviews. Co-authored-by: Sherlock <baihan.huang@gmail.com> Co-authored-by: Tal Ben-Nun <tbennun@users.noreply.github.com> Co-authored-by: Jonny Shipton <tmvector@gmail.com> * Address comments * Address a comment * Move Gradient to ai.onnx.training Update Gradient test models * Address comments 1. Create initialization_binding instead of using update_binding for initialization. 2. Swap key and velue in update_binding. 3. Refine documents accordingly. * Clarify sementics of algorithm and initialization * Fix typos * Address comment and explain the two computation modes of ModelProto.training_info * Fix typo and explain default behavior * Update onnx/checker.cc Co-Authored-By: Jonny Shipton <tmvector@gmail.com> * Address comments * Make normalization_binding a repeated field * Add GraphCall operator * Polish GraphCall * GraphCall now uses position to map inputs and outputs * Address comments: 1. Clarify GraphCall's semantic. 2. Implicitly force trainable tensors to be inference graph's inputs. 3. Training operators cannot be called in the inference graph. * Add accidently removed changes back * Use protobuf lite * Polish the helper script * Fix windows build and polish helper script * Fix linux and mac builds * One more line * fix the attribute types section in IR.md (#2590) * fix the attribute types section in IR.md * update per comments. * Some changes around the behavior of optional inference inputs. 1. Use pass-by-value to optional inference inputs. 2. Due to the semantic of GraphCall, we implicitly force trainable inputs to be added into inference graph's input list. Revise docs * Update spec per WG discussion * update_binding is optional now because user might only want to store initialization * Polish doc * Address comments. Polish words. * Use an alternative field to declar global variables. In yesterday's Operator SIG meeting, we agree to still put global variables in the inference graph and add a model-level field to indicate global variables. This way we can have smaller impact to the inference engines, because they don't need to move trainable tensors to a new field. * polish docs * Allow training initializers to be promoted to global & mutable variables * Merge the functions of global_mutable_initializer_names into update_binding * Polish docs * Remove restriction on using ai.onnx.training in the inference graph * Split training register from ai.onnx register file Co-authored-by: Sherlock <baihan.huang@gmail.com> Co-authored-by: Tal Ben-Nun <tbennun@users.noreply.github.com> Co-authored-by: Jonny Shipton <tmvector@gmail.com> Co-authored-by: Ke Zhang <kezhan@microsoft.com>

@tbennun

* ONNX Training proposal. Major changes: 1. Add a protobuf message, `TrainingInfoProto` originally designed in onnx#2013, to store training information. 2. In `TrainingInfoProto`, the user can store training algorithm in `algorithm` field as a `GraphProto`. 3. The user can also store initialization algorithm for resetting the model in `TrainingInfoProto.initialization` (proposed by @tbennun in onnx#2517 and agreed by Training WG). 4. `ModelProto.graph` is callable inside `TrainingInfoProto.algorithm`. `ModelProto.graph.initializer` are visible to nodes in `TrainingInfoProto.algorithm.node`. 5. This PR also introduces a `Gradient` operator to differentiate a function represented by a (sub-)graph. This idea is from onnx#2168. Contribution list: Baihan Huang: spec design. Tal Ben-Nun: model initialization design. Wei-Sheng Chin: spec design, Gradient operator design. Jonny Shipton and active WG members and participants: many valuable comments and reviews. Co-authored-by: Sherlock <baihan.huang@gmail.com> Co-authored-by: Tal Ben-Nun <tbennun@users.noreply.github.com> Co-authored-by: Jonny Shipton <tmvector@gmail.com> * Address comments * Address a comment * Move Gradient to ai.onnx.training Update Gradient test models * Address comments 1. Create initialization_binding instead of using update_binding for initialization. 2. Swap key and velue in update_binding. 3. Refine documents accordingly. * Clarify sementics of algorithm and initialization * Fix typos * Address comment and explain the two computation modes of ModelProto.training_info * Fix typo and explain default behavior * Update onnx/checker.cc Co-Authored-By: Jonny Shipton <tmvector@gmail.com> * Address comments * Make normalization_binding a repeated field * Add GraphCall operator * Polish GraphCall * GraphCall now uses position to map inputs and outputs * Address comments: 1. Clarify GraphCall's semantic. 2. Implicitly force trainable tensors to be inference graph's inputs. 3. Training operators cannot be called in the inference graph. * Add accidently removed changes back * Use protobuf lite * Polish the helper script * Fix windows build and polish helper script * Fix linux and mac builds * One more line * fix the attribute types section in IR.md (onnx#2590) * fix the attribute types section in IR.md * update per comments. * Some changes around the behavior of optional inference inputs. 1. Use pass-by-value to optional inference inputs. 2. Due to the semantic of GraphCall, we implicitly force trainable inputs to be added into inference graph's input list. Revise docs * Update spec per WG discussion * update_binding is optional now because user might only want to store initialization * Polish doc * Address comments. Polish words. * Use an alternative field to declar global variables. In yesterday's Operator SIG meeting, we agree to still put global variables in the inference graph and add a model-level field to indicate global variables. This way we can have smaller impact to the inference engines, because they don't need to move trainable tensors to a new field. * polish docs * Allow training initializers to be promoted to global & mutable variables * Merge the functions of global_mutable_initializer_names into update_binding * Polish docs * Remove restriction on using ai.onnx.training in the inference graph * Split training register from ai.onnx register file Co-authored-by: Sherlock <baihan.huang@gmail.com> Co-authored-by: Tal Ben-Nun <tbennun@users.noreply.github.com> Co-authored-by: Jonny Shipton <tmvector@gmail.com> Co-authored-by: Ke Zhang <kezhan@microsoft.com>

fdokic-eng and others added 2 commits December 18, 2019 20:05

introduce graph_initializer

411b2eb

Documentation

fef7e41

tbennun requested a review from a team as a code owner December 18, 2019 22:58

linkerzhang suggested changes Dec 19, 2019

View reviewed changes

Comment thread onnx/onnx.in.proto

tbennun and others added 3 commits January 2, 2020 22:12

Information about graph initializer contents

d47205d

Update .proto files with information about graph initializers

cf5a43c

Merge branch 'master' into graph_initializer

add77cd

TMVector reviewed Jan 3, 2020

View reviewed changes

tbennun and others added 2 commits January 3, 2020 12:28

Update docs/IR.md

2bba36f

Co-Authored-By: Jonny Shipton <tmvector@gmail.com>

Merge branch 'master' into graph_initializer

7edd178

wschin reviewed Jan 3, 2020

View reviewed changes

wschin reviewed Jan 7, 2020

View reviewed changes

tbennun closed this Jan 7, 2020

wschin mentioned this pull request Jan 24, 2020

Training Proposal: Spec Changes and Gradient Operator #2314

Merged

	// initialized. Each GraphProto must have a name (distinct from any other
	// initialized. Each GraphProto must have a name (the same to one

Conversation

tbennun commented Dec 18, 2019

Uh oh!

Uh oh!

gramalingam commented Dec 19, 2019

Uh oh!

tbennun commented Dec 20, 2019

Uh oh!

linkerzhang commented Jan 2, 2020

Uh oh!

linkerzhang commented Jan 2, 2020

Uh oh!

tbennun commented Jan 2, 2020

Uh oh!

TMVector left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

TMVector Jan 3, 2020

Choose a reason for hiding this comment

Uh oh!

TMVector Jan 3, 2020

Choose a reason for hiding this comment

Uh oh!

tbennun Jan 3, 2020

Choose a reason for hiding this comment

Uh oh!

wschin Jan 3, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tbennun Jan 3, 2020

Choose a reason for hiding this comment

Uh oh!

wschin Jan 3, 2020

Choose a reason for hiding this comment

Uh oh!

linkerzhang Jan 7, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tbennun Jan 7, 2020

Choose a reason for hiding this comment

Uh oh!

wschin Jan 7, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wschin Jan 7, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wschin Jan 7, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

linkerzhang commented Jan 7, 2020

Uh oh!

tbennun commented Jan 7, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

wschin Jan 3, 2020 •

edited

Loading

linkerzhang Jan 7, 2020 •

edited

Loading

wschin Jan 7, 2020 •

edited

Loading

wschin Jan 7, 2020 •

edited

Loading

wschin Jan 7, 2020 •

edited

Loading