Skip to content

Graph initializers for tensors#2517

Closed
tbennun wants to merge 7 commits intoonnx:masterfrom
deep500:graph_initializer
Closed

Graph initializers for tensors#2517
tbennun wants to merge 7 commits intoonnx:masterfrom
deep500:graph_initializer

Conversation

@tbennun
Copy link
Copy Markdown
Contributor

@tbennun tbennun commented Dec 18, 2019

ONNX now supports dense and sparse tensors as initializers. This PR adds missing IR documentation for sparse tensor initializers, and adds an initializer type that is based on a computational graph.

The reasoning behind the distinction between an initializer graph and simply adding nodes to the ONNX model is to support the strict separation between how a model is computed and the way to initialize (some of) its tensors. For example, pre-trained models can use tensor initializers to store weights, whereas untrained models can provide schemes for the initialization of weights (e.g., a RandomNormal operator followed by offset and scaling).

Our use case for this is reproducible training workloads (such as in MLPerf). Instead of defining a reference model as a PyTorch/TensorFlow script, with this PR one could simply upload an .onnx file that competitors can use. Beyond reducing the file size from hundreds of megabytes to a something that can be pushed to git, the initializers also act as "recipes" to what the fair starting conditions of training should be (as that differs from model to model).

@tbennun tbennun requested a review from a team as a code owner December 18, 2019 22:58
Comment thread onnx/onnx.in.proto
@gramalingam
Copy link
Copy Markdown
Contributor

It looks like this might also be related to the training-proposal to extend the IR to support training: please see: #2314 ... training needs more support, since it also needs to describe how to update the weights after processing a batch.

@tbennun
Copy link
Copy Markdown
Contributor Author

tbennun commented Dec 20, 2019

@linkerzhang Thanks! Posted more details on the gitter.

@gramalingam Yes, this PR is related to #2314 but currently orthogonal to it, as the initializers there are still defined as Tensors.

@linkerzhang
Copy link
Copy Markdown
Member

@wschin

@linkerzhang
Copy link
Copy Markdown
Member

Comments based on chatting over gitter room - https://gitter.im/onnx/Infra:

  1. Both IR.md and *.proto files should be commented (clarifying the 1:1 name mapping is still there between the only-one output of these graphs and initializers (weights).
  2. Comments from training work group folks if any @wschin @SherlockNoMad @chinhuang007

One more question, is it necessary to have one graph (with one output) mapping to one weight? how about just using one graph to cover all weights?

@tbennun
Copy link
Copy Markdown
Contributor Author

tbennun commented Jan 2, 2020

There is no functional difference between having one graph to initialize all tensors or having a graph per tensor, but there are two main reasons why we thought it was best to use a graph per tensor:

  • In both cases, in order to create different weights for each tensor, you would have to replicate the graph nodes.
  • Using the name field of a GraphProto is more straightforward to implement (both saving and loading) rather than iterating over nodes in the singular initializer graph and updating the output names. This is also more equivalent to how tensor and sparse tensor initializers are defined now, so the implementations would remain similar.

Copy link
Copy Markdown
Contributor

@TMVector TMVector left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure where the best place to specify it is, but IMO there should be explicit prescription of evaluation semantics. I don't think ONNX by itself has a notion of sessions (perhaps that changes with training?), but operators like RandomUniform imply different implementations could behave differently.

The obvious behaviour would seem to be evaluation once per "session", since once per evaluation can be implemented by just putting your nodes in the graph instead of an initializer, and once for-ever is implemented by using a constant tensor.

Comment thread docs/IR.md Outdated
Comment thread onnx/onnx.in.proto
Comment on lines +290 to +293
// A list of named graphs, used to specify how model inputs should be
// initialized. Each GraphProto must have a name (distinct from any other
// initializer) and one output, which is of the same type and shape as the
// corresponding model input.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about this?

Initializers (see above) described by graphs. Each GraphProto must have a name (which acts as the initializer name), zero inputs, and one output (which acts as the initializer tensor).

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the type/shape constraint is implied by the spec already, but if you think it is worth being explicit, then IMO it should be on the initializer field since it applied to all initializers.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, this is why I didn't mention the type/shape constraint in the first iteration. The zero inputs requirement is also important. Thanks!

tbennun and others added 2 commits January 3, 2020 12:28
Co-Authored-By: Jonny Shipton <tmvector@gmail.com>
Comment thread docs/IR.md
node|Node[]|A list of nodes, forming a partially ordered computation graph based on input/output data dependencies.
initializer|Tensor[]|A list of named tensor values. When an initializer has the same name as a graph input, it specifies a default value for that input. When an initializer has a name different from all graph inputs, it specifies a constant value.
sparse_initializer|SparseTensor[]|A list of initializer tensor values that are represented by sparse tensors.
graph_initializer|Graph[]|A list of initializers that are represented by graphs, which describe how to compute the initializer tensor. Each initializer graph has zero inputs and one output, which acts as the initializer tensor. The initializer graph name acts as the initializer name.
Copy link
Copy Markdown
Collaborator

@wschin wschin Jan 3, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't quite sure the reasons behind graph_initializers. If an initializer is input-dependent, it should be computed from the input in the main computation graph. If an initializer is input-independent, it sounds a constant and the current initializer support is sufficient. Are there cases we must compute initializers from separated graphs? Or this new field is mainly for convenience?

ONNX training spec will support some ways to compute the initializers and possibly have a similar semantic here. It's also strange if we have a main graph, training graph, and initialization graph --- should we execute initialization graph each time we execute the training graph? If not, when should initialization graph be computed? These two questions make me feel initialization is a part of training graph, rather than a part of inference-oriented graph.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. I created the PR after we saw that the training spec does not include an initialization graph. Training is the main purpose for this change.

IMO there may still be other reasons to keep the initialization separate from training. For instance, if for inference some weights are represented by large tensors that can be simply computed, it would be better to initialize tensors once. Currently, if they are added as constant initializers, the file size would be large. If they are added as nodes to the graph, they will be recomputed every time, which might not be efficient.

It would probably be fine if this change will only be applied in TrainingInfoProto. What do you think?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we want to use graph to compute large constant tensor, we can specify that logic in the main inference graph (that is, we insert nodes into inference graph). Is it necessary to define another field for that logic?

For combining this proposal and training, we need to first list out the scenarios we want to support. For example, if we only want to express initialization + training in the training graph, we can merge initialization and training into one single graph and extra fields are not necessary.

Copy link
Copy Markdown
Member

@linkerzhang linkerzhang Jan 7, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wschin I am understanding this PR is trying to add a field to show the initialization logic of weights in training scenario. @tbennun
correct me if I'm wrong. This is why I think this should align with training proposal.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@linkerzhang yes, that is the goal.

Copy link
Copy Markdown
Collaborator

@wschin wschin Jan 7, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we put some a layer's weight into graph_initializer, it means we don't necessarily have a placeholder for the tensor, which can cause some problems when training algorithm updates its value. For example, if I define W as a graph initializer, where should I save its value produced by the training algorithm?

The signature of graph_initializer looks fine. We just need to refine its semantic. It should be an optional step to update existing initializers rather than defining new initializers.

Comment thread onnx/onnx-ml.proto
repeated SparseTensorProto sparse_initializer = 15;

// A list of named graphs, used to specify how model inputs should be
// initialized. Each GraphProto must have a name (distinct from any other
Copy link
Copy Markdown
Collaborator

@wschin wschin Jan 7, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// initialized. Each GraphProto must have a name (distinct from any other
// initialized. Each GraphProto must have a name (the same to one

As mentioned above, graph initializers can't be used to declare initializers. It can only specify how existing initializers should be computed.

Comment thread onnx/onnx-ml.proto
// initialized. Each GraphProto must have a name (distinct from any other
// initializer) and one output, which is of the same type and shape as the
// corresponding model input.
repeated GraphProto graph_initializer = 16;
Copy link
Copy Markdown
Collaborator

@wschin wschin Jan 7, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we put the entire initialization stage into a single graph? I feel it's cleaner than having a list of graphs.

[Update]
Now, I think we should make it a list. In training proposal, we have TrainingInfoProto as a repeated field. I think for each, TrainingInfoProto, we should have one initialization graph.

@linkerzhang
Copy link
Copy Markdown
Member

With more thoughts of supporting training (plus the training proposal #2314 ),

I'd suggest,

  1. move this initialization field as part of training proposal (TrainingInfoProto).
  2. if my understanding is correct, this initialization for training should be run only once (for training reproducibility), in this way, I'd still suggest to have only one GraphProto to initialize all trainable weights.
  3. The initialization graph's outputs will have exactly the same name as the "initializers" in inference graph. Say, in "initializers" there's a tensor named "a", if this "a" is trainable, there should be an output tensor named "a" in initialization graph.

Thoughts?

@tbennun
Copy link
Copy Markdown
Contributor Author

tbennun commented Jan 7, 2020

@linkerzhang I agree with everything. We also had a discussion in the training WG that supports this decision. As @wschin said, adding graph initializers to TrainingInfoProto has the additional benefit of being able to restart training on a pre-trained ONNX file, and the behavior of saving/loading models is more well defined.

I will now close this PR and create one targeted at the branch in #2314, so that it will appear as part of that PR. I will incorporate notes (2) and (3) from the message above.

Thank you all for the important feedback!

@tbennun tbennun closed this Jan 7, 2020
wschin added a commit to wschin/onnx that referenced this pull request Jan 24, 2020
Major changes:
  1. Add a protobuf message, `TrainingInfoProto` originally designed in
     onnx#2013, to store training information.
  2. In `TrainingInfoProto`, the user can store training algorithm in
     `algorithm` field as a `GraphProto`.
  3. The user can also store initialization algorithm for resetting the
     model in `TrainingInfoProto.initialization` (proposed by @tbennun in
     onnx#2517 and agreed by Training WG).
  4. `ModelProto.graph` is callable inside `TrainingInfoProto.algorithm`.
     `ModelProto.graph.initializer` are visible to nodes in
     `TrainingInfoProto.algorithm.node`.
  5. This PR also introduces a `Gradient` operator to differentiate a
     function represented by a (sub-)graph. This idea is from onnx#2168.

Contribution list:
   Baihan Huang: spec design.
   Tal Ben-Nun: model initialization design.
   Wei-Sheng Chin: spec design, Gradient operator design.
   Jonny Shipton and active WG members and participants: many valuable comments and reviews.

Co-authored-by: Sherlock <baihan.huang@gmail.com>
Co-authored-by: Tal Ben-Nun <tbennun@users.noreply.github.com>
Co-authored-by: Jonny Shipton <tmvector@gmail.com>
wschin added a commit that referenced this pull request Feb 17, 2020
* ONNX Training proposal.

Major changes:
  1. Add a protobuf message, `TrainingInfoProto` originally designed in
     #2013, to store training information.
  2. In `TrainingInfoProto`, the user can store training algorithm in
     `algorithm` field as a `GraphProto`.
  3. The user can also store initialization algorithm for resetting the
     model in `TrainingInfoProto.initialization` (proposed by @tbennun in
     #2517 and agreed by Training WG).
  4. `ModelProto.graph` is callable inside `TrainingInfoProto.algorithm`.
     `ModelProto.graph.initializer` are visible to nodes in
     `TrainingInfoProto.algorithm.node`.
  5. This PR also introduces a `Gradient` operator to differentiate a
     function represented by a (sub-)graph. This idea is from #2168.

Contribution list:
   Baihan Huang: spec design.
   Tal Ben-Nun: model initialization design.
   Wei-Sheng Chin: spec design, Gradient operator design.
   Jonny Shipton and active WG members and participants: many valuable comments and reviews.

Co-authored-by: Sherlock <baihan.huang@gmail.com>
Co-authored-by: Tal Ben-Nun <tbennun@users.noreply.github.com>
Co-authored-by: Jonny Shipton <tmvector@gmail.com>

* Address comments

* Address a comment

* Move Gradient to ai.onnx.training

Update Gradient test models

* Address comments
1. Create initialization_binding instead of
   using update_binding for initialization.
2. Swap key and velue in update_binding.
3. Refine documents accordingly.

* Clarify sementics of algorithm and initialization

* Fix typos

* Address comment and explain the two computation modes of  ModelProto.training_info

* Fix typo and explain default behavior

* Update onnx/checker.cc

Co-Authored-By: Jonny Shipton <tmvector@gmail.com>

* Address comments

* Make normalization_binding a repeated field

* Add GraphCall operator

* Polish GraphCall

* GraphCall now uses position to map inputs and outputs

* Address comments:
1. Clarify GraphCall's semantic.
2. Implicitly force trainable tensors to be inference graph's inputs.
3. Training operators cannot be called in the inference graph.

* Add accidently removed changes back

* Use protobuf lite

* Polish the helper script

* Fix windows build and polish helper script

* Fix linux and mac builds

* One more line

* fix the attribute types section in IR.md (#2590)

* fix the attribute types section in IR.md

* update per comments.

* Some changes around the behavior of optional inference inputs.

1. Use pass-by-value to optional inference inputs.
2. Due to the semantic of GraphCall, we implicitly force trainable
   inputs to be added into inference graph's input list.

Revise docs

* Update spec per WG discussion

* update_binding is optional now because user might only want to store initialization

* Polish doc

* Address comments. Polish words.

* Use an alternative field to declar global variables.
In yesterday's Operator SIG meeting, we agree to still
put global variables in the inference graph and add a
model-level field to indicate global variables. This way
we can have smaller impact to the inference engines, because
they don't need to move trainable tensors to a new field.

* polish docs

* Allow training initializers to be promoted to global & mutable variables

* Merge the functions of global_mutable_initializer_names into update_binding

* Polish docs

* Remove restriction on using ai.onnx.training in the inference graph

* Split training register from ai.onnx register file

Co-authored-by: Sherlock <baihan.huang@gmail.com>
Co-authored-by: Tal Ben-Nun <tbennun@users.noreply.github.com>
Co-authored-by: Jonny Shipton <tmvector@gmail.com>
Co-authored-by: Ke Zhang <kezhan@microsoft.com>
jcwchen pushed a commit to jcwchen/onnx that referenced this pull request Sep 23, 2020
* ONNX Training proposal.

Major changes:
  1. Add a protobuf message, `TrainingInfoProto` originally designed in
     onnx#2013, to store training information.
  2. In `TrainingInfoProto`, the user can store training algorithm in
     `algorithm` field as a `GraphProto`.
  3. The user can also store initialization algorithm for resetting the
     model in `TrainingInfoProto.initialization` (proposed by @tbennun in
     onnx#2517 and agreed by Training WG).
  4. `ModelProto.graph` is callable inside `TrainingInfoProto.algorithm`.
     `ModelProto.graph.initializer` are visible to nodes in
     `TrainingInfoProto.algorithm.node`.
  5. This PR also introduces a `Gradient` operator to differentiate a
     function represented by a (sub-)graph. This idea is from onnx#2168.

Contribution list:
   Baihan Huang: spec design.
   Tal Ben-Nun: model initialization design.
   Wei-Sheng Chin: spec design, Gradient operator design.
   Jonny Shipton and active WG members and participants: many valuable comments and reviews.

Co-authored-by: Sherlock <baihan.huang@gmail.com>
Co-authored-by: Tal Ben-Nun <tbennun@users.noreply.github.com>
Co-authored-by: Jonny Shipton <tmvector@gmail.com>

* Address comments

* Address a comment

* Move Gradient to ai.onnx.training

Update Gradient test models

* Address comments
1. Create initialization_binding instead of
   using update_binding for initialization.
2. Swap key and velue in update_binding.
3. Refine documents accordingly.

* Clarify sementics of algorithm and initialization

* Fix typos

* Address comment and explain the two computation modes of  ModelProto.training_info

* Fix typo and explain default behavior

* Update onnx/checker.cc

Co-Authored-By: Jonny Shipton <tmvector@gmail.com>

* Address comments

* Make normalization_binding a repeated field

* Add GraphCall operator

* Polish GraphCall

* GraphCall now uses position to map inputs and outputs

* Address comments:
1. Clarify GraphCall's semantic.
2. Implicitly force trainable tensors to be inference graph's inputs.
3. Training operators cannot be called in the inference graph.

* Add accidently removed changes back

* Use protobuf lite

* Polish the helper script

* Fix windows build and polish helper script

* Fix linux and mac builds

* One more line

* fix the attribute types section in IR.md (onnx#2590)

* fix the attribute types section in IR.md

* update per comments.

* Some changes around the behavior of optional inference inputs.

1. Use pass-by-value to optional inference inputs.
2. Due to the semantic of GraphCall, we implicitly force trainable
   inputs to be added into inference graph's input list.

Revise docs

* Update spec per WG discussion

* update_binding is optional now because user might only want to store initialization

* Polish doc

* Address comments. Polish words.

* Use an alternative field to declar global variables.
In yesterday's Operator SIG meeting, we agree to still
put global variables in the inference graph and add a
model-level field to indicate global variables. This way
we can have smaller impact to the inference engines, because
they don't need to move trainable tensors to a new field.

* polish docs

* Allow training initializers to be promoted to global & mutable variables

* Merge the functions of global_mutable_initializer_names into update_binding

* Polish docs

* Remove restriction on using ai.onnx.training in the inference graph

* Split training register from ai.onnx register file

Co-authored-by: Sherlock <baihan.huang@gmail.com>
Co-authored-by: Tal Ben-Nun <tbennun@users.noreply.github.com>
Co-authored-by: Jonny Shipton <tmvector@gmail.com>
Co-authored-by: Ke Zhang <kezhan@microsoft.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants