Skip to content

ONNX Training Proposal#2013

Closed
wschin wants to merge 75 commits intoonnx:masterfrom
wschin:training-info
Closed

ONNX Training Proposal#2013
wschin wants to merge 75 commits intoonnx:masterfrom
wschin:training-info

Conversation

@wschin
Copy link
Copy Markdown
Collaborator

@wschin wschin commented May 11, 2019

PR #2314 is a single place for reviewing the whole training story.

This proposal aims at capturing the information required to perform
stochastic gradient-based training on the inference graph.

Although training itself can encoded as a sub-graph by adding
a single backward operator, we currently prefer a more strongly-typed
way to store that information. We also think separating training graph
from inference graph can make backends' life easier.

The major change happens in the introduction of TrainingInfoProto:

// Training information
// TrainingInfoProto contains the minimized loss function (e.g., mean
// squared error), the used optimization algorithm (a mapping from gradient
// and the current tensors to their new values; that is, one training iteration),
// and the binding between the optimized tensors and their gradient tensors'
// names (we need to give a name (e.g., "gradient_of_X") to the gradient of
// tensor "X", so that gradient tensors can be referenced in the training 
// algorithm).
//
// This design is mainly designed for popular stochastic gradient-based methods.
// Conceptually, a TrainingInfoProto defines a "partial computation graph"
// which will be merged into the inference graph to form a full training
// computation graph.
message TrainingInfoProto {
  // Training-specific initializers, for example, accumulated squared gradient
  // in ADAGRAD optimizer, momentum, or number of executed training iterations.
  // The values can only be inputs of "loss" and "optimizer" defined below.
  repeated TensorProto additional_initializer = 1;

  // Training-specific inputs, for example, labels in classification problems.
  // The values can only be inputs of "loss" and "optimizer" defined below.
  // The full input list of the training computation graph is the concatenation
  // of "ModelProto.graph.input" and
  // "ModelProto.training_info.additional_input".
  repeated ValueInfoProto additional_input = 2;

  // Training-specific outputs, for example, new tensor values computed by the
  // specified training algorithm. The full output list of the training
  // computation graph is the concatenation of "ModelProto.graph.output" and
  // "ModelProto.training_info.additional_output".
  repeated ValueInfoProto additional_output = 3;
  
  // Loss node which consumes tensors in "ModelProto.graph", "additional_inputs"
  // , and "additional_initializer". Loss nodes produces a scalar loss value.
  // In training phase, the optimization algorithm would be launch to minimizes
  // loss node's output value. "loss" can be an ONNX primitive operator or a
  // function defined in "ModelProto.function."
  //
  // The field MUST be present.
  optional NodeProto loss = 4;

  // Optimized tensors and their gradient tensor names. Each pair describes a
  // tensor's name (key) and its gradient's name (value). The gradient tensors
  // are only visible to the input list of "optimizer."
  repeated StringStringEntryProto gradient_binding = 5;

  // Optimization algorithm used to minimize the loss function. It can be an
  // ONNX primitive operator or a function defined in "ModelProto.function."
  // This node can reference tensors in the "ModelProto.graph",
  // "additional_inputs," "additional_initializer," and gradient tensors
  // defined by "gradient_bing" below. The field "optimizer"'s outputs are
  // accessible only in "additional_training_output."
  //
  // The field MUST be present.
  optional NodeProto optimizer = 6;

  // Gradient-based training is usually an iterative procedure. In one gradient
  // descent iteration, we apply
  //
  // x = x - r * g
  //
  // where "x" is the optimized tensor, "r" stands for learning rate, and "g" is
  // gradient of "x" with respect to "loss" node's output value. To avoid the
  // direct use of an assignment, user should explicitly specify the new
  // tensor's name for each optimized tensor. For example, if the new value of
  // "x" is "y," the field "update_binding" may contain a pair of "x" (key) and
  // "y" (value).
  //
  // Similarly, optimizer's state tensors (e.g., accumulated squared gradient
  // and momentum) should be handled by using the same strategy. 
  //
  // Each StringStringEntryProto binds an initizliaer defined in one of
  // "ModelProto.graph.initializer" and "additional_training_initializer" to
  // an element of "additional_training_output." An initializer can be bound
  // to at most one training output.
  repeated StringStringEntryProto update_binding = 7;
}

An optional TrainingInfoProto will be added into ModelProto so that users know how to apply the specified optimization algorithm to conduct further training iterations.

To allow customized gradient-based optimization algorithm, a FunctionProto list is also added into ModelProto. User can store their training algorithm as a FunctionProto in that list and reference that FunctionProto using TrainingInfoProto.optimizer.

The existence of TrainingInfoProto.additional_initializer has a reason. There are many per-tensor states in training phase. Common examples are momentum and accumulated squared gradient. If we reuse ModelProto.graph.initializer to store those training-specific tensors, loading a model for inference may cause a lot more memory (2 times or 3 times).

Common optimizers and loss functions will be proposed FunctionProtos subsequently. For example, ADAGRAD optimizer has a WIP PR, #1955.

@wschin wschin requested a review from a team as a code owner May 11, 2019 23:59
This proposal aims at capturing the information required to perform
stochastic gradient-based training on the inference graph.

Although training itself can encoded as a sub-graph by adding
a single backward operator, we currently perfer a more strongly-typed
way to store that information.
@wschin wschin changed the title ONNX Training Proposal [WIP] ONNX Training Proposal May 12, 2019
@wschin wschin requested a review from a team as a code owner May 12, 2019 18:53
@wschin wschin changed the title [WIP] ONNX Training Proposal ONNX Training Proposal May 12, 2019
Comment thread onnx/onnx-ml.proto Outdated
Comment thread onnx/onnx-ml.proto Outdated
Comment thread onnx/onnx-ml.proto Outdated
@gramalingam
Copy link
Copy Markdown
Contributor

Thanks Wei-sheng. Looks good to me, with a few minor points I mentioned above.

Comment thread onnx/onnx.proto Outdated
// function defined in "ModelProto.function."
//
// The field MUST be present.
optional NodeProto loss = 4;
Copy link
Copy Markdown
Collaborator Author

@wschin wschin May 21, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please clarify name look-up when referencing FunctionProto in ModelProto.functions for both of loss and optimizer.

Comment thread onnx/onnx.proto Outdated
// Optimized tensors and their gradient tensor names. Each pair describes a
// tensor's name (key) and its gradient's name (value). The gradient tensors
// are only visible to the input list of "optimizer."
repeated StringStringEntryProto gradient_binding = 5;
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please explain how the gradient tensors are associated with loss.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we really need this binding? Based on the document https://github.com/onnx/onnx/files/3208156/ONNX.Training.Discussion.pptx, the optimizer has W and Gradient W as input. The name of "gradient W", which is output from the backward prop, seems could be derived from the backend and runtime implementation. The current Pytorch and TF APIs do not support user-defined names for the gradients for the optimizers as a reference.

Copy link
Copy Markdown
Collaborator Author

@wschin wschin May 29, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need it. Pytorch creates such a binding in another (but equivalent) way. They have a dictionary where key is parameter and value is that parameter's gradient tensor.

hariharans29 and others added 24 commits August 29, 2019 13:42
* Update to include ONNX Foundation WG

Added link to Gitter and description of the newly formed Foundation WG. Co-leaders Jim Spohrer (IBM) and Ryan Loney (Intel)

* Updated description of Foundation WG

Revised the description of the working group for ONNX Foundation

* Update working-groups.md
…2288)

with type default values even though they are not in the stream.
Move map and sequence types to onnx domain, this is the first step of merging onnx-ml and onnx types.
* Fix link to community docs in readme

Addresses onnx#2255

* Update README.md

* Update README.md
* Update managingexperimentalops.md

* Update managingexperimentalops.md

* Rename managingexperimentalops.md to ManagingExperimentalOps.md
* Added negative axes for slice and squeeze opset 11

* added negative axes support for squeeze, unsqueeze, flatten

* added support for negative axes to all the existing ops

* fixed minor if condition missed for axis attr in flatten

* fixed test name for flatten with negative axes

* updated unsqueeze and softmax tests with fix for failures

* fixed typo

* Updating Split op documentations and version

* fixed typo in unsqueeze model

* fixed dim check for unsqueeze

* fixed type cast

* test fix for build failure

* updating onnx model for unsqueeze test

* fixed minor error in type casting
* added test for int64 input to 'where' op

* added onnx model files and docs for test 'where' op with long input

* added missing doc updates
… that do not have matching graph inputs. (onnx#2135)

* Update helper.py

An IR v4 model is not required to have matching graph inputs for all initializers. Update printable_graph to allow for this and output the name, type and shape of initializers with no matching graph input.

* Add test for printable_graph

Add test and tweak messaging

* Fix comment formatting
…lements', 'OneHot' (onnx#2260)

* modified gather docs to support negative indices

* added support for negatice indices to gather_elements and scatter_elements

* fixed documentation formatting as per comments

* Added negatice indices to docs for OneHot op

* GatheND spec for negative indices editted

* fix for comments

* fixed formatting for gather op

* Update onnx/defs/tensor/defs.cc

Co-Authored-By: Wei-Sheng Chin <wschin@outlook.com>

* Update docs/Changelog.md

Co-Authored-By: Wei-Sheng Chin <wschin@outlook.com>

* added print examples to onehot and gather as per comments

* adding modifies model tests for onehot

* updated doc files

* updating unsqueeze test

* typo fix as per comments
* Fix shapeinference function

* Added shapeinference test for cumsum

* update inference test

* fix test

* minor fix -- shape (1) should be (1,)

* Add whitespace after comma to fix flake warning
* Add a helper function update_inputs_outputs_dims to tools

* fix link to doc

* newline at the end

* add test for tools

* doc props

* nit

* ci tests

* ci tests 2

* accept shapes by dictionary inputs and add more error handling

* Update onnx/tools/update_model_dims.py

nit: rephrasing

Co-Authored-By: Wei-Sheng Chin <wschin@outlook.com>

* remove debug line

* fix type annotation

* fix annotation

* fix annotation

* fix annotation

* fix flake8
* sequence related ops

* refine docs

* extend hasInputShape to Sequence

* refining naming and error checking

* refine descriptions
* fix resize shape inference issue in opset10

* include opset10 upsample as well

* nit: const auto*

* rename 'opset7' to 'opset7_to_10'
* Added more test cases for Unsqueeze

* Added a test case for unsqueezing 3 dims

Also renamed the 1 dim test cases slightly.

* Added more test cases for Unsqueeze

* Added a test case for unsqueezing 3 dims

Also renamed the 1 dim test cases slightly.

* Update docs/Operators.md

Feedback from wschin to fix axis bounds.

Co-Authored-By: Wei-Sheng Chin <wschin@outlook.com>

* Re-ran update_doc.sh
@wschin
Copy link
Copy Markdown
Collaborator Author

wschin commented Jan 24, 2020

This PR has been merged into #2314 so we can close it now.

@wschin wschin closed this Jan 24, 2020
wschin added a commit to wschin/onnx that referenced this pull request Jan 24, 2020
Major changes:
  1. Add a protobuf message, `TrainingInfoProto` originally designed in
     onnx#2013, to store training information.
  2. In `TrainingInfoProto`, the user can store training algorithm in
     `algorithm` field as a `GraphProto`.
  3. The user can also store initialization algorithm for resetting the
     model in `TrainingInfoProto.initialization` (proposed by @tbennun in
     onnx#2517 and agreed by Training WG).
  4. `ModelProto.graph` is callable inside `TrainingInfoProto.algorithm`.
     `ModelProto.graph.initializer` are visible to nodes in
     `TrainingInfoProto.algorithm.node`.
  5. This PR also introduces a `Gradient` operator to differentiate a
     function represented by a (sub-)graph. This idea is from onnx#2168.

Contribution list:
   Baihan Huang: spec design.
   Tal Ben-Nun: model initialization design.
   Wei-Sheng Chin: spec design, Gradient operator design.
   Jonny Shipton and active WG members and participants: many valuable comments and reviews.

Co-authored-by: Sherlock <baihan.huang@gmail.com>
Co-authored-by: Tal Ben-Nun <tbennun@users.noreply.github.com>
Co-authored-by: Jonny Shipton <tmvector@gmail.com>
wschin added a commit that referenced this pull request Feb 17, 2020
* ONNX Training proposal.

Major changes:
  1. Add a protobuf message, `TrainingInfoProto` originally designed in
     #2013, to store training information.
  2. In `TrainingInfoProto`, the user can store training algorithm in
     `algorithm` field as a `GraphProto`.
  3. The user can also store initialization algorithm for resetting the
     model in `TrainingInfoProto.initialization` (proposed by @tbennun in
     #2517 and agreed by Training WG).
  4. `ModelProto.graph` is callable inside `TrainingInfoProto.algorithm`.
     `ModelProto.graph.initializer` are visible to nodes in
     `TrainingInfoProto.algorithm.node`.
  5. This PR also introduces a `Gradient` operator to differentiate a
     function represented by a (sub-)graph. This idea is from #2168.

Contribution list:
   Baihan Huang: spec design.
   Tal Ben-Nun: model initialization design.
   Wei-Sheng Chin: spec design, Gradient operator design.
   Jonny Shipton and active WG members and participants: many valuable comments and reviews.

Co-authored-by: Sherlock <baihan.huang@gmail.com>
Co-authored-by: Tal Ben-Nun <tbennun@users.noreply.github.com>
Co-authored-by: Jonny Shipton <tmvector@gmail.com>

* Address comments

* Address a comment

* Move Gradient to ai.onnx.training

Update Gradient test models

* Address comments
1. Create initialization_binding instead of
   using update_binding for initialization.
2. Swap key and velue in update_binding.
3. Refine documents accordingly.

* Clarify sementics of algorithm and initialization

* Fix typos

* Address comment and explain the two computation modes of  ModelProto.training_info

* Fix typo and explain default behavior

* Update onnx/checker.cc

Co-Authored-By: Jonny Shipton <tmvector@gmail.com>

* Address comments

* Make normalization_binding a repeated field

* Add GraphCall operator

* Polish GraphCall

* GraphCall now uses position to map inputs and outputs

* Address comments:
1. Clarify GraphCall's semantic.
2. Implicitly force trainable tensors to be inference graph's inputs.
3. Training operators cannot be called in the inference graph.

* Add accidently removed changes back

* Use protobuf lite

* Polish the helper script

* Fix windows build and polish helper script

* Fix linux and mac builds

* One more line

* fix the attribute types section in IR.md (#2590)

* fix the attribute types section in IR.md

* update per comments.

* Some changes around the behavior of optional inference inputs.

1. Use pass-by-value to optional inference inputs.
2. Due to the semantic of GraphCall, we implicitly force trainable
   inputs to be added into inference graph's input list.

Revise docs

* Update spec per WG discussion

* update_binding is optional now because user might only want to store initialization

* Polish doc

* Address comments. Polish words.

* Use an alternative field to declar global variables.
In yesterday's Operator SIG meeting, we agree to still
put global variables in the inference graph and add a
model-level field to indicate global variables. This way
we can have smaller impact to the inference engines, because
they don't need to move trainable tensors to a new field.

* polish docs

* Allow training initializers to be promoted to global & mutable variables

* Merge the functions of global_mutable_initializer_names into update_binding

* Polish docs

* Remove restriction on using ai.onnx.training in the inference graph

* Split training register from ai.onnx register file

Co-authored-by: Sherlock <baihan.huang@gmail.com>
Co-authored-by: Tal Ben-Nun <tbennun@users.noreply.github.com>
Co-authored-by: Jonny Shipton <tmvector@gmail.com>
Co-authored-by: Ke Zhang <kezhan@microsoft.com>
jcwchen pushed a commit to jcwchen/onnx that referenced this pull request Sep 23, 2020
* ONNX Training proposal.

Major changes:
  1. Add a protobuf message, `TrainingInfoProto` originally designed in
     onnx#2013, to store training information.
  2. In `TrainingInfoProto`, the user can store training algorithm in
     `algorithm` field as a `GraphProto`.
  3. The user can also store initialization algorithm for resetting the
     model in `TrainingInfoProto.initialization` (proposed by @tbennun in
     onnx#2517 and agreed by Training WG).
  4. `ModelProto.graph` is callable inside `TrainingInfoProto.algorithm`.
     `ModelProto.graph.initializer` are visible to nodes in
     `TrainingInfoProto.algorithm.node`.
  5. This PR also introduces a `Gradient` operator to differentiate a
     function represented by a (sub-)graph. This idea is from onnx#2168.

Contribution list:
   Baihan Huang: spec design.
   Tal Ben-Nun: model initialization design.
   Wei-Sheng Chin: spec design, Gradient operator design.
   Jonny Shipton and active WG members and participants: many valuable comments and reviews.

Co-authored-by: Sherlock <baihan.huang@gmail.com>
Co-authored-by: Tal Ben-Nun <tbennun@users.noreply.github.com>
Co-authored-by: Jonny Shipton <tmvector@gmail.com>

* Address comments

* Address a comment

* Move Gradient to ai.onnx.training

Update Gradient test models

* Address comments
1. Create initialization_binding instead of
   using update_binding for initialization.
2. Swap key and velue in update_binding.
3. Refine documents accordingly.

* Clarify sementics of algorithm and initialization

* Fix typos

* Address comment and explain the two computation modes of  ModelProto.training_info

* Fix typo and explain default behavior

* Update onnx/checker.cc

Co-Authored-By: Jonny Shipton <tmvector@gmail.com>

* Address comments

* Make normalization_binding a repeated field

* Add GraphCall operator

* Polish GraphCall

* GraphCall now uses position to map inputs and outputs

* Address comments:
1. Clarify GraphCall's semantic.
2. Implicitly force trainable tensors to be inference graph's inputs.
3. Training operators cannot be called in the inference graph.

* Add accidently removed changes back

* Use protobuf lite

* Polish the helper script

* Fix windows build and polish helper script

* Fix linux and mac builds

* One more line

* fix the attribute types section in IR.md (onnx#2590)

* fix the attribute types section in IR.md

* update per comments.

* Some changes around the behavior of optional inference inputs.

1. Use pass-by-value to optional inference inputs.
2. Due to the semantic of GraphCall, we implicitly force trainable
   inputs to be added into inference graph's input list.

Revise docs

* Update spec per WG discussion

* update_binding is optional now because user might only want to store initialization

* Polish doc

* Address comments. Polish words.

* Use an alternative field to declar global variables.
In yesterday's Operator SIG meeting, we agree to still
put global variables in the inference graph and add a
model-level field to indicate global variables. This way
we can have smaller impact to the inference engines, because
they don't need to move trainable tensors to a new field.

* polish docs

* Allow training initializers to be promoted to global & mutable variables

* Merge the functions of global_mutable_initializer_names into update_binding

* Polish docs

* Remove restriction on using ai.onnx.training in the inference graph

* Split training register from ai.onnx register file

Co-authored-by: Sherlock <baihan.huang@gmail.com>
Co-authored-by: Tal Ben-Nun <tbennun@users.noreply.github.com>
Co-authored-by: Jonny Shipton <tmvector@gmail.com>
Co-authored-by: Ke Zhang <kezhan@microsoft.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

topic: training Issues related to ONNX training

Projects

None yet

Development

Successfully merging this pull request may close these issues.