ONNX Training Proposal by wschin · Pull Request #2013 · onnx/onnx

wschin · 2019-05-11T23:59:36Z

PR #2314 is a single place for reviewing the whole training story.

This proposal aims at capturing the information required to perform
stochastic gradient-based training on the inference graph.

Although training itself can encoded as a sub-graph by adding
a single backward operator, we currently prefer a more strongly-typed
way to store that information. We also think separating training graph
from inference graph can make backends' life easier.

The major change happens in the introduction of TrainingInfoProto:

// Training information
// TrainingInfoProto contains the minimized loss function (e.g., mean
// squared error), the used optimization algorithm (a mapping from gradient
// and the current tensors to their new values; that is, one training iteration),
// and the binding between the optimized tensors and their gradient tensors'
// names (we need to give a name (e.g., "gradient_of_X") to the gradient of
// tensor "X", so that gradient tensors can be referenced in the training 
// algorithm).
//
// This design is mainly designed for popular stochastic gradient-based methods.
// Conceptually, a TrainingInfoProto defines a "partial computation graph"
// which will be merged into the inference graph to form a full training
// computation graph.
message TrainingInfoProto {
  // Training-specific initializers, for example, accumulated squared gradient
  // in ADAGRAD optimizer, momentum, or number of executed training iterations.
  // The values can only be inputs of "loss" and "optimizer" defined below.
  repeated TensorProto additional_initializer = 1;

  // Training-specific inputs, for example, labels in classification problems.
  // The values can only be inputs of "loss" and "optimizer" defined below.
  // The full input list of the training computation graph is the concatenation
  // of "ModelProto.graph.input" and
  // "ModelProto.training_info.additional_input".
  repeated ValueInfoProto additional_input = 2;

  // Training-specific outputs, for example, new tensor values computed by the
  // specified training algorithm. The full output list of the training
  // computation graph is the concatenation of "ModelProto.graph.output" and
  // "ModelProto.training_info.additional_output".
  repeated ValueInfoProto additional_output = 3;
  
  // Loss node which consumes tensors in "ModelProto.graph", "additional_inputs"
  // , and "additional_initializer". Loss nodes produces a scalar loss value.
  // In training phase, the optimization algorithm would be launch to minimizes
  // loss node's output value. "loss" can be an ONNX primitive operator or a
  // function defined in "ModelProto.function."
  //
  // The field MUST be present.
  optional NodeProto loss = 4;

  // Optimized tensors and their gradient tensor names. Each pair describes a
  // tensor's name (key) and its gradient's name (value). The gradient tensors
  // are only visible to the input list of "optimizer."
  repeated StringStringEntryProto gradient_binding = 5;

  // Optimization algorithm used to minimize the loss function. It can be an
  // ONNX primitive operator or a function defined in "ModelProto.function."
  // This node can reference tensors in the "ModelProto.graph",
  // "additional_inputs," "additional_initializer," and gradient tensors
  // defined by "gradient_bing" below. The field "optimizer"'s outputs are
  // accessible only in "additional_training_output."
  //
  // The field MUST be present.
  optional NodeProto optimizer = 6;

  // Gradient-based training is usually an iterative procedure. In one gradient
  // descent iteration, we apply
  //
  // x = x - r * g
  //
  // where "x" is the optimized tensor, "r" stands for learning rate, and "g" is
  // gradient of "x" with respect to "loss" node's output value. To avoid the
  // direct use of an assignment, user should explicitly specify the new
  // tensor's name for each optimized tensor. For example, if the new value of
  // "x" is "y," the field "update_binding" may contain a pair of "x" (key) and
  // "y" (value).
  //
  // Similarly, optimizer's state tensors (e.g., accumulated squared gradient
  // and momentum) should be handled by using the same strategy. 
  //
  // Each StringStringEntryProto binds an initizliaer defined in one of
  // "ModelProto.graph.initializer" and "additional_training_initializer" to
  // an element of "additional_training_output." An initializer can be bound
  // to at most one training output.
  repeated StringStringEntryProto update_binding = 7;
}

An optional TrainingInfoProto will be added into ModelProto so that users know how to apply the specified optimization algorithm to conduct further training iterations.

To allow customized gradient-based optimization algorithm, a FunctionProto list is also added into ModelProto. User can store their training algorithm as a FunctionProto in that list and reference that FunctionProto using TrainingInfoProto.optimizer.

The existence of TrainingInfoProto.additional_initializer has a reason. There are many per-tensor states in training phase. Common examples are momentum and accumulated squared gradient. If we reuse ModelProto.graph.initializer to store those training-specific tensors, loading a model for inference may cause a lot more memory (2 times or 3 times).

Common optimizers and loss functions will be proposed FunctionProtos subsequently. For example, ADAGRAD optimizer has a WIP PR, #1955.

This proposal aims at capturing the information required to perform stochastic gradient-based training on the inference graph. Although training itself can encoded as a sub-graph by adding a single backward operator, we currently perfer a more strongly-typed way to store that information.

gramalingam · 2019-05-13T16:35:12Z

Thanks Wei-sheng. Looks good to me, with a few minor points I mentioned above.

wschin · 2019-05-21T19:54:32Z

+  // function defined in "ModelProto.function."
+  //
+  // The field MUST be present.
+  optional NodeProto loss = 4;


Please clarify name look-up when referencing FunctionProto in ModelProto.functions for both of loss and optimizer.

wschin · 2019-05-21T19:55:29Z

+  // Optimized tensors and their gradient tensor names. Each pair describes a
+  // tensor's name (key) and its gradient's name (value). The gradient tensors
+  // are only visible to the input list of "optimizer."
+  repeated StringStringEntryProto gradient_binding = 5;


Please explain how the gradient tensors are associated with loss.

Do we really need this binding? Based on the document https://github.com/onnx/onnx/files/3208156/ONNX.Training.Discussion.pptx, the optimizer has W and Gradient W as input. The name of "gradient W", which is output from the backward prop, seems could be derived from the backend and runtime implementation. The current Pytorch and TF APIs do not support user-defined names for the gradients for the optimizers as a reference.

We need it. Pytorch creates such a binding in another (but equivalent) way. They have a dictionary where key is parameter and value is that parameter's gradient tensor.

Add GatherND

* Update to include ONNX Foundation WG Added link to Gitter and description of the newly formed Foundation WG. Co-leaders Jim Spohrer (IBM) and Ryan Loney (Intel) * Updated description of Foundation WG Revised the description of the working group for ONNX Foundation * Update working-groups.md

…he Range op (onnx#2287)

…2288) with type default values even though they are not in the stream.

Move map and sequence types to onnx domain, this is the first step of merging onnx-ml and onnx types.

* Fix link to community docs in readme Addresses onnx#2255 * Update README.md * Update README.md

* Update managingexperimentalops.md * Update managingexperimentalops.md * Rename managingexperimentalops.md to ManagingExperimentalOps.md

* Added negative axes for slice and squeeze opset 11 * added negative axes support for squeeze, unsqueeze, flatten * added support for negative axes to all the existing ops * fixed minor if condition missed for axis attr in flatten * fixed test name for flatten with negative axes * updated unsqueeze and softmax tests with fix for failures * fixed typo * Updating Split op documentations and version * fixed typo in unsqueeze model * fixed dim check for unsqueeze * fixed type cast * test fix for build failure * updating onnx model for unsqueeze test * fixed minor error in type casting

* added test for int64 input to 'where' op * added onnx model files and docs for test 'where' op with long input * added missing doc updates

… that do not have matching graph inputs. (onnx#2135) * Update helper.py An IR v4 model is not required to have matching graph inputs for all initializers. Update printable_graph to allow for this and output the name, type and shape of initializers with no matching graph input. * Add test for printable_graph Add test and tweak messaging * Fix comment formatting

…lements', 'OneHot' (onnx#2260) * modified gather docs to support negative indices * added support for negatice indices to gather_elements and scatter_elements * fixed documentation formatting as per comments * Added negatice indices to docs for OneHot op * GatheND spec for negative indices editted * fix for comments * fixed formatting for gather op * Update onnx/defs/tensor/defs.cc Co-Authored-By: Wei-Sheng Chin <wschin@outlook.com> * Update docs/Changelog.md Co-Authored-By: Wei-Sheng Chin <wschin@outlook.com> * added print examples to onehot and gather as per comments * adding modifies model tests for onehot * updated doc files * updating unsqueeze test * typo fix as per comments

* Fix shapeinference function * Added shapeinference test for cumsum * update inference test * fix test * minor fix -- shape (1) should be (1,) * Add whitespace after comma to fix flake warning

* Add a helper function update_inputs_outputs_dims to tools * fix link to doc * newline at the end * add test for tools * doc props * nit * ci tests * ci tests 2 * accept shapes by dictionary inputs and add more error handling * Update onnx/tools/update_model_dims.py nit: rephrasing Co-Authored-By: Wei-Sheng Chin <wschin@outlook.com> * remove debug line * fix type annotation * fix annotation * fix annotation * fix annotation * fix flake8

* sequence related ops * refine docs * extend hasInputShape to Sequence * refining naming and error checking * refine descriptions

* fix resize shape inference issue in opset10 * include opset10 upsample as well * nit: const auto* * rename 'opset7' to 'opset7_to_10'

* Added more test cases for Unsqueeze * Added a test case for unsqueezing 3 dims Also renamed the 1 dim test cases slightly. * Added more test cases for Unsqueeze * Added a test case for unsqueezing 3 dims Also renamed the 1 dim test cases slightly. * Update docs/Operators.md Feedback from wschin to fix axis bounds. Co-Authored-By: Wei-Sheng Chin <wschin@outlook.com> * Re-ran update_doc.sh

wschin · 2020-01-24T01:15:30Z

This PR has been merged into #2314 so we can close it now.

@tbennun

Major changes: 1. Add a protobuf message, `TrainingInfoProto` originally designed in onnx#2013, to store training information. 2. In `TrainingInfoProto`, the user can store training algorithm in `algorithm` field as a `GraphProto`. 3. The user can also store initialization algorithm for resetting the model in `TrainingInfoProto.initialization` (proposed by @tbennun in onnx#2517 and agreed by Training WG). 4. `ModelProto.graph` is callable inside `TrainingInfoProto.algorithm`. `ModelProto.graph.initializer` are visible to nodes in `TrainingInfoProto.algorithm.node`. 5. This PR also introduces a `Gradient` operator to differentiate a function represented by a (sub-)graph. This idea is from onnx#2168. Contribution list: Baihan Huang: spec design. Tal Ben-Nun: model initialization design. Wei-Sheng Chin: spec design, Gradient operator design. Jonny Shipton and active WG members and participants: many valuable comments and reviews. Co-authored-by: Sherlock <baihan.huang@gmail.com> Co-authored-by: Tal Ben-Nun <tbennun@users.noreply.github.com> Co-authored-by: Jonny Shipton <tmvector@gmail.com>

@tbennun

* ONNX Training proposal. Major changes: 1. Add a protobuf message, `TrainingInfoProto` originally designed in #2013, to store training information. 2. In `TrainingInfoProto`, the user can store training algorithm in `algorithm` field as a `GraphProto`. 3. The user can also store initialization algorithm for resetting the model in `TrainingInfoProto.initialization` (proposed by @tbennun in #2517 and agreed by Training WG). 4. `ModelProto.graph` is callable inside `TrainingInfoProto.algorithm`. `ModelProto.graph.initializer` are visible to nodes in `TrainingInfoProto.algorithm.node`. 5. This PR also introduces a `Gradient` operator to differentiate a function represented by a (sub-)graph. This idea is from #2168. Contribution list: Baihan Huang: spec design. Tal Ben-Nun: model initialization design. Wei-Sheng Chin: spec design, Gradient operator design. Jonny Shipton and active WG members and participants: many valuable comments and reviews. Co-authored-by: Sherlock <baihan.huang@gmail.com> Co-authored-by: Tal Ben-Nun <tbennun@users.noreply.github.com> Co-authored-by: Jonny Shipton <tmvector@gmail.com> * Address comments * Address a comment * Move Gradient to ai.onnx.training Update Gradient test models * Address comments 1. Create initialization_binding instead of using update_binding for initialization. 2. Swap key and velue in update_binding. 3. Refine documents accordingly. * Clarify sementics of algorithm and initialization * Fix typos * Address comment and explain the two computation modes of ModelProto.training_info * Fix typo and explain default behavior * Update onnx/checker.cc Co-Authored-By: Jonny Shipton <tmvector@gmail.com> * Address comments * Make normalization_binding a repeated field * Add GraphCall operator * Polish GraphCall * GraphCall now uses position to map inputs and outputs * Address comments: 1. Clarify GraphCall's semantic. 2. Implicitly force trainable tensors to be inference graph's inputs. 3. Training operators cannot be called in the inference graph. * Add accidently removed changes back * Use protobuf lite * Polish the helper script * Fix windows build and polish helper script * Fix linux and mac builds * One more line * fix the attribute types section in IR.md (#2590) * fix the attribute types section in IR.md * update per comments. * Some changes around the behavior of optional inference inputs. 1. Use pass-by-value to optional inference inputs. 2. Due to the semantic of GraphCall, we implicitly force trainable inputs to be added into inference graph's input list. Revise docs * Update spec per WG discussion * update_binding is optional now because user might only want to store initialization * Polish doc * Address comments. Polish words. * Use an alternative field to declar global variables. In yesterday's Operator SIG meeting, we agree to still put global variables in the inference graph and add a model-level field to indicate global variables. This way we can have smaller impact to the inference engines, because they don't need to move trainable tensors to a new field. * polish docs * Allow training initializers to be promoted to global & mutable variables * Merge the functions of global_mutable_initializer_names into update_binding * Polish docs * Remove restriction on using ai.onnx.training in the inference graph * Split training register from ai.onnx register file Co-authored-by: Sherlock <baihan.huang@gmail.com> Co-authored-by: Tal Ben-Nun <tbennun@users.noreply.github.com> Co-authored-by: Jonny Shipton <tmvector@gmail.com> Co-authored-by: Ke Zhang <kezhan@microsoft.com>

@tbennun

* ONNX Training proposal. Major changes: 1. Add a protobuf message, `TrainingInfoProto` originally designed in onnx#2013, to store training information. 2. In `TrainingInfoProto`, the user can store training algorithm in `algorithm` field as a `GraphProto`. 3. The user can also store initialization algorithm for resetting the model in `TrainingInfoProto.initialization` (proposed by @tbennun in onnx#2517 and agreed by Training WG). 4. `ModelProto.graph` is callable inside `TrainingInfoProto.algorithm`. `ModelProto.graph.initializer` are visible to nodes in `TrainingInfoProto.algorithm.node`. 5. This PR also introduces a `Gradient` operator to differentiate a function represented by a (sub-)graph. This idea is from onnx#2168. Contribution list: Baihan Huang: spec design. Tal Ben-Nun: model initialization design. Wei-Sheng Chin: spec design, Gradient operator design. Jonny Shipton and active WG members and participants: many valuable comments and reviews. Co-authored-by: Sherlock <baihan.huang@gmail.com> Co-authored-by: Tal Ben-Nun <tbennun@users.noreply.github.com> Co-authored-by: Jonny Shipton <tmvector@gmail.com> * Address comments * Address a comment * Move Gradient to ai.onnx.training Update Gradient test models * Address comments 1. Create initialization_binding instead of using update_binding for initialization. 2. Swap key and velue in update_binding. 3. Refine documents accordingly. * Clarify sementics of algorithm and initialization * Fix typos * Address comment and explain the two computation modes of ModelProto.training_info * Fix typo and explain default behavior * Update onnx/checker.cc Co-Authored-By: Jonny Shipton <tmvector@gmail.com> * Address comments * Make normalization_binding a repeated field * Add GraphCall operator * Polish GraphCall * GraphCall now uses position to map inputs and outputs * Address comments: 1. Clarify GraphCall's semantic. 2. Implicitly force trainable tensors to be inference graph's inputs. 3. Training operators cannot be called in the inference graph. * Add accidently removed changes back * Use protobuf lite * Polish the helper script * Fix windows build and polish helper script * Fix linux and mac builds * One more line * fix the attribute types section in IR.md (onnx#2590) * fix the attribute types section in IR.md * update per comments. * Some changes around the behavior of optional inference inputs. 1. Use pass-by-value to optional inference inputs. 2. Due to the semantic of GraphCall, we implicitly force trainable inputs to be added into inference graph's input list. Revise docs * Update spec per WG discussion * update_binding is optional now because user might only want to store initialization * Polish doc * Address comments. Polish words. * Use an alternative field to declar global variables. In yesterday's Operator SIG meeting, we agree to still put global variables in the inference graph and add a model-level field to indicate global variables. This way we can have smaller impact to the inference engines, because they don't need to move trainable tensors to a new field. * polish docs * Allow training initializers to be promoted to global & mutable variables * Merge the functions of global_mutable_initializer_names into update_binding * Polish docs * Remove restriction on using ai.onnx.training in the inference graph * Split training register from ai.onnx register file Co-authored-by: Sherlock <baihan.huang@gmail.com> Co-authored-by: Tal Ben-Nun <tbennun@users.noreply.github.com> Co-authored-by: Jonny Shipton <tmvector@gmail.com> Co-authored-by: Ke Zhang <kezhan@microsoft.com>

wschin requested a review from a team as a code owner May 11, 2019 23:59

wschin force-pushed the training-info branch from eaa5908 to 30bd71b Compare May 12, 2019 00:01

wschin changed the title ~~ONNX Training Proposal~~ [WIP] ONNX Training Proposal May 12, 2019

wschin added 3 commits May 11, 2019 23:25

Update protos

36d764e

Fix an ir_version

97b7f35

Fix FunctionProto import

4b42f1c

wschin requested a review from a team as a code owner May 12, 2019 18:53

wschin changed the title ~~[WIP] ONNX Training Proposal~~ ONNX Training Proposal May 12, 2019

Polish descriptions

f56962d

linkerzhang requested review from bddppq, gramalingam, houseroad, linkerzhang and yuanbyu May 13, 2019 02:54

gramalingam reviewed May 13, 2019

View reviewed changes

Comment thread onnx/onnx-ml.proto Outdated

gramalingam reviewed May 13, 2019

View reviewed changes

Comment thread onnx/onnx-ml.proto Outdated

gramalingam reviewed May 13, 2019

View reviewed changes

Comment thread onnx/onnx-ml.proto Outdated

wschin added 3 commits May 14, 2019 00:05

Address comments

3c62200

Merge branch 'master' into training-info

d6df98a

Merge branch 'master' into training-info

6e3e00e

wschin mentioned this pull request May 21, 2019

Summary of Training Story in ONNX #2038

Closed

wschin commented May 21, 2019

View reviewed changes

wschin and others added 5 commits May 22, 2019 22:35

Merge branch 'master' into training-info

926f478

Merge branch 'master' into training-info

2b87c13

Merge branch 'master' into training-info

93b0c79

Merge remote-tracking branch 'upstream/master' into training-info

dc0cae0

Update training spec

e9a0ded

hariharans29 and others added 24 commits August 29, 2019 13:42

Support GatherND operator in ONNX (onnx#2106)

1a62afd

Add GatherND

Fix testdata model for CumSum. Add exclusive attribute. (onnx#2271)

4eabc4b

Remove type info for loop variadic input in Loop op used to compose t…

1f350f2

…he Range op (onnx#2287)

Improve compatiblity with proto3 and enable reading attributes (onnx#…

568b65a

…2288) with type default values even though they are not in the stream.

move map and sequence types to onnx domain, (onnx#2244)

2fdb3ef

Move map and sequence types to onnx domain, this is the first step of merging onnx-ml and onnx types.

Fix link to community docs in readme (onnx#2261)

bc0495c

* Fix link to community docs in readme Addresses onnx#2255 * Update README.md * Update README.md

Update managingexperimentalops.md (onnx#1981)

5ca0a09

* Update managingexperimentalops.md * Update managingexperimentalops.md * Rename managingexperimentalops.md to ManagingExperimentalOps.md

test int64 input type for 'where' op (onnx#2253)

428d09b

* added test for int64 input to 'where' op * added onnx model files and docs for test 'where' op with long input * added missing doc updates

Fix collect_snippets warnings (onnx#2277)

7636978

fix the buffer overflow problem in shape inference logic of Squeeze op

414285b

Fix shapeinference function (onnx#2296)

95252c2

* Fix shapeinference function * Added shapeinference test for cumsum * update inference test * fix test * minor fix -- shape (1) should be (1,) * Add whitespace after comma to fix flake warning

Fix extra collect_snippets warning (onnx#2277) (onnx#2307)

8926671

Shape inference for NMS (onnx#2269)

0c765d9

Update documentation about required input output types (onnx#2310)

3e6382b

Sequence related ops (onnx#2249)

d7595f3

* sequence related ops * refine docs * extend hasInputShape to Sequence * refining naming and error checking * refine descriptions

Fix resize shape inference issue in opset10 (onnx#2294)

57b5193

* fix resize shape inference issue in opset10 * include opset10 upsample as well * nit: const auto* * rename 'opset7' to 'opset7_to_10'

Add TrainingInfoProto

0ebc32a

Regenerate protobuf files

71a8e1d

wschin mentioned this pull request Sep 16, 2019

Training Proposal: Spec Changes and Gradient Operator #2314

Merged

wschin closed this Jan 24, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ONNX Training Proposal#2013

ONNX Training Proposal#2013
wschin wants to merge 75 commits intoonnx:masterfrom
wschin:training-info

wschin commented May 11, 2019 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

gramalingam commented May 13, 2019

Uh oh!

wschin May 21, 2019 •

edited

Loading

Uh oh!

wschin May 21, 2019

Uh oh!

chinhuang007 May 28, 2019

Uh oh!

wschin May 29, 2019 •

edited

Loading

Uh oh!

wschin commented Jan 24, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants

Conversation

wschin commented May 11, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

gramalingam commented May 13, 2019

Uh oh!

wschin May 21, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wschin May 21, 2019

Choose a reason for hiding this comment

Uh oh!

chinhuang007 May 28, 2019

Choose a reason for hiding this comment

Uh oh!

wschin May 29, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wschin commented Jan 24, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants

wschin commented May 11, 2019 •

edited

Loading

wschin May 21, 2019 •

edited

Loading

wschin May 29, 2019 •

edited

Loading