Training Proposal: Spec Changes and Gradient Operator by wschin · Pull Request #2314 · onnx/onnx

wschin · 2019-09-16T05:31:09Z

This PR includes all necessary changes for enabling training. It currently includes #2013, #1955, #1970, #1959, #2168, #1939. Per Training WG's discussion, we want to put everything into a single place for convenience and having a global view to our status.

~~Please see #2038 and those sub-PRs' messages for details.~~

To speedup the code review process and properly distribute works, this PR now only contains content from #2517 by @tbennun, #2013, and #2168,. They are the minimal requirement to express training graph. Losses and optimizers will be added separately.

This PR introduces several major changes.

Add a protobuf message, TrainingInfoProto originally designed in ONNX Training Proposal #2013, to store training information.
In TrainingInfoProto, the user can store training algorithm in algorithm field as a GraphProto.
The user can also store initialization algorithm for resetting the model in TrainingInfoProto.initialization (proposed by @tbennun in Graph initializers for tensors #2517 and agreed by Training WG).
ModelProto.graph is callable inside TrainingInfoProto.algorithm. ModelProto.graph.initializer are visible to nodes in TrainingInfoProto.algorithm.node.
This PR also introduces a Gradient operator to differentiate a function represented by a (sub-)graph. This idea is from [WIP][Training] Draft of Gradient Operator #2168. The domain of Gradient is ai.onnx.training.

CLAassistant · 2019-09-17T06:29:37Z

All committers have signed the CLA.

prasanthpul · 2019-09-17T18:46:27Z

Are the new ops only useful in training scenarios? If so, they should be put in a separate namespace. It doesn't have to be "ai.onnx.training", it can be "ai.onnx.lossfunction" or something like that

wschin · 2019-09-17T21:39:10Z

Are the new ops only useful in training scenarios? If so, they should be put in a separate namespace. It doesn't have to be "ai.onnx.training", it can be "ai.onnx.lossfunction" or something like that

Sounds good. I will put them into ai.onnx.training.

spandantiwari · 2019-10-01T23:36:22Z

How should a backend decide which parameters in the graph are trainable and which are not? Do we consider all the initializers to be 'trainable' or only those that are captured in update_binding. This question comes up when backends are trying to optimize the graphs using constant folding, and trying to decide which of the initializers are true constants (non-trainable) that can be folded.

wschin · 2019-12-03T16:44:44Z

How should a backend decide which parameters in the graph are trainable and which are not? Do we consider all the initializers to be 'trainable' or only those that are captured in update_binding. This question comes up when backends are trying to optimize the graphs using constant folding, and trying to decide which of the initializers are true constants (non-trainable) that can be folded.

There is no direct concept of trainable. As long as you use update_binding to update a initializer, that initializer is considered trainable because its value would be altered at the end of each training iteration. In contrast, as long a initializer's name is not the value of any key-value pairs in update_binding, that initializer is considered as a constant.

chinhuang007 · 2019-12-10T23:01:55Z

Okay, I believe that means trainable parameters are inferred based on initializer and update_binding. Sounds reasonable.

wschin · 2019-12-10T23:03:57Z

Okay, I believe that means trainable parameters are inferred based on initializer and update_binding. Sounds reasonable.

Yes!

TMVector · 2020-02-11T23:50:34Z

+  // If an input or initializer in a graph has the same name as a global variable,
+  // then the local variable in the graph will hide the global variable.


Wouldn't it be more consistent to say that graph initializers cannot have the same name as any global initializer? The spec currently requires node outputs in subgraphs to not hide any tensors from parent graphs.

Graph initializer means declaring a variable so local variable hides outer-scope's variable. Operator output means declaring a variable if that variable doesn't exist; if that variable exists, it will be an assignment which we want to avoid.

TMVector · 2020-02-12T14:29:30Z

+The variable "W" is an optional input in the called graph.
+If the user omits it, the input names of GraphCall becomes ["X_1", "", "Z_1"].
+In this case, from the view of computation graph, the Conv operator invoked by
+GraphCall's may be connected the global "W" variable.


So when you use the gradient operator in the training graph, on a graph containing a GraphCall:

Global initializers preserve gradient flow even when used by the inference graph but not explicitly passed to the GraphCall

Inference graph initializers aren't visible to the training graph so you can't take the gradient w.r.t. them

If the inference graph has an input with the same name as a global initializer, then that input is optional

Is that right?

I'm guessing the main purpose in adding the global intializers is to make it clearer that these are implicitly used by GraphCall without the GraphCall having to explicitly pass values for each one?

So when you use the gradient operator in the training graph, on a graph containing a GraphCall:

Global initializers preserve gradient flow even when used by the inference graph but not explicitly passed to the GraphCall

Yes. Global variable is referenced if we don't find a local variable with the required name (here the name is "W").

Inference graph initializers aren't visible to the training graph so you can't take the gradient w.r.t. them

Yes. They are inference-time and training-time constants.

If the inference graph has an input with the same name as a global initializer, then that input is optional

Yes.

Is that right?

I'm guessing the main purpose in adding the global intializers is to make it clearer that these are implicitly used by GraphCall without the GraphCall having to explicitly pass values for each one?

Yes! It also forces the initialization/training algorithm not to touch inference-only-constants.

wschin · 2020-02-12T17:13:28Z

+  //
+  // This field usually stores trainable tensors of the model, because trainable
+  // tensors are used by both inference graph and training algorithm graph.
+  repeated TensorProto global_initializer = 30;


Suggested change

repeated TensorProto global_initializer = 30;

repeated TensorProto global_mutable_initializer = 30;

maybe?

Let's make it string. All tensors are in global_initializer, the name of the mutable subset are in global_mutable_initializer.

Ok. I change this field to global_mutable_initializer_name and it can only contains names of ModelProto.graph.initializer. Do you want TrainingInfoProto.algorithm.initializer and TrainingInfoProto.initialization.initializer to be mutable and globally-visible?

wschin · 2020-02-12T17:43:21Z

+
+  // Sparse initializers which are globally accessible.
+  // See the document of `global_initializer` for details.
+  repeated SparseTensorProto global_sparse_initializer = 31;


Merge these two lists into a string list? Each element is a name of ModelProto.graph.initializer.

(a) I thought the above version also allowed initializers from the training-algorithm?
(b) I forget whether today's meeting said it is okay to explicitly specify this, or just infer it from update_bindings?
(c) We didn't discuss one point in meeting: whether this list allows any (ModelProto.graph) initializer or only initializers that appear in input?

(a) I thought the above version also allowed initializers from the training-algorithm?

If the mutable initializer is only needed in a training graph, its name won't appear in the global variable name list.

(b) I forget whether today's meeting said it is okay to explicitly specify this, or just infer it from update_bindings?

Ok with both but it seems people like update_bindings more.

(c) We didn't discuss one point in meeting: whether this list allows any (ModelProto.graph) initializer or only initializers that appear in input?

No because

We only need initializers (trainable and mutable tensors) for training. I don't feel an input is a mutable thing.

If an input is promoted to global, multiple graph calls will have to use the same global input, which sounds strange.

ebarsoum

As we agreed in today SIG meeting, let's do the following:

Rename global_initializer to mutable_initializer.
Change the type to string, which contains the name of the tensors in the initializer that are mutable.

In yesterday's Operator SIG meeting, we agree to still put global variables in the inference graph and add a model-level field to indicate global variables. This way we can have smaller impact to the inference engines, because they don't need to move trainable tensors to a new field.

wschin · 2020-02-14T08:33:52Z

As we agreed in today SIG meeting, let's do the following:

Rename global_initializer to mutable_initializer.

Change the type to string, which contains the name of the tensors in the initializer that are mutable.

The last commit includes all requested changes. Please take a look. Thank you.

…inding

The request has been addressed.

houseroad

Why we silently switched to protobuf-lite by default?

This is not friendly to building systems which relies on the protobuf.

Please give a heads up and discuss about it before doing so.

@tbennun

* ONNX Training proposal. Major changes: 1. Add a protobuf message, `TrainingInfoProto` originally designed in onnx#2013, to store training information. 2. In `TrainingInfoProto`, the user can store training algorithm in `algorithm` field as a `GraphProto`. 3. The user can also store initialization algorithm for resetting the model in `TrainingInfoProto.initialization` (proposed by @tbennun in onnx#2517 and agreed by Training WG). 4. `ModelProto.graph` is callable inside `TrainingInfoProto.algorithm`. `ModelProto.graph.initializer` are visible to nodes in `TrainingInfoProto.algorithm.node`. 5. This PR also introduces a `Gradient` operator to differentiate a function represented by a (sub-)graph. This idea is from onnx#2168. Contribution list: Baihan Huang: spec design. Tal Ben-Nun: model initialization design. Wei-Sheng Chin: spec design, Gradient operator design. Jonny Shipton and active WG members and participants: many valuable comments and reviews. Co-authored-by: Sherlock <baihan.huang@gmail.com> Co-authored-by: Tal Ben-Nun <tbennun@users.noreply.github.com> Co-authored-by: Jonny Shipton <tmvector@gmail.com> * Address comments * Address a comment * Move Gradient to ai.onnx.training Update Gradient test models * Address comments 1. Create initialization_binding instead of using update_binding for initialization. 2. Swap key and velue in update_binding. 3. Refine documents accordingly. * Clarify sementics of algorithm and initialization * Fix typos * Address comment and explain the two computation modes of ModelProto.training_info * Fix typo and explain default behavior * Update onnx/checker.cc Co-Authored-By: Jonny Shipton <tmvector@gmail.com> * Address comments * Make normalization_binding a repeated field * Add GraphCall operator * Polish GraphCall * GraphCall now uses position to map inputs and outputs * Address comments: 1. Clarify GraphCall's semantic. 2. Implicitly force trainable tensors to be inference graph's inputs. 3. Training operators cannot be called in the inference graph. * Add accidently removed changes back * Use protobuf lite * Polish the helper script * Fix windows build and polish helper script * Fix linux and mac builds * One more line * fix the attribute types section in IR.md (onnx#2590) * fix the attribute types section in IR.md * update per comments. * Some changes around the behavior of optional inference inputs. 1. Use pass-by-value to optional inference inputs. 2. Due to the semantic of GraphCall, we implicitly force trainable inputs to be added into inference graph's input list. Revise docs * Update spec per WG discussion * update_binding is optional now because user might only want to store initialization * Polish doc * Address comments. Polish words. * Use an alternative field to declar global variables. In yesterday's Operator SIG meeting, we agree to still put global variables in the inference graph and add a model-level field to indicate global variables. This way we can have smaller impact to the inference engines, because they don't need to move trainable tensors to a new field. * polish docs * Allow training initializers to be promoted to global & mutable variables * Merge the functions of global_mutable_initializer_names into update_binding * Polish docs * Remove restriction on using ai.onnx.training in the inference graph * Split training register from ai.onnx register file Co-authored-by: Sherlock <baihan.huang@gmail.com> Co-authored-by: Tal Ben-Nun <tbennun@users.noreply.github.com> Co-authored-by: Jonny Shipton <tmvector@gmail.com> Co-authored-by: Ke Zhang <kezhan@microsoft.com>

wschin added the topic: training Issues related to ONNX training label Sep 16, 2019

wschin added this to the 1.7 milestone Sep 16, 2019

wschin requested review from a team as code owners September 16, 2019 05:31

wschin requested review from ebarsoum, gramalingam, houseroad and yuanbyu September 17, 2019 18:08

wschin changed the title ~~Lump-sum Training PR~~ [Draft] Lump-sum Training PR Sep 17, 2019

wschin requested review from linkerzhang and spandantiwari September 17, 2019 18:13

SherlockNoMad mentioned this pull request Sep 17, 2019

Introduce MeanSquaredError and SoftmaxCrossEntropy as Loss Function #1939

Closed

chinhuang007 mentioned this pull request Oct 25, 2019

How to train loaded model from ONNX in tensorflow or any other framework ? onnx/onnx-tensorflow#508

Open

wschin commented Nov 9, 2019

View reviewed changes

Comment thread onnx/onnx.in.proto Outdated

wschin force-pushed the training branch from c78f181 to 3081768 Compare December 3, 2019 21:46

gramalingam mentioned this pull request Dec 19, 2019

Graph initializers for tensors #2517

Closed

wschin requested a review from chinhuang007 January 9, 2020 01:59

wschin added 2 commits February 11, 2020 17:14

Polish doc

7dea294

Merge branch 'master' into training

2d25151

TMVector reviewed Feb 12, 2020

View reviewed changes

postrational reviewed Feb 12, 2020

View reviewed changes

Comment thread onnx/onnx.in.proto Outdated

wschin commented Feb 12, 2020

View reviewed changes

Comment thread onnx/onnx.in.proto Outdated

wschin commented Feb 12, 2020

View reviewed changes

Comment thread onnx/onnx.in.proto Outdated

wschin commented Feb 12, 2020

View reviewed changes

Comment thread onnx/onnx.in.proto Outdated

Address comments. Polish words.

7d4ab63

wschin commented Feb 12, 2020

View reviewed changes

ebarsoum suggested changes Feb 13, 2020

View reviewed changes

wschin added 4 commits February 13, 2020 00:59

polish docs

e30acb8

Allow training initializers to be promoted to global & mutable variables

004aa57

Merge branch 'master' into training

951939f

wschin added 3 commits February 15, 2020 01:02

Merge the functions of global_mutable_initializer_names into update_b…

05d37eb

…inding

Merge branch 'master' into training

bc1f6d9

Polish docs

128aeb2

ebarsoum approved these changes Feb 16, 2020

View reviewed changes

wschin and others added 4 commits February 16, 2020 00:02

Merge remote-tracking branch 'origin/master' into training

ed283b2

Merge branch 'master' into training

ac06891

Remove restriction on using ai.onnx.training in the inference graph

345dc84

Split training register from ai.onnx register file

a83acc7

wschin merged commit 807c62c into onnx:master Feb 17, 2020

wschin deleted the training branch May 6, 2020 22:15

houseroad reviewed May 7, 2020

View reviewed changes

		// If an input or initializer in a graph has the same name as a global variable,
		// then the local variable in the graph will hide the global variable.

	repeated TensorProto global_initializer = 30;
	repeated TensorProto global_mutable_initializer = 30;

Conversation

wschin commented Sep 16, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

CLAassistant commented Sep 17, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

prasanthpul commented Sep 17, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wschin commented Sep 17, 2019

Uh oh!

spandantiwari commented Oct 1, 2019

Uh oh!

Uh oh!

wschin commented Dec 3, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

chinhuang007 commented Dec 10, 2019

Uh oh!

wschin commented Dec 10, 2019

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wschin Feb 12, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wschin Feb 13, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ebarsoum left a comment

Choose a reason for hiding this comment

Uh oh!

wschin commented Feb 14, 2020

Uh oh!

houseroad left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

16 participants

wschin commented Sep 16, 2019 •

edited

Loading

CLAassistant commented Sep 17, 2019 •

edited

Loading

prasanthpul commented Sep 17, 2019 •

edited

Loading

wschin commented Dec 3, 2019 •

edited

Loading

wschin Feb 12, 2020 •

edited

Loading

wschin Feb 13, 2020 •

edited

Loading