Skip to content

Training Proposal: Spec Changes and Gradient Operator#2314

Merged
wschin merged 53 commits intoonnx:masterfrom
wschin:training
Feb 17, 2020
Merged

Training Proposal: Spec Changes and Gradient Operator#2314
wschin merged 53 commits intoonnx:masterfrom
wschin:training

Conversation

@wschin
Copy link
Copy Markdown
Collaborator

@wschin wschin commented Sep 16, 2019

This PR includes all necessary changes for enabling training. It currently includes #2013, #1955, #1970, #1959, #2168, #1939. Per Training WG's discussion, we want to put everything into a single place for convenience and having a global view to our status.

Please see #2038 and those sub-PRs' messages for details.

To speedup the code review process and properly distribute works, this PR now only contains content from #2517 by @tbennun, #2013, and #2168,. They are the minimal requirement to express training graph. Losses and optimizers will be added separately.

This PR introduces several major changes.

  1. Add a protobuf message, TrainingInfoProto originally designed in ONNX Training Proposal #2013, to store training information.
  2. In TrainingInfoProto, the user can store training algorithm in algorithm field as a GraphProto.
  3. The user can also store initialization algorithm for resetting the model in TrainingInfoProto.initialization (proposed by @tbennun in Graph initializers for tensors #2517 and agreed by Training WG).
  4. ModelProto.graph is callable inside TrainingInfoProto.algorithm. ModelProto.graph.initializer are visible to nodes in TrainingInfoProto.algorithm.node.
  5. This PR also introduces a Gradient operator to differentiate a function represented by a (sub-)graph. This idea is from [WIP][Training] Draft of Gradient Operator #2168. The domain of Gradient is ai.onnx.training.

@wschin wschin added the topic: training Issues related to ONNX training label Sep 16, 2019
@wschin wschin added this to the 1.7 milestone Sep 16, 2019
@wschin wschin requested review from a team as code owners September 16, 2019 05:31
@CLAassistant
Copy link
Copy Markdown

CLAassistant commented Sep 17, 2019

CLA assistant check
All committers have signed the CLA.

@prasanthpul
Copy link
Copy Markdown
Member

prasanthpul commented Sep 17, 2019

Are the new ops only useful in training scenarios? If so, they should be put in a separate namespace. It doesn't have to be "ai.onnx.training", it can be "ai.onnx.lossfunction" or something like that

@wschin
Copy link
Copy Markdown
Collaborator Author

wschin commented Sep 17, 2019

Are the new ops only useful in training scenarios? If so, they should be put in a separate namespace. It doesn't have to be "ai.onnx.training", it can be "ai.onnx.lossfunction" or something like that

Sounds good. I will put them into ai.onnx.training.

@spandantiwari
Copy link
Copy Markdown
Contributor

How should a backend decide which parameters in the graph are trainable and which are not? Do we consider all the initializers to be 'trainable' or only those that are captured in update_binding. This question comes up when backends are trying to optimize the graphs using constant folding, and trying to decide which of the initializers are true constants (non-trainable) that can be folded.

Comment thread onnx/onnx.in.proto Outdated
@wschin
Copy link
Copy Markdown
Collaborator Author

wschin commented Dec 3, 2019

How should a backend decide which parameters in the graph are trainable and which are not? Do we consider all the initializers to be 'trainable' or only those that are captured in update_binding. This question comes up when backends are trying to optimize the graphs using constant folding, and trying to decide which of the initializers are true constants (non-trainable) that can be folded.

There is no direct concept of trainable. As long as you use update_binding to update a initializer, that initializer is considered trainable because its value would be altered at the end of each training iteration. In contrast, as long a initializer's name is not the value of any key-value pairs in update_binding, that initializer is considered as a constant.

@chinhuang007
Copy link
Copy Markdown
Contributor

Okay, I believe that means trainable parameters are inferred based on initializer and update_binding. Sounds reasonable.

@wschin
Copy link
Copy Markdown
Collaborator Author

wschin commented Dec 10, 2019

Okay, I believe that means trainable parameters are inferred based on initializer and update_binding. Sounds reasonable.

Yes!

Comment thread onnx/onnx.in.proto Outdated
Comment thread onnx/onnx.in.proto Outdated
Comment thread onnx/onnx.in.proto Outdated
Comment on lines +395 to +396
// If an input or initializer in a graph has the same name as a global variable,
// then the local variable in the graph will hide the global variable.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't it be more consistent to say that graph initializers cannot have the same name as any global initializer? The spec currently requires node outputs in subgraphs to not hide any tensors from parent graphs.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Graph initializer means declaring a variable so local variable hides outer-scope's variable. Operator output means declaring a variable if that variable doesn't exist; if that variable exists, it will be an assignment which we want to avoid.

Comment thread onnx/defs/training/defs.cc Outdated
Comment on lines +269 to +272
The variable "W" is an optional input in the called graph.
If the user omits it, the input names of GraphCall becomes ["X_1", "", "Z_1"].
In this case, from the view of computation graph, the Conv operator invoked by
GraphCall's may be connected the global "W" variable.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So when you use the gradient operator in the training graph, on a graph containing a GraphCall:

  • Global initializers preserve gradient flow even when used by the inference graph but not explicitly passed to the GraphCall
  • Inference graph initializers aren't visible to the training graph so you can't take the gradient w.r.t. them
  • If the inference graph has an input with the same name as a global initializer, then that input is optional

Is that right?

I'm guessing the main purpose in adding the global intializers is to make it clearer that these are implicitly used by GraphCall without the GraphCall having to explicitly pass values for each one?

Copy link
Copy Markdown
Collaborator Author

@wschin wschin Feb 12, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So when you use the gradient operator in the training graph, on a graph containing a GraphCall:

  • Global initializers preserve gradient flow even when used by the inference graph but not explicitly passed to the GraphCall

Yes. Global variable is referenced if we don't find a local variable with the required name (here the name is "W").

  • Inference graph initializers aren't visible to the training graph so you can't take the gradient w.r.t. them

Yes. They are inference-time and training-time constants.

  • If the inference graph has an input with the same name as a global initializer, then that input is optional

Yes.

Is that right?

I'm guessing the main purpose in adding the global intializers is to make it clearer that these are implicitly used by GraphCall without the GraphCall having to explicitly pass values for each one?

Yes! It also forces the initialization/training algorithm not to touch inference-only-constants.

Comment thread onnx/onnx.in.proto Outdated
Comment thread onnx/onnx.in.proto Outdated
Comment thread onnx/onnx.in.proto Outdated
Comment thread onnx/onnx.in.proto Outdated
Comment thread onnx/onnx.in.proto Outdated
//
// This field usually stores trainable tensors of the model, because trainable
// tensors are used by both inference graph and training algorithm graph.
repeated TensorProto global_initializer = 30;
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
repeated TensorProto global_initializer = 30;
repeated TensorProto global_mutable_initializer = 30;

maybe?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's make it string. All tensors are in global_initializer, the name of the mutable subset are in global_mutable_initializer.

Copy link
Copy Markdown
Collaborator Author

@wschin wschin Feb 13, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. I change this field to global_mutable_initializer_name and it can only contains names of ModelProto.graph.initializer. Do you want TrainingInfoProto.algorithm.initializer and TrainingInfoProto.initialization.initializer to be mutable and globally-visible?

Comment thread onnx/onnx.in.proto Outdated

// Sparse initializers which are globally accessible.
// See the document of `global_initializer` for details.
repeated SparseTensorProto global_sparse_initializer = 31;
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Merge these two lists into a string list? Each element is a name of ModelProto.graph.initializer.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(a) I thought the above version also allowed initializers from the training-algorithm?
(b) I forget whether today's meeting said it is okay to explicitly specify this, or just infer it from update_bindings?
(c) We didn't discuss one point in meeting: whether this list allows any (ModelProto.graph) initializer or only initializers that appear in input?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(a) I thought the above version also allowed initializers from the training-algorithm?

If the mutable initializer is only needed in a training graph, its name won't appear in the global variable name list.

(b) I forget whether today's meeting said it is okay to explicitly specify this, or just infer it from update_bindings?

Ok with both but it seems people like update_bindings more.

(c) We didn't discuss one point in meeting: whether this list allows any (ModelProto.graph) initializer or only initializers that appear in input?

No because

  1. We only need initializers (trainable and mutable tensors) for training. I don't feel an input is a mutable thing.
  2. If an input is promoted to global, multiple graph calls will have to use the same global input, which sounds strange.

Copy link
Copy Markdown
Contributor

@ebarsoum ebarsoum left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As we agreed in today SIG meeting, let's do the following:

  1. Rename global_initializer to mutable_initializer.
  2. Change the type to string, which contains the name of the tensors in the initializer that are mutable.

In yesterday's Operator SIG meeting, we agree to still
put global variables in the inference graph and add a
model-level field to indicate global variables. This way
we can have smaller impact to the inference engines, because
they don't need to move trainable tensors to a new field.
@wschin
Copy link
Copy Markdown
Collaborator Author

wschin commented Feb 14, 2020

As we agreed in today SIG meeting, let's do the following:

  1. Rename global_initializer to mutable_initializer.
  2. Change the type to string, which contains the name of the tensors in the initializer that are mutable.

The last commit includes all requested changes. Please take a look. Thank you.

@ebarsoum ebarsoum dismissed prasanthpul’s stale review February 17, 2020 07:35

The request has been addressed.

@wschin wschin merged commit 807c62c into onnx:master Feb 17, 2020
@wschin wschin deleted the training branch May 6, 2020 22:15
Copy link
Copy Markdown
Member

@houseroad houseroad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why we silently switched to protobuf-lite by default?

This is not friendly to building systems which relies on the protobuf.

Please give a heads up and discuss about it before doing so.

jcwchen pushed a commit to jcwchen/onnx that referenced this pull request Sep 23, 2020
* ONNX Training proposal.

Major changes:
  1. Add a protobuf message, `TrainingInfoProto` originally designed in
     onnx#2013, to store training information.
  2. In `TrainingInfoProto`, the user can store training algorithm in
     `algorithm` field as a `GraphProto`.
  3. The user can also store initialization algorithm for resetting the
     model in `TrainingInfoProto.initialization` (proposed by @tbennun in
     onnx#2517 and agreed by Training WG).
  4. `ModelProto.graph` is callable inside `TrainingInfoProto.algorithm`.
     `ModelProto.graph.initializer` are visible to nodes in
     `TrainingInfoProto.algorithm.node`.
  5. This PR also introduces a `Gradient` operator to differentiate a
     function represented by a (sub-)graph. This idea is from onnx#2168.

Contribution list:
   Baihan Huang: spec design.
   Tal Ben-Nun: model initialization design.
   Wei-Sheng Chin: spec design, Gradient operator design.
   Jonny Shipton and active WG members and participants: many valuable comments and reviews.

Co-authored-by: Sherlock <baihan.huang@gmail.com>
Co-authored-by: Tal Ben-Nun <tbennun@users.noreply.github.com>
Co-authored-by: Jonny Shipton <tmvector@gmail.com>

* Address comments

* Address a comment

* Move Gradient to ai.onnx.training

Update Gradient test models

* Address comments
1. Create initialization_binding instead of
   using update_binding for initialization.
2. Swap key and velue in update_binding.
3. Refine documents accordingly.

* Clarify sementics of algorithm and initialization

* Fix typos

* Address comment and explain the two computation modes of  ModelProto.training_info

* Fix typo and explain default behavior

* Update onnx/checker.cc

Co-Authored-By: Jonny Shipton <tmvector@gmail.com>

* Address comments

* Make normalization_binding a repeated field

* Add GraphCall operator

* Polish GraphCall

* GraphCall now uses position to map inputs and outputs

* Address comments:
1. Clarify GraphCall's semantic.
2. Implicitly force trainable tensors to be inference graph's inputs.
3. Training operators cannot be called in the inference graph.

* Add accidently removed changes back

* Use protobuf lite

* Polish the helper script

* Fix windows build and polish helper script

* Fix linux and mac builds

* One more line

* fix the attribute types section in IR.md (onnx#2590)

* fix the attribute types section in IR.md

* update per comments.

* Some changes around the behavior of optional inference inputs.

1. Use pass-by-value to optional inference inputs.
2. Due to the semantic of GraphCall, we implicitly force trainable
   inputs to be added into inference graph's input list.

Revise docs

* Update spec per WG discussion

* update_binding is optional now because user might only want to store initialization

* Polish doc

* Address comments. Polish words.

* Use an alternative field to declar global variables.
In yesterday's Operator SIG meeting, we agree to still
put global variables in the inference graph and add a
model-level field to indicate global variables. This way
we can have smaller impact to the inference engines, because
they don't need to move trainable tensors to a new field.

* polish docs

* Allow training initializers to be promoted to global & mutable variables

* Merge the functions of global_mutable_initializer_names into update_binding

* Polish docs

* Remove restriction on using ai.onnx.training in the inference graph

* Split training register from ai.onnx register file

Co-authored-by: Sherlock <baihan.huang@gmail.com>
Co-authored-by: Tal Ben-Nun <tbennun@users.noreply.github.com>
Co-authored-by: Jonny Shipton <tmvector@gmail.com>
Co-authored-by: Ke Zhang <kezhan@microsoft.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

topic: operator Issues related to ONNX operators topic: training Issues related to ONNX training

Projects

None yet

Development

Successfully merging this pull request may close these issues.