Skip to content

[WIP][Training] Draft of Gradient Operator#2168

Closed
wschin wants to merge 8 commits intoonnx:masterfrom
wschin:grad
Closed

[WIP][Training] Draft of Gradient Operator#2168
wschin wants to merge 8 commits intoonnx:masterfrom
wschin:grad

Conversation

@wschin
Copy link
Copy Markdown
Collaborator

@wschin wschin commented Jul 14, 2019

PR #2314 is a single place for reviewing the whole training story.

In Tensorflow and Pytorch, most models are trained using backpropagation and gradient-based optimization algorithms. This operator provides a way to compute gradient given an inference graph.

  • Signature and documentation.
  • Shape inference code
  • Reference implementation
  • Tests
  • Verification script

Here is a similar operator in Pytroch.

@wschin wschin requested a review from a team as a code owner July 14, 2019 08:11
Comment thread onnx/defs/training/defs.cc Outdated
Comment thread onnx/defs/training/defs.cc Outdated
| |
+------+--> Gradient(xs=["X", "W"], y=["Y"]) ---> dO/dX (1st output of Gradient)
| | |
| | '---> dO/dW (2nd output of Gradient)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest replacing "dO/dW" by a valid identifier, like "dO_dW" or "first_order_gradient".

| Gradient)
|
|
'---> d^2O/dW^2 (2nd output of Gradient)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggest using a valid identifier (as above).

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Humm, but d(dO/dW)dX would be changed to something like d_dO_dW_dX and become less readable.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see your point. But my concern is that some users will copy this blindly and use "dO/dW" as a tensor name. This can lead to potential problems downstream. Technically, the ONNX documentation says that tensor names should be valid C identifiers. If the ONNX checker were to enforce this, that would be somewhat helpful. But it does not seem to do that. (I remember Niklas was trying to add such a check ... but it doesn't seem to exist today … not sure why. Possibly because a number of models were violating this then.) We should probably sort this out one way or the other in the ONNX standard. Until then, I think it would be better to try to stick to the identifier restriction.

We could use simply use names like D1 and D2 to denote the first and second derivative in the example (and in the text we could say D1 denotes dO/dW, for example).

Comment thread onnx/defs/training/defs.cc Outdated
Comment thread onnx/defs/training/defs.cc Outdated

namespace ONNX_NAMESPACE {
static const char* Gradient_ver12_doc = R"DOC(
Gradient operator computes the derivatives of some tensors with respect to
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be worth adding an explanatory note somewhere, something like "The gradient operator is slightly different from the standard operators. It is conceptually a higher-order operator that takes a computation-graph as an attribute, but we abuse the attribute mechanism to describe the computation-sub-graph implicitly by listing the inputs and output of the computation-sub-graph in the current graph.'

Comment thread onnx/defs/training/defs.cc Outdated

As mentioned above, the attributes "xs" and "y" are used to identify a graph,
and we can feed different tensors to the identified graph. For,
example, one can compute the gradient of Y with respect to H by substituting
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there typos here? I am a bit confused, because I am not sure what is Y_1, H_1, and the picture below shows W_1 and Z_1. Are you trying to say that it is not necessary for the inputs to be identical to the list xs specified in the attributes?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the above is the case, it would be helpful to add that "The implementation can be optimized in the common case where the inputs are identical to the attributes." … because if the inputs are not identical to the attributes, the implementation has to do extra work to create a duplicate of the forward-graph for the actual-inputs.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think we need to allow

 if the inputs are not identical to the attributes, the implementation has to do extra work to create a duplicate of the forward-graph for the actual-inputs

? Looks like it's not necessary.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we don't support it, that's okay with me (since I can't immediately think of use-cases where we would want to use it). But, then, the documentation should make this explicit (and implementations should check that the input-list and attribute-list match).

Comment thread onnx/defs/training/defs.cc Outdated
"contains and only contains the necessary inputs of a "
"(sub-)graph. Variables (usually called intermediate "
"variables) which can be generated by inputs cannot be "
"included in this attribute.",
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it correct to say the following? If xs=[X_1, ..., X_m], then no X_i can depend on X_j (i <> j): that is, there cannot be a dependence-path in the graph from X_j to X_i? Furthermore, do we need to say the list in xs must be complete? If we have Y = X_1 + X_2, is it valid or invalid to specify xs=["X_1"]?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. The inputs named in xs must be independent variables; that is, they cannot be upstream variable of each other.

The list must be complete. Otherwise, we cannot finish a forward path due to missing input (in your example, we cannot compute Y without X_2).

Comment thread onnx/defs/training/defs.cc Outdated
@CLAassistant
Copy link
Copy Markdown

CLAassistant commented Jul 23, 2019

CLA assistant check
All committers have signed the CLA.

@prasanthpul prasanthpul added this to the 1.7 milestone Aug 20, 2019
@postrational postrational added topic: operator Issues related to ONNX operators topic: training Issues related to ONNX training labels Aug 23, 2019
Copy link
Copy Markdown

@newling newling left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it the case that "xs" is always the names of the inputs, and if so why is
it necessary to have both? It seems like one or the other is redundant.

So, why is an attribute-less definition like this,

input i = 0:N-1 : see outputs definition
input N : see outputs definition
outputs i = 0:N-1 : d(input_N)/d(input_i)

attributes : none.

not enough?

Comment thread docs/Operators.md
@wschin
Copy link
Copy Markdown
Collaborator Author

wschin commented Jan 24, 2020

The operator Gradient is now a part of #2314 so this PR can be closed.

@wschin wschin closed this Jan 24, 2020
wschin added a commit to wschin/onnx that referenced this pull request Jan 24, 2020
Major changes:
  1. Add a protobuf message, `TrainingInfoProto` originally designed in
     onnx#2013, to store training information.
  2. In `TrainingInfoProto`, the user can store training algorithm in
     `algorithm` field as a `GraphProto`.
  3. The user can also store initialization algorithm for resetting the
     model in `TrainingInfoProto.initialization` (proposed by @tbennun in
     onnx#2517 and agreed by Training WG).
  4. `ModelProto.graph` is callable inside `TrainingInfoProto.algorithm`.
     `ModelProto.graph.initializer` are visible to nodes in
     `TrainingInfoProto.algorithm.node`.
  5. This PR also introduces a `Gradient` operator to differentiate a
     function represented by a (sub-)graph. This idea is from onnx#2168.

Contribution list:
   Baihan Huang: spec design.
   Tal Ben-Nun: model initialization design.
   Wei-Sheng Chin: spec design, Gradient operator design.
   Jonny Shipton and active WG members and participants: many valuable comments and reviews.

Co-authored-by: Sherlock <baihan.huang@gmail.com>
Co-authored-by: Tal Ben-Nun <tbennun@users.noreply.github.com>
Co-authored-by: Jonny Shipton <tmvector@gmail.com>
wschin added a commit that referenced this pull request Feb 17, 2020
* ONNX Training proposal.

Major changes:
  1. Add a protobuf message, `TrainingInfoProto` originally designed in
     #2013, to store training information.
  2. In `TrainingInfoProto`, the user can store training algorithm in
     `algorithm` field as a `GraphProto`.
  3. The user can also store initialization algorithm for resetting the
     model in `TrainingInfoProto.initialization` (proposed by @tbennun in
     #2517 and agreed by Training WG).
  4. `ModelProto.graph` is callable inside `TrainingInfoProto.algorithm`.
     `ModelProto.graph.initializer` are visible to nodes in
     `TrainingInfoProto.algorithm.node`.
  5. This PR also introduces a `Gradient` operator to differentiate a
     function represented by a (sub-)graph. This idea is from #2168.

Contribution list:
   Baihan Huang: spec design.
   Tal Ben-Nun: model initialization design.
   Wei-Sheng Chin: spec design, Gradient operator design.
   Jonny Shipton and active WG members and participants: many valuable comments and reviews.

Co-authored-by: Sherlock <baihan.huang@gmail.com>
Co-authored-by: Tal Ben-Nun <tbennun@users.noreply.github.com>
Co-authored-by: Jonny Shipton <tmvector@gmail.com>

* Address comments

* Address a comment

* Move Gradient to ai.onnx.training

Update Gradient test models

* Address comments
1. Create initialization_binding instead of
   using update_binding for initialization.
2. Swap key and velue in update_binding.
3. Refine documents accordingly.

* Clarify sementics of algorithm and initialization

* Fix typos

* Address comment and explain the two computation modes of  ModelProto.training_info

* Fix typo and explain default behavior

* Update onnx/checker.cc

Co-Authored-By: Jonny Shipton <tmvector@gmail.com>

* Address comments

* Make normalization_binding a repeated field

* Add GraphCall operator

* Polish GraphCall

* GraphCall now uses position to map inputs and outputs

* Address comments:
1. Clarify GraphCall's semantic.
2. Implicitly force trainable tensors to be inference graph's inputs.
3. Training operators cannot be called in the inference graph.

* Add accidently removed changes back

* Use protobuf lite

* Polish the helper script

* Fix windows build and polish helper script

* Fix linux and mac builds

* One more line

* fix the attribute types section in IR.md (#2590)

* fix the attribute types section in IR.md

* update per comments.

* Some changes around the behavior of optional inference inputs.

1. Use pass-by-value to optional inference inputs.
2. Due to the semantic of GraphCall, we implicitly force trainable
   inputs to be added into inference graph's input list.

Revise docs

* Update spec per WG discussion

* update_binding is optional now because user might only want to store initialization

* Polish doc

* Address comments. Polish words.

* Use an alternative field to declar global variables.
In yesterday's Operator SIG meeting, we agree to still
put global variables in the inference graph and add a
model-level field to indicate global variables. This way
we can have smaller impact to the inference engines, because
they don't need to move trainable tensors to a new field.

* polish docs

* Allow training initializers to be promoted to global & mutable variables

* Merge the functions of global_mutable_initializer_names into update_binding

* Polish docs

* Remove restriction on using ai.onnx.training in the inference graph

* Split training register from ai.onnx register file

Co-authored-by: Sherlock <baihan.huang@gmail.com>
Co-authored-by: Tal Ben-Nun <tbennun@users.noreply.github.com>
Co-authored-by: Jonny Shipton <tmvector@gmail.com>
Co-authored-by: Ke Zhang <kezhan@microsoft.com>
jcwchen pushed a commit to jcwchen/onnx that referenced this pull request Sep 23, 2020
* ONNX Training proposal.

Major changes:
  1. Add a protobuf message, `TrainingInfoProto` originally designed in
     onnx#2013, to store training information.
  2. In `TrainingInfoProto`, the user can store training algorithm in
     `algorithm` field as a `GraphProto`.
  3. The user can also store initialization algorithm for resetting the
     model in `TrainingInfoProto.initialization` (proposed by @tbennun in
     onnx#2517 and agreed by Training WG).
  4. `ModelProto.graph` is callable inside `TrainingInfoProto.algorithm`.
     `ModelProto.graph.initializer` are visible to nodes in
     `TrainingInfoProto.algorithm.node`.
  5. This PR also introduces a `Gradient` operator to differentiate a
     function represented by a (sub-)graph. This idea is from onnx#2168.

Contribution list:
   Baihan Huang: spec design.
   Tal Ben-Nun: model initialization design.
   Wei-Sheng Chin: spec design, Gradient operator design.
   Jonny Shipton and active WG members and participants: many valuable comments and reviews.

Co-authored-by: Sherlock <baihan.huang@gmail.com>
Co-authored-by: Tal Ben-Nun <tbennun@users.noreply.github.com>
Co-authored-by: Jonny Shipton <tmvector@gmail.com>

* Address comments

* Address a comment

* Move Gradient to ai.onnx.training

Update Gradient test models

* Address comments
1. Create initialization_binding instead of
   using update_binding for initialization.
2. Swap key and velue in update_binding.
3. Refine documents accordingly.

* Clarify sementics of algorithm and initialization

* Fix typos

* Address comment and explain the two computation modes of  ModelProto.training_info

* Fix typo and explain default behavior

* Update onnx/checker.cc

Co-Authored-By: Jonny Shipton <tmvector@gmail.com>

* Address comments

* Make normalization_binding a repeated field

* Add GraphCall operator

* Polish GraphCall

* GraphCall now uses position to map inputs and outputs

* Address comments:
1. Clarify GraphCall's semantic.
2. Implicitly force trainable tensors to be inference graph's inputs.
3. Training operators cannot be called in the inference graph.

* Add accidently removed changes back

* Use protobuf lite

* Polish the helper script

* Fix windows build and polish helper script

* Fix linux and mac builds

* One more line

* fix the attribute types section in IR.md (onnx#2590)

* fix the attribute types section in IR.md

* update per comments.

* Some changes around the behavior of optional inference inputs.

1. Use pass-by-value to optional inference inputs.
2. Due to the semantic of GraphCall, we implicitly force trainable
   inputs to be added into inference graph's input list.

Revise docs

* Update spec per WG discussion

* update_binding is optional now because user might only want to store initialization

* Polish doc

* Address comments. Polish words.

* Use an alternative field to declar global variables.
In yesterday's Operator SIG meeting, we agree to still
put global variables in the inference graph and add a
model-level field to indicate global variables. This way
we can have smaller impact to the inference engines, because
they don't need to move trainable tensors to a new field.

* polish docs

* Allow training initializers to be promoted to global & mutable variables

* Merge the functions of global_mutable_initializer_names into update_binding

* Polish docs

* Remove restriction on using ai.onnx.training in the inference graph

* Split training register from ai.onnx register file

Co-authored-by: Sherlock <baihan.huang@gmail.com>
Co-authored-by: Tal Ben-Nun <tbennun@users.noreply.github.com>
Co-authored-by: Jonny Shipton <tmvector@gmail.com>
Co-authored-by: Ke Zhang <kezhan@microsoft.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

topic: operator Issues related to ONNX operators topic: training Issues related to ONNX training

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants