[WIP][Training] Draft of Gradient Operator by wschin · Pull Request #2168 · onnx/onnx

wschin · 2019-07-14T08:11:00Z

PR #2314 is a single place for reviewing the whole training story.

In Tensorflow and Pytorch, most models are trained using backpropagation and gradient-based optimization algorithms. This operator provides a way to compute gradient given an inference graph.

Here is a similar operator in Pytroch.

gramalingam · 2019-07-18T21:08:32Z

+|      |
+------+--> Gradient(xs=["X", "W"], y=["Y"]) ---> dO/dX (1st output of Gradient)
+|      |      |
+|      |      '---> dO/dW (2nd output of Gradient)


I suggest replacing "dO/dW" by a valid identifier, like "dO_dW" or "first_order_gradient".

gramalingam · 2019-07-18T21:09:07Z

+       |                                          Gradient)
+       |
+       |
+       '---> d^2O/dW^2 (2nd output of Gradient)


Suggest using a valid identifier (as above).

Humm, but d(dO/dW)dX would be changed to something like d_dO_dW_dX and become less readable.

I see your point. But my concern is that some users will copy this blindly and use "dO/dW" as a tensor name. This can lead to potential problems downstream. Technically, the ONNX documentation says that tensor names should be valid C identifiers. If the ONNX checker were to enforce this, that would be somewhat helpful. But it does not seem to do that. (I remember Niklas was trying to add such a check ... but it doesn't seem to exist today … not sure why. Possibly because a number of models were violating this then.) We should probably sort this out one way or the other in the ONNX standard. Until then, I think it would be better to try to stick to the identifier restriction.

We could use simply use names like D1 and D2 to denote the first and second derivative in the example (and in the text we could say D1 denotes dO/dW, for example).

gramalingam · 2019-07-18T21:24:03Z

+
+namespace ONNX_NAMESPACE {
+static const char* Gradient_ver12_doc = R"DOC(
+Gradient operator computes the derivatives of some tensors with respect to


It might be worth adding an explanatory note somewhere, something like "The gradient operator is slightly different from the standard operators. It is conceptually a higher-order operator that takes a computation-graph as an attribute, but we abuse the attribute mechanism to describe the computation-sub-graph implicitly by listing the inputs and output of the computation-sub-graph in the current graph.'

gramalingam · 2019-07-18T21:24:39Z

+
+As mentioned above, the attributes "xs" and "y" are used to identify a graph,
+and we can feed different tensors to the identified graph. For,
+example, one can compute the gradient of Y with respect to H by substituting


Are there typos here? I am a bit confused, because I am not sure what is Y_1, H_1, and the picture below shows W_1 and Z_1. Are you trying to say that it is not necessary for the inputs to be identical to the list xs specified in the attributes?

If the above is the case, it would be helpful to add that "The implementation can be optimized in the common case where the inputs are identical to the attributes." … because if the inputs are not identical to the attributes, the implementation has to do extra work to create a duplicate of the forward-graph for the actual-inputs.

Do you think we need to allow

if the inputs are not identical to the attributes, the implementation has to do extra work to create a duplicate of the forward-graph for the actual-inputs

? Looks like it's not necessary.

If we don't support it, that's okay with me (since I can't immediately think of use-cases where we would want to use it). But, then, the documentation should make this explicit (and implementations should check that the input-list and attribute-list match).

gramalingam · 2019-07-18T21:37:02Z

+            "contains and only contains the necessary inputs of a "
+            "(sub-)graph. Variables (usually called intermediate "
+            "variables) which can be generated by inputs cannot be "
+            "included in this attribute.",


Is it correct to say the following? If xs=[X_1, ..., X_m], then no X_i can depend on X_j (i <> j): that is, there cannot be a dependence-path in the graph from X_j to X_i? Furthermore, do we need to say the list in xs must be complete? If we have Y = X_1 + X_2, is it valid or invalid to specify xs=["X_1"]?

Yes. The inputs named in xs must be independent variables; that is, they cannot be upstream variable of each other.

The list must be complete. Otherwise, we cannot finish a forward path due to missing input (in your example, we cannot compute Y without X_2).

CLAassistant · 2019-07-23T07:49:07Z

All committers have signed the CLA.

newling

Is it the case that "xs" is always the names of the inputs, and if so why is
it necessary to have both? It seems like one or the other is redundant.

So, why is an attribute-less definition like this,

input i = 0:N-1 : see outputs definition
input N : see outputs definition
outputs i = 0:N-1 : d(input_N)/d(input_i)

attributes : none.

not enough?

wschin · 2020-01-24T01:20:23Z

The operator Gradient is now a part of #2314 so this PR can be closed.

@tbennun

Major changes: 1. Add a protobuf message, `TrainingInfoProto` originally designed in onnx#2013, to store training information. 2. In `TrainingInfoProto`, the user can store training algorithm in `algorithm` field as a `GraphProto`. 3. The user can also store initialization algorithm for resetting the model in `TrainingInfoProto.initialization` (proposed by @tbennun in onnx#2517 and agreed by Training WG). 4. `ModelProto.graph` is callable inside `TrainingInfoProto.algorithm`. `ModelProto.graph.initializer` are visible to nodes in `TrainingInfoProto.algorithm.node`. 5. This PR also introduces a `Gradient` operator to differentiate a function represented by a (sub-)graph. This idea is from onnx#2168. Contribution list: Baihan Huang: spec design. Tal Ben-Nun: model initialization design. Wei-Sheng Chin: spec design, Gradient operator design. Jonny Shipton and active WG members and participants: many valuable comments and reviews. Co-authored-by: Sherlock <baihan.huang@gmail.com> Co-authored-by: Tal Ben-Nun <tbennun@users.noreply.github.com> Co-authored-by: Jonny Shipton <tmvector@gmail.com>

@tbennun

* ONNX Training proposal. Major changes: 1. Add a protobuf message, `TrainingInfoProto` originally designed in #2013, to store training information. 2. In `TrainingInfoProto`, the user can store training algorithm in `algorithm` field as a `GraphProto`. 3. The user can also store initialization algorithm for resetting the model in `TrainingInfoProto.initialization` (proposed by @tbennun in #2517 and agreed by Training WG). 4. `ModelProto.graph` is callable inside `TrainingInfoProto.algorithm`. `ModelProto.graph.initializer` are visible to nodes in `TrainingInfoProto.algorithm.node`. 5. This PR also introduces a `Gradient` operator to differentiate a function represented by a (sub-)graph. This idea is from #2168. Contribution list: Baihan Huang: spec design. Tal Ben-Nun: model initialization design. Wei-Sheng Chin: spec design, Gradient operator design. Jonny Shipton and active WG members and participants: many valuable comments and reviews. Co-authored-by: Sherlock <baihan.huang@gmail.com> Co-authored-by: Tal Ben-Nun <tbennun@users.noreply.github.com> Co-authored-by: Jonny Shipton <tmvector@gmail.com> * Address comments * Address a comment * Move Gradient to ai.onnx.training Update Gradient test models * Address comments 1. Create initialization_binding instead of using update_binding for initialization. 2. Swap key and velue in update_binding. 3. Refine documents accordingly. * Clarify sementics of algorithm and initialization * Fix typos * Address comment and explain the two computation modes of ModelProto.training_info * Fix typo and explain default behavior * Update onnx/checker.cc Co-Authored-By: Jonny Shipton <tmvector@gmail.com> * Address comments * Make normalization_binding a repeated field * Add GraphCall operator * Polish GraphCall * GraphCall now uses position to map inputs and outputs * Address comments: 1. Clarify GraphCall's semantic. 2. Implicitly force trainable tensors to be inference graph's inputs. 3. Training operators cannot be called in the inference graph. * Add accidently removed changes back * Use protobuf lite * Polish the helper script * Fix windows build and polish helper script * Fix linux and mac builds * One more line * fix the attribute types section in IR.md (#2590) * fix the attribute types section in IR.md * update per comments. * Some changes around the behavior of optional inference inputs. 1. Use pass-by-value to optional inference inputs. 2. Due to the semantic of GraphCall, we implicitly force trainable inputs to be added into inference graph's input list. Revise docs * Update spec per WG discussion * update_binding is optional now because user might only want to store initialization * Polish doc * Address comments. Polish words. * Use an alternative field to declar global variables. In yesterday's Operator SIG meeting, we agree to still put global variables in the inference graph and add a model-level field to indicate global variables. This way we can have smaller impact to the inference engines, because they don't need to move trainable tensors to a new field. * polish docs * Allow training initializers to be promoted to global & mutable variables * Merge the functions of global_mutable_initializer_names into update_binding * Polish docs * Remove restriction on using ai.onnx.training in the inference graph * Split training register from ai.onnx register file Co-authored-by: Sherlock <baihan.huang@gmail.com> Co-authored-by: Tal Ben-Nun <tbennun@users.noreply.github.com> Co-authored-by: Jonny Shipton <tmvector@gmail.com> Co-authored-by: Ke Zhang <kezhan@microsoft.com>

@tbennun

* ONNX Training proposal. Major changes: 1. Add a protobuf message, `TrainingInfoProto` originally designed in onnx#2013, to store training information. 2. In `TrainingInfoProto`, the user can store training algorithm in `algorithm` field as a `GraphProto`. 3. The user can also store initialization algorithm for resetting the model in `TrainingInfoProto.initialization` (proposed by @tbennun in onnx#2517 and agreed by Training WG). 4. `ModelProto.graph` is callable inside `TrainingInfoProto.algorithm`. `ModelProto.graph.initializer` are visible to nodes in `TrainingInfoProto.algorithm.node`. 5. This PR also introduces a `Gradient` operator to differentiate a function represented by a (sub-)graph. This idea is from onnx#2168. Contribution list: Baihan Huang: spec design. Tal Ben-Nun: model initialization design. Wei-Sheng Chin: spec design, Gradient operator design. Jonny Shipton and active WG members and participants: many valuable comments and reviews. Co-authored-by: Sherlock <baihan.huang@gmail.com> Co-authored-by: Tal Ben-Nun <tbennun@users.noreply.github.com> Co-authored-by: Jonny Shipton <tmvector@gmail.com> * Address comments * Address a comment * Move Gradient to ai.onnx.training Update Gradient test models * Address comments 1. Create initialization_binding instead of using update_binding for initialization. 2. Swap key and velue in update_binding. 3. Refine documents accordingly. * Clarify sementics of algorithm and initialization * Fix typos * Address comment and explain the two computation modes of ModelProto.training_info * Fix typo and explain default behavior * Update onnx/checker.cc Co-Authored-By: Jonny Shipton <tmvector@gmail.com> * Address comments * Make normalization_binding a repeated field * Add GraphCall operator * Polish GraphCall * GraphCall now uses position to map inputs and outputs * Address comments: 1. Clarify GraphCall's semantic. 2. Implicitly force trainable tensors to be inference graph's inputs. 3. Training operators cannot be called in the inference graph. * Add accidently removed changes back * Use protobuf lite * Polish the helper script * Fix windows build and polish helper script * Fix linux and mac builds * One more line * fix the attribute types section in IR.md (onnx#2590) * fix the attribute types section in IR.md * update per comments. * Some changes around the behavior of optional inference inputs. 1. Use pass-by-value to optional inference inputs. 2. Due to the semantic of GraphCall, we implicitly force trainable inputs to be added into inference graph's input list. Revise docs * Update spec per WG discussion * update_binding is optional now because user might only want to store initialization * Polish doc * Address comments. Polish words. * Use an alternative field to declar global variables. In yesterday's Operator SIG meeting, we agree to still put global variables in the inference graph and add a model-level field to indicate global variables. This way we can have smaller impact to the inference engines, because they don't need to move trainable tensors to a new field. * polish docs * Allow training initializers to be promoted to global & mutable variables * Merge the functions of global_mutable_initializer_names into update_binding * Polish docs * Remove restriction on using ai.onnx.training in the inference graph * Split training register from ai.onnx register file Co-authored-by: Sherlock <baihan.huang@gmail.com> Co-authored-by: Tal Ben-Nun <tbennun@users.noreply.github.com> Co-authored-by: Jonny Shipton <tmvector@gmail.com> Co-authored-by: Ke Zhang <kezhan@microsoft.com>

Draft of Gradient Operator

653058e

wschin requested a review from a team as a code owner July 14, 2019 08:11

Fix format

95992d0

wschin force-pushed the grad branch from a948566 to 95992d0 Compare July 15, 2019 06:11

chinhuang007 mentioned this pull request Jul 17, 2019

Summary of Training Story in ONNX #2038

Closed

wschin commented Jul 18, 2019

View reviewed changes