[WIP][Training] Draft of Gradient Operator#2168
Conversation
| | | | ||
| +------+--> Gradient(xs=["X", "W"], y=["Y"]) ---> dO/dX (1st output of Gradient) | ||
| | | | | ||
| | | '---> dO/dW (2nd output of Gradient) |
There was a problem hiding this comment.
I suggest replacing "dO/dW" by a valid identifier, like "dO_dW" or "first_order_gradient".
| | Gradient) | ||
| | | ||
| | | ||
| '---> d^2O/dW^2 (2nd output of Gradient) |
There was a problem hiding this comment.
Suggest using a valid identifier (as above).
There was a problem hiding this comment.
Humm, but d(dO/dW)dX would be changed to something like d_dO_dW_dX and become less readable.
There was a problem hiding this comment.
I see your point. But my concern is that some users will copy this blindly and use "dO/dW" as a tensor name. This can lead to potential problems downstream. Technically, the ONNX documentation says that tensor names should be valid C identifiers. If the ONNX checker were to enforce this, that would be somewhat helpful. But it does not seem to do that. (I remember Niklas was trying to add such a check ... but it doesn't seem to exist today … not sure why. Possibly because a number of models were violating this then.) We should probably sort this out one way or the other in the ONNX standard. Until then, I think it would be better to try to stick to the identifier restriction.
We could use simply use names like D1 and D2 to denote the first and second derivative in the example (and in the text we could say D1 denotes dO/dW, for example).
|
|
||
| namespace ONNX_NAMESPACE { | ||
| static const char* Gradient_ver12_doc = R"DOC( | ||
| Gradient operator computes the derivatives of some tensors with respect to |
There was a problem hiding this comment.
It might be worth adding an explanatory note somewhere, something like "The gradient operator is slightly different from the standard operators. It is conceptually a higher-order operator that takes a computation-graph as an attribute, but we abuse the attribute mechanism to describe the computation-sub-graph implicitly by listing the inputs and output of the computation-sub-graph in the current graph.'
|
|
||
| As mentioned above, the attributes "xs" and "y" are used to identify a graph, | ||
| and we can feed different tensors to the identified graph. For, | ||
| example, one can compute the gradient of Y with respect to H by substituting |
There was a problem hiding this comment.
Are there typos here? I am a bit confused, because I am not sure what is Y_1, H_1, and the picture below shows W_1 and Z_1. Are you trying to say that it is not necessary for the inputs to be identical to the list xs specified in the attributes?
There was a problem hiding this comment.
If the above is the case, it would be helpful to add that "The implementation can be optimized in the common case where the inputs are identical to the attributes." … because if the inputs are not identical to the attributes, the implementation has to do extra work to create a duplicate of the forward-graph for the actual-inputs.
There was a problem hiding this comment.
Do you think we need to allow
if the inputs are not identical to the attributes, the implementation has to do extra work to create a duplicate of the forward-graph for the actual-inputs
? Looks like it's not necessary.
There was a problem hiding this comment.
If we don't support it, that's okay with me (since I can't immediately think of use-cases where we would want to use it). But, then, the documentation should make this explicit (and implementations should check that the input-list and attribute-list match).
| "contains and only contains the necessary inputs of a " | ||
| "(sub-)graph. Variables (usually called intermediate " | ||
| "variables) which can be generated by inputs cannot be " | ||
| "included in this attribute.", |
There was a problem hiding this comment.
Is it correct to say the following? If xs=[X_1, ..., X_m], then no X_i can depend on X_j (i <> j): that is, there cannot be a dependence-path in the graph from X_j to X_i? Furthermore, do we need to say the list in xs must be complete? If we have Y = X_1 + X_2, is it valid or invalid to specify xs=["X_1"]?
There was a problem hiding this comment.
Yes. The inputs named in xs must be independent variables; that is, they cannot be upstream variable of each other.
The list must be complete. Otherwise, we cannot finish a forward path due to missing input (in your example, we cannot compute Y without X_2).
newling
left a comment
There was a problem hiding this comment.
Is it the case that "xs" is always the names of the inputs, and if so why is
it necessary to have both? It seems like one or the other is redundant.
So, why is an attribute-less definition like this,
input i = 0:N-1 : see outputs definition
input N : see outputs definition
outputs i = 0:N-1 : d(input_N)/d(input_i)
attributes : none.
not enough?
|
The operator |
Major changes:
1. Add a protobuf message, `TrainingInfoProto` originally designed in
onnx#2013, to store training information.
2. In `TrainingInfoProto`, the user can store training algorithm in
`algorithm` field as a `GraphProto`.
3. The user can also store initialization algorithm for resetting the
model in `TrainingInfoProto.initialization` (proposed by @tbennun in
onnx#2517 and agreed by Training WG).
4. `ModelProto.graph` is callable inside `TrainingInfoProto.algorithm`.
`ModelProto.graph.initializer` are visible to nodes in
`TrainingInfoProto.algorithm.node`.
5. This PR also introduces a `Gradient` operator to differentiate a
function represented by a (sub-)graph. This idea is from onnx#2168.
Contribution list:
Baihan Huang: spec design.
Tal Ben-Nun: model initialization design.
Wei-Sheng Chin: spec design, Gradient operator design.
Jonny Shipton and active WG members and participants: many valuable comments and reviews.
Co-authored-by: Sherlock <baihan.huang@gmail.com>
Co-authored-by: Tal Ben-Nun <tbennun@users.noreply.github.com>
Co-authored-by: Jonny Shipton <tmvector@gmail.com>
* ONNX Training proposal.
Major changes:
1. Add a protobuf message, `TrainingInfoProto` originally designed in
#2013, to store training information.
2. In `TrainingInfoProto`, the user can store training algorithm in
`algorithm` field as a `GraphProto`.
3. The user can also store initialization algorithm for resetting the
model in `TrainingInfoProto.initialization` (proposed by @tbennun in
#2517 and agreed by Training WG).
4. `ModelProto.graph` is callable inside `TrainingInfoProto.algorithm`.
`ModelProto.graph.initializer` are visible to nodes in
`TrainingInfoProto.algorithm.node`.
5. This PR also introduces a `Gradient` operator to differentiate a
function represented by a (sub-)graph. This idea is from #2168.
Contribution list:
Baihan Huang: spec design.
Tal Ben-Nun: model initialization design.
Wei-Sheng Chin: spec design, Gradient operator design.
Jonny Shipton and active WG members and participants: many valuable comments and reviews.
Co-authored-by: Sherlock <baihan.huang@gmail.com>
Co-authored-by: Tal Ben-Nun <tbennun@users.noreply.github.com>
Co-authored-by: Jonny Shipton <tmvector@gmail.com>
* Address comments
* Address a comment
* Move Gradient to ai.onnx.training
Update Gradient test models
* Address comments
1. Create initialization_binding instead of
using update_binding for initialization.
2. Swap key and velue in update_binding.
3. Refine documents accordingly.
* Clarify sementics of algorithm and initialization
* Fix typos
* Address comment and explain the two computation modes of ModelProto.training_info
* Fix typo and explain default behavior
* Update onnx/checker.cc
Co-Authored-By: Jonny Shipton <tmvector@gmail.com>
* Address comments
* Make normalization_binding a repeated field
* Add GraphCall operator
* Polish GraphCall
* GraphCall now uses position to map inputs and outputs
* Address comments:
1. Clarify GraphCall's semantic.
2. Implicitly force trainable tensors to be inference graph's inputs.
3. Training operators cannot be called in the inference graph.
* Add accidently removed changes back
* Use protobuf lite
* Polish the helper script
* Fix windows build and polish helper script
* Fix linux and mac builds
* One more line
* fix the attribute types section in IR.md (#2590)
* fix the attribute types section in IR.md
* update per comments.
* Some changes around the behavior of optional inference inputs.
1. Use pass-by-value to optional inference inputs.
2. Due to the semantic of GraphCall, we implicitly force trainable
inputs to be added into inference graph's input list.
Revise docs
* Update spec per WG discussion
* update_binding is optional now because user might only want to store initialization
* Polish doc
* Address comments. Polish words.
* Use an alternative field to declar global variables.
In yesterday's Operator SIG meeting, we agree to still
put global variables in the inference graph and add a
model-level field to indicate global variables. This way
we can have smaller impact to the inference engines, because
they don't need to move trainable tensors to a new field.
* polish docs
* Allow training initializers to be promoted to global & mutable variables
* Merge the functions of global_mutable_initializer_names into update_binding
* Polish docs
* Remove restriction on using ai.onnx.training in the inference graph
* Split training register from ai.onnx register file
Co-authored-by: Sherlock <baihan.huang@gmail.com>
Co-authored-by: Tal Ben-Nun <tbennun@users.noreply.github.com>
Co-authored-by: Jonny Shipton <tmvector@gmail.com>
Co-authored-by: Ke Zhang <kezhan@microsoft.com>
* ONNX Training proposal.
Major changes:
1. Add a protobuf message, `TrainingInfoProto` originally designed in
onnx#2013, to store training information.
2. In `TrainingInfoProto`, the user can store training algorithm in
`algorithm` field as a `GraphProto`.
3. The user can also store initialization algorithm for resetting the
model in `TrainingInfoProto.initialization` (proposed by @tbennun in
onnx#2517 and agreed by Training WG).
4. `ModelProto.graph` is callable inside `TrainingInfoProto.algorithm`.
`ModelProto.graph.initializer` are visible to nodes in
`TrainingInfoProto.algorithm.node`.
5. This PR also introduces a `Gradient` operator to differentiate a
function represented by a (sub-)graph. This idea is from onnx#2168.
Contribution list:
Baihan Huang: spec design.
Tal Ben-Nun: model initialization design.
Wei-Sheng Chin: spec design, Gradient operator design.
Jonny Shipton and active WG members and participants: many valuable comments and reviews.
Co-authored-by: Sherlock <baihan.huang@gmail.com>
Co-authored-by: Tal Ben-Nun <tbennun@users.noreply.github.com>
Co-authored-by: Jonny Shipton <tmvector@gmail.com>
* Address comments
* Address a comment
* Move Gradient to ai.onnx.training
Update Gradient test models
* Address comments
1. Create initialization_binding instead of
using update_binding for initialization.
2. Swap key and velue in update_binding.
3. Refine documents accordingly.
* Clarify sementics of algorithm and initialization
* Fix typos
* Address comment and explain the two computation modes of ModelProto.training_info
* Fix typo and explain default behavior
* Update onnx/checker.cc
Co-Authored-By: Jonny Shipton <tmvector@gmail.com>
* Address comments
* Make normalization_binding a repeated field
* Add GraphCall operator
* Polish GraphCall
* GraphCall now uses position to map inputs and outputs
* Address comments:
1. Clarify GraphCall's semantic.
2. Implicitly force trainable tensors to be inference graph's inputs.
3. Training operators cannot be called in the inference graph.
* Add accidently removed changes back
* Use protobuf lite
* Polish the helper script
* Fix windows build and polish helper script
* Fix linux and mac builds
* One more line
* fix the attribute types section in IR.md (onnx#2590)
* fix the attribute types section in IR.md
* update per comments.
* Some changes around the behavior of optional inference inputs.
1. Use pass-by-value to optional inference inputs.
2. Due to the semantic of GraphCall, we implicitly force trainable
inputs to be added into inference graph's input list.
Revise docs
* Update spec per WG discussion
* update_binding is optional now because user might only want to store initialization
* Polish doc
* Address comments. Polish words.
* Use an alternative field to declar global variables.
In yesterday's Operator SIG meeting, we agree to still
put global variables in the inference graph and add a
model-level field to indicate global variables. This way
we can have smaller impact to the inference engines, because
they don't need to move trainable tensors to a new field.
* polish docs
* Allow training initializers to be promoted to global & mutable variables
* Merge the functions of global_mutable_initializer_names into update_binding
* Polish docs
* Remove restriction on using ai.onnx.training in the inference graph
* Split training register from ai.onnx register file
Co-authored-by: Sherlock <baihan.huang@gmail.com>
Co-authored-by: Tal Ben-Nun <tbennun@users.noreply.github.com>
Co-authored-by: Jonny Shipton <tmvector@gmail.com>
Co-authored-by: Ke Zhang <kezhan@microsoft.com>
PR #2314 is a single place for reviewing the whole training story.
In Tensorflow and Pytorch, most models are trained using backpropagation and gradient-based optimization algorithms. This operator provides a way to compute gradient given an inference graph.
Here is a similar operator in Pytroch.