[GSoC] Adding optimization features not related with my GSoC project#1070
[GSoC] Adding optimization features not related with my GSoC project#1070zoq merged 10 commits intomlpack:masterfrom
Conversation
…d RNNs GSoC project
| */ | ||
| GradientClipping(const double minGradient, | ||
| const double maxGradient, | ||
| UpdatePolicy updatePolicy) : |
There was a problem hiding this comment.
Should we pass the update policy by reference? Also, I think we should add:
//! Get the update policy.
UpdatePolicyType& UpdatePolicy() const { return updatePolicy; }
//! Modify the update policy.
UpdatePolicyType& UpdatePolicy() { return updatePolicy; }
to modify the wrapped policy.
There was a problem hiding this comment.
Likewise, it looks reasonable to add MinGradient and MaxGradient methods (done in the last commit)
| const arma::mat& gradient) | ||
| { | ||
| // First, clip the gradient. | ||
| gradient.transform( |
There was a problem hiding this comment.
Haven't tested it, but I guess using transform is faster as arma::clamp.
There was a problem hiding this comment.
Didn't know about clamp, implementing. By the way, it also resolves the issue with const reference.
There was a problem hiding this comment.
Speaking about performance, they should have the comparable performance, as they're both element-wise operations (if transform is parallelized, which I think is true)
There was a problem hiding this comment.
Benchmarked that one. Here's what I've got: https://gist.github.com/partobs-mdp/02b3bb93be0496ad6866528359d5ba3d
There was a problem hiding this comment.
Thanks for taking a look into the issue; looks like transform is slightly faster, but I think it's negligible.
| */ | ||
| void Update(arma::mat& iterate, | ||
| const double stepSize, | ||
| const arma::mat& gradient) |
There was a problem hiding this comment.
gradient.transformshould fail, if we pass the gradient is const.
There was a problem hiding this comment.
Another option is to relax the const restriction and allow the UpdatePolicy to modify the gradient when Update() is called. I don't think it would be a problem to do that, as long as we document that the UpdatePolicy is allowed to do so. And it would avoid the copy too. :)
There was a problem hiding this comment.
Agreed, that is another good option, @partobs-mdp what do you think?
There was a problem hiding this comment.
Sure, but should we do this? As we have seen, the clamp performance is not much of an issue, and (imho) we shouldn't break the natural assumption that the update policy doesn't break the variable which stores the gradient.
There was a problem hiding this comment.
Right, the performance difference is insignificant, we could use arma::clamp here, no need to change it if you don't think that's a good idea. We just wanted to point out there is another solution.
src/mlpack/tests/sgd_test.cpp
Outdated
| SGDTestFunction f; | ||
| VanillaUpdate vanillaUpdate; | ||
| GradientClipping<VanillaUpdate> update(-3., +3., vanillaUpdate); | ||
| StandardSGD s(0.0003, 5000000, 1e-9, true); |
There was a problem hiding this comment.
We should use SGD<GradientClipping<VanillaUpdate> > s(..., update); here, right now we use the standard update policy, without gradient clipping.
src/mlpack/tests/sgd_test.cpp
Outdated
| { | ||
| SGDTestFunction f; | ||
| VanillaUpdate vanillaUpdate; | ||
| GradientClipping<VanillaUpdate> update(-3., +3., vanillaUpdate); |
There was a problem hiding this comment.
Pedantic style issue; I think using 3.0 instead of 3. would slightly improve the readability.
| @@ -39,6 +41,23 @@ BOOST_AUTO_TEST_CASE(SimpleSGDTestFunction) | |||
| BOOST_REQUIRE_SMALL(coordinates[2], 1e-7); | |||
| } | |||
|
|
|||
There was a problem hiding this comment.
Might be a good idea to test the policy independently from the optimizer, we could simply test if the gradient is clipped correctly.
There was a problem hiding this comment.
I also thought about this, but how could we extract the gradient history from SGD optimizer object?
| const arma::Mat<eT>&& target, | ||
| arma::Mat<eT>&& output) | ||
| { | ||
| output = (1. - target) / (1. - input + 1e-2) - target / (input + 1e-2); |
There was a problem hiding this comment.
1e-2 is somewhat hight, I would use something like 1e-10.
There was a problem hiding this comment.
Introduced the eps parameter for handling this kind of trade-off without breaking the code.
| typename InputDataType = arma::mat, | ||
| typename OutputDataType = arma::mat | ||
| > | ||
| class CrossEntropyError |
There was a problem hiding this comment.
Can we add a simple test for the cross entropy error function?
| arma::mat coordinates = arma::zeros(3, 3); | ||
| // Setting step = 1 to make math easy. | ||
| double stepSize = 1.0; | ||
| arma::mat dummyGradient = { {-6, +6, 0}, {1, 2, 3}, {-3, 0, +4} }; |
There was a problem hiding this comment.
To make the windows build happy you could use:
mat input;
input << 1 << 2 << 3 << endr
<< 4 << 5 << 6 << endr
<< 7 << 8 << 9
<< ...;
or
mat input("1 2 3; \
4 5 6 \
...");
rcurtin
left a comment
There was a problem hiding this comment.
Looks good to me so far, just some minor comments to address from my end. Overall I think the design is fine, but we should definitely add a test for the CrossEntropyLayer like Marcus suggested.
| * @param minGradient Minimum gradient value | ||
| * (affects optimization iff clipGradient flag is on). | ||
| * @param maxGradient Maximum gradient value | ||
| * (affects optimization iff clipGradient flag is on). |
There was a problem hiding this comment.
I think the last three comments here are not needed anymore. :)
| */ | ||
| void Update(arma::mat& iterate, | ||
| const double stepSize, | ||
| const arma::mat& gradient) |
There was a problem hiding this comment.
Another option is to relax the const restriction and allow the UpdatePolicy to modify the gradient when Update() is called. I don't think it would be a problem to do that, as long as we document that the UpdatePolicy is allowed to do so. And it would avoid the copy too. :)
| double& Eps() { return eps; } | ||
|
|
||
| /** | ||
| * Serialize the layer |
There was a problem hiding this comment.
Very pedantic comment: can you add a period at the end of the sentence? :)
| const arma::Mat<eT>&& input, const arma::Mat<eT>&& target) | ||
| { | ||
| return -arma::accu(target % arma::log(input + eps) + | ||
| (1. - target) % arma::log(1. - input + eps)); |
There was a problem hiding this comment.
It's a little late so I'm not sure I'm thinking 100% clearly, but this appears that it will work in the multiclass setting as long as the input and target matrices are one-hot encoded. So labels like [0 2 1] will not work but labels like [[1 0 0] [0 0 1] [0 1 0]] will. Correct me if I am wrong. (i.e., I think this is right and works the way I would expect it to.)
There was a problem hiding this comment.
Well, it won't work as it stands, but the computations in the case you've mentioned would be easier: -arma::accu(target % arma::log(input + eps)) (the formula get easier due to data representation redundancy)
There was a problem hiding this comment.
Fair enough; in this case, can you clarify the limitation on how the labels should be in the documentation for the class? We should definitely support multiclass cross-entropy at some point, so if you don't want to do that here that's ok, but in that case could I ask you to open a new issue for it, detailing basically what needs to be done and where someone could look to get started with it?
…ropy performance function
|
|
||
| //! Get the update policy. | ||
| const UpdatePolicyType& UpdatePolicy() const { return updatePolicy; } | ||
| UpdatePolicyType UpdatePolicy() const { return updatePolicy; } |
There was a problem hiding this comment.
We should return a reference instead of a copy here.
| */ | ||
| void Update(arma::mat& iterate, | ||
| const double stepSize, | ||
| const arma::mat& gradient) |
There was a problem hiding this comment.
Agreed, that is another good option, @partobs-mdp what do you think?
| Archive& /* ar */, | ||
| const unsigned int /* version */) | ||
| { | ||
| // Nothing to do here. |
There was a problem hiding this comment.
We should serialize eps here, since it's now a parameter.
| const arma::Mat<eT>&& target, | ||
| arma::Mat<eT>&& output) | ||
| { | ||
| output = (1. - target) / (1. - input + eps) - target / (input + eps); |
There was a problem hiding this comment.
Not sure, if the compile would optimize that away, but we could rewrite the expression as:
output = (target - input) / ((x - 1) % x); and save one extra division and the eps addition.
There was a problem hiding this comment.
Checked it on a piece of paper and got the precise expression we should get after simplifying the original expression so that it would still be the gradient of the loss function with log(x + eps):
output = (input - target + eps * (1. - 2 * target)) / ((1. - input + eps) % (input + eps))
This one, however, doesn't really optimize much (or, at least, I think so), because even though it runs only one (element-wise) division, it runs three multiplications, which are also slow (as compared with additions - that's why I didn't count them).
There was a problem hiding this comment.
Right, I guess the only benefit I see is that we could avoid adding eps since: ((x - 1) % x) should be stable. Anyway, I don't mind to leave it as it is.
|
What's wrong with the Internet connection on Travis CI? It didn't even manage to install boost from apt :'( |
|
Let me restart the build. |
|
Is there anything else that should be done on this PR on my side? |
zoq
left a comment
There was a problem hiding this comment.
Looks ready for me, I'll wait 3 days for the merge, in case anyone has any more comments.
| // the gradient from the momentum, which gives 2 * gradient value | ||
| // for the momentum on that step. Adding that to the gradient which | ||
| // was subtracted earlier yiels the 3 * gradient in the following check. | ||
| BOOST_REQUIRE_SMALL(arma::abs(coordinates - 3 * targetCoordinates).max(), |
There was a problem hiding this comment.
Really pedantic style issue; I would probably write:
BOOST_REQUIRE_SMALL(
arma::abs(coordinates - targetCoordinates).max(), 1e-7);
| /** | ||
| * The cross-entropy performance function measures the network's | ||
| * performance according to the cross-entropy | ||
| * between the input and target distributions. |
There was a problem hiding this comment.
I think you can reduce the number of lines here, it seems like they are nowhere near as long as 80 characters. :) (I think this also applies elsewhere)
src/mlpack/tests/sgd_test.cpp
Outdated
| #include <mlpack/core.hpp> | ||
| #include <mlpack/core/optimizers/sgd/sgd.hpp> | ||
| #include <mlpack/core/optimizers/sgd/update_policies/gradient_clipping.hpp> | ||
| #include <mlpack/core/optimizers/sgd/update_policies/vanilla_update.hpp> |
There was a problem hiding this comment.
Is there a need to add these here if they aren't being used by the tests?
rcurtin
left a comment
There was a problem hiding this comment.
Code looks good, nothing more from my side. Thanks for splitting this out from the other PR so we can merge it more quickly. :)
|
Thanks for the contributions! |
This PR is part of my GSoC project "Augmented RNNs".
Implemented:
CrossEntropyLayerfor evaluating the performance of the model on binary vector targets.As far as I understand, the conversation related to these two points (including but not limited to the reusable update API for gradient clipping) is transferred here.