Skip to content

[GSoC] Adding optimization features not related with my GSoC project#1070

Merged
zoq merged 10 commits intomlpack:masterfrom
sidorov-ks:grad-opt
Jul 29, 2017
Merged

[GSoC] Adding optimization features not related with my GSoC project#1070
zoq merged 10 commits intomlpack:masterfrom
sidorov-ks:grad-opt

Conversation

@sidorov-ks
Copy link
Copy Markdown
Contributor

This PR is part of my GSoC project "Augmented RNNs".

Implemented:

As far as I understand, the conversation related to these two points (including but not limited to the reusable update API for gradient clipping) is transferred here.

*/
GradientClipping(const double minGradient,
const double maxGradient,
UpdatePolicy updatePolicy) :
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we pass the update policy by reference? Also, I think we should add:

//! Get the update policy.
UpdatePolicyType& UpdatePolicy() const { return updatePolicy; }
//! Modify the update policy.
UpdatePolicyType& UpdatePolicy() { return updatePolicy; }

to modify the wrapped policy.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Likewise, it looks reasonable to add MinGradient and MaxGradient methods (done in the last commit)

const arma::mat& gradient)
{
// First, clip the gradient.
gradient.transform(
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Haven't tested it, but I guess using transform is faster as arma::clamp.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Didn't know about clamp, implementing. By the way, it also resolves the issue with const reference.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Speaking about performance, they should have the comparable performance, as they're both element-wise operations (if transform is parallelized, which I think is true)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Benchmarked that one. Here's what I've got: https://gist.github.com/partobs-mdp/02b3bb93be0496ad6866528359d5ba3d

Copy link
Copy Markdown
Member

@zoq zoq Jul 24, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for taking a look into the issue; looks like transform is slightly faster, but I think it's negligible.

*/
void Update(arma::mat& iterate,
const double stepSize,
const arma::mat& gradient)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

gradient.transformshould fail, if we pass the gradient is const.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another option is to relax the const restriction and allow the UpdatePolicy to modify the gradient when Update() is called. I don't think it would be a problem to do that, as long as we document that the UpdatePolicy is allowed to do so. And it would avoid the copy too. :)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, that is another good option, @partobs-mdp what do you think?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, but should we do this? As we have seen, the clamp performance is not much of an issue, and (imho) we shouldn't break the natural assumption that the update policy doesn't break the variable which stores the gradient.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, the performance difference is insignificant, we could use arma::clamp here, no need to change it if you don't think that's a good idea. We just wanted to point out there is another solution.

SGDTestFunction f;
VanillaUpdate vanillaUpdate;
GradientClipping<VanillaUpdate> update(-3., +3., vanillaUpdate);
StandardSGD s(0.0003, 5000000, 1e-9, true);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should use SGD<GradientClipping<VanillaUpdate> > s(..., update); here, right now we use the standard update policy, without gradient clipping.

{
SGDTestFunction f;
VanillaUpdate vanillaUpdate;
GradientClipping<VanillaUpdate> update(-3., +3., vanillaUpdate);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pedantic style issue; I think using 3.0 instead of 3. would slightly improve the readability.

@@ -39,6 +41,23 @@ BOOST_AUTO_TEST_CASE(SimpleSGDTestFunction)
BOOST_REQUIRE_SMALL(coordinates[2], 1e-7);
}

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be a good idea to test the policy independently from the optimizer, we could simply test if the gradient is clipped correctly.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also thought about this, but how could we extract the gradient history from SGD optimizer object?

const arma::Mat<eT>&& target,
arma::Mat<eT>&& output)
{
output = (1. - target) / (1. - input + 1e-2) - target / (input + 1e-2);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1e-2 is somewhat hight, I would use something like 1e-10.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Introduced the eps parameter for handling this kind of trade-off without breaking the code.

typename InputDataType = arma::mat,
typename OutputDataType = arma::mat
>
class CrossEntropyError
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add a simple test for the cross entropy error function?

arma::mat coordinates = arma::zeros(3, 3);
// Setting step = 1 to make math easy.
double stepSize = 1.0;
arma::mat dummyGradient = { {-6, +6, 0}, {1, 2, 3}, {-3, 0, +4} };
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To make the windows build happy you could use:

mat input;
input << 1 << 2 << 3 << endr
      << 4 << 5 << 6 << endr
      << 7 << 8 << 9
      << ...;

or

mat input("1 2 3; \
           4 5 6 \
          ...");

Copy link
Copy Markdown
Member

@rcurtin rcurtin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me so far, just some minor comments to address from my end. Overall I think the design is fine, but we should definitely add a test for the CrossEntropyLayer like Marcus suggested.

* @param minGradient Minimum gradient value
* (affects optimization iff clipGradient flag is on).
* @param maxGradient Maximum gradient value
* (affects optimization iff clipGradient flag is on).
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the last three comments here are not needed anymore. :)

*/
void Update(arma::mat& iterate,
const double stepSize,
const arma::mat& gradient)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another option is to relax the const restriction and allow the UpdatePolicy to modify the gradient when Update() is called. I don't think it would be a problem to do that, as long as we document that the UpdatePolicy is allowed to do so. And it would avoid the copy too. :)

double& Eps() { return eps; }

/**
* Serialize the layer
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very pedantic comment: can you add a period at the end of the sentence? :)

const arma::Mat<eT>&& input, const arma::Mat<eT>&& target)
{
return -arma::accu(target % arma::log(input + eps) +
(1. - target) % arma::log(1. - input + eps));
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a little late so I'm not sure I'm thinking 100% clearly, but this appears that it will work in the multiclass setting as long as the input and target matrices are one-hot encoded. So labels like [0 2 1] will not work but labels like [[1 0 0] [0 0 1] [0 1 0]] will. Correct me if I am wrong. (i.e., I think this is right and works the way I would expect it to.)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, it won't work as it stands, but the computations in the case you've mentioned would be easier: -arma::accu(target % arma::log(input + eps)) (the formula get easier due to data representation redundancy)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair enough; in this case, can you clarify the limitation on how the labels should be in the documentation for the class? We should definitely support multiclass cross-entropy at some point, so if you don't want to do that here that's ok, but in that case could I ask you to open a new issue for it, detailing basically what needs to be done and where someone could look to get started with it?


//! Get the update policy.
const UpdatePolicyType& UpdatePolicy() const { return updatePolicy; }
UpdatePolicyType UpdatePolicy() const { return updatePolicy; }
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should return a reference instead of a copy here.

*/
void Update(arma::mat& iterate,
const double stepSize,
const arma::mat& gradient)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, that is another good option, @partobs-mdp what do you think?

Archive& /* ar */,
const unsigned int /* version */)
{
// Nothing to do here.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should serialize eps here, since it's now a parameter.

const arma::Mat<eT>&& target,
arma::Mat<eT>&& output)
{
output = (1. - target) / (1. - input + eps) - target / (input + eps);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure, if the compile would optimize that away, but we could rewrite the expression as:

output = (target - input) / ((x - 1) % x); and save one extra division and the eps addition.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Checked it on a piece of paper and got the precise expression we should get after simplifying the original expression so that it would still be the gradient of the loss function with log(x + eps):

output = (input - target + eps * (1. - 2 * target)) / ((1. - input + eps) % (input + eps))

This one, however, doesn't really optimize much (or, at least, I think so), because even though it runs only one (element-wise) division, it runs three multiplications, which are also slow (as compared with additions - that's why I didn't count them).

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, I guess the only benefit I see is that we could avoid adding eps since: ((x - 1) % x) should be stable. Anyway, I don't mind to leave it as it is.

@sidorov-ks
Copy link
Copy Markdown
Contributor Author

What's wrong with the Internet connection on Travis CI? It didn't even manage to install boost from apt :'(

@zoq
Copy link
Copy Markdown
Member

zoq commented Jul 25, 2017

Let me restart the build.

@sidorov-ks
Copy link
Copy Markdown
Contributor Author

Is there anything else that should be done on this PR on my side?

Copy link
Copy Markdown
Member

@zoq zoq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks ready for me, I'll wait 3 days for the merge, in case anyone has any more comments.

// the gradient from the momentum, which gives 2 * gradient value
// for the momentum on that step. Adding that to the gradient which
// was subtracted earlier yiels the 3 * gradient in the following check.
BOOST_REQUIRE_SMALL(arma::abs(coordinates - 3 * targetCoordinates).max(),
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really pedantic style issue; I would probably write:

BOOST_REQUIRE_SMALL(
    arma::abs(coordinates - targetCoordinates).max(), 1e-7);

/**
* The cross-entropy performance function measures the network's
* performance according to the cross-entropy
* between the input and target distributions.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can reduce the number of lines here, it seems like they are nowhere near as long as 80 characters. :) (I think this also applies elsewhere)

#include <mlpack/core.hpp>
#include <mlpack/core/optimizers/sgd/sgd.hpp>
#include <mlpack/core/optimizers/sgd/update_policies/gradient_clipping.hpp>
#include <mlpack/core/optimizers/sgd/update_policies/vanilla_update.hpp>
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a need to add these here if they aren't being used by the tests?

Copy link
Copy Markdown
Member

@rcurtin rcurtin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code looks good, nothing more from my side. Thanks for splitting this out from the other PR so we can merge it more quickly. :)

@zoq zoq merged commit 5c68061 into mlpack:master Jul 29, 2017
@zoq
Copy link
Copy Markdown
Member

zoq commented Jul 29, 2017

Thanks for the contributions!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants