[MRG+1] refactor NMF and add CD solver#4852
Conversation
|
I think it's awesome! And travis is failing ;) |
|
Thanks ! Only 2/3 are failing :p |
|
nice. I don't have time to review though. |
sklearn/decomposition/nmf.py
Outdated
There was a problem hiding this comment.
In the discussion on NMF, you said:
Maybe we can keep the class API simple and expose a function with more options
This function add the choice to regularize W or H or both, whereas the class does not have this option, and regularizes both. Is it what you meant?
There was a problem hiding this comment.
The idea was to do like in ridge.py with ridge_regression and Ridge (i.e. the public function is common for all solvers).
There was a problem hiding this comment.
(i.e. the public function is common for all solvers).
Ok I'll do that
|
I think we should remove projected gradient by v0.19. Otherwise, with your plan to add more loss functions, this will be way too much code to maintain. @vene, is it fine with you? |
I think that was the plan, I will add deprecation. |
|
Based on the plots above, that aren't surprising, yes, I fully agree that if we merge CD we can deprecate PG. |
sklearn/decomposition/nmf.py
Outdated
There was a problem hiding this comment.
I don't see where the code path for 'uniform' currently is. Is 'uniform' supposed to be different than 'random'?
Also, @mblondel's gist used as initialization a random subset of the rows/cols of the input matrix, would this make sense? We could run some benchmarks for this too.
There was a problem hiding this comment.
'uniform' is initialization by the same value everywhere, but I removed it in this PR.
I will do a specific PR to address the initialization schemes and run some benchmarks.
And I will also look at the random subset of rows/cols.
There was a problem hiding this comment.
I remember in one of the papers they say that full rows/cols of zeros can cause problems, is it the case in this implementation?
There was a problem hiding this comment.
With a full col/row of zeros in W or H, WtW or HHt will have a zero in its diagonal, which is a problem since we will divide by zero.
To avoid that, I added a small L2 regularization (1e-15), which avoid this problem.
There was a problem hiding this comment.
I think this is not the right way to do it. In this case, the optimal solution is probably to not update the coefficient.
There was a problem hiding this comment.
i.e., skip the current iteration
sklearn/decomposition/cdnmf_fast.pyx
Outdated
There was a problem hiding this comment.
W is (n_samples, n_components)
H.T is (n_features, n_components)
I changed the name to be more clear:
cdef int n_samples = W.shape[0] # n_features for H update
f7041cc to
ecd0415
Compare
|
really nice
just one thing, can you summarize your conclusions when you post figures?
thanks
|
|
I notice that the greedy curves are sometimes longer on the X axis, even if they reach the same loss. Is this because the convergence criterion fires later, or because iterations take longer? The paper seemed to advertise the greedy selection as being very cheap computationally. |
Iterations take longer. I used the same number of iteration for CD and GCD, with a low tolerance so as to stop only when maximum number of iterations is reached. |
|
The improvements are not so impressive overall (~1s gain). The author released the source code in MATLAB (computationally expensive parts written in C). You could have a look to check potential tricks. In the lower-right plot in the n_components=10 case (news20), the loss difference looks very bad. I wonder if it's not just a scale problem. |
sklearn/decomposition/nmf.py
Outdated
There was a problem hiding this comment.
I noticed that you used fast_dot in some places. You could potentially use it here too.
There was a problem hiding this comment.
It was to be sure to have C-contiguous array, but as I removed the check in Cython, I will change it.
sklearn/decomposition/nmf.py
Outdated
There was a problem hiding this comment.
you can add a constant term to the diag of a matrix without allocating a full identity matrix.
git grep ".flat["
|
that's it for me don't forget what's new page nice job @TomDLT ! |
|
Sorry to be late on this but could we address the scaling of alpha in another PR? I agree that scaling of alpha could be useful but I think it should probably be an option (default: False) for backward compatibility and consistency with the rest of the scikit. Also, the motivation for the way it is currently scaled is not completely obvious to me. My first intuition would have been to divide the loss term by n_samples x n_features. Although @TomDLT has some early experiments, it would be nice to take our time to compare different scalings. |
|
Comments addressed, thanks. I also removed the alpha scaling from this PR, as suggested by @mblondel, and I will open a new PR for further discussion about it. |
|
+1 for merge on my side. |
|
+1 too |
[MRG+1] refactor NMF and add CD solver
|
🍻
|
|
Cool 🍻 ! |
|
sorry if I overlooked it but did you commit / post your benchmark script using the "real" datasets? the current one only uses synthetic data, right? |
|
I benchmarked it with "real" datasets I think: 20 newsgroups, Olivetti Faces, RCV1 and MNIST I don't find |
|
I think that would be great. Would you be interested to do that? Otherwise maybe open an issue? |
|
I can do that |
|
see #5779 |






This PR is a first part of what is discussed in #4811.
It includes:
_initialize_nmf()alphaandl1_ratio)In some future PR, I will:
Please tell me what you think :)