[WIP] Sparse PCA online in the direction of samples#4873
[WIP] Sparse PCA online in the direction of samples#4873arthurmensch wants to merge 10 commits intoscikit-learn:masterfrom
Conversation
There was a problem hiding this comment.
ols or least_squares would be a better (more explicit) name in my opinion.
There was a problem hiding this comment.
Actually this will not yield a sparse code, so I wonder if adding this option here does not reflect a design problem.
|
I think this method is interesting but I think the class organisation needs a rework:
However sparse PCA is not dictionary learning as the activations / code are not sparse. From a user point of you I find this really confusing. I think we should find a way to refactor the code to have the estimator name reflect the model name (the class of objective function with either sparse components or sparse code) and not the kind of solver used internally. |
There was a problem hiding this comment.
I think this should be called projected_gradientsparse_projection. Elastic net by itself is not a solver like lars and coordinate descent but rather a kind of penalty.
Edit: it's not the gradient that is projected, but the weights after an update, so projected_gradient is a bad name.
There was a problem hiding this comment.
Also does is the current implementation well defined and equivalent to l1-ball projection when l1_ratio=1.0?
There was a problem hiding this comment.
It should. Do we have existent l1-ball projection to compare with ? It is actually a block coordinate descent with projection, so cd_proj could be used, but it would be confusing as it describe the dictionary update algorithm and not the code learning algorithm
|
Out of curiosity what is the speed impact of using the python version of the Elastic Net projection? If the use of the the cython version is critical for speed, I think we should just write tests for that version and completely drop the Python version from the code base. There is no point in keeping 2 versions if one is always sub optimal. |
|
We gain a factor ~40 in performance on the face dataset. I kept the python version in the database for testing, but it should be included in a test file. |
|
Actually the online SPCA method presented in this PR can also enforce code sparsity : however it is often the case that you do not want both sparse code and sparse dictionary. I think we could refactor as follow :
We could keep a class MiniBatchFeatureSparsePCA (or what not...) that would keep computing a sparse code and a dense dictionary by transposing I still need to do some benchmarking to see if we do not loose considerable performance depending on One of the problems with this approach is that the output of |
2d63da6 to
1411d21
Compare
|
Sorry for the multiple commits, to sum up I added a tolerance based early stopping, using the surrogate function |
|
I tweaked the cython code so that it can release the gil. The for loop in |
|
However you can either use joblib with from contextlib import closing
from multiprocessing.pool import ThreadPool
with closing(ThreadPool(n_jobs)) as p:
for X_batch in minibatch_iterator:
# use the pool to apply a function in parallel for instance p.map |
|
Note that we could also work to make |
|
Since we are only using a ThreadPool for |
3157d29 to
6e1644a
Compare
871d2f0 to
0fbc54b
Compare
|
I changed this PR to We could then deprecate the current |
|
@arthurmensch when you rebase and squash commit, please cleanup the full commit message: they are old phony messages included in the commit message for 0fbc54b. |
0fbc54b to
c0abd2d
Compare
|
The major issue here is now a naming problem : this PR code could be used to change MiniBatchSparsePCA into a real online estimator. However, we would have to keep present MiniBatchSparsePCA for the next few versions and deprecate it, while creating another estimator (IncrementalSparsePCA ?) |
There was a problem hiding this comment.
This scaling is the one recommended in the Lasso litterature. I can write a test showing that sparsity level stays roughly the same no matter what input feature size is provided.
|
Output from example : |
a21ea54 to
e37d1f2
Compare
|
I am closing this PR for the moment as it needs serious rework from my side |
|
ok thanks for working on this :) |



Following #4856, I implemented a dictionary learning algorithm that output a dense code and a sparse dictionary, which corresponds to the description of SPCA given in [1]. It alternated between ridge regression for learning code for a batch of samples, and projected block coordinate descent on the elastic net ball.
For each features (i.e. pixel in faces example) we project the
n_componentvector that contains all dictionary component value for this feature, which explicitly enforce dictionary components to be non overlapping (this differ from [1], where projection occurs in the feature space and not the component space)For the moment there is no unit tests, but output of the method can be seen running
example.decomposition.lot_decomposition_faces. As we use a large outer loop ofn_featureiterations and an inner loop for elastic net projection, I wrote a cython extension for speed consideration.Does this method seem to have an interest for the project ?
Mairal, J., Bach, F., Ponce, J., & Sapiro, G. (2010). Online learning for matrix factorization and sparse coding. The Journal of Machine Learning Research, 11, 19-60.