Skip to content

[WIP] Sparse PCA online in the direction of samples#4873

Closed
arthurmensch wants to merge 10 commits intoscikit-learn:masterfrom
arthurmensch:spca_online_sample
Closed

[WIP] Sparse PCA online in the direction of samples#4873
arthurmensch wants to merge 10 commits intoscikit-learn:masterfrom
arthurmensch:spca_online_sample

Conversation

@arthurmensch
Copy link
Copy Markdown
Contributor

Following #4856, I implemented a dictionary learning algorithm that output a dense code and a sparse dictionary, which corresponds to the description of SPCA given in [1]. It alternated between ridge regression for learning code for a batch of samples, and projected block coordinate descent on the elastic net ball.

For each features (i.e. pixel in faces example) we project the n_component vector that contains all dictionary component value for this feature, which explicitly enforce dictionary components to be non overlapping (this differ from [1], where projection occurs in the feature space and not the component space)

For the moment there is no unit tests, but output of the method can be seen running example.decomposition.lot_decomposition_faces. As we use a large outer loop of n_feature iterations and an inner loop for elastic net projection, I wrote a cython extension for speed consideration.

Does this method seem to have an interest for the project ?

Mairal, J., Bach, F., Ponce, J., & Sapiro, G. (2010). Online learning for matrix factorization and sparse coding. The Journal of Machine Learning Research, 11, 19-60.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ols or least_squares would be a better (more explicit) name in my opinion.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually this will not yield a sparse code, so I wonder if adding this option here does not reflect a design problem.

@ogrisel
Copy link
Copy Markdown
Member

ogrisel commented Jun 18, 2015

I think this method is interesting but I think the class organisation needs a rework:

  • the current MiniBatchSparsePCA does minibatches in the feature space, I think this is an abuse. At least it's not consistent with other use of the MiniBatch prefix in scikit-learn. I would rather have called this BlockedCoordinateSparsePCA or something similar.
  • the new method would implement Sparse PCA with mini-batches in the sample space (via ENet projections of the components after each minibatch update if I understand correctly) but is currently implemented as part of the online dictionary learning code base for code reuse purpose.

However sparse PCA is not dictionary learning as the activations / code are not sparse. From a user point of you I find this really confusing. I think we should find a way to refactor the code to have the estimator name reflect the model name (the class of objective function with either sparse components or sparse code) and not the kind of solver used internally.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should be called projected_gradient sparse_projection. Elastic net by itself is not a solver like lars and coordinate descent but rather a kind of penalty.

Edit: it's not the gradient that is projected, but the weights after an update, so projected_gradient is a bad name.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also does is the current implementation well defined and equivalent to l1-ball projection when l1_ratio=1.0?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should. Do we have existent l1-ball projection to compare with ? It is actually a block coordinate descent with projection, so cd_proj could be used, but it would be confusing as it describe the dictionary update algorithm and not the code learning algorithm

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does

@ogrisel
Copy link
Copy Markdown
Member

ogrisel commented Jun 18, 2015

Out of curiosity what is the speed impact of using the python version of the Elastic Net projection? If the use of the the cython version is critical for speed, I think we should just write tests for that version and completely drop the Python version from the code base. There is no point in keeping 2 versions if one is always sub optimal.

@arthurmensch
Copy link
Copy Markdown
Contributor Author

We gain a factor ~40 in performance on the face dataset. I kept the python version in the database for testing, but it should be included in a test file.

@arthurmensch
Copy link
Copy Markdown
Contributor Author

Actually the online SPCA method presented in this PR can also enforce code sparsity : however it is often the case that you do not want both sparse code and sparse dictionary.

I think we could refactor as follow :

  • Allow user selection of code computation algorithm, ie {'cd', 'lars', 'ols'}, such allowing him to use a non sparse code. This would involve a parameter alpha controlling code sparsity.
  • Allow user selection of dictionary constraint, i.e. {'ordinary', 'sparse'}, with a parameter l1_ratio controlling dictionary sparsity.

MiniBatchDictionaryLearning would then default to sparse code, dense dictionary, while MiniBatchSparsePCA would default to sparse dictionary, sparse code, which is its expected behaviour as for version 0.16.

We could keep a class MiniBatchFeatureSparsePCA (or what not...) that would keep computing a sparse code and a dense dictionary by transposing X.

I still need to do some benchmarking to see if we do not loose considerable performance depending on X shape.

One of the problems with this approach is that the output of fit using new MiniBatchDictionaryLearning would be different using the same parameter, as there is no obvious correspondence between regulation parameters l1_ratio and alpha (which is used today to regulate code sparsity and thus component_ sparsity in MiniBatchSparsePCA).

@arthurmensch
Copy link
Copy Markdown
Contributor Author

Sorry for the multiple commits, to sum up I added a tolerance based early stopping, using the surrogate function $\hat f (t)$ described in above reference, and I changed the API so we can control code learning algorithm ('lars, 'cd' (not working on master, but working using PR #4775 ), 'ols'), and dictionary learning constraints (either euclidian or enet, which is structure inducing).

@arthurmensch
Copy link
Copy Markdown
Contributor Author

I tweaked the cython code so that it can release the gil. The for loop in _update_dict_enet is embarissingly parallel, are we allowed to use prange or should we always stick to joblib ?

@ogrisel
Copy link
Copy Markdown
Member

ogrisel commented Jun 23, 2015

prange cannot be currently used in scikit-learn because:

  • not all supported compilers (e.g. clang on OSX or MSVC 2008 on windows) implement OpenMP
  • the OpenMP implementation of gcc (libgomp) is still crashing when used in a program that uses multiprocessing somewhere else: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=60035

However you can either use joblib with backend="threading" if you do not do many calls to parallel in a row (otherwise the thread pool creation and destruction will be killing the performance). Alternatively you can maintain you own pool (the use of contextlib is necessary to use the ThreadPool as a context manager in Python 2.6):

from contextlib import closing
from multiprocessing.pool import ThreadPool

with closing(ThreadPool(n_jobs)) as p:
    for X_batch in minibatch_iterator:
        # use the pool to apply a function in parallel for instance p.map

@ogrisel
Copy link
Copy Markdown
Member

ogrisel commented Jun 23, 2015

Note that we could also work to make joblib.Parallel a context manager with __enter__ and __exit__ methods to keep a managed pool internally for successive jobs. This requires a lot more work and would not bring any benefit fro the threading backend case. It could be useful for the multiprocessing backend though.

@arthurmensch
Copy link
Copy Markdown
Contributor Author

Since we are only using a ThreadPool for enet version, I explicitly close at the end of dict_learning_online. I also tried to use the constructed Pool in sparse_encode, without gain in performance (I guess we need a multiprocessing Pool for that).

@arthurmensch arthurmensch changed the title [WIP] Sparse PCA online in the direction of samples [MRG] Sparse PCA online in the direction of samples Jun 24, 2015
@arthurmensch arthurmensch force-pushed the spca_online_sample branch 2 times, most recently from 871d2f0 to 0fbc54b Compare June 24, 2015 14:34
@ogrisel ogrisel changed the title [MRG] Sparse PCA online in the direction of samples [WIP] Sparse PCA online in the direction of samples Jun 25, 2015
@ogrisel ogrisel added the API label Jun 25, 2015
@ogrisel
Copy link
Copy Markdown
Member

ogrisel commented Jun 25, 2015

I changed this PR to [WIP] as we need to resolve the API issue. I suggest we introduce classes with name IncrementalDictionaryLearning and IncrementalSparsePCA to host the implementation of the cases where we have a partial_fit method and keep DictionaryLearning and SparsePCA for the cases where we cannot implement a partial_fit method (even if the fit method does some kind of internal 'online' optimization in the axis of the features).

We could then deprecate the current MiniBatchDictionaryLearning and MiniBatchSparsePCA as they are introducing confusion as they are not consistent.

@ogrisel
Copy link
Copy Markdown
Member

ogrisel commented Jun 25, 2015

@arthurmensch when you rebase and squash commit, please cleanup the full commit message: they are old phony messages included in the commit message for 0fbc54b.

@arthurmensch
Copy link
Copy Markdown
Contributor Author

The major issue here is now a naming problem : this PR code could be used to change MiniBatchSparsePCA into a real online estimator.

However, we would have to keep present MiniBatchSparsePCA for the next few versions and deprecate it, while creating another estimator (IncrementalSparsePCA ?)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This scaling is the one recommended in the Lasso litterature. I can write a test showing that sparsity level stays roughly the same no matter what input feature size is provided.

@arthurmensch
Copy link
Copy Markdown
Contributor Author

Output from example :

Dataset consists of 400 faces
Extracting the top 30 Online Dictionary learning...
Iteration   0 (elapsed time:   0s,  0.0mn)
Iteration  10 (elapsed time:   0s,  0.0mn)
Iteration  20 (elapsed time:   0s,  0.0mn)
Iteration  30 (elapsed time:   0s,  0.0mn)
Iteration  40 (elapsed time:   0s,  0.0mn)
Iteration  50 (elapsed time:   1s,  0.0mn)
Iteration  60 (elapsed time:   1s,  0.0mn)
Iteration  70 (elapsed time:   1s,  0.0mn)
Iteration  80 (elapsed time:   1s,  0.0mn)
Iteration  90 (elapsed time:   2s,  0.0mn)
done in 2.416s
Online Dictionary learning - Component density
0.0601806640625
Code density
1.0

components
faces

reconstruction

@arthurmensch
Copy link
Copy Markdown
Contributor Author

I am closing this PR for the moment as it needs serious rework from my side

@amueller
Copy link
Copy Markdown
Member

ok thanks for working on this :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants