[MRG] Integration of GSoC2015 Gaussian Mixture (first step) by tguillemot · Pull Request #6407 · scikit-learn/scikit-learn

tguillemot · 2016-02-19T14:48:50Z

@xuewei4d has done a huge work for the GM classes.
Nevertheless it seems that #4802 is not integrated yet. I thought it was because of the length of the contribution and the difficulty to measure the efficiency of the new classes with respect to the previous ones.
I propose to integrate this work step by step in several pull requests (400 lines max each time).

This pull request introduces the new abstract class _MixtureBase with some common GMM attributes and methods. Now, GMM derives from _MixtureBase.

@ogrisel @amueller @agramfort

TomDLT · 2016-02-19T16:32:13Z

sklearn/mixture/gmm.py

+    -------
+    X : array, (n_samples, n_features)
+    """
+    X = check_array(X, dtype=[np.float64, np.float32], ensure_2d=False)


why not having ensure_2d=True, allow_nd=False ?

also if you check X here, you can remove the check in GMM._fit

xuewei4d · 2016-02-19T17:38:19Z

I don't understand why you work on gmm.py, because we decided to use its full name gaussian_mixture as the file name.

TomDLT · 2016-02-19T17:44:16Z

I think the goal is to make the diff readable, and then to do a different PR with a simple file renaming

xuewei4d · 2016-02-19T17:57:35Z

Well, the new classes in my PR have huge differences. That's not a small amount of diff.

tguillemot · 2016-02-21T15:20:50Z

@TomDLT Thanks for your comments, I will correct all those things fastly.

I think the goal is to make the diff readable, and then to do a different PR with a simple file renaming

Indeed. The only thing I've done for now is to add the class _MixtureBase and modify the old GMM class to derive from it. I plan next to do another PR to move _MixtureBase into base.py.

@xuewei4d Maybe I didn't explain my goal correctly and I'm sorry for that.

Of course your PR has huge differences and obviously I have noticed it.
I think you've done a really good job with your PR. Nevertheless, it is still not included in scikit-learn.
Your PR is ~4500 lines and has huge differences with original GMM class. I imagine it is not merged because it's too big to be analyse in detail by the reviewers (maybe my supposition is wrong ?).

My purpose is to merge all your work. For me, the best solution is to merge it step by step. It is what I've proposed with this PR. Of course, it's just a little part of your work but it's easier to analyse it in that way.
If the reviewers want to merge it, I will propose more PR with more of your code until we totally merge it. If it is not a good solution to merge your work, I will just close the PR.

ogrisel · 2016-02-22T13:40:27Z

The problem is to deprecate the old GMM class in favor of @xuewei4d's new GaussianMixture without deprecating the old VBGMM and DPGMM subclasses for now.

So I would suggest to:

Rename the old GMM class as _GMMBase (without changing any line of the code) in the sklearn/mixture/gmm.py file and make VBGMM and DPGMM derive from _GMMBase instead of GMM,
Create new GMM class that derives from_GMMBase, is in the same file but with a deprecation warning in the __init__ (and no other overridden method).

This _GMMBase class should not be used by any of the new gaussian mixture models. It's just a way to deprecate the old classes one at a time to make the pull request small and easier to review for incremental merge to master.

ogrisel · 2016-02-22T13:45:53Z

And include @xuewei4d GaussianMixture code in its own file as done in #4802: sklearn/mixture/dp_gaussian_mixture.py.

tguillemot · 2016-02-23T13:07:39Z

In fact, I thought add _GMMBase and move _MixtureBase from gmm.py to base.py at the next PR but it's easier to do it now.

ogrisel · 2016-02-25T10:40:27Z

sklearn/mixture/gmm.py



-class GMM(BaseEstimator):
+class _GMMBase(_MixtureBase):


This should not inherit from the new _MixtureBase class. Just BaseEstimator as previously. We do not want to change the implementation and behavior of the old GMM class in any way besides issuing a deprecation warning in the init.

In that case, I have to send all the code of _MixtureBase and GaussianMixture in the same PR.
To simplify the review, I can propose to split the code in different commit. I don't see a better way of not pushing 1000 lines at once.

yes but just the code for GaussianMixture (not the variational Bayes sub classes).

ogrisel · 2016-03-14T21:46:56Z

@tguillemot any news on this PR?

tguillemot · 2016-03-15T09:03:26Z

Sorry for being late. I'll send everything before the end of the day.

ogrisel · 2016-04-13T11:25:16Z

sklearn/mixture/base.py

+        return logsumexp(self._estimate_weighted_log_prob(X), axis=1)
+
+    def score(self, X, y=None):
+        """Compute the log-likelihood of the model on the given data X.


The docstring is not correct: this is the per-sample average log-likelihood: as we use p.mean() instead of p.sum().

ogrisel · 2016-04-13T11:48:03Z

To have a fair comparison between them, you need to have the same tolerance criterium.
Moreover, classically in the EM algorithm, we use a tolerance which is independent of the number of component.

Alright but I don't understand why dividing tol by n_components would make the stopping criterion (based on a change in the total log likelihood of the training set) independent of the number of components.

Also shouldn't we divide tol the log-likelihood change by n_samples to get a stopping criterion that is less dependent on the size of the training set?

tguillemot · 2016-04-13T12:11:55Z

My mistake on the old class it's not divide by n_components but by n_samples indeed.
On the GMM the current_log_likelihood is define as mean and not as a sum.
Forget all I've said about the n_components. Sorry for the noise.
I change that.

ogrisel · 2016-04-13T14:58:00Z

Ok great that makes sense to me then. It would be good to check how other packages check convergence for GMM trained with EM:

http://www.mixmod.org
https://cran.r-project.org/web/views/Cluster.html (bgmm for instance)

ogrisel · 2016-04-13T15:57:16Z

sklearn/mixture/tests/test_gaussian_mixture.py

+    for cov_type in COVARIANCE_TYPE:
+        gmm = GaussianMixture(n_components=n_samples, covariance_type=cov_type,
+                              reg_covar=0)
+        assert_raise_message(ValueError,


I wonder if we should not introduce a dedicated exception such as DivergenceError in sklearn.exceptions but maybe we can stick to ValueError for now.

I will do another PR for that once we have finished with the GMM, ok ?

tguillemot · 2016-04-14T09:55:29Z

Ok I need help to solve the travis problem.
It seems that there is a problem with the test_base_mixture.py file which doesn't exist.
More interesting, the test are passing for Python-2.6 but not for the other ?
On my local branch Travis works without problems.

Do you have any explination ?
At the beginning of this branch I have created a file call test_base_mixture.py but it was removed a long time ago.

TomDLT · 2016-04-14T11:08:52Z

At the beginning of this branch I have created a file call test_base_mixture.py but it was removed a long time ago.

This is weird. Maybe try to squash and rebase on top of master.

tguillemot · 2016-04-14T11:48:43Z

Another solution could be to propose a new PR with a new branch.
I'll do that only if you don't see another solution.

TomDLT · 2016-04-14T13:06:03Z

or

# delete your branch
git checkout GSoC-integration-step1
git checkout -b GSoC-integration-step1_backup
git branch -D GSoC-integration-step1
# create new branch with same name
git checkout master
git checkout -b GSoC-integration-step1
# cherry-pick and push
git cherry-pick 628e6a56c861aa582cf7f6d4032c9bff24bd07de
git push -f origin GSoC-integration-step1
git branch -D GSoC-integration-step1_backup

Maybe do some make clean also, it might help

Depreciation of the GMM class. Modification of the GaussianMixture class. Some functions from the original GSoC code have been removed, renamed or simplified. Some new functions have been introduced (as the 'check_parameters' function). Some parameters names have been changed : - covars_ -> covariances_ : to be coherent with sklearn/covariances Addition of the parameter 'warm_start' allowing to fit data by using the previous computation. The old examples have been modified to replace the deprecated GMM class by the new GaussianMixture class. Every exemple use the eigenvectors norm to solve the scale ellipse problem (Issues 6548). Correction of all commentaries from the PR - Rename MixtureBase -> BaseMixture - Remove n_features_ - Fix some problems - Add some tests Correction of the bic/aic test. Fix the test_check_means and test_check_covariances. Remove all references to the deprecated GMM class. Remove initialized_. Add and correct docstring. Correct the order of random_state. Fix small typo. Some fix in prevision of the integration of the new BayesianGaussianMixture class. Modification in preparation of the integration of the BayesianGaussianMixture class. Add 'best_n_iter' attribute. Fix some bugs and tests. Change the parameter order in the documentation. Change best_n_iter_ name to n_iter_. Fix of the warm_start problem. Fix the divergence error message. Correction of the random state init in the test file. Fix the testing problems.

tguillemot · 2016-04-14T14:55:17Z

Ok I've tried your solution and also try to do it manually and it still doesn't work.
Thanks for your solutions @TomDLT .

TomDLT · 2016-04-14T16:38:12Z

Interestingly you solved the probem by adding the file, but unsolved it by removing the file (only py35 though)

tguillemot · 2016-04-15T12:43:48Z

To solve the problem with the strange travis bug, I've created another PR #6666.

tguillemot force-pushed the GSoC-integration-step1 branch from 4476dc3 to 2f1200e Compare February 19, 2016 15:48

TomDLT reviewed Feb 19, 2016
View reviewed changes

tguillemot force-pushed the GSoC-integration-step1 branch 2 times, most recently from e7841e7 to a1cea27 Compare February 24, 2016 11:08

ogrisel reviewed Feb 25, 2016
View reviewed changes

tguillemot mentioned this pull request Mar 15, 2016

Gaussian mixture examples: wrong ellipse size ? #6548

Closed

tguillemot force-pushed the GSoC-integration-step1 branch 7 times, most recently from a7922ff to a878b60 Compare March 16, 2016 10:01

ogrisel reviewed Apr 13, 2016
View reviewed changes

tguillemot force-pushed the GSoC-integration-step1 branch 2 times, most recently from 2930805 to 13a8ccf Compare April 14, 2016 09:29

tguillemot force-pushed the GSoC-integration-step1 branch 2 times, most recently from b0eed8a to 628e6a5 Compare April 14, 2016 11:28

tguillemot force-pushed the GSoC-integration-step1 branch 5 times, most recently from 9f7ece7 to 2201ef7 Compare April 14, 2016 14:34

tguillemot force-pushed the GSoC-integration-step1 branch from 2201ef7 to 12d66fc Compare April 14, 2016 14:37

tguillemot added 2 commits April 14, 2016 18:07

Add file test_mixture_base.py.

430245f

Remove the test_mixture_base.py file.

d17bcc7

tguillemot mentioned this pull request Apr 15, 2016

[MRG+1] Integration of GSoC2015 Gaussian Mixture (first step) #6666

Merged

tguillemot closed this Apr 16, 2016

Uh oh!

Conversation

tguillemot commented Feb 19, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

xuewei4d commented Feb 19, 2016

Uh oh!

TomDLT commented Feb 19, 2016

Uh oh!

xuewei4d commented Feb 19, 2016

Uh oh!

tguillemot commented Feb 21, 2016

Uh oh!

ogrisel commented Feb 22, 2016

Uh oh!

ogrisel commented Feb 22, 2016

Uh oh!

tguillemot commented Feb 23, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ogrisel commented Mar 14, 2016

Uh oh!

tguillemot commented Mar 15, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ogrisel commented Apr 13, 2016

Uh oh!

tguillemot commented Apr 13, 2016

Uh oh!

ogrisel commented Apr 13, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tguillemot commented Apr 14, 2016

Uh oh!

TomDLT commented Apr 14, 2016

Uh oh!

tguillemot commented Apr 14, 2016

Uh oh!

TomDLT commented Apr 14, 2016

Uh oh!

tguillemot commented Apr 14, 2016

Uh oh!

TomDLT commented Apr 14, 2016

Uh oh!

tguillemot commented Apr 15, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants