[MRG+1] Adress decomposition.PCA mle option problem by lschwetlick · Pull Request #16224 · scikit-learn/scikit-learn

lschwetlick · 2020-01-25T15:06:18Z

Reference Issues/PRs

This PR uses the proposed solutions from PR #10359 to issue #4441 and translates it to the current version of the decomposition package. We adressed 2 of the 3 issues raised by the reviewer:

rename rcond to spectrum_cutoff for more intuitive naming
make the spectrum cutoff value contigent upon the machine epsilon for the spectrum's data type

We reverted changes relating to the off-by-one error. Maybe someone with more experience in linear algebra can weigh in and we can fix it as a seperate issue. It was discussed here and here.

other comments

Contributing authors were @gelavizh1, @marijavlajic and @lschwetlick at #Wimlds sprint Berlin :)

CC: @adrinjalali @noatamir

glemaitre · 2020-01-26T11:32:24Z

You can check why codecov is complaining;

https://codecov.io/gh/scikit-learn/scikit-learn/compare/5c36df6098d4a6325b621030163897d19853e698...9b5d58ddcee529c9c4374efca37fa6afc5961d43/diff

lschwetlick · 2020-01-30T13:00:12Z

I'm a bit stuck due to my limited knowledge of the linear algebra behind this. Two questions occur to me:

If _assess_dimension_ is a private function and is only accessed from within (from _infer_dimension_) and that call does not make use of the spectrum_cutoff function argument, do we really need it? Or should we just omit it and define the default (cutoff at the datatype's epsilon) a hard coded fact?
In reference to the test coverage - I need to test whether I get a sensible result when the rank and number of features are equal. I do not know how to test this in a good way
except to make sure that I am getting a value when that is the case. In the test_assess_dimension_same_n_rank_and_features I just added I check this but maybe there is a more specific sensible test case we could use?

I appreciate any input :)

jnothman

This is looking good

I can confirm that the following new tests fail at master:

test_infer_dim_bad_spec
test_assess_dimension_same_n_rank_and_features
test_assess_dimension_small_eigenvalues
test_infer_dim_mle

sklearn/decomposition/_pca.py

jnothman · 2020-02-10T22:30:23Z

sklearn/decomposition/tests/test_pca.py

+
+
+def test_assess_dimension_same_n_rank_and_features():
+    # Test that


please finish this line

sklearn/decomposition/tests/test_pca.py

glemaitre

Please add an entry to the change log at doc/whats_new/v0.23.rst under bug fixes. Like the other entries there, please reference this pull request with :pr: and credit yourself (and other contributors if applicable) with :user:

sklearn/decomposition/_pca.py

sklearn/decomposition/tests/test_pca.py

glemaitre · 2020-02-11T13:59:20Z

If assess_dimension is a private function and is only accessed from within (from infer_dimension) and that call does not make use of the spectrum_cutoff function argument, do we really need it? Or should we just omit it and define the default (cutoff at the datatype's epsilon) a hard coded fact?

I agree. I would remove the parameter because it is never used elsewhere. So let's internally create the threshold.

In addition, could you rename:

_infer_dimension_ -> _infer_dimension
_assess_dimension_ -> _assess_dimension

They are internal functions so we can change them.

lschwetlick · 2020-02-15T16:23:58Z

Thanks for the input!

Unfortunately I'm having some trouble writing sensible tests for this and I'm starting to understand what is wrong. While I'm a bit reluctant to get into the off by one error problem, I think it might not be avoidable.

So the way I understand it the variable explained_variance_ (later called spectrum) is a vector of the eigenvalues of the data. In the 'mle' fit, we want to ascertain how many dimensions are relevant using this maths. Essentially we take that vector and for each identified eigenvalue, compute a log likelihood.

The problem, which is probably the same as the off-by-one problem we mentioned previously, is that in _infer_dimension we iterate over range(len(spectrum)) and we call out iterated variable rank here.
This is misleading because rank insinuates meaning the rank of the matrix, where rank=0 would mean a matrix that has no variance. However, in _assess_dimension we use the variable rank as an index into our list of eigenvalues, where the first one (i think) refers to a dimension that extists. Actually rank is being used both as an index and as a semantically individual concept and I'm not sure those two jobs match up.

In my understanding, if we find 3 eigenvalues, we should be looking at the LL for the cases rank=0, rank=1, rank=2, and rank=3. Currently we look at ranks 0,1, and 2. That was also why the code coverage was bad before, because the if rank == n_features is never enters unless I artificially directly a call to _assess_dimension.

Still To Do:

add an entry to the change log at doc/whats_new/v0.23.rst
check the off by one situation
write a test case for a known outcome of mle

adrinjalali

Do I understand correctly that this now fixes the small eigenvalue problem, but whenever there is only one component, our n_components_ is 0?

It seems on master:

In [...]:     X, _ = datasets.make_classification(n_informative=1, n_repeated=0, 
   ...:                                         n_redundant=0, n_clusters_per_class=1, 
   ...:                                         random_state=42) 
   ...:     pca = PCA(n_components='mle').fit(X) 

In [...]: pca.n_components_                                                                                                                                                                                                              
Out[...]: 0

That makes me think this PR is kinda complete as is, and the off by one issue can have its own issue+pr.

You could also add a test with a large cut-off to check that the passed value always works.

sklearn/decomposition/_pca.py

sklearn/decomposition/tests/test_pca.py

lschwetlick · 2020-02-25T16:49:34Z

Okay, so then let's work on the off by one error in a separate issue. I left TODOs in the code, where I found the code coverage never reaches the lines on account of this.

adrinjalali · 2020-02-25T17:07:37Z

The CI issue is not related, I opened #16545 to figure it out. You can safely ignore it (I think)

adrinjalali

I'd be happy to have this in and work on #16546 on a different PR. Thanks @lschwetlick

cmarmo · 2020-03-02T15:47:28Z

@lschwetlick minimal dependencies have been updated in the meanwhile. Could you please synchronise with upstream? this will update the build and (hopefully) green-lights all the tests. Thanks!

jnothman · 2020-03-02T21:19:10Z

Please change the title from WiP to MRG if this is awaiting review but otherwise ready for merge

…4827 and #10359

Co-Authored-By: Joel Nothman <joel.nothman@gmail.com>

Co-Authored-By: Guillaume Lemaitre <g.lemaitre58@gmail.com>

…to _infer_dimension

lschwetlick · 2020-03-03T16:44:25Z

Could you please synchronise with upstream?

I rebased the PR onto the current upstream master. I hope that was the right way to synchronise it?

jnothman · 2020-03-03T20:16:04Z

I rebased the PR onto the current upstream master. I hope that was the right way to synchronise it?

Looks good though merging master into your branch is cleaner :)

jnothman

Otherwise lgtm

sklearn/decomposition/tests/test_pca.py

cmarmo · 2020-03-04T15:15:47Z

Someone available to merge this twice approved all green PR? :)
Thanks!

rth · 2020-03-04T15:26:23Z

Thanks @lschwetlick (and @cmarmo for the reminder) !

jnothman · 2020-03-05T08:04:24Z

Yay! About time this got closed! Thanks @lschwetlick

adrinjalali added Sprint module:decomposition Needs Info labels Jan 30, 2020

jnothman reviewed Feb 10, 2020

View reviewed changes

glemaitre removed the Needs Info label Feb 11, 2020

glemaitre reviewed Feb 11, 2020

View reviewed changes

cmarmo added the Needs work label Feb 14, 2020

lschwetlick changed the title ~~[MRG] Adress decomposition.PCA mle option problem (issue #4441) (bring PR #10359 up to date)~~ [WIP] Adress decomposition.PCA mle option problem (issue #4441) (bring PR #10359 up to date) Feb 16, 2020

adrinjalali reviewed Feb 17, 2020

View reviewed changes

sklearn/decomposition/_pca.py Outdated Show resolved Hide resolved

sklearn/decomposition/tests/test_pca.py Outdated Show resolved Hide resolved

adrinjalali mentioned this pull request Feb 25, 2020

CI: scikit-image 0.12.3 requires dask[array]>=0.5.0, which is not installed. #16545

Closed

lschwetlick mentioned this pull request Feb 25, 2020

Off-By-One Error in _pca with 'mle' #16546

Closed

adrinjalali approved these changes Feb 26, 2020

View reviewed changes

cmarmo added Waiting for Reviewer and removed Needs work labels Mar 2, 2020

adrinjalali changed the title ~~[WIP] Adress decomposition.PCA mle option problem (issue #4441) (bring PR #10359 up to date)~~ [MRG+1] Adress decomposition.PCA mle option problem Mar 3, 2020

lschwetlick added 7 commits March 3, 2020 17:37

add changes from PR #10359 to current version of decomposition package

6792ac6

fix test that failed because of the off-by-one error mentioned in PR #…

67216c9

…4827 and #10359

linting code

f604bd7

added tests

05c928c

test edge case where samples<features. discovered making datasets

1f91790

linting

e2a6400

linting

492dd60

lschwetlick and others added 13 commits March 3, 2020 17:37

forgot a function description

3d35c27

linting

d306fcd

typo

5f1d395

Co-Authored-By: Joel Nothman <joel.nothman@gmail.com>

docstring

f30285e

Co-Authored-By: Guillaume Lemaitre <g.lemaitre58@gmail.com>

docstring

05e5f5a

Co-Authored-By: Guillaume Lemaitre <g.lemaitre58@gmail.com>

Update sklearn/decomposition/_pca.py

bce1c8f

Co-Authored-By: Guillaume Lemaitre <g.lemaitre58@gmail.com>

rename spectrum_cutoff to spectrum_threshold

ef7d80c

rename _assess_dimension_ to _assess_dimension and _infer_dimension_ …

db66fc1

…to _infer_dimension

remove spectrum threshold as a keyword

2c12704

fix tests

adcfa89

clean up, comments and doc

fdc5f3f

linting

8272383

linting

3f238c0

jnothman approved these changes Mar 3, 2020

View reviewed changes

sklearn/decomposition/tests/test_pca.py Outdated Show resolved Hide resolved

remove comment

353a000

cmarmo removed the Waiting for Reviewer label Mar 3, 2020

rth merged commit ea31818 into scikit-learn:master Mar 4, 2020

lschwetlick deleted the pca_fix_new branch March 13, 2020 10:27

ashutosh1919 pushed a commit to ashutosh1919/scikit-learn that referenced this pull request Mar 13, 2020

FIX Adress decomposition.PCA mle option problem (scikit-learn#16224)

4905ac3

larsoner mentioned this pull request Mar 20, 2020

BUG: MLE for PCA mis-estimates rank #16730

Closed

gio8tisu pushed a commit to gio8tisu/scikit-learn that referenced this pull request May 15, 2020

FIX Adress decomposition.PCA mle option problem (scikit-learn#16224)

e9d7cd3



		def test_assess_dimension_same_n_rank_and_features():
		# Test that

Uh oh!

Conversation

lschwetlick commented Jan 25, 2020 • edited by glemaitre Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reference Issues/PRs

other comments

Uh oh!

glemaitre commented Jan 26, 2020

Uh oh!

lschwetlick commented Jan 30, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jnothman left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

jnothman Feb 10, 2020

Choose a reason for hiding this comment

Uh oh!

Uh oh!

glemaitre left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

glemaitre commented Feb 11, 2020

Uh oh!

lschwetlick commented Feb 15, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

adrinjalali left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

lschwetlick commented Feb 25, 2020

Uh oh!

adrinjalali commented Feb 25, 2020

Uh oh!

adrinjalali left a comment

Choose a reason for hiding this comment

Uh oh!

cmarmo commented Mar 2, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jnothman commented Mar 2, 2020

Uh oh!

lschwetlick commented Mar 3, 2020

Uh oh!

jnothman commented Mar 3, 2020

Uh oh!

jnothman left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

cmarmo commented Mar 4, 2020

Uh oh!

rth commented Mar 4, 2020

Uh oh!

jnothman commented Mar 5, 2020 via email

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

lschwetlick commented Jan 25, 2020 •

edited by glemaitre

Loading

lschwetlick commented Jan 30, 2020 •

edited

Loading

lschwetlick commented Feb 15, 2020 •

edited

Loading

cmarmo commented Mar 2, 2020 •

edited

Loading