[MRG+1] Adress decomposition.PCA mle option problem#16224
[MRG+1] Adress decomposition.PCA mle option problem#16224rth merged 21 commits intoscikit-learn:masterfrom lschwetlick:pca_fix_new
Conversation
|
You can check why codecov is complaining; |
|
I'm a bit stuck due to my limited knowledge of the linear algebra behind this. Two questions occur to me:
I appreciate any input :) |
jnothman
left a comment
There was a problem hiding this comment.
This is looking good
I can confirm that the following new tests fail at master:
- test_infer_dim_bad_spec
- test_assess_dimension_same_n_rank_and_features
- test_assess_dimension_small_eigenvalues
- test_infer_dim_mle
|
|
||
|
|
||
| def test_assess_dimension_same_n_rank_and_features(): | ||
| # Test that |
glemaitre
left a comment
There was a problem hiding this comment.
Please add an entry to the change log at doc/whats_new/v0.23.rst under bug fixes. Like the other entries there, please reference this pull request with :pr: and credit yourself (and other contributors if applicable) with :user:
I agree. I would remove the parameter because it is never used elsewhere. So let's internally create the threshold. In addition, could you rename:
They are internal functions so we can change them. |
|
Thanks for the input! Unfortunately I'm having some trouble writing sensible tests for this and I'm starting to understand what is wrong. While I'm a bit reluctant to get into the off by one error problem, I think it might not be avoidable. So the way I understand it the variable The problem, which is probably the same as the off-by-one problem we mentioned previously, is that in In my understanding, if we find 3 eigenvalues, we should be looking at the LL for the cases rank=0, rank=1, rank=2, and rank=3. Currently we look at ranks 0,1, and 2. That was also why the code coverage was bad before, because the Still To Do:
|
adrinjalali
left a comment
There was a problem hiding this comment.
Do I understand correctly that this now fixes the small eigenvalue problem, but whenever there is only one component, our n_components_ is 0?
It seems on master:
In [...]: X, _ = datasets.make_classification(n_informative=1, n_repeated=0,
...: n_redundant=0, n_clusters_per_class=1,
...: random_state=42)
...: pca = PCA(n_components='mle').fit(X)
In [...]: pca.n_components_
Out[...]: 0That makes me think this PR is kinda complete as is, and the off by one issue can have its own issue+pr.
You could also add a test with a large cut-off to check that the passed value always works.
|
Okay, so then let's work on the off by one error in a separate issue. I left TODOs in the code, where I found the code coverage never reaches the lines on account of this. |
|
The CI issue is not related, I opened #16545 to figure it out. You can safely ignore it (I think) |
adrinjalali
left a comment
There was a problem hiding this comment.
I'd be happy to have this in and work on #16546 on a different PR. Thanks @lschwetlick
|
@lschwetlick minimal dependencies have been updated in the meanwhile. Could you please synchronise with upstream? this will update the build and (hopefully) green-lights all the tests. Thanks! |
|
Please change the title from WiP to MRG if this is awaiting review but otherwise ready for merge |
Co-Authored-By: Guillaume Lemaitre <g.lemaitre58@gmail.com>
…to _infer_dimension
I rebased the PR onto the current upstream master. I hope that was the right way to synchronise it? |
Looks good though merging master into your branch is cleaner :) |
|
Someone available to merge this twice approved all green PR? :) |
|
Thanks @lschwetlick (and @cmarmo for the reminder) ! |
|
Yay! About time this got closed! Thanks @lschwetlick
|
Reference Issues/PRs
closes #10359
closes #4441
This PR uses the proposed solutions from PR #10359 to issue #4441 and translates it to the current version of the decomposition package. We adressed 2 of the 3 issues raised by the reviewer:
We reverted changes relating to the off-by-one error. Maybe someone with more experience in linear algebra can weigh in and we can fix it as a seperate issue. It was discussed here and here.
other comments
Contributing authors were @gelavizh1, @marijavlajic and @lschwetlick at #Wimlds sprint Berlin :)
CC: @adrinjalali @noatamir