Skip to content

[MRG] MNT Fixes for PCA with n_components='mle'#16841

Merged
agramfort merged 6 commits intoscikit-learn:masterfrom
NicolasHug:fix_pca_mle
Apr 7, 2020
Merged

[MRG] MNT Fixes for PCA with n_components='mle'#16841
agramfort merged 6 commits intoscikit-learn:masterfrom
NicolasHug:fix_pca_mle

Conversation

@NicolasHug
Copy link
Copy Markdown
Member

@NicolasHug NicolasHug commented Apr 4, 2020

Fixes #16730
Closes #16546

  • infer_dimension() now returns a value in [1, n_features - 1]. rank = 0 isn't possible anymore (it's a bug fix, as this would transform into an empty array).
  • rank as passed to _assess_dimension() is actually the rank, not an index as previously. This fixes the infamous off-by-one error. Note also that the method isn't defined if we pass it rank == n_features (division by zero)
  • a few fixes in the formula

@NicolasHug
Copy link
Copy Markdown
Member Author

@NicolasHug NicolasHug added this to the 0.23 milestone Apr 4, 2020
@lschwetlick
Copy link
Copy Markdown
Contributor

This looks good to me, thanks for fixing!

@NicolasHug
Copy link
Copy Markdown
Member Author

Thanks for checking @lschwetlick . If you have time / feel like it, it'd be great if you could check with your notebook if the results make sense now? (but no worries otherwise)

@lschwetlick
Copy link
Copy Markdown
Contributor

Well... not sure if this is maybe quite edge case-y but

b = np.ones((9, 6))
print("rank=", np.linalg.matrix_rank(b))
u2, s2, vh2 = np.linalg.svd(b, full_matrices=True)
ll = np.empty_like(s2)
ll[0] = -np.inf
for r in range(1,s2.shape[0]):
    ll[r] = (_assess_dimension(np.asarray(s2), r, 9))
print(ll)

-> [-inf nan -inf -inf -inf -inf]

but then funnily enough it gives the right number of dimensions (1) because apparently argmax treats nan>-inf. Its giving nan because if all subsequent values in the spectrum are 0 then in line 76 v is also 0...

@lschwetlick
Copy link
Copy Markdown
Contributor

I'm also still slightly confused to be honest: we dont test full rank because the method isn't defined for testing full rank. But what if the number of dimensions is full rank?

@NicolasHug
Copy link
Copy Markdown
Member Author

NicolasHug commented Apr 7, 2020

Thanks @lschwetlick , I fixed that and added a test (though indeed it's an unlikely case I think)

But what if the number of dimensions is full rank?

I guess users use mle when they want some dimensionality reduction, but just don't know by how much. So that's fine not to consider rank==n_features (we can't anyway). There's always the solution to directly provide n_components=n_features.

@lschwetlick
Copy link
Copy Markdown
Contributor

Alright cool, other than that my notebook and I are happy :)

@agramfort agramfort merged commit a655de5 into scikit-learn:master Apr 7, 2020
@agramfort
Copy link
Copy Markdown
Member

thanks a lot @NicolasHug and @lschwetlick for the team work !

gio8tisu pushed a commit to gio8tisu/scikit-learn that referenced this pull request May 15, 2020
* Fixed off by one in MLE and better handling of small eigenvalues

* light update tests

* pep8

* Added test + threhsold on small log
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

BUG: MLE for PCA mis-estimates rank Off-By-One Error in _pca with 'mle'

3 participants