Skip to content

[MRG] Fixes 'math domain error' in sklearn.decomposition.PCA with "n_components='mle'#10359

Closed
thechargedneutron wants to merge 4 commits intoscikit-learn:masterfrom
thechargedneutron:pca_svd_solver
Closed

[MRG] Fixes 'math domain error' in sklearn.decomposition.PCA with "n_components='mle'#10359
thechargedneutron wants to merge 4 commits intoscikit-learn:masterfrom
thechargedneutron:pca_svd_solver

Conversation

@thechargedneutron
Copy link
Copy Markdown
Contributor

@thechargedneutron thechargedneutron commented Dec 22, 2017

Reference Issues/PRs

Fixes #4441

What does this implement/fix? Explain your changes.

This is a PR originally by @lbillingham in #4827. All thanks to him.

Any other comments?

@thechargedneutron thechargedneutron changed the title [WIP] Fixes 'math domain error' in sklearn.decomposition.PCA with "n_components='mle' [MRG] Fixes 'math domain error' in sklearn.decomposition.PCA with "n_components='mle' Dec 22, 2017
n_redundant=1, n_clusters_per_class=1,
random_state=42)
pca = PCA(n_components='mle').fit(X)
assert_equal(pca.n_components_, 0)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isn't it surprising to get 0 components even if you have 1 informative feature?

Copy link
Copy Markdown
Contributor Author

@thechargedneutron thechargedneutron Dec 24, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@agramfort Yes, indeed. There's an off by one problem as found in #4827 (comment) . But I am reluctant to change

return ll.argmax()
this line to ll.agrmax() + 1 as it may break existing code. Any workaround for this?

@thechargedneutron
Copy link
Copy Markdown
Contributor Author

@agramfort Does this look correct now?

n_components = \
_infer_dimension_(explained_variance_, n_samples, n_features)
_infer_dimension_(explained_variance_, n_samples,
n_features) + 1
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you cannot change the behavior in any case like this. Was is it a bug? when I complained before about n_components=0 I was surprised but was the behavior expected for MLE option?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure. But this off by one problem is discussed here. #4827 (comment)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vene are you able to comment here? I assume we should not be making this change together with the present fix, but I've not looked into it at all.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think there was an off-by-one error here formerly, the argmax seems to correspond correctly to the assessed rank. But if rank=0 is a bad answer, then we don't need to assess rank=0... If so, that's a separate bug, but this +1 should be removed, IMO.

@agramfort
Copy link
Copy Markdown
Member

agramfort commented Dec 28, 2017 via email

n_components = \
_infer_dimension_(explained_variance_, n_samples, n_features)
_infer_dimension_(explained_variance_, n_samples,
n_features) + 1
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vene are you able to comment here? I assume we should not be making this change together with the present fix, but I've not looked into it at all.

Number of samples.
n_features : int
Number of features.
rcond : float
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why the name rcond?

Number of features.
rcond : float
Cut-off for values in `spectrum`. Any value lower than this
will be ignored (`default=1e-15`)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should the default be a function of dtype?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Problems in sklearn.decomposition.PCA with "n_components='mle' option"

4 participants