[MRG] Fixes 'math domain error' in sklearn.decomposition.PCA with "n_components='mle'#10359
[MRG] Fixes 'math domain error' in sklearn.decomposition.PCA with "n_components='mle'#10359thechargedneutron wants to merge 4 commits intoscikit-learn:masterfrom
Conversation
| n_redundant=1, n_clusters_per_class=1, | ||
| random_state=42) | ||
| pca = PCA(n_components='mle').fit(X) | ||
| assert_equal(pca.n_components_, 0) |
There was a problem hiding this comment.
isn't it surprising to get 0 components even if you have 1 informative feature?
There was a problem hiding this comment.
@agramfort Yes, indeed. There's an off by one problem as found in #4827 (comment) . But I am reluctant to change
scikit-learn/sklearn/decomposition/pca.py
Line 105 in 5311c81
ll.agrmax() + 1 as it may break existing code. Any workaround for this?
…into pca_svd_solver
|
@agramfort Does this look correct now? |
| n_components = \ | ||
| _infer_dimension_(explained_variance_, n_samples, n_features) | ||
| _infer_dimension_(explained_variance_, n_samples, | ||
| n_features) + 1 |
There was a problem hiding this comment.
you cannot change the behavior in any case like this. Was is it a bug? when I complained before about n_components=0 I was surprised but was the behavior expected for MLE option?
There was a problem hiding this comment.
I am not sure. But this off by one problem is discussed here. #4827 (comment)
There was a problem hiding this comment.
@vene are you able to comment here? I assume we should not be making this change together with the present fix, but I've not looked into it at all.
There was a problem hiding this comment.
I don't think there was an off-by-one error here formerly, the argmax seems to correspond correctly to the assessed rank. But if rank=0 is a bad answer, then we don't need to assess rank=0... If so, that's a separate bug, but this +1 should be removed, IMO.
|
if you look here:
http://scikit-learn.org/stable/auto_examples/decomposition/plot_pca_vs_fa_model_selection.html
you see that MLE gives 10 components as expected.
|
| n_components = \ | ||
| _infer_dimension_(explained_variance_, n_samples, n_features) | ||
| _infer_dimension_(explained_variance_, n_samples, | ||
| n_features) + 1 |
There was a problem hiding this comment.
@vene are you able to comment here? I assume we should not be making this change together with the present fix, but I've not looked into it at all.
| Number of samples. | ||
| n_features : int | ||
| Number of features. | ||
| rcond : float |
| Number of features. | ||
| rcond : float | ||
| Cut-off for values in `spectrum`. Any value lower than this | ||
| will be ignored (`default=1e-15`) |
There was a problem hiding this comment.
should the default be a function of dtype?
Reference Issues/PRs
Fixes #4441
What does this implement/fix? Explain your changes.
This is a PR originally by @lbillingham in #4827. All thanks to him.
Any other comments?