[MRG + 1] Raising an error when batch_size < n_components in IncrementalPCA#9303
[MRG + 1] Raising an error when batch_size < n_components in IncrementalPCA#9303jnothman merged 22 commits intoscikit-learn:masterfrom
Conversation
|
Codecov: "No report found to compare against" ? |
…into n_samples6452
|
Sugar. Reverting the merge I did while trying to solve the codecov issue did not have the effect I expected (some commits by others have been undone) EDIT: ok, solved. |
|
|
||
| if self.n_components is None: | ||
| self.n_components_ = n_features | ||
| if self.components_ is None: |
There was a problem hiding this comment.
Can you add regression tests for these? I guess if n_features < n_samples we had an error earlier?
There was a problem hiding this comment.
Yes, the master had an error if n_samples < n_features (you wrote the opposite, but I believe it was a typo right?). As a ‘visual’ aid, this is the partial_fit method, so n_samples is equivalent to the size of the batches used.
|
looks good apart of missing regression test for n_components=None and |
|
added the regression test! |
|
ping @amueller. Should I add an entry to whats_new.rst? |
|
@wallygauze sorry, too many reviews :-/ yes, whatsnew sounds good :) |
|
LGTM |
|
Thanks. |
| "processing" % (self.n_components, n_features)) | ||
| elif not self.n_components <= n_samples: | ||
| raise ValueError("n_components=%r must be less or equal to " | ||
| "the batch number of samples %d. You can change " |
There was a problem hiding this comment.
I would remove the "You can change either one depending on what you want". It doesn't bring any useful piece of information and the message is clear without it.
There was a problem hiding this comment.
Is that so. I added that to avoid putting the emphasis on either of the parameters. But you are right that it should still be easy to see even for a beginner.
|
@lesteve Done |
| "processing" % (self.n_components, n_features)) | ||
| elif not self.n_components <= n_samples: | ||
| raise ValueError("n_components=%r must be less or equal to " | ||
| "the batch number of samples " |
There was a problem hiding this comment.
Funnily enough we were chatting with @ogrisel about this yesterday in an unrelated context. IIUC he was hoping that IncrementalPCA would be able to do partial_fit on a small number of samples (and converge to something sensible after a few calls to partial_fit). It looks like this is not the case at the moment ...
There was a problem hiding this comment.
Also I bumped into the same problem (checking that n_components <= n_features but not n_components <= n_samples) in sklearn/decomposition/pca.py yesterday. There are also slight inconsistencies between _fit_full and _fit_truncated. Not in this PR but I think we should have a helper function that is reused where appropriate.
There was a problem hiding this comment.
@lesteve I have a pull-request for PCA as well --> #8742.
It has received a number of reviews and it all seems pretty much finished, but it has not received much attention these last months because it was not marked for the 0.19 release (which on second thoughts may be incongruous since it's practically the same as this.)
Do you want to have a look, I do think it may be a quick case to just finish off.
lesteve
left a comment
There was a problem hiding this comment.
I pushed some minor changes in the test. Other than this this looks good.
| self.n_components_ = self.components_.shape[0] | ||
| elif not 1 <= self.n_components <= n_features: | ||
| raise ValueError("n_components=%r invalid for n_features=%d, need " | ||
| "more rows than columns for IncrementalPCA " |
There was a problem hiding this comment.
I have no idea what "more rows than columns means here" ...
There was a problem hiding this comment.
I don't think the message is good here either, but I wanted to focus my pull request on the points mentioned in #6452 (so that it would be reviewed and merged more quickly).
| rng = np.random.RandomState(1999) | ||
| for n_samples, n_features in [(50, 10), (10, 50)]: | ||
|
|
||
| X = rng.rand(n_samples, n_features) |
There was a problem hiding this comment.
Thanks for all the improvements. For this bit here, the reason why I wanted X to be created anew for each partial_fit call was just to replicate the mechanism of batches (i.e. if you get three batches of 30 samples from a dataset with 90 samples, the three batches will be different).
Not sure it's that important for our case, but regardless I just realised my code was producing identical batches anyway, since the random state is fixed beforehand (I guess I would have had to use a different random state for the second call)
|
@lesteve Can we merge or are we expecting a third person to review the changes you yourself added to the test? It lgtm |
jnothman
left a comment
There was a problem hiding this comment.
Other than needing the what's new entry to move to 0.20, this looks good to me
| "n_components={} invalid for n_features={}, need" | ||
| " more rows than columns for IncrementalPCA " | ||
| "processing".format(n_components, n_features), | ||
| IncrementalPCA(n_components, batch_size=10).fit, X) |
There was a problem hiding this comment.
Should this also be raised for partial_fit?
|
Thanks for your contribution and for your patience! |
|
:D |
Reference Issue
#6452
What does this implement/fix? Explain your changes.
Raises an error when the number of samples of the dataset entered in partial_fit is lower than the number of components
Obviously, the problem mentioned in the reference issue would also present itself when n_components was 'None'. The modified test from the issue resulted in the same failure:
Any other comments?