[MRG+1] API make sure vectorizers read data from file before analyzing#13641
[MRG+1] API make sure vectorizers read data from file before analyzing#13641jnothman merged 19 commits intoscikit-learn:masterfrom
Conversation
jnothman
left a comment
There was a problem hiding this comment.
Do you think we need a deprecation period before changing this? Previously there was no benefit in specifying input with a custom analyzer, but users still might have done it??? Or do we not care if they relied on buggy (not as documented) behaviour? Otherwise I think this is good.
|
I think regarding custom analyzers there are two issues here which we should separate from one another:
To fix the first one, we can simply check if the input is a file object, or a valid existing file name, and not do more. For the second issue, this PR is changing the behavior and I don't see a backward compatible way w/ deprecation cycle of doing it w/o an extra parameter. I personally think this PR brings the handling of a custom analyzer to what people would expect(?), but I also understand the other side saying this is a change in behavior and not merely fixing a bug. |
|
I think this is a bug fix too. Is there a way we can issue a
ChangedBehaviorWarning in appropriate circumstances (at least if the
analyzer breaks)?
|
I think there are some edge cases where we may not detect the issue, but this probably covers most cases. Of course if the given custom analyzer actually catches those exceptions, we can't do much. |
I think with this PR, the analyzer would never be given a file object. I'm not sure if I understand you correctly here. |
jnothman
left a comment
There was a problem hiding this comment.
You're right. If won't be given the file object. Tired. Thankful you're tackling these ancient issues. Probably not in a state to be reviewing!
jnothman
left a comment
There was a problem hiding this comment.
Okay, I'm fine with this.
What's new? Emphasise that it may break current usage?
Co-Authored-By: adrinjalali <adrin.jalali@gmail.com>
…ikit-learn into countvectorizer/fileinput
| "and not the file names or the file objects. This warning " | ||
| "will be removed in v0.23.") | ||
| try: | ||
| self.analyzer(fname) |
There was a problem hiding this comment.
If input="file" why are we checking that analyser works with filenames? fname is a string, right?
There was a problem hiding this comment.
we're just making sure that the analyzer neither tries to read from a file object, nor try to open the non-existing file.
NicolasHug
left a comment
There was a problem hiding this comment.
looks like comments have been addressed, lgtm
|
Yay! Thanks @adrinjalali for digging this one up :) |
scikit-learn#13641)" This reverts commit 9010d9f.
scikit-learn#13641)" This reverts commit 9010d9f.
Fixes #5482
If the given analyzer is a calable, it seems reasonable to assume if
input='file'orinput='filename', the data should be read from the file first, and then passed to the analyzer, the same way as it's done for non-callable analyzers.This PR clarifies this in the docstrings, and passes the "decoded" input to the analyzer. It should be less of a concern regarding the input on the bytes vs str since we don't support python2 anymore.
I'm not entirely sure if this is what we wanna do, it's more of a proposal to move it forward.
After this PR, the following would result in a
FileNotFoundErrorexception: