Cleaned and updated version of token-processor support#1735
Cleaned and updated version of token-processor support#1735wrichert wants to merge 9 commits intoscikit-learn:masterfrom wrichert:new-token-processor
Conversation
There was a problem hiding this comment.
Can you move these tests to test_text.py?
|
I understand from the examples that the processor is called once per token. I'd prefer it to receive a stream (iterable) of tokens and produce one as well, to eliminate function call overhead, allow context-sensitive rules (POS/NER taggers!), and allow production of multiple tokens given a single one. I'd also call it a token filter to mirror Lucene (and Unix) terminology. |
doc/modules/feature_extraction.rst
Outdated
There was a problem hiding this comment.
I'd prefer a more interesting example. E.g., using my proposed stream API,
def to_british(tokens):
"""Heuristic British->American spelling converter."""
for t in tokens:
t = re.sub(r"(...)our$", "\1or", t)
t = re.sub(r"([bt])re$", r"\1er", t)
t = re.sub(r"([iy])s(e$|ing|ation)", r"\1z\2", t)
t = re.sub(r"ogue$", "og", t)
yield t
|
@larsmans Totally agree both with your stream API suggestion and the example. Addressed it in https://github.com/wrichert/scikit-learn/commit/083205e01a8db713c629d1cb5a11aca8c6963be7. |
|
This can no longer be merged. Could you please rebase? |
|
Sorry I was to late on IRC. After rebase, you can just force-push in the same branch and the PR will automatically be updated. Thanks :) |
|
I just force-pushed. Hopefully, the universe does not collapse now... |
single tokens; integrating larsmans' to_british() token_processor
|
I'm afraid my Py3 compat changes broke this PR again... |
single tokens; integrating larsmans' to_british() token_processor
…kit-learn into new-token-processor
|
@larsmans I've rebased again. Could you give it a try before it is outdated again? :-) |
|
I'm juggling a lot of projects simultaneously ATM, but I'll see if I can find some time. Don't hold your breath. |
|
Just wanted to bump and +1 this PR. |
There was a problem hiding this comment.
Looking at the diff on github, these lines seem repeated. I don't understand. Is there a reason, or is this just a oversight?
|
In general, I am 👍 for merge on this PR as it is a nice feature (please address my small comments). However, I have the feeling that this part of the codebase is getting more and more hairy. It would be good that someone with good understanding of the text processing application puts a bit of order. I have in mind:
|
|
@wrichert sorry I completely forgot about this PR. I wanted to finalize it during the sprint last week but failed to do so. Could you please rebase it on top of current master? |
|
Link chased here from google looking for exactly this functionality. Is there any status update on this or has it already been added in some other way? |
|
Will this be merged any time soon? It is really useful feature and it looks like code has been written some time ago but still not included in scikit-learn release. I hate to override these classes and create custom solutions when this is already solved :) |
|
@ldulcic, it seems as if there are a number of comments, particularly from @GaelVaroquaux that @wrichert has not responded to. If someone would like to take over this PR, and finish it up, it seems like it's still a desirable enhancement. |
|
superseded by #7286 |
This is the cleaned up version of pull request #1537 to fix #1156.