Skip to content

Cleaned and updated version of token-processor support#1735

Closed
wrichert wants to merge 9 commits intoscikit-learn:masterfrom
wrichert:new-token-processor
Closed

Cleaned and updated version of token-processor support#1735
wrichert wants to merge 9 commits intoscikit-learn:masterfrom
wrichert:new-token-processor

Conversation

@wrichert
Copy link
Copy Markdown

@wrichert wrichert commented Mar 5, 2013

This is the cleaned up version of pull request #1537 to fix #1156.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you move these tests to test_text.py?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

@larsmans
Copy link
Copy Markdown
Member

larsmans commented Mar 5, 2013

I understand from the examples that the processor is called once per token. I'd prefer it to receive a stream (iterable) of tokens and produce one as well, to eliminate function call overhead, allow context-sensitive rules (POS/NER taggers!), and allow production of multiple tokens given a single one.

I'd also call it a token filter to mirror Lucene (and Unix) terminology.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer a more interesting example. E.g., using my proposed stream API,

def to_british(tokens):
    """Heuristic British->American spelling converter."""
    for t in tokens:
        t = re.sub(r"(...)our$", "\1or", t)
        t = re.sub(r"([bt])re$", r"\1er", t)
        t = re.sub(r"([iy])s(e$|ing|ation)", r"\1z\2", t)
        t = re.sub(r"ogue$", "og", t)
        yield t

@wrichert
Copy link
Copy Markdown
Author

wrichert commented Mar 5, 2013

@larsmans Totally agree both with your stream API suggestion and the example. Addressed it in https://github.com/wrichert/scikit-learn/commit/083205e01a8db713c629d1cb5a11aca8c6963be7.

@amueller
Copy link
Copy Markdown
Member

amueller commented Mar 9, 2013

This can no longer be merged. Could you please rebase?

@amueller
Copy link
Copy Markdown
Member

Sorry I was to late on IRC. After rebase, you can just force-push in the same branch and the PR will automatically be updated. Thanks :)

@wrichert
Copy link
Copy Markdown
Author

I just force-pushed. Hopefully, the universe does not collapse now...

@larsmans
Copy link
Copy Markdown
Member

I'm afraid my Py3 compat changes broke this PR again...

@wrichert
Copy link
Copy Markdown
Author

@larsmans I've rebased again. Could you give it a try before it is outdated again? :-)

@larsmans
Copy link
Copy Markdown
Member

I'm juggling a lot of projects simultaneously ATM, but I'll see if I can find some time. Don't hold your breath.

@jseabold
Copy link
Copy Markdown
Contributor

Just wanted to bump and +1 this PR.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at the diff on github, these lines seem repeated. I don't understand. Is there a reason, or is this just a oversight?

@GaelVaroquaux
Copy link
Copy Markdown
Member

In general, I am 👍 for merge on this PR as it is a nice feature (please address my small comments). However, I have the feeling that this part of the codebase is getting more and more hairy.

It would be good that someone with good understanding of the text processing application puts a bit of order. I have in mind:

  • Separating out a clear public and private API. We don't want too many public methods
  • Making sure that there is no redundancy.
  • Making sure that everything pickles right (a lot of functionality is defined in closures of functions. I am uneasy with this).

@ogrisel
Copy link
Copy Markdown
Member

ogrisel commented Jul 29, 2013

@wrichert sorry I completely forgot about this PR. I wanted to finalize it during the sprint last week but failed to do so. Could you please rebase it on top of current master?

@EntilZha
Copy link
Copy Markdown

EntilZha commented Dec 2, 2015

Link chased here from google looking for exactly this functionality. Is there any status update on this or has it already been added in some other way?

@ldulcic
Copy link
Copy Markdown

ldulcic commented May 27, 2016

Will this be merged any time soon? It is really useful feature and it looks like code has been written some time ago but still not included in scikit-learn release. I hate to override these classes and create custom solutions when this is already solved :)

@jnothman
Copy link
Copy Markdown
Member

@ldulcic, it seems as if there are a number of comments, particularly from @GaelVaroquaux that @wrichert has not responded to. If someone would like to take over this PR, and finish it up, it seems like it's still a desirable enhancement.

@jnothman
Copy link
Copy Markdown
Member

superseded by #7286

@jnothman jnothman closed this Sep 29, 2016
rth pushed a commit to rth/scikit-learn that referenced this pull request Nov 2, 2016
rth pushed a commit to rth/scikit-learn that referenced this pull request Jun 6, 2017
rth pushed a commit to rth/scikit-learn that referenced this pull request Jun 21, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add stemming support to CountVectorizer

10 participants