Cleaned and updated version of token-processor support by wrichert · Pull Request #1735 · scikit-learn/scikit-learn

wrichert · 2013-03-05T00:19:11Z

This is the cleaned up version of pull request #1537 to fix #1156.

mblondel · 2013-03-05T01:30:51Z

sklearn/feature_extraction/tests/test_count_vectorizer.py

Can you move these tests to test_text.py?

larsmans · 2013-03-05T11:13:15Z

I understand from the examples that the processor is called once per token. I'd prefer it to receive a stream (iterable) of tokens and produce one as well, to eliminate function call overhead, allow context-sensitive rules (POS/NER taggers!), and allow production of multiple tokens given a single one.

I'd also call it a token filter to mirror Lucene (and Unix) terminology.

larsmans · 2013-03-05T11:21:25Z

doc/modules/feature_extraction.rst

I'd prefer a more interesting example. E.g., using my proposed stream API,

def to_british(tokens): """Heuristic British->American spelling converter.""" for t in tokens: t = re.sub(r"(...)our$", "\1or", t) t = re.sub(r"([bt])re$", r"\1er", t) t = re.sub(r"([iy])s(e$|ing|ation)", r"\1z\2", t) t = re.sub(r"ogue$", "og", t) yield t

wrichert · 2013-03-05T20:32:59Z

@larsmans Totally agree both with your stream API suggestion and the example. Addressed it in https://github.com/wrichert/scikit-learn/commit/083205e01a8db713c629d1cb5a11aca8c6963be7.

amueller · 2013-03-09T12:23:16Z

This can no longer be merged. Could you please rebase?

amueller · 2013-03-10T16:41:50Z

Sorry I was to late on IRC. After rebase, you can just force-push in the same branch and the PR will automatically be updated. Thanks :)

wrichert · 2013-03-10T18:16:05Z

I just force-pushed. Hopefully, the universe does not collapse now...

single tokens; integrating larsmans' to_british() token_processor

larsmans · 2013-03-11T11:55:00Z

I'm afraid my Py3 compat changes broke this PR again...

single tokens; integrating larsmans' to_british() token_processor

…kit-learn into new-token-processor

wrichert · 2013-03-15T16:24:19Z

@larsmans I've rebased again. Could you give it a try before it is outdated again? :-)

larsmans · 2013-03-15T16:26:18Z

I'm juggling a lot of projects simultaneously ATM, but I'll see if I can find some time. Don't hold your breath.

jseabold · 2013-05-19T01:23:58Z

Just wanted to bump and +1 this PR.

GaelVaroquaux · 2013-06-03T07:34:44Z

sklearn/feature_extraction/tests/test_text.py

Looking at the diff on github, these lines seem repeated. I don't understand. Is there a reason, or is this just a oversight?

GaelVaroquaux · 2013-06-03T11:46:13Z

In general, I am 👍 for merge on this PR as it is a nice feature (please address my small comments). However, I have the feeling that this part of the codebase is getting more and more hairy.

It would be good that someone with good understanding of the text processing application puts a bit of order. I have in mind:

Separating out a clear public and private API. We don't want too many public methods
Making sure that there is no redundancy.
Making sure that everything pickles right (a lot of functionality is defined in closures of functions. I am uneasy with this).

ogrisel · 2013-07-29T18:12:06Z

@wrichert sorry I completely forgot about this PR. I wanted to finalize it during the sprint last week but failed to do so. Could you please rebase it on top of current master?

EntilZha · 2015-12-02T03:14:10Z

Link chased here from google looking for exactly this functionality. Is there any status update on this or has it already been added in some other way?

ldulcic · 2016-05-27T12:57:54Z

Will this be merged any time soon? It is really useful feature and it looks like code has been written some time ago but still not included in scikit-learn release. I hate to override these classes and create custom solutions when this is already solved :)

jnothman · 2016-05-29T14:23:29Z

@ldulcic, it seems as if there are a number of comments, particularly from @GaelVaroquaux that @wrichert has not responded to. If someone would like to take over this PR, and finish it up, it seems like it's still a desirable enhancement.

)

jnothman · 2016-09-29T14:43:43Z

superseded by #7286

)

mblondel reviewed Mar 5, 2013
View reviewed changes

larsmans reviewed Mar 5, 2013
View reviewed changes

wrichert added 4 commits March 10, 2013 19:18

Cleaned and updated version of token-processor support

6e8fb33

Forgot to add tests to test_text.py

6fc8796

Changing token_processor() to work on iterables of tokens instead of on

9bd1536

single tokens; integrating larsmans' to_british() token_processor

Reapplying larsmans' min_df=1 changes

59523b5

wrichert added 5 commits March 14, 2013 21:39

Cleaned and updated version of token-processor support

c85c4ef

Forgot to add tests to test_text.py

801d0d6

Changing token_processor() to work on iterables of tokens instead of on

ab74af6

single tokens; integrating larsmans' to_british() token_processor

Reapplying larsmans' min_df=1 changes

9eb520a

Merge branch 'new-token-processor' of https://github.com/wrichert/sci…

d9a1e35

…kit-learn into new-token-processor

GaelVaroquaux reviewed Jun 3, 2013
View reviewed changes

larsmans force-pushed the master branch from 58a55ad to 4b82379 Compare August 25, 2014 21:50

MechCoder force-pushed the master branch from 6deaea0 to 3f49cee Compare November 3, 2014 12:36

jnothman added the Need Contributor label May 29, 2016

rth pushed a commit to rth/scikit-learn that referenced this pull request Aug 29, 2016

Cleaned and updated version of the token processor (PR scikit-learn#1735

66dd3eb

)

rth mentioned this pull request Aug 29, 2016

[MRG+1] Custom token processor example #7286

Merged

jnothman closed this Sep 29, 2016

rth pushed a commit to rth/scikit-learn that referenced this pull request Nov 2, 2016

Cleaned and updated version of the token processor (PR scikit-learn#1735

7cce960

)

rth pushed a commit to rth/scikit-learn that referenced this pull request Jun 6, 2017

Cleaned and updated version of the token processor (PR scikit-learn#1735

ae055d2

)

rth pushed a commit to rth/scikit-learn that referenced this pull request Jun 21, 2017

Cleaned and updated version of the token processor (PR scikit-learn#1735

d1fe09b

)

Uh oh!

Conversation

wrichert commented Mar 5, 2013

Uh oh!

mblondel Mar 5, 2013

Choose a reason for hiding this comment

Uh oh!

ogrisel Mar 5, 2013

Choose a reason for hiding this comment

Uh oh!

larsmans commented Mar 5, 2013

Uh oh!

larsmans Mar 5, 2013

Choose a reason for hiding this comment

Uh oh!

wrichert commented Mar 5, 2013

Uh oh!

amueller commented Mar 9, 2013

Uh oh!

amueller commented Mar 10, 2013

Uh oh!

wrichert commented Mar 10, 2013

Uh oh!

larsmans commented Mar 11, 2013

Uh oh!

wrichert commented Mar 15, 2013

Uh oh!

larsmans commented Mar 15, 2013

Uh oh!

jseabold commented May 19, 2013

Uh oh!

GaelVaroquaux Jun 3, 2013

Choose a reason for hiding this comment

Uh oh!

GaelVaroquaux commented Jun 3, 2013

Uh oh!

ogrisel commented Jul 29, 2013

Uh oh!

EntilZha commented Dec 2, 2015

Uh oh!

ldulcic commented May 27, 2016

Uh oh!

jnothman commented May 29, 2016

Uh oh!

jnothman commented Sep 29, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

10 participants