[MRG+1] Custom token processor example by rth · Pull Request #7286 · scikit-learn/scikit-learn

rth · 2016-08-29T22:34:52Z

Rebased version of PR #1735, fixes Issue #1156.

This would allow to plug-in a custom token processor for text feature extraction, which can be used e.g. for stemming.

This PR addresses review comments by @GaelVaroquaux in the original PR.

Since this is a relatively demanded feature (cf. comments in the original PR) with the general idea mostly validated in 2013: would it be possible to include this into the 0.18 release?

amueller · 2016-08-29T22:43:12Z

It looks to me that the outcome of #1156 was a different solution. maybe @larsmans or @ogrisel care to comment?
I would also thing that creating custom analyzers should be made easier. Otherwise we end up with arbitrary many callbacks.
What you want to do is relatively easy to achieve by providing a custom analyzer, right? Or is it the n_grams that give you trouble?

We could maybe add an example on how to build a custom analyzer that can handle n-grams?

rth · 2016-08-29T23:05:18Z

Well, issue #1156 is marked to be automatically closed (not sure how that works) when PR #1735 is merged, which itself was labeled as "Need contributor" not so long ago by @jnothman (since people keep commenting there). I don't personally have a problem with this issue.

I agree this can be achieved with a custom analyser, but this simplifies the process from this example to,

stem_vectorizer = CountVectorizer(token_processor=FrenchStemmer().stem)

though the nested callbacks are indeed not pretty.

jnothman · 2016-08-30T03:01:34Z

@rth, you should know that we're working towards a release now and seeing as this is not critical (though admittedly it's long-sought) it probably won't be reviewed in detail right now. Please bump once 0.18 has been released over the coming weeks.

amueller · 2016-10-13T18:03:59Z

I'd bump this but I think first we want 0.18.1 out the door ;)

rth · 2016-11-02T11:12:25Z

I rebased and changed a bit the code to bypass this new text token processor if it is not specified (the previous version had a 10% performance cost for Bag of words with default parameters).

Not so happy that this adds yet another parameter to CountVectorizer, etc. However, fundamentally this reflects the problem that preprocessor, tokenizer and token_processor are somewhat redundant with the custom analyzer argument, but at the same time it's not trivial to add processing steps to the analyzer without getting into the internals of how it works. For instance, to add stemming with the current code base, one has to create an analyzer that inherits from the default analyzer (cf. post above) which is somewhat non-standard. Not sure what can be done about it though..

P.S: I know 0.18.1 is still in progress, so nothing critical here..

jnothman

Sorry, @rth, I see the need for this example, but not this feature. You're composing an arbitrary function onto another arbitrary function; and it applies only to words.

What's wrong with:

class StemmingMixin(object):
    def build_tokenizer(self):
        base = super(StemmingMixin, self).build_tokenizer()
        return lambda doc: map(stem, base(doc))

?

jnothman · 2016-11-09T09:41:57Z

sklearn/feature_extraction/text.py

        Only applies if ``analyzer == 'word'``.

+    token_processor: callable or None (default)
+        Override the token processing step while preserving the


Needs to be clear that it applies to a sequence of tokens, rather than each token individually.

jnothman · 2016-11-09T09:41:59Z

sklearn/feature_extraction/text.py

        preprocessing and n-grams generation steps.
        Only applies if ``analyzer == 'word'``.

+    token_processor: callable or None (default)


space before colon.

rth · 2017-06-06T14:13:17Z

@jnothman Thanks for the feedback and sorry for late response.

You are right, this PR was far from ideal. So I removed all code changes, and simply kept the example in the docs on how to use a custom token processor (the same could be used for stemming though I'm not sure it's the most elegant example). Issue #1156 could then be closed with a "wont fix" status (i.e. one has to use a custom analyser / tokenizer for stemming).

This also removes all min_df=1 arguments to CountVectorizer & TfidfVecorizer in the user_manual, since it's the default anyway.

cc @agramfort

jnothman · 2017-06-06T22:18:24Z

doc/modules/feature_extraction.rst

    (Note that this will not filter out punctuation.)

+
+    The following example will, for instance, transform British spelling to American


This is when I say "so much for all those input parameters; overloading ftw!"

I'm sure this kind of thing is done in the wild, so we might as well advertise it.

LGTM!

Very nice example!

Another use for this, maybe more useful in practice, is normalizing all numbers to a new ## token

vene · 2017-06-08T09:54:19Z

doc/modules/feature_extraction.rst

+        >>> print(CustomVectorizer().build_analyzer()(u"color colour")) # doctest: +NORMALIZE_WHITESPACE +ELLIPSIS
+        [...'color', ...'color']
+
+    The same approach could be used for stemming.


Maybe mention the number normalization idea here.

The number normalization idea is also performed in this example in our codebase. How about modifying that example to use the same inheritance trick (to avoid duplication of the tokenizing functionality), and then reference it from here?

@jnothman what do you think?

Modifying that example and referencing here seems reasonable.

vene · 2017-06-08T16:54:09Z

doc/modules/feature_extraction.rst

+    The same approach could be used for stemming or normalizing all numbers
+    to a new token, as illustrated in:
+
+     * :ref:`sphx_glr_auto_examples_model_selection_grid_search_text_feature_extraction.py`


I think you're referencing the wrong example here

vene · 2017-06-08T16:57:46Z

doc/modules/feature_extraction.rst

        [...'color', ...'color']

-    The same approach could be used for stemming.
+    The same approach could be used for stemming or normalizing all numbers


[...] for other styles of preprocessing; examples include stemming, lemmatization, or normalizing numerical tokens, with the latter illustrated in:

vene · 2017-06-08T16:58:11Z

examples/bicluster/plot_bicluster_newsgroups.py


-def number_aware_tokenizer(doc):
-    """ Tokenizer that maps all numeric tokens to a placeholder.
+def numbers_to_paceholder(tokens):


number_normalizer?

(paceholder is a typo anyway)

vene · 2017-06-08T16:59:13Z

examples/bicluster/plot_bicluster_newsgroups.py

+        yield re.sub(r"^\d+$", "#NUMBER", t)
+
+
+class CustomTfidfVectorizer(TfidfVectorizer):


class NumberNormalizingVectorizer (or NumberNormalizerTfidfVectorizer, NumberAware(Tfidf)Vectorizer)

vene · 2017-06-08T17:01:26Z

as soon as the minor wording fixes are done and we can check the travis/circleci output this is awesome imo!

rth · 2017-06-08T17:05:20Z

Thanks for the quick review @vene ! Made the changes, waiting for CI..

rth · 2017-06-08T17:10:26Z

Also replaced super(CustomVectorizer, self).build_tokenizer() by CountVectorizer.build_tokenizer(self). Not sure what's the policy on that...

rth · 2017-06-09T07:57:53Z

@vene OK, I reverted the last change regarding super( ). CircleCI and Travis passed..

Edit: rebased so that Appveyor build would pass..

vene · 2017-06-09T08:46:10Z

doc/modules/feature_extraction.rst

+        ...
+        >>> class CustomVectorizer(CountVectorizer):
+        ...     def build_tokenizer(self):
+        ...         base = super(CustomVectorizer, self).build_tokenizer()


one more thing, sorry!

could you rename this from base to tokenize or similar? (here as well as in the example)

vene · 2017-06-09T08:48:17Z

doc/modules/feature_extraction.rst

+    for other styles of preprocessing; examples include stemming, lemmatization,
+    or normalizing numerical tokens, with the latter illustrated in:
+
+     * :ref:`sphx_glr_auto_examples_bicluster_plot_bicluster_newsgroups.py.py`


this link is not working correctly, see rendering here

i think it's because of the double ".py" at the end ;)

Thanks, missed that. Where do you find the URL of the docs rendered by circleci? Circle CI passed, but I'm not sure if it rendered correctly..

see #9082 ;)

vene · 2017-06-09T13:21:07Z

examples/bicluster/plot_bicluster_newsgroups.py

-              for token in tokens]
-    return tokens
+    for t in tokens:
+        yield re.sub(r"^\d+$", "#NUMBER", t)


This is different behavior tham before. The docstring is no longer accurate: the old version checks if the first character is a digit, your regex requires the entire token to be made out of digits.

Either update the docstring or (maybe better) replace by

return ("#NUMBER" if token[0].isdigit() else token for token in tokens)

Come to think of it, even if you want to keep your behaviour, it is probably faster and more readable to use

return ("#NUMBER" if token.isdigit() else token for token in tokens)

On this dataset there seem to be quite a few tokens that start with digits and should probably be discarded, so it might be a good idea to showcase this idea, as well as to keep behavior as close to the old example as possible.

(I'm ok with removing the underscore, that was not really explained in the old example)

Done, reverted to the original behavior. Sorry for all the changes..

)

ogrisel · 2017-06-30T22:04:02Z

LGTM, merging.

rth · 2017-07-02T08:57:50Z

Thanks for the review, @ogrisel .

vene · 2017-07-03T15:00:49Z

Thanks @rth this is a very nice example!

rth force-pushed the new-token-processor-v2 branch from 9db0f10 to 14fc884 Compare August 29, 2016 22:42

rth force-pushed the new-token-processor-v2 branch 2 times, most recently from 7880dae to ab393b5 Compare August 29, 2016 23:23

rth changed the title ~~Cleaned and updated version of token-processor v2~~ [MRG] Cleaned and updated version of token-processor v2 Aug 30, 2016

jnothman mentioned this pull request Sep 29, 2016

Cleaned and updated version of token-processor support #1735

Closed

amueller added the Waiting for Reviewer label Oct 13, 2016

rth force-pushed the new-token-processor-v2 branch from ab393b5 to ab4fdd2 Compare November 2, 2016 10:00

rth force-pushed the new-token-processor-v2 branch from 4e3a4de to f0296e5 Compare November 2, 2016 11:14

jnothman reviewed Nov 9, 2016

View reviewed changes

rth force-pushed the new-token-processor-v2 branch 2 times, most recently from 3974ab5 to 661adee Compare June 6, 2017 12:30

rth changed the title ~~[MRG] Cleaned and updated version of token-processor v2~~ [MRG] Custom token processor example Jun 6, 2017

jnothman reviewed Jun 6, 2017

View reviewed changes

jnothman changed the title ~~[MRG] Custom token processor example~~ [MRG+1] Custom token processor example Jun 6, 2017

vene reviewed Jun 8, 2017

View reviewed changes

rth force-pushed the new-token-processor-v2 branch from 31bb781 to eadfafc Compare June 9, 2017 07:18

vene reviewed Jun 9, 2017

View reviewed changes

rth force-pushed the new-token-processor-v2 branch 2 times, most recently from 34126fa to 59c2692 Compare June 10, 2017 10:55

jnothman added this to the 0.19 milestone Jun 18, 2017

wrichert and others added 9 commits June 21, 2017 19:42

Cleaned and updated version of the token processor (PR scikit-learn#1735

d1fe09b

)

Token processor: addressing PR review comments

3faf951

Addressing performance issue

9fbc14e

Converting this PR to be an example

8325eb1

Updated the custom token processing example

23de872

More review comments

e7e49f6

Directly call parent method instead of "super"

e4a1046

Revert the last modification relative to "super" call

dd8a05e

More review comments / (retriggering the Circle CI build)

8437cb0

rth force-pushed the new-token-processor-v2 branch from 59c2692 to 8437cb0 Compare June 21, 2017 16:42

ogrisel merged commit b46e638 into scikit-learn:master Jun 30, 2017

ogrisel mentioned this pull request Jun 30, 2017

Add stemming support to CountVectorizer #1156

Closed

rth deleted the new-token-processor-v2 branch July 2, 2017 08:57

massich pushed a commit to massich/scikit-learn that referenced this pull request Jul 13, 2017

Custom token processor example (scikit-learn#7286)

2e781b8

dmohns pushed a commit to dmohns/scikit-learn that referenced this pull request Aug 7, 2017

Custom token processor example (scikit-learn#7286)

0ee76b8

dmohns pushed a commit to dmohns/scikit-learn that referenced this pull request Aug 7, 2017

Custom token processor example (scikit-learn#7286)

5a05e1f

NelleV pushed a commit to NelleV/scikit-learn that referenced this pull request Aug 11, 2017

Custom token processor example (scikit-learn#7286)

48d92dc

paulha pushed a commit to paulha/scikit-learn that referenced this pull request Aug 19, 2017

Custom token processor example (scikit-learn#7286)

936e00b

AishwaryaRK pushed a commit to AishwaryaRK/scikit-learn that referenced this pull request Aug 29, 2017

Custom token processor example (scikit-learn#7286)

cb8afbc

maskani-moh pushed a commit to maskani-moh/scikit-learn that referenced this pull request Nov 15, 2017

Custom token processor example (scikit-learn#7286)

57183ef

jwjohnson314 pushed a commit to jwjohnson314/scikit-learn that referenced this pull request Dec 18, 2017

Custom token processor example (scikit-learn#7286)

de002ae

		(Note that this will not filter out punctuation.)


		The following example will, for instance, transform British spelling to American

		yield re.sub(r"^\d+$", "#NUMBER", t)


		class CustomTfidfVectorizer(TfidfVectorizer):

Uh oh!

Conversation

rth commented Aug 29, 2016

Uh oh!

amueller commented Aug 29, 2016

Uh oh!

rth commented Aug 29, 2016

Uh oh!

jnothman commented Aug 30, 2016

Uh oh!

amueller commented Oct 13, 2016

Uh oh!

rth commented Nov 2, 2016

Uh oh!

jnothman left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rth commented Jun 6, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

vene commented Jun 8, 2017

Uh oh!

rth commented Jun 8, 2017

Uh oh!

rth commented Jun 8, 2017

Uh oh!

rth commented Jun 9, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

vene Jun 9, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ogrisel commented Jun 30, 2017

Uh oh!

rth commented Jul 2, 2017

Uh oh!

vene commented Jul 3, 2017

Uh oh!

rth commented Jun 9, 2017 •

edited

Loading

vene Jun 9, 2017 •

edited

Loading