Skip to content

[MRG+1] Custom token processor example#7286

Merged
ogrisel merged 9 commits intoscikit-learn:masterfrom
rth:new-token-processor-v2
Jun 30, 2017
Merged

[MRG+1] Custom token processor example#7286
ogrisel merged 9 commits intoscikit-learn:masterfrom
rth:new-token-processor-v2

Conversation

@rth
Copy link
Copy Markdown
Member

@rth rth commented Aug 29, 2016

Rebased version of PR #1735, fixes Issue #1156.

This would allow to plug-in a custom token processor for text feature extraction, which can be used e.g. for stemming.

This PR addresses review comments by @GaelVaroquaux in the original PR.

Since this is a relatively demanded feature (cf. comments in the original PR) with the general idea mostly validated in 2013: would it be possible to include this into the 0.18 release?

@rth rth force-pushed the new-token-processor-v2 branch from 9db0f10 to 14fc884 Compare August 29, 2016 22:42
@amueller
Copy link
Copy Markdown
Member

It looks to me that the outcome of #1156 was a different solution. maybe @larsmans or @ogrisel care to comment?
I would also thing that creating custom analyzers should be made easier. Otherwise we end up with arbitrary many callbacks.
What you want to do is relatively easy to achieve by providing a custom analyzer, right? Or is it the n_grams that give you trouble?

We could maybe add an example on how to build a custom analyzer that can handle n-grams?

@rth
Copy link
Copy Markdown
Member Author

rth commented Aug 29, 2016

Well, issue #1156 is marked to be automatically closed (not sure how that works) when PR #1735 is merged, which itself was labeled as "Need contributor" not so long ago by @jnothman (since people keep commenting there). I don't personally have a problem with this issue.

I agree this can be achieved with a custom analyser, but this simplifies the process from this example to,

stem_vectorizer = CountVectorizer(token_processor=FrenchStemmer().stem)

though the nested callbacks are indeed not pretty.

@rth rth force-pushed the new-token-processor-v2 branch 2 times, most recently from 7880dae to ab393b5 Compare August 29, 2016 23:23
@jnothman
Copy link
Copy Markdown
Member

@rth, you should know that we're working towards a release now and seeing as this is not critical (though admittedly it's long-sought) it probably won't be reviewed in detail right now. Please bump once 0.18 has been released over the coming weeks.

@rth rth changed the title Cleaned and updated version of token-processor v2 [MRG] Cleaned and updated version of token-processor v2 Aug 30, 2016
@amueller
Copy link
Copy Markdown
Member

I'd bump this but I think first we want 0.18.1 out the door ;)

@rth rth force-pushed the new-token-processor-v2 branch from ab393b5 to ab4fdd2 Compare November 2, 2016 10:00
@rth
Copy link
Copy Markdown
Member Author

rth commented Nov 2, 2016

I rebased and changed a bit the code to bypass this new text token processor if it is not specified (the previous version had a 10% performance cost for Bag of words with default parameters).

Not so happy that this adds yet another parameter to CountVectorizer, etc. However, fundamentally this reflects the problem that preprocessor, tokenizer and token_processor are somewhat redundant with the custom analyzer argument, but at the same time it's not trivial to add processing steps to the analyzer without getting into the internals of how it works. For instance, to add stemming with the current code base, one has to create an analyzer that inherits from the default analyzer (cf. post above) which is somewhat non-standard. Not sure what can be done about it though..

P.S: I know 0.18.1 is still in progress, so nothing critical here..

@rth rth force-pushed the new-token-processor-v2 branch from 4e3a4de to f0296e5 Compare November 2, 2016 11:14
Copy link
Copy Markdown
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, @rth, I see the need for this example, but not this feature. You're composing an arbitrary function onto another arbitrary function; and it applies only to words.

What's wrong with:

class StemmingMixin(object):
    def build_tokenizer(self):
        base = super(StemmingMixin, self).build_tokenizer()
        return lambda doc: map(stem, base(doc))

?

Only applies if ``analyzer == 'word'``.

token_processor: callable or None (default)
Override the token processing step while preserving the
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Needs to be clear that it applies to a sequence of tokens, rather than each token individually.

preprocessing and n-grams generation steps.
Only applies if ``analyzer == 'word'``.

token_processor: callable or None (default)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

space before colon.

@rth rth force-pushed the new-token-processor-v2 branch 2 times, most recently from 3974ab5 to 661adee Compare June 6, 2017 12:30
@rth rth changed the title [MRG] Cleaned and updated version of token-processor v2 [MRG] Custom token processor example Jun 6, 2017
@rth
Copy link
Copy Markdown
Member Author

rth commented Jun 6, 2017

@jnothman Thanks for the feedback and sorry for late response.

You are right, this PR was far from ideal. So I removed all code changes, and simply kept the example in the docs on how to use a custom token processor (the same could be used for stemming though I'm not sure it's the most elegant example). Issue #1156 could then be closed with a "wont fix" status (i.e. one has to use a custom analyser / tokenizer for stemming).

This also removes all min_df=1 arguments to CountVectorizer & TfidfVecorizer in the user_manual, since it's the default anyway.

cc @agramfort

(Note that this will not filter out punctuation.)


The following example will, for instance, transform British spelling to American
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"some"!

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is when I say "so much for all those input parameters; overloading ftw!"

I'm sure this kind of thing is done in the wild, so we might as well advertise it.

LGTM!

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice example!

Another use for this, maybe more useful in practice, is normalizing all numbers to a new ## token

@jnothman jnothman changed the title [MRG] Custom token processor example [MRG+1] Custom token processor example Jun 6, 2017
>>> print(CustomVectorizer().build_analyzer()(u"color colour")) # doctest: +NORMALIZE_WHITESPACE +ELLIPSIS
[...'color', ...'color']

The same approach could be used for stemming.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe mention the number normalization idea here.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The number normalization idea is also performed in this example in our codebase. How about modifying that example to use the same inheritance trick (to avoid duplication of the tokenizing functionality), and then reference it from here?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jnothman what do you think?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Modifying that example and referencing here seems reasonable.

The same approach could be used for stemming or normalizing all numbers
to a new token, as illustrated in:

* :ref:`sphx_glr_auto_examples_model_selection_grid_search_text_feature_extraction.py`
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you're referencing the wrong example here

[...'color', ...'color']

The same approach could be used for stemming.
The same approach could be used for stemming or normalizing all numbers
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[...] for other styles of preprocessing; examples include stemming, lemmatization, or normalizing numerical tokens, with the latter illustrated in:


def number_aware_tokenizer(doc):
""" Tokenizer that maps all numeric tokens to a placeholder.
def numbers_to_paceholder(tokens):
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

number_normalizer?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(paceholder is a typo anyway)

yield re.sub(r"^\d+$", "#NUMBER", t)


class CustomTfidfVectorizer(TfidfVectorizer):
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

class NumberNormalizingVectorizer (or NumberNormalizerTfidfVectorizer, NumberAware(Tfidf)Vectorizer)

@vene
Copy link
Copy Markdown
Member

vene commented Jun 8, 2017

as soon as the minor wording fixes are done and we can check the travis/circleci output this is awesome imo!

@rth
Copy link
Copy Markdown
Member Author

rth commented Jun 8, 2017

Thanks for the quick review @vene ! Made the changes, waiting for CI..

@rth
Copy link
Copy Markdown
Member Author

rth commented Jun 8, 2017

Also replaced super(CustomVectorizer, self).build_tokenizer() by CountVectorizer.build_tokenizer(self). Not sure what's the policy on that...

@rth rth force-pushed the new-token-processor-v2 branch from 31bb781 to eadfafc Compare June 9, 2017 07:18
@rth
Copy link
Copy Markdown
Member Author

rth commented Jun 9, 2017

@vene OK, I reverted the last change regarding super( ). CircleCI and Travis passed..

Edit: rebased so that Appveyor build would pass..

...
>>> class CustomVectorizer(CountVectorizer):
... def build_tokenizer(self):
... base = super(CustomVectorizer, self).build_tokenizer()
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

one more thing, sorry!

could you rename this from base to tokenize or similar? (here as well as in the example)

for other styles of preprocessing; examples include stemming, lemmatization,
or normalizing numerical tokens, with the latter illustrated in:

* :ref:`sphx_glr_auto_examples_bicluster_plot_bicluster_newsgroups.py.py`
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this link is not working correctly, see rendering here

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think it's because of the double ".py" at the end ;)

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, missed that. Where do you find the URL of the docs rendered by circleci? Circle CI passed, but I'm not sure if it rendered correctly..

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see #9082 ;)

for token in tokens]
return tokens
for t in tokens:
yield re.sub(r"^\d+$", "#NUMBER", t)
Copy link
Copy Markdown
Member

@vene vene Jun 9, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is different behavior tham before. The docstring is no longer accurate: the old version checks if the first character is a digit, your regex requires the entire token to be made out of digits.

Either update the docstring or (maybe better) replace by

return ("#NUMBER" if token[0].isdigit() else token for token in tokens)

Come to think of it, even if you want to keep your behaviour, it is probably faster and more readable to use

return ("#NUMBER" if token.isdigit() else token for token in tokens)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On this dataset there seem to be quite a few tokens that start with digits and should probably be discarded, so it might be a good idea to showcase this idea, as well as to keep behavior as close to the old example as possible.

(I'm ok with removing the underscore, that was not really explained in the old example)

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, reverted to the original behavior. Sorry for all the changes..

@rth rth force-pushed the new-token-processor-v2 branch 2 times, most recently from 34126fa to 59c2692 Compare June 10, 2017 10:55
@jnothman jnothman added this to the 0.19 milestone Jun 18, 2017
@rth rth force-pushed the new-token-processor-v2 branch from 59c2692 to 8437cb0 Compare June 21, 2017 16:42
@ogrisel
Copy link
Copy Markdown
Member

ogrisel commented Jun 30, 2017

LGTM, merging.

@ogrisel ogrisel merged commit b46e638 into scikit-learn:master Jun 30, 2017
@rth
Copy link
Copy Markdown
Member Author

rth commented Jul 2, 2017

Thanks for the review, @ogrisel .

@rth rth deleted the new-token-processor-v2 branch July 2, 2017 08:57
@vene
Copy link
Copy Markdown
Member

vene commented Jul 3, 2017

Thanks @rth this is a very nice example!

massich pushed a commit to massich/scikit-learn that referenced this pull request Jul 13, 2017
dmohns pushed a commit to dmohns/scikit-learn that referenced this pull request Aug 7, 2017
dmohns pushed a commit to dmohns/scikit-learn that referenced this pull request Aug 7, 2017
NelleV pushed a commit to NelleV/scikit-learn that referenced this pull request Aug 11, 2017
paulha pushed a commit to paulha/scikit-learn that referenced this pull request Aug 19, 2017
AishwaryaRK pushed a commit to AishwaryaRK/scikit-learn that referenced this pull request Aug 29, 2017
maskani-moh pushed a commit to maskani-moh/scikit-learn that referenced this pull request Nov 15, 2017
jwjohnson314 pushed a commit to jwjohnson314/scikit-learn that referenced this pull request Dec 18, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants