[MRG+1] Custom token processor example#7286
Conversation
9db0f10 to
14fc884
Compare
|
It looks to me that the outcome of #1156 was a different solution. maybe @larsmans or @ogrisel care to comment? We could maybe add an example on how to build a custom analyzer that can handle n-grams? |
|
Well, issue #1156 is marked to be automatically closed (not sure how that works) when PR #1735 is merged, which itself was labeled as "Need contributor" not so long ago by @jnothman (since people keep commenting there). I don't personally have a problem with this issue. I agree this can be achieved with a custom analyser, but this simplifies the process from this example to, though the nested callbacks are indeed not pretty. |
7880dae to
ab393b5
Compare
|
@rth, you should know that we're working towards a release now and seeing as this is not critical (though admittedly it's long-sought) it probably won't be reviewed in detail right now. Please bump once 0.18 has been released over the coming weeks. |
|
I'd bump this but I think first we want 0.18.1 out the door ;) |
ab393b5 to
ab4fdd2
Compare
|
I rebased and changed a bit the code to bypass this new text token processor if it is not specified (the previous version had a 10% performance cost for Bag of words with default parameters). Not so happy that this adds yet another parameter to CountVectorizer, etc. However, fundamentally this reflects the problem that P.S: I know 0.18.1 is still in progress, so nothing critical here.. |
4e3a4de to
f0296e5
Compare
jnothman
left a comment
There was a problem hiding this comment.
Sorry, @rth, I see the need for this example, but not this feature. You're composing an arbitrary function onto another arbitrary function; and it applies only to words.
What's wrong with:
class StemmingMixin(object):
def build_tokenizer(self):
base = super(StemmingMixin, self).build_tokenizer()
return lambda doc: map(stem, base(doc))?
sklearn/feature_extraction/text.py
Outdated
| Only applies if ``analyzer == 'word'``. | ||
|
|
||
| token_processor: callable or None (default) | ||
| Override the token processing step while preserving the |
There was a problem hiding this comment.
Needs to be clear that it applies to a sequence of tokens, rather than each token individually.
sklearn/feature_extraction/text.py
Outdated
| preprocessing and n-grams generation steps. | ||
| Only applies if ``analyzer == 'word'``. | ||
|
|
||
| token_processor: callable or None (default) |
3974ab5 to
661adee
Compare
|
@jnothman Thanks for the feedback and sorry for late response. You are right, this PR was far from ideal. So I removed all code changes, and simply kept the example in the docs on how to use a custom token processor (the same could be used for stemming though I'm not sure it's the most elegant example). Issue #1156 could then be closed with a "wont fix" status (i.e. one has to use a custom analyser / tokenizer for stemming). This also removes all cc @agramfort |
doc/modules/feature_extraction.rst
Outdated
| (Note that this will not filter out punctuation.) | ||
|
|
||
|
|
||
| The following example will, for instance, transform British spelling to American |
There was a problem hiding this comment.
This is when I say "so much for all those input parameters; overloading ftw!"
I'm sure this kind of thing is done in the wild, so we might as well advertise it.
LGTM!
There was a problem hiding this comment.
Very nice example!
Another use for this, maybe more useful in practice, is normalizing all numbers to a new ## token
doc/modules/feature_extraction.rst
Outdated
| >>> print(CustomVectorizer().build_analyzer()(u"color colour")) # doctest: +NORMALIZE_WHITESPACE +ELLIPSIS | ||
| [...'color', ...'color'] | ||
|
|
||
| The same approach could be used for stemming. |
There was a problem hiding this comment.
Maybe mention the number normalization idea here.
There was a problem hiding this comment.
The number normalization idea is also performed in this example in our codebase. How about modifying that example to use the same inheritance trick (to avoid duplication of the tokenizing functionality), and then reference it from here?
There was a problem hiding this comment.
Modifying that example and referencing here seems reasonable.
doc/modules/feature_extraction.rst
Outdated
| The same approach could be used for stemming or normalizing all numbers | ||
| to a new token, as illustrated in: | ||
|
|
||
| * :ref:`sphx_glr_auto_examples_model_selection_grid_search_text_feature_extraction.py` |
There was a problem hiding this comment.
I think you're referencing the wrong example here
doc/modules/feature_extraction.rst
Outdated
| [...'color', ...'color'] | ||
|
|
||
| The same approach could be used for stemming. | ||
| The same approach could be used for stemming or normalizing all numbers |
There was a problem hiding this comment.
[...] for other styles of preprocessing; examples include stemming, lemmatization, or normalizing numerical tokens, with the latter illustrated in:
|
|
||
| def number_aware_tokenizer(doc): | ||
| """ Tokenizer that maps all numeric tokens to a placeholder. | ||
| def numbers_to_paceholder(tokens): |
| yield re.sub(r"^\d+$", "#NUMBER", t) | ||
|
|
||
|
|
||
| class CustomTfidfVectorizer(TfidfVectorizer): |
There was a problem hiding this comment.
class NumberNormalizingVectorizer (or NumberNormalizerTfidfVectorizer, NumberAware(Tfidf)Vectorizer)
|
as soon as the minor wording fixes are done and we can check the travis/circleci output this is awesome imo! |
|
Thanks for the quick review @vene ! Made the changes, waiting for CI.. |
|
Also replaced |
31bb781 to
eadfafc
Compare
|
@vene OK, I reverted the last change regarding Edit: rebased so that Appveyor build would pass.. |
doc/modules/feature_extraction.rst
Outdated
| ... | ||
| >>> class CustomVectorizer(CountVectorizer): | ||
| ... def build_tokenizer(self): | ||
| ... base = super(CustomVectorizer, self).build_tokenizer() |
There was a problem hiding this comment.
one more thing, sorry!
could you rename this from base to tokenize or similar? (here as well as in the example)
doc/modules/feature_extraction.rst
Outdated
| for other styles of preprocessing; examples include stemming, lemmatization, | ||
| or normalizing numerical tokens, with the latter illustrated in: | ||
|
|
||
| * :ref:`sphx_glr_auto_examples_bicluster_plot_bicluster_newsgroups.py.py` |
There was a problem hiding this comment.
this link is not working correctly, see rendering here
There was a problem hiding this comment.
i think it's because of the double ".py" at the end ;)
There was a problem hiding this comment.
Thanks, missed that. Where do you find the URL of the docs rendered by circleci? Circle CI passed, but I'm not sure if it rendered correctly..
| for token in tokens] | ||
| return tokens | ||
| for t in tokens: | ||
| yield re.sub(r"^\d+$", "#NUMBER", t) |
There was a problem hiding this comment.
This is different behavior tham before. The docstring is no longer accurate: the old version checks if the first character is a digit, your regex requires the entire token to be made out of digits.
Either update the docstring or (maybe better) replace by
return ("#NUMBER" if token[0].isdigit() else token for token in tokens)
Come to think of it, even if you want to keep your behaviour, it is probably faster and more readable to use
return ("#NUMBER" if token.isdigit() else token for token in tokens)
There was a problem hiding this comment.
On this dataset there seem to be quite a few tokens that start with digits and should probably be discarded, so it might be a good idea to showcase this idea, as well as to keep behavior as close to the old example as possible.
(I'm ok with removing the underscore, that was not really explained in the old example)
There was a problem hiding this comment.
Done, reverted to the original behavior. Sorry for all the changes..
34126fa to
59c2692
Compare
59c2692 to
8437cb0
Compare
|
LGTM, merging. |
|
Thanks for the review, @ogrisel . |
|
Thanks @rth this is a very nice example! |
Rebased version of PR #1735, fixes Issue #1156.
This would allow to plug-in a custom token processor for text feature extraction, which can be used e.g. for stemming.
This PR addresses review comments by @GaelVaroquaux in the original PR.
Since this is a relatively demanded feature (cf. comments in the original PR) with the general idea mostly validated in 2013: would it be possible to include this into the 0.18 release?