Add stemming support to CountVectorizer

An idea for a feature enhancement:

I'm currently using [`sklearn.feature_extraction.text.CountVectorizer`](https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/feature_extraction/text.py#L86) for one of my projects. In my opinion, it strongly lacks support for stemming. An additional attribute such as `self.stemmer` which accepts a stemming function as a callable would be nice, together with a reasonable default stemmer for English.  

What about an additional stemmer module? However, there are already some quite good Python stemmers available, for example as part of the Natural Language Toolkit. A longer time ago, I contributed stemmers for 12 languages which are available in [`nltk.stem.snowball`](https://github.com/nltk/nltk/blob/master/nltk/stem/snowball.py). Maybe one could support them as a dependency in scikit-learn?

The general interface, however, is not difficult to implement. For example, one could do it like this:

1) Introduce an additional attribute `self.stemmer` which is initialized as `None` by default.
2) Write a method `build_stemmer` as something like this:

``` python
# Snowball stemmers could be used as a dependency
from nltk.stem import SnowballStemmer

def build_stemmer(self):
    if self.stemmer is not None:
        return self.stemmer
    # One could provide an English stemmer as default
    english_stemmer = SnowballStemmer('english')
    return lambda tokens: [english_stemmer.stem(token) for token in tokens]    
```

3) Incorporate this method call in [`CountVectorizer.build_analyzer()`](https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/feature_extraction/text.py#L358) as something like this:

``` python
def build_analyzer(self):
    # ...    
    elif self.analyzer == 'word':
        stop_words = self.get_stop_words()
        # Add a stemmer instance here
        stem = self.build_stemmer()
        tokenize = self.build_tokenizer()

        # Include stemmer in the method chain
        return lambda doc: self._word_ngrams(
            stem(tokenize(preprocess(self.decode(doc)))), stop_words)
    # ...
```

What do you think?    


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add stemming support to CountVectorizer #1156

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Add stemming support to CountVectorizer #1156

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions