Skip to content

Add stemming support to CountVectorizer #1156

@pemistahl

Description

@pemistahl

An idea for a feature enhancement:

I'm currently using sklearn.feature_extraction.text.CountVectorizer for one of my projects. In my opinion, it strongly lacks support for stemming. An additional attribute such as self.stemmer which accepts a stemming function as a callable would be nice, together with a reasonable default stemmer for English.

What about an additional stemmer module? However, there are already some quite good Python stemmers available, for example as part of the Natural Language Toolkit. A longer time ago, I contributed stemmers for 12 languages which are available in nltk.stem.snowball. Maybe one could support them as a dependency in scikit-learn?

The general interface, however, is not difficult to implement. For example, one could do it like this:

  1. Introduce an additional attribute self.stemmer which is initialized as None by default.
  2. Write a method build_stemmer as something like this:
# Snowball stemmers could be used as a dependency
from nltk.stem import SnowballStemmer

def build_stemmer(self):
    if self.stemmer is not None:
        return self.stemmer
    # One could provide an English stemmer as default
    english_stemmer = SnowballStemmer('english')
    return lambda tokens: [english_stemmer.stem(token) for token in tokens]    
  1. Incorporate this method call in CountVectorizer.build_analyzer() as something like this:
def build_analyzer(self):
    # ...    
    elif self.analyzer == 'word':
        stop_words = self.get_stop_words()
        # Add a stemmer instance here
        stem = self.build_stemmer()
        tokenize = self.build_tokenizer()

        # Include stemmer in the method chain
        return lambda doc: self._word_ngrams(
            stem(tokenize(preprocess(self.decode(doc)))), stop_words)
    # ...

What do you think?

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions