-
-
Notifications
You must be signed in to change notification settings - Fork 26.9k
Description
An idea for a feature enhancement:
I'm currently using sklearn.feature_extraction.text.CountVectorizer for one of my projects. In my opinion, it strongly lacks support for stemming. An additional attribute such as self.stemmer which accepts a stemming function as a callable would be nice, together with a reasonable default stemmer for English.
What about an additional stemmer module? However, there are already some quite good Python stemmers available, for example as part of the Natural Language Toolkit. A longer time ago, I contributed stemmers for 12 languages which are available in nltk.stem.snowball. Maybe one could support them as a dependency in scikit-learn?
The general interface, however, is not difficult to implement. For example, one could do it like this:
- Introduce an additional attribute
self.stemmerwhich is initialized asNoneby default. - Write a method
build_stemmeras something like this:
# Snowball stemmers could be used as a dependency
from nltk.stem import SnowballStemmer
def build_stemmer(self):
if self.stemmer is not None:
return self.stemmer
# One could provide an English stemmer as default
english_stemmer = SnowballStemmer('english')
return lambda tokens: [english_stemmer.stem(token) for token in tokens] - Incorporate this method call in
CountVectorizer.build_analyzer()as something like this:
def build_analyzer(self):
# ...
elif self.analyzer == 'word':
stop_words = self.get_stop_words()
# Add a stemmer instance here
stem = self.build_stemmer()
tokenize = self.build_tokenizer()
# Include stemmer in the method chain
return lambda doc: self._word_ngrams(
stem(tokenize(preprocess(self.decode(doc)))), stop_words)
# ...What do you think?