-
-
Notifications
You must be signed in to change notification settings - Fork 26.9k
Use succinct tries for vocabulary storage #2639
Description
Hey,
Python dictionary is quite wasteful for string data, so CountVectorizer, TfidfVectorizer and DictVectorizer all could take much less memory with an another data structure for vocabulary_.
Here is a quick prototype that persists vocabulary_ of CountVectorizer and TfidfVectorizer in MARISA-Trie via Python wrapper: https://gist.github.com/kmike/7814472.
I used this script to measure memory and speed of fit and dump, and this script to measure parameters of load and transform.
The results are quite cool: for '20 newsgroups' data (fit on training subset, transform on test subset) memory consumption of different vectorizers after loading is the following:
CountVectorizer(): 94MB;CountVectorizer(ngram_range=(1,2): 666MB;MarisaCountVectorizer(): 1.2MB;MarisaCountVectorizer(ngram_range=(1,2)): 13.3MB;
So using a good succinct Trie implementation gives us 50x-80x reduction of memory usage; it also makes serializing and deserializing almost instant. And such vectorizers don't have HashingVectorizer limitations.
The downside is that fit method is 2x-3x slower and requires slightly more memory (but I think it could be changed to require less memory than CountVectorizer.fit), and transform is about 2x slower. Full output: https://gist.github.com/kmike/7815156
What do you think about adding something similar to scikit-learn?