[MRG+1] micro-optimize HashingVectorizer and FeatureHasher by kmike · Pull Request #7470 · scikit-learn/scikit-learn

kmike · 2016-09-22T17:18:38Z

There is a common gotcha in Cython code: even after a type check Cython compiles txt.encode('utf-8') to a code which looks up 'encode' method, then calls it as a Python method, this method has to find utf-8 codec, etc.

if isinstance(f, unicode):
    f = f.encode("utf-8")

On the other hand,

if isinstance(f, unicode):
    f = (<unicode>f).encode("utf-8")

compiles directly to a PyUnicode_AsUTF8String call.

Another gotcha is that if you declare argument type (e.g. bytes) and then pass a variable of type object, it does more work, not less work because type is checked. Because we know a variable is bytes casting it to bytes allow to remove all these type checks and conversions.

The third thing I fixed is string_types lookup. "basestring" in Cython is the same as six.string_types; isinstance check gets compiled directly to C API calls this way.

These changes allow to make HashingVectorizer about 10% faster. Benchmark script:

from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import HashingVectorizer

categories = ['alt.atheism', 'comp.graphics']
remove = ('headers', 'footers', 'quotes')
data_train = fetch_20newsgroups(subset='train', categories=categories,
                                shuffle=True, random_state=42,
                                remove=remove)
vec = HashingVectorizer()
%timeit vec.fit_transform(data_train.data[:100])

For me it runs in 13.8 ms before the changes and 11.5 ms after the changes.

This change is

kmike · 2016-09-22T17:29:02Z

sklearn/feature_extraction/_hashing.pyx

    for x in raw_X:
        for f, v in x:
-            if isinstance(v, string_types):
+            if isinstance(v, basestring):


This requires Cython 0.20+, is it OK? If not, it is possible to replace it with (str, unicode) which does the same.

Hmm I'd prefer using (str, unicode). We don't have published "minimum cython requirements" but I suppose there's no harm in this case in maintaining backwards compatibility?

jnothman · 2016-09-26T23:46:12Z

LGTM

amueller · 2016-09-30T00:51:02Z

thanks

…arn#7470) * micro-optimize HashingVectorizer and FeatureHasher * fix backwards compatibility for Cython < 0.20

micro-optimize HashingVectorizer and FeatureHasher

4cb6fd4

kmike commented Sep 22, 2016

View reviewed changes

fix backwards compatibility for Cython < 0.20

9de996d

jnothman changed the title ~~micro-optimize HashingVectorizer and FeatureHasher~~ [MRG+1] micro-optimize HashingVectorizer and FeatureHasher Sep 26, 2016

amueller merged commit c336a43 into scikit-learn:master Sep 30, 2016

TomDLT pushed a commit to TomDLT/scikit-learn that referenced this pull request Oct 3, 2016

[MRG+1] micro-optimize HashingVectorizer and FeatureHasher (scikit-le…

861d9ec

…arn#7470) * micro-optimize HashingVectorizer and FeatureHasher * fix backwards compatibility for Cython < 0.20

rth mentioned this pull request Jun 9, 2017

[MRG+1] Add text vectorizers benchmarks #9086

Merged

Sundrique pushed a commit to Sundrique/scikit-learn that referenced this pull request Jun 14, 2017

[MRG+1] micro-optimize HashingVectorizer and FeatureHasher (scikit-le…

fa37ed4

…arn#7470) * micro-optimize HashingVectorizer and FeatureHasher * fix backwards compatibility for Cython < 0.20

paulha pushed a commit to paulha/scikit-learn that referenced this pull request Aug 19, 2017

[MRG+1] micro-optimize HashingVectorizer and FeatureHasher (scikit-le…

1794164

…arn#7470) * micro-optimize HashingVectorizer and FeatureHasher * fix backwards compatibility for Cython < 0.20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[MRG+1] micro-optimize HashingVectorizer and FeatureHasher#7470

[MRG+1] micro-optimize HashingVectorizer and FeatureHasher#7470
amueller merged 2 commits intoscikit-learn:masterfrom
kmike:optimize-hashing

kmike commented Sep 22, 2016 •

edited by amueller

Loading

Uh oh!

kmike Sep 22, 2016

Uh oh!

nelson-liu Sep 22, 2016

Uh oh!

kmike Sep 22, 2016

Uh oh!

jnothman commented Sep 26, 2016

Uh oh!

amueller commented Sep 30, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Uh oh!

Conversation

kmike commented Sep 22, 2016 • edited by amueller Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kmike Sep 22, 2016

Choose a reason for hiding this comment

Uh oh!

nelson-liu Sep 22, 2016

Choose a reason for hiding this comment

Uh oh!

kmike Sep 22, 2016

Choose a reason for hiding this comment

Uh oh!

jnothman commented Sep 26, 2016

Uh oh!

amueller commented Sep 30, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

kmike commented Sep 22, 2016 •

edited by amueller

Loading