[MRG+1] micro-optimize HashingVectorizer and FeatureHasher#7470
Merged
amueller merged 2 commits intoscikit-learn:masterfrom Sep 30, 2016
Merged
[MRG+1] micro-optimize HashingVectorizer and FeatureHasher#7470amueller merged 2 commits intoscikit-learn:masterfrom
amueller merged 2 commits intoscikit-learn:masterfrom
Conversation
kmike
commented
Sep 22, 2016
| for x in raw_X: | ||
| for f, v in x: | ||
| if isinstance(v, string_types): | ||
| if isinstance(v, basestring): |
Contributor
Author
There was a problem hiding this comment.
This requires Cython 0.20+, is it OK? If not, it is possible to replace it with (str, unicode) which does the same.
Contributor
There was a problem hiding this comment.
Hmm I'd prefer using (str, unicode). We don't have published "minimum cython requirements" but I suppose there's no harm in this case in maintaining backwards compatibility?
Member
|
LGTM |
Member
|
thanks |
TomDLT
pushed a commit
to TomDLT/scikit-learn
that referenced
this pull request
Oct 3, 2016
…arn#7470) * micro-optimize HashingVectorizer and FeatureHasher * fix backwards compatibility for Cython < 0.20
Sundrique
pushed a commit
to Sundrique/scikit-learn
that referenced
this pull request
Jun 14, 2017
…arn#7470) * micro-optimize HashingVectorizer and FeatureHasher * fix backwards compatibility for Cython < 0.20
paulha
pushed a commit
to paulha/scikit-learn
that referenced
this pull request
Aug 19, 2017
…arn#7470) * micro-optimize HashingVectorizer and FeatureHasher * fix backwards compatibility for Cython < 0.20
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
There is a common gotcha in Cython code: even after a type check Cython compiles
txt.encode('utf-8')to a code which looks up 'encode' method, then calls it as a Python method, this method has to find utf-8 codec, etc.On the other hand,
compiles directly to a PyUnicode_AsUTF8String call.
Another gotcha is that if you declare argument type (e.g. bytes) and then pass a variable of type object, it does more work, not less work because type is checked. Because we know a variable is bytes casting it to bytes allow to remove all these type checks and conversions.
The third thing I fixed is
string_typeslookup. "basestring" in Cython is the same as six.string_types; isinstance check gets compiled directly to C API calls this way.These changes allow to make HashingVectorizer about 10% faster. Benchmark script:
For me it runs in 13.8 ms before the changes and 11.5 ms after the changes.
This change is