[MRG + 1] MNT Platform independent hash collision tests in FeatureHasher#9710
Merged
jnothman merged 2 commits intoscikit-learn:masterfrom Sep 12, 2017
Merged
Conversation
d1ebfad to
4f55747
Compare
amueller
reviewed
Sep 8, 2017
|
|
||
|
|
||
| @ignore_warnings(category=DeprecationWarning) | ||
| def test_hash_collisions(): |
Member
There was a problem hiding this comment.
You could be really sure and do X = [list("Thequickbrownfoxjumped")]
Member
|
LGTM. (I think you meant .5 ** 8 = 0.004) |
Member
Author
|
@amueller Thanks for the review. Increased the vocabulary size as you suggested.
Yes thanks, I keep making typos in every other comment, apparently. |
Member
|
have you tried finding a docker to reproduce somehow?
…On 8 Sep 2017 10:47 pm, "Roman Yurchak" ***@***.***> wrote:
This PR aims to address the current failures of test_hasher_alternate_sign
on non amd64 platforms #9393 (comment)
<#9393 (comment)>
that is likely due to the fact the current test rely on Murmurhash3 results
to yield a particular hash value (that produces a collision) while it is
actually platform dependent #9393 (comment)
<#9393 (comment)>
. Since the original issue couldn't be reproduced, there is no guarantee
that this would fix it (hopefully it would), but in any case, it would make
the test_hasher_alternate_sign more robust ...
*Note:* these tests here rely on the fact that when hashing 8 strings
with alternate_sign=True, some of them will get a negative sign and some
a positive one (it's a 50%/50% probability). However, there is still a
(0.5)**2 = .004 probability that on a given platform all the signs will be
positive (in which case these tests will fail) but hopefully, that's
unlikely enough...
cc @jnothman <https://github.com/jnothman>
------------------------------
You can view, comment on, or merge this pull request online at:
#9710
Commit Summary
- More robust hash collision tests in the FeatureHasher
File Changes
- *M* sklearn/feature_extraction/tests/test_feature_hasher.py
<https://github.com/scikit-learn/scikit-learn/pull/9710/files#diff-0>
(37)
Patch Links:
- https://github.com/scikit-learn/scikit-learn/pull/9710.patch
- https://github.com/scikit-learn/scikit-learn/pull/9710.diff
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#9710>, or mute the
thread
<https://github.com/notifications/unsubscribe-auth/AAEz688_XL5_MbkScMaOfEy01icQSJeEks5sgTdigaJpZM4PRIkv>
.
|
jnothman
reviewed
Sep 12, 2017
jnothman
pushed a commit
to jnothman/scikit-learn
that referenced
this pull request
Sep 12, 2017
amueller
pushed a commit
to amueller/scikit-learn
that referenced
this pull request
Sep 12, 2017
massich
pushed a commit
to massich/scikit-learn
that referenced
this pull request
Sep 15, 2017
amueller
added a commit
to amueller/scikit-learn
that referenced
this pull request
Sep 19, 2017
remove outdated comment fix also for FeatureUnion [MRG+2] Limiting n_components by both n_features and n_samples instead of just n_features (Recreated PR) (scikit-learn#8742) [MRG+1] Remove hard dependency on nose (scikit-learn#9670) MAINT Stop vendoring sphinx-gallery (scikit-learn#9403) CI upgrade travis to run on new numpy release (scikit-learn#9096) CI Make it possible to run doctests in .rst files with pytest (scikit-learn#9697) * doc/datasets/conftest.py to implement the equivalent of nose fixtures * add conftest.py in root folder to ensure that sklearn local folder is used rather than the package in site-packages * test doc with pytest in Travis * move custom_data_home definition from nose fixture to .rst file [MRG+1] avoid integer overflow by using floats for matthews_corrcoef (scikit-learn#9693) * Fix bug#9622: avoid integer overflow by using floats for matthews_corrcoef * matthews_corrcoef: cosmetic change requested by jnothman * Add test_matthews_corrcoef_overflow for Bug#9622 * test_matthews_corrcoef_overflow: clean-up and make deterministic * matthews_corrcoef: pass dtype=np.float64 to sum & trace instead of using astype * test_matthews_corrcoef_overflow: add simple deterministic tests TST Platform independent hash collision tests in FeatureHasher (scikit-learn#9710) TST More informative error message in test_preserve_trustworthiness_approximately (scikit-learn#9738) add some rudimentary tests for meta-estimators fix extra whitespace in error message add missing if_delegate_has_method in pipeline don't test tuple pipeline for now only copy list if not list already? doesn't seem to help?
maskani-moh
pushed a commit
to maskani-moh/scikit-learn
that referenced
this pull request
Nov 15, 2017
jwjohnson314
pushed a commit
to jwjohnson314/scikit-learn
that referenced
this pull request
Dec 18, 2017
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This PR aims to address the current failures of
test_hasher_alternate_signon non amd64 platforms #9393 (comment) that is likely due to the fact the current test rely on Murmurhash3 results to yield a particular hash value (that produces a collision) while it is actually platform dependent #9393 (comment) . Since the original issue couldn't be reproduced, there is no guarantee that this would fix it (hopefully it would), but in any case, it would make thetest_hasher_alternate_signmore robust ...Note: these tests here rely on the fact that when hashing 8 strings with
alternate_sign=True, some of them will get a negative sign and some a positive one (it's a 50%/50% probability). However, there is still a (0.5)**2 = .004 probability that on a given platform all the signs will be positive (in which case these tests will fail) but hopefully, that's unlikely enough...cc @jnothman