Fix uncomparable values in SimpleImputer tie-breaking strategy#31820
Fix uncomparable values in SimpleImputer tie-breaking strategy#31820betatim merged 6 commits intoscikit-learn:mainfrom
Conversation
When encountering uncomparable types, SimpleImputer's 'min' strategy previously raised an error. This commit resolves the issue by selecting the value with the smallest hash of its pickled representation as a fallback. Fix scikit-learn#31717
sklearn/impute/_base.py
Outdated
| return min(items) | ||
| except TypeError as e: | ||
| if "'<' not supported between" in str(e): | ||
| return min(items, key=lambda x: murmurhash3_32(pickle.dumps(x))) |
There was a problem hiding this comment.
I wonder if it's not a bit overkill. For the use case of SimpleImputer, I believe that using the string representation of objects would be fine. It's deterministic across versions, os, envs... for usual types (int, float, str, None). It's not on more complex objects but I don't think that we expect complex objects to be passed to SimpleImputer.
| return min(items, key=lambda x: murmurhash3_32(pickle.dumps(x))) | |
| return min(items, key=str) |
What do you think ?
There was a problem hiding this comment.
I thought of that too but I am bugged by the 1 out of a million case where there is a tie between None and 'None' that is shuffle-sensitive:
In [1]: min([None, 'None'], key=str)
In [2]: min(['None', None], key=str)
Out[2]: 'None'repr does not fail in that case but it may in others.
Actually, we do not really need the hash, the pickle dumped string should be unique.
return min(items, key=lambda x: pickle.dumps(x))Is that better? Still needs pickle though...
Actually, that's where the other PR comes in handy, using the type breaks the tie.
There was a problem hiding this comment.
It's actually the same for 1 and "1" and so on which is more likely to happen.
I was undecided between pickle.dumps(x) and (str(type(x), str(x)), but a quick benchmark showed that the latter is much faster, so I'm in favor of choosing this one.
There was a problem hiding this comment.
Maybe another reason to not rely on pickle is that IIRC there are very little promises about the contents of a pickle with respect to the Python version (I think you'd have to select an explicit version of the protocol?). Generally it feels like using pickle is somehow "too complicated"
jeremiedbb
left a comment
There was a problem hiding this comment.
More comments. Please also add a changelog entry
jeremiedbb
left a comment
There was a problem hiding this comment.
I directly pushed a commit to use (str(type(x)), str(x)) as key. LGTM.
@betatim is this solution fine for you ?
|
Sorry, I was off, I could not implement the requests. let me know if there is anything more I can do. |
betatim
left a comment
There was a problem hiding this comment.
I think this is the best compromise. I still wonder about "what could be weird edge cases" - even though I can't think of any. Let's try it and see what people report
|
I merged |
…t-learn#31820) Co-authored-by: Jérémie du Boisberranger <jeremie@probabl.ai> Co-authored-by: Tim Head <betatim@gmail.com>
…t-learn#31820) Co-authored-by: Jérémie du Boisberranger <jeremie@probabl.ai> Co-authored-by: Tim Head <betatim@gmail.com>
Co-authored-by: Jérémie du Boisberranger <jeremie@probabl.ai> Co-authored-by: Tim Head <betatim@gmail.com>
…t-learn#31820) Co-authored-by: Jérémie du Boisberranger <jeremie@probabl.ai> Co-authored-by: Tim Head <betatim@gmail.com>
Fixes #31717
When encountering uncomparable types, SimpleImputer's 'min' strategy previously raised an error. This commit resolves the issue by selecting the value with the smallest hash of its pickled representation as a fallback.
The pickled value is used because some special values such as
Noneare not handled natively by murmurhash.