[MRG + 2] ENH RobustScaler#4125
Conversation
There was a problem hiding this comment.
Could you add a line check_is_fitted(self, 'scale_') here too?
|
Test failure: AttributeError: 'RobustScaler' object has no attribute 'center_' |
|
Fixed now, sorry that it even happened, I only ran the |
|
is this ready to review? |
|
Yes it is |
|
Yes, but probably only after he beta release is cut on Friday. |
|
ping @amueller is now a good time to remind you about this PR? |
|
It is always a good time to remind me ;) I'll try to have a look. |
sklearn/preprocessing/data.py
Outdated
There was a problem hiding this comment.
please add a space before " :"
|
If you like you could add a plot_ example that shows samples from a 2d gaussian with outliers, how it gets transformed using StandardScaler (into something weird) and how using RobustScaler (a ball + outliers). Not required but I think this would be very illustrative. Apart from my minor nitpicks this looks great. Sorry for the lack of feedback previously. |
Long answer outside of the affected code lines so the answer stays visible for future reviewers: Yes, using the 0.25-0.75 (the "interquartile range") is the most common option, although others might make sense in some occasions (e.g. Wikipedia also meantions the interdecale range 0.1-0.9 as relevant). We could make the quantiles a user-definable parameter that just defaults to 0.25-0.75, to cover the more general case. The only question is if it's worth the added complexity for the user. I've personally never needed anything more than IQR, but maybe having the option might be nice? |
|
@agramfort Thanks for taking the time to look at this. I've fixed the mistakes you pointed out, and went over the docstrings in |
sklearn/preprocessing/data.py
Outdated
There was a problem hiding this comment.
I don't see any axis parameter
There was a problem hiding this comment.
You are right, that is a left-over from a feature that I later removed because an other reviewer didn't like it. Thanks for catching it!
|
ok LGTM one last review? |
|
Thanks for looking it over. Do PRs need 3 reviews before being accepted now? |
|
no but I'd like to have one more review to see if my last comments would be
nitpicks for another core dev
|
|
Ok, I find that the issues raised by @agramfort are not major ones. I think that this can be merged as such. However :), the median and the inter quartile range are very robust estimators but also very noisy. Using a trimmed mean, both on the raw data to estimate center location, and on the squared distances to the center, to estimate spread, is in general a much better strategy with a continuous parameter to interpolate between mean and median. I would really love a PR adding such a behavior to the robust scaler. It worries me that anybody wanting to do robust scaling will right now have also very noisy scaling. |
|
I never heard of the trimmed mean being more robust, could you name any references I could read up on? |
|
Merged. Thanks @untom. |
|
Nice! Thanks to the reviewers for their comments. @amueller: I am fairly busy until the 5th, so most likely I will only find time to work on this afterwards. |
|
I smell nips ;) Good luck! |
If there are papers to support this claim, it would be nice to add them to the references. |
This PR adds
RobustScalerandrobust_scaleas alternative toStandardScalerandscale. They use robust estimates of data center/scale (median & interquartile range), which will work better for data with outliers.Most of this was discussed before in #2514 (and even older commits). I separated this out so it can be merged as-is.
I originally wanted to submit this only after #3639 was merged, but sending the PR now allows it to be discussed concurrently (if either this or #3639 get merged, I'll of course update the other commit).