Skip to content

ENH Add option to scale output to unit var in RobustScaler#17193

Merged
glemaitre merged 26 commits intoscikit-learn:masterfrom
lucyleeow:IS/10139
Jun 2, 2020
Merged

ENH Add option to scale output to unit var in RobustScaler#17193
glemaitre merged 26 commits intoscikit-learn:masterfrom
lucyleeow:IS/10139

Conversation

@lucyleeow
Copy link
Copy Markdown
Member

Reference Issues/PRs

Follows from PR #10140
closes #10139

What does this implement/fix? Explain your changes.

Adds gauss_adjust parameter to RobustScaler so transformed data is standard Gaussian when the input data was a Gaussian distribution.

Adds test to check that this option makes StandardScaler and RobustScaler equivalent on normal data as suggested by @amueller. Note these are only equivalent with a large dataset. Not sure if the parameters I used are optimal, open to suggestions.

Any other comments?

@jnothman
Copy link
Copy Markdown
Member

In QuantileTransformer this is available as:

output_distribution : str, optional (default='uniform')
    Marginal distribution for the transformed data.
    The choices are ‘uniform’ (default) or ‘normal’.

@jnothman
Copy link
Copy Markdown
Member

I suppose it's not the same. Here it's just a change in scale? Still, it would be nice if the options looked more similar to each other??

@lucyleeow
Copy link
Copy Markdown
Member Author

lucyleeow commented May 13, 2020

I see your point but what would you suggest? This seems a bit clunky?

output_distribution : str, optional (default='non-standard normal')
    Scale distribution for the transformed data to standard normal or keep as non-standard normal.      
    Options 'non-standard normal' or ‘standard normal’

@lucyleeow
Copy link
Copy Markdown
Member Author

lucyleeow commented May 13, 2020

Here it's just a change in scale?

Yes it is just a change in scale. Normally the standard deviation of the output distribution for 0.25,0.75 quartiles is:
std ~= 1/(norm.ppf(.75) - norm.ppf(.25)) = 0.741

The adjusted scale just changes the standard deviation to ~1 by changing the scale:

self.scale_  = (q[1] - q[0]) / self.adjust

where self.adjust is:

self.adjust = norm.ppf(q[1]  / 100.0) - norm.ppf(q[0] / 100.0) 

@lucyleeow
Copy link
Copy Markdown
Member Author

Maybe:

standard-normal :  boolean, False by default
    If True, scale data to a standard normal distribution.

which is vaguely more similar to QuantileTransformer

@jnothman
Copy link
Copy Markdown
Member

Don't know what the right name is... scale_to_unit_variance???

@amueller
Copy link
Copy Markdown
Member

Or just unit_variance? Maybe scale_to_unit_variance is more explicit?
Happy with either name.

Semi-related: should that become the default, i.e. should we deprecate it being False?

@lucyleeow
Copy link
Copy Markdown
Member Author

lucyleeow commented May 13, 2020

Thanks for the reviews. I've used unit_variance but happy to change to scale_to_unit_variance

@lucyleeow lucyleeow changed the title ENH Gaussian adjust option in RobustScaler ENH Add option to scale output to unit var in RobustScaler May 13, 2020
Copy link
Copy Markdown
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

otherwise LGTM

Please add an entry to the change log at doc/whats_new/v0.24.rst. Like the other entries there, please reference this pull request with :pr: and credit yourself (and other contributors if applicable) with :user:

Copy link
Copy Markdown
Member

@adrinjalali adrinjalali left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @lucyleeow.

Copy link
Copy Markdown
Member

@glemaitre glemaitre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some additional nitpicks

@lucyleeow
Copy link
Copy Markdown
Member Author

ping @glemaitre. I also added more explanation of what the parameter does generally as suggested by @adrinjalali, but I'm not sure about the wording...

Could we also expand on this a bit and explain what happens if the feature is not normally distributed? Since this is the only place we're having the explanation of this parameter (no user guide), it'd be nice to elaborate a bit.

@glemaitre glemaitre self-assigned this May 27, 2020
@glemaitre glemaitre removed their assignment May 27, 2020
@lucyleeow
Copy link
Copy Markdown
Member Author

Amended!

@glemaitre glemaitre merged commit 863e58f into scikit-learn:master Jun 2, 2020
@glemaitre
Copy link
Copy Markdown
Member

Thanks @lucyleeow merging

@lucyleeow lucyleeow deleted the IS/10139 branch June 2, 2020 12:52
viclafargue pushed a commit to viclafargue/scikit-learn that referenced this pull request Jun 26, 2020
…arn#17193)

Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>
jayzed82 pushed a commit to jayzed82/scikit-learn that referenced this pull request Oct 22, 2020
…arn#17193)

Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Feature Request: Gaussian adjust option in RobustScaler

6 participants