DOC Add details to StandardScaler calculation#12446
DOC Add details to StandardScaler calculation#12446qinhanmin2014 merged 1 commit intoscikit-learn:masterfrom
Conversation
sklearn/preprocessing/data.py
Outdated
| scale_ : ndarray or None, shape (n_features,) | ||
| Per feature relative scaling of the data. Equal to ``None`` when | ||
| ``with_std=False``. | ||
| Per feature relative scaling of the data. Computed using |
There was a problem hiding this comment.
"Computed using" is ambiguous. Better to just say "Equal to".
There was a problem hiding this comment.
Or maybe you can use This is calculated using np.sqrt(...
There was a problem hiding this comment.
Hi @robert-dodier and @eamanu, thank you for your suggestions. I ended choosing the This is calculated using to not repeat the Equal to in the next sentence. Let me know what you guys think.
sklearn/preprocessing/data.py
Outdated
|
|
||
| z = (x - mean_) / scale_ | ||
|
|
||
| But the formula can be different if `with_mean=False` or `with_std=False`. |
There was a problem hiding this comment.
I see two problems here. (1) There is actually one mean_ and one scale_ per column. The documentation should make this explicit.
(2) The documentation should say what is the different formula that is used when with_mean = False and with_std = False, and when both are false.
There was a problem hiding this comment.
I agree with @robert-dodier . But, is necessary put a lot of implementation detail on documentation?
There was a problem hiding this comment.
I changed mean_ to u and scale_ to std to indicate this is more a high level explanation and not an actual implementation. I have also added their respective values when with_mean=False or with_std=False.
I agree that we have one mean_ and one scale_ per column, but this is covered at least partially in the next paragraph: Centering and scaling happen independently on each feature by computing the relevant statistics on the samples in the training set.
sklearn/preprocessing/data.py
Outdated
|
|
||
| z = (x - mean_) / scale_ | ||
|
|
||
| But the formula can be different if `with_mean=False` or `with_std=False`. |
There was a problem hiding this comment.
I agree with @robert-dodier . But, is necessary put a lot of implementation detail on documentation?
sklearn/preprocessing/data.py
Outdated
| scale_ : ndarray or None, shape (n_features,) | ||
| Per feature relative scaling of the data. Equal to ``None`` when | ||
| ``with_std=False``. | ||
| Per feature relative scaling of the data. Computed using |
There was a problem hiding this comment.
Or maybe you can use This is calculated using np.sqrt(...
|
@tuliocasagrande Tell me if you need help |
|
@tuliocasagrande if @robert-dodier get the approve, you should edit the PR title to [MRG] |
TomDLT
left a comment
There was a problem hiding this comment.
Only nitpicks, thanks @tuliocasagrande
sklearn/preprocessing/data.py
Outdated
| z = (x - u) / std | ||
|
|
||
| where `u` is the mean of the population or zero if `with_mean=False`, and | ||
| `std` is the standard deviation of the population or one if |
There was a problem hiding this comment.
I would either use explicit names, mean and std, or single letters as in math expressions, u and s, but not a mix.
There was a problem hiding this comment.
Good call, @TomDLT. I was initially considering using μ and σ, but I'm not sure how they'd be rendered. I'm sticking to 'u' and 's'.
sklearn/preprocessing/data.py
Outdated
|
|
||
| z = (x - u) / std | ||
|
|
||
| where `u` is the mean of the population or zero if `with_mean=False`, and |
There was a problem hiding this comment.
The term population is rather vague. What about mean of the training samples?
b34cab9 to
5c1e357
Compare
|
How can I see what is the latest version of the proposed patch? I see 5c1e357 which doesn't seem to contain suggestions that have been made during this discussion. Does 5c1e357 contain everything that is proposed? Is there something else I should look at? Thanks for any info. |
|
@robert-dodier 5c1e357 is indeed the last commit. If you're using GitHub, you can check the previous discussions. Some of them are still unresolved. |
xhluca
left a comment
There was a problem hiding this comment.
I think preprocessing.scale accomplishes a similar task (but doesn't use the Transformer api), maybe it could also be updated with a similar equation?
http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.scale.html#sklearn.preprocessing.scale
qinhanmin2014
left a comment
There was a problem hiding this comment.
LGTM, thanks @tuliocasagrande
|
For everyone here : Feel free to submit PR to improve the docstring of |
Hello!
This addresses some documentation issues raised on #12438.
1- Define standard scaler formula
2- Explicit how
scale_is calculatedThanks for reviewing this!
Closes #12438