[MRG] Fix missing 'sparse matrix' in docstrings when allowed #16646#16656
[MRG] Fix missing 'sparse matrix' in docstrings when allowed #16646#16656glemaitre merged 22 commits intoscikit-learn:masterfrom
Conversation
glemaitre
left a comment
There was a problem hiding this comment.
A couple of changes to apply in the same time.
sklearn/preprocessing/_data.py
Outdated
| Parameters | ||
| ---------- | ||
| X : {array-like, sparse matrix}, shape [n_samples, n_features] | ||
| X : {array-like, sparse matrix}, shape (n_samples, n_features) |
There was a problem hiding this comment.
we should change , shape to of shape at the same time. In addition, I think that the algorithms pointed out will work with dataframe as well.
| X : {array-like, sparse matrix}, shape (n_samples, n_features) | |
| X : {array-like, sparse matrix, dataframe} of shape (n_samples, n_features) |
sklearn/preprocessing/_data.py
Outdated
| Returns | ||
| ------- | ||
| K_new : numpy array of shape [n_samples1, n_samples2] | ||
| K_new : numpy array of shape (n_samples1, n_samples2) |
There was a problem hiding this comment.
| K_new : numpy array of shape (n_samples1, n_samples2) | |
| K_new : ndarray of shape (n_samples1, n_samples2) |
sklearn/preprocessing/_data.py
Outdated
| Parameters | ||
| ---------- | ||
| K : numpy array of shape [n_samples1, n_samples2] | ||
| K : numpy array of shape (n_samples1, n_samples2) |
There was a problem hiding this comment.
| K : numpy array of shape (n_samples1, n_samples2) | |
| K : ndarray of shape (n_samples1, n_samples2) |
sklearn/preprocessing/_data.py
Outdated
| Parameters | ||
| ---------- | ||
| K : numpy array of shape [n_samples, n_samples] | ||
| K : numpy array of shape (n_samples, n_samples) |
There was a problem hiding this comment.
| K : numpy array of shape (n_samples, n_samples) | |
| K : ndarray of shape (n_samples, n_samples) |
sklearn/preprocessing/_data.py
Outdated
| Normalized input X. | ||
|
|
||
| norms : array, shape [n_samples] if axis=1 else [n_features] | ||
| norms : array, shape (n_samples, ) if axis=1 else (n_features, ) |
There was a problem hiding this comment.
| norms : array, shape (n_samples, ) if axis=1 else (n_features, ) | |
| norms : ndarray of shape (n_samples, ) if `axis=1` else (n_features, ) |
sklearn/preprocessing/_data.py
Outdated
| Returns | ||
| ------- | ||
| XP : np.ndarray or CSR/CSC sparse matrix, shape [n_samples, NP] | ||
| XP : {np.ndarray, CSR/CSC sparse matrix}, shape (n_samples, NP) |
There was a problem hiding this comment.
I think that we should put CSR and CSC information in the description rather than in the type.
|
@glemaitre thanks for the feedback, I will proceed with the fixes. |
|
Normally we have xxx : ndarray of shape (n_features,) or NoneIf there is a default then xxx: ndarray of shape (n_features,), default=None |
|
Could you check the linter error |
| Parameters | ||
| ---------- | ||
| X : {array-like, sparse matrix}, shape [n_samples, n_features] | ||
| X : {array-like, sparse matrix} of shape (n_samples, n_features) |
There was a problem hiding this comment.
it should accept dataframe as well
| Parameters | ||
| ---------- | ||
| X : {array-like, sparse matrix}, shape [n_samples, n_features] | ||
| X : {array-like, sparse matrix} of shape (n_samples, n_features) |
There was a problem hiding this comment.
it should accept dataframe as well
sklearn/preprocessing/_data.py
Outdated
| Parameters | ||
| ---------- | ||
| X : array-like, shape [n_samples, n_features] | ||
| X : {array-like, sparse matrix} of shape (n_samples, n_features) |
There was a problem hiding this comment.
it should accept dataframe as well
| Parameters | ||
| ---------- | ||
| X : array-like, shape [n_samples, n_features] | ||
| X : {array-like, sparse matrix} of shape (n_samples, n_features) |
There was a problem hiding this comment.
it should accept dataframe as well
sklearn/preprocessing/_data.py
Outdated
| Returns | ||
| ------- | ||
| X_tr : array-like, shape [n_samples, n_features] | ||
| X_tr : array-like of shape (n_samples, n_features) |
There was a problem hiding this comment.
It will not be an array-like in fact. It will be an ndarray.
sklearn/preprocessing/_data.py
Outdated
| Parameters | ||
| ---------- | ||
| X : {array-like, sparse matrix}, shape [n_samples, n_features] | ||
| X : {array-like, sparse matrix} of (n_samples, n_features) |
There was a problem hiding this comment.
Basically all the preprocesing method will accept dataframe
|
Hi @glemaitre, the linter errors are due to the line being too long. Is there a standard way within scikit-learn to split the line? I scanned all the scikit-learn modules and I found that dataframe is almost never mentioned, is it because it might be considered covered by |
|
You can split using the backslack X : {array-like, sparse matrix, dataframe} of shape \
(n_samples, n_features)
Whatever description |
|
Preview of how it would look like: def fit(self, X, y=None):
"""Compute the mean and std to be used for later scaling.
Parameters
----------
X : {array-like, sparse matrix, dataframe} of shape \
(n_samples, n_features)
The data used to compute the mean and standard deviation
used for later scaling along the features axis.
y
Ignored
"""Do you have any comments/improvements? |
|
Looks good to me 👍
…On Tue, 10 Mar 2020 at 14:48, Alessandro Gentile ***@***.***> wrote:
Preview of how it would look like:
def fit(self, X, y=None):
"""Compute the mean and std to be used for later scaling. Parameters ---------- X : {array-like, sparse matrix, dataframe} of shape \ (n_samples, n_features) The data used to compute the mean and standard deviation used for later scaling along the features axis. y Ignored """
Do you have any comments/improvements?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#16656?email_source=notifications&email_token=ABY32PYANIWYJFQPVJNEDBDRGZAKXA5CNFSM4LD3ML22YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEOLPHEY#issuecomment-597095315>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABY32PZWFAHK3MOX3CLUD2DRGZAKXANCNFSM4LD3ML2Q>
.
--
Guillaume Lemaitre
Scikit-learn @ Inria Foundation
https://glemaitre.github.io/
|
|
Cool! Another thing I noticed: some scikit-learn/sklearn/preprocessing/_data.py Lines 779 to 788 in 535ef55 Shall I fix it now, or it should be part of another issue/PR? |
|
Let's correct this in the same PR then :) It will be a documentation change consistency across the file. |
|
Hi @glemaitre, I did some last checks and everything seems good. What do you think? |
glemaitre
left a comment
There was a problem hiding this comment.
I think this is almost there. Apart from my suggestion, there are the occurrences of boolean to be changed to bool and remove "optional"
sklearn/preprocessing/_data.py
Outdated
| each sample. | ||
|
|
||
| with_mean : boolean, True by default | ||
| with_mean : boolean, default=True |
There was a problem hiding this comment.
| with_mean : boolean, default=True | |
| with_mean : bool, default=True |
sklearn/preprocessing/_data.py
Outdated
| If True, center the data before scaling. | ||
|
|
||
| with_std : boolean, True by default | ||
| with_std : boolean, default=True |
There was a problem hiding this comment.
| with_std : boolean, default=True | |
| with_std : bool, default=True |
sklearn/preprocessing/_data.py
Outdated
| unit standard deviation). | ||
|
|
||
| copy : boolean, optional, default True | ||
| copy : boolean, optional, default=True |
There was a problem hiding this comment.
| copy : boolean, optional, default=True | |
| copy : bool, default=True |
sklearn/preprocessing/_data.py
Outdated
| Parameters | ||
| ---------- | ||
| copy : boolean, optional, default True | ||
| copy : boolean, optional, default=True |
There was a problem hiding this comment.
for the subsequent entries, you can remove optional and use the python type instead of english one; e.g boolean -> bool, string -> str, etc.
sklearn/preprocessing/_data.py
Outdated
| Parameters | ||
| ---------- | ||
| norm : 'l1', 'l2', or 'max', optional ('l2' by default) | ||
| norm : 'l1', 'l2', or 'max', optional, default='l2' |
There was a problem hiding this comment.
| norm : 'l1', 'l2', or 'max', optional, default='l2' | |
| norm : {'l1', 'l2', 'max'}, default='l2' |
sklearn/preprocessing/_data.py
Outdated
| differ for value-identical sparse and dense matrices. | ||
|
|
||
| random_state : int, RandomState instance or None, optional (default=None) | ||
| random_state : int, RandomState instance or None, optional, default=None |
There was a problem hiding this comment.
| random_state : int, RandomState instance or None, optional, default=None | |
| random_state : int or RandomState instance, default=None |
|
Thanks for the suggestions, I will work on it tomorrow! |
…python naming of types
|
Is there a standard way to document a list with a fixed type of its elements? |
|
You were probably aware of this, but I found out a way to set a template for the documentation: https://github.com/NilsJPWerner/autoDocstring/blob/master/src/docstring/templates/numpy.mustache |
glemaitre
left a comment
There was a problem hiding this comment.
LGTM. I added 2 suggestions for the case you had questions. Up to now, we use the same syntax than NumPy array (true that it has only a length indeed)
sklearn/preprocessing/_data.py
Outdated
| Parameters | ||
| ---------- | ||
| input_features : list of string, length n_features, optional | ||
| input_features : list of str, length n_features, default=None |
There was a problem hiding this comment.
| input_features : list of str, length n_features, default=None | |
| input_features : list of str of shape (n_features,), default=None |
sklearn/preprocessing/_data.py
Outdated
| ------- | ||
| output_feature_names : list of string, length n_output_features | ||
|
|
||
| output_feature_names : list of str, length n_output_features |
There was a problem hiding this comment.
| output_feature_names : list of str, length n_output_features | |
| output_feature_names : list of str of shape (n_output_features,) |
There was a problem hiding this comment.
Thanks for the suggestions. I updated the PR!
|
I do not think we should add |
I would prefer to be explicit and mention it. |
|
On one hand, I see that dataframe is almost never explicitly mentioned in sklearn, but I see here https://scikit-learn.org/stable/developers/contributing.html#contribute-documentation that it's given as an option |
|
"array-like or frame-like or sparse matrix"?? I think it's getting too
verbose.
|
|
Up to now, our dev guide states However, I really feel that we should settle with those once for all. It is bad that we ask contributors to add/remove at each new reviewer passing by while the dev guide should be sufficient regarding the doc style. |
We can talk about this in the monthly call. Here are my thoughts: It looks like we only use I would say unless we are doing something special with the column names, we just need "array-like". If we are doing something special, such as in column transformer, we can put "dataframe" or maybe "frame-like" (if a dataframe protocol becomes a thing). In the end, I want to avoid listing out every single valid data structure. Currently, we can take cupy arrays, dask arrays, xarrays, etc. and they work because they are "array-like". This can get incredibility verbose if we list them all. |
|
I think I agree with your point. Until an estimator uses a I am also thinking that we should be careful with the output type. Usually, if an array or sparse array is given, we will output the same type. It is one of the reasons that I might not consider stating |
|
Use the Glossary to clarify, and use numpydoc_xref_param_type to link to it?
|
|
I will remove the references to |
|
Thanks @Alemaudit, #17359 bring more light on the dataframe and as you did here, we will not it in the description. I will have a quick look but I think that I should be able to merge. |
|
Thanks @Alemaudit for the patience :) |
|
Thanks @glemaitre, it was a pleasure to contribute! |
Reference Issues/PRs
Fixes #16646
What does this implement/fix? Explain your changes.
This PR is an improvement to the docstrings of some Transformer classes in
sklearn.preprocessing._data.As described in issue #16646,
PolynomialFeaturesdoes not mention the possibility to pass sparse matrices. The same holds for other classes in the same module, such asStandardScaler,MaxAbsScaler,RobustScaler,Binarizer.With this PR all the descriptions of
Xare changed toor
depending on whether the corresponding methods accept sparse matrices and dataframes respectively.
Any other comments?
I took the liberty of formatting other occurrences of
[n_samples, n_features]to(n_samples, n_features), for consistency within the module.Further fixes: docstrings with no
Returnsare completed; every.fitmethod documents the ignored argumenty, and the default values are described in a consistent format throughout the module.