[MRG] Fix missing 'sparse matrix' in docstrings when allowed #16646 by genziano · Pull Request #16656 · scikit-learn/scikit-learn

genziano · 2020-03-08T17:42:42Z

Reference Issues/PRs

What does this implement/fix? Explain your changes.

This PR is an improvement to the docstrings of some Transformer classes in sklearn.preprocessing._data.
As described in issue #16646, PolynomialFeatures does not mention the possibility to pass sparse matrices. The same holds for other classes in the same module, such as StandardScaler, MaxAbsScaler, RobustScaler, Binarizer.

With this PR all the descriptions of X are changed to

array-like, shape (n_samples, n_features)

or

{array-like, sparse matrix, dataframe} of shape (n_samples, n_features)

depending on whether the corresponding methods accept sparse matrices and dataframes respectively.

Any other comments?

I took the liberty of formatting other occurrences of [n_samples, n_features] to (n_samples, n_features), for consistency within the module.

Further fixes: docstrings with no Returns are completed; every .fit method documents the ignored argument y, and the default values are described in a consistent format throughout the module.

…o account sparse matrices (scikit-learn#16646)

…arn#16646)

glemaitre

A couple of changes to apply in the same time.

glemaitre · 2020-03-09T10:11:20Z

sklearn/preprocessing/_data.py

        Parameters
        ----------
-        X : {array-like, sparse matrix}, shape [n_samples, n_features]
+        X : {array-like, sparse matrix}, shape (n_samples, n_features)


we should change , shape to of shape at the same time. In addition, I think that the algorithms pointed out will work with dataframe as well.

Suggested change

X : {array-like, sparse matrix}, shape (n_samples, n_features)

X : {array-like, sparse matrix, dataframe} of shape (n_samples, n_features)

glemaitre · 2020-03-09T10:13:46Z

sklearn/preprocessing/_data.py

        Returns
        -------
-        K_new : numpy array of shape [n_samples1, n_samples2]
+        K_new : numpy array of shape (n_samples1, n_samples2)


Suggested change

K_new : numpy array of shape (n_samples1, n_samples2)

K_new : ndarray of shape (n_samples1, n_samples2)

glemaitre · 2020-03-09T10:14:00Z

sklearn/preprocessing/_data.py

        Parameters
        ----------
-        K : numpy array of shape [n_samples1, n_samples2]
+        K : numpy array of shape (n_samples1, n_samples2)


Suggested change

K : numpy array of shape (n_samples1, n_samples2)

K : ndarray of shape (n_samples1, n_samples2)

glemaitre · 2020-03-09T10:14:23Z

sklearn/preprocessing/_data.py

        Parameters
        ----------
-        K : numpy array of shape [n_samples, n_samples]
+        K : numpy array of shape (n_samples, n_samples)


Suggested change

K : numpy array of shape (n_samples, n_samples)

K : ndarray of shape (n_samples, n_samples)

glemaitre · 2020-03-09T10:14:51Z

sklearn/preprocessing/_data.py

        Normalized input X.

-    norms : array, shape [n_samples] if axis=1 else [n_features]
+    norms : array, shape (n_samples, ) if axis=1 else (n_features, )


Suggested change

norms : array, shape (n_samples, ) if axis=1 else (n_features, )

norms : ndarray of shape (n_samples, ) if `axis=1` else (n_features, )

glemaitre · 2020-03-09T10:15:22Z

sklearn/preprocessing/_data.py

        Returns
        -------
-        XP : np.ndarray or CSR/CSC sparse matrix, shape [n_samples, NP]
+        XP : {np.ndarray, CSR/CSC sparse matrix}, shape (n_samples, NP)


I think that we should put CSR and CSC information in the description rather than in the type.

genziano · 2020-03-09T17:42:54Z

@glemaitre thanks for the feedback, I will proceed with the fixes.
Is it ok to rewrite ndarray or None, shape (n_features,) as ndarray of shape (n_features,) or None?

…passing a dataframe documented

glemaitre · 2020-03-10T09:56:05Z

Normally we have

xxx : ndarray of shape (n_features,) or None

If there is a default then

xxx: ndarray of shape (n_features,), default=None

glemaitre · 2020-03-10T09:56:17Z

Could you check the linter error

glemaitre · 2020-03-10T09:56:42Z

sklearn/preprocessing/_data.py

        Parameters
        ----------
-        X : {array-like, sparse matrix}, shape [n_samples, n_features]
+        X : {array-like, sparse matrix} of shape (n_samples, n_features)


it should accept dataframe as well

glemaitre · 2020-03-10T09:56:55Z

sklearn/preprocessing/_data.py

        Parameters
        ----------
-        X : {array-like, sparse matrix}, shape [n_samples, n_features]
+        X : {array-like, sparse matrix} of shape (n_samples, n_features)


it should accept dataframe as well

glemaitre · 2020-03-10T09:57:04Z

sklearn/preprocessing/_data.py

        Parameters
        ----------
-        X : array-like, shape [n_samples, n_features]
+        X : {array-like, sparse matrix} of shape (n_samples, n_features)


it should accept dataframe as well

glemaitre · 2020-03-10T09:57:13Z

sklearn/preprocessing/_data.py

        Parameters
        ----------
-        X : array-like, shape [n_samples, n_features]
+        X : {array-like, sparse matrix} of shape (n_samples, n_features)


it should accept dataframe as well

glemaitre · 2020-03-10T09:57:38Z

sklearn/preprocessing/_data.py

        Returns
        -------
-        X_tr : array-like, shape [n_samples, n_features]
+        X_tr : array-like of shape (n_samples, n_features)


It will not be an array-like in fact. It will be an ndarray.

glemaitre · 2020-03-10T09:58:00Z

sklearn/preprocessing/_data.py

        Parameters
        ----------
-        X : {array-like, sparse matrix}, shape [n_samples, n_features]
+        X : {array-like, sparse matrix} of (n_samples, n_features)


Basically all the preprocesing method will accept dataframe

genziano · 2020-03-10T10:07:35Z

Hi @glemaitre, the linter errors are due to the line

        X : {array-like, sparse matrix, dataframe} of shape (n_samples, n_features)

being too long. Is there a standard way within scikit-learn to split the line?

I scanned all the scikit-learn modules and I found that dataframe is almost never mentioned, is it because it might be considered covered by array-like?

glemaitre · 2020-03-10T13:04:42Z

You can split using the backslack

X : {array-like, sparse matrix, dataframe} of shape \
        (n_samples, n_features)
    Whatever description

genziano · 2020-03-10T13:48:26Z

Preview of how it would look like:

    def fit(self, X, y=None):
        """Compute the mean and std to be used for later scaling.

        Parameters
        ----------
        X : {array-like, sparse matrix, dataframe} of shape \
                (n_samples, n_features)
            The data used to compute the mean and standard deviation
            used for later scaling along the features axis.

        y
            Ignored
        """

Do you have any comments/improvements?

glemaitre · 2020-03-10T14:08:53Z

Looks good to me 👍

…

On Tue, 10 Mar 2020 at 14:48, Alessandro Gentile ***@***.***> wrote: Preview of how it would look like: def fit(self, X, y=None): """Compute the mean and std to be used for later scaling. Parameters ---------- X : {array-like, sparse matrix, dataframe} of shape \ (n_samples, n_features) The data used to compute the mean and standard deviation used for later scaling along the features axis. y Ignored """ Do you have any comments/improvements? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#16656?email_source=notifications&email_token=ABY32PYANIWYJFQPVJNEDBDRGZAKXA5CNFSM4LD3ML22YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEOLPHEY#issuecomment-597095315>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABY32PZWFAHK3MOX3CLUD2DRGZAKXANCNFSM4LD3ML2Q> .

-- Guillaume Lemaitre Scikit-learn @ Inria Foundation https://glemaitre.github.io/

genziano · 2020-03-10T14:17:47Z

Cool!

Another thing I noticed: some .transform() methods do not document the output. For instance:

scikit-learn/sklearn/preprocessing/_data.py

Lines 779 to 788 in 535ef55

    
               def transform(self, X, copy=None): 
        
                   """Perform standardization by centering and scaling 
        
                   Parameters 
        
                   ---------- 
        
                   X : array-like, shape [n_samples, n_features] 
        
                       The data used to scale along the features axis. 
        
                   copy : bool, optional (default: None) 
        
                       Copy the input X or not. 
        
                   """

Shall I fix it now, or it should be part of another issue/PR?

glemaitre · 2020-03-11T09:00:25Z

Let's correct this in the same PR then :) It will be a documentation change consistency across the file.

…6646)

…cikit-learn#16646)

genziano · 2020-03-12T11:04:45Z

Hi @glemaitre, I did some last checks and everything seems good. What do you think?

glemaitre

I think this is almost there. Apart from my suggestion, there are the occurrences of boolean to be changed to bool and remove "optional"

glemaitre · 2020-03-12T15:29:11Z

sklearn/preprocessing/_data.py

        each sample.

-    with_mean : boolean, True by default
+    with_mean : boolean, default=True


Suggested change

with_mean : boolean, default=True

with_mean : bool, default=True

glemaitre · 2020-03-12T15:29:20Z

sklearn/preprocessing/_data.py

        If True, center the data before scaling.

-    with_std : boolean, True by default
+    with_std : boolean, default=True


Suggested change

with_std : boolean, default=True

with_std : bool, default=True

glemaitre · 2020-03-12T15:29:30Z

sklearn/preprocessing/_data.py

        unit standard deviation).

-    copy : boolean, optional, default True
+    copy : boolean, optional, default=True


Suggested change

copy : boolean, optional, default=True

copy : bool, default=True

glemaitre · 2020-03-12T15:30:47Z

sklearn/preprocessing/_data.py

    Parameters
    ----------
-    copy : boolean, optional, default True
+    copy : boolean, optional, default=True


for the subsequent entries, you can remove optional and use the python type instead of english one; e.g boolean -> bool, string -> str, etc.

glemaitre · 2020-03-12T15:31:54Z

sklearn/preprocessing/_data.py

    Parameters
    ----------
-    norm : 'l1', 'l2', or 'max', optional ('l2' by default)
+    norm : 'l1', 'l2', or 'max', optional, default='l2'


Suggested change

norm : 'l1', 'l2', or 'max', optional, default='l2'

norm : {'l1', 'l2', 'max'}, default='l2'

glemaitre · 2020-03-12T15:32:35Z

sklearn/preprocessing/_data.py

        differ for value-identical sparse and dense matrices.

-    random_state : int, RandomState instance or None, optional (default=None)
+    random_state : int, RandomState instance or None, optional, default=None


Suggested change

random_state : int, RandomState instance or None, optional, default=None

random_state : int or RandomState instance, default=None

genziano · 2020-03-16T22:09:04Z

Thanks for the suggestions, I will work on it tomorrow!

…python naming of types

…16646)

genziano · 2020-03-17T09:19:01Z

Is there a standard way to document a list with a fixed type of its elements?
For instance:

input_features : list of str, length n_features, default=None

genziano · 2020-03-18T14:56:40Z

You were probably aware of this, but I found out a way to set a template for the documentation: https://github.com/NilsJPWerner/autoDocstring/blob/master/src/docstring/templates/numpy.mustache

glemaitre

LGTM. I added 2 suggestions for the case you had questions. Up to now, we use the same syntax than NumPy array (true that it has only a length indeed)

glemaitre · 2020-05-13T14:48:19Z

sklearn/preprocessing/_data.py

        Parameters
        ----------
-        input_features : list of string, length n_features, optional
+        input_features : list of str, length n_features, default=None


Suggested change

input_features : list of str, length n_features, default=None

input_features : list of str of shape (n_features,), default=None

glemaitre · 2020-05-13T14:48:41Z

sklearn/preprocessing/_data.py

        -------
-        output_feature_names : list of string, length n_output_features
-
+        output_feature_names : list of str, length n_output_features


Suggested change

output_feature_names : list of str, length n_output_features

output_feature_names : list of str of shape (n_output_features,)

Thanks for the suggestions. I updated the PR!

thomasjpfan · 2020-05-16T21:31:36Z

I do not think we should add dataframe everywhere. It would fit almost everywhere we have ndarray as long as np.asarray returns a a homogeneous ndarray.

glemaitre · 2020-05-18T07:50:38Z

I do not think if we should add dataframe everywhere. It would fit almost everywhere we have ndarray as long as np.asarray returns a a homogeneous ndarray.

I would prefer to be explicit and mention it.

genziano · 2020-05-18T11:07:12Z

On one hand, I see that dataframe is almost never explicitly mentioned in sklearn, but I see here https://scikit-learn.org/stable/developers/contributing.html#contribute-documentation that it's given as an option

jnothman · 2020-05-18T12:23:51Z

"array-like or frame-like or sparse matrix"?? I think it's getting too verbose.

glemaitre · 2020-05-18T13:25:32Z

Up to now, our dev guide states {array-like, sparse matrix, dataframe}.
I don't mind that we come with something shorter and change the guideline.

However, I really feel that we should settle with those once for all. It is bad that we ask contributors to add/remove at each new reviewer passing by while the dev guide should be sufficient regarding the doc style.

thomasjpfan · 2020-05-18T13:45:29Z

However, I really feel that we should settle with those once for all. It is bad that we ask contributors to add/remove at each new reviewer passing by while the dev guide should be sufficient regarding the doc style.

We can talk about this in the monthly call.

Here are my thoughts:

It looks like we only use dataframe in the column transformer, some mixins, and the return types of the dataset functions.

I would say unless we are doing something special with the column names, we just need "array-like". If we are doing something special, such as in column transformer, we can put "dataframe" or maybe "frame-like" (if a dataframe protocol becomes a thing).

In the end, I want to avoid listing out every single valid data structure. Currently, we can take cupy arrays, dask arrays, xarrays, etc. and they work because they are "array-like". This can get incredibility verbose if we list them all.

glemaitre · 2020-05-18T14:26:32Z

I think I agree with your point. Until an estimator uses a DataFrame as an array, then we can consider in the array-like. Right now, we only state that DataFrame should be numeric to be an array-like but actually we could consider object dtype, if the estimator can manage the data (OHE or OrdinalEncoder for instance).

I am also thinking that we should be careful with the output type. Usually, if an array or sparse array is given, we will output the same type. It is one of the reasons that I might not consider stating dataframe since we cannot output them up to now. And as you mentioned, if we grow support for these different types and we need to state types in input and output, it needs to be done wisely to avoid verbosity.

jnothman · 2020-05-18T14:47:11Z

Use the Glossary to clarify, and use numpydoc_xref_param_type to link to it?

genziano · 2020-05-19T16:00:15Z

I will remove the references to dataframe then.

glemaitre · 2020-05-27T13:15:29Z

Thanks @Alemaudit, #17359 bring more light on the dataframe and as you did here, we will not it in the description. I will have a quick look but I think that I should be able to merge.

glemaitre

We are good to go.

glemaitre · 2020-05-27T13:17:35Z

Thanks @Alemaudit for the patience :)

genziano · 2020-05-28T13:48:32Z

Thanks @glemaitre, it was a pleasure to contribute!

…scikit-learn#16656)

genziano added 3 commits March 8, 2020 17:18

DOC make consistent description of X in preprocessing._data, take int…

bad03af

…o account sparse matrices (scikit-learn#16646)

DOC last fixes to X docs for QuantileTransformer (scikit-learn#16646)

9880a3e

DOC further fixes to restore ndarray instead of array-like (scikit-le…

030524e

…arn#16646)

github-actions bot added the module:preprocessing label Mar 8, 2020

genziano changed the title ~~[MRG] Fix missing 'sparse matrix' in docstrings when allowed #16626~~ [MRG] Fix missing 'sparse matrix' in docstrings when allowed #16646 Mar 8, 2020

glemaitre reviewed Mar 9, 2020

View reviewed changes

genziano added 2 commits March 9, 2020 20:27

', shape' -> 'of shape', 'array' -> 'ndarray', some possibilities of …

1b762fd

…passing a dataframe documented

Merge remote-tracking branch 'upstream/master' into issue16646

4e15284

glemaitre reviewed Mar 10, 2020

View reviewed changes

genziano added 2 commits March 10, 2020 15:10

DOC all dataframe as input should be documented scikit-learn#16646

988ed5d

Merge remote-tracking branch 'upstream/master' into issue16646

d9c6bc3

DOC minor fixes (scikit-learn#16646)

4578ebd

genziano added 7 commits March 11, 2020 15:49

Merge remote-tracking branch 'upstream/master' into issue16646

b395090

DOC included missing and in classes of preprocessing._data

8ed1808

DOC fixed missing in functions of preprocessing._data (scikit-learn#1…

d767899

…6646)

DOC all docstrings with the same format for default: default=<value> (s…

efd386a

…cikit-learn#16646)

DOC trailing spaces fixed (scikit-learn#16646)

859195f

DOCK Minor consistency fixes (scikit-learn#16646)

c8dfea2

Merge remote-tracking branch 'upstream/master' into issue16646

fd95f83

glemaitre reviewed Mar 12, 2020

View reviewed changes

genziano added 4 commits March 17, 2020 09:19

Merge remote-tracking branch 'upstream/master' into issue16646

8dd9f95

DOC removed 'optional' made redundant from default value, english to …

c4b96f3

…python naming of types

DOC minor fixes, added some missing defaults (scikit-learn#16646)

f15fdc8

DOC changed docs of multiple choices of str attributes (scikit-learn#…

098aa3a

…16646)

Merge remote-tracking branch 'upstream/master' into issue16646

c19ad81

genziano requested a review from glemaitre April 22, 2020 12:33

glemaitre reviewed May 13, 2020

View reviewed changes

documenting lists of str

7d188b5

removed dataframe as type in the docstrigs

c004916

genziano requested a review from glemaitre May 24, 2020 16:21

glemaitre approved these changes May 27, 2020

View reviewed changes

glemaitre merged commit ffd1873 into scikit-learn:master May 27, 2020

genziano deleted the issue16646 branch May 29, 2020 07:20

viclafargue pushed a commit to viclafargue/scikit-learn that referenced this pull request Jun 26, 2020

DOC improve supporting types input and output in preprocessing module (…

9f8637b

…scikit-learn#16656)

jayzed82 pushed a commit to jayzed82/scikit-learn that referenced this pull request Oct 22, 2020

DOC improve supporting types input and output in preprocessing module (…

b3037fe

…scikit-learn#16656)

	X : {array-like, sparse matrix}, shape (n_samples, n_features)
	X : {array-like, sparse matrix, dataframe} of shape (n_samples, n_features)

	K_new : numpy array of shape (n_samples1, n_samples2)
	K_new : ndarray of shape (n_samples1, n_samples2)

	K : numpy array of shape (n_samples1, n_samples2)
	K : ndarray of shape (n_samples1, n_samples2)

	K : numpy array of shape (n_samples, n_samples)
	K : ndarray of shape (n_samples, n_samples)

	norms : array, shape (n_samples, ) if axis=1 else (n_features, )
	norms : ndarray of shape (n_samples, ) if `axis=1` else (n_features, )

	with_mean : boolean, default=True
	with_mean : bool, default=True

	with_std : boolean, default=True
	with_std : bool, default=True

	copy : boolean, optional, default=True
	copy : bool, default=True

	norm : 'l1', 'l2', or 'max', optional, default='l2'
	norm : {'l1', 'l2', 'max'}, default='l2'

	random_state : int, RandomState instance or None, optional, default=None
	random_state : int or RandomState instance, default=None

	input_features : list of str, length n_features, default=None
	input_features : list of str of shape (n_features,), default=None

	output_feature_names : list of str, length n_output_features
	output_feature_names : list of str of shape (n_output_features,)

Uh oh!

Conversation

genziano commented Mar 8, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Any other comments?

Uh oh!

glemaitre left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

genziano commented Mar 9, 2020

Uh oh!

glemaitre commented Mar 10, 2020

Uh oh!

glemaitre commented Mar 10, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

genziano commented Mar 10, 2020

Uh oh!

glemaitre commented Mar 10, 2020

Uh oh!

genziano commented Mar 10, 2020

Uh oh!

glemaitre commented Mar 10, 2020 via email

Uh oh!

genziano commented Mar 10, 2020

Uh oh!

glemaitre commented Mar 11, 2020

Uh oh!

genziano commented Mar 12, 2020

Uh oh!

glemaitre left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

genziano commented Mar 16, 2020

Uh oh!

genziano commented Mar 17, 2020

Uh oh!

genziano commented Mar 18, 2020

Uh oh!

glemaitre left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

genziano commented Mar 8, 2020 •

edited

Loading

thomasjpfan commented May 16, 2020 •

edited

Loading