Skip to content

[MRG] Fix missing 'sparse matrix' in docstrings when allowed #16646#16656

Merged
glemaitre merged 22 commits intoscikit-learn:masterfrom
genziano:issue16646
May 27, 2020
Merged

[MRG] Fix missing 'sparse matrix' in docstrings when allowed #16646#16656
glemaitre merged 22 commits intoscikit-learn:masterfrom
genziano:issue16646

Conversation

@genziano
Copy link
Copy Markdown
Contributor

@genziano genziano commented Mar 8, 2020

Reference Issues/PRs

Fixes #16646

What does this implement/fix? Explain your changes.

This PR is an improvement to the docstrings of some Transformer classes in sklearn.preprocessing._data.
As described in issue #16646, PolynomialFeatures does not mention the possibility to pass sparse matrices. The same holds for other classes in the same module, such as StandardScaler, MaxAbsScaler, RobustScaler, Binarizer.

With this PR all the descriptions of X are changed to

array-like, shape (n_samples, n_features)

or

{array-like, sparse matrix, dataframe} of shape (n_samples, n_features)

depending on whether the corresponding methods accept sparse matrices and dataframes respectively.

Any other comments?

I took the liberty of formatting other occurrences of [n_samples, n_features] to (n_samples, n_features), for consistency within the module.

Further fixes: docstrings with no Returns are completed; every .fit method documents the ignored argument y, and the default values are described in a consistent format throughout the module.

@genziano genziano changed the title [MRG] Fix missing 'sparse matrix' in docstrings when allowed #16626 [MRG] Fix missing 'sparse matrix' in docstrings when allowed #16646 Mar 8, 2020
Copy link
Copy Markdown
Member

@glemaitre glemaitre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple of changes to apply in the same time.

Parameters
----------
X : {array-like, sparse matrix}, shape [n_samples, n_features]
X : {array-like, sparse matrix}, shape (n_samples, n_features)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should change , shape to of shape at the same time. In addition, I think that the algorithms pointed out will work with dataframe as well.

Suggested change
X : {array-like, sparse matrix}, shape (n_samples, n_features)
X : {array-like, sparse matrix, dataframe} of shape (n_samples, n_features)

Returns
-------
K_new : numpy array of shape [n_samples1, n_samples2]
K_new : numpy array of shape (n_samples1, n_samples2)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
K_new : numpy array of shape (n_samples1, n_samples2)
K_new : ndarray of shape (n_samples1, n_samples2)

Parameters
----------
K : numpy array of shape [n_samples1, n_samples2]
K : numpy array of shape (n_samples1, n_samples2)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
K : numpy array of shape (n_samples1, n_samples2)
K : ndarray of shape (n_samples1, n_samples2)

Parameters
----------
K : numpy array of shape [n_samples, n_samples]
K : numpy array of shape (n_samples, n_samples)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
K : numpy array of shape (n_samples, n_samples)
K : ndarray of shape (n_samples, n_samples)

Normalized input X.

norms : array, shape [n_samples] if axis=1 else [n_features]
norms : array, shape (n_samples, ) if axis=1 else (n_features, )
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
norms : array, shape (n_samples, ) if axis=1 else (n_features, )
norms : ndarray of shape (n_samples, ) if `axis=1` else (n_features, )

Returns
-------
XP : np.ndarray or CSR/CSC sparse matrix, shape [n_samples, NP]
XP : {np.ndarray, CSR/CSC sparse matrix}, shape (n_samples, NP)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that we should put CSR and CSC information in the description rather than in the type.

@genziano
Copy link
Copy Markdown
Contributor Author

genziano commented Mar 9, 2020

@glemaitre thanks for the feedback, I will proceed with the fixes.
Is it ok to rewrite ndarray or None, shape (n_features,) as ndarray of shape (n_features,) or None?

@glemaitre
Copy link
Copy Markdown
Member

Normally we have

xxx : ndarray of shape (n_features,) or None

If there is a default then

xxx: ndarray of shape (n_features,), default=None

@glemaitre
Copy link
Copy Markdown
Member

Could you check the linter error

Parameters
----------
X : {array-like, sparse matrix}, shape [n_samples, n_features]
X : {array-like, sparse matrix} of shape (n_samples, n_features)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it should accept dataframe as well

Parameters
----------
X : {array-like, sparse matrix}, shape [n_samples, n_features]
X : {array-like, sparse matrix} of shape (n_samples, n_features)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it should accept dataframe as well

Parameters
----------
X : array-like, shape [n_samples, n_features]
X : {array-like, sparse matrix} of shape (n_samples, n_features)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it should accept dataframe as well

Parameters
----------
X : array-like, shape [n_samples, n_features]
X : {array-like, sparse matrix} of shape (n_samples, n_features)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it should accept dataframe as well

Returns
-------
X_tr : array-like, shape [n_samples, n_features]
X_tr : array-like of shape (n_samples, n_features)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It will not be an array-like in fact. It will be an ndarray.

Parameters
----------
X : {array-like, sparse matrix}, shape [n_samples, n_features]
X : {array-like, sparse matrix} of (n_samples, n_features)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Basically all the preprocesing method will accept dataframe

@genziano
Copy link
Copy Markdown
Contributor Author

Hi @glemaitre, the linter errors are due to the line

        X : {array-like, sparse matrix, dataframe} of shape (n_samples, n_features)

being too long. Is there a standard way within scikit-learn to split the line?

I scanned all the scikit-learn modules and I found that dataframe is almost never mentioned, is it because it might be considered covered by array-like?

@glemaitre
Copy link
Copy Markdown
Member

You can split using the backslack

X : {array-like, sparse matrix, dataframe} of shape \
        (n_samples, n_features)
    Whatever description

@genziano
Copy link
Copy Markdown
Contributor Author

Preview of how it would look like:

    def fit(self, X, y=None):
        """Compute the mean and std to be used for later scaling.

        Parameters
        ----------
        X : {array-like, sparse matrix, dataframe} of shape \
                (n_samples, n_features)
            The data used to compute the mean and standard deviation
            used for later scaling along the features axis.

        y
            Ignored
        """

Do you have any comments/improvements?

@glemaitre
Copy link
Copy Markdown
Member

glemaitre commented Mar 10, 2020 via email

@genziano
Copy link
Copy Markdown
Contributor Author

Cool!

Another thing I noticed: some .transform() methods do not document the output. For instance:

def transform(self, X, copy=None):
"""Perform standardization by centering and scaling
Parameters
----------
X : array-like, shape [n_samples, n_features]
The data used to scale along the features axis.
copy : bool, optional (default: None)
Copy the input X or not.
"""

Shall I fix it now, or it should be part of another issue/PR?

@glemaitre
Copy link
Copy Markdown
Member

Let's correct this in the same PR then :) It will be a documentation change consistency across the file.

@genziano
Copy link
Copy Markdown
Contributor Author

Hi @glemaitre, I did some last checks and everything seems good. What do you think?

Copy link
Copy Markdown
Member

@glemaitre glemaitre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is almost there. Apart from my suggestion, there are the occurrences of boolean to be changed to bool and remove "optional"

each sample.

with_mean : boolean, True by default
with_mean : boolean, default=True
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
with_mean : boolean, default=True
with_mean : bool, default=True

If True, center the data before scaling.

with_std : boolean, True by default
with_std : boolean, default=True
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
with_std : boolean, default=True
with_std : bool, default=True

unit standard deviation).

copy : boolean, optional, default True
copy : boolean, optional, default=True
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
copy : boolean, optional, default=True
copy : bool, default=True

Parameters
----------
copy : boolean, optional, default True
copy : boolean, optional, default=True
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for the subsequent entries, you can remove optional and use the python type instead of english one; e.g boolean -> bool, string -> str, etc.

Parameters
----------
norm : 'l1', 'l2', or 'max', optional ('l2' by default)
norm : 'l1', 'l2', or 'max', optional, default='l2'
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
norm : 'l1', 'l2', or 'max', optional, default='l2'
norm : {'l1', 'l2', 'max'}, default='l2'

differ for value-identical sparse and dense matrices.

random_state : int, RandomState instance or None, optional (default=None)
random_state : int, RandomState instance or None, optional, default=None
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
random_state : int, RandomState instance or None, optional, default=None
random_state : int or RandomState instance, default=None

@genziano
Copy link
Copy Markdown
Contributor Author

Thanks for the suggestions, I will work on it tomorrow!

@genziano
Copy link
Copy Markdown
Contributor Author

Is there a standard way to document a list with a fixed type of its elements?
For instance:

input_features : list of str, length n_features, default=None

@genziano
Copy link
Copy Markdown
Contributor Author

You were probably aware of this, but I found out a way to set a template for the documentation: https://github.com/NilsJPWerner/autoDocstring/blob/master/src/docstring/templates/numpy.mustache

@genziano genziano requested a review from glemaitre April 22, 2020 12:33
Copy link
Copy Markdown
Member

@glemaitre glemaitre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. I added 2 suggestions for the case you had questions. Up to now, we use the same syntax than NumPy array (true that it has only a length indeed)

Parameters
----------
input_features : list of string, length n_features, optional
input_features : list of str, length n_features, default=None
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
input_features : list of str, length n_features, default=None
input_features : list of str of shape (n_features,), default=None

-------
output_feature_names : list of string, length n_output_features

output_feature_names : list of str, length n_output_features
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
output_feature_names : list of str, length n_output_features
output_feature_names : list of str of shape (n_output_features,)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the suggestions. I updated the PR!

@thomasjpfan
Copy link
Copy Markdown
Member

thomasjpfan commented May 16, 2020

I do not think we should add dataframe everywhere. It would fit almost everywhere we have ndarray as long as np.asarray returns a a homogeneous ndarray.

@glemaitre
Copy link
Copy Markdown
Member

I do not think if we should add dataframe everywhere. It would fit almost everywhere we have ndarray as long as np.asarray returns a a homogeneous ndarray.

I would prefer to be explicit and mention it.

@genziano
Copy link
Copy Markdown
Contributor Author

On one hand, I see that dataframe is almost never explicitly mentioned in sklearn, but I see here https://scikit-learn.org/stable/developers/contributing.html#contribute-documentation that it's given as an option

@jnothman
Copy link
Copy Markdown
Member

jnothman commented May 18, 2020 via email

@glemaitre
Copy link
Copy Markdown
Member

Up to now, our dev guide states {array-like, sparse matrix, dataframe}.
I don't mind that we come with something shorter and change the guideline.

However, I really feel that we should settle with those once for all. It is bad that we ask contributors to add/remove at each new reviewer passing by while the dev guide should be sufficient regarding the doc style.

@thomasjpfan
Copy link
Copy Markdown
Member

However, I really feel that we should settle with those once for all. It is bad that we ask contributors to add/remove at each new reviewer passing by while the dev guide should be sufficient regarding the doc style.

We can talk about this in the monthly call.

Here are my thoughts:

It looks like we only use dataframe in the column transformer, some mixins, and the return types of the dataset functions.

I would say unless we are doing something special with the column names, we just need "array-like". If we are doing something special, such as in column transformer, we can put "dataframe" or maybe "frame-like" (if a dataframe protocol becomes a thing).

In the end, I want to avoid listing out every single valid data structure. Currently, we can take cupy arrays, dask arrays, xarrays, etc. and they work because they are "array-like". This can get incredibility verbose if we list them all.

@glemaitre
Copy link
Copy Markdown
Member

I think I agree with your point. Until an estimator uses a DataFrame as an array, then we can consider in the array-like. Right now, we only state that DataFrame should be numeric to be an array-like but actually we could consider object dtype, if the estimator can manage the data (OHE or OrdinalEncoder for instance).

I am also thinking that we should be careful with the output type. Usually, if an array or sparse array is given, we will output the same type. It is one of the reasons that I might not consider stating dataframe since we cannot output them up to now. And as you mentioned, if we grow support for these different types and we need to state types in input and output, it needs to be done wisely to avoid verbosity.

@jnothman
Copy link
Copy Markdown
Member

jnothman commented May 18, 2020 via email

@genziano
Copy link
Copy Markdown
Contributor Author

I will remove the references to dataframe then.

@genziano genziano requested a review from glemaitre May 24, 2020 16:21
@glemaitre
Copy link
Copy Markdown
Member

Thanks @Alemaudit, #17359 bring more light on the dataframe and as you did here, we will not it in the description. I will have a quick look but I think that I should be able to merge.

Copy link
Copy Markdown
Member

@glemaitre glemaitre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are good to go.

@glemaitre glemaitre merged commit ffd1873 into scikit-learn:master May 27, 2020
@glemaitre
Copy link
Copy Markdown
Member

Thanks @Alemaudit for the patience :)

@genziano
Copy link
Copy Markdown
Contributor Author

Thanks @glemaitre, it was a pleasure to contribute!

@genziano genziano deleted the issue16646 branch May 29, 2020 07:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

PolynomialFeatures' docstring does not mention that sparse data is allowed for fit

4 participants