Skip to content

[MRG+1] TST cover sparse matrix case for passing through NaN in transformer#11012

Merged
jnothman merged 6 commits intoscikit-learn:masterfrom
glemaitre:fix_test_sparse_nan
Apr 26, 2018
Merged

[MRG+1] TST cover sparse matrix case for passing through NaN in transformer#11012
jnothman merged 6 commits intoscikit-learn:masterfrom
glemaitre:fix_test_sparse_nan

Conversation

@glemaitre
Copy link
Copy Markdown
Member

We previously make a common test for testing consistency of transformers letting pass NaN values.
However, we did not have a common test using sparse inputs.

This PR is covering this case.

[QuantileTransformer(n_quantiles=10, random_state=42)]
)
@pytest.mark.parametrize(
"func_sparse_format",
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe sparse_constructor ?

@pytest.mark.parametrize(
"func_sparse_format",
[sparse.csr_matrix,
sparse.csc_matrix]
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should include all sparse array formats IMO (see e.g.

for sparse_format in ['csr', 'csc', 'dok', 'lil', 'coo', 'dia', 'bsr']:
)

for i in range(X.shape[1]):
# train only on non-NaN
col_train_sparse = func_sparse_format(
X_train[:, [i]][~np.isnan(X_train[:, i])])
Copy link
Copy Markdown
Member

@rth rth Apr 22, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe define this in a function,

def _get_valid_samples_by_column(X, i):
    """Get non NaN samples in column i of X"""
    return X_train[:, [i]][~np.isnan(X_train[:, i])])

and use here and in the test_missing_value_handling_dense function (twice)

@glemaitre glemaitre changed the title [WIP] TST cover sparse matrix case for passing through NaN in transformer [MRG] TST cover sparse matrix case for passing through NaN in transformer Apr 23, 2018
@glemaitre
Copy link
Copy Markdown
Member Author

@jnothman We forgot to include the sparse case in the common test.
If you can look at it, that would be great.

@rth
Copy link
Copy Markdown
Member

rth commented Apr 23, 2018

LGTM

@rth rth changed the title [MRG] TST cover sparse matrix case for passing through NaN in transformer [MRG+1] TST cover sparse matrix case for passing through NaN in transformer Apr 23, 2018
Copy link
Copy Markdown
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is quite rigorous, but if all our estimators supporting sparse matrices also support dense ones, I don't understand why we are not just checking that for X_train and X_test containing some NaN, the sparse and dense transforms are equivalent.

Copy link
Copy Markdown
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I.e. there seems to be redundant testing of invariants here, when the only invariant we really care about is between sparse and dense (in the case that NaN occurs).

@glemaitre
Copy link
Copy Markdown
Member Author

This is quite rigorous, but if all our estimators supporting sparse matrices also support dense ones, I don't understand why we are not just checking that for X_train and X_test containing some NaN, the sparse and dense transforms are equivalent.

Good point. I thought that we needed to dissociate but actually this is only a single line to test and we can parametrize if the transformer support sparse matrices. I will change it accordingly.

@glemaitre glemaitre force-pushed the fix_test_sparse_nan branch from aeacc10 to 47ed0bf Compare April 24, 2018 11:38
@glemaitre glemaitre force-pushed the fix_test_sparse_nan branch from 47ed0bf to fc4042c Compare April 24, 2018 11:39
@glemaitre
Copy link
Copy Markdown
Member Author

@jnothman Made the changes.

Xt_sparse = (est_sparse.fit(sparse_constructor(X_train))
.transform(sparse_constructor(X_test)))
assert_allclose(Xt_dense, Xt_sparse.A)
# check that inverse transform lead to the input data
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is surely not always the case. again we should be testing consistency with the dense case and no more

Copy link
Copy Markdown
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't mean this shouldn't test inverse_transform, but only test it for consistency with dense. But we should perhaps merge and fix it up

est_sparse = clone(est)

Xt_dense = est_dense.fit(X_train).transform(X_test)
Xt_inv_dense = est_dense.transform(Xt_dense)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this be inverse_transform?

Xt_sparse = (est_sparse.fit(sparse_constructor(X_train))
.transform(sparse_constructor(X_test)))
assert_allclose(Xt_sparse.A, Xt_dense)
Xt_inv_sparse = est_sparse.transform(Xt_sparse)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this be inverse_transform?

@glemaitre
Copy link
Copy Markdown
Member Author

thanks stupid mistake

@jnothman jnothman merged commit 96a02f3 into scikit-learn:master Apr 26, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants