Skip to content

[MRG+1] Add add_indicator_features option to show missing values in the output#6607

Merged
jnothman merged 1 commit intoscikit-learn:masterfrom
maniteja123:imputer_indicator
Apr 14, 2016
Merged

[MRG+1] Add add_indicator_features option to show missing values in the output#6607
jnothman merged 1 commit intoscikit-learn:masterfrom
maniteja123:imputer_indicator

Conversation

@maniteja123
Copy link
Copy Markdown
Contributor

Initial attempt to provide an option to indicate imputed features in the transformed output as suggested in #6556 . I have added an option to fit_transform to provide backward compatibility and not break the code.

At present, the indicator feature is only for dense matrices and it is a simple approach. Any feedback is appreciated here. Thanks.

@maniteja123 maniteja123 changed the title Add option to inidcate imputed features [WIP] Add option to inidcate imputed features Mar 30, 2016
@raghavrv
Copy link
Copy Markdown
Member

Ping @amueller @jmschrei @jnothman

@jnothman
Copy link
Copy Markdown
Member

Option to fit_transform is a no-no in scikit-learn. Make it an estimator parameter. After all, I may want to perform a parameter search that switches this on and off and see which performs better.

else:
self.imputed_features = features_with_missing_values
if self.return_imputed:
X = np.hstack((X,imputed_mask))
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pep8 space after comma

@maniteja123
Copy link
Copy Markdown
Contributor Author

@jnothman Sorry I didn't make it clear in the tests regarding the case when a feature has all missing values. I hope it is clearer now. In that case with axis=0, the indicator doesn't give the information about the discarded feature hence the masking on valid_statistics_indexes.

imputed_features_ : array of shape (n_features_with_missing, )
The input features which have been imputed during transform
The size of this attribute will be the number of features with
at least one missing value.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(at least one and fewer than all in the axis=0 case, if we want to be pedantic)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah sure, that makes it clearer.

@jnothman
Copy link
Copy Markdown
Member

You're not currently testing the correctness of the concatenated indicator matrix. Please do.

I'm not sure what it'd be useful for, but this change also presents the possibility of implementing inverse_transform. Is there any value in that, @amueller?

@jnothman
Copy link
Copy Markdown
Member

And I'd really like to see an example added to this PR, and perhaps an extension to the narrative docs.

@maniteja123
Copy link
Copy Markdown
Contributor Author

Regarding checking the indicator matrix, sorry I couldn't decide whether it is sufficient to only check the indicator matrix or the whole output. Would something like calling fit_transform with add_missing_indicator=False option and stack it with the separately computed indicator matrix based on imputed_features_ and finally check with the output when the fit_transform is called with the add_missing_indicator = True option sound good ?

And regarding an example, should I try something on lines of this example ?

Thanks !

@maniteja123 maniteja123 changed the title [WIP] Add option to inidcate imputed features [WIP] Add add_missing_indicator option to show missing values in the output Mar 31, 2016
@maniteja123
Copy link
Copy Markdown
Contributor Author

I have added the check for the values in the indicator matrix as explained above. There is definitely a lot of code redundancy in it( except for checking the imputed_features_ attribute, everything else has much in common). Will try to make it more compact. I will wait for your opinion on implementing inverse_transform and adding an example.

@jnothman
Copy link
Copy Markdown
Member

jnothman commented Apr 3, 2016

I don't think there's any question that an example which shows this improves the value of imputation would be a real boon to this PR. inverse_transform, I think don't bother.

>>> imp.fit([[1, 2], [np.nan, 3], [7, 6]])
Imputer(axis=0, copy=True, missing_values='NaN', strategy='mean', verbose=0)
Imputer(add_missing_indicator=False, axis=0, copy=True, missing_values='NaN',
strategy='mean', verbose=0)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please indent appropriately for PEP8.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have changed this after the previous commit failed on doctest and updated based on the actual output when I run the command. I understand pep8 warrants more indentation, but this is documentation and we should be able to give the output as it would if we run the code right ? I have never changed anything related to doctests, so let me know if pep8 guidelines need to be followed here also. Thanks.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use the NORMALIZE_WHITESPACE flag

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh okay didn't get that.. Thanks will do it.

@jnothman
Copy link
Copy Markdown
Member

jnothman commented Apr 3, 2016

The tests look alright. I'd like to see the narrative doc changed to mention this feature and its motivation.

- If `axis=1` and X is encoded as a CSC matrix.

add_missing_indicator : boolean, optional (default=False)
If True, the returned X will have an appended indicator matrix
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"returned" is inappropriate in the class docstring. Try "transformed"

@jnothman
Copy link
Copy Markdown
Member

LGTM, after that name change. Thanks @maniteja123!

@jnothman jnothman changed the title [MRG] Add add_missing_indicator option to show missing values in the output [MRG+1] Add add_missing_indicator option to show missing values in the output Apr 13, 2016
@maniteja123
Copy link
Copy Markdown
Contributor Author

Thanks. Shall I will squash all the commits ?

@jnothman
Copy link
Copy Markdown
Member

After you've changed the test name, you can squash the commits, I guess.

@maniteja123 maniteja123 changed the title [MRG+1] Add add_missing_indicator option to show missing values in the output [MRG+1] Add add_indicator_features option to show missing values in the output Apr 13, 2016
@maniteja123
Copy link
Copy Markdown
Contributor Author

Squashed and rebased with master. Thank you so much @jnothman for patiently answering my queries and guiding all along !

@MechCoder
Copy link
Copy Markdown
Member

Just one more minor thing, can you add a test to show that imputed_features_ returns an empty mask when add_indicator_features is set to False and update whatsnew?

I can merge after that.

@maniteja123
Copy link
Copy Markdown
Contributor Author

Sorry if I am understanding it wrong, but irrespective of add_indicator_features, imputed_features will be the features with at least one missing value, right ?

@MechCoder
Copy link
Copy Markdown
Member

Oh yes, my bad :/ Sorry.

@maniteja123
Copy link
Copy Markdown
Contributor Author

No problem, I actually changed it in the last version. And any suggestion regarding the name ?

Also I am on mobile now. Sorry don't have laptop now to do the changes now. Would it be possible for you to add what's new if it is not a problem. Thanks for the help !

@maniteja123
Copy link
Copy Markdown
Contributor Author

I have added an entry to what's new. As a side note, I also did add entry for the MultiOutputClassifier here. But i suppose it will now have a merge conflict. Please do let me know if you need to push a commit again for that or if possible for you to resolve conflict and add. Thanks.

@jnothman
Copy link
Copy Markdown
Member

LGTM! Thanks @maniteja123!

@jnothman jnothman merged commit d837998 into scikit-learn:master Apr 14, 2016
@MechCoder
Copy link
Copy Markdown
Member

Thanks a lot @maniteja123 🍷 🍷 !!

@raghavrv
Copy link
Copy Markdown
Member

Thanks @maniteja123 🍻

@jnothman
Copy link
Copy Markdown
Member

jnothman commented Aug 30, 2016

With apologies, @maniteja123, I think we should revert this change so as not to include it in the upcoming release. The error was mine: I was not thorough enough in my review, and mistakenly thought someone else had given it a +1 (we usually require two full reviews to merge). I think it cannot be released in its current form and we should revert this change and produce a PR with a fully functional add_indicator_features option. (I or you or someone else could do this, but I don't want the pressure of the release in getting that done.) Is this alright by everyone? @amueller?

@MechCoder
Copy link
Copy Markdown
Member

MechCoder commented Aug 30, 2016

Sorry for not responding to your pings lately (or did you ping?). My github notifications are overloaded. I'm astonished that I did not object to the attribute indicator_features_ not being set at fit but at transform time (and had given an implicit +1 (#6607 (comment)) ) There seems to be two major issues.

  1. The issue of indicator_features_ attribute being set at transform time (being fixed by FIX: Imputer fix for indicator_matrix in add_indicator_features option #6782 ?)
  2. The design decision of what to do when missing values are present at transform time. (Imputer(with add_indicator_features=True) can create arrays of different shape in transform() than in fit_transform() #6767 (comment)). Option 1 seems to be more reasonable since I view the indicator features as something learnt during fit time and should not be influenced by the unknown data at transform time. For option 2 and 3 to feel useful, we still do not support partial_fit for Imputer, right?

I know there are a few other issues that you had summarized somewhere but I'm unable to find them. However at this point of time, if you feel that it is too much of stress before the release +1 for reverting. I would be happy to start afresh as well. Again apologies, I feel quite bad for doing this.

@GaelVaroquaux
Copy link
Copy Markdown
Member

I think it cannot be released in its current form and we should revert
this change and produce a PR with a fully functional
add_indicator_features option.

Indeed, I feel that good practice is not to release this in a
half-complete state, as people are going to rely on it, and it's going to
break.

@maniteja123
Copy link
Copy Markdown
Contributor Author

Hi everyone, I am sorry for proceeding in the wrong direction and this issue becoming a blocker for the release. Please don't apologize for reverting this change or reviewing, all of you have patiently guided me with the questions. IMO I think it is definitely essential to revert the change. Also if possible, I would happily work on this once the consensus of functionality for fit and transform is reached. And I suppose partial fit is not supported for the Imputer as of now. Please let me know if I can revert the changes or if the commits can be reverted. Thanks.

@jnothman
Copy link
Copy Markdown
Member

It's not blocking the release. It's just distressing me a little that I merged it and can't fix it quickly enough :)

jnothman added a commit to jnothman/scikit-learn that referenced this pull request Aug 30, 2016
)"

This reverts commit 18396be as it was
merged when incomplete.
jnothman added a commit that referenced this pull request Aug 30, 2016
This reverts commit 18396be as it was
merged when incomplete.
@amueller
Copy link
Copy Markdown
Member

Should we reopen?

@jnothman
Copy link
Copy Markdown
Member

I think rebasing and renaming #6782 is more appropriate.

TomDLT pushed a commit to TomDLT/scikit-learn that referenced this pull request Oct 3, 2016
paulha pushed a commit to paulha/scikit-learn that referenced this pull request Aug 19, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants