[MRG+1] Add add_indicator_features option to show missing values in the output by maniteja123 · Pull Request #6607 · scikit-learn/scikit-learn

maniteja123 · 2016-03-30T17:51:44Z

Initial attempt to provide an option to indicate imputed features in the transformed output as suggested in #6556 . I have added an option to fit_transform to provide backward compatibility and not break the code.

At present, the indicator feature is only for dense matrices and it is a simple approach. Any feedback is appreciated here. Thanks.

raghavrv · 2016-03-30T18:36:57Z

Ping @amueller @jmschrei @jnothman

jnothman · 2016-03-30T23:22:26Z

Option to fit_transform is a no-no in scikit-learn. Make it an estimator parameter. After all, I may want to perform a parameter search that switches this on and off and see which performs better.

raghavrv · 2016-03-30T23:27:03Z

sklearn/preprocessing/imputation.py

+            else:
+                self.imputed_features = features_with_missing_values
+            if self.return_imputed:
+                X = np.hstack((X,imputed_mask))


pep8 space after comma

maniteja123 · 2016-03-31T01:25:22Z

@jnothman Sorry I didn't make it clear in the tests regarding the case when a feature has all missing values. I hope it is clearer now. In that case with axis=0, the indicator doesn't give the information about the discarded feature hence the masking on valid_statistics_indexes.

jnothman · 2016-03-31T03:13:25Z

sklearn/preprocessing/imputation.py

+    imputed_features_ : array of shape (n_features_with_missing, )
+        The input features which have been imputed during transform
+        The size of this attribute will be the number of features with
+        at least one missing value.


(at least one and fewer than all in the axis=0 case, if we want to be pedantic)

Yeah sure, that makes it clearer.

jnothman · 2016-03-31T03:22:17Z

You're not currently testing the correctness of the concatenated indicator matrix. Please do.

I'm not sure what it'd be useful for, but this change also presents the possibility of implementing inverse_transform. Is there any value in that, @amueller?

jnothman · 2016-03-31T03:22:38Z

And I'd really like to see an example added to this PR, and perhaps an extension to the narrative docs.

maniteja123 · 2016-03-31T09:32:22Z

Regarding checking the indicator matrix, sorry I couldn't decide whether it is sufficient to only check the indicator matrix or the whole output. Would something like calling fit_transform with add_missing_indicator=False option and stack it with the separately computed indicator matrix based on imputed_features_ and finally check with the output when the fit_transform is called with the add_missing_indicator = True option sound good ?

And regarding an example, should I try something on lines of this example ?

Thanks !

maniteja123 · 2016-04-02T07:03:15Z

I have added the check for the values in the indicator matrix as explained above. There is definitely a lot of code redundancy in it( except for checking the imputed_features_ attribute, everything else has much in common). Will try to make it more compact. I will wait for your opinion on implementing inverse_transform and adding an example.

jnothman · 2016-04-03T09:37:01Z

I don't think there's any question that an example which shows this improves the value of imputation would be a real boon to this PR. inverse_transform, I think don't bother.

jnothman · 2016-04-03T12:13:22Z

doc/modules/preprocessing.rst

    >>> imp.fit([[1, 2], [np.nan, 3], [7, 6]])
-    Imputer(axis=0, copy=True, missing_values='NaN', strategy='mean', verbose=0)
+    Imputer(add_missing_indicator=False, axis=0, copy=True, missing_values='NaN',
+        strategy='mean', verbose=0)


Please indent appropriately for PEP8.

I have changed this after the previous commit failed on doctest and updated based on the actual output when I run the command. I understand pep8 warrants more indentation, but this is documentation and we should be able to give the output as it would if we run the code right ? I have never changed anything related to doctests, so let me know if pep8 guidelines need to be followed here also. Thanks.

Use the NORMALIZE_WHITESPACE flag

Oh okay didn't get that.. Thanks will do it.

jnothman · 2016-04-03T12:17:36Z

The tests look alright. I'd like to see the narrative doc changed to mention this feature and its motivation.

jnothman · 2016-04-03T12:19:09Z

sklearn/preprocessing/imputation.py

        - If `axis=1` and X is encoded as a CSC matrix.

+    add_missing_indicator : boolean, optional (default=False)
+        If True, the returned X will have an appended indicator matrix


"returned" is inappropriate in the class docstring. Try "transformed"

jnothman · 2016-04-13T12:55:08Z

LGTM, after that name change. Thanks @maniteja123!

maniteja123 · 2016-04-13T12:57:19Z

Thanks. Shall I will squash all the commits ?

jnothman · 2016-04-13T12:58:49Z

After you've changed the test name, you can squash the commits, I guess.

maniteja123 · 2016-04-13T13:04:25Z

Squashed and rebased with master. Thank you so much @jnothman for patiently answering my queries and guiding all along !

MechCoder · 2016-04-13T16:14:02Z

Just one more minor thing, can you add a test to show that imputed_features_ returns an empty mask when add_indicator_features is set to False and update whatsnew?

I can merge after that.

maniteja123 · 2016-04-13T16:19:04Z

Sorry if I am understanding it wrong, but irrespective of add_indicator_features, imputed_features will be the features with at least one missing value, right ?

MechCoder · 2016-04-13T16:20:35Z

Oh yes, my bad :/ Sorry.

maniteja123 · 2016-04-13T16:26:00Z

No problem, I actually changed it in the last version. And any suggestion regarding the name ?

Also I am on mobile now. Sorry don't have laptop now to do the changes now. Would it be possible for you to add what's new if it is not a problem. Thanks for the help !

maniteja123 · 2016-04-14T07:04:19Z

I have added an entry to what's new. As a side note, I also did add entry for the MultiOutputClassifier here. But i suppose it will now have a merge conflict. Please do let me know if you need to push a commit again for that or if possible for you to resolve conflict and add. Thanks.

jnothman · 2016-04-14T12:13:01Z

LGTM! Thanks @maniteja123!

MechCoder · 2016-04-14T17:32:33Z

Thanks a lot @maniteja123 🍷 🍷 !!

raghavrv · 2016-04-14T17:35:54Z

Thanks @maniteja123 🍻

jnothman · 2016-08-30T02:38:51Z

With apologies, @maniteja123, I think we should revert this change so as not to include it in the upcoming release. The error was mine: I was not thorough enough in my review, and mistakenly thought someone else had given it a +1 (we usually require two full reviews to merge). I think it cannot be released in its current form and we should revert this change and produce a PR with a fully functional add_indicator_features option. (I or you or someone else could do this, but I don't want the pressure of the release in getting that done.) Is this alright by everyone? @amueller?

MechCoder · 2016-08-30T06:10:37Z

Sorry for not responding to your pings lately (or did you ping?). My github notifications are overloaded. I'm astonished that I did not object to the attribute indicator_features_ not being set at fit but at transform time (and had given an implicit +1 (#6607 (comment)) ) There seems to be two major issues.

The issue of indicator_features_ attribute being set at transform time (being fixed by FIX: Imputer fix for indicator_matrix in add_indicator_features option #6782 ?)
The design decision of what to do when missing values are present at transform time. (Imputer(with add_indicator_features=True) can create arrays of different shape in transform() than in fit_transform() #6767 (comment)). Option 1 seems to be more reasonable since I view the indicator features as something learnt during fit time and should not be influenced by the unknown data at transform time. For option 2 and 3 to feel useful, we still do not support partial_fit for Imputer, right?

I know there are a few other issues that you had summarized somewhere but I'm unable to find them. However at this point of time, if you feel that it is too much of stress before the release +1 for reverting. I would be happy to start afresh as well. Again apologies, I feel quite bad for doing this.

GaelVaroquaux · 2016-08-30T06:30:55Z

I think it cannot be released in its current form and we should revert
this change and produce a PR with a fully functional
add_indicator_features option.

Indeed, I feel that good practice is not to release this in a
half-complete state, as people are going to rely on it, and it's going to
break.

maniteja123 · 2016-08-30T09:32:22Z

Hi everyone, I am sorry for proceeding in the wrong direction and this issue becoming a blocker for the release. Please don't apologize for reverting this change or reviewing, all of you have patiently guided me with the questions. IMO I think it is definitely essential to revert the change. Also if possible, I would happily work on this once the consensus of functionality for fit and transform is reached. And I suppose partial fit is not supported for the Imputer as of now. Please let me know if I can revert the changes or if the commits can be reverted. Thanks.

jnothman · 2016-08-30T09:47:02Z

It's not blocking the release. It's just distressing me a little that I merged it and can't fix it quickly enough :)

)" This reverts commit 18396be as it was merged when incomplete.

This reverts commit 18396be as it was merged when incomplete.

amueller · 2016-08-30T13:47:51Z

Should we reopen?

jnothman · 2016-08-30T23:48:14Z

I think rebasing and renaming #6782 is more appropriate.

)" (scikit-learn#7292) This reverts commit 18396be as it was merged when incomplete.

maniteja123 mentioned this pull request Mar 30, 2016

add indicator features in imputer #6556

Closed

maniteja123 changed the title ~~Add option to inidcate imputed features~~ [WIP] Add option to inidcate imputed features Mar 30, 2016

raghavrv reviewed Mar 30, 2016
View reviewed changes

maniteja123 force-pushed the imputer_indicator branch from ebb9384 to 3507580 Compare March 31, 2016 01:22

jnothman reviewed Mar 31, 2016
View reviewed changes

maniteja123 changed the title ~~[WIP] Add option to inidcate imputed features~~ [WIP] Add add_missing_indicator option to show missing values in the output Mar 31, 2016

jnothman reviewed Apr 3, 2016
View reviewed changes

jnothman changed the title ~~[MRG] Add add_missing_indicator option to show missing values in the output~~ [MRG+1] Add add_missing_indicator option to show missing values in the output Apr 13, 2016

maniteja123 changed the title ~~[MRG+1] Add add_missing_indicator option to show missing values in the output~~ [MRG+1] Add add_indicator_features option to show missing values in the output Apr 13, 2016

maniteja123 force-pushed the imputer_indicator branch from bb94859 to 74108e3 Compare April 13, 2016 13:01

ENH: Add indicator features to imputer output

18396be

maniteja123 force-pushed the imputer_indicator branch from 74108e3 to 18396be Compare April 14, 2016 06:51

jnothman merged commit d837998 into scikit-learn:master Apr 14, 2016

jnothman added a commit to jnothman/scikit-learn that referenced this pull request Aug 30, 2016

Revert "ENH: Add indicator features to imputer output (scikit-learn#6607

9f9253b

)" This reverts commit 18396be as it was merged when incomplete.

jnothman added a commit that referenced this pull request Aug 30, 2016

Revert "ENH: Add indicator features to imputer output (#6607)" (#7292)

0a0bf24

This reverts commit 18396be as it was merged when incomplete.

MechCoder mentioned this pull request Aug 30, 2016

Common test to check that estimator state does not change at transform/predict/predict_proba time #7297

Closed

TomDLT pushed a commit to TomDLT/scikit-learn that referenced this pull request Oct 3, 2016

Revert "ENH: Add indicator features to imputer output (scikit-learn#6607

f13015c

)" (scikit-learn#7292) This reverts commit 18396be as it was merged when incomplete.

paulha pushed a commit to paulha/scikit-learn that referenced this pull request Aug 19, 2017

Revert "ENH: Add indicator features to imputer output (scikit-learn#6607

ce1ae59

)" (scikit-learn#7292) This reverts commit 18396be as it was merged when incomplete.

Uh oh!

Conversation

maniteja123 commented Mar 30, 2016

Uh oh!

raghavrv commented Mar 30, 2016

Uh oh!

jnothman commented Mar 30, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

maniteja123 commented Mar 31, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jnothman commented Mar 31, 2016

Uh oh!

jnothman commented Mar 31, 2016

Uh oh!

maniteja123 commented Mar 31, 2016

Uh oh!

maniteja123 commented Apr 2, 2016

Uh oh!

jnothman commented Apr 3, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jnothman commented Apr 3, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jnothman commented Apr 13, 2016

Uh oh!

maniteja123 commented Apr 13, 2016

Uh oh!

jnothman commented Apr 13, 2016

Uh oh!

maniteja123 commented Apr 13, 2016

Uh oh!

MechCoder commented Apr 13, 2016

Uh oh!

maniteja123 commented Apr 13, 2016

Uh oh!

MechCoder commented Apr 13, 2016

Uh oh!

maniteja123 commented Apr 13, 2016

Uh oh!

maniteja123 commented Apr 14, 2016

Uh oh!

jnothman commented Apr 14, 2016

Uh oh!

MechCoder commented Apr 14, 2016

Uh oh!

raghavrv commented Apr 14, 2016

Uh oh!

jnothman commented Aug 30, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

MechCoder commented Aug 30, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

GaelVaroquaux commented Aug 30, 2016

Uh oh!

maniteja123 commented Aug 30, 2016

Uh oh!

jnothman commented Aug 30, 2016

Uh oh!

amueller commented Aug 30, 2016

Uh oh!

jnothman commented Aug 30, 2016

Uh oh!

Reviewers

Assignees

jnothman commented Aug 30, 2016 •

edited

Loading

MechCoder commented Aug 30, 2016 •

edited

Loading