Skip to content

ENH Perform KNN imputation without O(n^2) memory cost#16397

Merged
thomasjpfan merged 6 commits intoscikit-learn:masterfrom
jnothman:knnimpute-memory
Feb 24, 2020
Merged

ENH Perform KNN imputation without O(n^2) memory cost#16397
thomasjpfan merged 6 commits intoscikit-learn:masterfrom
jnothman:knnimpute-memory

Conversation

@jnothman
Copy link
Copy Markdown
Member

@jnothman jnothman commented Feb 6, 2020

Fixes #15604

This is more computationally expensive than the previous implementation,
but should reduce memory costs substantially in common use cases.

Sorry for duplicating your effort, @thomasjpfan, if you had already attempted this.

The KNNImputer is a pretty difficult piece of code to work with.

Fixes scikit-learn#15604

This is more computationally expensive than the previous implementation,
but should reduce memory costs substantially in common use cases.
@jnothman
Copy link
Copy Markdown
Member Author

jnothman commented Feb 6, 2020

the uncovered lines are uncovered in master...

@ajing
Copy link
Copy Markdown

ajing commented Feb 6, 2020

When will this be available? How could I try this new function before merging?

@jnothman
Copy link
Copy Markdown
Member Author

jnothman commented Feb 6, 2020

When will this be available? How could I try this new function before merging?

You can pull this branch into your local working copy... Or try pip install https://github.com/jnothman/scikit-learn/archive/knnimpute-memory.zip

@thomasjpfan thomasjpfan self-requested a review February 7, 2020 03:06
@thomasjpfan
Copy link
Copy Markdown
Member

By running the code snippet similar to #15604 (comment)

import numpy as np
from sklearn.datasets import fetch_california_housing
from sklearn.impute import KNNImputer
import pandas as pd

calhousing = fetch_california_housing()

X = pd.DataFrame(calhousing.data, columns=calhousing.feature_names)
y = pd.Series(calhousing.target, name='house_value')

rng = np.random.RandomState(42)

density = 4  # one in 10 values will be NaN

mask = rng.randint(density, size=X.shape) == 0
X_na = X.copy()
X_na.values[mask] = np.nan
X_na = StandardScaler().fit_transform(X_na)

knn = KNNImputer()

This PR

%%memit
knn.fit_transform(X_na)
# peak memory: 3468.01 MiB, increment: 3345.06 MiB

Master

%%memit
knn.fit_transform(X_na)
# peak memory: 6371.18 MiB, increment: 6245.66 MiB

@impiyush
Copy link
Copy Markdown

When will this be available? How could I try this new function before merging?

You can pull this branch into your local working copy... Or try pip install https://github.com/jnothman/scikit-learn/archive/knnimpute-memory.zip

I was facing the same memory error and the imputer kept crashing. Thanks to this PR and sharing the link to pull this into my local copy, I was able to move forward in my project.

@glemaitre glemaitre self-requested a review February 20, 2020 12:26
Copy link
Copy Markdown
Member

@glemaitre glemaitre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. I am just not sure about the warning if it could be fine to filter it to make it obvious that we are expecting it for this test?

@glemaitre
Copy link
Copy Markdown
Member

@glemaitre
Copy link
Copy Markdown
Member

Oh these lines were not covered by the test before as well. So still LGTM.
@thomasjpfan we should probably add a test case if possible to cover these lines. This is independent from this PR.

@glemaitre
Copy link
Copy Markdown
Member

@thomasjpfan In which case is it that we will reach this part of the code?

@glemaitre glemaitre self-assigned this Feb 21, 2020
@glemaitre
Copy link
Copy Markdown
Member

@jnothman Do you want me to push the small changes if you have limited time? They are only nitpicking which I am able to do :)

@glemaitre
Copy link
Copy Markdown
Member

LGTM @thomasjpfan do you want to have a final look. I think this good to be merged.

Copy link
Copy Markdown
Member

@thomasjpfan thomasjpfan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@thomasjpfan thomasjpfan merged commit 244d118 into scikit-learn:master Feb 24, 2020
@jnothman
Copy link
Copy Markdown
Member Author

jnothman commented Feb 24, 2020 via email

@jnothman jnothman added this to the 0.22.2 milestone Feb 24, 2020
jeremiedbb pushed a commit to jeremiedbb/scikit-learn that referenced this pull request Feb 28, 2020
ogrisel added a commit that referenced this pull request Feb 28, 2020
* FIX ensure object array are properly casted when dtype=object (#16076)

* DOC Docstring example of classifier should import classifier (#16430)

* MNT Update nightly build URL and release staging config (#16435)

* BUG ensure that estimator_name is properly stored in the ROC display (#16500)

* BUG ensure that name is properly stored in the precision/recall display (#16505)

* ENH Perform KNN imputation without O(n^2) memory cost (#16397)

* bump scikit-learn version for binder

* bump version to 0.22.2

* MNT Skips failing SpectralCoclustering doctest (#16232)

* TST Updates test for deprecation in pandas.SparseArray (#16040)

* move 0.22.2 what's new entries (#16586)

* add 0.22.2 in the news of the web site frontpage

* skip test_ard_accuracy_on_easy_problem

Co-authored-by: alexshacked <al.shacked@gmail.com>
Co-authored-by: Oleksandr Pavlyk <oleksandr-pavlyk@users.noreply.github.com>
Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>
Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>
Co-authored-by: Joel Nothman <joel.nothman@gmail.com>
Co-authored-by: Thomas J Fan <thomasjpfan@gmail.com>
panpiort8 pushed a commit to panpiort8/scikit-learn that referenced this pull request Mar 3, 2020
gio8tisu pushed a commit to gio8tisu/scikit-learn that referenced this pull request May 15, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

MemoryError in KNNImputer with california housing

6 participants