ENH Perform KNN imputation without O(n^2) memory cost by jnothman · Pull Request #16397 · scikit-learn/scikit-learn

jnothman · 2020-02-06T10:20:29Z

This is more computationally expensive than the previous implementation,
but should reduce memory costs substantially in common use cases.

Sorry for duplicating your effort, @thomasjpfan, if you had already attempted this.

The KNNImputer is a pretty difficult piece of code to work with.

Fixes scikit-learn#15604 This is more computationally expensive than the previous implementation, but should reduce memory costs substantially in common use cases.

jnothman · 2020-02-06T12:46:36Z

the uncovered lines are uncovered in master...

ajing · 2020-02-06T18:01:21Z

When will this be available? How could I try this new function before merging?

jnothman · 2020-02-06T20:37:13Z

When will this be available? How could I try this new function before merging?

You can pull this branch into your local working copy... Or try pip install https://github.com/jnothman/scikit-learn/archive/knnimpute-memory.zip

thomasjpfan · 2020-02-13T18:34:57Z

By running the code snippet similar to #15604 (comment)

import numpy as np
from sklearn.datasets import fetch_california_housing
from sklearn.impute import KNNImputer
import pandas as pd

calhousing = fetch_california_housing()

X = pd.DataFrame(calhousing.data, columns=calhousing.feature_names)
y = pd.Series(calhousing.target, name='house_value')

rng = np.random.RandomState(42)

density = 4  # one in 10 values will be NaN

mask = rng.randint(density, size=X.shape) == 0
X_na = X.copy()
X_na.values[mask] = np.nan
X_na = StandardScaler().fit_transform(X_na)

knn = KNNImputer()

This PR

%%memit
knn.fit_transform(X_na)
# peak memory: 3468.01 MiB, increment: 3345.06 MiB

Master

%%memit
knn.fit_transform(X_na)
# peak memory: 6371.18 MiB, increment: 6245.66 MiB

sklearn/impute/tests/test_knn.py

sklearn/metrics/pairwise.py

impiyush · 2020-02-20T06:07:51Z

When will this be available? How could I try this new function before merging?

You can pull this branch into your local working copy... Or try pip install https://github.com/jnothman/scikit-learn/archive/knnimpute-memory.zip

I was facing the same memory error and the imputer kept crashing. Thanks to this PR and sharing the link to pull this into my local copy, I was able to move forward in my project.

glemaitre

LGTM. I am just not sure about the warning if it could be fine to filter it to make it obvious that we are expecting it for this test?

sklearn/impute/_knn.py

sklearn/metrics/pairwise.py

sklearn/impute/tests/test_knn.py

doc/whats_new/v0.23.rst

glemaitre · 2020-02-20T12:53:30Z

Uhm thought the codecov error is weird: https://codecov.io/gh/scikit-learn/scikit-learn/compare/0c4252cc52ccb4f150e2e7564f40ff8af83f47cc...a6af8242801d6aeef3ba61c0021fce2556579609/diff#D3-272

glemaitre · 2020-02-20T13:27:59Z

Oh these lines were not covered by the test before as well. So still LGTM.
@thomasjpfan we should probably add a test case if possible to cover these lines. This is independent from this PR.

glemaitre · 2020-02-20T13:43:36Z

@thomasjpfan In which case is it that we will reach this part of the code?

sklearn/impute/_knn.py

glemaitre · 2020-02-21T18:41:37Z

@jnothman Do you want me to push the small changes if you have limited time? They are only nitpicking which I am able to do :)

sklearn/impute/tests/test_knn.py

glemaitre · 2020-02-23T12:10:43Z

LGTM @thomasjpfan do you want to have a final look. I think this good to be merged.

thomasjpfan

LGTM

jnothman · 2020-02-24T07:39:35Z

Thanks for the reviews!

…6397)

* FIX ensure object array are properly casted when dtype=object (#16076) * DOC Docstring example of classifier should import classifier (#16430) * MNT Update nightly build URL and release staging config (#16435) * BUG ensure that estimator_name is properly stored in the ROC display (#16500) * BUG ensure that name is properly stored in the precision/recall display (#16505) * ENH Perform KNN imputation without O(n^2) memory cost (#16397) * bump scikit-learn version for binder * bump version to 0.22.2 * MNT Skips failing SpectralCoclustering doctest (#16232) * TST Updates test for deprecation in pandas.SparseArray (#16040) * move 0.22.2 what's new entries (#16586) * add 0.22.2 in the news of the web site frontpage * skip test_ard_accuracy_on_easy_problem Co-authored-by: alexshacked <al.shacked@gmail.com> Co-authored-by: Oleksandr Pavlyk <oleksandr-pavlyk@users.noreply.github.com> Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org> Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com> Co-authored-by: Joel Nothman <joel.nothman@gmail.com> Co-authored-by: Thomas J Fan <thomasjpfan@gmail.com>

…6397)

jnothman added 2 commits February 6, 2020 21:19

ENH Perform KNN imputation without O(n^2) memory cost

13a6d97

Fixes scikit-learn#15604 This is more computationally expensive than the previous implementation, but should reduce memory costs substantially in common use cases.

DOC Add what's new

a6af824

thomasjpfan self-requested a review February 7, 2020 03:06

thomasjpfan reviewed Feb 13, 2020

View reviewed changes

sklearn/impute/tests/test_knn.py Show resolved Hide resolved

sklearn/metrics/pairwise.py Outdated Show resolved Hide resolved

glemaitre self-requested a review February 20, 2020 12:26

glemaitre approved these changes Feb 20, 2020

View reviewed changes

sklearn/impute/_knn.py Show resolved Hide resolved

sklearn/metrics/pairwise.py Outdated Show resolved Hide resolved

sklearn/impute/tests/test_knn.py Show resolved Hide resolved

doc/whats_new/v0.23.rst Outdated Show resolved Hide resolved

swchoi727 reviewed Feb 21, 2020

View reviewed changes

sklearn/impute/_knn.py Show resolved Hide resolved

glemaitre self-assigned this Feb 21, 2020

jnothman added 3 commits February 22, 2020 23:25

Respond to reviews

2c5e053

Merge branch 'master' into knnimpute-memory

afcaf3f

fix ignore_warnings usage

b12e308

glemaitre reviewed Feb 22, 2020

View reviewed changes

sklearn/impute/tests/test_knn.py Outdated Show resolved Hide resolved

Use pytest.mark to ignore warnings

670e495

thomasjpfan approved these changes Feb 24, 2020

View reviewed changes

thomasjpfan merged commit 244d118 into scikit-learn:master Feb 24, 2020

jnothman added this to the 0.22.2 milestone Feb 24, 2020

jeremiedbb pushed a commit to jeremiedbb/scikit-learn that referenced this pull request Feb 28, 2020

ENH Perform KNN imputation without O(n^2) memory cost (scikit-learn#1…

1aebf8d

…6397)

panpiort8 pushed a commit to panpiort8/scikit-learn that referenced this pull request Mar 3, 2020

ENH Perform KNN imputation without O(n^2) memory cost (scikit-learn#1…

91e5b8f

…6397)

gio8tisu pushed a commit to gio8tisu/scikit-learn that referenced this pull request May 15, 2020

ENH Perform KNN imputation without O(n^2) memory cost (scikit-learn#1…

693eaba

…6397)

aperezlebel mentioned this pull request Jun 21, 2022

Documenting missing-values practices #21967

Open

7 tasks

Uh oh!

Conversation

jnothman commented Feb 6, 2020

Uh oh!

jnothman commented Feb 6, 2020

Uh oh!

ajing commented Feb 6, 2020

Uh oh!

jnothman commented Feb 6, 2020

Uh oh!

thomasjpfan commented Feb 13, 2020

This PR

Master

Uh oh!

Uh oh!

Uh oh!

impiyush commented Feb 20, 2020

Uh oh!

glemaitre left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

glemaitre commented Feb 20, 2020

Uh oh!

glemaitre commented Feb 20, 2020

Uh oh!

glemaitre commented Feb 20, 2020

Uh oh!

Uh oh!

glemaitre commented Feb 21, 2020

Uh oh!

Uh oh!

glemaitre commented Feb 23, 2020

Uh oh!

thomasjpfan left a comment

Choose a reason for hiding this comment

Uh oh!

jnothman commented Feb 24, 2020 via email

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants