ENH Perform KNN imputation without O(n^2) memory cost#16397
ENH Perform KNN imputation without O(n^2) memory cost#16397thomasjpfan merged 6 commits intoscikit-learn:masterfrom
Conversation
Fixes scikit-learn#15604 This is more computationally expensive than the previous implementation, but should reduce memory costs substantially in common use cases.
|
the uncovered lines are uncovered in master... |
|
When will this be available? How could I try this new function before merging? |
You can pull this branch into your local working copy... Or try |
|
By running the code snippet similar to #15604 (comment) import numpy as np
from sklearn.datasets import fetch_california_housing
from sklearn.impute import KNNImputer
import pandas as pd
calhousing = fetch_california_housing()
X = pd.DataFrame(calhousing.data, columns=calhousing.feature_names)
y = pd.Series(calhousing.target, name='house_value')
rng = np.random.RandomState(42)
density = 4 # one in 10 values will be NaN
mask = rng.randint(density, size=X.shape) == 0
X_na = X.copy()
X_na.values[mask] = np.nan
X_na = StandardScaler().fit_transform(X_na)
knn = KNNImputer()This PR%%memit
knn.fit_transform(X_na)
# peak memory: 3468.01 MiB, increment: 3345.06 MiBMaster%%memit
knn.fit_transform(X_na)
# peak memory: 6371.18 MiB, increment: 6245.66 MiB |
I was facing the same memory error and the imputer kept crashing. Thanks to this PR and sharing the link to pull this into my local copy, I was able to move forward in my project. |
glemaitre
left a comment
There was a problem hiding this comment.
LGTM. I am just not sure about the warning if it could be fine to filter it to make it obvious that we are expecting it for this test?
|
Uhm thought the codecov error is weird: https://codecov.io/gh/scikit-learn/scikit-learn/compare/0c4252cc52ccb4f150e2e7564f40ff8af83f47cc...a6af8242801d6aeef3ba61c0021fce2556579609/diff#D3-272 |
|
Oh these lines were not covered by the test before as well. So still LGTM. |
|
@thomasjpfan In which case is it that we will reach this part of the code? |
|
@jnothman Do you want me to push the small changes if you have limited time? They are only nitpicking which I am able to do :) |
|
LGTM @thomasjpfan do you want to have a final look. I think this good to be merged. |
|
Thanks for the reviews!
|
* FIX ensure object array are properly casted when dtype=object (#16076) * DOC Docstring example of classifier should import classifier (#16430) * MNT Update nightly build URL and release staging config (#16435) * BUG ensure that estimator_name is properly stored in the ROC display (#16500) * BUG ensure that name is properly stored in the precision/recall display (#16505) * ENH Perform KNN imputation without O(n^2) memory cost (#16397) * bump scikit-learn version for binder * bump version to 0.22.2 * MNT Skips failing SpectralCoclustering doctest (#16232) * TST Updates test for deprecation in pandas.SparseArray (#16040) * move 0.22.2 what's new entries (#16586) * add 0.22.2 in the news of the web site frontpage * skip test_ard_accuracy_on_easy_problem Co-authored-by: alexshacked <al.shacked@gmail.com> Co-authored-by: Oleksandr Pavlyk <oleksandr-pavlyk@users.noreply.github.com> Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org> Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com> Co-authored-by: Joel Nothman <joel.nothman@gmail.com> Co-authored-by: Thomas J Fan <thomasjpfan@gmail.com>
Fixes #15604
This is more computationally expensive than the previous implementation,
but should reduce memory costs substantially in common use cases.
Sorry for duplicating your effort, @thomasjpfan, if you had already attempted this.
The KNNImputer is a pretty difficult piece of code to work with.