ENH Uses binned values from training to find missing values#16883
Conversation
|
The proposed changes seem equivalent (potentially slower?)
Correct. When there are no missing values, we only do one scan (left to right).
How? |
Ah yes, I was mistaken, this PR is not strictly needed.
Although this change is faster: from sklearn.ensemble._hist_gradient_boosting.binning import _BinMapper
import numpy as np
rng = np.random.RandomState(42)
X = rng.rand(100_000, 200)
mask = rng.binomial(1, 0.1, size=X.shape).astype(np.bool)
X[mask] = np.nan
mapper = _BinMapper()
X_trans = mapper.fit_transform(X)%timeit (X_trans == mapper.missing_values_bin_idx_).any(axis=0).astype(np.uint8)
# 2.31 ms ± 69.6 µs per loop
%timeit np.isnan(X).any(axis=0).astype(np.uint8)
# 14.4 ms ± 394 µs per loopSecretly, I am using this change for categorical support and this PR may be an improvement (at least in terms of speed). |
NicolasHug
left a comment
There was a problem hiding this comment.
Oh OK nice, thanks for the quick benchmark.
I guess we can merge without a +2?
Looks okay to me. |
|
Reverted this from master in c9ca1d7 since it broke master |
|
It was broken because of #13022 which does not work for new versions of sphinx. |
|
Will be fixed with #17724 |
|
Ah okay I see. |
…earn#16883) * ENH Uses training data to find missing values * CLN Uses the binned data to find missing value
…cikit-learn#16883)" This reverts commit e5cc2b0.
…earn#16883) * ENH Uses training data to find missing values * CLN Uses the binned data to find missing value
…cikit-learn#16883)" This reverts commit e5cc2b0.
From my understanding: If the training data,
X_binned_train, has no missing values, we would not need to do the right to left sweep when splitting. Secondly, this seems leaks data from the validation set,X_binned_val, into training.@NicolasHug What do you think?