[MRG] IterativeImputer: n_iter->max_iter#13061
[MRG] IterativeImputer: n_iter->max_iter#13061jnothman merged 22 commits intoscikit-learn:iterativeimputerfrom
Conversation
|
Note that the new test I've pushed a proposed fix, but I don't know if it's statistically valid. Opinion appreciated. |
jnothman
left a comment
There was a problem hiding this comment.
So this early stopping criterion avoids dangerous chaos, but does not avoid unnecessary fits and possible minor perturbations when, say, the result has converged but is not about to blow out.
sklearn/impute.py
Outdated
| 'reached. Using result of round %d' % i_rnd) | ||
| break | ||
| else: | ||
| Xt_previous = np.copy(Xt) |
There was a problem hiding this comment.
I don't get why we're copying in both the if and the else here. I think maybe this one is needed if Xt keeps being updated, but the other is not.
There was a problem hiding this comment.
You're right. we can do this in the if branch: Xt = Xt_previous.
sklearn/impute.py
Outdated
| # stop early if difference between consecutive imputations goes up. | ||
| # if so, back off to previous imputation | ||
| if not self.sample_posterior: | ||
| norm_diff = np.linalg.norm(Xt - Xt_previous) \ |
There was a problem hiding this comment.
Use parentheses rather than \ for line continuation.
Is this an obvious definition of "the difference between imputations"?
There was a problem hiding this comment.
It's not obvious but it's documented here: https://academic.oup.com/bioinformatics/article/28/1/112/219101
I'll add it in a comment.
| imputed_values[~good_sigmas] = mus[~good_sigmas] | ||
| mus = mus[good_sigmas] | ||
| sigmas = sigmas[good_sigmas] | ||
| # two types of problems: (1) non-positive sigmas, (2) mus outside |
There was a problem hiding this comment.
@jnothman @benlawson This is my proposed solution to the problem of having mus outside of the legal min_value and max_value. However, after implementing this, now this test fails: test_iterative_imputer_truncated_normal_posterior. So something is off here...
There was a problem hiding this comment.
OK, what's happening is that the first prediction via mus, sigmas = predictor.predict(X_test, return_std=True) is outside of the min/max so it never gets to the truncated draw, and thus the imputations from imputations = np.array([imputer.transform(X)[0][0] for _ in range(1000)]) are all 0.5.
There was a problem hiding this comment.
I changed the random seed, which allows the first draw to be between min/max bounds, and thus results in the test passing, I also turned down 1000 draws to 100 as it seemed unnecessarily slow.
|
I agree that |
|
One argument against this early stopping criterion is it broke the But now, on my machine, early stopping happens after the 2nd iteration and you get: Which is clearly wrong based on the training data. OK, I'm open to considering a |
|
Do you mind if we do some benchmarks? E.g. take Boston, insert some MCAR,
and plot the convergence statistic over number of iterations? Could
show both the L2 and L∞ norms?
|
jnothman
left a comment
There was a problem hiding this comment.
Do you think we should issue a ConvergenceWarning if we reach max_iter, or is it not really important for imputation?
I hope to give this a bigger look later/tomorrow.
| the imputation. Length is ``self.n_features_with_missing_ * max_iter``. | ||
|
|
||
| n_features_with_missing_ : int | ||
| Number of features with missing values. |
There was a problem hiding this comment.
This should be another list item rather than a note in the above attribute description
sklearn/impute.py
Outdated
| ``neighbor_feat_idx`` is the array of other features used to impute the | ||
| current feature, and ``predictor`` is the trained predictor used for | ||
| the imputation. Length is ``self.n_features_with_missing_ * n_iter``. | ||
| the imputation. Length is ``self.n_features_with_missing_ * max_iter``. |
|
So that's a problem of stopping too early rather than too late! Hmm... If
it's stopping too early, that basically means that it needs more burn-in
for the imputed values to find some equilibrium? So maybe what you want is
not tol, but something more like n_iter_no_decrease.
|
|
Haha, now you've got me convinced about I'll try that for now - it's what you had originally: Setting tol as a default of 1e-3 fixes the |
|
Re: benchmarks. I'll do some tomorrow (it's 9pm here). Re: convergence warning. Can't hurt. I'll find an example and stick it in here. |
|
Might want to have that max (which is an infty-norm so can maybe use the
scipy norm implementation) divided by the norm of the non-missing values of
the original X to be invariant to scale.
… |
jnothman
left a comment
There was a problem hiding this comment.
It would be nice if the max_iter change here was pulled out from the rest..?
| the imputation. Length is ``self.n_features_with_missing_ * max_iter``. | ||
|
|
||
| n_features_with_missing_ : int | ||
| Number of features with missing values. |
There was a problem hiding this comment.
This should be another list item rather than a note in the above attribute description
sklearn/tests/test_impute.py
Outdated
| verbose=1, | ||
| random_state=rng) | ||
| X_filled_100 = imputer.fit_transform(X_missing) | ||
| assert(len(imputer.imputation_sequence_) == d * 5) |
There was a problem hiding this comment.
So that the code is clear you can check n_iter_ directly
Can also check that tol=0 results in n_iter_ == max_iter?
There was a problem hiding this comment.
and remove the outer brackets
sklearn/impute.py
Outdated
| the imputation. Length is ``self.n_features_with_missing_ * n_iter``. | ||
| the imputation. Length is ``self.n_features_with_missing_ * | ||
| self.n_iter_``. ``self.n_iter_`` is the number of iteration rounds that | ||
| actually occurred, taking early stopping into account. |
sklearn/impute.py
Outdated
| Maximum number of imputation rounds to perform before returning the | ||
| imputations computed during the final round. A round is a single | ||
| imputation of each feature with missing values. The stopping criterion | ||
| is met once abs(max(X_i - X_{i-1})) < tol. Note that early stopping is |
There was a problem hiding this comment.
Put the backticks around the formulation of the stopping criterion
sklearn/impute.py
Outdated
| mus_too_high = mus > self._max_value | ||
| imputed_values[mus_too_high] = self._max_value | ||
| # the rest can be sampled without statistical issues | ||
| sample_flag = positive_sigmas & ~mus_too_low & ~mus_too_high |
There was a problem hiding this comment.
I would use the term mask instead of flag. Maybe inrange_mask
sklearn/impute.py
Outdated
| self.n_iter_ = self.max_iter | ||
| if not self.sample_posterior: | ||
| Xt_previous = np.copy(Xt) | ||
| for i_rnd in range(self.max_iter): |
There was a problem hiding this comment.
I would prefer a while loop
while criterion > self.tol or self.n_iter_ < self.max_iterThe reasoning is that we already see that we will have an early stopping mechanism
There was a problem hiding this comment.
You also need if not self.sample_posterior: so it gets messier. I would prefer to leave as is.
There was a problem hiding this comment.
The idiom I have come to enjoy is:
for self.n_iter_ in range(1, self.max_iter + 1):
update model
if stopping condition is met:
break
else:
warnings.warng(...)(We have something similar in label_propagation.py, but there the stopping condition applies at the start of the loop)
sklearn/impute.py
Outdated
| for i_rnd in range(self.n_iter): | ||
| self.n_iter_ = self.max_iter | ||
| if not self.sample_posterior: | ||
| Xt_previous = np.copy(Xt) |
sklearn/impute.py
Outdated
| print('[IterativeImputer] Early stopping criterion ' | ||
| 'reached.') | ||
| break | ||
| else: |
There was a problem hiding this comment.
I don't think that's true. I need to specify that Xt_previous = Xt.copy() otherwise the subsequent diff will be 0. Try it out.
There was a problem hiding this comment.
what I mean is
if not self.sample_posterior:
inf_norm = np.linalg.norm(Xt - Xt_previous, ord=np.inf,
axis=None)
if inf_norm < normalized_tol:
self.n_iter_ = i_rnd + 1
if self.verbose > 0:
print('[IterativeImputer] Early stopping criterion '
'reached.')
break
Xt_previous = Xt.copy()You don't need else since you will break in the if branch
There was a problem hiding this comment.
Oh! I thought you meant that the entire else branch was not needed. Yup, you're right. I'll fix it.
sklearn/impute.py
Outdated
| return Xt | ||
|
|
||
| imputations_per_round = len(self.imputation_sequence_) // self.n_iter | ||
| imps_per_round = len(self.imputation_sequence_) // self.n_iter_ |
There was a problem hiding this comment.
Let the full name, this is more explicit. Split the line into two lines.
sklearn/tests/test_impute.py
Outdated
| verbose=1, | ||
| random_state=rng) | ||
| X_filled_5 = imputer.fit_transform(X_missing) | ||
| assert_allclose(X_filled_100, X_filled_5, atol=1e-7) |
There was a problem hiding this comment.
you can add an assert on n_iter_. The reason is that we were missing (+1) in some estimator when reporting the number of iteration. it could be quite useful.
There was a problem hiding this comment.
Sorry, what does this mean? I added this:
imputer = IterativeImputer(max_iter=100,
tol=0,
sample_posterior=False,
verbose=1,
random_state=rng)
imputer.fit(X_missing)
assert imputer.max_iter == imputer.n_iter_
There was a problem hiding this comment.
Yes, it is what I meant. Inverse the assert order. Usually expected come last.
assert imputer.n_iter_ == imputer.max_iter
|
@jnothman You asked for some convergence plots. I'm plotting inf norm and l2 norm inf norm is l2 norm is Here are two plots. Convergence is quite rapid in both cases, but there are minor improvements in the case of California, while Boston stays flat. The code that generates these is below, but you can't run it because I had to define custom class variables. |
|
I don't think it's so weird with sample_posterior... it's maybe even
converging on a reasonably stable distribution.
|
|
Great. I'll say this is ready for MRG reviews then. |
jnothman
left a comment
There was a problem hiding this comment.
This looks very good otherwise!
sklearn/impute.py
Outdated
| Maximum number of imputation rounds to perform before returning the | ||
| imputations computed during the final round. A round is a single | ||
| imputation of each feature with missing values. The stopping criterion | ||
| is met once `abs(max(X_i - X_{i-1}))/abs(max(X[known_vals]))` < tol. |
There was a problem hiding this comment.
you might need to clarify that X_i is X at iteration i. Might be better to use t to index time?
sklearn/impute.py
Outdated
| self.n_iter_ = self.max_iter | ||
| if not self.sample_posterior: | ||
| Xt_previous = np.copy(Xt) | ||
| for i_rnd in range(self.max_iter): |
There was a problem hiding this comment.
The idiom I have come to enjoy is:
for self.n_iter_ in range(1, self.max_iter + 1):
update model
if stopping condition is met:
break
else:
warnings.warng(...)(We have something similar in label_propagation.py, but there the stopping condition applies at the start of the loop)
|
@jnothman I've addressed your two comments. @glemaitre perhaps you'll take a second look? |
I thought about that pattern and gave up thinking that the |
glemaitre
left a comment
There was a problem hiding this comment.
All good. Just three little tests for the coverage
|
|
||
| if self.n_iter == 0: | ||
| if self.max_iter == 0: | ||
| return Xt |
There was a problem hiding this comment.
This line is not covered by a test (as previously). Could you add this case to the unit test?
| axis=None) | ||
| if inf_norm < normalized_tol: | ||
| if self.verbose > 0: | ||
| print('[IterativeImputer] Early stopping criterion ' |
There was a problem hiding this comment.
We also do not test the testing. Turning to 1 could be enough for a couple of tests. We will ensure that printing does not trigger an error.
sklearn/impute.py
Outdated
| X, Xt, mask_missing_values = self._initial_imputation(X) | ||
|
|
||
| if self.n_iter == 0: | ||
| if self.n_iter_ == 0: |
There was a problem hiding this comment.
This is also not tested. Could we force the n_iter_ and check that we have X == Xt.
|
@glemaitre I've added some tests to cover the cases you mentioned. |
|
That's the error: |
|
Weird this passed locally. Will investigate...
…On Fri, Feb 8, 2019, 10:36 AM Guillaume Lemaitre ***@***.*** wrote:
That's the error:
[00:15:01] ================================== FAILURES ===================================
[00:15:01] _______________________ test_iterative_imputer_verbose ________________________
[00:15:01]
[00:15:01] def test_iterative_imputer_verbose():
[00:15:01] n = 10
[00:15:01] d = 3
[00:15:01] X = sparse_random_matrix(n, d, density=0.10).toarray()
[00:15:01] imputer = IterativeImputer(missing_values=0, max_iter=1, verbose=1)
[00:15:01] > imputer.fit(X)
[00:15:01]
[00:15:01] X = array([[0., 0., 0.],
[00:15:01] [0., 0., 0.],
[00:15:01] [0., 0., 0.],
[00:15:01] [0., 0., 0.],
[00:15:01] [0., 0., 0.],
[00:15:01] [0., 0., 0.],
[00:15:01] [0., 0., 0.],
[00:15:01] [0., 0., 0.],
[00:15:01] [0., 0., 0.],
[00:15:01] [0., 0., 0.]])
[00:15:01] d = 3
[00:15:01] imputer = IterativeImputer(imputation_order='ascending', initial_strategy='mean',
[00:15:01] max_iter=1, max_value=None, m...earest_features=None, predictor=None, random_state=None,
[00:15:01] sample_posterior=False, tol=0.001, verbose=1)
[00:15:01] n = 10
[00:15:01]
[00:15:01] c:\python37-x64\lib\site-packages\sklearn\tests\test_impute.py:542:
[00:15:01] _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
[00:15:01] c:\python37-x64\lib\site-packages\sklearn\impute.py:1032: in fit
[00:15:01] self.fit_transform(X)
[00:15:01] c:\python37-x64\lib\site-packages\sklearn\impute.py:930: in fit_transform
[00:15:01] normalized_tol = self.tol * np.max(np.abs(X[~mask_missing_values]))
[00:15:01] c:\python37-x64\lib\site-packages\numpy\core\fromnumeric.py:2505: in amax
[00:15:01] initial=initial)
[00:15:01] _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
[00:15:01]
[00:15:01] obj = array([], dtype=float64), ufunc = <ufunc 'maximum'>, method = 'max'
[00:15:01] axis = None, dtype = None, out = None
[00:15:01] kwargs = {'initial': <no value>, 'keepdims': <no value>}, passkwargs = {}
[00:15:01]
[00:15:01] def _wrapreduction(obj, ufunc, method, axis, dtype, out, **kwargs):
[00:15:01] passkwargs = {k: v for k, v in kwargs.items()
[00:15:01] if v is not np._NoValue}
[00:15:01]
[00:15:01] if type(obj) is not mu.ndarray:
[00:15:01] try:
[00:15:01] reduction = getattr(obj, method)
[00:15:01] except AttributeError:
[00:15:01] pass
[00:15:01] else:
[00:15:01] # This branch is needed for reductions like any which don't
[00:15:01] # support a dtype.
[00:15:01] if dtype is not None:
[00:15:01] return reduction(axis=axis, dtype=dtype, out=out, **passkwargs)
[00:15:01] else:
[00:15:01] return reduction(axis=axis, out=out, **passkwargs)
[00:15:01]
[00:15:01] > return ufunc.reduce(obj, axis, dtype, out, **passkwargs)
[00:15:01] E ValueError: zero-size array to reduction operation maximum which has no identity
[00:15:01]
[00:15:01] axis = None
[00:15:01] dtype = None
[00:15:01] kwargs = {'initial': <no value>, 'keepdims': <no value>}
[00:15:01] method = 'max'
[00:15:01] obj = array([], dtype=float64)
[00:15:01] out = None
[00:15:01] passkwargs = {}
[00:15:01] ufunc = <ufunc 'maximum'>
—
You are receiving this because you modified the open/close state.
Reply to this email directly, view it on GitHub
<#13061 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABya7GOGy4DbuqvVi5mzaFQ8ReoXEfPvks5vLcO2gaJpZM4aWqTq>
.
|
|
Fascinating - it accidentally made a test matrix that had no known values. I guess I should add a check for that and return without doing anything because you can't make an inference if you have 0 info? |
|
@glemaitre I fixed that issue and made a few more very small changes to prevent other problems. Plus one more tests. |
Co-Authored-By: sergeyf <sergeyfeldman@gmail.com>
Co-Authored-By: sergeyf <sergeyfeldman@gmail.com>
|
Thanks, applied. |
|
@jnothman @glemaitre I renamed |
glemaitre
left a comment
There was a problem hiding this comment.
Only one nitpick otherwise LGTM
| d = 3 | ||
| X = sparse_random_matrix(n, d, density=0.10).toarray() | ||
| X = sparse_random_matrix(n, d, density=0.10, random_state=rng).toarray() | ||
| imputer = IterativeImputer(missing_values=0, max_iter=1, verbose=1) |
There was a problem hiding this comment.
could you put verbose at 2. I think that we have a case that we want verbose > 1
There was a problem hiding this comment.
I did this two lines down (545):
imputer.verbose = 2
imputer.transform(X)
There was a problem hiding this comment.
Nevermind, I'll just do it twice.
|
@glemaitre Ready to merge! |
|
Thanks once more @sergeyf |




This PR addresses a number of discussion, most recently in #11977.
The main purpose of this PR is to provide automatic early stopping. We are going to be using the early stopping rule that is used in the missForest R package:
This is a sensible criterion, and has the added benefit of not needing to specify a
tol.Note that this criterion is not applied when
sample_posterior=Truebecause there is no steady state.This PR does a few more things:
sample_posterior=Trueusingtruncnormalto sample from the posterior.