Skip to content

IterativeImputer does not behave as expected when using fit() and transform() on a train/test split versus using fit_transform() on an entire dataset #14383

@LuxMiranda

Description

@LuxMiranda

Description

IterativeImputer seems to give pretty rubbish values when using the two-step fit() and transform() on a train/test data split versus simply using fit_transform() on the entire dataset. Issue #14338 may be related.

Steps/Code to Reproduce

Here's a fairly minimal example:

from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
from sklearn.linear_model import LinearRegression
import numpy as np

# x_0 = x_1 + x_2 + x_3
train = np.array([
    [ 5, 2, 2, 1],
    [10, 1, 2, 7],
    [ 3, 1, 1, 1],
    [ 8, 4, 2, 2]
])

test = np.array([
    [np.nan, 2, 4, 5],
    [np.nan, 4, 1, 2],
    [np.nan, 1, 10, 1]
])

y_actual = np.array([11, 7, 12])

## Prediction using IterativeImputer and  the 1-step fit_transform()
imputer     = IterativeImputer(estimator = LinearRegression())
fullData    = np.concatenate([train,test])
imputedData = imputer.fit_transform(fullData)
y_1imputed  = imputedData[-3:,0]

## Prediction using IterativeImputer and the 2-step fit() / transform()
imputer2     = IterativeImputer(estimator = LinearRegression())
imputer2     = imputer2.fit(train)
imputedTest  = imputer2.transform(test)
y_2imputed   = imputedTest[:,0]

print('Actual y:             {}'.format(y_actual))
print('1-step Imputed y:     {}'.format(y_1imputed))
print('2-step Imputed y:     {}'.format(y_2imputed))

Expected Results

Actual y:             [11  7 12]
1-step Imputed y:     [11.  7. 12.]
2-step Imputed y:     [11.  7. 12.]

Actual Results

Actual y:             [11  7 12]
1-step Imputed y:     [11.  7. 12.]
2-step Imputed y:     [6.5 6.5 6.5]

Versions

System:
    python: 3.7.2 (default, Dec 29 2018, 06:19:36)  [GCC 7.3.0]
executable: /home/jack/miniconda3/bin/python
   machine: Linux-4.15.0-54-generic-x86_64-with-debian-buster-sid

BLAS:
    macros: NO_ATLAS_INFO=1, HAVE_CBLAS=None
  lib_dirs: /usr/lib/x86_64-linux-gnu
cblas_libs: cblas

Python deps:
       pip: 19.1.1
setuptools: 40.2.0
   sklearn: 0.22.dev0
     numpy: 1.15.3
     scipy: 1.1.0
    Cython: 0.29.12
    pandas: 0.24.2
matplotlib: 3.0.2
    joblib: 0.13.2

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions