-
-
Notifications
You must be signed in to change notification settings - Fork 26.9k
ARD Regressor accuracy degrades when upgrading Scipy 1.2.1 -> 1.3.0 #14055
Description
Hi,
bit of a tricky one, I'm hoping someone will have some time and/or suggestions for further investigation!
There seems to be an often-occurring worsening of performance (i.e. accuracy, although run-time increases too!) from the ARD regressor when upgrading from Scipy 1.2.1 -> 1.3.0.
Description
On a very simple dataset (see code snippets below) where a near-perfect fit should be achievable, typical error seems to degrade from order 1E-5 to 1E-2. Notably, convergence iterations seem to increase also from ~a few (~5) to around 50-200 iterations.
Here's the headline plot, plotting absolute co-efficient error when fit across 1000 datasets generated with different random seeds:

Note how with Scipy==1.2.1, errors are largely constrained to <0.01, while with Scipy==1.3.0 they range up to 0.05 (and in a few rare cases the algorithm produces garbage results, see later).
I guess this could be (probably is?) a Scipy rather than Sklearn issue, but probably the only way to confirm / isolate that would be to start here.
It's also possible that this worsening of behaviour is a weirdness of my particular toy example, but the difference in behaviour seems large and unexpected enough to warrant further investigation, I'd hope!
Steps/Code to Reproduce
Single Seed:
OK, so here's a short snippet on just a single seed if you're curious to try this yourself. I'm generating three vectors of normally distributed values, 250 samples. Then the target is just a perfect copy of one of those vectors (index=1). We measure the accuracy of the fit by simply checking how close that coefficient is to 1.0 (the other coefficients always shrink to 0., as you'd hope):
import scipy
import sklearn
import matplotlib.pyplot as plt
import numpy as np
from sklearn.linear_model import ARDRegression
sklearn.show_versions()
def test_ard_regressor(dataset: np.ndarray) -> float:
X = dataset
y = X[:,1]
regressor = ARDRegression(verbose=True)
regressor.fit(X, y)
abs_coef_error = np.abs(1 - regressor.coef_[1])
print(abs_coef_error)
return abs_coef_error
size=250
X = np.random.RandomState(seed=45).normal(size=(size,3))
test_ard_regressor(X)
Results
Scipy 1.2.1:
python single_seed.py
System:
python: 3.6.7 (default, Oct 22 2018, 11:32:17) [GCC 8.2.0]
executable: /home/staley/.virtualenvs/sklearn-bug-scipy-1.2.1/bin/python
machine: Linux-4.15.0-47-generic-x86_64-with-Ubuntu-18.04-bionic
BLAS:
macros: HAVE_CBLAS=None
lib_dirs: /usr/lib/x86_64-linux-gnu
cblas_libs: openblas, openblas
Python deps:
pip: 19.1.1
setuptools: 41.0.1
sklearn: 0.21.2
numpy: 1.16.4
scipy: 1.2.1
Cython: None
pandas: None
Converged after 4 iterations
9.647701516568574e-07
Scipy 1.3.0
python single_seed.py
System:
python: 3.6.7 (default, Oct 22 2018, 11:32:17) [GCC 8.2.0]
executable: /home/staley/.virtualenvs/sklearn-bug-scipy-1.3/bin/python
machine: Linux-4.15.0-47-generic-x86_64-with-Ubuntu-18.04-bionic
BLAS:
macros: HAVE_CBLAS=None
lib_dirs: /usr/lib/x86_64-linux-gnu
cblas_libs: openblas, openblas
Python deps:
pip: 19.1.1
setuptools: 41.0.1
sklearn: 0.21.2
numpy: 1.16.4
scipy: 1.3.0
Cython: None
pandas: None
Converged after 18 iterations
0.16538104739325354
Datasets from 1000 different seeds
It could be that there's some oddity of the random data from a single seed, so I set up some short scripts to first generate a static collection of 1000 of the datasets as seen above, then collate the results from both versions of scipy. The snippets are as follows:
Make data:
import numpy as np
size=250
random_datasets = {seed: np.random.RandomState(seed).normal(size=(size,3))
for seed in range(1000)}
np.savez('random_datasets.npz', data=list(random_datasets.values()), seeds=list(random_datasets.keys()))
Test sklearn:
import scipy
import sklearn
import matplotlib.pyplot as plt
import numpy as np
from sklearn.linear_model import ARDRegression
random_datasets = np.load('random_datasets.npz')
random_datasets=dict(zip(random_datasets['seeds'], random_datasets['data']))
def test_ard_regressor(dataset: np.ndarray) -> float:
X = dataset
y = X[:,1]
regressor = ARDRegression(verbose=True)
regressor.fit(X, y)
abs_coef_error = np.abs(1 - regressor.coef_[1])
print(abs_coef_error)
return abs_coef_error
results = []
for seed, data in random_datasets.items():
print("Seed:",seed)
results.append(test_ard_regressor(data))
np.save(f'scipy_{scipy.__version__}_results', np.array(results))
Plot results:
import numpy as np
import matplotlib.pyplot as plt
results_1_2_1 = np.load("./scipy_1.2.1_results.npy")
results_1_3_0 = np.load("./scipy_1.3.0_results.npy")
counts, bin_edges = np.histogram(results_1_2_1)
ax = plt.gca()
ax.hist(results_1_2_1, label="scipy==1.2.1", alpha=0.5, bins=bin_edges)
ax.hist(results_1_3_0, label="scipy==1.3.0", alpha=0.5, bins=bin_edges)
# ax.set_xlim(0, 1.0)
ax.legend()
plt.show()
A little investigating summary statistics of those datasets in notebook gives the following points of comparison:
> np.median(results_1_2_1)
1.1909624002770514e-05
> np.median(results_1_3_0)
0.008368892887510193
>np.percentile(results_1_2_1, 99)
0.03166983391537859
>np.percentile(results_1_3_0, 99)
0.16551247976283737
> results_1_2_1.max()
0.08478086928684647
>results_1_3_0.max()
46606.5545533851