-
-
Notifications
You must be signed in to change notification settings - Fork 26.9k
Closed
Labels
BugMetadata Routingall issues related to metadata routing, slep006, sample propsall issues related to metadata routing, slep006, sample props
Description
Note: this is a special case of a the wider problem described in:
Describe the bug
_log_reg_scoring_path used within LogisticRegressionCV with liblinear solver not returning the same coefficients when weighting samples using sample_weight versus when repeating samples based on weights.
NOTE: L801 in _log_reg_scoring_path does not pass sample_weight into scorer when scorer is not specified, needs fixing.
Steps/Code to Reproduce
import numpy as np
from sklearn.datasets import make_classification
from sklearn.metrics import get_scorer
from sklearn.linear_model import LogisticRegression, LogisticRegressionCV
from sklearn.model_selection import LeaveOneGroupOut
import sklearn
sklearn.set_config(enable_metadata_routing=True)
rng = np.random.RandomState(0)
X, y = make_classification(
n_samples=300000, n_features=8,
random_state=10,
n_informative=4,
n_classes=2,
)
n_samples = X.shape[0] // 3
sw = np.ones_like(y)
# We weight the first fold n times more.
sw[:n_samples] = rng.randint(0, 5, size=n_samples)
groups_sw = np.r_[
np.full(n_samples, 0), np.full(n_samples, 1), np.full(n_samples, 2)
]
splits_weighted = list(LeaveOneGroupOut().split(X, groups=groups_sw))
# We repeat the first fold n times and provide splits ourselves and overwrite
## initial resampled data
X_resampled_by_weights = np.repeat(X, sw.astype(int), axis=0)
##Need to know number of repitions made in total
n_reps = X_resampled_by_weights.shape[0] - X.shape[0]
y_resampled_by_weights = np.repeat(y, sw.astype(int), axis=0)
groups = np.r_[
np.full(n_reps + n_samples, 0), np.full(n_samples, 1), np.full(n_samples, 2)
]
splits_repeated = list(LeaveOneGroupOut().split(X_resampled_by_weights, groups=groups))
est_weighted = LogisticRegression(solver = "liblinear").fit(X,y,sample_weight=sw)
est_repeated = LogisticRegression(solver = "liblinear").fit(X_resampled_by_weights,y_resampled_by_weights)
np.testing.assert_allclose(est_weighted.coef_, est_repeated.coef_)
est_weighted = LogisticRegressionCV(cv=splits_weighted, solver = "liblinear").fit(X,y,sample_weight=sw)
est_repeated = LogisticRegressionCV(cv=splits_repeated, solver = "liblinear").fit(X_resampled_by_weights,y_resampled_by_weights)
np.testing.assert_allclose(est_weighted.coef_, est_repeated.coef_)Expected Results
No error is thrown
Actual Results
AssertionError:
Not equal to tolerance rtol=1e-07, atol=0
Mismatched elements: 8 / 8 (100%)
Max absolute difference among violations: 0.02352997
Max relative difference among violations: 10.49415031
ACTUAL: array([[ 5.580057e-01, 1.455297e-01, 1.117538e-02, 9.940221e-04,
2.078733e-05, -2.118241e-01, -2.361904e-01, -6.555003e-01]])
DESIRED: array([[ 5.757953e-01, 1.541149e-01, 9.722671e-04, 1.094184e-03,
1.143567e-04, -2.027509e-01, -2.405034e-01, -6.790303e-01]])Versions
System:
python: 3.12.4 | packaged by conda-forge | (main, Jun 17 2024, 10:13:44) [Clang 16.0.6 ]
executable: /Users/shrutinath/micromamba/envs/scikit-learn/bin/python
machine: macOS-14.3-arm64-arm-64bit
Python dependencies:
sklearn: 1.6.dev0
pip: 24.0
setuptools: 70.1.1
numpy: 2.0.0
scipy: 1.14.0
Cython: 3.0.10
pandas: None
matplotlib: 3.9.0
joblib: 1.4.2
threadpoolctl: 3.5.0
Built with OpenMP: True
threadpoolctl info:
user_api: blas
internal_api: openblas
num_threads: 8
prefix: libopenblas
filepath: /Users/shrutinath/micromamba/envs/scikit-learn/lib/libopenblas.0.dylib
version: 0.3.27
threading_layer: openmp
architecture: VORTEX
user_api: openmp
internal_api: openmp
num_threads: 8
prefix: libomp
filepath: /Users/shrutinath/micromamba/envs/scikit-learn/lib/libomp.dylib
version: NoneReactions are currently unavailable
Metadata
Metadata
Assignees
Labels
BugMetadata Routingall issues related to metadata routing, slep006, sample propsall issues related to metadata routing, slep006, sample props