Fix elasticnet cv sample weight by snath-xoc · Pull Request #29308 · scikit-learn/scikit-learn

snath-xoc · 2024-06-19T22:52:57Z

What does this implement/fix? Explain your changes.

Adapted from the pull request #23045 by s-banach

Modifies _alpha_grid function in linear_model._coordinate_descent to accept a sample_weight argument and implements changes to be compatible with _preprocess_data

TODO

see if we can merge the new test with the existing sample_weight tests or update them if necessary.

It seems like this single call to _preprocess_data suffices in all cases.

This tiny example was given in scikit-learn#22914. The test merely asserts that alpha_max is large enough to force the coefficient to 0.

As per reviewer's suggestions: (1) Clarify eps=1. (2) Parameterize `fit_intercept`.

(1) Give the name `n_samples` to the quantity `X.shape[0]`. (2) Clarify that `y_offset` and `X_scale` are not used, since these are already applied to the data by `_preprocess_data`.

…anach

github-actions · 2024-06-19T22:54:14Z

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

_{Generated for commit: bf62b35. Link to the linter CI: here}

sklearn/linear_model/_coordinate_descent.py

Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>

…ied alpha_grid_ to accommodate for MultitaskCV y shape

ogrisel · 2024-06-27T16:07:31Z

@snath-xoc I merged main into this PR's branch to make the test_check_inplace_ensure_writeable failure go away as I think it was fixed there.

Don't forget to git pull origin fix_elasticnet_cv_sample_weight --rebase before adding new commits to this branch.

ogrisel

Thanks for the follow-up.

Aside from the formatting issue reported for by the linter and the missing changelog entry, here are few more suggestion:

sklearn/linear_model/tests/test_coordinate_descent.py

sklearn/linear_model/_coordinate_descent.py

ogrisel · 2024-06-27T18:43:05Z

Feel free to ignore the linter reports on files that are not changed in this PR. This will be fixed in #29359.

ogrisel · 2024-06-28T07:20:35Z

Merging main again to silence the unrelated linting problems.

sklearn/cluster/_birch.py

sklearn/metrics/tests/test_common.py

…ear_model/tests/test_coordinate_descent/test_enet_cv_sample_weight_correctness

ogrisel · 2024-07-05T14:17:56Z

@snath-xoc why did you close this PR?

ogrisel · 2024-07-05T15:12:23Z

sklearn/linear_model/_coordinate_descent.py

                accept_sparse="csc",
                order="F",
                dtype=[np.float64, np.float32],
-                force_writeable=True,


I think the removal of this line and the other force_writeable=True lines below were not intentional (maybe when resolving conflicts with main)?

I think this is what causes the test failures on the CI.

ogrisel · 2024-07-05T15:13:16Z

sklearn/linear_model/tests/test_coordinate_descent.py

-
-    # We weight the first fold 2 times more.
-    sw[:n_samples] = 2
+    # We weight the first fold n times more.


Suggested change

# We weight the first fold n times more.

# We re-weight the first cross-validation group with random integer weights.

# The samples in the other groups are left with unit weights.

ogrisel · 2024-07-05T15:15:13Z

sklearn/linear_model/tests/test_coordinate_descent.py

        X = X.toarray()
-    X = np.r_[X[:n_samples], X]
+    X_rep = np.repeat(X, sw.astype(int), axis=0)
+    ##Need to know number of repitions made in total


Suggested change

##Need to know number of repitions made in total

# Inspect the total number of random repetitions so as to adjust the size of

# the first cross-validation group accordingly.

Actually, I think that computing the number of repetitions is not needed, see the other suggestions below.

ogrisel · 2024-07-05T15:19:02Z

sklearn/linear_model/tests/test_coordinate_descent.py

+    X_rep = np.repeat(X, sw.astype(int), axis=0)
+    ##Need to know number of repitions made in total
+    n_reps = X_rep.shape[0] - X.shape[0]
+    X = X_rep


I would rather not rename change the X variable to keep the code easier to follow.

Maybe you could instead name the variables X_with_weights, y_with_weights, groups_with_weights on one hand and X_with_repetitions, y_with_repetitions and groups_with_repetitions on the other hand.

And similarly for the names of the 2 cross-validation splitters (if you adapt the code to use metadata routing) or the results of their splits if you prefer to precompute them instead of leveraging metadata routing.

ogrisel · 2024-07-05T15:21:32Z

sklearn/linear_model/tests/test_coordinate_descent.py

    groups = np.r_[
-        np.full(2 * n_samples, 0), np.full(n_samples, 1), np.full(n_samples, 2)
+        np.full(n_reps + n_samples, 0), np.full(n_samples, 1), np.full(n_samples, 2)
    ]


Instead of using n_reps, you could use:

groups_with_repetitions = np.repeat(groups_with_weights, sw.astype(int), axis=0)

as is done for X and y.

ogrisel · 2024-07-05T15:26:02Z

sklearn/linear_model/tests/test_coordinate_descent.py

    # ensure that we chose meaningful alphas, i.e. not boundaries
-    assert alphas[0] < reg.alpha_ < alphas[-1]
+    assert_allclose(reg_sw.alphas_, reg.alphas_)
    assert reg_sw.alpha_ == reg.alpha_


Please also compare the values of the mse_path_ attributes. prior to comparing the coef_ values.

ogrisel · 2024-08-07T13:54:02Z

Closing as most review comments have been addressed in #29442.

s-banach and others added 10 commits April 3, 2022 22:11

Update _alpha_grid to take sample_weight

8d4b501

It seems like this single call to _preprocess_data suffices in all cases.

Add a simple test for alpha_max with sample_weight

2f494db

This tiny example was given in scikit-learn#22914. The test merely asserts that alpha_max is large enough to force the coefficient to 0.

Update test

fa2c821

As per reviewer's suggestions: (1) Clarify eps=1. (2) Parameterize `fit_intercept`.

Clarify _alpha_grid.

75e6584

(1) Give the name `n_samples` to the quantity `X.shape[0]`. (2) Clarify that `y_offset` and `X_scale` are not used, since these are already applied to the data by `_preprocess_data`.

Clarify notation

8b6cfc0

Use Xy if it is provided.

2ba4c57

Update test, check alpha_max is not too large

5d1f5e7

Fix test that alpha_max is not too large.

dce169c

Test alpha_max without sample_weight.

380c21f

fix elasticnetcv sample weighting adapted from previous commit by s-b…

c187cf7

…anach

github-actions bot added the module:linear_model label Jun 19, 2024

ogrisel reviewed Jun 20, 2024

View reviewed changes

sklearn/linear_model/_coordinate_descent.py Outdated Show resolved Hide resolved

snath-xoc and others added 2 commits June 20, 2024 22:10

Update _preprocess_data inputs in _coordinate_descent.py

40d8b30

Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>

added tests for repeated vs weighted on cyclic ElasticNetCV and modif…

85062c0

…ied alpha_grid_ to accommodate for MultitaskCV y shape

snath-xoc marked this pull request as ready for review June 21, 2024 14:51

Merge branch 'main' into fix_elasticnet_cv_sample_weight

c649d36

ogrisel reviewed Jun 27, 2024

View reviewed changes

snath-xoc added 2 commits June 27, 2024 19:38

added to changelog and changed seeding in tests

41fcb5f

[all random seeds] test_enet_cv_sample_weight

335137d

Merge branch 'main' into fix_elasticnet_cv_sample_weight

36cc847

ogrisel reviewed Jun 28, 2024

View reviewed changes

sklearn/cluster/_birch.py Outdated Show resolved Hide resolved

ogrisel reviewed Jun 28, 2024

View reviewed changes

sklearn/metrics/tests/test_common.py Show resolved Hide resolved

ogrisel and others added 3 commits June 28, 2024 11:32

Revert unrelated changes

fec4f74

merged test into test_enet_cv_sample_weight_correctness

c41a8ee

changed sample weight to be explicitly set as integers in sklearn/lin…

ac9f090

…ear_model/tests/test_coordinate_descent/test_enet_cv_sample_weight_correctness

snath-xoc closed this Jul 5, 2024

snath-xoc deleted the fix_elasticnet_cv_sample_weight branch July 5, 2024 10:34

snath-xoc restored the fix_elasticnet_cv_sample_weight branch July 5, 2024 10:34

snath-xoc reopened this Jul 5, 2024

Merge branch 'main' into fix_elasticnet_cv_sample_weight

bf62b35

ogrisel reviewed Jul 5, 2024

View reviewed changes

snath-xoc mentioned this pull request Jul 9, 2024

Fix elasticnect cv sample weight #29442

Merged

1 task

ogrisel closed this Aug 7, 2024

	# We weight the first fold n times more.
	# We re-weight the first cross-validation group with random integer weights.
	# The samples in the other groups are left with unit weights.

	##Need to know number of repitions made in total
	# Inspect the total number of random repetitions so as to adjust the size of
	# the first cross-validation group accordingly.

Uh oh!

Conversation

snath-xoc commented Jun 19, 2024 • edited by ogrisel Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Jun 19, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✔️ Linting Passed

Uh oh!

Uh oh!

ogrisel commented Jun 27, 2024

Uh oh!

ogrisel left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ogrisel commented Jun 27, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ogrisel commented Jun 28, 2024

Uh oh!

Uh oh!

Uh oh!

ogrisel commented Jul 5, 2024

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ogrisel commented Aug 7, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

snath-xoc commented Jun 19, 2024 •

edited by ogrisel

Loading

github-actions bot commented Jun 19, 2024 •

edited

Loading

ogrisel commented Jun 27, 2024 •

edited

Loading