MAINT downcast indices dtype when converting sparse arrays by glemaitre · Pull Request #27372 · scikit-learn/scikit-learn

glemaitre · 2023-09-14T14:38:44Z

The indices dtype of sparse arrays is different from sparse matrices. This PR modifies check_array to have a consistent behaviour.

The reason to do so is to not have any regression on low-level code that typed indices to be 32-bits precision as seen in #27240. Not that this typing is not only in scikit-learn which makes it more difficult to handle and can lead to regression in the future.

The main issue is the conversion from DIA arrays to CSR/COO arrays.

github-actions · 2023-09-14T14:41:08Z

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

_{Generated for commit: 31eebd7. Link to the linter CI: here}

ogrisel

First pass of comments:

sklearn/utils/tests/test_validation.py

sklearn/utils/validation.py

Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>

sklearn/utils/validation.py

Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>

ogrisel

Another pass, mostly about wording, and making the test cases easier to understand.

doc/whats_new/v1.4.rst

sklearn/utils/validation.py

sklearn/utils/tests/test_validation.py

sklearn/utils/validation.py

Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>

glemaitre · 2023-09-26T09:24:16Z

@ogrisel I reworked completely the test for the new helper. Sorry to not have not check it.
I parametrize it as well. I think this is more readable.

I also rename the variable in the _ensure_sparse_format. I checked and this is only used one and it is private so it should be a problem.

glemaitre · 2023-09-26T15:31:14Z

I checked codecov and we should not have an issue here.

ogrisel

Another round of feedback.

Once #27346 is merged in main, we should merge main into this branch and push a commit with [pyodide] in the commit message to check that it works as expected on that platform or skip tests that cannot be run on that platform.

sklearn/utils/tests/test_validation.py

sklearn/utils/validation.py

sklearn/utils/tests/test_validation.py

sklearn/utils/validation.py

sklearn/utils/tests/test_validation.py

Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>

sklearn/utils/tests/test_validation.py

sklearn/utils/validation.py

ogrisel

Another pass of suggestions to further simplify the code and improve comments + cosmetic changes.

Besides this, LGTM.

sklearn/utils/tests/test_validation.py

ogrisel · 2023-10-24T16:08:57Z

sklearn/utils/tests/test_validation.py

+        # be converted to int32,
+        (
+            {"arrays": np.array([1], dtype=np.int64), "check_contents": True},
+            np.dtype("int32"),


Suggested change

np.dtype("int32"),

np.int32,

ogrisel · 2023-10-24T16:09:08Z

sklearn/utils/tests/test_validation.py

+                "arrays": np.array([np.iinfo(np.int32).max + 1], dtype=np.uint32),
+                "check_contents": True,
+            },
+            np.dtype("int64"),


Suggested change

np.dtype("int64"),

np.int64,

ogrisel · 2023-10-24T16:09:20Z

sklearn/utils/tests/test_validation.py

+                "check_contents": True,
+                "maxval": np.iinfo(np.int32).max + 1,
+            },
+            np.dtype("int64"),


Suggested change

np.dtype("int64"),

np.int64,

ogrisel · 2023-10-24T16:09:33Z

sklearn/utils/tests/test_validation.py

+                "check_contents": True,
+                "maxval": 1,
+            },
+            np.dtype("int64"),


Suggested change

np.dtype("int64"),

np.int64,

ogrisel · 2023-10-24T16:10:22Z

sklearn/utils/tests/test_validation.py

+    [
+        ({}, np.dtype("int32")),  # default behaviour
+        ({"maxval": np.iinfo(np.int32).max}, np.dtype("int32")),
+        ({"maxval": np.iinfo(np.int32).max + 1}, np.dtype("int64")),


Can you please do a big search and replace for np.dtype("int32") to just np.int32 and similar for np.dtype("int64")?

Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>

adrinjalali

Otherwise LGTM.

adrinjalali · 2023-10-26T12:02:03Z

sklearn/utils/validation.py

+    # With SciPy sparse arrays, conversion from DIA format to COO, CSR, or BSR triggers
+    # the use of `np.int64` indices even if the data is such that it could be more
+    # efficiently represented with `np.int32` indices.
+    # https://github.com/scipy/scipy/issues/19245
+    # Since not all scikit-learn algorithms support large indices, the following code
+    # downcasts to `np.int32` indices when it's safe to do so.
+    if (
+        sparse_container_type_name == "dia_array"
+        and changed_format
+        and accept_sparse[0] in ("csr", "coo")
+    ):
+        if accept_sparse[0] == "csr":
+            index_dtype = _smallest_admissible_index_dtype(
+                arrays=(sparse_container.indptr, sparse_container.indices),
+                maxval=max(sparse_container.nnz, sparse_container.shape[1]),
+                check_contents=True,
+            )
+            sparse_container.indices = sparse_container.indices.astype(
+                index_dtype, copy=False
+            )
+            sparse_container.indptr = sparse_container.indptr.astype(
+                index_dtype, copy=False
+            )
+        else:  # accept_sparse[0] == "coo"
+            index_dtype = _smallest_admissible_index_dtype(
+                maxval=max(sparse_container.shape)
+            )
+            sparse_container.row = sparse_container.row.astype(index_dtype, copy=False)
+            sparse_container.col = sparse_container.col.astype(index_dtype, copy=False)


Could we move this to _fixes.py with a clear comment as when we can remove this? aka when the next scipy release becomes our minimum requirement.

glemaitre · 2023-10-30T09:45:18Z

@adrinjalali Would you be able to merge this PR after that I moved the code into fixes.py?

adrinjalali

NICE!

glemaitre · 2023-10-30T10:35:36Z

Thanks @adrinjalali

…arn#27372) Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>

MAINT downcast indices dtype when converting sparse arrays

d0f9e24

github-actions bot added the module:utils label Sep 14, 2023

DOC add changelog

3387951

glemaitre mentioned this pull request Sep 14, 2023

TST Extend tests for scipy.sparse/*array in sklearn/manifold/tests/test_spectral_embedding #27240

Merged

test for _get_sparse_index_dtype

ba719b3

ogrisel reviewed Sep 15, 2023

View reviewed changes

sklearn/utils/tests/test_validation.py Outdated Show resolved Hide resolved

sklearn/utils/validation.py Outdated Show resolved Hide resolved

sklearn/utils/validation.py Outdated Show resolved Hide resolved

glemaitre and others added 2 commits September 15, 2023 21:46

Apply suggestions from code review

884308f

Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>

iter

d5a69ad

ogrisel reviewed Sep 16, 2023

View reviewed changes

sklearn/utils/validation.py Outdated Show resolved Hide resolved

Update sklearn/utils/validation.py

a57274b

Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>

ogrisel reviewed Sep 18, 2023

View reviewed changes

ogrisel mentioned this pull request Sep 18, 2023

TST Extend tests for scipy.sparse.*array #27090

Closed

glemaitre and others added 4 commits September 25, 2023 16:52

Apply suggestions from code review

c85dacc

Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>

don't support bsr

57e3ca0

improve test and naming

58c28ae

Merge remote-tracking branch 'origin/main' into downcast_indices_dtype

468e038

glemaitre added 2 commits September 26, 2023 15:16

actually we don't support this use case

bb91ae3

update doc

ed0d9f1

ogrisel reviewed Sep 26, 2023

View reviewed changes

glemaitre and others added 5 commits September 27, 2023 10:42

Apply suggestions from code review

9a653d6

Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>

apply more fixes

b446bf3

Merge remote-tracking branch 'origin/main' into downcast_indices_dtype

b55e726

earlier cathching for unknown type of arrays or dtype

7a0ead4

Merge branch 'main' into downcast_indices_dtype

4c63b49

StefanieSenger mentioned this pull request Oct 12, 2023

TST separate checks for sparse array and sparse matrix input in estimator_checks #27576

Merged

glemaitre added 2 commits October 23, 2023 13:09

Merge remote-tracking branch 'origin/main' into downcast_indices_dtype

1cc94cb

[azure parallel][pyodide] trigger pyodide

f5be95c

adrinjalali reviewed Oct 24, 2023

View reviewed changes

sklearn/utils/tests/test_validation.py Show resolved Hide resolved

sklearn/utils/validation.py Outdated Show resolved Hide resolved

ogrisel approved these changes Oct 24, 2023

View reviewed changes

glemaitre and others added 4 commits October 24, 2023 18:20

Apply suggestions from code review

0c36a5a

Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>

rework test

eb6eb1a

replace np.dytpe with the actual dtype

3f7413e

simplify code casting

a45bd80

adrinjalali reviewed Oct 26, 2023

View reviewed changes

move utilities in fixes

31eebd7

adrinjalali approved these changes Oct 30, 2023

View reviewed changes

adrinjalali merged commit 56f6477 into scikit-learn:main Oct 30, 2023

RUrlus pushed a commit to RUrlus/scikit-learn that referenced this pull request Oct 30, 2023

MAINT downcast indices dtype when converting sparse arrays (scikit-le…

af00b3f

…arn#27372) Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>

glemaitre added a commit to glemaitre/scikit-learn that referenced this pull request Oct 31, 2023

MAINT downcast indices dtype when converting sparse arrays (scikit-le…

cbe170e

…arn#27372) Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>

REDVM pushed a commit to REDVM/scikit-learn that referenced this pull request Nov 16, 2023

MAINT downcast indices dtype when converting sparse arrays (scikit-le…

a336a99

…arn#27372) Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>

Uh oh!

Conversation

glemaitre commented Sep 14, 2023

Uh oh!

github-actions bot commented Sep 14, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✔️ Linting Passed

Uh oh!

ogrisel left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ogrisel left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

glemaitre commented Sep 26, 2023

Uh oh!

glemaitre commented Sep 26, 2023

Uh oh!

ogrisel left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ogrisel left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ogrisel Oct 24, 2023

Choose a reason for hiding this comment

Uh oh!

ogrisel Oct 24, 2023

Choose a reason for hiding this comment

Uh oh!

ogrisel Oct 24, 2023

Choose a reason for hiding this comment

Uh oh!

ogrisel Oct 24, 2023

Choose a reason for hiding this comment

Uh oh!

ogrisel Oct 24, 2023

Choose a reason for hiding this comment

Uh oh!

adrinjalali left a comment

Choose a reason for hiding this comment

Uh oh!

adrinjalali Oct 26, 2023

Choose a reason for hiding this comment

Uh oh!

glemaitre Oct 26, 2023

Choose a reason for hiding this comment

Uh oh!

glemaitre commented Oct 30, 2023

github-actions bot commented Sep 14, 2023 •

edited

Loading