MAINT downcast indices dtype when converting sparse arrays#27372
MAINT downcast indices dtype when converting sparse arrays#27372adrinjalali merged 24 commits intoscikit-learn:mainfrom
Conversation
Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>
Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>
ogrisel
left a comment
There was a problem hiding this comment.
Another pass, mostly about wording, and making the test cases easier to understand.
Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>
|
@ogrisel I reworked completely the test for the new helper. Sorry to not have not check it. I also rename the variable in the |
|
I checked codecov and we should not have an issue here. |
ogrisel
left a comment
There was a problem hiding this comment.
Another round of feedback.
Once #27346 is merged in main, we should merge main into this branch and push a commit with [pyodide] in the commit message to check that it works as expected on that platform or skip tests that cannot be run on that platform.
Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>
ogrisel
left a comment
There was a problem hiding this comment.
Another pass of suggestions to further simplify the code and improve comments + cosmetic changes.
Besides this, LGTM.
| # be converted to int32, | ||
| ( | ||
| {"arrays": np.array([1], dtype=np.int64), "check_contents": True}, | ||
| np.dtype("int32"), |
There was a problem hiding this comment.
| np.dtype("int32"), | |
| np.int32, |
| "arrays": np.array([np.iinfo(np.int32).max + 1], dtype=np.uint32), | ||
| "check_contents": True, | ||
| }, | ||
| np.dtype("int64"), |
There was a problem hiding this comment.
| np.dtype("int64"), | |
| np.int64, |
| "check_contents": True, | ||
| "maxval": np.iinfo(np.int32).max + 1, | ||
| }, | ||
| np.dtype("int64"), |
There was a problem hiding this comment.
| np.dtype("int64"), | |
| np.int64, |
| "check_contents": True, | ||
| "maxval": 1, | ||
| }, | ||
| np.dtype("int64"), |
There was a problem hiding this comment.
| np.dtype("int64"), | |
| np.int64, |
| [ | ||
| ({}, np.dtype("int32")), # default behaviour | ||
| ({"maxval": np.iinfo(np.int32).max}, np.dtype("int32")), | ||
| ({"maxval": np.iinfo(np.int32).max + 1}, np.dtype("int64")), |
There was a problem hiding this comment.
Can you please do a big search and replace for np.dtype("int32") to just np.int32 and similar for np.dtype("int64")?
Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>
sklearn/utils/validation.py
Outdated
| # With SciPy sparse arrays, conversion from DIA format to COO, CSR, or BSR triggers | ||
| # the use of `np.int64` indices even if the data is such that it could be more | ||
| # efficiently represented with `np.int32` indices. | ||
| # https://github.com/scipy/scipy/issues/19245 | ||
| # Since not all scikit-learn algorithms support large indices, the following code | ||
| # downcasts to `np.int32` indices when it's safe to do so. | ||
| if ( | ||
| sparse_container_type_name == "dia_array" | ||
| and changed_format | ||
| and accept_sparse[0] in ("csr", "coo") | ||
| ): | ||
| if accept_sparse[0] == "csr": | ||
| index_dtype = _smallest_admissible_index_dtype( | ||
| arrays=(sparse_container.indptr, sparse_container.indices), | ||
| maxval=max(sparse_container.nnz, sparse_container.shape[1]), | ||
| check_contents=True, | ||
| ) | ||
| sparse_container.indices = sparse_container.indices.astype( | ||
| index_dtype, copy=False | ||
| ) | ||
| sparse_container.indptr = sparse_container.indptr.astype( | ||
| index_dtype, copy=False | ||
| ) | ||
| else: # accept_sparse[0] == "coo" | ||
| index_dtype = _smallest_admissible_index_dtype( | ||
| maxval=max(sparse_container.shape) | ||
| ) | ||
| sparse_container.row = sparse_container.row.astype(index_dtype, copy=False) | ||
| sparse_container.col = sparse_container.col.astype(index_dtype, copy=False) |
There was a problem hiding this comment.
Could we move this to _fixes.py with a clear comment as when we can remove this? aka when the next scipy release becomes our minimum requirement.
|
@adrinjalali Would you be able to merge this PR after that I moved the code into |
|
Thanks @adrinjalali |
…arn#27372) Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>
…arn#27372) Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>
…arn#27372) Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>
The indices dtype of sparse arrays is different from sparse matrices. This PR modifies
check_arrayto have a consistent behaviour.The reason to do so is to not have any regression on low-level code that typed indices to be 32-bits precision as seen in #27240. Not that this typing is not only in scikit-learn which makes it more difficult to handle and can lead to regression in the future.
The main issue is the conversion from DIA arrays to CSR/COO arrays.