[MRG+1] BUG: MultiLabelBinarizer.fit_transform sometimes returns an invalid CSR matrix#7750
Conversation
See scipy/scipy#6719 for context. The gist is that the `inverse` array may have a different dtype than `yt.indices`, which causes trouble down the line because, in those cases, `yt.indices` and `yt.indptr` have different dtypes. Alternately, we could insert `yt.check_format(full_check=False)` after modifying the sparse matrix members.
|
Thanks. can you add a test please? |
Older versions don't support kwargs for `astype`
sklearn/preprocessing/label.py
Outdated
| class_mapping = np.empty(len(tmp), dtype=dtype) | ||
| class_mapping[:] = tmp | ||
| self.classes_, inverse = np.unique(class_mapping, return_inverse=True) | ||
| yt.indices = inverse[yt.indices].astype(yt.indices.dtype, copy=False) |
There was a problem hiding this comment.
you could use np.asarray(..., dtype=) to more closely reflect this operation, I think
|
Yes, it looks like that's the sort of test we should perform wherever we do CSR manipulation. Hacky hacky hack hack. Thanks @perimosocordiae. |
|
Tests pass now. Should I squash the commits? |
|
no need. |
|
Maybe add a comment why the assert is needed and why the line is needed? It's a bit non-obvious to me. |
|
Opened #7762 to track the overall problem. |
[ci skip]
[ci skip]
|
Okay, comments added. I skipped the full CI treatment for the comment-only changes. Conversion to LIL format would test the symptom but not the cause of the error. It's possible that we may add checks in scipy to deal with this in the future, so I'd prefer to not rely on |
|
thanks :) |
…nvalid CSR matrix (scikit-learn#7750) * BUG: MultiLabelBinarizer makes invalid CSR matrix See scipy/scipy#6719 for context. The gist is that the `inverse` array may have a different dtype than `yt.indices`, which causes trouble down the line because, in those cases, `yt.indices` and `yt.indptr` have different dtypes. Alternately, we could insert `yt.check_format(full_check=False)` after modifying the sparse matrix members. * Fixing for old numpy Older versions don't support kwargs for `astype` * Adding tests * line-wrapping * adding comment to tests [ci skip] * added rationale comment [ci skip]
…nvalid CSR matrix (scikit-learn#7750) * BUG: MultiLabelBinarizer makes invalid CSR matrix See scipy/scipy#6719 for context. The gist is that the `inverse` array may have a different dtype than `yt.indices`, which causes trouble down the line because, in those cases, `yt.indices` and `yt.indptr` have different dtypes. Alternately, we could insert `yt.check_format(full_check=False)` after modifying the sparse matrix members. * Fixing for old numpy Older versions don't support kwargs for `astype` * Adding tests * line-wrapping * adding comment to tests [ci skip] * added rationale comment [ci skip]
…nvalid CSR matrix (scikit-learn#7750) * BUG: MultiLabelBinarizer makes invalid CSR matrix See scipy/scipy#6719 for context. The gist is that the `inverse` array may have a different dtype than `yt.indices`, which causes trouble down the line because, in those cases, `yt.indices` and `yt.indptr` have different dtypes. Alternately, we could insert `yt.check_format(full_check=False)` after modifying the sparse matrix members. * Fixing for old numpy Older versions don't support kwargs for `astype` * Adding tests * line-wrapping * adding comment to tests [ci skip] * added rationale comment [ci skip]
…nvalid CSR matrix (scikit-learn#7750) * BUG: MultiLabelBinarizer makes invalid CSR matrix See scipy/scipy#6719 for context. The gist is that the `inverse` array may have a different dtype than `yt.indices`, which causes trouble down the line because, in those cases, `yt.indices` and `yt.indptr` have different dtypes. Alternately, we could insert `yt.check_format(full_check=False)` after modifying the sparse matrix members. * Fixing for old numpy Older versions don't support kwargs for `astype` * Adding tests * line-wrapping * adding comment to tests [ci skip] * added rationale comment [ci skip]
…nvalid CSR matrix (scikit-learn#7750) * BUG: MultiLabelBinarizer makes invalid CSR matrix See scipy/scipy#6719 for context. The gist is that the `inverse` array may have a different dtype than `yt.indices`, which causes trouble down the line because, in those cases, `yt.indices` and `yt.indptr` have different dtypes. Alternately, we could insert `yt.check_format(full_check=False)` after modifying the sparse matrix members. * Fixing for old numpy Older versions don't support kwargs for `astype` * Adding tests * line-wrapping * adding comment to tests [ci skip] * added rationale comment [ci skip]
See scipy/scipy#6719 for context.
The gist is that the
inversearray may have a different dtype thanyt.indices, which causes trouble down the line because, in those cases,yt.indicesandyt.indptrhave different dtypes.Alternately, we could insert
yt.check_format(full_check=False)after modifying the sparse matrix members.