[MRG] fixes an issue w/ large sparse matrix indices in CountVectorizer#11295
[MRG] fixes an issue w/ large sparse matrix indices in CountVectorizer#11295jnothman merged 9 commits intoscikit-learn:masterfrom
Conversation
rth
left a comment
There was a problem hiding this comment.
Nice work on investigating! Overall I think it's the right fix.
If a CountVectorizer's feature matrix ends up having more than or equal to 2^31 values, it needs to use 64 bit indices / index pointers
The case of 64 bit indptr was fixed in #9147 in which case indices are also 64 bit, by side effect, due to constraints of scipy CSR matrices, if I remember correctly (see related discussion #9147 (comment)). Needing 64 bit indices is not actually something that happens too much - that would mean a vocabulary size of 2e9 tokens ( total word count in English wikipedia but of unique tokens).
However, when we do use 64bit indices/indptr I think this change would mean one less memory copy due to dtype conversion + better validity of the output array, so it's all good.
See other comments below,
|
|
||
| vec._sort_features(X, vocabulary) | ||
|
|
||
| assert_equal(INDEX_DTYPE, X.indices.dtype) |
There was a problem hiding this comment.
assert INDEX_DTYPE == X.indices.dtypewill produce nicer error messages with pytest
There was a problem hiding this comment.
good to know, will update
| "great!": 2 | ||
| } | ||
|
|
||
| vec._sort_features(X, vocabulary) |
There was a problem hiding this comment.
According to the docstring, this returns the reordered matrix, so let's rather validate,
Xs = vec._sort_features(X, vocabulary)Just to be safe. Looking at the code, I agree that it does inplace modification of X, but there is no guarantee that it will continue to do that in the future.
There was a problem hiding this comment.
Sure, that makes sense, I'll update.
| # force indices and indptr to int64. | ||
| INDEX_DTYPE = np.int64 | ||
| X.indices = X.indices.astype(INDEX_DTYPE, copy=False) | ||
| X.indptr = X.indptr.astype(INDEX_DTYPE, copy=False) |
There was a problem hiding this comment.
Are you sure copy=True has an impact here?
There was a problem hiding this comment.
I am not sure -- I suppose it doesn't matter if it's a copy or not.
| # If a count vectorizer has to store >= 2**31 count values, the sparse | ||
| # storage matrix has 64bit indices / indptrs. This requires ~2*8*2**31 | ||
| # bytes of memory in practice, so we just test the method that would | ||
| # hypothetically fail. |
There was a problem hiding this comment.
Maybe just say,
"""
Check that _sort_features preserves the indices dtype of sparse arrays with 64 bit indices
"""
There was a problem hiding this comment.
Yep, I'll update it to something cleaner.
| X = sparse.csr_matrix((5, 5), dtype=np.int64) | ||
|
|
||
| # force indices and indptr to int64. | ||
| INDEX_DTYPE = np.int64 |
There was a problem hiding this comment.
INDICES_DTYPE if we want to be consistent
There was a problem hiding this comment.
Sure, will update.
| ValueError, message, vec.transform, ["good news everyone"]) | ||
|
|
||
|
|
||
| @pytest.mark.parametrize("vec", [CountVectorizer()]) |
There was a problem hiding this comment.
I think there is no need to parametrize it, since we are only checking the CountVectorizer (which should be sufficient).
There was a problem hiding this comment.
Yeah, I wasn't sure at first if we needed to test this with other vectorizers, but I don't think so.
9c0b813 to
6d6099d
Compare
|
Yes this will likely never happen in practice! |
|
Looking into it! |
|
It would be great to have this. @gvacaliuc any chance you could fix the CI (and resolve the conflict)? Thanks! |
|
Hey there, sorry to ghost. I'll mess around with it tonight and see if I can figure it out. |
6d6099d to
59d12b8
Compare
* added test to show example of issue in scikit-learn#7762 * fixes error caused by manually manipulating sparse indices in `CountVectorizer`
59d12b8 to
317a169
Compare
|
Sorry, life's been busy lately. Appveyor fails on 32-bit Python 2.7.8. IIUC this issue would never arise on a 32bit arch, but I'll set up a test environment to figure out what's going on. The strange thing is that this is the relevant bit of code: map_index = np.empty(len(sorted_features), dtype=X.indices.dtype)
for new_val, (term, old_val) in enumerate(sorted_features):
vocabulary[term] = new_val
map_index[old_val] = new_val
X.indices = map_index.take(X.indices, mode='clip')And appveyor is failing on that last line due to a type error (cannot cast int64 to int32). Clearly map_index and X.indices are the same dtype, so my guess off the bat is that there's some issue with trying to index the map_index array with 64bit integers on a 32bit architecture, and that under the hood numpy tries to cast whatever dtype the index array is to int32. |
|
I will admit I also find this quite inexplicable...
|
|
I can reproduce 64 bit indexing issue you see in What we could do is to just use previous behavior (and skip the added tests) on 32 bit systems. Here is the code that could be used for detection (and skipping tests) |
| """ | ||
| sorted_features = sorted(six.iteritems(vocabulary)) | ||
| map_index = np.empty(len(sorted_features), dtype=np.int32) | ||
| map_index = np.empty(len(sorted_features), dtype=X.indices.dtype) |
There was a problem hiding this comment.
Should we just add a check here that catches an improper conversion to int64's on 32bit arch? Or should we never be mucking with the index dtype at all and just throw when our np.intp would overflow?
scikit-learn/sklearn/feature_extraction/text.py
Lines 944 to 955 in 317a169
There was a problem hiding this comment.
I don't have enough background here - could you link to the PR that added 64-bit indexing to scipy?
At any rate, it seems to be that indices_dtype should always be np.intp - if that's not possible, then you should pick between np.int32 (on old scipy) and np.intp (on fixed scipy)
There was a problem hiding this comment.
The indices can be either int32 or int64, scipy.sparse doesn't care which (intp size does not matter). Both indices and indptr however need to be of the same dtype to avoid casts on each operation.
Of course, when you constructing stuff manually, you'll also need to choose the dtype so that it can hold the items you are going to insert.
| @@ -852,7 +852,7 @@ def _sort_features(self, X, vocabulary): | |||
| Returns a reordered matrix and modifies the vocabulary in place | |||
There was a problem hiding this comment.
Unrelated, but: This seems to reorder X in place too.
There was a problem hiding this comment.
Indeed it does -- I can update that when we decide what the desired behavior is.
There was a problem hiding this comment.
That's intentional, I think, to reduce memory usage, why would it be an issue?
There was a problem hiding this comment.
It's an issue only because the documentation doesn't tell me that's going to happen.
There was a problem hiding this comment.
This seems to reorder X in place too.
Agreed, we should add a note about it but in a separate PR.
|
Thanks a lot for your feedback @eric-wieser and @pv on this !
That was my basic principle for #9147. I'll admit I have not thought too much about 32 bit architectures, beyond making the tests pass (which also run on 32 bit appveyor). It possible that there is a more elegant solution (using In this scikit-learn context, few users would need to vectorize enough text data that would produce a 64bit indexed sparse arrays: just the resulting sparse array would take >30-50 GB memory which comes closes to the inherent 64 GB RAM limitations of 32 architectures (with PAE). So currently we support this use case, with recent scipy, Unux or (Windows with Python 3) and (possibly?) 64 bit architectures only, and fail in all other cases. I don't think there is a need to support 32 architectures here. On the other hand, we do need to support all OS, python versions and 32 bit/64 bit architectures for smaller text datasets that would produce 32 bit indexed sparse arrays. So in the end, I am not sure what can/should be done here beyond what I proposed in #11295 (comment) but you might have more better ideas.. |
|
Sorry accidental key combination... Updated unfinished comment (#11295 (comment)) above. |
|
Merged master in to resolve conflicts. @gvacaliuc please run a |
| Xs = CountVectorizer()._sort_features(X, vocabulary) | ||
|
|
||
| assert INDICES_DTYPE == Xs.indices.dtype | ||
|
|
There was a problem hiding this comment.
I think PEP8 is complaining that a blank line contains whitespace here?
|
This should be a candidate for 0.20.3 (have we got a way to mark that?) |
|
Doesn't creating a milestone and putting it in there make sense? We can then backport them all I guess. |
|
Thanks for your work on this @gvacaliuc
I think it's acceptable not to support 64 bit sparse indices on 32 bit architectures (particularly if the alternative is some performance cost on all platforms). The most case for 32 bit arch I can think about (Raspberry PI & embedded, Windows 32 bit, web assembly) are not typically used in cases that need processing 10s of GB of sparse data. |
rth
left a comment
There was a problem hiding this comment.
Thanks @gvacaliuc ! Looks good apart for a minor comment below.
sklearn/feature_extraction/text.py
Outdated
|
|
||
| if indptr[-1] > 2147483648: # = 2**31 - 1 | ||
| if sp_version >= (0, 14): | ||
| if sp_version >= (0, 14) and not _IS_32BIT: |
There was a problem hiding this comment.
Our current minimal supported scipy version is 0.17, so you can just remove the scipy version check and remove the last else clause.
There was a problem hiding this comment.
cool, will update now!
|
Let me know if there's any other changes I should make, it looks like the failing check is a code coverage issue. Since I didn't add much code, I think the only test I could add is to check that 32bit python throws the error, but that might require a very large vocab 😮 |
|
Yes, presumably because the error isn't tested in our ci
|
|
I'm going to take @rth's comment as an approval and merge this. |
|
Thanks @gvacaliuc! |
|
Wait... Lacks a changelog entry. @gvacaliuc Please add an entry to the change log at |
rth
left a comment
There was a problem hiding this comment.
LGTM (apart for the missing what's new entry).
|
OK, just added the change log entry. Let me know if I missed anything else 😄 |
|
Thanks @gvacaliuc |
scikit-learn#11295)" This reverts commit 1da72e7.
scikit-learn#11295)" This reverts commit 1da72e7.
CountVectorizerReference Issues/PRs
#7762
What does this implement/fix? Explain your changes.
If a
CountVectorizer's feature matrix ends up having more than or equal to 2^31 values, it needs to use 64 bit indices / index pointers. If the features get sorted via a call to_sort_features, the new indices were hard coded to be 32 bit. I've updated that call to make the new indices the same dtype as before.Any other comments?
The issue was more broad than this one case, but I didn't identify any other manual modifications of sparse matrices indices / indptrs in the code. See #7762 for more details on that.