Tfidf mem fix by aupiff · Pull Request #4941 · scikit-learn/scikit-learn

aupiff · 2015-07-08T19:31:05Z

previously, line 760 in text.py

values = np.ones(len(j_indices))

would cause memory issues as len(j_indices) is equal to the entire corpus' word count. I had issues with a dataset of 200,000 documents with ~4000 words each when many gigabytes would be allocated temporarily. I've eliminated the need for this line and the X.sum_duplicates calculation without a perceptible performance hit.

Additionally, for matrices with greater than 2 billion nnz, np.intc and array.array(str("i")) were insufficient for storing values of indptr that had to index into more than the (2^31 - 1)th position in the values or j_indices arrays. I've changed np.intc -> np.int64 and array.array(str("i")) -> array.array(str("l")) to accommodate this possibility.

so large datasets with > 2 billion nnz can be stored as a sparse matrix without integer overflow

larsmans · 2015-07-09T15:00:21Z

We used np.intc because of compatibility with older SciPy, which didn't support 64-bit sparse matrices. I'm not sure if we still need to, ping @ogrisel.

Also, this fails on Windows because long is 32-bit there, regardless of architecture. It seems that there is no way to portably get a 64-bit integer array from array.array.

aupiff · 2015-07-09T16:52:13Z

Maybe I should get rid of the 64-bit changes and just keep the memory fix?

amueller · 2015-07-11T20:56:08Z

splitting it up into two PRs would probably be good.

larsmans · 2015-07-12T12:03:24Z

Yes. Mind you, it is possible to implement dynamic arrays with np.append, but it's probably slower and likely to take more memory. It also requires doing the doubling manually.

aupiff added 4 commits July 8, 2015 12:00

more memory-efficient word count calculation

e4edd12

intc -> int64 change

9fcb135

so large datasets with > 2 billion nnz can be stored as a sparse matrix without integer overflow

iteritems -> items for python3 compatability

a312772

np.frombuffer -> frombuffer_empty for values array

7aeffbe

aupiff mentioned this pull request Jul 12, 2015

more memory-efficient word count calculation #4968

Closed

aupiff closed this Jul 12, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Tfidf mem fix#4941

Tfidf mem fix#4941
aupiff wants to merge 4 commits intoscikit-learn:masterfrom
aupiff:tfidf-mem-fix

aupiff commented Jul 8, 2015

Uh oh!

larsmans commented Jul 9, 2015

Uh oh!

aupiff commented Jul 9, 2015

Uh oh!

amueller commented Jul 11, 2015

Uh oh!

larsmans commented Jul 12, 2015

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

aupiff commented Jul 8, 2015

Uh oh!

larsmans commented Jul 9, 2015

Uh oh!

aupiff commented Jul 9, 2015

Uh oh!

amueller commented Jul 11, 2015

Uh oh!

larsmans commented Jul 12, 2015

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants