Skip to content

Tfidf mem fix#4941

Closed
aupiff wants to merge 4 commits intoscikit-learn:masterfrom
aupiff:tfidf-mem-fix
Closed

Tfidf mem fix#4941
aupiff wants to merge 4 commits intoscikit-learn:masterfrom
aupiff:tfidf-mem-fix

Conversation

@aupiff
Copy link
Copy Markdown

@aupiff aupiff commented Jul 8, 2015

previously, line 760 in text.py

values = np.ones(len(j_indices))

would cause memory issues as len(j_indices) is equal to the entire corpus' word count. I had issues with a dataset of 200,000 documents with ~4000 words each when many gigabytes would be allocated temporarily. I've eliminated the need for this line and the X.sum_duplicates calculation without a perceptible performance hit.

Additionally, for matrices with greater than 2 billion nnz, np.intc and array.array(str("i")) were insufficient for storing values of indptr that had to index into more than the (2^31 - 1)th position in the values or j_indices arrays. I've changed np.intc -> np.int64 and array.array(str("i")) -> array.array(str("l")) to accommodate this possibility.

@larsmans
Copy link
Copy Markdown
Member

larsmans commented Jul 9, 2015

We used np.intc because of compatibility with older SciPy, which didn't support 64-bit sparse matrices. I'm not sure if we still need to, ping @ogrisel.

Also, this fails on Windows because long is 32-bit there, regardless of architecture. It seems that there is no way to portably get a 64-bit integer array from array.array.

@aupiff
Copy link
Copy Markdown
Author

aupiff commented Jul 9, 2015

Maybe I should get rid of the 64-bit changes and just keep the memory fix?

@amueller
Copy link
Copy Markdown
Member

splitting it up into two PRs would probably be good.

@larsmans
Copy link
Copy Markdown
Member

Yes. Mind you, it is possible to implement dynamic arrays with np.append, but it's probably slower and likely to take more memory. It also requires doing the doubling manually.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants