Skip to content

more memory-efficient word count calculation#4968

Closed
aupiff wants to merge 3 commits intoscikit-learn:masterfrom
aupiff:only-tfidf-memory-fix
Closed

more memory-efficient word count calculation#4968
aupiff wants to merge 3 commits intoscikit-learn:masterfrom
aupiff:only-tfidf-memory-fix

Conversation

@aupiff
Copy link
Copy Markdown

@aupiff aupiff commented Jul 12, 2015

(SPLITTING UP PR #4941)

previously, line 760 in text.py

values = np.ones(len(j_indices))

would cause memory issues as len(j_indices) is equal to the entire corpus' word count. I had issues with a dataset of 200,000 documents with ~4000 words each when many gigabytes would be allocated temporarily. I've eliminated the need for this line and the X.sum_duplicates calculation without a perceptible performance hit.

@amueller
Copy link
Copy Markdown
Member

travis is unhappy

@jmschrei
Copy link
Copy Markdown
Member

jmschrei commented Aug 3, 2015

@aupiff Is this still being worked on?

@aupiff
Copy link
Copy Markdown
Author

aupiff commented Aug 4, 2015

@jmschrei Yes! I forgot about this for a little while. I'm fixing a few failing tests now.

@aupiff
Copy link
Copy Markdown
Author

aupiff commented Aug 4, 2015

@jmschrei all tests passing now! sorry this took so long to fix.

@jmschrei
Copy link
Copy Markdown
Member

jmschrei commented Aug 4, 2015

Excellent! Would it be possible for you to provide some simple benchmarks, and the code which runs those benchmarks as well? It would make it easier for us to ensure the quality of your code.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not use a Counter?

@amueller
Copy link
Copy Markdown
Member

👍 for some benchmarks :)

@ephes
Copy link
Copy Markdown
Contributor

ephes commented Aug 16, 2015

I've added this vectorizer to my benchmark script for #5122 and added a line to the results table.

@jnothman
Copy link
Copy Markdown
Member

Superseded by #7272

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants