Skip to content

Refactored Count Vectorizer to be more memory efficient on N-grams#7107

Closed
qingyili wants to merge 2 commits intoscikit-learn:mainfrom
qingyili:master
Closed

Refactored Count Vectorizer to be more memory efficient on N-grams#7107
qingyili wants to merge 2 commits intoscikit-learn:mainfrom
qingyili:master

Conversation

@qingyili
Copy link
Copy Markdown

Reference Issue

What does this implement/fix? Explain your changes.

Any other comments?

When max_features are specified, CountVectorizer will first count the word occurrence. It will take the the top occurring words up to max_features. It will only make a 2-gram when its 1 gram count is apart of max_features, make a 3-gram when its 2-gram is apart of the max features and so on. This faster and more memory efficient on n-grams when the data sets are large.

self.max_features = max_features
if max_features is not None:
if (not isinstance(max_features, numbers.Integral) or
max_features <= 0):
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why?

@amueller
Copy link
Copy Markdown
Member

amueller commented Jul 28, 2016

can you please post speed benchmarks for varying ngram_ranges and dataset sizes?

max_features=None, vocabulary=None, binary=False,
dtype=np.int64, norm='l2', use_idf=True, smooth_idf=True,
sublinear_tf=False):

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please try not to make any non-function modifications to the code; it makes it harder to review / in this case the newline should be there.

@nelson-liu
Copy link
Copy Markdown
Contributor

travis doesn't seem happy, but i'm very interested to see benchmarks for this as well.

@rth rth mentioned this pull request Jan 30, 2017
Base automatically changed from master to main January 22, 2021 10:49
@adrinjalali
Copy link
Copy Markdown
Member

Closing, never got response from the OP.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants