Skip to content

Custom vocabulary checks for CountVectorizer #2357

Closed
alemagnani wants to merge 3 commits intoscikit-learn:masterfrom
alemagnani:CountVectorizer_Vocab
Closed

Custom vocabulary checks for CountVectorizer #2357
alemagnani wants to merge 3 commits intoscikit-learn:masterfrom
alemagnani:CountVectorizer_Vocab

Conversation

@alemagnani
Copy link
Copy Markdown
Contributor

added check for repeating indices and gaps in custom vocabulary for CountVectorizer
added test with wrong custom vocabs
this commit fixes #2353

alemagnani and others added 2 commits August 8, 2013 16:21
fix bug in matrix creation for CountVectorizer
* added test for custom vocab to check fauly vocabs
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

perhaps use six.itervalues

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i am not familiar with six.itervalues what is the difference?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It ensures a list is not materialised in either Py2 or 3.

@jnothman
Copy link
Copy Markdown
Member

Looks good. Perhaps specify in the docstring that vocabulary must map each entry to a distinct integer from 0 to the vocabulary size - 1.

* used six.itervalues
@larsmans larsmans closed this in c2cf21d Aug 15, 2013
@jnothman
Copy link
Copy Markdown
Member

Thanks for the report and the fix, @alemagnani

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Bug in text.py line 738

2 participants