Naive Bayes classifier for multivariate Bernoulli (binary features)#210
Naive Bayes classifier for multivariate Bernoulli (binary features)#210larsmans merged 13 commits intoscikit-learn:masterfrom
Conversation
Requires ogrisel's preprocessing-simplification branch, pull request scikit-learn#193.
Still two orders of magnitude slower than MultinomialNB because of loop in _joint_log_likelihood.
Conflicts: scikits/learn/naive_bayes.py
Maybe GaussianNB should share some of the code as well (?)
All tests pass.
|
I have the following broken tests: |
|
Should work now. |
|
It does. For the documentation, I found the following minor issues:
Would be also great to add a docstring for the Other than that it looks fine to me. Thanks for this contribution. I am +1 for merging once those glitches are fixed. |
|
Also rather than: I would prefer something like: To make it easier for the user to quickly spot the shape issues of his data without having to add a breakpoint / print statement and rerun his/her program to gain that runtime knowledge. |
|
Thanks to Lars for putting this together and to Olivier for the review. That indeed looks good for merge after Olivier's concerns have been addressed. Regarding |
No link for reference to Manning et al. since nlp.stanford.edu seems to be down. Should add this later.
|
@ogrisel: the |
|
@mblondel: I can imagine an |
|
Ok this branch looks great to me +1 for merge. |
Naive Bayes classifier for multivariate Bernoulli (binary features)
Here's a version of the naive Bayes classifier for data following multivariate Bernoulli distributions, in short: categorical data with binary features. It does binarizing if handed any other kind of data with a threshold that can be set by the user.
BernoulliNBis a separate class rather than a different "mode" ofMultinomialNBmostly because the parameter interaction needed to implement both models in a single class would, I think, be too complicated to describe. So, "complex is better than complicated" beats "flat is better than nested" here.BernoulliNBperforms slightly better in the document classification example on four classes; this may be a coincidence. It's also slower, but a pipeline using it might be a lot faster with something like anOccurrenceVectorizer(TODO).The documentation may need some expanding.