Naive Bayes classifier for multivariate Bernoulli (binary features) by larsmans · Pull Request #210 · scikit-learn/scikit-learn

larsmans · 2011-06-11T17:57:29Z

Here's a version of the naive Bayes classifier for data following multivariate Bernoulli distributions, in short: categorical data with binary features. It does binarizing if handed any other kind of data with a threshold that can be set by the user.

BernoulliNB is a separate class rather than a different "mode" of MultinomialNB mostly because the parameter interaction needed to implement both models in a single class would, I think, be too complicated to describe. So, "complex is better than complicated" beats "flat is better than nested" here.

BernoulliNB performs slightly better in the document classification example on four classes; this may be a coincidence. It's also slower, but a pipeline using it might be a lot faster with something like an OccurrenceVectorizer (TODO).

The documentation may need some expanding.

Requires ogrisel's preprocessing-simplification branch, pull request scikit-learn#193.

Still two orders of magnitude slower than MultinomialNB because of loop in _joint_log_likelihood.

Conflicts: scikits/learn/naive_bayes.py

Maybe GaussianNB should share some of the code as well (?)

All tests pass.

ogrisel · 2011-06-11T23:43:08Z

I have the following broken tests:

======================================================================
FAIL: Doctest: scikits.learn.naive_bayes.BernoulliNB
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/usr/lib/python2.7/doctest.py", line 2166, in runTest
    raise self.failureException(self.format_failure(new.getvalue()))
AssertionError: Failed doctest test for scikits.learn.naive_bayes.BernoulliNB
  File "/home/ogrisel/coding/scikit-learn/scikits/learn/naive_bayes.py", line 411, in BernoulliNB

----------------------------------------------------------------------
File "/home/ogrisel/coding/scikit-learn/scikits/learn/naive_bayes.py", line 464, in scikits.learn.naive_bayes.BernoulliNB
Failed example:
    clf.fit(X, Y)
Exception raised:
    Traceback (most recent call last):
      File "/usr/lib/python2.7/doctest.py", line 1254, in __run
        compileflags, 1) in test.globs
      File "<doctest scikits.learn.naive_bayes.BernoulliNB[5]>", line 1, in <module>
        clf.fit(X, Y)
      File "/home/ogrisel/coding/scikit-learn/scikits/learn/naive_bayes.py", line 254, in fit
        N_c, N_c_i = self._count(X, Y)
      File "/home/ogrisel/coding/scikit-learn/scikits/learn/naive_bayes.py", line 496, in _count
        N_c_i = safe_sparse_dot(Y.T, X)
      File "/home/ogrisel/coding/scikit-learn/scikits/learn/utils/extmath.py", line 100, in safe_sparse_dot
        return np.dot(a,b)
    ValueError: matrices are not aligned
----------------------------------------------------------------------
File "/home/ogrisel/coding/scikit-learn/scikits/learn/naive_bayes.py", line 466, in scikits.learn.naive_bayes.BernoulliNB
Failed example:
    print clf.predict(X[2])
Exception raised:
    Traceback (most recent call last):
      File "/usr/lib/python2.7/doctest.py", line 1254, in __run
        compileflags, 1) in test.globs
      File "<doctest scikits.learn.naive_bayes.BernoulliNB[6]>", line 1, in <module>
        print clf.predict(X[2])
      File "/home/ogrisel/coding/scikit-learn/scikits/learn/naive_bayes.py", line 277, in predict
        joint_log_likelihood = self._joint_log_likelihood(X)
      File "/home/ogrisel/coding/scikit-learn/scikits/learn/naive_bayes.py", line 508, in _joint_log_likelihood
        n_classes, n_features = self.coef_.shape
    AttributeError: 'BernoulliNB' object has no attribute 'coef_'

>>  raise self.failureException(self.format_failure(<StringIO.StringIO instance at 0x3e8a0e0>.getvalue()))

larsmans · 2011-06-12T10:14:25Z

Should work now.

ogrisel · 2011-06-12T11:40:19Z

It does. For the documentation, I found the following minor issues:

BernoulliNB is not clickable as MultinomialNB (BernoulliNB should be added to the class reference documentation under doc/modules/classes.rst next to the other naive_bayes estimators).
Once the documentation is built using cd doc && make clean html the page location occurring in the reference section of BernoulliNB are interpreted as mailto: style links... Also could be great to put real links to PDF files if available (see other chapters such as clustering.rst for examples of the linking syntax).

Would be also great to add a docstring for the BaseDiscreteNB abstract class giving an overview of the parts that are common to MultinomialNB and BernoulliNB and which parts are specific (i.e. the count and _joint_log_likelyhood estimation). Speaking of which, there is a missing docstring for BernoulliNB._joint_log_likelyhood. Also in that method:

scikits/learn/naive_bayes.py:503: local variable 'sparse' is assigned to but never used

Other than that it looks fine to me. Thanks for this contribution. I am +1 for merging once those glitches are fixed.

ogrisel · 2011-06-12T11:45:48Z

Also rather than:

assert n_features_X == n_features, \
    "Shape of samples doesn't match shape of training data"

I would prefer something like:

if n_features_X != n_features:
    raise ValueError("Expected input with n_features = %d, got %d instead"
                             % (n_features, n_features_X))

To make it easier for the user to quickly spot the shape issues of his data without having to add a breakpoint / print statement and rerun his/her program to gain that runtime knowledge.

mblondel · 2011-06-12T16:43:57Z

Thanks to Lars for putting this together and to Olivier for the review. That indeed looks good for merge after Olivier's concerns have been addressed.

Regarding OccurrenceVectorizer, I'm not sure what would be its purpose since you can just use Binarizer after CountVectorizer in a pipeline.

No link for reference to Manning et al. since nlp.stanford.edu seems to be down. Should add this later.

larsmans · 2011-06-12T19:52:05Z

@ogrisel: the mailto: links seem to be caused by a Unicode issue. I replaced the fancy dashes with good old -. Should be alright now.

larsmans · 2011-06-12T19:53:19Z

@mblondel: I can imagine an OccurrenceVectorizer to be faster than that. I have no plans to implement it, though.

ogrisel · 2011-06-13T07:09:16Z

Ok this branch looks great to me +1 for merge.

Naive Bayes classifier for multivariate Bernoulli (binary features)

larsmans and others added 11 commits June 1, 2011 17:21

Added naive Bayes classifier for multivariate Bernoulli models

5f87e43

some documentation for BernoulliNB

3522d79

Do binarizing in BernoulliNB

77817a8

Requires ogrisel's preprocessing-simplification branch, pull request scikit-learn#193.

Simplify binarizing in BernoulliNB

657b41d

Merge branch 'master' into bernoulli-naive-bayes

1ce86f3

Optimize BernoulliNB + improve docstring + add to doc-class example

7880af7

Still two orders of magnitude slower than MultinomialNB because of loop in _joint_log_likelihood.

Merge branch 'master' into bernoulli-naive-bayes

f035f4e

Conflicts: scikits/learn/naive_bayes.py

Refactor MultinomialNB and BernoulliNB: introduce BaseDiscreteNB

794605a

Maybe GaussianNB should share some of the code as well (?)

vectorize loop in BernoulliNB for 100x speedup in sparse case

8e7369c

Extend MultinomialNB tests to BernoulliNB

ea11518

All tests pass.

Update BernoulliNB docs

5ab1fee

BUG: broken doctest in BernoulliNB

c08faad

Glitches in BernoulliNB and DiscreteNB (mostly docs)

c92d52c

No link for reference to Manning et al. since nlp.stanford.edu seems to be down. Should add this later.

larsmans added a commit that referenced this pull request Jun 13, 2011

Merge pull request #210 from larsmans/bernoulli-naive-bayes

841a295

Naive Bayes classifier for multivariate Bernoulli (binary features)

larsmans merged commit 841a295 into scikit-learn:master Jun 13, 2011

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Naive Bayes classifier for multivariate Bernoulli (binary features)#210

Naive Bayes classifier for multivariate Bernoulli (binary features)#210
larsmans merged 13 commits intoscikit-learn:masterfrom
larsmans:bernoulli-naive-bayes

larsmans commented Jun 11, 2011

Uh oh!

ogrisel commented Jun 11, 2011

Uh oh!

larsmans commented Jun 12, 2011

Uh oh!

ogrisel commented Jun 12, 2011

Uh oh!

ogrisel commented Jun 12, 2011

Uh oh!

mblondel commented Jun 12, 2011

Uh oh!

larsmans commented Jun 12, 2011

Uh oh!

larsmans commented Jun 12, 2011

Uh oh!

ogrisel commented Jun 13, 2011

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

larsmans commented Jun 11, 2011

Uh oh!

ogrisel commented Jun 11, 2011

Uh oh!

larsmans commented Jun 12, 2011

Uh oh!

ogrisel commented Jun 12, 2011

Uh oh!

ogrisel commented Jun 12, 2011

Uh oh!

mblondel commented Jun 12, 2011

Uh oh!

larsmans commented Jun 12, 2011

Uh oh!

larsmans commented Jun 12, 2011

Uh oh!

ogrisel commented Jun 13, 2011

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants