Skip to content

Naive Bayes classifier for multivariate Bernoulli (binary features)#210

Merged
larsmans merged 13 commits intoscikit-learn:masterfrom
larsmans:bernoulli-naive-bayes
Jun 13, 2011
Merged

Naive Bayes classifier for multivariate Bernoulli (binary features)#210
larsmans merged 13 commits intoscikit-learn:masterfrom
larsmans:bernoulli-naive-bayes

Conversation

@larsmans
Copy link
Copy Markdown
Member

Here's a version of the naive Bayes classifier for data following multivariate Bernoulli distributions, in short: categorical data with binary features. It does binarizing if handed any other kind of data with a threshold that can be set by the user.

BernoulliNB is a separate class rather than a different "mode" of MultinomialNB mostly because the parameter interaction needed to implement both models in a single class would, I think, be too complicated to describe. So, "complex is better than complicated" beats "flat is better than nested" here.

BernoulliNB performs slightly better in the document classification example on four classes; this may be a coincidence. It's also slower, but a pipeline using it might be a lot faster with something like an OccurrenceVectorizer (TODO).

The documentation may need some expanding.

@ogrisel
Copy link
Copy Markdown
Member

ogrisel commented Jun 11, 2011

I have the following broken tests:

======================================================================
FAIL: Doctest: scikits.learn.naive_bayes.BernoulliNB
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/usr/lib/python2.7/doctest.py", line 2166, in runTest
    raise self.failureException(self.format_failure(new.getvalue()))
AssertionError: Failed doctest test for scikits.learn.naive_bayes.BernoulliNB
  File "/home/ogrisel/coding/scikit-learn/scikits/learn/naive_bayes.py", line 411, in BernoulliNB

----------------------------------------------------------------------
File "/home/ogrisel/coding/scikit-learn/scikits/learn/naive_bayes.py", line 464, in scikits.learn.naive_bayes.BernoulliNB
Failed example:
    clf.fit(X, Y)
Exception raised:
    Traceback (most recent call last):
      File "/usr/lib/python2.7/doctest.py", line 1254, in __run
        compileflags, 1) in test.globs
      File "<doctest scikits.learn.naive_bayes.BernoulliNB[5]>", line 1, in <module>
        clf.fit(X, Y)
      File "/home/ogrisel/coding/scikit-learn/scikits/learn/naive_bayes.py", line 254, in fit
        N_c, N_c_i = self._count(X, Y)
      File "/home/ogrisel/coding/scikit-learn/scikits/learn/naive_bayes.py", line 496, in _count
        N_c_i = safe_sparse_dot(Y.T, X)
      File "/home/ogrisel/coding/scikit-learn/scikits/learn/utils/extmath.py", line 100, in safe_sparse_dot
        return np.dot(a,b)
    ValueError: matrices are not aligned
----------------------------------------------------------------------
File "/home/ogrisel/coding/scikit-learn/scikits/learn/naive_bayes.py", line 466, in scikits.learn.naive_bayes.BernoulliNB
Failed example:
    print clf.predict(X[2])
Exception raised:
    Traceback (most recent call last):
      File "/usr/lib/python2.7/doctest.py", line 1254, in __run
        compileflags, 1) in test.globs
      File "<doctest scikits.learn.naive_bayes.BernoulliNB[6]>", line 1, in <module>
        print clf.predict(X[2])
      File "/home/ogrisel/coding/scikit-learn/scikits/learn/naive_bayes.py", line 277, in predict
        joint_log_likelihood = self._joint_log_likelihood(X)
      File "/home/ogrisel/coding/scikit-learn/scikits/learn/naive_bayes.py", line 508, in _joint_log_likelihood
        n_classes, n_features = self.coef_.shape
    AttributeError: 'BernoulliNB' object has no attribute 'coef_'

>>  raise self.failureException(self.format_failure(<StringIO.StringIO instance at 0x3e8a0e0>.getvalue()))

@larsmans
Copy link
Copy Markdown
Member Author

Should work now.

@ogrisel
Copy link
Copy Markdown
Member

ogrisel commented Jun 12, 2011

It does. For the documentation, I found the following minor issues:

  • BernoulliNB is not clickable as MultinomialNB (BernoulliNB should be added to the class reference documentation under doc/modules/classes.rst next to the other naive_bayes estimators).
  • Once the documentation is built using cd doc && make clean html the page location occurring in the reference section of BernoulliNB are interpreted as mailto: style links... Also could be great to put real links to PDF files if available (see other chapters such as clustering.rst for examples of the linking syntax).

Would be also great to add a docstring for the BaseDiscreteNB abstract class giving an overview of the parts that are common to MultinomialNB and BernoulliNB and which parts are specific (i.e. the count and _joint_log_likelyhood estimation). Speaking of which, there is a missing docstring for BernoulliNB._joint_log_likelyhood. Also in that method:

scikits/learn/naive_bayes.py:503: local variable 'sparse' is assigned to but never used

Other than that it looks fine to me. Thanks for this contribution. I am +1 for merging once those glitches are fixed.

@ogrisel
Copy link
Copy Markdown
Member

ogrisel commented Jun 12, 2011

Also rather than:

assert n_features_X == n_features, \
    "Shape of samples doesn't match shape of training data"

I would prefer something like:

if n_features_X != n_features:
    raise ValueError("Expected input with n_features = %d, got %d instead"
                             % (n_features, n_features_X))

To make it easier for the user to quickly spot the shape issues of his data without having to add a breakpoint / print statement and rerun his/her program to gain that runtime knowledge.

@mblondel
Copy link
Copy Markdown
Member

Thanks to Lars for putting this together and to Olivier for the review. That indeed looks good for merge after Olivier's concerns have been addressed.

Regarding OccurrenceVectorizer, I'm not sure what would be its purpose since you can just use Binarizer after CountVectorizer in a pipeline.

No link for reference to Manning et al. since nlp.stanford.edu seems
to be down. Should add this later.
@larsmans
Copy link
Copy Markdown
Member Author

@ogrisel: the mailto: links seem to be caused by a Unicode issue. I replaced the fancy dashes with good old -. Should be alright now.

@larsmans
Copy link
Copy Markdown
Member Author

@mblondel: I can imagine an OccurrenceVectorizer to be faster than that. I have no plans to implement it, though.

@ogrisel
Copy link
Copy Markdown
Member

ogrisel commented Jun 13, 2011

Ok this branch looks great to me +1 for merge.

larsmans added a commit that referenced this pull request Jun 13, 2011
Naive Bayes classifier for multivariate Bernoulli (binary features)
@larsmans larsmans merged commit 841a295 into scikit-learn:master Jun 13, 2011
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants