In a Naive Bayes classifier, why bother with smoothing when we have unknown words in the test set?

I came across this question on Stack Overflow a few days ago, and found it very interesting, so I thought I’d post the answer to the question here on the blog as well.

Why do we bother with smoothing at all in a Naive Bayes classifier (when we can throw away the unknown features instead).

The answer to the question is: words are often unknown in some but not all classes.

Say there are two classes M and N with features A, B and C, as follows:

M: A=3, B=1, C=0

(In the class M, A appears 3 times and B only once)

N: A=0, B=1, C=3

(In the class N, C appears 3 times and B only once)

Let’s see what happens when we throw away features that appear zero times.

A) Throw Away Features That Appear Zero Times In Any Class

If we throw away features A and C because they appear zero times in any of the classes, then we are only left with feature B to classify documents with.

And losing that information is a bad thing as we will see below!

Say we’re presented with a test document as follows:

B=1, C=3

(It contains B once and C three times)

Since we’ve discarded the features A and C, we won’t be able to tell whether the above document belongs to class M or class N.

So, losing any feature information is a bad thing!

B) Throw Away Features That Appear Zero Times In All Classes

Is it possible to get around this problem by discarding only those features that appear zero times in all of the classes?

No, because that would create its own problems!

The following test document illustrates what would happen if we did that:

A=3, B=1, C=1

The probability of M and N would both become zero (because we did not throw away the zero probability of A in class N and the zero probability of C in class M).

C) Don’t Throw Anything Away – Use Smoothing Instead

Smoothing allows us to classify both the above documents correctly because:

  1. We do not lose count information in classes where such information is available and
  2. We do not have to contend with zero counts.

Naive Bayes Classifiers In Practice

The Naive Bayes classifier in NLTK throws away features that have zero counts in all of the classes.

This makes it perform poorly when trained using a hard EM procedure (where the classifier is bootstrapped up from very little training data) on data sets with a high occurrence of unknown words (like Twitter feeds).

Here is the relevant code from NLTK 3.0:

def prob_classify(self, featureset):
  # Discard any feature names that we've never seen before.
  # Otherwise, we'll just assign a probability of 0 to
  # everything.
  featureset = featureset.copy()
  for fname in list(featureset.keys()):
    for label in self._labels:
    if (label, fname) in self._feature_probdist:
      break
    else:
      #print 'Ignoring unseen feature %s' % fname
      del featureset[fname]

Related Blog Posts:

  1. Analysing documents for non-obvious differences

  2. Naive Bayes Classifier in OpenNLP

 

One thought on “In a Naive Bayes classifier, why bother with smoothing when we have unknown words in the test set?

Leave a comment