I came across this question on Stack Overflow a few days ago, and found it very interesting, so I thought I’d post the answer to the question here on the blog as well.
Why do we bother with smoothing at all in a Naive Bayes classifier (when we can throw away the unknown features instead).
The answer to the question is: words are often unknown in some but not all classes.
Say there are two classes M and N with features A, B and C, as follows:
M: A=3, B=1, C=0
(In the class M, A appears 3 times and B only once)
N: A=0, B=1, C=3
(In the class N, C appears 3 times and B only once)
Let’s see what happens when we throw away features that appear zero times.
A) Throw Away Features That Appear Zero Times In Any Class
If we throw away features A and C because they appear zero times in any of the classes, then we are only left with feature B to classify documents with.
And losing that information is a bad thing as we will see below!
Say we’re presented with a test document as follows:
(It contains B once and C three times)
Since we’ve discarded the features A and C, we won’t be able to tell whether the above document belongs to class M or class N.
So, losing any feature information is a bad thing!
B) Throw Away Features That Appear Zero Times In All Classes
Is it possible to get around this problem by discarding only those features that appear zero times in all of the classes?
No, because that would create its own problems!
The following test document illustrates what would happen if we did that:
A=3, B=1, C=1
The probability of M and N would both become zero (because we did not throw away the zero probability of A in class N and the zero probability of C in class M).
C) Don’t Throw Anything Away – Use Smoothing Instead
Smoothing allows us to classify both the above documents correctly because:
- We do not lose count information in classes where such information is available and
- We do not have to contend with zero counts.
Naive Bayes Classifiers In Practice
The Naive Bayes classifier in NLTK throws away features that have zero counts in all of the classes.
This makes it perform poorly when trained using a hard EM procedure (where the classifier is bootstrapped up from very little training data) on data sets with a high occurrence of unknown words (like Twitter feeds).
Here is the relevant code from NLTK 3.0:
def prob_classify(self, featureset):
# Discard any feature names that we've never seen before.
# Otherwise, we'll just assign a probability of 0 to
featureset = featureset.copy()
for fname in list(featureset.keys()):
for label in self._labels:
if (label, fname) in self._feature_probdist:
#print 'Ignoring unseen feature %s' % fname
Related Blog Posts: