Tag: naive bayes classifier

Deep Learning for NLP

Deep learning is usually associated with neural networks.

In this article, we show that generative classifiers are also capable of deep learning.

What is deep learning?

Deep learning is a method of machine learning involving the use of multiple processing layers to learn non-linear functions or boundaries.

What are generative classifiers?

Generative classifiers use the Bayes rule to invert probabilities of the features F given a class c into a prediction of the class c given the features F.

The class predicted by the classifier is the one yielding the highest P(c|F).

A commonly used generative classifier is the Naive Bayes classifier.  It has two layers (one for the features F and one for the classes C).

Deep learning using generative classifiers

The first thing you need for deep learning is a hidden layer.  So you add one more layer H between the C and F layers to get a Hierarchical Bayesian classifier (HBC).

Now, you can compute P(c|F) in a HBC in two ways:

The first equation computes P(c|F) using a product of sums (POS).  The second equation computes P(c|F) using a sum of products (SOP).

POS Equation

We discovered something very interesting about these two equations.

It turns out that if you use the first equation, the HBC reduces to a Naive Bayes classifier. Such an HBC can only learn linear (or quadratic) decision boundaries.

Consider the discrete XOR-like function shown in Figure 1.

There is no way to separate the black dots from the white dots using one straight line.

Such a pattern can only be classified 100% correctly by a non-linear classifier.

If you train a multinomial Naive Bayes classifier on the data in Figure 1, you get the decision boundary seen in Figure 2a.

Note that the dotted area represents the class 1 and the clear area represents the class 0.

It can be seen that no matter what the angle of the line is, at least one point of the four will be misclassified.

In this instance, it is the point at {5, 1} that is misclassified as 0 (since the clear area represents the class 0).

You get the same result if you use a POS HBC.

SOP Equation

Our research showed us that something amazing happens if you use the second equation.

With the “sum of products” equation, the HBC becomes capable of deep learning.

SOP + Multinomial Distribution

The decision boundary learnt by a multinomial non-linear HBC (one that computes the posterior using a sum of products of the hidden-node conditional feature probabilities) is shown in Figure 2b.

The boundary consists of two straight lines passing through the origin. They are angled in such a way that they separate the data points into the two required categories.

All four points are classified correctly since the points at {1, 1} and {5, 5} fall in the clear conical region which represents a classification of 0 whereas the other two points fall in the dotted region representing class 1.

Therefore, the multinomial non-linear hierarchical Bayes classifier can learn the non-linear function of Figure 1.

Gaussian Distribution

The decision boundary learnt by a Gaussian nonlinear HBC is shown in Figure 2c.

The boundary consists of two quadratic curves separating the data points into the required categories.

Therefore, the Gaussian non-linear HBC can also learn the non-linear function depicted in Figure 1.

Conclusion

Since SOP HBCs are multilayered (with a layer of hidden nodes), and can learn non-linear decision boundaries, they can therefore be said to be capable of deep learning.

Applications to NLP

It turns out that the multinomial SOP HBC can outperform a number of linear classifiers at certain tasks.  For more information, read our paper.

Visit Aiaioo Labs

A Naive Bayes classifier that outperforms NLTK’s

We found that by changing the smoothing parameters of a Naive Bayes classifier, we could get far better accuracy numbers for certain tasks.  By changing the Lidstone smoothing parameter from 0.05 to 0.5 or greater, we could go from an accuracy of about 50% to almost 70% on the task of question classification for question answering.

This is not at all surprising because, as described in an earlier post, the smoothing method used in the estimation of probabilities affects Naive Bayes classifiers greatly.

Below, we have provided an implementation of a Naive Bayes classifier which outperforms the Naive Bayes classifier supplied with NLTK 3.o by almost 10% on the task of classifying questions from the questions-train.txt file supplied with the textbook “Taming Text”.

Our Naive Bayes classifier (with a Lidstone smoothing parameter of 0.5) exhibits about 65% accuracy on the task of question classification, whereas the NLTK classifier has an accuracy of about 40% as shown below.

Finally, I’d like to say a few words about the import of this work.

Theoretically, by increasing the Lidstone smoothing parameter, we are merely compensating more strongly for absent features; we are negating the absence of a feature more vigorously;  reducing the penalty for the absence of a feature in a specific category.

Because increased smoothing lowers the penalty for feature absence, it could help increase the accuracy when a data-set has many low-volume features that do not contribute to predicting a category, but whose chance presence and absence may be construed in the learning phase to be correlated with a category.

Further investigation is required before we can say whether the aforesaid hypothesis would explain the effect of smoothing on the accuracy of classification in regard to the question classification data-set that we used.

However, this exercise shows that algorithm implementations would do well to leave the choice of Lidstone smoothing parameters to the discretion of the end user of a Naive Bayes classifier.

The source code of our Naive Bayes classifier (using Lidstone smoothing) is provided below:

This implementation of the Naive Bayes classifier was created by Geetanjali Rakshit, an intern at Aiaioo Labs.

```import numpy as np
import random
import sys, math

class Classifier:
def __init__(self, featureGenerator):
self.featureGenerator = featureGenerator
self._C_SIZE = 0
self._V_SIZE = 0
self._classes_list = []
self._classes_dict = {}
self._vocab = {}

def setClasses(self, trainingData):
for(label, line) in trainingData:
if label not in self._classes_dict.keys():
self._classes_dict[label] = len(self._classes_list)
self._classes_list.append(label)
self._C_SIZE = len(self._classes_list)
return

def getClasses(self):
return self._classes_list

def setVocab(self, trainingData):
index = 0;
for (label, line) in trainingData:
line = self.featureGenerator.getFeatures(line)
for item in line:
if(item not in self._vocab.keys()):
self._vocab[item] = index
index += 1
self._V_SIZE = len(self._vocab)
return

def getVocab(self):
return self._vocab

def train(self, trainingData):
pass

def classify(self, testData, params):
pass

def getFeatures(self, data):
return self.featureGenerator.getFeatures(data)

class FeatureGenerator:
def getFeatures(self, text):
text = text.lower()
return text.split()

class NaiveBayesClassifier(Classifier):
def __init__(self, fg, alpha = 0.05):
Classifier.__init__(self, fg)
self.__classParams = []
self.__params = [[]]
self.__alpha = alpha

def getParameters(self):
return (self.__classParams, self.__params)

def train(self, trainingData):
self.setClasses(trainingData)
self.setVocab(trainingData)
self.initParameters()

for (cat, document) in trainingData:
for feature in self.getFeatures(document):
self.countFeature(feature, self._classes_dict[cat])

def countFeature(self, feature, class_index):
counts = 1
self._counts_in_class[class_index][self._vocab[feature]] = self._counts_in_class[class_index][self._vocab[feature]] + counts
self._total_counts[class_index] = self._total_counts[class_index] + counts
self._norm = self._norm + counts

def classify(self, testData):
post_prob = self.getPosteriorProbabilities(testData)
return self._classes_list[self.getMaxIndex(post_prob)]

def getPosteriorProbabilities(self, testData):
post_prob = np.zeros(self._C_SIZE)
for i in range(0, self._C_SIZE):
for feature in self.getFeatures(testData):
post_prob[i] += self.getLogProbability(feature, i)
post_prob[i] += self.getClassLogProbability(i)
return post_prob

def getFeatures(self, testData):
return self.featureGenerator.getFeatures(testData)

def initParameters(self):
self._total_counts = np.zeros(self._C_SIZE)
self._counts_in_class = np.zeros((self._C_SIZE, self._V_SIZE))
self._norm = 0.0

def getLogProbability(self, feature, class_index):
return math.log(self.smooth(self.getCount(feature, class_index),self._total_counts[class_index]))

def getCount(self, feature, class_index):
if feature not in self._vocab.keys():
return 0
else:
return self._counts_in_class[class_index][self._vocab[feature]]

def smooth(self, numerator, denominator):
return (numerator + self.__alpha) / (denominator + (self.__alpha * len(self._vocab)))

def getClassLogProbability(self, class_index):
return math.log(self._total_counts[class_index]/self._norm)

def getMaxIndex(self, posteriorProbabilities):
maxi = 0
maxProb = posteriorProbabilities[maxi]
for i in range(0, self._C_SIZE):
if(posteriorProbabilities[i] >= maxProb):
maxProb = posteriorProbabilities[i]
maxi = i
return maxi

class Dataset:
def __init__(self, filename):
fp = open(filename, "r")
i = 0
self.__dataset = []
for line in fp:
if(line != "\n"):
line = line.split()
cat = line[0]
sent = ""
for word in range(1, len(line)):
sent = sent+line[word]+" "
sent = sent.strip()
self.__dataset.append([cat, str(sent)])
i = i+1
random.shuffle(self.__dataset)
self.__D_SIZE = i
self.__trainSIZE = int(0.6*self.__D_SIZE)
self.__testSIZE = int(0.3*self.__D_SIZE)
self.__devSIZE = 1 - (self.__trainSIZE + self.__testSIZE)

def setTrainSize(self, value):
self.__trainSIZE = int(value*0.01*self.__D_SIZE)
return self.__trainSIZE

def setTestSize(self, value):
self.__testSIZE = int(value*0.01*self.__D_SIZE)
return self.__testSIZE

def setDevelopmentSize(self):
self.__devSIZE = int(1 - (self.__trainSIZE + self.__testSIZE))
return self.__devSIZE

def getDataSize(self):
return self.__D_SIZE

def getTrainingData(self):
return self.__dataset[0:self.__trainSIZE]

def getTestData(self):
return self.__dataset[self.__trainSIZE:(self.__trainSIZE+self.__testSIZE)]

def getDevData(self):
return self.__dataset[0:self.__devSIZE]

#============================================================================================

if __name__ == "__main__":

# This Naive Bayes classifier implementation 10% better accuracy than the NLTK 3.0 Naive Bayes classifier implementation
# at the task of classifying questions in the question corpus distributed with the book "Taming Text".

# The "questions-train.txt" file can be found in the source code distributed with the book at https://www.manning.com/books/taming-text.

# To the best of our knowledge, the improvement in accuracy is owed to the smoothing methods described in our blog:
# https://aiaioo.wordpress.com/2016/01/29/in-a-naive-bayes-classifier-why-bother-with-smoothing-when-we-have-unknown-words-in-the-test-set/

filename = "questions-train.txt"

if len(sys.argv) > 1:
filename = sys.argv[1]

data = Dataset(filename)

data.setTrainSize(50)
data.setTestSize(50)

train_set = data.getTrainingData()
test_set = data.getTestData()

test_data = [test_set[i][1] for i in range(len(test_set))]
actual_labels = [test_set[i][0] for i in range(len(test_set))]

fg = FeatureGenerator()
alpha = 0.5 #smoothing parameter

nbClassifier = NaiveBayesClassifier(fg, alpha)
nbClassifier.train(train_set)

correct = 0;
total = 0;
for line in test_data:
best_label = nbClassifier.classify(line)
if best_label == actual_labels[total]:
correct += 1
total += 1

acc = 1.0*correct/total
print("Accuracy of this Naive Bayes Classifier: "+str(acc))

```

Visit Aiaioo Labs