Tag: classifier

Deep Learning for NLP

Deep learning is usually associated with neural networks.

In this article, we show that generative classifiers are also capable of deep learning.

What is deep learning?

Deep learning is a method of machine learning involving the use of multiple processing layers to learn non-linear functions or boundaries.

What are generative classifiers?

Generative classifiers use the Bayes rule to invert probabilities of the features F given a class c into a prediction of the class c given the features F.

The class predicted by the classifier is the one yielding the highest P(c|F).

A commonly used generative classifier is the Naive Bayes classifier.  It has two layers (one for the features F and one for the classes C).

Deep learning using generative classifiers

The first thing you need for deep learning is a hidden layer.  So you add one more layer H between the C and F layers to get a Hierarchical Bayesian classifier (HBC).

Now, you can compute P(c|F) in a HBC in two ways:

Product of Sums
Computing P(c|F) using a Product of Sums
Sum of Products
Computing P(c|F) using a Sum of Products

The first equation computes P(c|F) using a product of sums (POS).  The second equation computes P(c|F) using a sum of products (SOP).

POS Equation

We discovered something very interesting about these two equations.

It turns out that if you use the first equation, the HBC reduces to a Naive Bayes classifier. Such an HBC can only learn linear (or quadratic) decision boundaries.

Consider the discrete XOR-like function shown in Figure 1.


There is no way to separate the black dots from the white dots using one straight line.

Such a pattern can only be classified 100% correctly by a non-linear classifier.

If you train a multinomial Naive Bayes classifier on the data in Figure 1, you get the decision boundary seen in Figure 2a.

Note that the dotted area represents the class 1 and the clear area represents the class 0.

Multinomial NB Classifier Decision Boundary
Figure 2a: The decision boundary of a multinomial NB classifier (or a POS HBC).

It can be seen that no matter what the angle of the line is, at least one point of the four will be misclassified.

In this instance, it is the point at {5, 1} that is misclassified as 0 (since the clear area represents the class 0).

You get the same result if you use a POS HBC.

SOP Equation

Our research showed us that something amazing happens if you use the second equation.

With the “sum of products” equation, the HBC becomes capable of deep learning.

SOP + Multinomial Distribution

The decision boundary learnt by a multinomial non-linear HBC (one that computes the posterior using a sum of products of the hidden-node conditional feature probabilities) is shown in Figure 2b.

Decision boundary of a SOP HBC.
Figure 2b: Decision boundary learnt by a multinomial SOP HBC.

The boundary consists of two straight lines passing through the origin. They are angled in such a way that they separate the data points into the two required categories.

All four points are classified correctly since the points at {1, 1} and {5, 5} fall in the clear conical region which represents a classification of 0 whereas the other two points fall in the dotted region representing class 1.

Therefore, the multinomial non-linear hierarchical Bayes classifier can learn the non-linear function of Figure 1.

Gaussian Distribution

The decision boundary learnt by a Gaussian nonlinear HBC is shown in Figure 2c.

Decision Boundary of a Gaussian SOP HBC.
Figure 2c: Decision boundary learnt by a SOP HBC based on the Gaussian probability distribution.

The boundary consists of two quadratic curves separating the data points into the required categories.

Therefore, the Gaussian non-linear HBC can also learn the non-linear function depicted in Figure 1.


Since SOP HBCs are multilayered (with a layer of hidden nodes), and can learn non-linear decision boundaries, they can therefore be said to be capable of deep learning.

Applications to NLP

It turns out that the multinomial SOP HBC can outperform a number of linear classifiers at certain tasks.  For more information, read our paper.

Visit Aiaioo Labs

A Naive Bayes classifier that outperforms NLTK’s

We found that by changing the smoothing parameters of a Naive Bayes classifier, we could get far better accuracy numbers for certain tasks.  By changing the Lidstone smoothing parameter from 0.05 to 0.5 or greater, we could go from an accuracy of about 50% to almost 70% on the task of question classification for question answering.

This is not at all surprising because, as described in an earlier post, the smoothing method used in the estimation of probabilities affects Naive Bayes classifiers greatly.

Below, we have provided an implementation of a Naive Bayes classifier which outperforms the Naive Bayes classifier supplied with NLTK 3.o by almost 10% on the task of classifying questions from the questions-train.txt file supplied with the textbook “Taming Text”.

Our Naive Bayes classifier (with a Lidstone smoothing parameter of 0.5) exhibits about 65% accuracy on the task of question classification, whereas the NLTK classifier has an accuracy of about 40% as shown below.


Finally, I’d like to say a few words about the import of this work.

Theoretically, by increasing the Lidstone smoothing parameter, we are merely compensating more strongly for absent features; we are negating the absence of a feature more vigorously;  reducing the penalty for the absence of a feature in a specific category.

Because increased smoothing lowers the penalty for feature absence, it could help increase the accuracy when a data-set has many low-volume features that do not contribute to predicting a category, but whose chance presence and absence may be construed in the learning phase to be correlated with a category.

Further investigation is required before we can say whether the aforesaid hypothesis would explain the effect of smoothing on the accuracy of classification in regard to the question classification data-set that we used.

However, this exercise shows that algorithm implementations would do well to leave the choice of Lidstone smoothing parameters to the discretion of the end user of a Naive Bayes classifier.

The source code of our Naive Bayes classifier (using Lidstone smoothing) is provided below:

This implementation of the Naive Bayes classifier was created by Geetanjali Rakshit, an intern at Aiaioo Labs.


import numpy as np
import random
import sys, math

class Classifier:
	def __init__(self, featureGenerator):
		self.featureGenerator = featureGenerator
		self._C_SIZE = 0
		self._V_SIZE = 0
		self._classes_list = []
		self._classes_dict = {}
		self._vocab = {}

	def setClasses(self, trainingData):
		for(label, line) in trainingData:
			if label not in self._classes_dict.keys():
				self._classes_dict[label] = len(self._classes_list)
		self._C_SIZE = len(self._classes_list)
	def getClasses(self):
		return self._classes_list

	def setVocab(self, trainingData):
		index = 0;
		for (label, line) in trainingData:
			line = self.featureGenerator.getFeatures(line)
			for item in line:
				if(item not in self._vocab.keys()):
					self._vocab[item] = index
					index += 1
		self._V_SIZE = len(self._vocab)

	def getVocab(self):
		return self._vocab

	def train(self, trainingData):

	def classify(self, testData, params):

	def getFeatures(self, data):
		return self.featureGenerator.getFeatures(data)

class FeatureGenerator:
	def getFeatures(self, text):
		text = text.lower()
		return text.split()

class NaiveBayesClassifier(Classifier):
	def __init__(self, fg, alpha = 0.05):
		Classifier.__init__(self, fg)
		self.__classParams = []
		self.__params = [[]]
		self.__alpha = alpha

	def getParameters(self):
		return (self.__classParams, self.__params)

	def train(self, trainingData):

		for (cat, document) in trainingData:
			for feature in self.getFeatures(document):
				self.countFeature(feature, self._classes_dict[cat])

	def countFeature(self, feature, class_index):
		counts = 1
		self._counts_in_class[class_index][self._vocab[feature]] = self._counts_in_class[class_index][self._vocab[feature]] + counts
		self._total_counts[class_index] = self._total_counts[class_index] + counts
		self._norm = self._norm + counts

	def classify(self, testData):
		post_prob = self.getPosteriorProbabilities(testData)
		return self._classes_list[self.getMaxIndex(post_prob)]

	def getPosteriorProbabilities(self, testData):
		post_prob = np.zeros(self._C_SIZE)
		for i in range(0, self._C_SIZE):
			for feature in self.getFeatures(testData):
				post_prob[i] += self.getLogProbability(feature, i)
			post_prob[i] += self.getClassLogProbability(i)
		return post_prob

	def getFeatures(self, testData):
		return self.featureGenerator.getFeatures(testData)

	def initParameters(self):
		self._total_counts = np.zeros(self._C_SIZE)
		self._counts_in_class = np.zeros((self._C_SIZE, self._V_SIZE))
		self._norm = 0.0

	def getLogProbability(self, feature, class_index):
		return math.log(self.smooth(self.getCount(feature, class_index),self._total_counts[class_index]))

	def getCount(self, feature, class_index):
		if feature not in self._vocab.keys():
			return 0
			return self._counts_in_class[class_index][self._vocab[feature]]

	def smooth(self, numerator, denominator):
		return (numerator + self.__alpha) / (denominator + (self.__alpha * len(self._vocab)))

	def getClassLogProbability(self, class_index):
		return math.log(self._total_counts[class_index]/self._norm)

	def getMaxIndex(self, posteriorProbabilities):
		maxi = 0
		maxProb = posteriorProbabilities[maxi]
		for i in range(0, self._C_SIZE):
			if(posteriorProbabilities[i] >= maxProb):
				maxProb = posteriorProbabilities[i]
				maxi = i
		return maxi

class Dataset:
	def __init__(self, filename):
		fp = open(filename, "r")
		i = 0
		self.__dataset = []
		for line in fp:
			if(line != "\n"):
				line = line.split()
				cat = line[0]
				sent = ""
				for word in range(1, len(line)):
					sent = sent+line[word]+" "
				sent = sent.strip()
				self.__dataset.append([cat, str(sent)])
				i = i+1
		self.__D_SIZE = i
		self.__trainSIZE = int(0.6*self.__D_SIZE)
		self.__testSIZE = int(0.3*self.__D_SIZE)
		self.__devSIZE = 1 - (self.__trainSIZE + self.__testSIZE)

	def setTrainSize(self, value):
		self.__trainSIZE = int(value*0.01*self.__D_SIZE)
		return self.__trainSIZE

	def setTestSize(self, value):
		self.__testSIZE = int(value*0.01*self.__D_SIZE)
		return self.__testSIZE

	def setDevelopmentSize(self):
		self.__devSIZE = int(1 - (self.__trainSIZE + self.__testSIZE))
		return self.__devSIZE

	def getDataSize(self):
		return self.__D_SIZE
	def getTrainingData(self):
		return self.__dataset[0:self.__trainSIZE]

	def getTestData(self):
		return self.__dataset[self.__trainSIZE:(self.__trainSIZE+self.__testSIZE)]

	def getDevData(self):
		return self.__dataset[0:self.__devSIZE]


if __name__ == "__main__":
	# This Naive Bayes classifier implementation 10% better accuracy than the NLTK 3.0 Naive Bayes classifier implementation
	# at the task of classifying questions in the question corpus distributed with the book "Taming Text".

	# The "questions-train.txt" file can be found in the source code distributed with the book at https://www.manning.com/books/taming-text.
	# To the best of our knowledge, the improvement in accuracy is owed to the smoothing methods described in our blog:
	# https://aiaioo.wordpress.com/2016/01/29/in-a-naive-bayes-classifier-why-bother-with-smoothing-when-we-have-unknown-words-in-the-test-set/
	filename = "questions-train.txt"
	if len(sys.argv) > 1:
		filename = sys.argv[1]
	data = Dataset(filename)
	train_set = data.getTrainingData()
	test_set = data.getTestData()
	test_data = [test_set[i][1] for i in range(len(test_set))]
	actual_labels = [test_set[i][0] for i in range(len(test_set))]
	fg = FeatureGenerator()
	alpha = 0.5 #smoothing parameter
	nbClassifier = NaiveBayesClassifier(fg, alpha)
	correct = 0;
	total = 0;
	for line in test_data:
		best_label = nbClassifier.classify(line)
		if best_label == actual_labels[total]:
			correct += 1
		total += 1
	acc = 1.0*correct/total
	print("Accuracy of this Naive Bayes Classifier: "+str(acc))


Visit Aiaioo Labs

Fun With Text – Hacking Text Analytics


I’ve always wondered if there was a way to teach people to cobble together quick and dirty solutions to problems involving natural language, from duct tape, as it were.

Having worked in the field now for a donkey’s years as of 2015, and having taught a number of text analytics courses along the way, I’ve seen students of text analysis stumble mostly on one of two hurdles:

1.  Inability to Reduce Text Analytics Problems to Machine Learning Problems

I’ve seen students, after hours of training, still revert to rule-based thinking when asked to solve new problems involving text.

You can spend hours teaching people about classification and feature sets, but when you ask them to apply their learning to a new task, say segmenting a resume, you’ll hear them very quickly falling back to thinking in terms of programming steps.

Umm, you could write a script to look for a horizontal line, followed by capitalized text in bold, big font, with the words “Education” or “Experience” in it !!!

2.  Inability to Solve the Machine Learning (ML) Problems

Another task that I have seen teams getting hung up on has been solving ML problems and comparing different solutions.

My manager wants me to identify the ‘introduction’ sections.  So, I labelled 5 sentences as introductions.  Then, I trained a maximum entropy classifier with them.  Why isn’t it working?

One Machine Learning Algorithm to Rule Them All

One day, when I was about to give a lecture at Barcamp Bangalore, I had an idea.

Wouldn’t it be fun to try to use just one machine learning algorithm, show people how to code up that algorithm themselves, and then show them how a really large number of text analytics problem (almost every single problem related to the semantic web) could be solved using it.

So, I quickly wrote up a set of problems in order of increasing complexity, and went about trying to reduce them all to one ML problem, and surprised myself!  It could be done!

Just about every text analytics problem related to the semantic web (which is, by far, the most important commercial category) could be reduced to a classification problem.

Moreover, you could tackle just about any problem using just two steps:

a) Modeling the problem as a machine learning problem

Spot the appropriate machine learning problem underlying the text analytics problem, and if it is a classification problem, the relevant categories, and you’ve reduced the text analytics problem to a machine learning problem.

b) Solving the problem using feature engineering

To solve the machine learning problem, you need to coming up with a set of features that allows the machine learning algorithm to separate the desired categories.

That’s it!

Check it out for yourself!

Here’s a set of slides.

It’s called “Fun with Text – Hacking Text Analytics”.

Can you make a sandwich from classifiers?

One day, just a few years ago, a client came to Aiaioo Labs with a very interesting problem.

He wanted to know if and how AI tools could save him some money.

It turned out that he had a fairly large team performing the task of manually categorizing documents.

He wanted to know if we could supply him an AI tool that could automate the work.  The only problem was, he was going to need very high quality.

And no single automated classifier was going to be able to deliver that sort of quality by itself.

That’s when we hit upon the idea of a classifier sandwich.

The sandwich is prepared by arranging two classifiers as follows:

1.  Top layer – high precision classifier – when it picks a category, it is very likely to be right (the precision of the selected category is very high).

2.  Bottom layer – high recall classifier – when it rejects a category, it is very likely to be right about the rejection (the precision of the rejected category is very high).

Whatever the top layer does not pick and the bottom layer does not reject – that is, the middle of the sandwich – is then handed off to be processed manually by the team of editors that the client had in place.

So, that was a lovely little offering, one that any consultant could put together.  And it is incredibly easy to put together an ROI proposition for such an offering.

How do you calculate the ROI of a classifier sandwich?


Let’s say the high-precision top layer has a recall of 30%.

Let’s say the high-recall bottom layer has a recall of 80%.

Then about 50% of the documents that pass through the system will end up being automatically sorted out.

The work effort and therefore the size of the team needed to do it, would therefore be halved.

Note that to make the sandwich, we need two high-precision classifiers (the first one selects a category with high precision while the second one rejects the other category with high precision).

Both categories need to have a precision greater than or equal to the quality guarantee demanded by the client.

That precision limit determines the amount of effort left over for humans to do.

How can we tune classifiers for high precision?

For maxent classifiers, thresholds can be set on the confidence scores they return for the various categories.

For naive bayesian classifiers, the best approach to creating high-precision classifiers is a training process known as expectation maximization.

For more information, please refer the work of Kamal Nigam et al:  http://www.kamalnigam.com/papers/emcat-mlj99.pdf

Another secret to boosting precision is using the right features in your classifier, but more about that later.