Tag: text analytics

Deep Bayesian Learning for NLP

Deep learning is usually associated with neural networks.

In this article, we show that generative classifiers are also capable of deep learning.

What is deep learning?

Deep learning is a method of machine learning involving the use of multiple processing layers to learn non-linear functions or boundaries.

What are generative classifiers?

Generative classifiers use the Bayes rule to invert probabilities of the features F given a class c into a prediction of the class c given the features F.

The class predicted by the classifier is the one yielding the highest P(c|F).

A commonly used generative classifier is the Naive Bayes classifier.  It has two layers (one for the features F and one for the classes C).

Deep learning using generative classifiers

The first thing you need for deep learning is a hidden layer.  So you add one more layer H between the C and F layers to get a Hierarchical Bayesian classifier (HBC).

Now, you can compute P(c|F) in a HBC in two ways:

Product of Sums
Computing P(c|F) using a Product of Sums
Sum of Products
Computing P(c|F) using a Sum of Products

The first equation computes P(c|F) using a product of sums (POS).  The second equation computes P(c|F) using a sum of products (SOP).

POS Equation

We discovered something very interesting about these two equations.

It turns out that if you use the first equation, the HBC reduces to a Naive Bayes classifier. Such an HBC can only learn linear (or quadratic) decision boundaries.

Consider the discrete XOR-like function shown in Figure 1.


There is no way to separate the black dots from the white dots using one straight line.

Such a pattern can only be classified 100% correctly by a non-linear classifier.

If you train a multinomial Naive Bayes classifier on the data in Figure 1, you get the decision boundary seen in Figure 2a.

Note that the dotted area represents the class 1 and the clear area represents the class 0.

Multinomial NB Classifier Decision Boundary
Figure 2a: The decision boundary of a multinomial NB classifier (or a POS HBC).

It can be seen that no matter what the angle of the line is, at least one point of the four will be misclassified.

In this instance, it is the point at {5, 1} that is misclassified as 0 (since the clear area represents the class 0).

You get the same result if you use a POS HBC.

SOP Equation

Our research showed us that something amazing happens if you use the second equation.

With the “sum of products” equation, the HBC becomes capable of deep learning.

SOP + Multinomial Distribution

The decision boundary learnt by a multinomial non-linear HBC (one that computes the posterior using a sum of products of the hidden-node conditional feature probabilities) is shown in Figure 2b.

Decision boundary of a SOP HBC.
Figure 2b: Decision boundary learnt by a multinomial SOP HBC.

The boundary consists of two straight lines passing through the origin. They are angled in such a way that they separate the data points into the two required categories.

All four points are classified correctly since the points at {1, 1} and {5, 5} fall in the clear conical region which represents a classification of 0 whereas the other two points fall in the dotted region representing class 1.

Therefore, the multinomial non-linear hierarchical Bayes classifier can learn the non-linear function of Figure 1.

Gaussian Distribution

The decision boundary learnt by a Gaussian nonlinear HBC is shown in Figure 2c.

Decision Boundary of a Gaussian SOP HBC.
Figure 2c: Decision boundary learnt by a SOP HBC based on the Gaussian probability distribution.

The boundary consists of two quadratic curves separating the data points into the required categories.

Therefore, the Gaussian non-linear HBC can also learn the non-linear function depicted in Figure 1.


Since SOP HBCs are multilayered (with a layer of hidden nodes), and can learn non-linear decision boundaries, they can therefore be said to be capable of deep learning.

Applications to NLP

It turns out that the multinomial SOP HBC can outperform a number of linear classifiers at certain tasks.  For more information, read our paper.

Visit Aiaioo Labs

Fun with Text – Managing Text Analytics

The year is 2016.

I’m a year older than when I designed the text analytics lecture titled “Fun with Text – Hacking Text Analytics“.

Yesterday, I found myself giving a follow on lecture titled “Fun with Text – Managing Text Analytics”.

Here are the slides:

“Hacking Text Analytics” was meant to help students understand a range text analytics problems by reducing them into simpler problems.

But it was designed with the understanding that they would hack their own text analytics tools.

However, in project after project, I was seeing that engineers tended not to build their own text analytics tools, but instead rely on handy and widely available open source products, and that the main thing they needed to learn was how to use them.

So, when I was asked to lecture to an audience at the NASSCOM Big Data and Analytics Summit in Hyderabad, and was advised that a large part of the audience might be non-technical, and could I please base the talk on use-cases, I tried a different tack.

So I designed another lecture “Fun with Text – Managing Text Analytics” about:

  • 3 types of opportunities for text analytics that typically exist in every vertical
  • 3 use cases dealing with each of these types of opportunities
  • 3 mistakes to avoid and 3 things to embrace

And the take away from it is how to go about solving a typical business problem (involving text), using text analytics.

Enjoy the slides!

Visit Aiaioo Labs

Kabir and Language

Image from Wikipedia

Yesterday, I went to a concert of songs belonging to the tradition of a 15th century saint-poet called Kabir, and came across a very interesting song that he is said to have composed.

It went something like this.

The cow was milked

Before the calf was born

But after I sold the curd in the market

and this:

The ant went to its wedding

Carrying a gallon of oil

And an elephant and a camel under its arms

From the perspective of natural language processing and machine learning, the incongruous situations depicted in these poems turn out having an interesting pattern in them, as I will explain below.

I found more examples of Kabir’s “inverted verses” online.

The poems at http://www.sriviliveshere.com/mapping-ulat-bansi.html come with beautiful illustrations as well.

Here are a few more lines from Kabir’s inverted verse:

A tree stands without roots

A tree bears fruit without flowers

Someone dances without feet

Someone plays music without hands

Someone sings without a tongue

Water catches fire

Someone sees with blind eyes

A cow eats a lion

A deer eats a cheetah

A crow pounces on a falcon

A quail pounces on a hawk

A mouse eats a cat

A dog eats a jackal

A frog eats snakes

What’s interesting about all of these is that they’re examples of entity-relationships that are false.

Let me first explain what entities and relationships are.

Entities are the real or conceptual objects that we perceive as existing in the world we live in.  They are usually described using a noun phrase and qualified using an adjective.

Relationships are the functions that apply to an ordered list of entities and return a true or false value.

For example, if you take the sentence “The hunter hunts the fox,” there are two entities (1. the hunter, 2. the fox).  The relationship is “hunts”, it returns true for the two entities presented in that order.

The relationship “hunts” would return false if the entities were inverted (as in 1. the fox and 2. the hunter … as in the sentence “The fox hunts the hunter”).

The relationship and the entity can be stored in a database and hence can be considered as the structured form of an unstructured plain-language utterance.

In fact it is entities and relationships such as these that it was speculated would some day make up the semantic web.

Most of Kabir’s inverted verse seems to be based on examples of false entity relationships of dual arity (involving two entities), and that often, there is a violation of entity order which causes the entity function to return the value false.

In the “cow was milked” song, the relationship that is violated is the temporal relationship: “takes place before”.

In the “ant’s wedding” song, the relationship that is violated is that of capability: “can do”.

In the rest of the examples, relationships like “eats”, “hunts”, “plays”, “dances”, “bears fruit”, etc., are violated.

Other Commentary

In Osho’s “The Revolution”, he talks about Kabir’s interest in and distrust of language, quoting the poet as saying:








And Osho concludes with:

Kabir is not interested in giving you any answers — because he knows perfectly well there is no answer. The game of question and answers is just a game — not that Kabir was not answering his disciples’ questions; he was answering, but answering playfully. That quality you have to remember. He is not a serious man; no wise man can ever be serious. Seriousness is part of ignorance, seriousness is a shadow of the ego. The wise is always non-serious. There can be no serious answers to questions, not at least with Kabir — because he does not believe that there is any meaning in life, and he does not believe that you have to stand aloof from life to observe and to find the meaning. He believes in participation. He does not want you to become a spectator, a speculator, a philosopher.


This genre of verse seems to have been a tradition in folk religious movements in North India.  In “The Tenth Rasa: An Anthology of Indian Nonsense” by Michael Heyman, Sumanya Satpathy and Anushka Ravishankar, they talk about Namdev, a 13th century saint-poet as having authored such verses as well.

A Naive Bayes classifier that outperforms NLTK’s

We found that by changing the smoothing parameters of a Naive Bayes classifier, we could get far better accuracy numbers for certain tasks.  By changing the Lidstone smoothing parameter from 0.05 to 0.5 or greater, we could go from an accuracy of about 50% to almost 70% on the task of question classification for question answering.

This is not at all surprising because, as described in an earlier post, the smoothing method used in the estimation of probabilities affects Naive Bayes classifiers greatly.

Below, we have provided an implementation of a Naive Bayes classifier which outperforms the Naive Bayes classifier supplied with NLTK 3.o by almost 10% on the task of classifying questions from the questions-train.txt file supplied with the textbook “Taming Text”.

Our Naive Bayes classifier (with a Lidstone smoothing parameter of 0.5) exhibits about 65% accuracy on the task of question classification, whereas the NLTK classifier has an accuracy of about 40% as shown below.


Finally, I’d like to say a few words about the import of this work.

Theoretically, by increasing the Lidstone smoothing parameter, we are merely compensating more strongly for absent features; we are negating the absence of a feature more vigorously;  reducing the penalty for the absence of a feature in a specific category.

Because increased smoothing lowers the penalty for feature absence, it could help increase the accuracy when a data-set has many low-volume features that do not contribute to predicting a category, but whose chance presence and absence may be construed in the learning phase to be correlated with a category.

Further investigation is required before we can say whether the aforesaid hypothesis would explain the effect of smoothing on the accuracy of classification in regard to the question classification data-set that we used.

However, this exercise shows that algorithm implementations would do well to leave the choice of Lidstone smoothing parameters to the discretion of the end user of a Naive Bayes classifier.

The source code of our Naive Bayes classifier (using Lidstone smoothing) is provided below:

This implementation of the Naive Bayes classifier was created by Geetanjali Rakshit, an intern at Aiaioo Labs.

There is a bug in the following code in that it uses calls to a dictionary’s “keys()” method in places where this is not required (resulting in poor run-time performance), as in the line “if label not in self._classes_dict.keys():”.  If you wish to use this implementation with large numbers of features, remove the “keys()” calls wherever possible.  So, the above line should be changed to “if label not in self._classes_dict:”.  Grep ‘keys’ and repeat.

import numpy as np
import random
import sys, math

class Classifier:
	def __init__(self, featureGenerator):
		self.featureGenerator = featureGenerator
		self._C_SIZE = 0
		self._V_SIZE = 0
		self._classes_list = []
		self._classes_dict = {}
		self._vocab = {}

	def setClasses(self, trainingData):
		for(label, line) in trainingData:
			if label not in self._classes_dict.keys():
				self._classes_dict[label] = len(self._classes_list)
		self._C_SIZE = len(self._classes_list)
	def getClasses(self):
		return self._classes_list

	def setVocab(self, trainingData):
		index = 0;
		for (label, line) in trainingData:
			line = self.featureGenerator.getFeatures(line)
			for item in line:
				if(item not in self._vocab.keys()):
					self._vocab[item] = index
					index += 1
		self._V_SIZE = len(self._vocab)

	def getVocab(self):
		return self._vocab

	def train(self, trainingData):

	def classify(self, testData, params):

	def getFeatures(self, data):
		return self.featureGenerator.getFeatures(data)

class FeatureGenerator:
	def getFeatures(self, text):
		text = text.lower()
		return text.split()

class NaiveBayesClassifier(Classifier):
	def __init__(self, fg, alpha = 0.05):
		Classifier.__init__(self, fg)
		self.__classParams = []
		self.__params = [[]]
		self.__alpha = alpha

	def getParameters(self):
		return (self.__classParams, self.__params)

	def train(self, trainingData):

		for (cat, document) in trainingData:
			for feature in self.getFeatures(document):
				self.countFeature(feature, self._classes_dict[cat])

	def countFeature(self, feature, class_index):
		counts = 1
		self._counts_in_class[class_index][self._vocab[feature]] = self._counts_in_class[class_index][self._vocab[feature]] + counts
		self._total_counts[class_index] = self._total_counts[class_index] + counts
		self._norm = self._norm + counts

	def classify(self, testData):
		post_prob = self.getPosteriorProbabilities(testData)
		return self._classes_list[self.getMaxIndex(post_prob)]

	def getPosteriorProbabilities(self, testData):
		post_prob = np.zeros(self._C_SIZE)
		for i in range(0, self._C_SIZE):
			for feature in self.getFeatures(testData):
				post_prob[i] += self.getLogProbability(feature, i)
			post_prob[i] += self.getClassLogProbability(i)
		return post_prob

	def getFeatures(self, testData):
		return self.featureGenerator.getFeatures(testData)

	def initParameters(self):
		self._total_counts = np.zeros(self._C_SIZE)
		self._counts_in_class = np.zeros((self._C_SIZE, self._V_SIZE))
		self._norm = 0.0

	def getLogProbability(self, feature, class_index):
		return math.log(self.smooth(self.getCount(feature, class_index),self._total_counts[class_index]))

	def getCount(self, feature, class_index):
		if feature not in self._vocab.keys():
			return 0
			return self._counts_in_class[class_index][self._vocab[feature]]

	def smooth(self, numerator, denominator):
		return (numerator + self.__alpha) / (denominator + (self.__alpha * len(self._vocab)))

	def getClassLogProbability(self, class_index):
		return math.log(self._total_counts[class_index]/self._norm)

	def getMaxIndex(self, posteriorProbabilities):
		maxi = 0
		maxProb = posteriorProbabilities[maxi]
		for i in range(0, self._C_SIZE):
			if(posteriorProbabilities[i] >= maxProb):
				maxProb = posteriorProbabilities[i]
				maxi = i
		return maxi

class Dataset:
	def __init__(self, filename):
		fp = open(filename, "r")
		i = 0
		self.__dataset = []
		for line in fp:
			if(line != "\n"):
				line = line.split()
				cat = line[0]
				sent = ""
				for word in range(1, len(line)):
					sent = sent+line[word]+" "
				sent = sent.strip()
				self.__dataset.append([cat, str(sent)])
				i = i+1
		self.__D_SIZE = i
		self.__trainSIZE = int(0.6*self.__D_SIZE)
		self.__testSIZE = int(0.3*self.__D_SIZE)
		self.__devSIZE = 1 - (self.__trainSIZE + self.__testSIZE)

	def setTrainSize(self, value):
		self.__trainSIZE = int(value*0.01*self.__D_SIZE)
		return self.__trainSIZE

	def setTestSize(self, value):
		self.__testSIZE = int(value*0.01*self.__D_SIZE)
		return self.__testSIZE

	def setDevelopmentSize(self):
		self.__devSIZE = int(1 - (self.__trainSIZE + self.__testSIZE))
		return self.__devSIZE

	def getDataSize(self):
		return self.__D_SIZE
	def getTrainingData(self):
		return self.__dataset[0:self.__trainSIZE]

	def getTestData(self):
		return self.__dataset[self.__trainSIZE:(self.__trainSIZE+self.__testSIZE)]

	def getDevData(self):
		return self.__dataset[0:self.__devSIZE]


if __name__ == "__main__":
	# This Naive Bayes classifier implementation 10% better accuracy than the NLTK 3.0 Naive Bayes classifier implementation
	# at the task of classifying questions in the question corpus distributed with the book "Taming Text".

	# The "questions-train.txt" file can be found in the source code distributed with the book at https://www.manning.com/books/taming-text.
	# To the best of our knowledge, the improvement in accuracy is owed to the smoothing methods described in our blog:
	# https://aiaioo.wordpress.com/2016/01/29/in-a-naive-bayes-classifier-why-bother-with-smoothing-when-we-have-unknown-words-in-the-test-set/
	filename = "questions-train.txt"
	if len(sys.argv) > 1:
		filename = sys.argv[1]
	data = Dataset(filename)
	train_set = data.getTrainingData()
	test_set = data.getTestData()
	test_data = [test_set[i][1] for i in range(len(test_set))]
	actual_labels = [test_set[i][0] for i in range(len(test_set))]
	fg = FeatureGenerator()
	alpha = 0.5 #smoothing parameter
	nbClassifier = NaiveBayesClassifier(fg, alpha)
	correct = 0;
	total = 0;
	for line in test_data:
		best_label = nbClassifier.classify(line)
		if best_label == actual_labels[total]:
			correct += 1
		total += 1
	acc = 1.0*correct/total
	print("Accuracy of this Naive Bayes Classifier: "+str(acc))

Visit Aiaioo Labs

Fun With Text – Hacking Text Analytics


I’ve always wondered if there was a way to teach people to cobble together quick and dirty solutions to problems involving natural language, from duct tape, as it were.

Having worked in the field now for a donkey’s years as of 2015, and having taught a number of text analytics courses along the way, I’ve seen students of text analysis stumble mostly on one of two hurdles:

1.  Inability to Reduce Text Analytics Problems to Machine Learning Problems

I’ve seen students, after hours of training, still revert to rule-based thinking when asked to solve new problems involving text.

You can spend hours teaching people about classification and feature sets, but when you ask them to apply their learning to a new task, say segmenting a resume, you’ll hear them very quickly falling back to thinking in terms of programming steps.

Umm, you could write a script to look for a horizontal line, followed by capitalized text in bold, big font, with the words “Education” or “Experience” in it !!!

2.  Inability to Solve the Machine Learning (ML) Problems

Another task that I have seen teams getting hung up on has been solving ML problems and comparing different solutions.

My manager wants me to identify the ‘introduction’ sections.  So, I labelled 5 sentences as introductions.  Then, I trained a maximum entropy classifier with them.  Why isn’t it working?

One Machine Learning Algorithm to Rule Them All

One day, when I was about to give a lecture at Barcamp Bangalore, I had an idea.

Wouldn’t it be fun to try to use just one machine learning algorithm, show people how to code up that algorithm themselves, and then show them how a really large number of text analytics problem (almost every single problem related to the semantic web) could be solved using it.

So, I quickly wrote up a set of problems in order of increasing complexity, and went about trying to reduce them all to one ML problem, and surprised myself!  It could be done!

Just about every text analytics problem related to the semantic web (which is, by far, the most important commercial category) could be reduced to a classification problem.

Moreover, you could tackle just about any problem using just two steps:

a) Modeling the problem as a machine learning problem

Spot the appropriate machine learning problem underlying the text analytics problem, and if it is a classification problem, the relevant categories, and you’ve reduced the text analytics problem to a machine learning problem.

b) Solving the problem using feature engineering

To solve the machine learning problem, you need to coming up with a set of features that allows the machine learning algorithm to separate the desired categories.

That’s it!

Check it out for yourself!

Here’s a set of slides.

It’s called “Fun with Text – Hacking Text Analytics”.

Text Analytics Tools for Deliberative Democracy

In our last post, we spoke about various control mechanisms that can be implemented to support direct democracy (which we  interpreted to mean the control of the allocation of common resources by the people who pooled in).

We also examined how these controls could be used to curtail man-in-the-middle corruption.

In this article, we examine a more sophisticated form of direct democracy called a deliberative democracy.

In a deliberative democracy, in addition to the control mechanisms prescribed for direct democracy, there need to be mechanisms to allow deliberation (discussion) before a referendum or any other action is taken.

I quote from the Wikipedia article on deliberative democracy:

Deliberative democracy holds that, for a democratic decision to be legitimate, it must be preceded by authentic deliberation, not merely the aggregation of preferences that occurs in voting.

In elitist deliberative democracy, principles of deliberative democracy apply to elite societal decision-making bodies, such as legislatures and courts; in populist deliberative democracy, principles of deliberative democracy apply to groups of lay citizens who are empowered to make decisions.

The article on direct democracy had the following to say:

Democratic theorists have identified a trilemma due to the presence of three desirable characteristics of an ideal system of direct democracy, which are challenging to deliver all at once. These three characteristics are participation – widespread participation in the decision making process by the people affected; deliberation – a rational discussion where all major points of view are weighted according to evidence; and equality – all members of the population on whose behalf decisions are taken have an equal chance of having their views taken into account.

(Aside to computer scientists: doesn’t this trilemma remind you of the CAP theorem that applies to database systems? Here’s a simple explanation of the CAP theorem: http://ksat.me/a-plain-english-introduction-to-cap-theorem/).

So, for example, representative democracy satisfies the requirement for deliberation and equality but sacrifices participation.

Participatory democracy allows inclusive participation and deliberation but sacrifices equality.

And then there is direct democracy which supports participation and equality, but not deliberation.

The problem seems to be that when a large number of people are invited to participate in a deliberation (and given that deliberations take time), it will not be possible to compensate them all for their time. Consequently, only those more interested in the issue being debated (or more likely to benefit from one position or the other) are more likely to participate, biasing the sample in their favour (all sections of the population are no longer equally represented in the discussion/decision).

So, it seems that all the three properties desired in an ideal democratic system – participation, equality and deliberation – cannot be present at the same time in a real democratic system.

But then, a while ago, we began wondering if this trilemma is merely a result of the lack of suitable technology and not really a fundamental property of democracy.  So, we proposed a design for (though we have not yet realized it) a tool that can support the participation of a large number of people in deliberations.  We call it the MCT (Mass Communication Tool).

It could be used as a method to enable direct democracies to support deliberations in which all citizens can participate, ahead of a vote on any subject.

It uses text clustering algorithms to solve the problems of volume as well as numeric asymmetry in the flow of communications between the deliberating participants and the moderators of the communications.

There’s a brief overview of the system in our lab profile.

MCTs are bound to have a huge impact on our experience of representative government.  A typical use case would involve a public figure, (say President Obama), sounding out the electorate before introducing legislation on say healthcare reform.


By first discussing the competing proposals with large numbers of people, it might be possible for the initiator of the discussion to get a sense of what might or might not work and what the response to the legislation was likely to be.


An MCT would have to be capable of supporting a live dialog involving a large number of people.

It would use natural language processing and machine learning to enable a few moderators (for example, the CEO of a company) to interact with a large number of people (for example, all the employees of the company) in real time (for example, during a virtual all-hands meeting), get a synopsis of a large number of concurrent discussions in real time, and participate in a significant fraction of the discussions as they are taking place.

The system would consist of:

  1. an aggregator of messages (built from natural language processing components) that groups together messages and discussions with identical semantic content;
  2. a hierarchical clustering system (built from natural language processing components) that assigns aggregated messages their place in a hierarchy by specificity with more general messages closer to the root of the hierarchy and more specific messages closer to the leaves of the hierarchy;
  3. a summarization system (built from natural language processing components) that creates a summary of the aggregate of all messages in a sub-tree; and
  4. a reply routing system (built from natural language processing components) that routes replies from cluster to cluster based on their relevance to the discussion threads.

Using text analytics to prevent O(log n) failure of democratic institutions

In this article, we discuss an Achilles’ heel present in many democratic institutions.  We claim that many democratic institutions can be made to fail in ‘log n’ time (exponentially fast) if patronage (nepotistic) networks are allowed to grow unfettered.

We then support the claim using real-world examples of the failure of democratic institutions (political and otherwise) and discuss why such failure has not been observed in polity in India.

We also look at how text analytics can be used to detect (and consequently enable steps to be taken to prevent) such failures.

The Weakness

In democratic institutions, voting mechanisms are used to confer upon one individual (from a field of eligible candidates), powers and responsibilities as specified in the charter of the institution.

In some cases, the powers that accrue to the elected individual are so great that they enable him to use them to ensure his or her re-election.  There are two methods available to such an individual.

The first is to pay off the electoral college and secure re-election.  This method is an O(n) algorithm.  The quantum of resources required to secure re-election is proportional to the number of people who need to be suborned.  So, this method only works in cases where electoral colleges are small (for example, in the case of committees deciding job appointments).

A faster method of suborning large numbers of people exists.  The establishment of a hierarchy of patronage can leverage the dynamics of social networks to speedily (in log n time) corrupt very large numbers of people.

It works like this:  The person who is elected to head the country appoints as immediate subordinates only people who are on account of tribal or ethnic affiliations expected to be loyal to him/her. This appointment is often accompanied by a monetary exchange in the form of a bribe paid by the appointee to secure the post.  Such a monetary exchange helps cement the loyalty since the person making the payment becomes dependent on their superior’s continuation in power to recoup the money spent. In other words, the immediate subordinates are forced to invest in their superiors’ careers.

The subordinates so appointed in turn appoint people loyal to themselves to positions below them, in order to recover their investment in the person above them and so on.  Very soon, the entire government machinery becomes beholden directly or indirectly to the person at the top for their jobs and has a vested interest in keeping the person at the top in power.

In some countries, this effectively transforms the democratically elected ‘president’ into a dictator for life. 

Example 1

The first example of such failure is possibly (and I am just illustrating a point – no disrespect intended) the government of Cameroon.  The President of Cameroon has been in power since the 1980s.  The President is impossible to replace in spite of rampant corruption and economic mismanagement because all the tribal chiefs and officials in Cameroon are beholden to the President.

All these officials and chiefs try and recoup their investment by rent-seeking behavior (you will need their good offices if you wish to do any business in Cameroon).

The resulting economic climate doesn’t necessarily encourage the growth of entrepreneurship or investment in Cameroon.

Example 2

Iraq for many years was ruled by a dictator who had managed to suborn an entire political system.  Iraq once had a participatory democratic system.  But the use of patronage by Saddam led to Iraq coming under totalitarian rule.

Similar failures can be seen in post WW2 Libya and Syria, and in Stalin’s Russia.


One common feature of many of the countries where such failure has occurred is that they have hierarchical social structures in which respect and obedience can be commanded by people higher in the hierarchy from people lower in the hierarchy.

Cameroon has a culture of veneration of elders and tribal leaders.  So do the countries of the Arab world.

India also has somewhat similar cultural traits.  So, it is very interesting to see that a similar process of deterioration of democracy has not been observed in India.

One explanation is that India was saved by its own heterogeneity.  India is made up of distinct linguistic regions and ethnic groups.  It would be impossible for tribal and ethnic hierarchies to span India’s many linguistic and ethnic boundaries.

So, even if one province has failed, that would not trigger failure in neighboring provinces, and the regional despot would not be strong enough to change the Indian constitution.

Examples of such failure can arguably be observed to have happened already in Karnataka and in Tamil Nadu.  In Karnataka, a couple of mining barons managed to become ministers in the government and managed to exercise a lot of control over it for a number of years. Had Karnataka been an independent political entity, they might at some point have attempted to change the constitution to give themselves more power and possibly unlimited tenure.

In the state of Tamil Nadu, the ruling politicians are suspected of building patronage networks around themselves in order to facilitate rent-seeking behaviour.  If Tamil Nadu had been an independent political entity, I suspect that it might have lost its democratic character, since the number of people below the Chief Minister with a vested interest in keeping him/her in power would have become too big to permit free and fair elections to take place.

But what is interesting is that in India, though democracy could have failed at the level of the states (controlled by politicians who had suborned a large part of the mechanism of government) the possible failure did not take place or proved reversible.  That is probably because no kinship networks extend to all parts of India.

The fragmentation must have protected the constitution and the electoral system.  That in turn allowed constitutional and legal frameworks and electoral politics to correct the failures.

In Karnataka, the corrupt mining barons and the Chief Minister who supported them ended up behind bars.  In Tamil Nadu, some family members of the corrupt Chief Minister ended up behind bars.  In both states, the parties that had engaged in corruption were voted out of power.

So, failure through networks of patronage might be difficult to engineer in India because of the extremely heterogeneous nature of its society (a result of size and diversity).


Why would someone intending to build a patronage network only elevate kith and kin to positions of power?

As in, why did the dictators in Iraq and Syria choose to pack the governing organizations with people from their own community or tribe?

One possible answer emanates from the work of George Price (http://www.bbc.co.uk/news/magazine-24457645), who showed that altruism exhibited towards close relatives can have concrete benefits for a selfish individual.

I quote from the article:

Price’s equation explained how altruism could thrive, even amongst groups of selfish people.

It built on the work of a number of other scientists, arguably beginning with JBS Haldane, a British biologist who developed a theory in the early 1950s. When asked if he would sacrifice his own life to save that of another, he said that he would, but only under certain conditions. “I would lay down my life for two brothers, or eight cousins.”

Here is the Wikipedia page describing the Price equation (http://en.wikipedia.org/wiki/Price_equation).

So, it is possible that the elevation of kith and kin minimizes the possibility that the people so elevated might one day turn against their patron (they might be more likely to exhibit non-selfish behavior towards their patron).

Detection using Text Analytics

One is tempted to ask whether it is possible to detect at an early stage the process of failure through patronage of a democratic system.

The answer is yes.  It appears to be possible to build a protective mechanism that can uncover and highlight the formation and growth of nepotistic patronage networks.

The BBC article on nepotism in Italy examines research on the detection of nepotism in Italy, and I quote:

Prof Perotti, along with others, has published revealing studies of university teachers, showing the extraordinary concentration of surnames in many departments.

“There is a scarcity of last-names that’s inexplicable,” says fellow academic, Stefano Allesina. “The odds of getting such population densities across so many departments is a million to one.”

Take the University of Bari, where five families have for years dominated the dozens of senior positions in Business and Economics there. Or consider the University of Palermo, where more than half the entire academic population has at least one relative working within the institution.

I happened to find one of Allesina’s papers titled “Measuring Nepotism through Shared Last Names: The Case of Italian Academia” on the internet.

Here it is:  http://www.plosone.org/article/info:doi/10.1371/journal.pone.0021160

I quote from the paper:

Both types of analysis (Monte Carlo and logistic regression) showed the same results: the paucity of names and the abundance of name-sharing connections in Italian academia are highly unlikely to be observed at random. Many disciplines, accounting for the majority of Italian academics, are very likely to be affected by nepotism. There is a strong latitudinal effect, with nepotistic practices increasing in the south. Although detecting some nepotism in Italian academia is hardly surprising, the level of diffusion evidenced by this analysis is well beyond what is expected.

Concentrating resources in the “healthy” part of the system is especially important at a time when funding is very limited and new positions are scarce: two conditions that are currently met by the Italian academic system. Moreover, promoting merit against nepotistic practices could help stem the severe brain-drain observed in Italy.

In December 2010, the Italian Parliament approved a new law for the University. Among other things, the new law forbids the hiring of relatives within the same department and introduces a probation period before tenure. The analysis conducted here should be repeated in the future, as the results could provide an assessment of the efficacy of the new law.

This analysis can be applied to different countries and types of organizations. Policy-makers can use similar methods to target resources and cuts in order to promote fair practices.

The analysis that Allesina performed (a diversity analysis of last names) is a fairly easy text analytics task and provides a clue to the solution to the problem.

Such an analysis can unearth nepotism and allow steps to be taken to prevent it.

Extensions of the method can also be used to determine if, as in the case of Iraq and Syria, a certain community or ethnicity or family has taken over or is beginning to take over the governing mechanisms of a country.

And if the problem is identified early enough, it might give the constitutional, legal and electoral institutions of a democracy a fighting chance at protecting themselves and lead to a correction of the failure.

Using text analytics to detect fraud related to government tenders

In my last article, I talked about using statistics to detect fraud.  I’d promised to write about methods for detecting and preventing fraud in the issuing of tenders.

The floating of tenders is the primary mechanism by which governments – often the biggest economic force in a geographical area – procure services from private organizations.

If the bidding process is compromised, contracts might end up not going to the best or most efficient vendor as a stakeholder (the tax-payer) would desire.

That in turn results in bad roads, poor infrastructure, delayed projects, under-spending on education, over-spending on military purchases and other problems associated with bad governance. (See: https://aiaioo.wordpress.com/2012/11/18/tools-for-the-mind-and-how-you-can-change-the-world/).

It also stands to reason that if you could detect tendering fraud, you could solve quite a few of the problems that affect places where corruption is rife.

So how do you tell if a tender has been fairly issued or if it has been gamed to the benefit of a certain participant?

An unfair procurement process can use one of the following methods to ensure that a contract is awarded to a favored party:

Method 1:  Choice of pre-selection criteria

One method used to favour a certain party is the introduction of unnecessary qualifying conditions in the tender that have nothing whatsoever to do with the the end product or service to be procured.

These conditions are added to the tender in order to ensure that only a chosen small set of bidders meet all the conditions for participation in the tender process.

Method 2:  Cancellation and reissuing of the tender

I have been given to believe (by various sources) that in India/China, 3% of the size of the deal is the norm for kickbacks.

If a very efficient bid is placed, and it brings the cost of the service down so that the 3% kickbacks do not translate into a lot of money or if the winner of the bid refuses to pay a bribe, procurement officials might be able to subvert the process by coming up with reasons to cancel or terminate the tender.

They can then reissue the tender with tighter criteria intended to disqualify the uncooperative bidder.

My Experiences

Now, I must tell you that since the beginning of my career as an entrepreneur in India, I have come across numerous stories of terminated tenders, or of the disqualification of firms from a bidding process because they bid too low to be able to pay much by way of bribes.

I have personally walked into a tendering meeting where the government officials began with the words: “Ladies and gentlemen, we are proud to welcome you to our campus today.  We are extremely sorry that we cannot entertain you the way you entertain us when we come to your campuses.”

The tender was being issued for a software project that I felt should have taken a team of 3 engineers no more than 6 months to deliver.  But the tender stated that only firms with a minimum of “100 crores in revenue each year for the past 5 years,” (approximately 20 million USD each year) could bid for the project.  There were only 6 other firms in the room.

When the officials realised that Aiaioo Labs was a small firm, they suggested we leave.

They said, “There should be some other things we can work on with you.  Let’s meet some other time.”

Over the years, I began to wonder if there was any way that I as a tax payer might protect myself from bad deals (corrupt or price-inefficient deals) entered into by government middlemen with my tax money.

Fortunately, it seems possible to use text analytics to detect and alert an ombudsman to possible fraud in the issuing of tenders.  Below is a description of how such a method might work.

Using text analytics to detect irrelevant selection constraints

If tenders for procuring very different products have very similar pre-selection criteria, they could be flagged as suspicious.

The reason this method might work is that relationships between corrupt officials and client firms can take a considerable amount of time to form (because of the risks involved and the consequent need for caution).  It is easy for corrupt officials to change the favored vendor very frequently.

That would mean that they would have to keep the criteria of selection of firms more or less unchanged across widely varying tenders and over long periods of time.  So, you might find that tenders small and large, for hardware or for software, (in other words, tenders for different services), but issued by the same organization might – if the tender process has become unfair – employ more or less the same set of selection criteria irrespective of what is being purchased.

Tools can be developed to detect these similarities and flag them up for review.  Such tools would have to be able to detect the portions of the tender document that are related to the bidder, and the portions of the document that are related to the product or service requested.  It would then have to measure the similarity between the bidder-related sections of the tender documents.  It might also be possible to extract only the qualifying criteria and look for similarities there.  It might also be possible to analyse the bidder selection criteria to see if any criteria might be irrelevant to a project, or incompatible with the requirements of the project.

Using text analytics to detect reissued tenders

If a new tender document’s product or service description sections resemble those in an older tender – and if the issuing organization remains the same, it might be possible that the tender has been reissued.  If it is further found that a very low bid had won the bidding in the previous round of tendering, and that the earlier tender had been cancelled, this could be used as a flag to alert an ombudsman.

Using text analytics to detect vendor-oriented constraints

If many of the conditions for participation in the tender are company-specific (properties of companies such as size or earnings) as opposed to capability-specific (experience in a certain technology space), it might raise a red-flag.

Systems to manage tenders

It might be possible to analyse tenders for fraud if tenders are stored in and managed using a tender management system with fraud detection analytics that both serves as a repository for tender documents, as well as manages the submission of bids and monitors the selection procedure and the life-cycle of projects. This would allow governments to maintain not just a history of tender issuers, but also a history of vendors.  By so doing, governments would be able to determine which vendors are reliable and which are not.

Moreover, it would give people issuing and evaluating tenders more confidence in a low bidder (there is always a danger in projects that someone could bid too low and win the project, but then not be able to execute) and hence help reduce costs. So, a tender fraud detection tool could possibly help governments make better decisions regarding vendors of services and reduce corruption in the process of issuing tenders for the procurement of services and products for government.

Graph Algorithms for Fraud Detection

Text analytics algorithms are difficult and expensive to develop.  Fortunately there are other ways to detect tendering fraud. Patterns of favoritism in tender outcomes can be detected from a bipartite graph of issuing organizations and beneficiaries.  If tenders from a particular issuing organization are found to repeatedly favour a specific vendor from a large field of vendors (more than random probabilities would allow), the organization and vendor could be flagged. Price comparisons across tenders can also be made to determine if any prices have exceeded the price range for similar purchases (this will again require text analytics).

Some Theory

There has been a lot of work on corruption by economists in the last 10 years.  One interesting equation that models corruption is the Klitgaard Corruption Equation. The equation is C = R + D – A where C stands for Corruption, R for Rent (quantum of possible illegal earnings from being in a position of responsibility where corruption is possible) and A stands for Accountability. These concepts are explained very well in the following article http://seekyt.com/define-corruption/ (from where I got the following image as well). But Klitgaard does not model one variable that can impact corruption – and that variable is choice. If you increase the choices available to a purchaser, the opportunities for avoiding corruption increase and the likelihood of corrupt transactions occurring decreases.

For example, if everyone in a certain location must only obtain a service from the government office serving their locality, then a person who does not want to pay a bribe does not have the option of travelling to a different office to obtain the service without paying a bribe.  So, if the officer at the local office is corrupt and demands a bribe for rendering a service, then the person has no choice but to cough up the bribe.  This happens in a lot of government offices where registrations have to be performed.  Increasing the choice of service provider lessens the likelihood of people being trapped into giving bribes.

The same forces are at work in the case of tenders.  The strategy of a corrupt tendering official is to artificially reduce the choices of the selectors to only firms that will pay a bribe. Computer systems that fight tendering corruption work by preventing the artificial restriction of choices.

Can you make a sandwich from classifiers?

One day, just a few years ago, a client came to Aiaioo Labs with a very interesting problem.

He wanted to know if and how AI tools could save him some money.

It turned out that he had a fairly large team performing the task of manually categorizing documents.

He wanted to know if we could supply him an AI tool that could automate the work.  The only problem was, he was going to need very high quality.

And no single automated classifier was going to be able to deliver that sort of quality by itself.

That’s when we hit upon the idea of a classifier sandwich.

The sandwich is prepared by arranging two classifiers as follows:

1.  Top layer – high precision classifier – when it picks a category, it is very likely to be right (the precision of the selected category is very high).

2.  Bottom layer – high recall classifier – when it rejects a category, it is very likely to be right about the rejection (the precision of the rejected category is very high).

Whatever the top layer does not pick and the bottom layer does not reject – that is, the middle of the sandwich – is then handed off to be processed manually by the team of editors that the client had in place.

So, that was a lovely little offering, one that any consultant could put together.  And it is incredibly easy to put together an ROI proposition for such an offering.

How do you calculate the ROI of a classifier sandwich?


Let’s say the high-precision top layer has a recall of 30%.

Let’s say the high-recall bottom layer has a recall of 80%.

Then about 50% of the documents that pass through the system will end up being automatically sorted out.

The work effort and therefore the size of the team needed to do it, would therefore be halved.

Note that to make the sandwich, we need two high-precision classifiers (the first one selects a category with high precision while the second one rejects the other category with high precision).

Both categories need to have a precision greater than or equal to the quality guarantee demanded by the client.

That precision limit determines the amount of effort left over for humans to do.

How can we tune classifiers for high precision?

For maxent classifiers, thresholds can be set on the confidence scores they return for the various categories.

For naive bayesian classifiers, the best approach to creating high-precision classifiers is a training process known as expectation maximization.

For more information, please refer the work of Kamal Nigam et al:  http://www.kamalnigam.com/papers/emcat-mlj99.pdf

Another secret to boosting precision is using the right features in your classifier, but more about that later.

Purchase Intention Use Cases

WisdomTap (www.wisdomtap.com) is a firm that uses purchase intention in a very interesting manner.

They spot purchase intention in public posts and target advertisements at the speakers.

The team at WisdomTap obtained some startling results when they measured how those users responded to their ads.

The results can be found in this whitepaper: http://www.aiaioo.com/whitepapers/intention_analysis_use_cases.pdf