Tag: nlp

Deep Learning for NLP

Deep learning is usually associated with neural networks.

In this article, we show that generative classifiers are also capable of deep learning.

What is deep learning?

Deep learning is a method of machine learning involving the use of multiple processing layers to learn non-linear functions or boundaries.

What are generative classifiers?

Generative classifiers use the Bayes rule to invert probabilities of the features F given a class c into a prediction of the class c given the features F.

The class predicted by the classifier is the one yielding the highest P(c|F).

A commonly used generative classifier is the Naive Bayes classifier.  It has two layers (one for the features F and one for the classes C).

Deep learning using generative classifiers

The first thing you need for deep learning is a hidden layer.  So you add one more layer H between the C and F layers to get a Hierarchical Bayesian classifier (HBC).

Now, you can compute P(c|F) in a HBC in two ways:

Product of Sums
Computing P(c|F) using a Product of Sums
Sum of Products
Computing P(c|F) using a Sum of Products

The first equation computes P(c|F) using a product of sums (POS).  The second equation computes P(c|F) using a sum of products (SOP).

POS Equation

We discovered something very interesting about these two equations.

It turns out that if you use the first equation, the HBC reduces to a Naive Bayes classifier. Such an HBC can only learn linear (or quadratic) decision boundaries.

Consider the discrete XOR-like function shown in Figure 1.

hbc_figure_1

There is no way to separate the black dots from the white dots using one straight line.

Such a pattern can only be classified 100% correctly by a non-linear classifier.

If you train a multinomial Naive Bayes classifier on the data in Figure 1, you get the decision boundary seen in Figure 2a.

Note that the dotted area represents the class 1 and the clear area represents the class 0.

Multinomial NB Classifier Decision Boundary
Figure 2a: The decision boundary of a multinomial NB classifier (or a POS HBC).

It can be seen that no matter what the angle of the line is, at least one point of the four will be misclassified.

In this instance, it is the point at {5, 1} that is misclassified as 0 (since the clear area represents the class 0).

You get the same result if you use a POS HBC.

SOP Equation

Our research showed us that something amazing happens if you use the second equation.

With the “sum of products” equation, the HBC becomes capable of deep learning.

SOP + Multinomial Distribution

The decision boundary learnt by a multinomial non-linear HBC (one that computes the posterior using a sum of products of the hidden-node conditional feature probabilities) is shown in Figure 2b.

Decision boundary of a SOP HBC.
Figure 2b: Decision boundary learnt by a multinomial SOP HBC.

The boundary consists of two straight lines passing through the origin. They are angled in such a way that they separate the data points into the two required categories.

All four points are classified correctly since the points at {1, 1} and {5, 5} fall in the clear conical region which represents a classification of 0 whereas the other two points fall in the dotted region representing class 1.

Therefore, the multinomial non-linear hierarchical Bayes classifier can learn the non-linear function of Figure 1.

Gaussian Distribution

The decision boundary learnt by a Gaussian nonlinear HBC is shown in Figure 2c.

Decision Boundary of a Gaussian SOP HBC.
Figure 2c: Decision boundary learnt by a SOP HBC based on the Gaussian probability distribution.

The boundary consists of two quadratic curves separating the data points into the required categories.

Therefore, the Gaussian non-linear HBC can also learn the non-linear function depicted in Figure 1.

Conclusion

Since SOP HBCs are multilayered (with a layer of hidden nodes), and can learn non-linear decision boundaries, they can therefore be said to be capable of deep learning.

Applications to NLP

It turns out that the multinomial SOP HBC can outperform a number of linear classifiers at certain tasks.  For more information, read our paper.

Visit Aiaioo Labs

Advertisements

Fun with Text – Managing Text Analytics

The year is 2016.

I’m a year older than when I designed the text analytics lecture titled “Fun with Text – Hacking Text Analytics“.

Yesterday, I found myself giving a follow on lecture titled “Fun with Text – Managing Text Analytics”.

Here are the slides:

“Hacking Text Analytics” was meant to help students understand a range text analytics problems by reducing them into simpler problems.

But it was designed with the understanding that they would hack their own text analytics tools.

However, in project after project, I was seeing that engineers tended not to build their own text analytics tools, but instead rely on handy and widely available open source products, and that the main thing they needed to learn was how to use them.

So, when I was asked to lecture to an audience at the NASSCOM Big Data and Analytics Summit in Hyderabad, and was advised that a large part of the audience might be non-technical, and could I please base the talk on use-cases, I tried a different tack.

So I designed another lecture “Fun with Text – Managing Text Analytics” about:

  • 3 types of opportunities for text analytics that typically exist in every vertical
  • 3 use cases dealing with each of these types of opportunities
  • 3 mistakes to avoid and 3 things to embrace

And the take away from it is how to go about solving a typical business problem (involving text), using text analytics.

Enjoy the slides!

Visit Aiaioo Labs

Kabir and Language

Kabir
Image from Wikipedia

Yesterday, I went to a concert of songs belonging to the tradition of a 15th century saint-poet called Kabir, and came across a very interesting song that he is said to have composed.

It went something like this.

The cow was milked

Before the calf was born

But after I sold the curd in the market

and this:

The ant went to its wedding

Carrying a gallon of oil

And an elephant and a camel under its arms

From the perspective of natural language processing and machine learning, the incongruous situations depicted in these poems turn out having an interesting pattern in them, as I will explain below.

I found more examples of Kabir’s “inverted verses” online.

The poems at http://www.sriviliveshere.com/mapping-ulat-bansi.html come with beautiful illustrations as well.

Here are a few more lines from Kabir’s inverted verse:

A tree stands without roots

A tree bears fruit without flowers

Someone dances without feet

Someone plays music without hands

Someone sings without a tongue

Water catches fire

Someone sees with blind eyes

A cow eats a lion

A deer eats a cheetah

A crow pounces on a falcon

A quail pounces on a hawk

A mouse eats a cat

A dog eats a jackal

A frog eats snakes

What’s interesting about all of these is that they’re examples of entity-relationships that are false.

Let me first explain what entities and relationships are.

Entities are the real or conceptual objects that we perceive as existing in the world we live in.  They are usually described using a noun phrase and qualified using an adjective.

Relationships are the functions that apply to an ordered list of entities and return a true or false value.

For example, if you take the sentence “The hunter hunts the fox,” there are two entities (1. the hunter, 2. the fox).  The relationship is “hunts”, it returns true for the two entities presented in that order.

The relationship “hunts” would return false if the entities were inverted (as in 1. the fox and 2. the hunter … as in the sentence “The fox hunts the hunter”).

The relationship and the entity can be stored in a database and hence can be considered as the structured form of an unstructured plain-language utterance.

In fact it is entities and relationships such as these that it was speculated would some day make up the semantic web.

Most of Kabir’s inverted verse seems to be based on examples of false entity relationships of dual arity (involving two entities), and that often, there is a violation of entity order which causes the entity function to return the value false.

In the “cow was milked” song, the relationship that is violated is the temporal relationship: “takes place before”.

In the “ant’s wedding” song, the relationship that is violated is that of capability: “can do”.

In the rest of the examples, relationships like “eats”, “hunts”, “plays”, “dances”, “bears fruit”, etc., are violated.

Other Commentary

In Osho’s “The Revolution”, he talks about Kabir’s interest in and distrust of language, quoting the poet as saying:

I HAVE BEEN THINKING OF THE DIFFERENCE BETWEEN WATER

AND THE WAVES ON IT. RISING,

WATER’S STILL WATER, FALLING BACK,

IT IS WATER. WILL YOU GIVE ME A HINT

HOW TO TELL THEM APART?

BECAUSE SOMEONE HAS MADE UP THE WORD ‘WAVE’,

DO I HAVE TO DISTINGUISH IT FROM ‘WATER’?

And Osho concludes with:

Kabir is not interested in giving you any answers — because he knows perfectly well there is no answer. The game of question and answers is just a game — not that Kabir was not answering his disciples’ questions; he was answering, but answering playfully. That quality you have to remember. He is not a serious man; no wise man can ever be serious. Seriousness is part of ignorance, seriousness is a shadow of the ego. The wise is always non-serious. There can be no serious answers to questions, not at least with Kabir — because he does not believe that there is any meaning in life, and he does not believe that you have to stand aloof from life to observe and to find the meaning. He believes in participation. He does not want you to become a spectator, a speculator, a philosopher.

Notes

This genre of verse seems to have been a tradition in folk religious movements in North India.  In “The Tenth Rasa: An Anthology of Indian Nonsense” by Michael Heyman, Sumanya Satpathy and Anushka Ravishankar, they talk about Namdev, a 13th century saint-poet as having authored such verses as well.

A Naive Bayes classifier that outperforms NLTK’s

We found that by changing the smoothing parameters of a Naive Bayes classifier, we could get far better accuracy numbers for certain tasks.  By changing the Lidstone smoothing parameter from 0.05 to 0.5 or greater, we could go from an accuracy of about 50% to almost 70% on the task of question classification for question answering.

This is not at all surprising because, as described in an earlier post, the smoothing method used in the estimation of probabilities affects Naive Bayes classifiers greatly.

Below, we have provided an implementation of a Naive Bayes classifier which outperforms the Naive Bayes classifier supplied with NLTK 3.o by almost 10% on the task of classifying questions from the questions-train.txt file supplied with the textbook “Taming Text”.

Our Naive Bayes classifier (with a Lidstone smoothing parameter of 0.5) exhibits about 65% accuracy on the task of question classification, whereas the NLTK classifier has an accuracy of about 40% as shown below.

smoothing_graph

Finally, I’d like to say a few words about the import of this work.

Theoretically, by increasing the Lidstone smoothing parameter, we are merely compensating more strongly for absent features; we are negating the absence of a feature more vigorously;  reducing the penalty for the absence of a feature in a specific category.

Because increased smoothing lowers the penalty for feature absence, it could help increase the accuracy when a data-set has many low-volume features that do not contribute to predicting a category, but whose chance presence and absence may be construed in the learning phase to be correlated with a category.

Further investigation is required before we can say whether the aforesaid hypothesis would explain the effect of smoothing on the accuracy of classification in regard to the question classification data-set that we used.

However, this exercise shows that algorithm implementations would do well to leave the choice of Lidstone smoothing parameters to the discretion of the end user of a Naive Bayes classifier.

The source code of our Naive Bayes classifier (using Lidstone smoothing) is provided below:

This implementation of the Naive Bayes classifier was created by Geetanjali Rakshit, an intern at Aiaioo Labs.

 

import numpy as np
import random
import sys, math

class Classifier:
	def __init__(self, featureGenerator):
		self.featureGenerator = featureGenerator
		self._C_SIZE = 0
		self._V_SIZE = 0
		self._classes_list = []
		self._classes_dict = {}
		self._vocab = {}

	def setClasses(self, trainingData):
		for(label, line) in trainingData:
			if label not in self._classes_dict.keys():
				self._classes_dict[label] = len(self._classes_list)
				self._classes_list.append(label)
		self._C_SIZE = len(self._classes_list)
		return
		
	def getClasses(self):
		return self._classes_list

	def setVocab(self, trainingData):
		index = 0;
		for (label, line) in trainingData:
			line = self.featureGenerator.getFeatures(line)
			for item in line:
				if(item not in self._vocab.keys()):
					self._vocab[item] = index
					index += 1
		self._V_SIZE = len(self._vocab)
		return

	def getVocab(self):
		return self._vocab

	def train(self, trainingData):
		pass

	def classify(self, testData, params):
		pass

	def getFeatures(self, data):
		return self.featureGenerator.getFeatures(data)
		

class FeatureGenerator:
	def getFeatures(self, text):
		text = text.lower()
		return text.split()


class NaiveBayesClassifier(Classifier):
	def __init__(self, fg, alpha = 0.05):
		Classifier.__init__(self, fg)
		self.__classParams = []
		self.__params = [[]]
		self.__alpha = alpha

	def getParameters(self):
		return (self.__classParams, self.__params)

	def train(self, trainingData):
		self.setClasses(trainingData)
		self.setVocab(trainingData)
		self.initParameters()

		for (cat, document) in trainingData:
			for feature in self.getFeatures(document):
				self.countFeature(feature, self._classes_dict[cat])

	def countFeature(self, feature, class_index):
		counts = 1
		self._counts_in_class[class_index][self._vocab[feature]] = self._counts_in_class[class_index][self._vocab[feature]] + counts
		self._total_counts[class_index] = self._total_counts[class_index] + counts
		self._norm = self._norm + counts

	def classify(self, testData):
		post_prob = self.getPosteriorProbabilities(testData)
		return self._classes_list[self.getMaxIndex(post_prob)]

	def getPosteriorProbabilities(self, testData):
		post_prob = np.zeros(self._C_SIZE)
		for i in range(0, self._C_SIZE):
			for feature in self.getFeatures(testData):
				post_prob[i] += self.getLogProbability(feature, i)
			post_prob[i] += self.getClassLogProbability(i)
		return post_prob

	def getFeatures(self, testData):
		return self.featureGenerator.getFeatures(testData)

	def initParameters(self):
		self._total_counts = np.zeros(self._C_SIZE)
		self._counts_in_class = np.zeros((self._C_SIZE, self._V_SIZE))
		self._norm = 0.0

	def getLogProbability(self, feature, class_index):
		return math.log(self.smooth(self.getCount(feature, class_index),self._total_counts[class_index]))

	def getCount(self, feature, class_index):
		if feature not in self._vocab.keys():
			return 0
		else:
			return self._counts_in_class[class_index][self._vocab[feature]]

	def smooth(self, numerator, denominator):
		return (numerator + self.__alpha) / (denominator + (self.__alpha * len(self._vocab)))

	def getClassLogProbability(self, class_index):
		return math.log(self._total_counts[class_index]/self._norm)

	def getMaxIndex(self, posteriorProbabilities):
		maxi = 0
		maxProb = posteriorProbabilities[maxi]
		for i in range(0, self._C_SIZE):
			if(posteriorProbabilities[i] >= maxProb):
				maxProb = posteriorProbabilities[i]
				maxi = i
		return maxi


class Dataset:
	def __init__(self, filename):
		fp = open(filename, "r")
		i = 0
		self.__dataset = []
		for line in fp:
			if(line != "\n"):
				line = line.split()
				cat = line[0]
				sent = ""
				for word in range(1, len(line)):
					sent = sent+line[word]+" "
				sent = sent.strip()
				self.__dataset.append([cat, str(sent)])
				i = i+1
		random.shuffle(self.__dataset)	
		self.__D_SIZE = i
		self.__trainSIZE = int(0.6*self.__D_SIZE)
		self.__testSIZE = int(0.3*self.__D_SIZE)
		self.__devSIZE = 1 - (self.__trainSIZE + self.__testSIZE)

	def setTrainSize(self, value):
		self.__trainSIZE = int(value*0.01*self.__D_SIZE)
		return self.__trainSIZE

	def setTestSize(self, value):
		self.__testSIZE = int(value*0.01*self.__D_SIZE)
		return self.__testSIZE

	def setDevelopmentSize(self):
		self.__devSIZE = int(1 - (self.__trainSIZE + self.__testSIZE))
		return self.__devSIZE

	def getDataSize(self):
		return self.__D_SIZE
	
	def getTrainingData(self):
		return self.__dataset[0:self.__trainSIZE]

	def getTestData(self):
		return self.__dataset[self.__trainSIZE:(self.__trainSIZE+self.__testSIZE)]

	def getDevData(self):
		return self.__dataset[0:self.__devSIZE]



#============================================================================================

if __name__ == "__main__":
	
	# This Naive Bayes classifier implementation 10% better accuracy than the NLTK 3.0 Naive Bayes classifier implementation
	# at the task of classifying questions in the question corpus distributed with the book "Taming Text".

	# The "questions-train.txt" file can be found in the source code distributed with the book at https://www.manning.com/books/taming-text.
	
	# To the best of our knowledge, the improvement in accuracy is owed to the smoothing methods described in our blog:
	# https://aiaioo.wordpress.com/2016/01/29/in-a-naive-bayes-classifier-why-bother-with-smoothing-when-we-have-unknown-words-in-the-test-set/
	
	filename = "questions-train.txt"
	
	if len(sys.argv) > 1:
		filename = sys.argv[1]
	
	data = Dataset(filename)
	
	data.setTrainSize(50)
	data.setTestSize(50)
	
	train_set = data.getTrainingData()
	test_set = data.getTestData()
	
	test_data = [test_set[i][1] for i in range(len(test_set))]
	actual_labels = [test_set[i][0] for i in range(len(test_set))]
	
	fg = FeatureGenerator()
	alpha = 0.5 #smoothing parameter
	
	nbClassifier = NaiveBayesClassifier(fg, alpha)
	nbClassifier.train(train_set)
	
	correct = 0;
	total = 0;
	for line in test_data:
		best_label = nbClassifier.classify(line)
		if best_label == actual_labels[total]:
			correct += 1
		total += 1
	
	acc = 1.0*correct/total
	print("Accuracy of this Naive Bayes Classifier: "+str(acc))

 

Visit Aiaioo Labs

Text Analytics Tools for Deliberative Democracy

In our last post, we spoke about various control mechanisms that can be implemented to support direct democracy (which we  interpreted to mean the control of the allocation of common resources by the people who pooled in).

We also examined how these controls could be used to curtail man-in-the-middle corruption.

In this article, we examine a more sophisticated form of direct democracy called a deliberative democracy.

In a deliberative democracy, in addition to the control mechanisms prescribed for direct democracy, there need to be mechanisms to allow deliberation (discussion) before a referendum or any other action is taken.

I quote from the Wikipedia article on deliberative democracy:

Deliberative democracy holds that, for a democratic decision to be legitimate, it must be preceded by authentic deliberation, not merely the aggregation of preferences that occurs in voting.

In elitist deliberative democracy, principles of deliberative democracy apply to elite societal decision-making bodies, such as legislatures and courts; in populist deliberative democracy, principles of deliberative democracy apply to groups of lay citizens who are empowered to make decisions.

The article on direct democracy had the following to say:

Democratic theorists have identified a trilemma due to the presence of three desirable characteristics of an ideal system of direct democracy, which are challenging to deliver all at once. These three characteristics are participation – widespread participation in the decision making process by the people affected; deliberation – a rational discussion where all major points of view are weighted according to evidence; and equality – all members of the population on whose behalf decisions are taken have an equal chance of having their views taken into account.

(Aside to computer scientists: doesn’t this trilemma remind you of the CAP theorem that applies to database systems? Here’s a simple explanation of the CAP theorem: http://ksat.me/a-plain-english-introduction-to-cap-theorem/).

So, for example, representative democracy satisfies the requirement for deliberation and equality but sacrifices participation.

Participatory democracy allows inclusive participation and deliberation but sacrifices equality.

And then there is direct democracy which supports participation and equality, but not deliberation.

The problem seems to be that when a large number of people are invited to participate in a deliberation (and given that deliberations take time), it will not be possible to compensate them all for their time. Consequently, only those more interested in the issue being debated (or more likely to benefit from one position or the other) are more likely to participate, biasing the sample in their favour (all sections of the population are no longer equally represented in the discussion/decision).

So, it seems that all the three properties desired in an ideal democratic system – participation, equality and deliberation – cannot be present at the same time in a real democratic system.

But then, a while ago, we began wondering if this trilemma is merely a result of the lack of suitable technology and not really a fundamental property of democracy.  So, we proposed a design for (though we have not yet realized it) a tool that can support the participation of a large number of people in deliberations.  We call it the MCT (Mass Communication Tool).

It could be used as a method to enable direct democracies to support deliberations in which all citizens can participate, ahead of a vote on any subject.

It uses text clustering algorithms to solve the problems of volume as well as numeric asymmetry in the flow of communications between the deliberating participants and the moderators of the communications.

There’s a brief overview of the system in our lab profile.

MCTs are bound to have a huge impact on our experience of representative government.  A typical use case would involve a public figure, (say President Obama), sounding out the electorate before introducing legislation on say healthcare reform.

 

By first discussing the competing proposals with large numbers of people, it might be possible for the initiator of the discussion to get a sense of what might or might not work and what the response to the legislation was likely to be.

 

An MCT would have to be capable of supporting a live dialog involving a large number of people.

It would use natural language processing and machine learning to enable a few moderators (for example, the CEO of a company) to interact with a large number of people (for example, all the employees of the company) in real time (for example, during a virtual all-hands meeting), get a synopsis of a large number of concurrent discussions in real time, and participate in a significant fraction of the discussions as they are taking place.

The system would consist of:

  1. an aggregator of messages (built from natural language processing components) that groups together messages and discussions with identical semantic content;
  2. a hierarchical clustering system (built from natural language processing components) that assigns aggregated messages their place in a hierarchy by specificity with more general messages closer to the root of the hierarchy and more specific messages closer to the leaves of the hierarchy;
  3. a summarization system (built from natural language processing components) that creates a summary of the aggregate of all messages in a sub-tree; and
  4. a reply routing system (built from natural language processing components) that routes replies from cluster to cluster based on their relevance to the discussion threads.

Using text analytics to prevent O(log n) failure of democratic institutions

In this article, we discuss an Achilles’ heel present in many democratic institutions.  We claim that many democratic institutions can be made to fail in ‘log n’ time (exponentially fast) if patronage (nepotistic) networks are allowed to grow unfettered.

We then support the claim using real-world examples of the failure of democratic institutions (political and otherwise) and discuss why such failure has not been observed in polity in India.

We also look at how text analytics can be used to detect (and consequently enable steps to be taken to prevent) such failures.

The Weakness

In democratic institutions, voting mechanisms are used to confer upon one individual (from a field of eligible candidates), powers and responsibilities as specified in the charter of the institution.

In some cases, the powers that accrue to the elected individual are so great that they enable him to use them to ensure his or her re-election.  There are two methods available to such an individual.

The first is to pay off the electoral college and secure re-election.  This method is an O(n) algorithm.  The quantum of resources required to secure re-election is proportional to the number of people who need to be suborned.  So, this method only works in cases where electoral colleges are small (for example, in the case of committees deciding job appointments).

A faster method of suborning large numbers of people exists.  The establishment of a hierarchy of patronage can leverage the dynamics of social networks to speedily (in log n time) corrupt very large numbers of people.

It works like this:  The person who is elected to head the country appoints as immediate subordinates only people who are on account of tribal or ethnic affiliations expected to be loyal to him/her. This appointment is often accompanied by a monetary exchange in the form of a bribe paid by the appointee to secure the post.  Such a monetary exchange helps cement the loyalty since the person making the payment becomes dependent on their superior’s continuation in power to recoup the money spent. In other words, the immediate subordinates are forced to invest in their superiors’ careers.

The subordinates so appointed in turn appoint people loyal to themselves to positions below them, in order to recover their investment in the person above them and so on.  Very soon, the entire government machinery becomes beholden directly or indirectly to the person at the top for their jobs and has a vested interest in keeping the person at the top in power.

In some countries, this effectively transforms the democratically elected ‘president’ into a dictator for life. 

Example 1

The first example of such failure is possibly (and I am just illustrating a point – no disrespect intended) the government of Cameroon.  The President of Cameroon has been in power since the 1980s.  The President is impossible to replace in spite of rampant corruption and economic mismanagement because all the tribal chiefs and officials in Cameroon are beholden to the President.

All these officials and chiefs try and recoup their investment by rent-seeking behavior (you will need their good offices if you wish to do any business in Cameroon).

The resulting economic climate doesn’t necessarily encourage the growth of entrepreneurship or investment in Cameroon.

Example 2

Iraq for many years was ruled by a dictator who had managed to suborn an entire political system.  Iraq once had a participatory democratic system.  But the use of patronage by Saddam led to Iraq coming under totalitarian rule.

Similar failures can be seen in post WW2 Libya and Syria, and in Stalin’s Russia.

India

One common feature of many of the countries where such failure has occurred is that they have hierarchical social structures in which respect and obedience can be commanded by people higher in the hierarchy from people lower in the hierarchy.

Cameroon has a culture of veneration of elders and tribal leaders.  So do the countries of the Arab world.

India also has somewhat similar cultural traits.  So, it is very interesting to see that a similar process of deterioration of democracy has not been observed in India.

One explanation is that India was saved by its own heterogeneity.  India is made up of distinct linguistic regions and ethnic groups.  It would be impossible for tribal and ethnic hierarchies to span India’s many linguistic and ethnic boundaries.

So, even if one province has failed, that would not trigger failure in neighboring provinces, and the regional despot would not be strong enough to change the Indian constitution.

Examples of such failure can arguably be observed to have happened already in Karnataka and in Tamil Nadu.  In Karnataka, a couple of mining barons managed to become ministers in the government and managed to exercise a lot of control over it for a number of years. Had Karnataka been an independent political entity, they might at some point have attempted to change the constitution to give themselves more power and possibly unlimited tenure.

In the state of Tamil Nadu, the ruling politicians are suspected of building patronage networks around themselves in order to facilitate rent-seeking behaviour.  If Tamil Nadu had been an independent political entity, I suspect that it might have lost its democratic character, since the number of people below the Chief Minister with a vested interest in keeping him/her in power would have become too big to permit free and fair elections to take place.

But what is interesting is that in India, though democracy could have failed at the level of the states (controlled by politicians who had suborned a large part of the mechanism of government) the possible failure did not take place or proved reversible.  That is probably because no kinship networks extend to all parts of India.

The fragmentation must have protected the constitution and the electoral system.  That in turn allowed constitutional and legal frameworks and electoral politics to correct the failures.

In Karnataka, the corrupt mining barons and the Chief Minister who supported them ended up behind bars.  In Tamil Nadu, some family members of the corrupt Chief Minister ended up behind bars.  In both states, the parties that had engaged in corruption were voted out of power.

So, failure through networks of patronage might be difficult to engineer in India because of the extremely heterogeneous nature of its society (a result of size and diversity).

Nepotism

Why would someone intending to build a patronage network only elevate kith and kin to positions of power?

As in, why did the dictators in Iraq and Syria choose to pack the governing organizations with people from their own community or tribe?

One possible answer emanates from the work of George Price (http://www.bbc.co.uk/news/magazine-24457645), who showed that altruism exhibited towards close relatives can have concrete benefits for a selfish individual.

I quote from the article:

Price’s equation explained how altruism could thrive, even amongst groups of selfish people.

It built on the work of a number of other scientists, arguably beginning with JBS Haldane, a British biologist who developed a theory in the early 1950s. When asked if he would sacrifice his own life to save that of another, he said that he would, but only under certain conditions. “I would lay down my life for two brothers, or eight cousins.”

Here is the Wikipedia page describing the Price equation (http://en.wikipedia.org/wiki/Price_equation).

So, it is possible that the elevation of kith and kin minimizes the possibility that the people so elevated might one day turn against their patron (they might be more likely to exhibit non-selfish behavior towards their patron).

Detection using Text Analytics

One is tempted to ask whether it is possible to detect at an early stage the process of failure through patronage of a democratic system.

The answer is yes.  It appears to be possible to build a protective mechanism that can uncover and highlight the formation and growth of nepotistic patronage networks.

The BBC article on nepotism in Italy examines research on the detection of nepotism in Italy, and I quote:

Prof Perotti, along with others, has published revealing studies of university teachers, showing the extraordinary concentration of surnames in many departments.

“There is a scarcity of last-names that’s inexplicable,” says fellow academic, Stefano Allesina. “The odds of getting such population densities across so many departments is a million to one.”

Take the University of Bari, where five families have for years dominated the dozens of senior positions in Business and Economics there. Or consider the University of Palermo, where more than half the entire academic population has at least one relative working within the institution.

I happened to find one of Allesina’s papers titled “Measuring Nepotism through Shared Last Names: The Case of Italian Academia” on the internet.

Here it is:  http://www.plosone.org/article/info:doi/10.1371/journal.pone.0021160

I quote from the paper:

Both types of analysis (Monte Carlo and logistic regression) showed the same results: the paucity of names and the abundance of name-sharing connections in Italian academia are highly unlikely to be observed at random. Many disciplines, accounting for the majority of Italian academics, are very likely to be affected by nepotism. There is a strong latitudinal effect, with nepotistic practices increasing in the south. Although detecting some nepotism in Italian academia is hardly surprising, the level of diffusion evidenced by this analysis is well beyond what is expected.

Concentrating resources in the “healthy” part of the system is especially important at a time when funding is very limited and new positions are scarce: two conditions that are currently met by the Italian academic system. Moreover, promoting merit against nepotistic practices could help stem the severe brain-drain observed in Italy.

In December 2010, the Italian Parliament approved a new law for the University. Among other things, the new law forbids the hiring of relatives within the same department and introduces a probation period before tenure. The analysis conducted here should be repeated in the future, as the results could provide an assessment of the efficacy of the new law.

This analysis can be applied to different countries and types of organizations. Policy-makers can use similar methods to target resources and cuts in order to promote fair practices.

The analysis that Allesina performed (a diversity analysis of last names) is a fairly easy text analytics task and provides a clue to the solution to the problem.

Such an analysis can unearth nepotism and allow steps to be taken to prevent it.

Extensions of the method can also be used to determine if, as in the case of Iraq and Syria, a certain community or ethnicity or family has taken over or is beginning to take over the governing mechanisms of a country.

And if the problem is identified early enough, it might give the constitutional, legal and electoral institutions of a democracy a fighting chance at protecting themselves and lead to a correction of the failure.

Analysing documents for non-obvious differences

The ease of classification of documents depends on the categories you are looking to classify documents into.

A few days ago, an engineer wrote about a problem where the analysis that needed to be performed on documents was not the most straight-forward.

He described the problem in a forum as follows: “I am working on sub classification. We already crawled sites using focused crawling. So we know domain, broad category for the site. Sometimes site is also tagged with broad category. So I don’t require to predict broad class for individual site. I am interested in sub-classification. For example, I don’t want to find if post is related to sports, politics, cricket etc. I am interested in to find if post is related to Indian cricket, Australia cricket, given that I already know post is related to cricket. Since in cricket post may contains frequent words like runs, six, fours, out,score etc, which are common across all cricket related posts. So I also want to consider rare terms which can help me in sub-classification. I agree that I may also require frequent words for classification. But I don’t want to skip rare terms for classification.

If you’re dealing with categories like sports, politics and finance, then using machine learning for classification is very easy.  That’s because all the nouns and verbs in the document give you clues as to the category that the document belongs to.

But if you’re given a set of categories for which there are few indicators in the text, you end up with no easy way to categorize it.

After spending a few days thinking about it, I realized that something I had learnt in college could be applied to the problem.  It’s a technique called Feature Selection.

I am going to share the reply I posted to the question, because it might be useful to others working on the classification of documents:

You seem to have a data set that looks as follows (letters are categories and numbers are features):

A P 2 4
A Q 2 5
B P 3 4
B Q 3 5

Let’s say the 2s and the 3s are features that occur very frequently in your corpus while the 4s and the 5s are features that occur far less frequently in your corpus.

When you use the ‘bag of words’ model as your feature vector, your classifier will only learn to tell A apart from B (because the 4s and 5s will not matter much to the classifier, being overwhelmed as it is by the 2s and 3s which are far more frequent).

I think that is why you have come to the conclusion that you need to look for rare words to be able to accomplish your goal of distinguishing category P from category Q.

But in reality, perhaps what you need to do is identify all the features like 4 and 5 that might be able to help you distinguish P from Q and you might even find some frequent features that could help you do that (it might turn out that some frequent features might also have a fairly healthy ability to resolve these categories).

So, now the question just boils down to how you would go about finding the set of features that resolves any given categorization scheme.

The answer seems to be something that literature refers to as ‘Feature Selection’.

As the name says, you select features that help you break data points apart in the way you want.

Wikipedia has an article on Feature Selection:

http://en.wikipedia.org/wiki/Feature_selection 

And Mark Hall’s thesis http://www.cs.waikato.ac.nz/~mhall/thesis.pdf seems to be highly referenced.

Mark Hall’s thesis – “A good feature subset is one that contains features highly correlated with (predictive of) the class, yet uncorrelated with (not predictive of) each other.”

To be honest to you, I’d heard about Feature Selection, but never connected it to the problem it solves until now, so I’m just looking up reading material as I write.

Best of luck with it.