Tag: machine learning

Deep Learning for NLP

Deep learning is usually associated with neural networks.

In this article, we show that generative classifiers are also capable of deep learning.

What is deep learning?

Deep learning is a method of machine learning involving the use of multiple processing layers to learn non-linear functions or boundaries.

What are generative classifiers?

Generative classifiers use the Bayes rule to invert probabilities of the features F given a class c into a prediction of the class c given the features F.

The class predicted by the classifier is the one yielding the highest P(c|F).

A commonly used generative classifier is the Naive Bayes classifier.  It has two layers (one for the features F and one for the classes C).

Deep learning using generative classifiers

The first thing you need for deep learning is a hidden layer.  So you add one more layer H between the C and F layers to get a Hierarchical Bayesian classifier (HBC).

Now, you can compute P(c|F) in a HBC in two ways:

Product of Sums
Computing P(c|F) using a Product of Sums
Sum of Products
Computing P(c|F) using a Sum of Products

The first equation computes P(c|F) using a product of sums (POS).  The second equation computes P(c|F) using a sum of products (SOP).

POS Equation

We discovered something very interesting about these two equations.

It turns out that if you use the first equation, the HBC reduces to a Naive Bayes classifier. Such an HBC can only learn linear (or quadratic) decision boundaries.

Consider the discrete XOR-like function shown in Figure 1.

hbc_figure_1

There is no way to separate the black dots from the white dots using one straight line.

Such a pattern can only be classified 100% correctly by a non-linear classifier.

If you train a multinomial Naive Bayes classifier on the data in Figure 1, you get the decision boundary seen in Figure 2a.

Note that the dotted area represents the class 1 and the clear area represents the class 0.

Multinomial NB Classifier Decision Boundary
Figure 2a: The decision boundary of a multinomial NB classifier (or a POS HBC).

It can be seen that no matter what the angle of the line is, at least one point of the four will be misclassified.

In this instance, it is the point at {5, 1} that is misclassified as 0 (since the clear area represents the class 0).

You get the same result if you use a POS HBC.

SOP Equation

Our research showed us that something amazing happens if you use the second equation.

With the “sum of products” equation, the HBC becomes capable of deep learning.

SOP + Multinomial Distribution

The decision boundary learnt by a multinomial non-linear HBC (one that computes the posterior using a sum of products of the hidden-node conditional feature probabilities) is shown in Figure 2b.

Decision boundary of a SOP HBC.
Figure 2b: Decision boundary learnt by a multinomial SOP HBC.

The boundary consists of two straight lines passing through the origin. They are angled in such a way that they separate the data points into the two required categories.

All four points are classified correctly since the points at {1, 1} and {5, 5} fall in the clear conical region which represents a classification of 0 whereas the other two points fall in the dotted region representing class 1.

Therefore, the multinomial non-linear hierarchical Bayes classifier can learn the non-linear function of Figure 1.

Gaussian Distribution

The decision boundary learnt by a Gaussian nonlinear HBC is shown in Figure 2c.

Decision Boundary of a Gaussian SOP HBC.
Figure 2c: Decision boundary learnt by a SOP HBC based on the Gaussian probability distribution.

The boundary consists of two quadratic curves separating the data points into the required categories.

Therefore, the Gaussian non-linear HBC can also learn the non-linear function depicted in Figure 1.

Conclusion

Since SOP HBCs are multilayered (with a layer of hidden nodes), and can learn non-linear decision boundaries, they can therefore be said to be capable of deep learning.

Applications to NLP

It turns out that the multinomial SOP HBC can outperform a number of linear classifiers at certain tasks.  For more information, read our paper.

Visit Aiaioo Labs

Advertisements

Fun with Text – Managing Text Analytics

The year is 2016.

I’m a year older than when I designed the text analytics lecture titled “Fun with Text – Hacking Text Analytics“.

Yesterday, I found myself giving a follow on lecture titled “Fun with Text – Managing Text Analytics”.

Here are the slides:

“Hacking Text Analytics” was meant to help students understand a range text analytics problems by reducing them into simpler problems.

But it was designed with the understanding that they would hack their own text analytics tools.

However, in project after project, I was seeing that engineers tended not to build their own text analytics tools, but instead rely on handy and widely available open source products, and that the main thing they needed to learn was how to use them.

So, when I was asked to lecture to an audience at the NASSCOM Big Data and Analytics Summit in Hyderabad, and was advised that a large part of the audience might be non-technical, and could I please base the talk on use-cases, I tried a different tack.

So I designed another lecture “Fun with Text – Managing Text Analytics” about:

  • 3 types of opportunities for text analytics that typically exist in every vertical
  • 3 use cases dealing with each of these types of opportunities
  • 3 mistakes to avoid and 3 things to embrace

And the take away from it is how to go about solving a typical business problem (involving text), using text analytics.

Enjoy the slides!

Visit Aiaioo Labs

A Naive Bayes classifier that outperforms NLTK’s

We found that by changing the smoothing parameters of a Naive Bayes classifier, we could get far better accuracy numbers for certain tasks.  By changing the Lidstone smoothing parameter from 0.05 to 0.5 or greater, we could go from an accuracy of about 50% to almost 70% on the task of question classification for question answering.

This is not at all surprising because, as described in an earlier post, the smoothing method used in the estimation of probabilities affects Naive Bayes classifiers greatly.

Below, we have provided an implementation of a Naive Bayes classifier which outperforms the Naive Bayes classifier supplied with NLTK 3.o by almost 10% on the task of classifying questions from the questions-train.txt file supplied with the textbook “Taming Text”.

Our Naive Bayes classifier (with a Lidstone smoothing parameter of 0.5) exhibits about 65% accuracy on the task of question classification, whereas the NLTK classifier has an accuracy of about 40% as shown below.

smoothing_graph

Finally, I’d like to say a few words about the import of this work.

Theoretically, by increasing the Lidstone smoothing parameter, we are merely compensating more strongly for absent features; we are negating the absence of a feature more vigorously;  reducing the penalty for the absence of a feature in a specific category.

Because increased smoothing lowers the penalty for feature absence, it could help increase the accuracy when a data-set has many low-volume features that do not contribute to predicting a category, but whose chance presence and absence may be construed in the learning phase to be correlated with a category.

Further investigation is required before we can say whether the aforesaid hypothesis would explain the effect of smoothing on the accuracy of classification in regard to the question classification data-set that we used.

However, this exercise shows that algorithm implementations would do well to leave the choice of Lidstone smoothing parameters to the discretion of the end user of a Naive Bayes classifier.

The source code of our Naive Bayes classifier (using Lidstone smoothing) is provided below:

This implementation of the Naive Bayes classifier was created by Geetanjali Rakshit, an intern at Aiaioo Labs.

 

import numpy as np
import random
import sys, math

class Classifier:
	def __init__(self, featureGenerator):
		self.featureGenerator = featureGenerator
		self._C_SIZE = 0
		self._V_SIZE = 0
		self._classes_list = []
		self._classes_dict = {}
		self._vocab = {}

	def setClasses(self, trainingData):
		for(label, line) in trainingData:
			if label not in self._classes_dict.keys():
				self._classes_dict[label] = len(self._classes_list)
				self._classes_list.append(label)
		self._C_SIZE = len(self._classes_list)
		return
		
	def getClasses(self):
		return self._classes_list

	def setVocab(self, trainingData):
		index = 0;
		for (label, line) in trainingData:
			line = self.featureGenerator.getFeatures(line)
			for item in line:
				if(item not in self._vocab.keys()):
					self._vocab[item] = index
					index += 1
		self._V_SIZE = len(self._vocab)
		return

	def getVocab(self):
		return self._vocab

	def train(self, trainingData):
		pass

	def classify(self, testData, params):
		pass

	def getFeatures(self, data):
		return self.featureGenerator.getFeatures(data)
		

class FeatureGenerator:
	def getFeatures(self, text):
		text = text.lower()
		return text.split()


class NaiveBayesClassifier(Classifier):
	def __init__(self, fg, alpha = 0.05):
		Classifier.__init__(self, fg)
		self.__classParams = []
		self.__params = [[]]
		self.__alpha = alpha

	def getParameters(self):
		return (self.__classParams, self.__params)

	def train(self, trainingData):
		self.setClasses(trainingData)
		self.setVocab(trainingData)
		self.initParameters()

		for (cat, document) in trainingData:
			for feature in self.getFeatures(document):
				self.countFeature(feature, self._classes_dict[cat])

	def countFeature(self, feature, class_index):
		counts = 1
		self._counts_in_class[class_index][self._vocab[feature]] = self._counts_in_class[class_index][self._vocab[feature]] + counts
		self._total_counts[class_index] = self._total_counts[class_index] + counts
		self._norm = self._norm + counts

	def classify(self, testData):
		post_prob = self.getPosteriorProbabilities(testData)
		return self._classes_list[self.getMaxIndex(post_prob)]

	def getPosteriorProbabilities(self, testData):
		post_prob = np.zeros(self._C_SIZE)
		for i in range(0, self._C_SIZE):
			for feature in self.getFeatures(testData):
				post_prob[i] += self.getLogProbability(feature, i)
			post_prob[i] += self.getClassLogProbability(i)
		return post_prob

	def getFeatures(self, testData):
		return self.featureGenerator.getFeatures(testData)

	def initParameters(self):
		self._total_counts = np.zeros(self._C_SIZE)
		self._counts_in_class = np.zeros((self._C_SIZE, self._V_SIZE))
		self._norm = 0.0

	def getLogProbability(self, feature, class_index):
		return math.log(self.smooth(self.getCount(feature, class_index),self._total_counts[class_index]))

	def getCount(self, feature, class_index):
		if feature not in self._vocab.keys():
			return 0
		else:
			return self._counts_in_class[class_index][self._vocab[feature]]

	def smooth(self, numerator, denominator):
		return (numerator + self.__alpha) / (denominator + (self.__alpha * len(self._vocab)))

	def getClassLogProbability(self, class_index):
		return math.log(self._total_counts[class_index]/self._norm)

	def getMaxIndex(self, posteriorProbabilities):
		maxi = 0
		maxProb = posteriorProbabilities[maxi]
		for i in range(0, self._C_SIZE):
			if(posteriorProbabilities[i] >= maxProb):
				maxProb = posteriorProbabilities[i]
				maxi = i
		return maxi


class Dataset:
	def __init__(self, filename):
		fp = open(filename, "r")
		i = 0
		self.__dataset = []
		for line in fp:
			if(line != "\n"):
				line = line.split()
				cat = line[0]
				sent = ""
				for word in range(1, len(line)):
					sent = sent+line[word]+" "
				sent = sent.strip()
				self.__dataset.append([cat, str(sent)])
				i = i+1
		random.shuffle(self.__dataset)	
		self.__D_SIZE = i
		self.__trainSIZE = int(0.6*self.__D_SIZE)
		self.__testSIZE = int(0.3*self.__D_SIZE)
		self.__devSIZE = 1 - (self.__trainSIZE + self.__testSIZE)

	def setTrainSize(self, value):
		self.__trainSIZE = int(value*0.01*self.__D_SIZE)
		return self.__trainSIZE

	def setTestSize(self, value):
		self.__testSIZE = int(value*0.01*self.__D_SIZE)
		return self.__testSIZE

	def setDevelopmentSize(self):
		self.__devSIZE = int(1 - (self.__trainSIZE + self.__testSIZE))
		return self.__devSIZE

	def getDataSize(self):
		return self.__D_SIZE
	
	def getTrainingData(self):
		return self.__dataset[0:self.__trainSIZE]

	def getTestData(self):
		return self.__dataset[self.__trainSIZE:(self.__trainSIZE+self.__testSIZE)]

	def getDevData(self):
		return self.__dataset[0:self.__devSIZE]



#============================================================================================

if __name__ == "__main__":
	
	# This Naive Bayes classifier implementation 10% better accuracy than the NLTK 3.0 Naive Bayes classifier implementation
	# at the task of classifying questions in the question corpus distributed with the book "Taming Text".

	# The "questions-train.txt" file can be found in the source code distributed with the book at https://www.manning.com/books/taming-text.
	
	# To the best of our knowledge, the improvement in accuracy is owed to the smoothing methods described in our blog:
	# https://aiaioo.wordpress.com/2016/01/29/in-a-naive-bayes-classifier-why-bother-with-smoothing-when-we-have-unknown-words-in-the-test-set/
	
	filename = "questions-train.txt"
	
	if len(sys.argv) > 1:
		filename = sys.argv[1]
	
	data = Dataset(filename)
	
	data.setTrainSize(50)
	data.setTestSize(50)
	
	train_set = data.getTrainingData()
	test_set = data.getTestData()
	
	test_data = [test_set[i][1] for i in range(len(test_set))]
	actual_labels = [test_set[i][0] for i in range(len(test_set))]
	
	fg = FeatureGenerator()
	alpha = 0.5 #smoothing parameter
	
	nbClassifier = NaiveBayesClassifier(fg, alpha)
	nbClassifier.train(train_set)
	
	correct = 0;
	total = 0;
	for line in test_data:
		best_label = nbClassifier.classify(line)
		if best_label == actual_labels[total]:
			correct += 1
		total += 1
	
	acc = 1.0*correct/total
	print("Accuracy of this Naive Bayes Classifier: "+str(acc))

 

Visit Aiaioo Labs

Fun With Text – Hacking Text Analytics

hacking_text_analytics

I’ve always wondered if there was a way to teach people to cobble together quick and dirty solutions to problems involving natural language, from duct tape, as it were.

Having worked in the field now for a donkey’s years as of 2015, and having taught a number of text analytics courses along the way, I’ve seen students of text analysis stumble mostly on one of two hurdles:

1.  Inability to Reduce Text Analytics Problems to Machine Learning Problems

I’ve seen students, after hours of training, still revert to rule-based thinking when asked to solve new problems involving text.

You can spend hours teaching people about classification and feature sets, but when you ask them to apply their learning to a new task, say segmenting a resume, you’ll hear them very quickly falling back to thinking in terms of programming steps.

Umm, you could write a script to look for a horizontal line, followed by capitalized text in bold, big font, with the words “Education” or “Experience” in it !!!

2.  Inability to Solve the Machine Learning (ML) Problems

Another task that I have seen teams getting hung up on has been solving ML problems and comparing different solutions.

My manager wants me to identify the ‘introduction’ sections.  So, I labelled 5 sentences as introductions.  Then, I trained a maximum entropy classifier with them.  Why isn’t it working?

One Machine Learning Algorithm to Rule Them All

One day, when I was about to give a lecture at Barcamp Bangalore, I had an idea.

Wouldn’t it be fun to try to use just one machine learning algorithm, show people how to code up that algorithm themselves, and then show them how a really large number of text analytics problem (almost every single problem related to the semantic web) could be solved using it.

So, I quickly wrote up a set of problems in order of increasing complexity, and went about trying to reduce them all to one ML problem, and surprised myself!  It could be done!

Just about every text analytics problem related to the semantic web (which is, by far, the most important commercial category) could be reduced to a classification problem.

Moreover, you could tackle just about any problem using just two steps:

a) Modeling the problem as a machine learning problem

Spot the appropriate machine learning problem underlying the text analytics problem, and if it is a classification problem, the relevant categories, and you’ve reduced the text analytics problem to a machine learning problem.

b) Solving the problem using feature engineering

To solve the machine learning problem, you need to coming up with a set of features that allows the machine learning algorithm to separate the desired categories.

That’s it!

Check it out for yourself!

Here’s a set of slides.

It’s called “Fun with Text – Hacking Text Analytics”.

Building Machine Learning Models that can help with Customer Service and Supply Chain Management

The Laptop that Stopped Working

One fine day, a couple of months ago, a laptop that we owned stopped working.  We heard 4 beeps coming from the machine at intervals but nothing appeared on the screen.

Customer Service

The service person quickly looked up the symptoms in his knowledge base and informed us that 4 beeps meant a memory error.

I replaced first the two memory modules one by one, but the machine still wouldn’t start.  I tried two spare memory modules that I had in the cupboard but the computer wouldn’t start.

I had a brand new computer with me that used the same type and speed of memory as the one we were fixing.  I pulled out its memory chips and inserted them into the faulty computer, but still no luck.

At that point, the service person told me that it must be the mother board itself that was not working.

Second Attempt at Triage

So the next day, a mother board and some memory arrived at my office.  A little later a field engineer showed up and replaced the mother board.   The computer still wouldn’t start up.

When the field engineer heard 4 beeps, the engineer said it MUST BE THE MEMORY.

Third Attempt at Triage

A few days later, a new set of memory modules arrived.

The engineer returned and tried inserting the new memory in.  Still no luck.  The computer would not start and you could still hear the 4 beeps.

A third set of brand new memory modules and a new mother board were sent over.

Fourth Attempt at Triage

The engineer tried both motherboards and various combinations of memory modules, but still, all you could hear were 4 beeps and the computer would not start.

During one of his attempts to combine memory and motherboards, the engineer noticed that though the computer did not start, it did not beep either.

So, the engineer guessed that it was the screen that was not working.  But just to be safe, he’d ask them to send another motherboard and another set of memory modules to go with it.

Fifth Attempt at Triage

The screen, the third motherboard and the fourth set of memory modules arrived in our office and an engineer spent the day trying various combinations of screens, motherboards and memory modules.

But the man on the phone said: “Sir, 4 beeps means there is something wrong with your memory.  I will have them replaced.”

I had to take out my new laptop’s memory and pop it into the faulty machine to convince the engineer and support staff that replacing the memory would not fix the problem.

All the parts were now sent over – the memory, motherboard, processor, drive, and screen.

Sixth Attempt at Triage

Finally, the field engineer found that when he had replaced the processor, the computer was able to boot up with no problems.

Better Root Cause Analysis

The manufacturer could have spared themselves all that expense, time and effort had they used an expert system that relied on a probabilistic model of the symptoms and their causes.

Such a model would be able to tell, given the symptoms, which component was the most likely to have failed.

Such a model would be able to direct a field engineer to the component or components whose replacement would be most likely to fix the problem.

If the attempted fix did not work, the model would simply update its understanding of the problem and recommend a different course of action.

I will illustrate the process using what is known in the machine learning community as a directed probabilistic graphical model.

Run-Through of Root Cause Analysis 

Let’s say a failure has occurred and there is only one symptom that can be observed: the laptop won’t start and emits 4 beeps.

The first step is to enter this information into the probabilistic graphical model.  From a list of symptoms, we select the ones that we observe (all observed symptoms are represented as yellow circles in this document).

So the following diagram has only one circle (observed symptom). 

Model 1:  The symptom of 4 beeps is modeled in a probabilistic graphical model with a yellow circle as follows:

pgm_1

Now, let’s assume that this symptom can be caused by the failure of memory, the motherboard or the processor.

Model 2:  I can add that information to the predictive model, so that the model now looks like this:

pgm_2

The model captures the belief that the causes of the symptom – processor / memory / motherboard failure are (in the absence of any symptoms) independent of each other.

It also captures the belief that given a symptom like 4 beeps, evidence for one cause will explain away (or decrease the probability of) the other causes.

Once such a model is built, it can tell a field engineer the most probable cause of a symptom, the second most probable cause and so on.

So, the engineer will only have to look at the output of the model’s analysis to know whether he needs to replace one component, or two, and which ones.

When the field engineer goes out and replaces the components, his actions can also be fed into the model.

Model 3:  Below is an extended model into which attempts to fix the problem by replacing the memory can be incorporated.

pgm_3

If a field engineer were to feed into the system the fact that the memory was replaced with a new module and it didn’t fix the problem, the system would be able to immediately figure out that the memory could not be the cause of the problem, and it would suggest the next most probable cause of failure.

Model 4

Finally, in case new memory modules being sent to customers for repairs frequently turned out to be defective, that information could also be added to the model as follows:

pgm_4

Now, if the error rate for new memory modules in the supply chain happens to be high for a particular type of memory, then if memory replacement failed to fix a 4-beep problem, the model would understand that faulty memory could still be the cause of the problem.

Applications to Supply Chain Management

The probabilities of all the nodes adjust themselves all the time and this information can actually be used to detect if the error rates in new memory module deliveries suddenly go up.

Benefits to a Customer Service Process

1.  Formal capture and storage of triage history

2.  Suggestion of cause(s) given the effects (symptoms)

3.  Suggestion of other causes given triage steps performed

What the system will seem to be doing (to the layman):

1.  Recording symptoms

2.  Recommending a course of action

3.  Recording the outcome of the course of action

4.  Recommending next steps

Analysing documents for non-obvious differences

The ease of classification of documents depends on the categories you are looking to classify documents into.

A few days ago, an engineer wrote about a problem where the analysis that needed to be performed on documents was not the most straight-forward.

He described the problem in a forum as follows: “I am working on sub classification. We already crawled sites using focused crawling. So we know domain, broad category for the site. Sometimes site is also tagged with broad category. So I don’t require to predict broad class for individual site. I am interested in sub-classification. For example, I don’t want to find if post is related to sports, politics, cricket etc. I am interested in to find if post is related to Indian cricket, Australia cricket, given that I already know post is related to cricket. Since in cricket post may contains frequent words like runs, six, fours, out,score etc, which are common across all cricket related posts. So I also want to consider rare terms which can help me in sub-classification. I agree that I may also require frequent words for classification. But I don’t want to skip rare terms for classification.

If you’re dealing with categories like sports, politics and finance, then using machine learning for classification is very easy.  That’s because all the nouns and verbs in the document give you clues as to the category that the document belongs to.

But if you’re given a set of categories for which there are few indicators in the text, you end up with no easy way to categorize it.

After spending a few days thinking about it, I realized that something I had learnt in college could be applied to the problem.  It’s a technique called Feature Selection.

I am going to share the reply I posted to the question, because it might be useful to others working on the classification of documents:

You seem to have a data set that looks as follows (letters are categories and numbers are features):

A P 2 4
A Q 2 5
B P 3 4
B Q 3 5

Let’s say the 2s and the 3s are features that occur very frequently in your corpus while the 4s and the 5s are features that occur far less frequently in your corpus.

When you use the ‘bag of words’ model as your feature vector, your classifier will only learn to tell A apart from B (because the 4s and 5s will not matter much to the classifier, being overwhelmed as it is by the 2s and 3s which are far more frequent).

I think that is why you have come to the conclusion that you need to look for rare words to be able to accomplish your goal of distinguishing category P from category Q.

But in reality, perhaps what you need to do is identify all the features like 4 and 5 that might be able to help you distinguish P from Q and you might even find some frequent features that could help you do that (it might turn out that some frequent features might also have a fairly healthy ability to resolve these categories).

So, now the question just boils down to how you would go about finding the set of features that resolves any given categorization scheme.

The answer seems to be something that literature refers to as ‘Feature Selection’.

As the name says, you select features that help you break data points apart in the way you want.

Wikipedia has an article on Feature Selection:

http://en.wikipedia.org/wiki/Feature_selection 

And Mark Hall’s thesis http://www.cs.waikato.ac.nz/~mhall/thesis.pdf seems to be highly referenced.

Mark Hall’s thesis – “A good feature subset is one that contains features highly correlated with (predictive of) the class, yet uncorrelated with (not predictive of) each other.”

To be honest to you, I’d heard about Feature Selection, but never connected it to the problem it solves until now, so I’m just looking up reading material as I write.

Best of luck with it.

Frameworks for evaluating values

I recently came across a very interesting 2001 paper by Daphne Koller that dealt with influence diagrams and how they could be applied to game theory.  I came across the paper while doing some background reading on a talk on decision making in accordance with our core values by a friend of mine, Somik Raha.

Influence diagrams are a formalism (very similar to probabilistic graphical models) that are used for making decisions.

What Somik Raha has attempted to do is come up with a framework for making decisions while also taking one’s values into account (either as constraints or as inputs into the decision model).  To do that he proposes extensions to influence diagrams.

What I found interesting when I thought about Daphne Koller’s work and Somik’s together, is that they could possibly give you a framework to evaluate your values.

Koller’s formalism reduces to a game theoretic model, which can be evaluated to determine the outcome of the decisions made by a group of people.

Plug in a formalism based on Somik’s ideas and you just might be able to create a way to quantity the benefits of values.

The Importance of Values?

I have been thinking a bit about values these days because there has been a horrific gang rape in Delhi, and there have recently been numerous incidents of bad driving where friends of mine have been injured in Bangalore.  Then there is corruption.  Our society seems to be quite happy with inequality and vast differences in the distribution of wealth.  It make me wonder if our values are to blame.

I have often wondered whether some of our problems originate in our value systems and whether the value systems that we consider sacrosanct in India are really very good ones.

Let me take just a couple of values that most Indians would consider to be very good values

  1. Non-violence
  2. Obedience

and let’s discuss them in more detail.

  1.  Non-violence

This value appeals not just to people in India.  You see variants of the value of non-violence appear in Tolstoy’s writings and in Semitic religions, as you can see from the Bible (“turn the other cheek”) and the Quran (“give alms to one who begs from you, even if he comes on the back of a horse”).

The issue with this sort of value is that it makes a person (and those around him/her) extremely vulnerable to injustice.

In India, we restrict the liberty of women – in their choice of clothing, company and lifestyle – for fear that they could be in danger if they violated societal norms.  This shows that none of us want to fight society or cross swords with someone who might make disparaging comments about personal choices.

Moreover, possibly as a result of the value of non-violence, very few Indians if any are taught fighting skills in school.  So, even if a person really wanted to act, say to protect a friend, he or she might not really have the skills to take down an aggressor.

So, instead of protecting and standing up for people who might be vulnerable, we become their tormentors and make their lives more miserable, just so we don’t have to get our hands dirty, or because we don’t have the skills and strength to do squat.

I’ve written about how bribes are openly collected by traffic policemen.  It should be very easy to put a stop to such behavior if you’re willing to fight.

If non-violence is not a core value, then how do we protect people from tearing each other to bits?

We could start with a question like:  non-violence for what purpose?  (turning it into an extrinsic value)

If the answer is something like, “so that the weak feel protected”, why not make protecting the weak our core value?

I’d prefer teaching kids values like “Don’t ever turn your back on a bully” rather than values like “Don’t fight anybody, and just come home safe, child!”

2.  Obedience

Indian parents love to boast that their child is “such an obedient child!”

Is that a good thing?

Obedience is different from politeness or respect.  The latter are mutual but the former is one way.

So, the politics of obedience creates a hierarchy of subservience.

In India, Parents expect complete obedience from Children.

The Police expect complete obedience from People.

The Politicians expect complete obedience from Police.

Teachers expect complete obedience from Students.

Managers expect complete obedience from Employees.

The creation of the hierarchy (through expectations of obedience) can be very dangerous in many ways.

1.  It can stifle creativity and problem-solving ability.  There is a bias against ideas flowing up a hierarchy because those higher up the hierarchy claim their place above those required to be obedient to them on the premise that they are somehow superior to those below them.  A good example is how parliament will not accept that people have a right to demand a bill against corruption (members of the Indian parliament claim that parliament is supreme in a parliamentary democracy – not the citizens that the parliamentarians represent).

2.  It can leave young people ill-equipped to defend their personal spaces.  I read in a paper on rape that many rapists approach victims by testing their boundaries.  They make comments and otherwise violate the intended victim’s personal boundaries.  If these are not strongly resisted, the probability of an assault becomes greater.  Another strategy used by rapists is to move their intended victim to a new location where they are more vulnerable. It is very important for people to be conditioned so that they do not obey an order by an attacker to relocate under any circumstances.

3.  The hierarchies perpetuate the power of stronger (bigger, older or richer) parties by providing social sanction to their dominant position, and so hinder social mobility.

4.  The obedience hierarchy could allow a few people at the top to amass too much power. It might have, for example, prevented cops from disobeying those in power during the Gujarat riots.

5.  Obedience means valuing rules above truth.  Obedience implies not challenging the rules or the status quo.  So there is little scope for discovering if the rules really are good ones for everybody.  People often defend something they assert with a “because I said so.” – that is, you are expected to believe them because of their authority, and not because they can substantiate their assertion.

Obedience as an absolute value is not entirely harmless.  It could be dangerous to us as a society because corrupt politicians can use the pliability and obedience of people around them to get away with evil (remember the activist who was hacked to death on the orders of a corporator from Bangalore, the journalist who was burnt to death in Uttar Pradesh, or the shutdown that the former Chief Minister of Karnataka State ordered when he was about to be investigated for corrupt dealings?).

I’d love to replace “obedience” with something else, perhaps “honesty” and “trustworthiness” and “pride”.

Summary

I understand that we as Indians are very proud of our values but I’ve tried to argue that our values need to be re-examined.

Personally, I’d love to see the day when we replace all our values with just the value of trustworthiness.

Trustworthiness as a value would mean we’d fight for each other.  It would mean we’d protect the weak.  It would mean we’d be on time.  It would mean we’d be honest.  It would mean we’d be capable and skilled and strong.  It would mean we’d be proud of each other.  It might mean we’d never lose another war.

Reading Koller’s and Somik’s work you get the feeling that one day you might be able to evaluate the comparative benefits of two sets of values, and pick the better one, using plain math.

And hopefully, by showing them mathematical proofs, you can convince people to change their values and pick better ones for themselves.