Month: March 2016

Programming in Hindi and Tamil in addition to Kannada

For children who are taught computer science in Indian languages, conventional software programming can pose a challenge because most programming languages use the Roman alphabet.

We’ve created a way to allow students and teachers to write Arduino programs in Indian languages (at the request of Arduino India).

Arduino boards can now be programmed in Hindi and Tamil in addition to Kannada (as already described in an earlier post).

The Indian language extensions can be downloaded from our website.

These are the words used as replacements for English words: http://www.aiaioo.com/arduino_in_local_languages/

The website allows one to comment on and discuss the keywords picked as replacements for English words.

There’s still a lot of work to be done on the choices.

Very specifically, we’re looking for ways to simplify the words so that school children will find them easier to remember and type.

Any ideas for facilitating the learning process for children would be very welcome.

Illustrative Example

For example, we used to translate analogWrite as “ಅನಂಕೀಯವಾಗಿ_ಬರೆ” in Kannada, (“अनंकीय_लिखो” in Hindi and “அனலாக்_எழுது” in Tamil) using the coined word ‘anankiiya’ (negation of ‘ankiiya’ meaning digital) or the transliteration of ‘analog’.

However, during a discussion, a physicist who helped with the Kannada (thank you, Padmalekha) suggested that we use the phrase “write without spaces” for analogWrite.

And then it hit us that we could just use the phrase “write value” for analogWrite and “write number” for digitalWrite.

The following translations for analogWrite: “ಮೌಲ್ಯವನ್ನು_ಬರೆ“, “मूल्य_लिखो” and “மதிப்பை_எழுது” are much more intuitive.

The new translations for digitalWrite are also just as easy to comprehend: “ಅಂಕೆಯನ್ನು_ಬರೆ“, “अंक_लिखो” and “எண்ணை_எழுது

The process of simplification is an ongoing one, and we hope in a few months’ time to have a generally agreed-upon set of translations, after taking everyone’s inputs into consideration.

Syntax

The Arduino IDE with extensions now supports syntax highlighting in Indian languages. This makes it easier to program in the local language.

This is how Kannada code looks:

Kannada Code

programming_kannada

And here is how it looks in Hindi and in Tamil.

Hindi Code

programming_hindi

Tamil Code

programming_tamil

Visit Aiaioo Labs

A Naive Bayes classifier that outperforms NLTK’s

We found that by changing the smoothing parameters of a Naive Bayes classifier, we could get far better accuracy numbers for certain tasks.  By changing the Lidstone smoothing parameter from 0.05 to 0.5 or greater, we could go from an accuracy of about 50% to almost 70% on the task of question classification for question answering.

This is not at all surprising because, as described in an earlier post, the smoothing method used in the estimation of probabilities affects Naive Bayes classifiers greatly.

Below, we have provided an implementation of a Naive Bayes classifier which outperforms the Naive Bayes classifier supplied with NLTK 3.o by almost 10% on the task of classifying questions from the questions-train.txt file supplied with the textbook “Taming Text”.

Our Naive Bayes classifier (with a Lidstone smoothing parameter of 0.5) exhibits about 65% accuracy on the task of question classification, whereas the NLTK classifier has an accuracy of about 40% as shown below.

smoothing_graph

Finally, I’d like to say a few words about the import of this work.

Theoretically, by increasing the Lidstone smoothing parameter, we are merely compensating more strongly for absent features; we are negating the absence of a feature more vigorously;  reducing the penalty for the absence of a feature in a specific category.

Because increased smoothing lowers the penalty for feature absence, it could help increase the accuracy when a data-set has many low-volume features that do not contribute to predicting a category, but whose chance presence and absence may be construed in the learning phase to be correlated with a category.

Further investigation is required before we can say whether the aforesaid hypothesis would explain the effect of smoothing on the accuracy of classification in regard to the question classification data-set that we used.

However, this exercise shows that algorithm implementations would do well to leave the choice of Lidstone smoothing parameters to the discretion of the end user of a Naive Bayes classifier.

The source code of our Naive Bayes classifier (using Lidstone smoothing) is provided below:

This implementation of the Naive Bayes classifier was created by Geetanjali Rakshit, an intern at Aiaioo Labs.

 

import numpy as np
import random
import sys, math

class Classifier:
	def __init__(self, featureGenerator):
		self.featureGenerator = featureGenerator
		self._C_SIZE = 0
		self._V_SIZE = 0
		self._classes_list = []
		self._classes_dict = {}
		self._vocab = {}

	def setClasses(self, trainingData):
		for(label, line) in trainingData:
			if label not in self._classes_dict.keys():
				self._classes_dict[label] = len(self._classes_list)
				self._classes_list.append(label)
		self._C_SIZE = len(self._classes_list)
		return
		
	def getClasses(self):
		return self._classes_list

	def setVocab(self, trainingData):
		index = 0;
		for (label, line) in trainingData:
			line = self.featureGenerator.getFeatures(line)
			for item in line:
				if(item not in self._vocab.keys()):
					self._vocab[item] = index
					index += 1
		self._V_SIZE = len(self._vocab)
		return

	def getVocab(self):
		return self._vocab

	def train(self, trainingData):
		pass

	def classify(self, testData, params):
		pass

	def getFeatures(self, data):
		return self.featureGenerator.getFeatures(data)
		

class FeatureGenerator:
	def getFeatures(self, text):
		text = text.lower()
		return text.split()


class NaiveBayesClassifier(Classifier):
	def __init__(self, fg, alpha = 0.05):
		Classifier.__init__(self, fg)
		self.__classParams = []
		self.__params = [[]]
		self.__alpha = alpha

	def getParameters(self):
		return (self.__classParams, self.__params)

	def train(self, trainingData):
		self.setClasses(trainingData)
		self.setVocab(trainingData)
		self.initParameters()

		for (cat, document) in trainingData:
			for feature in self.getFeatures(document):
				self.countFeature(feature, self._classes_dict[cat])

	def countFeature(self, feature, class_index):
		counts = 1
		self._counts_in_class[class_index][self._vocab[feature]] = self._counts_in_class[class_index][self._vocab[feature]] + counts
		self._total_counts[class_index] = self._total_counts[class_index] + counts
		self._norm = self._norm + counts

	def classify(self, testData):
		post_prob = self.getPosteriorProbabilities(testData)
		return self._classes_list[self.getMaxIndex(post_prob)]

	def getPosteriorProbabilities(self, testData):
		post_prob = np.zeros(self._C_SIZE)
		for i in range(0, self._C_SIZE):
			for feature in self.getFeatures(testData):
				post_prob[i] += self.getLogProbability(feature, i)
			post_prob[i] += self.getClassLogProbability(i)
		return post_prob

	def getFeatures(self, testData):
		return self.featureGenerator.getFeatures(testData)

	def initParameters(self):
		self._total_counts = np.zeros(self._C_SIZE)
		self._counts_in_class = np.zeros((self._C_SIZE, self._V_SIZE))
		self._norm = 0.0

	def getLogProbability(self, feature, class_index):
		return math.log(self.smooth(self.getCount(feature, class_index),self._total_counts[class_index]))

	def getCount(self, feature, class_index):
		if feature not in self._vocab.keys():
			return 0
		else:
			return self._counts_in_class[class_index][self._vocab[feature]]

	def smooth(self, numerator, denominator):
		return (numerator + self.__alpha) / (denominator + (self.__alpha * len(self._vocab)))

	def getClassLogProbability(self, class_index):
		return math.log(self._total_counts[class_index]/self._norm)

	def getMaxIndex(self, posteriorProbabilities):
		maxi = 0
		maxProb = posteriorProbabilities[maxi]
		for i in range(0, self._C_SIZE):
			if(posteriorProbabilities[i] >= maxProb):
				maxProb = posteriorProbabilities[i]
				maxi = i
		return maxi


class Dataset:
	def __init__(self, filename):
		fp = open(filename, "r")
		i = 0
		self.__dataset = []
		for line in fp:
			if(line != "\n"):
				line = line.split()
				cat = line[0]
				sent = ""
				for word in range(1, len(line)):
					sent = sent+line[word]+" "
				sent = sent.strip()
				self.__dataset.append([cat, str(sent)])
				i = i+1
		random.shuffle(self.__dataset)	
		self.__D_SIZE = i
		self.__trainSIZE = int(0.6*self.__D_SIZE)
		self.__testSIZE = int(0.3*self.__D_SIZE)
		self.__devSIZE = 1 - (self.__trainSIZE + self.__testSIZE)

	def setTrainSize(self, value):
		self.__trainSIZE = int(value*0.01*self.__D_SIZE)
		return self.__trainSIZE

	def setTestSize(self, value):
		self.__testSIZE = int(value*0.01*self.__D_SIZE)
		return self.__testSIZE

	def setDevelopmentSize(self):
		self.__devSIZE = int(1 - (self.__trainSIZE + self.__testSIZE))
		return self.__devSIZE

	def getDataSize(self):
		return self.__D_SIZE
	
	def getTrainingData(self):
		return self.__dataset[0:self.__trainSIZE]

	def getTestData(self):
		return self.__dataset[self.__trainSIZE:(self.__trainSIZE+self.__testSIZE)]

	def getDevData(self):
		return self.__dataset[0:self.__devSIZE]



#============================================================================================

if __name__ == "__main__":
	
	# This Naive Bayes classifier implementation 10% better accuracy than the NLTK 3.0 Naive Bayes classifier implementation
	# at the task of classifying questions in the question corpus distributed with the book "Taming Text".

	# The "questions-train.txt" file can be found in the source code distributed with the book at https://www.manning.com/books/taming-text.
	
	# To the best of our knowledge, the improvement in accuracy is owed to the smoothing methods described in our blog:
	# https://aiaioo.wordpress.com/2016/01/29/in-a-naive-bayes-classifier-why-bother-with-smoothing-when-we-have-unknown-words-in-the-test-set/
	
	filename = "questions-train.txt"
	
	if len(sys.argv) > 1:
		filename = sys.argv[1]
	
	data = Dataset(filename)
	
	data.setTrainSize(50)
	data.setTestSize(50)
	
	train_set = data.getTrainingData()
	test_set = data.getTestData()
	
	test_data = [test_set[i][1] for i in range(len(test_set))]
	actual_labels = [test_set[i][0] for i in range(len(test_set))]
	
	fg = FeatureGenerator()
	alpha = 0.5 #smoothing parameter
	
	nbClassifier = NaiveBayesClassifier(fg, alpha)
	nbClassifier.train(train_set)
	
	correct = 0;
	total = 0;
	for line in test_data:
		best_label = nbClassifier.classify(line)
		if best_label == actual_labels[total]:
			correct += 1
		total += 1
	
	acc = 1.0*correct/total
	print("Accuracy of this Naive Bayes Classifier: "+str(acc))

 

Visit Aiaioo Labs