Tag: nlproc

Fun with Text – Managing Text Analytics

The year is 2016.

I’m a year older than when I designed the text analytics lecture titled “Fun with Text – Hacking Text Analytics“.

Yesterday, I found myself giving a follow on lecture titled “Fun with Text – Managing Text Analytics”.

Here are the slides:

“Hacking Text Analytics” was meant to help students understand a range text analytics problems by reducing them into simpler problems.

But it was designed with the understanding that they would hack their own text analytics tools.

However, in project after project, I was seeing that engineers tended not to build their own text analytics tools, but instead rely on handy and widely available open source products, and that the main thing they needed to learn was how to use them.

So, when I was asked to lecture to an audience at the NASSCOM Big Data and Analytics Summit in Hyderabad, and was advised that a large part of the audience might be non-technical, and could I please base the talk on use-cases, I tried a different tack.

So I designed another lecture “Fun with Text – Managing Text Analytics” about:

  • 3 types of opportunities for text analytics that typically exist in every vertical
  • 3 use cases dealing with each of these types of opportunities
  • 3 mistakes to avoid and 3 things to embrace

And the take away from it is how to go about solving a typical business problem (involving text), using text analytics.

Enjoy the slides!

Visit Aiaioo Labs

Text Analytics Tools for Deliberative Democracy

In our last post, we spoke about various control mechanisms that can be implemented to support direct democracy (which we  interpreted to mean the control of the allocation of common resources by the people who pooled in).

We also examined how these controls could be used to curtail man-in-the-middle corruption.

In this article, we examine a more sophisticated form of direct democracy called a deliberative democracy.

In a deliberative democracy, in addition to the control mechanisms prescribed for direct democracy, there need to be mechanisms to allow deliberation (discussion) before a referendum or any other action is taken.

I quote from the Wikipedia article on deliberative democracy:

Deliberative democracy holds that, for a democratic decision to be legitimate, it must be preceded by authentic deliberation, not merely the aggregation of preferences that occurs in voting.

In elitist deliberative democracy, principles of deliberative democracy apply to elite societal decision-making bodies, such as legislatures and courts; in populist deliberative democracy, principles of deliberative democracy apply to groups of lay citizens who are empowered to make decisions.

The article on direct democracy had the following to say:

Democratic theorists have identified a trilemma due to the presence of three desirable characteristics of an ideal system of direct democracy, which are challenging to deliver all at once. These three characteristics are participation – widespread participation in the decision making process by the people affected; deliberation – a rational discussion where all major points of view are weighted according to evidence; and equality – all members of the population on whose behalf decisions are taken have an equal chance of having their views taken into account.

(Aside to computer scientists: doesn’t this trilemma remind you of the CAP theorem that applies to database systems? Here’s a simple explanation of the CAP theorem: http://ksat.me/a-plain-english-introduction-to-cap-theorem/).

So, for example, representative democracy satisfies the requirement for deliberation and equality but sacrifices participation.

Participatory democracy allows inclusive participation and deliberation but sacrifices equality.

And then there is direct democracy which supports participation and equality, but not deliberation.

The problem seems to be that when a large number of people are invited to participate in a deliberation (and given that deliberations take time), it will not be possible to compensate them all for their time. Consequently, only those more interested in the issue being debated (or more likely to benefit from one position or the other) are more likely to participate, biasing the sample in their favour (all sections of the population are no longer equally represented in the discussion/decision).

So, it seems that all the three properties desired in an ideal democratic system – participation, equality and deliberation – cannot be present at the same time in a real democratic system.

But then, a while ago, we began wondering if this trilemma is merely a result of the lack of suitable technology and not really a fundamental property of democracy.  So, we proposed a design for (though we have not yet realized it) a tool that can support the participation of a large number of people in deliberations.  We call it the MCT (Mass Communication Tool).

It could be used as a method to enable direct democracies to support deliberations in which all citizens can participate, ahead of a vote on any subject.

It uses text clustering algorithms to solve the problems of volume as well as numeric asymmetry in the flow of communications between the deliberating participants and the moderators of the communications.

There’s a brief overview of the system in our lab profile.

MCTs are bound to have a huge impact on our experience of representative government.  A typical use case would involve a public figure, (say President Obama), sounding out the electorate before introducing legislation on say healthcare reform.


By first discussing the competing proposals with large numbers of people, it might be possible for the initiator of the discussion to get a sense of what might or might not work and what the response to the legislation was likely to be.


An MCT would have to be capable of supporting a live dialog involving a large number of people.

It would use natural language processing and machine learning to enable a few moderators (for example, the CEO of a company) to interact with a large number of people (for example, all the employees of the company) in real time (for example, during a virtual all-hands meeting), get a synopsis of a large number of concurrent discussions in real time, and participate in a significant fraction of the discussions as they are taking place.

The system would consist of:

  1. an aggregator of messages (built from natural language processing components) that groups together messages and discussions with identical semantic content;
  2. a hierarchical clustering system (built from natural language processing components) that assigns aggregated messages their place in a hierarchy by specificity with more general messages closer to the root of the hierarchy and more specific messages closer to the leaves of the hierarchy;
  3. a summarization system (built from natural language processing components) that creates a summary of the aggregate of all messages in a sub-tree; and
  4. a reply routing system (built from natural language processing components) that routes replies from cluster to cluster based on their relevance to the discussion threads.

Should Cecilia have said “insecure” instead of “unsecure”?

In this funny PhD Comic, the main character – Cecilia (the girl in red) – says:

“Do you realize how unsecure your coffee distribution system is?”

That made me wonder – should she have said ‘insecure’?

Even the WordPress spell-checker has a problem with “unsecure”.

It thinks that “unsecure” is a spelling error.

However, the word “insecure” doesn’t sound as if it were the right term to use in the context of computer security.

That is because the word “insecure” is usually used in the context of a person to mean a person who is not confident and self-assured.

To call a computer “insecure” would be a bit like saying that the computer had self-image issues.

Others have written about this cognitive dissonance as well (see http://english.stackexchange.com/questions/19653/insecure-or-unsecure-when-dealing-with-security for a nice discussion).

Given the problem, the author of the cartoon seems to be justified in using a newly-minted word (one not found in any dictionary) in order to describe the lack of security.

This is also very interesting because it throws some light on how words are born.

Before I can explain what I mean, I’ll need you to take a look the Oxford dictionary’s definitions of the word “insecure” (from the Oxford English Dictionary online search at http://oxforddictionaries.com/definition/english/insecure?q=insecure):



  • 1   uncertain or anxious about oneself; not confident:  a rather gauche, insecure young man,  a top model who is notoriously insecure about her looks
  • 2   (of a thing) not firm or fixed; liable to give way or break:  an insecure footbridge 

                 not sufficiently protected; easily broken into:  an insecure computer system

  • 3   (of a job or situation) liable to change for the worse; not permanent or settled:  badly paid and insecure jobsa financially insecure period

There are three ways in which the word “insecure” can be used.

The second usage would have been perfect for the context of computer security.

But the first usage might be conflated with the second in that context.

And that is because (sorry, I no longer recall the references to support this claim) computers appear to the human mind to have human-like characteristics (we say things like “Google tells me that …” or “my computer has gone to sleep”).

So, the only word in the dictionary that can do the job – the word “insecure” – has a conflict of interest.

And therefore, a new word needs to be coined that is not susceptible to the same sort of ambiguity.

And if the new word “unsecure” catches on, then one day, the second sense of the word “insecure” could become extinct in the context of computers.

Oh well, “it’s only words!”


A friend pointed out that the Google NGram Viewer shows a history of the use of the word “unsecure”: http://books.google.com/ngrams/graph?content=unsecure.

The word seems to have been in use between 1650 and 1850 (there is evidence of use in literature), and has in more recent times simply fallen out of circulation (being eclipsed by “insecure” in around 1750).  Thanks, Prashant.

(You can also search for those early usages in books – http://books.google.com/books?id=WmpCAAAAcAAJ&pg=PA12&dq=%22unsecure%22&hl=en&sa=X&ei=aOcLUq7aA-3iyAHu8YGwAg&ved=0CDMQ6AEwAA#v=onepage&q=%22unsecure%22&f=false)

Analysing documents for non-obvious differences

The ease of classification of documents depends on the categories you are looking to classify documents into.

A few days ago, an engineer wrote about a problem where the analysis that needed to be performed on documents was not the most straight-forward.

He described the problem in a forum as follows: “I am working on sub classification. We already crawled sites using focused crawling. So we know domain, broad category for the site. Sometimes site is also tagged with broad category. So I don’t require to predict broad class for individual site. I am interested in sub-classification. For example, I don’t want to find if post is related to sports, politics, cricket etc. I am interested in to find if post is related to Indian cricket, Australia cricket, given that I already know post is related to cricket. Since in cricket post may contains frequent words like runs, six, fours, out,score etc, which are common across all cricket related posts. So I also want to consider rare terms which can help me in sub-classification. I agree that I may also require frequent words for classification. But I don’t want to skip rare terms for classification.

If you’re dealing with categories like sports, politics and finance, then using machine learning for classification is very easy.  That’s because all the nouns and verbs in the document give you clues as to the category that the document belongs to.

But if you’re given a set of categories for which there are few indicators in the text, you end up with no easy way to categorize it.

After spending a few days thinking about it, I realized that something I had learnt in college could be applied to the problem.  It’s a technique called Feature Selection.

I am going to share the reply I posted to the question, because it might be useful to others working on the classification of documents:

You seem to have a data set that looks as follows (letters are categories and numbers are features):

A P 2 4
A Q 2 5
B P 3 4
B Q 3 5

Let’s say the 2s and the 3s are features that occur very frequently in your corpus while the 4s and the 5s are features that occur far less frequently in your corpus.

When you use the ‘bag of words’ model as your feature vector, your classifier will only learn to tell A apart from B (because the 4s and 5s will not matter much to the classifier, being overwhelmed as it is by the 2s and 3s which are far more frequent).

I think that is why you have come to the conclusion that you need to look for rare words to be able to accomplish your goal of distinguishing category P from category Q.

But in reality, perhaps what you need to do is identify all the features like 4 and 5 that might be able to help you distinguish P from Q and you might even find some frequent features that could help you do that (it might turn out that some frequent features might also have a fairly healthy ability to resolve these categories).

So, now the question just boils down to how you would go about finding the set of features that resolves any given categorization scheme.

The answer seems to be something that literature refers to as ‘Feature Selection’.

As the name says, you select features that help you break data points apart in the way you want.

Wikipedia has an article on Feature Selection:


And Mark Hall’s thesis http://www.cs.waikato.ac.nz/~mhall/thesis.pdf seems to be highly referenced.

Mark Hall’s thesis – “A good feature subset is one that contains features highly correlated with (predictive of) the class, yet uncorrelated with (not predictive of) each other.”

To be honest to you, I’d heard about Feature Selection, but never connected it to the problem it solves until now, so I’m just looking up reading material as I write.

Best of luck with it.

Wishful Thinking and Leprechauns

I recently came across a lovely cartoon on Leprechauns and social media.

Fortunately for us, we have a leprechaun in the office.

(So, now you know where we get our startup funding from).

Here’s a picture of the guy (that’s the cubicle he shares with Selasdia):


Just kidding!

One of our business partners brought the little pewter leprechaun in the picture back to India for us from Ireland.

It might have once been popularly believed in Ireland that leprechauns had the ability to grant Wishes.

And we find Wishes immensely interesting because some of the earliest work on Intention Analysis started out as an attempt to detect and classify Wishes.

In fact, one of the loveliest papers on the subject started out with an attempt to study what people wished for (wanted) on New Years Day.

You can read the paper here:  http://pages.cs.wisc.edu/~jerryzhu/pub/wish.pdf

It has a very beautiful title: “May All Your Wishes Come True:  A Study of Wishes and How to Recognize Them”

You also find the word Wishes in the title of one of the first attempts in research literature to find “buy” intentions:


It is a paper titled, again quite poetically (what’s with Wishes and beautiful titles!) “Wishful Thinking – Finding suggestions and ‘buy’ wishes from product reviews”.

This paper was written by a research team working at Cognizant (India) in 2010.