Category: Problem Solving

# Protecting against deadly stampedes during the Hajj and other religious festivals

More than 700 people were killed in a stampede during the Hajj pilgrimage of 2015, which took place just a few days ago.

There have also been stampedes during religious events in India that have cost us hundreds of lives.

So pin-pointing the causes of death and injury in stampedes, and devising methods of prevention is of great importance to a large number of people.

In our earlier posts on stampedes, we looked at possible causes of deaths on sloping paths:

However, we had not been able to explain how people could die on flat ground.

In this article, we present a model for how forces on people in a crowd of flat ground can increase to such a magnitude that people would be crushed to death.

We also present a number of mechanisms for preventing deaths due to excess pressure in crowds.

On flat ground, as long as everyone in the crowd remains standing and stationary, there would be no horizontal crushing force.

However, a person can generate a force by trying to move in any direction.

Let’s say one person can generate 10 Kilograms of lateral force.

Now, if ten people stood one behind the other, in contact with each other, and pushed in the same direction, they could be expected to generate approximately ten times the 10 Kg of lateral force.

In other words, they’d be exerting 100 Kg of force on anything ahead of them, as shown in the figure below.

When people in a crowd experience such accumulated forces, they are either injured physically or asphyxiated.

Autopsies of victims of asphyxiation in stampedes showed that they could have experienced pressures on their chests of around 6.4 psi.

If the area of the torso coming in contact with another person in a crush is 2 square feet, about 1 ton of force (about 1000 Kg) would be needed to exert a force of 6.4 psi.

A tightly packed column of 100 people could generate that kind of cumulative lateral crushing force.

So, if a tightly packed column of people say a 100 men/women deep were to suddenly be obstructed, say by a barrier or by another group of people crossing their path, the forces experienced by those in front (or at the intersection) could be as high as 1 ton.

This seems to have been what happened in Mecca a few days ago.

How people were injured

According to eye witness accounts of the stampede during the Hajj, the deaths occurred on a flat road, and there had been pushing and jostling at the start of the stampede:

“As our group started to head back, taking Road 204, another group, coming from Road 206, crossed our way,” said another worshipper, Ahmed Mohammed Amer.

“Heavy pushing ensued. I’m at a loss of words to describe what happened. This massive pushing is what caused the high number of casualties among the pilgrims.”

Something very similar seems to have been reported by a witness to the Hajj stampede of 2006 where 350 people died:

On January 12, as we were returning to Mina for the last ritual of Haj, we saw the big stampede from a distance as waves of people collided.

Mathematical / Physical Models

I will now attempt to show that in a constrained space, even higher forces can be generated by a wedging effect.

The Wedge Effect

A wedge is a mechanical device that can amplify forces.

If a wedge that is four times as long as it is tall is used, and a force of 10 Kg applied along its longer edge, it can generate a force of 40 Kg in the direction of the shorter edge, as shown in the following diagram.

Restraints of any kind (railings, barriers, fences, chains) can act as wedges and increase the pressure within a crowd perpendicular to the direction of movement of the crowd.

So, a column of 20 people can generate a force of one ton if they were wedged in between fences of an aspect ratio of 1:5 (the fence closed in by 1 meter for every 5 meters of road), as shown in the following diagram (for space, we have demonstrated that a column of 5 people can generate a lateral force of 250 Kg on account of the wedging).

Wedging could also occur if the path had no constrictions, if people in the crowd moved in opposite directions, as shown in the following figure.

The  above kind of wedging is probably what caused the deaths at a Love Parade in a crowd that had been standing still.

So, the following need to be eliminated to prevent deadly crushes:

1. Obstructions to the movement of a tightly packed column of people
2. Any wedges that can amplify pressures

## A Partial Solution

The organizers could therefore probably improve the safety of their events by doing the following:

Parallel Channel Movement

Organizers could close off all intersections, and keeping all movement going along completely parallel, non-intersecting channels.

This would ensure that there could be no obstructions to movement.

Prevention of Wedging

Organizers would need to ensure that routes never constrict.

So, gates and converging roads would need to be avoided.

Also all traffic would have to be one-way.

This would prevent the formation of wedges.

References:

The Hajj Stampede is a Fluid Dynamics Problem

# Fun With Text – Hacking Text Analytics

I’ve always wondered if there was a way to teach people to cobble together quick and dirty solutions to problems involving natural language, from duct tape, as it were.

Having worked in the field now for a donkey’s years as of 2015, and having taught a number of text analytics courses along the way, I’ve seen students of text analysis stumble mostly on one of two hurdles:

1.  Inability to Reduce Text Analytics Problems to Machine Learning Problems

I’ve seen students, after hours of training, still revert to rule-based thinking when asked to solve new problems involving text.

You can spend hours teaching people about classification and feature sets, but when you ask them to apply their learning to a new task, say segmenting a resume, you’ll hear them very quickly falling back to thinking in terms of programming steps.

Umm, you could write a script to look for a horizontal line, followed by capitalized text in bold, big font, with the words “Education” or “Experience” in it !!!

2.  Inability to Solve the Machine Learning (ML) Problems

Another task that I have seen teams getting hung up on has been solving ML problems and comparing different solutions.

My manager wants me to identify the ‘introduction’ sections.  So, I labelled 5 sentences as introductions.  Then, I trained a maximum entropy classifier with them.  Why isn’t it working?

One Machine Learning Algorithm to Rule Them All

One day, when I was about to give a lecture at Barcamp Bangalore, I had an idea.

Wouldn’t it be fun to try to use just one machine learning algorithm, show people how to code up that algorithm themselves, and then show them how a really large number of text analytics problem (almost every single problem related to the semantic web) could be solved using it.

So, I quickly wrote up a set of problems in order of increasing complexity, and went about trying to reduce them all to one ML problem, and surprised myself!  It could be done!

Just about every text analytics problem related to the semantic web (which is, by far, the most important commercial category) could be reduced to a classification problem.

Moreover, you could tackle just about any problem using just two steps:

a) Modeling the problem as a machine learning problem

Spot the appropriate machine learning problem underlying the text analytics problem, and if it is a classification problem, the relevant categories, and you’ve reduced the text analytics problem to a machine learning problem.

b) Solving the problem using feature engineering

To solve the machine learning problem, you need to coming up with a set of features that allows the machine learning algorithm to separate the desired categories.

That’s it!

Check it out for yourself!

Here’s a set of slides.

It’s called “Fun with Text – Hacking Text Analytics”.

# Using text analytics to prevent O(log n) failure of democratic institutions

In this article, we discuss an Achilles’ heel present in many democratic institutions.  We claim that many democratic institutions can be made to fail in ‘log n’ time (exponentially fast) if patronage (nepotistic) networks are allowed to grow unfettered.

We then support the claim using real-world examples of the failure of democratic institutions (political and otherwise) and discuss why such failure has not been observed in polity in India.

We also look at how text analytics can be used to detect (and consequently enable steps to be taken to prevent) such failures.

The Weakness

In democratic institutions, voting mechanisms are used to confer upon one individual (from a field of eligible candidates), powers and responsibilities as specified in the charter of the institution.

In some cases, the powers that accrue to the elected individual are so great that they enable him to use them to ensure his or her re-election.  There are two methods available to such an individual.

The first is to pay off the electoral college and secure re-election.  This method is an O(n) algorithm.  The quantum of resources required to secure re-election is proportional to the number of people who need to be suborned.  So, this method only works in cases where electoral colleges are small (for example, in the case of committees deciding job appointments).

A faster method of suborning large numbers of people exists.  The establishment of a hierarchy of patronage can leverage the dynamics of social networks to speedily (in log n time) corrupt very large numbers of people.

It works like this:  The person who is elected to head the country appoints as immediate subordinates only people who are on account of tribal or ethnic affiliations expected to be loyal to him/her. This appointment is often accompanied by a monetary exchange in the form of a bribe paid by the appointee to secure the post.  Such a monetary exchange helps cement the loyalty since the person making the payment becomes dependent on their superior’s continuation in power to recoup the money spent. In other words, the immediate subordinates are forced to invest in their superiors’ careers.

The subordinates so appointed in turn appoint people loyal to themselves to positions below them, in order to recover their investment in the person above them and so on.  Very soon, the entire government machinery becomes beholden directly or indirectly to the person at the top for their jobs and has a vested interest in keeping the person at the top in power.

In some countries, this effectively transforms the democratically elected ‘president’ into a dictator for life.

Example 1

The first example of such failure is possibly (and I am just illustrating a point – no disrespect intended) the government of Cameroon.  The President of Cameroon has been in power since the 1980s.  The President is impossible to replace in spite of rampant corruption and economic mismanagement because all the tribal chiefs and officials in Cameroon are beholden to the President.

All these officials and chiefs try and recoup their investment by rent-seeking behavior (you will need their good offices if you wish to do any business in Cameroon).

The resulting economic climate doesn’t necessarily encourage the growth of entrepreneurship or investment in Cameroon.

Example 2

Iraq for many years was ruled by a dictator who had managed to suborn an entire political system.  Iraq once had a participatory democratic system.  But the use of patronage by Saddam led to Iraq coming under totalitarian rule.

Similar failures can be seen in post WW2 Libya and Syria, and in Stalin’s Russia.

India

One common feature of many of the countries where such failure has occurred is that they have hierarchical social structures in which respect and obedience can be commanded by people higher in the hierarchy from people lower in the hierarchy.

Cameroon has a culture of veneration of elders and tribal leaders.  So do the countries of the Arab world.

India also has somewhat similar cultural traits.  So, it is very interesting to see that a similar process of deterioration of democracy has not been observed in India.

One explanation is that India was saved by its own heterogeneity.  India is made up of distinct linguistic regions and ethnic groups.  It would be impossible for tribal and ethnic hierarchies to span India’s many linguistic and ethnic boundaries.

So, even if one province has failed, that would not trigger failure in neighboring provinces, and the regional despot would not be strong enough to change the Indian constitution.

Examples of such failure can arguably be observed to have happened already in Karnataka and in Tamil Nadu.  In Karnataka, a couple of mining barons managed to become ministers in the government and managed to exercise a lot of control over it for a number of years. Had Karnataka been an independent political entity, they might at some point have attempted to change the constitution to give themselves more power and possibly unlimited tenure.

In the state of Tamil Nadu, the ruling politicians are suspected of building patronage networks around themselves in order to facilitate rent-seeking behaviour.  If Tamil Nadu had been an independent political entity, I suspect that it might have lost its democratic character, since the number of people below the Chief Minister with a vested interest in keeping him/her in power would have become too big to permit free and fair elections to take place.

But what is interesting is that in India, though democracy could have failed at the level of the states (controlled by politicians who had suborned a large part of the mechanism of government) the possible failure did not take place or proved reversible.  That is probably because no kinship networks extend to all parts of India.

The fragmentation must have protected the constitution and the electoral system.  That in turn allowed constitutional and legal frameworks and electoral politics to correct the failures.

In Karnataka, the corrupt mining barons and the Chief Minister who supported them ended up behind bars.  In Tamil Nadu, some family members of the corrupt Chief Minister ended up behind bars.  In both states, the parties that had engaged in corruption were voted out of power.

So, failure through networks of patronage might be difficult to engineer in India because of the extremely heterogeneous nature of its society (a result of size and diversity).

Nepotism

Why would someone intending to build a patronage network only elevate kith and kin to positions of power?

As in, why did the dictators in Iraq and Syria choose to pack the governing organizations with people from their own community or tribe?

One possible answer emanates from the work of George Price (http://www.bbc.co.uk/news/magazine-24457645), who showed that altruism exhibited towards close relatives can have concrete benefits for a selfish individual.

I quote from the article:

Price’s equation explained how altruism could thrive, even amongst groups of selfish people.

It built on the work of a number of other scientists, arguably beginning with JBS Haldane, a British biologist who developed a theory in the early 1950s. When asked if he would sacrifice his own life to save that of another, he said that he would, but only under certain conditions. “I would lay down my life for two brothers, or eight cousins.”

So, it is possible that the elevation of kith and kin minimizes the possibility that the people so elevated might one day turn against their patron (they might be more likely to exhibit non-selfish behavior towards their patron).

Detection using Text Analytics

One is tempted to ask whether it is possible to detect at an early stage the process of failure through patronage of a democratic system.

The answer is yes.  It appears to be possible to build a protective mechanism that can uncover and highlight the formation and growth of nepotistic patronage networks.

The BBC article on nepotism in Italy examines research on the detection of nepotism in Italy, and I quote:

Prof Perotti, along with others, has published revealing studies of university teachers, showing the extraordinary concentration of surnames in many departments.

“There is a scarcity of last-names that’s inexplicable,” says fellow academic, Stefano Allesina. “The odds of getting such population densities across so many departments is a million to one.”

Take the University of Bari, where five families have for years dominated the dozens of senior positions in Business and Economics there. Or consider the University of Palermo, where more than half the entire academic population has at least one relative working within the institution.

I happened to find one of Allesina’s papers titled “Measuring Nepotism through Shared Last Names: The Case of Italian Academia” on the internet.

I quote from the paper:

Both types of analysis (Monte Carlo and logistic regression) showed the same results: the paucity of names and the abundance of name-sharing connections in Italian academia are highly unlikely to be observed at random. Many disciplines, accounting for the majority of Italian academics, are very likely to be affected by nepotism. There is a strong latitudinal effect, with nepotistic practices increasing in the south. Although detecting some nepotism in Italian academia is hardly surprising, the level of diffusion evidenced by this analysis is well beyond what is expected.

Concentrating resources in the “healthy” part of the system is especially important at a time when funding is very limited and new positions are scarce: two conditions that are currently met by the Italian academic system. Moreover, promoting merit against nepotistic practices could help stem the severe brain-drain observed in Italy.

In December 2010, the Italian Parliament approved a new law for the University. Among other things, the new law forbids the hiring of relatives within the same department and introduces a probation period before tenure. The analysis conducted here should be repeated in the future, as the results could provide an assessment of the efficacy of the new law.

This analysis can be applied to different countries and types of organizations. Policy-makers can use similar methods to target resources and cuts in order to promote fair practices.

The analysis that Allesina performed (a diversity analysis of last names) is a fairly easy text analytics task and provides a clue to the solution to the problem.

Such an analysis can unearth nepotism and allow steps to be taken to prevent it.

Extensions of the method can also be used to determine if, as in the case of Iraq and Syria, a certain community or ethnicity or family has taken over or is beginning to take over the governing mechanisms of a country.

And if the problem is identified early enough, it might give the constitutional, legal and electoral institutions of a democracy a fighting chance at protecting themselves and lead to a correction of the failure.

# Fraud detection using computers

For a long time, we’ve been interested in using mathematics (and computers) to detect and deter fraud.  It is related to our earlier work on identifying perpetrators of terrorist attacks.  (Yeah, I know it’s not as cool, but it’s some similar math!)

Today, I want to talk about some approaches to detecting fraud that we talked about on a beautiful summer day, in the engineering room at Aiaioo Labs.

That day, in the afternoon, somebody had rung the bell.  A colleague had answered the bell and then come and handed me a sheet of paper, saying that a lady at the door was asking for donations.

The paper bore the letterhead of an organization in a script that I couldn’t read.  However the text in English stated that the bearer was a student collecting money to feed a few thousand refugees living in a refugee camp in Hyderabad (the refugees’ homes had been destroyed in artillery shelling on the India-Pakistan border and that there were a few thousand families without shelter who needed food and medicines urgently).

On the sheet were the names and signatures of about 20 donors who had each donated around 1000 rupees.

Now the problem before us was to figure out if the lady was a genuine student volunteer or a fraudster out to make some quick money.

There was one thing about the document that looked decidedly suspicious.

It was that the amounts donated were all very similar – 1000, 1200, 1300, 1000, 1000, 1000, 1000.

All the numbers had unnaturally high values.

So, I called a friend of mine who came from the place she claimed the refugees (and the student volunteers) were from and asked him to talk to her and tell me if her story checked out.

He spoke to her over the phone for a few minutes and then told me that her story was not entirely true.

She was from the place that she claimed the refugees came from, but she was in fact collecting money for her own family (they had come south because one of them had needed a medical operation and were now collecting money to travel back to their home town).

We felt it would be fine to help a family in need, so we gave her some money.

However, the whole affair gave us an interesting problem to solve.

How do you tell if a set of numbers is ‘natural’ or if it has been made up by a person intent on making them look natural?

Well, it turns out that statistics can give you the tools to do that.

Method 1

In nature, many processes result in random numbers that follow a certain distribution. And there are standard distributions that almost all numbers found in nature belong to.

For example, on the sheet of paper that the lady had presented, the figures for the money donated should have followed a normal distribution.  There should have been a few high values and a few low values and a lot of the values in the middle.

Since that wasn’t the case I could easily tell that the numbers had been made up.

But you don’t need a human to tell you that.  There are statistical tests that can be done to see if a set of numbers belongs to any expected distribution.

I looked around online and found an article that tells you about methods that can be used to check if a set of numbers belongs to a normal distribution (a distribution that occurs very frequently in nature): http://mathforum.org/library/drmath/view/72065.html

Some of the methods it talks about are the Kolmogorov-Smirnov test, the Chi-square test, the D’Agostino-Pearson test and the Jarque-Bera test.

Details of each can be found at these links (taken from the article):

```One common test for normality with which I am personally NOT familiar, is the Kolmogorov-Smirnov test.  The math behind it is very involved, and I would suggest you refer to other resources such as this page

Wikipedia: Kolmogorov-Smirnov Test
http://en.wikipedia.org/wiki/Kolmogorov-Smirnov_test

You can read more about the D'Agostino-Pearson test and get a table that can be used in Excel here:

Wikipedia: Normality Test
http://en.wikipedia.org/wiki/User:Xargque#Normality_Test

Wikipedia: Jarque-Bera Test
http://en.wikipedia.org/wiki/Jarque-Bera_test

One item of note: depending on how your stats program calculates kurtosis, you may or may not need to subtract 3 from kurtosis.

See: Wikipedia Talk: Jarque-Bera Test
http://en.wikipedia.org/wiki/Talk:Jarque-Bera_test```

On to the next method:

Method 2

Another property of many naturally occurring numbers is that about one third of them start with the number 1 !!!  Surprising isn’t it?!!

Well, it turns out that this applies to population numbers, electricity bills, stock prices and the lengths of rivers.

It applies to all numbers that come from power law distributions (power laws govern the distribution of wealth, connections on facebook, the numbers of speakers of a language, and lot of numbers related to society).

This is called Benford’s law:  http://en.wikipedia.org/wiki/Benford’s_law

(I believe that Benford’s law would have applied to the above case as well – donations would have a power law distribution – if you assumed that all donors donated money proportional to their wealth).

The Wikipedia says:

### Accounting fraud detection

In 1972, Hal Varian suggested that the law could be used to detect possible fraud in lists of socio-economic data submitted in support of public planning decisions. Based on the plausible assumption that people who make up figures tend to distribute their digits fairly uniformly, a simple comparison of first-digit frequency distribution from the data with the expected distribution according to Benford’s Law ought to show up any anomalous results. Following this idea, Mark Nigrini showed that Benford’s Law could be used in forensic accounting and auditing as an indicator of accounting and expenses fraud.[10] In practice, applications of Benford’s Law for fraud detection routinely use more than the first digit.[10]

Method 3

There are also methods that can be used by governments and large organizations to prevent fraud in the issuing of tenders.

More about that in my next article.

# Building Machine Learning Models that can help with Customer Service and Supply Chain Management

The Laptop that Stopped Working

One fine day, a couple of months ago, a laptop that we owned stopped working.  We heard 4 beeps coming from the machine at intervals but nothing appeared on the screen.

Customer Service

The service person quickly looked up the symptoms in his knowledge base and informed us that 4 beeps meant a memory error.

I replaced first the two memory modules one by one, but the machine still wouldn’t start.  I tried two spare memory modules that I had in the cupboard but the computer wouldn’t start.

I had a brand new computer with me that used the same type and speed of memory as the one we were fixing.  I pulled out its memory chips and inserted them into the faulty computer, but still no luck.

At that point, the service person told me that it must be the mother board itself that was not working.

Second Attempt at Triage

So the next day, a mother board and some memory arrived at my office.  A little later a field engineer showed up and replaced the mother board.   The computer still wouldn’t start up.

When the field engineer heard 4 beeps, the engineer said it MUST BE THE MEMORY.

Third Attempt at Triage

A few days later, a new set of memory modules arrived.

The engineer returned and tried inserting the new memory in.  Still no luck.  The computer would not start and you could still hear the 4 beeps.

A third set of brand new memory modules and a new mother board were sent over.

Fourth Attempt at Triage

The engineer tried both motherboards and various combinations of memory modules, but still, all you could hear were 4 beeps and the computer would not start.

During one of his attempts to combine memory and motherboards, the engineer noticed that though the computer did not start, it did not beep either.

So, the engineer guessed that it was the screen that was not working.  But just to be safe, he’d ask them to send another motherboard and another set of memory modules to go with it.

Fifth Attempt at Triage

The screen, the third motherboard and the fourth set of memory modules arrived in our office and an engineer spent the day trying various combinations of screens, motherboards and memory modules.

But the man on the phone said: “Sir, 4 beeps means there is something wrong with your memory.  I will have them replaced.”

I had to take out my new laptop’s memory and pop it into the faulty machine to convince the engineer and support staff that replacing the memory would not fix the problem.

All the parts were now sent over – the memory, motherboard, processor, drive, and screen.

Sixth Attempt at Triage

Finally, the field engineer found that when he had replaced the processor, the computer was able to boot up with no problems.

Better Root Cause Analysis

The manufacturer could have spared themselves all that expense, time and effort had they used an expert system that relied on a probabilistic model of the symptoms and their causes.

Such a model would be able to tell, given the symptoms, which component was the most likely to have failed.

Such a model would be able to direct a field engineer to the component or components whose replacement would be most likely to fix the problem.

If the attempted fix did not work, the model would simply update its understanding of the problem and recommend a different course of action.

I will illustrate the process using what is known in the machine learning community as a directed probabilistic graphical model.

Run-Through of Root Cause Analysis

Let’s say a failure has occurred and there is only one symptom that can be observed: the laptop won’t start and emits 4 beeps.

The first step is to enter this information into the probabilistic graphical model.  From a list of symptoms, we select the ones that we observe (all observed symptoms are represented as yellow circles in this document).

So the following diagram has only one circle (observed symptom).

Model 1:  The symptom of 4 beeps is modeled in a probabilistic graphical model with a yellow circle as follows:

Now, let’s assume that this symptom can be caused by the failure of memory, the motherboard or the processor.

Model 2:  I can add that information to the predictive model, so that the model now looks like this:

The model captures the belief that the causes of the symptom – processor / memory / motherboard failure are (in the absence of any symptoms) independent of each other.

It also captures the belief that given a symptom like 4 beeps, evidence for one cause will explain away (or decrease the probability of) the other causes.

Once such a model is built, it can tell a field engineer the most probable cause of a symptom, the second most probable cause and so on.

So, the engineer will only have to look at the output of the model’s analysis to know whether he needs to replace one component, or two, and which ones.

When the field engineer goes out and replaces the components, his actions can also be fed into the model.

Model 3:  Below is an extended model into which attempts to fix the problem by replacing the memory can be incorporated.

If a field engineer were to feed into the system the fact that the memory was replaced with a new module and it didn’t fix the problem, the system would be able to immediately figure out that the memory could not be the cause of the problem, and it would suggest the next most probable cause of failure.

Model 4

Finally, in case new memory modules being sent to customers for repairs frequently turned out to be defective, that information could also be added to the model as follows:

Now, if the error rate for new memory modules in the supply chain happens to be high for a particular type of memory, then if memory replacement failed to fix a 4-beep problem, the model would understand that faulty memory could still be the cause of the problem.

Applications to Supply Chain Management

The probabilities of all the nodes adjust themselves all the time and this information can actually be used to detect if the error rates in new memory module deliveries suddenly go up.

Benefits to a Customer Service Process

1.  Formal capture and storage of triage history

2.  Suggestion of cause(s) given the effects (symptoms)

3.  Suggestion of other causes given triage steps performed

What the system will seem to be doing (to the layman):

1.  Recording symptoms

2.  Recommending a course of action

3.  Recording the outcome of the course of action

4.  Recommending next steps

# Analysing documents for non-obvious differences

The ease of classification of documents depends on the categories you are looking to classify documents into.

A few days ago, an engineer wrote about a problem where the analysis that needed to be performed on documents was not the most straight-forward.

He described the problem in a forum as follows: “I am working on sub classification. We already crawled sites using focused crawling. So we know domain, broad category for the site. Sometimes site is also tagged with broad category. So I don’t require to predict broad class for individual site. I am interested in sub-classification. For example, I don’t want to find if post is related to sports, politics, cricket etc. I am interested in to find if post is related to Indian cricket, Australia cricket, given that I already know post is related to cricket. Since in cricket post may contains frequent words like runs, six, fours, out,score etc, which are common across all cricket related posts. So I also want to consider rare terms which can help me in sub-classification. I agree that I may also require frequent words for classification. But I don’t want to skip rare terms for classification.

If you’re dealing with categories like sports, politics and finance, then using machine learning for classification is very easy.  That’s because all the nouns and verbs in the document give you clues as to the category that the document belongs to.

But if you’re given a set of categories for which there are few indicators in the text, you end up with no easy way to categorize it.

After spending a few days thinking about it, I realized that something I had learnt in college could be applied to the problem.  It’s a technique called Feature Selection.

I am going to share the reply I posted to the question, because it might be useful to others working on the classification of documents:

You seem to have a data set that looks as follows (letters are categories and numbers are features):

A P 2 4
A Q 2 5
B P 3 4
B Q 3 5

Let’s say the 2s and the 3s are features that occur very frequently in your corpus while the 4s and the 5s are features that occur far less frequently in your corpus.

When you use the ‘bag of words’ model as your feature vector, your classifier will only learn to tell A apart from B (because the 4s and 5s will not matter much to the classifier, being overwhelmed as it is by the 2s and 3s which are far more frequent).

I think that is why you have come to the conclusion that you need to look for rare words to be able to accomplish your goal of distinguishing category P from category Q.

But in reality, perhaps what you need to do is identify all the features like 4 and 5 that might be able to help you distinguish P from Q and you might even find some frequent features that could help you do that (it might turn out that some frequent features might also have a fairly healthy ability to resolve these categories).

So, now the question just boils down to how you would go about finding the set of features that resolves any given categorization scheme.

The answer seems to be something that literature refers to as ‘Feature Selection’.

As the name says, you select features that help you break data points apart in the way you want.

Wikipedia has an article on Feature Selection:

And Mark Hall’s thesis http://www.cs.waikato.ac.nz/~mhall/thesis.pdf seems to be highly referenced.

Mark Hall’s thesis – “A good feature subset is one that contains features highly correlated with (predictive of) the class, yet uncorrelated with (not predictive of) each other.”

To be honest to you, I’d heard about Feature Selection, but never connected it to the problem it solves until now, so I’m just looking up reading material as I write.

Best of luck with it.

# Digital Democracy and Cutting out the Middleman in Government

Can information technology in general and text analytics in particular help improve the quality of governance?

We believe they can.  In this article, we discuss one problem/weakness with the present system of governance that makes it very susceptible to corruption.  We then present a solution that relies on analytics to mitigate the problem.

Governance

Governance is a service.  An organization (government) provides people in a geographical area with a service called governance.  The organization that provides the service is for all practical purposes a service company owned by all the people to whom the service is provided.

Services provided by government include collecting money and using it to create infrastructure and services for the common good like roads and schools and city planning and waste disposal.

One weakness in the present approach is as follows.

The goals of the service provider may not always be well-aligned with the goals of the people being served.

When corruption exists, these goals may be very poorly aligned indeed.

Misalignment of Goals

Example 1:  Misalignment of Goals in Road Construction

For example, take the construction of a road.  To the people of the city who use roads, what they want in return for paying out money is better roads.  To the governing body who disburses the money, the goal – where corruption is rife – is high kickbacks.

Does Bangalore really not have enough money to build good roads?  It is very likely that our roads are bad not because we don’t have the money or the means to build roads that last, but because our governing body in charge of road repairs repeatedly doles out road maintenance contracts to people who do the road construction authorities favors in return for the contracts.

Example 2:  Misalignment of Goals in Allocating Budgets for Defence and Education

In an article on why India imports vast quantities of arms, we had described how the Indian government was under-spending on education and over-spending on defense procurement.

That article was based on a World Bank report http://www.imf.org/external/pubs/nft/2002/govern/index.htm that mentioned a study that showed that corrupt governments overspend on defence procurement because of the lack of transparency in such deals.

For example, in 2011 and 2012, India committed close to USD 50 billion to purchases of aircraft and ships alone whereas the expenditure towards education was around 12 billion per annum (woefully inadequate for our country).

Here again, we see a complete misalignment of goals.  People in India need education.  The government, however, when given a choice between putting our money into education or into arms, picks the choice that gives it a higher chance of receiving kickbacks.

Both are examples of something we call man-in-the-middle corruption.

One possible solution is to allow people to allocate portions of their income tax to categories of services that we expect our government to provide us.

Goal Alignment

For example, if I am paying Rs. 20,000 in income tax, I might quite reasonably be allowed to allocate say Rs. 10,000 of it to areas of infrastructure that I feel we need to invest in.  I might allocate of 5000 to education and 5000 to health services.  This would give people some measure of control over the use of our money by the governing body.

Moreover, it would give the governing body a deeper insight into the needs of the people, and also put some pressure on it to allocate all public funds according to a similar ratio.

For this to work, the allocation choices offered to people would have to be meaningful.  Meaningful choices may be determined by public discussion and/or referenda.

Any public discussion on the matter would require the use of debate support tools – text analytics tools that help large numbers of people communicate.

In essence, what might be needed are text analytics technologies that can support legislation (proposing legislation, modifying legislation, or conducting a referendum on legislation).

Legislation

Much to the point, at this year’s Coling conference, we came across a paper by a student of the Singapore Management University (Swapna Gottipati), on how one might detect suggestions (thoughtful suggestions) in social media messages.  The paper was titled “Finding Thoughtful Comments from Social Media”.  Unfortunately the paper is not yet available online.

There have been attempts to allow people to propose legislation through online communities that don’t seem to work very well as the following article shows you: http://news.yahoo.com/interactive-white-house-secession-petitions-and-presidential-power-235012490.html

But a more successful attempt at using social media is described in this BBC article Why not let social media run the country?, and I quote: “But Nick Jones, deputy director of digital communications at Downing Street … points to the Red Tape Challenge, which has received more than 28,000 comments since it was launched by the prime minister last year and which has a ‘social media element’.  More than 150 pieces of legislation identified by the public as unnecessary have been so far been scrapped.”

I also really like Clay Shirky’s talk on how the internet will one day transform government.  He talks about how freedom of expression is promoted by social media.  What does freedom of speech do?  Well, it allows more ideas to circulate.  The more ideas there are in circulation, the better things (possibly governance) can become.

He talks about a need for an open-source model for generating agreement on ideas and proposes large scale discussion using something like the GIT version control.

He provides examples of legislation dumped on GITHub and his big takeaway seems to be the idea of collaboration without coordination.

He also talks about the need for openness working in two directions (about participatory legislation and not just legislation being visible to everyone), and about the invention of new methods of argument.  Very interesting.

Feedback

Another use of social media in governance is to collect feedback on government policies and decisions.  In that context I want to mention Project Dreamcatcher, an analytics project with a social media component that was used by the Obama campaign in 2012.  Here is an article on Project Dreamcatcher.  It seems to be an extension of feedback monitoring which has been used for customer service.

Summary

There seem to be new possibilities opening up for the use of technology, possibly text analytics technology, in governance.