Category: Use Cases

Fun With Text – Hacking Text Analytics

hacking_text_analytics

I’ve always wondered if there was a way to teach people to cobble together quick and dirty solutions to problems involving natural language, from duct tape, as it were.

Having worked in the field now for a donkey’s years as of 2015, and having taught a number of text analytics courses along the way, I’ve seen students of text analysis stumble mostly on one of two hurdles:

1.  Inability to Reduce Text Analytics Problems to Machine Learning Problems

I’ve seen students, after hours of training, still revert to rule-based thinking when asked to solve new problems involving text.

You can spend hours teaching people about classification and feature sets, but when you ask them to apply their learning to a new task, say segmenting a resume, you’ll hear them very quickly falling back to thinking in terms of programming steps.

Umm, you could write a script to look for a horizontal line, followed by capitalized text in bold, big font, with the words “Education” or “Experience” in it !!!

2.  Inability to Solve the Machine Learning (ML) Problems

Another task that I have seen teams getting hung up on has been solving ML problems and comparing different solutions.

My manager wants me to identify the ‘introduction’ sections.  So, I labelled 5 sentences as introductions.  Then, I trained a maximum entropy classifier with them.  Why isn’t it working?

One Machine Learning Algorithm to Rule Them All

One day, when I was about to give a lecture at Barcamp Bangalore, I had an idea.

Wouldn’t it be fun to try to use just one machine learning algorithm, show people how to code up that algorithm themselves, and then show them how a really large number of text analytics problem (almost every single problem related to the semantic web) could be solved using it.

So, I quickly wrote up a set of problems in order of increasing complexity, and went about trying to reduce them all to one ML problem, and surprised myself!  It could be done!

Just about every text analytics problem related to the semantic web (which is, by far, the most important commercial category) could be reduced to a classification problem.

Moreover, you could tackle just about any problem using just two steps:

a) Modeling the problem as a machine learning problem

Spot the appropriate machine learning problem underlying the text analytics problem, and if it is a classification problem, the relevant categories, and you’ve reduced the text analytics problem to a machine learning problem.

b) Solving the problem using feature engineering

To solve the machine learning problem, you need to coming up with a set of features that allows the machine learning algorithm to separate the desired categories.

That’s it!

Check it out for yourself!

Here’s a set of slides.

It’s called “Fun with Text – Hacking Text Analytics”.

Fraud detection using computers

For a long time, we’ve been interested in using mathematics (and computers) to detect and deter fraud.  It is related to our earlier work on identifying perpetrators of terrorist attacks.  (Yeah, I know it’s not as cool, but it’s some similar math!)

Today, I want to talk about some approaches to detecting fraud that we talked about on a beautiful summer day, in the engineering room at Aiaioo Labs.

That day, in the afternoon, somebody had rung the bell.  A colleague had answered the bell and then come and handed me a sheet of paper, saying that a lady at the door was asking for donations.

The paper bore the letterhead of an organization in a script that I couldn’t read.  However the text in English stated that the bearer was a student collecting money to feed a few thousand refugees living in a refugee camp in Hyderabad (the refugees’ homes had been destroyed in artillery shelling on the India-Pakistan border and that there were a few thousand families without shelter who needed food and medicines urgently).

On the sheet were the names and signatures of about 20 donors who had each donated around 1000 rupees.

Now the problem before us was to figure out if the lady was a genuine student volunteer or a fraudster out to make some quick money.

There was one thing about the document that looked decidedly suspicious.

It was that the amounts donated were all very similar – 1000, 1200, 1300, 1000, 1000, 1000, 1000.

All the numbers had unnaturally high values.

So, I called a friend of mine who came from the place she claimed the refugees (and the student volunteers) were from and asked him to talk to her and tell me if her story checked out.

He spoke to her over the phone for a few minutes and then told me that her story was not entirely true.

She was from the place that she claimed the refugees came from, but she was in fact collecting money for her own family (they had come south because one of them had needed a medical operation and were now collecting money to travel back to their home town).

When we asked her why she had lied, she just shrugged.

We felt it would be fine to help a family in need, so we gave her some money.

However, the whole affair gave us an interesting problem to solve.

How do you tell if a set of numbers is ‘natural’ or if it has been made up by a person intent on making them look natural?

Well, it turns out that statistics can give you the tools to do that.

Method 1

In nature, many processes result in random numbers that follow a certain distribution. And there are standard distributions that almost all numbers found in nature belong to.

For example, on the sheet of paper that the lady had presented, the figures for the money donated should have followed a normal distribution.  There should have been a few high values and a few low values and a lot of the values in the middle.

Since that wasn’t the case I could easily tell that the numbers had been made up.

But you don’t need a human to tell you that.  There are statistical tests that can be done to see if a set of numbers belongs to any expected distribution.

I looked around online and found an article that tells you about methods that can be used to check if a set of numbers belongs to a normal distribution (a distribution that occurs very frequently in nature): http://mathforum.org/library/drmath/view/72065.html

Some of the methods it talks about are the Kolmogorov-Smirnov test, the Chi-square test, the D’Agostino-Pearson test and the Jarque-Bera test.

Details of each can be found at these links (taken from the article):

One common test for normality with which I am personally NOT familiar, is the Kolmogorov-Smirnov test.  The math behind it is very involved, and I would suggest you refer to other resources such as this page

  Wikipedia: Kolmogorov-Smirnov Test
    http://en.wikipedia.org/wiki/Kolmogorov-Smirnov_test 

You can read more about the D'Agostino-Pearson test and get a table that can be used in Excel here:

  Wikipedia: Normality Test
     http://en.wikipedia.org/wiki/User:Xargque#Normality_Test 

 Wikipedia: Jarque-Bera Test
     http://en.wikipedia.org/wiki/Jarque-Bera_test 

One item of note: depending on how your stats program calculates kurtosis, you may or may not need to subtract 3 from kurtosis.

 See: Wikipedia Talk: Jarque-Bera Test
      http://en.wikipedia.org/wiki/Talk:Jarque-Bera_test

On to the next method:

Method 2

Another property of many naturally occurring numbers is that about one third of them start with the number 1 !!!  Surprising isn’t it?!!

Well, it turns out that this applies to population numbers, electricity bills, stock prices and the lengths of rivers.

It applies to all numbers that come from power law distributions (power laws govern the distribution of wealth, connections on facebook, the numbers of speakers of a language, and lot of numbers related to society).

This is called Benford’s law:  http://en.wikipedia.org/wiki/Benford’s_law

(I believe that Benford’s law would have applied to the above case as well – donations would have a power law distribution – if you assumed that all donors donated money proportional to their wealth).

When I read about Benford’s law on Wikipedia (while writing this article), I found that it is already being used for accounting fraud detection.

The Wikipedia says:

Accounting fraud detection

In 1972, Hal Varian suggested that the law could be used to detect possible fraud in lists of socio-economic data submitted in support of public planning decisions. Based on the plausible assumption that people who make up figures tend to distribute their digits fairly uniformly, a simple comparison of first-digit frequency distribution from the data with the expected distribution according to Benford’s Law ought to show up any anomalous results. Following this idea, Mark Nigrini showed that Benford’s Law could be used in forensic accounting and auditing as an indicator of accounting and expenses fraud.[10] In practice, applications of Benford’s Law for fraud detection routinely use more than the first digit.[10]

Method 3

There are also methods that can be used by governments and large organizations to prevent fraud in the issuing of tenders.

More about that in my next article.

Building Machine Learning Models that can help with Customer Service and Supply Chain Management

The Laptop that Stopped Working

One fine day, a couple of months ago, a laptop that we owned stopped working.  We heard 4 beeps coming from the machine at intervals but nothing appeared on the screen.

Customer Service

The service person quickly looked up the symptoms in his knowledge base and informed us that 4 beeps meant a memory error.

I replaced first the two memory modules one by one, but the machine still wouldn’t start.  I tried two spare memory modules that I had in the cupboard but the computer wouldn’t start.

I had a brand new computer with me that used the same type and speed of memory as the one we were fixing.  I pulled out its memory chips and inserted them into the faulty computer, but still no luck.

At that point, the service person told me that it must be the mother board itself that was not working.

Second Attempt at Triage

So the next day, a mother board and some memory arrived at my office.  A little later a field engineer showed up and replaced the mother board.   The computer still wouldn’t start up.

When the field engineer heard 4 beeps, the engineer said it MUST BE THE MEMORY.

Third Attempt at Triage

A few days later, a new set of memory modules arrived.

The engineer returned and tried inserting the new memory in.  Still no luck.  The computer would not start and you could still hear the 4 beeps.

A third set of brand new memory modules and a new mother board were sent over.

Fourth Attempt at Triage

The engineer tried both motherboards and various combinations of memory modules, but still, all you could hear were 4 beeps and the computer would not start.

During one of his attempts to combine memory and motherboards, the engineer noticed that though the computer did not start, it did not beep either.

So, the engineer guessed that it was the screen that was not working.  But just to be safe, he’d ask them to send another motherboard and another set of memory modules to go with it.

Fifth Attempt at Triage

The screen, the third motherboard and the fourth set of memory modules arrived in our office and an engineer spent the day trying various combinations of screens, motherboards and memory modules.

But the man on the phone said: “Sir, 4 beeps means there is something wrong with your memory.  I will have them replaced.”

I had to take out my new laptop’s memory and pop it into the faulty machine to convince the engineer and support staff that replacing the memory would not fix the problem.

All the parts were now sent over – the memory, motherboard, processor, drive, and screen.

Sixth Attempt at Triage

Finally, the field engineer found that when he had replaced the processor, the computer was able to boot up with no problems.

Better Root Cause Analysis

The manufacturer could have spared themselves all that expense, time and effort had they used an expert system that relied on a probabilistic model of the symptoms and their causes.

Such a model would be able to tell, given the symptoms, which component was the most likely to have failed.

Such a model would be able to direct a field engineer to the component or components whose replacement would be most likely to fix the problem.

If the attempted fix did not work, the model would simply update its understanding of the problem and recommend a different course of action.

I will illustrate the process using what is known in the machine learning community as a directed probabilistic graphical model.

Run-Through of Root Cause Analysis 

Let’s say a failure has occurred and there is only one symptom that can be observed: the laptop won’t start and emits 4 beeps.

The first step is to enter this information into the probabilistic graphical model.  From a list of symptoms, we select the ones that we observe (all observed symptoms are represented as yellow circles in this document).

So the following diagram has only one circle (observed symptom). 

Model 1:  The symptom of 4 beeps is modeled in a probabilistic graphical model with a yellow circle as follows:

pgm_1

Now, let’s assume that this symptom can be caused by the failure of memory, the motherboard or the processor.

Model 2:  I can add that information to the predictive model, so that the model now looks like this:

pgm_2

The model captures the belief that the causes of the symptom – processor / memory / motherboard failure are (in the absence of any symptoms) independent of each other.

It also captures the belief that given a symptom like 4 beeps, evidence for one cause will explain away (or decrease the probability of) the other causes.

Once such a model is built, it can tell a field engineer the most probable cause of a symptom, the second most probable cause and so on.

So, the engineer will only have to look at the output of the model’s analysis to know whether he needs to replace one component, or two, and which ones.

When the field engineer goes out and replaces the components, his actions can also be fed into the model.

Model 3:  Below is an extended model into which attempts to fix the problem by replacing the memory can be incorporated.

pgm_3

If a field engineer were to feed into the system the fact that the memory was replaced with a new module and it didn’t fix the problem, the system would be able to immediately figure out that the memory could not be the cause of the problem, and it would suggest the next most probable cause of failure.

Model 4

Finally, in case new memory modules being sent to customers for repairs frequently turned out to be defective, that information could also be added to the model as follows:

pgm_4

Now, if the error rate for new memory modules in the supply chain happens to be high for a particular type of memory, then if memory replacement failed to fix a 4-beep problem, the model would understand that faulty memory could still be the cause of the problem.

Applications to Supply Chain Management

The probabilities of all the nodes adjust themselves all the time and this information can actually be used to detect if the error rates in new memory module deliveries suddenly go up.

Benefits to a Customer Service Process

1.  Formal capture and storage of triage history

2.  Suggestion of cause(s) given the effects (symptoms)

3.  Suggestion of other causes given triage steps performed

What the system will seem to be doing (to the layman):

1.  Recording symptoms

2.  Recommending a course of action

3.  Recording the outcome of the course of action

4.  Recommending next steps

Measuring the efficiency of retail and the possible implications

I read a beautiful BBC article today titled “How much will the technology boom change Kenya?

It is about how information delivered over mobile phones can improve people’s lives.  Here is an example:

“Ms Oguya, 25, is the creator of a mobile phone app called M-Farm, designed to help small-scale farmers maximise their potential.

Ms Oguya herself grew up on a farm. She realised people like her parents had two main problems.

Firstly, they did not always know the up-to-date market price for a particular crop.

Unscrupulous middlemen would take advantage of that and persuade farmers to part with their produce at lower prices.

Using M-Farm, a farmer can now find out the latest prices with a single text message.”

This reminded me first of all about the price of pomegranates in Bangalore.  A kilo of pomegranates costs Rs. 120 in Bangalore city.  The price at which traders buy pomegranates from farmers is Rs. 12 per kilo.  In other words, what the producer gets from the consumer is a mere 1/10 of the price a consumer pays for the product.

A year ago. I had written an article on an anti-corruption blog titled “Why Walmart’s measure of efficiency might be flawed“.

I quote from it:

“Walmart’s definition of efficiency is the cost of a product. They say they increase the efficiency of the entire market by lowering the cost of products (by getting an item manufactured in a low-cost geography).”

“However, in my opinion, there is a better way to measure efficiency, and I feel that retailers highlight lower prices as a measure of efficiency only because it is the only measure possible under the current conditions of lack of transparency in retail.”

“Take for example a $10 product that used to cost $5 to manufacture in the USA (potential profit margin of $5). Now, Walmart has a profit motive to move its manufacture to a geography where it costs less than $1 to manufacture if they can still charge $8 for it. Say, the transportation cost is $1. If the product sells for $8, they still have a profit margin of $6 which is higher than $5.”

“However, if I measure efficiency as the percentage of money paid by a customer that is being delivered to the manufacturer, there has actually been a drop in efficiency. The percentage of the price that went to the manufacturer dropped from 50% to only 12.5% (one dollar out of eight).”

“You would never donate your money to a charity without first asking what percentage of your donation was reaching the beneficiary, would you?”

Now let’s do the math for pomegranates.

By using the ratio of purchase price to sale price as a measure of efficiency, we find that the efficiency of the retail mechanism for pomegranates is a mere 10%.

If farmers became better informed, they might work to discover ways to improve the efficiency – either by bypassing middlemen – I remember the farmer’s market in Raleigh, NC, an amazingly simple idea, that did just that – or by negotiating better prices for themselves.

End consumers might also choose to patronize outlets that pay fairer prices to the producers if they could find out how much was really paid to the manufacturer.

That might also help bring back manufacturing to the USA.

Can you make a sandwich from classifiers?

One day, just a few years ago, a client came to Aiaioo Labs with a very interesting problem.

He wanted to know if and how AI tools could save him some money.

It turned out that he had a fairly large team performing the task of manually categorizing documents.

He wanted to know if we could supply him an AI tool that could automate the work.  The only problem was, he was going to need very high quality.

And no single automated classifier was going to be able to deliver that sort of quality by itself.

That’s when we hit upon the idea of a classifier sandwich.

The sandwich is prepared by arranging two classifiers as follows:

1.  Top layer – high precision classifier – when it picks a category, it is very likely to be right (the precision of the selected category is very high).

2.  Bottom layer – high recall classifier – when it rejects a category, it is very likely to be right about the rejection (the precision of the rejected category is very high).

Whatever the top layer does not pick and the bottom layer does not reject – that is, the middle of the sandwich – is then handed off to be processed manually by the team of editors that the client had in place.

So, that was a lovely little offering, one that any consultant could put together.  And it is incredibly easy to put together an ROI proposition for such an offering.

How do you calculate the ROI of a classifier sandwich?

Simple!

Let’s say the high-precision top layer has a recall of 30%.

Let’s say the high-recall bottom layer has a recall of 80%.

Then about 50% of the documents that pass through the system will end up being automatically sorted out.

The work effort and therefore the size of the team needed to do it, would therefore be halved.

Note that to make the sandwich, we need two high-precision classifiers (the first one selects a category with high precision while the second one rejects the other category with high precision).

Both categories need to have a precision greater than or equal to the quality guarantee demanded by the client.

That precision limit determines the amount of effort left over for humans to do.

How can we tune classifiers for high precision?

For maxent classifiers, thresholds can be set on the confidence scores they return for the various categories.

For naive bayesian classifiers, the best approach to creating high-precision classifiers is a training process known as expectation maximization.

For more information, please refer the work of Kamal Nigam et al:  http://www.kamalnigam.com/papers/emcat-mlj99.pdf

Another secret to boosting precision is using the right features in your classifier, but more about that later.

How product firms can expand into new markets with Worldjumper.com

Image

This blog post is about a client of ours in Japan called Worldjumper.com.

Japanese businesses need a way to communicate with customers from outside Japan, to learn about their problems, and to find out more about their needs.

I was invited to visit Worldjumper’s office in Tokyo in late Winter to help with a tool that could deliver rapid and easy localization at high speeds and very low cost.

Localization is interesting because businesses need to adapt and change their message to suit different regions, cultures and languages.

For example, the message “WorldJumper can make customer queries in any language understandable to a service person who speaks Japanese” would be a great message to convey (in the Japanese language) to a client in Japan.

But in China, that message should say (in the Chinese language) “WorldJumper can make customer queries in any language understandable to a service person who only speaks Chinese.”

Worldjumper can help customers there.  It can identify and prioritize messages that need localization and use crowd sourcing platforms to channel a huge volume of human effort to the task.

Worldjumper can also integrate very easily and quickly into a website.

If you are a product firm and have a website that needs to be readable in multiple languages, all you need to do is sign up for an account with Worldjumper and insert a snippet of HTML code into your website.

Now, your website will be readable in any language.

Moreover, you’ll be given a list of localization tasks, which you can manage directly from the Worldjumper console.

If you are a product firm, all you need to do to start selling your products to customers in countries where people speak languages that you do not understand, is take 5 minutes to insert the Worldjumper code into your website.

Once you have inserted the HTML, your site becomes multilingual.  There will also be a contact form for you that can convert customer inquiries into your language, and then convert your responses back into the customer’s own language.  I believe there are lots of plugins in the works – like chat tools and Facebook page plugins.

So, when you get a chance, do check out Worldjumper.com.