Month: September 2013

Using text analytics to detect fraud related to government tenders

In my last article, I talked about using statistics to detect fraud.  I’d promised to write about methods for detecting and preventing fraud in the issuing of tenders.

The floating of tenders is the primary mechanism by which governments – often the biggest economic force in a geographical area – procure services from private organizations.

If the bidding process is compromised, contracts might end up not going to the best or most efficient vendor as a stakeholder (the tax-payer) would desire.

That in turn results in bad roads, poor infrastructure, delayed projects, under-spending on education, over-spending on military purchases and other problems associated with bad governance. (See: https://aiaioo.wordpress.com/2012/11/18/tools-for-the-mind-and-how-you-can-change-the-world/).

It also stands to reason that if you could detect tendering fraud, you could solve quite a few of the problems that affect places where corruption is rife.

So how do you tell if a tender has been fairly issued or if it has been gamed to the benefit of a certain participant?

An unfair procurement process can use one of the following methods to ensure that a contract is awarded to a favored party:

Method 1:  Choice of pre-selection criteria

One method used to favour a certain party is the introduction of unnecessary qualifying conditions in the tender that have nothing whatsoever to do with the the end product or service to be procured.

These conditions are added to the tender in order to ensure that only a chosen small set of bidders meet all the conditions for participation in the tender process.

Method 2:  Cancellation and reissuing of the tender

I have been given to believe (by various sources) that in India/China, 3% of the size of the deal is the norm for kickbacks.

If a very efficient bid is placed, and it brings the cost of the service down so that the 3% kickbacks do not translate into a lot of money or if the winner of the bid refuses to pay a bribe, procurement officials might be able to subvert the process by coming up with reasons to cancel or terminate the tender.

They can then reissue the tender with tighter criteria intended to disqualify the uncooperative bidder.

My Experiences

Now, I must tell you that since the beginning of my career as an entrepreneur in India, I have come across numerous stories of terminated tenders, or of the disqualification of firms from a bidding process because they bid too low to be able to pay much by way of bribes.

I have personally walked into a tendering meeting where the government officials began with the words: “Ladies and gentlemen, we are proud to welcome you to our campus today.  We are extremely sorry that we cannot entertain you the way you entertain us when we come to your campuses.”

The tender was being issued for a software project that I felt should have taken a team of 3 engineers no more than 6 months to deliver.  But the tender stated that only firms with a minimum of “100 crores in revenue each year for the past 5 years,” (approximately 20 million USD each year) could bid for the project.  There were only 6 other firms in the room.

When the officials realised that Aiaioo Labs was a small firm, they suggested we leave.

They said, “There should be some other things we can work on with you.  Let’s meet some other time.”

Over the years, I began to wonder if there was any way that I as a tax payer might protect myself from bad deals (corrupt or price-inefficient deals) entered into by government middlemen with my tax money.

Fortunately, it seems possible to use text analytics to detect and alert an ombudsman to possible fraud in the issuing of tenders.  Below is a description of how such a method might work.

Using text analytics to detect irrelevant selection constraints

If tenders for procuring very different products have very similar pre-selection criteria, they could be flagged as suspicious.

The reason this method might work is that relationships between corrupt officials and client firms can take a considerable amount of time to form (because of the risks involved and the consequent need for caution).  It is easy for corrupt officials to change the favored vendor very frequently.

That would mean that they would have to keep the criteria of selection of firms more or less unchanged across widely varying tenders and over long periods of time.  So, you might find that tenders small and large, for hardware or for software, (in other words, tenders for different services), but issued by the same organization might – if the tender process has become unfair – employ more or less the same set of selection criteria irrespective of what is being purchased.

Tools can be developed to detect these similarities and flag them up for review.  Such tools would have to be able to detect the portions of the tender document that are related to the bidder, and the portions of the document that are related to the product or service requested.  It would then have to measure the similarity between the bidder-related sections of the tender documents.  It might also be possible to extract only the qualifying criteria and look for similarities there.  It might also be possible to analyse the bidder selection criteria to see if any criteria might be irrelevant to a project, or incompatible with the requirements of the project.

Using text analytics to detect reissued tenders

If a new tender document’s product or service description sections resemble those in an older tender – and if the issuing organization remains the same, it might be possible that the tender has been reissued.  If it is further found that a very low bid had won the bidding in the previous round of tendering, and that the earlier tender had been cancelled, this could be used as a flag to alert an ombudsman.

Using text analytics to detect vendor-oriented constraints

If many of the conditions for participation in the tender are company-specific (properties of companies such as size or earnings) as opposed to capability-specific (experience in a certain technology space), it might raise a red-flag.

Systems to manage tenders

It might be possible to analyse tenders for fraud if tenders are stored in and managed using a tender management system with fraud detection analytics that both serves as a repository for tender documents, as well as manages the submission of bids and monitors the selection procedure and the life-cycle of projects. This would allow governments to maintain not just a history of tender issuers, but also a history of vendors.  By so doing, governments would be able to determine which vendors are reliable and which are not.

Moreover, it would give people issuing and evaluating tenders more confidence in a low bidder (there is always a danger in projects that someone could bid too low and win the project, but then not be able to execute) and hence help reduce costs. So, a tender fraud detection tool could possibly help governments make better decisions regarding vendors of services and reduce corruption in the process of issuing tenders for the procurement of services and products for government.

Graph Algorithms for Fraud Detection

Text analytics algorithms are difficult and expensive to develop.  Fortunately there are other ways to detect tendering fraud. Patterns of favoritism in tender outcomes can be detected from a bipartite graph of issuing organizations and beneficiaries.  If tenders from a particular issuing organization are found to repeatedly favour a specific vendor from a large field of vendors (more than random probabilities would allow), the organization and vendor could be flagged. Price comparisons across tenders can also be made to determine if any prices have exceeded the price range for similar purchases (this will again require text analytics).

Some Theory

There has been a lot of work on corruption by economists in the last 10 years.  One interesting equation that models corruption is the Klitgaard Corruption Equation. The equation is C = R + D – A where C stands for Corruption, R for Rent (quantum of possible illegal earnings from being in a position of responsibility where corruption is possible) and A stands for Accountability. These concepts are explained very well in the following article http://seekyt.com/define-corruption/ (from where I got the following image as well). But Klitgaard does not model one variable that can impact corruption – and that variable is choice. If you increase the choices available to a purchaser, the opportunities for avoiding corruption increase and the likelihood of corrupt transactions occurring decreases.

For example, if everyone in a certain location must only obtain a service from the government office serving their locality, then a person who does not want to pay a bribe does not have the option of travelling to a different office to obtain the service without paying a bribe.  So, if the officer at the local office is corrupt and demands a bribe for rendering a service, then the person has no choice but to cough up the bribe.  This happens in a lot of government offices where registrations have to be performed.  Increasing the choice of service provider lessens the likelihood of people being trapped into giving bribes.

The same forces are at work in the case of tenders.  The strategy of a corrupt tendering official is to artificially reduce the choices of the selectors to only firms that will pay a bribe. Computer systems that fight tendering corruption work by preventing the artificial restriction of choices.

Fraud detection using computers

For a long time, we’ve been interested in using mathematics (and computers) to detect and deter fraud.  It is related to our earlier work on identifying perpetrators of terrorist attacks.  (Yeah, I know it’s not as cool, but it’s some similar math!)

Today, I want to talk about some approaches to detecting fraud that we talked about on a beautiful summer day, in the engineering room at Aiaioo Labs.

That day, in the afternoon, somebody had rung the bell.  A colleague had answered the bell and then come and handed me a sheet of paper, saying that a lady at the door was asking for donations.

The paper bore the letterhead of an organization in a script that I couldn’t read.  However the text in English stated that the bearer was a student collecting money to feed a few thousand refugees living in a refugee camp in Hyderabad (the refugees’ homes had been destroyed in artillery shelling on the India-Pakistan border and that there were a few thousand families without shelter who needed food and medicines urgently).

On the sheet were the names and signatures of about 20 donors who had each donated around 1000 rupees.

Now the problem before us was to figure out if the lady was a genuine student volunteer or a fraudster out to make some quick money.

There was one thing about the document that looked decidedly suspicious.

It was that the amounts donated were all very similar – 1000, 1200, 1300, 1000, 1000, 1000, 1000.

All the numbers had unnaturally high values.

So, I called a friend of mine who came from the place she claimed the refugees (and the student volunteers) were from and asked him to talk to her and tell me if her story checked out.

He spoke to her over the phone for a few minutes and then told me that her story was not entirely true.

She was from the place that she claimed the refugees came from, but she was in fact collecting money for her own family (they had come south because one of them had needed a medical operation and were now collecting money to travel back to their home town).

When we asked her why she had lied, she just shrugged.

We felt it would be fine to help a family in need, so we gave her some money.

However, the whole affair gave us an interesting problem to solve.

How do you tell if a set of numbers is ‘natural’ or if it has been made up by a person intent on making them look natural?

Well, it turns out that statistics can give you the tools to do that.

Method 1

In nature, many processes result in random numbers that follow a certain distribution. And there are standard distributions that almost all numbers found in nature belong to.

For example, on the sheet of paper that the lady had presented, the figures for the money donated should have followed a normal distribution.  There should have been a few high values and a few low values and a lot of the values in the middle.

Since that wasn’t the case I could easily tell that the numbers had been made up.

But you don’t need a human to tell you that.  There are statistical tests that can be done to see if a set of numbers belongs to any expected distribution.

I looked around online and found an article that tells you about methods that can be used to check if a set of numbers belongs to a normal distribution (a distribution that occurs very frequently in nature): http://mathforum.org/library/drmath/view/72065.html

Some of the methods it talks about are the Kolmogorov-Smirnov test, the Chi-square test, the D’Agostino-Pearson test and the Jarque-Bera test.

Details of each can be found at these links (taken from the article):

One common test for normality with which I am personally NOT familiar, is the Kolmogorov-Smirnov test.  The math behind it is very involved, and I would suggest you refer to other resources such as this page

  Wikipedia: Kolmogorov-Smirnov Test
    http://en.wikipedia.org/wiki/Kolmogorov-Smirnov_test 

You can read more about the D'Agostino-Pearson test and get a table that can be used in Excel here:

  Wikipedia: Normality Test
     http://en.wikipedia.org/wiki/User:Xargque#Normality_Test 

 Wikipedia: Jarque-Bera Test
     http://en.wikipedia.org/wiki/Jarque-Bera_test 

One item of note: depending on how your stats program calculates kurtosis, you may or may not need to subtract 3 from kurtosis.

 See: Wikipedia Talk: Jarque-Bera Test
      http://en.wikipedia.org/wiki/Talk:Jarque-Bera_test

On to the next method:

Method 2

Another property of many naturally occurring numbers is that about one third of them start with the number 1 !!!  Surprising isn’t it?!!

Well, it turns out that this applies to population numbers, electricity bills, stock prices and the lengths of rivers.

It applies to all numbers that come from power law distributions (power laws govern the distribution of wealth, connections on facebook, the numbers of speakers of a language, and lot of numbers related to society).

This is called Benford’s law:  http://en.wikipedia.org/wiki/Benford’s_law

(I believe that Benford’s law would have applied to the above case as well – donations would have a power law distribution – if you assumed that all donors donated money proportional to their wealth).

When I read about Benford’s law on Wikipedia (while writing this article), I found that it is already being used for accounting fraud detection.

The Wikipedia says:

Accounting fraud detection

In 1972, Hal Varian suggested that the law could be used to detect possible fraud in lists of socio-economic data submitted in support of public planning decisions. Based on the plausible assumption that people who make up figures tend to distribute their digits fairly uniformly, a simple comparison of first-digit frequency distribution from the data with the expected distribution according to Benford’s Law ought to show up any anomalous results. Following this idea, Mark Nigrini showed that Benford’s Law could be used in forensic accounting and auditing as an indicator of accounting and expenses fraud.[10] In practice, applications of Benford’s Law for fraud detection routinely use more than the first digit.[10]

Method 3

There are also methods that can be used by governments and large organizations to prevent fraud in the issuing of tenders.

More about that in my next article.