In a Naive Bayes classifier, why bother with smoothing when we have unknown words in the test set?

I came across this question on Stack Overflow a few days ago, and found it very interesting, so I thought I’d post the answer to the question here on the blog as well.

Why do we bother with smoothing at all in a Naive Bayes classifier (when we can throw away the unknown features instead).

The answer to the question is: words are often unknown in some but not all classes.

Say there are two classes M and N with features A, B and C, as follows:

M: A=3, B=1, C=0

(In the class M, A appears 3 times and B only once)

N: A=0, B=1, C=3

(In the class N, C appears 3 times and B only once)

Let’s see what happens when we throw away features that appear zero times.

A) Throw Away Features That Appear Zero Times In Any Class

If we throw away features A and C because they appear zero times in any of the classes, then we are only left with feature B to classify documents with.

And losing that information is a bad thing as we will see below!

Say we’re presented with a test document as follows:

B=1, C=3

(It contains B once and C three times)

Since we’ve discarded the features A and C, we won’t be able to tell whether the above document belongs to class M or class N.

So, losing any feature information is a bad thing!

B) Throw Away Features That Appear Zero Times In All Classes

Is it possible to get around this problem by discarding only those features that appear zero times in all of the classes?

No, because that would create its own problems!

The following test document illustrates what would happen if we did that:

A=3, B=1, C=1

The probability of M and N would both become zero (because we did not throw away the zero probability of A in class N and the zero probability of C in class M).

C) Don’t Throw Anything Away – Use Smoothing Instead

Smoothing allows us to classify both the above documents correctly because:

  1. We do not lose count information in classes where such information is available and
  2. We do not have to contend with zero counts.

Naive Bayes Classifiers In Practice

The Naive Bayes classifier in NLTK throws away features that have zero counts in all of the classes.

This makes it perform poorly when trained using a hard EM procedure (where the classifier is bootstrapped up from very little training data) on data sets with a high occurrence of unknown words (like Twitter feeds).

Here is the relevant code from NLTK 3.0:

def prob_classify(self, featureset):
  # Discard any feature names that we've never seen before.
  # Otherwise, we'll just assign a probability of 0 to
  # everything.
  featureset = featureset.copy()
  for fname in list(featureset.keys()):
    for label in self._labels:
    if (label, fname) in self._feature_probdist:
      break
    else:
      #print 'Ignoring unseen feature %s' % fname
      del featureset[fname]

Related Blog Posts:

  1. Analysing documents for non-obvious differences

  2. Naive Bayes Classifier in OpenNLP

 

Cooking Robot for Indian Cuisine using a Sensor Network Architecture

We had designed a cooking robot not very long ago.  The design was prepared as a creative exercise and there are some kinks in it that need to be ironed out.  We will list the kinks at the end.

The robot essentially uses a sensor network to perform transfer and heat operations.  It can also dip and activate various tools – mixing tools, stirring tools, grinding tools, etc.

Another cool feature of the design is that all tools and operations are driven by a single immovable motor drive.

The cooking robot, if it is realized, should be able to prepare many types of Indian side dishes and cook rice.

One of the things that this robot can’t do is prepare rotis (but there is now an automatic roti maker available in the market that can do that).

In the blog post below, we describe the design of the robot and hereby put it in the public domain.

At the core of the design is the vessel.  This is the only part of the design (apart from various tools) which comes into contact with the food.

cooking_robot_1

The vessel has a handle that allows it to be easily ‘picked up’ by the robot, moved around, turned and released.

Various tools, for example, mixing tools, stirring tools, and beating tools can be fitted with similar handles (but with a drive shaft running within to provide a mechanical driving force to the tool tip).

The robot connects with the vessel through a vessel carrier.  The vessel carrier is just a square piece.  The diagram below shows the vessel carrier in three positions – open, locked and rotated through 45%.

cooking_robot_2

The purple shading denotes the handle of the vessel.  There are two concentric rings with slits (shown in grey) in the rectangular carrier.

When the carrier is open, the slits in the slit rings line up and the vessel handle can slide into the carrier.

When the carrier locks, the inner slit ring rotates around the handle shaft, so it cannot slide out of place.

The slits are now at 90 degrees to each other.  Whenever they are at 90 degrees with respect to each other, the handle cannot slide out.

If both rings rotate, the vessel rotates with them, as shown in the last of the three figures in the above diagram.

Any number of such vessel carriers can be allowed to run along a set of vertical and horizontal tracks as shown in the following figure.

cooking_robot_3

The tracks are called ‘vessel guides’ because the vessel holders can only run along these guides.

The track and the surface they are cut into form the front face of the cooking robot.

So, in this design, the robot operates entirely in 2 dimensions, in a vertical plane parallel to a kitchen wall.  This plane of operation was deliberately chosen to conserve kitchen room, and to allow for pouring operations.

To pour a fluid or mixture from one vessel to another, all that one has to do is position one vessel in a vertical track (N-S) above another vessel running in a horizontal track (E-W) and turn the higher vessel (the pouring vessel).  The pouring vessel rises to keep its lower edge at a steady height.  The receiving vessel adjusts its position horizontally to keep the lower edge of the pouring edge in line with its own center.

At the base of the front face are burners.  They can turn on to provide heating energy and a vessel can position itself at any height above these burners to heat or warm food.

For mixing operations, a vessel holder equipped with a mixing tool would position itself over the vessel whose contents need to be mixed, and the rotating shaft in the handle would be engaged to provide power to the mixing tool’s tip.

The basic ingredients (already washed, cut and ground) were meant to be stored in containers just above the front face so that the ingredients could be transferred to the vessels positioned below them in controlled quantities by suitable methods under the force of gravity.

Power Train

We also designed a power train for the system so that the entire robot would operate using just one base-mounted motor.

At the heart of the power train is a set of vertical and horizontal lifts that run along rails and are propelled by screws (drive bars).

The figure below shows the arrangement of vertical rails (black lines) and drive bars (red lines).

cooking_robot_4

Below is an illustration of the horizontal lifts (rails and drive shafts).

cooking_robot_5

These rails and screws drive lift boxes as shown below.

The lift boxes would run along only one set of rails and the lift boxes running along vertical rails would not interfere with lift boxes running along horizontal rails, as shown below.

cooking_robot_6

In this diagram, the lift boxes (drawn as dotted rectangles) are shown latching on to the vessel carriers using holding bolts.

The holding bolts can be withdrawn and relocked at will, allowing the vessel carriers to be transferred from one lift to another at points of intersection, and to allow horizontal lifts’ holding bolts to avoid colliding with vertical drive shafts.

Below is a rough schematic of the lift boxes.

cooking_robot_7

The holding bolts are shown in blue.

The grippers (shown in solid black) hold on to the guide rails to stop the lift or onto the drive shaft(s) to propel the lift up or down.

We originally meant the system to have two counter-rotating drive shafts to allow the lift box to change direction very rapidly or stop against the drive rail.

The yellow square represents the gear wheels that power the  tools (for mixing, stirring, etc).

The blue square are the holding bolts and they latch onto the vessel carriers through matching slots as shown below.

cooking_robot_8

We had designed a gearing mechanism to lock and unlock vessel handles using cams that transferred power from the drive shafts and gears to power the tools, but those detailed diagrams have been lost.

Feasibility

An industrial robotics startup in Bangalore who evaluated this design for feasibility pointed out some problems in the design:

  1. The vertical plane of operation of the robot. They pointed out that lifting heavy vessels was going to require a lot of power, and would run into other problems like friction, and that it is always much easier to operate in the horizontal plane.  They asked us to study food processor designs and come up with ways to transfer Indian food ingredients between vessels in the horizontal plane.
  2. Friction in many of the components could prove difficult to overcome, for instance in the vessel holder. The weight of the vessel and its ingredients might end up locking the split ring system and preventing it from turning, or it might end up jamming the holding shafts connecting the vessel holder and the lift.

So, it appears that a number of mechanical problems need to be solved before such a product could become a reality.

Temperature Controlled Clothing for South Indian Summers

One of the first things we invented at Aiaioo Labs was a cheap vest that could keep people cool in the hot Bangalore summers.

We had applied for a patent for the vest, but later abandoned the application (it had appeared in the Indian patent office gazette as 2058/CHE/2008, with a filing date of 25.08.2008).

Since we do not intend to pursue the technology to production ourselves, I thought I’d describe it in our blog in the hope that someone interested in the idea might find a way to solve the remaining problems on the road to productising it.

clothing_temperature_controlled_1

This figure illustrates the components of the vest:

  1. Tubing filled with water
  2. A connector that connects the tubing to a reservoir (a bottle) of cold water
  3. A reservoir full of ice-cold water
  4. A hand-pump for circulating the water

clothing_temperature_controlled_5

The tubing is laid out in the pattern shown in the figure above so that cold water, entering at the top, can flow through the tube freely all the way to the bottom, mixing with the fluid already in the tubing.

This prevents the formation of cold spots and permits the water to maintain an evenly cool temperature throughout the vest.

clothing_temperature_controlled_4

The tubes that run up from the reservoir to the top of the vest carry very cold water and the wearer needs to be insulated from them as shown in the above figure.

clothing_temperature_controlled_2

The coolant is circulated using a hand pump – a kind of bellows – as shown above.  There are two reservoirs in the figure, but one reservoir will suffice.

clothing_temperature_controlled_7

The second reservoir comes in handy if the mixing of warm and cool fluids in the tubing does not warm of the coolant to a comfortable temperature.  The second reservoir can then be kept at a higher temperature than the first.

clothing_temperature_controlled_3

The above figure is of a simple pump that I built from commonly available parts – a rubber bulb, two valves and two nozzles.

Field Trials

We built a working prototype of the vest, and on a fine day in 2008, I wore it under my clothes while travelling by bus from Banashankari to Vidyaranyapura.

I thought it felt cooler with the vest on, but not significantly.

Work Remaining

To productise this, among other things, one needs to improve the tubing.

The only tubing whose walls could maintain their shape tended to have thick rubber/plastic insulating walls which impeded the cooling effect of the vest.

Thinner tubes tended to collapse and fold sharply, stopping the circulation of the fluid.

Another issue was the pumping pressure.  The hand pump (the bladder) was sufficient but needed frequent tight short squeezes to pump the fluid through the tube, which became exhausting and monotonous.

If a non-insulating, non-collapsing tubing system could be constructed, and the pump operated by a light battery-powered peristaltic pump, and the cost of the latter kept to within a few hundred rupees, then the vest could become a commercially feasible proposition.

A device for carrying road traffic on the Bangalore metro

A few of us were wondering the other day if it might be possible to carry road vehicles on metro rails.

A car is a machine with wheels that operates on certain flat surfaces, namely roads.

cars

A metro train is also a machine on wheels, but it operates on railway tracks.

metro

Roads tend to be very crowded, but not so the rails.  Long stretches of railway tracks go unoccupied most of the time.

So, initially, we felt that if we could put cars on rails using a suitable adaptor, we could move some 4-wheeler traffic off the roads.

Such an adaptor, it turns out, is easy to build.

car_holder

A cage with a roof with eyelets that allow the cage to be lifted, and wheel chocks that immobilize the car will do the trick.

Such a cage makes the car easily transportable and allows it to be lifted onto railway carriages or inaccessible parking spaces (say on the roofs of buildings).

Advantages

We calculated that a typical metro coach could hold up to 8 cars (a coach is 22 metres in length while cars are usually less than 5 metres in length, and you can stack two car cages one on top of the other).

So, 20 coaches could hold 160 cars.

Assuming a metro train frequency of one every 5 minutes and 20 car coaches, you would have an hourly capacity of 1920 cars.

Typical peak road capacities are around 1700 cars per hour per lane.

Since many central Bangalore roads are single lane roads, this could effectively double the traffic capacity at the city center.

Disadvantages

However, the plan would be hard to justify for the following reasons.

Each metro coach can carry a maximum of 300 passengers (if they are packed 6 to a square metre).  So the ticket for a car would be that of 40 passengers during peak hours.  I doubt anyone would pay so much to travel by car on a metro train.

The second problem is that metro coaches are prohibitively expensive.  Metro coaches cost upwards of Rs. 8 crores each.

Finally, a metro transport system is designed to serve a high density of passengers.  Those 20 coaches would have been of far greater service to commuters if employed to carry 6000 people instead of 160 cars.

So, it would make sense to add car-carrying trailers only after all commuter demand has been met, and only if the cost of car-carrying trailers turns out to be far less than that of passenger coaches.

Using such cages for parking also seemed like an interesting idea, but I feel that the cost of such a system might again be prohibitive (it’s cheaper in India to use a valet).

Anyways, it was an interesting thought experiment.

Naive Bayes Classifier in OpenNLP

The OpenNLP project of the Apache Foundation is a machine learning toolkit for text analytics.

For many years, OpenNLP did not carry a Naive Bayes classifier implementation.

OpenNLP has finally included a Naive Bayes classifier implementation in the trunk (it is not yet available in a stable release).

Naive Bayes classifiers are very useful when there is little to no labelled data available.

Labelled data is usually needed in large quantities to train classifiers.

However, the Naive Bayes classifier can sometimes make do with a very small amount of labelled data and bootstrap itself over unlabelled data.  Unlabelled data is usually easier to get your hands on or cheaper to collect than labelled data – by far.  The process of bootstrapping Naive Bayes classifiers over unlabelled data is explained in the paper “Text Classification from Labelled and Unlabelled Documents using EM” by Kamal Nigam et al.

So, whenever I get clients who are using OpenNLP, but have only very scanty labelled data available to train a classifier with, I end up having to teach them to build a Naive Bayes classifier and bootstrap it by using an EM procedure over unlabelled data.

Now that won’t be necessary any longer, because OpenNLP provides a Naive Bayes classifier that can be used for that purpose.

Tutorial

Training a Naive Bayes classifier is a lot like training a maximum entropy classifier.  In fact, you still have to use the DocumentCategorizerME class to do it.

But you pass in a special parameter to tell the DocumentCategorizerME class that you want a Naive Bayes classifier instead.

Here is some code for training a classifier (from the OpenNLP manual) in this case, the Maximum Entropy classifier.

DoccatModel model = null;
InputStream dataIn = null;
try {
  dataIn = new FileInputStream("en-sentiment.train");
  ObjectStream<String> lineStream =
		new PlainTextByLineStream(dataIn, "UTF-8");
  ObjectStream<DocumentSample> sampleStream = new DocumentSampleStream(lineStream);

  // Training a maxent model by default!!!
  model = DocumentCategorizerME.train("en", sampleStream);
}
catch (IOException e) {
  // Failed to read or parse training data, training failed
  e.printStackTrace();
}

Now, if you want to invoke the new Naive Bayes classifier instead, you just have to pass in a few training parameters, as follows.

						
DoccatModel model = null;
InputStream dataIn = null;
try {
  dataIn = new FileInputStream("en-sentiment.train");
  ObjectStream<String> lineStream =
		new PlainTextByLineStream(dataIn, "UTF-8");
  ObjectStream<DocumentSample> sampleStream = new DocumentSampleStream(lineStream);

  TrainingParameters params = new TrainingParameters();
  params.put(TrainingParameters.CUTOFF_PARAM, Integer.toString(0));
  params.put(TrainingParameters.ALGORITHM_PARAM, NaiveBayesTrainer.NAIVE_BAYES_VALUE);

  // Now the parameter TrainingParameters.ALGORITHM_PARAM ensures
  // that we train a Naive Bayes model instead
  model = DocumentCategorizerME.train("en", sampleStream, params);
}
catch (IOException e) {
  // Failed to read or parse training data, training failed
  e.printStackTrace();
}

Evaluation

I ran some tests on the Naive Bayes document categorizer in OpenNLP built from the trunk (you can also get the latest build using Maven).

Here are the numbers.

1. Subjectivity Classification

I ran the experiment on the 5000 movie reviews dataset (used in the paper “A Sentimental Education” by Bo Pang and Lillian Lee) with a 50:50 split into training and test:

Accuracies
Perceptron: 57.54% (100 iterations)
Perceptron: 59.96% (1000 iterations)
Maxent: 91.48% (100 iterations)
Maxent: 90.68% (1000 iterations)
Naive Bayes: 90.72%

2. Sentiment Polarity Classification

Cornell movie review dataset v1.1 (700 positive and 700 negative reviews).

With 350 of each as training and the rest as test, I get:

Accuracies
Perceptron: 49.70% (100 iterations)
Perceptron: 49.85% (1000 iterations)
Maxent: 77.11% (100 iterations)
Maxent: 77.55% (1000 iterations)
Naive Bayes: 75.65%

The data used in this experiment was taken from http://www.cs.cornell.edu/people/pabo/movie-review-data/

The OpenNLP Jira details for this feature are available at: https://issues.apache.org/jira/browse/OPENNLP-777

From Naive to Perplexed

We recently came up with three proofs that taken together suggest that the naive independence assumptions made in the Naive Bayes classifier are quite unnecessary for achieving the accuracy it is capable of.

A new classifier based on this math (called the Perplexed Bayes Classifier) is described in our recent research paper.

This blog post is to announce that we’ve also provided a mathematics companion to the paper.

The mathematics companion may be downloaded from:  http://www.aiaioo.com/publications/icon2015_mathematics_companion.pdf

What is the Perplexed Bayes Classifier?

The perplexed bayes classifier is a classifier that resembles the naive bayes classifier but does not suffer from one of the shortcomings of the naive bayes classifier (a tendency to be unjustifiably overconfident about its predictions).

The following diagram shows how the reliability curves of the perplexed bayes classifier look alongside the naive bayes classifier’s.

research_paper_graph

 

The Perplexed Bayes Classifier

research_paper

We are proud to announce our latest research paper.

This paper describes an improved version of the grand old Naive Bayesian classification algorithm with rather interesting mathematical implications.

We have shown in this paper that it is possible to create a classifier that can classify data in exactly the same way as a Naive Bayes classifier, but without using the same “naive” independence assumptions from which the old classifier got its name.

Now smoke that for a second.  What we’re saying is that it was completely unnecessary to make those independence assumptions to get that accuracy after all.

So you take them out.

And what results is vastly better posterior probabilities. In other words, the classifier’s confidence in a prediction becomes vastly more believable.

But since it makes the same decisions as a Naive Bayes algorithm, its accuracy is provably the same!

We call the new classifier the Perplexed Bayes classifier because it uses the reciprocal of perplexity (the geometric mean) as the combination operator for individual feature probabilities.

It turns out moreover, that the mathematical implications of using this operator are that the naive independence assumptions disappear.

The Perplexed Bayes classifier and the math thereof are described in this research paper: http://www.aiaioo.com/publications/icon2015.pdf

This new classifier is as easy to build as a Naive Bayes, has the same accuracy, learns fast, and returns confidence scores that are way better.

And the math is amenable to retrofitting into Hidden Markov Models and Probabilistic Graphical Models to dissolve invalid independence assumptions.

Yes, as I said, the mathematical implications are very interesting.

The paper will be presented at ICON in Trivandrum in December 2015: http://ltrc.iiit.ac.in/icon2015/.