Naive Bayes Classifier in OpenNLP

The OpenNLP project of the Apache Foundation is a machine learning toolkit for text analytics.

For many years, OpenNLP did not carry a Naive Bayes classifier implementation.

OpenNLP has finally included a Naive Bayes classifier implementation in the trunk (it is not yet available in a stable release).

Naive Bayes classifiers are very useful when there is little to no labelled data available.

Labelled data is usually needed in large quantities to train classifiers.

However, the Naive Bayes classifier can sometimes make do with a very small amount of labelled data and bootstrap itself over unlabelled data.  Unlabelled data is usually easier to get your hands on or cheaper to collect than labelled data – by far.  The process of bootstrapping Naive Bayes classifiers over unlabelled data is explained in the paper “Text Classification from Labelled and Unlabelled Documents using EM” by Kamal Nigam et al.

So, whenever I get clients who are using OpenNLP, but have only very scanty labelled data available to train a classifier with, I end up having to teach them to build a Naive Bayes classifier and bootstrap it by using an EM procedure over unlabelled data.

Now that won’t be necessary any longer, because OpenNLP provides a Naive Bayes classifier that can be used for that purpose.

Tutorial

Training a Naive Bayes classifier is a lot like training a maximum entropy classifier.  In fact, you still have to use the DocumentCategorizerME class to do it.

But you pass in a special parameter to tell the DocumentCategorizerME class that you want a Naive Bayes classifier instead.

Here is some code for training a classifier (from the OpenNLP manual) in this case, the Maximum Entropy classifier.

DoccatModel model = null;
InputStream dataIn = null;
try {
  dataIn = new FileInputStream("en-sentiment.train");
  ObjectStream<String> lineStream =
		new PlainTextByLineStream(dataIn, "UTF-8");
  ObjectStream<DocumentSample> sampleStream = new DocumentSampleStream(lineStream);

  // Training a maxent model by default!!!
  model = DocumentCategorizerME.train("en", sampleStream);
}
catch (IOException e) {
  // Failed to read or parse training data, training failed
  e.printStackTrace();
}

Now, if you want to invoke the new Naive Bayes classifier instead, you just have to pass in a few training parameters, as follows.

						
DoccatModel model = null;
InputStream dataIn = null;
try {
  dataIn = new FileInputStream("en-sentiment.train");
  ObjectStream<String> lineStream =
		new PlainTextByLineStream(dataIn, "UTF-8");
  ObjectStream<DocumentSample> sampleStream = new DocumentSampleStream(lineStream);

  TrainingParameters params = new TrainingParameters();
  params.put(TrainingParameters.CUTOFF_PARAM, Integer.toString(0));
  params.put(TrainingParameters.ALGORITHM_PARAM, NaiveBayesTrainer.NAIVE_BAYES_VALUE);

  // Now the parameter TrainingParameters.ALGORITHM_PARAM ensures
  // that we train a Naive Bayes model instead
  model = DocumentCategorizerME.train("en", sampleStream, params);
}
catch (IOException e) {
  // Failed to read or parse training data, training failed
  e.printStackTrace();
}

Evaluation

I ran some tests on the Naive Bayes document categorizer in OpenNLP built from the trunk (you can also get the latest build using Maven).

Here are the numbers.

1. Subjectivity Classification

I ran the experiment on the 5000 movie reviews dataset (used in the paper “A Sentimental Education” by Bo Pang and Lillian Lee) with a 50:50 split into training and test:

Accuracies
Perceptron: 57.54% (100 iterations)
Perceptron: 59.96% (1000 iterations)
Maxent: 91.48% (100 iterations)
Maxent: 90.68% (1000 iterations)
Naive Bayes: 90.72%

2. Sentiment Polarity Classification

Cornell movie review dataset v1.1 (700 positive and 700 negative reviews).

With 350 of each as training and the rest as test, I get:

Accuracies
Perceptron: 49.70% (100 iterations)
Perceptron: 49.85% (1000 iterations)
Maxent: 77.11% (100 iterations)
Maxent: 77.55% (1000 iterations)
Naive Bayes: 75.65%

The data used in this experiment was taken from http://www.cs.cornell.edu/people/pabo/movie-review-data/

The OpenNLP Jira details for this feature are available at: https://issues.apache.org/jira/browse/OPENNLP-777

17 thoughts on “Naive Bayes Classifier in OpenNLP

  1. Shirish, the NaiveBayesTrainer is under opennlp-tools/src/main/java. It’s in the package “opennlp.tools.ml.naivebayes”. If you download the trunk version, you should be able to just do an “import opennlp.tools.ml.naivebayes.NaiveBayesTrainer”.

  2. Hi,

    Nice article.

    Any idea how to parse and feed the provided dataset into Open NLP DocumentCategorizerME. The dataset contains a couple of files in folders pos/neg which basically tells us the score, but to feed this in? Any code snippets/ideas would be very helpful

      1. Thanks, I managed to get it running.

        However, I have two questions.

        1) How to check the accuracy of our model?

        In apache spark we split our training dataset into test,train by some ratio say 70:30.

        2) Is there any tuning done for accuracy?

        I could see random results coming out of the model that I trained. Apart from the mentioned points, is there any optimization done to improve accuracy?

    1. About your two questions.

      1) How to check the accuracy of our model?

      You do the same in OpenNLP as you described for Apache Spark.

      2) Is there any tuning done for accuracy?

      No. In theory there is a smoothing parameter you could set, using tuning data, but we have have not enabled the setting of that parameter in the Apache OpenNLP implementation (we’ve picked a value that generally works fine for sentiment analysis). So you have no tuning to do.

  3. Is there some trick to serializing the NB output model from training for use later? The result of DoccatModel.serialize and then creating a new DoccatModel from that file results in basically random classifications. Default ME model serialization/deserialization works fine for my tests. Looking at the unit tests they have added, it only checks to ensure the model serialized and read back in is null and not that it actually functions correctly.

    1. Mike, I saw an email on the OpenNLP dev group indicating that someone on the OpenNLP team (William Colen) has fixed it. The message said: “OPENNLP-1010: Fix NaiveBayes model writer

      The previous sortValues method was based on Perceptron, but for some reason it was not working
      for NaiveBayes. Changed it to the one from GIS fixed it.”

      It’ll take them a few days to pull it and then the fix should be available in the trunk.

Leave a comment