Naive Bayes Classifier in OpenNLP

The OpenNLP project of the Apache Foundation is a machine learning toolkit for text analytics.

For many years, OpenNLP did not carry a Naive Bayes classifier implementation.

OpenNLP has finally included a Naive Bayes classifier implementation in the trunk (it is not yet available in a stable release).

Naive Bayes classifiers are very useful when there is little to no labelled data available.

Labelled data is usually needed in large quantities to train classifiers.

However, the Naive Bayes classifier can sometimes make do with a very small amount of labelled data and bootstrap itself over unlabelled data. Unlabelled data is usually easier to get your hands on or cheaper to collect than labelled data – by far. The process of bootstrapping Naive Bayes classifiers over unlabelled data is explained in the paper “Text Classification from Labelled and Unlabelled Documents using EM” by Kamal Nigam et al.

So, whenever I get clients who are using OpenNLP, but have only very scanty labelled data available to train a classifier with, I end up having to teach them to build a Naive Bayes classifier and bootstrap it by using an EM procedure over unlabelled data.

Now that won’t be necessary any longer, because OpenNLP provides a Naive Bayes classifier that can be used for that purpose.

Tutorial

Training a Naive Bayes classifier is a lot like training a maximum entropy classifier. In fact, you still have to use the DocumentCategorizerME class to do it.

But you pass in a special parameter to tell the DocumentCategorizerME class that you want a Naive Bayes classifier instead.

Here is some code for training a classifier (from the OpenNLP manual) in this case, the Maximum Entropy classifier.

DoccatModel model = null;
InputStream dataIn = null;
try {
  dataIn = new FileInputStream("en-sentiment.train");
  ObjectStream<String> lineStream =
		new PlainTextByLineStream(dataIn, "UTF-8");
  ObjectStream<DocumentSample> sampleStream = new DocumentSampleStream(lineStream);

  // Training a maxent model by default!!!
  model = DocumentCategorizerME.train("en", sampleStream);
}
catch (IOException e) {
  // Failed to read or parse training data, training failed
  e.printStackTrace();
}

Now, if you want to invoke the new Naive Bayes classifier instead, you just have to pass in a few training parameters, as follows.

						
DoccatModel model = null;
InputStream dataIn = null;
try {
  dataIn = new FileInputStream("en-sentiment.train");
  ObjectStream<String> lineStream =
		new PlainTextByLineStream(dataIn, "UTF-8");
  ObjectStream<DocumentSample> sampleStream = new DocumentSampleStream(lineStream);

  TrainingParameters params = new TrainingParameters();
  params.put(TrainingParameters.CUTOFF_PARAM, Integer.toString(0));
  params.put(TrainingParameters.ALGORITHM_PARAM, NaiveBayesTrainer.NAIVE_BAYES_VALUE);

  // Now the parameter TrainingParameters.ALGORITHM_PARAM ensures
  // that we train a Naive Bayes model instead
  model = DocumentCategorizerME.train("en", sampleStream, params);
}
catch (IOException e) {
  // Failed to read or parse training data, training failed
  e.printStackTrace();
}

Evaluation

I ran some tests on the Naive Bayes document categorizer in OpenNLP built from the trunk (you can also get the latest build using Maven).

Here are the numbers.

1. Subjectivity Classification

I ran the experiment on the 5000 movie reviews dataset (used in the paper “A Sentimental Education” by Bo Pang and Lillian Lee) with a 50:50 split into training and test:

Accuracies
Perceptron: 57.54% (100 iterations)
Perceptron: 59.96% (1000 iterations)
Maxent: 91.48% (100 iterations)
Maxent: 90.68% (1000 iterations)
Naive Bayes: 90.72%

2. Sentiment Polarity Classification

Cornell movie review dataset v1.1 (700 positive and 700 negative reviews).

With 350 of each as training and the rest as test, I get:

Accuracies
Perceptron: 49.70% (100 iterations)
Perceptron: 49.85% (1000 iterations)
Maxent: 77.11% (100 iterations)
Maxent: 77.55% (1000 iterations)
Naive Bayes: 75.65%

The data used in this experiment was taken from http://www.cs.cornell.edu/people/pabo/movie-review-data/

The OpenNLP Jira details for this feature are available at: https://issues.apache.org/jira/browse/OPENNLP-777

17 thoughts on “Naive Bayes Classifier in OpenNLP”

Pingback: In a Naive Bayes classifier, why bother with smoothing when we have unknown words in the test set? – Aiaioo Labs Blog
Shirish says:

September 19, 2016 at 12:52 pm

Hey thanks for article.. Which version of open nlp jar did you NaiveBayesTrainer ?

Reply
1. aiaioo says:
  
  September 19, 2016 at 1:01 pm
  
  The Naive Bayes code is still in the trunk (release cycles take time on Apache).
  
  Reply
  1. Shirish says:
    
    September 22, 2016 at 3:35 pm
    
    I could not find NaiveBayesTrainer class in https://svn.apache.org/repos/asf/opennlp/trunk/ Please could you help ?
  2. aiaioo says:
    
    January 16, 2017 at 6:44 pm
    
    It’s available in 1.7.0.
aiaioo says:

September 24, 2016 at 7:07 am

Shirish, the NaiveBayesTrainer is under opennlp-tools/src/main/java. It’s in the package “opennlp.tools.ml.naivebayes”. If you download the trunk version, you should be able to just do an “import opennlp.tools.ml.naivebayes.NaiveBayesTrainer”.

Reply
Greedy Coder says:

November 22, 2016 at 9:56 am

Hi,

Nice article.

Any idea how to parse and feed the provided dataset into Open NLP DocumentCategorizerME. The dataset contains a couple of files in folders pos/neg which basically tells us the score, but to feed this in? Any code snippets/ideas would be very helpful

Reply
1. aiaioo says:
  
  November 22, 2016 at 10:05 am
  
  You’ll have to use a single file. The format is described in the OpenNLP manual: https://opennlp.apache.org/documentation/1.6.0/manual/opennlp.html. You could post any general questions on usage to the OpenNLP users’ mailing list: https://opennlp.apache.org/mail-lists.html
  
  Reply
  1. Greedy Coder says:
    
    November 23, 2016 at 11:41 am
    
    Thanks, I managed to get it running.
    
    However, I have two questions.
    
    1) How to check the accuracy of our model?
    
    In apache spark we split our training dataset into test,train by some ratio say 70:30.
    
    2) Is there any tuning done for accuracy?
    
    I could see random results coming out of the model that I trained. Apart from the mentioned points, is there any optimization done to improve accuracy?
2. aiaioo says:
  
  November 23, 2016 at 4:13 pm
  
  About your two questions.
  
  1) How to check the accuracy of our model?
  
  You do the same in OpenNLP as you described for Apache Spark.
  
  2) Is there any tuning done for accuracy?
  
  No. In theory there is a smoothing parameter you could set, using tuning data, but we have have not enabled the setting of that parameter in the Apache OpenNLP implementation (we’ve picked a value that generally works fine for sentiment analysis). So you have no tuning to do.
  
  Reply
Mike says:

March 28, 2017 at 10:36 pm

Is there some trick to serializing the NB output model from training for use later? The result of DoccatModel.serialize and then creating a new DoccatModel from that file results in basically random classifications. Default ME model serialization/deserialization works fine for my tests. Looking at the unit tests they have added, it only checks to ensure the model serialized and read back in is null and not that it actually functions correctly.

Reply
1. aiaioo says:
  
  March 29, 2017 at 2:03 am
  
  There isn’t any trick to it, Mike. So, there’s a possibility there’s a bug in the code there (and there have been some refactorings taking place around the serialization mechanism too). Would you be able to open a bug on the OpenNLP JIRA (https://issues.apache.org/jira/browse/OPENNLP/) and share the link here? I’ll follow it up.
  
  Reply
Mike says:

March 29, 2017 at 2:50 pm

It would appear someone beat me to creating an issue by a couple weeks: https://issues.apache.org/jira/browse/OPENNLP-1010 . You can just as easily reproduce the issue by modifying https://github.com/apache/opennlp/blob/c17c55110b216ed3d5e0adb06734677a9cb04abd/opennlp-tools/src/test/java/opennlp/tools/doccat/DocumentCategorizerNBTest.java to serialize out the Doccat model then read it back in.

Reply
1. aiaioo says:
  
  March 30, 2017 at 9:09 am
  
  Thanks, Mike. I’ll follow this up.
  
  Reply
2. aiaioo says:
  
  April 14, 2017 at 9:44 am
  
  Mike, I saw an email on the OpenNLP dev group indicating that someone on the OpenNLP team (William Colen) has fixed it. The message said: “OPENNLP-1010: Fix NaiveBayes model writer
  
  The previous sortValues method was based on Perceptron, but for some reason it was not working
  for NaiveBayes. Changed it to the one from GIS fixed it.”
  
  It’ll take them a few days to pull it and then the fix should be available in the trunk.
  
  Reply
hemanth kumar A says:

October 3, 2017 at 4:16 pm

hi is it possible to do sentiment analysis for kannada language?

Reply
1. aiaioo says:
  
  October 3, 2017 at 4:23 pm
  
  Yes, if you have training data for sentiment analysis in Kannada.
  
  Reply