The OpenNLP project of the Apache Foundation is a machine learning toolkit for text analytics.
For many years, OpenNLP did not carry a Naive Bayes classifier implementation.
OpenNLP has finally included a Naive Bayes classifier implementation in the trunk (it is not yet available in a stable release).
Naive Bayes classifiers are very useful when there is little to no labelled data available.
Labelled data is usually needed in large quantities to train classifiers.
However, the Naive Bayes classifier can sometimes make do with a very small amount of labelled data and bootstrap itself over unlabelled data. Unlabelled data is usually easier to get your hands on or cheaper to collect than labelled data – by far. The process of bootstrapping Naive Bayes classifiers over unlabelled data is explained in the paper “Text Classification from Labelled and Unlabelled Documents using EM” by Kamal Nigam et al.
So, whenever I get clients who are using OpenNLP, but have only very scanty labelled data available to train a classifier with, I end up having to teach them to build a Naive Bayes classifier and bootstrap it by using an EM procedure over unlabelled data.
Now that won’t be necessary any longer, because OpenNLP provides a Naive Bayes classifier that can be used for that purpose.
Tutorial
Training a Naive Bayes classifier is a lot like training a maximum entropy classifier. In fact, you still have to use the DocumentCategorizerME class to do it.
But you pass in a special parameter to tell the DocumentCategorizerME class that you want a Naive Bayes classifier instead.
Here is some code for training a classifier (from the OpenNLP manual) in this case, the Maximum Entropy classifier.
DoccatModel model = null; InputStream dataIn = null; try { dataIn = new FileInputStream("en-sentiment.train"); ObjectStream<String> lineStream = new PlainTextByLineStream(dataIn, "UTF-8"); ObjectStream<DocumentSample> sampleStream = new DocumentSampleStream(lineStream); // Training a maxent model by default!!! model = DocumentCategorizerME.train("en", sampleStream); } catch (IOException e) { // Failed to read or parse training data, training failed e.printStackTrace(); }
Now, if you want to invoke the new Naive Bayes classifier instead, you just have to pass in a few training parameters, as follows.
DoccatModel model = null; InputStream dataIn = null; try { dataIn = new FileInputStream("en-sentiment.train"); ObjectStream<String> lineStream = new PlainTextByLineStream(dataIn, "UTF-8"); ObjectStream<DocumentSample> sampleStream = new DocumentSampleStream(lineStream); TrainingParameters params = new TrainingParameters(); params.put(TrainingParameters.CUTOFF_PARAM, Integer.toString(0)); params.put(TrainingParameters.ALGORITHM_PARAM, NaiveBayesTrainer.NAIVE_BAYES_VALUE); // Now the parameter TrainingParameters.ALGORITHM_PARAM ensures // that we train a Naive Bayes model instead model = DocumentCategorizerME.train("en", sampleStream, params); } catch (IOException e) { // Failed to read or parse training data, training failed e.printStackTrace(); }
Evaluation
I ran some tests on the Naive Bayes document categorizer in OpenNLP built from the trunk (you can also get the latest build using Maven).
Here are the numbers.
1. Subjectivity Classification
I ran the experiment on the 5000 movie reviews dataset (used in the paper “A Sentimental Education” by Bo Pang and Lillian Lee) with a 50:50 split into training and test:
Accuracies
Perceptron: 57.54% (100 iterations)
Perceptron: 59.96% (1000 iterations)
Maxent: 91.48% (100 iterations)
Maxent: 90.68% (1000 iterations)
Naive Bayes: 90.72%
2. Sentiment Polarity Classification
Cornell movie review dataset v1.1 (700 positive and 700 negative reviews).
With 350 of each as training and the rest as test, I get:
Accuracies
Perceptron: 49.70% (100 iterations)
Perceptron: 49.85% (1000 iterations)
Maxent: 77.11% (100 iterations)
Maxent: 77.55% (1000 iterations)
Naive Bayes: 75.65%
The data used in this experiment was taken from http://www.cs.cornell.edu/people/pabo/movie-review-data/
The OpenNLP Jira details for this feature are available at: https://issues.apache.org/jira/browse/OPENNLP-777
Hey thanks for article.. Which version of open nlp jar did you NaiveBayesTrainer ?
The Naive Bayes code is still in the trunk (release cycles take time on Apache).
I could not find NaiveBayesTrainer class in https://svn.apache.org/repos/asf/opennlp/trunk/ Please could you help ?
It’s available in 1.7.0.
Shirish, the NaiveBayesTrainer is under opennlp-tools/src/main/java. It’s in the package “opennlp.tools.ml.naivebayes”. If you download the trunk version, you should be able to just do an “import opennlp.tools.ml.naivebayes.NaiveBayesTrainer”.
Hi,
Nice article.
Any idea how to parse and feed the provided dataset into Open NLP DocumentCategorizerME. The dataset contains a couple of files in folders pos/neg which basically tells us the score, but to feed this in? Any code snippets/ideas would be very helpful
You’ll have to use a single file. The format is described in the OpenNLP manual: https://opennlp.apache.org/documentation/1.6.0/manual/opennlp.html. You could post any general questions on usage to the OpenNLP users’ mailing list: https://opennlp.apache.org/mail-lists.html
Thanks, I managed to get it running.
However, I have two questions.
1) How to check the accuracy of our model?
In apache spark we split our training dataset into test,train by some ratio say 70:30.
2) Is there any tuning done for accuracy?
I could see random results coming out of the model that I trained. Apart from the mentioned points, is there any optimization done to improve accuracy?
About your two questions.
1) How to check the accuracy of our model?
You do the same in OpenNLP as you described for Apache Spark.
2) Is there any tuning done for accuracy?
No. In theory there is a smoothing parameter you could set, using tuning data, but we have have not enabled the setting of that parameter in the Apache OpenNLP implementation (we’ve picked a value that generally works fine for sentiment analysis). So you have no tuning to do.
Is there some trick to serializing the NB output model from training for use later? The result of DoccatModel.serialize and then creating a new DoccatModel from that file results in basically random classifications. Default ME model serialization/deserialization works fine for my tests. Looking at the unit tests they have added, it only checks to ensure the model serialized and read back in is null and not that it actually functions correctly.
There isn’t any trick to it, Mike. So, there’s a possibility there’s a bug in the code there (and there have been some refactorings taking place around the serialization mechanism too). Would you be able to open a bug on the OpenNLP JIRA (https://issues.apache.org/jira/browse/OPENNLP/) and share the link here? I’ll follow it up.
It would appear someone beat me to creating an issue by a couple weeks: https://issues.apache.org/jira/browse/OPENNLP-1010 . You can just as easily reproduce the issue by modifying https://github.com/apache/opennlp/blob/c17c55110b216ed3d5e0adb06734677a9cb04abd/opennlp-tools/src/test/java/opennlp/tools/doccat/DocumentCategorizerNBTest.java to serialize out the Doccat model then read it back in.
Thanks, Mike. I’ll follow this up.
Mike, I saw an email on the OpenNLP dev group indicating that someone on the OpenNLP team (William Colen) has fixed it. The message said: “OPENNLP-1010: Fix NaiveBayes model writer
The previous sortValues method was based on Perceptron, but for some reason it was not working
for NaiveBayes. Changed it to the one from GIS fixed it.”
It’ll take them a few days to pull it and then the fix should be available in the trunk.
hi is it possible to do sentiment analysis for kannada language?
Yes, if you have training data for sentiment analysis in Kannada.