The ease of classification of documents depends on the categories you are looking to classify documents into.
A few days ago, an engineer wrote about a problem where the analysis that needed to be performed on documents was not the most straight-forward.
He described the problem in a forum as follows: “I am working on sub classification. We already crawled sites using focused crawling. So we know domain, broad category for the site. Sometimes site is also tagged with broad category. So I don’t require to predict broad class for individual site. I am interested in sub-classification. For example, I don’t want to find if post is related to sports, politics, cricket etc. I am interested in to find if post is related to Indian cricket, Australia cricket, given that I already know post is related to cricket. Since in cricket post may contains frequent words like runs, six, fours, out,score etc, which are common across all cricket related posts. So I also want to consider rare terms which can help me in sub-classification. I agree that I may also require frequent words for classification. But I don’t want to skip rare terms for classification.”
If you’re dealing with categories like sports, politics and finance, then using machine learning for classification is very easy. That’s because all the nouns and verbs in the document give you clues as to the category that the document belongs to.
But if you’re given a set of categories for which there are few indicators in the text, you end up with no easy way to categorize it.
After spending a few days thinking about it, I realized that something I had learnt in college could be applied to the problem. It’s a technique called Feature Selection.
I am going to share the reply I posted to the question, because it might be useful to others working on the classification of documents:
“You seem to have a data set that looks as follows (letters are categories and numbers are features):
A P 2 4
A Q 2 5
B P 3 4
B Q 3 5
Let’s say the 2s and the 3s are features that occur very frequently in your corpus while the 4s and the 5s are features that occur far less frequently in your corpus.
When you use the ‘bag of words’ model as your feature vector, your classifier will only learn to tell A apart from B (because the 4s and 5s will not matter much to the classifier, being overwhelmed as it is by the 2s and 3s which are far more frequent).
I think that is why you have come to the conclusion that you need to look for rare words to be able to accomplish your goal of distinguishing category P from category Q.
But in reality, perhaps what you need to do is identify all the features like 4 and 5 that might be able to help you distinguish P from Q and you might even find some frequent features that could help you do that (it might turn out that some frequent features might also have a fairly healthy ability to resolve these categories).
So, now the question just boils down to how you would go about finding the set of features that resolves any given categorization scheme.
The answer seems to be something that literature refers to as ‘Feature Selection’.
As the name says, you select features that help you break data points apart in the way you want.
Wikipedia has an article on Feature Selection:
And Mark Hall’s thesis http://www.cs.waikato.ac.nz/~mhall/thesis.pdf seems to be highly referenced.
Mark Hall’s thesis – “A good feature subset is one that contains features highly correlated with (predictive of) the class, yet uncorrelated with (not predictive of) each other.”
To be honest to you, I’d heard about Feature Selection, but never connected it to the problem it solves until now, so I’m just looking up reading material as I write.
Best of luck with it.“