I’ve always wondered if there was a way to teach people to cobble together quick and dirty solutions to problems involving natural language, from duct tape, as it were.
Having worked in the field now for a donkey’s years as of 2015, and having taught a number of text analytics courses along the way, I’ve seen students of text analysis stumble mostly on one of two hurdles:
1. Inability to Reduce Text Analytics Problems to Machine Learning Problems
I’ve seen students, after hours of training, still revert to rule-based thinking when asked to solve new problems involving text.
You can spend hours teaching people about classification and feature sets, but when you ask them to apply their learning to a new task, say segmenting a resume, you’ll hear them very quickly falling back to thinking in terms of programming steps.
Umm, you could write a script to look for a horizontal line, followed by capitalized text in bold, big font, with the words “Education” or “Experience” in it !!!
2. Inability to Solve the Machine Learning (ML) Problems
Another task that I have seen teams getting hung up on has been solving ML problems and comparing different solutions.
My manager wants me to identify the ‘introduction’ sections. So, I labelled 5 sentences as introductions. Then, I trained a maximum entropy classifier with them. Why isn’t it working?
One Machine Learning Algorithm to Rule Them All
One day, when I was about to give a lecture at Barcamp Bangalore, I had an idea.
Wouldn’t it be fun to try to use just one machine learning algorithm, show people how to code up that algorithm themselves, and then show them how a really large number of text analytics problem (almost every single problem related to the semantic web) could be solved using it.
So, I quickly wrote up a set of problems in order of increasing complexity, and went about trying to reduce them all to one ML problem, and surprised myself! It could be done!
Just about every text analytics problem related to the semantic web (which is, by far, the most important commercial category) could be reduced to a classification problem.
Moreover, you could tackle just about any problem using just two steps:
a) Modeling the problem as a machine learning problem
Spot the appropriate machine learning problem underlying the text analytics problem, and if it is a classification problem, the relevant categories, and you’ve reduced the text analytics problem to a machine learning problem.
b) Solving the problem using feature engineering
To solve the machine learning problem, you need to coming up with a set of features that allows the machine learning algorithm to separate the desired categories.
Check it out for yourself!
Here’s a set of slides.
It’s called “Fun with Text – Hacking Text Analytics”.