Yesterday, I found myself giving a follow on lecture titled “Fun with Text – Managing Text Analytics”.
Here are the slides:
“Hacking Text Analytics” was meant to help students understand a range text analytics problems by reducing them into simpler problems.
But it was designed with the understanding that they would hack their own text analytics tools.
However, in project after project, I was seeing that engineers tended not to build their own text analytics tools, but instead rely on handy and widely available open source products, and that the main thing they needed to learn was how to use them.
So, when I was asked to lecture to an audience at the NASSCOM Big Data and Analytics Summit in Hyderabad, and was advised that a large part of the audience might be non-technical, and could I please base the talk on use-cases, I tried a different tack.
So I designed another lecture “Fun with Text – Managing Text Analytics” about:
3 types of opportunities for text analytics that typically exist in every vertical
3 use cases dealing with each of these types of opportunities
3 mistakes to avoid and 3 things to embrace
And the take away from it is how to go about solving a typical business problem (involving text), using text analytics.
Here are a few more lines from Kabir’s inverted verse:
A tree stands without roots
A tree bears fruit without flowers
Someone dances without feet
Someone plays music without hands
Someone sings without a tongue
Water catches fire
Someone sees with blind eyes
A cow eats a lion
A deer eats a cheetah
A crow pounces on a falcon
A quail pounces on a hawk
A mouse eats a cat
A dog eats a jackal
A frog eats snakes
What’s interesting about all of these is that they’re examples of entity-relationships that are false.
Let me first explain what entities and relationships are.
Entities are the real or conceptual objects that we perceive as existing in the world we live in. They are usually described using a noun phrase and qualified using an adjective.
Relationships are the functions that apply to an ordered list of entities and return a true or false value.
For example, if you take the sentence “The hunter hunts the fox,” there are two entities (1. the hunter, 2. the fox). The relationship is “hunts”, it returns true for the two entities presented in that order.
The relationship “hunts” would return false if the entities were inverted (as in 1. the fox and 2. the hunter … as in the sentence “The fox hunts the hunter”).
In fact it is entities and relationships such as these that it was speculated would some day make up the semantic web.
Most of Kabir’s inverted verse seems to be based on examples of false entity relationships of dual arity (involving two entities), and that often, there is a violation of entity order which causes the entity function to return the value false.
In the “cow was milked” song, the relationship that is violated is the temporal relationship: “takes place before”.
In the “ant’s wedding” song, the relationship that is violated is that of capability: “can do”.
In the rest of the examples, relationships like “eats”, “hunts”, “plays”, “dances”, “bears fruit”, etc., are violated.
In Osho’s “The Revolution”, he talks about Kabir’s interest in and distrust of language, quoting the poet as saying:
I HAVE BEEN THINKING OF THE DIFFERENCE BETWEEN WATER
AND THE WAVES ON IT. RISING,
WATER’S STILL WATER, FALLING BACK,
IT IS WATER. WILL YOU GIVE ME A HINT
HOW TO TELL THEM APART?
BECAUSE SOMEONE HAS MADE UP THE WORD ‘WAVE’,
DO I HAVE TO DISTINGUISH IT FROM ‘WATER’?
And Osho concludes with:
Kabir is not interested in giving you any answers — because he knows perfectly well there is no answer. The game of question and answers is just a game — not that Kabir was not answering his disciples’ questions; he was answering, but answering playfully. That quality you have to remember. He is not a serious man; no wise man can ever be serious. Seriousness is part of ignorance, seriousness is a shadow of the ego. The wise is always non-serious. There can be no serious answers to questions, not at least with Kabir — because he does not believe that there is any meaning in life, and he does not believe that you have to stand aloof from life to observe and to find the meaning. He believes in participation. He does not want you to become a spectator, a speculator, a philosopher.
This genre of verse seems to have been a tradition in folk religious movements in North India. In “The Tenth Rasa: An Anthology of Indian Nonsense” by Michael Heyman, Sumanya Satpathy and Anushka Ravishankar, they talk about Namdev, a 13th century saint-poet as having authored such verses as well.
I’ve always wondered if there was a way to teach people to cobble together quick and dirty solutions to problems involving natural language, from duct tape, as it were.
Having worked in the field now for a donkey’s years as of 2015, and having taught a number of text analytics courses along the way, I’ve seen students of text analysis stumble mostly on one of two hurdles:
1. Inability to Reduce Text Analytics Problems to Machine Learning Problems
I’ve seen students, after hours of training, still revert to rule-based thinking when asked to solve new problems involving text.
You can spend hours teaching people about classification and feature sets, but when you ask them to apply their learning to a new task, say segmenting a resume, you’ll hear them very quickly falling back to thinking in terms of programming steps.
Umm, you could write a script to look for a horizontal line, followed by capitalized text in bold, big font, with the words “Education” or “Experience” in it !!!
2. Inability to Solve the Machine Learning (ML) Problems
Another task that I have seen teams getting hung up on has been solving ML problems and comparing different solutions.
My manager wants me to identify the ‘introduction’ sections. So, I labelled 5 sentences as introductions. Then, I trained a maximum entropy classifier with them. Why isn’t it working?
One Machine Learning Algorithm to Rule Them All
One day, when I was about to give a lecture at Barcamp Bangalore, I had an idea.
Wouldn’t it be fun to try to use just one machine learning algorithm, show people how to code up that algorithm themselves, and then show them how a really large number of text analytics problem (almost every single problem related to the semantic web) could be solved using it.
So, I quickly wrote up a set of problems in order of increasing complexity, and went about trying to reduce them all to one ML problem, and surprised myself! It could be done!
Just about every text analytics problem related to the semantic web (which is, by far, the most important commercial category) could be reduced to a classification problem.
Moreover, you could tackle just about any problem using just two steps:
a) Modeling the problem as a machine learning problem
Spot the appropriate machine learning problem underlying the text analytics problem, and if it is a classification problem, the relevant categories, and you’ve reduced the text analytics problem to a machine learning problem.
b) Solving the problem using feature engineering
To solve the machine learning problem, you need to coming up with a set of features that allows the machine learning algorithm to separate the desired categories.
Check it out for yourself!
Here’s a set of slides.
It’s called “Fun with Text – Hacking Text Analytics”.