Category: Linguistics

Fun with Text – Managing Text Analytics

The year is 2016.

I’m a year older than when I designed the text analytics lecture titled “Fun with Text – Hacking Text Analytics“.

Yesterday, I found myself giving a follow on lecture titled “Fun with Text – Managing Text Analytics”.

Here are the slides:

“Hacking Text Analytics” was meant to help students understand a range text analytics problems by reducing them into simpler problems.

But it was designed with the understanding that they would hack their own text analytics tools.

However, in project after project, I was seeing that engineers tended not to build their own text analytics tools, but instead rely on handy and widely available open source products, and that the main thing they needed to learn was how to use them.

So, when I was asked to lecture to an audience at the NASSCOM Big Data and Analytics Summit in Hyderabad, and was advised that a large part of the audience might be non-technical, and could I please base the talk on use-cases, I tried a different tack.

So I designed another lecture “Fun with Text – Managing Text Analytics” about:

  • 3 types of opportunities for text analytics that typically exist in every vertical
  • 3 use cases dealing with each of these types of opportunities
  • 3 mistakes to avoid and 3 things to embrace

And the take away from it is how to go about solving a typical business problem (involving text), using text analytics.

Enjoy the slides!

Visit Aiaioo Labs

Fun With Text – Hacking Text Analytics

hacking_text_analytics

I’ve always wondered if there was a way to teach people to cobble together quick and dirty solutions to problems involving natural language, from duct tape, as it were.

Having worked in the field now for a donkey’s years as of 2015, and having taught a number of text analytics courses along the way, I’ve seen students of text analysis stumble mostly on one of two hurdles:

1.  Inability to Reduce Text Analytics Problems to Machine Learning Problems

I’ve seen students, after hours of training, still revert to rule-based thinking when asked to solve new problems involving text.

You can spend hours teaching people about classification and feature sets, but when you ask them to apply their learning to a new task, say segmenting a resume, you’ll hear them very quickly falling back to thinking in terms of programming steps.

Umm, you could write a script to look for a horizontal line, followed by capitalized text in bold, big font, with the words “Education” or “Experience” in it !!!

2.  Inability to Solve the Machine Learning (ML) Problems

Another task that I have seen teams getting hung up on has been solving ML problems and comparing different solutions.

My manager wants me to identify the ‘introduction’ sections.  So, I labelled 5 sentences as introductions.  Then, I trained a maximum entropy classifier with them.  Why isn’t it working?

One Machine Learning Algorithm to Rule Them All

One day, when I was about to give a lecture at Barcamp Bangalore, I had an idea.

Wouldn’t it be fun to try to use just one machine learning algorithm, show people how to code up that algorithm themselves, and then show them how a really large number of text analytics problem (almost every single problem related to the semantic web) could be solved using it.

So, I quickly wrote up a set of problems in order of increasing complexity, and went about trying to reduce them all to one ML problem, and surprised myself!  It could be done!

Just about every text analytics problem related to the semantic web (which is, by far, the most important commercial category) could be reduced to a classification problem.

Moreover, you could tackle just about any problem using just two steps:

a) Modeling the problem as a machine learning problem

Spot the appropriate machine learning problem underlying the text analytics problem, and if it is a classification problem, the relevant categories, and you’ve reduced the text analytics problem to a machine learning problem.

b) Solving the problem using feature engineering

To solve the machine learning problem, you need to coming up with a set of features that allows the machine learning algorithm to separate the desired categories.

That’s it!

Check it out for yourself!

Here’s a set of slides.

It’s called “Fun with Text – Hacking Text Analytics”.

Funky language features – the mystery of the missing possessive verb

The verb ‘have’ is used to indicate possession.  When a speaker of the English language says, “I have a car“, the listener can infer that the speaker possesses a car.

Have” is a word that we use a lot.  I doubt anyone can imagine English without the word “have” in it.

So, it will come as a surprise to many to know that many Indian languages have no such verb.

Yes, you heard it right.  Many Indian languages have no verb like “have”.

Speakers of those languages say “There is a car near me” instead.

Below is “I have a vehicle” in three Indian languages:

Tamil:  En kitta vandi irukku  (translation into English: there is a vehicle near me)

Kannada:  Nanna hatthira gaadi idhe   (translation into English: there is a vehicle near me)

Hindi:  Mere paas gaadi hai    (translation into English: there is a vehicle near me)

Expressing Possession in Asian Languages

Some other Asian languages lack a word for “have”.

Japanese does not have a word for “have”.  Neither does Korean.

In Malay, the word for “is” is “ada”.

But “ada” can be used to mean “have” as well, as you can see from the examples below.

In the following examples, “saya” means “mine/my” (the meanings of the other Malay words are obvious).

Malay: Guru saya ada motokar baru.   (translation:  My teacher has a new car)

Malay: Bapa saya ada di rumah.      (translation:  My father is in the house)

Mandarin Chinese is an exception to this pattern.  It has a verb meaning “have”.  It is 有 (yǒu).  有 (yǒu) can also mean “to exist”, but the word commonly used for “is” is different.  It is 是 (shì) meaning “to be”.

So, a good number of widely spoken languages in South Asia don’t use a possessive verb.

But this does not mean that these Asian languages lack a mechanism to express possession.

It only means that the expression of possession and ownership uses alternative mechanisms such as idiomatic expressions (“is near” in the case of Indic languages) and context (word order and semantics in the case of Malay) in large parts of South and South-East Asia.

Expressing Possession in European Languages

In Europe, the possessive verb seems to be the preferred tool to denote possession.

We’ve already encountered the verb “have” in English, and we know that it is distinct from the verb “is”.

Below are examples from a few other European languages:

French:

I am = Je suis

I have = J’ai

Polish:

I am = Jestem

I have = mam

Modern Greek:

I am = Είμαι (Eímai)

I have = έχω (écho̱)

Latin:

I am = sum

I have = habeo

Expressing Possession in Sanskrit

Sanskrit, unlike ancient Greek and Latin does not have a possessive verb.

I asked a Sanskrit scholar if possessive verbs like “have” appear anywhere in the Vedas.

He answered in the negative.

There is no evidence for the existence of possessive verbs in Vedic Sanskrit.

Some Interpretations and Flights of Fantasy

Some economists surmise that early human societies (hunter-gatherer societies) did not know the concept of ownership.

In early human societies, food from a hunt was shared, because it could not be hoarded (there was only so much food that one could eat, and what was not eaten would spoil).

So, early languages would not have had a verb like “have”.

The most important conversations in those languages would have been sort of like:

Person 1:  “Is there food?

Person 2:  “Nope.  There is no food today.

Another type of conversation that would have been critical to self-preservation would have gone like this:

Person 1:  “There is a tiger behind you!  Run!

Person 2:  “There is an antelope to your right!”

In societies centered around herding, the herds could have been common property.

Daily conversations would have gone:

Person 1:  “How many cows are there?

Person 2:  “There are 200 cows.

Sentences like “I have thirty cows” weren’t yet needed.

Economists surmise that it was farming that gave rise to concepts like ownership and property.

Farming for the first time allowed people to have a surplus of food.

This excess food could be stored, divided and traded.

Trade might have motivated the invention of language tools for talking about ownership.

It seems that in Europe languages converged on one such tool – the possessive verb.

It seems that in India languages chose another such tool – the idiomatic usage of the verb “is near”.

Historical Linguistics Questions

There is no evidence for the use of possessive verbs in Sanskrit.

However, I do not know if ancient (Vedic) Sanskrit used the idiomatic “is near” mechanism found in modern Indian languages for expressing ownership.

If it didn’t, it would suggest that the Indian vernacular mechanism for expressing ownership evolved after the period of time when the Vedas were composed or in a different geographical area.

If it did, it would suggest that the Vedas were composed after the Indian mechanisms for expressing possession were developed and in the same geographical area (assuming accurate oral transmission that preserved ancient language features).

I’d be very grateful if someone with a better knowledge of Vedic Sanskrit would be able to tell me whether such an idiomatic usage of “is near” to indicate ownership is attested in Vedic Sanskrit texts.

I’d also love to find out what mechanisms for expressing the idea of ownership existed in Old and Avestan Persian.

(Modern Persian – Farsi – has a verb “daestaen” meaning “have”, but Farsi is very different from Old Persian).

I’ve made a lot of assumptions in proposing those historical implications.  But this article was written merely to discuss possibilities.

ADDENDUM:

I’ll add examples from other languages below as and when I get them from readers (with their permission to post them here).

Arabic

Omar Khayyam (http://www.linkedin.com/profile/view?id=97267188) in a comment on LinkedIn (http://www.linkedin.com/groups/Funky-language-features-mystery-missing-1356867.S.5838734689329766403) said:

Arabic has no “have”. You don’t need a verb to say “I have a car” = “عِــنْــدِي سَــيَّـــارَةٌ” (By me a car). Nevertheless, there are the verbs “مَــلَــكَ” and “امْتَلَكَ” (to possess/own), which are used to stress that something belongs to someone, like, for example, in juridical documents. In a newspaper article you’d write “الأمير الوليد يمتلك طائرة خاصّة من نوع بوينغ ٧٤٧” (Prince Al-Walid owns a Boeing 747″ rather than “عِنْدَ الأمير الوليد طائرة خاصّة من نوع بوينغ ٧٤٧ “, even if it is grammatically correct.
As to the verb “to be”, Arabic has no need of it in the present tense. For example, “مَلِكُ الـمَـغْرِبِ غَـنِــيٌّ جِدَّا ” (word for word = The King of Morocco very rich). But in the past you need the verb “كَـانَ ” (to be/to exist). For example, “كَـانَ الملك الحسن الثّاني غنيّا جدّا ” (King Hassan II was very rich).

Funky language features – the third spatial deictic reference in Japanese, Korean and Tamil

The words ‘here’ and ‘there’ are spatial deictic references that are familiar to all English speakers.

‘Here’ means ‘near the speaker’.

‘There’ means ‘not near the speaker’.

Two words related to ‘here’ and ‘there’ are ‘this’ and ‘that’ which work much like ‘the’ but refer to things that are ‘near the speaker’ or ‘not near the speaker’.

So, in English, all spatial deictic references are relative to the speaker.

Here is an illustration of spatial deixis taken from the Wikipedia article on deixis.

But there are languages in which there are more than two spatial deictic references.

Japanese, Korean and Tamil have three each.

In Japanese, they are koko, soko and asoko.

In Korean, they are yogi, kugi and chogi.  (Here is a very nice lesson on deixis in Korean http://www.talktomeinkorean.com/lessons/l1l7).

In Tamil, they are inge, unge and ange.

The reason for the additional deictic reference is that in these languages, distances are perceived not just with respect to the speaker, but also with respect to the listener.

So,  in Japanese, Korean and Tamil respectively, koko, yogi and inge mean ‘near the speaker’.

Then, soko, kugi and unge mean ‘near the listener’.

Finally, asoko, chogi and ange mean ‘far from both the speaker and the listener’.

The “near the listener” deixis seems like a rather useless feature to have in a language (it is disappearing from modern Tamil).

In the modern world, when you talk to someone face to face (not on the phone), you are usually standing just a few feet from them.

So, anything “near the speaker” is also “near the listener”.  One of those spatial references is therefore redundant.

But then, if one of the spatial references was so useless, why did it appear in Korean and Japanese in addition to Tamil?

Perhaps it has something to do with the fact that Korea and South India are peninsulas, and Japan is an island.

All three countries have long coastlines.

So, some ancestors of the inhabitants of Korea, Japan and South India might have lived off of deep-water fishing.

On the ocean there is an immediate use for the “near the listener” deictic.

Imagine a fleet of boats spread out on the ocean looking for fish to spear or net.

The boatmen would have no features to use to communicate directions.

The only features they’d have had to identify positions would have been their own boats.

So, they’d probably have had conversations with each other that went as follows:

Boat 1:  Are there any fish near you (the listener)?

Boat 2:  No, there are no fish near me (the speaker).  Are there any fish near you (the listener)?

Boat 1:  No, there are no fish near me (the speaker).  We should look for fish away from both of us (pointing)?

In such conversations, all three deictics would have been used.

The sentence “Are there any fish near you (the listener)?” would have used the word soko (in Japanese), kugi (in Korean) and unge (in Tamil).

The sentence “No, there are no fish near me (the speaker)” would have used the word koko (in Japanese), yogi (in Korean) and inge (in Tamil).

The sentence “We should look for fish away from both of us (pointing)” would have used the word asoko (in Japanese), chogi (in Korean) and ange (in Tamil).

I am just guessing at all this, of course.  Part of the fun of working in linguistics is that you can extrapolate from tenuous linguistic clues, and indulge in wild flights of fantasy.

But what I am proposing is not entirely unimaginable.

In 2011, in a small cave (called the Jerimalai cave) in East Timor, archaeologists found bones from 2843 individual fish, some of which were caught 42000 years ago.  50% of the bones were those of deep-water tuna fish. The finds also included fish hooks dating from between 23000 and 16000 years ago.

More details on the Jerimalai find here: http://news.discovery.com/history/archaeology/ancient-human-fishermen-111128.htm

Should Cecilia have said “insecure” instead of “unsecure”?

In this funny PhD Comic, the main character – Cecilia (the girl in red) – says:

“Do you realize how unsecure your coffee distribution system is?”

That made me wonder – should she have said ‘insecure’?

Even the WordPress spell-checker has a problem with “unsecure”.

It thinks that “unsecure” is a spelling error.

However, the word “insecure” doesn’t sound as if it were the right term to use in the context of computer security.

That is because the word “insecure” is usually used in the context of a person to mean a person who is not confident and self-assured.

To call a computer “insecure” would be a bit like saying that the computer had self-image issues.

Others have written about this cognitive dissonance as well (see http://english.stackexchange.com/questions/19653/insecure-or-unsecure-when-dealing-with-security for a nice discussion).

Given the problem, the author of the cartoon seems to be justified in using a newly-minted word (one not found in any dictionary) in order to describe the lack of security.

This is also very interesting because it throws some light on how words are born.

Before I can explain what I mean, I’ll need you to take a look the Oxford dictionary’s definitions of the word “insecure” (from the Oxford English Dictionary online search at http://oxforddictionaries.com/definition/english/insecure?q=insecure):

insecure

adjective

  • 1   uncertain or anxious about oneself; not confident:  a rather gauche, insecure young man,  a top model who is notoriously insecure about her looks
  • 2   (of a thing) not firm or fixed; liable to give way or break:  an insecure footbridge 

                 not sufficiently protected; easily broken into:  an insecure computer system

  • 3   (of a job or situation) liable to change for the worse; not permanent or settled:  badly paid and insecure jobsa financially insecure period

There are three ways in which the word “insecure” can be used.

The second usage would have been perfect for the context of computer security.

But the first usage might be conflated with the second in that context.

And that is because (sorry, I no longer recall the references to support this claim) computers appear to the human mind to have human-like characteristics (we say things like “Google tells me that …” or “my computer has gone to sleep”).

So, the only word in the dictionary that can do the job – the word “insecure” – has a conflict of interest.

And therefore, a new word needs to be coined that is not susceptible to the same sort of ambiguity.

And if the new word “unsecure” catches on, then one day, the second sense of the word “insecure” could become extinct in the context of computers.

Oh well, “it’s only words!”

POST EDIT

A friend pointed out that the Google NGram Viewer shows a history of the use of the word “unsecure”: http://books.google.com/ngrams/graph?content=unsecure.

The word seems to have been in use between 1650 and 1850 (there is evidence of use in literature), and has in more recent times simply fallen out of circulation (being eclipsed by “insecure” in around 1750).  Thanks, Prashant.

(You can also search for those early usages in books – http://books.google.com/books?id=WmpCAAAAcAAJ&pg=PA12&dq=%22unsecure%22&hl=en&sa=X&ei=aOcLUq7aA-3iyAHu8YGwAg&ved=0CDMQ6AEwAA#v=onepage&q=%22unsecure%22&f=false)

Japanese and Tamil – The Work of Susumu Ohno

DSC09085

My father recently pointed me to the research work of Dr. Susumu Ohno, a Japanese linguist who studied ancient Japanese as well as ancient Tamil (a language spoken in South India).

Dr. Ohno (in a paper titled “The Genealogy of the Japanese Language”) made a number of interesting observations about phonological similarities and the existence of cognates (similar-sounding words) in the some forms of both languages.

For example, he noted that the in some dialects of Japanese, the words for “father”, “mother”, “elder brother” and “elder sister” are similar to the words used in Tamil.

In some Honshu and Ryukyu dialects of Japanese, the words for father, mother, elder brother and elder sister are “accha”, “aaya”, “annyaa” and “anne”.  Ohno argues that these words resemble the words “acchan”, “aaya”, “anna” and “annai” in Tamil.

I found that his observations supported some arguments that I had made in a blog entry in 2010 (I’d attempted to draw a 3-way comparison between Japanese, Tamil and Australian aboriginal languages).

He proposes a theory that in early Japanese, there were no e and o sounds – that these sounds were replacements for ai or ia and ua.

I quote:

The vowels in group B are believed to have resulted from the merging of two vowels, as follows:  ia>e, ai>e, ui>i, oi>i, ua>o

Though I don’t have a reference, I am told that T. P. Meenakshi Sundaram made an almost identical assertion in the case of Tamil.

You also see some evidence of such a transformation in the Tulu word “yan-ku” (to me). The corresponding word in Tamil is “en-akku”.  The correspondence makes you think that sometime in the past, they used to say “yan-akku” in Tamil instead of “en-akku”.

You see a similar correspondence in Kannada.  The Kannada word for why can be written and pronounced as “yaake” or as “eke”.  So “ia” seems to be replaceable with “e” there.

Similarly in Tamil, the word “evan” (who) can also be pronounced (colloquially) as “yaveng”.

So, if both ancient Tamil and Japanese used just a, i and u sounds, their phonetics begins to resemble that of Australian languages like Dyirbal.

Regarding consonants, Ohno notes the following correspondences:

Japanese

Consonants at head of word
k-, s-, t-, n-, F-, m-, y-, w

Consonants mid-word
-k- , -s-, -t-, -n-, -F-, -m-, -y-, -w-,
-r-, -ng-, -nz-, -nd-, -nb

Tamil

Consonants at head of word
k-, c-, t-, n-‘ n-, p-, m-, y-, v

Consonants mid-word
-k- , -c-, -t-, -n-, -p-, -m-, -y-, -v-,
-t- , -n-, -r-, -1-, -r-, -1-, -r-,
-nt- -nc, , -nt-, -mp-

Unfortunately, I don’t know Dyirbal or any Australian language.  So, I can’t check if these rules apply to them as well.  I can’t wait to get hold of a linguistic analysis of Dyirbal by an Indian or Japanese linguist.

A work of literature that plays with word patterns (Melanctha by Gertrude Stein)

This blog post is about a story titled “Melanctha” by Gertrude Stein, a novelist who lived at about the same time as Renoir and Picasso.  Picasso painted a portrait of the author that this blog post is about.  Picasso painted a portrait of the author in eighty sittings that spanned a year, and finally ended by painting out her face and replacing it with a mask.

I recently came across her writings in a book by the name of “Three Lives”.

The description of the writings in the introduction was very intriguing, so I picked up the book.

In the introduction, I read that Gertrude had had a very high opinion of the importance of her writing and had once said “think of the Bible and Homer, think of Shakespeare, and think of me”.  In the introduction, I read that she considered the second story in “Three Lives” about a girl called “Melanctha” to be “the first definitive step away from the nineteenth century and into the twentieth century in literature”.

The story did not disappoint.

The language in “Melanctha” was very different from anything I’ve ever read, and it produced a very pleasant sensation.  The language was very different, and it’s about the language that I want to write.

One interesting thing about the language is that in some parts phrase patterns appear in pairs and with a rhythm.

Here is an example of the pairing of sentences that is so interesting in the language:

“Jeff Campbell sat in his room, very quiet, a long time, after he got through reading this letter.  He sat very still and first he was very angry.  As if he, too, did not know very badly what it was to suffer keenly.  As if he had not been very strong to stay with Melanctha when he knew what it was that she really wanted.  He knew he was very right to be angry, he knew he really had not been a coward.  He knew Melanctha had done many things it was very hard for him to forgive her”

In some parts, the repetition gives rise to sentences like.

“Good night now, Dr. Campbell, I call you if I need you later to help me, Dr. Campbell, I hope you rest well, Dr. Campbell.”

I found an essay on Stein’s work online that said that Stein had tried to write memory-less literature, where the literature kept itself always in the present by not relying on the reader’s memories of past sentences.

But it seemed to me that there was a sort of similarity with impressionist painting if you considered the granularity of language used in rendering the story.

What Gertrude’s writing had in common with impressionist paintings, it seemed to me, was a form of broad, rough brush-strokes.

So, it seemed to me that the phrases that were often repeated in quick succession, fused within themselves to became separate units of expression, and therefore the smallest units of expression that Gertrude’s stories were built of were not single words, but phrases made of many words, making the language richer and more beautiful.

There was also a certain musicality in the prose.  There was a certain way for certain phrases to be repeated time and again, like a musical theme, for example, the line “what you mean by what you were saying” which, with its variants appears time and time again in the story.

Finally, I found it hilarious to read in the story a passage that was very similar to things that Ramana Maharishi and Osho had said about “thinking” that I had quoted in an older blog post in November https://aiaioo.wordpress.com/2012/11/22/contradictions-in-some-thoughts-on-thinking/.

I had quoted the following:

To bring about peace means to be free from thoughts and to abide as Pure Consciousness. ~ Sri Ramana Maharshi

Thoughts can create such a barrier that even if you are standing before a beautiful flower, you will not be able to see it. Your eyes are covered with layers of thought. To experience the beauty of the flower you have to be in a state of meditation, not in a state of mentation. You have to be silent, utterly silent, not even a flicker of thought – and the beauty explodes, reaches to you from all directions. You are drowned in the beauty of a sunrise, of a starry night, of beautiful trees.  ~ Osho

I had quoted the above and commented that those who wrote that must have thought a lot about thinking.

In the story “Melanctha” there was a meme that was similar to the above lines that I had quoted.

I quote from the story:

“Don’t you ever stop with your thinking long enough ever to have any feeling Jeff Campbell,” said Melanctha a little sadly.