Tag: linguistics

Fun with Text – Managing Text Analytics

The year is 2016.

I’m a year older than when I designed the text analytics lecture titled “Fun with Text – Hacking Text Analytics“.

Yesterday, I found myself giving a follow on lecture titled “Fun with Text – Managing Text Analytics”.

Here are the slides:

“Hacking Text Analytics” was meant to help students understand a range text analytics problems by reducing them into simpler problems.

But it was designed with the understanding that they would hack their own text analytics tools.

However, in project after project, I was seeing that engineers tended not to build their own text analytics tools, but instead rely on handy and widely available open source products, and that the main thing they needed to learn was how to use them.

So, when I was asked to lecture to an audience at the NASSCOM Big Data and Analytics Summit in Hyderabad, and was advised that a large part of the audience might be non-technical, and could I please base the talk on use-cases, I tried a different tack.

So I designed another lecture “Fun with Text – Managing Text Analytics” about:

3 types of opportunities for text analytics that typically exist in every vertical
3 use cases dealing with each of these types of opportunities
3 mistakes to avoid and 3 things to embrace

And the take away from it is how to go about solving a typical business problem (involving text), using text analytics.

Enjoy the slides!

Visit Aiaioo Labs

Languages and Numbers and Ways of Counting to 8 !

This article is about how small numbers are represented in various languages.

Acknowledgement: much of this article is taken from the Wikipedia page about positional notation.

Bases

The base is the mathematical term for the number of digits you would use to count in a language.

For example, if you used the fingers of both hands to count, you would be using a base of 10.

If you used the fingers of one hand to count, you would be using a base of 5.

If you used the fingers of both hands and the toes of both feet, you would be using a base of 20.

Base-20

Some languages have names for numbers that lead you to suspect that their users might have thought in terms of groups of 20.

French has an interesting way of describing numbers above 60. In French, the word for 60 is “soixante”, the word for 75 is “soixante quinze” (sixty and fifteen) while 80 is “quatre-vingt” (four-twenties) and 95 is “quatre-vingt quinze” (four-twenties and fifteen).

And it is not just French. English uses the word ‘score’ to describe a group of 20 things. So, when we talk of “two score” we mean forty, and when we say “four score and seven” we mean 87.

The article also talks about Welsh and Irish and Maori:

The Irish language also used base-20 in the past, twenty being fichid, forty dhá fhichid, sixty trí fhichid and eighty ceithre fhichid. A remnant of this system may be seen in the modern word for 40, daoichead.

The Welsh language continues to use a base-20 counting system, particularly for the age of people, dates and in common phrases. 15 is also important, with 16–19 being “one on 15”, “two on 15” etc. 18 is normally “two nines”. A decimal system is commonly used.

Danish numerals display a similar base-20 structure.

The Maori language of New Zealand also has evidence of an underlying base-20 system as seen in the terms Te Hokowhitu a Tu referring to a war party (literally “the seven 20s of Tu”) and Tama-hokotahi, referring to a great warrior (“the one man equal to 20”).

Base-12

Another interesting system is the base-12 system.

The Wikipedia article says:

Twelve is a useful base because it has many factors. It is the smallest common multiple of one, two, three, four and six. There is still a special word for “dozen” in English, and by analogy with the word for 10², hundred, commerce developed a word for 12², gross. The standard 12-hour clock and common use of 12 in English units emphasize the utility of the base. In addition, prior to its conversion to decimal, the old British currency Pound Sterling (GBP) partially used base-12; there were 12 pence (d) in a shilling (s), 20 shillings in a pound (£), and therefore 240 pence in a pound. Hence the term LSD or, more properly, £sd.

Base-2

There was even a language that made use of a base-2 (binary) system for counting. Base-2 (binary) is mainly used in computers today (because switches can represent binary numbers – a switch that is off represents the 0 digit and a switch that is on represents the 1 digit). But apparently, native Australian languages use binary too.

A number of Australian Aboriginal languages employ binary or binary-like counting systems. For example, in Kala Lagaw Ya, the numbers one through six are urapon,ukasar, ukasar-urapon, ukasar-ukasar, ukasar-ukasar-urapon, ukasar-ukasar-ukasar.

Base-8

The article also says that there is some evidence of the use of base-8 in language:

A base-8 system (octal) was devised by the Yuki tribe of Northern California, who used the spaces between the fingers to count, corresponding to the digits one through eight.^[6] There is also linguistic evidence which suggests that the Bronze Age Proto-Indo Europeans (from whom most European and Indic languages descend) might have replaced a base-8 system (or a system which could only count up to 8) with a base-10 system. The evidence is that the word for 9, newm, is suggested by some to derive from the word for “new”, newo-, suggesting that the number 9 had been recently invented and called the “new number”.^[7]

So much for bases.

Some languages have two sets of names for numerals!

Two Sets of Names for Numbers in Japanese and Korean

Japanese and Korean use two sets of names for numbers while counting.

In Japanese, there is a set of names that are typically used when small quantities are involved:

“hitotsu”, “futatsu”, “mittsu”, “yottsu”, “itsutsu“, “muttsu”, “nanatsu“, “yattsu“, “kokonotsu“, “to” (1 to 10).

But for larger numbers and for zero, the names used are ones derived from Chinese.

“ichi”, “ni”, “san”, “shi”, “go”, “roku”, “shichi”, “hachi”, “kyu”, “ju”.

These numbers correspond to the Chinese digits:

“yī”, “èr”, “sān”, “sì”, “wǔ”, “liù”, “qī”, “bā”, “jiǔ”, “shí”.

And similarly in Korean, you would use one set of names for small quantities (for example, hours in the day):

“hana”, “dul”, “seth”, “neth”, “thasoth”, “yosoth”, “ilgop”, “yodolp”, “ahop”, “yol”.

But to describe larger quantities, like minutes or the days in a month, you’d go with names based on Chinese:

“il”, “i”, “sam”, “sa”, “o”, “yug”, “chhil”, “phal”, “ku”, “ship”.

Finally, we come to some interesting irregularities in south Indian languages.

Irregular Numbering

In Tamil (a language spoken in south India), the word for 90 is “pre-hundred”.

The first ten numbers in Tamil go:

“ondru”, “irendu”, “muundru”, “naangu”, “aindhu”, “aaru”, “eelu”, “ettu”, “ombadhu”, “patthu”

But the word “ombadhu” which means 9 is not used in 90.

Tamil

In Tamil, the name for 80 is derived from the name for 8 by adding a suffix like in English. Just as “eight” becomes “eight-y”, in Tamil, “ettu” becomes “embathu”.

But the name for 90 is not derived from the number for 9. Instead,it is “pre-hundred”. (In Tamil, 90 is “thonnuuru” – hundred being “nuuru”). So, when counting from 90 to 99, you use the suffix one would normally associate with the hundred’s position.

So 91 is “pre-hundred and one”. It is pronounced “thonnuutri-ondru” in Tamil. 92 is “pre-hundred and two”. It is pronounced “thonnuutri-rendu” in Tamil.

I’ve not come across many languages in which 90 is described as pre-hundred. But Hindi (a language from the north of India) has a similar feature.

Hindi

In many Indian languages spoken in the north of India, the names of the first ten numbers are similar to their names in Latin. For example, Hindi has:

“ek”, “dho”, “thiin”, “chaar”, “paanch”, “che”, “saath”, “aaT”, “nov”, “dhas”

The Hindi names for various numbers are similar to the Sanskrit names of those numbers:

“ekam”, “dve”, “thriini”, “chathvaari”, “pancha”, “shath”, “saptha”, “ashta”, “nava”, “dhasha”

But when you get to 29 in Hindi, you say “pre-30”. The word in Hindi is “unthees” (“thees” means 30 in Hindi).

Similarly, 39 is “pre-40” (“unchaaliis” where “chaaliis” means 40).

This is different from how you count in Sanskrit.

Sanskrit

In Sanskrit, 39 is “navatrimshat” (nine and thirty) and 29 is “navavimshatihi” (nine and twenty).

Now the absence of a regular name for numbers with 9 in them supports a theory that Indic languages might once have used base-8 for counting.

I quote from the Wikipedia article again:

There is also linguistic evidence which suggests that the Bronze Age Proto-Indo Europeans (from whom most European and Indic languages descend) might have replaced a base-8 system (or a system which could only count up to 8) with a base-10 system. The evidence is that the word for 9, newm, is suggested by some to derive from the word for “new”, newo-, suggesting that the number 9 had been recently invented and called the “new number”.[7]

The assertion seems to have been made in an article titled ‘The Indo-European system of numerals from ‘1’ to ‘10’’ by Eugenio Ramón Luján Martínez.

Eugenio argues that each of the numerals in Indo-European languages gradually came into use when required by necessity, starting with the numbers 2 and 3 (which started as deictics – like in the words ‘duo’ and ‘trio’).

There’s an overview of his arguments in this article: http://smallislandnotesan.blogspot.in/2008/01/indo-european-numbers-1-10.html

Counting on the Fingers

To a twenty-first century human, a base-10 system of counting seems like the natural way to count.

But a base-8 system could have felt more natural than a base-10 system to early humans to count with.

This is because it is only possible to count to ten on the fingers of one’s hands if one has developed the technique of bending them to mark the number up to which one has counted.

If a person uses the technique of touching the thumb to a finger to mark a count, then one can only count up to 4 on each hand (and therefore only up to 8 on both hands).

Indian musicians still keep count of the rythmic patterns in music (the thaalas) by touching the tips of their fingers with the thumb (counting in multiples of 3 or 4).

So it is indeed possible that at some point in the distant past, speakers of Indo-European languages did indeed count in groups of 8.

Funky language features – the mystery of the missing possessive verb

The verb ‘have’ is used to indicate possession. When a speaker of the English language says, “I have a car“, the listener can infer that the speaker possesses a car.

“Have” is a word that we use a lot. I doubt anyone can imagine English without the word “have” in it.

So, it will come as a surprise to many to know that many Indian languages have no such verb.

Yes, you heard it right. Many Indian languages have no verb like “have”.

Speakers of those languages say “There is a car near me” instead.

Below is “I have a vehicle” in three Indian languages:

Tamil: En kitta vandi irukku (translation into English: there is a vehicle near me)

Kannada: Nanna hatthira gaadi idhe (translation into English: there is a vehicle near me)

Hindi: Mere paas gaadi hai (translation into English: there is a vehicle near me)

Expressing Possession in Asian Languages

Some other Asian languages lack a word for “have”.

Japanese does not have a word for “have”. Neither does Korean.

In Malay, the word for “is” is “ada”.

But “ada” can be used to mean “have” as well, as you can see from the examples below.

In the following examples, “saya” means “mine/my” (the meanings of the other Malay words are obvious).

Malay: Guru saya ada motokar baru. (translation: My teacher has a new car)

Malay: Bapa saya ada di rumah. (translation: My father is in the house)

Mandarin Chinese is an exception to this pattern. It has a verb meaning “have”. It is 有 (yǒu). 有 (yǒu) can also mean “to exist”, but the word commonly used for “is” is different. It is 是 (shì) meaning “to be”.

So, a good number of widely spoken languages in South Asia don’t use a possessive verb.

But this does not mean that these Asian languages lack a mechanism to express possession.

It only means that the expression of possession and ownership uses alternative mechanisms such as idiomatic expressions (“is near” in the case of Indic languages) and context (word order and semantics in the case of Malay) in large parts of South and South-East Asia.

Expressing Possession in European Languages

In Europe, the possessive verb seems to be the preferred tool to denote possession.

We’ve already encountered the verb “have” in English, and we know that it is distinct from the verb “is”.

Below are examples from a few other European languages:

French:

I am = Je suis

I have = J’ai

Polish:

I am = Jestem

I have = mam

Modern Greek:

I am = Είμαι (Eímai)

I have = έχω (écho̱)

Latin:

I am = sum

I have = habeo

Expressing Possession in Sanskrit

Sanskrit, unlike ancient Greek and Latin does not have a possessive verb.

I asked a Sanskrit scholar if possessive verbs like “have” appear anywhere in the Vedas.

He answered in the negative.

There is no evidence for the existence of possessive verbs in Vedic Sanskrit.

Some Interpretations and Flights of Fantasy

Some economists surmise that early human societies (hunter-gatherer societies) did not know the concept of ownership.

In early human societies, food from a hunt was shared, because it could not be hoarded (there was only so much food that one could eat, and what was not eaten would spoil).

So, early languages would not have had a verb like “have”.

The most important conversations in those languages would have been sort of like:

Person 1: “Is there food?”

Person 2: “Nope. There is no food today.”

Another type of conversation that would have been critical to self-preservation would have gone like this:

Person 1: “There is a tiger behind you! Run!”

Person 2: “There is an antelope to your right!”

In societies centered around herding, the herds could have been common property.

Daily conversations would have gone:

Person 1: “How many cows are there?”

Person 2: “There are 200 cows.”

Sentences like “I have thirty cows” weren’t yet needed.

Economists surmise that it was farming that gave rise to concepts like ownership and property.

Farming for the first time allowed people to have a surplus of food.

This excess food could be stored, divided and traded.

Trade might have motivated the invention of language tools for talking about ownership.

It seems that in Europe languages converged on one such tool – the possessive verb.

It seems that in India languages chose another such tool – the idiomatic usage of the verb “is near”.

Historical Linguistics Questions

There is no evidence for the use of possessive verbs in Sanskrit.

However, I do not know if ancient (Vedic) Sanskrit used the idiomatic “is near” mechanism found in modern Indian languages for expressing ownership.

If it didn’t, it would suggest that the Indian vernacular mechanism for expressing ownership evolved after the period of time when the Vedas were composed or in a different geographical area.

If it did, it would suggest that the Vedas were composed after the Indian mechanisms for expressing possession were developed and in the same geographical area (assuming accurate oral transmission that preserved ancient language features).

I’d be very grateful if someone with a better knowledge of Vedic Sanskrit would be able to tell me whether such an idiomatic usage of “is near” to indicate ownership is attested in Vedic Sanskrit texts.

I’d also love to find out what mechanisms for expressing the idea of ownership existed in Old and Avestan Persian.

(Modern Persian – Farsi – has a verb “daestaen” meaning “have”, but Farsi is very different from Old Persian).

I’ve made a lot of assumptions in proposing those historical implications. But this article was written merely to discuss possibilities.

ADDENDUM:

I’ll add examples from other languages below as and when I get them from readers (with their permission to post them here).

Arabic

Omar Khayyam (http://www.linkedin.com/profile/view?id=97267188) in a comment on LinkedIn (http://www.linkedin.com/groups/Funky-language-features-mystery-missing-1356867.S.5838734689329766403) said:

Arabic has no “have”. You don’t need a verb to say “I have a car” = “عِــنْــدِي سَــيَّـــارَةٌ” (By me a car). Nevertheless, there are the verbs “مَــلَــكَ” and “امْتَلَكَ” (to possess/own), which are used to stress that something belongs to someone, like, for example, in juridical documents. In a newspaper article you’d write “الأمير الوليد يمتلك طائرة خاصّة من نوع بوينغ ٧٤٧” (Prince Al-Walid owns a Boeing 747″ rather than “عِنْدَ الأمير الوليد طائرة خاصّة من نوع بوينغ ٧٤٧ “, even if it is grammatically correct.
As to the verb “to be”, Arabic has no need of it in the present tense. For example, “مَلِكُ الـمَـغْرِبِ غَـنِــيٌّ جِدَّا ” (word for word = The King of Morocco very rich). But in the past you need the verb “كَـانَ ” (to be/to exist). For example, “كَـانَ الملك الحسن الثّاني غنيّا جدّا ” (King Hassan II was very rich).

Funky language features – the third spatial deictic reference in Japanese, Korean and Tamil

The words ‘here’ and ‘there’ are spatial deictic references that are familiar to all English speakers.

‘Here’ means ‘near the speaker’.

‘There’ means ‘not near the speaker’.

Two words related to ‘here’ and ‘there’ are ‘this’ and ‘that’ which work much like ‘the’ but refer to things that are ‘near the speaker’ or ‘not near the speaker’.

So, in English, all spatial deictic references are relative to the speaker.

Here is an illustration of spatial deixis taken from the Wikipedia article on deixis.

But there are languages in which there are more than two spatial deictic references.

Japanese, Korean and Tamil have three each.

In Japanese, they are koko, soko and asoko.

In Korean, they are yogi, kugi and chogi. (Here is a very nice lesson on deixis in Korean http://www.talktomeinkorean.com/lessons/l1l7).

In Tamil, they are inge, unge and ange.

The reason for the additional deictic reference is that in these languages, distances are perceived not just with respect to the speaker, but also with respect to the listener.

So, in Japanese, Korean and Tamil respectively, koko, yogi and inge mean ‘near the speaker’.

Then, soko, kugi and unge mean ‘near the listener’.

Finally, asoko, chogi and ange mean ‘far from both the speaker and the listener’.

The “near the listener” deixis seems like a rather useless feature to have in a language (it is disappearing from modern Tamil).

In the modern world, when you talk to someone face to face (not on the phone), you are usually standing just a few feet from them.

So, anything “near the speaker” is also “near the listener”. One of those spatial references is therefore redundant.

But then, if one of the spatial references was so useless, why did it appear in Korean and Japanese in addition to Tamil?

Perhaps it has something to do with the fact that Korea and South India are peninsulas, and Japan is an island.

All three countries have long coastlines.

So, some ancestors of the inhabitants of Korea, Japan and South India might have lived off of deep-water fishing.

On the ocean there is an immediate use for the “near the listener” deictic.

Imagine a fleet of boats spread out on the ocean looking for fish to spear or net.

The boatmen would have no features to use to communicate directions.

The only features they’d have had to identify positions would have been their own boats.

So, they’d probably have had conversations with each other that went as follows:

Boat 1: Are there any fish near you (the listener)?

Boat 2: No, there are no fish near me (the speaker). Are there any fish near you (the listener)?

Boat 1: No, there are no fish near me (the speaker). We should look for fish away from both of us (pointing)?

In such conversations, all three deictics would have been used.

The sentence “Are there any fish near you (the listener)?” would have used the word soko (in Japanese), kugi (in Korean) and unge (in Tamil).

The sentence “No, there are no fish near me (the speaker)” would have used the word koko (in Japanese), yogi (in Korean) and inge (in Tamil).

The sentence “We should look for fish away from both of us (pointing)” would have used the word asoko (in Japanese), chogi (in Korean) and ange (in Tamil).

I am just guessing at all this, of course. Part of the fun of working in linguistics is that you can extrapolate from tenuous linguistic clues, and indulge in wild flights of fantasy.

But what I am proposing is not entirely unimaginable.

In 2011, in a small cave (called the Jerimalai cave) in East Timor, archaeologists found bones from 2843 individual fish, some of which were caught 42000 years ago. 50% of the bones were those of deep-water tuna fish. The finds also included fish hooks dating from between 23000 and 16000 years ago.

More details on the Jerimalai find here: http://news.discovery.com/history/archaeology/ancient-human-fishermen-111128.htm

Should Cecilia have said “insecure” instead of “unsecure”?

In this funny PhD Comic, the main character – Cecilia (the girl in red) – says:

“Do you realize how unsecure your coffee distribution system is?”

That made me wonder – should she have said ‘insecure’?

Even the WordPress spell-checker has a problem with “unsecure”.

It thinks that “unsecure” is a spelling error.

However, the word “insecure” doesn’t sound as if it were the right term to use in the context of computer security.

That is because the word “insecure” is usually used in the context of a person to mean a person who is not confident and self-assured.

To call a computer “insecure” would be a bit like saying that the computer had self-image issues.

Others have written about this cognitive dissonance as well (see http://english.stackexchange.com/questions/19653/insecure-or-unsecure-when-dealing-with-security for a nice discussion).

Given the problem, the author of the cartoon seems to be justified in using a newly-minted word (one not found in any dictionary) in order to describe the lack of security.

This is also very interesting because it throws some light on how words are born.

Before I can explain what I mean, I’ll need you to take a look the Oxford dictionary’s definitions of the word “insecure” (from the Oxford English Dictionary online search at http://oxforddictionaries.com/definition/english/insecure?q=insecure):

insecure

Pronunciation: /ˌɪnsɪˈkjʊə, ˌɪnsɪˈkjɔː/

adjective

1 uncertain or anxious about oneself; not confident: a rather gauche, insecure young man, a top model who is notoriously insecure about her looks

2 (of a thing) not firm or fixed; liable to give way or break: an insecure footbridge

not sufficiently protected; easily broken into: an insecure computer system

3 (of a job or situation) liable to change for the worse; not permanent or settled: badly paid and insecure jobsa financially insecure period

There are three ways in which the word “insecure” can be used.

The second usage would have been perfect for the context of computer security.

But the first usage might be conflated with the second in that context.

And that is because (sorry, I no longer recall the references to support this claim) computers appear to the human mind to have human-like characteristics (we say things like “Google tells me that …” or “my computer has gone to sleep”).

So, the only word in the dictionary that can do the job – the word “insecure” – has a conflict of interest.

And therefore, a new word needs to be coined that is not susceptible to the same sort of ambiguity.

And if the new word “unsecure” catches on, then one day, the second sense of the word “insecure” could become extinct in the context of computers.

Oh well, “it’s only words!”

POST EDIT

A friend pointed out that the Google NGram Viewer shows a history of the use of the word “unsecure”: http://books.google.com/ngrams/graph?content=unsecure.

The word seems to have been in use between 1650 and 1850 (there is evidence of use in literature), and has in more recent times simply fallen out of circulation (being eclipsed by “insecure” in around 1750). Thanks, Prashant.

(You can also search for those early usages in books – http://books.google.com/books?id=WmpCAAAAcAAJ&pg=PA12&dq=%22unsecure%22&hl=en&sa=X&ei=aOcLUq7aA-3iyAHu8YGwAg&ved=0CDMQ6AEwAA#v=onepage&q=%22unsecure%22&f=false)

Japanese and Tamil – The Work of Susumu Ohno

My father recently pointed me to the research work of Dr. Susumu Ohno, a Japanese linguist who studied ancient Japanese as well as ancient Tamil (a language spoken in South India).

Dr. Ohno (in a paper titled “The Genealogy of the Japanese Language”) made a number of interesting observations about phonological similarities and the existence of cognates (similar-sounding words) in the some forms of both languages.

For example, he noted that the in some dialects of Japanese, the words for “father”, “mother”, “elder brother” and “elder sister” are similar to the words used in Tamil.

In some Honshu and Ryukyu dialects of Japanese, the words for father, mother, elder brother and elder sister are “accha”, “aaya”, “annyaa” and “anne”. Ohno argues that these words resemble the words “acchan”, “aaya”, “anna” and “annai” in Tamil.

I found that his observations supported some arguments that I had made in a blog entry in 2010 (I’d attempted to draw a 3-way comparison between Japanese, Tamil and Australian aboriginal languages).

He proposes a theory that in early Japanese, there were no e and o sounds – that these sounds were replacements for ai or ia and ua.

I quote:

The vowels in group B are believed to have resulted from the merging of two vowels, as follows: ia>e, ai>e, ui>i, oi>i, ua>o

Though I don’t have a reference, I am told that T. P. Meenakshi Sundaram made an almost identical assertion in the case of Tamil.

You also see some evidence of such a transformation in the Tulu word “yan-ku” (to me). The corresponding word in Tamil is “en-akku”. The correspondence makes you think that sometime in the past, they used to say “yan-akku” in Tamil instead of “en-akku”.

You see a similar correspondence in Kannada. The Kannada word for why can be written and pronounced as “yaake” or as “eke”. So “ia” seems to be replaceable with “e” there.

Similarly in Tamil, the word “evan” (who) can also be pronounced (colloquially) as “yaveng”.

So, if both ancient Tamil and Japanese used just a, i and u sounds, their phonetics begins to resemble that of Australian languages like Dyirbal.

Regarding consonants, Ohno notes the following correspondences:

Japanese

Consonants at head of word
k-, s-, t-, n-, F-, m-, y-, w

Consonants mid-word
-k- , -s-, -t-, -n-, -F-, -m-, -y-, -w-,
-r-, -ng-, -nz-, -nd-, -nb

Tamil

Consonants at head of word
k-, c-, t-, n-‘ n-, p-, m-, y-, v

Consonants mid-word
-k- , -c-, -t-, -n-, -p-, -m-, -y-, -v-,
-t- , -n-, -r-, -1-, -r-, -1-, -r-,
-nt- -nc, , -nt-, -mp-

Unfortunately, I don’t know Dyirbal or any Australian language. So, I can’t check if these rules apply to them as well. I can’t wait to get hold of a linguistic analysis of Dyirbal by an Indian or Japanese linguist.

Statistics for Linguistics

I recently wrote an article on how machine learning algorithms can be considered extensions of linguistic rules.

The article I am currently sharing with you is also related to linguistics.

It is in the form of a set of slides and describes statistical concepts and tools that might come in handy for linguists.

Here are the slides: http://aiaioo.com/whitepapers/statistics_for_linguistics.pdf

The goal of these slides was to teach a team of linguists about statistics and machine learning.

The main sections are:

Motivation for learning new tools using examples from algorithms and their application to education and poverty elimination.
What to annotate
How to develop insights
How to annotate
How much data to annotate
How to avoid mistakes in using the corpus

Tulu to Kusunda

The young engineer who joined our firm a few months ago is from the coast, and one of the languages she speaks is Tulu. One day, on the way to a business meeting, she taught me a few sentences of Tulu.

The way I like to learn languages is this. I choose simple sentences in the language and then create variations of these sentences till an understanding of the grammar begins to emerge. So I asked her, “How do you say, ‘What is your name’ in Tulu?” Her answer amazed me. In Tulu, you say “Irna podar dada?”

The answer was amazing because it contained function words that I’ve never come across before. In Kannada, you would say something like: ‘Ninna hesaru enu?’ In Hindi, it would be “Tumhara naam kya hai?” Tulu is the first language I’ve encountered in South India that has function words that don’t resemble those in Kannada or Hindi.

But it turns out that there are more interesting and unique languages in the subcontinent, and that some are disappearing. On the BBC website today is an article: ‘Nepal’s mystery language on the verge of extinction‘ that talks about a language called Kusunda in Nepal. I quote:

The unknown origins and mysterious sentence structures of Kusunda have long baffled linguists. … Professor Pokharel describes Kusunda as a “language isolate”, not related to any common language of the world. “There are about 20 language families in the world,” he said, “among them are the Indo-European, Sino-Tibetan and Austro-Asiatic group of languages. Kusunda stands out because it is not phonologically, morphologically, syntactically and lexically related to any other languages of the world.”

It turns out that there are only two speakers of the Kusunda language left, one of them a very old woman named Gyani Maiya Sen and the other a woman named Kamala Khatri who had ‘left the country in search of a job’.

Intentions and Opinions

In the last blog post, I talked about intention analysis and what it does.

Intention analysis is the identification of intentions from text. Some examples of intentions are:

a) intention to complain
b) intention to inquire
c) intention to issue a directive
d) intention to buy

In this post, I am claiming that Sentiment Analysis needs Intention Analysis.

Yes, the results of sentiment analysis will be inaccurate unless you know that the intent of the speaker is to express an opinion.

Background

When sentiment analysis was initially proposed by researchers, they applied it to the analysis of product reviews.

The intention of a reviewer is obvious. Reviewers have only one intention: the intention to opine (either to praise or to criticize).

However, with the growth of social media, especially Twitter, the same sentiment analysis methods began to be applied to the analysis of twitter streams and other social media streams.

Now that’s where there is a problem.

Not every message on Twitter that mentions a particular product or brand intends to express an opinion about the brand!

Below are a few illustrative examples.

Example 1: “Is the Canon EOS 5 a good camera?”

This sentence is not an expression of positive opinion, but an inquiry about the Canon EOS 5.

In other words, the intent of the speaker is not to express an opinion, but to inquire.

Example 2: “I am looking to buy me a good Canon camera”

Here, the intention of the user is to purchase a product (people only indicate a preference for good things … no one really looks to buy a bad camera).

However, most sentiment analysis tools will identify this sentence as an expression of positive sentiment.

Example 3: “Take me to a good movie.”

Here, the speaker’s intent is to direct someone to do something.

A directive is not an assertion, and so does not always imply an intention to opine.

Example 4: “My good old Porsche for sale (cheap)”

Here, the speaker’s intent is to talk up something they’re selling.

The intent here is not to express sentiment about the brand.

Conclusion

So, what we can learn from the above examples is that sentiment analysis is not meant to be applied without reservations to Social Media Analysis.

In other words, for sentiment analysis to be accurate when applied to social media, it needs to be supported by intention analysis.

Demonstration

We recently released a sentiment analysis API that has the ability to filter out many kinds of intention including the ones listed above. We’d love to get your thoughts on our work. The demo is available at the following URL:

Demonstration of VakSent (a Sentiment Analysis API from Aiaioo Labs)

Do write me at cohan@aiaioo.com with intent to opine!