Month: June 2011

Speech Recognition using PocketSphinx on Win32

The zeroth thing you need is the Pocketsphinx binaries.
Just download the win32 binaries from the Sphinx website (download pocketsphinx, sphinxbase, sphinxtrain and cmuclmtk from the Sphinx website).

The first thing you need to do is build a language model or a grammar.

The grammar can be something simple in a format called JSGF, and this is the easier way to get a speech recognizer up and running. Alternatively, you can use a language model. The language model can be built using the instructions on the Sphinx site. You can create it starting from a file with sentences like this:

<s> I WANT A NEXTCUBE ZERO FOUR ZERO </s> <s> I WANT THE NEXTCUBE ZERO FOUR ZERO </s> <s> I NEED A NEXTCUBE ZERO FOUR ZERO </s> <s> I NEED THE NEXTCUBE ZERO FOUR ZERO </s> <s> I AM LOOKING FOR A NEXTCUBE ZERO FOUR ZERO </s> <s> I AM LOOKING FOR THE NEXTCUBE ZERO FOUR ZERO </s> <s> I AM SEEKING A NEXTCUBE ZERO FOUR ZERO </s> <s> I AM SEEKING THE NEXTCUBE ZERO FOUR ZERO </s> 

A sample JSGF file would be (modified from the sample on the Sphinx website) … note that I’ve made all the words capitals because the CMU phonetic dictionary has all the words listed in caps (make sure that any language model is all caps as well, except for the sentence boundaries):

#JSGF V1.0; /** * JSGF Grammar for Hello World example */ grammar hello; public <greet> = (GOOD MORNING | HELLO | HI) ( PAUL | RITA | WILL ); 

The second thing you need is an Acoustic Model

An acoustic model maps sound features from the speech recognizer to phonemes.
Voxforge provides a free acoustic model for Pocketsphinx that you can use.

The third thing you need is a phonetic dictionary

The phonetic dictionary maps the recognized phonemes to actual words in your language. For English, there is a phonetic dictionary available from CMU

You will just need to download one file: cmudict.0.7a_SPHINX_40

Now, you have all the components you need!

Running Pocketsphinx

With JSGF:

$ pocketsphinx-0.7-win32/pocketsphinx_continuous.exe \
-hmm voxforge-en-r0_1_3/model_parameters/voxforge_en_sphinx.cd_cont_3000 \
-jsgf greet.jsgf \
-dict cmudict.0.7a_SPHINX_40

With a language model:

$ pocketsphinx-0.7-win32/pocketsphinx_continuous.exe \
-hmm voxforge-en-r0_1_3/model_parameters/voxforge_en_sphinx.cd_cont_3000 \ -lm cmuclmtk-0.7-win32/output.lm.DMP \
-dict cmudict.0.7a_SPHINX_40l

Any additional phonetic entries in the phonetic dictionary can be created using the CMU dictionary phoneme set

Education

  1. Videos on speech recognition
  2. Lectures on speech recognition
  3. Voxforge has an article on what an acoustic model is

NLP Workshop

The IASNLP 2011 workshop turned out to be a good opportunity to learn a little bit about speech research.

(See the article: http://www.aiaioo.com/cms/index.php?id=28)

Here are two of the faculty who work on speech at IIIT-H:

1. Yegnanarayana: http://speech.iiit.ac.in/~yegna (many publications on signal processing, noise cancellation, feature extraction, ANNs).

2. Kishore Prahallad: http://www.iiit.net/people/faculty/kishore (speech synthesis and spoken dialog systems)

IIIT-H also has research on grammar and translation.

Dr. Rajeev Sangal (http://www.iiit.net/~sangal/) works on Dependency Parsing, Transfer Based Machine Translation and Anaphora Resolution.

Robotics Workshop

On the first day of a three-day workshop, I built a line-follower robot that successfully navigated what the instructor promised was a very difficult course (he said it would be impossible to navigate using a simple on-off algorithm).

The trick I used to complete the course was to run the DC motors on half-voltage and adjust sensor angles so that both always fed the ‘brain’ an excellent set of signals.

I came up with the idea owing to my experience with text analytics. The most critical task in text analysis is feature engineering. With a good set of features, you can get excellent results even if the machine learning algorithm is very simple. Unfortunately, very little work goes into feature engineering and feature combination methods for NLP.

So, I guess my weekend dabbling in robotics taught me an important lesson – no matter how good your machine learning algorithms (the brains of the system) are, they can’t do nothing without eyes.