Speech and Language Processing

Dan Jurafsky, James H. Martin

Mentioned 12

An explosion of Web-based language techniques, merging of distinct fields, availability of phone-based dialogue systems, and much more make this an exciting time in speech and language processing. The first of its kind to thoroughly cover language technology – at all levels and with all modern technologies – this book takes an empirical approach to the subject, based on applying statistical and other machine-learning algorithms to large corporations. Builds each chapter around one or more worked examples demonstrating the main idea of the chapter, usingthe examples to illustrate the relative strengths and weaknesses of various approaches. Adds coverage of statistical sequence labeling, information extraction, question answering and summarization, advanced topics in speech recognition, speech synthesis. Revises coverage of language modeling, formal grammars, statistical parsing, machine translation, and dialog processing. A useful reference for professionals in any of the areas of speech and language processing.

More on Amazon.com

Mentioned in questions and answers.

input: phrase 1, phrase 2

output: semantic similarity value (between 0 and 1), or the probability these two phrases are talking about the same thing

There's a short and a long answer to this.

The short answer:

Use the WordNet::Similarity Perl package. If Perl is not your language of choice, check the WordNet project page at Princeton, or google for a wrapper library.

The long answer:

Determining word similarity is a complicated issue, and research is still very hot in this area. To compute similarity, you need an appropriate represenation of the meaning of a word. But what would be a representation of the meaning of, say, 'chair'? In fact, what is the exact meaning of 'chair'? If you think long and hard about this, it will twist your mind, you will go slightly mad, and finally take up a research career in Philosophy or Computational Linguistics to find the truth™. Both philosophers and linguists have tried to come up with an answer for literally thousands of years, and there's no end in sight.

So, if you're interested in exploring this problem a little more in-depth, I highly recommend reading Chapter 20.7 in Speech and Language Processing by Jurafsky and Martin, some of which is available through Google Books. It gives a very good overview of the state-of-the-art of distributional methods, which use word co-occurrence statistics to define a measure for word similarity. You are not likely to find libraries implementing these, however.

Assume you know a student who wants to study Machine Learning and Natural Language Processing.

What introductory subjects would you recommend?

Example: I'm guessing that knowing Prolog and Matlab might help him. He also might want to study Discrete Structures*, Calculus, and Statistics.

*Graphs and trees. Functions: properties, recursive definitions, solving recurrences. Relations: properties, equivalence, partial order. Proof techniques, inductive proof. Counting techniques and discrete probability. Logic: propositional calculus, first-order predicate calculus. Formal reasoning: natural deduction, resolution. Applications to program correctness and automatic reasoning. Introduction to algebraic structures in computing.

This related stackoverflow question has some nice answers: What are good starting points for someone interested in natural language processing?

This is a very big field. The prerequisites mostly consist of probability/statistics, linear algebra, and basic computer science, although Natural Language Processing requires a more intensive computer science background to start with (frequently covering some basic AI). Regarding specific langauges: Lisp was created "as an afterthought" for doing AI research, while Prolog (with it's roots in formal logic) is especially aimed at Natural Language Processing, and many courses will use Prolog, Scheme, Matlab, R, or another functional language (e.g. OCaml is used for this course at Cornell) as they are very suited to this kind of analysis.

Here are some more specific pointers:

For Machine Learning, Stanford CS 229: Machine Learning is great: it includes everything, including full videos of the lectures (also up on iTunes), course notes, problem sets, etc., and it was very well taught by Andrew Ng.

Note the prerequisites:

Students are expected to have the following background: Knowledge of basic computer science principles and skills, at a level sufficient to write a reasonably non-trivial computer program. Familiarity with the basic probability theory. Familiarity with the basic linear algebra.

The course uses Matlab and/or Octave. It also recommends the following readings (although the course notes themselves are very complete):

For Natural Language Processing, the NLP group at Stanford provides many good resources. The introductory course Stanford CS 224: Natural Language Processing includes all the lectures online and has the following prerequisites:

Adequate experience with programming and formal structures. Programming projects will be written in Java 1.5, so knowledge of Java (or a willingness to learn on your own) is required. Knowledge of standard concepts in artificial intelligence and/or computational linguistics. Basic familiarity with logic, vector spaces, and probability.

Some recommended texts are:

The prerequisite computational linguistics course requires basic computer programming and data structures knowledge, and uses the same text books. The required articificial intelligence course is also available online along with all the lecture notes and uses:

This is the standard Artificial Intelligence text and is also worth reading.

I use R for machine learning myself and really recommend it. For this, I would suggest looking at The Elements of Statistical Learning, for which the full text is available online for free. You may want to refer to the Machine Learning and Natural Language Processing views on CRAN for specific functionality.

Jurafsky and Martin's Speech and Language Processing http://www.amazon.com/Speech-Language-Processing-Daniel-Jurafsky/dp/0131873210/ is very good. Unfortunately the draft second edition chapters are no longer free online now that it's been published :(

Also, if you're a decent programmer it's never too early to toy around with NLP programs. NLTK comes to mind (Python). It has a book you can read free online that was published (by OReilly I think).

I am planning to learn natural language processing this year.

But when I start reading introductory books on this topic, I found that I miss a lot of points relating mainly to mathematics.

So I'm here searching for what I should learn before I can learn nlp, well, more smoothly?

Thanks in advance.

There are two main approaches to NLP right now - one is the language-based approach detailed by Jurafsky and Martin (Speech and Language Processing) and the other is a probability and statistics-based approach (Foundations of Statistical Natural Language Processing).

Most people that I've talked to tend to prefer the latter as far as ease of ramping up and useful results. So I would recommend going over probability theory first and then tackling an NLP book (like the second one I linked to, which I am actually using on a project right now with pretty good results).

While I agree with laura that formal language theory is highly useful, I actually think that currently if you just want to get into the actual NL parts of NLP, you can leave formal languages for later as there are enough tools that will do your lexical analysis / parsing / tokenizing / text transformations that you can use those rather than roll your own.

Here is a book describing three such tools - I own it and recommend it as a good introduction to all three. Building Search Applications: Lucene, LingPipe, and Gate

Edit: in response to your question, I would say that the first step would be to get a thorough grounding in the basics of probability (the first 3-5 chapters of any undergrad prob/stats book should be fine), and then from there look up new topics as they come up in the NLP book. For instance, yesterday I had to learn about t-values or something (I'm bad with names) because they happened to be relevant to determining incidence of collocation.

I recently attended a class on coursera about "Natural Language Processing" and I learnt a lot about parsing, IR and other interesting aspects like Q&A etc. though I grasped the concepts well but I did not actually get any practical knowledge of it. Can anyone suggest me good online tutorials or books for Natural Language Processing?


You could read Jurafsky and Martin's Speech and Language Processing (2008 edition), which is the standard textbook in the field. It's long, and has a variety of topics, so I'd suggest reading just the chapters that really apply to your interests.

Further, the best way to learn is almost certainly to actually implement NLP algorithms from scratch. You could pick some standard tasks (language modeling, text classification, POS-tagging, NER, parsing) and implement various algorithms from the ground up (ngram models, HMMs, Naive Bayes, MaxEnt, CKY) to really understand what makes them work. It also shouldn't be too hard to find some free dataset to test your implementations on.

Finally, there are lots of tutorials out there for specific NLP algorithms that are excellent. For example, if you want to build an HMM, I suggest Jason Eisner's tutorial which also covers smoothing and unsupervised training with EM. If you want to implement Gibbs sampling for unsupervised Naive Bayes training, I suggest Philip Resnik's tutorial.

I have written (am writting) a program to analyze encrypted text and attempt to analyze and break it using frequency analysis.

The encrypted text takes the form of each letter being substituted for some other letter ie. a->m, b->z, c->t etc etc. all spaces and non alpha chars are removed and upper case letters made lowercase.

An example would be :

Orginal input - thisisasamplemessageitonlycontainslowercaseletters
Encrypted output - ziololqlqdhstdtllqutozgfsnegfzqoflsgvtkeqltstzztkl
Attempt at cracking - omieieaeanuhtnteeawtiorshylrsoaisehrctdlaethtootde

Here it has only got I, A and Y correctly.

Currently my program cracks it by analysing the frequency of each individual character, and mapping it to the character that appears in the same frequency rank in a non encrypted text.

I am looking for methods and ways to improve the accuracy of my program as at the moment I don't get too many characters right. For example when attempting to crack X amount of characters from Pride and Prejudice, I get:

1600 - 10 letters correct
800 - 7 letters correct
400 - 2 letters correct
200 - 3 letters correct
100 - 3 letters correct.

I am using Romeo and Juliet as a base to get the frequency data.

It has been suggested to me to look at and use the frequency of character pairs, but I am unsure how to use this because unless I am using very large encrypted texts I can imagine a similar approach to how I am doing single characters would be even more inaccurate and cause more errors than successes. I am hoping also to make my encryption cracker more accurate for shorter 'inputs'.

Any suggestions would be very helpful.


I'm not sure how constrained this problem is, i.e. how many of the decisions you made are yours to change, but here are some comments:

1) Frequency mapping is not enough to solve a puzzle like this, many frequencies are very close to each other and if you aren't using the same text for frequency source and plaintext, you are almost guaranteed to have a few letters off no matter how long the text. Different materials will have different use patterns.

2) Don't strip the spaces if you can help it. This will allow you to validate your potential solution by checking that some percentage of the words exist in a dictionary you have access to.

3) Look into natural language processing if you really want to get into the language side of this. This book has all you could ever want to know about it.

Edit: I would look into bigraphs and trigraphs first. If you're fairly confident of one or two letters, they can help predict likely candidates for the letters that follow. They're basically probability tables where AB would be the probability of an A being followed by a B. So assuming you have a given letter solved, that can be used to solve the letters next to it, rather than just guessing. For example, if you've got the word "y_u", it's obvious to you that the word is you, but not to the computer. If you've got the letters N, C, and O left, bigraphs will tell you that YN and YC are very uncommon where as YO is much more likely, so even if your text has unusual letter frequencies (which is easy when it's short) you still have a fairly accurate system for solving for unknowns. You can hunt around for a compiled dataset, or do your own analysis, but make sure to use a lot of varied text, a lot of Shakespeare is not the same as half of Shakespeare and half journal articles.

I'm interested in learning more about Natural Language Processing (NLP) and am curious if there are currently any strategies for recognizing proper nouns in a text that aren't based on dictionary recognition? Also, could anyone explain or link to resources that explain the current dictionary-based methods? Who are the authoritative experts on NLP or what are the definitive resources on the subject?

The task of determining the proper part of speech for a word in a text is called Part of Speech Tagging. The Brill tagger, for example, uses a mixture of dictionary(vocabulary) words and contextual rules. I believe that some of the important initial dictionary words for this task are the stop words. Once you have (mostly correct) parts of speech for your words, you can start building larger structures. This industry-oriented book differentiates between recognizing noun phrases (NPs) and recognizing named entities. About textbooks: Allen's Natural Language Understanding is a good, but a bit dated, book. Foundations of Statistical Natural Language Processing is a nice introduction to statistical NLP. Speech and Language Processing is a bit more rigorous and maybe more authoritative. The Association for Computational Linguistics is a leading scientific community on computational linguistics.

What Algorithm/method do I use for a Question Answering System's Question Processing?

I have been searching possible algorithms for my Question Answering System, the only thing that I think that would be possible to use is Parsing but I have asked about parsing in my last question and with the answers there i think its not possible to be used?(I'm not sure).

My idea of using Parsing is by Cutting the question into pieces word per word and then it will go through a Storage of Words that would determine what Kind of Word(noun,adjective,verb,etc) is being said. My purpose of using Parsing is to remove or rather to determine the Topic of the question.

The other idea of mine is the ChatterBot. A Chatterbot uses a query of words? Correct me if I'm not mistaken and those words are assigned to another Word. It would randomly choose a word from its Query.

Example: User's Statement: Hello > ChatterBot's Possible Replies: Hi,Hello,Hey

I'm not quite sure what is the possible method/algorithm to use in a Question Answering, I have read the Wikipedia post : http://en.wikipedia.org/wiki/Question_answering but I do not quite understand what algorithm to use in Question Processing.

Thank you.

PS: I'm developing in Javascript. Q = Question

here's a great book, you may need to read to get a lot of stuff related to NLP, and Question answering systems http://www.amazon.com/Speech-Language-Processing-2nd-Edition/dp/0131873210

the book has a full section (V.Applications) that will help you a lot to develop a good system. but note that the book is discussing theories and algorithms only (no code)

it's not about parsing text only, you'll need to understand the context to provide better answer. actually you need to extract some keywords and ignore everything else.

also you may read in topics Keywords (Bag of words), algorithms like (TF/IDF).

I need a Regex to detect questions within a text.

Example input:

please, tell me how to do this... or how to make it right! and so on....

I need output:

  1. how to do this
  2. how to make it right

Now I use this: (?<q>(how to|how match|how many).*)(\s|\.|;|!|\?|( \-)|(\- )|‾|:|…|_|\||@|~|…|–|—|¯|»|•|●|{|}|\(|\)|\\|\]|\[|>|<|→|'|""|`|$) but does not work

I need only how to questions

The task you are trying to accomplish falls under a different category than what regular expressions are good for.

To solve the problem of extracting arbitrary questions from text you need a lot more than just a few good regular expressions. You should start looking at a good natural language processing toolkit. And maybe first do some Part of Speech tagging. Then, from there you will need to do some syntax and sentence parsing and then move on to try to answer the question of: "Is this a sentence a question?" by examining each sentence your NLP pipeline will have identified.

Armed with this knowledge, at a minimum, you should understand that the task you want to accomplish is rather difficult and while not impossible will require a lot of fine tuning to get good performance (usually measured with Accuracy and Precision metrics). You will most likely not get anywhere near 100% on either but you should be able to get decent results with a good PoS tagger and a good sentence parser.


In light of your recent edit to the question, you may be able to get some basic coverage with RegExs and hand-written rules but you will still fail to differentiate many more complicated cases. The natural language processing toolkit route is still preferred for a more generic solution.

Don't spend too much time trying to come up with a silver bullet regular expression to match natural language. Natural language is not regular - so it's not going to work! It's ok to use regular expressions to identify some keywords but beyond that you're better off with simple hand-written rules, and tokenizing in lue of a good natural language pipeline..


If you're really serious about this task, have a look at sharpnlp.codeplex.com as a starting point. There are other NLP tool-kits out there with NLTK springing to mind as a popular one if you're not required to use C#. As a second step, get yourself an introductory book on NLP. The subject is vast and really cool. A great book I've learned a lot from would be: Speech and Language Processing by Jurafski and Martin.

And as a final thought, here's what I would do at a minimum:

  1. perform normalization (remove any symbols that you don't need, and duplicates)
  2. try to do basic sentence segmentation (split at punctuation: . , ; ? !)
  3. convert all letters to lowercase
  4. replace all numbers with a tag (i.e. )
  5. perform Part of Speech tagging on each of the normalized sentences
  6. then you can move on to try to determine where all the "how to" questions are located in the text.
  7. after you have your locations you should be able to map them back to the original text and extract the original "How To" questions from there

Good luck!

Hat in hand here. I'm a seasoned developer and I would be grateful for a bit of help. I don't have time to read or digest long intricate discussions on theoretical concepts around NLP (or go get my PHD). That said, I have read a few and it's a damn interesting field. The problem is I need real world solutions, for real world products, in real world time frames.

The problem I'm having is right now I'm not sure what the right questions are to ask to get started implementing. I believe this is mostly related to vocabulary. I'll read somewhere, a blog post, a forum post, a whitepaper, and it says, I'm doing flooping with the blargy blarg method, and I go google flooping and blargy blarg, and I get references to more obscurity. It seemingly never ends.

So, my question is multiphased.

First, more generally, how do I become passingly educated on this quickly? Just in time educated. I only need to know what I need to know to take the next step. I've spent 20 years writing code. Explain quick. I'll get it. (I mean provide a reference to something that explains quickly of course).

I'm happy to read the right book, but I don't want to read a book where I read the chapter introduction that explains what floopy floop is and then skip over the rest of the chapter with examples of floopy flooping (because now I get what it is). I also don't want to read a book that goes into too much detail with theoretical underpinnings or history. For example, the Jurafsky book seems like way more than I need: http://www.amazon.com/gp/product/0131873210. But I will read it if this is the right book to read. (It's also dang expensive!)

I need the root node of the expedited learning tree here, if you will. Point me in the right direction and I'll be quite grateful. I'm expecting quite a lot of firehose drinking - I just need the right firehose.

Second, what I need to do is take a single sentence, with a very reduced vocabulary, and get a grammar tree (sorry if this is the wrong terminology) that I can do something with.

I know I could easily write this command line input style in c in a more conventional manner, but I need it to be way better than that. But I don't need a chatterbot either.

What I'm doing needs to live in a constrained environment. I can't use Python (unfortunately). I can't ship with gigabytes of corpuses. I need any libraries I use to be in c/c++. If I have to write this myself, I will. Hopefully, it will be achievable considering the reduced problem set. Maybe, probably, that's just naive. If so, let me know. :-)

Thanks in advance - Mike

The Jurafsky & Martin book is the right book to read, or atleast have. It has a good index and a good bibliography and explains most of the concepts you need to know. There might also be a softcover version available which is considerably cheaper.

Natural language processing (NLP) is a subfield of artificial intelligence that involves transforming or extracting useful information from natural language data. Methods include machine-learning and rule-based approaches. It is often regarded as the engineering arm of Computational Linguistics.

NLP tasks

Beginner books on Natural Language Processing

Popular software packages

This is not a code question, but about concepts. I want to know who are the main author/researches for Information Extraction, Natural Language Processing and Text Mining to read his papers/books/works.

Look at The Handbook of Data Mining - Nong Ye for a collection of many papers. This should also point you to the key researchers in text/data mining.


I for the record own this book.

I tried to search a lot about this but all i could find was links to NLP libraries and AIML or chatbot APIs. I want to start from scratch and analyze the sentences myself so that i can write a basic chat bot that gives human like responses. Could someone please point to some links/research papers/ tutorials/videos for this?

Without using NLP libraries you'll have to write some of their functionality yourself. Although this can be educational you should know it can also be very time consuming.

Some academic resources:

Programming/practical resources: