How do you implement a "Did you mean"? [duplicate]

Possible Duplicate:
How does the Google “Did you mean?” Algorithm work?

Suppose you have a search system already in your website. How can you implement the "Did you mean:<spell_checked_word>" like Google does in some search queries?

Source: (StackOverflow)

How does Apple find dates, times and addresses in emails?

In the iOS email client, when an email contains a date, time or location, the text becomes a hyperlink and it is possible to create an appointment or look at a map simply by tapping the link. It not only works for emails in English, but in other languages also. I love this feature and would like to understand how they do it.

The naive way to do this would be to have many regular expressions and run them all. However I this is not going to scale very well and will work for only a specific language or date format, etc. I think that Apple must be using some concept of machine learning to extract entities (8:00PM, 8PM, 8:00, 0800, 20:00, 20h, 20h00, 2000 etc.).

Any idea how Apple is able to extract entities so quickly in its email client? What machine learning algorithm would you to apply accomplish such task?

Source: (StackOverflow)

Detecting syllables in a word

I need to find a fairly efficient way to detect syllables in a word. E.g.,

Invisible -> in-vi-sib-le

There are some syllabification rules that could be used:

V CV VC CVC CCV CCCV CVCC

*where V is a vowel and C is a consonant. E.g.,

Pronunciation (5 Pro-nun-ci-a-tion; CV-CVC-CV-V-CVC)

I've tried few methods, among which were using regex (which helps only if you want to count syllables) or hard coded rule definition (a brute force approach which proves to be very inefficient) and finally using a finite state automata (which did not result with anything useful).

The purpose of my application is to create a dictionary of all syllables in a given language. This dictionary will later be used for spell checking applications (using Bayesian classifiers) and text to speech synthesis.

I would appreciate if one could give me tips on an alternate way to solve this problem besides my previous approaches.

I work in Java, but any tip in C/C++, C#, Python, Perl... would work for me.

Source: (StackOverflow)

What programming language is most like natural language? [closed]

I got the idea for this question from numerous situations where I don't understand what the person is talking about and when others don't understand me.

So, a "smart" solution would be to speak a computer language. :)

I am interested how far a programming language can go to get near to (English) natural language. When I say near, I mean not just to use words and sentences, but to be able to "do" things a natural language can "do" and by "do" I mean that it can be used (in a very limited way) as a replacement for natural language.

I know that this is impossible (is it?) but I think that this can be interesting.

Source: (StackOverflow)

How to extract common / significant phrases from a series of text entries

I have a series of text items- raw HTML from a MYSQL database. I want to find the most common phrases in these entries (not the single most common phrase, and ideally, not enforcing word-for-word matching).

My example is any review on Yelp.com, that shows 3 snippets from hundreds of reviews of a given restaurant, in the format:

"Try the hamburger" (in 44 reviews)

e.g., the "Review Highlights" section of this page:

http://www.yelp.com/biz/sushi-gen-los-angeles/

I have NLTK installed and I've played around with it a bit, but am honestly overwhelmed by the options. This seems like a rather common problem and I haven't been able to find a straightforward solution by searching here. Thanks in advance for any help.

Source: (StackOverflow)

What are good starting points for someone interested in natural language processing? [closed]

Question

So I've recently came up with some new possible projects that would have to deal with deriving 'meaning' from text submitted and generated by users.

Natural language processing is the field that deals with these kinds of issues, and after some initial research I found the OpenNLP Hub and university collaborations like the attempto project. And stackoverflow has this.

If anyone could link me to some good resources, from reseach papers and introductionary texts to apis, I'd be happier than a 6 year-old kid opening his christmas presents!

Update

Through one of your recommendations I've found opencyc ('the world's largest and most complete general knowledge base and commonsense reasoning engine'). Even more amazing still, there's a project that is a distilled version of opencyc called UMBEL. It features semantic data in rdf/owl/skos n3 syntax.

I've also stumbled upon antlr, a parser generator for 'constructing recognizers, interpreters, compilers, and translators from grammatical descriptions'.

And there's a question on here by me, that lists tons of free and open data.

Thanks stackoverflow community!

Source: (StackOverflow)

Java or Python for Natural Language Processing [closed]

I would like to know which programming language is better for natural language processing. Java or Python? I have found lots of questions and answers regarding about it. But I am still lost in choosing which one to use.

And I want to know which NLP library to use for Java since there are lots of libraries (LingPipe, GATE, OpenNLP, StandfordNLP). For Python, most programmers recommend NLTK.

But if I am to do some text processing or information extraction from unstructured data (just free formed plain English text) to get some useful information, what is the best option? Java or Python? Suitable library?

Updated

What I want to do is to extract useful product information from unstructured data (E.g. users make different forms of advertisement about mobiles or laptops with not very standard English language)

Source: (StackOverflow)

Machine Learning and Natural Language Processing [closed]

Assume you know a student who wants to study Machine Learning and Natural Language Processing.

What introductory subjects would you recommend?

Example: I'm guessing that knowing Prolog and Matlab might help him. He also might want to study Discrete Structures*, Calculus, and Statistics.

*Graphs and trees. Functions: properties, recursive definitions, solving recurrences. Relations: properties, equivalence, partial order. Proof techniques, inductive proof. Counting techniques and discrete probability. Logic: propositional calculus, first-order predicate calculus. Formal reasoning: natural deduction, resolution. Applications to program correctness and automatic reasoning. Introduction to algebraic structures in computing.

Source: (StackOverflow)

How do I do word Stemming or Lemmatization?

I've tried PorterStemmer and Snowball but both don't work on all words, missing some very common ones.

My test words are: "cats running ran cactus cactuses cacti community communities", and both get less than half right.

See also:

Source: (StackOverflow)

Practical examples of NLTK use [closed]

I'm playing around with the Natural Language Toolkit (NLTK).

Its documentation (Book and HOWTO) are quite bulky and the examples are sometimes slightly advanced.

Are there any good but basic examples of uses/applications of NLTK? I'm thinking of things like the NTLK articles on the Stream Hacker blog.

Source: (StackOverflow)

Is it possible to guess a user's mood based on the structure of text?

I assume a natural language processor would need to be used to parse the text itself, but what suggestions do you have for an algorithm to detect a user's mood based on text that they have written? I doubt it would be very accurate, but I'm still interested nonetheless.

EDIT: I am by no means an expert on linguistics or natural language processing, so I apologize if this question is too general or stupid.

Source: (StackOverflow)

Difference between constituency parser and dependency parser

I want to know the difference between constituency parser and dependency parser. And what are the different usages of the two. How are they used in Natural Language Processing?

I am using Stanford and Linked Parser.

Source: (StackOverflow)

Looking for Java spell checker library [closed]

I am looking for an open source Java spell checking library which has dictionaries for at least the following languages: French, German, Spanish, and Czech. Any suggestion?

Source: (StackOverflow)

Similarity between two text documents

I am looking at working on an NLP project, in any language (though Python will be my preference).

I am wanting to write a program that will take two documents and determine how similar they are.

As I am fairly new to this and a quick google search does not point me to much. Do you know of any references (websites, textbooks, journal articles) which cover this subject and would be of help to me?

Thanks

Source: (StackOverflow)

How to read values from numbers written as words?

As we all know numbers can be written either in numerics, or called by their names. While there are a lot of examples to be found that convert 123 into one hundred twenty three, I could not find good examples of how to convert it the other way around.

Some of the caveats:

cardinal/nominal or ordinal: "one" and "first"
common spelling mistakes: "forty"/"fourty"
hundreds/thousands: 2100 -> "twenty one hundred" and also "two thousand and one hundred"
separators: "eleven hundred fifty two", but also "elevenhundred fiftytwo" or "eleven-hundred fifty-two" and whatnot
colloqialisms: "thirty-something"
fragments: 'one third', 'two fifths'
common names: 'a dozen', 'half'

And there are probably more caveats possible that are not yet listed. Suppose the algorithm needs to be very robust, and even understand spelling mistakes.

What fields/papers/studies/algorithms should I read to learn how to write all this? Where is the information?

PS: My final parser should actually understand 3 different languages, English, Russian and Hebrew. And maybe at a later stage more languages will be added. Hebrew also has male/female numbers, like "one man" and "one woman" have a different "one", "ehad" and "ahat". Russian also has some of it's own complexities.

Google does a great job at this, for example:

http://www.google.com/search?q=two+thousand+and+one+hundred+plus+five+dozen+and+four+fifths+in+decimal

(the reverse is also possible http://www.google.com/search?q=999999999999+in+english)

Source: (StackOverflow)

EzDevInfo.com

nlp interview questions

How do you implement a "Did you mean"? [duplicate]

How does Apple find dates, times and addresses in emails?

Detecting syllables in a word

What programming language is most like natural language? [closed]

How to extract common / significant phrases from a series of text entries

What are good starting points for someone interested in natural language processing? [closed]

Question

Update

Java or Python for Natural Language Processing [closed]

Machine Learning and Natural Language Processing [closed]

How do I do word Stemming or Lemmatization?

Practical examples of NLTK use [closed]

Is it possible to guess a user's mood based on the structure of text?

Difference between constituency parser and dependency parser

Looking for Java spell checker library [closed]

Similarity between two text documents

How to read values from numbers written as words?