Top machine-learning frequently asked interview questions

How to understand Locality Sensitive Hashing?

I noticed that LSH seems a good way to find similar items with high-dimension properties.

After reading the paper http://www.slaney.org/malcolm/yahoo/Slaney2008-LSHTutorial.pdf, I'm still confused with those formulas.

Does anyone know a blog or article that explains that the easy way?

Source: (StackOverflow)

What is the difference between supervised learning and unsupervised learning?

In terms of artificial intelligence and machine learning. Can you provide a basic, easy explanation with an example?

Source: (StackOverflow)

What is the difference between a Generative and Discriminative Algorithm? [closed]

Please help me understand the difference between a Generative and Discriminative Algorithm keeping in mind that I am just a beginner.

Source: (StackOverflow)

Role of Bias in Neural Networks

I'm aware of the Gradient Descent and the Back-propagation Theorem. What I don't get is: When is using a bias important and how do you use it?

For example, when mapping the AND function, when I use 2 inputs and 1 output, it does not give the correct weights, however, when I use 3 inputs (1 of which is a bias), it gives the correct weights.

Source: (StackOverflow)

A simple explanation of Naive Bayes Classification

I am finding it hard to understand the process of Naive Bayes, and I was wondering if someone could explained it with a simple step by step process in English. I understand it takes comparisons by times occurred as a probability, but I have no idea how the training data is related to the actual dataset.

Please give me an explanation of what role the training set plays. I am giving a very simple example for fruits here, like banana for example

training set---
round-red
round-orange
oblong-yellow
round-red

dataset----
round-red
round-orange
round-red
round-orange
oblong-yellow
round-red
round-orange
oblong-yellow
oblong-yellow
round-red

Source: (StackOverflow)

How does the Google "Did you mean?" Algorithm work?

I've been developing an internal website for a portfolio management tool. There is a lot of text data, company names etc. I've been really impressed with some search engines ability to very quickly respond to queries with "Did you mean: xxxx".

I need to be able to intelligently take a user query and respond with not only raw search results but also with a "Did you mean?" response when there is a highly likely alternative answer etc

[I'm developing in ASP.NET (VB - don't hold it against me! )]

UPDATE: OK, how can I mimic this without the millions of 'unpaid users'?

Generate typos for each 'known' or 'correct' term and perform lookups?
Some other more elegant method?

Source: (StackOverflow)

Estimating the number of neurons and number of layers of an artificial neural network

I am looking for a method on how to calculate the number of layers and the number of neurons per layer. As input i only have the size of the input vector, the size of the output vector and the size of the trainig set.

Usually the best net is determined by trying different net topologies and selecting the one with the least error. Unfortunately I can not do that.

Source: (StackOverflow)

Detecting patterns in waves

I'm trying to read a image from a electrocardiography and detect each one of the main waves in it (P wave, QRS complex and T wave). Now I can read the image and get a vector like (4.2; 4.4; 4.9; 4.7; ...) representative of the values in the electrocardiography, what is half of the problem. I need a algorithm that can walk through this vector and detect when each of this waves start and end.

Here is a example of one of its graphs:

alt text

Would be easy if they always had the same size, but it's not like it works, or if I knew how many waves the ecg would have, but it can vary too. Does anyone have some ideas?

Thanks!

Updating

Example of what I'm trying to achieve:

Given the wave

alt text

I can extract the vector

[0; 0; 20; 20; 20; 19; 18; 17; 17; 17; 17; 17; 16; 16; 16; 16; 16; 16; 16; 17; 17; 18; 19; 20; 21; 22; 23; 23; 23; 25; 25; 23; 22; 20; 19; 17; 16; 16; 14; 13; 14; 13; 13; 12; 12; 12; 12; 12; 11; 11; 10; 12; 16; 22; 31; 38; 45; 51; 47; 41; 33; 26; 21; 17; 17; 16; 16; 15; 16; 17; 17; 18; 18; 17; 18; 18; 18; 18; 18; 18; 18; 17; 17; 18; 19; 18; 18; 19; 19; 19; 19; 20; 20; 19; 20; 22; 24; 24; 25; 26; 27; 28; 29; 30; 31; 31; 31; 32; 32; 32; 31; 29; 28; 26; 24; 22; 20; 20; 19; 18; 18; 17; 17; 16; 16; 15; 15; 16; 15; 15; 15; 15; 15; 15; 15; 15; 15; 14; 15; 16; 16; 16; 16; 16; 16; 16; 16; 16; 15; 16; 15; 15; 15; 16; 16; 16; 16; 16; 16; 16; 16; 15; 16; 16; 16; 16; 16; 15; 15; 15; 15; 15; 16; 16; 17; 18; 18; 19; 19; 19; 20; 21; 22; 22; 22; 22; 21; 20; 18; 17; 17; 15; 15; 14; 14; 13; 13; 14; 13; 13; 13; 12; 12; 12; 12; 13; 18; 23; 30; 38; 47; 51; 44; 39; 31; 24; 18; 16; 15; 15; 15; 15; 15; 15; 16; 16; 16; 17; 16; 16; 17; 17; 16; 17; 17; 17; 17; 18; 18; 18; 18; 19; 19; 20; 20; 20; 20; 21; 22; 22; 24; 25; 26; 27; 28; 29; 30; 31; 32; 33; 32; 33; 33; 33; 32; 30; 28; 26; 24; 23; 23; 22; 20; 19; 19; 18; 17; 17; 18; 17; 18; 18; 17; 18; 17; 18; 18; 17; 17; 17; 17; 16; 17; 17; 17; 18; 18; 17; 17; 18; 18; 18; 19; 18; 18; 17; 18; 18; 17; 17; 17; 17; 17; 18; 17; 17; 18; 17; 17; 17; 17; 17; 17; 17; 18; 17; 17; 18; 18; 18; 20; 20; 21; 21; 22; 23; 24; 23; 23; 21; 21; 20; 18; 18; 17; 16; 14; 13; 13; 13; 13; 13; 13; 13; 13; 13; 12; 12; 12; 16; 19; 28; 36; 47; 51; 46; 40; 32; 24; 20; 18; 16; 16; 16; 16; 15; 16; 16; 16; 17; 17; 17; 18; 17; 17; 18; 18; 18; 18; 19; 18; 18; 19; 20; 20; 20; 20; 20; 21; 21; 22; 22; 23; 25; 26; 27; 29; 29; 30; 31; 32; 33; 33; 33; 34; 35; 35; 35; 0; 0; 0; 0;]

I would like to detect, for example

P wave in [19 - 37]

QRS complex in [51 - 64]

etc...

Source: (StackOverflow)

How do I find Wally with Python?

Shamelessly jumping on the bandwagon :-)

Inspired by How do I find Waldo with Mathematica and the followup How to find Waldo with R, as a new python user I'd love to see how this could be done. It seems that python would be better suited to this than R, and we don't have to worry about licenses as we would with Mathematica or Matlab.

In an example like the one below obviously simply using stripes wouldn't work. It would be interesting if a simple rule based approach could be made to work for difficult examples such as this.

At the beach

I've added the [machine-learning] tag as I believe the correct answer will have to use ML techniques, such as the Restricted Boltzmann Machine (RBM) approach advocated by Gregory Klopper in the original thread. There is some RBM code available in python which might be a good place to start, but obviously training data is needed for that approach.

At the 2009 IEEE International Workshop on MACHINE LEARNING FOR SIGNAL PROCESSING (MLSP 2009) they ran a Data Analysis Competition: Where's Wally?. Training data is provided in matlab format. Note that the links on that website are dead, but the data (along with the source of an approach taken by Sean McLoone and colleagues can be found here (see SCM link). Seems like one place to start.

Source: (StackOverflow)

Machine Learning in Game AI

In the old days of gaming, I'm sure simple switch/case statements (in a sense) would have done just fine for most of the game "AI." However, as games have become increasing complex, especially at the 3d leap, more complex algorithms are needed. My question is, are actual machine learning algorithms (like reinforcement learning) used in game AI at this point? Or is that still mostly only in research projects at universities (which I have been exposed to)?

If not actual machine learning algorithms, then what is driving bleeding edge commerical game AI? Is it simply highly complex but static (non-ML) algorithms that are able to cover most of the possibilities? And if so, what actual types of algorithms are used?

I've always been curious about this, thanks!

Edit: After thinking about it some more I can further clarify a bit. How do the agents in the game make decisions? If they are not using actual learning algorithms real-time, was a learning algorithm perhaps used in the development stages to produce a model (static algorithm), and that model is then used to make decisions in the game? Or was a static algorithm for decision making hand-coded in a sense?

Source: (StackOverflow)

When to choose which machine learning classifier?

Suppose I'm working on some classification problem. (Fraud detection and comment spam are two problems I'm working on right now, but I'm curious about any classification task in general.)

How do I know which classifier I should use? (Decision tree, SVM, Bayesian, logistic regression, etc.) In which cases is one of them the "natural" first choice, and what are the principles for choosing that one?

Examples of the type of answers I'm looking for (from Manning et al.'s "Introduction to Information Retrieval book": http://nlp.stanford.edu/IR-book/html/htmledition/choosing-what-kind-of-classifier-to-use-1.html):

a. If your data is labeled, but you only have a limited amount, you should use a classifier with high bias (for example, Naive Bayes). [I'm guessing this is because a higher-bias classifier will have lower variance, which is good because of the small amount of data.]

b. If you have a ton of data, then the classifier doesn't really matter so much, so you should probably just choose a classifier with good scalability.

What are other guidelines? Even answers like "if you'll have to explain your model to some upper management person, then maybe you should use a decision tree, since the decision rules are fairly transparent" are good. I care less about implementation/library issues, though.
Also, for a somewhat separate question, besides standard Bayesian classifiers, are there 'standard state-of-the-art' methods for comment spam detection (as opposed to email spam)?

[Not sure if stackoverflow is the best place to ask this question, since it's more machine learning than actual programming -- if not, any suggestions for where else?]

Source: (StackOverflow)

Is there a recommended package for machine learning in Python? [closed]

Is there a recommended package for machine learning in Python?

I have previous experience in implementing a variety of machine learning and statistical algorithms in C++ and MATLAB, but having done some work in Python I'm curious about the available packages for Python.

Source: (StackOverflow)

How can I build a model to distinguish tweets about Apple (Inc.) from tweets about apple (fruit)?

See below for 50 tweets about "apple." I have hand labeled the positive matches about Apple Inc. They are marked as 1 below.

Here are a couple of lines:

1|“@chrisgilmer: Apple targets big business with new iOS 7 features http://bit.ly/15F9JeF ”. Finally.. A corp iTunes account!
0|“@Zach_Paull: When did green skittles change from lime to green apple? #notafan” @Skittles
1|@dtfcdvEric: @MaroneyFan11 apple inc is searching for people to help and tryout all their upcoming tablet within our own net page No.
0|@STFUTimothy have you tried apple pie shine?
1|#SuryaRay #India Microsoft to bring Xbox and PC games to Apple, Android phones: Report: Microsoft Corp... http://dlvr.it/3YvbQx  @SuryaRay

Here is the total data set: http://pastebin.com/eJuEb4eB

I need to build a model that classifies "Apple" (Inc). from the rest.

I'm not looking for a general overview of machine learning, rather I'm looking for actual model in code (Python preferred).

Source: (StackOverflow)

Nearest neighbors in high-dimensional data?

I have asked a question a few days back on how to find the nearest neighbors for a given vector. My vector is now 21 dimensions and before I proceed further, because I am not from the domain of Machine Learning nor Math, I am beginning to ask myself some fundamental questions:

Is Euclidean distance a good metric for finding the nearest neighbors in the first place? If not, what are my options?
In addition, how does one go about deciding the right threshold for determining the k-neighbors? Is there some analysis that can be done to figure this value out?
Previously, I was suggested to use kd-Trees but the Wikipedia page clearly says that for high-dimensions, kd-Tree is almost equivalent to a brute-force search. In that case, what is the best way to find nearest-neighbors in a million point dataset efficiently?

Can someone please clarify the some (or all) of the above questions?

Source: (StackOverflow)

How does Apple find dates, times and addresses in emails?

In the iOS email client, when an email contains a date, time or location, the text becomes a hyperlink and it is possible to create an appointment or look at a map simply by tapping the link. It not only works for emails in English, but in other languages also. I love this feature and would like to understand how they do it.

The naive way to do this would be to have many regular expressions and run them all. However I this is not going to scale very well and will work for only a specific language or date format, etc. I think that Apple must be using some concept of machine learning to extract entities (8:00PM, 8PM, 8:00, 0800, 20:00, 20h, 20h00, 2000 etc.).

Any idea how Apple is able to extract entities so quickly in its email client? What machine learning algorithm would you to apply accomplish such task?

Source: (StackOverflow)

EzDevInfo.com

machine-learning interview questions

How to understand Locality Sensitive Hashing?

What is the difference between supervised learning and unsupervised learning?

What is the difference between a Generative and Discriminative Algorithm? [closed]

Role of Bias in Neural Networks

A simple explanation of Naive Bayes Classification

How does the Google "Did you mean?" Algorithm work?

Estimating the number of neurons and number of layers of an artificial neural network

Detecting patterns in waves

How do I find Wally with Python?

Machine Learning in Game AI

When to choose which machine learning classifier?

Is there a recommended package for machine learning in Python? [closed]

How can I build a model to distinguish tweets about Apple (Inc.) from tweets about apple (fruit)?

Nearest neighbors in high-dimensional data?

How does Apple find dates, times and addresses in emails?