EzDevInfo.com

statistics interview questions

Top statistics frequently asked interview questions

Geometric Mean: is there a built-in?

i tried to find a built-in for geometric mean but couldn't.

(Obviously a built-in isn't going to save me any time while working in the shell, nor do i suspect there's any difference in accuracy; for scripts i try to use built-ins as often as possible, where the (cumulative) performance gain is often noticeable.

In case there isn't one (which i doubt is the case) here's mine.

gm_mean = function(a){prod(a)^(1/length(a))}

Source: (StackOverflow)

How to calculate cumulative normal distribution in Python

I am looking for a function in Numpy or Scipy (or any rigorous Python library) that will give me the cumulative normal distribution function in Python.


Source: (StackOverflow)

Advertisements

In R, how to find the standard error of the mean?

Is there any command to find the standard error of the mean in R?


Source: (StackOverflow)

Getting statistics from Google Play Developers with an API

I am in charge of developing a website which should be able to show statistics from both Apple's app store and Google Play Store to clients, so they can easily see what's going on.

I have figured out some ways to get App Store's data, but the Google Play Developers statistics seem way harder to get.

I've heard of scraping, but this wouldn't be a great solution, as it would probably get broken whenever the developers console gets a major update.

I'm looking for something which would work like Andlytics or App Annie do, as an example, so I could get data with AJAX or something else (JSON format maybe?) and put it into a database.

For now, I haven't found any reliable solution (besides scraping, which seems like a unstable way to go), and this question has been asked a while ago, so I allow myself to ask it again, because maybe now there are some solutions to get around this.

All I could find was Google Play APIs, which allow me to fetch data from the public page of the app, but not from developers console, with authentication.

Any hints or help will be greatly appreciated :)


Source: (StackOverflow)

Which Git commit stats are easy to pull

Previously I have enjoyed TortoiseSvn's ability to generate simple commit stats for a given SVN repository. I wonder what is available in Git and am particularly interested in :

  • Number of commits per user
  • Number of lines changed per user
  • activity over time (for instance aggregated weekly changes)

Any ideas?


Source: (StackOverflow)

Screening (multi)collinearity in a regression model

I hope that this one is not going to be "ask-and-answer" question... here goes: (multi)collinearity refers to extremely high correlations between predictors in the regression model. How to cure them... well, sometimes you don't need to "cure" collinearity, since it doesn't affect regression model itself, but interpretation of an effect of individual predictors.

One way to spot collinearity is to put each predictor as a dependent variable, and other predictors as independent variables, determine R2, and if it's larger than .9 (or .95), we can consider predictor redundant. This is one "method"... what about other approaches? Some of them are time consuming, like excluding predictors from model and watching for b-coefficient changes - they should be noticeably different.

Of course, we must always bear in mind the specific context/goal of the analysis... Sometimes, only remedy is to repeat a research, but right now, I'm interested in various ways of screening redundant predictors when (multi)collinearity occurs in a regression model.


Source: (StackOverflow)

Sorting algorithms for data of known statistical distribution?

It just occurred to me, if you know something about the distribution (in the statistical sense) of the data to sort, the performance of a sorting algorithm might benefit if you take that information into account.

So my question is, are there any sorting algorithms that take into account that kind of information? How good are they?

Edit : an example to clarify: if you know the distribution of your data to be Gaussian, you could estimate mean and average on the fly as you process the data. This would give you an estimate of the final position of each number, which you could use to place them close to their final position.

Edit #2: I'm pretty surprised the answer isn't a wiki link to a thourough page discussing this issue. Isn't this a very common case (the Gaussian case, for example)?

Edit #3: I'm adding a bounty to this question, because I'm looking for definite answers with sources, not speculation. Something like "in the case of gaussian distributed data, XYZ algorithm is the fastest on average, as was proved by Smith et al. [1]". However any additional information is welcome.

Note: I will award the bounty to the highest-voted answer. Vote wisely!


Source: (StackOverflow)

Plotting a 3D surface plot with contour map overlay, using R

I have a 3-tuple data set (X,Y,Z points) that I want to plot using R.

I want to create a surface plot from the data, and superimpose a contour map on the surface plot, so as to create the impression of the contour map being the "shadow" or projection from the surface plot. The contour map is to appear below the surface plot.

My data set looks somewhat like this:

Axis  |  Data Type
-------------------
X     |  Date value
Y     |  Float value
Z     |  Float value

How can I achieve this?


Source: (StackOverflow)

"On-line" (iterator) algorithms for estimating statistical median, mode, skewness, kurtosis?

Is there an algorithm to estimate the median, mode, skewness, and/or kurtosis of set of values, but that does NOT require storing all the values in memory at once?

I'd like to calculate the basic statistics:

  • mean: arithmetic average
  • variance: average of squared deviations from the mean
  • standard deviation: square root of the variance
  • median: value that separates larger half of the numbers from the smaller half
  • mode: most frequent value found in the set
  • skewness: tl; dr
  • kurtosis: tl; dr

The basic formulas for calculating any of these is grade-school arithmetic, and I do know them. There are many stats libraries that implement them, as well.

My problem is the large number (billions) of values in the sets I'm handling: Working in Python, I can't just make a list or hash with billions of elements. Even if I wrote this in C, billion-element arrays aren't too practical.

The data is not sorted. It's produced randomly, on-the-fly, by other processes. The size of each set is highly variable, and the sizes will not be known in advance.

I've already figured out how to handle the mean and variance pretty well, iterating through each value in the set in any order. (Actually, in my case, I take them in the order in which they're generated.) Here's the algorithm I'm using, courtesy http://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#On-line_algorithm:

  • Initialize three variables: count, sum, and sum_of_squares
  • For each value:
    • Increment count.
    • Add the value to sum.
    • Add the square of the value to sum_of_squares.
  • Divide sum by count, storing as the variable mean.
  • Divide sum_of_squares by count, storing as the variable mean_of_squares.
  • Square mean, storing as square_of_mean.
  • Subtract square_of_mean from mean_of_squares, storing as variance.
  • Output mean and variance.

This "on-line" algorithm has weaknesses (e.g., accuracy problems as sum_of_squares quickly grows larger than integer range or float precision), but it basically gives me what I need, without having to store every value in each set.

But I don't know whether similar techniques exist for estimating the additional statistics (median, mode, skewness, kurtosis). I could live with a biased estimator, or even a method that compromises accuracy to a certain degree, as long as the memory required to process N values is substantially less than O(N).

Pointing me to an existing stats library will help, too, if the library has functions to calculate one or more of these operations "on-line".


Source: (StackOverflow)

calculate mean and standard deviation from a vector of samples in C++ using boost

Is there a way to calculate mean and standard deviation for a vector containing samples using boost? Or do I have to create an accumulator and feed the vector into it?


Source: (StackOverflow)

Browser statistics on JavaScript disabled

I am having a hard time collecting publically available statistics on the percentage of web users that browse with JavaScript disabled.

Yahoo has published data from 2010 and R. Reid published data from 2009 (picked from a site he had access to).

The findings from Yahoo were rather interesting at that time:

We took a combination of access logs and beacon data (previously included in the page) and filtered out all of the automated requests, leaving us with a set of requests we could confirm were sent by actual users. This data, which is completely anonymous, gave us a good indication of traffic patterns in several countries.

After crunching the numbers, we found a consistent rate of JavaScript-disabled requests hovering around 1% of the actual visitor traffic, with the highest rate being roughly 2 percent in the United States and the lowest being roughly 0.25 percent in Brazil. All of the other countries tested showed numbers very close to 1.3 percent.

Yahoo browser percentage with JavaScript disabled, 2010

This is about what i could find so far. But since this data is getting old, i wonder where the percentages are today.

I also looked at Statcounter, which seems to be the only company left to still openly publish browser statistics. But they do not publish data about javascript. I know that W3schools also publish stats, but since the target is aimed at developers, this data is extremely biased and therefore not interesting for me. (it has to be representative for ordinary users).

I therefore ask you to provide

  • links to any open, freely available statistics which touches this area
  • Your own stats, preferably from larger sites with do not target developers

Source: (StackOverflow)

Select k random elements from a list whose elements have weights

Selecting without any weights (equal probabilities) is beautifully described here.

I was wondering if there is a way to convert this approach to a weighted one.

I am also interested in other approaches as well.

Update: Sampling without replacement


Source: (StackOverflow)

How to make execution pause, sleep, wait for X seconds in R?

How do you pause an R script for a specified number of seconds or miliseconds? In many languages, there is a sleep function, but ?sleep references a data set. And ?pause and ?wait don't exist.

The intended purpose is for self-timed animations. The desired solution works without asking for user input.


Source: (StackOverflow)

Statistics: combinations in Python

I need to compute combinatorials (nCr) in Python but cannot find the function to do that in 'math', 'numyp' or 'stat' libraries. Something like a function of the type:

comb = calculate_combinations(n, r)

I need the number of possible combinations, not the actual combinations, so itertools.combinations does not interest me.

Finally, I want to avoid using factorials, as the numbers I'll be calculating the combinations for can get to big and the factorials are going to be monstruous.

This seems like a REALLY easy to answer question, however I am being drowned in questions about generating all the actual combinations, which is not what I want. :)

Many thanks


Source: (StackOverflow)

Calculating Pearson correlation and significance in Python

I am looking for a function that takes as input two lists, and returns the Pearson correlation, and the significance of the correlation.


Source: (StackOverflow)