EzDevInfo.com

statsmodels

Statsmodels: statistical modeling and econometrics in Python StatsModels: Statistics in Python — statsmodels 0.7.0 documentation

Get it on Github
Language : Python

http://statsmodels.sourceforge.net/devel/

Why do I get only one parameter from a statsmodels OLS fit

Here is what I am doing:

$ python
Python 2.7.6 (v2.7.6:3a1db0d2747e, Nov 10 2013, 00:42:54) 
[GCC 4.2.1 (Apple Inc. build 5666) (dot 3)] on darwin
>>> import statsmodels.api as sm
>>> statsmodels.__version__
'0.5.0'
>>> import numpy 
>>> y = numpy.array([1,2,3,4,5,6,7,8,9])
>>> X = numpy.array([1,1,2,2,3,3,4,4,5])
>>> res_ols = sm.OLS(y, X).fit()
>>> res_ols.params
array([ 1.82352941])

I had expected an array with two elements?!? The intercept and the slope coefficient?

Source: (StackOverflow)

What statistics module for python supports one way ANOVA with post hoc tests (Tukey, Scheffe or other)?

I have tried looking through multiple stats modules for python but can't seem to find any that support one away ANOVA post hoc tests.

Source: (StackOverflow)

ValueError: numpy.dtype has the wrong size, try recompiling

I just installed pandas and statsmodels package on my python 2.7 When I tried "import pandas as pd", this error message comes out. Can anyone help? Thanks!!!

numpy.dtype has the wrong size, try recompiling
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\analytics\ext\python27\lib\site-packages\statsmodels-0.5.0-py2.7-win32.egg\statsmodels\formula\__init__.py",
line 4, in <module>
    from formulatools import handle_formula_data
  File "C:\analytics\ext\python27\lib\site-packages\statsmodels-0.5.0-py2.7-win32.egg\statsmodels\formula\formulatools.p
y", line 1, in <module>
    import statsmodels.tools.data as data_util
  File "C:\analytics\ext\python27\lib\site-packages\statsmodels-0.5.0-py2.7-win32.egg\statsmodels\tools\__init__.py", li
ne 1, in <module>
    from tools import add_constant, categorical
  File "C:\analytics\ext\python27\lib\site-packages\statsmodels-0.5.0-py2.7-win32.egg\statsmodels\tools\tools.py", line
14, in <module>
    from pandas import DataFrame
  File "C:\analytics\ext\python27\lib\site-packages\pandas\__init__.py", line 6, in <module>
    from . import hashtable, tslib, lib
  File "numpy.pxd", line 157, in init pandas.tslib (pandas\tslib.c:49133)
ValueError: numpy.dtype has the wrong size, try recompiling

Source: (StackOverflow)

Package for time series analysis in python [closed]

I am working on time series in python. The libraries which I found useful and promising are

pandas;
statsmodel (for ARIMA);
simple exponential smoothing is provided from pandas.

Also for visualization: matplotlib

Does anyone know a library for exponential smoothing?

Source: (StackOverflow)

Where can I find mad (mean absolute deviation) in scipy?

It seems scipy once provided a function mad to calculate the mean absolute deviation for a set of numbers:

http://projects.scipy.org/scipy/browser/trunk/scipy/stats/models/utils.py?rev=3473

However, I can not find it anywhere in current versions of scipy. Of course it is possible to just copy the old code from repository but I prefer to use scipy's version. Where can I find it, or has it been replaced or removed?

Source: (StackOverflow)

Calculate how a value differs from the average of values using the Gaussian Kernel Density (Python)

I use this code to calculate a Gaussian Kernel Density on this values

from random import randint
x_grid=[]
for i in range(1000):
    x_grid.append(randint(0,4))
print (x_grid)

This is the code to calculate the Gaussian Kernel Density

from statsmodels.nonparametric.kde import KDEUnivariate
import matplotlib.pyplot as plt

def kde_statsmodels_u(x, x_grid, bandwidth=0.2, **kwargs):
    """Univariate Kernel Density Estimation with Statsmodels"""
    kde = KDEUnivariate(x)
    kde.fit(bw=bandwidth, **kwargs)
    return kde.evaluate(x_grid)

import numpy as np
from scipy.stats.distributions import norm

# The grid we'll use for plotting
from random import randint
x_grid=[]
for i in range(1000):
    x_grid.append(randint(0,4))
print (x_grid)

# Draw points from a bimodal distribution in 1D
np.random.seed(0)
x = np.concatenate([norm(-1, 1.).rvs(400),
                    norm(1, 0.3).rvs(100)])

pdf_true = (0.8 * norm(-1, 1).pdf(x_grid) +
            0.2 * norm(1, 0.3).pdf(x_grid))

# Plot the three kernel density estimates
fig, ax = plt.subplots(1, 2, sharey=True, figsize=(13, 8))
fig.subplots_adjust(wspace=0)

pdf=kde_statsmodels_u(x, x_grid, bandwidth=0.2)
ax[0].plot(x_grid, pdf, color='blue', alpha=0.5, lw=3)
ax[0].fill(x_grid, pdf_true, ec='gray', fc='gray', alpha=0.4)
ax[0].set_title("kde_statsmodels_u")
ax[0].set_xlim(-4.5, 3.5)

plt.show()

All the values in the grid are between 0 e 4. If I receive a new value of 5 I want to calculate how that value differs from the average values and assign to it a score between 0 and 1. (setting a threshold)

So if I receive as a new value 5 its score must be close to 0.90, while if I receive as a new value 500 its score must be close to 0.0.

How can I do that? Is my function to calculate the Gaussian Kernel Density correct or is there a better way/library to do that?

* UPDATE * I read an example in a paper. The weight of a washing machine is typically of 100 kg. Usually vendors use the kg unit to also refer its capacity (example 9 kg). For a human is easy to understand that 9 gk is the capacity and not the total weight of the washing machine. We can “fake” this form of intelligence without deep language understanding, by instead modeling a distribution of values over training data for each attribute.

For a given attribute a (weight of a washing machine for example), let Va = {va1, va2, . . . van} (|Va| = n) be the set of values of attribute a corresponding to products in the training data. If I found a new value v Intuitively it is “close” to (the distribution estimated from) Va, then we should feel more confident assigning this value to a (example weight of a washing machine).

An idea could be to measure the number of standard deviations by which the new value v differs from the average of values in Va but a better one could be to model a (Gaussian) kernel density on Va, and then express the support at new value v as the density at that point:

enter image description here

where where σ^(2)ak is the variance of the kth Gaussian, and Z is a constant to make sure S(c.s.v, Va) ∈ [0, 1]. How can I obtain it in Python using the statsmodels library?

* UPDATED 2 * Example of data... but I think that is not very important... Generated by this code...

from random import randint
x_grid=[]
for i in range(1000):
    x_grid.append(randint(1,3))
print (x_grid)

[2, 2, 1, 2, 2, 3, 1, 1, 1, 2, 2, 2, 1, 1, 3, 3, 1, 2, 1, 3, 2, 3, 3, 1, 2, 3, 1, 1, 3, 2, 2, 1, 1, 1, 2, 3, 2, 1, 2, 3, 3, 2, 2, 3, 3, 2, 2, 1, 2, 1, 2, 2, 3, 3, 1, 1, 2, 3, 3, 2, 1, 2, 3, 3, 3, 3, 2, 1, 3, 2, 2, 1, 3, 3, 1, 2, 1, 3, 2, 3, 3, 1, 2, 3, 3, 2, 1, 2, 3, 2, 1, 1, 2, 1, 1, 2, 3, 2, 1, 2, 2, 2, 3, 2, 3, 3, 1, 1, 3, 2, 1, 1, 3, 3, 3, 2, 1, 2, 2, 1, 3, 2, 3, 1, 3, 1, 2, 3, 1, 3, 2, 2, 1, 1, 2, 2, 3, 1, 1, 3, 2, 2, 1, 2, 1, 2, 3, 1, 3, 3, 1, 2, 1, 2, 1, 3, 1, 3, 3, 2, 1, 1, 3, 2, 2, 2, 3, 2, 1, 3, 2, 1, 1, 3, 3, 3, 2, 1, 1, 3, 2, 1, 2, 2, 2, 1, 3, 1, 3, 2, 3, 1, 2, 1, 1, 2, 2, 2, 3, 3, 3, 3, 2, 2, 2, 3, 1, 1, 2, 2, 1, 1, 1, 3, 3, 3, 3, 1, 3, 1, 3, 1, 1, 1, 2, 1, 2, 1, 1, 2, 1, 3, 1, 2, 3, 1, 3, 2, 2, 2, 2, 2, 1, 1, 2, 3, 1, 1, 1, 3, 1, 3, 2, 2, 3, 1, 3, 3, 2, 2, 3, 2, 1, 2, 1, 1, 1, 2, 2, 3, 2, 1, 1, 3, 1, 2, 1, 3, 3, 3, 1, 2, 2, 2, 1, 1, 2, 2, 1, 2, 3, 1, 3, 2, 2, 2, 2, 2, 2, 1, 3, 1, 3, 3, 2, 3, 2, 1, 3, 3, 3, 3, 3, 1, 2, 2, 2, 1, 1, 3, 2, 3, 1, 2, 3, 2, 3, 2, 1, 1, 3, 3, 1, 1, 2, 3, 2, 3, 3, 2, 3, 3, 2, 3, 3, 3, 3, 3, 3, 3, 2, 1, 1, 2, 3, 2, 3, 1, 1, 1, 1, 2, 2, 2, 2, 1, 1, 2, 2, 1, 3, 1, 1, 2, 3, 1, 1, 2, 3, 1, 2, 3, 1, 2, 1, 3, 3, 2, 2, 3, 3, 3, 2, 1, 1, 2, 2, 3, 2, 3, 2, 1, 1, 1, 1, 2, 3, 1, 3, 3, 3, 2, 1, 2, 3, 1, 2, 1, 1, 2, 3, 3, 1, 1, 3, 2, 1, 3, 3, 2, 1, 1, 3, 1, 3, 1, 2, 2, 1, 3, 3, 2, 3, 1, 1, 3, 1, 2, 2, 1, 3, 2, 3, 1, 1, 3, 1, 3, 1, 2, 1, 3, 2, 2, 2, 2, 1, 3, 2, 1, 3, 3, 2, 3, 2, 1, 3, 1, 2, 1, 2, 3, 2, 3, 2, 3, 3, 2, 3, 3, 1, 1, 3, 2, 3, 2, 2, 2, 3, 1, 3, 2, 2, 3, 3, 2, 3, 2, 2, 2, 3, 3, 1, 3, 2, 3, 1, 1, 2, 1, 3, 1, 2, 2, 3, 3, 1, 3, 1, 1, 2, 2, 1, 3, 3, 3, 1, 2, 2, 2, 1, 3, 1, 2, 2, 2, 3, 3, 3, 1, 1, 2, 3, 3, 1, 1, 2, 3, 2, 3, 3, 2, 2, 1, 3, 3, 3, 3, 2, 3, 1, 3, 3, 2, 1, 3, 2, 1, 1, 3, 3, 2, 2, 2, 2, 1, 1, 1, 1, 2, 3, 3, 3, 2, 1, 3, 1, 1, 1, 1, 3, 1, 2, 3, 3, 3, 2, 3, 1, 2, 2, 2, 3, 2, 1, 2, 3, 3, 2, 3, 3, 1, 2, 3, 3, 3, 3, 2, 3, 3, 2, 1, 1, 1, 2, 3, 1, 3, 3, 2, 1, 3, 3, 3, 2, 2, 1, 2, 3, 2, 3, 3, 3, 3, 2, 3, 2, 1, 2, 1, 1, 3, 3, 3, 2, 2, 3, 1, 3, 2, 1, 3, 1, 1, 3, 3, 1, 2, 2, 2, 3, 3, 1, 2, 1, 2, 1, 3, 2, 3, 3, 3, 3, 3, 3, 3, 1, 2, 3, 1, 3, 3, 2, 2, 1, 3, 1, 1, 3, 2, 1, 2, 3, 2, 1, 3, 3, 3, 2, 3, 1, 2, 3, 3, 1, 2, 2, 2, 3, 1, 2, 1, 1, 1, 3, 1, 3, 1, 3, 3, 2, 3, 1, 3, 2, 3, 3, 1, 2, 1, 3, 2, 2, 2, 2, 2, 2, 1, 2, 2, 3, 2, 2, 3, 2, 2, 2, 3, 1, 1, 3, 3, 1, 3, 1, 2, 1, 2, 1, 3, 2, 2, 1, 3, 1, 3, 3, 1, 3, 1, 1, 1, 1, 3, 2, 1, 2, 3, 1, 1, 3, 1, 1, 3, 1, 3, 3, 3, 1, 1, 3, 1, 3, 2, 2, 2, 1, 1, 2, 3, 3, 2, 3, 3, 1, 2, 3, 2, 2, 3, 1, 2, 2, 2, 1, 1, 3, 1, 2, 2, 2, 1, 1, 2, 3, 1, 3, 1, 1, 3, 2, 2, 3, 2, 2, 3, 3, 1, 1, 2, 2, 3, 1, 1, 2, 3, 2, 2, 3, 1, 2, 2, 1, 1, 3, 2, 3, 1, 1, 3, 1, 3, 2, 3, 3, 3, 3, 3, 2, 2, 3, 2, 1, 1, 1, 3, 3, 1, 2, 1, 3, 2, 3, 2, 2, 1, 2, 3, 3, 1, 1, 1, 1, 3, 3, 1, 3, 3, 1, 1, 3, 1, 3, 1, 3, 2, 3, 1, 3, 3, 3, 1, 1, 2, 2, 3, 2, 3, 2, 2, 1, 2, 1, 2, 1, 2, 2, 3, 1, 1, 3, 2, 2, 3, 2, 3, 3, 2, 2, 2, 2, 2, 2, 3, 2, 3, 1, 2, 2, 1, 1, 2, 3, 3, 1, 3, 3, 1, 3, 3, 1, 3, 2, 2, 2, 1, 1, 2, 1, 3, 1, 1, 1, 2, 3, 3, 2, 3, 1, 3]

This array represents the ram of new smartphones in the market... Usually they have 1,2,3 GB of ram.

That's the kernel density

enter image description here

*** UPDATE

I try the code with this values

[1024, 1, 1024, 1000, 1024, 128, 1536, 16, 192, 2048, 2000, 2048, 24, 250, 256, 278, 288, 290, 3072, 3, 3000, 3072, 32, 384, 4096, 4, 4096, 448, 45, 512, 576, 64, 768, 8, 96]

The values are all in mb... do you think that is working well? I think that I must set a threshold

      100%      cdfv      kdev
1       42  0.210097  0.499734
1024    96  0.479597  0.499983
5000     0  0.000359  0.498885
2048    36  0.181609  0.499700
3048     8  0.040299  0.499424

* UPDATE 3 *

[256, 256, 256, 256, 256, 256, 256, 256, 256, 256, 256, 256, 256, 256, 256, 256, 256, 256, 256, 256, 256, 256, 256, 256, 512, 512, 512, 256, 256, 256, 512, 512, 512, 128, 128, 128, 512, 512, 512, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 2048, 2048, 2048, 1024, 1024, 1024, 512, 512, 512, 512, 512, 512, 1024, 1024, 1024, 2048, 2048, 2048, 512, 512, 512, 512, 512, 512, 512, 512, 512, 512, 512, 512, 512, 512, 512, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 512, 512, 512, 128, 128, 128, 512, 512, 512, 256, 256, 256, 256, 256, 256, 1024, 1024, 1024, 512, 512, 512, 128, 128, 128, 512, 512, 512, 512, 512, 512, 1024, 1024, 1024, 1024, 1024, 1024, 4, 4, 4, 3, 3, 3, 24, 24, 24, 8, 8, 8, 16, 16, 16, 16, 16, 16, 256, 256, 256, 512, 512, 512, 512, 512, 512, 512, 512, 512, 512, 512, 512, 512, 512, 512, 512, 512, 512, 512, 512, 512, 512, 512, 512, 1024, 1024, 1024, 1024, 1024, 1024, 512, 512, 512, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 2048, 2048, 2048, 512, 512, 512, 1024, 1024, 1024, 512, 512, 512, 1024, 1024, 1024, 2048, 2048, 2048, 2048, 2048, 2048, 512, 512, 512, 512, 512, 512, 256, 256, 256, 256, 256, 256, 256, 256, 256, 512, 512, 512, 512, 512, 512, 1024, 1024, 1024, 512, 512, 512, 512, 512, 512, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 2048, 2048, 2048, 2048, 2048, 2048, 2048, 2048, 2048, 2048, 2048, 2048, 2048, 2048, 2048, 2048, 2048, 2048, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 2048, 2048, 2048, 2048, 2048, 2048, 4096, 4096, 4096, 2048, 2048, 2048, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 2048, 2048, 2048, 2048, 2048, 2048, 2048, 2048, 2048, 2048, 2048, 2048, 2048, 2048, 2048, 512, 512, 512, 512, 512, 512, 512, 512, 512, 512, 512, 512, 512, 512, 512, 768, 768, 768, 768, 768, 768, 2048, 2048, 2048, 2048, 2048, 2048, 3072, 3072, 3072, 2048, 2048, 2048, 2048, 2048, 2048, 2048, 2048, 2048, 2048, 2048, 2048, 1024, 1024, 1024, 512, 512, 512, 256, 256, 256, 512, 512, 512, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 2048, 2048, 2048, 1024, 1024, 1024, 1024, 1024, 1024, 2048, 2048, 2048, 1024, 1024, 1024, 3072, 3072, 3072, 1024, 1024, 1024, 512, 512, 512, 1024, 1024, 1024, 1024, 1024, 1024, 512, 512, 512, 2048, 2048, 2048, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 2048, 2048, 2048, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 2048, 2048, 2048, 1024, 1024, 1024, 2048, 2048, 2048, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 512, 512, 512, 1024, 1024, 1024, 512, 512, 512, 512, 512, 512, 512, 512, 512, 1024, 1024, 1024, 1024, 1024, 1024, 512, 512, 512, 1024, 1024, 1024, 512, 512, 512, 1024, 1024, 1024, 512, 512, 512, 512, 512, 512, 512, 512, 512, 256, 256, 256, 1024, 1024, 1024, 2048, 2048, 2048, 1024, 1024, 1024, 1024, 1024, 1024, 512, 512, 512, 512, 512, 512, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 2048, 2048, 2048, 1024, 1024, 1024, 2048, 2048, 2048, 1024, 1024, 1024, 512, 512, 512, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 2048, 2048, 2048, 2048, 2048, 2048, 2048, 2048, 2048, 1024, 1024, 1024, 2048, 2048, 2048, 512, 512, 512, 512, 512, 512, 1024, 1024, 1024, 1024, 1024, 1024, 512, 512, 512, 64, 64, 64, 1024, 1024, 1024, 1024, 1024, 1024, 256, 256, 256, 512, 512, 512, 512, 512, 512, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 64, 64, 64, 64, 64, 64, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 128, 128, 128, 576, 576, 576, 512, 512, 512, 1024, 1024, 1024, 512, 512, 512, 576, 576, 576, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 2048, 2048, 2048, 512, 512, 512, 2048, 2048, 2048, 768, 768, 768, 768, 768, 768, 768, 768, 768, 512, 512, 512, 192, 192, 192, 1024, 1024, 1024, 512, 512, 512, 512, 512, 512, 384, 384, 384, 448, 448, 448, 576, 576, 576, 384, 384, 384, 288, 288, 288, 768, 768, 768, 384, 384, 384, 288, 288, 288, 64, 64, 64, 2048, 2048, 2048, 2048, 2048, 2048, 2048, 2048, 2048, 3072, 3072, 3072, 2048, 2048, 2048, 2048, 2048, 2048, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 512, 512, 512, 1024, 1024, 1024, 64, 64, 64, 128, 128, 128, 128, 128, 128, 128, 128, 128, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 256, 256, 256, 768, 768, 768, 768, 768, 768, 768, 768, 768, 256, 256, 256, 192, 192, 192, 256, 256, 256, 64, 64, 64, 256, 256, 256, 192, 192, 192, 128, 128, 128, 256, 256, 256, 192, 192, 192, 288, 288, 288, 288, 288, 288, 288, 288, 288, 288, 288, 288, 128, 128, 128, 128, 128, 128, 384, 384, 384, 512, 512, 512, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 2048, 2048, 2048, 2048, 2048, 2048, 2048, 2048, 2048, 3072, 3072, 3072, 1024, 1024, 1024, 2048, 2048, 2048, 2048, 2048, 2048, 3072, 3072, 3072, 512, 512, 512, 512, 512, 512, 512, 512, 512, 1024, 1024, 1024, 1024, 1024, 1024, 512, 512, 512, 1024, 1024, 1024, 512, 512, 512, 512, 512, 512, 512, 512, 512, 1024, 1024, 1024, 1024, 1024, 1024, 32, 32, 32, 768, 768, 768, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 2048, 2048, 2048, 1024, 1024, 1024, 2048, 2048, 2048, 3072, 3072, 3072, 2048, 2048, 2048, 1024, 1024, 1024, 2048, 2048, 2048, 1024, 1024, 1024, 2048, 2048, 2048, 256, 256, 256, 256, 256, 256, 256, 256, 256, 256, 256, 256, 512, 512, 512, 512, 512, 512, 256, 256, 256, 512, 512, 512, 512, 512, 512, 512, 512, 512, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 128, 128, 128, 128, 128, 128, 1024, 1024, 1024, 1024, 1024, 1024, 128, 128, 128, 1024, 1024, 1024, 2048, 2048, 2048, 1024, 1024, 1024, 1024, 1024, 1024, 2048, 2048, 2048, 3072, 3072, 3072, 1024, 1024, 1024, 1024, 1024, 1024, 512, 512, 512, 512, 512, 512, 512, 512, 512, 512, 512, 512, 2048, 2048, 2048, 1024, 1024, 1024, 2048, 2048, 2048, 1024, 1024, 1024, 1024, 1024, 1024, 512, 512, 512, 512, 512, 512, 1024, 1024, 1024, 512, 512, 512, 1024, 1024, 1024, 512, 512, 512, 512, 512, 512, 512, 512, 512, 1024, 1024, 1024, 2048, 2048, 2048, 2048, 2048, 2048, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 2048, 2048, 2048, 2048, 2048, 2048, 512, 512, 512, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 256, 256, 256, 256, 256, 256, 512, 512, 512, 512, 512, 512, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 2048, 2048, 2048, 2048, 2048, 2048, 2048, 2048, 2048, 3072, 3072, 3072, 2048, 2048, 2048, 384, 384, 384, 512, 512, 512, 512, 512, 512, 512, 512, 512, 512, 512, 512, 512, 512, 512, 512, 512, 512, 512, 512, 512, 512, 512, 512, 2048, 2048, 2048, 2048, 2048, 2048, 2048, 2048, 2048, 1024, 1024, 1024, 2048, 2048, 2048, 1024, 1024, 1024, 3072, 3072, 3072, 3072, 3072, 3072, 3072, 3072, 3072, 128, 128, 128, 256, 256, 256, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 2048, 2048, 2048, 512, 512, 512, 1024, 1024, 1024, 1024, 1024, 1024, 2048, 2048, 2048, 512, 512, 512, 512, 512, 512, 512, 512, 512, 512, 512, 512, 768, 768, 768, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 512, 512, 512, 1024, 1024, 1024, 128, 128, 128, 512, 512, 512, 1024, 1024, 1024, 512, 512, 512, 1024, 1024, 1024, 512, 512, 512, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 2048, 2048, 2048, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 2048, 2048, 2048, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 512, 512, 512, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 64, 64, 64, 64, 64, 64, 256, 256, 256, 512, 512, 512, 512, 512, 512, 512, 512, 512, 16, 16, 16, 3072, 3072, 3072, 3072, 3072, 3072, 256, 256, 256, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 2048, 2048, 2048, 2048, 2048, 2048, 2048, 2048, 2048, 512, 512, 512, 32, 32, 32, 1024, 1024, 1024, 1024, 1024, 1024, 256, 256, 256, 256, 256, 256, 512, 512, 512, 512, 512, 512, 512, 512, 512, 512, 512, 512, 512, 512, 512, 512, 512, 512, 1024, 1024, 1024, 1024, 1024, 1024, 512, 512, 512, 512, 512, 512, 512, 512, 512, 512, 512, 512, 512, 512, 512, 1024, 1024, 1024, 512, 512, 512, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 32, 32, 32, 2048, 2048, 2048, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 2048, 2048, 2048, 512, 512, 512, 1, 1, 1, 1024, 1024, 1024, 32, 32, 32, 32, 32, 32, 45, 45, 45, 8, 8, 8, 512, 512, 512, 256, 256, 256, 512, 512, 512, 512, 512, 512, 512, 512, 512, 512, 512, 512, 1024, 1024, 1024, 512, 512, 512, 512, 512, 512, 512, 512, 512, 512, 512, 512, 512, 512, 512, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 512, 512, 512, 16, 16, 16, 4, 4, 4, 4, 4, 4, 4, 4, 4, 16, 16, 16, 16, 16, 16, 16, 16, 16, 64, 64, 64, 8, 8, 8, 8, 8, 8, 8, 8, 8, 64, 64, 64, 64, 64, 64, 256, 256, 256, 64, 64, 64, 64, 64, 64, 512, 512, 512, 512, 512, 512, 512, 512, 512, 32, 32, 32, 32, 32, 32, 32, 32, 32, 128, 128, 128, 128, 128, 128, 128, 128, 128, 32, 32, 32, 128, 128, 128, 64, 64, 64, 64, 64, 64, 16, 16, 16, 256, 256, 256, 2048, 2048, 2048, 1024, 1024, 1024, 2048, 2048, 2048, 256, 256, 256, 512, 512, 512, 1024, 1024, 1024, 512, 512, 512, 256, 256, 256, 512, 512, 512, 512, 512, 512, 512, 512, 512, 512, 512, 512, 1024, 1024, 1024, 512, 512, 512, 512, 512, 512, 1024, 1024, 1024, 1024, 1024, 1024, 512, 512, 512, 1024, 1024, 1024, 1024, 1024, 1024, 512, 512, 512, 1024, 1024, 1024, 1024, 1024, 1024, 2048, 2048, 2048, 256, 256, 256, 256, 256, 256, 1024, 1024, 1024, 1024, 1024, 1024, 256, 256, 256, 3072, 3072, 3072, 3072, 3072, 3072, 128, 128, 128, 1024, 1024, 1024, 512, 512, 512, 512, 512, 512, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 128, 128, 128, 128, 128, 128, 64, 64, 64, 256, 256, 256, 256, 256, 256, 512, 512, 512, 768, 768, 768, 768, 768, 768, 16, 16, 16, 32, 32, 32, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 2048, 2048, 2048, 2048, 2048, 2048, 1024, 1024, 1024, 2048, 2048, 2048, 1024, 1024, 1024, 512, 512, 512, 2048, 2048, 2048, 1024, 1024, 1024, 3072, 3072, 3072, 3072, 3072, 3072, 2048, 2048, 2048, 1024, 1024, 1024, 1024, 1024, 1024, 3072, 3072, 3072, 3072, 3072, 3072, 3072, 3072, 3072, 3072, 3072, 3072, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 512, 512, 512, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 2048, 2048, 2048, 1024, 1024, 1024, 2048, 2048, 2048, 2048, 2048, 2048, 2048, 2048, 2048, 2048, 2048, 2048, 2048, 2048, 2048, 2048, 2048, 2048, 2048, 2048, 2048, 2048, 2048, 2048, 3072, 3072, 3072, 3072, 3072, 3072, 512, 512, 512, 512, 512, 512, 512, 512, 512, 512, 512, 512, 512, 512, 512, 1024, 1024, 1024, 512, 512, 512, 64, 64, 64, 96, 96, 96, 512, 512, 512, 64, 64, 64, 64, 64, 64, 512, 512, 512, 512, 512, 512, 512, 512, 512, 512, 512, 512, 512, 512, 512, 512, 512, 512, 1024, 1024, 1024, 512, 512, 512, 512, 512, 512, 1024, 1024, 1024, 512, 512, 512, 512, 512, 512, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 512, 512, 512, 512, 512, 512, 1024, 1024, 1024, 1024, 1024, 1024, 512, 512, 512, 512, 512, 512, 1024, 1024, 1024, 512, 512, 512, 1024, 1024, 1024, 1024, 1024, 1024, 512, 512, 512, 512, 512, 512, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 512, 512, 512, 512, 512, 512, 1024, 1024, 1024, 2048, 2048, 2048, 2048, 2048, 2048, 2048, 2048, 2048, 3072, 3072, 3072, 3072, 3072, 3072, 2048, 2048, 2048, 2048, 2048, 2048, 2048, 2048, 2048, 512, 512, 512, 1024, 1024, 1024, 2048, 2048, 2048, 1024, 1024, 1024, 1024, 1024, 1024, 512, 512, 512, 1024, 1024, 1024, 1024, 1024, 1024, 512, 512, 512, 1024, 1024, 1024, 512, 512, 512, 1024, 1024, 1024, 1024, 1024, 1024, 2048, 2048, 2048, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 2048, 2048, 2048, 64, 64, 64, 64, 64, 64, 256, 256, 256, 1024, 1024, 1024, 512, 512, 512, 256, 256, 256, 512, 512, 512, 1024, 1024, 1024, 512, 512, 512, 512, 512, 512, 1024, 1024, 1024, 1024, 1024, 1024, 2048, 2048, 2048, 2048, 2048, 2048, 512, 512, 512, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 2048, 2048, 2048, 2048, 2048, 2048, 2048, 2048, 2048, 2048, 2048, 2048, 2048, 2048, 2048, 2048, 2048, 2048, 3072, 3072, 3072, 3072, 3072, 3072, 2048, 2048, 2048, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 2048, 2048, 2048, 2048, 2048, 2048, 1024, 1024, 1024, 2048, 2048, 2048, 3072, 3072, 3072, 2048, 2048, 2048]

With this data if I try as new value this number

# new values
x = np.asarray([128,512,1024,2048,3072,2800])

Something goes wrong with the 3072 (all values are in MB).

This is the result:

      100%      cdfv      kdev
128     26  0.129688  0.499376
512     55  0.275874  0.499671
1024    91  0.454159  0.499936
2048    12  0.062298  0.499150
3072     0  0.001556  0.498364
2800     1  0.004954  0.498573

I can't understand why this happens... the 3072 value appears a lot of time in the data... This is the histogram of my datas... this is very strange because there are some values for 3072 and also for 4096.

enter image description here

Source: (StackOverflow)

ANOVA in python using pandas dataframe with statsmodels or scipy?

I want to use the Pandas dataframe to breakdown the variance in one variable.

For example, if I have a column called 'Degrees', and I have this indexed for various dates, cities, and night vs. day, I want to find out what fraction of the variation in this series is coming from cross-sectional city variation, how much is coming from time series variation, and how much is coming from night vs. day.

In Stata I would use Fixed effects and look at the R^2. Hopefully my question makes sense.

Basically, what I want to do, is find the ANOVA breakdown of "Degrees" by three other columns.

Source: (StackOverflow)

Difference in Python statsmodels OLS and R's lm

I'm not sure why I'm getting slightly different results for a simple OLS, depending on whether I go through panda's experimental rpy interface to do the regression in R or whether I use statsmodels in Python.

import pandas
from rpy2.robjects import r

from functools import partial

loadcsv = partial(pandas.DataFrame.from_csv,
                  index_col="seqn", parse_dates=False)

demoq = loadcsv("csv/DEMO.csv")
rxq = loadcsv("csv/quest/RXQ_RX.csv")

num_rx = {}
for seqn, num in rxq.rxd295.iteritems():
    try:
        val = int(num)
    except ValueError:
        val = 0
    num_rx[seqn] = val

series = pandas.Series(num_rx, name="num_rx")
demoq = demoq.join(series)

import pandas.rpy.common as com
df = com.convert_to_r_dataframe(demoq)
r.assign("demoq", df)
r('lmout <- lm(demoq$num_rx ~ demoq$ridageyr)')  # run the regression
r('print(summary(lmout))')  # print from R

From R, I get the following summary:

Call:
lm(formula = demoq$num_rx ~ demoq$ridageyr)

Residuals:
    Min      1Q  Median      3Q     Max 
-2.9086 -0.6908 -0.2940  0.1358 15.7003 

Coefficients:
                 Estimate Std. Error t value Pr(>|t|)    
(Intercept)    -0.1358216  0.0241399  -5.626 1.89e-08 ***
demoq$ridageyr  0.0358161  0.0006232  57.469  < 2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

Residual standard error: 1.545 on 9963 degrees of freedom
Multiple R-squared: 0.249,  Adjusted R-squared: 0.2489 
F-statistic:  3303 on 1 and 9963 DF,  p-value: < 2.2e-16

Using statsmodels.api to do the OLS:

import statsmodels.api as sm
results = sm.OLS(demoq.num_rx, demoq.ridageyr).fit()
results.summary()

The results are similar to R's output but not the same:

OLS Regression Results
Adj. R-squared:  0.247
Log-Likelihood:  -18488.
No. Observations:    9965    AIC:   3.698e+04
Df Residuals:    9964    BIC:   3.698e+04
             coef   std err  t     P>|t|    [95.0% Conf. Int.]
ridageyr     0.0331  0.000   82.787    0.000        0.032 0.034

The install process is a a bit cumbersome. But, there is an ipython notebook here, that can reproduce the inconsistency.

Source: (StackOverflow)

Plotting confidence intervals for Maximum Likelihood Estimate

I am trying to write code to produce confidence intervals for the number of different books in a library (as well as produce an informative plot).

My cousin is at elementary school and every week is given a book by his teacher. He then reads it and returns it in time to get another one the next week. After a while we started noticing that he was getting books he had read before and this became gradually more common over time.

Say the true number of books in the library is N and the teacher picks one uniformly at random (with replacement) to give to you each week. If at week t the number of occasions on which you have received a book you have read is x, then I can produce a maximum likelihood estimate for the number of books in the library following http://math.stackexchange.com/questions/615464/how-many-books-are-in-a-library .

Example: Consider a library with five books A, B, C, D, and E. If you receive books [A, B, A, C, B, B, D] in seven successive weeks, then the value for x (the number of duplicates) will be [0, 0, 1, 1, 2, 3, 3] after each of those weeks, meaning after seven weeks, you have received a book you have already read on three occasions.

To visualise the likelihood function (assuming I have understood what one is correctly) I have written the following code which I believe plots the likelihood function. The maximum is around 135 which is indeed the maximum likelihood estimate according to the MSE link above.

from __future__ import division
import random
import matplotlib.pyplot as plt
import numpy as np

#N is the true number of books. t is the number of weeks.unk is the true number of repeats found 
t = 30
unk = 3
def numberrepeats(N, t):
    return t - len(set([random.randint(0,N) for i in xrange(t)]))

iters = 1000
ydata = []
for N in xrange(10,500):
    sampledunk = [numberrepeats(N,t) for i in xrange(iters)].count(unk)
    ydata.append(sampledunk/iters)

print "MLE is", np.argmax(ydata)
xdata = range(10, 500)
print len(xdata), len(ydata)
plt.plot(xdata,ydata)
plt.show()

The output looks like

enter image description here

My questions are these:

Is there an easy way to get a 95% confidence interval and plot it on the diagram?
How can you superimpose a smoothed curve over the plot?
Is there a better way my code should have been written? It isn't very elegant and is also quite slow.

Finding the 95% confidence interval means finding the range of the x axis so that 95% of the time the empirical maximum likelihood estimate we get by sampling (which should theoretically be 135 in this example) will fall within it. The answer @mbatchkarov has given does not currently do this correctly.

There is now a mathematical answer at http://math.stackexchange.com/questions/656101/how-to-find-a-confidence-interval-for-a-maximum-likelihood-estimate .

Source: (StackOverflow)

Making Probability Distribution Functions (PDFs) from histograms

Say I have several histograms, each with counts at different bin locations (on a real axis). e.g.

def generate_random_histogram():

    # Random bin locations between 0 and 100
    bin_locations = np.random.rand(10,) * 100
    bin_locations.sort()

    # Random counts between 0 and 50 on those locations 
    bin_counts = np.random.randint(50, size=len(bin_locations))
    return {'loc': bin_locations, 'count':bin_counts}

# We can assume that the bin size is either pre-defined or that 
# the bin edges are on the middle-point between consecutive counts.
hists = [generate_random_histogram() for x in xrange(3)]

How can I normalize these histograms so that I get PDFs where the integral of each PDF adds up to one within a given range (e.g. 0 and 100)?

We can assume that the histogram counts events on pre-defined bin size (e.g. 10)

Most implementations I have seen are based, for example, on Gaussian Kernels (see scipy and scikit-learn) that start from the data. In my case, I need to do this from the histograms, since I don't have access to the original data.

Update:

Note that all current answers assume that we are looking at a random variable that lives in (-Inf, +Inf). That's fine as a rough approximation, but this may not be the case depending on the application, where the variable may be defined within some other range [a,b] (e.g. 0 and 100 in the above case)

Source: (StackOverflow)

Highest Posterior Density Region and Central Credible Region

Given a posterior p(Θ|D) over some parameters Θ, one can define the following:

Highest Posterior Density Region:

The Highest Posterior Density Region is the set of most probable values of Θ that, in total, constitute 100(1-α) % of the posterior mass.

In other words, for a given α, we look for a p* that satisfies:

enter image description here

and then obtain the Highest Posterior Density Region as the set:

enter image description here

Central Credible Region:

Using the same notation as above, a Credible Region (or interval) is defined as:

enter image description here

Depending on the distribution, there could be many such intervals. The central credible interval is defined as a credible interval where there is (1-α)/2 mass on each tail.

Computation:

For general distributions, given samples from the distribution, are there any built-ins in to obtain the two quantities above in Python or PyMC?
For common parametric distributions (e.g. Beta, Gaussian, etc.) are there any built-ins or libraries to compute this using SciPy or statsmodels?

Source: (StackOverflow)

Python statistics package: difference between statsmodel and scipy.stats [closed]

I need some advice on selecting statistics package for Python, I've done quite some search, but not sure if I get everything right, specifically on the differences between statsmodels and scipy.stats.

One thing that I know is those with scikits namespace are specific "branches" of scipy, and what used to be scikits.statsmodels is now called statsmodels. On the other hand there is also scipy.stats. What are the differences between the two, and which one is the statistics package for Python?

Thanks.

--EDIT--

I changed the title because some answers are not really related to the question, and I suppose that's because the title is not clear enough.

Source: (StackOverflow)

confidence and prediction intervals with StatsModels

I do this linear regression with StatsModels:

import numpy as np
import statsmodels.api as sm
from statsmodels.sandbox.regression.predstd import wls_prediction_std

#measurements genre
nmuestra = 100

x = np.linspace(0, 10, nmuestra)
e = np.random.normal(size=nmuestra)
y = 1 + 0.5*x + 2*e
X = sm.add_constant(x)

re = sm.OLS(y, X).fit()
print re.summary()    #print the result type Stata

prstd, iv_l, iv_u = wls_prediction_std(re)

My questions are, iv_l and iv_u are the upper and lower confidence intervals or prediction intervals?? How I get the other?? (I need the confidence and prediction intervals for all point, to do as plot)

Source: (StackOverflow)

Best practice for upgrading Python modules?

Good morning,

I've been learning Python for two or three months now but now finding some problems with my 2.7 installation as I've looked into modules such as nltk.

However, when I want to list modules using help ("modules) I have the main error which I think explains the problem is:

    /Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/distribute-0.6.28-py2.7.egg/setuptools/command/install_scripts.py:3: UserWarning: Module numpy was already imported from /Library/Python/2.7/site-packages/numpy-override/numpy/__init__.pyc, but /Library/Python/2.7/site-packages/numpy-1.8.0.dev_5c944b9_20120828-py2.7-macosx-10.8-x86_64.egg is being added to sys.path
from pkg_resources import Distribution, PathMetadata, ensure_directory

I also receive the following error to do with deprecated modules:

    /Library/Python/2.7/site-packages/statsmodels-0.5.0-py2.7-macosx-10.8-intel.egg/scikits/statsmodels/__init__.py:2: UserWarning: scikits.statsmodels namespace is deprecated and will be removed in 0.5, please use statsmodels instead

I'm still trying to get to grips with paths and wonder if anyone can help me avoid this issue in future. Thank you.

Source: (StackOverflow)

python 3 + statsmodels?

If I do sudo pip3 install statsmodels I get errors. I pasted the end of the console output below. I see a numpy 1.7 warning, yet if I do pip3 freeze | grep numpy, I see that I'm using numpy==1.8.1.

Here is the output. any ideas?

/usr/lib/python3/dist-packages/numpy/core/include/numpy/npy_1_7_deprecated_api.h:15:2: warning: #warning "Using deprecated NumPy API, disable it by " "#defining NPY_NO_DEPRECATED_API NPY_1_7_API_VERSION" [-Wcpp]

 #warning "Using deprecated NumPy API, disable it by " \

  ^

statsmodels/tsa/kalmanf/kalman_loglike.c: In function ‘__Pyx_TraceSetupAndCall’:

statsmodels/tsa/kalmanf/kalman_loglike.c:7021:17: error: ‘PyFrameObject’ has no member named ‘f_tstate’

         (*frame)->f_tstate = PyThreadState_GET();

                 ^

In file included from /usr/lib/python3/dist-packages/numpy/core/include/numpy/ndarrayobject.h:26:0,

                 from /usr/lib/python3/dist-packages/numpy/core/include/numpy/arrayobject.h:4,

                 from statsmodels/tsa/kalmanf/kalman_loglike.c:257:

statsmodels/tsa/kalmanf/kalman_loglike.c: At top level:

/usr/lib/python3/dist-packages/numpy/core/include/numpy/__multiarray_api.h:1629:1: warning: ‘_import_array’ defined but not used [-Wunused-function]

 _import_array(void)

 ^

In file included from /usr/lib/python3/dist-packages/numpy/core/include/numpy/ufuncobject.h:327:0,

                 from statsmodels/tsa/kalmanf/kalman_loglike.c:258:

/usr/lib/python3/dist-packages/numpy/core/include/numpy/__ufunc_api.h:241:1: warning: ‘_import_umath’ defined but not used [-Wunused-function]

 _import_umath(void)

 ^

error: command 'x86_64-linux-gnu-gcc' failed with exit status 1

----------------------------------------
Cleaning up...
Command /usr/bin/python3 -c "import setuptools, tokenize;__file__='/tmp/pip_build_root/statsmodels/setup.py';exec(compile(getattr(tokenize, 'open', open)(__file__).read().replace('\r\n', '\n'), __file__, 'exec'))" install --record /tmp/pip-ukof84o0-record/install-record.txt --single-version-externally-managed --compile failed with error code 1 in /tmp/pip_build_root/statsmodels
Storing debug log for failure in /home/kurt/.pip/pip.log

Source: (StackOverflow)