EzDevInfo.com

patsy

Describing statistical models in Python using symbolic formulas

Using patsy in PySpark

I like using formulas in R. Python has a library patsy that can form matrices from the formulas, including categorical variable creation.

Is there a good way to use patsy when constructing RDDs in PySpark? I'd like to see examples.


Source: (StackOverflow)

as_formula specifier for sklearn.tree.decisiontreeclassifier in Python?

I was curious if there is an as_formula specifier (like in statsmodels) for sklearn.tree.decisiontreeclassifier in Python, or some way to hack one in. Currently, I must use

clf = tree.DecisionTreeClassifier()
clf = clf.fit(X, Y)

but I would prefer to have something like

clf = clf.fit(formula='Y ~ X', data=df)

The reason is that I would like to specify more than one X without having to do a lot of array shaping. Thanks.


Source: (StackOverflow)

Advertisements

ipython notebook and patsy categorical variable (formula)

I had the same error as in this question.

What is weird, is that it works (with the answer provided) in an ipython shell, but not in an ipython notebook. But it's related to the C() operator, because without it works (but not as an operator)

Same with that example :

import statsmodels.formula.api as smf
import numpy as np
import pandas


url = "http://vincentarelbundock.github.com/Rdatasets/csv/HistData/Guerry.csv"
df = pandas.read_csv(url)
df = df[['Lottery', 'Literacy', 'Wealth', 'Region']].dropna()
df.head()
mod = smf.ols(formula='Lottery ~ Literacy + Wealth + Region', data=df)
res = mod.fit()
print res.summary()

This works well, both in the ipython notebook and in the shell, and patsy treats Region as categorical variable because it's composed of strings.

but if I try this (as in the tutorial) :

res = smf.ols(formula='Lottery ~ Literacy + Wealth + C(Region)', data=df).fit()

I got an error in the ipython notebook:

TypeError: 'Series' object is not callable

Note that both in the notebook and in the shell statsmodels and patsy are the same versions (0.5.0 and 0.3.0 respectively)

Do you have the same error ?


Source: (StackOverflow)

Changing dictionary consisting 16k dicts to a Pandas Dataframe

I'm working on a data mining problem for my Master Thesis. I'm using Python for data analysis, but I have no experience with Pandas, which is needed to convert my data to a Dataframe. In order to do Survival Regression with a Python package called Lifelines I need to create a Covariate Matrix from my experiment_data dict containing over 16k of dicts with Twitter data about Kickstarter projects (see example dict below).

16041: {'goal': 1200, 'launch': 1353544772, 'days-before-deadline': 3, 'followers': 149, 'date-funded': 1355887690.9189188, 'id': 52687, 'tweet_ids': [280965208409796608, ... n], 'state': 1, 'deadline': 1356136772, 'retweets': 0, 'favorites': 0, 'duration': 31, 'timestamps': [1355876412.0], 'favourites': 0, 'runtime': 27, 'friends': 127, 'pledges': [0.0, 0.0625, 0.0625, ... n], 'statuses': 7460}

If I create a Pandas Dataframe from this dict, I'll be able to create a Covariate Matrix by using Patsy, for example like this:

X = patsy.dmatrix('friends + followers + retweets, favorites -1', data, return_type='dataframe') 

Now my question is how to create a Pandas Dataframe from the experiment_data dicts? The keys of the inner dictionaries (goal, launch, followers, etc.) should be columns for each Kickstarter project (i.e. index nr.: 0 to 16041).

Any help would be really appreciated. Thanks in advance!

P.S. If you have experience in Survival Regression using Python and Lifelines, please let me know!


Source: (StackOverflow)

Mapping dummy variables in pandas data frame

I have a large data frame with 11 columns. I need to convert categorical variables into binary values, so I used Patsy:

attributes = "admit ~ C(gender) + age + C(ethnicity) + C(state) + gpa + sci_gpa + mcat + C(major) + C(tier) + C(same_ins)"
y, X = dmatrices(attributes, df, return_type="dataframe")

This works well. However, I want to test a new sample using data that was stored in the format of the original data frame E.g:

gender    age    ethnicity    state    gpa    sci_gpa    gre    major    tier    same_ins
male      21     Asian        NV       3.4    3.2        .99    Physics  1       1     

Is there an easy way to convert this into the same format as X??


Source: (StackOverflow)

How to get rid of main effects when coding interaction between categorical variables in patsy?

I have a problem very similar to :

Interaction effects in patsy with patsy.dmatrices giving duplicate columns for ":" as with "+" , or "*"

except that I have other categorical variables besides the interaction term. My formula is :

f = 'VarDep ~ C(MoisAvantDep):C(Groupe) + C(JourSemDep) + C(MoisDep) + jour_nuit'

When I run an ols regression in statsmodels with this formula, I get main effects for the variable "Groupe", which I would like to avoid. I tried to add -1 in the formula (as suggested in the above mentioned discussion), but still get the main effects.

Any suggestion ?


Source: (StackOverflow)

Easily configure categorical variables

I have a categorical variable, let's say cat_var which can assume the following values: cat_var = ["A", "B", "C", "D"]

I run a series of regressions and patsy makes it easy to describe a regression: regr= " y ~ x + C(cat_var)

I was wondering what the easiest way to tune the use of categorical variable is . For example, let's say I would like to have patsy create dummies only for "A", "B", ie "C" and "D" are treated as one single group. I could remap cat_var to another set of value, but is there some sugar in patsy to do this task already?


Source: (StackOverflow)

How to prepare large datasets with Patsy's API?

I'm running a logistic regression and having trouble using Patsy's API to prepare the data when it is bigger than a small sample.

Using the dmatrices function directly on a DataFrame, I am left with this abrupt error ( please note, I spun up an EC2 with 300GB of RAM after encountering this on my laptop, and got the same error ):

Traceback (most recent call last):
File "My_File.py", line 22, in <module>
   df, return_type="dataframe")
File "/root/anaconda/lib/python2.7/site-packages/patsy/highlevel.py", line 297, in dmatrices
 NA_action, return_type)
File "/root/anaconda/lib/python2.7/site-packages/patsy/highlevel.py", line 156, in do_highlevel_design
return_type=return_type)
File "/root/anaconda/lib/python2.7/site-packages/patsy/build.py", line 989, in build_design_matrices
results.append(builder._build(evaluator_to_values, dtype))
File "/root/anaconda/lib/python2.7/site-packages/patsy/build.py", line 821, in _build
m = DesignMatrix(np.empty((num_rows, self.total_columns), dtype=dtype),
MemoryError

So, I combed through Patsy's docs and found this gem:

patsy.incr_dbuilder(formula_like, data_iter_maker, eval_env=0)
    Construct a design matrix builder incrementally from a large data set.

However, the method is sparsely documented, and the source code is largely uncommented.

I have arrived at this code:

def iter_maker():
    with open("test.tsv", "r") as f:
        reader = csv.DictReader(f, delimiter="\t")
        for row in reader:
            yield(row)


y, dta = incr_dbuilders("s ~ C(x) + C(y):C(rgh) + \
C(z):C(f) + C(r):C(p) + C(q):C(w) + \
C(zr):C(rt) + C(ff):C(djjj) + C(hh):C(tt) + \
C(bb):lat + C(jj):lng + C(ee):C(bb) + C(qq):C(uu)",
        iter_maker)

df = dmatrix(dta, {}, 0, "drop", return_type="dataframe")

but I receive PatsyError: Error evaluating factor: NameError: name 'ff' is not defined

This is being thrown because _try_incr_builders (called from dmatrix) is returning None on line 151 of highlevel.py

What is the correct way to use these Patsy functions to prepare my data? Any examples or guidance you may have will be helpful.


Source: (StackOverflow)

Pandas + Patsy + Statsmodels Linear Reg issue passing in categorical variable (duplicate rows)

[Preface: I now realize I should've used a classification model (maybe decision tree) instead but I ended up using a linear regression model.]

I had a pandas dataframe as such:

enter image description here

And I want to predict audience score using genre, year, tomato-meter score. But as is constructed, the genres for each movie came in a list, so I felt the need to isolate each genre to pass each genre into my model as separate variables.

After doing such, my modified dataframe looks like this, with duplicate rows for each movie, but each genre element of that movie isolated (just one movie pulled from the dataframe to show):

enter image description here

Now, my question is, can I pass in this second dataframe as is to Patsy and statsmodel linear regression, or will the row duplication introduce bias into my model?

y1, X1 = dmatrices('Q("Audience Score") ~ Year + Q("Tomato-meter") + Genre',
                   data=DF2, return_type='dataframe')

In summary, looking for a way for patsy and my model to recognize treat each genre as separate variables.. but want to make sure I'm not fudging the numbers/model by passing in a dataframe in this format as the data (as not every movie as the same # of genres).


Source: (StackOverflow)

Logistic Regression Bigram Text Classification w/ Patsy

I'm working on upgrading a LogisticRegression text classification from single word features to bigrams (two word features). However when I include the two word feature in the formula sent to patsy.dmatrices, I receive the following error...

y, X = dmatrices("is_host ~ dedicated + hosting + dedicated hosting", df, return_type="dataframe")

  File "<string>", line 1
    dedicated hosting
                ^
SyntaxError: unexpected EOF while parsing

I've looked around online for any examples on how to approach this and haven't found anything. I tried throwing a few different syntax options at the formula and none seem to work.

"is_host ~ dedicated + hosting + {dedicated hosting}"
"is_host ~ dedicated + hosting + (dedicated hosting)"
"is_host ~ dedicated + hosting + [dedicated hosting]"

What is the proper way to include multi-word features in the formula passed to dmatricies?


Source: (StackOverflow)

Patsy's dmatrices cannot read my formula

I have a function LogReg, which is as follows: (using justmarkham's code as inspiration)

def LogReg(self):
      formulA = "class ~"
      print self.frame #dataframe used
      print self.columnNames[:-1]
      for a in self.columnNames[:-1]:
         formulA += " {0} +".format(a)
      formula = formulA[:-2] #there is always a \n behind, we don't want that
      print "formula = " + formula
      Y,X = dmatrices(formula, self.frame, return_type="dataframe")
      Y = np.ravel(Y) #flatten Y to a 1D list
      model = LogisticRegression() #from sklearn.linear_model
      model = model.fit(X, Y)
      print model.score(X, Y)

with the following outcome:

         a0 a1  a2  a3 class
picture1  1  2   3  67     1
picture2  6  7  45  61     3
picture3  8  7   6   5     2
picture4  1  2   4   3     0
['a0', 'a1', 'a2', 'a3']
formula = class ~ a0 + a1 + a2 + a3
Traceback (most recent call last):
  File "classification.py", line 80, in <module>
    c.LogReg()
  File "classification.py", line 61, in LogReg
    Y,X = dmatrices(formula, self.frame, return_type="dataframe")
  File "/<path>/python2.7/site-packages/patsy/highlevel.py", line 297, in dmatrices
    NA_action, return_type)
  File "/<path>/python2.7/site-packages/patsy/highlevel.py", line 152, in _do_highlevel_design
    NA_action)
  File "/<path>/python2.7/site-packages/patsy/highlevel.py", line 57, in _try_incr_builders
    NA_action)
  File "/<path>/python2.7/site-packages/patsy/build.py", line 660, in design_matrix_builders
    NA_action)
  File "/<path>/python2.7/site-packages/patsy/build.py", line 424, in _examine_factor_types
    value = factor.eval(factor_states[factor], data)
  File "/<path>/python2.7/site-packages/patsy/eval.py", line 485, in eval
    return self._eval(memorize_state["eval_code"], memorize_state, data)
  File "/<path>/python2.7/site-packages/patsy/eval.py", line 468, in _eval
    code, inner_namespace=inner_namespace)
  File "/<path>/python2.7/site-packages/patsy/compat.py", line 117, in call_and_wrap_exc
    return f(*args, **kwargs)
  File "/<path>/python2.7/site-packages/patsy/eval.py", line 125, in eval
    code = compile(expr, source_name, "eval", self.flags, False)
  File "<string>", line 1
    class
        ^
SyntaxError: unexpected EOF while parsing

I do not see what goes wrong here, as the string does by my knowledge not contain the EOF character, nor does the Python code seem erroneous. Therefore, the question: Where does it go wrong (and preferably: , and how to fix it)?

P.S.: The software used are all the most recent stable packages as available on 04/09/2015.


Source: (StackOverflow)

patsy formula - adding powers of a factor

I use patsy to build design matrix. I need to include powers of the original factors. For example, with the regression y~x1+x1^2+x2+x2^2+x2^3, I want to be able to write

patsy.dmatrix('y~x1 + x1**2 + x2 + x2**2 + x2**3', data)

where data is a dataframe that contains column y, x1, x2. But it does not seem to work at all. Any solutions?


Source: (StackOverflow)

regression on trend + seasonal using python statsmodels

I have a question regarding regression in python. To make a long story short, I need to find a model of form yt = mt + st where mt and st are trends and seasonal component respectively. In my earlier analysis, I have found that a good model for mt is a quadratic trend of type mt = a0 + a1*t + a2*t^2 through my regression analysis. Now, when I want to add the seasonal component, this is where I am having the hardest time. Now, I approached this two ways...one is through R programming where I am calling R objects into python and the other through python solely. Now, following the example in my book, I did the folliwng using R:

%load_ext rmagic
import rpy2.robjects as R
import pandas.rpy.common as com
from rpy2.robjects.packages import importr

stats = importr('stats')
r_df = com.convert_to_r_dataframe(pd.DataFrame(data.logTotal))
%Rpush r_df
%R ss = as.factor(rep(1:12,length(r_df$logTotal)/12))
%R tt = 1:length(r_df$logTotal)
%R tt2 = cbind(tt,tt^2)
%R ts_model = lm(r_df$logTotal ~ tt2+ss-1)
%R print(summary(ts_model))

I get the right regression coefficients. But, if i do the same thing in python, this is where I am getting problem replicating it.

import statsmodels.formula.api as smf
ss_temp=     pd.Categorical.from_array(np.repeat(np.arange(1,13),len(data.logTotal)/12))
dtemp = np.column_stack((t,t**2,data.logTotal))
dtemp = pd.DataFrame(dtemp,columns=['t','tsqr','logTotal'])
dtemp['ss'] = sstemp
res_result = smf.ols(formula='logTotal ~ t+tsqr + C(ss) -1',data=dtemp).fit()
res_result.params

What am i doing wrong here? I first get an error saying 'data type not found' which points to the res_result formula. So, then I tried changing ss_temp to a Series. Then, the above statements worked. However, my parameters were completely off when compared to the R output. I have been spending a day on this with no avail. Can someone please help me or guide me as to do or is there an python equivalent to as.factor in R? I assumed that was categorical in pandas.

Thanks

If the above is too hard, its fine. I still have the residual model from my regression in R. But, any ideas how to convert this to a python equivalent to what statsmodels interprets as a res from regression? thanks again


Source: (StackOverflow)

statsmodels.ols.predict() is not working with exog=dict

Because statsmodels.OLS.params returns only a np.array() without the corresponding keys from the dataframe it is impossible to 'lookup' regressors when trying to use statsmodels.OLS.predict...especially with categorical regressors with a lot of categories (I have 1k plus params) in a relatively large and rich dataset.

So I tried to go back to the statsmodels formula api. Which is fine, however, I have been unable to follow the simple pattern:

import statsmodels.formula.api as smf

model = smf.ols(formula,data=subdata).fit()

x = dict(Capacity=[275],Age=[11.79],Type=['Jack'],SaleType=['Retail'])
outcome = model.predict(x)

********************
AttributeError: 'DataFrame' object has no attribute 'design_info'

Two things that this makes me assume....the sm api predict method doesn't like dictionaries. And that I should try a pandas dataframe and make sure that design_info gets passed along with it.

I import patsy and create a DesignMatrix:

   temp = list(x.values())
   design_info = patsy.DesignMatrix(temp,x.keys())

Which gives me the following error:

ValueError: wrong number of column names for design matrix (got 4, wanted 1)

I've tried recreating the temp variable so that it would be four columns and 1 row but again had no luck...what am I doing wrong?


Source: (StackOverflow)