patsy
Describing statistical models in Python using symbolic formulas
I like using formulas in R. Python has a library patsy that can form matrices from the formulas, including categorical variable creation.
Is there a good way to use patsy when constructing RDDs in PySpark? I'd like to see examples.
Source: (StackOverflow)
I was curious if there is an as_formula specifier (like in statsmodels
) for sklearn.tree.decisiontreeclassifier
in Python, or some way to hack one in. Currently, I must use
clf = tree.DecisionTreeClassifier()
clf = clf.fit(X, Y)
but I would prefer to have something like
clf = clf.fit(formula='Y ~ X', data=df)
The reason is that I would like to specify more than one X without having to do a lot of array shaping. Thanks.
Source: (StackOverflow)
I had the same error as in this question.
What is weird, is that it works (with the answer provided) in an ipython shell, but not in an ipython notebook. But it's related to the C()
operator, because without it works (but not as an operator)
Same with that example :
import statsmodels.formula.api as smf
import numpy as np
import pandas
url = "http://vincentarelbundock.github.com/Rdatasets/csv/HistData/Guerry.csv"
df = pandas.read_csv(url)
df = df[['Lottery', 'Literacy', 'Wealth', 'Region']].dropna()
df.head()
mod = smf.ols(formula='Lottery ~ Literacy + Wealth + Region', data=df)
res = mod.fit()
print res.summary()
This works well, both in the ipython notebook and in the shell, and patsy
treats Region
as categorical variable because it's composed of strings.
but if I try this (as in the tutorial) :
res = smf.ols(formula='Lottery ~ Literacy + Wealth + C(Region)', data=df).fit()
I got an error in the ipython notebook:
TypeError: 'Series' object is not callable
Note that both in the notebook and in the shell statsmodels
and patsy
are the same versions (0.5.0 and 0.3.0 respectively)
Do you have the same error ?
Source: (StackOverflow)
I'm working on a data mining problem for my Master Thesis. I'm using Python for data analysis, but I have no experience with Pandas, which is needed to convert my data to a Dataframe. In order to do Survival Regression with a Python package called Lifelines I need to create a Covariate Matrix from my experiment_data dict containing over 16k of dicts with Twitter data about Kickstarter projects (see example dict below).
16041: {'goal': 1200, 'launch': 1353544772, 'days-before-deadline': 3, 'followers': 149, 'date-funded': 1355887690.9189188, 'id': 52687, 'tweet_ids': [280965208409796608, ... n], 'state': 1, 'deadline': 1356136772, 'retweets': 0, 'favorites': 0, 'duration': 31, 'timestamps': [1355876412.0], 'favourites': 0, 'runtime': 27, 'friends': 127, 'pledges': [0.0, 0.0625, 0.0625, ... n], 'statuses': 7460}
If I create a Pandas Dataframe from this dict, I'll be able to create a Covariate Matrix by using Patsy, for example like this:
X = patsy.dmatrix('friends + followers + retweets, favorites -1', data, return_type='dataframe')
Now my question is how to create a Pandas Dataframe from the experiment_data dicts? The keys of the inner dictionaries (goal, launch, followers, etc.) should be columns for each Kickstarter project (i.e. index nr.: 0 to 16041).
Any help would be really appreciated. Thanks in advance!
P.S. If you have experience in Survival Regression using Python and Lifelines, please let me know!
Source: (StackOverflow)
I have a large data frame with 11 columns. I need to convert categorical variables into binary values, so I used Patsy:
attributes = "admit ~ C(gender) + age + C(ethnicity) + C(state) + gpa + sci_gpa + mcat + C(major) + C(tier) + C(same_ins)"
y, X = dmatrices(attributes, df, return_type="dataframe")
This works well. However, I want to test a new sample using data that was stored in the format of the original data frame
E.g:
gender age ethnicity state gpa sci_gpa gre major tier same_ins
male 21 Asian NV 3.4 3.2 .99 Physics 1 1
Is there an easy way to convert this into the same format as X??
Source: (StackOverflow)
I have a categorical variable, let's say cat_var
which can assume the following values: cat_var = ["A", "B", "C", "D"]
I run a series of regressions and patsy
makes it easy to describe a regression: regr= " y ~ x + C(cat_var)
I was wondering what the easiest way to tune the use of categorical variable is .
For example, let's say I would like to have patsy
create dummies only for "A"
, "B"
, ie "C"
and "D"
are treated as one single group. I could remap cat_var
to another set of value, but is there some sugar in patsy
to do this task already?
Source: (StackOverflow)
I'm running a logistic regression and having trouble using Patsy's API to prepare the data when it is bigger than a small sample.
Using the dmatrices
function directly on a DataFrame, I am left with this abrupt error ( please note, I spun up an EC2 with 300GB of RAM after encountering this on my laptop, and got the same error ):
Traceback (most recent call last):
File "My_File.py", line 22, in <module>
df, return_type="dataframe")
File "/root/anaconda/lib/python2.7/site-packages/patsy/highlevel.py", line 297, in dmatrices
NA_action, return_type)
File "/root/anaconda/lib/python2.7/site-packages/patsy/highlevel.py", line 156, in do_highlevel_design
return_type=return_type)
File "/root/anaconda/lib/python2.7/site-packages/patsy/build.py", line 989, in build_design_matrices
results.append(builder._build(evaluator_to_values, dtype))
File "/root/anaconda/lib/python2.7/site-packages/patsy/build.py", line 821, in _build
m = DesignMatrix(np.empty((num_rows, self.total_columns), dtype=dtype),
MemoryError
So, I combed through Patsy's docs and found this gem:
patsy.incr_dbuilder(formula_like, data_iter_maker, eval_env=0)
Construct a design matrix builder incrementally from a large data set.
However, the method is sparsely documented, and the source code is largely uncommented.
I have arrived at this code:
def iter_maker():
with open("test.tsv", "r") as f:
reader = csv.DictReader(f, delimiter="\t")
for row in reader:
yield(row)
y, dta = incr_dbuilders("s ~ C(x) + C(y):C(rgh) + \
C(z):C(f) + C(r):C(p) + C(q):C(w) + \
C(zr):C(rt) + C(ff):C(djjj) + C(hh):C(tt) + \
C(bb):lat + C(jj):lng + C(ee):C(bb) + C(qq):C(uu)",
iter_maker)
df = dmatrix(dta, {}, 0, "drop", return_type="dataframe")
but I receive PatsyError: Error evaluating factor: NameError: name 'ff' is not defined
This is being thrown because _try_incr_builders (called from dmatrix) is returning None on line 151 of highlevel.py
What is the correct way to use these Patsy functions to prepare my data? Any examples or guidance you may have will be helpful.
Source: (StackOverflow)
[Preface: I now realize I should've used a classification model (maybe decision tree) instead but I ended up using a linear regression model.]
I had a pandas dataframe as such:
And I want to predict audience score using genre, year, tomato-meter score. But as is constructed, the genres for each movie came in a list, so I felt the need to isolate each genre to pass each genre into my model as separate variables.
After doing such, my modified dataframe looks like this, with duplicate rows for each movie, but each genre element of that movie isolated (just one movie pulled from the dataframe to show):
Now, my question is, can I pass in this second dataframe as is to Patsy and statsmodel linear regression, or will the row duplication introduce bias into my model?
y1, X1 = dmatrices('Q("Audience Score") ~ Year + Q("Tomato-meter") + Genre',
data=DF2, return_type='dataframe')
In summary, looking for a way for patsy and my model to recognize treat each genre as separate variables.. but want to make sure I'm not fudging the numbers/model by passing in a dataframe in this format as the data (as not every movie as the same # of genres).
Source: (StackOverflow)
I'm working on upgrading a LogisticRegression text classification from single word features to bigrams (two word features). However when I include the two word feature in the formula sent to patsy.dmatrices, I receive the following error...
y, X = dmatrices("is_host ~ dedicated + hosting + dedicated hosting", df, return_type="dataframe")
File "<string>", line 1
dedicated hosting
^
SyntaxError: unexpected EOF while parsing
I've looked around online for any examples on how to approach this and haven't found anything. I tried throwing a few different syntax options at the formula and none seem to work.
"is_host ~ dedicated + hosting + {dedicated hosting}"
"is_host ~ dedicated + hosting + (dedicated hosting)"
"is_host ~ dedicated + hosting + [dedicated hosting]"
What is the proper way to include multi-word features in the formula passed to dmatricies?
Source: (StackOverflow)
I have a function LogReg, which is as follows: (using justmarkham's code as inspiration)
def LogReg(self):
formulA = "class ~"
print self.frame #dataframe used
print self.columnNames[:-1]
for a in self.columnNames[:-1]:
formulA += " {0} +".format(a)
formula = formulA[:-2] #there is always a \n behind, we don't want that
print "formula = " + formula
Y,X = dmatrices(formula, self.frame, return_type="dataframe")
Y = np.ravel(Y) #flatten Y to a 1D list
model = LogisticRegression() #from sklearn.linear_model
model = model.fit(X, Y)
print model.score(X, Y)
with the following outcome:
a0 a1 a2 a3 class
picture1 1 2 3 67 1
picture2 6 7 45 61 3
picture3 8 7 6 5 2
picture4 1 2 4 3 0
['a0', 'a1', 'a2', 'a3']
formula = class ~ a0 + a1 + a2 + a3
Traceback (most recent call last):
File "classification.py", line 80, in <module>
c.LogReg()
File "classification.py", line 61, in LogReg
Y,X = dmatrices(formula, self.frame, return_type="dataframe")
File "/<path>/python2.7/site-packages/patsy/highlevel.py", line 297, in dmatrices
NA_action, return_type)
File "/<path>/python2.7/site-packages/patsy/highlevel.py", line 152, in _do_highlevel_design
NA_action)
File "/<path>/python2.7/site-packages/patsy/highlevel.py", line 57, in _try_incr_builders
NA_action)
File "/<path>/python2.7/site-packages/patsy/build.py", line 660, in design_matrix_builders
NA_action)
File "/<path>/python2.7/site-packages/patsy/build.py", line 424, in _examine_factor_types
value = factor.eval(factor_states[factor], data)
File "/<path>/python2.7/site-packages/patsy/eval.py", line 485, in eval
return self._eval(memorize_state["eval_code"], memorize_state, data)
File "/<path>/python2.7/site-packages/patsy/eval.py", line 468, in _eval
code, inner_namespace=inner_namespace)
File "/<path>/python2.7/site-packages/patsy/compat.py", line 117, in call_and_wrap_exc
return f(*args, **kwargs)
File "/<path>/python2.7/site-packages/patsy/eval.py", line 125, in eval
code = compile(expr, source_name, "eval", self.flags, False)
File "<string>", line 1
class
^
SyntaxError: unexpected EOF while parsing
I do not see what goes wrong here, as the string does by my knowledge not contain the EOF character, nor does the Python code seem erroneous. Therefore, the question: Where does it go wrong (and preferably: , and how to fix it)?
P.S.: The software used are all the most recent stable packages as available on 04/09/2015.
Source: (StackOverflow)
I use patsy to build design matrix. I need to include powers of the original factors. For example, with the regression , I want to be able to write
patsy.dmatrix('y~x1 + x1**2 + x2 + x2**2 + x2**3', data)
where data is a dataframe that contains column y, x1, x2. But it does not seem to work at all. Any solutions?
Source: (StackOverflow)
I have a question regarding regression in python. To make a long story short, I need to find a model of form yt = mt + st where mt and st are trends and seasonal component respectively. In my earlier analysis, I have found that a good model for mt is a quadratic trend of type mt = a0 + a1*t + a2*t^2
through my regression analysis. Now, when I want to add the seasonal component, this is where I am having the hardest time. Now, I approached this two ways...one is through R programming where I am calling R objects into python and the other through python solely. Now, following the example in my book, I did the folliwng using R:
%load_ext rmagic
import rpy2.robjects as R
import pandas.rpy.common as com
from rpy2.robjects.packages import importr
stats = importr('stats')
r_df = com.convert_to_r_dataframe(pd.DataFrame(data.logTotal))
%Rpush r_df
%R ss = as.factor(rep(1:12,length(r_df$logTotal)/12))
%R tt = 1:length(r_df$logTotal)
%R tt2 = cbind(tt,tt^2)
%R ts_model = lm(r_df$logTotal ~ tt2+ss-1)
%R print(summary(ts_model))
I get the right regression coefficients. But, if i do the same thing in python, this is where I am getting problem replicating it.
import statsmodels.formula.api as smf
ss_temp= pd.Categorical.from_array(np.repeat(np.arange(1,13),len(data.logTotal)/12))
dtemp = np.column_stack((t,t**2,data.logTotal))
dtemp = pd.DataFrame(dtemp,columns=['t','tsqr','logTotal'])
dtemp['ss'] = sstemp
res_result = smf.ols(formula='logTotal ~ t+tsqr + C(ss) -1',data=dtemp).fit()
res_result.params
What am i doing wrong here? I first get an error saying 'data type not found' which points to the res_result formula. So, then I tried changing ss_temp to a Series. Then, the above statements worked. However, my parameters were completely off when compared to the R output. I have been spending a day on this with no avail. Can someone please help me or guide me as to do or is there an python equivalent to as.factor in R? I assumed that was categorical in pandas.
Thanks
If the above is too hard, its fine. I still have the residual model from my regression in R. But, any ideas how to convert this to a python equivalent to what statsmodels interprets as a res from regression? thanks again
Source: (StackOverflow)
Because statsmodels.OLS.params returns only a np.array() without the corresponding keys from the dataframe it is impossible to 'lookup' regressors when trying to use statsmodels.OLS.predict...especially with categorical regressors with a lot of categories (I have 1k plus params) in a relatively large and rich dataset.
So I tried to go back to the statsmodels formula api. Which is fine, however, I have been unable to follow the simple pattern:
import statsmodels.formula.api as smf
model = smf.ols(formula,data=subdata).fit()
x = dict(Capacity=[275],Age=[11.79],Type=['Jack'],SaleType=['Retail'])
outcome = model.predict(x)
********************
AttributeError: 'DataFrame' object has no attribute 'design_info'
Two things that this makes me assume....the sm api predict method doesn't like dictionaries. And that I should try a pandas dataframe and make sure that design_info gets passed along with it.
I import patsy and create a DesignMatrix:
temp = list(x.values())
design_info = patsy.DesignMatrix(temp,x.keys())
Which gives me the following error:
ValueError: wrong number of column names for design matrix (got 4, wanted 1)
I've tried recreating the temp variable so that it would be four columns and 1 row but again had no luck...what am I doing wrong?
Source: (StackOverflow)