pandas
Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
Python Data Analysis Library — pandas: Python Data Analysis Library
i try to get the number of rows of dataframe df, both code snippets give me an error: TypeError: unsupported operand type(s) for +: 'instancemethod' and 'int'
total_rows = df.count
print total_rows +1
total_rows = df['First_columnn_label'].count
print total_rows +1
I'd be grateful for any suggestions what I'm doing wrong.
EDIT: According to the answer given by root the best (the fastest) way to check df length is to call:
len(df.index)
Source: (StackOverflow)
I have a dataframe in pandas which I would like to write to a CSV file. I am doing this using:
df.to_csv('out.csv')
And getting the error:
UnicodeEncodeError: 'ascii' codec can't encode character u'\u03b1' in position 20: ordinal not in range(128)
Is there any way to get around this easily (i.e. I have unicode characters in my data frame)? And is there a way to write to a tab delimited file instead of a CSV using e.g. a 'to-tab' method (that I dont think exists)?
Source: (StackOverflow)
I have a data table using pandas and column labels that I need to edit to replace the original column labels.
I'd like to change the column names in a data table A
where the original column names are:
['$a', '$b', '$c', '$d', '$e']
to
['a', 'b', 'c', 'd', 'e'].
I have the edited column names stored it in a list, but I don't know how to replace the column names.
Source: (StackOverflow)
I want to get a list of the column headers from a pandas dataframe. The dataframe will come from user input so I won't know how many columns there will be or what they will be called. For example, if I'm given a dataframe like this:
my_dataframe =
y gdp cap
0 1 2 5
1 2 3 9
2 8 7 2
3 3 4 7
4 6 7 7
5 4 8 3
6 8 2 8
7 9 9 10
8 6 6 4
9 10 10 7
I would want to get a list like this:
header_list =
[y, gdp, cap]
Source: (StackOverflow)
When deleting a column in a DataFrame I use del DF['column-name']
and all is well. Why does del DF.column_name
not work also?
I am using v0.9.1.
Source: (StackOverflow)
How to select rows from a DataFrame based on values in some column in pandas?
In SQL I would use:
select * from table where colume_name = some_value.
I tried to look at pandas documentation but did not immediately find the answer.
Source: (StackOverflow)
Is there a way to make matplotlib
behave identically to R, or almost like R, in terms of plotting defaults? For example R treats its axes pretty differently from matplotlib
. The following histogram
has "floating axes" with outward ticks, such that there are no inner ticks (unlike matplotlib
) and the axes do not cross "near" the origin. Also, the histogram can "spillover" to values that are not marked by the tick - e.g. the x-axis ends at 3 but the histograms extends slightly beyond it. How can this be achieved automatically for all histograms in matplotlib
?
Related question: scatter plots and line plots have different default axes settings in R, for example:
There no inner ticks again and the ticks face outward. Also, the ticks start slightly after the origin point (where the y and x axes cross at the bottom left of the axes) and the ticks end slightly before the axes end. This way the labels of the lowest x-axis tick and lowest y-axis tick can't really cross, because there's a space between them and this gives the plots a very elegant clean look. Note that there's also considerably more space between the axes ticklabels and the ticks themselves.
Also, by default there are no ticks on the non-labeled x or y axes, meaning the y-axis on the left that is parallel to the labeled y-axis on the right has no ticks, and same for the x-axis, again removing clutter from the plots.
Is there a way to make matplotlib look like this? And in general to look by default as much as default R plots? I like matplotlib
a lot but I think the R defaults / out-of-the-box plotting behavior really have gotten things right and its default settings rarely lead to overlapping tick labels, clutter or squished data, so I would like the defaults to be as much like that as possible.
Source: (StackOverflow)
I'm starting with input data like this
df1 = pandas.DataFrame( {
"Name" : ["Alice", "Bob", "Mallory", "Mallory", "Bob" , "Mallory"] ,
"City" : ["Seattle", "Seattle", "Portland", "Seattle", "Seattle", "Portland"] } )
Which when printed appears as this:
City Name
0 Seattle Alice
1 Seattle Bob
2 Portland Mallory
3 Seattle Mallory
4 Seattle Bob
5 Portland Mallory
Grouping is simple enough:
g1 = df1.groupby( [ "Name", "City"] ).count()
and printing yields a GroupBy object:
City Name
Name City
Alice Seattle 1 1
Bob Seattle 2 2
Mallory Portland 2 2
Seattle 1 1
But what I want eventually is another DataFrame object that contains all the rows in the GroupBy object. In other words I want to get the following result:
City Name
Name City
Alice Seattle 1 1
Bob Seattle 2 2
Mallory Portland 2 2
Mallory Seattle 1 1
I can't quite see how to accomplish this in the pandas documentation. Any hints would be welcome.
Source: (StackOverflow)
I have a list of dictionaries like this:
[{'points': 50, 'time': '5:00', 'year': 2010},
{'points': 25, 'time': '6:00', 'month': "february"},
{'points':90, 'time': '9:00', 'month': 'january'},
{'points_h1':20, 'month': 'june'}]
and I want to turn this into a pandas dataframe like this:
points, time, year, month, points_h1
50, 5:00, 2010, NONE, NONE
25, 6:00, NONE, february, NONE
90, 9:00, NONE, january, NONE
NONE, NONE, NONE, june, 20
Order of the columns does not matter. Ultimately, the goal is to write this to a text file and this seems like the best solution I could find. How can I turn the list of dictionaries into a panda data frame as shown above?
Source: (StackOverflow)
I have a DataFrames from pandas:
import pandas as pd
inp = [{'c1':10, 'c2':100}, {'c1':11,'c2':110}, {'c1':12,'c2':120}]
df = pd.DataFrame(inp)
print df
Output:
c1 c2
0 10 100
1 11 110
2 12 120
Now I want to iterate over the rows of the above frame. For every row I want to be able to access its elements (values in cells) by the name of the columns. So, for example, I would like to have something like that:
for row in df.rows:
print row['c1'], row['c2']
Is it possible to do that in pandas?
I found similar question. But it does not give me the answer I need. For example, it is suggested there to use:
for date, row in df.T.iteritems():
or
for row in df.iterrows():
But I do not understand what the row
object is and how I can work with it.
Source: (StackOverflow)
I have a dataframe as below
itm Date Amount
67 420 2012-09-30 00:00:00 65211
68 421 2012-09-09 00:00:00 29424
69 421 2012-09-16 00:00:00 29877
70 421 2012-09-23 00:00:00 30990
71 421 2012-09-30 00:00:00 61303
72 485 2012-09-09 00:00:00 71781
73 485 2012-09-16 00:00:00 NaN
74 485 2012-09-23 00:00:00 11072
75 485 2012-09-30 00:00:00 113702
76 489 2012-09-09 00:00:00 64731
77 489 2012-09-16 00:00:00 NaN
when I try to .apply a function to the Amount column I get the following error.
ValueError: cannot convert float NaN to integer
I have tried applying a function using .isnan from the Math Module
I have tried the pandas .replace attribute
I tried the .sparse data attribute from pandas 0.9
I have also tried if NaN == NaN statement in a function.
I have also looked at this article How do I replace NA values with zeros in R? whilst looking at some other articles.
All the methods I have tried have not worked or do not recognise NaN.
Any Hints or solutions would be appreciated.
Source: (StackOverflow)
Most operations in pandas
can be accomplished with operator chaining (groupby
, aggregate
, apply
, etc), but the only way I've found to filter rows is via normal bracket indexing
df_filtered = df[df['column'] == value]
This is unappealing as it requires I assign df
to a variable before being able to filter on its values. Is there something more like the following?
df_filtered = df.mask(lambda x: x['column'] == value)
Source: (StackOverflow)
I want to be able to set the major and minor xticks and their labels for a time series graph plotted from a Pandas time series object.
The Pandas 0.9 "what's new" page says:
"you can either use to_pydatetime or register a converter for the
Timestamp type"
but I can't work out how to do that so that I can use the matplotlib ax.xaxis.set_major_locator
and ax.xaxis.set_major_formatter
(and minor) commands.
If I use them without converting the pandas times, the x-axis ticks and labels end up wrong.
By using the 'xticks' parameter I can pass the major ticks to pandas.plot, and then set the major tick labels. I can't work out how to do the minor ticks using this approach. (I can set the labels on the default minor ticks set by pandas.plot)
Here is my test code:
import pandas
print 'pandas.__version__ is ', pandas.__version__
print 'matplotlib.__version__ is ', matplotlib.__version__
dStart = datetime.datetime(2011,5,1) # 1 May
dEnd = datetime.datetime(2011,7,1) # 1 July
dateIndex = pandas.date_range(start=dStart, end=dEnd, freq='D')
print "1 May to 1 July 2011", dateIndex
testSeries = pandas.Series(data=np.random.randn(len(dateIndex)),
index=dateIndex)
ax = plt.figure(figsize=(7,4), dpi=300).add_subplot(111)
testSeries.plot(ax=ax, style='v-', label='first line')
# using MatPlotLib date time locators and formatters doesn't work with new
# pandas datetime index
ax.xaxis.set_minor_locator(matplotlib.dates.WeekdayLocator(byweekday=(1),
interval=1))
ax.xaxis.set_minor_formatter(matplotlib.dates.DateFormatter('%d\n%a'))
ax.xaxis.grid(True, which="minor")
ax.xaxis.grid(False, which="major")
ax.xaxis.set_major_formatter(matplotlib.dates.DateFormatter('\n\n\n%b%Y'))
plt.show()
# set the major xticks and labels through pandas
ax2 = plt.figure(figsize=(7,4), dpi=300).add_subplot(111)
xticks = pandas.date_range(start=dStart, end=dEnd, freq='W-Tue')
print "xticks: ", xticks
testSeries.plot(ax=ax2, style='-v', label='second line',
xticks=xticks.to_pydatetime())
ax2.set_xticklabels([x.strftime('%a\n%d\n%h\n%Y') for x in xticks]);
# set the text of the first few minor ticks created by pandas.plot
# ax2.set_xticklabels(['a','b','c','d','e'], minor=True)
# remove the minor xtick labels set by pandas.plot
ax2.set_xticklabels([], minor=True)
# turn the minor ticks created by pandas.plot off
# plt.minorticks_off()
plt.show()
print testSeries['6/4/2011':'6/7/2011']
and it's output:
pandas.__version__ is 0.9.1.dev-3de54ae
matplotlib.__version__ is 1.1.1
1 May to 1 July 2011 <class 'pandas.tseries.index.DatetimeIndex'>
[2011-05-01 00:00:00, ..., 2011-07-01 00:00:00]
Length: 62, Freq: D, Timezone: None
xticks: <class 'pandas.tseries.index.DatetimeIndex'>
[2011-05-03 00:00:00, ..., 2011-06-28 00:00:00]
Length: 9, Freq: W-TUE, Timezone: None
2011-06-04 -0.199393
2011-06-05 -0.043118
2011-06-06 0.477771
2011-06-07 -0.033207
Freq: D
Update: I've been able to get closer to the layout I wanted by using a loop to build the major xtick labels:
# only show month for first label in month
month = dStart.month - 1
xticklabels = []
for x in xticks:
if month != x.month :
xticklabels.append(x.strftime('%d\n%a\n%h'))
month = x.month
else:
xticklabels.append(x.strftime('%d\n%a'))
But this is a bit like doing the x-axis using ax.annotate, possible but not ideal.
Source: (StackOverflow)
Can you tell me when to use these vectorization methods with basic examples? I see that map
is a Series
method whereas the rest are DataFrame
methods. I got confused about apply
and applymap
methods though. Why do we have two methods for applying a function to a DataFrame? Again, simple examples which illustrate the usage would be great!
Thanks!
Source: (StackOverflow)
I have data in different columns but I don't know how to extract it to save it in another variable.
index a b c
1 2 3 4
2 3 4 5
How do I select b, c and save it in to df1?
I tried
df1 = df['a':'b']
df1 = df.ix[:, 'a':'b']
None seem to work. Any ideas would help thanks.
Source: (StackOverflow)