EzDevInfo.com

PyTables

A Python package to manage extremely large amounts of data Welcome to PyTables’ documentation! — PyTables 3.2.1 documentation

Get it on Github
Language : Python

http://www.pytables.org

Floating Point Exception with Numpy and PyTables

I have a rather large HDF5 file generated by PyTables that I am attempting to read on a cluster. I am running into a problem with NumPy as I read in an individual chunk. Let's go with the example:

The total shape of the array within in the HDF5 file is,

In [13]: data.shape
Out[13]: (21933063, 800, 3)

Each entry in this array is a np.float64.

I am having each node read slices of size (21933063,10,3). Unfortunately, NumPy seems to be unable to read all 21 million subslices at once. I have tried to do this sequentially by dividing up these slices into 10 slices of size (2193306,10,3) and then using the following reduce to get things working:

In [8]: a = reduce(lambda x,y : np.append(x,y,axis=0), [np.array(data[i*      \
        chunksize: (i+1)*chunksize,:10],dtype=np.float64) for i in xrange(k)])
In [9]:

where 1 <= k <= 10 and chunksize = 2193306. This code works for k <= 9; otherwise I get the following:

In [8]: a = reduce(lambda x,y : np.append(x,y,axis=0), [np.array(data[i*      \
        chunksize: (i+1)*chunksize,:10],dtype=np.float64) for i in xrange(k)])
Floating point exception
home@mybox  00:00:00  ~
$

I tried using Valgrind's memcheck tool to figure out what is going on and it seems as if PyTables is the culprit. The two main files that show up in the trace are libhdf5.so.6 and a file related to blosc.

Also, note that if I have k=8, I get:

In [12]: a.shape
Out[12]: (17546448, 10, 3)

But if I append the last subslice, I get:

In [14]: a = np.append(a,np.array(data[8*chunksize:9*chunksize,:10],   \
         dtype=np.float64))
In [15]: a.shape
Out[15]: (592192620,)

Does anyone have any ideas of what to do? Thanks!

Source: (StackOverflow)

PyTables vs. SQLite3 insertion speed

I bought Kibot's stock data and it is enormous. I have about 125,000,000 rows to load (1000 stocks * 125k rows/stock [1-minute bar data since 2010-01-01], each stock in a CSV file whose fields are Date,Time,Open,High,Low,Close,Volume). I'm totally new to python (I chose it because it's free and well-supported by a community) and I chose SQLite to store the data because of python's built-in support for it. (And I know the SQL language very well. SQLiteStudio is a gem of a free program.)

My loader program is working well, but is getting slower. The SQLite db is about 6 Gb and it's only halfway loaded. I'm getting about 500k rows/hour loaded using INSERT statements and committing the transaction after each stock (approx 125k rows).

So here's the question: is PyTables substantially faster than SQLite, making the effort to learn how to use it worth it? (And since I'm in learning mode, feel free to suggest alternatives to these two.) One things that bother me about PyTables is that it's really bare bones, almost like saving a binary file, for the free version. No "where clause" functions or indexing, so you wind up scanning for the rows you need.

After I get the data loaded, I'm going to be doing statistical analysis (rolling regression & correlation, etc) using something based on NumPy: Timeseries, larry, pandas, or a scikit. I haven't chosen the analysis package yet, so if you have a recommendation, and that recommendation is best used with either PyTables or pandas (or whatever), please factor that in to your response.

(For @John) Python 2.6;
Windows XP SP3 32-bit;
Manufactured strings used as INSERT statements;
Memory usage is rock solid at 750M of the 2G physical memory;
CPU usage is 10% +/- 5%;
Totally i/o bound (disk is always crunching).
DB schema:

create table MinuteBarPrices (
    SopDate smalldatetime not null,
    Ticker  char( 5 )     not null,
    Open    real,
    High    real,
    Low     real,
    Close   real          not null,
    Volume  int,
    primary key ( SopDate, Ticker )
);
create unique index MinuteBarPrices_IE1 on MinuteBarPrices (
    Ticker,
    SopDate
);

Source: (StackOverflow)

Advertisements

Using pytables, which is more efficient: scipy.sparse or numpy dense matrix?

When using pytables, there's no support (as far as I can tell) for the scipy.sparse matrix formats, so to store a matrix I have to do some conversion, e.g.

def store_sparse_matrix(self):
    grp1 = self.getFileHandle().createGroup(self.getGroup(), 'M')
    self.getFileHandle().createArray(grp1, 'data', M.tocsr().data)
    self.getFileHandle().createArray(grp1, 'indptr', M.tocsr().indptr)
    self.getFileHandle().createArray(grp1, 'indices', M.tocsr().indices)

def get_sparse_matrix(self):
    return sparse.csr_matrix((self.getGroup().M.data, self.getGroup().M.indices, self.getGroup().M.indptr))

The trouble is that the get_sparse function takes some time (reading from disk), and if I understand it correctly also requires the data to fit into memory.

The only other option seems to convert the matrix to dense format (numpy array) and then use pytables normally. However this seems to be rather inefficient, although I suppose perhaps pytables will deal with the compression itself?

Source: (StackOverflow)

Reading a large table with millions of rows from Oracle and writing to HDF5

I am working with an Oracle database with millions of rows and 100+ columns. I am attempting to store this data in an HDF5 file using pytables with certain columns indexed. I will be reading subsets of these data in a pandas DataFrame and performing computations.

I have attempted the following:

Download the the table, using a utility into a csv file, read the csv file chunk by chunk using pandas and append to HDF5 table using pandas.HDFStore. I created a dtype definition and provided the maximum string sizes.

However, now when I am trying to download data directly from Oracle DB and post it to HDF5 file via pandas.HDFStore, I run into some problems.

pandas.io.sql.read_frame does not support chunked reading. I don't have enough RAM to be able to download the entire data to memory first.

If I try to use cursor.fecthmany() with a fixed number of records, the read operation takes ages at the DB table is not indexed and I have to read records falling under a date range. I am using DataFrame(cursor.fetchmany(), columns = ['a','b','c'], dtype=my_dtype) however, the created DataFrame always infers the dtype rather than enforce the dtype I have provided (unlike read_csv which adheres to the dtype I provide). Hence, when I append this DataFrame to an already existing HDFDatastore, there is a type mismatch for e.g. a float64 will maybe interpreted as int64 in one chunk.

Appreciate if you guys could offer your thoughts and point me in the right direction.

Source: (StackOverflow)

finding a duplicate in a hdf5 pytable with 500e6 rows

Problem

I have a large (> 500e6 rows) dataset that I've put into a pytables database.

Lets say first column is ID, second column is counter for each ID. each ID-counter combination has to be unique. I have one non-unique row amongst 500e6 rows I'm trying to find.

As a starter I've done something like this:

index1 = db.cols.id.create_index()
index2 = db.cols.counts.create_index()
for row in db:
    query = '(id == %d) & (counts == %d)' % (row['id'],  row['counts'])
    result = th.readWhere(query)
    if len(result) > 1:
        print row

It's a brute force method I'll admit. Any suggestions on improvements?

update

current brute force runtime is 8421 minutes.

solution Thanks for the input everyone. I managed to get the runtime down to 2364.7 seconds using the following method:

ex = tb.Expr('(x * 65536) + y', uservars = {"x":th.cols.id, "y":th.cols.counts})
ex = tb.Expr(expr)
ex.setOutput(th.cols.hash)
ex.eval()
indexrows = th.cols.hash.create_csindex(filters=filters)

ref = None
dups = []
for row in th.itersorted(sortby=th.cols.hash):
  if row['hash'] == ref:
    dups.append(row['hash'] )
  ref = row['hash']

print("ids: ", np.right_shift(np.array(dups, dtype=np.int64), 16))
print("counts: ", np.array(dups, dtype=np.int64) & 65536-1)

I can generate a perfect hash because my maximum values are less than 2^16. I am effectively bit packing the two columns into a 32 bit int.

Once the csindex is generated it is fairly trivial to iterate over the sorted values and do a neighbor test for duplicates.

This method can probably be tweaked a bit, but I'm testing a few alternatives that may provide a more natural solution.

Source: (StackOverflow)

Large matrix multiplication in Python - what is the best option?

I have two boolean sparse square matrices of c. 80,000 x 80,000 generated from 12BM of data (and am likely to have orders of magnitude larger matrices when I use GBs of data).

I want to multiply them (which produces a triangular matrix - however I dont get this since I don't limit the dot product to yield a triangular matrix).

I am wondering what the best way of multiplying them is (memory-wise and speed-wise) - I am going to do the computation on a m2.4xlarge AWS instance which has >60GB of RAM. I would prefer to keep the calc in RAM for speed reasons.

I appreciate that SciPy has sparse matrices and so does h5py, but have no experience in either.

Whats the best option to go for?

Thanks in advance

UPDATE: sparsity of the boolean matrices is <0.6%

Source: (StackOverflow)

Pytables vs. CSV for files that are not very large

I recently came across Pytables and find it to be very cool. It is clear that they are superior to a csv format for very large data sets. I am running some simulations using python. The output is not so large, say 200 columns and 2000 rows.

If someone has experience with both, can you suggest which format would be more convenient in the long run for such data sets that are not very large. Pytables has data manipulation capabilities and browsing of the data with Vitables, but the browser does not have as much functionality as, say Excel, which can be used for CSV. Similarly, do you find one better than the other for importing and exporting data, if working mainly in python? Is one more convenient in terms of file organization? Any comments on issues such as these would be helpful.

Thanks.

Source: (StackOverflow)

In PyTables, how to create nested array of variable length?

I'm using PyTables 2.2.1 w/ Python 2.6, and I would like to create a table which contains nested arrays of variable length.

I have searched the PyTables documentation, and the tutorial example (PyTables Tutorial 3.8) shows how to create a nested array of length = 1. But for this example, how would I add a variable number of rows to data 'info2/info3/x' and 'info2/info3/y'?

For perhaps an easier to understand table structure, here's my homegrown example:

"""Desired Pytable output:

DIEM    TEMPUS  Temperature             Data
5       0       100         Category1 <--||-->  Category2
                         x <--| |--> y          z <--|
                        0           0           0
                        2           1           1
                        4           1.33        2.67
                        6           1.5         4.5
                        8           1.6         6.4
5       1       99
                        2           2           0   
                        4           2           2
                        6           2           4
                        8           2           6
5       2       96
                        4           4           0
                        6           3           3
                        8           2.67        5.33


Note that nested arrays have variable length.
"""

import tables as ts

tableDef =      {'DIEM': ts.Int32Col(pos=0),
                'TEMPUS': ts.Int32Col(pos=1), 
                'Temperature' : ts.Float32Col(pos=2),
                'Data': 
                    {'Category1': 
                        {
                        'x': ts.Float32Col(), 
                        'y': ts.Float32Col()
                        }, 
                    'Category2': 
                        {
                        'z': ts.Float32Col(), 
                        }
                    }
                }

# create output file
fpath = 'TestDb.h5'
fh = ts.openFile(fpath, 'w')
# define my table
tableName = 'MyData'
fh.createTable('/', tableName, tableDef)
tablePath = '/'+tableName
table = fh.getNode(tablePath)

# get row iterator
row = table.row
for i in xrange(3):
    print '\ni=', i
    # calc some fake data
    row['DIEM'] = 5
    row['TEMPUS'] = i
    row['Temperature'] = 100-i**2

    for j in xrange(5-i):
        # Note that nested array has variable number of rows
        print 'j=', j,
        # calc some fake nested data
        val1 = 2.0*(i+j)
        val2 = val1/(j+1.0)
        val3 = val1 - val2

        ''' Magic happens here...
        How do I write 'j' rows of data to the elements of 
        Category1 and/or Category2?

        In bastardized pseudo-code, I want to do:

        row['Data/Category1/x'][j] = val1
        row['Data/Category1/y'][j] = val2
        row['Data/Category2/z'][j] = val3
        '''

    row.append()
table.flush()

fh.close()

I have not found any indication in the PyTables docs that such a structure is not possible... but in case such a structure is in fact not possible, what are my alternatives to variable length nested columns?

EArray? VLArray? If so, how to integrate these data types into the above described structure?
some other idea?

Any assistance is greatly appreciated!

EDIT w/ additional info: It appears that the PyTables gurus have already addressed the "is such a structure possible" question:

PyTables Mail Forum - Hierachical Datasets

So has anyone figured out a way to create an analogous PyTable data structure?

Thanks again!

Source: (StackOverflow)

Building a huge numpy array using pytables

How can I create a huge numpy array using pytables. I tried this but gives me the "ValueError: array is too big." error:

import numpy as np
import tables as tb
ndim = 60000
h5file = tb.openFile('test.h5', mode='w', title="Test Array")
root = h5file.root
h5file.createArray(root, "test", np.zeros((ndim,ndim), dtype=float))
h5file.close()

Source: (StackOverflow)

Unable to reinstall PyTables for Python 2.7

I am installing Python 2.7 in addition to 2.7. When installing PyTables again for 2.7, I get this error -

Found numpy 1.5.1 package installed. .. ERROR:: Could not find a local HDF5 installation. You may need to explicitly state where your local HDF5 headers and library can be found by setting the HDF5_DIR environment variable or by using the --hdf5 command-line option.

I am not clear on the HDF installation. I downloaded again - and copied it into a /usr/local/hdf5 directory. And tried to set the environement vars as suggested in the PyTable install. Has anyone else had this problem that could help?

Source: (StackOverflow)

merging two tables with millions of rows in python

I am using python for some data analysis. I have two tables, the first (lets call it 'A') has 10 million rows and 10 columns and the second ('B') has 73 million rows and 2 columns. They have 1 column with common ids and I want to intersect the two tables based on that column. In particular I want the inner join of the tables. I could not load the table B on memory as a pandas dataframe to use the normal merge function on pandas. I tried by reading the file of table B on chunks, intersecting each chunk with A and the concatenating these intersections (output from inner joins). This is OK on speed but every now and then this gives me problems and spits out a segmentation fault ... no so great. This error is difficult to reproduce but it happens on two different machines (Mac OS Snow Leopard and UNIX, RedHat). I finally tried with the combination of Pandas and PyTables by writing table B to disk and then iterating over table A and selecting from table B the matching rows. This last options works but it is slow. Table B on pytables has been indexed already by default. I wonder if you could suggest any other idea of how to tackle this problem. Thanks in advance

Source: (StackOverflow)

HDF5 taking more space than CSV?

Consider the following example:

Prepare the data:

import string
import random
import pandas as pd

matrix = np.random.random((100, 3000))
my_cols = [random.choice(string.ascii_uppercase) for x in range(matrix.shape[1])]
mydf = pd.DataFrame(matrix, columns=my_cols)
mydf['something'] = 'hello_world'

Set the highest compression possible for HDF5:

store = pd.HDFStore('myfile.h5',complevel=9, complib='bzip2')
store['mydf'] = mydf
store.close()

Save also to CSV:

mydf.to_csv('myfile.csv', sep=':')

The result is:

myfile.csv is 5.6 MB big
myfile.h5 is 11 MB big

The difference grows bigger as the datasets get larger.

I have tried with other compression methods and levels. Is this a bug? (I am using Pandas 0.11 and the latest stable version of HDF5 and Python).

Source: (StackOverflow)

Convert large csv to hdf5

I have a 100M line csv file (actually many separate csv files) totaling 84GB. I need to convert it to a HDF5 file with a single float dataset. I used h5py in testing without any problems, but now I can't do the final dataset without running out of memory.

How can I write to HDF5 without having to store the whole dataset in memory? I'm expecting actual code here, because it should be quite simple.

I was just looking into pytables, but it doesn't look like the array class (which corresponds to a HDF5 dataset) can be written to iteratively. Similarly, pandas has read_csv and to_hdf methods in its io_tools, but I can't load the whole dataset at one time so that won't work. Perhaps you can help me solve the problem correctly with other tools in pytables or pandas.

Source: (StackOverflow)

Efficient way of inputting large raster data into PyTables

I am looking for the efficient way to feed up the raster data file (GeoTiff) with 20GB size into PyTables for further out of core computation.

Currently I am reading it as numpy array using Gdal, and writing the numpy array into pytables using the code below:

import gdal, numpy as np, tables as tb

inraster = gdal.Open('infile.tif').ReadAsArray().astype(np.float32)
f = tb.openFile('myhdf.h5','w')
dataset = f.createCArray(f.root, 'mydata', atom=tb.Float32Atom(),shape=np.shape(inraster)
dataset[:] = inraster
dataset.flush()
dataset.close()
f.close()
inraster = None

Unfortunately, since my input file is extremely large, while reading it as numpy error, my PC shows memory error. Is there any alternative way to feed up the data into PyTables or any suggestions to improve my code?

Source: (StackOverflow)

How to store wide tables in pytables / hdf5

I have data coming from a csv which has a few thousand columns and ten thousand (or so) rows. Within each column the data is of the same type, but different columns have data of different type*. Previously I have been pickling the data from numpy and storing on disk, but it's quite slow, especially because usually I want to load some subset of the columns rather than all of them.

I want to put the data into hdf5 using pytables, and my first approach was to put the data in a single table, with one hdf5 column per csv column. Unfortunately this didn't work, I assume because of the 512 (soft) column limit.

What is a sensible way to store this data?

* I mean, the type of the data after it has been converted from text.

Source: (StackOverflow)