h5py in Python

Statistics on huge numpy (HDF5) arrays

I have fairly large 4D arrays [20x20x40x15000] that I save to disk as HDF5 files using h5py. Now the problem is that I want to calculate an average of the entire array i.e. using:

numpy.average(HDF5_file)

I get a MemoryError. It seems that numpy tries to load the HDF5 file into memory to perform the average?

Does anyone have an elegant and efficient solution to this problem?

Source: (StackOverflow)

Combining hdf5 files

I have a number of hdf5 files, each of which have a single dataset. The datasets are too large to hold in RAM. I would like to combine these files into a single file containing all datasets separately (i.e. not to concatenate the datasets into a single dataset).

One way to do this is to create a hdf5 file and then copy the datasets one by one. This will be slow and complicated because it will need to be buffered copy.

Is there a more simple way to do this? Seems like there should be, since it is essentially just creating a container file.

I am using python/h5py.

Source: (StackOverflow)

Input and output numpy arrays to h5py

I have a Python code whose output is a enter image description here sized matrix, whose entries are all of the type float. If I save it with the extension .dat the file size is of the order of 500 MB. I read that using h5py reduces the file size considerably. So, let's say I have the 2D numpy array named A. How do I save it to an h5py file? Also, how do I read the same file and put it as a numpy array in a different code, as I need to do manipulations with the array?

Source: (StackOverflow)

How to partially copy using python an Hdf5 file into a new one keeping the same structure?

I have a large hdf5 file that looks something like this:

A/B/dataset1, dataset2
A/C/dataset1, dataset2
A/D/dataset1, dataset2
A/E/dataset1, dataset2

...

I want to create a new file with only that: A/B/dataset1, dataset2 A/C/dataset1, dataset2

What is the easiest way in python?

I did:

fs = h5py.File('source.h5', 'r')
fd = h5py.File('dest.h5', 'w')
fs.copy('group B', fd)

the problem is that I get for dest.h5:

B/dataset1, dataset2

and that I am missing part of the arborescence.

Source: (StackOverflow)

How to store dictionary in HDF5 dataset

I have a dictionary, where key is datetime object and value is tuple of integers:

>>> d.items()[0]
(datetime.datetime(2012, 4, 5, 23, 30), (14, 1014, 6, 3, 0))

I want to store it in HDF5 dataset, but if I try to just dump the dictionary h5py raises error:

TypeError: Object dtype dtype('object') has no native HDF5 equivalent

What would be "the best" way to transform this dictionary so that I can store it in HDF5 dataset?

Specifically I don't want to just dump the dictionary in numpy array, as it would complicate data retrieval based on datetime query.

Source: (StackOverflow)

Experience with using h5py to do analytical work on big data in Python?

I do a lot of statistical work and use Python as my main language. Some of the data sets I work with though can take 20GB of memory, which makes operating on them using in-memory functions in numpy, scipy, and PyIMSL nearly impossible. The statistical analysis language SAS has a big advantage here in that it can operate on data from hard disk as opposed to strictly in-memory processing. But, I want to avoid having to write a lot of code in SAS (for a variety of reasons) and am therefore trying to determine what options I have with Python (besides buying more hardware and memory).

I should clarify that approaches like map-reduce will not help in much of my work because I need to operate on complete sets of data (e.g. computing quantiles or fitting a logistic regression model).

Recently I started playing with h5py and think it is the best option I have found for allowing Python to act like SAS and operate on data from disk (via hdf5 files), while still being able to leverage numpy/scipy/matplotlib, etc. I would like to hear if anyone has experience using Python and h5py in a similar setting and what they have found. Has anyone been able to use Python in "big data" settings heretofore dominated by SAS?

EDIT: Buying more hardware/memory certainly can help, but from an IT perspective it is hard for me to sell Python to an organization that needs to analyze huge data sets when Python (or R, or MATLAB etc) need to hold data in memory. SAS continues to have a strong selling point here because while disk-based analytics may be slower, you can confidently deal with huge data sets. So, I am hoping that Stackoverflow-ers can help me figure out how to reduce the perceived risk around using Python as a mainstay big-data analytics language.

Source: (StackOverflow)

Fastest way to write hdf5 file with Python?

Given a large (10s of GB) CSV file of mixed text/numbers, what is the fastest way to create an hdf5 file with the same content, while keeping the memory usage reasonable? I'd like to use the h5py module if possible.

In the toy example below, I've found an incredibly slow and incredibly fast way to write data to hdf5. Would it be best practice to write to hdf5 in chunks of 10,000 rows or so? Or is there a better way to write a massive amount of data to such a file?

import h5py

n = 10000000
f = h5py.File('foo.h5','w')
dset = f.create_dataset('int',(n,),'i')

# this is terribly slow
for i in xrange(n):
  dset[i] = i

# instantaneous
dset[...] = 42

Source: (StackOverflow)

HDF5 for Python: high level vs low level interfaces. h5py

I've been working with HDF5 files with C and Matlab, both using the same way for reading from and writing to datasets:

open file with h5f
open dataset with h5d
select space with h5s

and so on...

But now I'm working with Python, and with its h5py library I see that it has two ways to manage HDF5: high-level and low-level interfaces. And with the former it takes less lines of code to get the information from a single variable of the file.

Is there any noticeable loss of performance when using the high-level interface?
For example when dealing with a file with many variables inside, and we must read just one of them.

Source: (StackOverflow)

how to export HDF5 file to NumPy using H5PY?

I have an existing hdf5 file with three arrays, i want to extract one of the arrays using h5py.

Source: (StackOverflow)

Updating h5py Datasets

Does any one have an idea for updating hdf5 datasets from h5py? Assuming we create a dataset like:

import h5py
import numpy
f = h5py.File('myfile.hdf5')
dset = f.create_dataset('mydataset', data=numpy.ones((2,2),"=i4"))
new_dset_value=numpy.zeros((3,3),"=i4")

Is it possible to extend the dset to a 3x3 numpy array?

Source: (StackOverflow)

Is h5py capable of converting python dictionaries to hdf5 groups automatically?

I have been using scipy.io to save my structured data (lists and dictionaries filled with ndarrays in different shapes). Since v7.3 mat file is going to replace the old v7 mat format some day, I am thinking about switching to HDF5 to store my data, more specifically h5py for python. However, I noticed that I cannot save my dictionaries as easy as:

import scipy.io as sio
data = {'data': 'Complicated structure data'}
sio.savemat('fileName.mat', data)

Instead, I have to use h5py.create_group one by one to replicated the structure in python dictionary. For very large structures, this is unfeasible. Is there an easy way to automatically convert python dictionaries to hdf5 groups?

Thank you!

-Shawn

Source: (StackOverflow)

An XML file inside HDF5, h5py

I am using h5py to save data (float numbers), in groups. In addition to the data itself, I need to include an additional file (an .xml file, containing necessary information) within the hdf5. How do i do this? Is my approach wrong?

f = h5py.File('filename.h5')
f.create_dataset('/data/1',numpy_array_1)
f.create_dataset('/data/2',numpy_array_2)
.
.

my h5 tree should look thus:

/ 
/data
/data/1 (numpy_array_1)
/data/2 (numpy_array_2)
.
.
/morphology.xml (?)

Source: (StackOverflow)

Check if node exists in h5py

Wondering if there's a simple way to check if a node exists within an HDF5 file using h5py.

I couldn't find anything in the docs, so right now I'm using exceptions, which is ugly.

# check if node exists
# first assume it exists
e = True
try:
  h5File["/some/path"]
except KeyError:
  e = False # now we know it doesn't

To add context: I'm using this to determine if a node exists before trying to create a new node with the same name.

Source: (StackOverflow)

Installing h5py on OS X

I've spent the day trying to get the h5py module of python working, but without success. I've installed HDF5 shared libraries, followed the instructions I could find on the web to get it right. But it doesn't work, below is the error message I get when trying to import the module into python. I tried installing through MacPorts too but again it wouldnt work.

I'm using Python27 32 bits (had too for another module, and thus installed the i386 HDF5 library... if that's right?)

Any help very welcome !

Thank you !

import h5py

Traceback (most recent call last):

File "<stdin>", line 1, in <module>

File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/h5py/__init__.py", line 1, in <module>

      from h5py import _errors

ImportError: dlopen(/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/h5py/_errors.so, 2): Symbol not found: _H5E_ALREADYEXISTS_g

  Referenced from: /Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/h5py/_errors.so

  Expected in: flat namespace
 in /Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/h5py/_errors.so

Source: (StackOverflow)

what is a fast way to output h5py dataset to text?

I am using the h5py python package to read files in HDF5 format. (e.g. somefile.h5) I would like to write the contents of a dataset to a text file.

For example, I would like to create a text file with the following contents: 1,20,31,75,142,324,78,12,3,90,8,21,1

I am able to access the dataset in python using this code:

import h5py
f     = h5py.File('/Users/Me/Desktop/thefile.h5', 'r')
group = f['/level1/level2/level3']
dset  = group['dsetname']

My naive approach is too slow, because my dataset has over 20000 entries:

# write all values to file        
for index in range(len(dset)):
        # do not add comma after last value
        if index == len(dset)-1: txtfile.write(repr(dset[index]))
        else:                    txtfile.write(repr(dset[index])+',')
txtfile.close()
    return None

Is there a faster way to write this to a file? Perhaps I could convert the dataset into a NumPy array or even a Python list, and then use some file-writing tool?

(I could experiment with concatenating the values into a larger string before writing to file, but I'm hoping there's something entirely more elegant)

Source: (StackOverflow)