EzDevInfo.com

pycuda

CUDA integration for Python, plus shiny features PyCUDA

pyCUDA vs C performance differences?

I'm new to CUDA programming and I was wondering how the performance of pyCUDA is compared to programs implemented in plain C. Will the performance be roughly the same? Are there any bottle necks that I should be aware of?

EDIT: I obviously tried to google this issue first, and was surprised to not find any information. i.e. I would have excepted that the pyCUDA people have this question answered in their FAQ.


Source: (StackOverflow)

Disappointing results in pyCUDA benchmark for distance computing between N points

The following script was set-up for benchmark purposes. It computes the distance between N points using an Euclidean L2 norm. Three different routines are implemented:

  1. High-level solution using the scipy.spatial.distance.pdist function.
  2. Fairly low-level OpenMP powered scipy.weave.inline solution.
  3. pyCUDA powered GPGPU solution.

Here are the benchmark results on a i5-3470 (16GB RAM) using a GTX660 (2GB RAM):

    ------------
    Scipy Pdist
    Execution time: 3.01975 s
    Frist five elements: [ 0.74968684  0.71457213  0.833188    0.48084545  0.86407363]
    Last five elements: [ 0.65717077  0.76850474  0.29652017  0.856179    0.56074625]
    ------------
    Weave Inline
    Execution time: 2.48705 s
    Frist five elements: [ 0.74968684  0.71457213  0.83318806  0.48084542  0.86407363]
    Last five elements: [ 0.65717083  0.76850474  0.29652017  0.856179    0.56074625]
    ------------
    pyCUDA
    CUDA clock timing:  0.713028930664
    Execution time: 2.04364 s
    Frist five elements: [ 0.74968684  0.71457213  0.83318806  0.48084542  0.86407363]
    Last five elements: [ 0.65717083  0.76850468  0.29652017  0.856179    0.56074625]
    ------------

I am a bit disappointed on the pyCUDA perfomance. Since I am new to CUDA, there is probably something I am missing here. So where is the crux of the matter ? Am I reaching the limits of global memory bandwidth ? Poor choice of block- and gridsizes ?

import numpy,time,math
import pycuda.autoinit
import pycuda.driver as drv
from pycuda.compiler import SourceModule
from scipy.spatial.distance import pdist
from scipy import weave

def weave_solution(x):
    """
    OpenMP powered weave inline.
    """
    N,DIM     = numpy.shape(x)
    L         = ((N-1)**2+(N-1))/2
    solution  = numpy.zeros(L).astype(numpy.float32)
    ncpu      = 4
    weave_omp = {'headers'           : ['<omp.h>'],
                 'extra_compile_args': ['-fopenmp'],
                 'extra_link_args'   : ['-lgomp']}
    code = \
         r'''
         omp_set_num_threads(ncpu);
         #pragma omp parallel
         {            
            int j,d,pos;
            float r=0.0;

            #pragma omp for
               for (int i=0; i<(N-1); i++){
                  for (j=(i+1); j<N; j++){
                     r = 0.0;
                     for (d=0; d<DIM; d++){
                        r += (x[i*DIM+d]-x[j*DIM+d])*(x[i*DIM+d]-x[j*DIM+d]);
                     }
                     pos = (i*N+j)-(i*(i+1)/2)-i-1;
                     solution[pos] = sqrt(r);
                  }
               }

         }
         '''
    weave.inline(code,['x','N','DIM','solution','ncpu'],**weave_omp)
    return numpy.array(solution)

def scipy_solution(x):
    """
    SciPy High-level function
    """
    return pdist(x).astype(numpy.float32)

def cuda_solution(x):
    """
    pyCUDA
    """
    N,DIM     = numpy.shape(x)
    N         = numpy.int32(N)
    DIM       = numpy.int32(DIM)    
    L         = ((N-1)**2+(N-1))/2
    solution  = numpy.zeros(L).astype(numpy.float32)

    start = drv.Event()
    end   = drv.Event()       

    mod = SourceModule("""
    __global__ void distance(float *x,int N,int DIM,float *solution){

    const int i = blockDim.x * blockIdx.x + threadIdx.x;

    int j,d,pos;
    float r=0.0;

    if ( i < (N-1) ){

       for (j=(i+1); j<N; j++){

          r = 0.0;
          for (d=0; d<DIM; d++){
             r += (x[i*DIM+d]-x[j*DIM+d])*(x[i*DIM+d]-x[j*DIM+d]);
          }
          pos = (i*N+j)-(i*(i+1)/2)-i-1;
          solution[pos] = sqrt(r);
       }

    }
    }
    """)


    func = mod.get_function("distance")

    start.record()
    func(drv.In(x),N,DIM,drv.Out(solution),block=(192,1,1),grid=(192,1))
    end.record()
    end.synchronize()
    secs = start.time_till(end)*1e-3

    print "CUDA clock timing: ",secs
    return solution

if __name__ == '__main__':

    # Set up data points
    N   = 25000
    DIM = 3
    x   = numpy.random.rand(N,DIM).astype(numpy.float32)

    print "-"*12
    # Scipy solution
    print "Scipy Pdist"
    stime = time.time()
    spsolution = scipy_solution(x)
    stime = time.time()-stime
    print "Execution time: {0:.5f} s".format(stime)
    print "Frist five elements:", spsolution[:5]
    print "Last five elements:", spsolution[-5:]    
    print "-"*12

    # Weave solution
    print "Weave Inline"
    wtime = time.time()
    wsolution = weave_solution(x)
    wtime = time.time()-wtime
    print "Execution time: {0:.5f} s".format(wtime)
    print "Frist five elements:", wsolution[:5]
    print "Last five elements:", wsolution[-5:]
    print "-"*12

    # pyCUDA solution
    print "pyCUDA"
    ctime = time.time()
    csolution = cuda_solution(x)
    ctime = time.time()-ctime
    print "Execution time: {0:.5f} s".format(ctime)
    print "Frist five elements:", csolution[:5]
    print "Last five elements:", csolution[-5:]    
    print "-"*12

Edit:

I have added the hash bang line

#!/usr/bin/env python

at the top of the file and made it executable. After commenting out the computation using weave.inline and scipy.spatial.distance.pdist, the NVIDIA Visual Profiler promts the following results:

NVIDIA Visual Profiler


Source: (StackOverflow)

Advertisements

PyCUDA: C/C++ includes?

Something that isn't really mentioned anywhere (at least that I can see) is what library functions are exposed to inline CUDA kernels.

Specifically I'm doing small / stupid matrix multiplications that don't deserve to be individually offloaded to the GPU but am offloading a larger section of the algorithm which includes this multiplication. Noone ever liked using their own linalg functions since someone has always done it better.

TLDR What libraries can I play with while in inline kernels under PyCUDA?


Source: (StackOverflow)

`Out of resources` error while doing loop unrolling

When I increase the unrolling from 8 to 9 loops in my kernel, it breaks with an out of resources error.

I read in How do I diagnose a CUDA launch failure due to being out of resources? that a mismatch of parameters and an overuse of registers could be a problem, but that seems not be the case here.

My kernel calculates the distance between n points and m centroids and selects for each point the closest centroid. It works for 8 dimensions but not for 9. When I set dimensions=9 and uncomment the two lines for the distance calculation, I get an pycuda._driver.LaunchError: cuLaunchGrid failed: launch out of resources.

What do you think, could cause this behavior? What other iusses can cause an out of resources*?

I use an Quadro FX580. Here is the minimal(ish) example. For the unrolling in the real code I use templates.

import numpy as np
from pycuda import driver, compiler, gpuarray, tools
import pycuda.autoinit


## preference
np.random.seed(20)
points = 512
dimensions = 8
nclusters = 1

## init data
data = np.random.randn(points,dimensions).astype(np.float32)
clusters = data[:nclusters]

## init cuda
kernel_code = """

      // the kernel definition 
    __device__ __constant__ float centroids[16384];

    __global__ void kmeans_kernel(float *idata,float *g_centroids,
    int * cluster, float *min_dist, int numClusters, int numDim) {
    int valindex = blockIdx.x * blockDim.x + threadIdx.x ;
    float increased_distance,distance, minDistance;
    minDistance = 10000000 ;
    int nearestCentroid = 0;
    for(int k=0;k<numClusters;k++){
      distance = 0.0;
      increased_distance = idata[valindex*numDim] -centroids[k*numDim];
      distance = distance +(increased_distance * increased_distance);
      increased_distance =  idata[valindex*numDim+1] -centroids[k*numDim+1];
      distance = distance +(increased_distance * increased_distance);
      increased_distance =  idata[valindex*numDim+2] -centroids[k*numDim+2];
      distance = distance +(increased_distance * increased_distance);
      increased_distance =  idata[valindex*numDim+3] -centroids[k*numDim+3];
      distance = distance +(increased_distance * increased_distance);
      increased_distance =  idata[valindex*numDim+4] -centroids[k*numDim+4];
      distance = distance +(increased_distance * increased_distance);
      increased_distance =  idata[valindex*numDim+5] -centroids[k*numDim+5];
      distance = distance +(increased_distance * increased_distance);
      increased_distance =  idata[valindex*numDim+6] -centroids[k*numDim+6];
      distance = distance +(increased_distance * increased_distance);
      increased_distance =  idata[valindex*numDim+7] -centroids[k*numDim+7];
      distance = distance +(increased_distance * increased_distance);
      //increased_distance =  idata[valindex*numDim+8] -centroids[k*numDim+8];
      //distance = distance +(increased_distance * increased_distance);

      if(distance <minDistance) {
        minDistance = distance ;
        nearestCentroid = k;
        } 
      }
      cluster[valindex]=nearestCentroid;
      min_dist[valindex]=sqrt(minDistance);
    } 
 """
mod = compiler.SourceModule(kernel_code)
centroids_adrs = mod.get_global('centroids')[0]    
kmeans_kernel = mod.get_function("kmeans_kernel")
clusters_gpu = gpuarray.to_gpu(clusters)
cluster = gpuarray.zeros(points, dtype=np.int32)
min_dist = gpuarray.zeros(points, dtype=np.float32)

driver.memcpy_htod(centroids_adrs,clusters)

distortion = gpuarray.zeros(points, dtype=np.float32)
block_size= 512

## start kernel
kmeans_kernel(
    driver.In(data),driver.In(clusters),cluster,min_dist,
    np.int32(nclusters),np.int32(dimensions),
    grid = (points/block_size,1),
    block = (block_size, 1, 1),
)
print cluster
print min_dist

Source: (StackOverflow)

Python Multiprocessing with PyCUDA

I've got a problem that I want to split across multiple CUDA devices, but I suspect my current system architecture is holding me back;

What I've set up is a GPU class, with functions that perform operations on the GPU (strange that). These operations are of the style

for iteration in range(maxval):
    result[iteration]=gpuinstance.gpufunction(arguments,iteration)

I'd imagined that there would be N gpuinstances for N devices, but I don't know enough about multiprocessing to see the simplest way of applying this so that each device is asynchronously assigned, and strangely few of the examples that I came across gave concrete demonstrations of collating results after processing.

Can anyone give me any pointers in this area?

UPDATE Thank you Kaloyan for your guidance in terms of the multiprocessing area; if CUDA wasn't specifically the sticking point I'd be marking you as answered. Sorry.

Perviously to playing with this implementation, the gpuinstance class initiated the CUDA device with import pycuda.autoinit But that didn't appear to work, throwing invalid context errors as soon as each (correctly scoped) thread met a cuda command. I then tried manual initialisation in the __init__ constructor of the class with...

pycuda.driver.init()
self.mydev=pycuda.driver.Device(devid) #this is passed at instantiation of class
self.ctx=self.mydev.make_context()
self.ctx.push()    

My assumption here is that the context is preserved between the list of gpuinstances is created and when the threads use them, so each device is sitting pretty in its own context.

(I also implemented a destructor to take care of pop/detach cleanup)

Problem is, invalid context exceptions are still appearing as soon as the thread tries to touch CUDA.

Any ideas folks? And Thanks to getting this far. Automatic upvotes for people working 'banana' into their answer! :P


Source: (StackOverflow)

scipy.interpolate.griddata equivalent in CUDA

I'm trying to perform Fitted Value Iteration (FVI) in python (involving approximating a 5 dimensional function using piecewise linear interpolation).

scipy.interpolate.griddata works perfectly for this. However, I need to call the interpolation routine several thousand times (since FVI is a MC based algorithm).

So basically, the set of points where the function is known is static (and large - say 32k), but the points i need to approximate (which are small perturbations of the original set) is very large (32k x 5000 say).

Is there an implementation of what scipy.interpolate.griddata does that's been ported to CUDA? alternatively, is there a way to speed up the calculation somehow?

Thanks.


Source: (StackOverflow)

How do I diagnose a CUDA launch failure due to being out of resources?

I'm getting an out-of-resources error when trying to launch a CUDA kernel (through PyCUDA), and I'm wondering if it's possible to get the system to tell me which resource it is that I'm short on. Obviously the system knows what resource has been exhausted, I just want to query that as well.

I've used the occupancy calculator, and everything seems okay, so either there's a corner case not covered, or I'm using it wrong. I know it's not registers (which seems to be the usual culprit) because I'm using <= 63 and it still fails with a 1x1x1 block and 1x1 grid on a CC 2.1 device.

Thanks for any help. I posted a thread on the NVidia boards:

http://forums.nvidia.com/index.php?showtopic=206261&st=0

But got no responses. If the answer is "you can't ask the system for that information" that would be nice to know too (sort of... ;).

Edit:

The most register usage I've seen has been 63. Edited the above to reflect that.


Source: (StackOverflow)

Identify contiguous segments of a non-contiguous numpy array

In the example below, we have a contiguous array, and a view of the same array that is non-contiguous:

shape = (5, 100)
A = np.arange(np.product(shape)).reshape(shape)

# Everything is contiguous at this point
assert A.flags.c_contiguous == True

# Since we're taking a view over the
# row minor dimension, we'll have 5
# segments of 50 contiguous elements
A[:,20:70].flags.c_contiguous == False

# Each segment is contiguous
A[0,20:70].flags.c_contiguous == True
A[1,20:70].flags.c_contiguous == True
A[2,20:70].flags.c_contiguous == True
A[3,20:70].flags.c_contiguous == True
A[4,20:70].flags.c_contiguous == True

The view references 5 segments of 50 contiguous elements.

If we look at a more general case

shape = (30, 20, 70, 50)
B = np.arange(np.product(shape)).reshape(shape)

then, the following holds:

B[0,:,:,:].flags.c_contiguous == True
B[0,0,:,:].flags.c_contiguous == True
B[0,0,0,:].flags.c_contiguous == True

A partial solution, based on _UpdateContiguousFlags in flagobjects.c might be:

def is_contiguous(ary):
    is_c_contig = True
    sd = ary.itemsize
    for i in reversed(range(ary.ndim)):
        dim = ary.shape[i]

        # Contiguous by default
        if dim == 0:
            return True

        if dim != 1:
            if ary.strides[i] != sd:
                is_c_contig = False

            sd *= dim

    return is_c_contig

But I don't think this handles cases where the array is transposed. e.g.:

B.transpose(3,0,1,2)

Question: Is there a method of identifying the contiguous segments in a general NumPy view/array?

I need this functionality in order to transfer data from a NumPy array to a GPU without copying data into pinned memory. Instead, I'd like to pin contiguous segments using cudaHostRegister and then do the transfer to the GPU. I'm aware of CUDA's support for pitched memory via MemCpy3D, but I'd like to handle more general cases.

Edit 1: Added a basic solution to the question, and asked about the transpose case.

Edit 2: Clarified removing the need to copy data into pinned memory.


Source: (StackOverflow)

python+ pycuda (linux) error

I've installed python+pycuda (and other libraries) through this link: http://wiki.tiker.net/PyCuda/Installation/Linux

But when I run test program, it says:

Traceback (most recent call last):
File "test_driver.py", line 17, in <module>
import pycuda.gpuarray as gpuarray
File "/usr/local/lib/python2.7/dist-packages/pycuda-2014.1-py2.7-linux-x86_64.egg/pycuda/gpuarray.py", line 3, in <module>
import pycuda.elementwise as elementwise
File "/usr/local/lib/python2.7/dist-packages/pycuda-2014.1-py2.7-linux-x86_64.egg/pycuda/elementwise.py", line 34, in <module>
from pytools import memoize_method
File "/usr/local/lib/python2.7/dist-packages/pytools-2014.3.5-py2.7.egg/pytools/__init__.py", line 5, in <module>
from six.moves import range, zip, intern, input
ImportError: cannot import name intern
  • six is installed. I don't know what should I do!

Source: (StackOverflow)

pip install pycuda on windows

I'm using VS2008, Win XP, latest CUDA toolkit. I run pip install pycuda on windows and get following log from C:\Documents and Settings\User\Application Data\pip\pip.log

I get error

LINK : fatal error LNK1181: cannot open input file 'cuda.lib'

error: command '"C:\Program Files\Microsoft Visual Studio 9.0\VC\BIN\link.exe"' failed with exit status 1181

I think I need to specify some path variable to cuda lib, but I don't understand what variable and why it don't set during instalation of cuda toolkit.

UPDATE: I manged to resolve this issue installing prebuild pycuda from here , but maybe it will work slower because it wasn't compiled on my machine.


Source: (StackOverflow)

How can i tell PyCUDA which GPU to use?

I have two NVidia cards in my machine, and both are CUDA capable. When I run the example script to get started with PyCUDA seen here: http://documen.tician.de/pycuda/ i get the error

nvcc fatal   : Value 'sm_30' is not defined for option 'gpu-architecture'

My computing GPU is compute capability 3.0, so sm_30 should be the right option for the nvcc compiler. My graphics GPU is only CC 1.2, so i thought maybe that's the problem. I've installed the CUDA 5.0 release for linux with no errors, and all the compiler components and python components.

Is there a way to tell PyCUDA explicitly which GPU to use?


Source: (StackOverflow)

Automatic CudaMat conversion in Python

I'm looking into speeding up my python code, which is all matrix math, using some form of CUDA. Currently my code is using Python and Numpy, so it seems like it shouldn't be too difficult to rewrite it using something like either PyCUDA or CudaMat.

However, on my first attempt using CudaMat, I realized I had to rearrange a lot of the equations in order to keep the operations all on the GPU. This included the creation of many temporary variables so I could store the results of the operations.

I understand why this is necessary, but it makes what were once easy to read equations into somewhat of a mess that difficult to inspect for correctness. Additionally, I would like to be able to easily modify the equations later on, which isn't in their converted form.

The package Theano manages to do this by first creating a symbolic representation of the operations, then compiling them to CUDA. However, after trying Theano out for a bit, I was frustrated by how opaque everything was. For example, just getting the actual value for myvar.shape[0] is made difficult since the tree doesn't get evaluated until much later. I would also much prefer less of a framework in which my code much conform to a library that acts invisibly in the place of Numpy.

Thus, what I would really like is something much simpler. I don't want automatic differentiation (there are other packages like OpenOpt that can do that if I require it), or optimization of the tree, but just a conversion from standard Numpy notation to CudaMat/PyCUDA/somethingCUDA. In fact, I want to be able to have it evaluate to just Numpy without any CUDA code for testing.

I'm currently considering writing this myself, but before even consider such a venture, I wanted to see if anyone else knows of similar projects or a good starting place. The only other project I know that might be close to this is SymPy, but I don't know how easy it would be to adapt to this purpose.

My current idea would be to create an array class that looked like a Numpy.array class. It's only function would be to build a tree. At any time, that symbolic array class could be converted to a Numpy array class and be evaluated (there would also be a one-to-one parity). Alternatively, the array class could be traversed and have CudaMat commands be generated. If optimizations are required they can be done at that stage (e.g. re-ordering of operations, creation of temporary variables, etc.) without getting in the way of inspecting what's going on.

Any thoughts/comments/etc. on this would be greatly appreciated!

Update

A usage case may look something like (where sym is the theoretical module), where we might be doing something such as calculating the gradient:

W = sym.array(np.rand(size=(numVisible, numHidden)))
delta_o = -(x - z)
delta_h = sym.dot(delta_o, W)*h*(1.0-h)
grad_W = sym.dot(X.T, delta_h)

In this case, grad_W would actually just be a tree containing the operations that needed to be done. If you wanted to evaluate the expression normally (i.e. via Numpy) you could do:

npGrad_W = grad_W.asNumpy()

which would just execute the Numpy commands that the tree represents. If on the other hand, you wanted to use CUDA, you would do:

cudaGrad_W = grad_W.asCUDA()

which would convert the tree into expressions that can executed via CUDA (this could happen in a couple of different ways).

That way it should be trivial to: (1) test grad_W.asNumpy() == grad_W.asCUDA(), and (2) convert your pre-existing code to use CUDA.


Source: (StackOverflow)

Difference between memcpy_htod and to_gpu in Pycuda?

I am learning PyCUDA, and while going through the documentation on pycuda.gpuarray, I am puzzled by the difference between pycuda.driver.memcpy_htod (also _dtoh) and pycuda.gpuarray.to_gpu (also get) functions. According to gpuarray documentation, .get().

For example,transfer the contents of self into array or a newly allocated numpy.ndarray. If array is given, it must have the right size (not necessarily shape) and dtype. If it is not given, a pagelocked specifies whether the new array is allocated page-locked.

Is this saying that .get() is implemented exactly the same way as pycuda.driver.memcpy_dtoh? Somehow, I think I am mis-interpreting it.


Source: (StackOverflow)

operator overloading in Cuda

I successfully created an operator+ between two float4 by doing :

__device__ float4 operator+(float4 a, float4 b) {
 // ...
}

However, if in addition, I want to have an operator+ for uchar4, by doing the same thing with uchar4, i get the following error: "error: more than one instance of overloaded function "operator+" has "C" linkage" "

I get a similar error message when I declare multiple functions with the same name but different arguments. So, two questions :

  • Polymorphism : Is-it possible to have multiple functions with the same name and different arguments in Cuda ? If so, why do I have this error message ?
  • operator+ for float4 : it seems that this feature is already included by including "cutil_math.h", but when I include that (#include <cutil_math.h>) it complains that there is no such file or directory... anything particular I should do ? Note: I am using pycuda, which is a cuda for python.

Thanks!


Source: (StackOverflow)

cudaBindTextureToArray in PyCuda

Is-there a way to bind an array that is already on the gpu to a texture using PyCuda ?

There is already a cuda.bind_array_to_texref(cuda.make_multichannel_2d_array(...), texref) that binds an array on the CPU to a texture, but I couldn't find the equivalent of cudaBindTextureToArray in PyCuda if the array is already on the device. For example, doing :

myArray = [1, 2, 3]
myArray_d = gpu.to_gpu(myArray)  # then performs some computations on  it, and then
cuda.bind_texture_to_array(myArray_d, texref)

Source: (StackOverflow)