EzDevInfo.com

parakeet

Runtime compiler for numerical Python Parakeet: a runtime compiler for numerical Python

Optimizing Python function with Parakeet

I need this function to be optimized as I am trying to make my OpenGL simulation run faster. I want to use Parakeet, but I can't quite understand in what way I would need to modify the code below in order to do so. Can you see what I should do?

def distanceMatrix(self,x,y,z):
    " ""Computes distances between all particles and places the result in a matrix such that the ij th matrix entry corresponds to the distance between particle i and j"" "
    xtemp = tile(x,(self.N,1))
    dx = xtemp - xtemp.T
    ytemp = tile(y,(self.N,1))
    dy = ytemp - ytemp.T
    ztemp = tile(z,(self.N,1))
    dz = ztemp - ztemp.T

    # Particles 'feel' each other across the periodic boundaries
    if self.periodicX:
        dx[dx>self.L/2]=dx[dx > self.L/2]-self.L
        dx[dx<-self.L/2]=dx[dx < -self.L/2]+self.L
    if self.periodicY:
        dy[dy>self.L/2]=dy[dy>self.L/2]-self.L
        dy[dy<-self.L/2]=dy[dy<-self.L/2]+self.L
    if self.periodicZ:
        dz[dz>self.L/2]=dz[dz>self.L/2]-self.L
        dz[dz<-self.L/2]=dz[dz<-self.L/2]+self.L

    # Total Distances
    d = sqrt(dx**2+dy**2+dz**2)

    # Mark zero entries with negative 1 to avoid divergences
    d[d==0] = -1

    return d, dx, dy, dz

From what I can tell, Parakeet should be able to use the above function without modifications - it only uses Numpy and math. But, I always get the following error when calling the function from the Parakeet jit wrapper:

AssertionError: Unsupported function: <bound method Particles.distanceMatrix of <particles.Particles instance at 0x04CD8E90>>

Source: (StackOverflow)

Simplify statement '.'.join( string.split('.')[0:3] )

I am used to code in C/C++ and when I see the following array operation, I feel some CPU wasting:

version = '1.2.3.4.5-RC4'                 # the end can vary a lot
api = '.'.join( version.split('.')[0:3] ) # extract '1.2.3'

Therefore I wonder:

  • Will this line be executed (interpreted) as creation of a temporary array (memory allocation), then concatenate the first three cells (again memory allocation)?
    Or is the python interpreter smart enough?
    (I am also curious about optimizations made in this context by Pythran, Parakeet, Numba, Cython, and other python interpreters/compilers...)

  • Is there a trick to write a replacement line more CPU efficient and still understandable/elegant?
    (You can provide specific Python2 and/or Python3 tricks and tips)


Source: (StackOverflow)

Advertisements

Unexpected behavior for Python parakeet results in timing benchmarks

I recently reran some benchmarks and got a quite unexpected behavior for parakeet. I reran it 3 times and the results are always the same. What troubles me is why there is a sudden drop for parakeet at sample size 10**3 in the following plot

My explanation would be that this is either due to the item ordering in the list at 10**3 (I will reran it over night with a different random seed), but another thought would be that this is where the multiprocessing kicks in? Any thoughts?

PS: this drop didn't occur when I ran it on a 2-core processor machine

enter image description here

I don't want to cram all the code into this post, so I will post the links to the IPython notebooks if you don't mind


Source: (StackOverflow)

Converting function to NumbaPro CUDA

I am comparing several Python modules/extensions or methods for achieving the following:

import numpy as np

def fdtd(input_grid, steps):
    grid = input_grid.copy()
    old_grid = np.zeros_like(input_grid)
    previous_grid = np.zeros_like(input_grid)

    l_x = grid.shape[0]
    l_y = grid.shape[1]

    for i in range(steps):
        np.copyto(previous_grid, old_grid)
        np.copyto(old_grid, grid)

        for x in range(l_x):
            for y in range(l_y):
                grid[x,y] = 0.0
                if 0 < x+1 < l_x:
                    grid[x,y] += old_grid[x+1,y]
                if 0 < x-1 < l_x:
                    grid[x,y] += old_grid[x-1,y]
                if 0 < y+1 < l_y:
                    grid[x,y] += old_grid[x,y+1]
                if 0 < y-1 < l_y:
                    grid[x,y] += old_grid[x,y-1]

                grid[x,y] /= 2.0
                grid[x,y] -= previous_grid[x,y]

    return grid

This function is a very basic implementation of the Finite-Difference Time Domain (FDTD) method. I've implemented this function several ways:

  • with more NumPy routines
  • in Cython
  • using Numba (auto)jit.

Now I would like to compare the performance with NumbaPro CUDA.

This is the first time I am writing code for CUDA and I came up with the code below.

from numbapro import cuda, float32, int16
import numpy as np

@cuda.jit(argtypes=(float32[:,:], float32[:,:], float32[:,:], int16, int16, int16))
def kernel(grid, old_grid, previous_grid, steps, l_x, l_y):

    x,y = cuda.grid(2)

    for i in range(steps):
        previous_grid[x,y] = old_grid[x,y]
        old_grid[x,y] = grid[x,y]  

    for i in range(steps):

        grid[x,y] = 0.0

        if 0 < x+1 and x+1 < l_x:
            grid[x,y] += old_grid[x+1,y]
        if 0 < x-1 and x-1 < l_x:
            grid[x,y] += old_grid[x-1,y]
        if 0 < y+1 and y+1 < l_x:
            grid[x,y] += old_grid[x,y+1]
        if 0 < y-1 and y-1 < l_x:
            grid[x,y] += old_grid[x,y-1]

        grid[x,y] /= 2.0
        grid[x,y] -= previous_grid[x,y]


def fdtd(input_grid, steps):

    grid = cuda.to_device(input_grid)
    old_grid = cuda.to_device(np.zeros_like(input_grid))
    previous_grid = cuda.to_device(np.zeros_like(input_grid))

    l_x = input_grid.shape[0]
    l_y = input_grid.shape[1]

    kernel[(16,16),(32,8)](grid, old_grid, previous_grid, steps, l_x, l_y)

    return grid.copy_to_host()

Unfortunately I get the following error:

  File ".../fdtd_numbapro.py", line 98, in fdtd
    return grid.copy_to_host()
  File "/opt/anaconda1anaconda2anaconda3/lib/python2.7/site-packages/numbapro/cudadrv/devicearray.py", line 142, in copy_to_host
  File "/opt/anaconda1anaconda2anaconda3/lib/python2.7/site-packages/numbapro/cudadrv/driver.py", line 1702, in device_to_host
  File "/opt/anaconda1anaconda2anaconda3/lib/python2.7/site-packages/numbapro/cudadrv/driver.py", line 772, in check_error
numbapro.cudadrv.error.CudaDriverError: CUDA_ERROR_LAUNCH_FAILED
Failed to copy memory D->H

I've used grid.to_host() as well and that would work neither. CUDA is definitely working using NumbaPro on this system.


Source: (StackOverflow)