EzDevInfo.com

parakeet

Runtime compiler for numerical Python Parakeet: a runtime compiler for numerical Python

http://www.parakeetpython.com

Optimizing Python function with Parakeet

I need this function to be optimized as I am trying to make my OpenGL simulation run faster. I want to use Parakeet, but I can't quite understand in what way I would need to modify the code below in order to do so. Can you see what I should do?

def distanceMatrix(self,x,y,z):
    " ""Computes distances between all particles and places the result in a matrix such that the ij th matrix entry corresponds to the distance between particle i and j"" "
    xtemp = tile(x,(self.N,1))
    dx = xtemp - xtemp.T
    ytemp = tile(y,(self.N,1))
    dy = ytemp - ytemp.T
    ztemp = tile(z,(self.N,1))
    dz = ztemp - ztemp.T

    # Particles 'feel' each other across the periodic boundaries
    if self.periodicX:
        dx[dx>self.L/2]=dx[dx > self.L/2]-self.L
        dx[dx<-self.L/2]=dx[dx < -self.L/2]+self.L
    if self.periodicY:
        dy[dy>self.L/2]=dy[dy>self.L/2]-self.L
        dy[dy<-self.L/2]=dy[dy<-self.L/2]+self.L
    if self.periodicZ:
        dz[dz>self.L/2]=dz[dz>self.L/2]-self.L
        dz[dz<-self.L/2]=dz[dz<-self.L/2]+self.L

    # Total Distances
    d = sqrt(dx**2+dy**2+dz**2)

    # Mark zero entries with negative 1 to avoid divergences
    d[d==0] = -1

    return d, dx, dy, dz

From what I can tell, Parakeet should be able to use the above function without modifications - it only uses Numpy and math. But, I always get the following error when calling the function from the Parakeet jit wrapper:

AssertionError: Unsupported function: <bound method Particles.distanceMatrix of <particles.Particles instance at 0x04CD8E90>>

Source: (StackOverflow)

Simplify statement '.'.join( string.split('.')[0:3] )

I am used to code in C/C++ and when I see the following array operation, I feel some CPU wasting:

version = '1.2.3.4.5-RC4'                 # the end can vary a lot
api = '.'.join( version.split('.')[0:3] ) # extract '1.2.3'

Therefore I wonder:

Will this line be executed (interpreted) as creation of a temporary array (memory allocation), then concatenate the first three cells (again memory allocation)?
Or is the python interpreter smart enough?
^{(I am also curious about optimizations made in this context by Pythran, Parakeet, Numba, Cython, and other python interpreters/compilers...)}
Is there a trick to write a replacement line more CPU efficient and still understandable/elegant?
^{(You can provide specific Python2 and/or Python3 tricks and tips)}

Source: (StackOverflow)

Unexpected behavior for Python parakeet results in timing benchmarks

I recently reran some benchmarks and got a quite unexpected behavior for parakeet. I reran it 3 times and the results are always the same. What troubles me is why there is a sudden drop for parakeet at sample size 10**3 in the following plot

My explanation would be that this is either due to the item ordering in the list at 10**3 (I will reran it over night with a different random seed), but another thought would be that this is where the multiprocessing kicks in? Any thoughts?

PS: this drop didn't occur when I ran it on a 2-core processor machine

enter image description here

I don't want to cram all the code into this post, so I will post the links to the IPython notebooks if you don't mind

4-CPU version (what you see in the plot)
2-CPU version

Source: (StackOverflow)

Converting function to NumbaPro CUDA

I am comparing several Python modules/extensions or methods for achieving the following:

import numpy as np

def fdtd(input_grid, steps):
    grid = input_grid.copy()
    old_grid = np.zeros_like(input_grid)
    previous_grid = np.zeros_like(input_grid)

    l_x = grid.shape[0]
    l_y = grid.shape[1]

    for i in range(steps):
        np.copyto(previous_grid, old_grid)
        np.copyto(old_grid, grid)

        for x in range(l_x):
            for y in range(l_y):
                grid[x,y] = 0.0
                if 0 < x+1 < l_x:
                    grid[x,y] += old_grid[x+1,y]
                if 0 < x-1 < l_x:
                    grid[x,y] += old_grid[x-1,y]
                if 0 < y+1 < l_y:
                    grid[x,y] += old_grid[x,y+1]
                if 0 < y-1 < l_y:
                    grid[x,y] += old_grid[x,y-1]

                grid[x,y] /= 2.0
                grid[x,y] -= previous_grid[x,y]

    return grid

This function is a very basic implementation of the Finite-Difference Time Domain (FDTD) method. I've implemented this function several ways:

with more NumPy routines
in Cython
using Numba (auto)jit.

Now I would like to compare the performance with NumbaPro CUDA.

This is the first time I am writing code for CUDA and I came up with the code below.

from numbapro import cuda, float32, int16
import numpy as np

@cuda.jit(argtypes=(float32[:,:], float32[:,:], float32[:,:], int16, int16, int16))
def kernel(grid, old_grid, previous_grid, steps, l_x, l_y):

    x,y = cuda.grid(2)

    for i in range(steps):
        previous_grid[x,y] = old_grid[x,y]
        old_grid[x,y] = grid[x,y]  

    for i in range(steps):

        grid[x,y] = 0.0

        if 0 < x+1 and x+1 < l_x:
            grid[x,y] += old_grid[x+1,y]
        if 0 < x-1 and x-1 < l_x:
            grid[x,y] += old_grid[x-1,y]
        if 0 < y+1 and y+1 < l_x:
            grid[x,y] += old_grid[x,y+1]
        if 0 < y-1 and y-1 < l_x:
            grid[x,y] += old_grid[x,y-1]

        grid[x,y] /= 2.0
        grid[x,y] -= previous_grid[x,y]


def fdtd(input_grid, steps):

    grid = cuda.to_device(input_grid)
    old_grid = cuda.to_device(np.zeros_like(input_grid))
    previous_grid = cuda.to_device(np.zeros_like(input_grid))

    l_x = input_grid.shape[0]
    l_y = input_grid.shape[1]

    kernel[(16,16),(32,8)](grid, old_grid, previous_grid, steps, l_x, l_y)

    return grid.copy_to_host()

Unfortunately I get the following error:

  File ".../fdtd_numbapro.py", line 98, in fdtd
    return grid.copy_to_host()
  File "/opt/anaconda1anaconda2anaconda3/lib/python2.7/site-packages/numbapro/cudadrv/devicearray.py", line 142, in copy_to_host
  File "/opt/anaconda1anaconda2anaconda3/lib/python2.7/site-packages/numbapro/cudadrv/driver.py", line 1702, in device_to_host
  File "/opt/anaconda1anaconda2anaconda3/lib/python2.7/site-packages/numbapro/cudadrv/driver.py", line 772, in check_error
numbapro.cudadrv.error.CudaDriverError: CUDA_ERROR_LAUNCH_FAILED
Failed to copy memory D->H

I've used grid.to_host() as well and that would work neither. CUDA is definitely working using NumbaPro on this system.

Source: (StackOverflow)