Runtime compiler for numerical Python
Parakeet: a runtime compiler for numerical Python
I need this function to be optimized as I am trying to make my OpenGL simulation run faster. I want to use Parakeet, but I can't quite understand in what way I would need to modify the code below in order to do so. Can you see what I should do?
def distanceMatrix(self,x,y,z):
" ""Computes distances between all particles and places the result in a matrix such that the ij th matrix entry corresponds to the distance between particle i and j"" "
xtemp = tile(x,(self.N,1))
dx = xtemp - xtemp.T
ytemp = tile(y,(self.N,1))
dy = ytemp - ytemp.T
ztemp = tile(z,(self.N,1))
dz = ztemp - ztemp.T
# Particles 'feel' each other across the periodic boundaries
if self.periodicX:
dx[dx>self.L/2]=dx[dx > self.L/2]-self.L
dx[dx<-self.L/2]=dx[dx < -self.L/2]+self.L
if self.periodicY:
dy[dy>self.L/2]=dy[dy>self.L/2]-self.L
dy[dy<-self.L/2]=dy[dy<-self.L/2]+self.L
if self.periodicZ:
dz[dz>self.L/2]=dz[dz>self.L/2]-self.L
dz[dz<-self.L/2]=dz[dz<-self.L/2]+self.L
# Total Distances
d = sqrt(dx**2+dy**2+dz**2)
# Mark zero entries with negative 1 to avoid divergences
d[d==0] = -1
return d, dx, dy, dz
From what I can tell, Parakeet should be able to use the above function without modifications - it only uses Numpy and math. But, I always get the following error when calling the function from the Parakeet jit wrapper:
AssertionError: Unsupported function: <bound method Particles.distanceMatrix of <particles.Particles instance at 0x04CD8E90>>
Source: (StackOverflow)
I am comparing several Python modules/extensions or methods for achieving the following:
import numpy as np
def fdtd(input_grid, steps):
grid = input_grid.copy()
old_grid = np.zeros_like(input_grid)
previous_grid = np.zeros_like(input_grid)
l_x = grid.shape[0]
l_y = grid.shape[1]
for i in range(steps):
np.copyto(previous_grid, old_grid)
np.copyto(old_grid, grid)
for x in range(l_x):
for y in range(l_y):
grid[x,y] = 0.0
if 0 < x+1 < l_x:
grid[x,y] += old_grid[x+1,y]
if 0 < x-1 < l_x:
grid[x,y] += old_grid[x-1,y]
if 0 < y+1 < l_y:
grid[x,y] += old_grid[x,y+1]
if 0 < y-1 < l_y:
grid[x,y] += old_grid[x,y-1]
grid[x,y] /= 2.0
grid[x,y] -= previous_grid[x,y]
return grid
This function is a very basic implementation of the Finite-Difference Time Domain (FDTD) method. I've implemented this function several ways:
- with more NumPy routines
- in Cython
- using Numba (auto)jit.
Now I would like to compare the performance with NumbaPro CUDA.
This is the first time I am writing code for CUDA and I came up with the code below.
from numbapro import cuda, float32, int16
import numpy as np
@cuda.jit(argtypes=(float32[:,:], float32[:,:], float32[:,:], int16, int16, int16))
def kernel(grid, old_grid, previous_grid, steps, l_x, l_y):
x,y = cuda.grid(2)
for i in range(steps):
previous_grid[x,y] = old_grid[x,y]
old_grid[x,y] = grid[x,y]
for i in range(steps):
grid[x,y] = 0.0
if 0 < x+1 and x+1 < l_x:
grid[x,y] += old_grid[x+1,y]
if 0 < x-1 and x-1 < l_x:
grid[x,y] += old_grid[x-1,y]
if 0 < y+1 and y+1 < l_x:
grid[x,y] += old_grid[x,y+1]
if 0 < y-1 and y-1 < l_x:
grid[x,y] += old_grid[x,y-1]
grid[x,y] /= 2.0
grid[x,y] -= previous_grid[x,y]
def fdtd(input_grid, steps):
grid = cuda.to_device(input_grid)
old_grid = cuda.to_device(np.zeros_like(input_grid))
previous_grid = cuda.to_device(np.zeros_like(input_grid))
l_x = input_grid.shape[0]
l_y = input_grid.shape[1]
kernel[(16,16),(32,8)](grid, old_grid, previous_grid, steps, l_x, l_y)
return grid.copy_to_host()
Unfortunately I get the following error:
File ".../fdtd_numbapro.py", line 98, in fdtd
return grid.copy_to_host()
File "/opt/anaconda1anaconda2anaconda3/lib/python2.7/site-packages/numbapro/cudadrv/devicearray.py", line 142, in copy_to_host
File "/opt/anaconda1anaconda2anaconda3/lib/python2.7/site-packages/numbapro/cudadrv/driver.py", line 1702, in device_to_host
File "/opt/anaconda1anaconda2anaconda3/lib/python2.7/site-packages/numbapro/cudadrv/driver.py", line 772, in check_error
numbapro.cudadrv.error.CudaDriverError: CUDA_ERROR_LAUNCH_FAILED
Failed to copy memory D->H
I've used grid.to_host() as well and that would work neither.
CUDA is definitely working using NumbaPro on this system.
Source: (StackOverflow)