EzDevInfo.com

numba

NumPy aware dynamic Python compiler using LLVM Numba — Numba

How to speed up multiple inner products in python

I have some simple code that does the following.

It iterates over all possible length n lists F with +-1 entries . For each one it iterates over all possible length 2n lists S with +-1 entries where the first half of $S$ is simply a copy of the second half. The code computes the inner product of F with each sublist of S of length n. For each F, S it counts the inner products that are zero until the first non-zero inner product.

Here is the code.

#!/usr/bin/python

from __future__ import division
import itertools
import operator
import math

n=14
m=n+1
def innerproduct(A, B):
    assert (len(A) == len(B))
    s = 0 
    for k in xrange(0,n):
        s+=A[k]*B[k]
    return s

leadingzerocounts = [0]*m
for S in itertools.product([-1,1], repeat = n):
    S1 = S + S
    for F in itertools.product([-1,1], repeat = n):
        i = 0
        while (i<m):
            ip = innerproduct(F, S1[i:i+n])
            if (ip == 0):
                leadingzerocounts[i] +=1
                i+=1
            else:
                break

print leadingzerocounts

The correct output for n=14 is

[56229888, 23557248, 9903104, 4160640, 1758240, 755392, 344800, 172320, 101312, 75776, 65696, 61216, 59200, 59200, 59200]

Using pypy this takes 1 min 18 seconds for n = 14. Unfortunately I would really like to run it for 16,18,20,22,24,26 . I don't mind using numba or cython but I would like to stay close to python if at all possible.

Any help speeding this up is very much appreciated.


I'll keep a record of the fastest solutions here. (Please let me know if I miss an updated answer.)

  • n = 22 at 9m35.081s by Eisenstat (C)
  • n = 18 at 1m16.344s by Eisenstat (pypy)
  • n = 18 at 2m54.998s by Tupteq (pypy)
  • n = 14 at 26s by Neil (numpy)
  • n - 14 at 11m59.192s by kslote1 (pypy)

Source: (StackOverflow)

Getting python Numba working on Ubuntu 14.10 or Fedora 21 with python 2.7

Recently, I have had a frustrating time to get python Numba working on Ubuntu or Fedora Linux. The main problem has been with the compilation of llvmlite. What do I need to install for these to compile properly?


Source: (StackOverflow)

Advertisements

Optimize Double loop in python

I am trying to optimize the following loop :

def numpy(nx, nz, c, rho):
    for ix in range(2, nx-3):
        for iz in range(2, nz-3):
            a[ix, iz]  = sum(c*rho[ix-1:ix+3, iz])
            b[ix, iz]  = sum(c*rho[ix-2:ix+2, iz])
    return a, b

I tried different solutions and found using numba to calculate the sum of the product leads to better performances:

import numpy as np
import numba as nb
import time

@nb.autojit
def sum_opt(arr1, arr2):
    s = arr1[0]*arr2[0]
    for i in range(1, len(arr1)):
        s+=arr1[i]*arr2[i]
    return s

def numba1(nx, nz, c, rho):
    for ix in range(2, nx-3):
        for iz in range(2, nz-3):        
            a[ix, iz]  = sum_opt(c, rho[ix-1:ix+3, iz])
            b[ix, iz]  = sum_opt(c, rho[ix-2:ix+2, iz])
    return a, b

@nb.autojit
def numba2(nx, nz, c, rho):
    for ix in range(2, nx-3):
        for iz in range(2, nz-3):        
            a[ix, iz]  = sum_opt(c, rho[ix-1:ix+3, iz])
            b[ix, iz]  = sum_opt(c, rho[ix-2:ix+2, iz])
    return a, b
nx = 1024
nz = 256    

rho = np.random.rand(nx, nz)
c = np.random.rand(4)
a = np.zeros((nx, nz))
b = np.zeros((nx, nz))

ti = time.clock()
a, b = numpy(nx, nz, c, rho)
print 'Time numpy  : ' + `round(time.clock() - ti, 4)`

ti = time.clock()
a, b = numba1(nx, nz, c, rho)
print 'Time numba1 : ' + `round(time.clock() - ti, 4)`

ti = time.clock()
a, b = numba2(nx, nz, c, rho)
print 'Time numba2 : ' + `round(time.clock() - ti, 4)`

This lead to

Time numpy : 4.1595

Time numba1 : 0.6993

Time numba2 : 1.0135

Using the numba version of the sum function (sum_opt) performs very well. But I am wondering why the numba version of the double loop function (numba2) leads to slower execution times. I tried to use jit instead of autojit, specifying the argument types, but it was worse.

I also noticed that looping first on the smallest loop is slower than looping first on the biggest loop. Is there any explanation ?

Whether it is, I am sure this double loop function can be improved a lot vectorizing the problem (like this) or using another method (map ?) but I am a little bit confused about these methods.

In the other parts of my code, I used numba and numpy slicing methods to replace all explicit loops but in this particular case, I don't how to set it up.

Any ideas ?

EDIT

Thanks for all your comments. I worked a little on this problem:

import numba as nb
import numpy as np
from scipy import signal
import time


@nb.jit(['float64(float64[:], float64[:])'], nopython=True)
def sum_opt(arr1, arr2):
    s = arr1[0]*arr2[0]
    for i in xrange(1, len(arr1)):
        s+=arr1[i]*arr2[i]
    return s

@nb.autojit
def numba1(nx, nz, c, rho, a, b):
    for ix in range(2, nx-3):
        for iz in range(2, nz-3):        
            a[ix, iz]  = sum_opt(c, rho[ix-1:ix+3, iz])
            b[ix, iz]  = sum_opt(c, rho[ix-2:ix+2, iz])
    return a, b


@nb.jit(nopython=True)
def numba2(nx, nz, c, rho, a, b):
    for ix in range(2, nx-3):
        for iz in range(2, nz-3):        
            a[ix, iz]  = sum_opt(c, rho[ix-1:ix+3, iz])
            b[ix, iz]  = sum_opt(c, rho[ix-2:ix+2, iz])
    return a, b

@nb.jit(['float64[:,:](int16, int16, float64[:], float64[:,:], float64[:,:])'], nopython=True)
def numba3a(nx, nz, c, rho, a):
    for ix in range(2, nx-3):
        for iz in range(2, nz-3):        
            a[ix, iz]  = sum_opt(c, rho[ix-1:ix+3, iz])
    return a

@nb.jit(['float64[:,:](int16, int16, float64[:], float64[:,:], float64[:,:])'], nopython=True)
def numba3b(nx, nz, c, rho, b):
    for ix in range(2, nx-3):
        for iz in range(2, nz-3):        
            b[ix, iz]  = sum_opt(c, rho[ix-2:ix+2, iz])
    return b

def convol(nx, nz, c, aa, bb):
    s1 = rho[1:nx-1,2:nz-3]
    s2 = rho[0:nx-2,2:nz-3]
    kernel = c[:,None][::-1]
    aa[2:nx-3,2:nz-3] = signal.convolve2d(s1, kernel, boundary='symm', mode='valid')
    bb[2:nx-3,2:nz-3] = signal.convolve2d(s2, kernel, boundary='symm', mode='valid')
    return aa, bb


nx = 1024
nz = 256 
rho = np.random.rand(nx, nz)
c = np.random.rand(4)
a = np.zeros((nx, nz))
b = np.zeros((nx, nz))

ti = time.clock()
for i in range(1000):
    a, b = numba1(nx, nz, c, rho, a, b)
print 'Time numba1 : ' + `round(time.clock() - ti, 4)`

ti = time.clock()
for i in range(1000):
    a, b = numba2(nx, nz, c, rho, a, b)
print 'Time numba2 : ' + `round(time.clock() - ti, 4)`

ti = time.clock()
for i in range(1000):
    a = numba3a(nx, nz, c, rho, a)
    b = numba3b(nx, nz, c, rho, b)
print 'Time numba3 : ' + `round(time.clock() - ti, 4)`

ti = time.clock()
for i in range(1000):
    a, b = convol(nx, nz, c, a, b)
print 'Time convol : ' + `round(time.clock() - ti, 4)`

Your solution is very elegant Divakar, but I have to use this function a large number of time in my code. So, for 1000 iterations, this lead to

Time numba1 : 3.2487

Time numba2 : 3.7012

Time numba3 : 3.2088

Time convol : 22.7696

autojit and jit are very close. However, when using jit, it seems important to specify all argument types.

I do not know if there is a way to specify argument types in the jit decorator when the function has multiple outputs. Someone ?

For now I did not find other solution than using numba. New ideas are welcomed !


Source: (StackOverflow)

Need help vectorizing code or optimizing

I am trying to do a double integral by first interpolating the data to make a surface. I am using numba to try and speed this process up, but it's just taking too long.

Here is my code, with the images needed to run the code located at here and here.


Source: (StackOverflow)

Numba and Cython aren't improving the performance compared to CPython significantly, maybe I am using it incorrectly?

BIG EDIT:

================

For the sake of clarity, I am removing the old results and replace it by the more recent results. The question is still the same: Am I using both Cython and Numba correctly, and what improvements to the code can be made? (I have a newer and more bare-bones temporary IPython notebook with all the code and results here)

1)

I think I figured out why there was initially no difference between Cython, Numba, and CPython: It was because I fed them

numpy arrays as input:

x = np.asarray([x_i*np.random.randint(8,12)/10 for x_i in range(n)])

instead of lists:

x = [x_i*random.randint(8,12)/10 for x_i in range(n)]

Benchmark using Numpy arrays as data input

enter image description here

Benchmark using Python lists as input

enter image description here

2)

I replaced the zip() function by explicit loops, however, it didn't make much of a difference. The code would be:

CPython

def py_lstsqr(x, y):
    """ Computes the least-squares solution to a linear matrix equation. """
    len_x = len(x)
    x_avg = sum(x)/len_x
    y_avg = sum(y)/len(y)
    var_x = 0
    cov_xy = 0
    for i in range(len_x):
        temp = (x[i] - x_avg)
        var_x += temp**2
        cov_xy += temp*(y[i] - y_avg)
    slope = cov_xy / var_x
    y_interc = y_avg - slope*x_avg
    return (slope, y_interc) 

Cython

%load_ext cythonmagic

%%cython
def cy_lstsqr(x, y):
    """ Computes the least-squares solution to a linear matrix equation. """
    cdef double x_avg, y_avg, var_x, cov_xy,\
         slope, y_interc, x_i, y_i
    cdef int len_x
    len_x = len(x)
    x_avg = sum(x)/len_x
    y_avg = sum(y)/len(y)
    var_x = 0
    cov_xy = 0
    for i in range(len_x):
        temp = (x[i] - x_avg)
        var_x += temp**2
        cov_xy += temp*(y[i] - y_avg)
    slope = cov_xy / var_x
    y_interc = y_avg - slope*x_avg
    return (slope, y_interc)

Numba

from numba import jit

@jit
def numba_lstsqr(x, y):
    """ Computes the least-squares solution to a linear matrix equation. """
    len_x = len(x)
    x_avg = sum(x)/len_x
    y_avg = sum(y)/len(y)
    var_x = 0
    cov_xy = 0
    for i in range(len_x):
        temp = (x[i] - x_avg)
        var_x += temp**2
        cov_xy += temp*(y[i] - y_avg)
    slope = cov_xy / var_x
    y_interc = y_avg - slope*x_avg
    return (slope, y_interc)

Source: (StackOverflow)

Speeding up element-wise array multiplication in python

I have been playing around with numba and numexpr trying to speed up a simple element-wise matrix multiplication. I have not been able to get better results, they both are basically (speedwise) equivalent to numpys multiply function. Has anyone had any luck in this area? Am I using numba and numexpr wrong (I'm quite new to this) or is this altogether a bad approach to try and speed this up. Here is a reproducible code, thank you in advanced:

import numpy as np
from numba import autojit
import numexpr as ne

a=np.random.rand(10,5000000)

# numpy
multiplication1 = np.multiply(a,a)

# numba
def multiplix(X,Y):
    M = X.shape[0]
    N = X.shape[1]
    D = np.empty((M, N), dtype=np.float)
    for i in range(M):
        for j in range(N):
            D[i,j] = X[i, j] * Y[i, j]
    return D

mul = autojit(multiplix)
multiplication2 = mul(a,a)

# numexpr
def numexprmult(X,Y):
    M = X.shape[0]
    N = X.shape[1]
    return ne.evaluate("X * Y")

multiplication3 = numexprmult(a,a) 

Source: (StackOverflow)

Numba autojit error on comparing numpy arrays

When I compare two numpy arrays inside my function I get an error saying only length-1 arrays can be converted to Python scalars:

from numpy.random import rand
from numba import autojit

@autojit
def myFun():
    a = rand(10,1)
    b = rand(10,1)
    idx = a > b
    return idx

myFun()

The error:

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-7-f7b68c0872a3> in <module>()
----> 1 myFun()

/Users/Guest/Library/Enthought/Canopy_64bit/User/lib/python2.7/site-packages/numba/numbawrapper.so in numba.numbawrapper._NumbaSpecializingWrapper.__call__ (numba/numbawrapper.c:3764)()

TypeError: only length-1 arrays can be converted to Python scalars

Source: (StackOverflow)

Numba slow when assigning to an array?

Numba seems to be a great solution for accelerating the execution of numeric code. However, when there are assignments to an array Numba seems to be slower than standard Python code. Consider this example comparing four alternatives, with/without Numba, writing to an array/scalar:

(The calculations were kept very simple on purpose, to focus on the issue, which is assignment to a scalar vs assignment to an array cell)

@autojit
def fast_sum_arr(arr):
    z = arr.copy()
    M = len(arr)
    for i in range(M):
        z[i] += arr[i]

    return z

def sum_arr(arr):
    z = arr.copy()
    M = len(arr)
    for i in range(M):
        z[i] += arr[i]

    return z

@autojit
def fast_sum_sclr(arr):
    z = 0
    M = len(arr)
    for i in range(M):
        z += arr[i]

    return z

def sum_sclr(arr):
    z = 0
    M = len(arr)
    for i in range(M):
        z += arr[i]

    return z

Using IPython's %timeit to evaluate the four alternatives I got:

In [125]: %timeit fast_sum_arr(arr)
100 loops, best of 3: 10.8 ms per loop

In [126]: %timeit sum_arr(arr)
100 loops, best of 3: 4.11 ms per loop

In [127]: %timeit fast_sum_sclr(arr)
100000 loops, best of 3: 10 us per loop

In [128]: %timeit sum_sclr(arr)
100 loops, best of 3: 2.93 ms per loop

sum_arr, which was not compiled with Numba is more than twice as fast as fast_sum_arr, which was compiled with Numba. On the other hand, fast_sum_sclr, which was compiled with Numba is more than two orders of magnitude faster than sum_sclr, which was not compiled with Numba.

So Numba performs remarkably well the task of accelerating sum_sclr but actually makes sum_arr execute slower. The only difference between sum_sclr and sum_arr is that the former assigns to a scalar while the latter assigns to an array cell.

I don't know if there is any relation, but I recently read the following on the blog http://www.phi-node.com/:

"It turns out that when Numba is confronted with any construct it doesn't support directly, it switches to a (very) slow code path."

The blog author got Numba to perform much faster using an if statement instead of Python's max().

Any insights on this?

Thanks,

FS


Source: (StackOverflow)

Why is numba faster than numpy here?

I can't figure out why numba is beating numpy here (over 3x). Did I make some fundamental error in how I am benchmarking here? Seems like the perfect situation for numpy, no? Note that as a check, I also ran a variation combining numba and numpy (not shown), which as expected was the same as running numpy without numba.

(btw this is a followup question to: Fastest way to numerically process 2d-array: dataframe vs series vs array vs numba )

import numpy as np
from numba import jit
nobs = 10000 

def proc_numpy(x,y,z):

   x = x*2 - ( y * 55 )      # these 4 lines represent use cases
   y = x + y*2               # where the processing time is mostly
   z = x + y + 99            # a function of, say, 50 to 200 lines
   z = z * ( z - .88 )       # of fairly simple numerical operations

   return z

@jit
def proc_numba(xx,yy,zz):
   for j in range(nobs):     # as pointed out by Llopis, this for loop 
      x, y = xx[j], yy[j]    # is not needed here.  it is here by 
                             # accident because in the original benchmarks 
      x = x*2 - ( y * 55 )   # I was doing data creation inside the function 
      y = x + y*2            # instead of passing it in as an array
      z = x + y + 99         # in any case, this redundant code seems to 
      z = z * ( z - .88 )    # have something to do with the code running
                             # faster.  without the redundant code, the 
      zz[j] = z              # numba and numpy functions are exactly the same.
   return zz

x = np.random.randn(nobs)
y = np.random.randn(nobs)
z = np.zeros(nobs)
res_numpy = proc_numpy(x,y,z)

z = np.zeros(nobs)
res_numba = proc_numba(x,y,z)

results:

In [356]: np.all( res_numpy == res_numba )
Out[356]: True

In [357]: %timeit proc_numpy(x,y,z)
10000 loops, best of 3: 105 µs per loop

In [358]: %timeit proc_numba(x,y,z)
10000 loops, best of 3: 28.6 µs per loop

I ran this on a 2012 macbook air (13.3), standard anaconda distribution. I can provide more detail on my setup if it's relevant.


Source: (StackOverflow)

Python: JIT for known bottlenecks

Is there any way to use somehow use pypy just to compile one function and not for the rest of my python program?

I have a known bottleneck where I spend 99% of my CPU time (containing mostly integer shifts and XORs) and have optimized it as much as I can in Python. I don't want to have to write and maintain C libraries unless absolutely positively necessary.

Right now I'm using Anaconda Python which is the normal python with a bunch of libraries. I would use pypy except that I don't want to have to make sure that all the rest of my program works ok w/ pypy.

Is there a way to explicitly run a JIT only on one Python function?


edit: the function is a modular multiplication step in GF2 (Galois field)

https://bitbucket.org/jason_s/libgf2/src/a71a14a035468e703a7c24349be814419cdba2cb/src/libgf2/gf2.py?at=default

specifically:

def _gf2mulmod(x,y,m):
    z = 0
    while x > 0:
        if (x & 1) != 0:
            z ^= y
        y <<= 1
        y2 = y ^ m
        if y2 < y:
            y = y2
        x >>= 1
    return z

It needs to work for bigints so I'm not sure how to rewrite to be Cython-compatible.

I just tried numba's @autojit and it failed because it didn't know what variable types to use and assumed small integers. I can't seem to figure out how to tell it to use standard Python bigints.

Traceback (most recent call last):
  File "/Users/jason_s/Documents/python/libgf2/src/libgf2/gf2.py", line 440, in <module>
    dlog43 = GF2DiscreteLog(0x100000000065)
  File "/Users/jason_s/Documents/python/libgf2/src/libgf2/gf2.py", line 295, in __init__
    factors = _calculateFactors(poly)
  File "/Users/jason_s/Documents/python/libgf2/src/libgf2/gf2.py", line 264, in _calculateFactors
    if (e1 << m).value != 1:
  File "/Users/jason_s/Documents/python/libgf2/src/libgf2/gf2.py", line 379, in __lshift__
    return self._wrapraw(_gf2lshiftmod(self.value,k,self.poly))
  File "/Users/jason_s/Documents/python/libgf2/src/libgf2/gf2.py", line 195, in _gf2lshiftmod
    return _gf2mulmod(x,_gf2powmod(2,k,m),m)
  File "/Users/jason_s/Documents/python/libgf2/src/libgf2/gf2.py", line 189, in _gf2powmod
    z = _gf2mulmod(z,x,m)
  File "numbawrapper.pyx", line 193, in numba.numbawrapper._NumbaSpecializingWrapper.__call__ (numba/numbawrapper.c:3764)
OverflowError: value too large to convert to signed int

Source: (StackOverflow)

Numba code slower than pure python

I've been working on speeding up a resampling calculation for a particle filter. As python has many ways to speed it up, I though I'd try them all. Unfortunately, the numba version is incredibly slow. As Numba should result in a speed up, I assume this is an error on my part.

I tried 4 different versions:

  1. Numba
  2. Python
  3. Numpy
  4. Cython

The code for each is below:

import numpy as np
import scipy as sp
import numba as nb
from cython_resample import cython_resample

@nb.autojit
def numba_resample(qs, xs, rands):
    n = qs.shape[0]
    lookup = np.cumsum(qs)
    results = np.empty(n)

    for j in range(n):
        for i in range(n):
            if rands[j] < lookup[i]:
                results[j] = xs[i]
                break
    return results

def python_resample(qs, xs, rands):
    n = qs.shape[0]
    lookup = np.cumsum(qs)
    results = np.empty(n)

    for j in range(n):
        for i in range(n):
            if rands[j] < lookup[i]:
                results[j] = xs[i]
                break
    return results

def numpy_resample(qs, xs, rands):
    results = np.empty_like(qs)
    lookup = sp.cumsum(qs)
    for j, key in enumerate(rands):
        i = sp.argmax(lookup>key)
        results[j] = xs[i]
    return results

#The following is the code for the cython module. It was compiled in a
#separate file, but is included here to aid in the question.
"""
import numpy as np
cimport numpy as np
cimport cython

DTYPE = np.float64

ctypedef np.float64_t DTYPE_t

@cython.boundscheck(False)
def cython_resample(np.ndarray[DTYPE_t, ndim=1] qs, 
             np.ndarray[DTYPE_t, ndim=1] xs, 
             np.ndarray[DTYPE_t, ndim=1] rands):
    if qs.shape[0] != xs.shape[0] or qs.shape[0] != rands.shape[0]:
        raise ValueError("Arrays must have same shape")
    assert qs.dtype == xs.dtype == rands.dtype == DTYPE

    cdef unsigned int n = qs.shape[0]
    cdef unsigned int i, j 
    cdef np.ndarray[DTYPE_t, ndim=1] lookup = np.cumsum(qs)
    cdef np.ndarray[DTYPE_t, ndim=1] results = np.zeros(n, dtype=DTYPE)

    for j in range(n):
        for i in range(n):
            if rands[j] < lookup[i]:
                results[j] = xs[i]
                break
    return results
"""

if __name__ == '__main__':
    n = 100
    xs = np.arange(n, dtype=np.float64)
    qs = np.array([1.0/n,]*n)
    rands = np.random.rand(n)

    print "Timing Numba Function:"
    %timeit numba_resample(qs, xs, rands)
    print "Timing Python Function:"
    %timeit python_resample(qs, xs, rands)
    print "Timing Numpy Function:"
    %timeit numpy_resample(qs, xs, rands)
    print "Timing Cython Function:"
    %timeit cython_resample(qs, xs, rands)

This results in the following output:

Timing Numba Function:
1 loops, best of 3: 8.23 ms per loop
Timing Python Function:
100 loops, best of 3: 2.48 ms per loop
Timing Numpy Function:
1000 loops, best of 3: 793 µs per loop
Timing Cython Function:
10000 loops, best of 3: 25 µs per loop

Any idea why the numba code is so slow? I assumed it would be at least comparable to Numpy.

Note: if anyone has any ideas on how to speed up either the Numpy or Cython code samples, that would be nice too:) My main question is about Numba though.


Source: (StackOverflow)

Inter segment distance using numba jit, Python

I have been asking related questions on this stack for the past week to try to isolate things I didn't understand about using the @jit decorator with Numba in Python. However, I'm hitting the wall, so I'll just write the whole problem.

The issue at hand is to calculate the minimal distance between pairs of a large number of segments. The segments are represented by their beginning and endpoints in 3D. Mathematically, every segment is parameterized as [AB] = A + (B-A)*s, where s in [0,1], and A and B are the beginning and endpoints of the segment. For two such segments, the minimal distance can be computed and the formula is given here.

I have already exposed this problem on another thread, and the answer given dealt with replacing double loops of my code by vectorizing the problem, which however would hit memory issues for large sets of segments. Therefore, I've decided to stick with the loops, and use numba's jit instead.

Since the solution to the problem requires a lot of dot products, and numpy's dot product is not supported by numba, I started by implementing my own 3D dot product.

import numpy as np
from numba import jit, autojit, double, float64, float32, void, int32

def my_dot(a,b):
    res = a[0]*b[0]+a[1]*b[1]+a[2]*b[2]
    return res

dot_jit = jit(double(double[:], double[:]))(my_dot)    #I know, it's not of much use here.

The function computing the minimal distance of all pairs in N segments takes as input an Nx6 array (6 coordinates)

def compute_stuff(array_to_compute):
    N = len(array_to_compute)
    con_mat = np.zeros((N,N))
    for i in range(N):
        for j in range(i+1,N):

            p0 = array_to_compute[i,0:3]
            p1 = array_to_compute[i,3:6]
            q0 = array_to_compute[j,0:3]
            q1 = array_to_compute[j,3:6]

            s = ( dot_jit((p1-p0),(q1-q0))*dot_jit((q1-q0),(p0-q0)) - dot_jit((q1-q0),(q1-q0))*dot_jit((p1-p0),(p0-q0)))/( dot_jit((p1-p0),(p1-p0))*dot_jit((q1-q0),(q1-q0)) - dot_jit((p1-p0),(q1-q0))**2 )
            t = ( dot_jit((p1-p0),(p1-p0))*dot_jit((q1-q0),(p0-q0)) -dot_jit((p1-p0),(q1-q0))*dot_jit((p1-p0),(p0-q0)))/( dot_jit((p1-p0),(p1-p0))*dot_jit((q1-q0),(q1-q0)) - dot_jit((p1-p0),(q1-q0))**2 )

            con_mat[i,j] = np.sum( (p0+(p1-p0)*s-(q0+(q1-q0)*t))**2 ) 

return con_mat

fast_compute_stuff = jit(double[:,:](double[:,:]))(compute_stuff)

So, compute_stuff(arg) takes as argument an 2D np.array ( double[:,:]) , performs a bunch of numba-supported (?) operations, and returns another 2D np.array (double[:,:]). However,

v = np.random.random( (100,6) )
%timeit compute_stuff(v)
%timeit fast_compute_stuff(v)

I get 134 and 123 ms per loop. Can you shed some light on why I fail to speed up my function ? Any feedback would be much appreciated.


Source: (StackOverflow)

Optimizing access on numpy arrays for numba

I recently stumbled upon numba and thought about replacing some homemade C extensions with more elegant autojitted python code. Unfortunately I wasn't happy, when I tried a first, quick benchmark. It seems like numba is not doing much better than ordinary python here, though I would have expected nearly C-like performance:

from numba import jit, autojit, uint, double
import numpy as np
import imp
import logging
logging.getLogger('numba.codegen.debug').setLevel(logging.INFO)

def sum_accum(accmap, a):
    res = np.zeros(np.max(accmap) + 1, dtype=a.dtype)
    for i in xrange(len(accmap)):
        res[accmap[i]] += a[i]
    return res

autonumba_sum_accum = autojit(sum_accum)
numba_sum_accum = jit(double[:](int_[:], double[:]), 
                      locals=dict(i=uint))(sum_accum)

accmap = np.repeat(np.arange(1000), 2)
np.random.shuffle(accmap)
accmap = np.repeat(accmap, 10)
a = np.random.randn(accmap.size)

ref = sum_accum(accmap, a)
assert np.all(ref == numba_sum_accum(accmap, a))
assert np.all(ref == autonumba_sum_accum(accmap, a))

%timeit sum_accum(accmap, a)
%timeit autonumba_sum_accum(accmap, a)
%timeit numba_sum_accum(accmap, a)

accumarray = imp.load_source('accumarray', '/path/to/accumarray.py')
assert np.all(ref == accumarray.accum(accmap, a))

%timeit accumarray.accum(accmap, a)

This gives on my machine:

10 loops, best of 3: 52 ms per loop
10 loops, best of 3: 42.2 ms per loop
10 loops, best of 3: 43.5 ms per loop
1000 loops, best of 3: 321 us per loop

I'm running the latest numba version from pypi, 0.11.0. Any suggestions, how to fix the code, so it runs reasonably fast with numba?


Source: (StackOverflow)

Why would using 8 threads be faster than 4 threads on a 4 core Hyper Threaded CPU?

I have a quad core i7 920 CPU. It is Hyperthreaded, so the computer thinks it has 8 cores.

From what I've read on the interweb, when doing parallel tasks, I should use the number of physical cores, not the number of hyper threaded cores.

So I have done some timings, and was surprised that using 8 threads in a parallel loop is faster than using 4 threads.

Why is this? My example code is too long to post here, but can be found by running the example here: https://github.com/jsphon/MTVectorizer

A chart of the performance is here:

enter image description here


Source: (StackOverflow)

Python and Numba for vectorized functions

Good day, I'm writing a Python module for some numeric work. Since there's a lot of stuff going on, I've been spending the last few days optimizing code to improve calculations times. However, I have a question concerning Numba. Basically, I have a class with some fields which are numpy arrays, which I initialize in the following way:

def init(self):
    a = numpy.arange(0, self.max_i, 1)
    self.vibr_energy = self.calculate_vibr_energy(a)

def calculate_vibr_energy(i):
    return numpy.exp(-self.harmonic * i - self.anharmonic * (i ** 2))

So, the code is vectorized, and using Numba's JIT results in some improvement. However, sometimes I need to access the calculate_vibr_energy function from outside the class, and pass a single integer instead of an array in place of i. As far as I understand, if I use Numba's JIT on the calculate_vibr_energy, it will have to always take an array as an argument.

So, which of the following options is better: 1) Create a new function calculate_vibr_energy_single(i), which will only take a single integer number, and use Numba on it too 2) Replace all usages of the function that are similar to this one:

myclass.calculate_vibr_energy(1)

with this:

tmp = np.array([1])
myclass.calculate_vibr_energy(tmp)[0]

Or are there other, more efficient (or at least, more Python-ic) ways of doing that?


Source: (StackOverflow)