EzDevInfo.com

pyopencl

OpenCL integration for Python, plus shiny features PyOpenCL

Bigint in OpenCL and Python

I am trying to implement RSA in Python but I want to run the intensive calculations on the GPU. I have successfully implemented my own modulo expoentiation running in PyOpenCL, but I max out on six digit integers for both the base and exponent. Larger numbers than that and the GPU crashes.

I need to create an array of very large integer values in numpy and send them along to the PyOpenCL code, and do both multiplication and modulo operations with the large integers on the GPU.

Has anyone done anything similar before? Googling cuda and bigint doesn't give any good results. :(


Source: (StackOverflow)

Python LogicError: clGetPlatformIDs failed: platform not found khr

While making nosetests for the set of Python programs in Ubuntu an error occurs:

devices = [ d for d in cl.get_platforms()[0].get_devices() if LogicError: clGetPlatformIDs failed: platform not found khr

File "/home/fateeva/prog/deflectometry/SGMFMeasurement/_PhaseShifts.py", line 30, in devices = [ d for d in cl.get_platforms()[0].get_devices() if LogicError: clGetPlatformIDs failed: platform not found khr

How it's possible to fix it?


Source: (StackOverflow)

Advertisements

Faster way to structure operations on offset neighborhoods in OpenCL

How can an operation on many overlapping but offset blocks of a 2D array be structured for more efficient execution in OpenCL?

For example, I have the following OpenCL kernel:

__kernel void test_kernel(
    read_only image2d_t src,
    write_only image2d_t dest,
    const int width,
    const int height
)
{
    const sampler_t sampler =  CLK_NORMALIZED_COORDS_FALSE | CLK_ADDRESS_CLAMP_TO_EDGE | CLK_FILTER_NEAREST;
    int2 pos = (int2)(get_global_id(0), get_global_id(1));
    int2 pos0 = (int2)(pos.x - pos.x % 16, pos.y - pos.y % 16);

    uint4 diff = (uint4)(0, 0, 0, 0);

    for (int i=0; i<16; i++)
    {
        for (int j=0; j<16; j++)
        {
            diff += read_imageui(src, sampler, (int2)(pos0.x + i, pos0.y + j)) -
                read_imageui(src, sampler, (int2)(pos.x + i, pos.y + j));
        }
    }
    write_imageui(dest, pos, diff);
}

It produces correct results, but is slow... only ~25 GFLOPS on NVS4200M with 1k by 1k input. (The hardware spec is 155 GFLOPS). I'm guessing this has to do with the memory access patterns. Each work item reads one 16x16 block of data which is the same as all its neighbors in a 16x16 area, and also another offset block of data most of the time overlaps with that of its immediate neighbors. All reads are through samplers. The host program is PyOpenCL (I don't think that actually changes anything) and the work-group size is 16x16.

EDIT: New version of kernel per suggestion below, copy work area to local variables:

__kernel __attribute__((reqd_work_group_size(16, 16, 1)))
void test_kernel(
    read_only image2d_t src,
    write_only image2d_t dest,
    const int width,
    const int height
)
{
    const sampler_t sampler =  CLK_NORMALIZED_COORDS_FALSE | CLK_ADDRESS_CLAMP_TO_EDGE | CLK_FILTER_NEAREST;
    int2 pos = (int2)(get_global_id(0), get_global_id(1));

    int dx = pos.x % 16;
    int dy = pos.y % 16;

    __local uint4 local_src[16*16];
    __local uint4 local_src2[32*32];

    local_src[(pos.y % 16) * 16 + (pos.x % 16)] = read_imageui(src, sampler, pos);
    local_src2[(pos.y % 16) * 32 + (pos.x % 16)] = read_imageui(src, sampler, pos);
    local_src2[(pos.y % 16) * 32 + (pos.x % 16) + 16] = read_imageui(src, sampler, (int2)(pos.x + 16, pos.y));
    local_src2[(pos.y % 16 + 16) * 32 + (pos.x % 16)] = read_imageui(src, sampler, (int2)(pos.x, pos.y + 16));
    local_src2[(pos.y % 16 + 16) * 32 + (pos.x % 16) + 16] = read_imageui(src, sampler, (int2)(pos.x + 16, pos.y + 16));
    barrier(CLK_LOCAL_MEM_FENCE);


    uint4 diff = (uint4)(0, 0, 0, 0);

    for (int i=0; i<16; i++)
    {
        for (int j=0; j<16; j++)
        {
            diff += local_src[ j*16 + i ] - local_src2[ (j+dy)*32 + i+dx ];
        }
    }
    write_imageui(dest, pos, diff);
}

Result: output is correct, running time is 56% slower. If using local_src only (not local_src2), the result is ~10% faster.

EDIT: Benchmarked on much more powerful hardware, AMD Radeon HD 7850 gets 420GFLOPS, spec is 1751GFLOPS. To be fair the spec is for multiply-add, and there is no multiply here so the expected is ~875GFLOPS, but this is still off by quite a lot compared to the theoretical performance.

EDIT: To ease running tests for anyone who would like to try this out, the host-side program in PyOpenCL below:

import pyopencl as cl
import numpy
import numpy.random
from time import time

CL_SOURCE = ''' 
// kernel goes here
'''

ctx = cl.create_some_context()
queue = cl.CommandQueue(ctx, properties=cl.command_queue_properties.PROFILING_ENABLE)
prg = cl.Program(ctx, CL_SOURCE).build()

h, w = 1024, 1024
src = numpy.zeros((h, w, 4), dtype=numpy.uint8)
src[:,:,:] = numpy.random.rand(h, w, 4) * 255

mf = cl.mem_flags
src_buf = cl.image_from_array(ctx, src, 4)
fmt = cl.ImageFormat(cl.channel_order.RGBA, cl.channel_type.UNSIGNED_INT8)
dest_buf = cl.Image(ctx, mf.WRITE_ONLY, fmt, shape=(w, h))

# warmup
for n in range(10):
    event = prg.test_kernel(queue, (w, h), (16,16), src_buf, dest_buf, numpy.int32(w), numpy.int32(h))
event.wait()

# benchmark
t1 = time()
for n in range(100):
    event = prg.test_kernel(queue, (w, h), (16,16), src_buf, dest_buf, numpy.int32(w), numpy.int32(h))
event.wait()
t2 = time()
print "Duration (host): ", (t2-t1)/100
print "Duration (event): ", (event.profile.end-event.profile.start)*1e-9

EDIT: Thinking about the memory access patterns, the original naive version may be pretty good; when calling read_imageui(src, sampler, (int2)(pos0.x + i, pos0.y + j)) all work-items in a work group are reading the same location (so this is just one read??), and when calling read_imageui(src, sampler, (int2)(pos.x + i, pos.y + j)) they are reading sequential locations (so the reads can be coalesced perfectly??).


Source: (StackOverflow)

Allocation error with pyopencl with simple multiplication in for-loop

I am using pyopencl to speed up my calculations using a GPU and am at the moment mystified by the following problem.

Im doing a simple multiplication of two arrays in a for loop using the following code

import numpy as np
import pyopencl as cl
import pyopencl.array as cl_array
from pyopencl.elementwise import ElementwiseKernel

ctx = cl.create_some_context(0)
queue = cl.CommandQueue(ctx)

multiply = ElementwiseKernel(ctx,
           "float *x, float *y, float *z",
           "z[i] = x[i] * y[i]",
           "multiplication")

x = cl_array.arange(queue, 1000000, dtype=np.complex64)
y = cl_array.arange(queue, 1000000, dtype=np.complex64)
z = cl_array.empty_like(x)

for n in range(10000):
    z = x*y
    multiply(x.real, y.real, z.real)
    multiply(x, y, z)

The last three lines do of course the same thing namely the multiplication. However, the first two options result in the following error (I commented out the other two of course):

pyopencl.MemoryError: clEnqueueNDRangeKernel failed: mem object allocation failure

I'm just lost why the first two options are running into allocation errors.

NOTES:

GPU: [0] pyopencl.Device 'Capeverde' on 'AMD Accelerated Parallel Processing' at 0x2a76d90

>>> pyopencl.VERSION
(2013, 1)

I am aware that the complex type is not handled correctly, but if you change them into np.float32 I still get the same problem.


Source: (StackOverflow)

Optimization and speedup of a mathematical function in python

The purpose of this mathematical function is to compute a distance between two (or more) protein structures using dihedral angles:

enter image description here

It is very useful in structural biology, for example. And I already code this function in python using numpy, but the goal is to have a faster implementation. As computation time reference, I use the euclidean distance function available in the scikit-learn package.

Here the code I have for the moment:

import numpy as np
import numexpr as ne
from sklearn.metrics.pairwise import euclidean_distances

# We have 10000 structures with 100 dihedral angles
n = 10000
m = 100

# Generate some random data
c = np.random.rand(n,m)
# Generate random int number
x = np.random.randint(c.shape[0])

print c.shape, x

# First version with numpy of the dihedral_distances function
def dihedral_distances(a, b):
    l = 1./a.shape[0]
    return np.sqrt(l* np.sum((0.5)*(1. - np.cos(a-b)), axis=1))

# Accelerated version with numexpr
def dihedral_distances_ne(a, b):
    l = 1./a.shape[0]
    tmp = ne.evaluate('sum((0.5)*(1. - cos(a-b)), axis=1)')
    return ne.evaluate('sqrt(l* tmp)')

# The function of reference I try to be close as possible 
# in term of computation time
%timeit euclidean_distances(c[x,:], c)[0]
1000 loops, best of 3: 1.07 ms per loop

# Computation time of the first version of the dihedral_distances function
# We choose randomly 1 structure among the 10000 structures.
# And we compute the dihedral distance between this one and the others
%timeit dihedral_distances(c[x,:], c)
10 loops, best of 3: 21.5 ms per loop

# Computation time of the accelerated function with numexpr
%timeit dihedral_distances_ne(c[x,:], c)
100 loops, best of 3: 9.44 ms per loop

9.44 ms it's very fast, but it's very slow if you need to run it a million times. Now the question is, how to do that? What is the next step? Cython? PyOpenCL? I have some experience with PyOpenCL, however I never code something as elaborate as this one. I don't know if it's possible to compute the dihedral distances in one step on GPU as I do with numpy and how to proceed.

Thank you for helping me!

EDIT: Thank you guys! I am currently working on the full solution and once it's finished I will put the code here.

CYTHON VERSION:

%load_ext cython
import numpy as np

np.random.seed(1234)

n = 10000
m = 100

c = np.random.rand(n,m)
x = np.random.randint(c.shape[0])

print c.shape, x

%%cython --compile-args=-fopenmp --link-args=-fopenmp --force

import numpy as np
cimport numpy as np
from libc.math cimport sqrt, cos
cimport cython
from cython.parallel cimport parallel, prange

# Define a function pointer to a metric
ctypedef double (*metric)(double[: ,::1], np.intp_t, np.intp_t)

cdef extern from "math.h" nogil:
    double cos(double x)
    double sqrt(double x)

@cython.boundscheck(False)
@cython.wraparound(False)
@cython.cdivision(True)
cdef double dihedral_distances(double[:, ::1] a, np.intp_t i1, np.intp_t i2):
    cdef double res
    cdef int m
    cdef int j

    res = 0.
    m = a.shape[1]

    for j in range(m):
        res += 1. - cos(a[i1, j] - a[i2, j])

    res /= 2.*m

    return sqrt(res)

@cython.boundscheck(False)
@cython.wraparound(False)
@cython.cdivision(True)
cdef double dihedral_distances_p(double[:, ::1] a, np.intp_t i1, np.intp_t i2):
    cdef double res
    cdef int m
    cdef int j

    res = 0.
    m = a.shape[1]

    with nogil, parallel(num_threads=2):
        for j in prange(m, schedule='dynamic'):
            res += 1. - cos(a[i1, j] - a[i2, j])

    res /= 2.*m

    return sqrt(res)

@cython.boundscheck(False)
@cython.wraparound(False)
def pairwise(double[: ,::1] c not None, np.intp_t x, p = True):
    cdef metric dist_func
    if p:
        dist_func = &dihedral_distances_p
    else:
        dist_func = &dihedral_distances

    cdef np.intp_t i, n_structures
    n_samples = c.shape[0]

    cdef double[::1] res = np.empty(n_samples)

    for i in range(n_samples):
        res[i] = dist_func(c, x, i)

    return res

%timeit pairwise(c, x, False)
100 loops, best of 3: 17 ms per loop    

# Parallel version
%timeit pairwise(c, x, True)
10 loops, best of 3: 37.1 ms per loop

So I follow your link to create the cython version of the dihedral distances function. We gain some speed, not so much, but it is still slower than the numexpr version (17ms vs 9.44ms). So I tried to parallelize the function using prange and it is worse (37.1ms vs 17ms vs 9.4ms)!

Do I miss something?


Source: (StackOverflow)

Why Use pyopencl.array.Array

PyOpenCL has two ways of creating an OpenCL buffer:

pyopencl.Buffer takes a numpy array and turns it into a buffer.

pyopencl.array.Array takes a numpy array and turns it into a PyOpenCL array - an object that can be still be edited like a numpy array, but provides a buffer when you call .data on it.

Is there a performance/function reason to choose one or the other? Or should I just choose the one that makes my code more readable?


Source: (StackOverflow)

Pyopencl: difference between to_device and Buffer

Let

import pyopencl as cl
import pyopencl.array as cl_array
import numpy
a = numpy.random.rand(50000).astype(numpy.float32)
mf = cl.mem_flags

What is the difference between

a_gpu = cl.Buffer(self.ctx, mf.READ_ONLY | mf.COPY_HOST_PTR, hostbuf=a)

and

a_gpu = cl_array.to_device(self.ctx, self.queue, a)

?

And what is the difference between

result =  numpy.empty_like(a)
cl.enqueue_copy(self.queue, result, result_gpu)

and

result = result_gpu.get()

?


Source: (StackOverflow)

How To Structure Large OpenCL Kernels?

I have worked with OpenCL on a couple of projects, but have always written the kernel as one (sometimes rather large) function. Now I am working on a more complex project and would like to share functions across several kernels.

But the examples I can find all show the kernel as a single file (very few even call secondary functions). It seems like it should be possible to use multiple files - clCreateProgramWithSource() accepts multiple strings (and combines them, I assume) - although pyopencl's Program() takes only a single source.

So I would like to hear from anyone with experience doing this:

  • Are there any problems associated with multiple source files?
  • Is the best workaround for pyopencl to simply concatenate files?
  • Is there any way to compile a library of functions (instead of passing in the library source with each kernel, even if not all are used)?
  • If it's necessary to pass in the library source every time, are unused functions discarded (no overhead)?
  • Any other best practices/suggestions?

Thanks.


Source: (StackOverflow)

OpenCL matrix multiplication should be faster?

I'm trying to learn how to make GPU optimalized OpenCL kernells, I took example of matrix multiplication using square tiles in local memory. However I got at best case just ~10-times speedup ( ~50 Gflops ) in comparison to numpy.dot() ( 5 Gflops , it is using BLAS).

I found studies where they got speedup >200x ( >1000 Gflops ). ftp://ftp.u-aizu.ac.jp/u-aizu/doc/Tech-Report/2012/2012-002.pdf I don't know what I'm doing wrong, or if it is just because of my GPU ( nvidia GTX 275 ). Or if it is because of some pyOpenCl overhead. But I meassured also how long does take just to copy result from GPU to RAM and it is just ~10% of the matrix multiplication time.

#define BLOCK_SIZE 22 
__kernel void matrixMul(
      __global float* Cij, 
      __global float* Aik, 
      __global float* Bkj, 
      __const int ni, 
      __const int nj,
      __const int nk
){
//   WARRNING : interchange of  i  and  j  dimension  lower the performance >2x on my nV GT275 GPU    
int gj = get_global_id(0);    int gi = get_global_id(1); 
int bj = get_group_id(0);     int bi = get_group_id(1);  // Block index
int tj = get_local_id(0);     int ti = get_local_id(1);  // Thread index
int oj = bi*BLOCK_SIZE;       int oi = bj*BLOCK_SIZE; 
float Csub =0; 
__local float As   [BLOCK_SIZE][BLOCK_SIZE];
__local float Bs   [BLOCK_SIZE][BLOCK_SIZE];
for (int ok = 0; ok < nk; ok += BLOCK_SIZE )   {
    As[ti][tj] = Aik[ nk*(gi   ) + tj + ok ];   // A[i][k]
    Bs[ti][tj] = Bkj[ nj*(ti+ok) + gj ];        // B[k][j]
    barrier(CLK_LOCAL_MEM_FENCE);
    for (int k = 0; k < BLOCK_SIZE; ++k) Csub += As[ti][k] * Bs[k][tj];
    barrier(CLK_LOCAL_MEM_FENCE);
}
Cij[ nj * ( gi ) + gj ] = Csub;

}

NOTE - the strange BLOCK_SIZE=22 is the maximum BLOCK_SIZE which does fit to max work_group_size which is 512 on my GPU. In this code must hold condition BLOCK_SIZE^2 < max work_group_size. 22=int(sqrt(512)). I tried also BLOCK_SIZE=16 or 8 but it was slower tan with 22.

I also tried simple matrixMul (without using local memory) but it was even 10-times slower than numpy.dot(). I copied the code here http://gpgpu-computing4.blogspot.cz/2009/10/matrix-multiplication-3-opencl.html they say that even the simple version (without local memory) should run 200x faster than CPU? I don't undrestand that.

the dependecne of performance in my case is:

N =  220 numpy 3.680 [Gflops] GPU 16.428 [Gflops] speedUp 4.464 
N =  330 numpy 4.752 [Gflops] GPU 29.487 [Gflops] speedUp 6.205 
N =  440 numpy 4.914 [Gflops] GPU 37.096 [Gflops] speedUp 7.548 
N =  550 numpy 3.849 [Gflops] GPU 47.019 [Gflops] speedUp 12.217 
N =  660 numpy 5.251 [Gflops] GPU 49.999 [Gflops] speedUp 9.522 
N =  770 numpy 4.565 [Gflops] GPU 48.567 [Gflops] speedUp 10.638 
N =  880 numpy 5.452 [Gflops] GPU 44.444 [Gflops] speedUp 8.152 
N =  990 numpy 4.976 [Gflops] GPU 42.187 [Gflops] speedUp 8.478 
N = 1100 numpy 5.324 [Gflops] GPU 83.187 [Gflops] speedUp 15.625 
N = 1210 numpy 5.401 [Gflops] GPU 57.147 [Gflops] speedUp 10.581 
N = 1320 numpy 5.450 [Gflops] GPU 48.936 [Gflops] speedUp 8.979  

NOTE - the "Gflops" number is obtained as N^3/time and it does include time required to copy results from GPU to main memory, but this time is just few percent of total time especially for N>1000

maybe more pictorial is time in secons:

N =  220 numpy 0.003 [s] GPU 0.001 [s] load 0.001 [s] speedUp 5.000 
N =  330 numpy 0.008 [s] GPU 0.001 [s] load 0.001 [s] speedUp 7.683 
N =  440 numpy 0.017 [s] GPU 0.002 [s] load 0.001 [s] speedUp 7.565 
N =  550 numpy 0.043 [s] GPU 0.004 [s] load 0.001 [s] speedUp 11.957 
N =  660 numpy 0.055 [s] GPU 0.006 [s] load 0.002 [s] speedUp 9.298 
N =  770 numpy 0.100 [s] GPU 0.009 [s] load 0.003 [s] speedUp 10.638 
N =  880 numpy 0.125 [s] GPU 0.010 [s] load 0.000 [s] speedUp 12.097 
N =  990 numpy 0.195 [s] GPU 0.015 [s] load 0.000 [s] speedUp 12.581 
N = 1100 numpy 0.250 [s] GPU 0.031 [s] load 0.000 [s] speedUp 8.065 
N = 1210 numpy 0.328 [s] GPU 0.031 [s] load 0.000 [s] speedUp 10.581 
N = 1320 numpy 0.422 [s] GPU 0.047 [s] load 0.000 [s] speedUp 8.979

I was thinking that maybe some speed improvement can be obtained using async_work_group_copy or even read_imageui to copy blocks to local memory. But I don't understand why I have so big difference when I'm using basically the same code as people who say they have 200x speedup?????


Source: (StackOverflow)

how to explicitely set work - dimension in pyopencl?

When writing OpenCL host program in C++ we use the following API to call the OpenCL kernel:

cl_int clEnqueueNDRangeKernel (     cl_command_queue command_queue,
    cl_kernel kernel,
    cl_uint work_dim,
    const size_t *global_work_offset,
    const size_t *global_work_size,
    const size_t *local_work_size,
    cl_uint num_events_in_wait_list,
    const cl_event *event_wait_list,
    cl_event *event)

Here the third argument sets the work dimension. Where as in python using pyopencl, we call the kernel as part of the program as:

<program_name>.<kernel_name>( <command_queue>, <Global_work_size>,
                              <Local_work_size>, <Parameters_to_kernel.....> )

for example:

event = program.square( queue, A.shape, None, 
                        A_buf, B_buf, cl.LocalMemory( A.size), np.int32(COUNT) )

So how to set the "work_dim" explicitly in python using pyopencl?


Source: (StackOverflow)

Passing struct with pointer members to OpenCL kernel using PyOpenCL

Let's suppose I have a kernel to compute the element-wise sum of two arrays. Rather than passing a, b, and c as three parameters, I make them structure members as follows:

typedef struct
{
    __global uint *a;
    __global uint *b;
    __global uint *c;
} SumParameters;

__kernel void compute_sum(__global SumParameters *params)
{
    uint id = get_global_id(0);
    params->c[id] = params->a[id] + params->b[id];
    return;
}

There is information on structures if you RTFM of PyOpenCL [1], and others have addressed this question too [2] [3] [4]. But none of the OpenCL struct examples I've been able to find have pointers as members.

Specifically, I'm worried about whether host/device address spaces match, and whether host/device pointer sizes match. Does anyone know the answer?

[1] http://documen.tician.de/pyopencl/howto.html#how-to-use-struct-types-with-pyopencl

[2] Struct Alignment with PyOpenCL

[3] http://enja.org/2011/03/30/adventures-in-opencl-part-3-constant-memory-structs/

[4] http://acooke.org/cute/Somesimple0.html


Source: (StackOverflow)

pyopencl errror : undefined symbol: clCreateSubDevices

I have installed pyopencl. When I try :

 python /home/a/pyopencl/examples/benchmark.py

then I have an error :

 Traceback (most recent call last):
 File "/home/a/pyopencl/examples/benchmark.py", line 6, in <module>
 import pyopencl as cl
 File "/usr/local/lib/python2.7/dist-packages/pyopencl-2013.3-py2.7-linux-   
 x86_64.egg/pyopencl/__init__.py", line 28, in <module>
 import pyopencl._cl as _cl
 ImportError: /usr/local/lib/python2.7/dist-packages/pyopencl-2013.3-py2.7-linux-     
 x86_64.egg/pyopencl/_cl.so: undefined symbol: clCreateSubDevices

How can I solve it ?

TIA

 Python 2.7.5+ (default, Sep 19 2013, 13:48:49) 
 [GCC 4.8.1] on linux2
 Linux zalman 3.11.0-15-generic #23-Ubuntu SMP Mon Dec 9 18:17:04 UTC 2013 x86_64   
 x86_64 x86_64 GNU/Linux
+------------------------------------------------------+                       
| NVIDIA-SMI 4.304.88   Driver Version: 304.88         |                       
|-------------------------------+----------------------+----------------------+
| GPU  Name                     | Bus-Id        Disp.  | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap| Memory-Usage         | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 770          | 0000:01:00.0     N/A |                  N/A |
| 25%   29C  N/A     N/A /  N/A |   7%  149MB / 2047MB |     N/A      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Compute processes:                                               GPU Memory |
|  GPU       PID  Process name                                     Usage      |
|=============================================================================|
|    0            Not Supported                                               |
+-----------------------------------------------------------------------------+
 ls /etc/OpenCL/vendors
 nvidia.icd



 locate libOpenCL.so


 /etc/alternatives/x86_64-linux-gnu_libOpenCL.so
 /etc/alternatives/x86_64-linux-gnu_libOpenCL.so_lib32
 /usr/lib/libOpenCL.so
 /usr/lib/x86_64-linux-gnu/libOpenCL.so
 /usr/lib32/libOpenCL.so

  nvcc --version
  nvcc: NVIDIA (R) Cuda compiler driver
  Copyright (c) 2005-2012 NVIDIA Corporation
  Built on Fri_Sep_21_17:28:58_PDT_2012
  Cuda compilation tools, release 5.0, V0.2.1221

Source: (StackOverflow)

Why doesn't run OpenCL on my GPU (Ubuntu)

I've set up pyopencl on my laptop by getting python-pyopencl from multiverse and installing the amd app sdk. To get the Nvidia ICDs I reinstalled the latest Nvidia driver from the driver manager.

My system is a Thinkpad t540p, i7 4700hq, Nvidia gt 730m, 64bit Ubuntu 14.04

To test the opencl installation I ran this pyopencl example: http://wiki.tiker.net/PyOpenCL/Examples/MatrixMultiply

Unfortunately the performance is very bad: Only 2 GFlop/s. Surely the laptop can do better. So I printed the vendor information. It's "GenuineIntel", apparently the kernel is not run on the GPU, but on the CPU. How can I change that ?

It seems like pyopencl doesn't find the GPU.

for dev in ctx.devices:
    print dev.vendor

this returns only "GenuineIntel"

The context is created with:

import pyopencl as cl
ctx=cl.create_some_context()

UPDATE:

This seems to be a duplicate of: ERROR: pyopencl: creating context for specific device


Source: (StackOverflow)

OpenCL - Why Use READ_ONLY or WRITE_ONLY Buffers

In OpenCL, are there any performance benefits to flagging buffers as READ_ONLY or WRITE_ONLY?

This kernel is what I often see (a is READ_ONLY and b is WRITE_ONLY):

__kernel void two_buffer_double(__global float* a, __global float* b)
{
    int i = get_global_id(0);
    b[i] = a[i] * 2;
}

This kernel seems better, because it uses less global memory (a is READ_WRITE):

__kernel void one_buffer_double(__global float* a)
{
    int i = get_global_id(0);
    a[i] = a[i] * 2;
}

Do READ_ONLY and WRITE_ONLY flags just exist to help with debugging and catching errors?


Source: (StackOverflow)

Re-utilizing the openCL memory

Hello Everybody, as an example consider two openCL kernel one kernel let us say add and other is sub.

the add kernel is

__kernel void add(global int *output1,global int *input1,global int *input2
    /* Put other parameters here */
    ) 
       {
    int i = get_global_id(0);
    output1[i] = input1[i] + input2[i];
       }

the sub kernel is

__kernel void add(global int *output2,global int *input1,global int *input2
    /* Put other parameters here */
    ) 
        {
    int i = get_global_id(0);
    output2[i] = input1[i] - input2[i];
       }

for these two kernels whose 2 inputs are same i need to copy same inputs(input1 & input2) twice to the device from host memory and that may add some cost in terms of performance.
Is there any way so that i can copy the data once and re-utilize it in any function till i am not releasing the memory?


Source: (StackOverflow)