using feval() to launch CUDA kernels from Matlab

Question

Crni on 13 Dec 2017

0
Link

Direct link to this question

https://uk.mathworks.com/matlabcentral/answers/372737-using-feval-to-launch-cuda-kernels-from-matlab

Commented: Crni on 14 Dec 2017

I'm using feval() for rapid prototyping of tests for my CUDA kernels, written in CUDA C. Works like a charm, but I have two questions:

1. The explanation of the treatment of input/output kernel arguments in the feval() function documentation leaves me wondering is there any data copying overhead involved? Namely, let's say that I have following kernel implemented in file addAssign.cu:

_global__ void addAssign(int const n, float* __restrict__ y, float const* __restrict__ x)
{
  /* do y[i] += x[i] here */
}

and that I call it from Matlab as follows:

n = 10000;
x = gpuArray(single(rand(n, 1)));
y = gpuArray(single(rand(n, 1)));
kernel = parallel.gpu.CUDAKernel('addAssign.ptx', 'addAssign.cu');
kernel.ThreadBlockSize = [128, 1, 1];
kernel.GridSize = [ceil(n / 128), 1, 1];
y = feval(kernel, n, y, x);

So, is Matlab runtime going to do any data copying for assignment in the last statement, or it is going to recognize that 'y' appears both as input and output, and that the kernel will properly update it? I'm asking because I've noticed that if I omit the assignment, values of 'y' won't get updated.

2. How to use gputimeit() with feval() run as above? It seems like gputimeit() would discard return values from function passed as argument.

Thanks.

0 Comments
Show -2 older commentsHide -2 older comments

Sign in to comment.

Sign in to answer this question.

Answer 1

Edric Ellis on 14 Dec 2017

1
Link

Direct link to this answer

https://uk.mathworks.com/matlabcentral/answers/372737-using-feval-to-launch-cuda-kernels-from-matlab#answer_296221

1. Input/output variables to CUDAKernel.feval are operated on in-place in the usual way for MATLAB data. See this blog entry for more details. In other words, providing the type of y matches exactly the prototype of your kernel (i.e. it doesn't need to be cast to a different numeric type), and that it appears on the left-hand side too, then it is eligible for in-place operation.

2. You could use the wait method of GPUDevice rather than gputimeit to ensure you're timing exactly what you expect. (Behind the scenes, gputimeit uses the wait method to ensure execution has completed).

1 Comment
Show -1 older commentsHide -1 older comments

Crni on 14 Dec 2017

Open in MATLAB Online

Thanks for your reply. This is behavior that I was hoping for, but I was not aware of this additional requirement that in-place operation is possible only if given piece of Matlab code is within a function. Let me present full example. Here is the kernel code, let's say we put it in file addAssign.cu:

__global__ void addAssign(const int n, float* __restrict__ y,
                          const float * __restrict__ x)
 {
     int idx = blockIdx.x * blockDim.x + threadIdx.x;
     int incr = gridDim.x * blockDim.x; 
     if (idx == 0)
         printf("%p %p\n", x, y);
     for ( ; idx < n; idx += incr)
         y[idx] += x[idx];
 }

And here is Matlab script:

   n = 1000;
   x = gpuArray(single(ones(n, 1)));
   y = gpuArray(single(ones(n, 1)));
   kernel = parallel.gpu.CUDAKernel('addAssign.ptx', 'addAssign.cu');
   kernel.ThreadBlockSize = [128, 1, 1];
   kernel.GridSize = [ceil(n / 128), 1, 1];
   y = feval(kernel, n, y, x);
   y = feval(kernel, n, y, x);

The idea is that kernel print pointers, so that through successive kernel calls we could check was there any re-allocation. Now, if we compile the kernel through:

nvcc -ptx addAssign.cu

and if we run the script, then it would print something like:

   0x7fa6f0a00000 0x7fa6f0a20000
   0x7fa6f0a00000 0x7fa6f0a10000

which means that y would get re-allocated. However, if I put the Matlab code above in a function, and call this function, then the output is:

   0x7fa6f0a20000 0x7fa6f0a30000
   0x7fa6f0a20000 0x7fa6f0a30000

which means y is used in-place. I don't quite understand why this restriction, but in any case it's good to know that in-place operation is possible.

As far as measuring elapsed time concerned, I guess what you suggested is replacing feval() call in the script above with something as follows:

   device = gpuDevice;
   tic;
   y = feval(kernel, n, y, x);
   device.wait();
   toc;

That's fine, but I was hoping that gputimeit() is using high-precision CUDA timers. Is that correct, and if so, is there a way to use gputimeit() in combination with feval()?

Sign in to comment.

using feval() to launch CUDA kernels from Matlab

0 Comments
Show -2 older commentsHide -2 older comments

Answers (1)

1 Comment
Show -1 older commentsHide -1 older comments

See Also

Categories

Tags

Community Treasure Hunt

using feval() to launch CUDA kernels from Matlab

0 Comments Show -2 older commentsHide -2 older comments

Answers (1)

1 Comment Show -1 older commentsHide -1 older comments

See Also

Categories

Tags

Community Treasure Hunt

0 Comments
Show -2 older commentsHide -2 older comments

1 Comment
Show -1 older commentsHide -1 older comments