significant increase of memory when moving part of the code to GPU

Question

0 votes

Hi all,

I am experimenting a bit with matlab (R2014b) and gpu (Tesla 2075). I am puzzled by a significant increase in memory usage after I "moved" the innermost loop of my code to the gpu. I am by no means expert, and I'm possibly doing something wrong.

So my code is basically a wrapper for a function that integrates a set of coupled differential equations. The innermost loop iterates a Runge-Kutta integration a few hundred times. A fair amount of ffts and iffts are involved, so I thought that moving that to the GPU would speed up my code. I turned all the auxiliary vectors in the four RK steps into gpuArrays. When the innermost loop has finished, I gather only the gpuArray containing the state of my system, and leave all the auxiliary stuff in the GPU. Ready for the next loop, I guess. Turns out that the speed actually increases, for sufficiently large systems. However, apparently this comes at the price of a significant increase in memory.

The machine I'm using is on a cluster managed by HTCondor. I have noticed that the "GPU version" of my code way more memory than the "CPU version". The situation according to condor_q and top is the following

    SIZE(condor_q)	VIRT(top) 	RES(top)	SHR(top)

 GPU 73242.2		67,775g 	468112		129700

 CPU 3418.0		3277324 	186324  	 77264

The readings from top should be in KiB, those from condor_q in Kbytes.

Update: in order to check whether this behavior was caused by the queuing system (HTCondor), I submitted one instance of my code directly on the node of our cluster that has the GPU, using nohup. The job is now running in background, but the figures from "top" are basically the same as above for GPU.

Is such a memory increase to be expected? Am I missing something?

Thanks a lot for your help

Francesco

3 Comments
Show 1 older comment Hide 1 older comment

pfb on 12 Apr 2015

Edited: pfb on 12 Apr 2015

Open in MATLAB Online

functions.zip

So, I prepared some small self-contained examples of what I'm talking about. You can find them in the zip archive I'm attaching.

The function "test_CPU.m" is a bare-bone version of the function I use in my code. The two figures output at every "snapshot" iterations are checks on conserved quantities (they should be zero, in principle).

The function "test_GPU.m" is, in my intention, the GPU version of the previous function. As you can see i create a gpuArray for the state of the system (Gz0) and for the first auxiliary matrix (GeD). The remaining auxiliary matrices in the Runge-Kutta loop (K, K1, K2, K3, K4) should be gpuArrays themselves.

I did not create gpuArrays for (real or complex) scalars because somewhere I read that it is not necessary (can someone confirm?).

The function test_both.m runs both of the above functions and compares the results, just to make sure that they work in the same way. They do.

The functions test_GPU_driver.m and test_CPU_driver.m are simply drivers that I use to submit the codes in background and to measure execution time. For L=128, as in the current settings, the GPU function is 2/3 faster than the CPU function.

For L=64 the GPU version slower, but I guess that for larger sizes it becomes more and more convenient (assuming that the memory is enough).

But the point here is exactly memory. I would expect that the two versions employ roughly the same memory. Instead, here is what I get from top

        VIRT      RES      SHR
 CPU    2403588   176768    74492
 GPU    66,961g   459856   127368

Can someone comment on this? Is this expected?

As I mention above, when launched from the queuing system the memory requirements of the GPU version are 20 times larger than those of the CPU version. This actually refers to my full function, not the minimal examples above. However, the differences between the CPU and GPU versions of my full function are exactly as in the examples.

Thanks a lot for any insight

Mohammad Abouali on 12 Apr 2015

Edited: Mohammad Abouali on 12 Apr 2015

I have also saw strange behavior regarding GPU and memory in MATLAB check here

No convincing answer yet.

Despite loving MATLAB a lot, I am almost giving up GPU on MATLAB. I don't get enough speedup, it seems it eats through memory and I am forced to run much smaller program.

If I write my kernels in CUDA-C and call it within matlab, I get better results though. The functions that they have and accepts GPU also is not bad.

pfb on 13 Apr 2015

Mohammad,

thanks for your thoughts. I took a look at the post you mention. If I get it right, the problem affects a GPU that is used for both display and computing. In my case it's a separate GPU, dedicated to computing. I did not do the check you did, but I want to try as soon as I have the chance.

I know nothing of CUDA-C. I wish I had time to learn that. Right now I was experimenting with matlab, and actually changing a few lines of codes gave me a significant speedup. That alone would be terrific, if it wasn't for the lurking memory "problem".

In your other post you mention device resetting. I have to say I am not doing that. When should it be done? Could you point me to the documentation you mention in your post?

Thanks a lot

Francesco

Sign in to comment.

Sign in to answer this question.

Follow Question

Answer 1

Edric Ellis on 13 Apr 2015

Open in MATLAB Online

1 vote

When you move the code to the GPU, MATLAB loads a suite of supporting CUDA libraries to provide implementations of fft etc. I believe this is the primary cause of the host-side memory increase you're seeing. The CUDA libraries supporting gpuArray are large because they contain specialised variants of many different algorithms, and support many different GPU hardware variants. On my system, I see the large increase in VSZ simply by invoking gpuDevice:

>> !ps -C MATLAB -O vsz,rsz
  PID    VSZ   RSZ S TTY          TIME COMMAND
 4965 1865924 581852 S pts/4  00:00:31 /local/MATLAB/R2015a/bin/glnxa64/MATLAB
>> gpuDevice;
>> !ps -C MATLAB -O vsz,rsz
  PID    VSZ   RSZ S TTY          TIME COMMAND
 4965 44297124 764552 S pts/4 00:00:32 /local/MATLAB/R2015a/bin/glnxa64/MATLAB
>> fft2(gpuArray.rand(2048));
>> !ps -C MATLAB -O vsz,rsz
  PID    VSZ   RSZ S TTY          TIME COMMAND
 4965 44362660 830340 S pts/4 00:00:33 /local/MATLAB/R2015a/bin/glnxa64/MATLAB

4 Comments
Show 2 older comments Hide 2 older comments

pfb on 13 Apr 2015

Edited: pfb on 13 Apr 2015

Open in MATLAB Online

Edric,

thanks a lot for your reply. It reassures me a bit.

So this increase of memory is to be expected? Nothing to worry about?

I used ps as you did, though not while in the interactive workspace, but from the shell. The figures below refer to my full codes (not the minimal examples), running in background

       VSZ     RSZ
GPU 71112568 463468
CPU  3207024 182084

While RSZ increases roughly by 2.5 times, VSZ is more than 20 times bigger for the GPU. That seems a lot.

From the examples I attached, you can see that I'm using only ffts, iffts and elementwise multiplications of matrices.

If this huge increase was explained by the loading of the CUDA libraries, wouldn't we end up using the same amount of memory? I mean, the difference between GPU and CPU for both your case and mine should be roughly the same.

I realize that, for a meaningful comparison, I should do exactly what you do. I plan to do that as soon as I am able to use that machine in interactive mode when it is reasonably free. Right now it is packed with running jobs of other people.

I still have a few doubts. Correct me if I'm wrong here, please.

The increase of memory due to the loading of libraries should be roughly fixed, so it should become less important as I increase the size of the system I'm addressing, right?

It is not entirely clear to me where these libraries are going, though. They are needed for the calculation, so I guess they're going in the GPU memory. This means that the memory available to my calculations is less than I expect. Is this correct?

Also, I see that the worst increase happens for the virtual memory, so it should not cause too much worry.

All of this may sound obvious, but I'm asking all the same because I'm not very expert in these matters.

Thanks again for your assistance.

Francesco

Edric Ellis on 14 Apr 2015

Yes, I think the increase in VSZ should hopefully not actually cause you too many problems in practice. The shared libraries mostly consume host-side memory, although some GPU memory is needed to load the specific device code.

To answer your subsequent questions:

You do not need to add a call to gpuDevice to your code - I used that simply to force MATLAB to load the GPU libraries
Unfortunately you cannot selectively load the GPU libraries - they all get loaded as soon as you use any GPU functionality.

pfb on 14 Apr 2015

Thanks Edric, this was really useful.

I tried my code on a larger lattice (L=256) and the memory requirements remained more or less the same, especially as reported by HTCondor. This should confirm that -- for what I'm doing now -- most of the memory is indeed used for the libraries.

Also, it seems to me that the code is running much faster.

This makes me happy! :)

Sign in to comment.

significant increase of memory when moving part of the code to GPU

3 Comments
Show 1 older comment Hide 1 older comment

Accepted Answer

4 Comments
Show 2 older comments Hide 2 older comments

More Answers (0)

Categories

Products

Tags

Community Treasure Hunt

significant increase of memory when moving part of the code to GPU

3 Comments Show 1 older comment Hide 1 older comment

Accepted Answer

4 Comments Show 2 older comments Hide 2 older comments

More Answers (0)

Categories

Products

Tags

See Also

Community Treasure Hunt

3 Comments
Show 1 older comment Hide 1 older comment

4 Comments
Show 2 older comments Hide 2 older comments