significant increase of memory when moving part of the code to GPU

Hi all,

I am experimenting a bit with matlab (R2014b) and gpu (Tesla 2075). I am puzzled by a significant increase in memory usage after I "moved" the innermost loop of my code to the gpu. I am by no means expert, and I'm possibly doing something wrong.

So my code is basically a wrapper for a function that integrates a set of coupled differential equations. The innermost loop iterates a Runge-Kutta integration a few hundred times. A fair amount of ffts and iffts are involved, so I thought that moving that to the GPU would speed up my code. I turned all the auxiliary vectors in the four RK steps into gpuArrays. When the innermost loop has finished, I gather only the gpuArray containing the state of my system, and leave all the auxiliary stuff in the GPU. Ready for the next loop, I guess. Turns out that the speed actually increases, for sufficiently large systems. However, apparently this comes at the price of a significant increase in memory.

The machine I'm using is on a cluster managed by HTCondor. I have noticed that the "GPU version" of my code way more memory than the "CPU version". The situation according to condor_q and top is the following

    SIZE(condor_q)	VIRT(top) 	RES(top)	SHR(top)
 GPU 73242.2		67,775g 	468112		129700
 CPU 3418.0		3277324 	186324  	 77264

The readings from top should be in KiB, those from condor_q in Kbytes.

Update: in order to check whether this behavior was caused by the queuing system (HTCondor), I submitted one instance of my code directly on the node of our cluster that has the GPU, using nohup. The job is now running in background, but the figures from "top" are basically the same as above for GPU.

Is such a memory increase to be expected? Am I missing something?

Thanks a lot for your help

Francesco

3 Comments

So, I prepared some small self-contained examples of what I'm talking about. You can find them in the zip archive I'm attaching.
The function "test_CPU.m" is a bare-bone version of the function I use in my code. The two figures output at every "snapshot" iterations are checks on conserved quantities (they should be zero, in principle).
The function "test_GPU.m" is, in my intention, the GPU version of the previous function. As you can see i create a gpuArray for the state of the system (Gz0) and for the first auxiliary matrix (GeD). The remaining auxiliary matrices in the Runge-Kutta loop (K, K1, K2, K3, K4) should be gpuArrays themselves.
I did not create gpuArrays for (real or complex) scalars because somewhere I read that it is not necessary (can someone confirm?).
The function test_both.m runs both of the above functions and compares the results, just to make sure that they work in the same way. They do.
The functions test_GPU_driver.m and test_CPU_driver.m are simply drivers that I use to submit the codes in background and to measure execution time. For L=128, as in the current settings, the GPU function is 2/3 faster than the CPU function.
For L=64 the GPU version slower, but I guess that for larger sizes it becomes more and more convenient (assuming that the memory is enough).
But the point here is exactly memory. I would expect that the two versions employ roughly the same memory. Instead, here is what I get from top
VIRT RES SHR
CPU 2403588 176768 74492
GPU 66,961g 459856 127368
Can someone comment on this? Is this expected?
As I mention above, when launched from the queuing system the memory requirements of the GPU version are 20 times larger than those of the CPU version. This actually refers to my full function, not the minimal examples above. However, the differences between the CPU and GPU versions of my full function are exactly as in the examples.
Thanks a lot for any insight
I have also saw strange behavior regarding GPU and memory in MATLAB check here
No convincing answer yet.
Despite loving MATLAB a lot, I am almost giving up GPU on MATLAB. I don't get enough speedup, it seems it eats through memory and I am forced to run much smaller program.
If I write my kernels in CUDA-C and call it within matlab, I get better results though. The functions that they have and accepts GPU also is not bad.
Mohammad,
thanks for your thoughts. I took a look at the post you mention. If I get it right, the problem affects a GPU that is used for both display and computing. In my case it's a separate GPU, dedicated to computing. I did not do the check you did, but I want to try as soon as I have the chance.
I know nothing of CUDA-C. I wish I had time to learn that. Right now I was experimenting with matlab, and actually changing a few lines of codes gave me a significant speedup. That alone would be terrific, if it wasn't for the lurking memory "problem".
In your other post you mention device resetting. I have to say I am not doing that. When should it be done? Could you point me to the documentation you mention in your post?
Thanks a lot
Francesco

Sign in to comment.

 Accepted Answer

When you move the code to the GPU, MATLAB loads a suite of supporting CUDA libraries to provide implementations of fft etc. I believe this is the primary cause of the host-side memory increase you're seeing. The CUDA libraries supporting gpuArray are large because they contain specialised variants of many different algorithms, and support many different GPU hardware variants. On my system, I see the large increase in VSZ simply by invoking gpuDevice:
>> !ps -C MATLAB -O vsz,rsz
PID VSZ RSZ S TTY TIME COMMAND
4965 1865924 581852 S pts/4 00:00:31 /local/MATLAB/R2015a/bin/glnxa64/MATLAB
>> gpuDevice;
>> !ps -C MATLAB -O vsz,rsz
PID VSZ RSZ S TTY TIME COMMAND
4965 44297124 764552 S pts/4 00:00:32 /local/MATLAB/R2015a/bin/glnxa64/MATLAB
>> fft2(gpuArray.rand(2048));
>> !ps -C MATLAB -O vsz,rsz
PID VSZ RSZ S TTY TIME COMMAND
4965 44362660 830340 S pts/4 00:00:33 /local/MATLAB/R2015a/bin/glnxa64/MATLAB

4 Comments

Edric,
thanks a lot for your reply. It reassures me a bit.
So this increase of memory is to be expected? Nothing to worry about?
I used ps as you did, though not while in the interactive workspace, but from the shell. The figures below refer to my full codes (not the minimal examples), running in background
VSZ RSZ
GPU 71112568 463468
CPU 3207024 182084
While RSZ increases roughly by 2.5 times, VSZ is more than 20 times bigger for the GPU. That seems a lot.
From the examples I attached, you can see that I'm using only ffts, iffts and elementwise multiplications of matrices.
If this huge increase was explained by the loading of the CUDA libraries, wouldn't we end up using the same amount of memory? I mean, the difference between GPU and CPU for both your case and mine should be roughly the same.
I realize that, for a meaningful comparison, I should do exactly what you do. I plan to do that as soon as I am able to use that machine in interactive mode when it is reasonably free. Right now it is packed with running jobs of other people.
I still have a few doubts. Correct me if I'm wrong here, please.
The increase of memory due to the loading of libraries should be roughly fixed, so it should become less important as I increase the size of the system I'm addressing, right?
It is not entirely clear to me where these libraries are going, though. They are needed for the calculation, so I guess they're going in the GPU memory. This means that the memory available to my calculations is less than I expect. Is this correct?
Also, I see that the worst increase happens for the virtual memory, so it should not cause too much worry.
All of this may sound obvious, but I'm asking all the same because I'm not very expert in these matters.
Thanks again for your assistance.
Francesco
Ok, I was able to repeat what you did
>> !ps -C MATLAB -O vsz,rsz
PID VSZ RSZ S TTY TIME COMMAND
41447 776512 165672 S pts/3 00:00:02 /usr/local/MATLAB/R2014b ...
>> gpuDevice;
>> !ps -C MATLAB -O vsz,rsz
PID VSZ RSZ S TTY TIME COMMAND
41447 68526016 342684 S pts/3 00:00:03 /usr/local/MATLAB/R2014b ...
>> fft2(gpuArray.rand(2048));
>> !ps -C MATLAB -O vsz,rsz
PID VSZ RSZ S TTY TIME COMMAND
41447 68528236 400968 S pts/3 00:00:04 /usr/local/MATLAB/R2014b ...
These are not exactly the same numbers you get... which is a bit weird. But I think I get the picture.
I have two more questions
1) In my code I do not have an instruction
gpuDevice;
Is that necessary/useful?
2) I wonder whether it would be possible/convenient to load only selected libraries. For instance I need only fft and elementwise matrix multiplication.
Thanks a lot for any insight
Francesco
Yes, I think the increase in VSZ should hopefully not actually cause you too many problems in practice. The shared libraries mostly consume host-side memory, although some GPU memory is needed to load the specific device code.
To answer your subsequent questions:
  1. You do not need to add a call to gpuDevice to your code - I used that simply to force MATLAB to load the GPU libraries
  2. Unfortunately you cannot selectively load the GPU libraries - they all get loaded as soon as you use any GPU functionality.
Thanks Edric, this was really useful.
I tried my code on a larger lattice (L=256) and the memory requirements remained more or less the same, especially as reported by HTCondor. This should confirm that -- for what I'm doing now -- most of the memory is indeed used for the libraries.
Also, it seems to me that the code is running much faster.
This makes me happy! :)

Sign in to comment.

More Answers (0)

Products

Tags

Asked:

pfb
on 11 Apr 2015

Commented:

pfb
on 14 Apr 2015

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!