GPU memory build up from multiple mex calls in loop

Question

Benjamin Ballintyn on 28 Apr 2020

0
Link

Direct link to this question

https://uk.mathworks.com/matlabcentral/answers/521253-gpu-memory-build-up-from-multiple-mex-calls-in-loop

Commented: Joss Knight on 2 May 2020

runAEVLIFNetGPU.m

Hi,

I am having memory problems when repeatedly calling the mex version (generated using GPU Coder) of the following function (also attached):

function [GsynMax,V,Vth,Isra,GsynE,GsynI,D,F,r1,r2,o1,o2] = runAEVLIFNetGPU(V,Vreset,tau_ref,Vth,Vth0,Vth_max,...
              Isra,tau_sra,a,b,VsynE,VsynI,GsynE,GsynI,GsynMax,tau_D,tau_F,f_fac,D,F,has_facilitation,has_depression,...
              p0,tau_synE,tau_synI,Cm,Gl,El,dth,Iapp,std_noise,dt,ecells,icells,spikeGenProbs,cells2record,...
              is_plastic,plasticity_type,C,r1,r2,o1,o2,A2plus,A3plus,A2minus,A3minus,...
              tau_plus,tau_x,tau_minus,tau_y,nT,spkfid) %#codegen
coder.gpu.kernelfun; % for code generation
N = size(V,1); % # of simulated neurons
nSpikeGen = length(spikeGenProbs); % # of poisson spike generator neurons
useSpikeGen = (nSpikeGen > 0); % determine if there are any spike generators
n2record = length(cells2record); % # of neurons to record
useRecord = (n2record > 0); % determine if any neurons should be recorded
nSimulatedSpikes = 0;
nGeneratedSpikes = 0;
Fmax = 1./p0;
if (~isnan(D))
    useSynDynamics = true;
else
    useSynDynamics = false;
end
if (~isnan(C))
    usePlasticity = true;
else
    usePlasticity = false;
end
% if no spike file was given, don't record any spikes
if (spkfid < 0)
    useRecord = false;
end
% Loop through nT timepoints
for i=1:nT
    
    % Update spike thresholds
    vth1 = arrayfun(@minus,Vth0,Vth); % (Vth0 - Vth)
    dVthdt = arrayfun(@rdivide,vth1,tau_ref); % (Vth0 - Vth)./tau_ref
    Vth = arrayfun(@plus,Vth,dVthdt*dt); % Vth = Vth + dt*[(Vth0 - Vth)./tau_ref]
    
    % Update adaptation currents
    sra1 = arrayfun(@minus,V,El); % (V - El)
    sra2 = arrayfun(@times,a,sra1); % a.*(V - El)
    sra3 = arrayfun(@minus,sra2,Isra); % a.*(V - El) - Isra
    dIsradt = arrayfun(@rdivide,sra3,tau_sra); % [a.*(V - El) - Isra]./tau_sra
    Isra = arrayfun(@plus,Isra,dIsradt*dt); % Isra = Isra + dt*([a.*(V - El) - Isra]./tau_sra)
    
    if (useSynDynamics)
        % update depression/facilitation variables
        dDdt = arrayfun(@rdivide,(1 - D),tau_D);
        dFdt = arrayfun(@rdivide,(1 - F),tau_F);
        D = arrayfun(@plus,D,dDdt*dt);
        F = arrayfun(@plus,F,dFdt*dt);
    end
    
    if (usePlasticity)
        % decay plasticity variables
        dr1dt = arrayfun(@rdivide,-r1,tau_plus);
        dr2dt = arrayfun(@rdivide,-r2,tau_x);
        do1dt = arrayfun(@rdivide,-o1,tau_minus);
        do2dt = arrayfun(@rdivide,-o2,tau_y);
        r1 = arrayfun(@plus,r1,dr1dt*dt);
        r2 = arrayfun(@plus,r2,dr2dt*dt);
        o1 = arrayfun(@plus,o1,do1dt*dt);
        o2 = arrayfun(@plus,o2,do2dt*dt);
    end
    
    spiked = (V > Vth); % determine simulated neurons that spiked
    
    % check if any spike generators spiked
    if (useSpikeGen)
        spikeGenSpikes = (rand(nSpikeGen,1) < spikeGenProbs);
        allSpikes = [spiked; spikeGenSpikes];
        nGeneratedSpikes = nGeneratedSpikes + sum(spikeGenSpikes);
    else
        allSpikes = spiked;
    end
    nSimulatedSpikes = nSimulatedSpikes + sum(spiked);
    
    areSimSpikes = any(spiked); % determine if there were any spikes
    areAnySpikes = any(allSpikes);
    
    if (areSimSpikes) % If there are any simulated spikes
        V(spiked) = Vreset(spiked); % reset membrane voltages of spiking neurons
        Vth(spiked) = Vth_max(spiked); % set spike threshold to max
        Isra(spiked) = arrayfun(@plus,Isra(spiked),b(spiked)); % Increment the spike rate adaptation currents
    end
    
    e_spiked = logical(allSpikes.*ecells); % all excitatory simulated and spike generating neurons that spiked
    i_spiked = logical(allSpikes.*icells); % all inhibitory simulated and spike generating neurons that spiked
    dGsynEdt = arrayfun(@rdivide,-GsynE,tau_synE); % decay in excitatory synaptic conductances
    dGsynIdt = arrayfun(@rdivide,-GsynI,tau_synI); % decay in inhibitory synaptic conductances
    
    GsynE = arrayfun(@plus,GsynE,dGsynEdt*dt);
    GsynI = arrayfun(@plus,GsynI,dGsynIdt*dt);
    if (areAnySpikes)
        if (useSynDynamics)
            % multiply release probability with maximum synaptic conductance
            dGsynE1 = bsxfun(@times,GsynMax(:,e_spiked),p0(e_spiked)');
            dGsynE2 = bsxfun(@times,dGsynE1,F(e_spiked)');
            dGsynE3 = bsxfun(@times,dGsynE2,D(e_spiked)');
            dGsynI1 = bsxfun(@times,GsynMax(:,i_spiked),p0(i_spiked)');
            dGsynI2 = bsxfun(@times,dGsynI1,F(i_spiked)');
            dGsynI3 = bsxfun(@times,dGsynI2,D(i_spiked)');
            % sum across all inputs
            dGsynE_sum = sum(dGsynE3,2);
            dGsynI_sum = sum(dGsynI3,2);
            
            % update depression/facilitation variables for neurons that spiked
            d1 = arrayfun(@times,p0(allSpikes),F(allSpikes));
            d2 = arrayfun(@times,d1,D(allSpikes));
            d3 = arrayfun(@times,d2,has_depression(allSpikes));
            D(allSpikes) = arrayfun(@minus,D(allSpikes),d3);
            f1 = arrayfun(@minus,Fmax(allSpikes),F(allSpikes));
            f2 = arrayfun(@times,f_fac(allSpikes),f1);
            f3 = arrayfun(@times,f2,has_facilitation(allSpikes));
            F(allSpikes) = arrayfun(@plus,F(allSpikes),f3);
        else
            dGsynE = bsxfun(@times,GsynMax(:,e_spiked),p0(e_spiked)');
            dGsynI = bsxfun(@times,GsynMax(:,i_spiked),p0(i_spiked)');
            dGsynE_sum = sum(dGsynE,2);
            dGsynI_sum = sum(dGsynI,2);
        end
        GsynE = arrayfun(@plus,GsynE,dGsynE_sum);
        GsynI = arrayfun(@plus,GsynI,dGsynI_sum);
    end
    
    % Compute total synaptic current for each neuron
    Isyn = arrayfun(@plus,arrayfun(@times,GsynE,arrayfun(@minus,VsynE,V)),arrayfun(@times,GsynI,arrayfun(@minus,VsynI,V)));
    
    % add noise to any input currents
    curIapp = Iapp;
    curIapp = arrayfun(@plus,curIapp,arrayfun(@times,std_noise,randn(N,1)));
    
    % update membrane voltages
    f1 = (1./Cm); % (1./Cm)
    f2 = arrayfun(@minus,El,V); % El - V
    f3 = arrayfun(@minus,V,Vth); % V - Vth
    f4 = arrayfun(@rdivide,f3,dth); % (V - Vth)/dth
    f5 = arrayfun(@exp,f4); % exp((V - Vth)/dth)
    f6 = arrayfun(@times,dth,f5); % dth.*exp((V - Vth)/dth)
    f7 = arrayfun(@plus,f2,f6); % (El - V) + dth.*exp((V - Vth)/dth)
    f8 = arrayfun(@times,Gl,f7); % Gl.*[(El - V) + dth.*exp((V - Vth)/dth)]
    f9 = arrayfun(@plus,f8,Isyn); % Gl.*[(El - V) + dth.*exp((V - Vth)/dth)] + Isyn
    f10 = arrayfun(@plus,f9,curIapp); % Gl.*[(El - V) + dth.*exp((V - Vth)/dth)] + Isyn + Iapp
    f11 = arrayfun(@minus,f10,Isra); % Gl.*[(El - V) + dth.*exp((V - Vth)./dth)] + Isyn + Iapp - Isra
    dVdt = arrayfun(@times,f1,f11); % (1./Cm).*(Gl.*[(El - V) + dth.*exp((V - Vth)/dth)] + Isyn + Iapp - Isra)
    
    V = arrayfun(@plus,V,dVdt*dt); % V = V + dVdt*dt
    V = arrayfun(@max,Vreset,V); % bound membrane potentials to be >= than the reset value
    
    if (usePlasticity)
        % update synaptic weights
        % LTD first (includes simulated neuron and spikeGenerator spikes)
        if (areAnySpikes)
            ltd1 = bsxfun(@times,A2minus,C(:,allSpikes));
            ltd2 = bsxfun(@times,ltd1,o1);
            ltd3 = bsxfun(@times,C(:,allSpikes),A3minus);
            ltd4 = bsxfun(@times,ltd3,o1);
            ltd5 = bsxfun(@times,ltd4,r2(allSpikes)');
            ltd6 = ltd2 + ltd5;
            ltd7 = bsxfun(@times,ltd6,is_plastic(:,allSpikes));
            ltd8 = bsxfun(@times,ltd7,GsynMax(:,allSpikes));
            GsynMax(:,allSpikes) = GsynMax(:,allSpikes) - ltd8;
        end
        % LTP next (includes simulated neuron spikes only)
        if (areSimSpikes)
            ltp1 = bsxfun(@times,C(spiked,:),A2plus');
            ltp2 = bsxfun(@times,ltp1,r1');
            ltp3 = bsxfun(@times,C(spiked,:),A3plus');
            ltp4 = bsxfun(@times,ltp3,r1');
            ltp5 = bsxfun(@times,ltp4,o2(spiked));
            ltp6 = ltp2 + ltp5;
            ltp7 = bsxfun(@times,ltp6,is_plastic(spiked,:));
            ltp8 = bsxfun(@times,ltp7,GsynMax(spiked,:));
            GsynMax(spiked,:) = GsynMax(spiked,:) + ltp8;
        end
        % update plasticity variables
        if (strcmp(plasticity_type,'all-to-all'))
            r1(spiked) = r1(spiked) + 1;
            r2(spiked) = r2(spiked) + 1;
            o1(spiked) = o1(spiked) + 1;
            o2(spiked) = o2(spiked) + 1;
        elseif (strcmp(plasticity_type,'nearest'))
            r1(spiked) = 1;
            r2(spiked) = 1;
            o1(spiked) = 1;
            o2(spiked) = 1;
        end
    end
    
    % if there are any spikes from SIMULATED neurons, write them to file
    if (areSimSpikes)
        if (useRecord)
            fwrite(spkfid,-1,'int32'); % each timestep with spikes starts with -1
            fwrite(spkfid,i,'int32'); % then the integer # of the timestep
            % then the INDICES FROM THE cells2record VECTOR ARE WRITTEN. NOT THE NEURON IDS THEMSELVES
            fwrite(spkfid,find(spiked(cells2record)),'int32');
        end
    end
end
fprintf('%i simulated spikes\n',int32(nSimulatedSpikes));
fprintf('%i generated spikes\n',int32(nGeneratedSpikes));
end

I have a different function that creates all of the inputs (which are mostly gpuArrays except for 'plasticity_type', 'nT', and 'spkfid'. I have a loop where I initially call runAEVLIFNetGPU_mex with these created inputs. I then use the outputs of this function call ([GsynMax,V,Vth,Isra,GsynE,GsynI,D,F,r1,r2,o1,o2]) as inputs in a subsequent call to the function. I find that this rapidly eats up my GPUs memory (RTX 2070 with 8GB of memory) after only 60 or so function calls. I believe this to be a memory leak problem based on the googling around I have done about this problem. When I only make one call to this function I don't see any build up of memory, so I believe the problem arises from multiple calls.

I have tried calling the function repeatedly but without using the outputs (leaving the output field in the calling function blank), this did not change anything. I am looking for any best (or basic) practices I am missing when using MEX functions (particularly on the GPU). Any advice appreciated and please let me know if I can provide additional info to help.

System Info:

Using MATLAB 2019b

CUDA Version 10.2

NVIDIA Driver 440.64

GPU: GeForce RTX 2070

5 Comments
Show 3 older commentsHide 3 older comments

Benjamin Ballintyn on 28 Apr 2020

Hi Joss,

Thanks for the quick reply. I am using arrayfun and bsxfun based on the recommendations on this page: https://www.mathworks.com/help/parallel-computing/measure-and-improve-gpu-performance.html but perhaps I misunderstood it. I did initially write this function without it but I found that changing to arrayfun/bsxfun did seem to speed things up, at least on gpu. My guess is it must be less memory efficient though.

If I'm understanding your suggestion right, my function is holding all of the temporary variables in memory, but shouldn't their memory allocation be freed up once the function terminates (and they are not returned)?

Joss Knight on 2 May 2020

It looks like you're returning many of those temporary variables, so they can't be freed. Also, Coder may not know that temporary variables that are holding onto GPU memory are free to be released during the function execution. It is certainly a good place to start anyway - make your MATLAB code itself more efficient with GPU memory, and you may find that the generated code is also more efficient.

That doc page suggests grouping many element-wise operations into a single function and calling it with arrayfun - it is no advantage (in fact, a positive disadvantage) to call arrayfun independently for each operation. For many releases now MATLAB's automatic optimisations for element-wise operations give equally good performance in most typical circumstances. Also, for many releases bsxfun has been unnecessary since the MATLAB language itself supports singleton dimension expansion. For what it's worth gpuArray/arrayfun also supports dimension expansion and bsxfun just calls arrayfun under the hood for gpuArray inputs, so the only reason to use it would be if you are writing code that may take either numeric or gpuArray inputs.

bsxfun is sometimes needed to implement dimension expansion in generated code, since the Coder products do not support automatic dimension expansion.

Sign in to comment.

Sign in to answer this question.

GPU memory build up from multiple mex calls in loop

5 Comments
Show 3 older commentsHide 3 older comments

Answers (0)

See Also

Categories

Tags

Products

Release

Community Treasure Hunt

GPU memory build up from multiple mex calls in loop

5 Comments Show 3 older commentsHide 3 older comments

Answers (0)

See Also

Categories

Tags

Products

Release

Community Treasure Hunt

5 Comments
Show 3 older commentsHide 3 older comments