GPU optimization of looped vector operations

I think I am making a simple mistake. I am comparing a vectrorized integration using the GPU and the CPU. This code takes ~365 sec on the GPU (Nvidia quadro P5200 and 96 sec on the parallel CPU (6 workers, Xeon E-2176M, 2.7 GHz). The integral is a straight forward operation with vectors 90,000 long in this example that repeats 90,000 times in the loop. A test of an array multiplication of two 10,000x10,000 arrays of random numbers takes 0.65 s on my GPU and 10.8 s on my CPU. In the example below the GPU is slower for larger arrays. It seems as though the loop introduces a lot of overhead on the GPU operations.
What is the best strategy to optimize this problem for the GPU?
nubar_low = 2450;
nubar_high = 3350;
p_density = 100; %points per wavenumber
nu_bar = nubar_low:1/p_density:nubar_high;
K = zeros(size(nu_bar));
nub = nu_bar;
n_inf = 0;
nub = nu_bar;
k_max = 0.01; %max k
nub_0 = 2800; %nu bar center to absorption
gamma = 50; %width of the absortion
K = k_max * (gamma/2)^2 * ( ((nub-nub_0).^2 + (gamma/2)^2).^-1 - ((nub+nub_0).^2 + (gamma/2)^2).^-1);
% dK data is the derivative of K --> d(K)/d(nubar)
% Use value on either side of the point where possible
dK = zeros(size(K));
dK(2:end-1) = (K(3:end)-K(1:end-2))./(nu_bar(3:end)-nu_bar(1:end-2));
% Endpoints are special case.
dK(1) = (K(2)-K(1))./(nu_bar(2)-nu_bar(1));
dK(end) = (K(end)-K(end-1))./(nu_bar(end)-nu_bar(end-1));
len=length(nu_bar);
dN_KK = zeros(1,len);
% The integral
tic
try
canUseGPU = parallel.gpu.GPUDevice.isAvailable;
catch ME
canUseGPU = false;
end
%canUseGPU = false;
if canUseGPU
%integral using GPU
gnu_bar = gpuArray(nu_bar);
gK = gpuArray(K);
gdK = gpuArray(dK);
gdN_KK = gpuArray(dN_KK);
for i = 1:len
gdN_KK(i) = sum(gnu_bar([1:i-1, i+1:end]) .* gK([1:i-1, i+1:end]) ./ (gnu_bar([1:i-1, i+1:end]).^2 - gnu_bar(i).^2));
gdN_KK(i) = 2*gdN_KK(i) + gK(i)./(2*gnu_bar(i)) + gdK(i);
end
dN_KK =gather(gdN_KK);
else
%integral using GPU
parfor i = 1:len
dN_KK(i) = sum(nu_bar([1:i-1, i+1:end]) .* K([1:i-1, i+1:end]) ./ (nu_bar([1:i-1, i+1:end]).^2 - nu_bar(i).^2));
dN_KK(i) = 2*dN_KK(i) + K(i)./(2*nu_bar(i)) + dK(i);
end
end
% Scales data
dN_KK = (1/(pi*p_density))*dN_KK;
% Adds constant for N infinity
N_KK = dN_KK + n_inf;
toc

4 Comments

Matt J
Matt J on 28 Aug 2019
Edited: Matt J on 28 Aug 2019
In the example below the GPU is slower for larger arrays. It seems as though the loop introduces a lot of overhead on the GPU operations.
You need to tell us what run times you observed. You also need to use gputimeit() or wait() to time the GPU implementation to make sure GPU synchronization issues don't affect the result.
Lloyd Bumm
Lloyd Bumm on 28 Aug 2019
Edited: Lloyd Bumm on 28 Aug 2019
The times are stated in the second sentence -- from tic to toc in the provided code. I'll take a look the the alternat timeing you suggest.
Lloyd, have you tried running your computation in single precision? Your Quadro P5200 has respectable single precision performance of about 8 or 9 teraflops, but like most graphics cards except special ones, its double precision performance is a small fraction of that at about 280 gigaflops (figures from the Wikipedia page where NVIDIA post their specs). This is why you're not getting much better matrix multiply performance out of your GPU than your CPU - this would be dramatically different in single. It is perfectly normal for an algorithm to run faster on the CPU than on one of these graphics-focussed cards, especially if it is an algorithm with a lot of unvectorized loops and a multiprocess parallelization.
I noted the single precision times in benchmarks below. The effect is only a factor of 2. However this is not the sort of problem that should be done in single precision.

Sign in to comment.

 Accepted Answer

Matt J
Matt J on 28 Aug 2019
Edited: Matt J on 6 Sep 2019
This modification uses mat2tiles from the File Exchange, to help divide the computation into bigger, vectorized chunks
It runs in about 2 seconds on my graphics card (GeForce GTX 1080 Ti). Aside from increased vectorization, the key is to eliminate all the indexing expressions x([1:i-1, i+1:end]). Those are costly.
tic;
gnu_bar = gpuArray(nu_bar);
gK = gpuArray(K);
gdK = gpuArray(dK);
gdN_KK = gpuArray(dN_KK);
chunksize=1000;
vv=gnu_bar.^2;
vvchunks=mat2tiles( vv , [1,chunksize]);
numer=gnu_bar.*gK;
c=1;
for k=1:numel(vvchunks)
Q=numer(:)./(vv.'-vvchunks{k});
Q(c:len+1:end)=0;
c=c+size(Q,2);
vvchunks{k}=sum(Q,1);
end
gdN_KK=[vvchunks{:}];
gdN_KK = 2*gdN_KK + gK./(2*gnu_bar) + gdK;
wait(gd)
toc %Elapsed time is 2.027665 seconds.

15 Comments

You are using the vector speed of the GPU and reducing the loop iterations to a minumum. Makes sense.
Maybe I am missing something in your implememtation. The reason for the indexing expression x([1:i-1, i+1:end]) is to jump over the singularity when the array element index equals the loop index. In each iteration of the loop the full arrays are multiplied except for the element at i. The second line in the loop fixes up that skip.
Is there a clever way to vectorize with the moving skip? I can do the square and the multiplication outside the loop, which is clearly more efficient, but they need to be combined and summed inside the loop as you have gone for "Q" in your example.
Matt J
Matt J on 28 Aug 2019
Edited: Matt J on 28 Aug 2019
Is there a clever way to vectorize with the moving skip?
None that I think would be more optimal than omitting the skip, as in my current proposal.
The skipping is highly sub-optimal, whether vectorized or not, because it will always involve allocating memory for copies of your arrays with the i-th element removed. Conversely, when you omit the skip, you do have to process the singularities, but the overhead for that is tiny.
Since it is a sum, it is probably faster to do everything including the item to be skipped, and then subtract off the value at the location to be skipped.
Or alternately to create an array of 1's everywhere except the elements to be skipped and multiply that by the elements before doing the sum. You could perhaps use 1-eye() to process by array.
Lloyd Bumm
Lloyd Bumm on 28 Aug 2019
Edited: Lloyd Bumm on 28 Aug 2019
What about a circshift to move the zero along? Or using a boolean array index? The former adds another floating point operation. Is the latter is any better than x([1:i-1, i+1:end])?
Matt J
Matt J on 28 Aug 2019
Edited: Matt J on 28 Aug 2019
All of those suggestions involve memory allocation and copying above and beyond what the current solution does.
sum(gnu_bar([1:i-1, i+1:end]) .* gK([1:i-1, i+1:end])
can be replaced with
partials = gnu_bar .* gK;
before the loop, and then
sum(gnu_bar([1:i-1, i+1:end]) .* gK([1:i-1, i+1:end])
calculated for all i values would be
sum(partials) - partials
Likewise,
gnu_bar([1:i-1, i+1:end]).^2 - gnu_bar(i).^2)
is
gb_sq = gnu_bar.^2;
sum(gb_sq) - gb_sq
Matt J
Matt J on 28 Aug 2019
Edited: Matt J on 28 Aug 2019
sum(gnu_bar([1:i-1, i+1:end]) .* gK([1:i-1, i+1:end]) can be replaced with...
Unfortunately, the summation is not just over elements of gnu_bar.*gK. This is just the numerator of a more complicated quotient expression.
Thanks. Considering all the comments. It appears the best way to adapt this to a GPU is to vectorize it in two dimentions. As Matt J points out, I can not sum the partials.
sum( (a(:) .* b(:) ./ ( a(:).^2 - a(i).^2 ) )
This problem is in essence a convolution. with the inegral performed in the original direction of vectorization, call it j, is repeated in the i direction (looped over i in the original). j needs ot be at least as dense as i, but denser is better. And i can be varied over a shorter interval than j, which reduces the end effects. I think the denomentor needs to be a 2D array to be fully vectorized.
For the time being I'm going to remove the unneccessary repeated computatins from the loops.
Using repmat to set up the 2D arrays needs some thought to not exceed the available memory, But it could be done in chunks.
But what you've described sounds no different from my originally posted solution, with the exception of repmat, which should not be necessary in a recent version of Matlab. The implicit expansion in,
Q=numer{k}./(vvchunks{k}-vv.');
already creates a 2D array in the denominator for you.
If you are saying that you are satisfied that the presented solution answers your posted question, then please do Accept-click it.
Thanks for your comments. I work incrementally. Plus most of the computers I work with don't have GPUs, so the basic 1D vectorization is most general and has greater impact. Your comments were key in pointing out how to implement that. I will be digesting your implementation of mat2tiles next.
I looked at your mat2tiles implemention. It is not giving the correct result, the graph looks nothing like the correct result. I implemented yiour solution as in the long code block below, which I tacked onto my original code.
I'm not sure I understand what this statement is doing,
vvchunks{k}-vv.'
but I don't think it is correct for what I need mathmatically. The updated optimizations I posted should make it more clear.
I agree that mat2tiles will help the memory probem in the 2D operations. I just traped those for testing in the updated optimizations.
Note, I expanded the range of the vectorized integral and made the number of integrals evaluated a subset of those points (wavenumbers here). This is better mathmatically.
nubar_low = 2450;
nubar_high = 3350;
p_density = 100; %points per wavenumber
nu_bar = nubar_low:1/p_density:nubar_high;
K = zeros(size(nu_bar));
nub = nu_bar;
n_inf = 0;
nub = nu_bar;
k_max = 0.01; %max k
nub_0 = 2800; %nu bar center to absorption
gamma = 50; %width of the absortion
K = k_max * (gamma/2)^2 * ( ((nub-nub_0).^2 + (gamma/2)^2).^-1 - ((nub+nub_0).^2 + (gamma/2)^2).^-1);
% dK data is the derivative of K --> d(K)/d(nubar)
% Use value on either side of the point where possible
dK = zeros(size(K));
dK(2:end-1) = (K(3:end)-K(1:end-2))./(nu_bar(3:end)-nu_bar(1:end-2));
% Endpoints are special case.
dK(1) = (K(2)-K(1))./(nu_bar(2)-nu_bar(1));
dK(end) = (K(end)-K(end-1))./(nu_bar(end)-nu_bar(end-1));
len=length(nu_bar);
dN_KK = zeros(1,len);
gGPU = gpuDevice(1);
reset(gGPU);
tic;
gnu_bar = gpuArray(nu_bar);
gK = gpuArray(K);
gdK = gpuArray(dK);
gdN_KK = gpuArray(dN_KK);
chunksize=1000;
vv=gnu_bar.^2;
vvchunks=mat2tiles( vv , [1,chunksize]);
numer=mat2tiles(gnu_bar.*gK,[1,chunksize]);
c=1;
for k=1:numel(vvchunks)
Q=numer{k}./(vvchunks{k}-vv.');
Q(c:len+1:end)=0;
c=c+size(Q,2);
vvchunks{k}=sum(Q,1);
end
gdN_KK=[vvchunks{:}];
gdN_KK = 2*gdN_KK + gK./(2*gnu_bar) + gdK;
MJ_dN_KK = (1/(pi*p_density))*gather(gdN_KK);
wait(gGPU)
toc
figure
plot(nu_bar,MJ_dN_KK);
I looked at your mat2tiles implemention. It is not giving the correct result, the graph looks nothing like the correct result
I compared the results of the code from your initial post with the version I posted in my answer. I got identical results...
There are not the same. Althought I had left out a scalling term (1/(pi*p_density)) in the code block of your implentation above (now editied to correct this, added in the line with the gather). But this does not change the shape of the curve. Yours in the blue.
This is what is being evaluated:
Okay, I did make a few fixes, but now to be sure we're on the same page, I share the test code below, and I see strong agreement between the two versions
nubar_low = 2450;
nubar_high = 2451;
p_density = 100; %points per wavenumber
nu_bar = nubar_low:1/p_density:nubar_high;
K = zeros(size(nu_bar));
nub = nu_bar;
n_inf = 0;
nub = nu_bar;
k_max = 0.01; %max k
nub_0 = 2800; %nu bar center to absorption
gamma = 50; %width of the absortion
K = k_max * (gamma/2)^2 * ( ((nub-nub_0).^2 + (gamma/2)^2).^-1 - ((nub+nub_0).^2 + (gamma/2)^2).^-1);
% dK data is the derivative of K --> d(K)/d(nubar)
% Use value on either side of the point where possible
dK = zeros(size(K));
dK(2:end-1) = (K(3:end)-K(1:end-2))./(nu_bar(3:end)-nu_bar(1:end-2));
% Endpoints are special case.
dK(1) = (K(2)-K(1))./(nu_bar(2)-nu_bar(1));
dK(end) = (K(end)-K(end-1))./(nu_bar(end)-nu_bar(end-1));
len=length(nu_bar);
dN_KK = zeros(1,len);
len,
tic;
gnu_bar = gpuArray(nu_bar);
gK = gpuArray(K);
gdK = gpuArray(dK);
gdN_KK = gpuArray(dN_KK);
%%%% ORIGINAL %%%%%
for i = 1:len
gdN_KK(i) = sum(gnu_bar([1:i-1, i+1:end]) .* gK([1:i-1, i+1:end]) ./ (gnu_bar([1:i-1, i+1:end]).^2 - gnu_bar(i).^2));
gdN_KK(i) = 2*gdN_KK(i) + gK(i)./(2*gnu_bar(i)) + gdK(i);
end
version1 = gdN_KK ;
%%%% OPTIMIZED %%%%%%
chunksize=5;
vv=gnu_bar.^2;
vvchunks=mat2tiles( vv , [1,chunksize]);
numer=gnu_bar.*gK;
c=1;
for k=1:numel(vvchunks)
Q=numer(:)./(vv.'-vvchunks{k});
Q(c:len+1:end)=0;
c=c+size(Q,2);
vvchunks{k}=sum(Q,1);
end
gdN_KK=[vvchunks{:}];
gdN_KK = 2*gdN_KK + gK./(2*gnu_bar) + gdK;
%wait(gd)
toc %Elapsed time is 2.027665 seconds.
version2 = gdN_KK ;
plot(1:len,version1,'-',1:len,+version2,'x'); legend('Lloyd','Matt')
At first I didn't recognize what was going on becasue the you had the interval set to one wavenumber (2450-2451) far away from the absorption (2800). I changed the interval back to 2450-3350, increased the chunksize to 1000, scaled it properly, and compared it to my 1D vectorized CPU code (a fairer comparison). It is spot on now.
LB CPU 1D vector optimized: 4.036714 seconds
MJ GPU mat2tiles optimized: 2.554221 seconds
I'll need to figure out is I can implement your solution when the integrals are being evaluated at a subset of the points in the integration.
MJ_LB_compare.2019_09_06.png
nubar_low = 2450;
nubar_high = 3350;
p_density = 100; %points per wavenumber
nu_bar = nubar_low:1/p_density:nubar_high;
K = zeros(size(nu_bar));
nub = nu_bar;
n_inf = 0;
nub = nu_bar;
k_max = 0.01; %max k
nub_0 = 2800; %nu bar center to absorption
gamma = 50; %width of the absortion
K = k_max * (gamma/2)^2 * ( ((nub-nub_0).^2 + (gamma/2)^2).^-1 - ((nub+nub_0).^2 + (gamma/2)^2).^-1);
% dK data is the derivative of K --> d(K)/d(nubar)
% Use value on either side of the point where possible
dK = zeros(size(K));
dK(2:end-1) = (K(3:end)-K(1:end-2))./(nu_bar(3:end)-nu_bar(1:end-2));
% Endpoints are special case.
dK(1) = (K(2)-K(1))./(nu_bar(2)-nu_bar(1));
dK(end) = (K(end)-K(end-1))./(nu_bar(end)-nu_bar(end-1));
len=length(nu_bar);
dN_KK = zeros(1,len);
len,
%%%% 1D vector optimized %%%%%
tic;
part_a = nu_bar .* K;
part_b = nu_bar .^2;
for i = 1:len
part_c = part_a ./ (part_b - part_b(i));
part_c(i) = 0;
dN_KK(i) = sum(part_c);
end
dN_KK = (1/(pi*p_density))*(2*dN_KK + K./(2*nu_bar) + dK);
version1 = dN_KK ;
toc %Elapsed time is 4.036714 seconds.
%%%% OPTIMIZED %%%%%%
tic
gnu_bar = gpuArray(nu_bar);
gK = gpuArray(K);
gdK = gpuArray(dK);
gdN_KK = gpuArray(dN_KK);
chunksize=1000;
vv=gnu_bar.^2;
vvchunks=mat2tiles( vv , [1,chunksize]);
numer=gnu_bar.*gK;
c=1;
for k=1:numel(vvchunks)
Q=numer(:)./(vv.'-vvchunks{k});
Q(c:len+1:end)=0;
c=c+size(Q,2);
vvchunks{k}=sum(Q,1);
end
gdN_KK=[vvchunks{:}];
gdN_KK = (1/(pi*p_density))*(2*gdN_KK + gK./(2*gnu_bar) + gdK);
%wait(gd)
toc %Elapsed time is 2.554221 seconds.
version2 = gdN_KK ;
figure
plot(nu_bar,version1,nu_bar,+version2,'x'); legend('Lloyd','Matt')

Sign in to comment.

More Answers (1)

I had some time to get back to this project today. The code below is set diferently that the OP in that the number of points and the range of the Kramers-Kronig integration is different than the range and density of points over which it is evaluated.
Below I compare the un-optimized method in the OP with an efficient 1D vectorization and the 2D vectorization using the for loops, parfor loops, and the GPU. The primary improvement is the 1D opimization implemented beased on discusssions above.
The parameters and timings are in the code, but I will list them here for convenience.
990001 points per integral; 301 integrals
19.893800 seconds.
CPU for 1D vector optimized: 0.319634 seconds.
CPU 2D vectorized: 1.318735 seconds.
CPU for 1D vector optimized single precision: 0.079821 seconds.
CPU 2D vectorized single precision: 0.861958 seconds.
CPU parfor unoptimized: 7.018130 seconds.
CPU parfor 1D vector optimized: 0.736592 seconds.
GPU for un-optimized: 11.348325 seconds.
GPU for 1D vector optimized: 0.540666 seconds.
GPU for 2D vector optimized: 33.338617 seconds.
GPU for un-optimized single precision: 11.187868 seconds.
GPU for 1D vector optimized single precision: 0.471209 seconds.
GPU for 2D vector optimized single precision: 16.836338 seconds.
% nu_bar_low and nubar_high are starting points for the intergrals in wavenumbers (nu_bar),
% p_density is the number of points between wavenumbers
% The K-K integral is evaluated at subsets of wavenumbers given by
% on the interval nubar_start to nubar_end (nu_bar_eval) with a point density lower than
% the intergral by the factor eval_sub_density
% the indicies for eval_ind must exactly correspond to points in nu_bar
close all
clear all
nubar_low = 100;
nubar_high = 10000;
nubar_start = 2450;
nubar_end = 3350;
p_density = 100; %points per wavenumber for integral
eval_sub_density = 300; % evaluate interval lower than the p_density by this factor
nu_bar = nubar_low:1/p_density:nubar_high; %wavenumber vector for integral
len_nubar = length(nu_bar);
nu_bar_eval = nubar_start:1/(p_density/eval_sub_density):nubar_end; %vector wavenumbers where integral is evaluated
len_eval=length(nu_bar_eval);
bb = (nu_bar >= nubar_start) & (nu_bar <= nubar_end);
ind_start = find(bb,1,'first');
ind_end = find(bb,1,'last');
eval_ind = ind_start:eval_sub_density:ind_end; %indices in nu_bar that correspond to nu_bar_eval
fprintf('%i points per integral; %i integrals\r',len_nubar, len_eval);
n_inf = 0;
%compute K spectrum
K = zeros(size(nu_bar));
k_max = 0.01; %max k
nu_bar_0 = 2800; %nu bar center to absorption
gamma = 50; %width of the absortion
K = k_max * (gamma/2)^2 * ( ((nu_bar-nu_bar_0).^2 + (gamma/2)^2).^-1 - ((nu_bar+nu_bar_0).^2 + (gamma/2)^2).^-1);
% dK data is the derivative of K --> d(K)/d(nubar)
% Use value on either side of the point where possible
dK = zeros(size(K));
dK(2:end-1) = (K(3:end)-K(1:end-2))./(nu_bar(3:end)-nu_bar(1:end-2));
% Endpoints are special case.
dK(1) = (K(2)-K(1))./(nu_bar(2)-nu_bar(1));
dK(end) = (K(end)-K(end-1))./(nu_bar(end)-nu_bar(end-1));
dN_KK = zeros(1,len_eval);
%times for len_nubar = 990001; len_nubar_eval = 301
fprintf('\rCPU for unoptimized\r'); % 19.9 s
tic
for i = 1:len_eval
jj = eval_ind(i);
dN_KK(i) = sum(nu_bar([1:jj-1, i+1:end]) .* K([1:jj-1, i+1:end]) ./ (nu_bar([1:jj-1, i+1:end]).^2 - nu_bar(jj).^2));
dN_KK(i) = 2*dN_KK(i) + K(jj)./(2*nu_bar(jj)) + dK(jj);
end
dN_KK = 2*dN_KK + K(eval_ind)./(2*nu_bar_eval) + dK(eval_ind);
dN_KK = (1/(pi*p_density))*dN_KK;
N_KK = dN_KK + n_inf;
toc
figure
plot(nu_bar_eval,dN_KK);
hold on
fprintf('\rCPU for 1D vector optimized\r'); % 0.32 s
tic
part_a = nu_bar .* K;
part_b = nu_bar .^2;
for i = 1:len_eval
jj = eval_ind(i);
part_c = part_a ./ (part_b - part_b(jj));
part_c(jj) = 0; %set singlar points to zero before sum
dN_KK(i) = sum(part_c);
end
dN_KK = 2*dN_KK + K(eval_ind)./(2*nu_bar_eval) + dK(eval_ind);
dN_KK = (1/(pi*p_density))*dN_KK;
N_KK = dN_KK + n_inf;
toc
plot(nu_bar_eval,dN_KK);
hold on
fprintf('\rCPU 2D vectorized\r'); % 1.32 s
%test for sufficient available memory
mem = memory;
mem_avail = mem.MemAvailableAllArrays;
mem_need = len_nubar*len_eval*8*4;
if mem_need > mem_avail
fprintf('not enough memory, need %10.3e, only %10.3e available\r',mem_need, mem_avail);
proceed = false;
else
fprintf('have enough memory, need %10.3e, %10.3e available\r',mem_need, mem_avail);
proceed = true;
end
% proceed if sufficient available memory
if proceed
tic
R_temp = repmat(nu_bar.^2,len_eval,1); % conserve memory use this varaiable as a temp
R_nubar_K = repmat(nu_bar.*K,len_eval,1);
R_nu_bar_eval_sq = repmat(nu_bar_eval'.^2,1,len_nubar);
R_temp = (R_nubar_K) ./ (R_temp - R_nu_bar_eval_sq);
for i = 1:len_eval
R_temp(i,eval_ind(i)) = 0; %set singlar points to zero before sum
end
dN_KK = sum(R_temp,2)';
dN_KK = 2*dN_KK + K(eval_ind)./(2*nu_bar_eval) + dK(eval_ind);
dN_KK = (1/(pi*p_density))*dN_KK;
N_KK = dN_KK + n_inf;
toc
clear R_temp R_nubar_K R_nu_bar_eval_sq; %dump big arrays
plot(nu_bar_eval,dN_KK);
hold on
end
%%%%%%%%%%%%%%%%%%%%%%%%%% single precision %%%%%%%%%%%%%%%%
fprintf('\rCPU for 1D vector optimized single precision\r'); % 0.8 s
tic
snu_bar = single(nu_bar);
sK = single(K);
sdK = single(dK);
snu_bar_eval = single(nu_bar_eval);
seval_ind = single(eval_ind);
sdN_KK = single(dN_KK);
spart_a = snu_bar .* sK;
spart_b = snu_bar .^2;
for i = 1:len_eval
jj = seval_ind(i);
spart_c = spart_a ./ (spart_b - spart_b(jj));
spart_c(jj) = 0; %set singlar points to zero before sum
sdN_KK(i) = sum(spart_c);
end
dN_KK = double(2*sdN_KK + sK(seval_ind)./(2*snu_bar_eval) + sdK(seval_ind));
dN_KK = (1/(pi*p_density))*dN_KK;
N_KK = dN_KK + n_inf;
toc
plot(nu_bar_eval,dN_KK);
hold on
fprintf('\rCPU 2D vectorized single precision\r'); % 0.86 s
%test for sufficient available memory
mem = memory;
mem_avail = mem.MemAvailableAllArrays;
mem_need = len_nubar*len_eval*4*4;
if mem_need > mem_avail
fprintf('not enough memory, need %10.3e, only %10.3e available\r',mem_need, mem_avail);
proceed = false;
else
fprintf('have enough memory, need %10.3e, %10.3e available\r',mem_need, mem_avail);
proceed = true;
end
% proceed if sufficient available memory
if proceed
tic
sR_temp = repmat(single(nu_bar).^2,len_eval,1); % conserve memory use this varaiable as a temp
sR_nubar_K = repmat(single(nu_bar).*K,len_eval,1);
sR_nu_bar_eval_sq = repmat(single(nu_bar_eval)'.^2,1,len_nubar);
seval_ind = single(eval_ind);
sR_temp = (sR_nubar_K) ./ (sR_temp - sR_nu_bar_eval_sq);
for i = 1:len_eval
sR_temp(i,seval_ind(i)) = 0; %set singlar points to zero before sum
end
dN_KK = double(sum(sR_temp,2)');
dN_KK = 2*dN_KK + K(eval_ind)./(2*nu_bar_eval) + dK(eval_ind);
dN_KK = (1/(pi*p_density))*dN_KK;
N_KK = dN_KK + n_inf;
toc
clear sR_temp sR_nubar_K sR_nu_bar_eval_sq; %dump big arrays
plot(nu_bar_eval,dN_KK);
hold on
end
%^^^^^^^^^^^^^^^^^^^^ end single precision ^^^^^^^^^^^^^^^^^^
fprintf('\rCPU parfor unoptimized\r'); % 7.02 s
poolobj = gcp;
tic
parfor i = 1:len_eval
jj = eval_ind(i);
dN_KK(i) = sum(nu_bar([1:jj-1, i+1:end]) .* K([1:jj-1, i+1:end]) ./ (nu_bar([1:jj-1, i+1:end]).^2 - nu_bar(jj).^2));
dN_KK(i) = 2*dN_KK(i) + K(jj)./(2*nu_bar(jj)) + dK(jj);
end
dN_KK = 2*dN_KK + K(eval_ind)./(2*nu_bar_eval) + dK(eval_ind);
dN_KK = (1/(pi*p_density))*dN_KK;
N_KK = dN_KK + n_inf;
toc
plot(nu_bar_eval,dN_KK);
hold on
fprintf('\rCPU parfor 1D vector optimized\r'); % 0.74 s
poolobj = gcp;
tic
part_a = nu_bar .* K;
part_b = nu_bar .^2;
parfor i = 1:len_eval
jj = eval_ind(i);
part_c = part_a ./ (part_b - part_b(jj));
part_c(jj) = 0; %set singlar points to zero before sum
dN_KK(i) = sum(part_c);
end
dN_KK = 2*dN_KK + K(eval_ind)./(2*nu_bar_eval) + dK(eval_ind);
dN_KK = (1/(pi*p_density))*dN_KK;
N_KK = dN_KK + n_inf;
toc
plot(nu_bar_eval,dN_KK);
hold on
% Test for presence of GPU
try
canUseGPU = parallel.gpu.GPUDevice.isAvailable;
catch ME
canUseGPU = false;
end
if canUseGPU
fprintf('\rGPU for un-optimized\r'); % 11.35 s
gGPU = gpuDevice(1);
reset(gGPU);
tic
gnu_bar = gpuArray(nu_bar);
gK = gpuArray(K);
gdK = gpuArray(dK);
gdN_KK = gpuArray(dN_KK);
for i = 1:len_eval
jj = eval_ind(i);
gdN_KK(i) = sum(gnu_bar([1:jj-1, jj+1:end]) .* gK([1:jj-1, jj+1:end]) ./ (gnu_bar([1:jj-1, jj+1:end]).^2 - gnu_bar(jj).^2));
gdN_KK(i) = 2*gdN_KK(i) + gK(jj)./(2*gnu_bar(jj)) + gdK(jj);
end
gdN_KK = (1/(pi*p_density))*gdN_KK;
dN_KK =gather(gdN_KK);
wait(gGPU);
toc
plot(nu_bar_eval,dN_KK);
hold on
fprintf('\rGPU for 1D vector optimized\r'); % 0.54 s
gGPU = gpuDevice(1);
reset(gGPU);
tic
gnu_bar = gpuArray(nu_bar);
gK = gpuArray(K);
gdK = gpuArray(dK);
gnu_bar_eval = gpuArray(nu_bar_eval);
gdN_KK = gpuArray(dN_KK);
geval_ind = gpuArray(eval_ind);
gpart_a = gnu_bar .* gK;
gpart_b = gnu_bar .^2;
for i = 1:len_eval
jj = geval_ind(i);
gpart_c = gpart_a ./ (gpart_b - gpart_b(jj));
gpart_c(jj) = 0; %set singlar points to zero before sum
gdN_KK(i)= sum(gpart_c);
end
gdN_KK = 2*gdN_KK + gK(geval_ind)./(2*gnu_bar_eval) + gdK(geval_ind);
gdN_KK = (1/(pi*p_density))*gdN_KK;
dN_KK =gather(gdN_KK);
wait(gGPU);
N_KK = dN_KK + n_inf;
toc
plot(nu_bar_eval,dN_KK);
hold on
fprintf('\rGPU for 2D vector optimized\r'); % 33.34 s
gGPU = gpuDevice(1);
reset(gGPU);
%test for sufficient memory
mem_GPU_avail = gGPU.AvailableMemory;
mem_need = len_nubar*len_eval*8*5;
if mem_need > mem_GPU_avail
fprintf('not enough memory, need %10.3e, only %10.3e available\r',mem_need, mem_GPU_avail);
proceed = false;
else
fprintf('have enough memory, need %10.3e, %10.3e available\r',mem_need, mem_GPU_avail);
proceed = true;
end
%proceed is sufficient memory is available
if proceed
tic
gnu_bar = gpuArray(nu_bar);
gnu_bar_eval = gpuArray(nu_bar_eval);
gK = gpuArray(K);
gdK = gpuArray(dK);
gdN_KK = gpuArray(dN_KK);
geval_ind = gpuArray(eval_ind);
R_gtemp = repmat(gnu_bar.^2,len_eval,1);
R_gnubar_K = repmat(gnu_bar.*gK,len_eval,1);
R_gnu_bar_eval_sq = repmat(gnu_bar_eval'.^2,1,len_nubar);
R_gtemp = (R_gnubar_K) ./ (R_gtemp - R_gnu_bar_eval_sq);
for i = 1:len_eval
R_gtemp(i,eval_ind(i)) = 0; %set singlar points to zero before sum
end
gdN_KK = sum(R_gtemp,2)';
gdN_KK = 2*gdN_KK + gK(eval_ind)./(2*gnu_bar_eval) + gdK(geval_ind);
gdN_KK = (1/(pi*p_density))*gdN_KK;
dN_KK =gather(gdN_KK);
wait(gGPU);
N_KK = dN_KK + n_inf;
toc
plot(nu_bar_eval,dN_KK);
hold on
end
%%%%%%%%%%%%% single precision tests
fprintf('\rGPU for un-optimized single precision\r'); % 11.19 s
gGPU = gpuDevice(1);
reset(gGPU);
tic
gnu_bar = gpuArray(single(nu_bar));
gK = gpuArray(single(K));
gdK = gpuArray(single(dK));
gdN_KK = gpuArray(single(dN_KK));
for i = 1:len_eval
jj = eval_ind(i);
gdN_KK(i) = sum(gnu_bar([1:jj-1, jj+1:end]) .* gK([1:jj-1, jj+1:end]) ./ (gnu_bar([1:jj-1, jj+1:end]).^2 - gnu_bar(jj).^2));
gdN_KK(i) = 2*gdN_KK(i) + gK(jj)./(2*gnu_bar(jj)) + gdK(jj);
end
gdN_KK = (1/(pi*p_density))*gdN_KK;
dN_KK =double(gather(gdN_KK));
wait(gGPU);
toc
plot(nu_bar_eval,dN_KK);
hold on
fprintf('\rGPU for 1D vector optimized single precision\r'); % 0.47 s
gGPU = gpuDevice(1);
reset(gGPU);
tic
gnu_bar = gpuArray(single(nu_bar));
gK = gpuArray(single(K));
gdK = gpuArray(single(dK));
gdN_KK = gpuArray(single(dN_KK));
gnu_bar_eval = gpuArray(single(nu_bar_eval));
geval_ind = gpuArray(single(eval_ind));
gpart_a = gnu_bar .* gK;
gpart_b = gnu_bar .^2;
for i = 1:len_eval
jj = geval_ind(i);
gpart_c = gpart_a ./ (gpart_b - gpart_b(jj));
gpart_c(jj) = 0; %set singlar points to zero before sum
gdN_KK(i)= sum(gpart_c);
end
gdN_KK = 2*gdN_KK + gK(geval_ind)./(2*gnu_bar_eval) + gdK(geval_ind);
gdN_KK = (1/(pi*p_density))*gdN_KK;
dN_KK =double(gather(gdN_KK));
wait(gGPU);
N_KK = dN_KK + n_inf;
toc
plot(nu_bar_eval,dN_KK);
hold on
fprintf('\rGPU for 2D vector optimized single precision\r'); % 16.84 s
gGPU = gpuDevice(1);
reset(gGPU);
%test for sufficient memory
mem_GPU_avail = gGPU.AvailableMemory;
mem_need = len_nubar*len_eval*4*5;
if mem_need > mem_GPU_avail
fprintf('not enough memory, need %10.3e, only %10.3e available\r',mem_need, mem_GPU_avail);
proceed = false;
else
fprintf('have enough memory, need %10.3e, %10.3e available\r',mem_need, mem_GPU_avail);
proceed = true;
end
%proceed is sufficient memory is available
if proceed
tic
gnu_bar = gpuArray(single(nu_bar));
gK = gpuArray(single(K));
gdK = gpuArray(single(dK));
gdN_KK = gpuArray(single(dN_KK));
gnu_bar_eval = gpuArray(single(nu_bar_eval));
geval_ind = gpuArray(single(eval_ind));
R_gtemp = repmat(gnu_bar.^2,len_eval,1);
R_gnubar_K = repmat(gnu_bar.*gK,len_eval,1);
R_gnu_bar_eval_sq = repmat(gnu_bar_eval'.^2,1,len_nubar);
R_gtemp = (R_gnubar_K) ./ (R_gtemp - R_gnu_bar_eval_sq);
for i = 1:len_eval
R_gtemp(i,eval_ind(i)) = 0; %set singlar points to zero before sum
end
gdN_KK = sum(R_gtemp,2)';
gdN_KK = 2*gdN_KK + gK(eval_ind)./(2*gnu_bar_eval) + gdK(geval_ind);
gdN_KK = (1/(pi*p_density))*gdN_KK;
dN_KK =double(gather(gdN_KK));
wait(gGPU);
N_KK = dN_KK + n_inf;
toc
plot(nu_bar_eval,dN_KK);
hold on
end
end

3 Comments

I'm still curious to know what happens to your GPU numbers if you compute in single precision. I suspect the issue is that your GPU is just not suitable for this double precision computation.
I edited the above results to include single precsion on the CPU and the GPU.
Thanks. The only explanation for that is that your cost is all overhead on the GPU, and not computation.

Sign in to comment.

Categories

Find more on Parallel Computing Toolbox in Help Center and File Exchange

Asked:

on 28 Aug 2019

Commented:

on 6 Sep 2019

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!