How to speed up our code to be implemented on GPU
Show older comments
Hello, I have previously created my MEX file of my code to speed up its implementation speed on GPU. Fortunately, it got faster by 5 times, and hopefully, I want to know if there is way to implement it with higher speed. Here is my code:
function BPmimo2C(Efield) %#codegen
coder.gpu.kernelfun;
image = complex(zeros(17,54,54));
%% creating kaiser window
numT = 16;
numR= 16;
f = 10e9:0.5e9:20e9;
numF = numel(f);
w = ones(numel(f),1);
viq = repmat(w.', [1,numT*numR]);
c = physconst('LightSpeed');
%% grid points
xf = (-8:0.3:8)*0.01;
yf = (-8:0.3:8)*0.01;
[uf , vf] = meshgrid(xf,yf);
x1f = uf(:);
y1f = vf(:);
%% initialization
ArrRadius = 30;
TX = [ArrRadius.*cosd((360/15)*(0:14))*0.01 0];
TY = [ArrRadius.*sind((360/15)*(0:14))*0.01 0];
K = 2*pi*f/c;
z = 0.36:0.003:0.41;
% z = 0.4;
for dep = 1:numel(z)
%% making the matrix of <transmitter-grid point> distance
XYPos = [TX.' TY.' ones(size(TX,2),1)*(z(dep))];
UVPos = [x1f(:), y1f(:), zeros(size(y1f(:),1),1)];
dtXYUV = pdist2( XYPos, UVPos);
dtXYUV2 = zeros(numR,numel(x1f(:)));
expTerm1 = bsxfun(@times,dtXYUV(:)' , K');
expT1 = reshape(expTerm1,[numel(K),numel(TX),numel(x1f)]);
expT2 = zeros(numel(K),numR,numel(x1f),numel(TX));
for i = 1:numel(TX)
expT2(:,:,:,i) = repmat(expT1(:,i,:),[1 numR 1]);
dtXYUV2(:,:,i) = repmat(dtXYUV(i,:),[numR,1]);
end
expT = permute(reshape(permute(expT2,[1 3 2 4]),[numel(K),numel(x1f),numR*numel(TX)]),[1 3 2]);
%% making the matrix of <reciever-grid point> distance
XYPos = [real(Efield(1:numR,2,1)) , real(Efield(1:numR,3,1)), ones(numR,1)*(z(dep))];
UVPos = [x1f(:), y1f(:), zeros(size(y1f(:),1),1)];
dXYUV = pdist2( XYPos, UVPos);
expTerm1 = bsxfun(@times,dXYUV(:)' , K');
expR = repmat(reshape(expTerm1,[numel(K),numR,numel(x1f)]),[1 numel(TX) 1]);
%% making the exponentail term
EXP = exp(1i*(expT + expR));
EXP2 = reshape(EXP,[numel(K)*numel(TX)*numR,numel(x1f)]);
Efield2 = reshape(permute(Efield(1:numT*numR,:,:),[3 1 2]),[numel(f)*numT*numR,6]);
image2 = reshape(((viq.').*Efield2(:,6)).'*EXP2,[sqrt(numel(x1f)),sqrt(numel(x1f))]);
%% gahter to change matrix from GPU-array to normal array
image(dep,:,:) = image2;
end
image = abs(image);
uf = repmat(reshape(uf,[1,numel(xf),numel(yf)]),[numel(z) 1 1]);
vf = repmat(reshape(vf,[1,numel(xf),numel(yf)]),[numel(z) 1 1]);
hf = uf;
for j = 1:numel(z)
hf(j,:,:) = z(j);
end
figure(1);
er = squeeze((image(13,:,:)));
h = surf(squeeze(uf(1,:,:)),squeeze(vf(1,:,:)),er);
colormap(jet);
set(h,'LineStyle','none');
view(2);
end
In addition to speed, sometimes it encounters with "out of memory" error, which is due to huge size of some arrays. I can implement it using multiple nested "for"loops, however, I understood it'd be faster on CPU if I use MATLAB's matrix multipication capability; Therefore, I preferred matrix-based code rather than multiple nested "for" loops.
Any advice, whether it would be general or specific, would be appreciated.
Thank you
2 Comments
Joss Knight
on 8 Jul 2024
Can I just check that you are aware that you do not need to use Code Generation to accelerate your code on GPU? You only need to adapt your code to use gpuArray data. GPU Coder can be useful for converting code that must be written as a loop; but if you can vectorize your loops and make them matrix, vector or pagewise operations instead, you could get better performance without needing to use coder instrinsics or configure a compiler.
moh mor
on 9 Jul 2024
Accepted Answer
More Answers (2)
Chao Luo
on 3 Jul 2024
The generated code is quite optimized for GPU. I tried rewriting the code using explicit for-loops which results in similar performance. On top of that, I converted the data type from double to single, which speeds up the execution about 10 times. Do the conversion If signle precision is good enough for you. Here is the code I rewrite with the ploting part removed for your reference:
function image = BPmimo2C4(Efield) %#codegen
coder.gpu.kernelfun;
%% creating kaiser window
numT = 16;
numR= 16;
f = 10e9:0.5e9:20e9;
numF = numel(f);
w = ones(numel(f),1);
viq = repmat(w.', [1,numT*numR]);
c = physconst('LightSpeed');
%% grid points
xf = (-8:0.3:8)*0.01;
yf = (-8:0.3:8)*0.01;
[uf , vf] = meshgrid(xf,yf);
x1f = uf(:);
y1f = vf(:);
%% initialization
ArrRadius = 30;
TX = [ArrRadius.*cosd((360/15)*(0:14))*0.01 0];
TY = [ArrRadius.*sind((360/15)*(0:14))*0.01 0];
K = 2*pi*f/c;
z = 0.36:0.003:0.41;
Efield2 = reshape(permute(Efield(1:numT*numR,:,:),[3 1 2]),[numel(f)*numT*numR,6]); % 5376x6
Efield2_6 = single(Efield2(:,6).');
% z = 0.4;
XYPos1 = single([TX.', TY.']);
UVPos = single([x1f(:), y1f(:)]);
dtXYUV1 = pdist2(XYPos1, UVPos);
XYPos2 = single([real(Efield(1:numR,2,1)) , real(Efield(1:numR,3,1))]);
dtXYUV2 = pdist2(XYPos2, UVPos);
EXP = coder.nullcopy(single((ones(21,16,16,17,2916) * 1i)));
for f_idx = 1:numel(x1f)
for dep = 1:17
for r_idx = 1:numR
for t_idx = 1:numel(TX)
for k_idx = 1:numel(K)
z2 = z(dep) * z(dep);
dt1 = dtXYUV1(r_idx,f_idx) * dtXYUV1(r_idx,f_idx) + z2;
dt1 = sqrt(dt1);
dt2 = dtXYUV2(t_idx,f_idx) * dtXYUV2(t_idx,f_idx) + z2;
dt2 = sqrt(dt2);
expV = exp((dt1 + dt2) * K(k_idx) * 1i);
EXP(k_idx, t_idx, r_idx, dep, f_idx) = expV;
end
end
end
end
end
EXP_resh = reshape(EXP, [21*16*16, 17*2916]);
image = Efield2_6 * EXP_resh;
image = reshape(image, [17,54,54]);
end
8 Comments
moh mor
on 6 Jul 2024
Umar
on 6 Jul 2024
Edited: Walter Roberson
on 8 Jul 2024
Hi Moh mor,
Sorry I couldn’t respond to your most recent comment. But I do appreciate Chao’s help. The crash and performance issues you are encountering may stem from incorrect data type definitions or mismatches between CPU and GPU data types. When transferring computations to the GPU, it is crucial to specify the data types correctly to leverage the parallel processing capabilities effectively. To address the crashing and performance issues when converting MATLAB code for GPU processing, you need to ensure that you define the data types correctly for GPU arrays. Here is an example of how you can specify the data type when working with GPU arrays in MATLAB:
% Define input data
inputData = rand(100, 'single'); % Single precision data
% Transfer data to GPU
gpuData = gpuArray(inputData);
% Perform computations on GPU
result = someGPUFunction(gpuData);
% Retrieve results back to CPU
resultCPU = gather(result);
By explicitly defining the data type (e.g., 'single' for single precision) when creating GPU arrays and performing computations, you can avoid crashes and optimize performance during GPU processing as shown in Mr. Lou’s code.
I will wait for Mr. Luo’s comments about to provide recommendations about proceeding to next stage and execution of his code performed on his system.
moh mor
on 6 Jul 2024
Chao Luo
on 8 Jul 2024
@moh mor Can you also make sure that
- The function runs in MATLAB without any error
- Generate MEX with MATLAB Coder and the MEX runs without any error
It is because GPU Coder will do any error check such as out-of-boundary. So make sure to check those error in MATLAB and with MATLAB Coder MEX.
moh mor
on 9 Jul 2024
moh mor
on 9 Jul 2024
Chao Luo
on 10 Jul 2024
R2018b is pretty old that I cannot debug it and give you a workaround. Is it possible for you to upgrade MATLAB at least to R2019b version?
Umar
on 6 Jul 2024
0 votes
Hi Moh Mor,
Have you considered reaching out to MathWorks support for further assistance. Provide them with detailed information about your system configuration, MATLAB version, and the steps leading to the internal error.
Categories
Find more on MATLAB in Help Center and File Exchange
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!


