fmincon running on GPU

Hi
I implemented a modified copy of this example.
I tried to run it on GPU
but if I try to convert the inputs (dlR,..) from alarrays to gpuarray's, I got the error
objFun = @(parameters) objectiveFunction(parameters,dlR,dlTheta,dlT,dlR0,dlTheta0,dlT0,dlUr0,parameterNames,parameterSizes);
"FMINCON requires all values returned by functions to be of data type double."
So how can I run this example with GPU power?
Best regards,
Chris

 Accepted Answer

objFun = @(parameters) objectiveFunction(parameters,dlR,dlTheta,dlT,dlR0,dlTheta0,dlT0,dlUr0,parameterNames,parameterSizes);
That is fine in itself, and if you want, any or all of the values such as dl* variables can be gpuArray --- however, the value passed by fmincon, received here as parameters will never be gpuArray .
Inside objectiveFunction you can unpack parameters as needed into individual variables, potentially constructing gpuArray objects as needed.
Then you can do whatever calculation is appropriate.
When you get an appropriate place in the calculation, gather() the gpuArray calculations. The resulting output will not be gpuArray.
The final output from objectiveFunction must not be gpuArray .
If your logic is suitable, it would be perfectly fine to code something like
cost = gather(cost);
at the end of your code -- though it would be more efficient to use different variable names so that you are not changing the type of the variable
cost_non_gpu = gather(cost); %with the function having been defined as returning cost_non_gpu

9 Comments

Matt J
Matt J on 2 Feb 2022
Edited: Matt J on 2 Feb 2022
If your logic is suitable, it would be perfectly fine to code something like cost = gather(cost);
It is a terribly unfortunate concession in performance that one has to do a CPU/GPU/CPU transfer every time the objective is called. I've discussed with Mathworks staff (in 2018) the benefits of gpuArray support for the Parallel Computing Toolbox solvers,
@Joss Knight assured me that it would be looked into, but I've no idea what decision was made.
However you cannot determine whether the last iteration improved the objective without doing a comparison, and comparisons used to make a decision are rather inefficient at the GPU level.
A scalar >=,<= operation is less efficient on the GPU? Why?
And would that outweigh the overhead of transfering your x array from CPU to GPU on the front end?
Thank you all for replying to my question.
@ Matt, thank you for the link, and raising the last question, this is also not clear to me.
By the way, next time, please cede the questionier to take the decistion about accepting an answer.
Matt J
Matt J on 2 Feb 2022
Edited: Matt J on 2 Feb 2022
By the way, next time, please cede the questionier to take the decistion about accepting an answer.
Mathworks blocks anyone from accepting answers in your stead until a week passes without any activity from you. After that. the forum assumes you have abandoned the thread and relaxes that block. In your case, 2 months passed since you participated in the thread, so all the more reason to think you weren't coming back.
Thats probably makes sense, in cases where at least one answer were timely given. This assumption makes no sense if you got response after a long time. This week timeframe should be related to the first answer and not to the question release date.
A scalar >=,<= operation is less efficient on the GPU? Why?
NVIDIA GPUs have an unusual structure. They have thousands of computation cores, but the cores do not have independent decision making ability. Instead, the cores are grouped, and each group of cores have one instruction decoder (known as an SM); one SM for each 128 cores. Each of the cores for one SM always executes exactly the same instruction as the other cores for the same SM. Except... for each core there is also an inhibit flag.
Thus, to make a decision about a scalar, the SM would would have to decode the comparison instruction, and decode the information about which core is responsible for the memory address holding the scalar, and create a logical vector that selects that one core, and then emit the vector to the selection bus, and emit the microcode for the comparison onto the shared bus, and tell the cores to start the instruction. All 128 cores for the SM would receive the same comparison instruction, but 127 of them would see the "Not you!" flag and do nothing, while the one location would do the instruction, calculating the result of the comparison. And on you would go, selecting one core out of the 128 to do actual work, until finally the code got to a point where it was time to do a vectorized operation again (probably after having loaded new all new memory address mappings for the individual cores.)
The same dynamic is in play for indexing. If the computation had been chunked into 64 x 64 subsets of an array, then 32 SMs might be involved controlling 4096 (64*64) cores, and A(I,J)+1 might end up having 31 of the 32 SMs telling all of their cores to do nothing, with the 32nd SM telling one core to undertake the addition while the others paused. But if I or J happened to be vectors that selected multiple locations then in this scenario up to the full [e.g.] 4096 arithmetic operations might take place in the same time as it would take for doing just one operation.
Is it faster to do that one comparison than to do a synchronization? Oh, perhaps -- but the consequence of the comparison is probably going to involve punting back through several layers and ending up loading new kernels from the MATLAB end anyhow.
The MATLAB end keeps the GPU kernels relatively small. On a Windows system, if you are using a "gaming" GPU instead of a "server" GPU, then the WDDM driver is probably active instead of the TCC driver, and the WDDM timeout is 2 seconds -- every computation has to finish within 2 seconds. (WDDM mode assumes that the Windows system is also using the NVIDIA to drive the display, and assumes that it will need the entire GPU to do a display update.)
To really get performance improvements for something like fmincon you would probably want to use GPU Coder on the entire chain -- or rather, on a modified version of the chain that did the equivalent of keeping the arrays needed to estimate gradients around as sort-of-global variables, only not exactly global. Sort-of-global to keep them "hot" in the GPU to make updates of the gradients faster. And so on -- reengineering the flow.
Another way that performance could perhaps be improved would be if there were something like a 'vectorized' option for the calculation, in which fmincon always passed in a 2D array for x, with the columns intended to be acted on independently, and the expected return size being 1 x size(x,2) . That would give users an opportunity to calculate for several proposed x in vectorized form. At present, fmincon is defined as passing in the x information in the same shape as x0 was passed to it. A 'vectorized' flag would, I suspect, off more optimization possibilities for the objective function, whether on CPU or GPU.
Thank you for you explanation. It would be really helpfull if you could give an example how I can use the vectorization together with Fmincon.
Thank you.
Unfortunately at present fmincon() does not support vectorization; my point about vectorization is that if Mathwrks were looking to improve performance for fmincon, it would make more sense for Mathworks to start with permitting vectorization than it would for Mathworks to start by trying to permit the selection logic (of whether the location improved the fit) to be executed on the GPU. A GPU-improved fmincon would call for a redesign of the fmincon internals rather than for simply adding gpuArray to the list of types the code permits to be returned.

Sign in to comment.

More Answers (1)

Shivam Singh
Shivam Singh on 2 Feb 2022

0 votes

Hello Chris,
This error message indicates that the type of input arguments to fmincom is different than the expected. You may refer the fmincom function to know more about it.
Also, currently “fmincom” function doesn't have GPU support.

2 Comments

CSCh
CSCh on 22 Jul 2024
Is there any progress made by Mathworks in terms of the discussed issue "fmincon & using GPU"
Best regards,
Chris

Sign in to comment.

Categories

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!