fmincon running on GPU

Question

0 votes

Hi

I implemented a modified copy of this example.

I tried to run it on GPU

but if I try to convert the inputs (dlR,..) from alarrays to gpuarray's, I got the error

objFun = @(parameters) objectiveFunction(parameters,dlR,dlTheta,dlT,dlR0,dlTheta0,dlT0,dlUr0,parameterNames,parameterSizes);

"FMINCON requires all values returned by functions to be of data type double."

So how can I run this example with GPU power?

Solve Partial Differential Equation with LBFGS Method and Deep Learning - MATLAB & Simulink - MathWorks Switzerland

Best regards,

Chris

0 Comments
Show -2 older comments Hide -2 older comments

Sign in to comment.

Sign in to answer this question.

Follow Question

Answer 1

Walter Roberson on 2 Feb 2022

Open in MATLAB Online

0 votes

objFun = @(parameters) objectiveFunction(parameters,dlR,dlTheta,dlT,dlR0,dlTheta0,dlT0,dlUr0,parameterNames,parameterSizes);

That is fine in itself, and if you want, any or all of the values such as dl* variables can be gpuArray --- however, the value passed by fmincon, received here as parameters will never be gpuArray .

Inside objectiveFunction you can unpack parameters as needed into individual variables, potentially constructing gpuArray objects as needed.

Then you can do whatever calculation is appropriate.

When you get an appropriate place in the calculation, gather() the gpuArray calculations. The resulting output will not be gpuArray.

The final output from objectiveFunction must not be gpuArray .

If your logic is suitable, it would be perfectly fine to code something like

cost = gather(cost);

at the end of your code -- though it would be more efficient to use different variable names so that you are not changing the type of the variable

cost_non_gpu = gather(cost);  %with the function having been defined as returning cost_non_gpu

9 Comments
Show 7 older comments Hide 7 older comments

Walter Roberson on 3 Feb 2022

A scalar >=,<= operation is less efficient on the GPU? Why?

NVIDIA GPUs have an unusual structure. They have thousands of computation cores, but the cores do not have independent decision making ability. Instead, the cores are grouped, and each group of cores have one instruction decoder (known as an SM); one SM for each 128 cores. Each of the cores for one SM always executes exactly the same instruction as the other cores for the same SM. Except... for each core there is also an inhibit flag.

Thus, to make a decision about a scalar, the SM would would have to decode the comparison instruction, and decode the information about which core is responsible for the memory address holding the scalar, and create a logical vector that selects that one core, and then emit the vector to the selection bus, and emit the microcode for the comparison onto the shared bus, and tell the cores to start the instruction. All 128 cores for the SM would receive the same comparison instruction, but 127 of them would see the "Not you!" flag and do nothing, while the one location would do the instruction, calculating the result of the comparison. And on you would go, selecting one core out of the 128 to do actual work, until finally the code got to a point where it was time to do a vectorized operation again (probably after having loaded new all new memory address mappings for the individual cores.)

The same dynamic is in play for indexing. If the computation had been chunked into 64 x 64 subsets of an array, then 32 SMs might be involved controlling 4096 (64*64) cores, and A(I,J)+1 might end up having 31 of the 32 SMs telling all of their cores to do nothing, with the 32nd SM telling one core to undertake the addition while the others paused. But if I or J happened to be vectors that selected multiple locations then in this scenario up to the full [e.g.] 4096 arithmetic operations might take place in the same time as it would take for doing just one operation.

Is it faster to do that one comparison than to do a synchronization? Oh, perhaps -- but the consequence of the comparison is probably going to involve punting back through several layers and ending up loading new kernels from the MATLAB end anyhow.

The MATLAB end keeps the GPU kernels relatively small. On a Windows system, if you are using a "gaming" GPU instead of a "server" GPU, then the WDDM driver is probably active instead of the TCC driver, and the WDDM timeout is 2 seconds -- every computation has to finish within 2 seconds. (WDDM mode assumes that the Windows system is also using the NVIDIA to drive the display, and assumes that it will need the entire GPU to do a display update.)

To really get performance improvements for something like fmincon you would probably want to use GPU Coder on the entire chain -- or rather, on a modified version of the chain that did the equivalent of keeping the arrays needed to estimate gradients around as sort-of-global variables, only not exactly global. Sort-of-global to keep them "hot" in the GPU to make updates of the gradients faster. And so on -- reengineering the flow.

Another way that performance could perhaps be improved would be if there were something like a 'vectorized' option for the calculation, in which fmincon always passed in a 2D array for x, with the columns intended to be acted on independently, and the expected return size being 1 x size(x,2) . That would give users an opportunity to calculate for several proposed x in vectorized form. At present, fmincon is defined as passing in the x information in the same shape as x0 was passed to it. A 'vectorized' flag would, I suspect, off more optimization possibilities for the objective function, whether on CPU or GPU.

CSCh on 3 Feb 2022

Thank you for you explanation. It would be really helpfull if you could give an example how I can use the vectorization together with Fmincon.

Thank you.

Walter Roberson on 3 Feb 2022

Unfortunately at present fmincon() does not support vectorization; my point about vectorization is that if Mathwrks were looking to improve performance for fmincon, it would make more sense for Mathworks to start with permitting vectorization than it would for Mathworks to start by trying to permit the selection logic (of whether the location improved the fit) to be executed on the GPU. A GPU-improved fmincon would call for a redesign of the fmincon internals rather than for simply adding gpuArray to the list of types the code permits to be returned.

Sign in to comment.

Answer 2

Shivam Singh on 2 Feb 2022

0 votes

Hello Chris,

This error message indicates that the type of input arguments to fmincom is different than the expected. You may refer the fmincom function to know more about it.

Also, currently “fmincom” function doesn't have GPU support.

2 Comments
Show None Hide None

CSCh on 22 Jul 2024

Is there any progress made by Mathworks in terms of the discussed issue "fmincon & using GPU"

Best regards,

Chris

Walter Roberson on 22 Jul 2024

@CSCh

I doubt it. My logic at https://www.mathworks.com/matlabcentral/answers/1605350-fmincon-running-on-gpu#comment_1967900 still holds

Sign in to comment.

fmincon running on GPU

0 Comments
Show -2 older comments Hide -2 older comments

Accepted Answer

9 Comments
Show 7 older comments Hide 7 older comments

More Answers (1)

2 Comments
Show None Hide None

Categories

Products

Release

Tags

Community Treasure Hunt

fmincon running on GPU

0 Comments Show -2 older comments Hide -2 older comments

Accepted Answer

9 Comments Show 7 older comments Hide 7 older comments

More Answers (1)

2 Comments Show None Hide None

Categories

Products

Release

Tags

See Also

Community Treasure Hunt

0 Comments
Show -2 older comments Hide -2 older comments

9 Comments
Show 7 older comments Hide 7 older comments

2 Comments
Show None Hide None