Can I have checkpoints in Bayesian optimization for tuning hyperparameters of a neural network?

Hi there,
I'm trying to implement Bayesian Optimization on a BiLSTM network.
I'm planning to run this code in a university cluster but, they give us maximum of 2 days (48 hours) to run our job and if it goes beyond it, they automatically kill the job which probably will result in wasted time and resources for me and for other students waiting in que.
I was wondering if it would be possible to implement some kind of a checkpoint for bayesopt() to continue from where the job is left off:
Basically, what I'm asking is if I can save my previous runs (variables bayesopt() observed) and load them in my next run and continue from where it stopped?
I have not seen any documentation related to this (I may have missed it).
My understanding with bayesopt() is that, the more points are observed, the more accurate the answers bayesopt() gives. Is this right? If so, that means I might want to try to run it for more than 2 days maybe. The number of cores I can request are limited (the more I request, the longer I wait in que) and from what I'm estimating, the most complex combination of variables can take between 40 mins to 1 hour to train and give me a result ( obviously, not every combination will take this much time).
Any help is appreciated.
Thank you.

 Accepted Answer

Currently, there is no checkpointing argument. However, you can use the 'OutputFcn' argument along with the 'SaveFileName' argument to save to file, and the resume function to restart the process as follows,
x1 = optimizableVariable('x1',[-5,5]);
x2 = optimizableVariable('x2',[-5,5]);
fun = @rosenbrocks;
if exist('BayesoptResults.mat','file')
load('BayesoptResults.mat');
results = resume(BayesoptResults,...
'SaveFileName', 'BayesoptResults.mat', ...
'OutputFcn',{@saveToFile});
else
results = bayesopt(fun, [x1, x2],'AcquisitionFunctionName',...
'expected-improvement-plus', ...
'SaveFileName', 'BayesoptResults.mat', ...
'OutputFcn',{@saveToFile});
end
function f = rosenbrocks(x)
f = 100*(x.x2 - x.x1^2)^2 + (1 - x.x1)^2;
end
Note that this saves to file on every iteration, so you might want to replace saveToFile with a custom function that saves occasionally, for performance reason.
The relevant docs are available here, resume and bayesopt.

10 Comments

I’ll look into it Aditya, thank you so much. So, does this method allow me to continue from where I stopped before in case I manually stop the run or totally close the matlab window? Since that’s what will happen after 48 hours automatically. Thank you.
To be more clear, I’m trying to avoid re-running the same combination of parameters again from previous runs. Thank you.
Yes, as long as the file doesn't get deleted, you should be able to resume the process from where it stopped.
Great! Thank you so much Aditya, I really appreciate it. I haven’t had time to try it yet but, I’ll surely post an update on how it went.
I'm just here to say that it really worked. Thank you very much Aditya, it works great!
It even kept the points that were evalueted previously and did not re-evaluate them again and if plotFcn was active, I could see the previously evaluated points as well in the plot along with the new points.
The only additional thing that would be nice to have would be to save the output in commandwindow from verbose (and keep appending the next runs to it) so that I can see the full picture but, I think I can find a solution for it (probably diary file?).
But, I should give a warning that it did not work with the spmd option (for parallel bayesopt) (if you anyone is using it or any of the options similarly documented in the documentation for parallel bayesopt) the 2nd time even if the file is saved.
The error was:
ObjectiveFcn is a parallel.pool.Constant, but its 'Value' property does not exist on the current pool.
The C.Value when I checked inside C = parallel.pool.Constant(valErrorFun), I noticed the C.Value "does not exist" probably because it is a function handle. However, when I first ran it using the automatic options and then run it using the spdm option for parallel (same code), it worked fine.
For now, I can't tell why this happened exactly but, I think there is an easy solution to this which I don't know about yet.
Regardless, the solution that Aditya gave worked fine.
Thank you again Aditya.
I was unable to reproduce the issue. The following code works for me,
spmd
fun = makeFun();
end
C = parallel.pool.Constant(fun);
if exist('bayespar.mat','file')
load('bayespar.mat');
results = resume(BayesoptResults,...
'SaveFileName', 'bayespar.mat', ...
'OutputFcn',{@saveToFile});
else
results = bayesopt(C, [x1, x2], ...
'AcquisitionFunctionName',...
'expected-improvement-plus', ...
'SaveFileName', 'bayespar.mat', ...
'UseParallel', true, ...
'OutputFcn',{@saveToFile});
end
function f = makeFun()
x1 = optimizableVariable('x1',[-5,5]);
x2 = optimizableVariable('x2',[-5,5]);
f = @rosenbrocks;
function f = rosenbrocks(x)
f = 100*(x.x2 - x.x1^2)^2 + (1 - x.x1)^2;
end
end
Can you provide the release info(output of version command), and any sample code that reproduces the issue?
Hi Aditya,
I aplogize for not getting back to you sooner. I was working on something else that was taking a lot of time and didn't notice how the time has passed. I finally posted a question on mathworks about it.
Here is the code with the issue:
clear;
clc;
close all;
spmd
valErrorFun = makeObjFcn();
end
C = parallel.pool.Constant(valErrorFun);
% Classical Neural Network - Hyperparameter tuning using Bayesian Optimization
simplefitInputs = [0 0.0498 0.0996 0.1550 0.2103 0.2657 0.3210 0.3825 ...
0.4440 0.5123 0.5807 0.6566 0.7409 0.8347 0.9388 1.0674 1.2102 1.3690 ...
1.5453 1.7041 1.8469 1.9898 2.1326 2.2755 2.4183 2.5612 2.7041 2.8469 ...
2.9898 3.1326 3.2755 3.4342 3.5929 3.7693 3.9457 4.1220 4.2984 4.4748 ...
4.6511 4.8275 4.9862 5.1450 5.3037 5.4466 5.5894 5.7323 5.8910 6.0674 ...
6.2437 6.3866 6.5295 6.6452 6.7389 6.8233 6.8992 6.9675 7.0290 7.0905 ...
7.1458 7.2012 7.2565 7.3119 7.3617 7.4115 7.4613 7.5167 7.5720 7.6273 ...
7.6827 7.7442 7.8057 7.8740 7.9499 8.0343 8.1384 8.2813 8.4577 8.6005 ...
8.7162 8.8100 8.8943 8.9702 9.0461 9.1145 9.1828 9.2511 9.3195 9.3878 ...
9.4637 9.5396 9.6240 9.7177 9.8334 9.9763];
simplefitTargets = [5.0472 5.3578 5.6632 5.9955 6.3195 6.6343 6.9389 ...
7.2645 7.5753 7.9020 8.2078 8.5216 8.8366 9.1432 9.4289 9.7007 9.8995 ...
10.0000 9.9786 9.8589 9.6876 9.4722 9.2283 8.9701 8.7099 8.4579 8.2217 ...
8.0065 7.8153 7.6494 7.5084 7.3793 7.2770 7.1912 7.1319 7.0972 7.0866 ...
7.1014 7.1440 7.2169 7.3100 7.4287 7.5699 7.7102 7.8544 7.9901 8.1120 ...
8.1811 8.1424 8.0056 7.7556 7.4618 7.1617 6.8445 6.5222 6.2041 5.8970 ...
5.5721 5.2664 4.9500 4.6250 4.2937 3.9920 3.6889 3.3863 3.0529 2.7252 ...
2.4056 2.0968 1.7695 1.4619 1.1469 0.8345 0.5391 0.2564 0.0263 0 0.1787 ...
0.4413 0.7207 1.0154 1.3092 1.6244 1.9214 2.2266 2.5356 2.8438 3.1469 ...
3.4723 3.7799 4.0938 4.3986 4.6956 4.9132];
%%Choose Variables to Optimize
minHiddenLayerSize = 10;
maxHiddenLayerSize = 20;
hiddenLayerSizeRange = [minHiddenLayerSize maxHiddenLayerSize];
optimVars = [
optimizableVariable('Layer1Size',hiddenLayerSizeRange,'Type','integer')
optimizableVariable('Layer2Size',hiddenLayerSizeRange,'Type','integer')];
%%Perform Bayesian Optimization
%ObjFcn = makeObjFcn(simplefitInputs, simplefitTargets);
% Print command window results into a text file
% Save to file and resume from where it is left
if exist('BayesoptResults.mat', 'file')
load ('BayesoptResults.mat');
BayesObject = resume(BayesoptResults, 'SaveFilename', 'BayesoptResults.mat', 'OutputFcn', {@saveToFile});
else
BayesObject = bayesopt(C,optimVars,...
"MaxObj",30,...
"PlotFcn", [],...
"MaxTime",8*60*60,...
"IsObjectiveDeterministic",false,...
"UseParallel",true, 'SaveFileName', 'BayesoptResults.mat', 'OutputFcn', {@saveToFile});
end
%%Evaluate Final Network
bestIdx = BayesObject.IndexOfMinimumTrace(end);
fileName = BayesObject.UserDataTrace{bestIdx};
load(fileName);
YPredicted = net(simplefitInputs);
testError = perform(net,simplefitTargets,YPredicted);
testError
valError
%%etc.
% ...
%%Objective Function for Optimization
function ObjFcn = makeObjFcn()
%%Input-Output Fitting with a Neural Network and Bayesian Optimization
%%Prepare Data
XTrain = [0 0.0498 0.0996 0.1550 0.2103 0.2657 0.3210 0.3825 ...
0.4440 0.5123 0.5807 0.6566 0.7409 0.8347 0.9388 1.0674 1.2102 1.3690 ...
1.5453 1.7041 1.8469 1.9898 2.1326 2.2755 2.4183 2.5612 2.7041 2.8469 ...
2.9898 3.1326 3.2755 3.4342 3.5929 3.7693 3.9457 4.1220 4.2984 4.4748 ...
4.6511 4.8275 4.9862 5.1450 5.3037 5.4466 5.5894 5.7323 5.8910 6.0674 ...
6.2437 6.3866 6.5295 6.6452 6.7389 6.8233 6.8992 6.9675 7.0290 7.0905 ...
7.1458 7.2012 7.2565 7.3119 7.3617 7.4115 7.4613 7.5167 7.5720 7.6273 ...
7.6827 7.7442 7.8057 7.8740 7.9499 8.0343 8.1384 8.2813 8.4577 8.6005 ...
8.7162 8.8100 8.8943 8.9702 9.0461 9.1145 9.1828 9.2511 9.3195 9.3878 ...
9.4637 9.5396 9.6240 9.7177 9.8334 9.9763];
YTrain = [5.0472 5.3578 5.6632 5.9955 6.3195 6.6343 6.9389 ...
7.2645 7.5753 7.9020 8.2078 8.5216 8.8366 9.1432 9.4289 9.7007 9.8995 ...
10.0000 9.9786 9.8589 9.6876 9.4722 9.2283 8.9701 8.7099 8.4579 8.2217 ...
8.0065 7.8153 7.6494 7.5084 7.3793 7.2770 7.1912 7.1319 7.0972 7.0866 ...
7.1014 7.1440 7.2169 7.3100 7.4287 7.5699 7.7102 7.8544 7.9901 8.1120 ...
8.1811 8.1424 8.0056 7.7556 7.4618 7.1617 6.8445 6.5222 6.2041 5.8970 ...
5.5721 5.2664 4.9500 4.6250 4.2937 3.9920 3.6889 3.3863 3.0529 2.7252 ...
2.4056 2.0968 1.7695 1.4619 1.1469 0.8345 0.5391 0.2564 0.0263 0 0.1787 ...
0.4413 0.7207 1.0154 1.3092 1.6244 1.9214 2.2266 2.5356 2.8438 3.1469 ...
3.4723 3.7799 4.0938 4.3986 4.6956 4.9132];
ObjFcn = @valErrorFun;
function [valError,cons,fileName] = valErrorFun(optVars)
% Solve an Input-Output Fitting problem with a Neural Network
% Choose a Training Function
% For a list of all training functions type: help nntrain
% 'trainlm' is usually fastest.
% 'trainbr' takes longer but may be better for challenging problems.
% 'trainscg' uses less memory. Suitable in low memory situations.
trainFcn = 'trainlm'; % Levenberg-Marquardt backpropagation.
% Create a Fitting Network
layer1_size = optVars.Layer1Size;
layer2_size = optVars.Layer2Size;
hiddenLayerSizes = [layer1_size layer2_size];
net = fitnet(hiddenLayerSizes,trainFcn);
% Setup Division of Data for Training, Validation, Testing
net.divideParam.trainRatio = 70/100;
net.divideParam.valRatio = 15/100;
net.divideParam.testRatio = 15/100;
% Train the Network
net.trainParam.showWindow = false;
net.trainParam.showCommandLine = false;
[net,~] = train(net,XTrain,YTrain);
% Test the Network
YPredicted = net(XTrain);
valError = perform(net,YTrain,YPredicted);
fileName = num2str(valError) + ".mat";
save(fileName,'net','valError')
cons = [];
end
end
The error for me is:
Error using bayesoptim.BayesoptParallel/ensureObjFcnIsOnWorkers (line 82)
ObjectiveFcn is a parallel.pool.Constant, but its 'Value' property does not exist on the current pool.
Error in bayesoptim.BayesoptParallel (line 42)
this = ensureObjFcnIsOnWorkers(this, objFcn, Verbose);
Error in BayesianOptimization/initializeParallel (line 2002)
this.Parallel = bayesoptim.BayesoptParallel(this.ObjectiveFcn, this.PrivOptions.NumCoupledConstraints, this.PrivOptions.Verbose);
Error in BayesianOptimization/resume (line 444)
Results = initializeParallel(Results);
Error in Bayesian_optimization_classical_NN (line 52)
BayesObject = resume(BayesoptResults, 'SaveFilename', 'BayesoptResults.mat', 'OutputFcn', {@saveToFile});
"Can you provide the release info(output of version command)": The matlab release is R2020a.
I wasn't able to reproduce the issue with either R2020a or R2020b.
That's fine.
Your suggested method still works in general (automatic) and that's what I need for the time being.
I highly appreciate you for answering my question because it was a life saver and I'm forever grateful to you for it.
I also really appreciate you trying to reproduce the issue even if it was not successful. There may be other reasons (not code related) for the issue which I can't pinpoint right now but, at least it's good to know it worked fine for you. This knowledge might help me in the future.
Have a great day.

Sign in to comment.

More Answers (1)

There is one other possible solution to your problem. surrogateopt froom Optimization Toolbox™ can checkpoint automatically. It does not perform Bayesian optimization, but surrogate optimization is closely related and might be similar enough for your purposes.
Alan Weiss
MATLAB mathematical toolbox documentation

2 Comments

Thank you so much Mr. Weiss,
I wasn't aware of surrogateopt option and its ability to checkpoint automatically.
I will have to look into it as well and see which one works best for me.
Also, I noticed when I type parallel in bayesian it uses multiple cores for each iteration (I initially thought it would send each run to a different core simultaneously (observe multiple points for the next points optimization) ). I will look into surrogateopt but, do you by any chance know if it has parallel option and whether it works similar to bayesopt?
Thank you.
Reference page: surrogateopt
Algorithm description, including parallel: Surrogate Optimization Algorithm
Options description: Surrogate Optimization Options
Alan Weiss
MATLAB mathematical toolbox documentation

Sign in to comment.

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!