Main Content

Compare Agents on Continuous Pendulum Swing-Up with Image Environment

This example shows how to create and train frequently used default agents on a continuous action space pendulum swing-up environment. This environment represents a simple frictionless pendulum that initially hangs in a downward position. The agent, which senses the pendulum position from an image, can apply a control torque on the pendulum, and its goal is to make the pendulum stand upright using minimal control effort. The example plots performance metrics such as the total training time and the total reward for each trained agent. The results that the agents obtain in this environment, with the selected initial conditions and random number generator seed, do not necessarily imply that specific agents are in general better than others. Also, note that the training times depend on the computer and operating system you use to run the example, and on other processes running in the background. Your training times might differ substantially from the training times shown in the example.

Due to the large observation space, the consequently large size of the default neural networks used in the agents, and the fact that the maximum number of training episodes is reached for many agents (because training often does not converge), the example might take days to run, even when using a GPU. To run the complete example, set doTraining to true. Otherwise, to inspect the example, or to run only selected example sections, you can keep doTraining to false.

doTraining = false;
if ~doTraining, return, end

Fix Random Number Stream for Reproducibility

The example code might involve computation of random numbers at various stages. Fixing the random number stream at the beginning of various sections in the example code preserves the random number sequence in the section every time you run it, and increases the likelihood of reproducing the results. For more information, see Results Reproducibility.

Fix the random number stream with seed 0 and random number algorithm Mersenne Twister. For more information on controlling the seed used for random number generation, see rng.

previousRngState = rng(0,"twister");

The output previousRngState is a structure that contains information about the previous state of the stream. You will restore the state at the end of the example.

Continuous Action Space Simple Pendulum with Image MATLAB Environment

The reinforcement learning environment for this example is a simple frictionless pendulum that initially hangs in a downward position. The training goal is to make the pendulum stand upright using minimal control effort.

For this environment:

  • The balanced, upright pendulum position is zero radians, and the downward hanging pendulum position is pi radians.

  • The torque action signal from the agent to the environment is from –2 to 2 N·m.

  • The observations from the environment are an image indicating the location of the pendulum mass and the pendulum angular velocity.

  • The reward rt, provided at every time step, is

rt=-(θt2+0.1θt˙2+0.001ut-12)

Here:

  • θt is the angle of displacement from the upright position.

  • θt˙ is the derivative of the displacement angle.

  • ut-1 is the control effort from the previous time step.

For more information on this model, see Load Predefined Control System Environments.

Create Environment Object

Create a predefined environment object for the continuous pendulum environment.

env = rlPredefinedEnv("SimplePendulumWithImage-Continuous")

The environment reset function initializes and returns the environment state (linear and angular positions and velocities).

reset(env)

You can visualize the pendulum system using the plot function during training or simulation.

plot(env)

Obtain the observation and action information for later use when creating agents.

obsInfo = getObservationInfo(env)
actInfo = getActionInfo(env)

Configure Training Options for All Agents

Set up an evaluator object to evaluate the agent 10 times without exploration every 100 training episodes.

evl = rlEvaluator(NumEpisodes=10,EvaluationFrequency=100);

Create a training options object. For this example, use the following options.

  • Run the training for a maximum of 5000 episodes, with each episode lasting 500 time steps.

  • Stop the training when the agent receives an average cumulative reward greater than –1000 over the consecutive evaluation episodes. At this point, the agent can quickly balance the pendulum upright using minimal control effort.

  • To have a better insight on the agent's behavior during training, plot the training progress (default option). If you want to achieve faster training times, set the Plots option to none.

trainOpts = rlTrainingOptions(...
    MaxEpisodes=5000, ...
    MaxStepsPerEpisode=500, ...
    StopTrainingCriteria="EvaluationStatistic",...
    StopTrainingValue=-1000);

For more information on training options, see rlTrainingOptions.

To simulate the trained agent, create a simulation options object and configure it to simulate for 500 steps.

simOptions = rlSimulationOptions(MaxSteps=500);

For more information on simulation options, see rlSimulationOptions.

Create, Train, and Simulate a PG Agent

The actor and critic networks are initialized stochastically. Ensure reproducibility of the section by fixing the seed used for random number generation.

rng(0,"twister")

First, create a default rlPGAgent object using the environment specification objects.

pgAgent = rlPGAgent(obsInfo,actInfo);

Set a lower learning rate and a lower gradient threshold to promote a smoother (though possibly slower) training.

pgAgent.AgentOptions.CriticOptimizerOptions.LearnRate = 1e-3;
pgAgent.AgentOptions.ActorOptimizerOptions.LearnRate = 1e-3;
pgAgent.AgentOptions.CriticOptimizerOptions.GradientThreshold = 1;
pgAgent.AgentOptions.ActorOptimizerOptions.GradientThreshold = 1;

Set the entropy loss weight to increase exploration.

pgAgent.AgentOptions.EntropyLossWeight = 0.005;

If the computer can use the GPU, fix the random generator seed used on the GPU for reproducibility, and set the actor and critic to use the GPU for gradient calculation. For more information, see canUseGPU and gpurng (Parallel Computing Toolbox).

if canUseGPU
    % Fix random seed
    gpurng(0,"threefry")
    % Use GPU for actor gradient calculations
    actor = getActor(pgAgent);
    actor.UseDevice = "gpu";
    setActor(pgAgent,actor);
    % Use GPU for critic gradient calculations
    critic = getCritic(pgAgent);
    critic.UseDevice = "gpu";
    setCritic(pgAgent,critic);
end

Train the agent, passing the agent, the environment, and the previously defined training options and evaluator objects to train.

% Recreate the environment so it does not plot during training.
env = rlPredefinedEnv("SimplePendulumWithImage-Continuous");
% Train the agent. Record the training time.
tic
pgTngRes = train(pgAgent,env,trainOpts,Evaluator=evl);
pgTngTime = toc;
% Extract the number of training episodes and the number of total steps.
pgTngEps = pgTngRes.EpisodeIndex(end);
pgTngSteps = sum(pgTngRes.TotalAgentSteps);
% Uncomment to save the trained agent and the training metrics.
% save("cpsuImgBchPGData.mat", ...
%     "pgAgent","pgTngEps","pgTngSteps","pgTngTime")

For the PG agent, the training does not converge to a solution. You can check the trained agent within the pendulum environment.

Ensure reproducibility of the simulation by fixing the seed used for random number generation.

rng(0,"twister")

Visualize the environment.

plot(env)

Configure the agent to use a greedy policy (no exploration) in simulation.

pgAgent.UseExplorationPolicy = false;

Simulate the environment with the trained agent for 500 steps and display the total reward. For more information on agent simulation, see sim.

experience = sim(env,pgAgent,simOptions);
pgTotalRwd = sum(experience.Reward)

The trained PG agent is not able to swing up the pendulum.

Create, Train, and Simulate a AC Agent

The actor and critic networks are initialized stochastically. Ensure reproducibility of the section by fixing the seed used for random number generation.

rng(0,"twister")

First, create a default rlACAgent object using the environment specification objects.

acAgent = rlACAgent(obsInfo,actInfo);

Set a lower learning rate and a lower gradient threshold to promote a smoother (though possibly slower) training.

acAgent.AgentOptions.CriticOptimizerOptions.LearnRate = 1e-3;
acAgent.AgentOptions.ActorOptimizerOptions.LearnRate = 1e-3;
acAgent.AgentOptions.CriticOptimizerOptions.GradientThreshold = 1;
acAgent.AgentOptions.ActorOptimizerOptions.GradientThreshold = 1;

Set the entropy loss weight to increase exploration.

acAgent.AgentOptions.EntropyLossWeight = 0.005;

If the computer can use the GPU, fix the random generator seed used on the GPU for reproducibility, and set the actor and critic to use the GPU for gradient calculation. For more information, see canUseGPU and gpurng (Parallel Computing Toolbox).

if canUseGPU
    % Fix random seed
    gpurng(0,"threefry")
    % Use GPU for actor gradient calculations
    actor = getActor(acAgent);
    actor.UseDevice = "gpu";
    setActor(acAgent,actor);
    % Use GPU for critic gradient calculations
    critic = getCritic(acAgent);
    critic.UseDevice = "gpu";
    setCritic(acAgent,critic);
end

Train the agent, passing the agent, the environment, and the previously defined training options and evaluator objects to train.

% Recreate the environment so it does not plot during training.
env = rlPredefinedEnv("SimplePendulumWithImage-Continuous");
% Train the agent. Record the training time.
tic
acTngRes = train(acAgent,env,trainOpts,Evaluator=evl);
acTngTime = toc;
% Extract the number of training episodes and the number of total steps.
acTngEps = acTngRes.EpisodeIndex(end);
acTngSteps = sum(acTngRes.TotalAgentSteps);
% Uncomment to save the trained agent and the training metrics.
% save("cpsuImgBchACData.mat", ...
%     "acAgent","acTngEps","acTngSteps","acTngTime")

For the AC agent, the training does not converge to a solution. You can check the trained agent within the pendulum environment.

Ensure reproducibility of the simulation by fixing the seed used for random number generation.

rng(0,"twister")

Visualize the environment.

plot(env)

Configure the agent to use a greedy policy (no exploration) in simulation.

acAgent.UseExplorationPolicy = false;

Simulate the environment with the trained agent for 500 steps and display the total reward. For more information on agent simulation, see sim.

experience = sim(env,acAgent,simOptions);
acTotalRwd = sum(experience.Reward)

The trained AC agent is not able to swing up the pendulum.

Create, Train, and Simulate a PPO Agent

The actor and critic networks are initialized stochastically. Ensure reproducibility of the section by fixing the seed used for random number generation.

rng(0,"twister")

First, create a default rlPPOAgent object using the environment specification objects.

ppoAgent = rlPPOAgent(obsInfo,actInfo);

Set a lower learning rate and a lower gradient threshold to promote a smoother (though possibly slower) training.

ppoAgent.AgentOptions.CriticOptimizerOptions.LearnRate = 1e-3;
ppoAgent.AgentOptions.ActorOptimizerOptions.LearnRate = 1e-3;
ppoAgent.AgentOptions.CriticOptimizerOptions.GradientThreshold = 1;
ppoAgent.AgentOptions.ActorOptimizerOptions.GradientThreshold = 1;

If the computer can use the GPU, fix the random generator seed used on the GPU for reproducibility, and set the actor and critic to use the GPU for gradient calculation. For more information, see canUseGPU and gpurng (Parallel Computing Toolbox).

if canUseGPU
    % Fix random seed
    gpurng(0,"threefry")
    % Use GPU for actor gradient calculations
    actor = getActor(ppoAgent);
    actor.UseDevice = "gpu";
    setActor(ppoAgent,actor);
    % Use GPU for critic gradient calculations
    critic = getCritic(ppoAgent);
    critic.UseDevice = "gpu";
    setCritic(ppoAgent,critic);
end

Train the agent, passing the agent, the environment, and the previously defined training options and evaluator objects to train.

% Recreate the environment so it does not plot during training.
env = rlPredefinedEnv("SimplePendulumWithImage-Continuous");
% Train the agent. Record the training time.
tic
ppoTngRes = train(ppoAgent,env,trainOpts,Evaluator=evl);
ppoTngTime = toc;
% Extract the number of training episodes and the number of total steps.
ppoTngEps = ppoTngRes.EpisodeIndex(end);
ppoTngSteps = sum(ppoTngRes.TotalAgentSteps);
% Uncomment to save the trained agent and the training metrics.
% save("cpsuImgBchPPOData.mat", ...
%     "ppoAgent","ppoTngEps","ppoTngSteps","ppoTngTime")

For the PPO Agent, the training does not converge to a solution. You can check the trained agent within the pendulum environment.

Ensure reproducibility of the simulation by fixing the seed used for random number generation.

rng(0,"twister")

Visualize the environment.

plot(env)

Configure the agent to use a greedy policy (no exploration) in simulation.

ppoAgent.UseExplorationPolicy = false;

Simulate the environment with the trained agent for 500 steps and display the total reward. For more information on agent simulation, see sim.

experience = sim(env,ppoAgent,simOptions);
ppoTotalRwd = sum(experience.Reward)

The trained PPO agent does not swing up the pendulum.

Create, Train, and Simulate a DDPG Agent

The actor and critic networks are initialized stochastically. Ensure reproducibility of the section by fixing the seed used for random number generation.

rng(0,"twister")

First, create a default rlDDPGAgent object using the environment specification objects.

ddpgAgent = rlDDPGAgent(obsInfo,actInfo);

Set a lower learning rate and a lower gradient threshold to promote a smoother (though possibly slower) training.

ddpgAgent.AgentOptions.CriticOptimizerOptions.LearnRate = 1e-3;
ddpgAgent.AgentOptions.ActorOptimizerOptions.LearnRate = 1e-3;
ddpgAgent.AgentOptions.CriticOptimizerOptions.GradientThreshold = 1;
ddpgAgent.AgentOptions.ActorOptimizerOptions.GradientThreshold = 1;

If the computer can use the GPU, fix the random generator seed used on the GPU for reproducibility, and set the actor and critic to use the GPU for gradient calculation. For more information, see canUseGPU and gpurng (Parallel Computing Toolbox).

if canUseGPU
    % Fix random seed
    gpurng(0,"threefry")
    % Use GPU for actor gradient calculations
    actor = getActor(ddpgAgent);
    actor.UseDevice = "gpu";
    setActor(ddpgAgent,actor);
    % Use GPU for critic gradient calculations
    critic = getCritic(ddpgAgent);
    critic.UseDevice = "gpu";
    setCritic(ddpgAgent,critic);
end

Train the agent, passing the agent, the environment, and the previously defined training options and evaluator objects to train.

% Recreate the environment so it does not plot during training.
env = rlPredefinedEnv("SimplePendulumWithImage-Continuous");
% Train the agent. Record the training time.
tic
ddpgTngRes = train(ddpgAgent,env,trainOpts,Evaluator=evl);
ddpgTngTime = toc;
% Extract the number of training episodes and the number of total steps.
ddpgTngEps = ddpgTngRes.EpisodeIndex(end);
ddpgTngSteps = sum(ddpgTngRes.TotalAgentSteps);
% Uncomment to save the trained agent and the training metrics.
% save("cpsuImgBchDDPGData.mat", ...
%     "ddpgAgent","ddpgTngEps","ddpgTngSteps","ddpgTngTime")

For the DDPG Agent, the training does not converge to a solution. You can check the trained agent within the pendulum environment.

Ensure reproducibility of the simulation by fixing the seed used for random number generation.

rng(0,"twister")

Visualize the environment.

plot(env)

Configure the agent to use a greedy policy (no exploration) in simulation.

ddpgAgent.UseExplorationPolicy = false;

Simulate the environment with the trained agent for 500 steps and display the total reward. For more information on agent simulation, see sim.

experience = sim(env,ddpgAgent,simOptions);
ddpgTotalRwd = sum(experience.Reward)

The agent does not swing up the pendulum.

Create, Train, and Simulate a TD3 Agent

The actor and critic networks are initialized stochastically. Ensure reproducibility of the section by fixing the seed used for random number generation.

rng(0,"twister")

First, create a default rlDDPGAgent object using the environment specification objects.

td3Agent = rlTD3Agent(obsInfo,actInfo);

Set a lower learning rate and a lower gradient threshold to promote a smoother (though possibly slower) training.

td3Agent.AgentOptions.CriticOptimizerOptions(1).LearnRate = 1e-3;
td3Agent.AgentOptions.CriticOptimizerOptions(2).LearnRate = 1e-3;
td3Agent.AgentOptions.CriticOptimizerOptions(1).GradientThreshold = 1;
td3Agent.AgentOptions.CriticOptimizerOptions(2).GradientThreshold = 1;

td3Agent.AgentOptions.ActorOptimizerOptions.LearnRate = 1e-3;
td3Agent.AgentOptions.ActorOptimizerOptions.GradientThreshold = 1;

Use a larger experience buffer to store more experiences, therefore decreasing the likelihood of catastrophic forgetting.

td3Agent.AgentOptions.ExperienceBufferLength = 1e6;

If the computer can use the GPU, fix the random generator seed used on the GPU for reproducibility, and set the actor and critic to use the GPU for gradient calculation. For more information, see canUseGPU and gpurng (Parallel Computing Toolbox).

if canUseGPU
    % Fix random seed
    gpurng(0,"threefry")
    % Use GPU for actor gradient calculations
    actor = getActor(td3Agent);
    actor.UseDevice = "gpu";
    setActor(td3Agent,actor);
    % Use GPU for critic gradient calculations
    critics = getCritic(td3Agent);
    critics(1).UseDevice = "gpu";
    critics(2).UseDevice = "gpu";
    setCritic(td3Agent,critics);
end

Train the agent, passing the agent, the environment, and the previously defined training options and evaluator objects to train.

% Recreate the environment so it does not plot during training.
env = rlPredefinedEnv("SimplePendulumWithImage-Continuous");
% Train the agent. Record the training time.
tic
td3TngRes = train(td3Agent,env,trainOpts,Evaluator=evl);
td3TngTime = toc;
% Extract the number of training episodes and the number of total steps.
td3TngEps = td3TngRes.EpisodeIndex(end);
td3TngSteps = sum(td3TngRes.TotalAgentSteps);
% Uncomment to save the trained agent and the training metrics.
% save("cpsuImgBchTD3Data.mat", ...
%     "td3Agent","td3TngEps","td3TngSteps","td3TngTime")

For the TD3 Agent, the training does not converge to a solution. You can check the trained agent within the pendulum environment.

Ensure reproducibility of the simulation by fixing the seed used for random number generation.

rng(0,"twister")

Visualize the environment.

plot(env)

Configure the agent to use a greedy policy (no exploration) in simulation.

td3Agent.UseExplorationPolicy = false;

Simulate the environment with the trained agent for 500 steps and display the total reward. For more information on agent simulation, see sim.

experience = sim(env,td3Agent,simOptions);
td3TotalRwd = sum(experience.Reward)

The trained TD3 agent does not swing up the pendulum.

Create, Train, and Simulate a SAC Agent

The actor and critic networks are initialized stochastically. Ensure reproducibility of the section by fixing the seed used for random number generation.

rng(0,"twister")

First, create a default rlSACAgent object using the environment specification objects.

sacAgent = rlSACAgent(obsInfo,actInfo);

Set a lower learning rate and a lower gradient threshold to promote a smoother (though possibly slower) training.

sacAgent.AgentOptions.CriticOptimizerOptions(1).LearnRate = 1e-3;
sacAgent.AgentOptions.CriticOptimizerOptions(2).LearnRate = 1e-3;
sacAgent.AgentOptions.CriticOptimizerOptions(1).GradientThreshold = 1;
sacAgent.AgentOptions.CriticOptimizerOptions(2).GradientThreshold = 1;

sacAgent.AgentOptions.ActorOptimizerOptions.LearnRate = 1e-3;
sacAgent.AgentOptions.ActorOptimizerOptions.GradientThreshold = 1;

Set the initial entropy weight and target entropy to increase exploration.

sacAgent.AgentOptions.EntropyWeightOptions.EntropyWeight = 5e-3;
sacAgent.AgentOptions.EntropyWeightOptions.TargetEntropy = 5e-1;

Use a larger experience buffer to store more experiences, therefore decreasing the likelihood of catastrophic forgetting.

sacAgent.AgentOptions.ExperienceBufferLength = 1e6;

If the computer can use the GPU, fix the random generator seed used on the GPU for reproducibility, and set the actor and critic to use the GPU for gradient calculation. For more information, see canUseGPU and gpurng (Parallel Computing Toolbox).

if canUseGPU
    % Fix random seed
    gpurng(0,"threefry")
    % Use GPU for actor gradient calculations
    actor = getActor(sacAgent);
    actor.UseDevice = "gpu";
    setActor(sacAgent,actor);
    % Use GPU for critic gradient calculations
    critics = getCritic(sacAgent);
    critics(1).UseDevice = "gpu";
    critics(2).UseDevice = "gpu";
    setCritic(sacAgent,critics);
end

Train the agent, passing the agent, the environment, and the previously defined training options and evaluator objects to train.

% Recreate the environment so it does not plot during training.
env = rlPredefinedEnv("SimplePendulumWithImage-Continuous");
% Train the agent. Record the training time.
tic
sacTngRes = train(sacAgent,env,trainOpts,Evaluator=evl);
sacTngTime = toc;
% Extract the number of training episodes and the number of total steps.
sacTngEps = sacTngRes.EpisodeIndex(end);
sacTngSteps = sum(sacTngRes.TotalAgentSteps);
% Uncomment to save the trained agent and the training metrics.
% save("cpsuImgBchSACData.mat", ...
%     "sacAgent","sacTngEps","sacTngSteps","sacTngTime")

For the SAC Agent, the training does not converge to a solution. You can check the trained agent within the pendulum environment.

Ensure reproducibility of the simulation by fixing the seed used for random number generation.

rng(0,"twister")

Visualize the environment.

plot(env)

Configure the agent to use a greedy policy (no exploration) in simulation.

sacAgent.UseExplorationPolicy = false;

Simulate the environment with the trained agent for 500 steps and display the total reward. For more information on agent simulation, see sim.

experience = sim(env,sacAgent,simOptions);
sacTotalRwd = sum(experience.Reward)

The trained SAC agent is not able to swing up the pendulum.

Plot Training and Simulation Metrics

For each agent, collect the total reward from the final simulation episode, the number of training episodes, the total number of agent steps, and the total training time as shown in the Reinforcement Learning Training Monitor.

simReward = [
    pgTotalRwd
    acTotalRwd
    ppoTotalRwd
    ddpgTotalRwd
    td3TotalRwd
    sacTotalRwd
    ];

tngEpisodes = [
    pgTngEps
    acTngEps
    ppoTngEps
    ddpgTngEps
    td3TngEps
    sacTngEps
    ];

tngSteps = [
    pgTngSteps
    acTngSteps
    ppoTngSteps
    ddpgTngSteps
    td3TngSteps
    sacTngSteps
    ];

tngTime = [
    pgTngTime
    acTngTime
    ppoTngTime
    ddpgTngTime
    td3TngTime
    sacTngTime
    ];

Plot the simulation reward, number of training episodes, number of training steps and training time. Scale the data by the factor [1 1 5e5 5] for better visualization.

bar([simReward,tngEpisodes,tngSteps,tngTime]./[1 1 5e5 5])
xticklabels(["PG" "AC" "PPO" "DDPG" "TD3" "SAC"])
legend(["Simulation Reward","Training Episodes", ...
        "Training Steps","Training Time"], ...
    "Location","northwest")

The plot shows that, for this environment, and with the used random number generator seed and initial conditions, only the DDPG converges to a solution and is able to swing up the pendulum. As expected, TD3 and SAC take a longer time due to their more complex algorithms, in which more gradients need to be calculated. With a different random seed, the initial agent networks would be different, and therefore, convergence results might be different. For more information on the relative strengths and weaknesses of each agent, see Reinforcement Learning Agents.

Save all the variables created in this example, including the training results, for later use.

% Uncomment to save all the workspace variables
save cpsuImgAllVars.mat

Restore the random number stream using the information stored in previousRngState.

rng(previousRngState);

See Also

Functions

Objects

Topics