Compare Agents on the Discrete Double-Integrator Environment

Open Live Script

This example shows how to create and train frequently used agents on a discrete double-integrator environment. The training goal is to control the position of a mass in the second-order system by applying a force input. The example plots performance metrics such as the total training time and the total reward for each trained agent. The results that the agents obtain in this environment, with the selected initial conditions and random number generator seed, do not necessarily imply that specific agents are better than others. Also, note that the training times depend on the computer and operating system you use to run the example, and on other processes running in the background. Your training times might differ substantially from the training times shown in the example.

Fix Random Number Stream for Reproducibility

The example code might involve computation of random numbers at various stages. Fixing the random number stream at the beginning of various sections in the example code preserves the random number sequence in the section every time you run it, and increases the likelihood of reproducing the results. For more information, see Results Reproducibility.

Fix the random number stream with seed zero and random number algorithm Mersenne Twister. For more information on controlling the seed used for random number generation, see rng.

previousRngState = rng(0,"twister")

previousRngState = struct with fields:
     Type: 'twister'
     Seed: 0
    State: [625×1 uint32]

The output previousRngState is a structure that contains information about the previous state of the stream. You will restore the state at the end of the example.

Discrete Action Space Double-Integrator MATLAB Environment

The reinforcement learning environment for this example is a second-order double-integrator system with a gain and a discrete action space. The training goal is to control the position of a mass in the second-order system by applying a force input.

For this environment:

The mass starts at an initial position of 2 m and zero velocity.
The agent can apply one of three possible force values to the mass: -2, 0, or 2 N.
The observations from the environment are the position and velocity of the mass.
The episode terminates if the mass moves more than 5 m from the original position or if $| x | < 0.01$ .
The reward $r_{t}$ , provided at every time step, is a discretization of $r (t)$ :

$r (t) = - ({x (t)}^{'} Q x (t) + {u (t)}^{'} R u (t))$

Here:

$x$ is the state vector of the mass.
$u$ is the force applied to the mass.
$Q$ is the weight on the state deviation from zero; $Q = [10 0; 0 1]$ .
$R$ is the weight on the control effort; $R = 0.01$ .

For more information on this model, see Load Predefined Control System Environments.

Create Environment Object

Create a predefined environment object for the double-integrator.

env = rlPredefinedEnv("DoubleIntegrator-Discrete")

env = 
  DoubleIntegratorDiscreteAction with properties:

             Gain: 1
               Ts: 0.1000
      MaxDistance: 5
    GoalThreshold: 0.0100
                Q: [2×2 double]
                R: 0.0100
         MaxForce: 2
            State: [2×1 double]

The environment reset function initializes and returns the environment state (position and velocity).

reset(env)

ans = 2×1

     2
     0

You can visualize the double-integrator system using the plot function during training or simulation.

plot(env)

Figure Double Integrator Visualizer contains an axes object. The axes object contains an object of type rectangle.

Obtain the observation and action information for later use when creating agents.

obsInfo = getObservationInfo(env)

obsInfo = 
  rlNumericSpec with properties:

     LowerLimit: -Inf
     UpperLimit: Inf
           Name: "states"
    Description: "x, dx"
      Dimension: [2 1]
       DataType: "double"

actInfo = getActionInfo(env)

actInfo = 
  rlFiniteSetSpec with properties:

       Elements: [-2 0 2]
           Name: "force"
    Description: [0×0 string]
      Dimension: [1 1]
       DataType: "double"

Configure Training and Simulation Options for All Agents

Set up an evaluator object to evaluate the agent ten times without exploration every 100 training episodes.

evl = rlEvaluator(NumEpisodes=10,EvaluationFrequency=100);

Create a training options object. For this example, use the following options.

Run the training for a maximum of 5000 episodes, with each episode lasting a maximum of 200 time steps.
Stop training when the average reward in the evaluation episodes is greater than –40. At this point, the agent can control the position of the mass using minimal control effort.
To have a better insight on the agent's behavior during training, plot the training progress (default option). If you want to achieve faster training times, set the Plots option to none.

trainOpts = rlTrainingOptions(...
    MaxEpisodes=5000, ...
    MaxStepsPerEpisode=200, ...
    StopTrainingCriteria="EvaluationStatistic",...
    StopTrainingValue=-40);

For more information on training options, see rlTrainingOptions.

To simulate the trained agent, create a simulation options object and configure it to simulate for 500 steps.

simOptions = rlSimulationOptions(MaxSteps=500);

For more information on simulation options, see rlSimulationOptions.

Create, Train, and Simulate a DQN Agent

The actor and critic networks are initialized randomly. Ensure reproducibility of the section by fixing the seed used for random number generation.

rng(0,"twister")

First, create a default rlDQNAgent object using the environment specification objects.

dqnAgent = rlDQNAgent(obsInfo,actInfo);

Set a lower learning rate and a lower gradient threshold to promote a smoother (though possibly slower) training.

dqnAgent.AgentOptions.CriticOptimizerOptions.LearnRate = 1e-3;
dqnAgent.AgentOptions.CriticOptimizerOptions.GradientThreshold = 1;

Use a larger experience buffer to store more experiences, therefore decreasing the likelihood of catastrophic forgetting.

dqnAgent.AgentOptions.ExperienceBufferLength = 1e6;

Train the agent, passing the agent, the environment, and the previously defined training options and evaluator objects to train. Training is a computationally intensive process that takes several minutes to complete. To save time while running this example, load a pretrained agent by setting doTraining to false. To train the agent yourself, set doTraining to true.

doTraining = false;
if doTraining
    % To avoid plotting in training, recreate the environment.
    env = rlPredefinedEnv("DoubleIntegrator-Discrete");
    % Train the agent. Record the training time.
    tic 
    dqnTngRes = train(dqnAgent,env,trainOpts,Evaluator=evl);
    dqnTngTime = toc;
    % Extract the number of training episodes and the number of total steps.
    dqnTngEps = dqnTngRes.EpisodeIndex(end);
    dqnTngSteps = sum(dqnTngRes.TotalAgentSteps);
    % Uncomment to save the trained agent and the training metrics.
    % save("ddiBchDQNAgent.mat", ...
    %    "dqnAgent","dqnTngEps","dqnTngSteps","dqnTngTime")
else
    % Load the pretrained agent and results for the example.
    load("ddiBchDQNAgent.mat", ...
        "dqnAgent","dqnTngEps","dqnTngSteps","dqnTngTime")
end

For the DQN Agent, the training converges to a solution after 200 episodes. You can check the trained agent within the double-integrator environment.

Ensure reproducibility of the simulation by fixing the seed used for random number generation.

rng(0,"twister")

Visualize the environment.

plot(env)

Configure the agent to use a greedy policy (no exploration) in simulation.

dqnAgent.UseExplorationPolicy = false;

Simulate the environment with the trained agent for 500 steps and display the total reward. For more information on agent simulation, see sim.

experience = sim(env,dqnAgent,simOptions);

Figure Double Integrator Visualizer contains an axes object. The axes object contains an object of type rectangle.

dqnTotalRwd = sum(experience.Reward)

dqnTotalRwd = 
-82.8390

The trained DQN agent stabilizes the mass at the origin.

Create, Train, and Simulate a PG Agent

The actor and critic networks are initialized randomly. Ensure reproducibility of the section by fixing the seed used for random number generation.

rng(0,"twister")

First, create a default rlPGAgent object using the environment specification objects.

pgAgent = rlPGAgent(obsInfo,actInfo);

Set a lower learning rate and a lower gradient threshold to promote a smoother (though possibly slower) training.

pgAgent.AgentOptions.CriticOptimizerOptions.LearnRate = 1e-3;
pgAgent.AgentOptions.ActorOptimizerOptions.LearnRate = 1e-3;
pgAgent.AgentOptions.CriticOptimizerOptions.GradientThreshold = 1;
pgAgent.AgentOptions.ActorOptimizerOptions.GradientThreshold = 1;

Set the entropy loss weight to increase exploration.

pgAgent.AgentOptions.EntropyLossWeight = 0.005;

doTraining = false;
if doTraining
    % To avoid plotting in training, recreate the environment.
    env = rlPredefinedEnv("DoubleIntegrator-Discrete");
    % Train the agent. Record the training time.
    tic
    pgTngRes = train(pgAgent,env,trainOpts,Evaluator=evl);
    pgTngTime = toc;
    % Extract the number of training episodes and the number of total steps.
    pgTngEps = pgTngRes.EpisodeIndex(end);
    pgTngSteps = sum(pgTngRes.TotalAgentSteps);
    % Uncomment to save the trained agent and the training metrics.
    % save("ddiBchPGAgent.mat", ...
    %   "pgAgent","pgTngEps","pgTngSteps","pgTngTime")
else
    % Load the pretrained agent and results for the example.
    load("ddiBchPGAgent.mat", ...
        "pgAgent","pgTngEps","pgTngSteps","pgTngTime")
end

For the PG agent, the training converges to a solution after 400 episodes. You can check the trained agent within the double-integrator environment.

Ensure reproducibility of the simulation by fixing the seed used for random number generation.

rng(0,"twister")

Visualize the environment.

plot(env)

Configure the agent to use a greedy policy (no exploration) in simulation.

pgAgent.UseExplorationPolicy = false;

Simulate the environment with the trained agent for 500 steps and display the total reward. For more information on agent simulation, see sim.

experience = sim(env,pgAgent,simOptions);

Figure Double Integrator Visualizer contains an axes object. The axes object contains an object of type rectangle.

pgTotalRwd = sum(experience.Reward)

pgTotalRwd = 
-35.5296

The trained PG agent does not stabilize the mass at the origin.

Create, Train, and Simulate an AC Agent

The actor and critic networks are initialized randomly. Ensure reproducibility of the section by fixing the seed used for random number generation.

rng(0,"twister")

First, create a default rlACAgent object using the environment specification objects.

acAgent = rlACAgent(obsInfo,actInfo);

Set a lower learning rate and a lower gradient threshold to promote a smoother (though possibly slower) training.

acAgent.AgentOptions.CriticOptimizerOptions.LearnRate = 1e-3;
acAgent.AgentOptions.ActorOptimizerOptions.LearnRate = 1e-3;
acAgent.AgentOptions.CriticOptimizerOptions.GradientThreshold = 1;
acAgent.AgentOptions.ActorOptimizerOptions.GradientThreshold = 1;

Set the entropy loss weight to increase exploration.

acAgent.AgentOptions.EntropyLossWeight = 0.005;

doTraining = false;
if doTraining
    % To avoid plotting in training, recreate the environment.
    env = rlPredefinedEnv("DoubleIntegrator-Discrete");
    % Train the agent.
    tic
    acTngRes = train(acAgent,env,trainOpts,Evaluator=evl);
    acTngTime = toc;
    % Extract the number of training episodes and the number of total steps.
    acTngEps = acTngRes.EpisodeIndex(end);
    acTngSteps = sum(acTngRes.TotalAgentSteps);
    % Uncomment to save the trained agent and the training metrics.
    % save("ddiBchACAgent.mat", ...
    %    "acAgent","acTngEps","acTngSteps","acTngTime")
else
    % Load the pretrained agent and results for the example.
    load("ddiBchACAgent.mat", ...
        "acAgent","acTngEps","acTngSteps","acTngTime")
end

For the AC agent, the training converges to a solution after 400 episodes. You can check the trained agent within the double-integrator environment.

Ensure reproducibility of the simulation by fixing the seed used for random number generation.

rng(0,"twister")

Visualize the environment.

plot(env)

Configure the agent to use a greedy policy (no exploration) in simulation.

acAgent.UseExplorationPolicy = false;

Simulate the environment with the trained agent for 500 episodes, and display the total reward. For more information on agent simulation, see sim.

experience = sim(env,acAgent,simOptions);

Figure Double Integrator Visualizer contains an axes object. The axes object contains an object of type rectangle.

acTotalRwd = sum(experience.Reward)

acTotalRwd = 
-37.1481

The trained AC agent stabilizes the mass at the origin.

Create, Train, and Simulate a PPO Agent

The actor and critic networks are initialized randomly. Ensure reproducibility of the section by fixing the seed used for random number generation.

rng(0,"twister")

First, create a default rlPPOAgent object using the environment specification objects.

ppoAgent = rlPPOAgent(obsInfo,actInfo);

Set a lower learning rate and a lower gradient threshold to promote a smoother (though possibly slower) training.

ppoAgent.AgentOptions.CriticOptimizerOptions.LearnRate = 1e-3;
ppoAgent.AgentOptions.ActorOptimizerOptions.LearnRate = 1e-3;
ppoAgent.AgentOptions.CriticOptimizerOptions.GradientThreshold = 1;
ppoAgent.AgentOptions.ActorOptimizerOptions.GradientThreshold = 1;

doTraining = false;
if doTraining
    % To avoid plotting in training, recreate the environment.
    env = rlPredefinedEnv("DoubleIntegrator-Discrete");
    % Train the agent. Record the training time.
    tic
    ppoTngRes = train(ppoAgent,env,trainOpts,Evaluator=evl);
    ppoTngTime = toc;
    % Extract the number of training episodes and the number of total steps.
    ppoTngEps = ppoTngRes.EpisodeIndex(end);
    ppoTngSteps = sum(ppoTngRes.TotalAgentSteps);
    % Uncomment to save the trained agent and the training metrics.
    % save("ddiBchPPOAgent.mat", ...
    %    "ppoAgent","ppoTngEps","ppoTngSteps","ppoTngTime")
else
    % Load the pretrained agent and results for the example.
    load("ddiBchPPOAgent.mat", ...
        "ppoAgent","ppoTngEps","ppoTngSteps","ppoTngTime")
end

For the PPO Agent, the training converges to a solution after 200 episodes. You can check the trained agent within the double-integrator environment.

Ensure reproducibility of the simulation by fixing the seed used for random number generation.

rng(0,"twister")

Visualize the environment.

plot(env)

Configure the agent to use a greedy policy (no exploration) in simulation.

ppoAgent.UseExplorationPolicy = false;

Simulate the environment with the trained agent for 500 steps and display the total reward. For more information on agent simulation, see sim.

experience = sim(env,ppoAgent,simOptions);

Figure Double Integrator Visualizer contains an axes object. The axes object contains an object of type rectangle.

ppoTotalRwd = sum(experience.Reward)

ppoTotalRwd = 
-35.2849

The trained PPO agent stabilizes the mass at the origin.

Create, Train, and Simulate a SAC Agent

The constructor functions initialize the agent networks randomly. Ensure reproducibility of the section by fixing the seed used for random number generation.

rng(0,"twister")

Create a default rlACAgent object using the environment specification objects.

sacAgent = rlSACAgent(obsInfo,actInfo);

Set a lower learning rate and a lower gradient threshold to promote a smoother (though possibly slower) training.

sacAgent.AgentOptions.CriticOptimizerOptions(1).LearnRate = 1e-3;
sacAgent.AgentOptions.CriticOptimizerOptions(2).LearnRate = 1e-3;
sacAgent.AgentOptions.CriticOptimizerOptions(1).GradientThreshold = 1;
sacAgent.AgentOptions.CriticOptimizerOptions(2).GradientThreshold = 1;

sacAgent.AgentOptions.ActorOptimizerOptions.LearnRate = 1e-3;
sacAgent.AgentOptions.ActorOptimizerOptions.GradientThreshold = 1;

Set the initial entropy weight and target entropy to increase exploration.

sacAgent.AgentOptions.EntropyWeightOptions.EntropyWeight = 0.005;
sacAgent.AgentOptions.EntropyWeightOptions.TargetEntropy = 0.5;

Use a larger experience buffer to store more experiences, therefore decreasing the likelihood of catastrophic forgetting.

sacAgent.AgentOptions.ExperienceBufferLength = 1e6;

doTraining = false;
if doTraining
    % To avoid plotting in training, recreate the environment.
    env = rlPredefinedEnv("DoubleIntegrator-Discrete");
    % Train the agent. Record the training time.
    tic
    sacTngRes = train(sacAgent,env,trainOpts,Evaluator=evl);
    sacTngTime = toc;
    % Extract the number of training episodes and the number of total steps.
    sacTngEps = sacTngRes.EpisodeIndex(end);
    sacTngSteps = sum(sacTngRes.TotalAgentSteps);
    % Uncomment to save the trained agent and the training metrics.
    save("ddiBchSACAgent.mat", ...
       "sacAgent","sacTngEps","sacTngSteps","sacTngTime")
else
    % Load the pretrained agent and results for the example.
    load("ddiBchSACAgent.mat", ...
        "sacAgent","sacTngEps","sacTngSteps","sacTngTime")
end

For the SAC agent, the training converges to a solution after 200 episodes. You can check the trained agent within the cart-pole environment.

Ensure reproducibility of the simulation by fixing the seed used for random number generation.

rng(0,"twister")

Visualize the environment.

plot(env)

Configure the agent to use a greedy policy (no exploration) in simulation.

sacAgent.UseExplorationPolicy = false;

Simulate the environment with the trained agent for 500 steps. For more information on agent simulation, see rlSimulationOptions and sim.

simOptions = rlSimulationOptions(MaxSteps=500);
experience = sim(env,sacAgent,simOptions);

Figure Double Integrator Visualizer contains an axes object. The axes object contains an object of type rectangle.

sacTotalRwd = sum(experience.Reward)

sacTotalRwd = 
-41.9512

The trained SAC agent stabilizes the mass at the origin.

Plot Training and Simulation Metrics

For each agent, collect the total reward from the final simulation episode, the number of training episodes, the total number of agent steps, and the total training time as shown in the Reinforcement Learning Training Monitor.

simReward = [
    dqnTotalRwd
    pgTotalRwd
    acTotalRwd
    ppoTotalRwd
    sacTotalRwd
    ];

tngEpisodes = [
    dqnTngEps
    pgTngEps
    acTngEps
    ppoTngEps
    sacTngEps
    ];

tngSteps = [
    dqnTngSteps
    pgTngSteps
    acTngSteps
    ppoTngSteps
    sacTngSteps
    ];

tngTime = [
    dqnTngTime
    pgTngTime
    acTngTime
    ppoTngTime
    sacTngTime
    ];

Plot the simulation reward, number of training episodes, number of training steps (that is the number of interactions between the agent and the environment) and the training time. Scale the data by the factor [30 200 2e6 600] for better visualization.

bar([simReward,tngEpisodes,tngSteps,tngTime]./[30 200 2e6 600])
xticklabels(["DQN" "PG" "AC" "PPO" "SAC"])
legend( ...
    "Total Reward","Training Episodes", ...
    "Training Steps","Training Time", ...
    "Location","northeast")

Figure contains an axes object. The axes object contains 4 objects of type bar. These objects represent Total Reward, Training Episodes, Training Steps, Training Time.

The plot shows that, for this environment, and with the used random number generator seed and initial conditions, DQN performs worse in terms of total reward, with PPO and AC using less training time, and SAC (due to its more complex algorithm that needs to calculate more gradients), using more training time. With a different random seed, the initial agent networks would be different, and therefore, convergence results might be different. For more information on the relative strengths and weaknesses of each agent, see Reinforcement Learning Agents.

Save all the variables created in this example, including the training results, for later use.

% Uncomment the following line to save all the workspace variables
% save ddiAllVariables.mat

Restore the random number stream using the information stored in previousRngState.

rng(previousRngState);

Compare Agents on the Discrete Double-Integrator Environment

Fix Random Number Stream for Reproducibility

Discrete Action Space Double-Integrator MATLAB Environment

Create Environment Object

Configure Training and Simulation Options for All Agents

Create, Train, and Simulate a DQN Agent

Create, Train, and Simulate a PG Agent

Create, Train, and Simulate an AC Agent

Create, Train, and Simulate a PPO Agent

Create, Train, and Simulate a SAC Agent

Plot Training and Simulation Metrics

See Also

Functions

Objects

Topics