Main Content


Train DDPG, TD3 or SAC agent using an evolutionary strategy within a specified environment

Since R2023b



    trainStats = trainWithEvolutionStrategy(env,agent,estOpts) trains agent within the environment env, using the evolution strategy training options object trainOpts. Note that agent is an handle object and it is updated during training, despite being an input argument. For more information on the training algorithm, see Train agent with evolution strategy.


    collapse all

    This example shows how to train a DDPG agent using an evolutionary strategy.

    Load the predefined environment object representing a cart-pole system with a continuous action space. For more information on this environment, see Load Predefined Control System Environments.

    env = rlPredefinedEnv("CartPole-Continuous");

    The agent networks are initialized randomly. Ensure reproducibility by fixing the seed of the random generator.


    Create a DDPG agent with default networks.

    agent = rlDDPGAgent(getObservationInfo(env),getActionInfo(env));

    To create an evolution strategy options object, use rlEvolutionStrategyTrainingOptions.

    estOpts = rlEvolutionStrategyTrainingOptions(...
        PopulationSize=10 , ...
        ReturnedPolicy="BestPolicy" , ...
        StopTrainingCriteria="EpisodeCount" , ...

    To train the agent, use trainWithEvolutionStrategy.

    trainStats = trainWithEvolutionStrategy(agent,env,estOpts);

    Display the reward accumulated during the last episode.

    ans = 496.2431

    This value means that the agent is able to balance the cart-pole system for the whole episode.

    Input Arguments

    collapse all

    Agent to train, specified as an rlDDPGAgent, rlTD3Agent, or rlSACAgent object.


    trainWithEvolutionStrategy updates the agent as training progresses. For more information on how to preserve the original agent, how to save an agent during training, and on the state of agent after training, see the notes and the tips section in train. For more information about handle objects, see Handle Object Behavior.

    For more information about how to create and configure agents for reinforcement learning, see Reinforcement Learning Agents.

    Environment in which the agent acts, specified as one of the following kinds of reinforcement learning environment object:


    Multiagent environments do not support training agents with an evolution strategy.

    For more information about creating and configuring environments, see:

    When env is a Simulink environment, calling trainWithEvolutionStrategy compiles and simulates the model associated with the environment.

    Parameters and options for training using an evolution strategy, specified as an rlEvolutionStrategyTrainingOptions object. Use this argument to specify parameters and options such as:

    • Population size

    • Population update method

    • Number training epochs

    • Criteria for saving candidate agents

    • How to display training progress


    trainWithEvolutionStrategy does not support parallel computing.

    For details, see rlEvolutionStrategyTrainingOptions.

    Output Arguments

    collapse all

    Training episode data, returned as an rlTrainingResult object. The following properties pertain to the rlTrainingResult object:

    Episode numbers, returned as the column vector [1;2;…;N], where N is the number of episodes in the training run. This vector is useful if you want to plot the evolution of other quantities from episode to episode.

    Reward for each episode, returned in a column vector of length N. Each entry contains the reward for the corresponding episode.

    Number of steps in each episode, returned in a column vector of length N. Each entry contains the number of steps in the corresponding episode.

    Average reward over the averaging window specified in trainOpts, returned as a column vector of length N. Each entry contains the average award computed at the end of the corresponding episode.

    Total number of agent steps in training, returned as a column vector of length N. Each entry contains the cumulative sum of the entries in EpisodeSteps up to that point.

    Critic estimate of expected discounted cumulative long-term reward using the current agent and the environment initial conditions, returned as a column vector of length N. Each entry is the critic estimate (Q0) for the agent at the beginning of the corresponding episode. This field is present only for agents that have critics, such as rlDDPGAgent and rlDQNAgent.

    Information collected during the simulations performed for training, returned as:

    • For training in MATLAB environments, a structure containing the field SimulationError. This field is a column vector with one entry per episode. When the StopOnError option of rlTrainingOptions is "off", each entry contains any errors that occurred during the corresponding episode. Otherwise, the field contains an empty array.

    • For training in Simulink environments, a vector of Simulink.SimulationOutput objects containing simulation data recorded during the corresponding episode. Recorded data for an episode includes any signals and states that the model is configured to log, simulation metadata, and any errors that occurred during the corresponding episode.

    Evaluation statistic for each episode, returned as a column vector with as many elements as the number of episodes. Since trainWithEvolutionStrategy does not support evaluator objects, each elements of this vector is a NaN. For more information, see rlEvaluator and rlCustomEvaluator.

    Training options set, returned as an rlEvolutionStrategyTrainingOptions object.

    Version History

    Introduced in R2023b