# rlEvolutionStrategyTrainingOptions

Options for training off-policy reinforcement learning agents using an evolutionary strategy

*Since R2023b*

## Description

Use an `rlEvolutionStrategyTrainingOptions`

object to specify
options to train an DDPG, TD3 or SAC agent within an environment. Evolution strategy training
options include the population size and its update method, the number of training epochs, as
well as criteria for stopping training and saving agents. After setting its options, use this
object as an input argument for
`trainWithEvolutionStrategy`

.

For more information on the training algorithm, see Train agent with evolution strategy. For more information on training agents, see Train Reinforcement Learning Agents.

## Creation

### Syntax

### Description

returns the default options for training a DDPG, TD3 or SAC agent using an evolutionary
strategy.`trainOpts`

= rlEvolutionStrategyTrainingOptions

creates the training option set `trainOpts`

= rlEvolutionStrategyTrainingOptions(`Name=Value`

)`trainOpts`

and sets its Properties using one or more
name-value arguments.

## Properties

`PopulationSize`

— Number of individuals in the population

25 (default) | positive integer

Number of individuals in the population, specified as a positive integer. Every individual corresponds to an actor.

**Example: **`PopulationSize=50`

`PercentageEliteSize`

— Percentage of surviving individuals

50 (default) | positive integer

Percentage of individuals surviving to form the next population, specified as an integer between 1 and 100.

**Example: **`PercentageEliteSize=30`

`EvaluationsPerIndividual`

— Maximum number of episodes run per individual

1 (default) | positive integer

Maximum number of episodes run per individual, specified as a positive integer.

**Example: **`EvaluationsPerIndividual=2`

`TrainEpochs`

— Number of training epochs

10 (default) | nonnegative integer

Number of training epochs used to update the gradient-based agent. If you set
`TrainEpochs`

to `0`

, then the agents are updated
without using any gradient-based agent (therefore using only an pure evolutionary search
strategy). For more information on the training algorithm, see Train agent with evolution strategy.

**Example: **`TrainEpochs=5`

`PopulationUpdateOptions`

— Population update options

`GaussianUpdateOptions`

object

Population update options, specified as a `GaussianUpdateOptions`

object. For more information on the training algorithm, see Train agent with evolution strategy.

The properties of the `GaussianUpdateOptions`

object, which determine
how the evolution algorithm updates the distribution, and which you can modify using dot
notation after creating the `rlEvolutionStrategyTrainingOptions`

object,
are as follows.

`UpdateMethod`

— Update method for the population distribution

`"WeightedMixing"`

(default) | `"UniformMixing"`

Update method for the population distribution, specified as either:

`"WeightedMixing"`

— When calculating the sum used to calculate the mean and standard deviation of the population distribution, weights each actor according to its fitness index (that is, better actors are weighted more).`"UniformMixing"`

— When calculating the sum used to calculate the mean and standard deviation of the population distribution, weights each actor equally.

**Example: **`Mode="async"`

`InitialMean`

— Initial mean of the population distribution

`0`

(default) | scalar

Initial mean of the population distribution, specified as a scalar.

**Example: **`InitialMean=-0.5`

`InitialStandardDeviation`

— Initial standard deviation of the population distribution

`0.1`

(default) | positive scalar

Initial standard deviation of the population distribution, specified as a scalar.

**Example: **`InitialStandardDeviation=0.5`

`InitialStandardDeviationBias`

— Initial bias of the standard deviation of the population distribution

`0.1`

(default) | positive scalar

Initial bias of the standard deviation of the population distribution, specified as a scalar. A larger value promotes exploration.

**Example: **`InitialStandardDeviationBias=0.2`

`FinalStandardDeviationBias`

— Final bias of the standard deviation of the population distribution

`0.001`

(default) | positive scalar

Final bias of the standard deviation of the population distribution, specified as a nonnegative scalar.

**Example: **`FinalStandardDeviationBias=0.002`

`StandardDeviationBiasDecayRate`

— Decay rate of the bias of the standard deviation of the population distribution

`0.95`

(default) | positive scalar less than one

Decay rate of the bias of the standard deviation of the population distribution, specified as a positive scalar.

At the end of each training time step, if the bias of the population standard
deviation `StdBias`

is updated as follows.

StdBias = (1-StandardDeviationBiasDecayRate)*StdBias + ... StandardDeviationBiasDecayRate*FinalStandardDeviationBias

Note that `StdBias`

is conserved between the end of an
episode and the start of the next one. Therefore, it keeps on uniformly evolving
over multiple episodes until it reaches
`FinalStandardDeviationBias`

.

**Example: **`StandardDeviationBiasDecayRate=0.99`

`ReturnedPolicy`

— Type of the policy returned once training is terminated

`"AveragedPolicy"`

(default) | `"BestPolicy"`

Type of the policy returned once training is terminated, specified as either
`"AveragedPolicy"`

or `"BestPolicy"`

.

**Example: **`ReturnedPolicy="BestPolicy"`

`MaxGenerations`

— Maximum number of generations

500 (default) | positive integer

Maximum number of generations that the population is updated, specified as a positive integer.

**Example: **`MaxGenerations=1000`

`MaxStepsPerEpisode`

— Maximum number of steps to run per episode

`500`

(default) | positive integer

Maximum number of steps to run per episode, specified as a positive integer. In general, you define episode termination conditions in the environment. This value is the maximum number of steps to run in the episode if other termination conditions are not met.

**Example: **`MaxStepsPerEpisode=1000`

`ScoreAveragingWindowLength`

— Window length for averaging

`5`

(default) | positive integer

Window length for averaging the scores, rewards, and number of steps, specified as a scalar or vector.

For options expressed in terms of averages,
`ScoreAveragingWindowLength`

is the number of episodes included in
the average. For instance, if `StopTrainingCriteria`

is
`"AverageReward"`

, and `StopTrainingValue`

is
`500`

, training terminates when the average reward over the number of
episodes specified in `ScoreAveragingWindowLength`

equals or exceeds
`500`

.

**Example: **`ScoreAveragingWindowLength=10`

`StopTrainingCriteria`

— Training termination condition

`"AverageReward"`

| `"EpisodeReward"`

| ...

Training termination condition, specified as one of the following strings:

`"AverageReward"`

— Stop training when the running average reward equals or exceeds the critical value.`"EpisodeReward"`

— Stop training when the reward in the current episode equals or exceeds the critical value.

**Example: **`StopTrainingCriteria="AverageReward"`

`StopTrainingValue`

— Critical value of training termination condition

`500`

(default) | scalar

Critical value of the training termination condition, specified as an scalar.

Training ends when the termination condition specified by the
`StopTrainingCriteria`

option equals or exceeds this value.

For instance, if `StopTrainingCriteria`

is
`"AverageReward"`

, and `StopTrainingValue`

is
`100`

, training terminates when the average reward over the number of
episodes specified in `ScoreAveragingWindowLength`

equals or exceeds
`100`

.

**Example: **`StopTrainingValue=100`

`SaveAgentCriteria`

— Condition for saving the agent during training

`"none"`

(default) | `"EpisodeReward"`

| `"AverageSteps"`

| `"AverageReward"`

| `"GlobalStepCount"`

| `"EpisodeCount"`

| ...

Condition for saving agents during training, specified as one of the following strings:

`"none"`

— Do not save any agents during training.`"EpisodeReward"`

— Save the agent when the reward in the current episode equals or exceeds the critical value.`"AverageSteps"`

— Save the agent when the running average number of steps per episode equals or exceeds the critical value specified by the option`StopTrainingValue`

. The average is computed using the window`'ScoreAveragingWindowLength'`

.`"AverageReward"`

— Save the agent when the running average reward over all episodes equals or exceeds the critical value.`"GlobalStepCount"`

— Save the agent when the total number of steps in all episodes (the total number of times the agent is invoked) equals or exceeds the critical value.`"EpisodeCount"`

— Save the agent when the number of training episodes equals or exceeds the critical value.

Set this option to store candidate agents that perform well according to the
criteria you specify. When you set this option to a value other than
`"none"`

, the software sets the `SaveAgentValue`

option to 500. You can change that value to specify the condition for saving the agent.

For instance, suppose you want to store for further testing any agent that yields an
episode reward that equals or exceeds 100. To do so, set
`SaveAgentCriteria`

to `"EpisodeReward"`

and set
the `SaveAgentValue`

option to 100. When an episode reward equals or
exceeds 100, `train`

saves the current agent in a MAT-file in the
folder specified by the `SaveAgentDirectory`

option. The MAT-file is
called `AgentK.mat`

, where `K`

is the number of the
corresponding episode. The agent is stored within that MAT-file as
`saved_agent`

.

**Example: **`SaveAgentCriteria="EpisodeReward"`

`SaveAgentValue`

— Critical value of condition for saving agents

`"none"`

(default) | 500 | scalar

Critical value of the condition for saving agents, specified as a scalar.

When you specify a condition for saving candidate agents using
`SaveAgentCriteria`

, the software sets this value to 500. Change
the value to specify the condition for saving the agent. See the
`SaveAgentCriteria`

option for more details.

**Example: **`SaveAgentValue=100`

`SaveAgentDirectory`

— Folder name for saved agents

`"savedAgents"`

(default) | string | character vector

Folder name for saved agents, specified as a string or character vector. The folder
name can contain a full or relative path. When an episode occurs in which the conditions
specified by the `SaveAgentCriteria`

and
`SaveAgentValue`

options are satisfied, the software saves the
current agent in a MAT-file in this folder. If the folder does not exist,
`train`

creates it. When `SaveAgentCriteria`

is
`"none"`

, this option is ignored and `train`

does
not create a folder.

**Example: **`SaveAgentDirectory = pwd + "\run1\Agents"`

`Verbose`

— Option to display training progress at the command line

`false`

(0) (default) | `true`

(1)

Option to display training progress at the command line, specified as the logical
values `false`

(0) or `true`

(1). Set to
`true`

to write information from each training episode to the
MATLAB^{®} command line during training.

**Example: **`Verbose=false`

`StopOnError`

— Option to stop training when error occurs

`"on"`

(default) | `"off"`

Option to stop training when an error occurs during an episode, specified as
`"on"`

or `"off"`

. When this option is
`"off"`

, errors are captured and returned in the
`SimulationInfo`

output of `train`

, and training
continues to the next episode.

**Example: **`StopOnError="off"`

`Plots`

— Option to display training progress with Episode Manager

`"training-progress"`

(default) | `"none"`

Option to display training progress with Episode Manager, specified as
`"training-progress"`

or `"none"`

. By default,
calling `train`

opens the Reinforcement Learning Episode Manager,
which graphically and numerically displays information about the training progress, such
as the reward for each episode, average reward, number of episodes, and total number of
steps. For more information, see `train`

. To turn
off this display, set this option to `"none"`

.

**Example: **`Plots="none"`

## Object Functions

`trainWithEvolutionStrategy` | Train DDPG, TD3 or SAC agent using an evolutionary strategy within a specified environment |

## Examples

### Configure Options for Training with Evolutionary Strategy

Create an options set for training a DDPG, TD3 or SAC agent using an evolutionary strategy. Set the population size, the number of train epochs, and the maximum number of steps per episode. You can set the options using name-value pair arguments when you create the options set. Any options that you do not explicitly set have their default values.

esOpts = rlEvolutionStrategyTrainingOptions(... PopulationSize=50, ... TrainEpoch=10, ... MaxStepsPerEpisode=500)

esOpts = EvolutionStrategyTrainingOptions with properties: PopulationSize: 50 PercentageEliteSize: 50 EvaluationsPerIndividual: 1 TrainEpochs: 10 PopulationUpdateOptions: [1×1 rl.option.GaussianUpdateOptions] ReturnedPolicy: "AveragedPolicy" MaxGenerations: 500 MaxStepsPerEpisode: 500 ScoreAveragingWindowLength: 5 StopTrainingCriteria: "AverageSteps" StopTrainingValue: 500 SaveAgentCriteria: "none" SaveAgentValue: "none" SaveAgentDirectory: "savedAgents" Verbose: 0 Plots: "training-progress"

Alternatively, create a default options set and use dot notation to change some of the values.

esOpts = rlEvolutionStrategyTrainingOptions; esOpts.PopulationSize=30; esOpts.TrainEpochs=15; esOpts.MaxStepsPerEpisode=500;

Set the population update method and the initial standard deviation in the `PopulationUpdateOptions`

property.

```
esOpts.PopulationUpdateOptions.UpdateMethod = "UniformMixing";
esOpts.PopulationUpdateOptions.InitialStandardDeviation = 0.2;
```

To train a supported off-policy agent with an evolutionary strategy, you can now use `esOpts`

as an input argument to `trainWithEvolutionStrategy`

.

## Algorithms

### Train agent with evolution strategy

Each individual in the population is an actor identified by a vector of learnable
parameters, which is sampled from a multivariate Gaussian distribution. Specifically, the
training algorithm uses the `InitialMean`

and `InitialStandardDeviation`

properties to establish the initial Gaussian
distribution for the population, and then samples a population of actors from that
distribution. Additionally, the algorithm also maintains a gradient-based actor, for which
parameters are updated independently using a policy-gradient based rule (in which the
gradient is calculated using experience data from all the actors).

After interacting with the environment for a number of episodes specified by EvaluationsPerIndividual, each actor (including the gradient-based one), is
assigned a fitness index, which corresponds to the reward accumulated during the episodes.
New mean and a standard deviation values are then calculated from the elite population,
according to `PercentageEliteSize`

, using a sum weighted according to `UpdateMethod`

.

A standard deviation bias factor, which evolves independently according to the
properties `InitialStandardDeviationBias`

, `FinalStandardDeviationBias`

and `StandardDeviationBiasDecayRate`

, is scalarly expanded and then added to the
standard deviation. The training algorithm then instantiates a new population of actors by
sampling the new Gaussian distribution specified by the new mean and standard deviation, and
the cycle resumes.

## Version History

**Introduced in R2023b**

## See Also

### Functions

`trainWithEvolutionStrategy`

|`train`

|`trainFromData`

|`inspectTrainingResult`

|`rlDataLogger`

|`rlDataViewer`

### Objects

## Open Example

You have a modified version of this example. Do you want to open this example with your edits?

## MATLAB Command

You clicked a link that corresponds to this MATLAB command:

Run the command by entering it in the MATLAB Command Window. Web browsers do not support MATLAB commands.

Select a Web Site

Choose a web site to get translated content where available and see local events and offers. Based on your location, we recommend that you select: .

You can also select a web site from the following list:

## How to Get Best Site Performance

Select the China site (in Chinese or English) for best site performance. Other MathWorks country sites are not optimized for visits from your location.

### Americas

- América Latina (Español)
- Canada (English)
- United States (English)

### Europe

- Belgium (English)
- Denmark (English)
- Deutschland (Deutsch)
- España (Español)
- Finland (English)
- France (Français)
- Ireland (English)
- Italia (Italiano)
- Luxembourg (English)

- Netherlands (English)
- Norway (English)
- Österreich (Deutsch)
- Portugal (English)
- Sweden (English)
- Switzerland
- United Kingdom (English)