Reinforcement Learning Environments
In a reinforcement learning scenario, where you train an agent to complete a task, the environment models the external system (that is the world) with which the agent interacts. A multiagent environment interacts with more than one agent at the same time.
In control systems applications, this external system is often referred to as the plant. Any reference signal that might need to be tracked by some of the environment variables is also included in the environment.
The agent and the environment interact at each of a sequence of discrete time steps:
At a given time step t, the environment is in a state S(t), which results in the observation O(t).
Based on O(t) and its internal policy function, the agent calculates an action A(t).
Based on both the state S(t) and the action A(t), and according to its internal dynamics, the environment updates its state to S(t+1), which results in the next observation O(t+1).
Based on S(t), A(t), and S(t+1), the environment also calculates a scalar reward R(t+1). The reward is an immediate measure of how good the action A(t) is. Note that neither the next observation O(t+1) not the reward R(t+1) depend on the next action A(t+1). In other words, there is no direct feedthrough between action and either observation or reward.
At the next time step t+1 the agent receives the observation O(t+1) and the reward R(t+1).
Based on the history of observations and rewards received, the learning algorithm updates the agent policy parameters in an attempt to improve the policy function. The parameter update may occur at each step or after a subsequence of steps.
Based on O(t+1) and on its policy function, the agent calculates the next action A(t+1), and the process is repeated.
Starting from time t=1 and using subscripts to indicate time, the causal sequence of events, often also called trajectory, can be summarized as O1,A1,R2,O2,A2. The interaction between the environment and the agent is also illustrated in the following figure, where the dashed lines represent a delay of one step.
By convention, the observation or action can be divided into one or more
channels, each of which carries a group of single elements all
belonging to either a numeric (infinite and continuous) set or a finite (discrete) set. Each
group can be organized according to any number of dimensions (for example a vector or a
matrix), and is defined by a specification object. The specification object can be either a
rlNumericSpec
object
(for channels carrying continuous signals) or an rlFiniteSetSpec
object
(for channels carrying discrete signals).
For example, an agent tasked with controlling a rover might receive from the environment
an observation composed of four channels such as a continuous channel carrying acceleration
measurements from accelerometers, another continuous channel carrying angular velocity
estimates from an inertial measurement unit, a third continuous channel carrying a 100 by 100
pixels image, where each pixel is represented by an uint8
value, and a fourth
discrete channel carrying a logical value indicating whether the collision sensor is detecting
a collision. In this case, the observation is specified by a vector containing three
rlNumericSpec
object followed by onerlFiniteSetSpec
object. If this agent uses neural networks as the underlying approximator model, the networks
tasked to process the observations must have an input layer for each corresponding observation
channel.
For non-hybrid environments (that is environments with an action space that is either
discrete or continuous but not both), only one channel is allowed for the action. The reward
must be a numeric scalar. For more information on specification objects for groups of actions
and observations, see rlNumericSpec
and
rlFiniteSetSpec
.
Multiagent environments are environments in which you can train and simulate multiple agents together.
Environment Objects
Reinforcement Learning Toolbox™ represents environments with MATLAB® objects. Such objects interact with agents using object functions (methods)
such as step
or reset
. Specifically, at the
beginning of each training or simulation episode, the reset function is called (by a
built-in training or simulation function) to set the environment initial condition. Then at
each training or simulation time step, the step function is called to update the state of
the environment and return the next state along with a reward.
After you create an environment object in the MATLAB workspace, you can extract the observation and action specifications from the environment object, and use these specifications to create an agent object that works within your environment.
You can then use both the environment and agent objects as arguments for the built-in
functions train
and sim
, which train
and simulate the agent within the environment, respectively. Alternatively, you can create
your custom training or simulation loop that calls the environment
reset
and step
functions directly.
Environments that rely on an underlying Simulink® model for the calculation of the state transition, reward, and observation are
called Simulink environments. These environments do not support using
reset
and step
functions. Environments that
instead rely on MATLAB functions or objects for the calculation of the state transition, reward, and
observation, are referred to as MATLAB environments.
The following sections summarize the different types of environment provided by the software.
Markov Decision Process (MDP) Environments
A Markov decision process (MDP) is a discrete time stochastic control process. It
provides a mathematical framework for modeling decision making in situations where outcomes
are partly random and partly under the control of the decision maker. Reinforcement Learning Toolbox represents MDPs using rlMDPEnv
objects.
MDP environments are MATLAB environments based on rlMDPEnv
objects. In
these environments, which are very useful for studying optimization problems that can be
solved using reinforcement learning, state and observation belong to finite spaces, and
state transitions are in general governed by stochastic rules.
Grid world environments are a special case of MDP environments. Here, the state represents a position in a two-dimensional grid, while the action represents a move from the current position to the next, which an agent might attempt. Grid world environments are often used in many introductory reinforcement learning examples.
You can use three types of MDP environment.
Predefined grid world environments
Reinforcement Learning Toolbox provides three predefined grid world environment object types. For predefined environments, all states, actions, and rewards are already defined. You can use them to learn basic reinforcement learning concepts and gain familiarity with Reinforcement Learning Toolbox software features. For an introduction to predefined grid world environments, see Load Predefined Grid World Environments.
Custom grid world environments
You can create custom grid worlds of any size with your own custom rewards, state transitions, and obstacle configurations. To create a custom grid world environment, you typically use
createGridWorld
to create aGridWorld
object. You can then modify some of this object properties and pass it torlMDPEnv
to create an environment that agents can interact with for training and simulation.For an introduction to custom grid worlds, see Create Custom Grid World Environments.
Custom Markov Decision Process (MDP) environments
You can also create custom generic MDP environments by supplying your own state and action sets. To create generic MDP environments, you typically use
createMDP
to create aGenericMDP
object. You can then modify some of this object properties and pass it torlMDPEnv
to create an environment that agents can interact with for training and simulation. For an example, see Train Reinforcement Learning Agent in MDP Environment.
Predefined Control System Environments
Control system environments are environments that represent dynamical systems in which state and observation typically belong to infinite (and uncountable) numerical vector spaces. Here, the state transition laws are deterministic and often derived by discretizing the dynamics of an underlying physical system that you want to model. Note that in these environments the action can still belong to a finite set.
Reinforcement Learning Toolbox provides several predefined control system environment objects that model dynamical systems such a double integrator or cart-pole system. In general each predefined environment comes in two versions, one with a discrete (finite) action space and the other with a continuous (infinite and uncountable) action space.
Some of the predefined control system environments are Simulink environments, and some are multiagent environments.
You can use predefined control system environments to learn how to apply reinforcement learning to the control of physical systems, gain familiarity with Reinforcement Learning Toolbox software features, or test your own agents. For an introduction to predefined control system environments, see Load Predefined Control System Environments.
Custom Environments
You can create different types of custom environments. Once you create a custom environment, you can train and simulate agents as with any other environment.
For critical considerations on defining reward and observation signals in custom environments, see Define Observation and Reward Signals in Custom Environments.
You can create three different types of custom environment.
Custom function environments
Custom function environments rely (for the calculation of the state transition, reward, observation, and initial state) on custom
step
andreset
MATLAB functions.For single-agent environments, once you define your action and observation specifications and write your custom step and reset functions, you use
rlFunctionEnv
to return an environment object that can interact with your agent in the same way any other environment does.For an example on custom functions environments, see Create Custom Environment Using Step and Reset Functions.
You can also create two different kinds of custom multiagent function environments:
Multiagent environments with universal sample time, in which all agents execute in the same step.
Turn-based function environments, in which agents execute in turns. Specifically, the environment assigns execution to only one group of agents at a time, and the group executes when it is its turn to do so. For an example, see Train Agent to Play Turn-Based Game.
For both kinds of multiagent environments, the observation and action specifications are cell arrays of specification objects in which each element corresponds to one agent. For example, for an environment with three agents, the observation is then specified by a cell with three elements. Each element can be, for example, a vector of specification objects representing the observation channels for the corresponding agent.
For custom multiagent function environments with universal sample time, use
rlMultiAgentFunctionEnv
to return an environment object. For custom turn-based multiagent function environments, userlTurnBasedFunctionEnv
.To specify options for training agents in multiagent environments, create and configure a
rlMultiAgentTrainingOptions
object. Doing so allows you to specify, for example, whether different groups of agents are trained in a decentralized or centralized manner. In a group of agents subject to decentralized training, each agent collects its own set of experiences and learns from its own set of experiences. In a group of agents subject to centralized training, each agent shares its experiences with the other agents in the group and each agent in the group learns from the collective shared experiences.You can train and simulate your agents within a multiagent environment using
train
andsim
, respectively. You can visualize the training progress of all the agents using the Reinforcement Learning Training Manager.For more information on training multiagent environments, see Multiagent Training.
For more information on predefined multiagent environments, see Load Predefined Multiagent Environments.
Custom template environments
Custom template environments are based on a modified class template.
To create a custom template environment, you use
rlCreateEnvTemplate
to open a MATLAB script that contains a template class for an environment, then modify the template, specifying environment properties, required environment functions, and optional environment functions.While this process is more elaborate than just writing custom
step
andreset
functions, it gives you more flexibility in adding properties or methods that might be needed for your application. For example, you can write a custom plot method to plot a visual representation of the environment at a given time.For an introduction to creating environments using a template, see Create Custom Environment from Class Template.
Custom Simulink environments
Custom Simulink environments are based on a Simulink model that you create.
You can also use Simulink to design multiagent environments. In particular, Simulink allows you to model environments with multi-rate execution, in which each agent may have its own execution rates.
For an introduction to creating custom Simulink environments, see Create Custom Simulink Environments. For an example that illustrates some differences between MATLAB and Simulink environments, highlighting the role of delays between action and observations, see Create and Simulate Same Environment in Both MATLAB and Simulink.
Neural Network Environments
Neural network environments are custom environments that rely on a neural network for the calculation of the state transition. Here, state and observation belong to continuous spaces and the state transitions laws can be deterministic or stochastic.
Neural network environments are typically used within model-based reinforcement learning
agents, such as Model-Based Policy Optimization (MBPO) Agent. However, you could extract
the environment from a trained
rlMBPOAgent
agent and
use it as a (potentially less computational demanding, but less accurate) approximation of
the original environment in which the model based agent was trained in.
For more information on how to create neural network environments, see rlNeuralNetworkEnvironment
.
See Also
Functions
rlPredefinedEnv
|getActionInfo
|getObservationInfo
|train
|sim
|rlCreateEnvTemplate
|validateEnvironment
|rlSimulinkEnv
|bus2RLSpec
|createIntegratedEnv
Objects
rlNumericSpec
|rlFiniteSetSpec
|rlMDPEnv
|rlFunctionEnv
|rlMultiAgentFunctionEnv
|rlTurnBasedFunctionEnv
|SimulinkEnvWithAgent
|rlNeuralNetworkEnvironment
Topics
- Train Reinforcement Learning Agent in MDP Environment
- Train Reinforcement Learning Agent in Basic Grid World
- Train PG Agent to Balance Discrete Cart-Pole System
- Create Custom Environment Using Step and Reset Functions
- Create Custom Environment from Class Template
- What Is Reinforcement Learning?
- Load MATLAB Environments in Reinforcement Learning Designer
- Load Predefined Grid World Environments
- Load Predefined Control System Environments
- Create Custom Simulink Environments