Reinforcement Learning Environments

In a reinforcement learning scenario, where you train an agent to complete a task, the environment models the external system (that is the world) with which the agent interacts. A multiagent environment interacts with more than one agent at the same time.

In control systems applications, this external system is often referred to as the plant. Any reference signal that might need to be tracked by some of the environment variables is also included in the environment.

The agent and the environment interact at each of a sequence of discrete time steps:

At a given time step t, the environment is in a state S(t), which results in the observation O(t).
Based on O(t) and its internal policy function, the agent calculates an action A(t).
Based on both the state S(t) and the action A(t), and according to its internal dynamics, the environment updates its state to S(t+1), which results in the next observation O(t+1).
Based on S(t), A(t), and S(t+1), the environment also calculates a scalar reward R(t+1). The reward is an immediate measure of how good the action A(t) is. Note that neither the next observation O(t+1) not the reward R(t+1) depend on the next action A(t+1). In other words, there is no direct feedthrough between action and either observation or reward.
At the next time step t+1 the agent receives the observation O(t+1) and the reward R(t+1).
Based on the history of observations and rewards received, the learning algorithm updates the agent policy parameters in an attempt to improve the policy function. The parameter update may occur at each step or after a subsequence of steps.
Based on O(t+1) and on its policy function, the agent calculates the next action A(t+1), and the process is repeated.

Starting from time t=1 and using subscripts to indicate time, the causal sequence of events, often also called trajectory, can be summarized as O₁,A₁,R₂,O₂,A₂. The interaction between the environment and the agent is also illustrated in the following figure, where the dashed lines represent a delay of one step.

Diagram showing an agent that interacts with its environment using a policy that is updated by the reinforcement learning algorithm. Actions and observations for time t, as well as rewards and observations for time t+1, are shown.

By convention, the observation or action can be divided into one or more channels, each of which carries a group of single elements all belonging to either a numeric (infinite and continuous) set or a finite (discrete) set. Each group can be organized according to any number of dimensions (for example a vector or a matrix), and is defined by a specification object. The specification object can be either a rlNumericSpec object (for channels carrying continuous signals) or an rlFiniteSetSpec object (for channels carrying discrete signals).

For example, an agent tasked with controlling a rover might receive from the environment an observation composed of four channels such as a continuous channel carrying acceleration measurements from accelerometers, another continuous channel carrying angular velocity estimates from an inertial measurement unit, a third continuous channel carrying a 100 by 100 pixels image, where each pixel is represented by an uint8 value, and a fourth discrete channel carrying a logical value indicating whether the collision sensor is detecting a collision. In this case, the observation is specified by a vector containing three rlNumericSpec object followed by onerlFiniteSetSpec object. If this agent uses neural networks as the underlying approximator model, the networks tasked to process the observations must have an input layer for each corresponding observation channel.

For non-hybrid environments (that is environments with an action space that is either discrete or continuous but not both), only one channel is allowed for the action. The reward must be a numeric scalar. For more information on specification objects for groups of actions and observations, see rlNumericSpec and rlFiniteSetSpec.

Multiagent environments are environments in which you can train and simulate multiple agents together.

Environment Objects

Reinforcement Learning Toolbox™ represents environments with MATLAB^® objects. Such objects interact with agents using object functions (methods) such as step or reset. Specifically, at the beginning of each training or simulation episode, the reset function is called (by a built-in training or simulation function) to set the environment initial condition. Then at each training or simulation time step, the step function is called to update the state of the environment and return the next state along with a reward.

After you create an environment object in the MATLAB workspace, you can extract the observation and action specifications from the environment object, and use these specifications to create an agent object that works within your environment.

You can then use both the environment and agent objects as arguments for the built-in functions train and sim, which train and simulate the agent within the environment, respectively. Alternatively, you can create your custom training or simulation loop that calls the environment reset and step functions directly.

Environments that rely on an underlying Simulink^® model for the calculation of the state transition, reward, and observation are called Simulink environments. These environments do not support using reset and step functions. Environments that instead rely on MATLAB functions or objects for the calculation of the state transition, reward, and observation, are referred to as MATLAB environments.

The following sections summarize the different types of environment provided by the software.

Markov Decision Process (MDP) Environments

A Markov decision process (MDP) is a discrete time stochastic control process. It provides a mathematical framework for modeling decision making in situations where outcomes are partly random and partly under the control of the decision maker. Reinforcement Learning Toolbox represents MDPs using rlMDPEnv objects.

MDP environments are MATLAB environments based on rlMDPEnv objects. In these environments, which are very useful for studying optimization problems that can be solved using reinforcement learning, state and observation belong to finite spaces, and state transitions are in general governed by stochastic rules.

Grid world environments are a special case of MDP environments. Here, the state represents a position in a two-dimensional grid, while the action represents a move from the current position to the next, which an agent might attempt. Grid world environments are often used in many introductory reinforcement learning examples.

You can use three types of MDP environment.

Predefined grid world environments
Reinforcement Learning Toolbox provides three predefined grid world environment object types. For predefined environments, all states, actions, and rewards are already defined. You can use them to learn basic reinforcement learning concepts and gain familiarity with Reinforcement Learning Toolbox software features. For an introduction to predefined grid world environments, see Load Predefined Grid World Environments.
Custom grid world environments
You can create custom grid worlds of any size with your own custom rewards, state transitions, and obstacle configurations. To create a custom grid world environment, you typically use createGridWorld to create a GridWorld object. You can then modify some of this object properties and pass it to rlMDPEnv to create an environment that agents can interact with for training and simulation.
For an introduction to custom grid worlds, see Create Custom Grid World Environments.
Custom Markov Decision Process (MDP) environments
You can also create custom generic MDP environments by supplying your own state and action sets. To create generic MDP environments, you typically use createMDP to create a GenericMDP object. You can then modify some of this object properties and pass it to rlMDPEnv to create an environment that agents can interact with for training and simulation. For an example, see Train Reinforcement Learning Agent in MDP Environment.

Predefined Control System Environments

Control system environments are environments that represent dynamical systems in which state and observation typically belong to infinite (and uncountable) numerical vector spaces. Here, the state transition laws are deterministic and often derived by discretizing the dynamics of an underlying physical system that you want to model. Note that in these environments the action can still belong to a finite set.

Reinforcement Learning Toolbox provides several predefined control system environment objects that model dynamical systems such a double integrator or cart-pole system. In general each predefined environment comes in two versions, one with a discrete (finite) action space and the other with a continuous (infinite and uncountable) action space.

Some of the predefined control system environments are Simulink environments, and some are multiagent environments.

You can use predefined control system environments to learn how to apply reinforcement learning to the control of physical systems, gain familiarity with Reinforcement Learning Toolbox software features, or test your own agents. For an introduction to predefined control system environments, see Load Predefined Control System Environments.

Custom Environments

You can create different types of custom environments. Once you create a custom environment, you can train and simulate agents as with any other environment.

For critical considerations on defining reward and observation signals in custom environments, see Define Observation and Reward Signals in Custom Environments.

You can create three different types of custom environment.

Custom function environments
Custom function environments rely (for the calculation of the state transition, reward, observation, and initial state) on custom step and reset MATLAB functions.
For single-agent environments, once you define your action and observation specifications and write your custom step and reset functions, you use rlFunctionEnv to return an environment object that can interact with your agent in the same way any other environment does.
For an example on custom functions environments, see Create Custom Environment Using Step and Reset Functions.
You can also create two different kinds of custom multiagent function environments:
- Multiagent environments with universal sample time, in which all agents execute in the same step.
- Turn-based function environments, in which agents execute in turns. Specifically, the environment assigns execution to only one group of agents at a time, and the group executes when it is its turn to do so. For an example, see Train Agent to Play Turn-Based Game.
For both kinds of multiagent environments, the observation and action specifications are cell arrays of specification objects in which each element corresponds to one agent. For example, for an environment with three agents, the observation is then specified by a cell with three elements. Each element can be, for example, a vector of specification objects representing the observation channels for the corresponding agent.
For custom multiagent function environments with universal sample time, use rlMultiAgentFunctionEnv to return an environment object. For custom turn-based multiagent function environments, use rlTurnBasedFunctionEnv.
To specify options for training agents in multiagent environments, create and configure a rlMultiAgentTrainingOptions object. Doing so allows you to specify, for example, whether different groups of agents are trained in a decentralized or centralized manner. In a group of agents subject to decentralized training, each agent collects its own set of experiences and learns from its own set of experiences. In a group of agents subject to centralized training, each agent shares its experiences with the other agents in the group and each agent in the group learns from the collective shared experiences.
You can train and simulate your agents within a multiagent environment using train and sim, respectively. You can visualize the training progress of all the agents using the Reinforcement Learning Training Manager.
For more information on training multiagent environments, see Multiagent Training.
For more information on predefined multiagent environments, see Load Predefined Multiagent Environments.
Custom template environments
Custom template environments are based on a modified class template.
To create a custom template environment, you use rlCreateEnvTemplate to open a MATLAB script that contains a template class for an environment, then modify the template, specifying environment properties, required environment functions, and optional environment functions.
While this process is more elaborate than just writing custom step and reset functions, it gives you more flexibility in adding properties or methods that might be needed for your application. For example, you can write a custom plot method to plot a visual representation of the environment at a given time.
For an introduction to creating environments using a template, see Create Custom Environment from Class Template.
Custom Simulink environments
Custom Simulink environments are based on a Simulink model that you create.
You can also use Simulink to design multiagent environments. In particular, Simulink allows you to model environments with multi-rate execution, in which each agent may have its own execution rates.
For an introduction to creating custom Simulink environments, see Create Custom Simulink Environments. For an example that illustrates some differences between MATLAB and Simulink environments, highlighting the role of delays between action and observations, see Create and Simulate Same Environment in Both MATLAB and Simulink.

Neural Network Environments

Neural network environments are custom environments that rely on a neural network for the calculation of the state transition. Here, state and observation belong to continuous spaces and the state transitions laws can be deterministic or stochastic.

Neural network environments are typically used within model-based reinforcement learning agents, such as Model-Based Policy Optimization (MBPO) Agent. However, you could extract the environment from a trained rlMBPOAgent agent and use it as a (potentially less computational demanding, but less accurate) approximation of the original environment in which the model based agent was trained in.

For more information on how to create neural network environments, see rlNeuralNetworkEnvironment.