Load Predefined Grid World Environments
Reinforcement Learning Toolbox™ software provides several predefined grid world environments for which the actions, observations, rewards, and dynamics are already defined. You can use these environments to:
Learn reinforcement learning concepts.
Gain familiarity with Reinforcement Learning Toolbox software features.
Test your own discrete-action-space reinforcement learning agents.
You can load the following predefined MATLAB® grid world environments using the rlPredefinedEnv
function.
Environment | Agent Task |
---|---|
Basic grid world | Move from a starting location to a target location on a two-dimensional grid by
selecting moves from the discrete action space {N,S,E,W} . |
Waterfall grid world | Move from a starting location to a target location on a larger two-dimensional grid with unknown deterministic or stochastic dynamics. |
In Reinforcement Learning Toolbox, these grid world environments are implemented as rlMDPEnv
objects. For more information on the properties of grid world environments, see Create Custom Grid World Environments.
You can also load predefined MATLAB control system environments. For more information, see Load Predefined Control System Environments.
Basic Grid World
The basic grid world environment is a two-dimensional 5-by-5 grid with a starting location, a terminal location, and obstacles. The environment also contains a special jump from state [2,4] to state [4,4]. The goal of the agent is to move from the starting location to the terminal location while avoiding obstacles and maximizing the total reward.
To create a basic grid world environment, use the rlPredefinedEnv
function. When using the keyword "BasicGridWorld"
this function returns
an rlMDPEnv
object
representing the grid world.
env = rlPredefinedEnv("BasicGridWorld")
env = rlMDPEnv with properties: Model: [1×1 rl.env.GridWorld] ResetFcn: []
Environment Visualization
You can visualize the grid world environment using the plot
function.
The agent location is a red circle.
The terminal location is a blue square.
The obstacles are black squares.
plot(env)
Actions
For this environment, the action channel carries a scalar integer from 1 to 4, which
indicates an (attempted) move in one of four possible directions (north, south, east, or
west, respectively). Therefore the action specification is an rlFiniteSetSpec
object. To extract the action specification, use getActionInfo
.
actInfo = getActionInfo(env)
actInfo = rlFiniteSetSpec with properties: Elements: [4×1 double] Name: "MDP Actions" Description: [0×0 string] Dimension: [1 1] DataType: "double"
Observations
The environment observation channel carries a scalar integer from 1 to 25, which
indicates the current agent location (that is, its state) in columnwise fashion. For
example, the observation 5 corresponds to the agent position [5,1] on the grid, the
observation 6 corresponds to the position [1,2], and so on. Therefore the observation
specification is an rlFiniteSetSpec
object. To extract the observation specification, use getObservationInfo
.
obsInfo = getObservationInfo(env)
obsInfo = rlFiniteSetSpec with properties: Elements: [25×1 double] Name: "MDP Observations" Description: [0×0 string] Dimension: [1 1] DataType: "double"
If the agent attempts an illegal move, such as an action that would get it out of the grid or into an obstacle, the resulting position remains unchanged, otherwise, the agent position is updated according to the action. For example, if the agent is in the position 5, the action 3 (attempted move eastward) will move the agent, at the next time step, to the position 10, while the action 4 will result in the agent still keeping position 5 at the next time step.
Note that there is no direct feedthrough between the action and the observation, that is, the observation does not depend on the current value of the action. For more information, see Reinforcement Learning Environments.
Rewards
The action A(t), results in the transition from the current state S(t) to the following one S(t+1), which in turn results in the following rewards or penalties from the environment to the agent, represented by the scalar R(t+1) :
+10
reward for reaching the terminal state at [5,5]+5
reward for jumping from state [2,4] to state [4,4]-1
penalty for every other action
As for the observation, there is no direct feedthrough between the action and the reward, that is, the reward R(t+1) does not depend on the next action A(t+1). For more information, see Reinforcement Learning Environments.
Reset Function
The default reset function for this environment sets the initial of the agent on the grid randomly, while avoiding the obstacle and the target cells.
x0 = reset(env)
x0 = 11
You can write your own reset function to specify a different initial state. For example, to specify that the initial state of the agent is always 2, create a reset function that returns the state number for the initial agent state.
env.ResetFcn = @() 2;
The reset function is called (by a training or simulation function) at the beginning of each training or simulation episode.
Step Function
The environment observation and action specifications allow you to create an agent
that works with your environment. You can then use both the environment and agent as
arguments for the built-in functions train
and
sim
, which train
or simulate the agent within the environment, respectively.
You can also call the step function to return the next observation, reward and an
is-done
scalar indicating the whether a final state has been reached.
For example, reset the basic grid world environment and call the step function.
x0 = reset(env)
x0 = 1
[xn,rn,id]=step(env,3)
xn = 6 rn = -1 id = logical 0
The environment step
and reset
functions
allow you to create a custom training or simulation loop.
Deterministic Waterfall Grid Worlds
The deterministic waterfall grid world environment is a two-dimensional 8-by-7 grid with a starting location and terminal location. The environment includes a waterfall that pushes the agent toward the bottom of the grid. The goal of the agent is to move from the starting location to the terminal location while maximizing the total reward.
To create a deterministic waterfall grid world, use the rlPredefinedEnv
function. This function creates an rlMDPEnv
object
representing the grid world.
env = rlPredefinedEnv('WaterFallGridWorld-Deterministic');
As with the basic grid world, you can visualize the environment, where the agent is a red circle and the terminal location is a blue square.
plot(env)
Actions
The agent can move in one of four possible directions (north, south, east, or west).
Rewards
The agent receives the following rewards or penalties:
+10
reward for reaching the terminal state at [4,5]-1
penalty for every other action
Waterfall Dynamics
In this environment, a waterfall pushes the agent toward the bottom of the grid.
The intensity of the waterfall varies between the columns, as shown at the top of the preceding figure. When the agent moves into a column with a nonzero intensity, the waterfall pushes it downward by the indicated number of squares. For example, if the agent goes east from state [5,2], it reaches state [7,3].
Stochastic Waterfall Grid Worlds
The stochastic waterfall grid world environment is a two-dimensional 8-by-7 grid with a starting location and terminal locations. The environment includes a waterfall that pushes the agent towards the bottom of the grid with a stochastic intensity. The goal of the agent is to move from the starting location to the target terminal location while avoiding the penalty terminal states along the bottom of the grid and maximizing the total reward.
To create a stochastic waterfall grid world, use the rlPredefinedEnv
function. This function creates an rlMDPEnv
object
representing the grid world.
env = rlPredefinedEnv('WaterFallGridWorld-Stochastic');
As with the basic grid world, you can visualize the environment, where the agent is a red circle and the terminal location is a blue square.
plot(env)
Actions
The agent can move in one of four possible directions (north, south, east, or west).
Rewards
The agent receives the following rewards or penalties:
+10
reward for reaching the terminal state at [4,5]-10
penalty for reaching any terminal state in the bottom row of the grid-1
penalty for every other action
Waterfall Dynamics
In this environment, a waterfall pushes the agent towards the bottom of the grid with a stochastic intensity. The baseline intensity matches the intensity of the deterministic waterfall environment. However, in the stochastic waterfall case, the agent has an equal chance of experiencing the indicated intensity, one level above that intensity, or one level below that intensity. For example, if the agent goes east from state [5,2], it has an equal chance of reaching state [6,3], [7,3], or [8,3].