Deep Q-Network (DQN) Agents
The deep Q-network (DQN) algorithm is a model-free, online, off-policy reinforcement learning method. A DQN agent is a value-based reinforcement learning agent that trains a critic to estimate discounted expected cumulative long-term reward. DQN is a variant of Q-learning. For more information on Q-learning, see Q-Learning Agents.
For more information on the different types of reinforcement learning agents, see Reinforcement Learning Agents.
DQN agents can be trained in environments with the following observation and action spaces.
|Continuous or discrete
DQN agents use the following critic.
DQN agents do not use an actor.
During training, the agent:
Updates the critic properties at each time step during learning.
Explores the action space using epsilon-greedy exploration. During each control interval, the agent either selects a random action with probability ϵ or selects an action greedily with respect to the action-value function with probability 1-ϵ. The greedy action is the action for which the action-value function is greatest.
Stores past experiences using a circular experience buffer. The agent updates the critic based on a mini-batch of experiences randomly sampled from the buffer.
Critic Function Approximators
To estimate the value of the optimal policy, a DQN agent uses two parametrized action-value functions, each maintained by a corresponding critic.
Critic Q(S,A;ϕ) — Given observation S and action A this critic stores the corresponding estimate of the expected discounted cumulative long-term reward when following the optimal policy (this is the value of the optimal policy).
Target critic Qt(S,A;ϕt) — To improve the stability of the optimization, the agent periodically updates the target critic parameters ϕt using the latest critic parameter values.
Both Q(S,A;ϕ) and Qt(S,A;ϕt) are implemented by function approximator objects having the same structure and parameterization.
For more information on creating critics for value function approximation, see Create Policies and Value Functions.
During training, the agent tunes the parameter values in ϕ. After training, the parameters remain at their tuned value and the trained value function approximator is stored in critic Q(S,A).
You can create and train DQN agents at the MATLAB® command line or using the Reinforcement Learning Designer app. For more information on creating agents using Reinforcement Learning Designer, see Create Agents Using Reinforcement Learning Designer.
At the command line, you can create a default DQN agent based on the observation and action specifications from the environment. A default DQN agent uses function default approximators that rely on a deep neural network model. To do so, perform the following steps.
Create observation specifications for your environment. If you already have an environment interface object, you can obtain these specifications using
Create action specifications for your environment. If you already have an environment interface object, you can obtain these specifications using
If needed, specify the number of neurons in each learnable layer (the default is 256 neurons) or whether to use an LSTM layer (by default no LSTM layer is used). To do so, create an agent initialization option object using
If needed, specify agent options using an
Create the agent using an
Alternatively, you can create a critic and use it to create your agent. In this case, ensure that the dimensions of the observation and action layers in the critic match the corresponding action and observation specifications of the environment.
Specify agent options using an
rlDQNAgentOptionsobject. Alternatively, you can create the agent first (step 3) and then, using dot notation, access its option object and modify the options.
Create the agent using an
DQN agents support critics that use recurrent deep neural networks as functions approximators.
For more information on creating actors and critics for function approximation, see Create Policies and Value Functions.
DQN agents use the following training algorithm, in which they update their critic model
at each time step. To configure the training algorithm, specify options using an
Initialize the critic Q(s,a;ϕ) with random parameter values ϕ, and initialize the target critic parameters ϕt with the same values. .
For each training time step:
For the current observation S, select a random action A with probability ϵ. Otherwise, select the action for which the critic value function is greatest.
To specify ϵ and its decay rate, use the
Execute action A. Observe the reward R and next observation S'.
Store the experience (S,A,R,S') in the experience buffer. To specify the size of the experience buffer, use the
Sample a random mini-batch of M experiences (Si,Ai,Ri,S'i) from the experience buffer. To specify M, use the
For all experiences in the minibatch, if S'i is a terminal state, set the value function target yi to Ri. Otherwise, set it to
Here, the normal DQN algorithm selects the action that maximizes the action-value function maintained by the target critic, while the double DQN selects the action that maximizes the action-value function maintained by the base critic.
To set the discount factor γ, use the
DiscountFactoroption. To use double DQN, set the
If you specify a value of
NumStepsToLookAheadequal to N, then the N-step return (which adds the rewards of the following N steps and the discounted estimated value of the state that caused the N-th reward) is used to calculate the target yi.
Update the critic parameters by one-step minimization of the loss L across all sampled experiences.
Update the target critic parameters depending on the target update method. For more information, see Target Update Methods.
Update the probability threshold ϵ for selecting a random action based on the decay rate you specify in the
Target Update Methods
DQN agents update their target critic parameters using one of the following target update methods.
Smoothing — Update the target parameters at every time step using smoothing factor τ. To specify the smoothing factor, use the
Periodic — Update the target parameters periodically without smoothing (
TargetSmoothFactor = 1). To specify the update period, use the
Periodic Smoothing — Update the target parameters periodically with smoothing.
To configure the target update method, create a
object, and set the
TargetSmoothFactor parameters as shown in the following table.
 Mnih, Volodymyr, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, and Martin Riedmiller. “Playing Atari with Deep Reinforcement Learning.” ArXiv:1312.5602 [Cs], December 19, 2013. https://arxiv.org/abs/1312.5602.