## Q-Learning Agent

The Q-learning algorithm is an off-policy reinforcement learning method for environments with a discrete action space. A Q-learning agent trains a Q-value function critic to estimate the value of the optimal policy, while following an epsilon-greedy policy based on the value estimated by the critic (it does not try to directly learn an optimal policy). For more information on the different types of reinforcement learning agents, see Reinforcement Learning Agents.

In Reinforcement Learning Toolbox™, a Q-learning agent is implemented by an `rlQAgent`

object.

**Note**

Q-learning agents do not support recurrent networks.

Q-learning agents can be trained in environments with the following observation and action spaces.

Observation Space | Action Space |
---|---|

Continuous or discrete | Discrete |

Q agents use the following critic.

Critic | Actor |
---|---|

Q-value function critic
| Q agents do not use an actor |

During training, the agent explores the action space using epsilon-greedy exploration.
During each control interval the agent selects a random action with probability
*ϵ*, otherwise it selects the action for which the action-value function
greatest with probability 1–*ϵ*.

### Critic Function Approximator

To estimate the value of the optimal policy, a Q-learning agent uses a critic. The
critic is a function approximator object that implements the parametrized action-value
function
*Q*(*S*,*A*;*ϕ*), using
parameters *ϕ*. For a given observation *S* and action
*A*, the critic stores the corresponding estimate of the expected
discounted cumulative long-term reward when following the optimal policy (this is the value
of the optimal policy). During training, the critic tunes its parameters to improve its
estimation.

For critics that use table-based value functions, the parameters in *ϕ*
are the actual *Q*(*S*,*A*) values in the
table.

For more information on creating critics for value function approximation, see Create Policies and Value Functions.

During training, the agent tunes the parameter values in *ϕ*. After
training, the parameters remain at their tuned value and the trained value function
approximator is stored in critic
*Q*(*S*,*A*).

### Agent Creation

To create a Q-learning agent:

Create a critic using an

`rlQValueFunction`

or`rlVectorQValueFunction`

object.Specify agent options using an

`rlQAgentOptions`

object. Alternatively, you can create the agent first (step 3) and then, using dot notation, access its option object and modify the options.Create the agent using an

`rlQAgent`

object.

### Training Algorithm

Q-learning agents use the following training algorithm. To configure the training
algorithm, specify options using an `rlQAgentOptions`

object.

Initialize the critic

*Q*(*S*,*A*;*ϕ*) with random parameter values in*ϕ*.For each training episode:

Get the initial observation

*S*from the environment.Repeat the following for each step of the episode until

*S*is a terminal state.For the current observation

*S*, select a random action*A*with probability*ϵ*. Otherwise, select the action for which the critic value function is greatest.$$A=\mathrm{arg}\underset{A}{\mathrm{max}}Q\left(S,A;\varphi \right)$$

To specify

*ϵ*and its decay rate, use the`EpsilonGreedyExploration`

option.Execute action

*A*. Observe the reward*R*and next observation*S'*.If

*S'*is a terminal state, set the value function target*y*to*R*. Otherwise, set it to$$y=R+\gamma \underset{A}{\mathrm{max}}Q\left(S\text{'},A;\varphi \right)$$

To set the discount factor

*γ*, use the`DiscountFactor`

option.Compute the difference

*ΔQ*between the value function target and the current*Q*(*S*,*A*;*ϕ*) value.$$\Delta Q=y-Q\left(S,A;\varphi \right)$$

Update the critic using the learning rate

*α*. Specify the learning rate when you create the critic by setting the`LearnRate`

option in the`rlCriticOptimizerOptions`

property within the agent options object.For table-based critics, update the corresponding

*Q*(*S*,*A*) value in the table.$$Q\left(S,A\right)=Q\left(S,A;\varphi \right)+\alpha \cdot \Delta Q$$

For all other types of critics, compute the gradients

*Δϕ*of the loss function with respect to the parameters*ϕ*. Then, update the parameters based on the computed gradients. In this case, the loss function is the square of*ΔQ*.$$\begin{array}{l}\Delta \varphi =\frac{1}{2}{\nabla}_{\varphi}{\left(\Delta Q\right)}^{2}\\ \varphi =\varphi +\alpha \cdot \Delta \varphi \end{array}$$

Set the observation

*S*to*S'*.

## References

[1] Sutton, Richard S., and Andrew
G. Barto. *Reinforcement Learning: An Introduction*. Second edition.
Adaptive Computation and Machine Learning. Cambridge, Mass: The MIT Press,
2018.