LSPI Agent

The least square policy iteration (LSPI) algorithm is an off-policy reinforcement learning method for environments with a discrete action space. This algorithm trains a Q-value function critic to estimate the value of the optimal policy, while following an epsilon-greedy policy based on the value estimated by the critic.

The LSPI algorithm uses a least-squares approach to directly approximate the Q-value function. This approach is unlike gradient-based algorithms such as deep-Q network (DQN) and policy gradient (PG), which update the policy parameters using gradient descent to minimize a loss function. The approximation model used by critic must be a linear-in-the-parameters custom basis function. For more information on the different types of reinforcement learning agents, see Reinforcement Learning Agents.

In Reinforcement Learning Toolbox™, an LSPI agent is implemented by an rlLSPIAgent object. This implementation uses an online version of the LSPI algorithm that is designed to update the policy iteratively as new data becomes available, rather than relying on a fixed batch of data.

LSPI agents can be trained in environments with the following observation and action spaces.

Observation Space	Action Space
Continuous or discrete	Discrete

LSPI agents use the following critic.

Critic	Actor
Q-value function critic Q(S,A) that uses a custom basis function approximation model. You create this critic using `rlQValueFunction`.	LSPI agents do not use an actor

During training, the agent:

Updates the critic learnable parameters at a certain frequency specified by the LearningFrequency agent option.
Explores the action space using epsilon-greedy exploration. During each control interval the agent selects a random action with probability ϵ, otherwise, it selects the action for which the action-value function is the greatest, with probability 1–ϵ.

Critic Used by LSPI Agent

The LSPI agent uses a critic to estimate the value of the optimal policy. The critic is a function approximator object that implements the parameterized action-value function Q(S,A;W), using parameters W. For LSPI agents, Q must be linear in W. So, you must implement the critic using a custom basis function ϕ(S,A), instead of a neural network or a table, to implement the critic. For a given observation S and action A, the critic stores the corresponding estimate of the expected discounted cumulative long-term reward when following the optimal policy. This is the value of the optimal policy. During training, the critic tunes the parameters in W to improve its action-value function estimation. After training, the parameters remain at their tuned values in the critic internal to the trained agent.

For more information on critics, see Create Policies and Value Functions.

LSPI Agent Creation

To create an LSPI agent object, follow these steps:

Create observation specifications for your environment. If you already have an environment object, you can obtain these specifications using the getObservationInfo function.
Create action specifications for your environment. If you already have an environment object, you can obtain these specifications using the getActionInfo function.
Create a custom basis function that takes an observation and an action as inputs, and returns a vector as output. Each element of the vector also referred to as a feature. Ideally, there is a linear combination of the features that approximates the value function of your reinforcement learning problem with reasonable accuracy. Functions defined in a separate file or anonymous functions are recommended over local functions.
Initialize a parameter (weights) vector W. This vector must contain as many elements as the number of features returned by your custom basis function.
Create an rlQValueFunction critic object, passing as a first input argument a cell containing both a handle to your custom basis function and the initial weight vector.
Specify agent options using an rlLSPIAgentOptions object. Alternatively, you can skip this step and modify the agent options later using dot notation.
Create the agent using rlLSPIAgent.

LSPI Training Algorithm

LSPI agents use the following training algorithm. To configure the training algorithm, specify options using an rlLSPIAgentOptions object.

Initialize the feature matrix F as an identity matrix multiplied by a scalar δ. F has as many rows and columns as the number of features supplied by your custom basis function. To specify δ, use the FeatureMatrixInitializationConstant option. Also initialize the target vector b to zero. The vector b has as many elements as the number of rows of F.
For each training episode, perform these operations:
1. At the beginning of each episode, get the initial observation from the environment.
2. For the current observation S, select a random action A with probability ϵ. Otherwise, select the action for which the critic value function is greatest.
  $A = \arg \max_{A} Q (S, A; W) = \underset{A}{\arg \max} W ϕ (S, A)$
  To specify ϵ and its decay rate, use the EpsilonGreedyExploration option.
3. Execute action A. Observe the reward R and the next observation S'.
4. Store the experience (S,A,R,S').
5. If ϵ is greater than its minimum value, perform the decay operation as described in EpsilonGreedyExploration.
6. Every M time steps (to specify M use the LearningFrequency option) perform these learning operations:
  1. Assemble the matrices Φ' and Φ using the M stored experiences:
    
    $\begin{array}{l} Φ = [\begin{matrix} ϕ (S_{1}, A_{1}) & ϕ (S_{2}, A_{2}) & \dots & ϕ (S_{M}, A_{M}) \end{matrix}] \\ Φ^{'} = [\begin{matrix} ϕ ({S^{'}}_{1}, {A^{'}}_{1}) & ϕ ({S^{'}}_{2}, {A^{'}}_{2}) & \dots & ϕ ({S^{'}}_{M}, {A^{'}}_{M}) \end{matrix}] \end{array}$
    
    Here, A'_i is the action for which the critic value function calculated in S'_i is the greatest:
    ${A^{'}}_{i} = \underset{A}{\arg \max} W ϕ ({S^{'}}_{i}, A)$
    If S'_i = S_i+1 is a terminal state, the vector ϕ(S'_i,A'_i) is set to zero.
  2. Compute the matrix F_new and the vector b_new:
    
    $\begin{array}{l} F_{n e w} = Φ {(Φ - γ Φ^{'})}^{T} \\ b_{n e w} = Φ {[\begin{matrix} R_{1} & R_{2} & \dots & R_{M} \end{matrix}]}^{T} \end{array}$
    
    To set the discount factor γ, use the DiscountFactor option.
  3. Update the feature matrix F and the target vector b using a running mean formula:
    $\begin{array}{l} N = N + M \\ F = \frac{N - M}{N} F + \frac{1}{N} F_{n e w} \\ b = \frac{N - M}{N} b + \frac{1}{N} b_{n e w} \end{array}$
    Here, N is the total number of samples collected from the beginning.
    To set the discount factor γ, use the DiscountFactor option.
  4. Compute the new weights matrix W:
    
    $W = F^{- 1} b$

References

[1] Lagoudakis, Michail G., and Ronald Parr. “Least-Squares Policy Iteration.” Journal of Machine Learning Research 4, no. Dec (2003): 1107–49. https://www.jmlr.org/papers/v4/lagoudakis03a.html.

[2] Busoniu, L., D Ernst, B. De Schutter, and R. Babuska. “Online Least-Squares Policy Iteration for Reinforcement Learning Control.” In Proceedings of the 2010 American Control Conference, 486–91. Baltimore, MD: IEEE, 2010. https://doi.org/10.1109/ACC.2010.5530856.

LSPI Agent

Critic Used by LSPI Agent

LSPI Agent Creation

LSPI Training Algorithm

References

See Also

Functions

Objects

Topics