LSPI Agent
The least square policy iteration (LSPI) algorithm is an off-policy reinforcement learning method for environments with a discrete action space. This algorithm trains a Q-value function critic to estimate the value of the optimal policy, while following an epsilon-greedy policy based on the value estimated by the critic.
The LSPI algorithm uses a least-squares approach to directly approximate the Q-value function. This approach is unlike gradient-based algorithms such as deep-Q network (DQN) and policy gradient (PG), which update the policy parameters using gradient descent to minimize a loss function. The approximation model used by critic must be a linear-in-the-parameters custom basis function. For more information on the different types of reinforcement learning agents, see Reinforcement Learning Agents.
In Reinforcement Learning Toolbox™, an LSPI agent is implemented by an rlLSPIAgent
object.
This implementation uses an online version of the LSPI algorithm that is designed to update
the policy iteratively as new data becomes available, rather than relying on a fixed batch of
data.
LSPI agents can be trained in environments with the following observation and action spaces.
Observation Space | Action Space |
---|---|
Continuous or discrete | Discrete |
LSPI agents use the following critic.
Critic | Actor |
---|---|
Q-value function critic
Q(S,A) that uses a custom
basis function approximation model. You create this critic using | LSPI agents do not use an actor |
During training, the agent:
Updates the critic learnable parameters at a certain frequency specified by the
LearningFrequency
agent option.Explores the action space using epsilon-greedy exploration. During each control interval the agent selects a random action with probability ϵ, otherwise, it selects the action for which the action-value function is the greatest, with probability 1–ϵ.
Critic Used by LSPI Agent
The LSPI agent uses a critic to estimate the value of the optimal policy. The critic is a function approximator object that implements the parameterized action-value function Q(S,A;W), using parameters W. For LSPI agents, Q must be linear in W. So, you must implement the critic using a custom basis function ϕ(S,A), instead of a neural network or a table, to implement the critic. For a given observation S and action A, the critic stores the corresponding estimate of the expected discounted cumulative long-term reward when following the optimal policy. This is the value of the optimal policy. During training, the critic tunes the parameters in W to improve its action-value function estimation. After training, the parameters remain at their tuned values in the critic internal to the trained agent.
For more information on critics, see Create Policies and Value Functions.
LSPI Agent Creation
To create an LSPI agent object, follow these steps:
Create observation specifications for your environment. If you already have an environment object, you can obtain these specifications using the
getObservationInfo
function.Create action specifications for your environment. If you already have an environment object, you can obtain these specifications using the
getActionInfo
function.Create a custom basis function that takes an observation and an action as inputs, and returns a vector as output. Each element of the vector also referred to as a feature. Ideally, there is a linear combination of the features that approximates the value function of your reinforcement learning problem with reasonable accuracy. Functions defined in a separate file or anonymous functions are recommended over local functions.
Initialize a parameter (weights) vector W. This vector must contain as many elements as the number of features returned by your custom basis function.
Create an
rlQValueFunction
critic object, passing as a first input argument a cell containing both a handle to your custom basis function and the initial weight vector.Specify agent options using an
rlLSPIAgentOptions
object. Alternatively, you can skip this step and modify the agent options later using dot notation.Create the agent using
rlLSPIAgent
.
LSPI Training Algorithm
LSPI agents use the following training algorithm. To configure the training algorithm,
specify options using an rlLSPIAgentOptions
object.
Initialize the feature matrix F as an identity matrix multiplied by a scalar δ. F has as many rows and columns as the number of features supplied by your custom basis function. To specify δ, use the
FeatureMatrixInitializationConstant
option. Also initialize the target vector b to zero. The vector b has as many elements as the number of rows of F.For each training episode, perform these operations:
At the beginning of each episode, get the initial observation from the environment.
For the current observation S, select a random action A with probability ϵ. Otherwise, select the action for which the critic value function is greatest.
To specify ϵ and its decay rate, use the
EpsilonGreedyExploration
option.Execute action A. Observe the reward R and the next observation S'.
Store the experience (S,A,R,S').
If ϵ is greater than its minimum value, perform the decay operation as described in
EpsilonGreedyExploration
.Every M time steps (to specify M use the
LearningFrequency
option) perform these learning operations:Assemble the matrices Φ' and Φ using the M stored experiences:
Here, A'i is the action for which the critic value function calculated in S'i is the greatest:
If S'i = Si+1 is a terminal state, the vector ϕ(S'i,A'i) is set to zero.
Compute the matrix Fnew and the vector bnew:
To set the discount factor γ, use the
DiscountFactor
option.Update the feature matrix F and the target vector b using a running mean formula:
Here, N is the total number of samples collected from the beginning.
To set the discount factor γ, use the
DiscountFactor
option.Compute the new weights matrix W:
References
[1] Lagoudakis, Michail G., and Ronald Parr. “Least-Squares Policy Iteration.” Journal of Machine Learning Research 4, no. Dec (2003): 1107–49. https://www.jmlr.org/papers/v4/lagoudakis03a.html.
[2] Busoniu, L., D Ernst, B. De Schutter, and R. Babuska. “Online Least-Squares Policy Iteration for Reinforcement Learning Control.” In Proceedings of the 2010 American Control Conference, 486–91. Baltimore, MD: IEEE, 2010. https://doi.org/10.1109/ACC.2010.5530856.
See Also
Functions
Objects
rlLSPIAgent
|rlLSPIAgentOptions
|rlQAgentOptions
|rlQValueFunction
|rlQAgent
|rlSARSAAgent
|rlDQNAgent