Reinforcement learning is a type of machine learning technique where a computer agent learns to perform a task through repeated trial and error interactions with a dynamic environment. This learning approach enables the agent to make a series of decisions that maximize a reward metric for the task without human intervention and without being explicitly programmed to achieve the task.
AI programs trained with reinforcement learning beat human players in board games like Go and chess, as well as video games. While reinforcement learning is by no means a new concept, recent progress in deep learning and computing power made it possible to achieve some remarkable results in the area of artificial intelligence.
Reinforcement learning is a branch of machine learning (Figure 1). Unlike unsupervised and supervised machine learning, reinforcement learning does not rely on a static dataset, but operates in a dynamic environment and learns from collected experiences. Data points, or experiences, are collected during training through trial-and-error interactions between the environment and a software agent. This aspect of reinforcement learning is important, because it alleviates the need for data collection, preprocessing, and labeling before training, otherwise necessary in supervised and unsupervised learning. Practically, this means that, given the right incentive, a reinforcement learning model can start learning a behavior on its own, without (human) supervision.
Deep learning spans all three types of machine learning; reinforcement learning and deep learning are not mutually exclusive. Complex reinforcement learning problems often rely on deep neural networks, a field known as deep reinforcement learning.
Deep neural networks trained with reinforcement learning can encode complex behaviors. This allows an alternative approach to applications that are otherwise intractable or more challenging to tackle with more traditional methods. For example, in autonomous driving, a neural network can replace the driver and decide how to turn the steering wheel by simultaneously looking at multiple sensors such as camera frames and lidar measurements. Without neural networks, the problem would normally be broken down in smaller pieces like extracting features from camera frames, filtering the lidar measurements, fusing the sensor outputs, and making “driving” decisions based on sensor inputs.
While reinforcement learning as an approach is still under evaluation for production systems, some industrial applications are good candidates for this technology.
Advanced controls: Controlling nonlinear systems is a challenging problem that is often addressed by linearizing the system at different operating points. Reinforcement learning can be applied directly to the nonlinear system.
Automated driving: Making driving decisions based on camera input is an area where reinforcement learning is suitable considering the success of deep neural networks in image applications.
Robotics: Reinforcement learning can help with applications like robotic grasping, such as teaching a robotic arm how to manipulate a variety of objects for pick-and-place applications. Other robotics applications include human-robot and robot-robot collaboration.
Scheduling: Scheduling problems appear in many scenarios including traffic light control and coordinating resources on the factory floor towards some objective. Reinforcement learning is a good alternative to evolutionary methods to solve these combinatorial optimization problems.
Calibration: Applications that involve manual calibration of parameters, such as electronic control unit (ECU) calibration, may be good candidates for reinforcement learning.
The training mechanism behind reinforcement learning reflects many real-world scenarios. Consider, for example, pet training through positive reinforcement.
Using reinforcement learning terminology (Figure 2), the goal of learning in this case is to train the dog (agent) to complete a task within an environment, which includes the surroundings of the dog as well as the trainer. First, the trainer issues a command or cue, which the dog observes (observation). The dog then responds by taking an action. If the action is close to the desired behavior, the trainer will likely provide a reward, such as a food treat or a toy; otherwise, no reward will be provided. At the beginning of training, the dog will likely take more random actions like rolling over when the command given is “sit,” as it is trying to associate specific observations with actions and rewards. This association, or mapping, between observations and actions is called policy. From the dog’s perspective, the ideal case would be one in which it would respond correctly to every cue, so that it gets as many treats as possible. So, the whole meaning of reinforcement learning training is to “tune” the dog’s policy so that it learns the desired behaviors that will maximize some reward. After training is complete, the dog should be able to observe the owner and take the appropriate action, for example, sitting when commanded to “sit” by using the internal policy it has developed. By this point, treats are welcome but, theoretically, shouldn’t be necessary.
Keeping in mind the dog training example, consider the task of parking a vehicle using an automated driving system (Figure 3). The goal is to teach the vehicle computer (agent) to park in the correct parking spot with reinforcement learning. As in the dog training case, the environment is everything outside the agent and could include the dynamics of the vehicle, other vehicles that may be nearby, weather conditions, and so on. During training, the agent uses readings from sensors such as cameras, GPS, and lidar (observations) to generate steering, braking, and acceleration commands (actions). To learn how to generate the correct actions from the observations (policy tuning), the agent repeatedly tries to park the vehicle using a trial-and-error process. A reward signal can be provided to evaluate the goodness of a trial and to guide the learning process.
In the dog training example, training is happening inside the dog’s brain. In the autonomous parking example, training is handled by a training algorithm. The training algorithm is responsible for tuning the agent’s policy based on the collected sensor readings, actions, and rewards. After training is complete, the vehicle’s computer should be able to park using only the tuned policy and sensor readings.
One thing to keep in mind is that reinforcement learning is not sample efficient. That is, it requires a large number of interactions between the agent and the environment to collect data for training. As an example, AlphaGo, the first computer program to defeat a world champion at the game of Go, was trained non-stop for a period of a few days by playing millions of games, accumulating thousands of years of human knowledge. Even for relatively simple applications, training time can take anywhere from minutes, to hours or days. Also, setting up the problem correctly can be challenging as there is a list of design decisions that need to be made, which may require a few iterations to get right. These include, for example, selecting the appropriate architecture for the neural networks, tuning hyperparameters, and shaping of the reward signal.
The general workflow for training an agent using reinforcement learning includes the following steps (Figure 4):
First you need to define the environment within which the reinforcement learning agent operates, including the interface between agent and environment. The environment can be either a simulation model, or a real physical system, but simulated environments are usually a good first step since they are safer and allow experimentation.
Next, specify the reward signal that the agent uses to measure its performance against the task goals and how this signal is calculated from the environment. Reward shaping can be tricky and may require a few iterations to get it right.
Then you create the agent, which consists of the policy and the reinforcement learning training algorithm. So you need to:
a) Choose a way to represent the policy (such as using neural networks or look-up tables).
b) Select the appropriate training algorithm. Different representations are often tied to specific categories of training algorithms. But in general, most modern reinforcement learning algorithms rely on neural networks as they are good candidates for large state/action spaces and complex problems.
Set up training options (like stopping criteria) and train the agent to tune the policy. Make sure to validate the trained policy after training ends. If necessary, revisit design choices like the reward signal and policy architecture and train again. Reinforcement learning is generally known to be sample inefficient; training can take anywhere from minutes to days depending on the application. For complex applications, parallelizing training on multiple CPUs, GPUs, and computer clusters will speed things up (Figure 5).
Deploy the trained policy representation using, for example, generated C/C++ or CUDA code. At this point, the policy is a standalone decision-making system.
Training an agent using reinforcement learning is an iterative process. Decisions and results in later stages can require you to return to an earlier stage in the learning workflow. For example, if the training process does not converge to an optimal policy within a reasonable amount of time, you may have to update any of the following before retraining the agent:
MATLAB® and Reinforcement Learning Toolbox™ simplify reinforcement learning tasks. You can implement controllers and decision-making algorithms for complex systems such as robots and autonomous systems by working through every step of the reinforcement learning workflow. Specifically, you can:
1. Create environments and reward functions using MATLAB and Simulink®
2. Use deep neural networks, polynomials, and look-up tables to define reinforcement learning policies
3. Switch, evaluate, and compare popular reinforcement learning algorithms like DQN, DDPG, PPO, and SAC with only minor code changes, or create your own custom algorithm
5. Generate code and deploy reinforcement learning policies to embedded devices with MATLAB Coder™ and GPU Coder™
6. Get started with reinforcement learning using reference examples.