Control a Quanser QUBE Pendulum with a Raspberry Pi using Reinforcement Learning

This example uses:

This example shows how to train a reinforcement learning policy deployed on a Raspberry Pi® board to control a Quanser QUBE™-Servo 2 inverted pendulum system. The goal of the policy is to swing up and balance the pendulum in the upright position. For more information regarding the Quanser pendulum system, see Quanser QUBE™-Servo 2.

Introduction

When an agent needs to interact with a physical system, you can use different architectures to allocate the required computations. Three different architectures, with their advantages and disadvantages, are discussed in Examine Approaches to Fine Tune a Deployed Policy. This example implements the third architecture.

Here, the learning algorithm runs in a desktop MATLAB process, while the policy (that is, the control process) is executed on a Raspberry Pi board. The control process collects experiences by interacting with the pendulum system, and periodically sends these experiences to the learning process. The learning process then uses the experiences received from the control process to update the actor and critic parameters of the agent, and then periodically sends the updated actor parameters to the control process. The control process then updates the policy parameters with the new parameters received by the learning process.

The figure illustrates this architecture.

In this figure, the control process is deployed (from a Simulink® model) on a Raspberry Pi board. The Raspberry Pi board interacts with the Quanser QUBE pendulum system using a serial peripheral interface (SPI) connection. The control process runs the agent policy and collects experiences by controlling the Quanser QUBE pendulum system. The process regularly updates its policy parameters with new policy parameters read from a parameter file that the learning process regularly writes on the Raspberry Pi board. The control process also periodically saves the collected experiences to experience files in the Raspberry Pi board file system.

The learning process runs (in MATLAB) on a desktop computer. This process reads the files on the Raspberry Pi board that contain the new collected experiences, uses the experiences to train the agent, and periodically writes to the Raspberry Pi board a file that contains new policy parameters for the control board to read.

Create Quanser QUBE Pendulum Environment Specifications

The Quanser QUBE-Servo 2 pendulum system is a rotational inverted pendulum with two degrees of freedom. The pendulum is attached to the motor arm by a free revolute joint. The arm is actuated by a DC motor. The control process containing the agent policy is designed in Simulink and deployed (using the Raspberry Pi® Blockset) on a Raspberry Pi board, which interacts with the pendulum system using a Serial Peripheral Interface (SPI) connection. This example uses the Quanser QFLEX 2 Embedded module to enable SPI communication with the QUBE-Servo 2. The goal of the agent is to swing up and balance the pendulum in the upright position. The pendulum system is an underactuated system, with the agent controlling the motor that rotates along the vertical axis. For more information on the Quanser pendulum system, wiring, and command packet structure, see Quanser QUBE™-Servo 2.

For this environment:

The observation is the vector $s_{k} = [\sin θ_{k}, \cos θ_{k}, {θ_{k}}_{}^{˙}, \sin φ_{k}, \cos φ_{k}, {φ_{k}}_{}^{˙}, u_{k - 1}]$ . Using the sine and cosine of the measured angles can facilitate training by representing the otherwise discontinuous angular measurements by a continuous two-dimensional parameterization.
The action is the normalized input voltage command to the servo motor.
The reward signal is defined as follows:

$\begin{array}{l} r (s_{k}, u_{k - 1}, u_{k - 2}) = F_{k} - 0.1 ({θ_{k}}^{2} + φ_{k}^{2} + 0.01 {θ_{k}}_{}^{˙}^{2} + 0.01 {φ_{}^{˙}}^{2} + u_{k - 1}^{2} + 0.3 {(u_{k - 1} - u_{k - 2})}^{2}) \\ F_{k} = {\begin{array}{ll} 1 & θ_{k} \in \pm 5 \frac{π}{8} rad and {ϕ_{k}}_{}^{˙} \in \pm 30 \frac{rad}{\sec} \\ 0 & otherwise \end{array} \end{array}$

The above reward function penalizes six terms:

Deviations from the forward position of the motor arm ( $θ_{k} = 0$ ).
Deviations for the inverted position of the pendulum ( $φ_{k} = 0$ ).
The angular speed of the motor arm ${θ_{k}}_{}^{˙}$ .
The angular speed of the pendulum ${ϕ_{k}}_{}^{˙}$ .
The control action $u_{k}$ .
Changes to the control action $(u_{k - 1} - u_{k - 2})$ .

The system constraints $F_{k}$ are needed to prevent the motor from deviating too far from the center position and to avoid the pendulum potentially hitting the power cord. They also ensure the pendulum does not swing too quickly, as swinging too fast can cause the magnetically coupled pendulum to decouple from the base. The agent is rewarded while the system constraints are satisfied (that is $F_{k} = 1$ ). Additionally, the episode is terminated early if any of the constraints are violated.

Define the observation specification obsInfo and action specification actInfo. These specifications are needed to create the agent. The sample time for the environment and the agent is 0.005 seconds.

obsInfo = rlNumericSpec([7 1]);
obsInfo.Description = "sinTheta, cosTheta, thetaDot, sinPhi, cosPhi, phiDot, uPrevious";

actInfo = rlNumericSpec([1 1]);
actInfo.LowerLimit = -1;
actInfo.UpperLimit = 1;
actInfo.Description = "motorVoltage";

sampleTime = 0.005;

Review Control Process Simulink Model and Define Environment Parameters

The control process code is generated from a Simulink model and deployed on the Raspberry Pi board. This process runs indefinitely, because the stop time of the Simulink model used to generate the process executable is set to Inf.

Open the model.

mdl = 'deployQuanserQubeEnvironment';
open_system(mdl);

This model consists of two main subsystems:

The Environment subsystem interacts with the Quanser QUBE Servo-2 hardware.
The Experiment Controller subsystem contains the reinforcement learning policy, the experiment mode state switching control module, the policy parameter update module, and the experience saving module.

The Environment subsystem interacts with the Quanser QUBE Servo-2 hardware. Specifically, the RPI_Driver block takes the actuation (the motor voltage), the enable signal for the motor enableMotor, and the signal to reset the encoders as inputs and returns the next measurement y consisting of the motor angle ( $θ$ ) and pendulum angle ( $ϕ$ ) as output. It uses the SPI Controller Transfer block to write data to and read data from SPI peripheral device; see SPI Controller Transfer (Raspberry Pi Blockset) for more information. The Furuta_State_Estimator block uses the next measurement coming from the RPI_Drive block to estimate the states $θ, θ_{}^{˙}, ϕ,$ and $ϕ_{}^{˙}$ .

Within the Experiment Controller subsystem:

The Mode State Machine subsystem determines the experiment mode states (no-op, reset, and run) based on the measurements and the isdone signals from the last time step. Because the Simulink model runs indefinitely (its stop time is set to Inf), the Mode State Machine is responsible for switching between different experiment states. It does so by enabling the Reset, Run and Null subsystems according to the needed mode.
When in the no-op state, the Experiment Controller subsystem generates a zero actuation signal and does not transmit experiences to the learning process running on the MATLAB desktop. The model starts in the no-op state and advances to the reset state after one second.
A reset system with appropriate logic and control is an essential part of the experiment process, as it is needed to safely return the system to a valid operating condition for further experiments or runs. In this example, the system is reset to its initial configuration ( $θ = 0$ for the motor arm and $ϕ = π$ or $- π$ for the pendulum) using a linear-quadratic regulator (LQR), which takes the states ( $θ, θ_{}^{˙}, ϕ, ϕ_{}^{˙}$ ) as inputs and outputs the motor voltage needed to quickly achieve the desired configuration.
When the run state is activated, a MATLAB function called by the Get Policy Parameters subsystem reads the latest policy parameters from a parameter file in the Raspberry Pi file system. The policy block inside the Run subsystem is then updated with the new parameters for the next "episode."

The Process Data subsystem (under the Run subsystem, and shown in the following figure) computes the observations, reward, and isdone signals given the measurements from the Environment subsystem. For this example, the measurements are $θ, θ_{}^{˙}, ϕ,$ and $ϕ_{}^{˙}$ . The seven observations $s_{k} = [\sin θ_{k}, \cos θ_{k}, {θ_{k}}_{}^{˙}, \sin φ_{k}, \cos φ_{k}, {φ_{k}}_{}^{˙}, u_{k - 1}]$ are computed using the measurements.

The Remote Agent subsystem (under the Run subsystem, and shown in the following figure) maps observations to actions using the updated policy parameters. Additionally, the Experience Writer subsystem collects experiences in a circular buffer and writes the buffer to experience files in the Raspberry Pi file system, with a frequency given by the envData.ExperiencesWriteFrequency parameter.

You can use the command generatePolicyBlock(agent) to generate a Simulink policy block. For more information, see generatePolicyBlock.

Use the local function environmentParameters to define the parameters needed by the environment and the reset controller. This function is defined at the end of this example.

envData = environmentParameters(sampleTime);

Create a structure representing a circular buffer to store experiences. The buffer is saved to a file based on the writing frequency specified by envData.ExperiencesWriteFrequency.

tempExperienceBuffer.NextObservation = ...
    zeros(obsInfo.Dimension(1),envData.ExperiencesWriteFrequency);
tempExperienceBuffer.Observation = ...
    zeros(obsInfo.Dimension(1),envData.ExperiencesWriteFrequency);
tempExperienceBuffer.Action = ...
    zeros(actInfo.Dimension(1),envData.ExperiencesWriteFrequency);
tempExperienceBuffer.Reward = ...
    zeros(1,envData.ExperiencesWriteFrequency);
tempExperienceBuffer.IsDone = ...
    zeros(1,envData.ExperiencesWriteFrequency,"uint8");

You use this structure as a reference when the policy parameters are converted from bytes of data to a structure (deserialization).

agentData.TempExperienceBuffer = tempExperienceBuffer;

Create SAC Agent

Fix the random number stream with the seed 0 and random number algorithm Mersenne Twister to reproduce the same initial learnable parameters used in the agent. For more information on controlling the seed used for random number generation, see rng.

previousRngState = rng(0, "twister");

The output previousRngState is a structure that contains information about the previous state of the stream. You restore the state at the end of the example.

Create an agent initialization object to initialize the actor and critic networks with the hidden layer size 64. For more information on agent initialization options, see rlAgentInitializationOptions.

initOptions = rlAgentInitializationOptions("NumHiddenUnit",64);

Create a default SAC agent using the observation specifications, action specifications, and initialization options. For more information, see rlSACAgent.

agent = rlSACAgent(obsInfo,actInfo,initOptions);

Specify the agent options for training. For more information, see rlSACAgentOptions. For this training:

Specify the sample time, experience buffer length, and mini-batch size.
Set the actor and critic learning rates to 1e-3, and entropy weight learning rate to 3e-4. A fast learning rate causes drastic updates that can lead to divergent behaviors, while a slow learning rate value can require many updates before reaching the optimal point.
Use a gradient threshold of 1 to clip the gradients. Clipping the gradients can improve training stability.
Specify the initial entropy component weight and the target entropy value.

agent.SampleTime = sampleTime;
agent.AgentOptions.ExperienceBufferLength = 1e6;
agent.AgentOptions.MiniBatchSize = 256;

agent.AgentOptions.ActorOptimizerOptions.LearnRate = 1e-3;
agent.AgentOptions.ActorOptimizerOptions.GradientThreshold = 1;
for criticIndex = 1:2
    agent.AgentOptions.CriticOptimizerOptions(criticIndex).LearnRate = 1e-3;
    agent.AgentOptions.CriticOptimizerOptions(criticIndex).GradientThreshold = 1;
end

agent.AgentOptions.EntropyWeightOptions.LearnRate = 3e-4;
agent.AgentOptions.EntropyWeightOptions.EntropyWeight = 0.1;
agent.AgentOptions.EntropyWeightOptions.TargetEntropy = -1;

Define offline training options using the rlTrainingFromDataOptions object. You use trainFromData to train the agent from the collected experience data appended to the experience buffer. Set the number of steps to 200 and the max epochs to 1 to take 200 learning steps to train the agent when new experiences are collected. The actual training progress (episode rewards) can be obtained from the control process, so you do not need to plot the progress of offline training.

trainFromDataOptions = rlTrainingFromDataOptions;
trainFromDataOptions.MaxEpochs = 1;
trainFromDataOptions.NumStepsPerEpoch = 200;
trainFromDataOptions.Plots = "none";

agentData.Agent = agent;
agentData.TrainFromDataOptions = trainFromDataOptions;
agentData.Ts = sampleTime;

Build, Deploy, and Start the Control Process on Raspberry Pi Board

Connect to the Raspberry Pi board using the IP address, username, and password. For this example, you need Raspberry Pi® Blockset. See Get Started with Raspberry Pi Blockset (Raspberry Pi Blockset) for details on setting up your Raspberry Pi hardware.

Replace the information in the code below with the login credentials of your Raspberry Pi board.

ipaddress = "172.31.172.170";
username  = "guestuser";
password  = "guestpasswd";

Create a raspberrypi object and store it in the agentData structure.

agentData.RPi = raspberrypi(ipaddress,username,password)

agentData = struct with fields:
    TempExperienceBuffer: [1×1 struct]
                   Agent: [1×1 rl.agent.rlSACAgent]
    TrainFromDataOptions: [1×1 rl.option.rlTrainingFromDataOptions]
                      Ts: 0.0050
                     RPi: [1×1 raspberrypi]

Create variables that store folder and file information for parameters and experiences.

% Folder for storing experiences on Raspberry Pi
experiencesFolderRPi = "~/pendulumExperiments/experiences/";
% Folder for storing parameters on Raspberry Pi
parametersFolderRPi = "~/pendulumExperiments/parameters/";

% Folder for storing experiences on desktop computer
experiencesFolderDesktop = ...
    fullfile(".","pendulumExperiments","experiences/"); 
% Folder for storing parameters on desktop computer
parametersFolderDesktop = ...
    fullfile(".","pendulumExperiments","parameters/");

parametersFile = "parametersFile.dat";

Use the structure agentData to store information related to the agent.

agentData.ExperiencesFolderRPi = experiencesFolderRPi;
agentData.ExperiencesFolderDesktop = experiencesFolderDesktop;
agentData.ParametersFolderRPi = parametersFolderRPi;
agentData.ParametersFolderDesktop = parametersFolderDesktop;
agentData.ParametersFile = parametersFile;

Create a local folder to save the parameters and experiences if you do not already have one.

if ~exist(fullfile(".","pendulumExperiments/"),'dir')
   % Making the experiences folder
    mkdir(fullfile(".","pendulumExperiments/"));
end

Remove files containing variables that store old experience information if the experiences folder exists.

if exist(experiencesFolderDesktop,'dir')
    % If the experiences folder exists, 
    % remove the folder to delete any old files.
    rmdir(experiencesFolderDesktop,'s');
end

Create the folder to collect the experiences.

mkdir(experiencesFolderDesktop);

Remove files containing variables that store old parameter information if the parameters folder exists.

if exist(parametersFolderDesktop,'dir')
    % If the parameter folder exists, 
    % remove the folder to delete any old files.
    rmdir(parametersFolderDesktop,'s');
end

Create the parameters folder.

mkdir(parametersFolderDesktop);

Check if the folder where experience files are stored exists on the Raspberry Pi. If the folder does not exist, create one. To create the folder, use the supporting function createFolderOnRPi, which is defined at the end of this example. This function uses the system command, with the Raspberry Pi object as first input argument, to send a command to the Raspberry Pi Linux operating system. For more information, see Run Linux Shell Commands on Raspberry Pi Hardware (Raspberry Pi Blockset).

createFolderOnRPi(agentData.RPi, agentData.ExperiencesFolderRPi);

Check if the parameters folder exists on the Raspberry Pi board. If not, create one.

createFolderOnRPi(agentData.RPi, agentData.ParametersFolderRPi);

Read the current state of the experiences folder using listFolderContentsOnRPi. This function is defined at the end of this example.

dirContent = listFolderContentsOnRPi( ...
    agentData.RPi, agentData.ExperiencesFolderRPi);

Check for and remove all the old experience files from the Raspberry Pi board.

if ~isempty(dirContent)
    system(agentData.RPi,convertStringsToChars("rm -f " +...
       agentData.ExperiencesFolderRPi + "*.*"));
end

Use the writePolicyParameters function to save the initial policy parameters to a file on the desktop computer and to send the parameters in the agentData structure to the Raspberry Pi. This function is defined at the end of this example.

agentData.UseExplorationPolicy = true;
writePolicyParameters(agentData);

Store the policy parameter structure in envData. This structure is used as a reference to convert the serialized data to a structure in the Get Policy Parameters subsystem mentioned earlier.

envData.ParametersStruct = policyParameters( ...
    getExplorationPolicy(agentData.Agent));

Use the slbuild (Simulink) (Simulink) command to build and deploy the model on the Raspberry Pi board.

Use evalc to capture the text output from code generation, for possible later inspection.

buildLog = evalc("slbuild(mdl)");

The model is configured only to build and deploy. You can also specify the build directory.

Start the model deployed on the Raspberry Pi Board.

runModel(agentData.RPi,mdl);

You can use the isModelRunning function to determine whether the model is running on the board. The isModelRunning returns true if the model is running on the Raspberry Pi board.

isModelRunning(agentData.RPi,mdl)

ans = logical
   1

Run Training Loop

The example executes the training loop, as a part of the learning process, on the desktop computer.

Create variables to monitor training.

trainingResultData.CumulativeRewardTemp = 0;
trainingResultData.CumulativeReward = [];
trainingResultData.AverageCumulativeReward = [];
trainingResultData.AveragingWindowLength = 15;
trainingResultData.NumOfEpisodes = 0;

Stop training when the agent receives an average cumulative reward greater than 800 or a maximum of 5000 episodes.

rewardStopTrainingValue = 800;
maxEpisodes = 5000;

To train the agent, set doTraining to true.

doTraining = false;

Create a figure for training visualization using the buildRemoteTrainingMonitor local function. This function is defined at the end of this example.

if doTraining
    [trainingPlot,...
     trainingResultData.LineReward,...
     trainingResultData.LineAverageReward] = ...
         buildRemoteTrainingMonitor();
    % Enable the training visualization plot.
    set(trainingPlot,Visible="on");
end

The training repeats the following steps:

Check if the deployed control process Simulink model has generated new experiences. If it did, get the experiences from the Raspberry Pi board to the desktop computer.
Read the experiences and append them to the replay buffer using readNewExperiences. The readNewExperiences function is defined at the end of this example. Once the function reads the experiences, they are deleted from the Raspberry Pi board.
Train the agent from the collected data using trainFromData.
Save the updated policy parameters to the Raspberry Pi board using writePolicyParameters. This function is defined at the end of this example.

if doTraining
    while true
        % Look for any experience files.
        dirContent = listFolderContentsOnRPi( ...
            agentData.RPi,agentData.ExperiencesFolderRPi);

        if ~isempty(dirContent)
    
            % Display the new found experience files.
            fprintf('%s  %4d new experience file found\n', ...
                datetime('now'), numel(dirContent));
    
            % Use the readNewExperiences function to read 
            % new experience files and append experiences 
            % to the replay buffer.
            [agentData, trainingResultData] = readNewExperiences( ...
                dirContent,agentData,trainingResultData);
    
            if agentData.Agent.ExperienceBuffer.Length>=...
                agent.AgentOptions.MiniBatchSize
                % Perform a learning step for the agent 
                % using data in the experience buffer. 
                offlineTrainStats = trainFromData( ...
                    agentData.Agent,agentData.TrainFromDataOptions);
            end     

            % Save the updated actor parameters to a file on the 
            % desktop computer and send the file to the Raspberry Pi.
            writePolicyParameters(agentData);
        end
    
        if ~isempty(trainingResultData.AverageCumulativeReward)
            % Check if at least one experience was read to prevent
            % AverageCumulativeReward reward from being empty.
            if trainingResultData.AverageCumulativeReward(end) > ... 
                    rewardStopTrainingValue ...
                    || trainingResultData.NumOfEpisodes>=maxEpisodes
                % Use break to exit the training loop when 
                % the average cumulative reward is greater than 
                % rewardStopTrainingValue or the number of
                % episodes is more than maxEpisodes.
                break
            end
        end
    end
else
    % If you did not train the agent, load the agent from a file.
    load("trainedAgentQuanserQube.mat");
    agent.UseExplorationPolicy = false;
    agentData.Agent = agent;
end

Stop the model deployed to the Raspberry Pi Board.

if isModelRunning(agentData.RPi,mdl)
    stopModel(agentData.RPi,mdl);
end

Evaluate Trained Policy

Evaluate the trained policy on the Raspberry Pi board, using greedy actions. Change the policy to act greedily by setting the policy parameter Policy_UseMaxLikelihoodAction to true.

agentData.UseExplorationPolicy = false;

Send the policy parameters to the Raspberry Pi board.

writePolicyParameters(agentData);

Remove all the old experience files on the board.

system(agentData.RPi,convertStringsToChars("rm -f " +...
   agentData.ExperiencesFolderRPi + "*.*"));

Create variables to monitor evaluation.

evaluationResultData.CumulativeRewardTemp = 0;
evaluationResultData.CumulativeReward = [];
evaluationResultData.AverageCumulativeReward = [];
evaluationResultData.AveragingWindowLength = 10;
evaluationResultData.NumOfEpisodes = 0;
evaluationMaxEpisodes = 5;

Start the model deployed on the Raspberry Pi Board.

if ~isModelRunning(agentData.RPi,mdl)
    runModel(agentData.RPi,mdl);
end

The evaluation repeats the following steps:

Check if the deployed Control Process model has generated new experiences. If it did, send the experiences from the Raspberry Pi board to the desktop computer.
Read the experiences and append them to the replay buffer using readNewExperiences.

while true
    % See if there are any new experiences.
    dirContent = listFolderContentsOnRPi( ...
        agentData.RPi, agentData.ExperiencesFolderRPi);
    
    if ~isempty(dirContent)
        
fprintf('%s  %4d new experience file found\n', ...
datetime('now'), numel(dirContent));
        
        % Read new experience files, 
        % and append experiences to the replay buffer.
        [agentData, evaluationResultData] = readNewExperiences( ...
            dirContent,agentData,evaluationResultData);
    end

    if evaluationResultData.NumOfEpisodes>=evaluationMaxEpisodes
        % Use break to exit the evaluation loop when the number of
        % episodes is more than evaluationMaxEpisodes.
        break
    end
end

15-Dec-2025 08:30:59     1 new experience file found
15-Dec-2025 08:31:00     1 new experience file found
15-Dec-2025 08:31:01     1 new experience file found
15-Dec-2025 08:31:02     1 new experience file found
15-Dec-2025 08:31:03     1 new experience file found
15-Dec-2025 08:31:10     1 new experience file found
15-Dec-2025 08:31:11     1 new experience file found
15-Dec-2025 08:31:12     1 new experience file found
15-Dec-2025 08:31:13     1 new experience file found
15-Dec-2025 08:31:14     1 new experience file found
15-Dec-2025 08:31:22     1 new experience file found
15-Dec-2025 08:31:23     1 new experience file found
15-Dec-2025 08:31:24     1 new experience file found
15-Dec-2025 08:31:25     1 new experience file found
15-Dec-2025 08:31:26     1 new experience file found
15-Dec-2025 08:31:34     1 new experience file found
15-Dec-2025 08:31:35     1 new experience file found
15-Dec-2025 08:31:36     1 new experience file found
15-Dec-2025 08:31:37     1 new experience file found
15-Dec-2025 08:31:38     1 new experience file found
15-Dec-2025 08:31:46     1 new experience file found
15-Dec-2025 08:31:47     1 new experience file found
15-Dec-2025 08:31:48     1 new experience file found
15-Dec-2025 08:31:49     1 new experience file found
15-Dec-2025 08:31:50     1 new experience file found

Compute the average of episode rewards.

mean(evaluationResultData.CumulativeReward)

ans = 
664.4980

This value shows that that policy is able to swing up and balance the pendulum in the upright position as shown in the video below.

Stop the model deployed on the Raspberry Pi Board.

if isModelRunning(agentData.RPi,mdl)
    stopModel(agentData.RPi,mdl);
end

Restore the random number stream using the information stored in previousRngState.

rng(previousRngState);

Local Functions

Write Policy Parameters

The writePolicyParameters function extracts the policy parameters from the agent, writes them to a file, and sends the file to the Raspberry Pi board.

function writePolicyParameters(agentData)

    % Open a file for saving the policy parameters on the desktop computer.
    filepath = fullfile( ...
        agentData.ParametersFolderDesktop, agentData.ParametersFile);
    fid = fopen(filepath,'w+');
    cln = onCleanup(@()fclose(fid));
    
    if agentData.UseExplorationPolicy
        % Get exploration policy parameters.
        parameters = policyParameters(getExplorationPolicy(agentData.Agent));
    else
        % Get greedy policy parameters.
        parameters = policyParameters(getGreedyPolicy(agentData.Agent));
    end

    % Convert the policy parameters into a vector of uint8 values 
    % (serialization) and save it to a file.
    fwrite(fid,structToBytes(parameters),'uint8');
    
    % Send the file from the desktop to the Raspberry Pi board.
    putFile(agentData.RPi,...
            convertStringsToChars(fullfile( ...
                agentData.ParametersFolderDesktop, ...
                agentData.ParametersFile)), ...
            convertStringsToChars( ...
                agentData.ParametersFolderRPi+agentData.ParametersFile));    
end

Read New Experiences

The readNewExperiences function moves new experience files from the Raspberry Pi board to the desktop computer, reads the experiences, and appends them to the replay buffer.

function [agentData, trainingResultData] = ...
    readNewExperiences(dirContent,agentData, trainingResultData)
       
    for ii=1:length(dirContent)

        % Move file from Raspberry Pi board to the desktop computer.
        getFile(agentData.RPi, ...
            convertStringsToChars( ...
                agentData.ExperiencesFolderRPi+dirContent(ii)), ...
            convertStringsToChars(agentData.ExperiencesFolderDesktop));

        % Remove the moved file from Raspberry Pi board.
        system(agentData.RPi,convertStringsToChars("rm -f " + ...
            agentData.ExperiencesFolderRPi+dirContent(ii)));

        % Open the moved file on desktop for reading the data.
        fid = fopen(fullfile( ...
            agentData.ExperiencesFolderDesktop,dirContent(ii)),'r');
        cln = onCleanup(@()fclose(fid));

        % Read data from the file.
        experiencesBytes = fread(fid,'*uint8');

        % Convert the uint8s to experience structure.
        tempExperienceBuffer = bytesToStruct( ...
            experiencesBytes,agentData.TempExperienceBuffer);

        % Compute the cumulative reward and updating the plots.
        trainingResultData = computeCumulativeReward( ...
            trainingResultData,tempExperienceBuffer);

        % Get the experiences ready to append to the replay buffer.
        for kk=size(tempExperienceBuffer.Observation,2):-1:1
            experiences(kk).NextObservation = ...
                {tempExperienceBuffer.NextObservation(:,kk)};
            experiences(kk).Observation = ...
                {tempExperienceBuffer.Observation(:,kk)};
            experiences(kk).Action = {tempExperienceBuffer.Action(:,kk)};
            experiences(kk).Reward = tempExperienceBuffer.Reward(kk);
            experiences(kk).IsDone = tempExperienceBuffer.IsDone(kk);
        end

        % Append the new experiences to the replay buffer.
        append(agentData.Agent.ExperienceBuffer,experiences);
    end
end

Compute Cumulative Reward from Experiences

The computeCumulativeReward function computes the cumulative reward and updates the learning plot.

function trainingResultData = computeCumulativeReward( ...
    trainingResultData,tempExperienceBuffer) 

    isDone = tempExperienceBuffer.IsDone;
    reward = tempExperienceBuffer.Reward;
    N = length(isDone);
    while N > 0
        if any(isDone) % checking if there are any isDones
            endOfEpIdx = find(isDone,1);
            trainingResultData.CumulativeRewardTemp = ...
                trainingResultData.CumulativeRewardTemp ...
                + sum(reward(1:endOfEpIdx));

            % Append the new episode cumulative reward.
            trainingResultData.CumulativeReward = ...
                [trainingResultData.CumulativeReward; ...
                 trainingResultData.CumulativeRewardTemp];

            % Temporary variable used to compute the cumulative reward
            trainingResultData.CumulativeRewardTemp = 0;
            
            % Update the plots.
            trainingResultData.NumOfEpisodes = ...
                trainingResultData.NumOfEpisodes+1;
            trainingResultData.AverageCumulativeReward = ...
                movmean(trainingResultData.CumulativeReward,...
                trainingResultData.AveragingWindowLength,1);

            % Update the monitor.
            if isfield(trainingResultData, "LineReward") && ...
                     ~isempty(trainingResultData.LineReward)

                addpoints( ...
                    trainingResultData.LineReward, ...
                    trainingResultData.NumOfEpisodes,...
                    trainingResultData.CumulativeReward(end));

                addpoints( ...
                    trainingResultData.LineAverageReward, ...
                    trainingResultData.NumOfEpisodes,...
                    trainingResultData.AverageCumulativeReward(end));
            end
            drawnow;

            % Truncate the trajectory.
            isDone(1:endOfEpIdx) = [];
            reward(1:endOfEpIdx) = [];
            N = length(isDone);

        else
            % If there are no more isDone (plural noun), 
            % compute the cumulative reward and store it 
            % in the temporary variable.
            trainingResultData.CumulativeRewardTemp = ...
                trainingResultData.CumulativeRewardTemp ...
                + sum(reward(1:N));
            N = 0;
        end
    end
end

Create Figure for Training Visualization

The buildRemoteTrainingMonitor function creates a figure for training visualization.

function [trainingPlot, lineReward, lineAverageReward] = ...
    buildRemoteTrainingMonitor()
    
    plotRatio = 16/9;
    trainingPlot = figure( ...
                Visible="off", ...
                HandleVisibility="off", ...
                NumberTitle="off", ...
                Name="Reinforcement Learning with Hardware");

    trainingPlot.Position(3) = ...
         plotRatio * trainingPlot.Position(4);
    
    ax = gca(trainingPlot);
    
    lineReward = animatedline(ax);
    lineAverageReward = animatedline(ax,Color="r",LineWidth=3);
    xlabel(ax,"Episode");
    ylabel(ax,"Reward");
    legend(ax,"Cumulative Reward","Average Reward", ...
           Location="northwest")
    title(ax,"Training Progress");
end

Assign Parameter Values Related to Environment

The environmentParameters function assigns various parameter values needed for the environment and the reset controller.

function envData = environmentParameters(sampleTime)

    envData.MaxStepsPerSim = uint32(1000);
    envData.Ts = sampleTime;  % sec
    envData.ExperiencesWriteFrequency = 200;

    % For the reset controller: using LQR
    envData.LQRK = [1.3831 0.39192 -1.085 -0.082812];

end

Create Folder on Raspberry Pi Board

The createFolderOnRPi function creates a folder on the Raspberry Pi board if not exists.

function createFolderOnRPi(RPi, FolderRPi)
    
    system(RPi,convertStringsToChars("if [ ! -d " +...
        FolderRPi + " ]; then mkdir -p " +...
        FolderRPi + "; fi"));
end

List Contents of Folder on Raspberry Pi Board

The listFolderContentsOnRPi function lists contents of a folder on the Raspberry Pi board.

function dirContent = listFolderContentsOnRPi(RPi, experiencesFolderRPi)

    dirTemp = system(RPi,convertStringsToChars("ls " +...
        experiencesFolderRPi));
    dirContent = strsplit(dirTemp);
    dirContent(end) = [];
end

Control a Quanser QUBE Pendulum with a Raspberry Pi using Reinforcement Learning

Introduction

Create Quanser QUBE Pendulum Environment Specifications

Review Control Process Simulink Model and Define Environment Parameters

Create SAC Agent

Build, Deploy, and Start the Control Process on Raspberry Pi Board

Run Training Loop

Evaluate Trained Policy

Local Functions

See Also

Functions

Blocks

Objects

Topics