Deploy Image Recognition Network on FPGA with and Without Pruning
This example shows you how to deploy an image recognition network with and without convolutional filter pruning. Filter pruning is a compression technique that uses some criterion to identify and remove the least important filters in a network, which reduces the overall memory footprint of the network without significantly reducing the network accuracy.
Load Unpruned Network
Load the unpruned trained network. For information on network training, see Train Residual Network for Image Classification.
load("trainedYOLONet.mat");
Test Network
Load a test image. The test image is a part of the CIFAR-10 data set[1]. To download the data set, see the Prepare Data section in Train Residual Network for Image Classification.
load("testImage.mat");
Use the runonHW
function to:
Prepare the network for deployment.
Compile the network to generate weights, biases, and instructions.
Deploy the network to the FPGA board.
Retrieve the prediction results using MATLAB®.
To view the code for this function, see Helper Functions.
testImage = dlarray(testImage,'SSCB'); [~, speedInitial] = runOnHW(trainedNet,testImage,'zcu102_single');
### Compiling network for Deep Learning FPGA prototyping ... ### Targeting FPGA bitstream zcu102_single. ### An output layer called 'Output1_softmax' of type 'nnet.cnn.layer.RegressionOutputLayer' has been added to the provided network. This layer performs no operation during prediction and thus does not affect the output of the network. ### Optimizing network: Fused 'nnet.cnn.layer.BatchNormalizationLayer' into 'nnet.cnn.layer.Convolution2DLayer' ### Notice: The layer 'input' of type 'ImageInputLayer' is split into an image input layer 'input' and an addition layer 'input_norm' for normalization on hardware. ### The network includes the following layers: ### Notice: The layer 'softmax' with type 'nnet.cnn.layer.SoftmaxLayer' is implemented in software. ### Notice: The layer 'Output1_softmax' with type 'nnet.cnn.layer.RegressionOutputLayer' is implemented in software. ### Compiling layer group: convInp>>reluInp ... ### Compiling layer group: convInp>>reluInp ... complete. ### Compiling layer group: S1U1_conv1>>S1U1_conv2 ... ### Compiling layer group: S1U1_conv1>>S1U1_conv2 ... complete. ### Compiling layer group: S1U2_conv1>>S1U2_conv2 ... ### Compiling layer group: S1U2_conv1>>S1U2_conv2 ... complete. ### Compiling layer group: S1U3_conv1>>S1U3_conv2 ... ### Compiling layer group: S1U3_conv1>>S1U3_conv2 ... complete. ### Compiling layer group: skipConv1 ... ### Compiling layer group: skipConv1 ... complete. ### Compiling layer group: S2U1_conv1>>S2U1_conv2 ... ### Compiling layer group: S2U1_conv1>>S2U1_conv2 ... complete. ### Compiling layer group: S2U2_conv1>>S2U2_conv2 ... ### Compiling layer group: S2U2_conv1>>S2U2_conv2 ... complete. ### Compiling layer group: S2U3_conv1>>S2U3_conv2 ... ### Compiling layer group: S2U3_conv1>>S2U3_conv2 ... complete. ### Compiling layer group: skipConv2 ... ### Compiling layer group: skipConv2 ... complete. ### Compiling layer group: S3U1_conv1>>S3U1_conv2 ... ### Compiling layer group: S3U1_conv1>>S3U1_conv2 ... complete. ### Compiling layer group: S3U2_conv1>>S3U2_conv2 ... ### Compiling layer group: S3U2_conv1>>S3U2_conv2 ... complete. ### Compiling layer group: S3U3_conv1>>S3U3_conv2 ... ### Compiling layer group: S3U3_conv1>>S3U3_conv2 ... complete. ### Compiling layer group: globalPool ... ### Compiling layer group: globalPool ... complete. ### Compiling layer group: fcFinal ... ### Compiling layer group: fcFinal ... complete. ### Allocating external memory buffers: offset_name offset_address allocated_space _______________________ ______________ __________________ "InputDataOffset" "0x00000000" "480.0 kB" "OutputResultOffset" "0x00078000" "4.0 kB" "SchedulerDataOffset" "0x00079000" "300.0 kB" "SystemBufferOffset" "0x000c4000" "148.0 kB" "InstructionDataOffset" "0x000e9000" "312.0 kB" "ConvWeightDataOffset" "0x00137000" "1.1 MB" "FCWeightDataOffset" "0x0025b000" "4.0 kB" "EndOffset" "0x0025c000" "Total: 2416.0 kB" ### Network compilation complete. ### Programming FPGA Bitstream using Ethernet... ### Attempting to connect to the hardware board at 172.21.88.150... ### Connection successful ### Programming FPGA device on Xilinx SoC hardware board at 172.21.88.150... ### Attempting to connect to the hardware board at 172.21.88.150... ### Connection successful ### Copying FPGA programming files to SD card... ### Setting FPGA bitstream and devicetree for boot... # Copying Bitstream zcu102_single.bit to /mnt/hdlcoder_rd # Set Bitstream to hdlcoder_rd/zcu102_single.bit # Copying Devicetree devicetree_dlhdl.dtb to /mnt/hdlcoder_rd # Set Devicetree to hdlcoder_rd/devicetree_dlhdl.dtb # Set up boot for Reference Design: 'AXI-Stream DDR Memory Access : 3-AXIM' ### Programming done. The system will now reboot for persistent changes to take effect. ### Rebooting Xilinx SoC at 172.21.88.150... ### Reboot may take several seconds... ### Attempting to connect to the hardware board at 172.21.88.150... ### Connection successful ### Programming the FPGA bitstream has been completed successfully. ### Loading weights to Conv Processor. ### Conv Weights loaded. Current time is 29-Aug-2024 09:47:18 ### Loading weights to FC Processor. ### FC Weights loaded. Current time is 29-Aug-2024 09:47:18 ### Finished writing input activations. ### Running single input activation. Deep Learning Processor Profiler Performance Results LastFrameLatency(cycles) LastFrameLatency(seconds) FramesNum Total Latency Frames/s ------------- ------------- --------- --------- --------- Network 820920 0.00373 1 823612 267.1 input_norm 7285 0.00003 convInp 14204 0.00006 S1U1_conv1 32243 0.00015 S1U1_conv2 32294 0.00015 add11 30573 0.00014 S1U2_conv1 32564 0.00015 S1U2_conv2 32300 0.00015 add12 30453 0.00014 S1U3_conv1 32113 0.00015 S1U3_conv2 32258 0.00015 add13 30583 0.00014 skipConv1 20693 0.00009 S2U1_conv1 21316 0.00010 S2U1_conv2 26378 0.00012 add21 15413 0.00007 S2U2_conv1 26655 0.00012 S2U2_conv2 26573 0.00012 add22 15333 0.00007 S2U3_conv1 26371 0.00012 S2U3_conv2 26744 0.00012 add23 15323 0.00007 skipConv2 25101 0.00011 S3U1_conv1 25062 0.00011 S3U1_conv2 41716 0.00019 add31 7724 0.00004 S3U2_conv1 41621 0.00019 S3U2_conv2 41630 0.00019 add32 7842 0.00004 S3U3_conv1 41307 0.00019 S3U3_conv2 42067 0.00019 add33 7694 0.00003 globalPool 10349 0.00005 fcFinal 951 0.00000 * The clock frequency of the DL processor is: 220MHz
Load Pruned Network
Load the trained, pruned network. For more information on network training, see Prune Image Classification Network Using Taylor Scores.
load("prunedNet.mat");
Test Network
Load a test image. The test image is a part of the CIFAR-10 data set[1]. To download the data set, see the Prepare Data section in Train Residual Network for Image Classification.
load("testImage.mat");
Use the runonHW
function to:
Prepare the network for deployment.
Compile the network to generate weights, biases, and instructions.
Deploy the network to the FPGA board.
Retrieve the prediction results using MATLAB®.
To view the code for this function, see Helper Functions.
testImage = dlarray(testImage,'SSCB'); [~, speedPruned] = runOnHW(prunedNet,testImage,'zcu102_single');
### Compiling network for Deep Learning FPGA prototyping ... ### Targeting FPGA bitstream zcu102_single. ### An output layer called 'Output1_softmax' of type 'nnet.cnn.layer.RegressionOutputLayer' has been added to the provided network. This layer performs no operation during prediction and thus does not affect the output of the network. ### Optimizing network: Fused 'nnet.cnn.layer.BatchNormalizationLayer' into 'nnet.cnn.layer.Convolution2DLayer' ### Notice: The layer 'input' of type 'ImageInputLayer' is split into an image input layer 'input' and an addition layer 'input_norm' for normalization on hardware. ### The network includes the following layers: ### Notice: The layer 'softmax' with type 'nnet.cnn.layer.SoftmaxLayer' is implemented in software. ### Notice: The layer 'Output1_softmax' with type 'nnet.cnn.layer.RegressionOutputLayer' is implemented in software. ### Compiling layer group: convInp>>reluInp ... ### Compiling layer group: convInp>>reluInp ... complete. ### Compiling layer group: S1U1_conv1>>S1U1_conv2 ... ### Compiling layer group: S1U1_conv1>>S1U1_conv2 ... complete. ### Compiling layer group: S1U2_conv1>>S1U2_conv2 ... ### Compiling layer group: S1U2_conv1>>S1U2_conv2 ... complete. ### Compiling layer group: S1U3_conv1>>S1U3_conv2 ... ### Compiling layer group: S1U3_conv1>>S1U3_conv2 ... complete. ### Compiling layer group: skipConv1 ... ### Compiling layer group: skipConv1 ... complete. ### Compiling layer group: S2U1_conv1>>S2U1_conv2 ... ### Compiling layer group: S2U1_conv1>>S2U1_conv2 ... complete. ### Compiling layer group: S2U2_conv1>>S2U2_conv2 ... ### Compiling layer group: S2U2_conv1>>S2U2_conv2 ... complete. ### Compiling layer group: S2U3_conv1>>S2U3_conv2 ... ### Compiling layer group: S2U3_conv1>>S2U3_conv2 ... complete. ### Compiling layer group: skipConv2 ... ### Compiling layer group: skipConv2 ... complete. ### Compiling layer group: S3U1_conv1>>S3U1_conv2 ... ### Compiling layer group: S3U1_conv1>>S3U1_conv2 ... complete. ### Compiling layer group: S3U2_conv1>>S3U2_conv2 ... ### Compiling layer group: S3U2_conv1>>S3U2_conv2 ... complete. ### Compiling layer group: S3U3_conv1>>S3U3_conv2 ... ### Compiling layer group: S3U3_conv1>>S3U3_conv2 ... complete. ### Compiling layer group: globalPool ... ### Compiling layer group: globalPool ... complete. ### Compiling layer group: fcFinal ... ### Compiling layer group: fcFinal ... complete. ### Allocating external memory buffers: offset_name offset_address allocated_space _______________________ ______________ __________________ "InputDataOffset" "0x00000000" "480.0 kB" "OutputResultOffset" "0x00078000" "4.0 kB" "SchedulerDataOffset" "0x00079000" "300.0 kB" "SystemBufferOffset" "0x000c4000" "148.0 kB" "InstructionDataOffset" "0x000e9000" "252.0 kB" "ConvWeightDataOffset" "0x00128000" "576.0 kB" "FCWeightDataOffset" "0x001b8000" "4.0 kB" "EndOffset" "0x001b9000" "Total: 1764.0 kB" ### Network compilation complete. ### FPGA bitstream programming has been skipped as the same bitstream is already loaded on the target FPGA. ### Loading weights to Conv Processor. ### Conv Weights loaded. Current time is 29-Aug-2024 09:48:23 ### Loading weights to FC Processor. ### FC Weights loaded. Current time is 29-Aug-2024 09:48:23 ### Finished writing input activations. ### Running single input activation. Deep Learning Processor Profiler Performance Results LastFrameLatency(cycles) LastFrameLatency(seconds) FramesNum Total Latency Frames/s ------------- ------------- --------- --------- --------- Network 587699 0.00267 1 590366 372.7 input_norm 7285 0.00003 convInp 14043 0.00006 S1U1_conv1 20169 0.00009 S1U1_conv2 20366 0.00009 add11 30713 0.00014 S1U2_conv1 20457 0.00009 S1U2_conv2 20071 0.00009 add12 30580 0.00014 S1U3_conv1 32397 0.00015 S1U3_conv2 32007 0.00015 add13 30573 0.00014 skipConv1 19206 0.00009 S2U1_conv1 17927 0.00008 S2U1_conv2 18705 0.00009 add21 13432 0.00006 S2U2_conv1 23916 0.00011 S2U2_conv2 23985 0.00011 add22 13352 0.00006 S2U3_conv1 21375 0.00010 S2U3_conv2 21686 0.00010 add23 13400 0.00006 skipConv2 15211 0.00007 S3U1_conv1 16170 0.00007 S3U1_conv2 18288 0.00008 add31 4810 0.00002 S3U2_conv1 17963 0.00008 S3U2_conv2 18158 0.00008 add32 4830 0.00002 S3U3_conv1 16679 0.00008 S3U3_conv2 17484 0.00008 add33 4830 0.00002 globalPool 6601 0.00003 fcFinal 843 0.00000 * The clock frequency of the DL processor is: 220MHz
Quantize Pruned Network
You can quantize the pruned network to obtain an improved performance.
Create an augmentedImageDataStore
object to store the training images.
imds = augmentedImageDatastore([32,32],testImage);
Create a dlquantizer
object.
dlqObj = dlquantizer(prunedNet, ExecutionEnvironment="FPGA");
Calibrate the dlquantizer
object using the training images.
calibrate(dlqObj,imds)
ans=100×5 table
Optimized Layer Name Network Layer Name Learnables / Activations MinValue MaxValue
____________________ __________________ ________________________ __________ _________
"input_Mean" {'input' } "Mean" 113.87 125.31
"convInp_Weights" {'convInp' } "Weights" -0.0060522 0.0076182
"convInp_Bias" {'convInp' } "Bias" -0.23065 0.79941
"S1U1_conv1_Weights" {'S1U1_conv1'} "Weights" -0.36637 0.37601
"S1U1_conv1_Bias" {'S1U1_conv1'} "Bias" 0.076761 0.79494
"S1U1_conv2_Weights" {'S1U1_conv2'} "Weights" -0.8197 0.54487
"S1U1_conv2_Bias" {'S1U1_conv2'} "Bias" -0.27783 0.85751
"S1U2_conv1_Weights" {'S1U2_conv1'} "Weights" -0.29579 0.27284
"S1U2_conv1_Bias" {'S1U2_conv1'} "Bias" -0.55448 0.85351
"S1U2_conv2_Weights" {'S1U2_conv2'} "Weights" -0.78735 0.52628
"S1U2_conv2_Bias" {'S1U2_conv2'} "Bias" -0.50762 0.56423
"S1U3_conv1_Weights" {'S1U3_conv1'} "Weights" -0.18651 0.12745
"S1U3_conv1_Bias" {'S1U3_conv1'} "Bias" -0.33809 0.73826
"S1U3_conv2_Weights" {'S1U3_conv2'} "Weights" -0.49925 0.55922
"S1U3_conv2_Bias" {'S1U3_conv2'} "Bias" -0.42145 0.64184
"S2U1_conv1_Weights" {'S2U1_conv1'} "Weights" -0.1328 0.121
⋮
Use the runonHW
function to:
Prepare the network for deployment.
Compile the network to generate weights, biases, and instructions.
Deploy the network to the FPGA board.
Retrieve the prediction results using MATLAB®.
To view the code for this function, see Helper Functions.
testImage = dlarray(testImage,'SSCB'); [~, speedQuantized] = runOnHW(dlqObj,testImage,'zcu102_int8');
### Compiling network for Deep Learning FPGA prototyping ... ### Targeting FPGA bitstream zcu102_int8. ### An output layer called 'Output1_softmax' of type 'nnet.cnn.layer.RegressionOutputLayer' has been added to the provided network. This layer performs no operation during prediction and thus does not affect the output of the network. ### Optimizing network: Fused 'nnet.cnn.layer.BatchNormalizationLayer' into 'nnet.cnn.layer.Convolution2DLayer' ### The network includes the following layers: ### Notice: The layer 'input' with type 'nnet.cnn.layer.ImageInputLayer' is implemented in software. ### Notice: The layer 'softmax' with type 'nnet.cnn.layer.SoftmaxLayer' is implemented in software. ### Notice: The layer 'Output1_softmax' with type 'nnet.cnn.layer.RegressionOutputLayer' is implemented in software. ### Compiling layer group: convInp>>reluInp ... ### Compiling layer group: convInp>>reluInp ... complete. ### Compiling layer group: S1U1_conv1>>S1U1_conv2 ... ### Compiling layer group: S1U1_conv1>>S1U1_conv2 ... complete. ### Compiling layer group: S1U2_conv1>>S1U2_conv2 ... ### Compiling layer group: S1U2_conv1>>S1U2_conv2 ... complete. ### Compiling layer group: S1U3_conv1>>S1U3_conv2 ... ### Compiling layer group: S1U3_conv1>>S1U3_conv2 ... complete. ### Compiling layer group: skipConv1 ... ### Compiling layer group: skipConv1 ... complete. ### Compiling layer group: S2U1_conv1>>S2U1_conv2 ... ### Compiling layer group: S2U1_conv1>>S2U1_conv2 ... complete. ### Compiling layer group: S2U2_conv1>>S2U2_conv2 ... ### Compiling layer group: S2U2_conv1>>S2U2_conv2 ... complete. ### Compiling layer group: S2U3_conv1>>S2U3_conv2 ... ### Compiling layer group: S2U3_conv1>>S2U3_conv2 ... complete. ### Compiling layer group: skipConv2 ... ### Compiling layer group: skipConv2 ... complete. ### Compiling layer group: S3U1_conv1>>S3U1_conv2 ... ### Compiling layer group: S3U1_conv1>>S3U1_conv2 ... complete. ### Compiling layer group: S3U2_conv1>>S3U2_conv2 ... ### Compiling layer group: S3U2_conv1>>S3U2_conv2 ... complete. ### Compiling layer group: S3U3_conv1>>S3U3_conv2 ... ### Compiling layer group: S3U3_conv1>>S3U3_conv2 ... complete. ### Compiling layer group: globalPool ... ### Compiling layer group: globalPool ... complete. ### Compiling layer group: fcFinal ... ### Compiling layer group: fcFinal ... complete. ### Allocating external memory buffers: offset_name offset_address allocated_space _______________________ ______________ _________________ "InputDataOffset" "0x00000000" "240.0 kB" "OutputResultOffset" "0x0003c000" "4.0 kB" "SchedulerDataOffset" "0x0003d000" "208.0 kB" "SystemBufferOffset" "0x00071000" "52.0 kB" "InstructionDataOffset" "0x0007e000" "140.0 kB" "ConvWeightDataOffset" "0x000a1000" "172.0 kB" "FCWeightDataOffset" "0x000cc000" "4.0 kB" "EndOffset" "0x000cd000" "Total: 820.0 kB" ### Network compilation complete. ### Programming FPGA Bitstream using Ethernet... ### Attempting to connect to the hardware board at 172.21.88.150... ### Connection successful ### Programming FPGA device on Xilinx SoC hardware board at 172.21.88.150... ### Attempting to connect to the hardware board at 172.21.88.150... ### Connection successful ### Copying FPGA programming files to SD card... ### Setting FPGA bitstream and devicetree for boot... # Copying Bitstream zcu102_int8.bit to /mnt/hdlcoder_rd # Set Bitstream to hdlcoder_rd/zcu102_int8.bit # Copying Devicetree devicetree_dlhdl.dtb to /mnt/hdlcoder_rd # Set Devicetree to hdlcoder_rd/devicetree_dlhdl.dtb # Set up boot for Reference Design: 'AXI-Stream DDR Memory Access : 3-AXIM' ### Programming done. The system will now reboot for persistent changes to take effect. ### Rebooting Xilinx SoC at 172.21.88.150... ### Reboot may take several seconds... ### Attempting to connect to the hardware board at 172.21.88.150... ### Connection successful ### Programming the FPGA bitstream has been completed successfully. ### Loading weights to Conv Processor. ### Conv Weights loaded. Current time is 29-Aug-2024 09:51:25 ### Loading weights to FC Processor. ### FC Weights loaded. Current time is 29-Aug-2024 09:51:25 ### Finished writing input activations. ### Running single input activation. Deep Learning Processor Profiler Performance Results LastFrameLatency(cycles) LastFrameLatency(seconds) FramesNum Total Latency Frames/s ------------- ------------- --------- --------- --------- Network 211348 0.00085 1 213980 1168.3 convInp 7538 0.00003 S1U1_conv1 7055 0.00003 S1U1_conv2 7329 0.00003 add11 9155 0.00004 S1U2_conv1 7463 0.00003 S1U2_conv2 7452 0.00003 add12 8925 0.00004 S1U3_conv1 11014 0.00004 S1U3_conv2 11194 0.00004 add13 9025 0.00004 skipConv1 7232 0.00003 S2U1_conv1 6655 0.00003 S2U1_conv2 7282 0.00003 add21 4564 0.00002 S2U2_conv1 8910 0.00004 S2U2_conv2 9301 0.00004 add22 4684 0.00002 S2U3_conv1 8767 0.00004 S2U3_conv2 9244 0.00004 add23 4634 0.00002 skipConv2 6456 0.00003 S3U1_conv1 6370 0.00003 S3U1_conv2 6768 0.00003 add31 1420 0.00001 S3U2_conv1 6123 0.00002 S3U2_conv2 6567 0.00003 add32 1400 0.00001 S3U3_conv1 6253 0.00003 S3U3_conv2 6818 0.00003 add33 1450 0.00001 globalPool 3399 0.00001 fcFinal 714 0.00000 * The clock frequency of the DL processor is: 250MHz
Compare the Original, Pruned, and Pruned and Quantized Network Performance
Determine the impact of pruning and quantizing on the network. Pruning improves the network performance to 373 frames per second. However, pruning and quantizing the network improves the performance from 373 frames per second to 1168 frames per second.
fprintf('The performance achieved for the original network is %s frames per second. \n', speedInitial.("Frame/s")(1));
The performance achieved for the original network is 267.1161 frames per second.
fprintf('The performance achieved after pruning is %s frames per second. \n', speedPruned.("Frame/s")(1));
The performance achieved after pruning is 372.6502 frames per second.
fprintf('The performance achieved after pruning and quantizing the network to int8 fixed point is %s frames per second. \n', speedQuantized.("Frame/s")(1));
The performance achieved after pruning and quantizing the network to int8 fixed point is 1168.3335 frames per second.
References
[1] Krizhevsky, Alex. 2009. "Learning Multiple Layers of Features from Tiny Images." https://www.cs.toronto.edu/~kriz/learning-features-2009-TR.pdf
Helper Functions
The runOnHW
function prepares the network for deployment, compiles the network, deploys the network to the FPGA board, and retrieves the prediction results.
function [result, speed] = runOnHW(network, image, bitstream) wfObj = dlhdl.Workflow(Network=network,Bitstream=bitstream); wfObj.Target = dlhdl.Target("Xilinx", Interface="Ethernet"); compile(wfObj); deploy(wfObj); [result,speed] = predict(wfObj,image, Profiler='on'); end
See Also
dlhdl.Target
| dlhdl.Workflow
| compile
| deploy
| predict
| dlquantizer
| calibrate