Main Content

Deploy Image Recognition Network on FPGA with and Without Pruning

This example shows you how to deploy an image recognition network with and without convolutional filter pruning. Filter pruning is a compression technique that uses some criterion to identify and remove the least important filters in a network, which reduces the overall memory footprint of the network without significantly reducing the network accuracy.

Load Unpruned Network

Load the unpruned trained network. For information on network training, see Train Residual Network for Image Classification.

load("trainedYOLONet.mat");

Test Network

Load a test image. The test image is a part of the CIFAR-10 data set[1]. To download the data set, see the Prepare Data section in Train Residual Network for Image Classification.

load("testImage.mat");

Use the runonHW function to:

  • Prepare the network for deployment.

  • Compile the network to generate weights, biases, and instructions.

  • Deploy the network to the FPGA board.

  • Retrieve the prediction results using MATLAB®.

To view the code for this function, see Helper Functions.

testImage = dlarray(testImage,'SSCB');
[~, speedInitial] = runOnHW(trainedNet,testImage,'zcu102_single');
### Compiling network for Deep Learning FPGA prototyping ...
### Targeting FPGA bitstream zcu102_single.
### An output layer called 'Output1_softmax' of type 'nnet.cnn.layer.RegressionOutputLayer' has been added to the provided network. This layer performs no operation during prediction and thus does not affect the output of the network.
### Optimizing network: Fused 'nnet.cnn.layer.BatchNormalizationLayer' into 'nnet.cnn.layer.Convolution2DLayer'
### Notice: The layer 'input' of type 'ImageInputLayer' is split into an image input layer 'input' and an addition layer 'input_norm' for normalization on hardware.
### The network includes the following layers:

### Notice: The layer 'softmax' with type 'nnet.cnn.layer.SoftmaxLayer' is implemented in software.
### Notice: The layer 'Output1_softmax' with type 'nnet.cnn.layer.RegressionOutputLayer' is implemented in software.
### Compiling layer group: convInp>>reluInp ...
### Compiling layer group: convInp>>reluInp ... complete.
### Compiling layer group: S1U1_conv1>>S1U1_conv2 ...
### Compiling layer group: S1U1_conv1>>S1U1_conv2 ... complete.
### Compiling layer group: S1U2_conv1>>S1U2_conv2 ...
### Compiling layer group: S1U2_conv1>>S1U2_conv2 ... complete.
### Compiling layer group: S1U3_conv1>>S1U3_conv2 ...
### Compiling layer group: S1U3_conv1>>S1U3_conv2 ... complete.
### Compiling layer group: skipConv1 ...
### Compiling layer group: skipConv1 ... complete.
### Compiling layer group: S2U1_conv1>>S2U1_conv2 ...
### Compiling layer group: S2U1_conv1>>S2U1_conv2 ... complete.
### Compiling layer group: S2U2_conv1>>S2U2_conv2 ...
### Compiling layer group: S2U2_conv1>>S2U2_conv2 ... complete.
### Compiling layer group: S2U3_conv1>>S2U3_conv2 ...
### Compiling layer group: S2U3_conv1>>S2U3_conv2 ... complete.
### Compiling layer group: skipConv2 ...
### Compiling layer group: skipConv2 ... complete.
### Compiling layer group: S3U1_conv1>>S3U1_conv2 ...
### Compiling layer group: S3U1_conv1>>S3U1_conv2 ... complete.
### Compiling layer group: S3U2_conv1>>S3U2_conv2 ...
### Compiling layer group: S3U2_conv1>>S3U2_conv2 ... complete.
### Compiling layer group: S3U3_conv1>>S3U3_conv2 ...
### Compiling layer group: S3U3_conv1>>S3U3_conv2 ... complete.
### Compiling layer group: globalPool ...
### Compiling layer group: globalPool ... complete.
### Compiling layer group: fcFinal ...
### Compiling layer group: fcFinal ... complete.

### Allocating external memory buffers:

          offset_name          offset_address     allocated_space  
    _______________________    ______________    __________________

    "InputDataOffset"           "0x00000000"     "480.0 kB"        
    "OutputResultOffset"        "0x00078000"     "4.0 kB"          
    "SchedulerDataOffset"       "0x00079000"     "300.0 kB"        
    "SystemBufferOffset"        "0x000c4000"     "148.0 kB"        
    "InstructionDataOffset"     "0x000e9000"     "312.0 kB"        
    "ConvWeightDataOffset"      "0x00137000"     "1.1 MB"          
    "FCWeightDataOffset"        "0x0025b000"     "4.0 kB"          
    "EndOffset"                 "0x0025c000"     "Total: 2416.0 kB"

### Network compilation complete.

### Programming FPGA Bitstream using Ethernet...
### Attempting to connect to the hardware board at 172.21.88.150...
### Connection successful
### Programming FPGA device on Xilinx SoC hardware board at 172.21.88.150...
### Attempting to connect to the hardware board at 172.21.88.150...
### Connection successful
### Copying FPGA programming files to SD card...
### Setting FPGA bitstream and devicetree for boot...
# Copying Bitstream zcu102_single.bit to /mnt/hdlcoder_rd
# Set Bitstream to hdlcoder_rd/zcu102_single.bit
# Copying Devicetree devicetree_dlhdl.dtb to /mnt/hdlcoder_rd
# Set Devicetree to hdlcoder_rd/devicetree_dlhdl.dtb
# Set up boot for Reference Design: 'AXI-Stream DDR Memory Access : 3-AXIM'
### Programming done. The system will now reboot for persistent changes to take effect.
### Rebooting Xilinx SoC at 172.21.88.150...
### Reboot may take several seconds...
### Attempting to connect to the hardware board at 172.21.88.150...
### Connection successful
### Programming the FPGA bitstream has been completed successfully.
### Loading weights to Conv Processor.
### Conv Weights loaded. Current time is 29-Aug-2024 09:47:18
### Loading weights to FC Processor.
### FC Weights loaded. Current time is 29-Aug-2024 09:47:18
### Finished writing input activations.
### Running single input activation.


              Deep Learning Processor Profiler Performance Results

                   LastFrameLatency(cycles)   LastFrameLatency(seconds)       FramesNum      Total Latency     Frames/s
                         -------------             -------------              ---------        ---------       ---------
Network                     820920                  0.00373                       1             823612            267.1
    input_norm                7285                  0.00003 
    convInp                  14204                  0.00006 
    S1U1_conv1               32243                  0.00015 
    S1U1_conv2               32294                  0.00015 
    add11                    30573                  0.00014 
    S1U2_conv1               32564                  0.00015 
    S1U2_conv2               32300                  0.00015 
    add12                    30453                  0.00014 
    S1U3_conv1               32113                  0.00015 
    S1U3_conv2               32258                  0.00015 
    add13                    30583                  0.00014 
    skipConv1                20693                  0.00009 
    S2U1_conv1               21316                  0.00010 
    S2U1_conv2               26378                  0.00012 
    add21                    15413                  0.00007 
    S2U2_conv1               26655                  0.00012 
    S2U2_conv2               26573                  0.00012 
    add22                    15333                  0.00007 
    S2U3_conv1               26371                  0.00012 
    S2U3_conv2               26744                  0.00012 
    add23                    15323                  0.00007 
    skipConv2                25101                  0.00011 
    S3U1_conv1               25062                  0.00011 
    S3U1_conv2               41716                  0.00019 
    add31                     7724                  0.00004 
    S3U2_conv1               41621                  0.00019 
    S3U2_conv2               41630                  0.00019 
    add32                     7842                  0.00004 
    S3U3_conv1               41307                  0.00019 
    S3U3_conv2               42067                  0.00019 
    add33                     7694                  0.00003 
    globalPool               10349                  0.00005 
    fcFinal                    951                  0.00000 
 * The clock frequency of the DL processor is: 220MHz

Load Pruned Network

Load the trained, pruned network. For more information on network training, see Prune Image Classification Network Using Taylor Scores.

load("prunedNet.mat");

Test Network

Load a test image. The test image is a part of the CIFAR-10 data set[1]. To download the data set, see the Prepare Data section in Train Residual Network for Image Classification.

load("testImage.mat");

Use the runonHW function to:

  • Prepare the network for deployment.

  • Compile the network to generate weights, biases, and instructions.

  • Deploy the network to the FPGA board.

  • Retrieve the prediction results using MATLAB®.

To view the code for this function, see Helper Functions.

testImage = dlarray(testImage,'SSCB');
[~, speedPruned] = runOnHW(prunedNet,testImage,'zcu102_single');
### Compiling network for Deep Learning FPGA prototyping ...
### Targeting FPGA bitstream zcu102_single.
### An output layer called 'Output1_softmax' of type 'nnet.cnn.layer.RegressionOutputLayer' has been added to the provided network. This layer performs no operation during prediction and thus does not affect the output of the network.
### Optimizing network: Fused 'nnet.cnn.layer.BatchNormalizationLayer' into 'nnet.cnn.layer.Convolution2DLayer'
### Notice: The layer 'input' of type 'ImageInputLayer' is split into an image input layer 'input' and an addition layer 'input_norm' for normalization on hardware.
### The network includes the following layers:

### Notice: The layer 'softmax' with type 'nnet.cnn.layer.SoftmaxLayer' is implemented in software.
### Notice: The layer 'Output1_softmax' with type 'nnet.cnn.layer.RegressionOutputLayer' is implemented in software.
### Compiling layer group: convInp>>reluInp ...
### Compiling layer group: convInp>>reluInp ... complete.
### Compiling layer group: S1U1_conv1>>S1U1_conv2 ...
### Compiling layer group: S1U1_conv1>>S1U1_conv2 ... complete.
### Compiling layer group: S1U2_conv1>>S1U2_conv2 ...
### Compiling layer group: S1U2_conv1>>S1U2_conv2 ... complete.
### Compiling layer group: S1U3_conv1>>S1U3_conv2 ...
### Compiling layer group: S1U3_conv1>>S1U3_conv2 ... complete.
### Compiling layer group: skipConv1 ...
### Compiling layer group: skipConv1 ... complete.
### Compiling layer group: S2U1_conv1>>S2U1_conv2 ...
### Compiling layer group: S2U1_conv1>>S2U1_conv2 ... complete.
### Compiling layer group: S2U2_conv1>>S2U2_conv2 ...
### Compiling layer group: S2U2_conv1>>S2U2_conv2 ... complete.
### Compiling layer group: S2U3_conv1>>S2U3_conv2 ...
### Compiling layer group: S2U3_conv1>>S2U3_conv2 ... complete.
### Compiling layer group: skipConv2 ...
### Compiling layer group: skipConv2 ... complete.
### Compiling layer group: S3U1_conv1>>S3U1_conv2 ...
### Compiling layer group: S3U1_conv1>>S3U1_conv2 ... complete.
### Compiling layer group: S3U2_conv1>>S3U2_conv2 ...
### Compiling layer group: S3U2_conv1>>S3U2_conv2 ... complete.
### Compiling layer group: S3U3_conv1>>S3U3_conv2 ...
### Compiling layer group: S3U3_conv1>>S3U3_conv2 ... complete.
### Compiling layer group: globalPool ...
### Compiling layer group: globalPool ... complete.
### Compiling layer group: fcFinal ...
### Compiling layer group: fcFinal ... complete.

### Allocating external memory buffers:

          offset_name          offset_address     allocated_space  
    _______________________    ______________    __________________

    "InputDataOffset"           "0x00000000"     "480.0 kB"        
    "OutputResultOffset"        "0x00078000"     "4.0 kB"          
    "SchedulerDataOffset"       "0x00079000"     "300.0 kB"        
    "SystemBufferOffset"        "0x000c4000"     "148.0 kB"        
    "InstructionDataOffset"     "0x000e9000"     "252.0 kB"        
    "ConvWeightDataOffset"      "0x00128000"     "576.0 kB"        
    "FCWeightDataOffset"        "0x001b8000"     "4.0 kB"          
    "EndOffset"                 "0x001b9000"     "Total: 1764.0 kB"

### Network compilation complete.

### FPGA bitstream programming has been skipped as the same bitstream is already loaded on the target FPGA.
### Loading weights to Conv Processor.
### Conv Weights loaded. Current time is 29-Aug-2024 09:48:23
### Loading weights to FC Processor.
### FC Weights loaded. Current time is 29-Aug-2024 09:48:23
### Finished writing input activations.
### Running single input activation.


              Deep Learning Processor Profiler Performance Results

                   LastFrameLatency(cycles)   LastFrameLatency(seconds)       FramesNum      Total Latency     Frames/s
                         -------------             -------------              ---------        ---------       ---------
Network                     587699                  0.00267                       1             590366            372.7
    input_norm                7285                  0.00003 
    convInp                  14043                  0.00006 
    S1U1_conv1               20169                  0.00009 
    S1U1_conv2               20366                  0.00009 
    add11                    30713                  0.00014 
    S1U2_conv1               20457                  0.00009 
    S1U2_conv2               20071                  0.00009 
    add12                    30580                  0.00014 
    S1U3_conv1               32397                  0.00015 
    S1U3_conv2               32007                  0.00015 
    add13                    30573                  0.00014 
    skipConv1                19206                  0.00009 
    S2U1_conv1               17927                  0.00008 
    S2U1_conv2               18705                  0.00009 
    add21                    13432                  0.00006 
    S2U2_conv1               23916                  0.00011 
    S2U2_conv2               23985                  0.00011 
    add22                    13352                  0.00006 
    S2U3_conv1               21375                  0.00010 
    S2U3_conv2               21686                  0.00010 
    add23                    13400                  0.00006 
    skipConv2                15211                  0.00007 
    S3U1_conv1               16170                  0.00007 
    S3U1_conv2               18288                  0.00008 
    add31                     4810                  0.00002 
    S3U2_conv1               17963                  0.00008 
    S3U2_conv2               18158                  0.00008 
    add32                     4830                  0.00002 
    S3U3_conv1               16679                  0.00008 
    S3U3_conv2               17484                  0.00008 
    add33                     4830                  0.00002 
    globalPool                6601                  0.00003 
    fcFinal                    843                  0.00000 
 * The clock frequency of the DL processor is: 220MHz

Quantize Pruned Network

You can quantize the pruned network to obtain an improved performance.

Create an augmentedImageDataStore object to store the training images.

imds = augmentedImageDatastore([32,32],testImage); 

Create a dlquantizer object.

dlqObj = dlquantizer(prunedNet, ExecutionEnvironment="FPGA");

Calibrate the dlquantizer object using the training images.

calibrate(dlqObj,imds)
ans=100×5 table
    Optimized Layer Name    Network Layer Name    Learnables / Activations     MinValue     MaxValue 
    ____________________    __________________    ________________________    __________    _________

    "input_Mean"              {'input'     }             "Mean"                   113.87       125.31
    "convInp_Weights"         {'convInp'   }             "Weights"            -0.0060522    0.0076182
    "convInp_Bias"            {'convInp'   }             "Bias"                 -0.23065      0.79941
    "S1U1_conv1_Weights"      {'S1U1_conv1'}             "Weights"              -0.36637      0.37601
    "S1U1_conv1_Bias"         {'S1U1_conv1'}             "Bias"                 0.076761      0.79494
    "S1U1_conv2_Weights"      {'S1U1_conv2'}             "Weights"               -0.8197      0.54487
    "S1U1_conv2_Bias"         {'S1U1_conv2'}             "Bias"                 -0.27783      0.85751
    "S1U2_conv1_Weights"      {'S1U2_conv1'}             "Weights"              -0.29579      0.27284
    "S1U2_conv1_Bias"         {'S1U2_conv1'}             "Bias"                 -0.55448      0.85351
    "S1U2_conv2_Weights"      {'S1U2_conv2'}             "Weights"              -0.78735      0.52628
    "S1U2_conv2_Bias"         {'S1U2_conv2'}             "Bias"                 -0.50762      0.56423
    "S1U3_conv1_Weights"      {'S1U3_conv1'}             "Weights"              -0.18651      0.12745
    "S1U3_conv1_Bias"         {'S1U3_conv1'}             "Bias"                 -0.33809      0.73826
    "S1U3_conv2_Weights"      {'S1U3_conv2'}             "Weights"              -0.49925      0.55922
    "S1U3_conv2_Bias"         {'S1U3_conv2'}             "Bias"                 -0.42145      0.64184
    "S2U1_conv1_Weights"      {'S2U1_conv1'}             "Weights"               -0.1328        0.121
      ⋮

Use the runonHW function to:

  • Prepare the network for deployment.

  • Compile the network to generate weights, biases, and instructions.

  • Deploy the network to the FPGA board.

  • Retrieve the prediction results using MATLAB®.

To view the code for this function, see Helper Functions.

testImage = dlarray(testImage,'SSCB');
[~, speedQuantized] = runOnHW(dlqObj,testImage,'zcu102_int8');
### Compiling network for Deep Learning FPGA prototyping ...
### Targeting FPGA bitstream zcu102_int8.
### An output layer called 'Output1_softmax' of type 'nnet.cnn.layer.RegressionOutputLayer' has been added to the provided network. This layer performs no operation during prediction and thus does not affect the output of the network.
### Optimizing network: Fused 'nnet.cnn.layer.BatchNormalizationLayer' into 'nnet.cnn.layer.Convolution2DLayer'
### The network includes the following layers:

### Notice: The layer 'input' with type 'nnet.cnn.layer.ImageInputLayer' is implemented in software.
### Notice: The layer 'softmax' with type 'nnet.cnn.layer.SoftmaxLayer' is implemented in software.
### Notice: The layer 'Output1_softmax' with type 'nnet.cnn.layer.RegressionOutputLayer' is implemented in software.
### Compiling layer group: convInp>>reluInp ...
### Compiling layer group: convInp>>reluInp ... complete.
### Compiling layer group: S1U1_conv1>>S1U1_conv2 ...
### Compiling layer group: S1U1_conv1>>S1U1_conv2 ... complete.
### Compiling layer group: S1U2_conv1>>S1U2_conv2 ...
### Compiling layer group: S1U2_conv1>>S1U2_conv2 ... complete.
### Compiling layer group: S1U3_conv1>>S1U3_conv2 ...
### Compiling layer group: S1U3_conv1>>S1U3_conv2 ... complete.
### Compiling layer group: skipConv1 ...
### Compiling layer group: skipConv1 ... complete.
### Compiling layer group: S2U1_conv1>>S2U1_conv2 ...
### Compiling layer group: S2U1_conv1>>S2U1_conv2 ... complete.
### Compiling layer group: S2U2_conv1>>S2U2_conv2 ...
### Compiling layer group: S2U2_conv1>>S2U2_conv2 ... complete.
### Compiling layer group: S2U3_conv1>>S2U3_conv2 ...
### Compiling layer group: S2U3_conv1>>S2U3_conv2 ... complete.
### Compiling layer group: skipConv2 ...
### Compiling layer group: skipConv2 ... complete.
### Compiling layer group: S3U1_conv1>>S3U1_conv2 ...
### Compiling layer group: S3U1_conv1>>S3U1_conv2 ... complete.
### Compiling layer group: S3U2_conv1>>S3U2_conv2 ...
### Compiling layer group: S3U2_conv1>>S3U2_conv2 ... complete.
### Compiling layer group: S3U3_conv1>>S3U3_conv2 ...
### Compiling layer group: S3U3_conv1>>S3U3_conv2 ... complete.
### Compiling layer group: globalPool ...
### Compiling layer group: globalPool ... complete.
### Compiling layer group: fcFinal ...
### Compiling layer group: fcFinal ... complete.

### Allocating external memory buffers:

          offset_name          offset_address     allocated_space 
    _______________________    ______________    _________________

    "InputDataOffset"           "0x00000000"     "240.0 kB"       
    "OutputResultOffset"        "0x0003c000"     "4.0 kB"         
    "SchedulerDataOffset"       "0x0003d000"     "208.0 kB"       
    "SystemBufferOffset"        "0x00071000"     "52.0 kB"        
    "InstructionDataOffset"     "0x0007e000"     "140.0 kB"       
    "ConvWeightDataOffset"      "0x000a1000"     "172.0 kB"       
    "FCWeightDataOffset"        "0x000cc000"     "4.0 kB"         
    "EndOffset"                 "0x000cd000"     "Total: 820.0 kB"

### Network compilation complete.

### Programming FPGA Bitstream using Ethernet...
### Attempting to connect to the hardware board at 172.21.88.150...
### Connection successful
### Programming FPGA device on Xilinx SoC hardware board at 172.21.88.150...
### Attempting to connect to the hardware board at 172.21.88.150...
### Connection successful
### Copying FPGA programming files to SD card...
### Setting FPGA bitstream and devicetree for boot...
# Copying Bitstream zcu102_int8.bit to /mnt/hdlcoder_rd
# Set Bitstream to hdlcoder_rd/zcu102_int8.bit
# Copying Devicetree devicetree_dlhdl.dtb to /mnt/hdlcoder_rd
# Set Devicetree to hdlcoder_rd/devicetree_dlhdl.dtb
# Set up boot for Reference Design: 'AXI-Stream DDR Memory Access : 3-AXIM'
### Programming done. The system will now reboot for persistent changes to take effect.
### Rebooting Xilinx SoC at 172.21.88.150...
### Reboot may take several seconds...
### Attempting to connect to the hardware board at 172.21.88.150...
### Connection successful
### Programming the FPGA bitstream has been completed successfully.
### Loading weights to Conv Processor.
### Conv Weights loaded. Current time is 29-Aug-2024 09:51:25
### Loading weights to FC Processor.
### FC Weights loaded. Current time is 29-Aug-2024 09:51:25
### Finished writing input activations.
### Running single input activation.


              Deep Learning Processor Profiler Performance Results

                   LastFrameLatency(cycles)   LastFrameLatency(seconds)       FramesNum      Total Latency     Frames/s
                         -------------             -------------              ---------        ---------       ---------
Network                     211348                  0.00085                       1             213980           1168.3
    convInp                   7538                  0.00003 
    S1U1_conv1                7055                  0.00003 
    S1U1_conv2                7329                  0.00003 
    add11                     9155                  0.00004 
    S1U2_conv1                7463                  0.00003 
    S1U2_conv2                7452                  0.00003 
    add12                     8925                  0.00004 
    S1U3_conv1               11014                  0.00004 
    S1U3_conv2               11194                  0.00004 
    add13                     9025                  0.00004 
    skipConv1                 7232                  0.00003 
    S2U1_conv1                6655                  0.00003 
    S2U1_conv2                7282                  0.00003 
    add21                     4564                  0.00002 
    S2U2_conv1                8910                  0.00004 
    S2U2_conv2                9301                  0.00004 
    add22                     4684                  0.00002 
    S2U3_conv1                8767                  0.00004 
    S2U3_conv2                9244                  0.00004 
    add23                     4634                  0.00002 
    skipConv2                 6456                  0.00003 
    S3U1_conv1                6370                  0.00003 
    S3U1_conv2                6768                  0.00003 
    add31                     1420                  0.00001 
    S3U2_conv1                6123                  0.00002 
    S3U2_conv2                6567                  0.00003 
    add32                     1400                  0.00001 
    S3U3_conv1                6253                  0.00003 
    S3U3_conv2                6818                  0.00003 
    add33                     1450                  0.00001 
    globalPool                3399                  0.00001 
    fcFinal                    714                  0.00000 
 * The clock frequency of the DL processor is: 250MHz

Compare the Original, Pruned, and Pruned and Quantized Network Performance

Determine the impact of pruning and quantizing on the network. Pruning improves the network performance to 373 frames per second. However, pruning and quantizing the network improves the performance from 373 frames per second to 1168 frames per second.

fprintf('The performance achieved for the original network is %s frames per second. \n', speedInitial.("Frame/s")(1));
The performance achieved for the original network is 267.1161 frames per second. 
fprintf('The performance achieved after pruning is %s frames per second. \n', speedPruned.("Frame/s")(1));
The performance achieved after pruning is 372.6502 frames per second. 
fprintf('The performance achieved after pruning and quantizing the network to int8 fixed point is %s frames per second. \n', speedQuantized.("Frame/s")(1));
The performance achieved after pruning and quantizing the network to int8 fixed point is 1168.3335 frames per second. 

References

[1] Krizhevsky, Alex. 2009. "Learning Multiple Layers of Features from Tiny Images." https://www.cs.toronto.edu/~kriz/learning-features-2009-TR.pdf

Helper Functions

The runOnHW function prepares the network for deployment, compiles the network, deploys the network to the FPGA board, and retrieves the prediction results.

function [result, speed] = runOnHW(network, image, bitstream)
    wfObj = dlhdl.Workflow(Network=network,Bitstream=bitstream);
    wfObj.Target = dlhdl.Target("Xilinx", Interface="Ethernet");
    compile(wfObj);
    deploy(wfObj);
    [result,speed] = predict(wfObj,image, Profiler='on');    
end

See Also

| | | | | |