Deep Network Quantization and Deployment Using Deep Learning Toolbox Model Quantization Library
See how to quantize, calibrate, and validate deep neural networks in MATLAB® using a white-box approach to make tradeoffs between performance and accuracy, then deploy the quantized DNN to an embedded GPU and an FPGA hardware board.
 
 Using the Deep Learning Toolbox™ Model Quantization Library, you can quantize deep neural networks such as Squeezenet. During calibration, the tool collects required ranges for weights, biases, and activations, then provides visualization that represents histogram distributions of the calibrated dynamic ranges in power of two scale. You can then deploy the quantized network using GPU Coder™ to an NVIDIA® Jetson® AGX Xavier that achieves 2x speedup in performance and 4x reduction in memory usage, and with only about 3% top-1 accuracy loss compared with single precision implementation.
 
 See how to use the tool to quantize and deploy networks to a Xilinx® ZCU102 board connected to a high-speed camera. The original deep neural network had throughput of 45 frames per second. Using the Deep Learning Toolbox Model Quantization Library, you can quantize the networks to INT8, boosting the throughput to 139 frames per second while maintaining the right prediction results.
Published: 1 Nov 2020
In this demonstration, we’ll show the workflow to quantize deep learning networks and deploy them to GPUs and FPGAs from MATLAB.
Deploying deep learning networks to edge devices is challenging as deep learning networks can be quite compute intensive. For example, simple networks like AlexNet is over 200 MB while larger ones like VGG-16 is north of 500 MB.
Quantization helps to reduce the size of the network by converting floating point values used in the networks to smaller bit-widths while keeping the precision loss to a minimum.
Starting in R2020a, we released the ability to quantize deep learning algorithms using a white-box, easy-to-use iterative workflow. This approach helps you to make tradeoffs between performance and accuracy.
To see this workflow in action, let’s take an example of detecting defects in nuts and bolts that you might find in manufacturing.
Let’s say this is part of inspecting a production line, so we need to use a high-speed camera processing at a 120 frames / sec.
Requirements from system engineering will involve metrics like accuracy, latency of the network, and overall hardware cost, …
and they often drive tradeoff of choices during the design and implementation of the network.
This application includes…
1) Preprocessing logic that resizes and selects a region of interest, ...
2) Using the pretrained network to detect where the part is defective or not, …
3) And finally postprocessing to annotate the result on the screen.
Let’s get started with quantization workflow by looking at deployment to embedded GPUs.
Quantizing and deploying to GPUs running on NVIDIA Jetson AGX Xavier achieves 2X speed up in performance and 4X memory reduction, and with only around 3 % top-1 accuracy loss compared with single precision implementation.
This example uses Squeezenet that consumes 5 MB of disk memory.
To start, we first download the Deep Learning Quantization Support Package from the Add-on Explorer and then launch the app.
Once we load the network to quantize for GPU target, we then calibrate with a datastore that has already been set up. Calibration runs a set of images through the network to collect required ranges for weights, biases, and activations.
The visualization represents histogram distributions of the calibrated dynamic ranges in power of two scale. The gray in the histograms shows data that cannot be represented by the quantized type, while the blue shows what can be represented by the quantized type. Finally, darker colors are higher frequency bins.
If this is acceptable, we quantize the network and load a datastore to validate the accuracy of the quantized network.
Here is the result. Memory has been reduced by 74 percent with no loss in top-1 accuracy compared with the original floating-point network when measured on a desktop GPU.
Once we validated results and export the dlquantizer workflow object, we can use GPU Coder to deploy the quantized network onto the NVIDIA Jetson board.
We run inference for defective.png, we expect this image to get classified as defective bolt.
Now let’s turn our attention to quantizing and deploying networks to a Xilinx ZCU102 board. The network uses 34 MB of memory for learnable parameters and a runtime memory of 200 MB.
With these 5 lines of MATLAB code, we can load the single precision bitstream running on the ZCU102 board. We see that it uses 84 MB of memory with a throughput of 45 frames per second. This is not fast enough for our high-speed camera.
Let’s choose to quantize for FPGA.
Once the quantization workflow is completed, we’ll export the quantized network to the MATLAB workspace.
The quantized network needs to run on a processor quantized to INT8, so we’ll use the INT8 version of our downloaded zcu102 bitstream.
After compiling, the parameters have been reduced to 68 MB and we can run the network at 139 frames per second. We are getting the right prediction results as well.
So as you can see, the Deep Learning Quantization app helps you to reduce the size of the deep learning network for GPUs and FPGAs while minimizing the loss in accuracy. If you’re interested to learn more, take a look at the Deep Learning Toolbox Model Quantization Library in R2020a or the latest R2020b.
 
		
	
			
			 
		
	
			
			 
		
	
			
			 
		
	
			
			