Contrast Limited Adaptive Histogram Equalization with External Memory
This example shows how to implement the contrast-limited adaptive histogram equalization (CLAHE) algorithm for FPGA, including an external memory interface.
Supported Hardware
Xilinx® Zynq® ZC706 evaluation kit + FMC-HDMI-CAM mezzanine card
Introduction
Video processing algorithms often store a full frame of video data in memory. Implementing this storage on an FPGA increases BRAM utilization and can result in input video resolution constraints. This example shows how to implement vision algorithms on FPGAs by using an external memory resource to reduce use of BRAM and enable processing of higher resolution input video.
The external memory interface in this example uses AXI4 protocols and verifies the design against memory contention. The AXI4 Random Access interface provides a simple, direct interface to the memory interconnect. This protocol enables the algorithm to act as a memory master by providing the addresses and managing the burst transfer directly. The AXI4 Master Write Controller and AXI4 Master Read Controller blocks in this example model a simplified AXI-4 interface in Simulink®. When you generate HDL code using the HDL Coder™ product, the generated code includes a fully compliant AXI4 interface IP.
Model External Memory
You can use SoC Blockset™ blocks and visualization tools for modeling, simulating, and analyzing hardware and software architectures for ASICs, FPGAs, and systems on a chip (SoC). These features can help you build system architecture using memory models, bus models, and interface models and help you simulate the architecture together with the algorithms. This example models external memory using the AXI4 Random Access Memory block from the SoC Blockset library. This block models the connection with hardware through external memory. Both the writer and the reader are managers, sending read and write requests to memory through this block. This block also logs and displays memory performance data. This feature enables you to analyze and debug the performance of the system at simulation time.
HDL Implementation
The CLAHE algorithm has three steps: tiling, histogram equalization, and bilinear interpolation. The bilinear interpolation step uses the pixel intensities from the input frame. Storing the full input frame of video data until the bilinear interpolation step requires external memory.
The figure shows the top level of the example model. The HDMI Rx block processes the video input and passes it to the CLAHEAlgorithm_fpga subsystem. The HDMI Rx block converts raw video data to a YCbCr 4:2:2 pixel stream format. The output data is a pixel stream suitable for hardware algorithm design. The HDMI Rx block also directs the SoC Builder tool to generate the IP blocks necessary to receive video data from the FMC-HDMI-CAM card that is attached to the hardware board.
In the model, the AXI4-Master Write Controller and AXI4-Master Read Controller blocks model the AXI4 memory mapped interfaces. The AXI4-Master Write Controller block writes the input frame into the external memory, and the AXI4-Master Read Controller block reads the frame from the external memory for bilinear interpolation. The AXI Read FIFO block sends the output pixel stream to the HDMI Tx block. The HDMI Tx block converts a pixel stream in YCbCr 4:2:2 format to raw video data for display during simulation. This block also directs the SoC Builder tool to generate the IP blocks that transmit video data back to the FMC-HDMI-CAM card. To indicate the status of the AXI Read FIFO and AXI Write FIFO blocks when running the design on hardware, four debug signals from these blocks are connected to LEDs on the board.
The next figure shows the CLAHEAlgorithm_fpga reference model. The input pixel stream connects to a Video Stream Connector block. This block provides a video streaming interface to connect any two IPs in the FPGA implementation. In this example, the Video Stream Connector blocks connect the HDMI input and output blocks with the rest of the FPGA algorithm.
The next figure shows the CLAHEAlgorithm_fpga/CLAHE subsystem, which implements the AXI write and read from external memory, and the CLAHE algorithm.
The subsystem contains these areas: * AXI Write to Memory: This section writes the input data into the DDR. It consists of an AXI4 Master Write Controller block that receives the input video control information from the HDMI Rx block and models the AXI4 memory mapped interface for writing data into the DDR. It generates five signals: wr_addr
, wr_len
, wr_valid
, rd_start
, and frame
. The wr_valid
signal is an input to the AXI Write FIFO block, which stores the incoming pixel intensities. The SoC Bus Creator block generates the wrCtrlOut
master to slave bus for writing the data into the DDR. The model writes one line of data per burst. After writing tileHeight/ 2 lines (where tileHeight corresponds to the height of each tile in CLAHE), the model asserts the rd_start
signal to begin the read request. The frame
signal indicates the input frame count.
AXI Read from Memory: This section reads the data from the DDR. It consists of an AXI4-Master Read Controller block that receives the
rd_start
signal from the AXI4-Master Write Controller block. The AXI4-Master Read Controller block generates therd_addr
,rd_len
,rd_avalid
, andrd_dready
signals. An SoC Bus Creator block combines these signals into a bus. The AXI4-Master Read Controller block also generates thepixelcontrol
bus corresponding to therd_data
. The model slices the 32-bitrd_data
signal to retrieve the 8-bit (LSB) luminance component and then writes it into the cache memory block of the CLAHE algorithm.
CLAHE: For a detailed description of the implementation of the CLAHE algorithm for hardware, see the Contrast Limited Adaptive Histogram Equalization example. In this example, the CLAHEHDLAlgorithm subsystem operates on 8-bit grayscale images, which is why the 8-bit luminance (Y) component is separated from the 16-bit YCbCr pixel data.
The CLAHEHDLAlgorithm subsystem performs the three steps of CLAHE: tiling, histogram equalization, and bilinear interpolation. In the first step, the input frame is divided into a grid of tiles. In the second step, the histogram of each tile is calculated, and then performs distribution, redistribution, and CDF calculations. The calculated CDF values are stored in a buffer for further processing. The third step calculates the output pixel intensities by using a bilinear interpolation of the CDF values. The pixel intensities of the input frame are used as the address to the buffer that stores the CDF values. These pixel intensities are read from the external memory that stores the original input frame.
Because the data read back from the external memory is in burst mode, it cannot be used directly for bilinear interpolation. The cache buffer stores the burst of lines read from the external memory. The depth of the cache is enough to store a number of lines equal to tileHeight. The rdValid
signal from the CLAHEHDLAlgorithm subsystem generates the rd_addr
signal to read the data from the cache. The data read from the cache (pixValue
) is then returned to the CLAHEHDLAlgorithm subsystem to complete the bilinear interpolation to calculate the output pixel intensity.
Hardware Implementation
The SoC Builder tool builds, loads, and executes the model on the FPGA board. The hardware board used in this example is the Xilinx Zynq ZC706 evaluation kit. To build, load, and execute the design on the hardware, follow these steps.
Set up the Vivado® tool for synthesis, implementation, and generation of the FPGA bitstream.
The example model runs in
Accelerator
mode by default to speed up the simulation. However, the SoC Builder tool requiresNormal
simulation mode. In Simulink Configuration Parameters, set Simulation mode toNormal
.Launch the SoC Builder tool by clicking Configure, Build, & Deploy in the Simulink toolstrip.
On the Setup screen, select Build model. Click Next.
On the Select Build Action screen, select Build, load, and run. Click Next.
On the Select Project Folder screen, specify the project folder. Click Next.
On the Review Memory Map screen, to view the memory map, click View/Edit. Click Next.
On the Validate Model screen, to check the compatibility of the model for implementation, click Validate. Click Next.
On the Build Model screen, to build the model, click Build. An external shell opens when FPGA synthesis begins. Click Next.
When the bitstream generation is complete, on the Connect Hardware screen, to test the connectivity between the host computer and the hardware board, click Test Connection. Load the bitstream on the hardware by clicking Load.
This figure shows the final SoC Builder results after these steps are complete.
Simulation and Results
This example uses an input video of size 480-by-640 pixels. This size is configured in the HDMI Rx block. For the Xilinx Zynq ZC706 evaluation kit, the PL DDR controller is configured with a 64-bit AXI4-Slave interface running at 200 MHz. The resulting bandwidth is 1600 MB/s. This example has two AXI masters connected to the DDR controller. These AXI masters are the DUT AXI4 read and write interfaces. The YCbCr 4:2:2 video format requires 2 bytes per pixel. For the DUT AXI4 read and write interfaces, each pixel is zero-padded to 4 bytes. In this case, the read and write interfaces have a throughput requirement of 2*4*480*640*60 = 147.456 MB/s.
This figure shows the performance plot of the AXI4 Random Access Memory block. To view the performance plot, first open the AXI4 Random Access Memory block. Then, on the Performance tab, click View performance plots. Select all masters under Bandwidth, and then click Update. After the DUT starts writing and reading data into external memory, the throughput remains around 154 MB/s, which is within the required throughput of 147.456 MB/s.
The signals in the example model are logged during simulation. View these signals by using the Logic Analyzer app. This figure shows the logged data of input and output frames.
This figure shows the input and output frames from the model. The result shows the improved contrast in the output image.
References
[1] Zuiderveld, Karel. "Contrast Limited Adaptive Histogram Equalization." In Graphics Gems IV, edited by Paul S. Heckbert, 474-485. AP Professional, 1994.
See Also
Memory Channel (SoC Blockset) | Memory Controller (SoC Blockset) | Memory Traffic Generator (SoC Blockset)