A VLSI Architecture For High Performance
A VLSI Architecture For High Performance
Encoding
Hassan Shojania and Subramania Sudharsanan
Department of Electrical and Computer Engineering
Queen’s University, Kingston, ON K7L 3N6, Canada
ABSTRACT
One key technique for improving the coding efficiency of H.264 video standard is the entropy coder, context-
adaptive binary arithmetic coder (CABAC). However the complexity of the encoding process of CABAC is
significantly higher than the table driven entropy encoding schemes such as the Huffman coding. CABAC is also
bit serial and its multi-bit parallelization is extremely difficult. For a high definition video encoder, multi-giga
hertz RISC processors will be needed to implement the CABAC encoder. In this paper, we provide an efficient,
pipelined VLSI architecture for CABAC encoding along with an analysis of critical issues. The solution encodes
a binary symbol every cycle. An FPGA implementation of the proposed scheme capable of 104 Mbps encoding
rate and test results are presented. An ASIC synthesis and simulation for a 0.18 µm process technology indicates
that the design is capable of encoding 190 million binary symbols per second using an area of 0.35 mm2 . ∗
Keywords: H.264, CABAC, Arithmetic Coding, VLSI.
1. INTRODUCTION
The H.264 video standard includes several algorithmic improvements for the hybrid motion compensated, DCT-
based video codecs.1 One key technique for improving the coding efficiency is the entropy coder, context-adaptive
binary arithmetic coder (CABAC).2 The CABAC utilizes a context-sensitive, backward-adaptation mechanism
for calculating the probabilities of the input symbols. The context modeling is applied to a binary sequence
of the syntactical elements of the video data such as block types, motion vectors, and quantized coefficients
binarized using predefined mechanisms. Each bit is then coded with either adaptive or fixed probability models.
Context values are used for appropriate adaptations of the probability models corresponding to a total of 399
contexts representing various different elements of the source data. Each processing step of binarization, context
assignment, probability estimation, and binary arithmetic coding is designed with some computational complexity
constraint. For instance, the binary arithmetic coder uses a version that has no divisions or multiplications.
However the complexity of the encoding process in its totality is far higher than the table driven entropy
encoding schemes such as Huffman coding.
The CABAC encoding process is also bit serial and multi-bit parallelization as in Huffman type encoding is
difficult to achieve. Use of a modern microprocessor for encoding a bit consumes hundreds of cycles per bit.3, 4
For a high definition video encoder working at an average rate of 20 million symbols per second can translate
into a multi-giga hertz RISC processor requirement. Such large frequencies may not suit low power devices
such as cameras where H.264 is to become a dominant standard. Furthermore, instantaneous symbol rates
for such encoders can be significantly higher for multiple reasons: picture type (intra or inter) variations and
pipelined or stream-processing architectures with macro-block level granularity.5, 6 Such pipelined architectures
are preferred in processors that aim to reduce memory and inter-computational block bandwidth requirements.6
Additionally, if a motion estimator uses rate constrained motion estimation technique, the CABAC encoding
symbol rate requirement can go up significantly higher. Under these possibilities, a highly tuned hardware
architecture for CABAC encoding is a better alternative than programmable processor-based solutions.
Email: {shojania, sudha}@ee.queensu.ca
∗
This work was supported in part by the Natural Sciences and Engineering Research Council of Canada, Canadian
Microelectronics Corporation and Sun Microsystems, Inc.
Several recent papers have attempted to provide efficient schemes for this problem.3, 7, 8 Another paper
proposed an efficient binary arithmetic coder with a corresponding VLSI architecture as an alternative to the
highly complex CABAC process.4 The solution however is not compatible with the H.264 standard. The scheme
proposed in Ref. 7 uses a hybrid hardware - software approach with some estimation on the number of cycles per
bit and the required silicon area. Our previous paper3 introduced a novel architecture for a CABAC coprocessor
that can be easily integrated on system-on-chip designs. It was shown, with FPGA implementation results,
that under certain circumstances, the circuit could achieve the speed of single bit encoding for every two clock
cycles.3 One critical step in arithmetic coding is the renormalization of the state registers.9 The design in
Ref. 3 addressed renormalization using a simple and bit serial circuit that affected overall performance. The
renormalization solution presented in Ref. 7 is based on a QM-coder implementation.9 The solution does not
elaborate how this is applicable for H.264, particularly with respect to handling “outstanding bits” which is a
complex problem (described in Section 2.3). This problem was addressed in our subsequent work to obtain an
encoding rate of 54 million symbols per second.8 That particular architecture has a three cycle per binary
symbol throughput, which is significantly improved in this work.
In this paper, we provide efficient solutions for the arithmetic coder and the renormalizer that guarantee a
single cycle performance per binary symbol, and also address a number of issues that help reduce the silicon
area while maintaining the coprocessor architecture presented in Ref. 3. The proposed solution is tested using
encoder data generated by H.264 reference software10 for several standard video sequences. The remainder of
the paper provides an overview of the problem, discusses existing solutions, details the proposed architecture
and implementations. We provide details of two implementations, one based on an Altera FPGA platform and
the other for a 0.18 µm ASIC synthesis and simulation with timing, power, and area estimations.
2.1. Binarizer
Binarization is a form of pre-processing step that reduces the alphabet size of syntax elements to a maximally
reduced binary alphabet. The result is a unique intermediate binary codeword (bin string) for each syntax
element. The statistical behavior of individual bins can be better modeled in the subsequent context modeling
stage than the whole syntax element.2 Depending on the syntax element, each of its bins can be associated
with a context index which represents the probability model of the bin. Certain syntactical elements do not use
a context-adaptive model and are considered to be equiprobable. The binarization process consists of several
schemes that depend on the syntactical elements1 and consists of k th order exponential Golomb (EGk), unary,
truncated unary, and fixed length coding mechanisms. In addition to these four primary techniques, the CABAC
employs the concatenation of these methods. For example, a transform coefficient level is coded with a truncated
unary prefix and a 0th order exponential Golomb code suffix. A list of syntax elements and their associated types
of binarization is provided in Table 9-24 in the H.264 specification document.1
2.3. Renormalization
The renormalization process rescales arithmetic coding states. It takes a variable number of iterations to scale
codIRange to a minimum value of 256 with successive left shifts.1 The number of iterations, iter, varies from
zero to eight depending on the incoming codIRange calculated in the arithmetic coding stage. Each iteration
updates codILow by potentially resetting one of its two top bits and then shifting it to the left. A single output
bit is generated at each iteration to be added to the output stream. The polarity of generated bit depends on
the taken branch. Figure 1 names branches on Fig. 9-8 of Ref. 1 as 1, 1+ and 0 from right to left, and shows
the flow of iterations for renormalization of a single bin as a state diagram. While the polarity of generated bit
for 1 (one) and 0 (zero) branches are already determined, the polarity for 1+ branch is unknown till a future
bit (zero or one) is generated. This future bit could be generated either in the current renormalization process
or in a renormalization step corresponding to encode of a future symbol that could be several symbols away.
As suggested in Ref. 1, a counter, count, can keep track of the number of these 1+ bits (bits associated with
1+ branch, outstanding bits) until a future bit resolves them to a known value. This dependency on the future
bits introduces a serious challenge to hardware implementations as the length of these bits can grow with no
predetermined bounds. For example, the standard document1 does not set an upper limit on count and suggests
it could grow as large as the slice size. The outstanding bits are resolved to either a one followed by count
number of zeros or a zero followed by count number of ones depending on whether the resolving bit is a one or
zero respectively.
The variable number of iterations could force frequent stalls in the arithmetic encoder if not addressed
properly since renormalization has to be completed before processing the next incoming bit. This reduces the
overall throughput of the coder.
ctxData[5..0] Context +
Unary
binIdx[2..0] RAM
Fixed
Length
5. CONCLUSION
We have presented a comprehensive CABAC encoding engine addressing issues related to binarizer, arithmetic
coding, and bit generation. The proposed solution provides significant improvement to our previous work using a
fully pipelined approach. The fully pipelined design achieves an encoding rate of 104 Mbps on an Altera FPGA
platform. Synthesis and simulation results for a 0.18 µm ASIC design showed that the design is capable of
encoding up to 190 million symbols per second. The design issues related to outstanding bits were discussed in
detail and related empirical data from real test contents presented. Work in progress looks into integrating the
CABAC engine as part of an H.264 encode system-on-chip.
REFERENCES
1. ITU, ITU-T Recommendation H.264: Advanced video coding for generic audiovisual services, May 2003.
2. D. Marpe, H. Schwarz, and T. Wiegend, “Context-based adaptive binary arithmetic coding in the
H.264/AVC video compression standard,” IEEE Transactions on Circuits and Systems for Video Tech-
nology 13, pp. 620–636, July 2003.
3. S. Sudharsanan and A. Cohen, “A hardware architecture for a context adaptive binary arithmetic coder,”
in Proc. of the SPIE, Embedded Processors for Multimedia & Communications II, pp. 104–112, Mar. 2005.
4. J. L. Núñez and V. A. Chouliaras, “High-performance arithmetic coding VLSI macro for the H.264 video
compression standard,” IEEE Transactions on Consumer Electronics 51, pp. 144–151, Feb. 2005.
5. U. J. Kapasi, S. Rixner, W. J. Dally, B. Khailany, J. H. Ahn, P. Mattson, and J. D. Owens, “Programmable
stream processors,” IEEE Computer , pp. 54–62, Aug. 2003.
6. Tung-Chien Chen, Yu-Wen Huang, and Liang-Gee Chen, “Analysis and design of macroblock pipeline for
H.264/AVC VLSI architecture,” in Proc. International Symposium on Circuits and Systems, II, pp. 273–276,
2004.
7. R. Osorio and J. Bruguera, “Arithmetic coding architecture for H.264/AVC CABAC compression system,”
in Proc. Euromicro Symposium on Digital System Design, pp. 62–69, 2004.
8. H. Shojania and S. Sudharsanan, “A high performance CABAC encoder,” in Proc. of the 3rd International
IEEE Northeast Workshop on Circuits and Systems (NEWCAS‘05), June 2005.
9. J. Mitchell and W. Pennebaker, JPEG: Still Image Data Compression Standard, Van Nostrand Reinhold,
1993.
10. ITU, H.264/AVC Reference Software. http://iphome.hhi.de/suehring/tml, ver. JM 8.2, July 2004.