Advances in Video Compression System Using Deep Neural Network: A Review and Case Studies
Advances in Video Compression System Using Deep Neural Network: A Review and Case Studies
Abstract—Significant advances in video compression system had been focused on the development and standardization of
have been made in the past several decades to satisfy the video coding tools for optimized R-D performance, such as
nearly exponential growth of Internet-scale video traffic. From the intra/inter prediction, transform, entropy coding, etc., re-
arXiv:2101.06341v1 [eess.IV] 16 Jan 2021
In recent years, Internet traffic has been dominated by a • The “coding” block is the core unit that converts raw
wide range of applications involving video, including video pixels or pixel blocks into binary bits presentation. Over
on demand (VOD), live streaming, ultra-low latency real- the past decades, the “coding” R-D efficiency has been
time communications, etc.. With ever increasing demands gradually improved by introducing more advanced tools
in resolution (e.g., 4K, 8K, gigapixel [1], high speed [2]), to better exploit spatial, temporal, and statistical redun-
and fidelity, (e.g., high dynamic range [3], and higher bit dancy [23]. Nevertheless, this process inevitably incurs
precision or bit depth [4]), more efficient video compres- compression artifacts, such as blockiness and ringing, due
sion is imperative for content transmission and storage, by to the R-D trade-off, especially at low bit rates.
which networked video services can be successfully deployed. • The “post-processing” block is introduced to alleviate
Fundamentally, video compression systems devise appropriate visually perceptible impairments produced as byproducts
algorithms to minimize the end-to-end reconstruction distor- of coding. Post-processing mostly relies on the desig-
tion (or maximize the quality of experience (QoE)), under a nated adaptive filters to enhance the reconstructed video
given bit rate budget. This is a classical rate-distortion (R- quality or QoE. Such “post-processing” filters can also
D) optimization problem. In the past, the majority of effort be embedded into the “coding” loop to jointly improve
reconstruction quality and R-D efficiency, e.g., in-loop
D. Ding is with the School of Information Science and Engineering, deblocking [24] and sample adaptive offset (SAO) [25];
Hangzhou Normal University, Hangzhou, Zhejiang, China.
Z. Ma is with the School of Electronic Science and Engineering, Nanjing • The “pre-processing” block exploits the discriminative
University, Nanjing, Jiangsu, China. content preference of the human visual system (HVS),
D. Chen, Q. Chen, and F. Zhu are with the School of Electrical and caused by the non-linear response and frequency se-
Computer Engineering, Purdue University, West Lafayette, Indiana, USA.
Z. Liu is with Visionular Inc, 280 2nd St., Los Altos, CA, USA. lectivity (e.g., masking) of visual neurons in the visual
? These authors contributed equally. pathway. Pre-processing can extract content semantics
2
(e.g., saliency, object instance) to improve the psychovi- TABLE I: Abbreviations and Annotations
sual performance of the “coding” block, for example, by Abbreviation Description
allocating unequal qualities (UEQ) across different areas AE AutoEncoder
CNN Convolutional Neural Network
according to pre-processed cues [26]. 1 CONV Convolution
Building upon the advancements in deep neural networks ConvLSTM Convolutional LSTM
DNN Deep Neural Network
(DNN), numerous recently-created video processing algo- FCN Fully-Connected Network
rithms have been greatly improved to achieve superior per- GAN Generative Adversarial Network
formance, mostly leveraging the powerful nonlinear represen- LSTM Long Short-Term Memory
RNN Recurrent Neural Network
tation capacity of DNNs. At the same time, we have also VAE Variational AutoEncoder
witnessed an explosive growth in the invention of DNN-based BD-PSNR Bjøntegaard Delta PSNR
techniques for video compression from both academic research BD-Rate Bjøntegaard Delta Rate
GOP Group of Pictures
and industrial practices. For example, DNN-based filtering MS-SSIM Multiscale SSIM
in post-processing was extensively studied when developing MSE Mean Squared Error
the VVC standard under the joint task force of ISO/IEC and PSNR Peak Signal-to-Noise Ratio
QP Quantizatin Parameter
ITU-T experts over the past three years. More recently, the QoE Quality of Experience
standard committee issued a Call-for-Evidence (CfE) [27], SSIM Structural Similarity Index
[28] to encourage the exploration of deep learning-based video UEQ UnEqual Quality
VMAF Video Multi-Method Assessment Fusion
coding solutions beyond VVC. AV1 AOMedia Video 1
In this article, we discuss recent advances in pre-processing, AVS Audio Video Standard
coding, and post-processing, with particular emphasis on the H.264/AVC H.264/Advanced Video Coding
H.265/HEVC H.265/High-Efficiency Video Coding
use of DNN-based approaches for efficient video compression. VVC Versatile Video Coding
We aim to provide a comprehensive overview to bring readers AOM Alliance of Open Media
up to date on recent advances in this emerging field. We MPEG Moving Picture Experts Group
also suggest promising directions for further exploration. As
summarized in Fig. 1, we first dive into video pre-processing,
emphasizing the analysis and application of content semantics, provides an overview of abbreviations and acronyms that are
e.g., saliency, object, texture characteristics, etc., to video frequently used throughout this paper.
encoding. We then discuss recently-developed DNN-based
video coding techniques for both modularized coding tool II. OVERVIEW OF DNN- BASED V IDEO P RE - PROCESSING
development and end-to-end fully learned framework explo-
Pre-processing techniques are generally applied prior to the
ration. Finally, we provide an overview of the adaptive filters
video coding block, with the objective of guiding the video
that can be either embedded in codec loop, or placed as a
encoder to remove psychovisual redundancy and to maintain
post enhancement to improve final reconstruction. We also
or improve visual quality, while simultaneously lowering bit
present three case studies, including 1) switchable texture-
rate consumption. One category of pre-processing techniques
based video coding in pre-processing; 2) end-to-end neural
is the execution of pre-filtering operations. Recently, a number
video coding; and 3) efficient neural filtering, to provide
of deep learning-based pre-filtering approaches have been
examples the potential of DNNs to improve both subjective
adopted for targeted coding optimization. These include de-
and objective efficiency over traditional video compression
noising [29], [30], motion deblurring [31], [32], contrast
methodologies.
enhancement [33], edge detection [34], [35], etc. Another
The remainder of the article is organized as follows: From
important topic area is closely related to the analysis of video
Section II to IV, we extensively review the advances in respec-
content semantics, e.g., object instance, saliency attention,
tive pre-processing, coding, and post-processing. Traditional
texture distribution, etc., and its application to intelligent video
methodologies are first briefly summarized, and then DNN-
coding. For the sake of simplicity, we refer to this group of
based approaches are discussed in detail. As in the case
techniques as “pre-processing” for the remainder of this paper.
studies, we propose three neural approaches in Section V, VI,
In our discussion below, we also limit our focus to saliency-
and VII, respectively. Regarding pre-processing, we develop a
based and analysis/synthesis-based approaches.
CNN based texture analysis/synthesis scheme for AV1 codec.
For video compression, an end-to-end neural coding frame-
work is developed. In our discussion of post-processing,we A. Saliency-Based Video Pre-processing
present different neural methods for in-loop and post filtering
1) Saliency Prediction: Saliency is the quality of being
that can enhance the quality of reconstructed frames. Sec-
particularly noticeable or important. Thus, the salient area
tion VIII summarizes this work and discusses open challenges
refers to region of an image that predominantly attracts the
and future research directions. For your convenience, Table I
attention of subjects. This concept corresponds closely to
the highly discriminative and selective behaviour displayed
1 Although adaptive filters can also be used in pre-processing for pre-
in visual neuronal processing [36], [37]. Content feature
filtering, e.g., denoising, motion deblurring, contrast enhancement, edge
detection, etc., our primary focus in this work will be on semantic content extraction, activation, suppression and aggregation also occur
understanding for subsequent intelligent “coding”. in the visual pathway [38].
3
Transmission
Videos Understanding Representation Enhancement Videos
Fig. 1: Topic Outline. This article reviews DNN-based techniques used in pre-processing, coding, and post-processing of a
practical video compression system. The “pre-processing” module leverages content semantics (e.g., texture) to guide video
coding, followed by the “coding” step to represent the video content using more compact spatio-temporal features. Finally,
quality enhancement is applied in “post-processing” to improve reconstruction quality by alleviating processing artifacts.
Companion case studies are respectively offered to showcase the potential of DNN algorithms in video compression.
Earlier attempts to predict saliency typically utilized hand- these separated foreground objects and background attributes,
crafted image features, such as color, intensity, and orientation Zhang et al. [66] introduced a new background mode to more
contrast [39]; motion contrast [40]; camera motion [41], etc., compactly represent background information with better R-
to predict saliency. D efficiency. To the best of our knowledge, such foreground
CNN model
object/background segmentation has been mostly applied in
Later on, DNN-based semantic-level features were exten-
sively investigated for both image content [42]–[48] and video surveillance applications, where the visualrcorr scene lends
video sequences [49]–[55]. Among these features, image input
itself to easier separation. xcorr
saliency prediction only exploits spatial information, while x
3) Video Compression with UEQ Scales: Recalling that
video saliency prediction often relies on spatial and temporal saliency or object refers to more visually attentive areas. It
attributes jointly. One typical example of video saliency is is straightforward to apply UEQ setting in a video encoder,
a moving object that incurs spatio-temporal dynamics over where light compression is used to encode the saliency area,
time, and is therefore more likely to attract users’ attention. while heavy compression is used elsewhere. Use of this tech-
For example, Bazzani et al. [49] modeled the spatial relations nique often results in a lower level of total bit rate consumption
in videos using 3D convolutional features and the temporal without compromising QoE.
consistency with a convolutional long short-term memory For example, Hadi et al. [67] extended the well-known Itti-
(LSTM) network. Bak et al. [50] applied a two-stream net- Koch-Niebur (IKN) model to estimate saliency in the DCT
work that exploited different fusion mechanisms to effectively domain, also considering camera motion. In addition, saliency-
integrate spatial and temporal information. Sun et al. [51] driven distortion was also introduced to accurately capture the
proposed a step-gained FCN to combine the time-domain salient characteristics, in order to improve R-D optimization
memory information and space-domain motion components. in H.265/HEVC. Li et al. [68] suggested using graph-based
Jiang et al. [52] developed an object-to-motion CNN that was visual saliency to adapt the quantizations in H.265/HEVC,
applied together with a LSTM network. All of these efforts to reduce total bits consumption. Similarly, Ku et al. [69]
to efficiently predict video saliency leveraged spatio-temporal applied saliency-weighted Coding Tree Unit (CTU)-level bit
attributes. More details regarding the spatio-temporal saliency allocation, where the CTU-aligned saliency weights were
models for video content can be found in [56]. determined via low-level feature fusion.
2) Salient Object: One special example of image saliency The aforementioned methodologies rely on traditional hand-
involved the object instance in a visual scene, specifically, the crafted saliency prediction algorithms. As DNN-based saliency
moving object in videos. A simple yet effective solution to algorithms have demonstrated superior performance, we can
the problem of predicting image saliency in this case involved safely assume that their application to video coding will lead
segmenting foreground objects and background components. to better compression efficiency. For example, Zhu et al. [70]
The segmentation of foreground objects and background adopted a spatio-temporal saliency model to accurately control
components has mainly relied on foreground extraction or the QP in an encoder whose spatial saliency was generated
background subtraction. For example, motion information has using a 10-layer CNN, and whose temporal saliency was cal-
frequently been used to mask out foreground objects [57]–[61]. culated assuming the 2D motion model (resulting in an average
Recently, both CNN and foreground attentive neural net- of 0.24 BD-PSNR gains over H.265/HEVC reference model
work (FANN) models have been developed to perform fore- (version HM16.8)). Performance improvement due to fine-
ground segmentation [62], [63]. In addition to conventional grained quantization adaptation was reported using an open-
Gaussian mixture model-based background subtraction, recent source x264 encoder [71]. This was accomplished by jointly
explorations have also shown that CNN models could be examining the input video frame and associated saliency
effectively used for the same purpose [64], [65]. To address maps. These saliency maps were generated by utilizing three
4
Channel
Side Information
rate control, where the static saliency map of each frame was
Reconstructed
extracted using a DNN model and dynamic saliency region Video Texture
when it was tracked using a moving object segmentation Decoder
Synthesizer
algorithm. Experiment results revealed that the PSNR of
Side Information
salient regions was improved by 1.85 dB on average.
Though saliency-based pre-processing is mainly driven by Fig. 2: Texture Coding System. A general framework of
psychovisual studies, it heavily relies on saliency detection to analysis/synthesis based video coding.
perform UEQ-based adaptive quantization with a lower rate of
bit consumption but visually identical reconstruction. On the
other hand, visual selectivity behaviour is closely associated despite yielding visually identical sensation, compared to the
with video content distribution (e.g., frequency response), traditional hybrid “prediction+residual” method2 . Therefore,
leading to perceptually unequal preference. Thus, it is highly texture analysis and synthesis play a vital role for subsequent
expected that such content semantics-induced discriminative video coding. We will discuss related techniques below.
features can be utilized to improve the system efficiency when 1) Texture Analysis: Early developments in texture analysis
integrated into the video encoder. To this end, we will discuss and representation can be categorized into filter-based or
the analysis/synthesis-based approach for pre-processing in the statistical modeling-based approaches. Gabor filter is one
next section. typical example of a filter-based approach, by which the
input image is convoluted with nonlinear activation for the
derivation of corresponding texture representation [84], [85].
B. Analysis/Synthesis Based Pre-processing
At the same time, in order to identify static and dynamic
Since most videos are consumed by human vision, subjec- textures for video content, Thakur et al. [86] utilized the 2D
tive perception of HVS is the best way to evaluate quality. dual tree complex wavelet transform and steerable pyramid
However, it is quite difficult to devise a profoundly accurate transform [87], respectively. To accurately capture the tem-
mathematical HVS model in actual video encoder for rate poral variations in video, Bansal et al. [88] again suggested
and perceptual quality optimization, due to the complicated the use of optic flow for dynamic texture indication and
and unclear information processing that occurs in the human later synthesis, where optical flow could be generated using
visual pathway. Instead, many pioneering psychovisual studies temporal filtering. Leveraging statistical models such as the
have suggested that neuronal response to compound stimuli Markovian random field (MRF) [89], [90] is an alternative
is highly nonlinear [74]–[81] within the receptive field. This way to analyze and represent texture. For efficient texture
leads to well-known visual behaviors, such as frequency se- description, statistical modeling such as this was then ex-
lectivity, masking, etc., where such stimuli are closely related tended using handcrafted local features, e.g., the scale invariant
to the content texture characteristics. Intuitively, video scenes feature transform (SIFT) [91], speeded up robust features
can be broken down into areas that are either “perceptually (SURF) [92], and local binary patterns (LBP) [93]
significant” (e.g., measured in an MSE sense) or “perceptually Recently, stacked DNNs have demonstrated their superior
insignificant”. For “perceptually insignificant” regions, users efficiency in many computer vision tasks, This efficiency is
will not perceive compression or processing impairments with- mainly due to the powerful capacity of DNN features to be
out a side-by-side comparison with the original sample. This used for video content representation. The most straightfor-
is because the HVS gains semantic understanding by viewing ward scheme directly extracted features from the FC6 or FC7
content as a whole, instead of interpreting texture details pixel- layer of AlexNet [94] for texture representation. Furthermore,
by-pixel [82]. This notable effect of the HVS is also referred Cimpoi et al. [95] demonstrated that Fisher vectorized [96]
to as “masking,” where visually insignificant information, e.g., CNN features was a decent texture descriptor candidate.
perceptually insignificant pixels, will be noticeably suppressed. 2) Texture Synthesis: Texture synthesis reverse-engineers
In practice, we can first analyze the texture characteristics the analysis in pre-processing to restore pixels accordingly. It
of original video content in the pre-processing step, e.g., generally includes both non-parametric and parametric meth-
Texture Analyzer in Fig. 2, in order to sort textures by their ods. For non-parametric synthesis, texture patches are usually
significance. Subsequently, we can use any standard compliant resampled from reference images [97]–[99]. In contrast, the
video encoder to encode the perceptually significant areas as parametric method utilized statistical models to reconstruct
the main bitstream payload, and apply a statistical model to the texture regions by jointly optimizing observation outcomes
represent the perceptually insignificant textures with model from the model and model itself [87], [100], [101].
parameters encapsulated as side information. Finally, we can DNN-based solutions exhibit great potential for texture syn-
use decoded areas and parsed textures to jointly synthesize thesis applications. One notable example demonstrating this
the reconstructed sequences in Texture Synthesizer. This type
of texture modeling makes good use of statistical and psy- 2 A comprehensive survey of texture analysis/synthesis based video coding
chovisual representation jointly, generally requiring fewer bits, technologies can be found in [83].
5
potential used a pre-trained image classification-based CNN We briefly introduce key techniques in intra/inter prediction,
model to generate texture patches [102]. Li et al. [103], then quantization, and entropy coding. Though in-loop filtering is
demonstrated that a Markovian GAN-based texture synthesis another important piece in the “coding” block, due to its
could offer remarkable quality improvement. similarities with post filtering, we have chosen to review it
To briefly summarize, earlier “texture analysis/synthesis” in quality enhancement-aimed “post-processing” for the sake
approaches often relied on handcrafted models, as well as of creating a more cohesive presentation.
corresponding parameters. While they have shown good per- 1) Intra Prediction: Video frame content presents highly
formance to some extent for a set of test videos, it is usually correlated distribution across neighboring samples spatially.
very difficult to generalize them to large-scale video datasets Thus, block redundancy can be effectively exploited using
without fine-tuning parameters further. On the other hand, causal neighbors. In the meantime, due to the presence of local
related neuroscience studies propose a broader definition of structural dynamics, block pixels can be better represented
texture which is more closely related to perceptual sensa- from a variety of angular directed prediction.
tion, although existing mathematical or data-driven texture In conventional standards, such as the H.264/AVC,
representations attempt to fully fulfill such perceptual motives. H.265/HEVC, or even emerging VVC, specific prediction
Furthermore, recent DNN-based schemes present a promising rules are carefully designated to use weighted neighbors for
perspective. However, the complexity of these schemes has not respective angular directions. From the H.264/AVC to recent
yet been appropriately exploited. So, in Section V, we will VVC, intra coding efficiency has been gradually improved
reveal a CNN-based pixel-level texture analysis approach to by allowing more fine-grained angular directions and flexible
segment perceptually insignificant texture areas in a frame for block size/partitions. In practice, an optimal coding mode is
compression and later synthesis. To model the textures both often determined by R-D optimization.
spatially and temporally, we introduce a new coding mode One would intuitively expect that coding performance can
called the “switchable texture mode” that is determined at be further improved if better predictions can be produced.
group of pictures (GoP) level according to the bit rate saving. Therefore, there have been a number of attempts to lever-
age the powerful capacity of stacked DNNs for better in-
III. OVERVIEW OF DNN- BASED V IDEO C ODING tra predictor generation, including the CNN-based predictor
A number of investigations have shown that DNNs can be refinement suggested in [113] to reduce prediction residual,
used for efficient image/video coding [104]–[107]. This topic additional learned mode trained using FCN models reported
has attracted extensive attention in recent years, demonstrating in [114], [115], using RNNs in [116], using CNNs in [108],
its potential to enhance the conventional system with better R- or even using GANs in [117], etc. These approaches have
D performance. actively utilized the neighbor pixels or blocks, and/or other
There are three major directions currently under inves- context information (e.g., mode) if applicable, in order to
tigation. One is resolution resampling-based video coding, accurately represent the local structures for better prediction.
by which the input videos are first down-sampled prior to Many of these approaches have reported more than 3% BD-
being encoded, and the reconstructed videos are up-sampled Rate gains against the popular H.265/HEVC reference model.
or super-resolved to the same resolution as the input [108]– These examples demonstrate the efficiency of DNNs in intra
[111]. This category generally develops up-scaling or super- prediction.
resolution algorithms on top of standard video codecs. The 2) Inter Prediction: In addition to the spatial intra pre-
second direction under investigation is modularized neural diction, temporal correlations have also been exploited via
video coding (MOD-NVC), which has attempted to improve inter prediction, by which previously reconstructed frames
individual coding tools in traditional hybrid coding framework are utilized to generate inter predictor for compensation using
using learning-based solutions. The third direction is end-to- displaced motion vectors.
end neural video coding (E2E-NVC), which fully leverages Temporal prediction can be enhanced using references with
the stacked neural networks to compactly represent input im- higher fidelity, and more fine-grained motion compensation.
age/video in an end-to-end learning manner. In the following For example, fractional-pel interpolation is usually deployed to
sections, we will primarily review the latter two cases, since improve prediction accuracy [118]. On the other hand, motion
the first one has been extensively discussed in many other compensation with flexible block partitions is another major
studies [112]. contributor to inter coding efficiency.
Similarly, earlier attempts have been made to utilize DNNs
solutions for better inter coding. For instance, CNN-based
A. Modularized Neural Video Coding (MOD-NVC) interpolations were studied in [119]–[121] to improve the half-
The MOD-NVC has inherited the traditional hybrid coding pel samples. Besides, an additional virtual reference could
framework within which handcrafted tools are refined or be generated using CNN models for improved R-D decision
replaced using learned solutions. The general assumption is in [122]. Xia et al. [123] further extended this approach
that existing rule-based coding tools can be further improved using multiscale CNNs to create an additional reference closer
via a data-driven approach that leverages powerful DNNs to to the current frame by which accurate pixel-wise motion
learn robust and efficient mapping functions for more compact representation could be used. Furthermore, conventional ref-
content representation. Two great articles have comprehen- erences could also be enhanced using DNNs to refine the
sively reviewed relevant studies in this direction [106], [107]. compensation [124].
6
3) Quantization and Entropy Coding: Quantization and naturally raises the question of whether it is possible to encode
entropy coding are used to remove statistical redundancy. those spatio-temporal features in a compact format for efficient
Scalar quantization is typically implemented in video encoders lossy compression.
to remove insensitive high-frequency components, without Recently, we have witnessed the growth of video coding
losing the perceptual quality, while saving the bit rate. Re- technologies that rely completely on end-to-end supervised
cently, a three-layer DNN was developed to predict the local learning. Most learned schemes still closely follow the conven-
visibility threshold CT for each CTU, by which more accurate tional intra/inter frame definition by which different algorithms
quantization could be achieved via the connection between are investigated to efficiently represent the intra spatial tex-
CT and actual quantization stepsize. This development led tures, inter motion, and the inter residuals (if applicable) [104],
to noticeable R-D improvement, e.g., upto 11% as reported [129]–[131]. Raw video frames are fed into stacked DNNs to
in [125]. extract, activate, and aggregate appropriate compact features
Context-adaptive binary arithmetic coding (CABAC) and (at the bottleneck layer) for quantization and entropy coding.
its variants are techniques that are widely adopted to encode Similarly, R-D optimization is also facilitated to balance the
binarized symbols. The efficiency of CABAC is heavily reliant rate and distortion trade-off. In the following paragraphs, we
on the accuracy of probability estimation in different contexts. will briefly review the aforementioned key components.
Since the H.264/AVC, handcrafted probability transfer func- 1) Nonlinear Transform and Quantization: The autoen-
tions (developed through exhaustive simulations, and typically coder or variational autoencoder (VAE) architectures are typ-
implemented using look-up tables) were utilized. In [115] ically used to transform the intra texture or inter residual into
and [126], the authors demonstrated that a combined FCN and compressible features.
CNN model could be used to predict intra mode probability for For example, Toderic et al. [132] first applied fully-
better entropy coding. Another example of a combined FCN connected recurrent autoencoders for variable-rate thumbnail
and CNN model was presented in [127] to accurately encode image compression. Their work was then improved in [133],
transform indexes via stacked CNNs. And likewise, in [128], [134] with the support of full-resolution image, unequal bit
intra DC coefficient probability could be also estimated using allocation, etc. Variable bit rate is intrinsically enabled by these
DNNs for better performance. recurrent structures. The recurrent autoencoders, however, suf-
All of these explorations have reported positive R-D gains fer from higher computational complexity at higher bit rates,
when incorporating DNNs in traditional hybrid coding frame- because more recurrent processing is desired. Alternatively,
works. A companion H.265/HEVC-based software model is convolutional autoencoders have been extensively studied in
also offered by Liu et al. [106], to advance the potential for past years, where different bit rates are adapted by setting a
society to further pursue this line of exploration. However, variety of λs to optimize the R-D trade-off. Note that different
integrating DNN-based tools could exponentially increase both network models may be required for individual bit rates,
the computational and space complexity. Therefore, creating making hardware implementation challenging, (e.g., model
harmony between learning-based and conventional rule-based switch from one bit rate to another). Recently, conditional
tools under the same framework requires further investigation. convolution [135] and scaling factor [136] were proposed to
It is also worth noting that an alternative approach is cur- enable variable-rate compression using a single or very limited
rently being explored in parallel. In this approach, researchers network model without noticeable coding efficiency loss,
suggest using an end-to-end neural video coding (E2E-NVC) which makes the convolutional autoencoders more attractive
framework to drive the raw video content representation via for practical applications.
layered feature extraction, activation, suppression, and aggre- To generate a more compact feature representation, Balle
gation, mostly in a supervised learning fashion, instead of et al. [105] suggested replacing the traditional nonlinear ac-
refining individual coding tools. tivation, e.g., ReLU, using generalized divisive normalization
(GDN) that is theoretically proven to be more consistent with
human visual perception. A subsequent study [137] revealed
B. End-to-End Neural Video Coding (E2E-NVC) that GDN outperformed other nonlinear rectifiers, such as
Representing raw video pixels as compactly as possible by ReLU, leakyReLU, and tanh, in compression tasks. Several
massively exploiting its spatio-temporal and statistical correla- follow-up studies [138], [139] directly applied GDN in their
tions is the fundamental problem of lossy video coding. Over networks for compression exploration.
decades, traditional hybrid coding frameworks have utilized Quantization is a non-differentiable operation, basically
pixel-domain intra/inter prediction, transform, entropy coding, converting arbitrary elements into symbols with a limited
etc., to fulfill this purpose. Each coding tool is extensively alphabet for efficient entropy coding in compression. Quanti-
examined under a specific codec structure to carefully justify zation must be derivable in the end-to-end learning framework
the trade-off between R-D efficiency and complexity. This for back propagation. A number of methods, such as adding
process led to the creation of well-known international or uniform noise [105], stochastic rounding [132] and soft-to-
industry standards, such as the H.264/AVC, H.265/HEVC, hard vector quantization [140], were developed to approximate
AV1, etc. a continuous distribution for differentiation.
On the other hand, DNNs have demonstrated a powerful 2) Motion Representation: Chen et al. [104] developed the
capacity for video spatio-temporal feature representation for DeepCoder where a simple convolutional autoencoder was ap-
vision tasks, such as object segmentation, tracking, etc. This plied for both intra and residual coding at fixed 32×32 blocks,
7
and block-based motion estimation in traditional video coding Pixel-error, such as MSE, was one of the most popular
was re-used for temporal compensation. Lu et al. [141] intro- loss functions used. Concurrently, SSIM (or MS-SSIM) was
duced the optical flow for motion representation in their DVC also adopted because of its greater consistency with visual
work, which, together with the intra coding in [142], demon- perception. Simulations revealed that SSIM-based loss can
strated similar performance compared with the H.265/HEVC. improve reconstruction quality, especially at low bit rates.
However, coding efficiency suffered from a sharp loss at low Towards the perceptual-optimized encoding, perceptual losses
bit rates. Liu et al. [143] extended their non-local attention that were measured by adversarial loss [154]–[156] and VGG
optimized image compression (NLAIC) for intra and residual loss [157] were embedded in learning to produce visually
encoding, and applied second-order flow-to-flow prediction for appealing results.
more compact motion representation, showing consistent rate- Though E2E-NVC is still in its infancy, its fast growing R-
distortion gains across different contents and bit rates. D efficiency holds a great deal of promise. This is especially
Motion can also be implicitly inferred via temporal interpo- true, given that we can expect neural processors to be deployed
lation. For example, Wu et al. [144] applied RNN-based frame massively in the near future [158].
interpolation. Together with the residual compensation, RNN-
based frame interpolation offered comparable performance to IV. OVERVIEW OF DNN- BASED P OST- PROCESSING
the H.264/AVC. Djelouah et al. [145] furthered interpolation- Compression artifacts are inevitably present in both tra-
based video coding by utilizing advanced optical flow estima- ditional hybrid coding frameworks and learned compression
tion and feature domain residual coding. However, temporal approaches, e.g., blockiness, ringing, cartoonishness, etc.,
interpolation usually led to an inevitable structural coding severely impairing visual sensation and QoE. Thus, quality
delay. enhancement filters are often applied as a post-filtering step or
Another interesting exploration made by Ripple et al. in-loop module to alleviate compression distortions. Towards
in [130] was to jointly encode motion flow and residual using this goal, adaptive filters are usually developed to minimize
compound features, where a recurrent state was embedded to the error between original and distorted samples.
aggregate multi-frame information for efficient flow generation
and residual coding.
3) R-D Optimization: Li et al. [146] utilized a separate A. In-loop Filtering
three-layer CNN to generate an importance map for spatial- Existing video standards are mainly utilizing the in-loop
complexity-based adaptive bit allocation, leading to noticeable filters to improve the subjective quality of reconstruction,
subjective quality improvement. Mentzer et al. [140] further and also to offer better R-D efficiency due to enhanced
utilized the masked bottleneck layer to unequally weight fea- references. Examples include deblocking [24], sample adaptive
tures at different spatial locations. Such importance map em- offset (SAO) [25], constrained directional enhancement filter
bedding is a straightforward approach to end-to-end training. (CDEF) [159], loop-restoration (LR) [160], adaptive loop filter
Importance derivation was later improved with the non-local (ALF) [161], etc.
attention [147] mechanism to efficiently and implicitly capture Recently, numerous CNN models have been developed
both global and local significance for better compression for in-loop filtering via a data-driven approach to learn the
performance [136]. mapping functions. It is worth pointing out that prediction
Probabilistic models play a vital role in data compression. relationships must be carefully examined when designing
Assuming the Gaussian distribution for feature elements, Balle in-loop filters, due to the frame referencing structure and
et al. [142] utilized hyper priors to estimate the parameters potential error propagation. Both intra and inter predictions are
of Gaussian scale model (GSM) for latent features. Later utilized in popular video encoders, where an intra-coded frame
Hu et al. [148] used hierarchical hyper priors (coarse-to-fine) only exploits the spatial redundancy within current frame,
to improve the entropy models in multiscale representations. while an inter-coded frame jointly explores the spatio-temporal
Minnen et al. [149] improved the context modeling using joint correlations across frames over time.
autoregressive spatial neighbors and hyper priors based on the Earlier explorations of this subject have mainly focused on
Gaussian mixture model (GMM). Autoregressive spatial priors designing DNN-based filters for intra-coded frames, particu-
were commonly fused by PixelCNNs or PixelRNNs [150]. larly by trading network depth and parameters for better cod-
Reed et al. [151] further introduced multiscale PixelCNNs, ing efficiency. For example, IFCNN [162], and VRCNN [163]
yielding competitive density estimation and great boost in are shallow networks with ≈50,000 parameters, providing up
speed (e.g., from O(N ) to O(log N )). Prior aggregation to 5% BD-Rate savings for the H.265/HEVC intra encoder.
was later extended from 2D architectures to 3D PixelC- More gains can be obtained if we use a deeper and denser net-
NNs [140]. Channel-wise weights sharing-based 3D imple- work [164]–[166], e.g., 5.7% BD-Rate gain reported in [164]
mentations could greatly reduce network parameters without by using the model with 3,340,000 parameters, and 8.50%
performance loss. A parallel 3D PixelCNNs for practical BD-Rate saving obtained in [167] by using the model with
decoding is presented in Chen et al. [136]. Previous methods 2,298,160 parameters. The more parameters a model has, the
accumulated all the priors to estimate the probability based on more complex it is. Unfortunately, greater complexity limits
a single GMM assumption for each element. Recent studies the network’s potential for practical application. Such intra-
have shown that weighted GMMs can further improve coding frame-based in-loop filters treat decoded frames equally, with-
efficiency in [152], [153]. out the consideration of in-loop inter-prediction dependency.
8
Nevertheless, aforementioned networks can be used in post- [185]. Appropriate re-training may be applied in order to better
filtering out of the coding loop. capture the data characteristics. However, single-frame post-
It is necessary to include temporal prediction dependency filtering may introduce quality fluctuation across frames. This
while designing the in-loop CNN-based filters for inter-frame may be due to the limited capacity of CNN models to deal
coding. Some studies leveraged prior knowledge from the with a great amount of video contents. Thus, multi-frame post
encoding process to assist the CNN training and inference. filtering can be devised to massively exploit the correlation
For example, Jia et al. [168] incorporated the co-located block across successive temporal frames. By doing so, it not only
information for in-loop filtering. Meng et al. [169] utilized the greatly improves the single-frame solution, but also offers
coding unit partition for further performance improvement. better temporal quality over time.
Li et al. [170] input both the reconstructed frame and the Typically, a two-step strategy is applied for multi-frame post
difference between the reconstructed and predicted pixels to filtering. First, neighboring frames are aligned to the current
improve the coding efficiency. Applying prior knowledge in frame via (pixel-level) motion estimation and compensation
learning may improve the coding performance, but it further (MEMC). Then, all aligned frames are fed into networks
complicates the CNN model by involving additional informa- for high-quality reconstruction. Thus, the accuracy of MEMC
tion in the networks. On the other hand, the contribution of greatly affects reconstruction performance. In applications,
this prior knowledge is quite limited because such additional learned optical flow, such as FlowNet [186], FlowNet2 [187],
priors are already implicitly embedded in the reconstructed PWC-Net [188], and TOFlow [189], are widely used.
frame. Some exploration has already been made in this arena: Bao
If a CNN-based in-loop filtering is applied to frame I0 , the et al. [190] and Wang et al. [191] implemented a general video
impact will be gradually propagated to frame I1 that has frame quality enhancement framework for denoising, deblocking,
I0 as the reference. Subsequently, I1 is the reference of I2 , and and super-resolution, where Bao et al. [190] employed the
so on so forth3 . If frame I1 is filtered again by the same CNN FlowNet and Wang et al. [191] used pyramid, cascading, and
model, an over-filtering problem will be triggered, resulting deformable convolutions to respectively align frames tempo-
in severely degraded performance, as analyzed in [171]. To rally. Meanwhile, Yang et al. [192] proposed a multi-frame
overcome this challenging problem, a CNN model called quality enhancement framework called MFQE-1.0, in which a
SimNet was built to carry the relationship between the recon- spatial transformer motion compensation (STMC) network is
structed frame and its original frame in [172] to adaptively skip used for alignment, and a deep quality enhancement network
filtering operations in inter coding. SimNet reported 7.27% and (QE-net) is employed to improve reconstruction quality. Then,
5.57% BD-Rate savings for intra- and inter- coding of AV1, Guan et al. [193] upgraded MFQE-1.0 to MFQE-2.0 by
respectively. A similar skipping strategy was suggested by replacing QE-net using a dense CNN model, leading to better
Chen et al. [173] to enable a wide activation residual network, performance and less complexity. Later on, Tong et al. [194]
yielding 14.42% and 9.64% BD-Rate savings for respective suggested using FlowNet2 in MFQE-1.0 for temporal frame
intra- and inter- coding on AV1 platform. alignment (instead of default STMC), yielding 0.23 dB PSNR
Alternative solutions resort to the more expensive R-D gain over the original MFQE-1.0. Similarly, FlowNet2 is also
optimization to avoid the over-filtering problem. For example, used in [195] for improved efficiency.
Yin et al. [174] developed three sets of CNN filters for luma All of these studies suggested the importance of temporal
and chroma components, where the R-D optimal CNN model alignment in post filtering. Thus, in the subsequent case
is used and signaled in bitstream. Similar ideas are developed study (see Section VII-B), we first examine the efficiency
in [175], [176] as well, in which multiple CNN models are of alignment, and then further discuss the contributions from
trained and the R-D optimal model is selected for inference. respective intra-coded and inter-coded frames for the quality
It is impractical to use deeper and denser CNN models enhancement of final reconstruction. This will help audiences
in applications. It is also very expensive to conduct R-D gain a deeper understanding of similar post filtering tech-
optimization to choose the optimal one from a set of pre- niques.
trained models. Note that a limited number of pre-trained
models are theoretically insufficient to be generalized for
V. C ASE S TUDY FOR P RE - PROCESSING :
large-scale video samples. To this end, in Section VII-A, we
S WITCHABLE T EXTURE - BASED V IDEO C ODING
introduce a guided-CNN scheme which adapts shallow CNN
models according to the characteristics of input video content. This section presents a switchable texture-based video pre-
processing that leverages DNN-based semantic understanding
B. Post Filtering for subsequent coding improvement. In short, we exploit
DNNs to accurately segment “perceptually InSIGnifcant” (pIn-
Post filtering is generally applied to the compressed frames
SIG) texture areas to produce a corresponding pInSIG mask.
at the decoder side to further enhance the video quality for
In many instances, this mask drives the encoder to perform
better QoE.
separately for pInSIG textures that are typically inferred with-
Previous in-loop filters designated for intra-coded frames
out additional residuals, and “perceptually SIGnificant” (pSIG)
can be re-used for single-frame post-filtering [163], [177]–
areas elsewhere using traditional hybrid coding method. This
3 Even though more advanced inter referencing strategies can be devised, approach is implemented on top of the AV1 codec [196]–[198]
inter propagation-based behavior remains the same. by enabling the GoP-level switchable mechanism, This yields
9
Load pixel-based
based texture mask of the chosen texture class set4 [199]. YouTube UGC dataset is a sample selected from
thousands of User Generated Content (UGC) videos uploaded
Frame to YouTube. The names of the UGC videos follow the format
Level Calculate motion parameter
of Category Resolution UniqueID. We calculate the bit rate
savings at different QP values for 150 frames of the test se-
N
Is it a texture block? quences. In our experiments, we used the following parameters
for the AV1 codec5 as the baseline: 8-frame GoP or GF group
using random access configuration; 30 FPS; constant quality
Y
Block rate control policy; multi-layer coding structure for all GF
Level Choose texture mode RD optimization
groups; maximum intra frame interval at 150. We evaluate
the performance of our proposed method in terms of bit rate
Encode block
savings and perceived quality.
(a) 1) Coding Performance: To evaluate the performance of the
proposed switchable texture mode method, bit rate savings at
Texture region four quantization levels (QP = 16, 24, 32, 40) are calculated
percentage Texture N
Encode with region for each test sequence in comparison to the AV1 baseline.
texture mode Bit rate > 10%
enabled Texture Analysis. We compare two DNN-based texture
First GoP Y analysis methods [210], [212] with a handcrafted feature-based
approach [211] for selected standard test sequences. Results
Encode with Bit rate Has bit
texture mode rate
N are shown in Table II. A positive bit rate saving (%) indicates
disabled saving a reduction compared with the AV1 baseline. Compared to the
Y feature based approach, DNN-based methods show improved
Enable texture Disable texture performance in terms of bit rate saving. The feature based
mode for the GoP mode for the GoP
in the scene in the scene approach relies on color and edge information to generate
the texture mask and is less accurate and consistent both
(b)
spatially and temporally. Therefore, the number of blocks that
Fig. 4: Texture mode and switchable control scheme. (a) are reconstructed using texture mode is usually much smaller
Texture mode encoder implementation. (b) Switchable texture than that of DNN-based methods. Note that the parameters
mode decision. used in feature based approach require manually tuning for
each video to optimize the texture analysis output. The pixel-
level segmentation [210] shows further advantages compared
significant bit rate reduction at the same level of perceptual with block-level method [212], since the CNN model does not
sensation in most standard test videos, in comparison to the require block size to be fixed.
AV1 anchor. However, some videos did cause the model Switchable Scheme. We also compare the proposed
to perform more poorly. One reason for this effect is that method, a.k.a., tex-switch, with our previous work in [210],
higher QP settings typically incur more all-zero residual a.k.a., tex-allgf, which enables texture mode for all frames in
blocks. Alternatively, texture mode is also content-dependent: a GF group. All three methods use the same encoder setting
a relatively small number of texture blocks may be present for fair comparison. Bit rate saving results for various videos
for some videos. Both scenarios limit the bit rate savings, at different resolutions against the AV1 baseline are shown in
and an overhead of extra bits is mandatory for global motion Table III. A positive bit rate saving (%) indicates a reduction
signaling, if texture mode is enabled. compared with the AV1 baseline.
To address these problems, we introduce a switchable In general, compared to the AV1 baseline, the coding
scheme to determine whether texture mode could be poten- performance of tex-allgf shows significant bit rate savings
tially enabled for a GoP or a GF group. The criteria for at lower QPs. However, as QP increases, the savings are
switching are based on the texture region percentage that diminished. In some cases, tex-allgf exhibits poorer coding
is calculated as the average ratio of texture blocks in B- performance than the AV1 baseline at a high QP (e.g., negative
frames, and on the potential bit rate savings with or without numbers at QP 40). At a high QP, most blocks have zero
texture mode. Figure 4b illustrates the switchable texture mode residual due to heavy quantization, leading to very limited
decision. Currently, we use bit rate saving as a criterion for margins for bit rate savings using texture mode. In addition,
switch decisions when the texture mode is enabled. This few extra bits are required for the signalling of global motion
assumes perceptual sensation will remain nearly the same, of texture mode coded blocks. The bit savings gained through
since these texture blocks are perceptually insignificant. residual skipping in texture mode still cannot compensate for
the bits used as overhead for the side information.
Furthermore, the proposed tex-switch method retains the
C. Experimental Results greatest bit rate savings offered by tex-allgf, and resolves the
test sequences and the more challenging YouTube UGC data 5 AV1 codec change-Id: Ibed6015aa7cce12fcc6f314ffde76624df4ad2a1
11
TABLE II: Bit rate saving (%) comparison between handcraft feature (FM) [211], block-level DNN (BM) [212] and pixel-level
DNN (PM) [210] texture analysis against the AV1 baseline for selected standard test sequences using tex-allfg method.
QP=16 (%) QP=24 (%) QP=32 (%) QP=40 (%)
Video Sequence
FM BM PM FM BM PM FM BM PM FM BM PM
Coastguard −0.17 7.80 9.14 −0.36 6.99 8.01 −0.43 4.70 5.72 −0.62 1.90 2.13
Flower 7.42 10.55 13.00 5.42 8.66 10.78 2.51 5.96 4.95 0.19 3.38 1.20
Waterfall 3.65 4.63 13.11 1.58 3.96 7.21 −0.14 −0.33 1.30 −3.00 −3.74 −3.48
Netflix aerial 1.15 8.59 9.15 −0.26 2.15 5.59 −1.32 −0.68 1.05 −2.10 −4.59 −4.01
Intotree 0.88 5.32 9.71 0.15 4.32 9.42 −0.14 1.99 8.46 −0.26 −2.83 4.92
TABLE III: Bit rate saving (%) comparison for tex-allgf and tex-switch methods against the AV1 baseline.
QP=16 (%) QP=24 (%) QP=32 (%) QP=40 (%)
Resolution Video Sequence
tex-allgf tex-switch tex-allgf tex-switch tex-allgf tex-switch tex-allgf tex-switch
Bridgeclose 15.78 15.78 10.87 10.87 4.21 4.21 2.77 2.77
Bridgefar 10.68 10.68 8.56 8.56 6.34 6.34 6.01 6.01
CIF Coastguard 9.14 9.14 8.01 8.01 5.72 5.72 2.13 2.13
Flower 13.00 13.00 10.78 10.78 4.95 4.95 1.20 1.20
Waterfall 13.11 13.11 7.21 7.21 1.30 1.30 −3.48 0.00
512×270 Netflix ariel 9.15 9.15 5.59 5.59 1.05 1.05 −4.01 0.00
NewsClip 360P-1e1c 10.77 10.77 9.27 9.27 5.23 5.23 1.54 1.54
NewsClip 360P-22ce 17.37 17.37 15.79 15.79 16.37 16.37 17.98 17.98
360P TelevisionClip 360P-3b9a 1.45 1.45 0.48 0.48 −1.09 0.00 −3.26 0.00
TelevisionClip 360P-74dd 1.66 1.66 1.17 1.17 0.36 0.36 −0.37 0.00
HowTo 480P-04f1 3.81 3.81 2.57 2.57 0.93 0.93 0.06 0.36
HowTo 480P-4c99 2.36 2.36 1.67 1.67 0.37 0.00 −1.16 0.00
MusicVideo 480P-1eee 3.31 3.31 3.29 3.29 2.53 2.53 −0.30 −0.30
480P
NewsClip 480P-15fa 6.31 6.31 6.05 5.79 0.53 0.11 −0.79 0.03
NewsClip 480P-7a0d 11.54 11.54 10.03 10.03 1.53 1.53 0.08 0.00
TelevisionClip 480P-19d3 3.13 3.13 2.86 2.86 1.66 1.66 0.58 0.00
HowTo 720P-0b01 12.72 12.72 11.84 11.84 9.31 9.31 6.35 6.35
720P MusicVideo 720P-3698 1.76 1.76 1.07 1.07 0.30 0.30 −0.17 0.00
MusicVideo 720P-4ad2 6.93 6.93 3.81 3.81 1.87 1.87 0.60 0.11
HowTo 1080P-4d7b 7.31 7.31 6.07 6.07 3.21 3.21 0.72 0.72
1080P MusicVideo 1080P-55af 3.88 3.88 1.78 1.78 0.31 0.33 −0.99 −0.68
intotree 9.71 9.71 9.42 9.42 8.46 8.46 4.92 4.92
Average 7.96 7.96 6.28 6.27 3.38 3.40 1.45 2.05
Nonlocal Attention
Residual Block (x3)
Nonlocal Attention
Conv 5x5 / 2↓
Conv 5x5 / 2↓
Conv 5x5 / 2↓
Conv 5x5 / 2↓
Conv 5x5 / 2↓
Input
D. Discussion And Future Direction Q
AE
Nonlocal Attention
Residual Block (x3)
Residual Block (x3)
Residual Block (x3)
Nonlocal Attention
Conv 5x5 / 2↑
Conv 5x5 / 2↑
Conv 5x5 / 2↑
Conv 5x5 / 2↑
Conv 5x5 / 2↑
2↑
AE
5x5 // 2↓
posed method can achieve noticeable bit rate reduction with Output
PA
Conv 5x5
satisfying visual quality for both standard test sets and user AD
Conv
generated content, which is verified by a subjective study. We
envision that video coding driven by semantic understanding (b)
will continue to improve in terms of both quality and bit rate,
Fig. 6: End-to-End Neural Video Coding (E2E-NVC).
especially by leveraging advances of deep learning methods.
This E2E-NVC in (a) consists of modularized intra and inter
However, there remain several open challenges that require
coding, where inter coding utilizes respective motion and
further investigation.
residual coding. Each component is well exploited using
Accuracy of region analysis is one of the major challenges
a stacked CNNs-based VAE for efficient representations of
for integrating semantic understanding into video coding.
intra pixels, displaced inter residuals, and inter motions. All
However, recent advances in scene understanding have signif-
modularized components are inter-connected and optimized in
icantly improved the performance of region analysis. Visual
an end-to-end manner. (b) General VAE model applies stacked
artifacts are still noticeable when a non-texture region is
convolutions (e.g., 5×5) with main encoder-decoder (Em ,
incorrectly included in the texture mask, particularly if the
Dm ) and hyper encoder-decoder pairs (Eh , Dh ), where main
analysis/synthesis coding system is open loop. One potential
encoder Em includes four major convolutional layers (e.g.,
solution is to incorporate some perceptual visual quality mea-
convolutional downsampling and three residual blocks (×3)
sures in-loop during the texture region reconstruction.
robust feature processing [201]). Hyper decoder Dh mirrors
Video segmentation benchmark datasets are important for
the steps in hyper encoder Eh for hyper prior information
developing machine learning methods for video based seman-
generation. Prior aggregation (PA) engine collects the informa-
tic understanding. Existing segmentation datasets are either
tion from hyper prior, autoregressive spatial neighbors, as well
based on images with texture [214], or contain general video
as temporal correspondences (if applicable) for main decoder
objects only [215], [216], or focus on visual quality but lack
Dm to reconstruct input scene. Non-local attention is adopted
segmentation ground truth.
to simulate the saliency masking at bottlenecks, and rectified
VI. C ASE S TUDY FOR C ODING : linear unit (ReLU) is implicitly embedded with convolutions
E ND - TO -E ND N EURAL V IDEO C ODING (E2E-NVC) for enabling the nonlinearity. “Q” is for quantization, AE and
AD for respective arithmetic encoding and decoding. 2↓ and
This section presents a framework for end-to-end neural 2↑ are downsampling and upsampling at a factor of 2 for both
video coding. We include a discussion of its key components, horizontal and vertical dimensions.
as well as its overall efficiency. Our proposed method is
extended from our pioneering work in [104] but with signifi-
cant performance improvements by allowing fully end-to-end It codes the remaining frames in each group using motion
learning-based spatio-temporal feature representation. More compensated prediction. As shown in Fig. 6a, the proposed
details can be found in [131], [136], [217]. E2E-NVC uses the VAE compressor (neuro-Motion) to gen-
erate the multiscale motion field between the current frame and
A. Framework the reference frame. Then, a multiscale motion compensation
As with all modern video encoders, the proposed E2E-NVC network (MS-MCN) takes multiscale compressed flows, warps
compresses the first frame in each group of pictures as an intra- the multiscale features of the reference frame, and combines
frame using a VAE based compression engine (neuro-Intra). these warped features to generate the predicted frame. The
13
PSNR (dB)
o
MS-MCN takes the multiscale optical flows f~d1 , f~d2 , ..., f~ds
derived by the pyramid decoder in neuro-Motion, and then
uses them to generate the predicted frame X̂p2 by multiscale
30
motion compensation. Displaced inter-residual r2 = X2 − X̂p2
is then compressed in neuro-Res, yielding the reconstruction 28
r̂2 . The final reconstruction X̂2 is given by X̂2 = X̂p2 +r̂2 . All
of the remaining P-frames in the group of pictures are then 26
encoded using the same procedure. 0.2 0.4 0.6 0.8 1.0
Fig. 6b illustrates the general architecture of the VAE model. bit per pixel (bpp)
The VAE model includes a main encoder-decoder pair that is Fig. 7: Efficiency of neuro-Intra. PSNR vs. rate perfor-
used for latent feature analysis and synthesis, as well as a mance of neuro-Intra in comparison to NLAIC [136], Minnen
hyper encoder-decoder for hyper prior generation. The main (2018) [149], BPG (4:4:4) and JPEG2000. Note that the curves
encoder Em uses four stacked CNN layers. Each convolutional for neuro-Intra and NLAIC overlap.
layer employs stride convolutions to achieve downsampling
(at a factor of 2 in this example) and cascaded convolutions
for efficient feature extraction (here, we use three ResNet- inputs from the hyper priors, spatial neighbors, and temporal
based residual blocks [201])6 . We use two-layer hyper encoder context (if applicable)7 . Information theory suggests that more
Eh to further generate the subsequent hyper priors as side accurate context modeling requires fewer resources (e.g., bits)
information, which is used in the entropy coding of the latent to represent information [219]. For the sake of simplicity, we
features. assume the latent features (e.g., motion, image pixel, residual)
We apply stacked convolutional layers with a limited (3×3) are following the Gaussian distribution as in [148], [149]. We
receptive field to capture the spatial locality. These convo- use the PA engine to derive the mean and standard deviation
lutional layers are stacked in order to simulate layer-wise of the distribution for each feature.
feature extraction. These same ideas are used in many relevant
studies [142], [149]. We utilize the simplest ReLU as the
nonlinear activation function(although other nonlinear activa- B. Neural Intra Coding
tion functions such as the Generalized Divisive Normalization
Our neuro-Intra is a simplified version of the Non-Local
could be used as well) in [105].
Attention optimized Image Compression (NLAIC) that was
The human visual system operates in two stages: First, originally proposed in [136].
the observer scans an entire scene to gain a complete un-
One major difference between the NLAIC and the VAE
derstanding of everything within the field of vision. Second,
model using autoregressive spatial context in [149] is the
the observer focuses their attention on specific salient regions.
introduction of the NLAM inspired by [220]. In addition,
During image and video compression, this mechanism of
we have applied 3D 5×5×5 masked CNN8 to extract spatial
visual attention can be used to ensure that bit resources
priors, which are fused with hyper priors in PA for entropy
are allocated where they are most needed (e.g., via unequal
context modeling (e.g., the bottom part of Fig. 9). Here, we
feature quantization) [140], [218]. This allows resources to be
have assumed the single Gaussian distribution for the context
assigned such that salient areas are more accurately recon-
modeling of entropy coding. Note that temporal priors are not
structed, while resources are conserved in the reconstruction
used for intra-pixel and inter-residual in this paper by only
of less-salient areas. To more accurately discern salient from
utilizing the spatial priors.
non-salient areas, we adopt the non-local attention module
The original NLAIC applies multiple NLAMs in both main
(NLAM) at the bottleneck layers of both the main encoder
and hyper coders, leading to excessive memory consumption
and hyper encoder, prior to quantization, in order to include
at a large spatial scale. In E2E-NVC, NLAMs are only used at
both global and local information.
the bottleneck layers for both main and hyper encoder-decoder
To enable more accurate conditional probability density
pairs, allowing bits to be allocated adaptively.
modeling for entropy coding of the latent features, we in-
troduce the Prior Aggregation (PA) engine which fuses the 7 Intra and residual coding only use joint spatial and hyper priors without
temporal inference.
6 We choose to apply cascaded ResNets for stacked CNNs because they are 8 This 5×5×5 convolutional kernel shares the same parameters for all
highly efficient and reliable. Other efficient CNN architectures could also be channels, offering great model complexity reduction as compared with the
applied. 2D CNN-based solution in [149].
14
Pyramidal
Aggregation
Features
neuro-Motion Warping
Multiscale
Compressed
Flows (MCF)
Rate Estimation
Reconstruction
Rate Distortion Error
Optimization
Fig. 8: Multiscale Motion Estimation and Compensation. One-stage neuro-Motion with MS-MCN uses a pyramidal flow
decoder to synthesize the multiscale compressed optical flows (MCFs) that are used in a multiscale motion compensation
network for generating predicted frames.
3 6 1 5 G %
pF |(F1 ,...,F i−1,ẑt ,ht−1 ) (Fi |F1 , ..., Fi−1 , ẑt , ht−1 )
Y 1 1
2
= (N (µF , σF ) ∗ U(− , ))(Fi ). (2)
2 2
i 1 9 &
Note that TUM is applied to embedded current quantized / X B &