0% found this document useful (0 votes)
32 views27 pages

Advances in Video Compression System Using Deep Neural Network: A Review and Case Studies

This document reviews advancements in video compression systems utilizing deep neural networks (DNNs), focusing on pre-processing, coding, and post-processing techniques to enhance video quality and efficiency. It highlights the potential of AI-powered methods to improve the rate-distortion performance and presents three case studies demonstrating DNN applications in video coding. The article aims to provide a comprehensive overview of recent developments and future directions in this emerging field.

Uploaded by

The Witness
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
32 views27 pages

Advances in Video Compression System Using Deep Neural Network: A Review and Case Studies

This document reviews advancements in video compression systems utilizing deep neural networks (DNNs), focusing on pre-processing, coding, and post-processing techniques to enhance video quality and efficiency. It highlights the potential of AI-powered methods to improve the rate-distortion performance and presents three case studies demonstrating DNN applications in video coding. The article aims to provide a comprehensive overview of recent developments and future directions in this emerging field.

Uploaded by

The Witness
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 27

1

Advances In Video Compression System Using


Deep Neural Network: A Review And Case Studies
Dandan Ding? , Member, IEEE, Zhan Ma? , Senior Member, IEEE, Di Chen, Member, IEEE,
Qingshuang Chen, Member, IEEE, Zoe Liu, and Fengqing Zhu, Senior Member, IEEE

Abstract—Significant advances in video compression system had been focused on the development and standardization of
have been made in the past several decades to satisfy the video coding tools for optimized R-D performance, such as
nearly exponential growth of Internet-scale video traffic. From the intra/inter prediction, transform, entropy coding, etc., re-
arXiv:2101.06341v1 [eess.IV] 16 Jan 2021

the application perspective, we have identified three major


functional blocks including pre-processing, coding, and post- sulting in a number of popular standards and recommendation
processing, that have been continuously investigated to maximize specifications (e.g., ISO/IEC MPEG series [5]–[11], ITU-T
the end-user quality of experience (QoE) under a limited bit rate H.26x series [9]–[13], AVS series [14]–[16], as well as the
budget. Recently, artificial intelligence (AI) powered techniques AV1 [17], [18] from the Alliance of Open Media (AOM) [19]).
have shown great potential to further increase the efficiency All these standards have been widely deployed in the market
of the aforementioned functional blocks, both individually and
jointly. In this article, we review extensively recent technical and enabled advanced and high-performing services to both
advances in video compression system, with an emphasis on enterprises and consumers. They have been adopted to cover
deep neural network (DNN)-based approaches; and then present all major video scenarios from VOD, to live streaming, to
three comprehensive case studies. On pre-processing, we show ultra-low latency interactive real-time communications, used
a switchable texture-based video coding example that leverages for applications such as telemedicine, distance learning, video
DNN-based scene understanding to extract semantic areas for the
improvement of subsequent video coder. On coding, we present an conferencing, broadcasting, e-commerce, online gaming, short
end-to-end neural video coding framework that takes advantage video platforms, etc. Meanwhile, the system R-D efficiency
of the stacked DNNs to efficiently and compactly code input can also be improved from pre-processing and post-processing,
raw videos via fully data-driven learning. On post-processing, individually and jointly, for content adaptive encoding (CAE).
we demonstrate two neural adaptive filters to respectively fa- Notable examples include saliency detection for subsequent
cilitate the in-loop and post filtering for the enhancement of
compressed frames. Finally, a companion website hosting the region-wise quantization control, and adaptive filters to alle-
contents developed in this work can be accessed publicly at viate compression distortions [20]–[22].
https://purdueviper.github.io/dnn-coding/. In this article, we therefore consider pre-processing, coding,
Index Terms—Deep Neural Networks, Texture Analysis, Neural and post-processing as three basic functional blocks of an
Video Coding, Adaptive Filters end-to-end video compression system, and optimize them
to provide compact and high-quality representation of input
I. I NTRODUCTION original video.

In recent years, Internet traffic has been dominated by a • The “coding” block is the core unit that converts raw
wide range of applications involving video, including video pixels or pixel blocks into binary bits presentation. Over
on demand (VOD), live streaming, ultra-low latency real- the past decades, the “coding” R-D efficiency has been
time communications, etc.. With ever increasing demands gradually improved by introducing more advanced tools
in resolution (e.g., 4K, 8K, gigapixel [1], high speed [2]), to better exploit spatial, temporal, and statistical redun-
and fidelity, (e.g., high dynamic range [3], and higher bit dancy [23]. Nevertheless, this process inevitably incurs
precision or bit depth [4]), more efficient video compres- compression artifacts, such as blockiness and ringing, due
sion is imperative for content transmission and storage, by to the R-D trade-off, especially at low bit rates.
which networked video services can be successfully deployed. • The “post-processing” block is introduced to alleviate
Fundamentally, video compression systems devise appropriate visually perceptible impairments produced as byproducts
algorithms to minimize the end-to-end reconstruction distor- of coding. Post-processing mostly relies on the desig-
tion (or maximize the quality of experience (QoE)), under a nated adaptive filters to enhance the reconstructed video
given bit rate budget. This is a classical rate-distortion (R- quality or QoE. Such “post-processing” filters can also
D) optimization problem. In the past, the majority of effort be embedded into the “coding” loop to jointly improve
reconstruction quality and R-D efficiency, e.g., in-loop
D. Ding is with the School of Information Science and Engineering, deblocking [24] and sample adaptive offset (SAO) [25];
Hangzhou Normal University, Hangzhou, Zhejiang, China.
Z. Ma is with the School of Electronic Science and Engineering, Nanjing • The “pre-processing” block exploits the discriminative
University, Nanjing, Jiangsu, China. content preference of the human visual system (HVS),
D. Chen, Q. Chen, and F. Zhu are with the School of Electrical and caused by the non-linear response and frequency se-
Computer Engineering, Purdue University, West Lafayette, Indiana, USA.
Z. Liu is with Visionular Inc, 280 2nd St., Los Altos, CA, USA. lectivity (e.g., masking) of visual neurons in the visual
? These authors contributed equally. pathway. Pre-processing can extract content semantics
2

(e.g., saliency, object instance) to improve the psychovi- TABLE I: Abbreviations and Annotations
sual performance of the “coding” block, for example, by Abbreviation Description
allocating unequal qualities (UEQ) across different areas AE AutoEncoder
CNN Convolutional Neural Network
according to pre-processed cues [26]. 1 CONV Convolution
Building upon the advancements in deep neural networks ConvLSTM Convolutional LSTM
DNN Deep Neural Network
(DNN), numerous recently-created video processing algo- FCN Fully-Connected Network
rithms have been greatly improved to achieve superior per- GAN Generative Adversarial Network
formance, mostly leveraging the powerful nonlinear represen- LSTM Long Short-Term Memory
RNN Recurrent Neural Network
tation capacity of DNNs. At the same time, we have also VAE Variational AutoEncoder
witnessed an explosive growth in the invention of DNN-based BD-PSNR Bjøntegaard Delta PSNR
techniques for video compression from both academic research BD-Rate Bjøntegaard Delta Rate
GOP Group of Pictures
and industrial practices. For example, DNN-based filtering MS-SSIM Multiscale SSIM
in post-processing was extensively studied when developing MSE Mean Squared Error
the VVC standard under the joint task force of ISO/IEC and PSNR Peak Signal-to-Noise Ratio
QP Quantizatin Parameter
ITU-T experts over the past three years. More recently, the QoE Quality of Experience
standard committee issued a Call-for-Evidence (CfE) [27], SSIM Structural Similarity Index
[28] to encourage the exploration of deep learning-based video UEQ UnEqual Quality
VMAF Video Multi-Method Assessment Fusion
coding solutions beyond VVC. AV1 AOMedia Video 1
In this article, we discuss recent advances in pre-processing, AVS Audio Video Standard
coding, and post-processing, with particular emphasis on the H.264/AVC H.264/Advanced Video Coding
H.265/HEVC H.265/High-Efficiency Video Coding
use of DNN-based approaches for efficient video compression. VVC Versatile Video Coding
We aim to provide a comprehensive overview to bring readers AOM Alliance of Open Media
up to date on recent advances in this emerging field. We MPEG Moving Picture Experts Group
also suggest promising directions for further exploration. As
summarized in Fig. 1, we first dive into video pre-processing,
emphasizing the analysis and application of content semantics, provides an overview of abbreviations and acronyms that are
e.g., saliency, object, texture characteristics, etc., to video frequently used throughout this paper.
encoding. We then discuss recently-developed DNN-based
video coding techniques for both modularized coding tool II. OVERVIEW OF DNN- BASED V IDEO P RE - PROCESSING
development and end-to-end fully learned framework explo-
Pre-processing techniques are generally applied prior to the
ration. Finally, we provide an overview of the adaptive filters
video coding block, with the objective of guiding the video
that can be either embedded in codec loop, or placed as a
encoder to remove psychovisual redundancy and to maintain
post enhancement to improve final reconstruction. We also
or improve visual quality, while simultaneously lowering bit
present three case studies, including 1) switchable texture-
rate consumption. One category of pre-processing techniques
based video coding in pre-processing; 2) end-to-end neural
is the execution of pre-filtering operations. Recently, a number
video coding; and 3) efficient neural filtering, to provide
of deep learning-based pre-filtering approaches have been
examples the potential of DNNs to improve both subjective
adopted for targeted coding optimization. These include de-
and objective efficiency over traditional video compression
noising [29], [30], motion deblurring [31], [32], contrast
methodologies.
enhancement [33], edge detection [34], [35], etc. Another
The remainder of the article is organized as follows: From
important topic area is closely related to the analysis of video
Section II to IV, we extensively review the advances in respec-
content semantics, e.g., object instance, saliency attention,
tive pre-processing, coding, and post-processing. Traditional
texture distribution, etc., and its application to intelligent video
methodologies are first briefly summarized, and then DNN-
coding. For the sake of simplicity, we refer to this group of
based approaches are discussed in detail. As in the case
techniques as “pre-processing” for the remainder of this paper.
studies, we propose three neural approaches in Section V, VI,
In our discussion below, we also limit our focus to saliency-
and VII, respectively. Regarding pre-processing, we develop a
based and analysis/synthesis-based approaches.
CNN based texture analysis/synthesis scheme for AV1 codec.
For video compression, an end-to-end neural coding frame-
work is developed. In our discussion of post-processing,we A. Saliency-Based Video Pre-processing
present different neural methods for in-loop and post filtering
1) Saliency Prediction: Saliency is the quality of being
that can enhance the quality of reconstructed frames. Sec-
particularly noticeable or important. Thus, the salient area
tion VIII summarizes this work and discusses open challenges
refers to region of an image that predominantly attracts the
and future research directions. For your convenience, Table I
attention of subjects. This concept corresponds closely to
the highly discriminative and selective behaviour displayed
1 Although adaptive filters can also be used in pre-processing for pre-
in visual neuronal processing [36], [37]. Content feature
filtering, e.g., denoising, motion deblurring, contrast enhancement, edge
detection, etc., our primary focus in this work will be on semantic content extraction, activation, suppression and aggregation also occur
understanding for subsequent intelligent “coding”. in the visual pathway [38].
3

Pre-processing Coding Post-processing


Input Output
Original Semantic Feature Quality Reconstructed

Transmission
Videos Understanding Representation Enhancement Videos

Case Study I Case Study II Case Study III


Switchable Texture- End-to-End Neural Adaptive Filtering
based Video Coding Neural Video Coding (In-loop & Post)

Fig. 1: Topic Outline. This article reviews DNN-based techniques used in pre-processing, coding, and post-processing of a
practical video compression system. The “pre-processing” module leverages content semantics (e.g., texture) to guide video
coding, followed by the “coding” step to represent the video content using more compact spatio-temporal features. Finally,
quality enhancement is applied in “post-processing” to improve reconstruction quality by alleviating processing artifacts.
Companion case studies are respectively offered to showcase the potential of DNN algorithms in video compression.

Earlier attempts to predict saliency typically utilized hand- these separated foreground objects and background attributes,
crafted image features, such as color, intensity, and orientation Zhang et al. [66] introduced a new background mode to more
contrast [39]; motion contrast [40]; camera motion [41], etc., compactly represent background information with better R-
to predict saliency. D efficiency. To the best of our knowledge, such foreground
CNN model
object/background segmentation has been mostly applied in
Later on, DNN-based semantic-level features were exten-
sively investigated for both image content [42]–[48] and video surveillance applications, where the visualrcorr scene lends
video sequences [49]–[55]. Among these features, image input
itself to easier separation. xcorr
saliency prediction only exploits spatial information, while x
3) Video Compression with UEQ Scales: Recalling that
video saliency prediction often relies on spatial and temporal saliency or object refers to more visually attentive areas. It
attributes jointly. One typical example of video saliency is is straightforward to apply UEQ setting in a video encoder,
a moving object that incurs spatio-temporal dynamics over where light compression is used to encode the saliency area,
time, and is therefore more likely to attract users’ attention. while heavy compression is used elsewhere. Use of this tech-
For example, Bazzani et al. [49] modeled the spatial relations nique often results in a lower level of total bit rate consumption
in videos using 3D convolutional features and the temporal without compromising QoE.
consistency with a convolutional long short-term memory For example, Hadi et al. [67] extended the well-known Itti-
(LSTM) network. Bak et al. [50] applied a two-stream net- Koch-Niebur (IKN) model to estimate saliency in the DCT
work that exploited different fusion mechanisms to effectively domain, also considering camera motion. In addition, saliency-
integrate spatial and temporal information. Sun et al. [51] driven distortion was also introduced to accurately capture the
proposed a step-gained FCN to combine the time-domain salient characteristics, in order to improve R-D optimization
memory information and space-domain motion components. in H.265/HEVC. Li et al. [68] suggested using graph-based
Jiang et al. [52] developed an object-to-motion CNN that was visual saliency to adapt the quantizations in H.265/HEVC,
applied together with a LSTM network. All of these efforts to reduce total bits consumption. Similarly, Ku et al. [69]
to efficiently predict video saliency leveraged spatio-temporal applied saliency-weighted Coding Tree Unit (CTU)-level bit
attributes. More details regarding the spatio-temporal saliency allocation, where the CTU-aligned saliency weights were
models for video content can be found in [56]. determined via low-level feature fusion.
2) Salient Object: One special example of image saliency The aforementioned methodologies rely on traditional hand-
involved the object instance in a visual scene, specifically, the crafted saliency prediction algorithms. As DNN-based saliency
moving object in videos. A simple yet effective solution to algorithms have demonstrated superior performance, we can
the problem of predicting image saliency in this case involved safely assume that their application to video coding will lead
segmenting foreground objects and background components. to better compression efficiency. For example, Zhu et al. [70]
The segmentation of foreground objects and background adopted a spatio-temporal saliency model to accurately control
components has mainly relied on foreground extraction or the QP in an encoder whose spatial saliency was generated
background subtraction. For example, motion information has using a 10-layer CNN, and whose temporal saliency was cal-
frequently been used to mask out foreground objects [57]–[61]. culated assuming the 2D motion model (resulting in an average
Recently, both CNN and foreground attentive neural net- of 0.24 BD-PSNR gains over H.265/HEVC reference model
work (FANN) models have been developed to perform fore- (version HM16.8)). Performance improvement due to fine-
ground segmentation [62], [63]. In addition to conventional grained quantization adaptation was reported using an open-
Gaussian mixture model-based background subtraction, recent source x264 encoder [71]. This was accomplished by jointly
explorations have also shown that CNN models could be examining the input video frame and associated saliency
effectively used for the same purpose [64], [65]. To address maps. These saliency maps were generated by utilizing three
4

CNN models suggested in [52], [56], [72]. Up to 25% bit


rate reduction was reported when distortion was measured Original Video
Texture
using the edge-weighted SSIM (EW-SSIM). Similarly, Sun et Encoder
Analyzer
al. [73] implemented a saliency-driven CTU-level adaptive bit

Channel
Side Information
rate control, where the static saliency map of each frame was
Reconstructed
extracted using a DNN model and dynamic saliency region Video Texture
when it was tracked using a moving object segmentation Decoder
Synthesizer
algorithm. Experiment results revealed that the PSNR of
Side Information
salient regions was improved by 1.85 dB on average.
Though saliency-based pre-processing is mainly driven by Fig. 2: Texture Coding System. A general framework of
psychovisual studies, it heavily relies on saliency detection to analysis/synthesis based video coding.
perform UEQ-based adaptive quantization with a lower rate of
bit consumption but visually identical reconstruction. On the
other hand, visual selectivity behaviour is closely associated despite yielding visually identical sensation, compared to the
with video content distribution (e.g., frequency response), traditional hybrid “prediction+residual” method2 . Therefore,
leading to perceptually unequal preference. Thus, it is highly texture analysis and synthesis play a vital role for subsequent
expected that such content semantics-induced discriminative video coding. We will discuss related techniques below.
features can be utilized to improve the system efficiency when 1) Texture Analysis: Early developments in texture analysis
integrated into the video encoder. To this end, we will discuss and representation can be categorized into filter-based or
the analysis/synthesis-based approach for pre-processing in the statistical modeling-based approaches. Gabor filter is one
next section. typical example of a filter-based approach, by which the
input image is convoluted with nonlinear activation for the
derivation of corresponding texture representation [84], [85].
B. Analysis/Synthesis Based Pre-processing
At the same time, in order to identify static and dynamic
Since most videos are consumed by human vision, subjec- textures for video content, Thakur et al. [86] utilized the 2D
tive perception of HVS is the best way to evaluate quality. dual tree complex wavelet transform and steerable pyramid
However, it is quite difficult to devise a profoundly accurate transform [87], respectively. To accurately capture the tem-
mathematical HVS model in actual video encoder for rate poral variations in video, Bansal et al. [88] again suggested
and perceptual quality optimization, due to the complicated the use of optic flow for dynamic texture indication and
and unclear information processing that occurs in the human later synthesis, where optical flow could be generated using
visual pathway. Instead, many pioneering psychovisual studies temporal filtering. Leveraging statistical models such as the
have suggested that neuronal response to compound stimuli Markovian random field (MRF) [89], [90] is an alternative
is highly nonlinear [74]–[81] within the receptive field. This way to analyze and represent texture. For efficient texture
leads to well-known visual behaviors, such as frequency se- description, statistical modeling such as this was then ex-
lectivity, masking, etc., where such stimuli are closely related tended using handcrafted local features, e.g., the scale invariant
to the content texture characteristics. Intuitively, video scenes feature transform (SIFT) [91], speeded up robust features
can be broken down into areas that are either “perceptually (SURF) [92], and local binary patterns (LBP) [93]
significant” (e.g., measured in an MSE sense) or “perceptually Recently, stacked DNNs have demonstrated their superior
insignificant”. For “perceptually insignificant” regions, users efficiency in many computer vision tasks, This efficiency is
will not perceive compression or processing impairments with- mainly due to the powerful capacity of DNN features to be
out a side-by-side comparison with the original sample. This used for video content representation. The most straightfor-
is because the HVS gains semantic understanding by viewing ward scheme directly extracted features from the FC6 or FC7
content as a whole, instead of interpreting texture details pixel- layer of AlexNet [94] for texture representation. Furthermore,
by-pixel [82]. This notable effect of the HVS is also referred Cimpoi et al. [95] demonstrated that Fisher vectorized [96]
to as “masking,” where visually insignificant information, e.g., CNN features was a decent texture descriptor candidate.
perceptually insignificant pixels, will be noticeably suppressed. 2) Texture Synthesis: Texture synthesis reverse-engineers
In practice, we can first analyze the texture characteristics the analysis in pre-processing to restore pixels accordingly. It
of original video content in the pre-processing step, e.g., generally includes both non-parametric and parametric meth-
Texture Analyzer in Fig. 2, in order to sort textures by their ods. For non-parametric synthesis, texture patches are usually
significance. Subsequently, we can use any standard compliant resampled from reference images [97]–[99]. In contrast, the
video encoder to encode the perceptually significant areas as parametric method utilized statistical models to reconstruct
the main bitstream payload, and apply a statistical model to the texture regions by jointly optimizing observation outcomes
represent the perceptually insignificant textures with model from the model and model itself [87], [100], [101].
parameters encapsulated as side information. Finally, we can DNN-based solutions exhibit great potential for texture syn-
use decoded areas and parsed textures to jointly synthesize thesis applications. One notable example demonstrating this
the reconstructed sequences in Texture Synthesizer. This type
of texture modeling makes good use of statistical and psy- 2 A comprehensive survey of texture analysis/synthesis based video coding
chovisual representation jointly, generally requiring fewer bits, technologies can be found in [83].
5

potential used a pre-trained image classification-based CNN We briefly introduce key techniques in intra/inter prediction,
model to generate texture patches [102]. Li et al. [103], then quantization, and entropy coding. Though in-loop filtering is
demonstrated that a Markovian GAN-based texture synthesis another important piece in the “coding” block, due to its
could offer remarkable quality improvement. similarities with post filtering, we have chosen to review it
To briefly summarize, earlier “texture analysis/synthesis” in quality enhancement-aimed “post-processing” for the sake
approaches often relied on handcrafted models, as well as of creating a more cohesive presentation.
corresponding parameters. While they have shown good per- 1) Intra Prediction: Video frame content presents highly
formance to some extent for a set of test videos, it is usually correlated distribution across neighboring samples spatially.
very difficult to generalize them to large-scale video datasets Thus, block redundancy can be effectively exploited using
without fine-tuning parameters further. On the other hand, causal neighbors. In the meantime, due to the presence of local
related neuroscience studies propose a broader definition of structural dynamics, block pixels can be better represented
texture which is more closely related to perceptual sensa- from a variety of angular directed prediction.
tion, although existing mathematical or data-driven texture In conventional standards, such as the H.264/AVC,
representations attempt to fully fulfill such perceptual motives. H.265/HEVC, or even emerging VVC, specific prediction
Furthermore, recent DNN-based schemes present a promising rules are carefully designated to use weighted neighbors for
perspective. However, the complexity of these schemes has not respective angular directions. From the H.264/AVC to recent
yet been appropriately exploited. So, in Section V, we will VVC, intra coding efficiency has been gradually improved
reveal a CNN-based pixel-level texture analysis approach to by allowing more fine-grained angular directions and flexible
segment perceptually insignificant texture areas in a frame for block size/partitions. In practice, an optimal coding mode is
compression and later synthesis. To model the textures both often determined by R-D optimization.
spatially and temporally, we introduce a new coding mode One would intuitively expect that coding performance can
called the “switchable texture mode” that is determined at be further improved if better predictions can be produced.
group of pictures (GoP) level according to the bit rate saving. Therefore, there have been a number of attempts to lever-
age the powerful capacity of stacked DNNs for better in-
III. OVERVIEW OF DNN- BASED V IDEO C ODING tra predictor generation, including the CNN-based predictor
A number of investigations have shown that DNNs can be refinement suggested in [113] to reduce prediction residual,
used for efficient image/video coding [104]–[107]. This topic additional learned mode trained using FCN models reported
has attracted extensive attention in recent years, demonstrating in [114], [115], using RNNs in [116], using CNNs in [108],
its potential to enhance the conventional system with better R- or even using GANs in [117], etc. These approaches have
D performance. actively utilized the neighbor pixels or blocks, and/or other
There are three major directions currently under inves- context information (e.g., mode) if applicable, in order to
tigation. One is resolution resampling-based video coding, accurately represent the local structures for better prediction.
by which the input videos are first down-sampled prior to Many of these approaches have reported more than 3% BD-
being encoded, and the reconstructed videos are up-sampled Rate gains against the popular H.265/HEVC reference model.
or super-resolved to the same resolution as the input [108]– These examples demonstrate the efficiency of DNNs in intra
[111]. This category generally develops up-scaling or super- prediction.
resolution algorithms on top of standard video codecs. The 2) Inter Prediction: In addition to the spatial intra pre-
second direction under investigation is modularized neural diction, temporal correlations have also been exploited via
video coding (MOD-NVC), which has attempted to improve inter prediction, by which previously reconstructed frames
individual coding tools in traditional hybrid coding framework are utilized to generate inter predictor for compensation using
using learning-based solutions. The third direction is end-to- displaced motion vectors.
end neural video coding (E2E-NVC), which fully leverages Temporal prediction can be enhanced using references with
the stacked neural networks to compactly represent input im- higher fidelity, and more fine-grained motion compensation.
age/video in an end-to-end learning manner. In the following For example, fractional-pel interpolation is usually deployed to
sections, we will primarily review the latter two cases, since improve prediction accuracy [118]. On the other hand, motion
the first one has been extensively discussed in many other compensation with flexible block partitions is another major
studies [112]. contributor to inter coding efficiency.
Similarly, earlier attempts have been made to utilize DNNs
solutions for better inter coding. For instance, CNN-based
A. Modularized Neural Video Coding (MOD-NVC) interpolations were studied in [119]–[121] to improve the half-
The MOD-NVC has inherited the traditional hybrid coding pel samples. Besides, an additional virtual reference could
framework within which handcrafted tools are refined or be generated using CNN models for improved R-D decision
replaced using learned solutions. The general assumption is in [122]. Xia et al. [123] further extended this approach
that existing rule-based coding tools can be further improved using multiscale CNNs to create an additional reference closer
via a data-driven approach that leverages powerful DNNs to to the current frame by which accurate pixel-wise motion
learn robust and efficient mapping functions for more compact representation could be used. Furthermore, conventional ref-
content representation. Two great articles have comprehen- erences could also be enhanced using DNNs to refine the
sively reviewed relevant studies in this direction [106], [107]. compensation [124].
6

3) Quantization and Entropy Coding: Quantization and naturally raises the question of whether it is possible to encode
entropy coding are used to remove statistical redundancy. those spatio-temporal features in a compact format for efficient
Scalar quantization is typically implemented in video encoders lossy compression.
to remove insensitive high-frequency components, without Recently, we have witnessed the growth of video coding
losing the perceptual quality, while saving the bit rate. Re- technologies that rely completely on end-to-end supervised
cently, a three-layer DNN was developed to predict the local learning. Most learned schemes still closely follow the conven-
visibility threshold CT for each CTU, by which more accurate tional intra/inter frame definition by which different algorithms
quantization could be achieved via the connection between are investigated to efficiently represent the intra spatial tex-
CT and actual quantization stepsize. This development led tures, inter motion, and the inter residuals (if applicable) [104],
to noticeable R-D improvement, e.g., upto 11% as reported [129]–[131]. Raw video frames are fed into stacked DNNs to
in [125]. extract, activate, and aggregate appropriate compact features
Context-adaptive binary arithmetic coding (CABAC) and (at the bottleneck layer) for quantization and entropy coding.
its variants are techniques that are widely adopted to encode Similarly, R-D optimization is also facilitated to balance the
binarized symbols. The efficiency of CABAC is heavily reliant rate and distortion trade-off. In the following paragraphs, we
on the accuracy of probability estimation in different contexts. will briefly review the aforementioned key components.
Since the H.264/AVC, handcrafted probability transfer func- 1) Nonlinear Transform and Quantization: The autoen-
tions (developed through exhaustive simulations, and typically coder or variational autoencoder (VAE) architectures are typ-
implemented using look-up tables) were utilized. In [115] ically used to transform the intra texture or inter residual into
and [126], the authors demonstrated that a combined FCN and compressible features.
CNN model could be used to predict intra mode probability for For example, Toderic et al. [132] first applied fully-
better entropy coding. Another example of a combined FCN connected recurrent autoencoders for variable-rate thumbnail
and CNN model was presented in [127] to accurately encode image compression. Their work was then improved in [133],
transform indexes via stacked CNNs. And likewise, in [128], [134] with the support of full-resolution image, unequal bit
intra DC coefficient probability could be also estimated using allocation, etc. Variable bit rate is intrinsically enabled by these
DNNs for better performance. recurrent structures. The recurrent autoencoders, however, suf-
All of these explorations have reported positive R-D gains fer from higher computational complexity at higher bit rates,
when incorporating DNNs in traditional hybrid coding frame- because more recurrent processing is desired. Alternatively,
works. A companion H.265/HEVC-based software model is convolutional autoencoders have been extensively studied in
also offered by Liu et al. [106], to advance the potential for past years, where different bit rates are adapted by setting a
society to further pursue this line of exploration. However, variety of λs to optimize the R-D trade-off. Note that different
integrating DNN-based tools could exponentially increase both network models may be required for individual bit rates,
the computational and space complexity. Therefore, creating making hardware implementation challenging, (e.g., model
harmony between learning-based and conventional rule-based switch from one bit rate to another). Recently, conditional
tools under the same framework requires further investigation. convolution [135] and scaling factor [136] were proposed to
It is also worth noting that an alternative approach is cur- enable variable-rate compression using a single or very limited
rently being explored in parallel. In this approach, researchers network model without noticeable coding efficiency loss,
suggest using an end-to-end neural video coding (E2E-NVC) which makes the convolutional autoencoders more attractive
framework to drive the raw video content representation via for practical applications.
layered feature extraction, activation, suppression, and aggre- To generate a more compact feature representation, Balle
gation, mostly in a supervised learning fashion, instead of et al. [105] suggested replacing the traditional nonlinear ac-
refining individual coding tools. tivation, e.g., ReLU, using generalized divisive normalization
(GDN) that is theoretically proven to be more consistent with
human visual perception. A subsequent study [137] revealed
B. End-to-End Neural Video Coding (E2E-NVC) that GDN outperformed other nonlinear rectifiers, such as
Representing raw video pixels as compactly as possible by ReLU, leakyReLU, and tanh, in compression tasks. Several
massively exploiting its spatio-temporal and statistical correla- follow-up studies [138], [139] directly applied GDN in their
tions is the fundamental problem of lossy video coding. Over networks for compression exploration.
decades, traditional hybrid coding frameworks have utilized Quantization is a non-differentiable operation, basically
pixel-domain intra/inter prediction, transform, entropy coding, converting arbitrary elements into symbols with a limited
etc., to fulfill this purpose. Each coding tool is extensively alphabet for efficient entropy coding in compression. Quanti-
examined under a specific codec structure to carefully justify zation must be derivable in the end-to-end learning framework
the trade-off between R-D efficiency and complexity. This for back propagation. A number of methods, such as adding
process led to the creation of well-known international or uniform noise [105], stochastic rounding [132] and soft-to-
industry standards, such as the H.264/AVC, H.265/HEVC, hard vector quantization [140], were developed to approximate
AV1, etc. a continuous distribution for differentiation.
On the other hand, DNNs have demonstrated a powerful 2) Motion Representation: Chen et al. [104] developed the
capacity for video spatio-temporal feature representation for DeepCoder where a simple convolutional autoencoder was ap-
vision tasks, such as object segmentation, tracking, etc. This plied for both intra and residual coding at fixed 32×32 blocks,
7

and block-based motion estimation in traditional video coding Pixel-error, such as MSE, was one of the most popular
was re-used for temporal compensation. Lu et al. [141] intro- loss functions used. Concurrently, SSIM (or MS-SSIM) was
duced the optical flow for motion representation in their DVC also adopted because of its greater consistency with visual
work, which, together with the intra coding in [142], demon- perception. Simulations revealed that SSIM-based loss can
strated similar performance compared with the H.265/HEVC. improve reconstruction quality, especially at low bit rates.
However, coding efficiency suffered from a sharp loss at low Towards the perceptual-optimized encoding, perceptual losses
bit rates. Liu et al. [143] extended their non-local attention that were measured by adversarial loss [154]–[156] and VGG
optimized image compression (NLAIC) for intra and residual loss [157] were embedded in learning to produce visually
encoding, and applied second-order flow-to-flow prediction for appealing results.
more compact motion representation, showing consistent rate- Though E2E-NVC is still in its infancy, its fast growing R-
distortion gains across different contents and bit rates. D efficiency holds a great deal of promise. This is especially
Motion can also be implicitly inferred via temporal interpo- true, given that we can expect neural processors to be deployed
lation. For example, Wu et al. [144] applied RNN-based frame massively in the near future [158].
interpolation. Together with the residual compensation, RNN-
based frame interpolation offered comparable performance to IV. OVERVIEW OF DNN- BASED P OST- PROCESSING
the H.264/AVC. Djelouah et al. [145] furthered interpolation- Compression artifacts are inevitably present in both tra-
based video coding by utilizing advanced optical flow estima- ditional hybrid coding frameworks and learned compression
tion and feature domain residual coding. However, temporal approaches, e.g., blockiness, ringing, cartoonishness, etc.,
interpolation usually led to an inevitable structural coding severely impairing visual sensation and QoE. Thus, quality
delay. enhancement filters are often applied as a post-filtering step or
Another interesting exploration made by Ripple et al. in-loop module to alleviate compression distortions. Towards
in [130] was to jointly encode motion flow and residual using this goal, adaptive filters are usually developed to minimize
compound features, where a recurrent state was embedded to the error between original and distorted samples.
aggregate multi-frame information for efficient flow generation
and residual coding.
3) R-D Optimization: Li et al. [146] utilized a separate A. In-loop Filtering
three-layer CNN to generate an importance map for spatial- Existing video standards are mainly utilizing the in-loop
complexity-based adaptive bit allocation, leading to noticeable filters to improve the subjective quality of reconstruction,
subjective quality improvement. Mentzer et al. [140] further and also to offer better R-D efficiency due to enhanced
utilized the masked bottleneck layer to unequally weight fea- references. Examples include deblocking [24], sample adaptive
tures at different spatial locations. Such importance map em- offset (SAO) [25], constrained directional enhancement filter
bedding is a straightforward approach to end-to-end training. (CDEF) [159], loop-restoration (LR) [160], adaptive loop filter
Importance derivation was later improved with the non-local (ALF) [161], etc.
attention [147] mechanism to efficiently and implicitly capture Recently, numerous CNN models have been developed
both global and local significance for better compression for in-loop filtering via a data-driven approach to learn the
performance [136]. mapping functions. It is worth pointing out that prediction
Probabilistic models play a vital role in data compression. relationships must be carefully examined when designing
Assuming the Gaussian distribution for feature elements, Balle in-loop filters, due to the frame referencing structure and
et al. [142] utilized hyper priors to estimate the parameters potential error propagation. Both intra and inter predictions are
of Gaussian scale model (GSM) for latent features. Later utilized in popular video encoders, where an intra-coded frame
Hu et al. [148] used hierarchical hyper priors (coarse-to-fine) only exploits the spatial redundancy within current frame,
to improve the entropy models in multiscale representations. while an inter-coded frame jointly explores the spatio-temporal
Minnen et al. [149] improved the context modeling using joint correlations across frames over time.
autoregressive spatial neighbors and hyper priors based on the Earlier explorations of this subject have mainly focused on
Gaussian mixture model (GMM). Autoregressive spatial priors designing DNN-based filters for intra-coded frames, particu-
were commonly fused by PixelCNNs or PixelRNNs [150]. larly by trading network depth and parameters for better cod-
Reed et al. [151] further introduced multiscale PixelCNNs, ing efficiency. For example, IFCNN [162], and VRCNN [163]
yielding competitive density estimation and great boost in are shallow networks with ≈50,000 parameters, providing up
speed (e.g., from O(N ) to O(log N )). Prior aggregation to 5% BD-Rate savings for the H.265/HEVC intra encoder.
was later extended from 2D architectures to 3D PixelC- More gains can be obtained if we use a deeper and denser net-
NNs [140]. Channel-wise weights sharing-based 3D imple- work [164]–[166], e.g., 5.7% BD-Rate gain reported in [164]
mentations could greatly reduce network parameters without by using the model with 3,340,000 parameters, and 8.50%
performance loss. A parallel 3D PixelCNNs for practical BD-Rate saving obtained in [167] by using the model with
decoding is presented in Chen et al. [136]. Previous methods 2,298,160 parameters. The more parameters a model has, the
accumulated all the priors to estimate the probability based on more complex it is. Unfortunately, greater complexity limits
a single GMM assumption for each element. Recent studies the network’s potential for practical application. Such intra-
have shown that weighted GMMs can further improve coding frame-based in-loop filters treat decoded frames equally, with-
efficiency in [152], [153]. out the consideration of in-loop inter-prediction dependency.
8

Nevertheless, aforementioned networks can be used in post- [185]. Appropriate re-training may be applied in order to better
filtering out of the coding loop. capture the data characteristics. However, single-frame post-
It is necessary to include temporal prediction dependency filtering may introduce quality fluctuation across frames. This
while designing the in-loop CNN-based filters for inter-frame may be due to the limited capacity of CNN models to deal
coding. Some studies leveraged prior knowledge from the with a great amount of video contents. Thus, multi-frame post
encoding process to assist the CNN training and inference. filtering can be devised to massively exploit the correlation
For example, Jia et al. [168] incorporated the co-located block across successive temporal frames. By doing so, it not only
information for in-loop filtering. Meng et al. [169] utilized the greatly improves the single-frame solution, but also offers
coding unit partition for further performance improvement. better temporal quality over time.
Li et al. [170] input both the reconstructed frame and the Typically, a two-step strategy is applied for multi-frame post
difference between the reconstructed and predicted pixels to filtering. First, neighboring frames are aligned to the current
improve the coding efficiency. Applying prior knowledge in frame via (pixel-level) motion estimation and compensation
learning may improve the coding performance, but it further (MEMC). Then, all aligned frames are fed into networks
complicates the CNN model by involving additional informa- for high-quality reconstruction. Thus, the accuracy of MEMC
tion in the networks. On the other hand, the contribution of greatly affects reconstruction performance. In applications,
this prior knowledge is quite limited because such additional learned optical flow, such as FlowNet [186], FlowNet2 [187],
priors are already implicitly embedded in the reconstructed PWC-Net [188], and TOFlow [189], are widely used.
frame. Some exploration has already been made in this arena: Bao
If a CNN-based in-loop filtering is applied to frame I0 , the et al. [190] and Wang et al. [191] implemented a general video
impact will be gradually propagated to frame I1 that has frame quality enhancement framework for denoising, deblocking,
I0 as the reference. Subsequently, I1 is the reference of I2 , and and super-resolution, where Bao et al. [190] employed the
so on so forth3 . If frame I1 is filtered again by the same CNN FlowNet and Wang et al. [191] used pyramid, cascading, and
model, an over-filtering problem will be triggered, resulting deformable convolutions to respectively align frames tempo-
in severely degraded performance, as analyzed in [171]. To rally. Meanwhile, Yang et al. [192] proposed a multi-frame
overcome this challenging problem, a CNN model called quality enhancement framework called MFQE-1.0, in which a
SimNet was built to carry the relationship between the recon- spatial transformer motion compensation (STMC) network is
structed frame and its original frame in [172] to adaptively skip used for alignment, and a deep quality enhancement network
filtering operations in inter coding. SimNet reported 7.27% and (QE-net) is employed to improve reconstruction quality. Then,
5.57% BD-Rate savings for intra- and inter- coding of AV1, Guan et al. [193] upgraded MFQE-1.0 to MFQE-2.0 by
respectively. A similar skipping strategy was suggested by replacing QE-net using a dense CNN model, leading to better
Chen et al. [173] to enable a wide activation residual network, performance and less complexity. Later on, Tong et al. [194]
yielding 14.42% and 9.64% BD-Rate savings for respective suggested using FlowNet2 in MFQE-1.0 for temporal frame
intra- and inter- coding on AV1 platform. alignment (instead of default STMC), yielding 0.23 dB PSNR
Alternative solutions resort to the more expensive R-D gain over the original MFQE-1.0. Similarly, FlowNet2 is also
optimization to avoid the over-filtering problem. For example, used in [195] for improved efficiency.
Yin et al. [174] developed three sets of CNN filters for luma All of these studies suggested the importance of temporal
and chroma components, where the R-D optimal CNN model alignment in post filtering. Thus, in the subsequent case
is used and signaled in bitstream. Similar ideas are developed study (see Section VII-B), we first examine the efficiency
in [175], [176] as well, in which multiple CNN models are of alignment, and then further discuss the contributions from
trained and the R-D optimal model is selected for inference. respective intra-coded and inter-coded frames for the quality
It is impractical to use deeper and denser CNN models enhancement of final reconstruction. This will help audiences
in applications. It is also very expensive to conduct R-D gain a deeper understanding of similar post filtering tech-
optimization to choose the optimal one from a set of pre- niques.
trained models. Note that a limited number of pre-trained
models are theoretically insufficient to be generalized for
V. C ASE S TUDY FOR P RE - PROCESSING :
large-scale video samples. To this end, in Section VII-A, we
S WITCHABLE T EXTURE - BASED V IDEO C ODING
introduce a guided-CNN scheme which adapts shallow CNN
models according to the characteristics of input video content. This section presents a switchable texture-based video pre-
processing that leverages DNN-based semantic understanding
B. Post Filtering for subsequent coding improvement. In short, we exploit
DNNs to accurately segment “perceptually InSIGnifcant” (pIn-
Post filtering is generally applied to the compressed frames
SIG) texture areas to produce a corresponding pInSIG mask.
at the decoder side to further enhance the video quality for
In many instances, this mask drives the encoder to perform
better QoE.
separately for pInSIG textures that are typically inferred with-
Previous in-loop filters designated for intra-coded frames
out additional residuals, and “perceptually SIGnificant” (pSIG)
can be re-used for single-frame post-filtering [163], [177]–
areas elsewhere using traditional hybrid coding method. This
3 Even though more advanced inter referencing strategies can be devised, approach is implemented on top of the AV1 codec [196]–[198]
inter propagation-based behavior remains the same. by enabling the GoP-level switchable mechanism, This yields
9

the AV1 reference software platform is selected to exemplify


the efficiency of our proposal.
Texture Blocks. Texture and non-texture blocks are identi-
Resnet-50 Pool Upsample CONV
fied by overlaying the segmentation mask from the texture
Dilated CONV analyzer on its corresponding frame. These frame-aligned
Input Image
CNN
Scene texture masks produce pixel-level accuracy, which is capable
Segmentation of supporting arbitrary texture shapes. However, in order to
PSP Module support the block processing commonly adopted by video
encoders, we propose refining original pixel-level masks to
Fig. 3: Texture Analyzer. Proposed semantic segmentation their block-based representations. The minimum size of a
network using PSPNet [200] and ResNet-50 [201]. texture block is 16×16. In order to avoid boundary artifacts
and maintain temporal consistency, we implemented a con-
servative two-step strategy to determine the texture block.
noticeable bit rate savings for both standard test sequences First, the block itself must be fully contained in the texture
and additional challenging sequences from YouTube UGC region marked using the pixel-level mask. Then, its warped
dataset [199], under similar perceptual quality. The method representation to temporal references (e.g., the preceding and
we propose is a pioneering work that integrates learning-based succeeding frames in the encoding order) have to be inside
texture analysis and reconstruction approaches with modern the masked texture area of corresponding reference frames as
video codec to enhance video compression performance. well. Finally, these texture blocks are encoded using the texture
mode, and non-texture blocks are encoded as usual using the
hybrid coding structure.
A. Texture Analysis
Texture Mode. A texture mode coded block is inferred
Our previous attempt [202] yielded encouraging bit rate by its temporal reference using the global motion parame-
savings without decreasing visual quality. This was accom- ters without incurring any motion compensation residuals. In
plished by perceptually differentiating pInSIG textures and contrast, non-texture blocks are compressed using a hybrid
other areas to be encoded in a hybrid coding framework. “prediction+residual” scheme. For each current frame and any
However, the corresponding texture masks were derived using one of its reference frames, AV1 syntax specifies only one set
traditional methods, at the coding block level. On the other of global motion parameters at the frame header. Therefore,
hand, building upon advancements created by DNNs and large- to comply with the AV1 syntax, our implementation only
scale labeled datasets (e.g., ImageNet [203], COCO [204], and considers one texture class for each frame. This guarantees
ADE20K [205]), learning-based semantic scene segmentation the general compatibility of our solution to existing AV1
algorithms [200], [205], [206] have been tremendously im- decoders. We further modified the AV1 global motion tool to
proved to generate accurate pixel-level texture masks. estimate the motion parameters based on the texture regions
In this work, we first rely on the powerful ResNet50 [201] of the current frame and its reference frame. We used the
with dilated convolutions [207], [208] to extract feature maps same feature extraction and model fitting approach as in the
that effectively embed the content semantics. We then in- global motion coding tool in order to provide a more accurate
troduce the pyramid pooling module from PSPNet [200] to motion model for the texture regions. This was done to prevent
produce a pixel-level semantic segmentation map shown in visual artifacts on the block edges between the texture and
Fig. 3. Our implementation starts with a pre-trained PSPNet non-texture blocks in the reconstructed video. Although we
model generated using the MIT SceneParse150 [209] as a have demonstrated our algorithms using the AV1 standard,
scene parsing benchmark. We then retrained the model on we expect that the same methodology can be applied to
a subset of a densely annotated dataset ADE20K [205]. In other standards. For instance, when using the H.265/HEVC
the end, the model offers a pixel segmentation accuracy of standard, we can leverage the SKIP mode syntax to signal the
80.23%. texture mode instead of utilizing the global motion parameters.
It is worthwhile to note that such pixel-level segmentation Previous discussions have suggested that the texture mode is
may result in the creation of a number of semantic classes. enabled along with inter prediction. Our extensive studies have
Nevertheless, this study suggests grouping similar texture also demonstrated that it is better to activate the texture mode
classes commonly found in nature scenes together into four in frames where bi-directional predictions are allowed (e.g., B-
major categories, e.g., “earth and grass”, “water, sea and frames), for the optimal trade-off between bit rate saving and
river”, “mountain and hill”, and “tree”. Each texture category perceived quality. As will be shown in following performance
would have an individual segmentation mask to guide the comparisons, we use a 8-frame GoP (or Golden-Frame (GF)
compression performed by the succeeding video encoder. group defined in AV1) to exemplify the texture modes in
every other frame, by which the compound prediction from bi-
directional references can be facilitated for prediction warping.
B. Switchable Texture-Based Video Coding Such bi-directional prediction could also alleviate possible
Texture masks are generally used to identify texture blocks, temporal quality flickering.
and to perform the encoding of texture blocks and non-texture Switchable Optimization. In our previous work [210], the
blocks separately, as illustrated in Fig. 4a. In this case study, texture mode was enabled for every B frame, demonstrating
Get scene change information from first pass encoding
10

Load pixel-based
based texture mask of the chosen texture class set4 [199]. YouTube UGC dataset is a sample selected from
thousands of User Generated Content (UGC) videos uploaded
Frame to YouTube. The names of the UGC videos follow the format
Level Calculate motion parameter
of Category Resolution UniqueID. We calculate the bit rate
savings at different QP values for 150 frames of the test se-
N
Is it a texture block? quences. In our experiments, we used the following parameters
for the AV1 codec5 as the baseline: 8-frame GoP or GF group
using random access configuration; 30 FPS; constant quality
Y
Block rate control policy; multi-layer coding structure for all GF
Level Choose texture mode RD optimization
groups; maximum intra frame interval at 150. We evaluate
the performance of our proposed method in terms of bit rate
Encode block
savings and perceived quality.
(a) 1) Coding Performance: To evaluate the performance of the
proposed switchable texture mode method, bit rate savings at
Texture region four quantization levels (QP = 16, 24, 32, 40) are calculated
percentage Texture N
Encode with region for each test sequence in comparison to the AV1 baseline.
texture mode Bit rate > 10%
enabled Texture Analysis. We compare two DNN-based texture
First GoP Y analysis methods [210], [212] with a handcrafted feature-based
approach [211] for selected standard test sequences. Results
Encode with Bit rate Has bit
texture mode rate
N are shown in Table II. A positive bit rate saving (%) indicates
disabled saving a reduction compared with the AV1 baseline. Compared to the
Y feature based approach, DNN-based methods show improved
Enable texture Disable texture performance in terms of bit rate saving. The feature based
mode for the GoP mode for the GoP
in the scene in the scene approach relies on color and edge information to generate
the texture mask and is less accurate and consistent both
(b)
spatially and temporally. Therefore, the number of blocks that
Fig. 4: Texture mode and switchable control scheme. (a) are reconstructed using texture mode is usually much smaller
Texture mode encoder implementation. (b) Switchable texture than that of DNN-based methods. Note that the parameters
mode decision. used in feature based approach require manually tuning for
each video to optimize the texture analysis output. The pixel-
level segmentation [210] shows further advantages compared
significant bit rate reduction at the same level of perceptual with block-level method [212], since the CNN model does not
sensation in most standard test videos, in comparison to the require block size to be fixed.
AV1 anchor. However, some videos did cause the model Switchable Scheme. We also compare the proposed
to perform more poorly. One reason for this effect is that method, a.k.a., tex-switch, with our previous work in [210],
higher QP settings typically incur more all-zero residual a.k.a., tex-allgf, which enables texture mode for all frames in
blocks. Alternatively, texture mode is also content-dependent: a GF group. All three methods use the same encoder setting
a relatively small number of texture blocks may be present for fair comparison. Bit rate saving results for various videos
for some videos. Both scenarios limit the bit rate savings, at different resolutions against the AV1 baseline are shown in
and an overhead of extra bits is mandatory for global motion Table III. A positive bit rate saving (%) indicates a reduction
signaling, if texture mode is enabled. compared with the AV1 baseline.
To address these problems, we introduce a switchable In general, compared to the AV1 baseline, the coding
scheme to determine whether texture mode could be poten- performance of tex-allgf shows significant bit rate savings
tially enabled for a GoP or a GF group. The criteria for at lower QPs. However, as QP increases, the savings are
switching are based on the texture region percentage that diminished. In some cases, tex-allgf exhibits poorer coding
is calculated as the average ratio of texture blocks in B- performance than the AV1 baseline at a high QP (e.g., negative
frames, and on the potential bit rate savings with or without numbers at QP 40). At a high QP, most blocks have zero
texture mode. Figure 4b illustrates the switchable texture mode residual due to heavy quantization, leading to very limited
decision. Currently, we use bit rate saving as a criterion for margins for bit rate savings using texture mode. In addition,
switch decisions when the texture mode is enabled. This few extra bits are required for the signalling of global motion
assumes perceptual sensation will remain nearly the same, of texture mode coded blocks. The bit savings gained through
since these texture blocks are perceptually insignificant. residual skipping in texture mode still cannot compensate for
the bits used as overhead for the side information.
Furthermore, the proposed tex-switch method retains the
C. Experimental Results greatest bit rate savings offered by tex-allgf, and resolves the

We selected sequences with texture regions from standard 4 https://media.withyoutube.com/

test sequences and the more challenging YouTube UGC data 5 AV1 codec change-Id: Ibed6015aa7cce12fcc6f314ffde76624df4ad2a1
11

TABLE II: Bit rate saving (%) comparison between handcraft feature (FM) [211], block-level DNN (BM) [212] and pixel-level
DNN (PM) [210] texture analysis against the AV1 baseline for selected standard test sequences using tex-allfg method.
QP=16 (%) QP=24 (%) QP=32 (%) QP=40 (%)
Video Sequence
FM BM PM FM BM PM FM BM PM FM BM PM
Coastguard −0.17 7.80 9.14 −0.36 6.99 8.01 −0.43 4.70 5.72 −0.62 1.90 2.13
Flower 7.42 10.55 13.00 5.42 8.66 10.78 2.51 5.96 4.95 0.19 3.38 1.20
Waterfall 3.65 4.63 13.11 1.58 3.96 7.21 −0.14 −0.33 1.30 −3.00 −3.74 −3.48
Netflix aerial 1.15 8.59 9.15 −0.26 2.15 5.59 −1.32 −0.68 1.05 −2.10 −4.59 −4.01
Intotree 0.88 5.32 9.71 0.15 4.32 9.42 −0.14 1.99 8.46 −0.26 −2.83 4.92

TABLE III: Bit rate saving (%) comparison for tex-allgf and tex-switch methods against the AV1 baseline.
QP=16 (%) QP=24 (%) QP=32 (%) QP=40 (%)
Resolution Video Sequence
tex-allgf tex-switch tex-allgf tex-switch tex-allgf tex-switch tex-allgf tex-switch
Bridgeclose 15.78 15.78 10.87 10.87 4.21 4.21 2.77 2.77
Bridgefar 10.68 10.68 8.56 8.56 6.34 6.34 6.01 6.01
CIF Coastguard 9.14 9.14 8.01 8.01 5.72 5.72 2.13 2.13
Flower 13.00 13.00 10.78 10.78 4.95 4.95 1.20 1.20
Waterfall 13.11 13.11 7.21 7.21 1.30 1.30 −3.48 0.00
512×270 Netflix ariel 9.15 9.15 5.59 5.59 1.05 1.05 −4.01 0.00
NewsClip 360P-1e1c 10.77 10.77 9.27 9.27 5.23 5.23 1.54 1.54
NewsClip 360P-22ce 17.37 17.37 15.79 15.79 16.37 16.37 17.98 17.98
360P TelevisionClip 360P-3b9a 1.45 1.45 0.48 0.48 −1.09 0.00 −3.26 0.00
TelevisionClip 360P-74dd 1.66 1.66 1.17 1.17 0.36 0.36 −0.37 0.00
HowTo 480P-04f1 3.81 3.81 2.57 2.57 0.93 0.93 0.06 0.36
HowTo 480P-4c99 2.36 2.36 1.67 1.67 0.37 0.00 −1.16 0.00
MusicVideo 480P-1eee 3.31 3.31 3.29 3.29 2.53 2.53 −0.30 −0.30
480P
NewsClip 480P-15fa 6.31 6.31 6.05 5.79 0.53 0.11 −0.79 0.03
NewsClip 480P-7a0d 11.54 11.54 10.03 10.03 1.53 1.53 0.08 0.00
TelevisionClip 480P-19d3 3.13 3.13 2.86 2.86 1.66 1.66 0.58 0.00
HowTo 720P-0b01 12.72 12.72 11.84 11.84 9.31 9.31 6.35 6.35
720P MusicVideo 720P-3698 1.76 1.76 1.07 1.07 0.30 0.30 −0.17 0.00
MusicVideo 720P-4ad2 6.93 6.93 3.81 3.81 1.87 1.87 0.60 0.11
HowTo 1080P-4d7b 7.31 7.31 6.07 6.07 3.21 3.21 0.72 0.72
1080P MusicVideo 1080P-55af 3.88 3.88 1.78 1.78 0.31 0.33 −0.99 −0.68
intotree 9.71 9.71 9.42 9.42 8.46 8.46 4.92 4.92
Average 7.96 7.96 6.28 6.27 3.38 3.40 1.45 2.05

loss at higher QP settings. As shown in Table III, negative


numbers are mostly removed (highlighted in green) by the
introduction of a GoP-level switchable texture mode. In some
cases where tex-switch has zero bit rate savings compared to
the AV1 baseline, the texture mode is completely disabled
for all the GF groups, whereas tex-allgf has loss. In a few
cases, however, tex-switch has less bit rate saving than tex-
allgf (highlighted in red). This is because the bit rate saving
performance of the first GF group in the scene fails to
accurately represent the whole scene in some of the UGC
sequences with short scene cuts. A possible solution is to
identify additional GF groups that show potential bit rate
savings and enable texture mode for these GF groups.
2) Subjective Evaluation: Although significant bit rate sav- Fig. 5: Subjective evaluation of visual preference. Results
ings have been achieved compared to the AV1 baseline, it show average subjective preference (%) for QP = 16, 24, 32,
is acknowledged that identical QP values do not necessarily 40 compared between AV1 baseline and proposed switchable
imply the same video quality. We have performed a subjective texture mode.
visual quality study with 20 participants. Reconstructed videos
produced by the proposed method (tex-switch) and the baseline
AV1 codec at QP = 16, 24, 32 and 40 are arranged randomly “Same Quality” indicates the percentage of participants that
and assessed by the participants using a double stimulus cannot tell the difference between the reconstructed videos
continuous quality scale (DSCQS) method [213]. Subjects by the AV1 baseline codec and the proposed method tex-
have been asked to choose among three options: the first video switch (69.03% on average). The term “tex-switch” indicates
has better visual quality, the second video has better visual the percentage of participants that prefer the reconstructions
quality, or there is no difference between two versions. by the proposed method tex-switch (14.32% on average); and
The result of this study is summarized in Figure 5. The the “AV1” indicates the percentage of participants who think
12

the visual quality of the reconstructed videos using the AV1


neuro-Intra
baseline is better (16.65% on average).
We observe that the results are sequence dependent and that Intra Intra
Intra Encoder Decoder Compressed
spatial and temporal artifacts can appear in the reconstructed Coding Binary
Features
video. The main artifacts come from the inaccurate pixel-based neuro-Res
neuro-Motion
texture mask. For example, in some frames of Television-
Residual Residual
Clip 360P-74dd sequence, the texture masks include parts of Motion Motion -
Encoder Decoder
Encoder Decoder
the moving objects in the foreground, which are reconstructed
MS-MCN
using texture mode. Since the motion of the moving objects Inter Coding Multi-scale Motion
+
is different from the motion of the texture area, there are Compensation Network

noticeable artifacts around those parts of the frame. To further


improve the accuracy of region analysis using DNN-based pre- Reference Frame Buffer

processing, we plan to incorporate an in-loop perceptual visual


(a)
quality metric for optimization during the texture analysis and
reconstruction.

Residual Block (x3)

Residual Block (x3)

Residual Block (x3)

Nonlocal Attention
Residual Block (x3)

Nonlocal Attention

Residual Block (x3)


Conv 5x5 / 2↓

Conv 5x5 / 2↓

Conv 5x5 / 2↓
Conv 5x5 / 2↓

Conv 5x5 / 2↓
Conv 5x5 / 2↓
Input
D. Discussion And Future Direction Q
AE

We proposed a DNN based texture analysis/synthesis coding


Q AD
tool for AV1 codec. Experimental results show that our pro-

Nonlocal Attention
Residual Block (x3)
Residual Block (x3)
Residual Block (x3)

Residual Block (x3)

Residual Block (x3)

Nonlocal Attention

Conv 5x5 / 2↑
Conv 5x5 / 2↑
Conv 5x5 / 2↑

Conv 5x5 / 2↑

Conv 5x5 / 2↑

2↑
AE

5x5 // 2↓
posed method can achieve noticeable bit rate reduction with Output

PA
Conv 5x5
satisfying visual quality for both standard test sets and user AD

Conv
generated content, which is verified by a subjective study. We
envision that video coding driven by semantic understanding (b)
will continue to improve in terms of both quality and bit rate,
Fig. 6: End-to-End Neural Video Coding (E2E-NVC).
especially by leveraging advances of deep learning methods.
This E2E-NVC in (a) consists of modularized intra and inter
However, there remain several open challenges that require
coding, where inter coding utilizes respective motion and
further investigation.
residual coding. Each component is well exploited using
Accuracy of region analysis is one of the major challenges
a stacked CNNs-based VAE for efficient representations of
for integrating semantic understanding into video coding.
intra pixels, displaced inter residuals, and inter motions. All
However, recent advances in scene understanding have signif-
modularized components are inter-connected and optimized in
icantly improved the performance of region analysis. Visual
an end-to-end manner. (b) General VAE model applies stacked
artifacts are still noticeable when a non-texture region is
convolutions (e.g., 5×5) with main encoder-decoder (Em ,
incorrectly included in the texture mask, particularly if the
Dm ) and hyper encoder-decoder pairs (Eh , Dh ), where main
analysis/synthesis coding system is open loop. One potential
encoder Em includes four major convolutional layers (e.g.,
solution is to incorporate some perceptual visual quality mea-
convolutional downsampling and three residual blocks (×3)
sures in-loop during the texture region reconstruction.
robust feature processing [201]). Hyper decoder Dh mirrors
Video segmentation benchmark datasets are important for
the steps in hyper encoder Eh for hyper prior information
developing machine learning methods for video based seman-
generation. Prior aggregation (PA) engine collects the informa-
tic understanding. Existing segmentation datasets are either
tion from hyper prior, autoregressive spatial neighbors, as well
based on images with texture [214], or contain general video
as temporal correspondences (if applicable) for main decoder
objects only [215], [216], or focus on visual quality but lack
Dm to reconstruct input scene. Non-local attention is adopted
segmentation ground truth.
to simulate the saliency masking at bottlenecks, and rectified
VI. C ASE S TUDY FOR C ODING : linear unit (ReLU) is implicitly embedded with convolutions
E ND - TO -E ND N EURAL V IDEO C ODING (E2E-NVC) for enabling the nonlinearity. “Q” is for quantization, AE and
AD for respective arithmetic encoding and decoding. 2↓ and
This section presents a framework for end-to-end neural 2↑ are downsampling and upsampling at a factor of 2 for both
video coding. We include a discussion of its key components, horizontal and vertical dimensions.
as well as its overall efficiency. Our proposed method is
extended from our pioneering work in [104] but with signifi-
cant performance improvements by allowing fully end-to-end It codes the remaining frames in each group using motion
learning-based spatio-temporal feature representation. More compensated prediction. As shown in Fig. 6a, the proposed
details can be found in [131], [136], [217]. E2E-NVC uses the VAE compressor (neuro-Motion) to gen-
erate the multiscale motion field between the current frame and
A. Framework the reference frame. Then, a multiscale motion compensation
As with all modern video encoders, the proposed E2E-NVC network (MS-MCN) takes multiscale compressed flows, warps
compresses the first frame in each group of pictures as an intra- the multiscale features of the reference frame, and combines
frame using a VAE based compression engine (neuro-Intra). these warped features to generate the predicted frame. The
13

prediction residual is then coded using another VAE-based


compressor (neuro-Res).
A low-delay E2E-NVC based video encoder is specifically NLAIC
illustrated in this work. Given a group of pictures (GOP) X = neuro-Intra
36 Minnen (2018)
{X1 , X2 , ..., Xt }, we first encode X1 using the neuro-Intra BPG (4:4:4)
module and have its reconstructed frame X̂1 . The following JPEG2000
frame X2 is encoded predictively, using neuro-Motion, MS-
34
MCN, and neuro-Res together, as shown in Fig. n 6a. Note that
32

PSNR (dB)
o
MS-MCN takes the multiscale optical flows f~d1 , f~d2 , ..., f~ds
derived by the pyramid decoder in neuro-Motion, and then
uses them to generate the predicted frame X̂p2 by multiscale
30
motion compensation. Displaced inter-residual r2 = X2 − X̂p2
is then compressed in neuro-Res, yielding the reconstruction 28
r̂2 . The final reconstruction X̂2 is given by X̂2 = X̂p2 +r̂2 . All
of the remaining P-frames in the group of pictures are then 26
encoded using the same procedure. 0.2 0.4 0.6 0.8 1.0
Fig. 6b illustrates the general architecture of the VAE model. bit per pixel (bpp)
The VAE model includes a main encoder-decoder pair that is Fig. 7: Efficiency of neuro-Intra. PSNR vs. rate perfor-
used for latent feature analysis and synthesis, as well as a mance of neuro-Intra in comparison to NLAIC [136], Minnen
hyper encoder-decoder for hyper prior generation. The main (2018) [149], BPG (4:4:4) and JPEG2000. Note that the curves
encoder Em uses four stacked CNN layers. Each convolutional for neuro-Intra and NLAIC overlap.
layer employs stride convolutions to achieve downsampling
(at a factor of 2 in this example) and cascaded convolutions
for efficient feature extraction (here, we use three ResNet- inputs from the hyper priors, spatial neighbors, and temporal
based residual blocks [201])6 . We use two-layer hyper encoder context (if applicable)7 . Information theory suggests that more
Eh to further generate the subsequent hyper priors as side accurate context modeling requires fewer resources (e.g., bits)
information, which is used in the entropy coding of the latent to represent information [219]. For the sake of simplicity, we
features. assume the latent features (e.g., motion, image pixel, residual)
We apply stacked convolutional layers with a limited (3×3) are following the Gaussian distribution as in [148], [149]. We
receptive field to capture the spatial locality. These convo- use the PA engine to derive the mean and standard deviation
lutional layers are stacked in order to simulate layer-wise of the distribution for each feature.
feature extraction. These same ideas are used in many relevant
studies [142], [149]. We utilize the simplest ReLU as the
nonlinear activation function(although other nonlinear activa- B. Neural Intra Coding
tion functions such as the Generalized Divisive Normalization
Our neuro-Intra is a simplified version of the Non-Local
could be used as well) in [105].
Attention optimized Image Compression (NLAIC) that was
The human visual system operates in two stages: First, originally proposed in [136].
the observer scans an entire scene to gain a complete un-
One major difference between the NLAIC and the VAE
derstanding of everything within the field of vision. Second,
model using autoregressive spatial context in [149] is the
the observer focuses their attention on specific salient regions.
introduction of the NLAM inspired by [220]. In addition,
During image and video compression, this mechanism of
we have applied 3D 5×5×5 masked CNN8 to extract spatial
visual attention can be used to ensure that bit resources
priors, which are fused with hyper priors in PA for entropy
are allocated where they are most needed (e.g., via unequal
context modeling (e.g., the bottom part of Fig. 9). Here, we
feature quantization) [140], [218]. This allows resources to be
have assumed the single Gaussian distribution for the context
assigned such that salient areas are more accurately recon-
modeling of entropy coding. Note that temporal priors are not
structed, while resources are conserved in the reconstruction
used for intra-pixel and inter-residual in this paper by only
of less-salient areas. To more accurately discern salient from
utilizing the spatial priors.
non-salient areas, we adopt the non-local attention module
The original NLAIC applies multiple NLAMs in both main
(NLAM) at the bottleneck layers of both the main encoder
and hyper coders, leading to excessive memory consumption
and hyper encoder, prior to quantization, in order to include
at a large spatial scale. In E2E-NVC, NLAMs are only used at
both global and local information.
the bottleneck layers for both main and hyper encoder-decoder
To enable more accurate conditional probability density
pairs, allowing bits to be allocated adaptively.
modeling for entropy coding of the latent features, we in-
troduce the Prior Aggregation (PA) engine which fuses the 7 Intra and residual coding only use joint spatial and hyper priors without
temporal inference.
6 We choose to apply cascaded ResNets for stacked CNNs because they are 8 This 5×5×5 convolutional kernel shares the same parameters for all
highly efficient and reliable. Other efficient CNN architectures could also be channels, offering great model complexity reduction as compared with the
applied. 2D CNN-based solution in [149].
14

Multiscale Motion Compensation Network

Pyramidal
Aggregation
Features

neuro-Motion Warping

Feature Quantized Pyramidal Predicted


Fusion Features Flow Frame
Decoder

Multiscale
Compressed
Flows (MCF)
Rate Estimation
Reconstruction
Rate Distortion Error
Optimization

Fig. 8: Multiscale Motion Estimation and Compensation. One-stage neuro-Motion with MS-MCN uses a pyramidal flow
decoder to synthesize the multiscale compressed optical flows (MCFs) that are used in a multiscale motion compensation
network for generating predicted frames.

two concatenated frames (where one frame is the reference


Temporal ConvLSTM Updated from the past, and one is the current frame) into quantized
Priors Priors Next Time Step
temporal features that represent the inter-frame motion. These
Temporal Prior quantized features are decoded into compressed optical flow
Updating
in an unsupervised way for frame compensation via warping.
Quantized Autoregressive Obtained
This one-stage scheme does not require any pre-trained flow
1x1x1 Fusion
Motion Features
3D Masked Conv
Priors Probability network such as FlowNet2 or PWC-net to generate the optical
concate

flow explicitly. It allows us to quantize the motion features


Hyper Decoder
rather than the optical flows, and to train the motion feature
Hyper Priors encoder and decoder together with explicit consideration of
Spatial-temporal Prior quantization and rate constraint.
Aggregation
The neuro-Motion module is modified for multiscale motion
Fig. 9: Context-Adaptive Modeling Using Joint Spatio- generation, where the main encoder is used for feature fusion.
temporal and Hyper Priors. All priors are fused in PA to We replace the main decoder with a pyramidal flow decoder,
provide estimates of the probability distribution parameters. which generates the multiscale compressed optical flows
(MCFs). MCFs will be processed together with the reference
frame, using a multiscale motion compensation network (MS-
To overcome the non-differentiability of the quantization MCN) to obtain the predicted frame efficiently, as shown in
operation, quantization is usually simulated by adding uniform Fig. 8. Please refer to [217] for more details.
noise in [142]. However, such noise augmentation is not Encoding motion compactly is another important factor for
exactly consistent with the rounding in inference, which can overall performance improvement. We suggest the joint spatio-
yield performance loss (as reported by [135]). Thus, we apply temporal and hyper prior-based context-adaptive model shown
universal quantization (UQ) [135] in neuro-Intra. UQ is used in Fig. 9 for efficiently inferring current quantized features.
for neuro-Motion and neuro-Res as well. When applied to This is implemented in the PA engine of Fig. 6b.
the common Kodak dataset, neuro-Intra performed as well as The joint spatio-temporal and hyper prior-based context-
NLAIC [136], and outperformed Minnen (2018) [149], BPG adaptive model mainly consists of a spatio-temporal-hyper
(4:4:4) and JPEG2000, as shown in Fig. 7. aggregation module (STHAM) and a temporal updating mod-
ule (TUM), shown in Fig. 9. At timestamp t, STHAM is
C. Neural Motion Coding and Compensation introduced to accumulate all the accessible priors and estimate
the mean and standard deviation of Gaussian Mixture Model
Inter-frame coding plays a vital role in video coding. The
(GMM) jointly using:
key is how to efficiently represent motion in a compact format
for compensation. In contrast to the pixel-domain block-based (µF , σF ) = F(F1 , ..., Fi−1 , ẑt , ht−1 ), (1)
motion estimation and compensation in conventional video
coding, we rely on optical flow to accurately capture the Spatial priors are autoregressively derived using masked
temporal information for motion compensation. 5×5×5 3D convolutions and then concatenated with decoded
To improve inter-frame prediction, we extend our earlier hyper priors and temporal priors using stacked 1×1×1 con-
work [131] to multiscale motion generation and compensa- volutions. Fi , i = 0, 1, 2, ... are elements of quantized latent
tion. This multiscale motion processing directly transforms features (e.g., motion flow), ht−1 is aggregated temporal priors
15

from motion flows preceding the current frame. The neuro-


Motion module exploits temporal redundancy to further pre- 
diction efficiency, leveraging the correlation between second-

order moments of inter motion. A probabilistic model of each
element to be encoded is derived with the estimated µF and 
σF by:


3615 G%
pF |(F1 ,...,F i−1,ẑt ,ht−1 ) (Fi |F1 , ..., Fi−1 , ẑt , ht−1 )
Y 1 1 
2
= (N (µF , σF ) ∗ U(− , ))(Fi ). (2)
2 2 
i 19&
Note that TUM is applied to embedded current quantized /XB&935
 +
features Ft recurrently using a standard ConvLSTM [221]: +

(ht , ct ) = ConvLSTM(Ft , ht−1 , ct−1 ), (3)       
%SS
where ht are updated temporal priors for the next frame, ct (a)
is a memory state to control information flow across multiple
time instances (e.g., frames). Other recurrent units can also be
used to capture temporal correlations as in (3). 
It is worth noting that leveraging second-order information
for the representation of compact motion is also widely 
explored in traditional video coding approaches. For example,
motion vector predictions from spatial and temporal co-located
neighbors are standardized in H.265/HEVC, by which only 0666,0 
motion vector differences (after prediction) are encoded.

D. Neural Residual Coding 19&
/XB&935
Inter-frame residual coding is another significant module +

contributing to the overall efficiency of the system. It is used +
to compress the temporal prediction error pixels. It affects       
the efficiency of next frame prediction, since errors usually %SS
propagate temporally. (b)
Here we use the VAE architecture in Fig. 6b to encode the Fig. 10: BD-Rate Illustration Using PSNR & MS-SSIM.
residual rt . The rate-constrained loss function is used: (a) NVC offers averaged 35.34% gain against the anchor
L = λ · D2 (Xt , (Xpt + r̂t )) + R, (4) H.264/AVC when distortion is measured using PSNR. (b)
NVC shows over 50% gains against anchor H.264/AVC when
where D2 is the `2 loss between a residual compensated using MS-SSIM evaluation. MS-SSIM is usually studied as a
frame Xpt + r̂t and Xt . neuro-Res will be first pretrained perceptual quality metric in image compression, especially at
using the frames predicted by the pretrained neuro-Motion and a low bit rate.
MS-MCN, and a loss function in (4) where the rate R only
accounts for the bits for residual. Then we refine neuro-Res
jointly with neuro-Motion and MS-MCN, using a loss where
R incorporates the bits for both motion and residual with two We show the leading compression efficiency in Fig. 10
frames. using respective PSNR and MS-SSIM measures, across
H.265/HEVC and UVG test sequences. In Table IV, by setting
the same anchor using H.264/AVC, our NVC presents 35%
E. Experimental Comparison BD-Rate gains, while H.265/HEVC and DVC offer 30% and
We applied the same low-delay coding setting as DVC 22% gains, respectively. If the distortion is measured by
in [129] for our method and traditional H.264/AVC, and the MS-SSIM, our gains in efficiency are even larger. This
H.265/HEVC for comparison. We encoded 100 frames and demonstrates that NVC can achieve a 50% improvement in
used GOP of 10 on H.265/HEVC test sequences, and 600 efficiency, while both H.265/HEVC and DVC achieve only
frames with GOP of 12 on the UVG dataset. For H.265/HEVC, around 25%.
we applied the fast mode of the x2659 — a popular open- Our NVC rivals the recent DVC Pro [222], an upgrade
source H.265/HEVC encoder implementation; while the fast of the earlier DVC [141], e.g., 35.54% and 50.83% BD-
mode of the x26410 is used as the representative of the Rate reduction measured by PSNR and MS-SSIM distortion
H.264/AVC encoder. respectively for NVC, while 34.57% and 45.88% marked for
9 http://x265.org/ DVC Pro. DVC [141] has mainly achieved a higher level of
10 https://www.videolan.org/developers/x264.html coding efficiency than H.265/HEVC at high bit rates. However,
16

TABLE IV: BD-Rate Gains of NVC, H.265/HEVC and DVC against the H.264/AVC.
H.265/HEVC DVC NVC
PSNR MS-SSIM PSNR MS-SSIM PSNR MS-SSIM
Sequences
BDBR BD-(D) BDBR BD-(D) BDBR BD-(D) BDBR BD-(D) BDBR BD-(D) BDBR BD-(D)
ClassB -32.03% 0.78 -27.67% 0.0046 -27.92% 0.72 -22.56% 0.0049 -45.66% 1.21 -54.90% 0.0114
ClassC -20.88% 0.91 -19.57% 0.0054 -3.53% 0.13 -24.89% 0.0081 -17.82% 0.73 -43.11% 0.0133
ClassD -12.39% 0.57 -9.68% 0.0023 -6.20% 0.26 -22.44% 0.0067 -15.53% 0.70 -43.64% 0.0123
ClassE -36.45% 0.99 -30.82% 0.0018 -35.94% 1.17 -29.08% 0.0027 -49.81% 1.70 -58.63% 0.0048
UVG -48.53% 1.00 -37.5% 0.0056 -37.74% 1.00 -16.46% 0.0032 -48.91% 1.24 -53.87% 0.0100
Average -30.05% 0.85 -25.04% 0.0039 -22.26% 0.65 -23.08% 0.0051 -35.54% 1.11 -50.83% 0.0103

NVC H.265/HEVC H.264/AVC


(BPP: 0.1274/PSNR: 28.07dB) (BPP: 0.1347/PSNR: 27.61dB) (BPP: 0.1353/PSNR: 26.57dB)

NVC H.265/HEVC H.264/AVC


(BPP: 0.0634/PSNR: 34.63dB) (BPP: 0.0627/PSNR: 33.88dB) (BPP: 0.0687/PSNR: 32.57dB)

Fig. 11: Visual Comparison. Reconstructed frames of NVC, H.265/HEVC and H.264/AVC. We avoid blocky artifacts, visible
noise, etc., and provide better quality at lower bit rate.

a sharp decline in the performance of DVC is revealed at NVC uses 7 percent fewer bits despite an improvement in qual-
low bit rates (e.g., performing worse than H.264/AVC at some ity greater than 1.5 dB PSNR, compared with H.264/AVC. For
rates). We have also observed that DVC’s performance varies other cases, our method also shows robust improvement. Tradi-
for different test sequences. DVC Pro upgrades DVC with tional codec usually suffers from blocky artifacts and motion-
better intra/residual coding using [149] and λ fine-tuning, induced noise close to the edges of objects. In H.264/AVC,
showing state-of-the-art performance [222]. you clearly can observe block partition boundaries with severe
pixel discontinuity. Our results provide higher-quality recon-
Visual Comparison We provide a visual quality comparison struction and avoid noise and artifacts.
between NVC, H.264/AVC, and H.265/HEVC as shown in
Fig. 11. Generally, NVC yields reconstructions that are much
higher in quality than those of its competitors, even with a F. Discussion And Future Direction
lower bit rate cost. For the sample clip “RaceHorse”, which We developed an end-to-end deep neural video coding
NVC H.265/HEVC H.264/AVC
includes(BPP:
non-translational motion
0.0364/PSNR: 36.82dB)and a complex background, framework that can learn (BPP:
(BPP: 0.0368/PSNR: 36.24dB)
compact spatio-temporal
0.0395/PSNR: 35.15dB)represen-
sion
analysis Efficient neural in-loop filtering enhancement

17

tation of raw video input. Our extensive simulations yielded CNN model
very encouraging results, demonstrating that our proposed rcorr
method can offer consistent and stable gains over existing input xcorr
methods (e.g., traditional H.265/HEVC, recent learning-based x
approaches [129], etc.,) across a variety bit rates and a wide
range of content. skip connection
The H.264/AVC, H.264/HEVC, AVS, AV1, and even (a)
the VVC, are masterpieces of hybrid prediction/transform a0
CNN model
framework-based video coding. Rate-distortion optimization, r0
rcorr

rate control, etc., can certainly be incorporated to improve input rM-1
xcorr
learning-based solutions. For example, reference frame selec- x
tion is an important means by which we can embed and aggre- aM-1
gate the most appropriate information for reducing temporal skip connection
error and improving overall inter-coding efficiency. Making (b)
deep learning-based video coding practically applicable is Fig. 12: CNN-based Restoration. (a) Conventional model
another direction worthy of deeper investigation. structure. (b) Guided CNN model with adaptive weights.
High-quality Compensated
HF0 FlowNet HF0
VII. C ASE S TUDIES FOR P OST- PROCESSING :
Enhanced Low-quality Low-quality E
WARN model where a0 , a1 , · · · , aflow field parameters WARN
that aremodel
E FFICIENT
HF0Nand HF1 F ILTERING
EURAL LF M−1 are the weighting
LF
Motion
explicitly signaled in the compressed bitstream.
Compensated
demon-High-quality
In this case study, both in-loop and post filtering are compensation
HF1Our objective is to minimize HF the
1 distance between the
strated using stacked DNN-based neural filters for quality
restored block xcorr and its corresponding source s, i.e.,
enhancement of reconstructed frames. We specifically
(a) Single-frame design
processing 2 2
|xcorr − s| (b)=Pre-processing (c) Multi-frame
|rcorr − d| . Given the channel-wise processing
output
a single-frame guided CNN which adapts pre-trained CNN
features r0 , r1 , · · · , rM−1 , for a degraded input x, the
models to different video contents for in-loop filtering, and a
weighting parameters a0 , a1 , · · · , aM−1 can then be estimated
multi-frame CNN leveraging spatio-temporal information for
by least-square
Plain optimization as Proposed
post filtering. Both reveal noticeable performance gains. In ResBlk ResBlk
T
practice, neural filters can be devised, i.e., in-loop or post, [a0 , a1 , · · · , aM−1 ] = (RT R)−1 RT d, (6)
according to the application requirements.
where R = [r0 , r1 , . . . , rM−1 ] is the matrix at a size of N×M
comprised of stacked output features in column-wise order.
A. In-loop Filtering via Guided CNN The reconstruction error is given by
As reviewed in Section IV, most existing works design a e = |rcorr − d|2 = |d|2 − dT R(RT R)−1 RT d. (7)
CNN model to directly map a degraded input frame to its
restored version (e.g., ground truth label), as illustrated in Loss Function. Assuming that one training batch is com-
Fig. 12a. To ensure that the model is generalizable to other prised of T patch pairs: {si , xi }, i = 0, 1, , · · · , T − 1, the
contexts, CNN models are often designed to use deeper layers, overall reconstruction error over the training set is
denser connections, wider receptive fields, etc., with hundreds X 2 −1
of millions of parameters. As a consequence, such generalized E= {|di | − di T Ri (Ri T Ri ) Ri T di }, (8)
i
models are poorly suited to most practical applications. To where di = si − xi is the error for the ith patch. Ri =
address this problem, we propose that content adaptive weights [ri,0 , ri,1 , · · · , ri,M−1 ] is the corresponding channel-wise fea-
be used to guide a shallow CNN model (as shown in Fig. 12b) tures in matrix form, with ri,j being the j th channel when
instead. training sample xi is passed through the CNN model. Given
The principle underlying this approach is sparse signal 2
that |di | is independent of the network model, the loss
decomposition: We expect that the CNN model can represent function can be simplified as
any input as a weighted combination of channel-wise features. X −1
Note that weighting coefficients are dependent on input sig- L= {−di T Ri (Ri T Ri ) Ri T di }. (9)
i
nals, making this model generalizable to a variety of content
characteristics. Experimental Studies. A shallow baseline CNN model(as
Method. Let x be a degraded block with N pixels in a described in Table V) is used to demonstrate the efficiency
column-wise vector format. The corresponding source block of of the guided CNN model. This model is comprised of seven
x is s, which has a processing error d = s−x. We wish to have layers in total and has a fixed kernel size of 3×3. At the
rcorr from x so that the final reconstruction xcorr = x + rcorr bottleneck layer, the channel number of the output feature map
is closer to s. is M. After extensive simulations, M = 2 was selected. In
Let the CNN output layer have M channels, i.e., r0 , total, our model only requires 3,744 parameters, far fewer than
r1 ,· · · ,rM−1 . Then, the rcorr is assumed as a linear combi- the number required by existing methods.
nation of these channel-wise feature vectors, In training, 1000 pictures of DIV2K [223] were used. All
frames were compressed using the AV1 encoder with in-loop
rcorr = a0 r0 + a1 r1 + · · · + aM−1 rM−1 , (5) filters CDEF [159] and LR [160] turned off to generate
WARN model compensation
HF1 HF1 HF1

(a) (b) 18 (c)

TABLE V: Layered structure and parameter settings of base- Plain Proposed


line CNN model. ResBlk ResBlk

1×1, 192
Layer Kernel size Input channels Output channels Parameters
1 3×3 1 16 144
2 3×3 16 8 1152
3 3×3 8 8 576
4 3×3 8 8 576
Compensated 3×3,
5 3×3 8 8 576 32 3×3, Res ... Res 3×3,
HF 96 Blk Blk 1
6 3×3 8 8 576
Low-quality 3×3, Enhanced
7 3×3 8 M 144
LF 32 8 ResBlks LF
Total parameters 3744
Compensated 3×3,
MF 32

corresponding quantization-induced degraded reconstructions. Fig. 13: WARN. This wide activation residual network is used
GOP (GOP size = 16)
We divided the 64 QPs into six ranges and trained one model POC: to fuse/enhance input frame for improved quality. In MVE
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
for each QP range. The six ranges include QP values 7 to case, it takes three inputs to enhance the LFs; and in SVE
16, 17 to 26, 27 to 36, 47 to 56, and 57 to 63. Compressed case, it inputs a single frame and outputs its enhanced version.
frames falling into the same QP range were used to train This WARN generally follows the residual network structure
the corresponding CNN model. Frames were segmented into with residual link and ResBlk embedded. Note that ResBlk
64×64 patches. Each batch contained 1,000 patches. We isHFextended LF LF to LF support
HF LF LF wide LF activation
HF LF LF from
LF HFits LF
plain
LF version
LF HF

adopted the Adaptive moment estimation (Adam) algorithm, prior to ReLU activation.
with the initial learning rate set at 1e-4. The learning rate is
halved every 20 epochs.
We used the Tensorflow platform, which runs on NVIDIA approach on AV1 reconstructed frames and achieved signifi-
GeForce GTX 1080Ti GPU, to evaluate coding efficiency cant coding improvement. Similar observations are expected
across four QPs, e.g., {32, 43, 53, and 63}. Our test set with different anchors, such as the H.265/HEVC.
included 24 video sequences with resolutions ranging from Method. Single-frame video enhancement (SVE) refers to
2560×1600 to 352×288. The first 50 frames of each sequence the sole application of the fusion network without leveraging
were tested in both intra and inter configurations. temporal frame correlations. As discussed in Section IV, there
In our experiments, N was set to 64, 128, 256, and the are a great number of network models that can be used to do
whole frame, respectively. We found that N = 256 yields SVE. In most cases, the efficiency and complexity are at odds
the best performance. For each block, the linear combination with one another: In other words, efficiency and complexity
parameters ai (i = 0, 1) were derived accordingly. To strike come at the cost of deeper networks and higher numbers of
an appropriate balance between bit consumption and model parameters. Recently, Yu et al. [224] discovered that models
efficiency, our experiments suggest that the dynamic range of with more feature channels before activation could provide
ai is within 15. significantly better performance with the same parameters
We compared the respective BD-Rate reductions of our and computational budgets. We designed a wide activation
guided CNN model and a baseline CNN model against the residual network (WARN) by combining wide activation with
AV1 baseline encoder. All filters were enabled for the AV1 a powerful deep residual network (ResNet) [225], shown
anchor. For a description of the baseline CNN model, see in Fig. 13. This WARN illustrates the three inputs for an
Table V. Our guided CNN model is the baseline model plus enhanced output in the MVE framework. In contrast, SVE
the adaptive weights. normally inputs a single frame, and outputs a corresponding
Both baseline and guided CNN models were applied on top enhanced representation.
of the AV1 encoder with only the deblocking filter enabled, This MVE closely follows the two-step strategy reviewed
and other filters (including CDEF and LR) turned off. The find- in Section IV. It uses FlowNet2 [187] to perform pixel-
ings reported in Table VI demonstrate that either baseline or level motion estimation/compensation-based temporal frame
guided CNN models can be used to replace additional adaptive alignment. Next, a WARN-based fusion network is used for
in-loop filters, while improving R-D efficiency. Furthermore, final enhancement. We allow the two High-quality Frames
regardless of block size and frame types, our guided model (HF) immediately preceding and succeeding a low-quality
always outperformed the baseline CNN. This is mainly due frame (LF) to enhance the Low-quality Frame (LF) in between.
to the adaptive weights used to better characterize content dy- Bi-directional warping is performed for each LF to produce
namics. Similar lightweight CNN structures can be upgraded compensated HFs in Fig. 14.
using deep models [163], [164], [167] for potentially greater Experimental Studies. We evaluate both SVE and MVE
BD-Rate savings. against the AV1 baseline. A total of 118 video sequences
were selected to train network models. More specifically, the
first 200 frames of each sequence were encoded with AV1
B. Multi-frame Post Filtering encoder to generate the reconstructed frames. The QPs are
This section demonstrates how multi-frame video enhance- {32, 43, 53, 63}, yielding 23,600 reconstructed frames in total.
ment (MVE) scheme-based post filtering can be used to min- After frame alignment, we selected one training set containing
imize compression artifacts. We implemented our proposed compensated HF0 , compensated HF1 , and to-be-enhanced LF
19

TABLE VI: BD-Rate savings of baseline and guided CNN models against the AV1.
All Intra Random Access
Resolution Sequence Baseline Guided CNN Baseline Guided CNN
CNN N=64 N=128 N=256 Frame CNN N=64 N=128 N=256 Frame
PeopleOnStreet −1.15% −1.95% −2.84% −2.90% −2.81% −0.19% −0.22% −1.03% −1.02% −0.83%
2560 × 1600
Traffic −1.71% −1.76% −3.01% −3.16% −3.03% −0.26% +1.89% −1.64% −2.15% −2.17%
BasketballDrive −0.45% +2.95% −0.72% −1.06% −0.72% −0.02% +8.04% +0.87% +0.07% −0.05%
BQTerrace −0.98% −3.19% −3.66% −3.44% −2.10% −0.33% +0.68% −1.62% −1.91% −1.51%
Cactus −1.64% −1.38% −2.79% −2.89% −2.56% −0.21% +1.18% −1.13% −1.31% −0.96%
1920 × 1080 Kimono −0.23% +3.55% −0.18% −0.88% −0.95% −0.07% +6.07% +0.84% −0.07% −0.01%
ParkScene −1.21% +0.01% −1.92% −2.21% −2.11% −0.07% +1.11% −1.46% −1.82% −0.92%
blue-sky −2.89% −0.96% −2.58% −2.86% −2.56% +0.00% +3.46% −2.02% −2.96% −2.77%
crowd run −3.01% −2.34% −3.11% −3.22% −3.08% −0.13% −1.69% −2.19% −2.07% −1.09%
BasketballDrill −2.99% −5.55% −6.45% −6.26% −5.88% −0.25% −0.33% −2.10% −1.79% −1.55%
BQMall −1.74% −3.96% input
−4.48% −4.46% −4.35% −0.15% +0.16% xcorr −1.05% −1.13% −0.76%
832 × 480
PartyScene −0.83% −3.77% x −4.02% −3.97% −3.81% −0.20% −1.10% −1.43% −1.25% −0.13%
RaceHorsesC −1.91% −2.01% −2.58% −2.49% −2.38% −0.21% −0.70% −1.28% −1.03% −0.80%
BasketballPass −3.08% −3.66% −4.60% −4.72% −4.65% −0.20% +0.71% −0.63% −0.62%
skip connection −0.36%
BlowingBubbles −2.60% −3.36% −3.78% −3.77% −3.76% −0.34% −0.55% −1.05% −0.87% −0.86%
416 × 240
BQSquare −4.92% −6.09% −6.23% −6.27% (a) −6.22% −0.50% −0.54% −0.92% −1.13% −1.17%
RaceHorses −3.57% −5.39% −5.75% −5.75% −5.76%a0 −0.51% −2.82% −3.06% −2.69% −2.94%
CNN model
Johnny −2.01% −2.41% −4.03% −4.21% −4.12% r0 −0.31% +8.32% −0.94% −2.57% −2.63%
1280 × 720 FourPeople −1.94% −0.54% −3.49% −3.76% −2.85% … −0.29%rcorr
+17.99% +1.20% −1.65% −1.60%
KristenAndSara −2.71% input
−1.49% −3.97% −4.32% −4.26% rM-1 −0.42% +15.95%xcorr +0.53% −2.49% −2.31%
Harbour −0.79% x
−1.18% −1.43% −1.38% −1.42% −0.23% −1.00% −1.29% −1.40% −1.08%
Ice −3.59% −5.54% −6.88% −7.08% −7.19% −0.59% −1.59% −3.59% −3.65% −3.97%
352 × 288 aM-1 −0.21% +1.96% −0.29% −0.27%
Silent −1.68% −1.88% −2.80% skip −2.77% −2.79%
connection −0.70%
Students −3.08% −4.10% −4.77% −4.81% −4.88% −0.52% +1.25% −1.16% −1.44% −1.66%
Average −2.11% −2.33% −3.59% −3.69% (b) −3.51% −0.26% +2.43% −1.10% −1.55% −1.37%

Enhanced High-quality Compensated


WARN model
HF0 HF0 FlowNet HF0
Low-quality Low-quality Enhanced
flow Plain WARN model Proposed
LF LF ResBlk ResBlk LF

1×1, 192
Motion
Enhanced High-quality Compensated
WARN model compensation
HF1 HF1 HF1

(a) (b) (c)


Compensated Res 3×3,
Res 3×3,
3×3,
Fig. 14: Enhancement Framework. (a) Single-input WARN-based SVE
HF to HF. enhance
Blk (b)+(c)
... 32 the
BlkTwo-step
96
1 MVE using
FlowNet2 for temporal alignment, and three-input WARN- basedLow-quality
fusion to3×3, use preceding and succeeding HFsEnhanced
for LF
Plain LF Proposed 32 8 ResBlks LF
enhancement. ResBlk ResBlk Compensated 3×3,
1×1, 192

MF 32

from every 8 frames, which yielded a total of 2900 training GOP (GOP size = 16)
sets. These sets were used to train the WARN model as the POC: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
fusion network. Notice that we trained the WARN models
Compensated 3×3,
for SVE and MVE individually. HF The GoP32 size3×3,
was Res
16 with ... Res 3×3,
96 Blk Blk 1
a hierarchical prediction structure.
Low-quality The
3×3,LFs and HFs were Enhanced
identified using their QPs, i.e.,LF HFs with32 lower QP than the
8 ResBlks LF LF LF
HF LF LF LF HF LF HF LF LF LF HF LF LF LF HF
base QP were decoded, such as frames 0, 4, 8, 12, and 16 in
Fig. 15. Fig. 15: The hierarchical coding structure in the AV1
Original
encoder. The LFs AV1
frame arereconstructed
frame
SVE
enhanced enhancement
using MVE
HFsemhancement
following the
Algorithms were implemented using the Tensorflow plat- prediction structure via MVE scheme, and HFs are restored
form, NVIDIA GeForce GTX 1080Ti GPU. In training, frames using SVE method.
were segmented into 64×64 patches, with 64 patches included
in each batch. We adopted the Adam optimizer with the
initial learning rate set at 1e-4. The learning rate can be then We applied the proposed method on AV1 reconstructed
adjusted using the step strategy with γ = 0.5. An additional 18 frames. The results are presented in Table VII. Due to the
sequences were also employed for testing. These were mostly hierarchical coding structure in inter prediction, the LFs in
used to evaluate video quality. The first 50 frames of each Fig. 15 were enhanced using the neighboring HFs via MVE
test sequence were compressed. Then the reconstructed frames framework. The HFs themselves are enhanced using the SVE
were enhanced using the proposed SVE and MVE methods. method.
20

TABLE VII: BD-rate improvement


SVE of proposed
MVE SVE and Original AV1 reconstructed SVE MVE
MVE scheme against the AV1.
enhancement emhancement frame frame enhancement emhancement
All Intra Random Access
Class Sequence
SVE MVE SVE MVE
PeopleOnStreet −9.1% −14.7% −5.0% −8.1%
A
Traffic −7.6% −22.2% −5.8% −8.8%
BasketballDrive −5.9% −13.1% −4.4% −6.4%
BQTerrace −8.0% −23.7% −7.7% −9.8%
B Cactus −7.7% −21.9% −3.9% −6.0%
Kimono −3.8% −20.4% −3.9% −7.1%
ParkScene −5.1% −26.3% −4.9% −8.0%
BasketballDrill −12.5% −21.3% −5.6% −7.9%
BQMall −8.9% −18.7% −3.5% −6.1%
C
PartyScene −7.2% −19.0% −3.2% −5.0%
RaceHorsesC −5.9% −18.3% −3.3% −5.6%
BasketballPass −10.0% −18.5% −3.4% −6.2% Fig. 16: Qualitative Visualization Zoomed-in snapshots of
BlowingBubbles −7.0% −19.8% −4.6% −6.7% reconstructed frames for the AV1 baseline, SVE and MVE
D
BQSquare −10.8% −21.3% −11.0% −13.6% filtered restoration, as well as the ground truth label.
RaceHorses −9.2% −19.3% −4.9% −7.8%
FourPeople −9.7% −21.7% −5.1% −7.4%
E Johnny −9.6% −20.7% −5.5% −8.0% The quality of enhanced frames plays a significant role
KristenAndSara −9.6% −21.2% −4.4% −7.0%
for overall coding performance, since they serve as reference
Average −8.2% −20.1% −5.0% −7.5%
frames for the motion estimation of subsequent frames. Our
future work will investigate the joint effect of in-loop filtering
The overall BD-Rate savings of the SVE and MVE methods and motion estimation on reference frames to exploit the
are tabulated in Table VII, against the AV1. SVE achieves an inherent correlations of these coding tools, which could further
averaged reduction of 8.2% and 5.0% BD-rate for all intra improve coding performance.
and random access scenarios, respectively. On the other hand,
our MVE obtains 20.1% and 7.5% BD-rate savings on aver- VIII. D ISCUSSION AND C ONCLUSION
age, further demonstrating the effectiveness of our proposed As an old Chinese saying goes, “A journey of a thousand
scheme. When random access techniques are used, the HFs miles begins with a single step.” This is particularly true in the
selected are generally distant from a target LF, which reduces realm of technological advancement. Both the fields of video
the benefits provided from inter HFs. On the other hand, compression and machine learning have been established for
intra coding techniques uniformly demonstrate greater BD-rate many decades, but until recently, they evolved separately in
savings, because the neighboring frames nearest to target LFs both academic explorations and industrial practice.
can be used. This contributes significantly to enhancement. Lately, however, we have begun to witness the interdis-
Besides the objective measures, sample snapshots of recon- ciplinary advancements yielded by the proactive application
structed frames are illustrated in Fig. 16, clearly demonstrating of deep learning technologies [226] into video compression
that blocky and ringing artifacts from the AV1 baseline are systems. Benefits of these advances include remarkable im-
attenuated after applying either SVE or MVE based filtering. provements in performance in many technical aspects. To
Notably, MVE creates more visually appealing images than showcase the remarkable products of this disciplinary cross-
SVE. pollination, we have identified three major functional blocks
in a practical video system, e.g., pre-processing, coding, post-
C. Discussion And Future Direction processing. We then reviewed related studies and publications
In this section, we proposed DNN-based approaches for to help the audience familiarize themselves with these topics.
video quality enhancement. For in-loop filtering, we developed Finally, we presented three case studies to highlight the state-
a guided CNN framework to adapt pre-trained CNN models to of-the-art efficiency resulting from the application of DNNs to
various video contents. Under this framework, the guided CNN video compression systems, which demonstrates this avenue
learns to project an input signal onto a subspace of dimension of exploration’s great potential to bring about a new generation
M. The weighting parameters for a linear combination of these of video techniques, standards, and products.
channels are explicitly signaled in the encoded bitstream to ob- Though this article presents separate DNN-based case stud-
tain the final restoration. For post filtering, we devised a spatio- ies for pre-processing, coding, and post-processing, we believe
temporal multi-frame architecture to alleviate the compression that a fully end-to-end DNN model could potentially offer
artifacts. A two-step scheme is adopted in which optical flow a greater improvement in performance, while enabling more
is first obtained for accurate motion estimation/compensation, functionalities. For example, Xia et al. [227] applied deep
and then a wide activation residual network called WARN object segmentation in pre-processing, and used it to guide
is designed for information fusion and quality enhancement. neural video coding, demonstrating noticeable visual improve-
Our proposed enhancement approaches can be implemented ments at very low bit rates. Meanwhile, Lee et al. [228] and
on different CNN architectures. others observed similar effects, when a neural adaptive filter
21

was successfully used to further enhance neural compressed The exponential growth of Internet traffic, a majority of
images. which involves videos and images, has been the driving
Nevertheless, a number of open problems requiring substan- force for the development of video compression systems.
tial further study have been discovered. These include: The availability of a vast amount of images through the
Internet, meanwhile, has been critical for the renaissance of
• Model Generalization: It is vital for DNN models to be the field of machine learning. In this work, we show that
generalizable to a wide variety of video content, different recent progress in deep learning can, in return, improve video
artifacts, etc. Currently, most DNN-based video compres- compression. These mutual positive feedbacks suggest that
sion techniques utilize supervised learning, which often significant progress could be achieved in both fields when they
demands a significant amount of labelled image/video are investigated together. Therefore, the approaches presented
data for the full spectrum coverage of aforementioned ap- in this work could be the stepping stones for improving the
plication scenarios. Continuously developing a large-scale compression efficiency in Internet-scale video applications.
dataset, such as the ImageNet11 presents one possible
From a different perspective, most compressed videos will
solution to this problem. An alternative approach may use
be ultimately consumed by human beings or interpreted by
more advanced techniques to alleviate uncertainty related
machines, for subsequent task decisions. This is a typical
to a limited training sample for model generalization.
computer vision (CV) problem, i.e., content understanding
These techniques include (but are not limited to) few-
and decisions for consumption or task-oriented application
shot learning [229] and self-supervised learning [226].
(e.g., detection, classification, etc.) Existing approaches have
• Complexity: Existing DNN-based methods are mainly
performed these tasks by first decoding the video, and then
criticized for their unbearable complexity in both com-
examining the tasks via learned or rule-based methods based
putational and spatial dimensions. Compared to conven-
on decoded pixels. Such separate processing, e.g., video de-
tional video codec, which requires tens of Kilobytes
coding followed by CV tasks, is relied upon mainly because
on-chip memory, most DNN algorithms require several
traditional pixel-prediction based differential video compres-
Megabytes or even Gigabytes of memory space. On the
sion methods break the spatio-temporal features that could
other hand, although inference may be very fast, training
be potentially helpful for vision tasks. In contrast, recent
could take hours, days or even weeks for converged
DNN-based video compression algorithms rely on the feature
and reliable models [141]. All of these issues present
extraction, activation, suppression, and aggregation for more
serious barriers to the market adoption of DNN-based
compact representation. For these reasons, it is expected that
tools, particularly on energy-efficient mobile platforms.
the CV tasks can be fulfilled in the compressive domain
One promising solution is to design specialized hard-
without bit decoding and pixel reconstruction. Our earlier
ware for the acceleration of DNN algorithms [158].
attempts have shown very encouraging gain in the accuracy
Currently, neural processing units (NPU) have attracted
of classification and retrieval in compressive formats, without
significant attention, and have been gradually deployed
resorting to the traditional feature-based approaches using
in heterogeneous platforms (e.g., Qualcomm AI Engine
decoded pixels, which we report in [233], [234]. Using
in the Snapdragon chip series, Neural Processor in Apple
powerful DNNs to unify video compression and computer
silicons, etc.) This paints a promising picture of a future
vision techniques is an exciting new field. It is also worth
in which DNN algorithms can be deployed on NPU-
noting that the ISO/IEC MPEG is now actively working on
equipped devices at a massive scale.
a new project called “Video Coding for Machine” (VCM)12 ,
• QoE Metric: Video quality matters. A video QoE metric
with emphasis on exploring video compression solutions for
that is better correlated with the human visual system
both human perception and machine intelligence.
is highly desirable, not only for quality evaluation, but
also for loss control in DNN-based video compression.
There has been notable development in both subjec- R EFERENCES
tive and objective video quality assessments, yielding [1] D. J. Brady, M. E. Gehm, R. A. Stack, D. L. Marks, D. S. Kittle, D. R.
several well-known metrics, such as SSIM [230], just- Golish, E. Vera, and S. D. Feller, “Multiscale gigapixel photography,”
Nature, vol. 486, no. 7403, pp. 386–389, 2012.
noticeable-distortion (JND) [231], and VMAF [232], [2] M. Cheng, Z. Ma, S. Asif, Y. Xu, H. Liu, W. Bao, and J. Sun, “A dual
some of which are actively adopted for the evaluation of camera system for high spatiotemporal resolution video acquisition,”
video algorithms, application products, etc. On the other IEEE Trans. Pattern Analysis and Machine Intelligence, no. 01, pp.
1–1, 2020.
hand, existing DNN-based video coding approaches can [3] F. Dufaux, P. Le Callet, R. Mantiuk, and M. Mrak, High dynamic range
adaptively optimize the efficiency of a pre-defined loss video: from acquisition, to display and applications. Academic Press,
function, such as MSE, SSIM, adversarial loss [157], 2016.
[4] M. Winken, D. Marpe, H. Schwarz, and T. Wiegand, “Bit-depth
VGG feature based semantic loss, etc. However, none scalable video coding,” in 2007 IEEE International Conference on
of these loss functions has shown clear advantages. A Image Processing, vol. 1. IEEE, 2007, pp. I–5.
unified, differentiable, and HVS-driven metric is of great [5] P. Tudor, “Mpeg-2 video compression,” Electronics & communication
engineering journal, vol. 7, no. 6, pp. 257–264, 1995.
importance for the capacity of DNN-based video coding [6] B. G. Haskell, A. Puri, and A. N. Netravali, Digital video: an
techniques to offer perceptually better QoE. introduction to MPEG-2. Springer Science & Business Media, 1996.

12 https://mpeg.chiariglione.org/standards/exploration/
11 http://www.image-net.org/ video-coding-machines
22

[7] T. Sikora, “The mpeg-4 video standard verification model,” IEEE IEEE Transactions on Image Processing, vol. 26, no. 7, pp. 3142–3155,
Transactions on circuits and systems for video technology, vol. 7, no. 1, 2017.
pp. 19–31, 1997. [30] C. Tian, Y. Xu, L. Fei, and K. Yan, “Deep learning for image denoising:
[8] W. Li, “Overview of fine granularity scalability in mpeg-4 video stan- a survey,” in International Conference on Genetic and Evolutionary
dard,” IEEE Transactions on circuits and systems for video technology, Computing. Springer, 2018, pp. 563–572.
vol. 11, no. 3, pp. 301–317, 2001. [31] A. Chakrabarti, “A neural approach to blind motion deblurring,” in
[9] T. Wiegand, G. J. Sullivan, G. Bjontegaard, and A. Luthra, “Overview European conference on computer vision. Springer, 2016, pp. 221–
of the H.264/AVC video coding standard,” IEEE Transactions on 235.
circuits and systems for video technology, vol. 13, no. 7, pp. 560–576, [32] J. Koh, J. Lee, and S. Yoon, “Single-image deblurring with neural
2003. networks: A comparative survey,” Computer Vision and Image Under-
[10] G. J. Sullivan, J.-R. Ohm, W.-J. Han, and T. Wiegand, “Overview of standing, p. 103134, 2020.
the high efficiency video coding (HEVC) standard,” IEEE Transactions [33] Y. Zhu, X. Fu, and A. Liu, “Learning dual transformation networks for
on circuits and systems for video technology, vol. 22, no. 12, pp. 1649– image contrast enhancement,” IEEE Signal Processing Letters, 2020.
1668, 2012. [34] W. Guan, T. Wang, J. Qi, L. Zhang, and H. Lu, “Edge-aware con-
[11] V. Sze, M. Budagavi, and G. J. Sullivan, “High efficiency video coding volution neural network based salient object detection,” IEEE Signal
(hevc),” in Integrated circuit and systems, algorithms and architectures. Processing Letters, vol. 26, no. 1, pp. 114–118, 2018.
Springer, 2014, vol. 39, pp. 49–90. [35] L. Xu, J. Ren, Q. Yan, R. Liao, and J. Jia, “Deep edge-aware filters,” in
[12] G. J. Sullivan, P. N. Topiwala, and A. Luthra, “The h. 264/avc advanced International Conference on Machine Learning, 2015, pp. 1669–1678.
video coding standard: Overview and introduction to the fidelity range [36] L. Zhaoping, “A new framework for understanding vision from the per-
extensions,” in Applications of Digital Image Processing XXVII, vol. spective of the primary visual cortex,” Current opinion in neurobiology,
5558. International Society for Optics and Photonics, 2004, pp. 454– vol. 58, pp. 1–10, 2019.
474. [37] X. Chen, M. Zirnsak, G. M. Vega, E. Govil, S. G. Lomber, and
[13] A. Vetro, T. Wiegand, and G. J. Sullivan, “Overview of the stereo and T. Moore, “The contribution of parietal cortex to visual salience,”
multiview video coding extensions of the h. 264/mpeg-4 avc standard,” bioRxiv, 2019, doi: http://doi.org/10.1101/619643.
Proceedings of the IEEE, vol. 99, no. 4, pp. 626–642, 2011. [38] O. Schwartz and E. Simoncelli, “Natural signal statistics and sensory
[14] L. Yu, S. Chen, and J. Wang, “Overview of AVS-video coding gain control.” Nature neuroscience, vol. 4, no. 8, p. 819, 2001.
standards,” Signal processing: Image communication, vol. 24, no. 4, [39] L. Itti, C. Koch, and E. Niebur, “A model of saliency-based visual
pp. 247–262, 2009. attention for rapid scene analysis,” IEEE Transactions on pattern
[15] S. Ma, S. Wang, and W. Gao, “Overview of ieee 1857 video coding analysis and machine intelligence, vol. 20, no. 11, pp. 1254–1259,
standard,” in 2013 IEEE International Conference on Image Process- 1998.
ing. IEEE, 2013, pp. 1500–1504. [40] L. Itti, “Automatic foveation for video compression using a neuro-
[16] J. Zhang, C. Jia, M. Lei, S. Wang, S. Ma, and W. Gao, “Recent biological model of visual attention,” IEEE transactions on image
development of avs video coding standard: AVS3,” in 2019 Picture processing, vol. 13, no. 10, pp. 1304–1318, 2004.
Coding Symposium (PCS). IEEE, 2019, pp. 1–5. [41] T. V. Nguyen, M. Xu, G. Gao, M. Kankanhalli, Q. Tian, and S. Yan,
[17] Y. Chen, D. Murherjee, J. Han, A. Grange, Y. Xu, Z. Liu, S. Parker, “Static saliency vs. dynamic saliency: a comparative study,” in Proceed-
C. Chen, H. Su, U. Joshi et al., “An overview of core coding tools in ings of the 21st ACM international conference on Multimedia, 2013,
the AV1 video codec,” in 2018 Picture Coding Symposium. IEEE, pp. 987–996.
2018, pp. 41–45. [42] E. Vig, M. Dorr, and D. Cox, “Large-scale optimization of hierarchical
[18] J. Han, B. Li, D. Mukherjee, C.-H. Chiang, C. Chen, H. Su, S. Parker, features for saliency prediction in natural images,” in Proceedings of
U. Joshi, Y. Chen, Y. Wang et al., “A technical overview of av1,” arXiv the IEEE Conference on Computer Vision and Pattern Recognition,
preprint arXiv:2008.06091, 2020. 2014, pp. 2798–2805.
[19] “AOM - alliance for open media,” http://www.aomedia.org/. [43] N. Liu, J. Han, D. Zhang, S. Wen, and T. Liu, “Predicting eye fixations
[20] A. Aaron, Z. Li, M. Manohara, J. De Cock, and D. Ronca, using convolutional neural networks,” in Proceedings of the IEEE
“Per-Title Encode Optimization,” The Netflix Tech Blog, https:// Conference on Computer Vision and Pattern Recognition, 2015, pp.
netflixtechblog.com/per-title-encode-optimization-7e99442b62a2, (14 362–370.
December 2015). [44] G. Li and Y. Yu, “Deep contrast learning for salient object detection,” in
[21] T. Shoham, D. Gill, S. Carmel, N. Terterov, and P. Tiktov, “Content- Proceedings of the IEEE Conference on Computer Vision and Pattern
adaptive frame level rate control for video encoding using a perceptual Recognition, 2016, pp. 478–487.
video quality measure,” in Applications of Digital Image Processing [45] N. Liu and J. Han, “Dhsnet: Deep hierarchical saliency network for
XLII, A. G. Tescher and T. Ebrahimi, Eds., vol. 11137, September salient object detection,” in Proceedings of the IEEE Conference on
2019, p. 26. Computer Vision and Pattern Recognition, 2016, pp. 678–686.
[22] Y.-C. Lin, H. Denman, and A. Kokaram, “Multipass encoding for [46] Q. Hou, M.-M. Cheng, X. Hu, A. Borji, Z. Tu, and P. H. Torr,
reducing pulsing artifacts in cloud based video transcoding,” in 2015 “Deeply supervised salient object detection with short connections,” in
IEEE International Conference on Image Processing, September 2015, Proceedings of the IEEE Conference on Computer Vision and Pattern
pp. 907–911. Recognition, 2017, pp. 3203–3212.
[23] G. J. Sullivan and T. Wiegand, “Video compression-from concepts to [47] B. Yan, H. Wang, X. Wang, and Y. Zhang, “An accurate saliency
the H.264/AVC standard,” Proceedings of the IEEE, vol. 93, no. 1, pp. prediction method based on generative adversarial networks,” in 2017
18–31, 2005. IEEE International Conference on Image Processing. IEEE, 2017,
[24] A. Norkin, G. Bjontegaard, A. Fuldseth, M. Narroschke, M. Ikeda, pp. 2339–2343.
K. Andersson, M. Zhou, and G. Van der Auwera, “HEVC deblocking [48] Y. Xu, S. Gao, J. Wu, N. Li, and J. Yu, “Personalized saliency and
filter,” IEEE Transactions on Circuits and Systems for Video Technol- its prediction,” IEEE transactions on pattern analysis and machine
ogy, vol. 22, no. 12, pp. 1746–1754, 2012. intelligence, vol. 41, no. 12, pp. 2975–2989, 2018.
[25] C.-M. Fu, E. Alshina, A. Alshin, Y.-W. Huang, C.-Y. Chen, C.-Y. Tsai, [49] L. Bazzani, H. Larochelle, and L. Torresani, “Recurrent mixture
C.-W. Hsu, S.-M. Lei, J.-H. Park, and W.-J. Han, “Sample adaptive density network for spatiotemporal visual attention,” arXiv preprint
offset in the HEVC standard,” IEEE Transactions on Circuits and arXiv:1603.08199, 2016.
Systems for Video technology, vol. 22, no. 12, pp. 1755–1764, 2012. [50] C. Bak, A. Kocak, E. Erdem, and A. Erdem, “Spatio-temporal saliency
[26] R. Gupta, M. T. Khanna, and S. Chaudhury, “Visual saliency guided networks for dynamic saliency prediction,” IEEE Transactions on
video compression algorithm,” Signal Processing: Image Communica- Multimedia, vol. 20, no. 7, pp. 1688–1698, 2017.
tion, vol. 28, no. 9, pp. 1006–1022, 2013. [51] M. Sun, Z. Zhou, Q. Hu, Z. Wang, and J. Jiang, “SG-FCN: A motion
[27] S. Liu, X. Li, W. Wang, E. Alshina, K. Kawamura, K. Unno, Y. Kidani, and memory-based deep learning model for video saliency detection,”
P. Wu, A. Segall, M. Wien et al., “Ahg on neural network based coding IEEE transactions on cybernetics, vol. 49, no. 8, pp. 2900–2911, 2018.
tools,” Joint Video Expert Team, no. JVET-S0267/M54764, June 2020. [52] L. Jiang, M. Xu, T. Liu, M. Qiao, and Z. Wang, “Deepvs: A deep
[28] S. Liu, E. Alshina, J. Pfaff, M. Wien, P. Wu, and Y. Ye, “Report of learning based video saliency prediction approach,” in Proceedings of
ahg11 meeting on neural network-based video coding,” Joint Video the European Conference on Computer Vision, 2018, pp. 602–617.
Expert Team, no. JVET-T0042/M54848, July 2020. [53] Z. Wang, J. Ren, D. Zhang, M. Sun, and J. Jiang, “A deep-learning
[29] K. Zhang, W. Zuo, Y. Chen, D. Meng, and L. Zhang, “Beyond a based feature hybrid framework for spatiotemporal saliency detection
gaussian denoiser: Residual learning of deep cnn for image denoising,” inside videos,” Neurocomputing, vol. 287, pp. 68–83, 2018.
23

[54] R. Cong, J. Lei, H. Fu, F. Porikli, Q. Huang, and C. Hou, “Video the National Academy of Sciences, vol. 111, no. 8, pp. 3170–3175,
saliency detection via sparsity-based reconstruction and propagation,” 2014.
IEEE Transactions on Image Processing, vol. 28, no. 10, pp. 4819– [76] J. Ukita, T. Yoshida, and K. Ohki, “Characterisation of nonlinear
4831, 2019. receptive fields of visual neurons by convolutional neural network,”
[55] K. Min and J. J. Corso, “Tased-net: Temporally-aggregating spatial Scientific reports, vol. 9, no. 1, pp. 1–17, 2019.
encoder-decoder network for video saliency detection,” in Proceedings [77] P. Neri, “Nonlinear characterization of a simple process in human
of the IEEE International Conference on Computer Vision, 2019, pp. vision,” Journal of Vision, vol. 9, no. 12, pp. 1–1, 2009.
2394–2403. [78] D. J. Heeger, “Normalization of cell responses in cat striate cortex,”
[56] W. Wang, J. Shen, J. Xie, M.-M. Cheng, H. Ling, and A. Borji, Visual Neuroscience, vol. 9, no. 2, pp. 181–197, 1992.
“Revisiting video saliency prediction in the deep learning era,” IEEE [79] N. J. Priebe and D. Ferster, “Mechanisms of neuronal computation in
transactions on Pattern Analysis and Machine Intelligence, 2019. mammalian visual cortex,” Neuron, vol. 75, no. 2, pp. 194–208, 2012.
[57] A. Vetro, T. Haga, K. Sumi, and H. Sun, “Object-based coding for long- [80] M. Carandini and D. J. Heeger, “Normalization as a canonical neural
term archive of surveillance video,” in Proceedings of International computation,” Nature Reviews Neuroscience, vol. 13, no. 1, p. 51, 2012.
Conference on Multimedia and Expo, vol. 2, July 2003, p. 417, [81] M. H. Turner and F. Rieke, “Synaptic rectification controls nonlinear
Baltimore, MD. spatial integration of natural visual inputs,” Neuron, vol. 90, no. 6, pp.
[58] T. Nishi and H. Fujiyoshi, “Object-based video coding using pixel 1257–1271, 2016.
state analysis,” in Proceedings of the 17th International Conference on [82] D. Doshkov and P. Ndjiki-Nya, “Chapter 6 - how to use texture analysis
Pattern Recognition, vol. 3, August 2004, pp. 306–309, Cambridge, and synthesis methods for video compression,” in Academic Press
UK. Library in signal Processing, ser. Academic Press Library in Signal
[59] L. Zhu and Q. Zhang, “Motion-based foreground extraction in com- Processing, S. Theodoridis and R. Chellappa, Eds. Oxford, UK:
pressed video,” in 2010 International Conference on Measuring Tech- Elsevier, 2014, vol. 5, pp. 197–225.
nology and Mechatronics Automation, vol. 2. IEEE, 2010, pp. 711– [83] P. Ndjiki-Nya, D. Doshkov, H. Kaprykowsky, F. Zhang, D. Bull, and
714. T. Wiegand, “Perception-oriented video coding based on image analysis
[60] Z. Zhang, T. Jing, J. Han, Y. Xu, and X. Li, “Flow-process foreground and completion: A review,” Signal Processing: Image Communication,
region of interest detection method for video codecs,” IEEE Access, vol. 27, no. 6, pp. 579–594, 2012.
vol. 5, pp. 16 263–16 276, 2017. [84] A. K. Jain and F. Farrokhnia, “Unsupervised texture segmentation
[61] Y. Guo, Z. Xuan, and L. Song, “Foreground target extraction method using gabor filters,” in Proceedings of the International Conference
based on neighbourhood pixel intensity correction,” Australian Journal on Systems, Man, and Cybernetics Conference proceedings. IEEE,
of Mechanical Engineering, pp. 1–10, 2019. 1990, pp. 14–19, Los Angeles, CA.
[62] A. Shahbaz, V.-T. Hoang, and K.-H. Jo, “Convolutional neural network [85] A. C. Bovik, M. Clark, and W. S. Geisler, “Multichannel texture
based foreground segmentation for video surveillance systems,” in analysis using localized spatial filters,” IEEE Transactions on Pattern
IECON 2019-45th Annual Conference of the IEEE Industrial Elec- Analysis and Machine Intelligence, vol. 12, no. 1, pp. 55–73, 1990.
tronics Society, vol. 1. IEEE, 2019, pp. 86–89. [86] U. S. Thakur and O. Chubach, “Texture analysis and synthesis using
[63] S. Zhou, J. Wang, D. Meng, Y. Liang, Y. Gong, and N. Zheng, steerable pyramid decomposition for video coding,” in Proceedings of
“Discriminative feature learning with foreground attention for person International Conference on Systems, Signals and Image Processing,
re-identification,” IEEE Transactions on Image Processing, vol. 28, September 2015, pp. 204–207, London, UK.
no. 9, pp. 4671–4684, 2019. [87] J. Portilla and E. P. Simoncelli, “A parametric texture model based on
joint statistics of complex wavelet coefficients,” International Journal
[64] M. Babaee, D. T. Dinh, and G. Rigoll, “A deep convolutional neural
of Computer Vision, vol. 40, no. 1, pp. 49–70, 2000.
network for background subtraction,” arXiv preprint arXiv:1702.01731,
[88] S. Bansal, S. Chaudhury, and B. Lall, “Dynamic texture synthesis
2017.
for video compression,” in Proceeding of National Conference on
[65] X. Liang, S. Liao, X. Wang, W. Liu, Y. Chen, and S. Z. Li, “Deep back-
Communications, Feb 2013, pp. 1–5, New Delhi, India.
ground subtraction with guided learning,” in 2018 IEEE International
[89] G. R. Cross and A. K. Jain, “Markov random field texture models,”
Conference on Multimedia and Expo. IEEE, 2018, pp. 1–6.
IEEE Transactions on Pattern Analysis and Machine Intelligence, no. 1,
[66] S. Zhang, K. Wei, H. Jia, X. Xie, and W. Gao, “An efficient foreground- pp. 25–39, 1983.
based surveillance video coding scheme in low bit-rate compression,” [90] R. Chellappa and S. Chatterjee, “Classification of textures using
in 2012 Visual Communications and Image Processing. IEEE, 2012, Gaussian Markov random fields,” IEEE Transactions on Acoustics,
pp. 1–6. Speech, and Signal Processing, vol. 33, no. 4, pp. 959–963, 1985.
[67] H. Hadizadeh and I. V. Bajić, “Saliency-aware video compression,” [91] D. G. Lowe, “Distinctive image features from scale-invariant key-
IEEE Transactions on Image Processing, vol. 23, no. 1, pp. 19–33, points,” International Journal of Computer Vision, vol. 60, no. 2, pp.
2013. 91–110, 2004.
[68] Y. Li, W. Liao, J. Huang, D. He, and Z. Chen, “Saliency based percep- [92] H. Bay, T. Tuytelaars, and L. Van Gool, “Surf: Speeded up robust
tual HEVC,” in 2014 IEEE International Conference on Multimedia features,” in Proceedings of the European Conference on Computer
and Expo Workshops. IEEE, 2014, pp. 1–5. Vision. Springer, 2006, pp. 404–417, Graz, Austria.
[69] C. Ku, G. Xiang, F. Qi, W. Yan, Y. Li, and X. Xie, “Bit allocation based [93] T. Ojala, M. Pietikainen, and T. Maenpaa, “Multiresolution gray-scale
on visual saliency in HEVC,” in 2019 IEEE Visual Communications and rotation invariant texture classification with local binary patterns,”
and Image Processing. IEEE, 2019, pp. 1–4. IEEE Transactions on Pattern Analysis and Machine Intelligence,
[70] S. Zhu and Z. Xu, “Spatiotemporal visual saliency guided perceptual vol. 24, no. 7, pp. 971–987, 2002.
high efficiency video coding with neural network,” Neurocomputing, [94] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “ImageNet classification
vol. 275, pp. 511–522, 2018. with deep convolutional neural networks,” Advances in Neural Infor-
[71] V. Lyudvichenko, M. Erofeev, A. Ploshkin, and D. Vatolin, “Improving mation Processing Systems, pp. 1097–1105, 2012.
video compression with deep visual-attention models,” in Proceedings [95] M. Cimpoi, S. Maji, and A. Vedaldi, “Deep filter banks for texture
of the 2019 International Conference on Intelligent Medicine and recognition and segmentation,” in Proceedings of the IEEE Conference
Image Processing, 2019, pp. 88–94. on Computer Vision and Pattern Recognition, 2015, pp. 3828–3836,
[72] M. Cornia, L. Baraldi, G. Serra, and R. Cucchiara, “Predicting human Boston, Massachusetts.
eye fixations via an lstm-based saliency attentive model,” IEEE Trans- [96] F. Perronnin, J. Sánchez, and T. Mensink, “Improving the fisher kernel
actions on Image Processing, vol. 27, no. 10, pp. 5142–5154, 2018. for large-scale image classification,” in Proceedings of the European
[73] X. Sun, X. Yang, S. Wang, and M. Liu, “Content-aware rate control Conference on Computer Vision. Springer, 2010, pp. 143–156, Crete,
scheme for HEVC based on static and dynamic saliency detection,” Greece.
Neurocomputing, 2020. [97] A. A. Efros and T. K. Leung, “Texture synthesis by non-parametric
[74] M. Carandini, J. B. Demb, V. Mante, D. J. Tolhurst, Y. Dan, B. A. sampling,” in Proceedings of the Seventh IEEE International Confer-
Olshausen, J. L. Gallant, and N. C. Rust, “Do we know what the early ence on Computer Vision, vol. 2. IEEE, 1999, pp. 1033–1038, Kerkyra,
visual system does?” Journal of Neuroscience, vol. 25, no. 46, pp. Greece.
10 577–10 597, 2005. [98] L.-Y. Wei and M. Levoy, “Fast texture synthesis using tree-structured
[75] J. Kremkow, J. Jin, S. J. Komban, Y. Wang, R. Lashgari, X. Li, vector quantization,” in Proceedings of the 27th Annual Conference on
M. Jansen, Q. Zaidi, and J.-M. Alonso, “Neuronal nonlinearity explains Computer Graphics and Interactive Techniques, 2000, pp. 479–488,
greater visual spatial resolution for darks than lights,” Proceedings of New Orleans, LA.
24

[99] M. Ashikhmin, “Synthesizing natural textures,” in Proceedings of the [122] L. Zhao, S. Wang, X. Zhang, S. Wang, S. Ma, and W. Gao, “Enhanced
Symposium on Interactive 3D Graphics, 2001, pp. 217–226, New York, motion-compensated video coding with deep virtual reference frame
NY. generation,” IEEE Transactions on Image Processing, vol. 28, no. 10,
[100] H. Derin and H. Elliott, “Modeling and segmentation of noisy and pp. 4832–4844, 2019.
textured images using Gibbs random fields,” IEEE Transactions on [123] S. Xia, W. Yang, Y. Hu, and J. Liu, “Deep inter prediction via pixel-
Pattern Analysis and Machine Intelligence, no. 1, pp. 39–55, 1987. wise motion oriented reference generation,” in Proceedings of the IEEE
[101] D. J. Heeger and J. R. Bergen, “Pyramid-based texture analy- International Conference on Image Processing. IEEE, 2019, pp. 1710–
sis/synthesis,” in Proceedings of the 22nd annual Conference on 1774, Taipei, Taiwan.
Computer Graphics and Interactive Techniques, 1995, pp. 229–238, [124] S. Huo, D. Liu, F. Wu, and H. Li, “Convolutional neural network-based
Los Angeles, CA. motion compensation refinement for video coding,” in Proceedings of
[102] L. Gatys, A. S. Ecker, and M. Bethge, “Texture synthesis using convo- the IEEE International Symposium on Circuits and Systems. IEEE,
lutional neural networks,” Advances in Neural Information Processing 2018, pp. 1–4, Florence, Italy.
Systems, pp. 262–270, 2015. [125] M. M. Alam, T. D. Nguyen, M. T. Hagan, and D. M. Chandler, “A
[103] C. Li and M. Wand, “Precomputed real-time texture synthesis with perceptual quantization strategy for HEVC based on a convolutional
markovian generative adversarial networks,” in Proceedings of the neural network trained on natural images,” Applications of Digital
European Conference on Computer Vision. Springer, 2016, pp. 702– Image Processing XXXVIII, vol. 9599, p. 959918, 2015.
716, Amsterdam, The Netherlands. [126] R. Song, D. Liu, H. Li, and F. Wu, “Neural network-based arithmetic
[104] T. Chen, H. Liu, Q. Shen, T. Yue, X. Cao, and Z. Ma, “Deepcoder: A coding of intra prediction modes in HEVC,” in Proceedings of the
deep neural network based video compression,” in Proceedings of the IEEE Visual Communications and Image Processing. IEEE, 2017,
IEEE Visual Communications and Image Processing. IEEE, 2017, pp. pp. 1–4, St. Petersburg, FL.
1–4, St. Petersburg, FL. [127] S. Puri, S. Lasserre, and P. Le Callet, “CNN-based transform index
[105] J. Ballé, V. Laparra, and E. P. Simoncelli, “End-to-end optimized image prediction in multiple transforms framework to assist entropy coding,”
compression,” arXiv preprint arXiv:1611.01704, 2016. in Proceedings of the European Signal Processing Conference. IEEE,
[106] D. Liu, Y. Li, J. Lin, H. Li, and F. Wu, “Deep learning-based video 2017, pp. 798–802, Kos island, Greece.
coding: A review and a case study,” ACM Computing Surveys (CSUR), [128] C. Ma, D. Liu, X. Peng, and F. Wu, “Convolutional neural network-
vol. 53, no. 1, pp. 1–35, 2020. based arithmetic coding of dc coefficients for HEVC intra coding,” in
[107] S. Ma, X. Zhang, C. Jia, Z. Zhao, S. Wang, and S. Wanga, “Image and Proceedings of the IEEE International Conference on Image Process-
video compression with neural networks: A review,” IEEE Transactions ing. IEEE, 2018, pp. 1772–1776, Athens, Greece.
on Circuits and Systems for Video Technology, 2019. [129] G. Lu, W. Ouyang, D. Xu, X. Zhang, C. Cai, and Z. Gao, “Dvc: An
[108] Y. Li, D. Liu, H. Li, L. Li, F. Wu, H. Zhang, and H. Yang, “Convolu- end-to-end deep video compression framework,” in Proceedings of the
tional neural network-based block up-sampling for intra frame coding,” IEEE Conference on Computer Vision and Pattern Recognition, 2019,
IEEE Transactions on Circuits and Systems for Video Technology, pp. 11 006–11 015, Long Beach, CA.
vol. 28, no. 9, pp. 2316–2330, 2018. [130] O. Rippel, S. Nair, C. Lew, S. Branson, A. G. Anderson, and
[109] F. Jiang, W. Tao, S. Liu, J. Ren, X. Guo, and D. Zhao, “An end-to-end L. Bourdev, “Learned video compression,” in Proceedings of the IEEE
compression framework based on convolutional neural networks,” IEEE International Conference on Computer Vision, 2019, pp. 3454–3463,
Transactions on Circuits and Systems for Video Technology, vol. 28, South Korea.
no. 10, pp. 3007–3018, 2017.
[131] H. Liu, H. Shen, L. Huang, M. Lu, T. Chen, and Z. Ma, “Learned
[110] M. Afonso, F. Zhang, and D. R. Bull, “Video compression based on
video compression via joint spatial-temporal correlation exploration,”
spatio-temporal resolution adaptation,” IEEE Transactions on Circuits
in Proceedings of the AAAI Conference on Artificial Intelligence,
and Systems for Video Technology, vol. 29, no. 1, pp. 275–280, 2018.
vol. 34, no. 07, 2020, pp. 11 580–11 587.
[111] J. Lin, D. Liu, H. Yang, H. Li, and F. Wu, “Convolutional neural
[132] G. Toderici, S. M. O’Malley, S. J. Hwang, D. Vincent, D. Min-
network-based block up-sampling for HEVC,” IEEE Transactions on
nen, S. Baluja, M. Covell, and R. Sukthankar, “Variable rate image
Circuits and Systems for Video Technology, vol. 29, no. 12, pp. 3701–
compression with recurrent neural networks,” in Proceedings of the
3715, 2018.
International Conference on Learning Representations, 2016.
[112] W. Yang, X. Zhang, Y. Tian, W. Wang, J.-H. Xue, and Q. Liao,
“Deep learning for single image super-resolution: A brief review,” IEEE [133] G. Toderici, D. Vincent, N. Johnston, S. Jin Hwang, D. Minnen,
Transactions on Multimedia, vol. 21, no. 12, pp. 3106–3121, 2019. J. Shor, and M. Covell, “Full resolution image compression with
[113] W. Cui, T. Zhang, S. Zhang, F. Jiang, W. Zuo, and D. Zhao, “Con- recurrent neural networks,” in Proceedings of the IEEE Conference
volutional neural networks based intra prediction for HEVC,” arXiv on Computer Vision and Pattern Recognition, 2017, pp. 5306–5314,
preprint arXiv:1808.05734, 2018. Honolulu, Hawaii.
[114] J. Li, B. Li, J. Xu, R. Xiong, and W. Gao, “Fully connected network- [134] N. Johnston, D. Vincent, D. Minnen, M. Covell, S. Singh, T. Chinen,
based intra prediction for image coding,” IEEE Transactions on Image S. Jin Hwang, J. Shor, and G. Toderici, “Improved lossy image
Processing, vol. 27, no. 7, pp. 3236–3247, 2018. compression with priming and spatially adaptive bit rates for recurrent
[115] J. Pfaff, P. Helle, D. Maniry, S. Kaltenstadler, W. Samek, H. Schwarz, networks,” in Proceedings of the IEEE Conference on Computer Vision
D. Marpe, and T. Wiegand, “Neural network based intra prediction and Pattern Recognition, 2018, pp. 4385–4393.
for video coding,” Applications of Digital Image Processing XLI, vol. [135] Y. Choi, M. El-Khamy, and J. Lee, “Variable rate deep image com-
10752, p. 1075213, 2018. pression with a conditional autoencoder,” in Proceedings of the IEEE
[116] Y. Hu, W. Yang, M. Li, and J. Liu, “Progressive spatial recurrent International Conference on Computer Vision, 2019, pp. 3146–3154.
neural network for intra prediction,” IEEE Transactions on Multimedia, [136] T. Chen, H. Liu, Z. Ma, Q. Shen, X. Cao, and Y. Wang, “Neural image
vol. 21, no. 12, pp. 3024–3037, 2019. compression via non-local attention optimization and improved context
[117] Z. Jin, P. An, and L. Shen, “Video intra prediction using convolutional modeling,” arXiv preprint arXiv:1910.06244, 2019.
encoder decoder network,” Neurocomputing, vol. 394, pp. 168–177, [137] J. Ballé, “Efficient nonlinear transforms for lossy image compression,”
2020. arXiv preprint arXiv:1802.00847, 2018.
[118] B. Girod, “Motion-compensating prediction with fractional-pel accu- [138] J. Lee, S. Cho, and S.-K. Beack, “Context-adaptive entropy
racy,” IEEE Transactions on Communications, vol. 41, no. 4, pp. 604– model for end-to-end optimized image compression,” arXiv preprint
612, 1993. arXiv:1809.10452, 2018.
[119] N. Yan, D. Liu, H. Li, and F. Wu, “A convolutional neural network [139] J. Klopp, Y.-C. F. Wang, S.-Y. Chien, and L.-G. Chen, “Learning
approach for half-pel interpolation in video coding,” in Proceedings of a code-space predictor by exploiting intra-image-dependencies.” in
the IEEE International Symposium on Circuits and Systems. IEEE, BMVC, 2018, p. 124.
2017, pp. 1–4, Baltimore, MD. [140] F. Mentzer, E. Agustsson, M. Tschannen, R. Timofte, and L. Van Gool,
[120] H. Zhang, L. Song, Z. Luo, and X. Yang, “Learning a convolutional “Conditional probability models for deep image compression,” in IEEE
neural network for fractional interpolation in HEVC inter coding,” in Conference on Computer Vision and Pattern Recognition (CVPR),
Proceedings of the IEEE Conference on Visual Communications and vol. 1, no. 2, 2018, p. 3.
Image Processing. IEEE, 2017, pp. 1–4, St. Petersburg, FL. [141] G. Lu, W. Ouyang, D. Xu, X. Zhang, C. Cai, and Z. Gao, “DVC: An
[121] J. Liu, S. Xia, W. Yang, M. Li, and D. Liu, “One-for-all: Grouped end-to-end deep video compression framework,” in Proceedings of the
variation network-based fractional interpolation in video coding,” IEEE IEEE Conference on Computer Vision and Pattern Recognition, 2019,
Transactions on Image Processing, vol. 28, no. 5, pp. 2140–2151, 2018. pp. 11 006–11 015.
25

[142] J. Ballé, D. Minnen, S. Singh, S. J. Hwang, and N. Johnston, “Vari- [165] X. Xu, J. Qian, L. Yu, H. Wang, X. Zeng, Z. Li, and N. Wang, “Dense
ational image compression with a scale hyperprior,” arXiv preprint inception attention neural network for in-loop filter,” in 2019 Picture
arXiv:1802.01436, 2018. Coding Symposium. IEEE, 2019, pp. 1–5.
[143] H. Liu, T. Chen, P. Guo, Q. Shen, X. Cao, Y. Wang, and Z. Ma, “Non- [166] K. Lin, C. Jia, Z. Zhao, L. Wang, S. Wang, S. Ma, and W. Gao,
local attention optimized deep image compression,” arXiv preprint “Residual in residual based convolutional neural network in-loop filter
arXiv:1904.09757, 2019. for avs3,” in 2019 Picture Coding Symposium. IEEE, 2019, pp. 1–5.
[144] C.-Y. Wu, N. Singhal, and P. Krähenbühl, “Video compression through [167] J. Kang, S. Kim, and K. M. Lee, “Multi-modal/multi-scale convolu-
image interpolation,” in Proceedings of the European Conference on tional neural network based in-loop filter design for next generation
Computer Vision, 2018, pp. 416–431. video codec,” in 2017 IEEE International Conference on Image Pro-
[145] A. Djelouah, J. Campos, S. Schaub-Meyer, and C. Schroers, “Neural cessing. IEEE, 2017, pp. 26–30.
inter-frame compression for video coding,” in Proceedings of the IEEE [168] C. Jia, S. Wang, X. Zhang, S. Wang, and S. Ma, “Spatial-temporal
International Conference on Computer Vision, 2019, pp. 6421–6429. residue network based in-loop filter for video coding,” in 2017 IEEE
[146] M. Li, W. Zuo, S. Gu, D. Zhao, and D. Zhang, “Learning convolutional Visual Communications and Image Processing. IEEE, 2017, pp. 1–4.
networks for content-weighted image compression,” arXiv preprint [169] X. Meng, C. Chen, S. Zhu, and B. Zeng, “A new HEVC in-loop filter
arXiv:1703.10553, 2017. based on multi-channel long-short-term dependency residual networks,”
[147] X. Wang, R. Girshick, A. Gupta, and K. He, “Non-local neural in 2018 Data Compression Conference. IEEE, 2018, pp. 187–196.
networks,” in Proceedings of the IEEE conference on computer vision [170] D. Li and L. Yu, “An in-loop filter based on low-complexity cnn using
and pattern recognition, 2018, pp. 7794–7803. residuals in intra video coding,” in 2019 IEEE International Symposium
[148] Y. Hu, W. Yang, and J. Liu, “Coarse-to-fine hyper-prior modeling for on Circuits and Systems. IEEE, 2019, pp. 1–5.
learned image compression.” in Proceedings of the AAAI Conference [171] D. Ding, L. Kong, G. Chen, Z. Liu, and Y. Fang, “A switchable
on Artificial Intelligence, 2020, pp. 11 013–11 020. deep learning approach for in-loop filtering in video coding,” IEEE
[149] D. Minnen, J. Ballé, and G. D. Toderici, “Joint autoregressive and Transactions on Circuits and Systems for Video Technology, vol. 30,
hierarchical priors for learned image compression,” in Advances in no. 7, pp. 1871–1887, 2020.
Neural Information Processing Systems, 2018, pp. 10 794–10 803. [172] D. Ding, G. Chen, D. Mukherjee, U. Joshi, and Y. Chen, “A CNN-
[150] A. v. d. Oord, N. Kalchbrenner, and K. Kavukcuoglu, “Pixel recurrent based in-loop filtering approach for AV1 video codec,” in Proceedings
neural networks,” arXiv preprint arXiv:1601.06759, 2016. of the Picture Coding Symposium. IEEE, 2019, pp. 1–5, Ningbo,
[151] S. Reed, A. van den Oord, N. Kalchbrenner, S. G. Colmenarejo, China.
Z. Wang, Y. Chen, D. Belov, and N. De Freitas, “Parallel multiscale [173] G. Chen, D. Ding, D. Mukherjee, U. Joshi, and Y. Chen, “AV1 in-loop
autoregressive density estimation,” in Proceedings of the 34th Inter- filtering using a wide-activation structured residual network,” in Pro-
national Conference on Machine Learning. JMLR. org, 2017, pp. ceedings of the IEEE International Conference on Image Processing.
2912–2921. IEEE, 2019, pp. 1725–1729, Taipei, Taiwan.
[152] Z. Cheng, H. Sun, M. Takeuchi, and J. Katto, “Learned image com- [174] H. Yin, R. Yang, X. Fang, and S. Ma, “Ce13-1.2: adaptive convolutional
pression with discretized gaussian mixture likelihoods and attention neural network loop filter,” JVET-N0480, 2019.
modules,” in Proceedings of the IEEE/CVF Conference on Computer [175] T. Li, M. Xu, C. Zhu, R. Yang, Z. Wang, and Z. Guan, “A deep learning
Vision and Pattern Recognition, 2020, pp. 7939–7948. approach for multi-frame in-loop filter of HEVC,” IEEE Transactions
[153] J. Lee, S. Cho, and M. Kim, “An end-to-end joint learning scheme of on Image Processing, vol. 28, no. 11, pp. 5663–5678, 2019.
image compression and quality enhancement with improved entropy [176] C. Jia, S. Wang, X. Zhang, S. Wang, J. Liu, S. Pu, and S. Ma,
minimization,” arXiv, pp. arXiv–1912, 2019. “Content-aware convolutional neural network for in-loop filtering in
[154] O. Rippel and L. Bourdev, “Real-time adaptive image compression,” high efficiency video coding,” IEEE Transactions on Image Processing,
arXiv preprint arXiv:1705.05823, 2017. vol. 28, no. 7, pp. 3343–3356, 2019.
[155] C. Huang, H. Liu, T. Chen, S. Pu, Q. Shen, and Z. Ma, “Extreme [177] C. Dong, Y. Deng, C. Change Loy, and X. Tang, “Compression artifacts
image coding via multiscale autoencoders with generative adversarial reduction by a deep convolutional network,” in Proceedings of the
optimization,” in Proceedings of IEEE Visual Communications and IEEE International Conference on Computer Vision, 2015, pp. 576–
Image Processing, 2019. 584, Santiago, Chile.
[156] E. Agustsson, M. Tschannen, F. Mentzer, R. Timofte, and L. V. [178] L. Cavigelli, P. Hager, and L. Benini, “CAS-CNN: A deep convo-
Gool, “Generative adversarial networks for extreme learned image lutional neural network for image compression artifact suppression,”
compression,” in Proceedings of the IEEE International Conference in Proceedings of the IEEE International Joint Conference on Neural
on Computer Vision, 2019, pp. 221–231. Networks. IEEE, 2017, pp. 752–759, Anchorage, Alaska.
[157] H. Liu, T. Chen, Q. Shen, T. Yue, and Z. Ma, “Deep image compression [179] J. Guo and H. Chao, “Building dual-domain representations for com-
via end-to-end learning,” in Proceedings of the IEEE International pression artifacts reduction,” in Proceedings of the European Confer-
Conference on Computer Vision Workshops, 2018. ence on Computer Vision. Springer, 2016, pp. 628–644, Amsterdam,
[158] J. L. Hennessy and D. A. Patterson, “A new golden age for computer The Netherlands.
architecture,” Communications of the ACM, vol. 62, no. 2, pp. 48–60, [180] L. Galteri, L. Seidenari, M. Bertini, and A. Del Bimbo, “Deep
2019. generative adversarial compression artifact removal,” in Proceedings
[159] S. Midtskogen and J.-M. Valin, “The AV1 constrained directional of the IEEE International Conference on Computer Vision, 2017, pp.
enhancement filter (CDEF),” in 2018 IEEE International Conference 4826–4835, Venice, Italy.
on Acoustics, Speech and Signal Processing. IEEE, 2018, pp. 1193– [181] P. Liu, H. Zhang, K. Zhang, L. Lin, and W. Zuo, “Multi-level wavelet-
1197. cnn for image restoration,” in Proceedings of the IEEE Conference on
[160] D. Mukherjee, S. Li, Y. Chen, A. Anis, S. Parker, and J. Bankoski, “A Computer Vision and Pattern Recognition Workshops, 2018, pp. 773–
switchable loop-restoration with side-information framework for the 782, Salt Lake City, UT.
emerging AV1 video codec,” in 2017 IEEE International Conference [182] Y. Zhang, L. Sun, C. Yan, X. Ji, and Q. Dai, “Adaptive residual
on Image Processing. IEEE, 2017, pp. 265–269. networks for high-quality image restoration,” IEEE Transactions on
[161] C.-Y. Tsai, C.-Y. Chen, T. Yamakage, I. S. Chong, Y.-W. Huang, C.-M. Image Processing, vol. 27, no. 7, pp. 3150–3163, 2018.
Fu, T. Itoh, T. Watanabe, T. Chujoh, M. Karczewicz et al., “Adaptive [183] T. Wang, M. Chen, and H. Chao, “A novel deep learning-based method
loop filtering for video coding,” IEEE Journal of Selected Topics in of improving coding efficiency from the decoder-end for HEVC,” in
Signal Processing, vol. 7, no. 6, pp. 934–945, 2013. Proceedings of the Data Compression Conference. IEEE, 2017, pp.
[162] W.-S. Park and M. Kim, “Cnn-based in-loop filtering for coding 410–419, Snowbird, Utah.
efficiency improvement,” in 2016 IEEE 12th Image, Video, and Multi- [184] R. Yang, M. Xu, and Z. Wang, “Decoder-side HEVC quality en-
dimensional Signal Processing Workshop (IVMSP). IEEE, 2016, pp. hancement with scalable convolutional neural network,” in 2017 IEEE
1–5. International Conference on Multimedia and Expo. IEEE, 2017, pp.
[163] Y. Dai, D. Liu, and F. Wu, “A convolutional neural network approach 817–822.
for post-processing in HEVC intra coding,” in International Conference [185] X. He, Q. Hu, X. Zhang, C. Zhang, W. Lin, and X. Han, “Enhancing
on Multimedia Modeling. Springer, 2017, pp. 28–39. HEVC compressed videos with a partition-masked convolutional neural
[164] Y. Zhang, T. Shen, X. Ji, Y. Zhang, R. Xiong, and Q. Dai, “Residual network,” in 2018 25th IEEE International Conference on Image
highway convolutional neural networks for in-loop filtering in HEVC,” Processing. IEEE, 2018, pp. 216–220.
IEEE Transactions on image processing, vol. 27, no. 8, pp. 3827–3841, [186] A. Dosovitskiy, P. Fischer, E. Ilg, P. Hausser, C. Hazirbas, V. Golkov,
2018. P. Van Der Smagt, D. Cremers, and T. Brox, “Flownet: Learning
26

optical flow with convolutional networks,” in Proceedings of the IEEE [207] L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille,
international conference on computer vision, 2015, pp. 2758–2766. “Semantic image segmentation with deep convolutional nets and fully
[187] E. Ilg, N. Mayer, T. Saikia, M. Keuper, A. Dosovitskiy, and T. Brox, connected crfs,” arXiv preprint arXiv:1412.7062, 2014.
“Flownet 2.0: Evolution of optical flow estimation with deep networks,” [208] F. Yu and V. Koltun, “Multi-scale context aggregation by dilated
in Proceedings of the IEEE conference on computer vision and pattern convolutions,” arXiv preprint arXiv:1511.07122, 2015.
recognition, 2017, pp. 2462–2470. [209] B. Zhou, H. Zhao, X. Puig, S. Fidler, A. Barriuso, and A. Torralba,
[188] D. Sun, X. Yang, M.-Y. Liu, and J. Kautz, “Pwc-net: Cnns for optical “Scene parsing through ade20k dataset,” in Proceedings of the IEEE
flow using pyramid, warping, and cost volume,” in Proceedings of the conference on computer vision and pattern recognition, 2017, pp. 633–
IEEE conference on computer vision and pattern recognition, 2018, 641.
pp. 8934–8943. [210] D. Chen, Q. Chen, and F. Zhu, “Pixel-level texture segmentation based
[189] T. Xue, B. Chen, J. Wu, D. Wei, and W. T. Freeman, “Video en- AV1 video compression,” in 2019 IEEE International Conference on
hancement with task-oriented flow,” International Journal of Computer Acoustics, Speech and Signal Processing. IEEE, 2019, pp. 1622–1626.
Vision, vol. 127, no. 8, pp. 1106–1125, 2019. [211] M. Bosch, F. Zhu, and E. J. Delp, “Spatial texture models for video
[190] W. Bao, W.-S. Lai, X. Zhang, Z. Gao, and M.-H. Yang, “MEMC-Net: compression,” Proceedings of IEEE International Conference on Image
Motion estimation and motion compensation driven neural network for Processing, vol. 1, pp. 93–96, September 2007, San Antonio, TX.
video interpolation and enhancement,” IEEE transactions on pattern [212] C. Fu, D. Chen, E. Delp, Z. Liu, and F. Zhu, “Texture segmentation
analysis and machine intelligence, 2019. based video compression using convolutional neural networks,” Elec-
[191] X. Wang, K. C. Chan, K. Yu, C. Dong, and C. Change Loy, “EDVR: tronic Imaging, vol. 2018, no. 2, pp. 155–1, 2018.
Video restoration with enhanced deformable convolutional networks,” [213] I.-R. R. BT.500-14, “Methodologies for the subjective assessment of
in Proceedings of the IEEE Conference on Computer Vision and Pattern the quality of television images,” Geneva, Tech. Rep., 2019.
Recognition Workshops, 2019, pp. 0–0. [214] M. Haindl and S. Mikes, “Texture segmentation benchmark,” in Pro-
[192] R. Yang, M. Xu, Z. Wang, and T. Li, “Multi-frame quality enhancement ceedings of the 19th International Conference on Pattern Recognition.
for compressed video,” in Proceedings of the IEEE Conference on IEEE, 2008, pp. 1–4.
Computer Vision and Pattern Recognition, 2018, pp. 6664–6673. [215] N. Xu, L. Yang, Y. Fan, D. Yue, Y. Liang, J. Yang, and T. Huang,
[193] Z. Guan, Q. Xing, M. Xu, R. Yang, T. Liu, and Z. Wang, “MFQE “YouTube-VOS: A large-scale video object segmentation benchmark,”
2.0: A new approach for multi-frame quality enhancement on com- arXiv preprint, p. arXiv:1809.03327, 2018.
pressed video,” IEEE Transactions on Pattern Analysis and Machine [216] F. Perazzi, J. Pont-Tuset, B. McWilliams, L. Van Gool, M. Gross, and
Intelligence, 2019. A. Sorkine-Hornung, “A benchmark dataset and evaluation method-
[194] J. Tong, X. Wu, D. Ding, Z. Zhu, and Z. Liu, “Learning-based ology for video object segmentation,” in Proceedings of the IEEE
multi-frame video quality enhancement,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp.
International Conference on Image Processing. IEEE, 2019, pp. 929– 724–732.
933, Taipei, Taiwan. [217] H. Liu, M. Lu, Z. Ma, F. Wang, Z. Xie, X. Cao, and Y. Wang, “Neural
[195] M. Lu, M. Cheng, Y. Xu, S. Pu, Q. Shen, and Z. Ma, “Learned video coding using multiscale motion compensation and spatiotemporal
quality enhancement via multi-frame priors for HEVC compliant low- context model,” accepted by IEEE Trans. Circuits and Systems for
delay applications,” in 2019 IEEE International Conference on Image Video Technology, Oct. 2020.
Processing. IEEE, 2019, pp. 934–938.
[218] M. Li, W. Zuo, S. Gu, D. Zhao, and D. Zhang, “Learning convolutional
[196] U. Joshi, D. Mukherjee, J. Han, Y. Chen, S. Parker, H. Su, A. Chiang,
networks for content-weighted image compression,” in Proceedings of
Y. Xu, Z. Liu, Y. Wang et al., “Novel inter and intra prediction tools
the IEEE Conference on Computer Vision and Pattern Recognition,
under consideration for the emerging AV1 video codec,” in Applications
2018, pp. 3214–3223.
of Digital Image Processing XL, vol. 10396. International Society for
[219] T. M. Cover and J. A. Thomas, Elements of information theory. John
Optics and Photonics, 2017, p. 103960F.
Wiley & Sons, 2012.
[197] Z. Liu, D. Mukherjee, W.-T. Lin, P. Wilkins, J. Han, and Y. Xu,
“Adaptive multi-reference prediction using a symmetric framework,” [220] Y. Zhang, K. Li, K. Li, B. Zhong, and Y. Fu, “Residual non-local atten-
Electronic Imaging, vol. 2017, no. 2, pp. 65–72, 2017. tion networks for image restoration,” arXiv preprint arXiv:1903.10082,
[198] Y. Chen, D. Murherjee, J. Han, A. Grange, Y. Xu, Z. Liu, S. Parker, 2019.
C. Chen, H. Su, U. Joshi et al., “An overview of core coding tools in [221] X. Shi, Z. Chen, H. Wang, D.-Y. Yeung, W.-K. Wong, and W.-c.
the AV1 video codec,” in 2018 Picture Coding Symposium. IEEE, Woo, “Convolutional lstm network: A machine learning approach for
2018, pp. 41–45. precipitation nowcasting,” Advances in neural information processing
[199] Y. Wang, S. Inguva, and B. Adsumilli, “YouTube UGC dataset for video systems, vol. 28, pp. 802–810, 2015.
compression research,” IEEE International Workshop on Multimedia [222] G. Lu, X. Zhang, W. Ouyang, L. Chen, Z. Gao, and D. Xu, “An end-
Signal Processing, September 2019, Kuala Lumpur, Malaysia. to-end learning framework for video compression,” IEEE Transactions
[200] H. Zhao, J. Shi, X. Qi, X. Wang, and J. Jia, “Pyramid scene parsing on Pattern Analysis and Machine Intelligence, 2020.
network,” in Proceedings of the IEEE conference on Computer Vision [223] R. Timofte, E. Agustsson, L. Van Gool, M.-H. Yang, and L. Zhang,
and Pattern Recognition, July 2017, pp. 2881–2890, Honolulu, HI. “Ntire 2017 challenge on single image super-resolution: Methods and
[201] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image results,” in Proceedings of the IEEE conference on Computer Vision
recognition,” in Proceedings of the IEEE conference on computer vision and Pattern Recognition Workshops, 2017, pp. 114–125.
and pattern recognition, 2016, pp. 770–778. [224] J. Yu, Y. Fan, J. Yang, N. Xu, Z. Wang, X. Wang, and T. Huang,
[202] M. Bosch, F. Zhu, and E. J. Delp, “Segmentation-Based Video Com- “Wide activation for efficient and accurate image super-resolution,”
pression Using Texture and Motion Models,” IEEE Journal of Selected arXiv preprint arXiv:1808.08718, 2018.
Topics in Signal Processing, vol. 5, no. 7, pp. 1366–1377, November [225] K. He, X. Zhang, S. Ren, and J. Sun, “Identity mappings in deep
2011. residual networks,” in European Conference on Computer Vision.
[203] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Springer, 2016, pp. 630–645.
Z. Huang, A. Karpathy, A. Khosla, M. Bernstein et al., “ImageNet large [226] Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” nature, vol. 521,
scale visual recognition challenge,” International Journal of Computer no. 7553, pp. 436–444, 2015.
Vision, vol. 115, no. 3, pp. 211–252, 2015. [227] Q. Xia, H. Liu, and Z. Ma, “Object-based image coding: A learning-
[204] T. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, driven revisit,” in 2020 IEEE International Conference on Multimedia
P. Dollár, and C. L. Zitnick, “Microsoft COCO: Common objects and Expo (ICME). IEEE, 2020, pp. 1–6.
in context,” in Proceedings of the IEEE European Conference on [228] J. Lee, S. Cho, and M. Kim, “A hybrid architecture of jointly learning
Computer Vision, September 2014, pp. 740–755, Zürich, Switzerland. image compression and quality enhancement with improved entropy
[205] B. Zhou, H. Zhao, X. Puig, T. Xiao, S. Fidler, A. Barriuso, and minimization,” arXiv preprint arXiv:1912.12817, 2019.
A. Torralba, “Semantic understanding of scenes through the ADE20K [229] Y. Wang, Q. Yao, J. T. Kwok, and L. M. Ni, “Generalizing from a few
dataset,” International Journal of Computer Vision, vol. 127, no. 3, pp. examples: A survey on few-shot learning,” ACM Computing Surveys
302–321, 2019. (CSUR), vol. 53, no. 3, pp. 1–34, 2020.
[206] J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networks [230] Z. Wang, E. P. Simoncelli, and A. C. Bovik, “Multiscale structural
for semantic segmentation,” in Proceedings of the IEEE Conference on similarity for image quality assessment,” in Proceedings of The Thrity-
Computer Vision and Pattern Recognition, June 2015, pp. 3431–3440, Seventh Asilomar Conference on Signals, Systems Computers, vol. 2,
Boston, MA. Nov 2003, pp. 1398–1402, pacific Grove, CA.
27

[231] D. Yuan, T. Zhao, Y. Xu, H. Xue, and L. Lin, “Visual jnd: a perceptual
measurement in video coding,” IEEE Access, vol. 7, pp. 29 014–29 022,
2019.
[232] Netflix, Inc., “VMAF: Perceptual video quality assessment based on
multi-method fusion,” https://github.com/Netflix/vmaf, 2017.
[233] Q. Shen, J. Cai, L. Liu, H. Liu, T. Chen, L. Ye, and Z. Ma,
“Codedvision: Towards joint image understanding and compression
via end-to-end learning,” in Pacific Rim Conference on Multimedia.
Springer, 2018, pp. 3–14.
[234] L. Liu, H. Liu, T. Chen, Q. Shen, and Z. Ma, “Codedretrieval: Joint
image compression and retrieval with neural networks,” in 2019 IEEE
Visual Communications and Image Processing (VCIP). IEEE, 2019,
pp. 1–4.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy