0% found this document useful (0 votes)
18 views5 pages

DVDNet

Uploaded by

21je0918
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views5 pages

DVDNet

Uploaded by

21je0918
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

DVDNET: A FAST NETWORK FOR DEEP VIDEO DENOISING

Matias Tassano?† Julie Delon? Thomas Veit†


?
MAP5, Université Paris Descartes

GoPro France

ABSTRACT Based on this method, Chen and Pock proposed in [2] a train-
able nonlinear reaction diffusion model. This model can be
In this paper, we propose a state-of-the-art video denoising al- expressed as a feed-forward deep network by concatenating
gorithm based on a convolutional neural network architecture. a fixed number of gradient descent inference steps. Meth-
Previous neural network based approaches to video denois- ods such as these two attain denoising performances compa-
ing have been unsuccessful as their performance cannot com- rable to those of well-known algorithms such as BM3D [3]
pete with the performance of patch-based methods. However, or non-local Bayes (NLB [4]). However, their performance is
our approach outperforms other patch-based competitors with restricted to specific forms of prior. Additionally, many hand-
significantly lower computing times. In contrast to other ex- tuned parameters are involved in the training process. In [5],
isting neural network denoisers, our algorithm exhibits sev- a multi-layer perceptron was successfully applied for image
eral desirable properties such as a small memory footprint, denoising. Nevertheless, a significant drawback of all these
and the ability to handle a wide range of noise levels with a algorithms is that a specific model must be trained for each
single network model. The combination between its denois- noise level.
ing performance and lower computational load makes this al- Another popular approach involves the use of convolu-
gorithm attractive for practical denoising applications. We tional neural networks (CNN), e.g. RBDN [6], DnCNN [7],
compare our method with different state-of-art algorithms, and FFDNet [8]. Their performance compares favorably to
both visually and with respect to objective quality metrics. other state-of-the-art image denoising algorithms, both quan-
The experiments show that our algorithm compares favor- titatively and visually. These methods are composed of a
ably to other state-of-art methods. Video examples, code and succession of convolutional layers with nonlinear activation
models are publicly available at https://github.com/ functions in between them. This type of architecture has been
m-tassano/dvdnet. applied to the problem of joint denoising and demosaicing of
Index Terms— video denoising, CNN, residual learning, RGB and raw images by Gharbi et al. in [9]. Contrary to other
neural networks, image restoration deep learning denoising methods, one of the remarkable fea-
tures that these CNN-based methods present is the ability to
denoise several levels of noise with only one trained model.
1. INTRODUCTION Proposed by Zhang et al. in [7], DnCNN is an end-to-end
trainable deep CNN for image denoising. This method is able
We introduce a network for Deep Video Denoising: DVD- to denoise different noise levels (e.g. with standard deviation
net. The algorithm compares favorably to other state-of-the- σ ∈ [0, 55]) with only one trained model. One of its main
art methods, while it features fast running times. The out- features is that it implements residual learning [10], i.e. it es-
puts of our algorithm present remarkable temporal coherence, timates the noise existent in the input image rather than the de-
very low flickering, strong noise reduction, and accurate de- noised image. In a following paper [8], Zhang et al. proposed
tail preservation. FFDNet, which builds upon the work done for DnCNN.

1.1. Image Denoising 1.2. Video Denoising

Compared to image denoising, video denoising appears as a As for video denoising, the method proposed by Chen et al.
largely underexplored domain. Recently, new image denois- in [11] is one of the few to approach this problem with neural
ing methods based on deep learning techniques have drawn networks—recurrent neural networks in their case. However,
considerable attention due to their outstanding performance. their algorithm only works on grayscale images and it does
Schmidt and Roth proposed in [1] the cascade of shrinkage not achieve satisfactory results, probably due to the difficul-
fields method that unifies the random field-based model and ties associated with training recurring neural networks [12].
half-quadratic optimization into a single learning framework. Vogels et al. proposed in [13] an architecture based on kernel-

978-1-5386-6249-6/19/$31.00 ©2019 IEEE 1805 ICIP 2019


Authorized licensed use limited to: Indian Institute of Technology (ISM) Dhanbad. Downloaded on August 06,2024 at 14:30:38 UTC from IEEE Xplore. Restrictions apply.
predicting neural networks able to denoise Monte Carlo ren- which facilitates the task. Finally, the 2T + 1 aligned frames
dered sequences. The state-of-the-art in video denoising is are concatenated and input into the temporal denoising block.
mostly defined by patch-based methods. Kokaram et al. pro- Using temporal neighbors when denoising each frame helps
posed in [14] a 3D Wiener filtering scheme. We note in par- to reduce flickering as the residual error in each frame will
ticular an extension of the popular BM3D to video denois- be correlated. We also add a noise map as input to the spa-
ing, V-BM4D [15], and Video non-local Bayes (VNLB [16]). tial and temporal denoisers. The inclusion of the noise map
Nowadays, VNLB is the best video denoising algorithm in as input allows the processing of spatially varying noise [18].
terms of quality of results, as it outperforms V-BM4D by a Contrary to other denoising algorithms, our denoiser takes no
large margin. Nonetheless, its long running times render the other parameters as inputs apart from the image sequence and
method impractical—it could take several minutes to denoise the estimation of the input noise.
a single frame. The performance of our method compares fa- Observe that experiments presented in this paper focus on
vorably to that of VNLB for moderate to large values of noise, the case of additive white Gaussian noise (AWGN). Never-
while it features significantly faster inference times. theless, this algorithm can be straightforwardly extended to
other types of noise, e.g. spatially varying noise (e.g. Poisso-
nian). Let I be a noiseless image, while Ĩ is its noisy version
2. OUR METHOD
corrupted by a realization of zero-mean white Gaussian noise
Methods based on neural networks are nowadays state-of-the- N of standard deviation σ, then
art in image denoising. However, state-of-the-art in video
Ĩ = I + N . (1)
denoising still consists of patch-based methods. Generally
speaking, most previous approaches based on deep learning
have failed to employ the temporal information existent in 2.1. Spatial and Temporal Denoising Blocks
image sequences effectively. Temporal coherence and the
The design characteristics of the spatial and temporal blocks
lack of flickering are vital aspects in the perceived quality of
make a good compromise between performance and fast run-
a video. Most state-of-the-art algorithms in video denoising
ning times. Both blocks are implemented as standard feed-
are extensions of their image denoising counterparts. Such is
forward networks, as shown in fig. 2. The architecture of the
the case, for example, of V-BM4D and BM3D, or VNLB and
spatial denoiser is inspired by the architectures in [8, 9], while
NLB. There are mainly two factors in these video denoising
the temporal denoiser also borrows some elements from [13].
approaches which enforce temporal coherence in the results,
namely the extension of search regions from spatial neigh- The spatial and temporal denoising blocks are composed
borhoods to volumetric neighborhoods, and the use of mo- of Dspa = 12, and Dtemp = 6 convolutional layers, re-
tion estimation. In other words, the former implies that when spectively. The number of feature maps is set to W = 96.
denoising a given pixel (or patch), the algorithm is going to The outputs of the convolutional layers are followed by point-
look for similar pixels (patches) not only in the same frame, wise ReLU [19] activation functions ReLU (·) = max(·, 0).
but also in adjacent frames of the sequence. Secondly, the use At training time, batch normalization layers (BN [20]) are
of motion estimation and/or compensation has been shown placed between the convolutional and ReLU layers. At eval-
to help improving video denoising performance [17, 16, 15]. uation time, the batch normalization layers are removed, and
We thus incorporated these two elements into our algorithm, replaced by an affine layer that applies the learned normaliza-
as well as different aspects of other relevant CNN-based de- tion. The spatial size of the convolutional kernels is 3 × 3,
noising architectures [8, 9, 13]. Thanks to all these charac- and the stride is set to 1. In both blocks, the inputs are first
teristics, our algorithm improves the state-of-the-art results, downscaled to a quarter resolution. The main advantage of
while featuring fast inference times. performing the denoising in a lower resolution is the large
reduction in running times and memory requirements, with-
Figure 1 displays a simplified diagram of the architecture
out sacrificing denoising performance [8, 18]. The upscaling
of our method. When denoising a given frame, its 2T neigh-
back to full resolution is performed with the technique de-
boring frames are also taken as inputs. The denoising pro-
scribed in [21]. Both blocks feature residual connections [10],
cess of our algorithm can be split in two stages. Firstly, the
which have been observed to ease the training process [18].
2T + 1 frames are individually denoised with a spatial de-
noiser. Although each individual frame output at this stage
features relatively good image quality, they present evident 3. TRAINING DETAILS
flickering when considered as a sequence. In the second stage
of the algorithm, the 2T denoised temporal neighbors are reg- The spatial and temporal denoising parts are trained sepa-
istered with respect to the central frame. We use optical flow rately, with the spatial denoiser trained first as its outputs are
for this purpose. Splitting denoising in two stages allows for used to train the temporal denoiser. Both blocks are trained
an individual pre-processing of each frame. On top of this, using crops of images, or patches. The size of the patches
motion compensation is performed on pre-denoised images, should be larger than the receptive field of the networks. In

1806
Authorized licensed use limited to: Indian Institute of Technology (ISM) Dhanbad. Downloaded on August 06,2024 at 14:30:38 UTC from IEEE Xplore. Restrictions apply.
Fig. 1. Simplified architecture of our method.

where (w Îjt−T , . . . , Îjt , . . . , w Îjt+T ) is a collection of 2T + 1


spatial patches cropped at the same location in contiguous
frames. These are generated by adding AWGN of σ ∈ [0, 55]
to clean patches of a given sequence, and denoising them us-
ing the spatial denoiser. Then, the 2T patches contiguous to
the central reference patch Ijt are motion-compensated with
respect to the latter, i.e. w Îjl = compensate(Îjl , Îjt ). To
compensate frames, we use the DeepFlow algorithm [23] for
the estimation of the optical flow between frames. The noise
map Mj is the same as the one used in the spatial denoising
stage. A total of mt = 450000 training samples are extracted
from the training set of the DAVIS database [24]. The spa-
Fig. 2. Simplified architecture of the spatial (top) and tempo- tial size of the patches is 44 × 44, while the temporal size is
ral (bottom) denoising blocks. 2T + 1 = 5. The loss function for the temporal denoiser is
mt
1 X 2
the case of the spatial denoiser, the training dataset iso
com- Ltemp (θtemp ) = Îj − Ijt , (4)
n
j j j
ms 2mt j=1 temp, t
posed of pairs of input-output patches (Ĩ , M ), I
j=0
which are generated by adding AWGN with standard devi-
where Îjtemp, t = Ftemp (Ptj ; θtemp ).
ation σ ∈ [0, 55] to the clean patches Ij and building the
In both cases, the ADAM algorithm [25] is applied to min-
corresponding noise map Mj (which is in this case constant
imize the loss function, with all its hyper-parameters set to
with all its elements equal to σ). A total of ms = 1024000
their default values. The number of epochs is set to 80, and
patches are extracted from the Waterloo Exploration Database
the mini-batch size is 128. The scheduling of the learning rate
[22]. The patch size is 50×50. Patches are randomly cropped
is also common to both cases. It starts at 1e−3 for the first 50
from randomly sampled images of the training dataset. Resid-
epochs, then changes to 1e−4 for the following 10 epochs,
ual learning is used, which implies that if the network outputs
and finally switches to 1e−6 for the remaining of the train-
an estimation of the input noise Fspa ( Ĩ; θspa ) = N̂, then the
ing. Data is augmented five times by introducing rescaling by
denoised image is computed by subtracting the output noise
different scale factors and random flips. During the first 60
to the noisy input
epochs, the orthogonalization of the convolutional kernels is
Î( Ĩ; θspa ) = Ĩ − Fspa ( Ĩ; θspa ) . (2) applied as a means of regularization. It has been observed that
initializing the training with orthogonalization may be bene-
The loss function of the spatial denoiser writes ficial to performance [8, 18].

ms
1 X 2
Lspa (θspa ) = Îj ( Ĩj ; θspa ) − Ij , (3) 4. RESULTS
2ms j=1
Two different testsets were used for benchmarking our
where θspa is the collection of all learnable parameters. method: the DAVIS-test testset, and Set8, which is composed
As for the temporal denoiser, the training dataset consists of 4 color sequences from the Derf’s Test Media collec-
of input-output pairs tion1 and 4 color sequences captured with a GoPro camera.
n omt The DAVIS set contains 30 color sequences of resolution
Ptj = ( (w Îjt−T , . . . , Îjt , . . . , w Îjt+T ), Mj ), Ijt , 1 https://media.xiph.org/video/derf
j=0

1807
Authorized licensed use limited to: Indian Institute of Technology (ISM) Dhanbad. Downloaded on August 06,2024 at 14:30:38 UTC from IEEE Xplore. Restrictions apply.
(b) Noisy σ = 50 (c) V-BM4D (d) VNLB (e) Neat Video (e) DVDnet (ours)

Fig. 3. Comparison of results. Left to right: noisy frame (PSNRseq = 14.15dB), output by V-BM4D (PSNRseq = 24.91dB),
output by VNLB (PSNRseq = 26.34dB), output by Neat Video (PSNRseq = 23.11dB), output by DVDnet (PSNRseq =
26.62dB). Note the clarity of the denoised text, and the lack of low-frequency residual noise and chroma noise for DVDnet.
Best viewed in digital format.

854 × 480. The sequences of Set8 have been downscaled


Table 2. Comparison of P SN R on the DAVIS testset.
to a resolution of 960 × 540. In all cases, sequences were
limited to a maximum of 85 frames. We used the DeepFlow DVDnet VNLB V-BM4D
algorithm to compute flow maps for DVDnet and VNLB. We σ = 10 38.13 38.85 37.58
also compare our method to a commercial blind denoising σ = 20 35.70 35.68 33.88
software, Neat Video (NV [26]). σ = 30 34.08 33.73 31.65
In general, DVDnet outputs sequences which feature re- σ = 40 32.86 32.32 30.05
markable temporal coherence. Flickering rendered by our σ = 50 31.85 31.13 28.80
method is notably small, especially in flat areas, where patch-
based algorithms often leave behind low-frequency residual
noise. An example can be observed in fig. 3 (which is best
viewed in digital format). Temporally decorrelated low- about 20 times faster than V-BM4D, and about 50 times faster
frequency noise in flat areas appears as particularly annoying than VNLB. Even running on CPU, DVDnet is about an order
in the eyes of the viewer. More video examples can be found of magnitude faster than these methods. Of the 8s it takes to
in the website of the algorithm. The reader is encouraged to denoise a frame, 6s are spent on compensating motion of the
watch these examples to compare the visual quality of the temporal neighboring frames. Table 3 compares the running
results of our method. times of different state-of-the-art algorithms.
Tables 1 and 2 show a comparison of P SN R on the Set8
and DAVIS dataset, respectively. It can be observed that for
smaller values of noise, VNLB performs better. In effect, Table 3. Comparison of running times. Time to denoise a
DVDnet tends to over denoise in some of these cases. How- color frame of resolution 960 × 540. Note: values displayed
ever, for larger values of noise DVDnet surpasses VNLB. for VNLB do not include the time required to estimate mo-
tion.
Method V-BM4D VNLB DVDnet DVDnet
Table 1. Comparison of P SN R on the Set8 testset. (CPU) (GPU)
DVDnet VNLB V-BM4D NV
Time (s) 156 420 19 8
σ = 10 36.08 37.26 36.05 35.67
σ = 20 33.49 33.72 32.19 31.69
σ = 30 31.79 31.74 30.00 28.84
σ = 40 30.55 30.39 28.48 26.36 5. CONCLUSIONS
σ = 50 29.56 29.24 27.33 25.46
In this paper, we presented DVDnet, a video denoising algo-
rithm which improves the state-of-the-art. Denoising results
of DVDnet feature remarkable temporal coherence, very low
flickering, and excellent detail preservation. The algorithm
4.1. Running times achieves running times which are at least an order of magni-
tude faster than other state-of-the-art competitors. Although
Our method achieves fast inference times, thanks to its de- the results presented in this paper hold for Gaussian noise, our
sign characteristics and simple architecture. DVDnet takes method could be extended to denoise other types of noise.
less than 8s to denoise a 960 × 540 color frame, which is

1808
Authorized licensed use limited to: Indian Institute of Technology (ISM) Dhanbad. Downloaded on August 06,2024 at 14:30:38 UTC from IEEE Xplore. Restrictions apply.
6. REFERENCES [14] Anil Christopher Kokaram, Motion picture restoration,
Ph.D. thesis, University of Cambridge, 1993.
[1] U. Schmidt and S. Roth, “Shrinkage fields for effective
image restoration,” 2014, number 8, pp. 2774–2781. [15] Matteo Maggioni, Giacomo Boracchi, Alessandro Foi,
and Karen Egiazarian, “Video denoising, deblocking,
[2] Y. Chen and T. Pock, “Trainable Nonlinear Reaction and enhancement through separable 4-D nonlocal spa-
Diffusion: A Flexible Framework for Fast and Effective tiotemporal transforms,” IEEE Trans. IP, vol. 21, no. 9,
Image Restoration,” IEEE Trans. PAMI, vol. 39, no. 6, pp. 3952–3966, 2012.
pp. 1256–1272, 2017.
[16] Pablo Arias and Jean-Michel Morel, “Video denois-
[3] K Dabov, A Foi, and V Katkovnik, “Image denoising ing via empirical Bayesian estimation of space-time
by sparse 3D transformation-domain collaborative filter- patches,” Journal of Mathematical Imaging and Vision,
ing,” IEEE Trans. IP, vol. 16, no. 8, pp. 1–16, 2007. vol. 60, no. 1, pp. 70—-93, 2018.

[4] M. Lebrun, A. Buades, and J. M. Morel, “A Nonlocal [17] Antoni Buades and Jose-Luis Lisani, “Patch-Based
Bayesian Image Denoising Algorithm,” SIAM Journal Video Denoising With Optical Flow Estimation,” IEEE
IS, vol. 6, no. 3, pp. 1665–1688, 2013. Trans. IP, vol. 25, no. 6, pp. 2573–2586, 2016.

[5] H.C. Burger, C.J. Schuler, and S. Harmeling, “Im- [18] Matias Tassano, Julie Delon, and Thomas Veit, “An
age denoising: Can plain neural networks compete with Analysis and Implementation of the FFDNet Image De-
BM3D?,” 2012, pp. 2392–2399. noising Method,” IPOL, vol. 9, pp. 1–25, 2019.

[6] V. Santhanam, V.I. Morariu, and L.S. Davis, “General- [19] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hin-
ized Deep Image to Image Regression,” in Proc. CVPR, ton, “ImageNet Classification with Deep Convolutional
2016. Neural Networks,” NIPS, pp. 1–9, 2012.

[7] K. Zhang, W. Zuo, Y. Chen, D. Meng, and L. Zhang, [20] Sergey Ioffe and Christian Szegedy, “Batch Normaliza-
“Beyond a Gaussian denoiser: Residual learning of deep tion: Accelerating Deep Network Training by Reducing
CNN for image denoising,” IEEE Trans. IP, vol. 26, no. Internal Covariate Shift,” in Proc. ICML. 2015, pp. 448–
7, pp. 3142–3155, 2017. 456, JMLR.org.

[8] K. Zhang, W. Zuo, and L. Zhang, “FFDNet: Toward [21] Wenzhe Shi, Jose Caballero, Ferenc Huszar, Johannes
a Fast and Flexible Solution for CNN based Image De- Totz, Andrew P. Aitken, Rob Bishop, Daniel Rueckert,
noising,” IEEE Trans. IP, vol. 27, no. 9, pp. 4608–4622, and Zehan Wang, “Real-Time Single Image and Video
2018. Super-Resolution Using an Efficient Sub-Pixel Convo-
lutional Neural Network,” in Proc. CVPR, 2016, pp.
[9] M. Gharbi, G. Chaurasia, S. Paris, and F. Durand, “Deep 1874–1883.
joint demosaicking and denoising,” ACM Trans. Graph-
ics, vol. 35, no. 6, pp. 1–12, 2016. [22] K. Ma, Z. Duanmu, Q. Wu, Z. Wang, H. Yong, H. Li,
and L. Zhang, “Waterloo Exploration Database: New
[10] K. He, X. Zhang, S. Ren, and J. Sun, “Deep Residual Challenges for Image Quality Assessment Models,”
Learning for Image Recognition,” in Proc. CVPR, 2016, IEEE Trans. IP, vol. 26, no. 2, pp. 1004–1016, 2017.
pp. 770–778.
[23] Philippe Weinzaepfel, Jerome Revaud, Zaid Harchaoui,
[11] Xinyuan Chen, Li Song, and Xiaokang Yang, “Deep and Cordelia Schmid, “DeepFlow: Large displacement
rnns for video denoising,” in Applications of Digital Im- optical flow with deep matching,” in IEEE ICCV, Syd-
age Processing XXXIX. International Society for Optics ney, Australia, Dec. 2013.
and Photonics, 2016, vol. 9971, p. 99711T.
[24] Anna Khoreva, Anna Rohrbach, and Bernt Schiele,
[12] Razvan Pascanu, Tomas Mikolov, and Yoshua Bengio, “Video object segmentation with language referring ex-
“On the difficulty of training recurrent neural networks,” pressions,” in ACCV, 2018.
in ICML, 2013, pp. 1310–1318.
[25] D.P. Kingma and J.L. Ba, “ADAM: a Method for
[13] Thijs Vogels, Fabrice Rousselle, Brian McWilliams, Stochastic Optimization,” Proc. ICLR, pp. 1–15, 2015.
Gerhard Röthlin, Alex Harvill, David Adler, Mark
[26] ABSoft, “Neat Video,” https://www.
Meyer, and Jan Novák, “Denoising with kernel pre-
neatvideo.com, 1999–2019.
diction and asymmetric loss functions,” ACM Trans.
Graphics, vol. 37, no. 4, pp. 124, 2018.

1809
Authorized licensed use limited to: Indian Institute of Technology (ISM) Dhanbad. Downloaded on August 06,2024 at 14:30:38 UTC from IEEE Xplore. Restrictions apply.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy