Video Shot Boundary Detection Using Hybrid Dual Tree Complex Wavelet Transform With Walsh Hadamard Transform
Video Shot Boundary Detection Using Hybrid Dual Tree Complex Wavelet Transform With Walsh Hadamard Transform
https://doi.org/10.1007/s11042-021-11052-2
Ravi Mishra 1
Abstract
Shot boundary detection (SBD) is the initial process in the video analysis, indexing,
summarization, and retrieval. Detection of correct transition from a video sequence and
the feature extraction and their effectiveness of presenting the visual content of the video
frames are the main factors in SBD. In this paper, the Hybrid Dual-Tree Complex
Wavelet Transform with Walsh Hadamard transform (DTCWT-WHT) and the optimized
Deep belief network (DBN) are proposed for the SBD. A new feature extraction
technique is developed for extracting the feature vector from each block of the image
frames. Preprocessing is the initial step to remove the illumination noise in the video
frames. In preprocessing, the Fast Averaging Peer Group filter is designed, and the
significant feature of this filter is the computational efficiency. After preprocessing, the
distance of the adjacent frame is computed using HSV color histogram distance. Hybrid
approach is proposed to extract feature and the edge boundaries from the frames, and the
continuity signal is constructed. Then, the extracted features are fed to the DBN for the
classification process. Social ski driver optimization algorithm (SSDOA) is utilized to
update the weights of DBN. Finally, this proposed method detects the abrupt (cut) and
gradual (fade in and fade out) transitions from the video frames. Four well-known
datasets such as TRECVID 2016, 2017, 2018 and 2019 datasets are utilized to examine
the proposed framework. The capability of proposed work is reinforced by performing
the comparison with the recent techniques. The experimental outcomes showed the
efficiency of the proposed framework by comparing with the existing techniques.
Keywords Shot boundary detection . Feature extraction . Classification . Cut transition . Gradual
transition
* Ravi Mishra
ravi.mishra@raisoni.net
1
Department of Electronics and Telecommunication, G H Raisoni Institute of Engineering and
Technology, Nagpur, India
28110 Multimedia Tools and Applications (2021) 80:28109–28135
1 Introduction
Digital video availability and usage have increased rapidly with the development of multime-
dia technology. Moreover, the number of videos uploaded in the video database also increased
drastically [23]. Due to the increase in video availability, searching for a specific video from
the large database is a complex task. The process of manual searching consumed huge effort
and time for the desired event retrieval. For instance, the multiple events are included within an
extended play video, but it is a fascinating one for a few activities [27]. While it has some
benefits that it allows the user to retrieve the content of the particular video with the help of the
particular video clip’s name instead of searching it for in a huge video database [19]. Because
of this reason, the research processes are recently going on in video indexing, browsing, as
well as retrieving, and this process is termed as content-based video retrieval (CBVR) for
obtaining the semantic searching to assist the users [13].
Furthermore, the requirement of tools for indexing, retrieval, browsing, as well as video
data management and also the demand for advanced technologies has been raised due to the
availability of massive video data [8]. To achieve this, attach the video data with their indexing
(storage) and then analyze it to their basic units using video structure analysis. Features of
video like a huge size of raw data contain large information than the image, and prior
information about video structure is not available, which makes the video structure analysis
more challenging [10]. The separation of video into its fundamental elements is the major goal
of the video structure. Frames, Shots, Scenes, and Stories are the various levels of video
structure [14]. A shot along with gradual and abrupt transition is composed in a video. A series
of related consecutive frames are included in each shot that is captured by a single camera, and
a set of continuous task is represented under a definite space and time [11].
Feature extraction of video frames and the relationship analysis among the successive
frames are the processes included in SBD. In some feature extraction techniques, the edge
features or the color features are used to detect shot boundary [4]. Most of the documentary or
movie videos contain shots, scenes, and acts. Multiple scenes are there in a single act, and
multiple shots are there in a scene. The shots are switched smoothly by the director with the
help of a shot transition to achieve a more coherent video connection [15, 26]. The main task in
temporal video segmentation is detecting shot boundaries such as gradual and abrupt transi-
tions. The editing effects are the cause of gradual transition and the camera on and off is the
cause of abrupt (cut) transition. The gradual transition is categorized into fade-in, fade-out,
wipe and dissolve [9, 12, 18].
A significant and required step in SBD during post-production is to identify the video
transitions and allot a unique treatment for them. However, the detection accuracy and
processing speed of the SBD should be high with the demand of high computational post-
production techniques and the real-time requirement [20, 24, 29]. It is an essential operation in
indexing, classification, summarization, and retrieval of videos. Therefore, shot boundary
detection aims to extract some specific feature differences between adjacent frames [30].
Moreover, image denoising and reconstruction are also vital for natural and medical image
quality. These can effectively improve the performance for image analysis such as SBD,
classification, segmentation, and verification [7].
In recent years, many approaches are introduced for the detection of shot boundaries. Based
on the performance, these methods are divided as four types namely histogram, pixel, motion
and edge based techniques. In Pixel based strategies, the pixel differences among the consec-
utive frames are compared. The main advantage of this strategy is easy implementation, and
Multimedia Tools and Applications (2021) 80:28109–28135 28111
the disadvantages of this method are that: it cannot have the ability to identify the gradual
changes efficiently and also very sensitive to object motion [2]. The Edge based approaches
are used to extract each frame’s object edges if the edges of the continuous frame differ
significantly. The primary goal of histogram based methods is to decrease the sensitivity to
object motion as well as camera and also commonly utilized to combine with some other
approaches [3]. In motion based approach, the motion vectors are utilized, and it can be
prevented from the object motion. However, when compared with histogram based methods
and edge based methods, this method is more complicated. Also, the false detection is high in
the existing methods. This false detection decreased the overall efficiency of the detection
system. Therefore, the proposed approach is needed for the detection of shot boundary from
the large datasets.
Contributions:
The main contributions of this paper are:
(i) Fast averaging peer group (FAPG) filter is designed to enhance the contrast of frames and
remove the illumination noise. Also, it restores the images while preserving the edges as
well as tiny image data.
(ii) DTCWT-WHT is proposed to apply on image’s each block for the feature vector
extraction. It reduces the computation complexity. It is very effective in terms of
robustness, capacity, and invisibility.
(iii) Finally, the optimized DBN is developed to classify the cut transition and gradual
transition; this new concept improves the performance of the proposed method in terms
of simplification; also, its training time is more effective. Based on the principle of
empirical risk reduction, hybridized deep learning is performed.
(iv) For the linearly non-separable samples, the proposed approach achieves enhanced
outcomes and provides more robustness.
(v) The probability of false detection in cut, gradual, fade in and fade out are minimized in
the proposed framework.
(vi) The performance of the proposed framework is much higher than the state-of-the-art
approaches
Organization: The remaining part of the manuscript is arranged in the following way:
Section 2 describes the overview of SBD and Section 3 discusses the existing works with
their challenges. Section 4 presents the proposed method of SBD. The experimental results of
the proposed work described in Section 5, and finally, Section 6 concludes the paper.
2 Preliminaries
SBD is otherwise known as video temporal segmentation, which is inextricably and intrinsi-
cally correlated in the structural data analysis. Generally, shots are the smaller units composed
in a video, and several frames are composed in a single shot. All the frames are arranged in
shots with the consecutive frames in the temporal order. The shot transition is defined as the
video’s temporal movement from one shot to another. Figure 1 represents the general structure
of the SBD.
28112 Multimedia Tools and Applications (2021) 80:28109–28135
Initially, the video frames are preprocessed to eliminate illumination noises, and next, the
feature extraction takes place to extract the visual features from the frames. Thus, each frame is
represented mathematically. Based on color histogram, pixel, motion vector, singular value
decomposition (SVD), descriptor, edge, and multiple features, the feature extraction can be
performed. A continuity signal is generated using the extracted visual features. The final step is
the classification, in which the extracted features are fed to the classifier for the detection of
shot boundaries. Several conventional classification techniques are used in SBD such as,
probabilistic, convolutional neural network (CNN), Naïve Bayes (NB), K-nearest neighbor
(KNN), Support Vector Machines (SVM), etc. As a result, the cut and gradual transition can be
identified.
This section demonstrates the literature review of existing approaches of SBD and also
presents several challenges of the methods.
Abdulhussain, S.H. et al. [1] have been proposed a new SBD algorithm based on orthogonal
polynomial (OP) for the detection of shot boundaries. In video sequences, the hard transitions
are detected by deriving the features from the orthogonal transform domain or the moments.
The moments are used in this method due to the video frame or signal representation ability
without any data redundancy. These features are the video frames moments that are gradients
and smoothed. A developed OP is used for the computation of the moments that were squared
the Krawtchouk-Tchebichef polynomial. A feature vector was formed by the fusion of these
gradients and smoothed moments. Finally, the hard transitions are detected by using the
support vector machine.
Based on visual perception, Gygli, M., et al. [9] have been developed a novel model that
was said to be “whole to local”, in terms of human vision concept. The redundant frames are
removed by browsing the video so that the computational cost is also reduced. Then, the
consistency function was constructed between frames by the video’s visual consistency feature
for the generation of pending shots. In the motion feature combination, the SBD outcomes are
further optimized. The calculation time can be shortened by a new model that contained the
optical flow feature and the inter-frame coherence for SBD. Without sacrificing the perfor-
mance of detection, the shots boundary was detected accurately and fast.
Chakraborty, S. and Thounaojam, D.M., [5] have been suggested a novel SBD method that
combined the benefits of the Gravitational Search Algorithm (GSA) and Particle Swarm
Optimization (PSO) for the Feed-Forward Neural Network (FNN) weights optimization. A
Continuity matrix (φ) was formed to analyze the hybrid technique output to increase system
performance. Then, a feasible transition frameset was extracted using the outlier along with the
Multimedia Tools and Applications (2021) 80:28109–28135 28113
Continuity matrix. The abrupt and gradual transitions are classified by a set of thresholds δ1
and δ2 from the available set of possible transition frames. The input provided to the
FNN is the multiple image features such as normalized 3D Euclidean Standard
Deviation, CIEDE2000 color-difference, and 3D color difference. The weights of the
FNN are optimized by PSOGSA, PSO, and GSA for the classification of frames into
feasible normal frames as well as transition frames.
Wu, L. et al. [28] have been introduced the TSSBD approach (i.e., a two-stage method for
SBD) that differentiated the abrupt shot by the color histogram and the deep features fusion.
Using the C3D-based deep analysis, the gradual shot variations are placed. Initially, the abrupt
shot changes that occurred between two frames are detected that separated the complete video
into the segments that included the gradual transitions. Over these video segments, the 3D-
convolutional neural network was used for the detection of gradual shot change. Then the
neural network classified the clips into particular types of gradual shot change; finally, the
positions of gradual shot transitions were located by a proposed effective merging strategy.
Rashmi, B.S. and Nagendraswamy, H.S., [22] has been developed an effective
SBD method for the abrupt and gradual transition detect in the videos. Edge gradient
fuzzified frame has been generated from the correlation of global and local features.
From each frames, the block based Mean Cumulative Sum Histogram (MCSH) was
extracted. The gradual and the abrupt shots were detected from the video by applying
the relative standard deviation (RSD) statistical measure on the obtained MCSH.
However, this approach not has the capability to detect all the abrupt and gradual
transitions, as sometimes false detection was high.
Chakraborty, S. and Thounaojam, D.M., [6] has been proposed a novel SBD approach by
the color and gradient information. Each frame’s contrast and structural changes like lumi-
nance changes are evaluated for the luminance distortion and gradient similarity. An adaptive
threshold across technique was used to extract the feasible by correlating the effects of changes
in contrast-structure as well as luminance transitions. This approach consumed less time for the
detection of transition.
3.2 Challenges
Special attention has to be paid to deal with many difficulties in SBD. Some of the major
challenges of the work are: (i) gradual transition detection (ii) removal of disturbance due to
the changes in abrupt illumination (iii) large object/camera movement disturbance. These
challenges are considered as more complex during the construction of mapping. Moreover,
various illumination changes occur in a scene, and the special effects may lead to error
detection. To obtain a better detection performance for all transition types for any kind of
arbitrary video sequence, a robust SBD is required. A single scene contains several shots in a
video sequence. Therefore, detecting the scenes and sub-shots are termed as macro-
segmentation and micro-segmentation, respectively. This detection is a difficult process, and
there is a need for an efficient SBD technique to overcome the illumination changes.
In this work, the video shot boundary is detected using the proposed framework. Initially, the
pre-processing and filtering is performed in each frame to avoid the difficulties of noise,
28114 Multimedia Tools and Applications (2021) 80:28109–28135
contrast, and illumination. FAPG filter is designed to remove the noise, eliminates the
illumination noise, and improves the contrast of the frames. In the next step, the HSV color
histogram distance is computed between each adjacent frame. The proposed SBD architecture
is illustrated in Fig. 2.
In the third step, the hybrid DTCWT-WHT is applied to each frame for the extraction of
features. Also, this method is applied on each block of the image to extract the feature vector.
The composite feature vector comprises of all the frames of video. In the final step, the
detection of fake boundaries is performed by an optimized DBN, in which SSDOA is used for
optimizing the weight values to reduce the learning error. Finally, the cut transition and gradual
transition are detected.
FAPG filter is designed in the preprocessing stage for the illumination noise removal and the
contrast frame enhancement. This filter can remove the noise and also can restore the images
while conserving the small image and the particulars of the edge [17]. In the filter design, the
computational efficiency is considered as significant merit because it improved the image
enhancement operation. An indication of central pixels degree of membership is utilized in this
filter for a local neighborhood based on its peer group size. Figure 3 illustrates the
preprocessed image from the raw input.
The two operations performed by this filter are pixel inspection and pixel replacement. The
central pixel’s degree of membership is evaluated from the local window to its neighborhood
in the pixel inspection. In pixel replacement, the Weighed Average Filter (WAF) is used for
replacing the pixels that are classified as outliers. With a processed pixel, the pixel samples are
analyzed to determine the weights of WAF in neighborhood relation.
The first step in preprocessing is to determine the central pixel’s close neighbors (CN) of
filtering windowc1or the peer group size. If the normalized Euclidean distanceδ(ci, c1) is lesser
than the predefined threshold valueth in a given color space, then the close neighborc1 is
considered as a pixelci ≠ c1 that belongs toW. The primary parameter of this phase is a
threshold0 ≤ th ≤ 1, where the maximum Euclidean distance is considered as th = 1andth =
0is referred to as two identical pixels. The size of the pixel group is the determination of pixel
distortion produced due to a noise process. Moreover, the pixel is considered as corrupted if
the th value is low; else said to be not corrupted.
In FAPG filter, the pixel replacement is the second step. If the image pixels pivalues are
computed, then the filtering is executed using the following procedure:
(i) If the size of the peer group of a central pixel c1 ispi1 ≤ 1ofW, then it is considered as an outlier.
It is then replaced with the pixels corresponding to a similar operating window. The following
equation computes the weights of the corresponding pixelscpi, i = 2, . …, n:
δi
cpi ¼ ; δi ¼ piii ; ð1Þ
∑i¼2 δi
n
Where the size Wis represented asn, and the second parameter is represented as λ > 0that
influencing the quality of results. After replacingc1, the output o1of WAF is expressed in the
equation as:
1 n
o1 ¼ ∑ cp ci ð2Þ
∑ni¼2 cpi i¼2 i
In the filter output, the pixels which have more CNs are considered to be highly relative impact
and reliable. However, the pixels with (pi = 0) or have any CNs are considered as average. The
feasibility for standardizing the neighboring pixels degree of membership is provided byλ
parameter. The neighboring pixels peer group sizes vary based on the parameterλ.If then the
differences in neighboring pixels peer group sizes is decreased ifλ > 0 and forW, the peer group
sizes are increased.
(ii) The pixel is preserved, if the size of peer groupM< 1. If there is more CN forc1, then there
is a sufficient degree of membership for the pixel to treat as an uncorrupted and vacate
pixel without making any modifications.
(iii) In some rare conditions, there is no CNs among any of the pixels within Wthat may
occur in high noise images. Thus, the filtering window size needs to be maximized until
two uncorrupted pixels can be identified.
Therefore, this filter is a fast impulse detection method coordinated with computationally
effective pixel replacement. Then, the HSV color histogram distance will be calculated
between the adjacent frames.
28116 Multimedia Tools and Applications (2021) 80:28109–28135
HSV color histogram is utilized for color features representation and calculates the distance
between the adjacent frames. It provides the number of pixels with the colors within a similar
range of pixel values [16]. It extends the set of all feasible colors and the color space of the
image. In the color histogram, ordinate denotes the number of pixels, and the abscissa
represents the color range. The expression for the color histogram is as follows:
ni
CH ðiÞ ¼ ; i ¼ 1; 2; …; h ð3Þ
N
Where the total number of pixels is denoted as N, the number of color values in the histogram
is represented ash and the numbers of pixels with color i is denoted as ni. The vector of the
image M color histogram is denoted as CH(M) = (k1, k2, .…, kh).
The SHD method extracts the color features by considering the HSV color histograms
difference between the frames, which belongs to a video sequence. Therefore, the following
equation computes the histogram difference:
1
N 2 2
H diff ½ j ¼ ∑ k j;i −k j−1;i ð4Þ
i¼1
Where the difference between frame j and framej − 1 is denoted asHdiff[j], the frame j color
histogram with N dimension is denoted askj.
Figure 4 represents the calculation of distance using an HSV color histogram for the filtered
image. When a large discontinuity occurs between histograms, then a peak appears. This peak
is related to an abrupt transition. Therefore, this abrupt transition can be easily recognized by
other video effects. However, gradual transition detection is a complex task. Because there is a
frequent change in the flash and illumination, and it maximized the differences between
adjacent frames. This will disturb the detection of the abrupt transition. The illumination and
flash shapes variation are diverse to the ones of abrupt changes.
The structural features of the frames are extracted using the proposed hybrid DTCWT-WHT
method. DTCWT includes the separate sub-bands for the negative and positive orientation.
There are two trees in each scale of DTCWT, in that one tree generates the real part, while the
other tree generates the imaginary part of the complex wavelet coefficients. Moreover, WHT is
a real, orthogonal, and symmetric Z = Z∗ = ZI = Z−1. A four scale (six angles) wavelet transform
is performed on each frame. Each row represents one scale, whereas the column represents the
angle within that scale. The DTCWT-WHT matrix with the order of four is given as:
0 1
1 1 1 1
B1 1 −1 −1 C
D¼B
@1
C ð5Þ
−1 −1 1 A
1 −1 1 −1
For MPEG sequences, the frames are identified, and the entropy process is performed on each block.
A vector vf = {g1, g2, …, gk} is formed by the coefficient from each frame h, where k = τρ.The
projection value Ym of the block is obtained as an inner product using the following equation:
Y m ¼ ym 1 ; y2 ; y3 ; y4
m m m
ð6Þ
Where m = 1, 2, 3, …n, n is the number of blocks Bm. Using the inner product, every two
consecutive frames of representation are compared as follows:
ψðh; h þ ϕÞ ¼ vh:vh þ ϕ=vh‖vh þ ϕ ð7Þ
Let us consider the modification of scene in video sequence from frame vh to frame vh + ϕ, if
1 − |ψ| > υiwhere 0 < < υi < 1, and the resolution in the temporal domain is denoted asϕ. In a
video sequence, every frame is processed by letting ϕ = 1 for the detection of changes in
scenes. Then the color, texture, and edge features are computed for each frame in a block.
First, the color feature of a block is evaluated using the following equation:
1 ¼ fBm ; v1 g
CL ¼ ym ð8Þ
Then, the second feature extraction is the edge feature from the frames. The edge gradients of
the block Bm are considered as ym m
2 and y3 . The equation for the extraction of the edge feature
th
from the m block of the frame is as follows:
2 m 2
EG≈ ym 2 þ y3 ð9Þ
The third is the texture feature extraction, in which the vector is represented as a finite sum.
The block projection on the vector space is performed from Eq. (6) is given as:
3
1 v1 þ y 2 v2 þ y 3 v3 ¼ ∑ B m ; v j v j
Y ¼ ym m m
ð10Þ
i¼1
Then, the texture feature representation of blockBm is given in Eq. (11) as follows:
TS m ¼ B2 −Y 2
m ð11Þ
28118 Multimedia Tools and Applications (2021) 80:28109–28135
Therefore, the features are extracted from the frames of the block, and the next step is the
construction of a continuity signal.
After the feature extraction, the continuity signal construction takes place. This signal can be
constructed by the similarity or the dissimilarity identification between the video sequence’s
consecutive frames. The computation of similarity or the dissimilarity is performed by
integrating obtained features and the distance metric D. The evaluated difference values
between the consecutive frames are given as the accumulation of continuity between the h
and h + 1 frame blocks. Moreover, the similarity or the dissimilarity task is computed using the
City block distance measure. For each Color(CL), Edge (EG), Texture strength (TS) feature,
the continuity values are constructed using the City block distance measure and it is defined in
the below equations:
n
λðk Þ ¼ Dðh; h þ 1ÞCL ¼ ∑ CLm;h −CLm;hþ1 ð12Þ
m¼1
n
αðk Þ ¼ Dðh; h þ 1ÞEG ¼ ∑ EGm;h −EGm;hþ1 ð13Þ
m¼1
n
δðk Þ ¼ Dðh; h þ 1ÞTS ¼ ∑ TS m;h −TS m;hþ1 ð14Þ
m¼1
Initially, the features such as edge, texture strengths, and color are intended for the blocks of
the individual frame. Using Eq. (14), the similarity/dissimilarity between the consecutive
frames is evaluated. Normalization is performed for each feature on attaining the continuity
signal to illustrate the value of continuity in the range of [0, 1].The individual features
continuity signals (λ, α, δ, θ) with the normalized continuity value are fused to form a single
continuity signal. The process of feature fusion does not occur directly due to the various levels
of contribution to the frame’s visual representation. The significance of each feature is found
for the identification of the level of contribution. The weights are assigned to the features
depends on this consideration, which can be expressed as given in Eq. (15).
γ ¼ w1 λ þ w2 α þ w3 δ þ w4 θ ð15Þ
Where the weights are represented as w1,w2,w3,w4 and the coefficient of weights are computed
by the DBN classifier for further classification, as discussed in the following section.
4.5 Classification of frames using deep belief neural network with social ski driver
optimization
The extracted feature vectors and the continuity signal values are provided as input to the DBN
model for the classification process. A commonly used tool for classification is the neural
networks because it can learn new and synthetic features in the layers. In deep learning, the
DBN is an explicit type for learning the nonlinear conversion and a propagative model [21]. It
Multimedia Tools and Applications (2021) 80:28109–28135 28119
has some Restricted Boltzmann Machines (RBM), which is reserved as the building blocks. In
each layer of DBN, this RBM consists of enhancing the training efficiency of a model.
Moreover, the separating power of between-classes is enhanced using DBN by extracting
the high-level features from the training data. Figure 5 represents the structure of DBN.
During the high-test error and low train error, the over-fitting occurs in some existing models. In
order to prevent over fitting, a back propagation algorithm is utilized. From the visible layer to the
hidden layer, the DBN model’s direction is originated, whereas r, s ∈ [0, 1]. The part of RBM
consists of undirected weights wijwith biases, which is positioned between the layers. The joint
distribution of visible layers and hidden layers is determined by the energy function, and the
equation is given as:
e−EF ðr;sÞ
Pðr; sÞ ¼ ð16Þ
∑ e−EF ðr;sÞ
r;s
Where the energy function is denoted as EF(r, s), which can be obtained from the following
equation:
EF ðr; sÞ ¼ − ∑ ui ri − ∑ v j s j − ∑ ri s j wij ð17Þ
i¼1 j¼1 i; j
where, the weight between the hidden layer as well as the visible layer is denoted as wij, the
coefficients for both visible as well as hidden layers are represented asui andvj respectively. A
stochastic gradient descent (SGD) approach is utilized to obtain the optimal training. Thus, the
probability can be modified by taking into account a given sample, and it is expressed as:
∑ e−EF ðr;sÞ
r
P ðr Þ ¼ ð18Þ
∑ e−EF ðr;sÞ
r;s
Where,
P s j ¼ 1r ¼ sigmoid ∑bi¼1 wij ri þ v j ð22Þ
And
P r j ¼ 1s ¼ sigmoid ∑ai¼1 wij si þ u j ð23Þ
Here the logistic sigmoid function is denoted as sigmoid (⋅), the number of hidden nodes is
represented as r, the weight decay, the momentum weight is represented as δ and the learning
ratio is denoted as s. Based on the gradient descent algorithm and back propagation algorithm,
this process is performed.
The cost function is evaluated by utilizing Mean Square Error (MSE) for a minimization
issue. The difference between the desired value as well as the output value of the DBN is
described by MSE by the following equation:
1 N M 2
E cf ¼ ∑ ∑ O j ðiÞ−D j ðiÞ ð24Þ
M j¼1 i¼1
Where, the number of output layer and the data layer are represented as N and M respectively.
The number from jth unit in the output layer is denoted as Oj(i) at the timeti. The jth factor of the
desired number is denoted as Dj(i). Thus, the algorithm is used in the neural network and
repeat the process until the stopping criteria is reached. The typical algorithm utilization is a
complex operation for solving the error function and weight updation because of the NP-hard
problem. A meta-heuristic algorithm has been used for solving the above issues. Here, the
SSDOA has been utilized for optimizing the weight value of DBN to reduce the learning error.
SSDOA is utilized in this work for optimizing the value of weights from the DBN. The
primary objective of this algorithm is to search for near-optimal or optimal solutions in space
[25]. The four parameters in SSDOA are: (i) Positions of agent (Yi ∈ Pn), (ii) Previous best
position Bi, (iii) Mean global solution Gi and (iv) Velocity of agentsVi.
Steps for weight value optimization:
(a) The position (Yi) of agents is initialized randomly, where the user determines the number
of agents.
Multimedia Tools and Applications (2021) 80:28109–28135 28121
(b) Update the position of an agent by including the velocity to their previous position as
expressed in the following equation:
Y tþ1
i ¼ Y ti þ V ti ð25Þ
(c) Then, randomly initialize the velocity of the agent, and it is modified based on Eq. (26):
b sinðl1 Þ Bti −Y ti þ sinðl1 Þ Gti −Y ti if l 2 ≤ 0:5
V t¼1 ¼ ð26Þ
i b cosðl 1 Þ Bti −Y ti þ cosðl 1 Þ Gti −Y ti if l2 > 0:5
Where the velocity Yi is denoted as Vi, the uniform number that randomly generated
are represented as l1andl2 in a range of [0,1]. Moreover, Biindicates the agent’s
optimal solution, Girepresents the entire population’s mean global solution and bis
the parameter used for making a balance between the exploitation as well as the
exploration and it is computed as follows: bt + 1 = αbt where the current iteration is
denoted ast and the value of bis decreased by 0 < α < 1.
(d) Based on (i) the distance between the current positionY ti , and the mean global solutionGi,
and (ii) the distance between the previous best positionBti and the current positionY ti the
velocity of agents is adjusted.
(e) Towards the three optimal solutions mean, the agents in SSD moved.
(f) The agents are moved not in a straightforward direction due to the sine and cosine
functions, which gives the algorithm better exploration capabilities.
Figure 6 shows the flowchart of the SSDOA process. Thus, the optimal solutions are obtained,
and the DBN detects the cut and the gradual transitions with SSDOA. Finally, the noise in the
frames is removed, and the images are enhanced using FAPG filter in the preprocessing. Then,
the histogram distance is evaluated, and the edge features of the frames are extracted using
hybrid DTCWT-WHT method. The continuity signal is constructed after the feature extraction
and fed to the DBN classifier. DBN computes the coefficient of weights and classifies the cut
and gradual transition from the video frames with the SSDOA. This algorithm optimizes
weight values and reduces learning error.
The proposed work is implemented in MATLAB 2018a tool. Different video sequences from
various datasets are utilized for the experiment of proposed framework. The recent deep
learning approaches such as DBN, Recurrent neural network (RNN), Deep neural network
(DNN) and CNN are considered for the comparison with the proposed DBN-SSDOA. These
existing approaches are implemented in this work to show the efficacy of the proposed
framework. Precision, recall and F-measure are the evaluation metrics used in this work.
28122 Multimedia Tools and Applications (2021) 80:28109–28135
In this work, the datasets used for SBD are TRECVID 2016, 2017, 2018 and 2019 datasets.
TRECVID video sets are used to evaluate the proposed video SBD algorithm, and this dataset is
Multimedia Tools and Applications (2021) 80:28109–28135 28123
Precision, F-measure, and recall are some of the performance metrics computed for the
proposed approach. These metrics show the efficiency of proposed method when compared
with the existing methods. A correct hit (C) is the correctly detected shot boundary, a missed
hit (M) is the not detected boundary and a false hit (F) is the falsely detected shot boundary.
These are some of the processes used to compute the efficiency of SBD detection techniques.
5.2.1 Precision
Precision is evaluating the number of transitions correctly reported (i.e., C) from the total
number of transitions reported (C and F).
Fig. 7 Sample frames from the TRECVID 2016, 2017, 2018 and 2019 datasets
28124 Multimedia Tools and Applications (2021) 80:28109–28135
C
Precision ¼ ð27Þ
CþF
5.2.2 Recall
The recall is evaluated by the correctly predicted positive observations (C) from the total
number of observations (C and M).
C
Recall ¼ ð28Þ
CþM
5.2.3 F1-measure
F-measure is defined as the weighted average between recall and precision. Equation (29) is
used for evaluating the F-measure.
2*ðprecision*recall Þ
F1−measure ¼ ð29Þ
precision þ recall
The experimental results of the proposed work are analysed for the four datasets using some
performance metrics and compared the proposed method with some existing approaches to
show the efficiency of the proposed technique. Several transitions are there in the video shot
boundary like fade in, fade out, cut, dissolve, etc. In the proposed work, we detects the cut and
gradual (Fade in and Fade out) transitions from the video shot boundary. The results for these
transitions are analysed individually as follows:
From the gradual transition, the experimentation of the proposed work considers fade in and
fade out transitions. The proposed DBN-SSDOA performed the effective gradual transition
detection and the analysis on the corresponding performance is provided in the below
subsections.
Fade in transition Precision, recall and F-measure are evaluated for the gradual (fade
in) transition detection from the video sequences in the four datasets. The proposed
DBN-SSDOA is compared with the existing DBN, RNN, DNN and CNN in terms of
precision, recall and F-measure and the graphical representations are provided for
precision, recall and F-measure to show the effectiveness of the proposed approach.
Figure 8 shows the resultant images of Fade in transitions detected using proposed
framework.
Table 1 shows the comparative analysis of proposed DBN-SSDOA and the existing DBN,
RNN, DNN and CNN in terms of precision, recall and F-measure using various datasets. The
graphical representation also provided for the precision, recall and F-measure attained by the
proposed DBN-SSDOA as well as the existing methods for four datasets such as TRECVID
2016, 2017, 2018 and 2019.
Multimedia Tools and Applications (2021) 80:28109–28135 28125
Fig. 8 Some of the resultant fade in frames detected using proposed approach
Figure 9 (a) illustrates the graphical representation of precision for fade in transition of
proposed method and the existing approaches. The precision value obtained by the proposed
framework is high than the existing DBN, DNN, RNN and CNN. The precision value obtained
in fade in transition detection by the proposed DBN-SSDOA for TRECVID 2016, 2017, 2018
and 2019 datasets are 88.78%, 87.32%, 86.47% and 85.78% respectively. In general, if the
precision value is high, then the performance of the system is more efficient. In this result
analysis, the precision value obtained by the proposed approach is high, which is shown in
Table 1. Therefore, the efficiency of the proposed approach is better compared to the existing
approaches.
Figure 9 (b) represents the recall value obtained by the proposed method and the existing
techniques. The recall value achieved by the proposed DBN-SSDOA for TRECVID 2016,
2017, 2018 and 2019 datasets are 89.58%, 88.25%, 87.89% and 87.12% respectively. From
the Table 1, the recall value obtained by proposed DBN-SSDOA as well as the existing DBN,
RNN, DNN and CNN can be compared. The recall achieved by the proposed method is higher
than the existing approaches for fade in transition detection.
Figure 9 (c) shows the F-measure achieved by the proposed approach and the existing
methods. The proposed DBN-SSDOA obtained the F-measure for the four datasets such as
TRECVID 2016, 2017, 2018 and 2019 are 88.12%, 87.14%, 86.45% and 85.45% respective-
ly. The effectiveness of the approach can be easily evaluated by the F-measure value. If the F-
measure value is high, then the system can perform well in the detection of transition.
Therefore, from the result analysis, the proposed approach attained the high F-measure value,
which shows the system performance of the proposed DBN-SSDOA.
Fade out transition The precision, recall and F-measure obtained by the proposed method for
fade out transition using four datasets can be evaluated here. The comparative analysis of
proposed method and the existing approaches are performed to measure the efficacy of
proposed detection approach.
Figure 10 shows some of the resultant fade out images from the four datasets. The fade out
transition is the subdivision of gradual transition. In the video sequences, there are many fade
out frames are there among several thousand frames. The proposed approach efficiently detects
the fade out frames from the thousands of frames.
Figure 11 (a) represents the precision for fade out transition detection of proposed method
and the existing approaches. The precision value obtained by the proposed DBN-SSDOA
detection approach is high than the existing DBN, DNN, RNN and CNN. The precision value
28126
Table 1 Performance analysis of proposed and existing approaches for fade in transition detection using various datasets
Transitions Approaches TRECVID 2016 TRECVID 2017 TRECVID 2018 TRECVID 2019
Precision Recall F-measure Precision Recall F-measure Precision Recall F-measure Precision Recall F-measure
Fade in CNN 83.14 84.25 80.14 81.58 82.01 78.258 80.458 82.01 79.58 79.58 81.58 79.58
DNN 84.58 85.15 82.15 83.14 83.45 80.145 81.58 83.14 80.89 81.25 82.47 80.58
RNN 86.14 86.14 84.57 84.25 84.59 83.47 83.45 84.58 81.25 82.45 84.12 81.56
DBN 87.12 87.25 86.48 85.47 86.14 84.58 84.25 86.25 84.78 83.45 85.86 83.25
Proposed DBN-SSDOA 88.78 89.58 88.12 87.32 88.25 87.14 86.47 87.89 86.45 85.78 87.12 85.45
Multimedia Tools and Applications (2021) 80:28109–28135
Multimedia Tools and Applications (2021) 80:28109–28135 28127
Fig. 9 a, b and c: Precision, recall and F-measure of proposed DBN-SSDOA and the existing approaches for
Fade in transition
obtained in the fade out transition detection by the proposed DBN-SSDOA for TRECVID
2016, 2017, 2018 and 2019 datasets are 87.58%, 86.58%, 86.01% and 85.96% respectively. In
this result analysis, the precision value obtained by the proposed approach for fade out
transition detection is high, which is shown in Table 2. Therefore, the efficiency of the
proposed approach is better compared to the existing approaches.
Figure 11 (b) shows the recall obtained by the proposed DBN-SSDOA and the existing
techniques. The recall value achieved by the proposed DBN-SSDOA for TRECVID 2016,
2017, 2018 and 2019 datasets are 88.23%, 87.45%, 87% and 86.14% respectively. The recall
value obtained by the proposed DBN-SSDOA and the existing DBN, RNN, DNN and CNN
can be compared from the Table 2. Thus, the recall value achieved by the proposed method is
higher than the existing approaches for the detection of fade out transition.
Figure 11 (c) illustrates the comparative analysis of F-measure achieved by the
proposed approach and the existing methods. The proposed DBN-SSDOA obtained
the F-measure for the four datasets such as TRECVID 2016, 2017, 2018 and 2019 are
86.63%, 85.98%, 84.59% and 84.05% respectively. Therefore, from the result analy-
sis, the proposed approach attained the high F-measure value, which enhances the
detection performance of proposed DBN-SSDOA.
28128 Multimedia Tools and Applications (2021) 80:28109–28135
Fig. 10 Few resultant Fade out frames detected using proposed approach
Fig. 11 a, b and c: Precision, recall and F-measure of proposed DBN-SSDOA and the existing approaches for
Fade out transition
Table 2 Performance analysis of proposed and existing approaches for fade in transition detection using various datasets
Transitions Approaches TRECVID 2016 TRECVID 2017 TRECVID 2018 TRECVID 2019
Precision Recall F-measure Precision Recall F-measure Precision Recall F-measure Precision Recall F-measure
Multimedia Tools and Applications (2021) 80:28109–28135
Fade out CNN 77.45 79.25 76.47 77.82 78.25 77.58 75.25 78.58 76.25 76.58 78.25 72.58
DNN 78.36 81.25 79.07 79.58 80.5 78.25 79 79.2 77.45 78.6 80.5 78.56
RNN 81.25 83.89 80.45 81.58 82.69 79.69 80.56 81.25 78.58 80.25 82.65 79.25
DBN 84.57 85.25 83.25 83.25 85.45 81.25 83.12 84.59 81.56 82.69 84.59 81.58
Proposed DBN-SSDOA 87.58 88.23 86.63 86.58 87.45 85.98 86.012 87 84.59 85.96 86.14 84.05
28129
28130 Multimedia Tools and Applications (2021) 80:28109–28135
Fig. 12 Few Cut transition frames detected by the proposed DBN-SSDOA from all datasets
The precision, recall and F-measure of the proposed DBN-SSDOA approach and the existing
DBN, RNN, DNN and CNN are evaluated for the cut transition detection to compute the
efficiency of the proposed work. Figure 12 represents the some of the resultant cut transition
frames from TRECVID 2016, 2017, 2018 and 2019.
Table 3 shows the comparative analysis of various approaches in terms of precision, recall
and F-measure for various datasets on cut transition detection. The graphical representation for
precision, recall and F-measure for the proposed and the existing approaches for various
datasets are also provided below.
Figure 13 (a) represents the comparative analysis on precision value obtained by the
proposed DBN-SSDOA and the existing approaches for the detection of cut transition. The
precision value obtained by the proposed DBN-SSDOA detection approach is high than the
existing approaches. In the cut transition detection, the precision value obtained by the
proposed DBN-SSDOA for TRECVID 2016, 2017, 2018 and 2019 datasets are 94.5%,
92.56%, 92.58% and 91.25% respectively.
Figure 13 (b) illustrates the recall value achieved by the proposed DBN-SSDOA and the
existing DBN, RNN, DNN and CNN for cut transition detection. For TRECVID 2016, 2017,
2018 and 2019 datasets, the recall value attained by the proposed DBN-SSDOA are 93.5%,
92.36%, 93% and 92.12% respectively. From Table 3, the recall value obtained by the
proposed DBN-SSDOA and the existing approaches such as DBN, RNN, DNN and CNN
are compared to measure the efficacy of the proposed approach. From the comparative
analysis, the recall value achieved by the proposed method is higher than the existing
approaches for the detection of cut transition.
Figure 13 (c) illustrates the comparative analysis of F-measure achieved by the proposed
approach and the existing methods for cut transition detection. For the four datasets such as
TRECVID 2016, 2017, 2018 and 2019, the proposed DBN-SSDOA achieved the F-measure
are 92.5%, 91.12%, 90.23% and 89.89% respectively. According to the value of F-measure
obtained, the efficiency of the approach can be evaluated for SBD. If the F-measure value is
high, then the system can perform well in the detection of transition. Thus, from the result
analysis, the proposed approach attained the high F-measure value, which shows the system
performance of the proposed DBN-SSDOA.
Furthermore, the overall analysis of SBD by the proposed method and the existing
approaches in terms of precision, recall and F-measure are performed for different datasets
such as TRECVID 2001 to 2005 tasks. It is represented in Table 4. The existing approaches
considered for the evaluation of performance of SBD for TRECVID 2001 to 2005 datasets are
Table 3 Comparative analysis of proposed DBN-SSDOA and the existing approaches for various datasets on cut transition detection
Transitions Approaches TRECVID 2016 TRECVID 2017 TRECVID 2018 TRECVID 2019
dataset dataset dataset dataset
Precision Recall F-measure Precision Recall F-measure Precision Recall F-measure Precision Recall F-measure
Multimedia Tools and Applications (2021) 80:28109–28135
Cut CNN 88.02 87.23 87 86.02 84.12 86.23 87.25 85.25 86.28 86.13 84.32 85.56
DNN 90.23 88 87.15 87.25 86.25 87.25 88.14 86.01 86.12 87.12 85.02 86.58
RNN 91.26 89.23 88.23 88.25 87.25 88.01 89.15 87.02 87.25 88.23 86.5 87.5
DBN 93 90.25 89.01 90.23 89.01 89.01 90.23 88.5 89.25 89.01 88.-8 88.59
Proposed DBN-SSDOA 94.5 93.5 92.5 92.56 92.36 91.12 92.58 93 90.23 91.25 92.12 89.89
28131
28132 Multimedia Tools and Applications (2021) 80:28109–28135
Fig. 13 a, b and c: Precision, recall and F-measure of proposed DBN-SSDOA and the existing approaches for
cut transition detection on four datasets
SVD-SBD and WHT-SBD [31]. The videos are chosen randomly from all the datasets for the
overall performance analysis of proposed and the existing approaches.
The precision value for the proposed cut transition is 93% and the gradual transition is 89%.
Moreover, the precision value for the cut and gradual transition of the existing SVD-SBD
obtained are 91% and 84% respectively, whereas the cut and gradual transition of the existing
WHT-SBD precision values are 89% and 87% respectively. The precision value obtained by
the proposed approach is high than the existing approaches. Moreover, the precision value
indicates the false positives rate, i.e., if the precision value is high, then the rate of false
positives is low. Thus, the proposed method obtained the near optimal performance.
Table 4 Precision, recall, and F-measure of various methods for cut and gradual transitions
Precision (%) Recall (%) F1-measure (%) Precision (%) Recall (%) F1-measure (%)
SVD [25] 91 85 81 84 81 82
WHT [25] 89 91 90 87 85 86
Proposed method 93 93 94 89 90 89
Multimedia Tools and Applications (2021) 80:28109–28135 28133
The recall value for the proposed cut transition is 93% and the gradual transition is 90%.
From the recall value, the miss-classification rate reflects, i.e., if the recall value is higher, it
represents less rate of miss-classification. Moreover, the recall value for the cut and gradual
transition of the existing SVD-SBD obtained are 85% and 81% respectively, whereas the cut
and gradual transition of the existing WHT-SBD precision values are 91% and 85% respec-
tively. The effectiveness of the proposed strategy is shown from the improved recall value.
The harmonic mean of recall, as well as precision, is said to be F-measure value. If both the
precision and recall value is high, then the F-measure value also high. This means the F-
measure is high only if the rate of false positives and the rate of miss-classification are low.
The recall value for the proposed cut transition is 94% and the gradual transition is 89%.
Moreover, the F1-measure value for the cut and gradual transition of the existing SVD-SBD
obtained are 87% and 82% respectively, whereas the cut and gradual transition of the existing
WHT-SBD precision values are 90% and 86% respectively.
From the experimental results, the proposed method achieved high precision, F-measure,
and recall values that indicates the rate of false positives and the rate of miss-classification are
low. Thus, the proposed method for SBD is efficient, which can be seen from the result
analysis. The computation efficiency is high in the proposed method when compared with the
existing approaches.
6 Conclusion
The SBD is performed efficiently using the proposed hybrid DTCWT-WHT and the optimized
deep learning approaches. FAPG filter is designed for the denoising process. This filter
decreased the illumination noise intensity with high contrast and can restore the frames. The
significant feature of this filter is the computational efficiency that is required for the image
enhancement operation. The hybrid DTCWT-WHT is proposed to extract the features of
frames on each block. It reduced the complexity of feature vector computation for each frame.
Finally, the gradual and cut transition is classified from the huge frames by the proposed
optimized deep learning approach. This approach minimized the learning rate error. The
datasets utilized for the implementation of proposed work are TRECVID 2016, 2017, 2018
and 2019. The deep learning approaches such as DBN, RNN, DNN and CNN are considered
to compare the proposed approaches in order to show the performance efficiency. The
efficiency of the proposed work also compared with the existing SVD and WHT approaches
for TRECVID 2001 to 2005 datasets. The experimental results showed the efficiency of the
proposed method. The proposed approach is compared with the state-of-art techniques to
evaluate the performance of the proposed method. Therefore, the overall performance of the
proposed method is better in terms of precision, recall, and F-measure.
References
1. Abdulhussain SH, Mahmmod BM, Saripan MI, Al-Haddad SAR, Jassim WA (2019) Shot boundary
detection based on orthogonal polynomial. Multimed Tools Appl 78(14):20361–20382
2. Asha D, Latha YM (2019) Content-based video shot boundary detection using multiple Haar transform
features. In: Soft computing and signal processing. Springer, Singapore, pp 703–713
28134 Multimedia Tools and Applications (2021) 80:28109–28135
3. Bhaumik H, Bhattacharyya S, Chakraborty S (2019) A vague set approach for identifying shot transition in
videos using multiple feature amalgamation. Appl Soft Comput 75:633–651
4. Bi C, Yuan Y, Zhang J, Shi Y, Xiang Y, Wang Y, Zhang R (2018) Dynamic mode decomposition based
video shot detection. IEEE Access 6:21397–21407
5. Chakraborty S, Thounaojam DM (2019) A novel shot boundary detection system using hybrid optimization
technique. Appl Intell 49(9):3207–3220
6. Chakraborty S, Thounaojam DM (2020) SBD-duo: a dual stage shot boundary detection technique robust to
motion and illumination effect. Multimed Tools Appl 80(2):3071–3087
7. Chen Y, Shi L, Feng Q, Yang J, Shu H, Luo L, Coatrieux JL, Chen W (2014) Artifact suppressed dictionary
learning for low-dose CT image processing. IEEE Trans Med Imaging 33(12):2271–2292
8. Fan J, Zhou S, Siddique MA (2017) Fuzzy color distribution chart-based shot boundary detection.
Multimed Tools Appl 76(7):10169–10190
9. Gygli M (2018) Ridiculously fast shot boundary detection with fully convolutional neural networks. In 2018
International Conference on Content-Based Multimedia Indexing (CBMI), pp. 1–4. IEEE.
10. Hannane R, Elboushaki A, Afdel K, Naghabhushan P, Javed M (2016) An efficient method for video shot
boundary detection and keyframe extraction using SIFT-point distribution histogram. Int J Multimed Inf
Retr 5(2):89–104
11. Kar T, Kanungo P (2017) A motion and illumination resilient framework for automatic shot boundary
detection. SIViP 11(7):1237–1244
12. Kumar GN, Reddy VSK, Kumar SS (2018) Video shot boundary detection and key frame extraction for
video retrieval. In Proceedings of the Second International Conference on Computational Intelligence and
Informatics, Springer, Singapore, 557–567.
13. Kumar R, Ray S, Sharma M, Kumar B (2020) Abrupt scene change detection using spatiotemporal
regularity of video cube. In: Advances in VLSI, communication, and signal processing. Springer,
Singapore, pp 991–1002
14. Li Y, Xia R, Huang Q, Xie W, Li X (2017) Survey of spatio-temporal interest point detection algorithms in
video. IEEE Access 5:10323–10331
15. Liang R, Zhu Q, Wei H, Liao S (2017) A video shot boundary detection approach based on CNN feature. In
2017 IEEE International Symposium on Multimedia (ISM) 489-494.
16. Liu F, Wan Y (2015) Improving the video shot boundary detection using the HSV color space and image
subsampling. In 2015 seventh international conference on advanced computational intelligence (ICACI),
IEEE, 351-354.
17. Malinski L, Smolka B (2016) Fast averaging peer group filter for the impulsive noise removal in color
images. J Real-Time Image Proc 11(3):427–444
18. Mondal J, Kundu MK, Das S, Chowdhury M (2018) Video shot boundary detection using
multiscale geometric analysis of nsct and least squares support vector machine. Multimed Tools
Appl 77(7):8139–8161
19. Parmar M, Angelides MC (2015) MAC-REALM: a video content feature extraction and modelling
framework. Comput J 58(9):2135–2171
20. Prabavathy AK, Shree JD (2019) Histogram difference with fuzzy rule base modeling for gradual shot
boundary detection in video cloud applications. Clust Comput 22(1):1211–1218
21. Prathiba T, Kumari RSS (2021) Eagle eye CBVR based on unique key frame extraction and deep belief
neural network. Wirel Pers Commun 116(1):411–441
22. Rashmi BS, Nagendraswamy HS (2020) Video shot boundary detection using block based cumulative
approach. Multimed Tools Appl 80(1):641–664
23. Sasithradevi A, Roomi SMM (2020) A new pyramidal opponent color-shape model based video shot
boundary detection. J Vis Commun Image Represent 67:102754
24. Singh A, Thounaojam DM, Chakraborty S (2019) A novel automatic shot boundary detection algorithm:
robust to illumination and motion effect. SIViP 1–9
25. Tharwat A, Gabel T (2019) Parameters optimization of support vector machines for imbalanced
data using social ski driver algorithm. Neural Comput Applic 32(11):6925–6938
26. Thounaojam DM, Khelchandra T, Singh K, Roy S (2016) A genetic algorithm and fuzzy logic approach for
video shot boundary detection. Comput Intell Neurosci 2016:1–11
27. Tippaya S, Sitjongsataporn S, Tan T, Khan MM, Chamnongthai K (2017) Multi-modal visual features-
based video shot boundary detection. IEEE Access 5:12563–12575
28. Wu L, Zhang S, Jian M, Lu Z, Wang D (2019) Two stage shot boundary detection via feature fusion and
spatial-temporal convolutional neural networks. IEEE Access 7:77268–77276
29. Xu J, Song L, Xie R (2016) Shot boundary detection using convolutional neural networks. In 2016 visual
communications and image processing (VCIP), IEEE, 1–4.
Multimedia Tools and Applications (2021) 80:28109–28135 28135
30. Yang SH, Lin YN, Chiou GJ, Chen MK, Shen VR, Tseng HY (2019) Novel shot boundary detection in
news streams based on fuzzy petri nets. Appl Artif Intell 33(12):1035–1057
31. Youssef B, Fedwa E, Driss A, Ahmed S (2017) Shot boundary detection via adaptive low rank and svd-
updating. Comput Vis Image Underst 161:20–28
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps
and institutional affiliations.