Encoding and Video Content Based HEVC Video Quality Prediction
Encoding and Video Content Based HEVC Video Quality Prediction
DOI 10.1007/s11042-013-1795-z
Abstract Advances in multimedia devices and video compression techniques and the
availability of increased network bandwidth in both fixed and mobile networks has
increased the proliferation of multimedia applications (e.g. IPTV, video streaming and
online gaming). However, this has also posed a real challenge to network and service
providers to deliver these applications with an acceptable Quality of Experience (QoE). In
these multimedia applications, it is highly desirable to predict and if possible control
video quality to meet such QoE and user expectations. Streamed video quality is affected
by both encoding and transmission processes. The impacts of these processes are content
dependent. This issue has gradually been recognised in video quality modelling research
in recent years. In this paper, we carried out objective and subjective tests on video
sequences to investigate the impact of video content type and encoding parameter settings
on HEVC video quality. Initial results show that varying video content type and encoding
parameters impact video quality. Based on the test results, we developed a content-based
video quality prediction (CVQP) model that takes into account HEVC encoding parameter
such as Quantization Parameter (QP) and video content type (characterised by motion
activities and complexity of video sequences). We achieved an accuracy of 92 % for the
test dataset when model predicted PSNR values were compared with full reference PSNR
measurements. The performance of the model was also evaluated by comparing predicted
PSNR with those of Double Stimulus Impairment Scale (DSIS) subjective quality ratings.
Results show a good correlation between actual MOS and predicted PSNR. The proposed
model could be used by content providers to determine the initial quality of videos based
on QP and content type.
1 Introduction
at content providers who can use the model for better video provisioning and for intelligent
monitoring of video quality and apportioning of resources to ensure a maximization of video
quality for a given encoder settings.
Although the impact of the encoding process on video quality has been investigated for
MPEG-2, MPEG-4 and HEVC video coding standards [15, 20, 39]. However, the motion
activities and complexities which defined the content characteristics of video sequence has not
been explicitly defined and used in encoded video quality estimation especially for the recently
released HEVC standard [34]. The first motivation and novelty for this work is to explicitly
define video content type in terms of motion activities and complexity of video sequences.
Secondly, the defined content type is then used alongside the QP to model the impact of
encoding process on HEVC videos. The novelty of our approach is in defining the content type
of a video sequence and using it as a parameter in modelling video quality.
The contributions of this paper are threefold:
Firstly, we provide an in-depth analysis on the impact of video content type on video
quality. Secondly, we developed a new metric based on the motion activities and the com-
plexity of video sequence to define video content type. Thirdly, we developed a video quality
prediction model based on the newly proposed content type definition metric and QP
parameter.
The rest of the paper is organized as follows: Section 2 presents some background
information and related work; Section 3 provides an overview of HEVC. Section 4
describes the experimental setup and the impact of video content type and encoding
process on video quality. Section 5 presents limitations with current video quality
prediction that does not take into account video content type and the development of
video quality prediction models based on the defined content type and encoding
parameters (e.g. QP). Section 6 presents the evaluation and comparison of proposed
model with subjective results, while Section 7 provides some conclusions and high-
lights the future direction of our work.
2 Related work
Objective quality metrics have been classified into five main categories by ITU standardization
activities [35], these include:
1) Media-layer models
2) Parametric packet-layer models
3) Parametric planning models
4) Bitstream-layer models
5) Hybrid models
The media layer model used video signal to compute QoE without requiring any informa-
tion about the system under testing. This category of objective measurement is suitable for
codec comparison and optimization scenarios. Parametric packet-layer models on the other
hand, used packet-header information for QoE prediction without having access to the media
signals. Devoid of processing medial signal, this category is considered to be a lightweight
solution for QoE prediction. For parametric planning models, QoE prediction is based on
quality planning parameters for networks and terminal devices. As a result, prior knowledge
about the system under testing is required.
The bitstream-layer models used information from the encoded bitstream to measure QoE.
Finally, the hybrid model is a combination of two or more models.
3718 Multimed Tools Appl (2015) 74:3715–3738
The goal of our work is to use bitstream-layer model to estimate the quality of encoded
HEVC videos. The estimation takes into account extracted motion features from encoded
bitstream. These features include the Motion Vector (MV) and Quantization Parameters (QP)
and information related with number of bits used for video frame. The QP regulates how much
spatial detail is saved during encoding and is considered to be the main encoding quality
determining parameter [22]. Increased QP results in detail aggregation and a drop in encoded
video bitrate which may lead to degradation in video quality. The MVon the other hand, reflects
the motion activities of a video sequence [43]. In this work, we used the MV alongside the
complexity of video sequences to define video content type in terms of Motion Amount (MA).
In existing literature, the motion activities (temporal information) and complexity (spatial
information) of a video sequences have been used to determine video content types and to
classify videos into groups or categories. Seldom has the content type been defined and used as
an additional parameter in modelling video quality prediction. For instance, work presented in
[14] identified the content types of video sequences and classified them into groups using
spatio-temporal features and cluster analysis tool. Authors in [29] presented an approach to
estimate video quality based on content adaptive parameters and content classification.
Content classification is based on motion characteristics determined by MV information and
pixel wise sum of absolute differences (SAD). The reported approach classifies videos into
four groups. This is limited because videos grouped into a category may not have the same
spatial and temporal complexities. Additionally, the authors did not take into account videos
that may not fall into any of the limited categories. We propose a content independent video
quality prediction model that takes into account the definition of content type in terms of its
motion characteristics. In [38], the authors compared the impact of packet loss impairment on
H264/AVC and HEVC compression standard by using the amount of motion and spatial
details to classify HEVC video sequences into four classes. The conclusions arrived at assumes
that all video sequences fall into four classes which is inaccurate because more categories may
be needed for videos that do not fall into the limited categories.
In the literature, video content type has been identified as a significant parameter that
impact video quality. For example, in [42], the authors concluded that video content type has a
significant impact on video quality and is the second most significant QoS parameter after
encoder type and settings. Therefore it is important to further investigate the impact of video
content type on quality.
In [26], the authors monitored video quality using extracted bitstream information to
identify missing macroblocks and the spatial extent and temporal duration of each loss. The
reported approach only identified lost macroblocks without any consideration for the content
type. This is also different from our work where extracted motion features information are used
for defining content type and further for quality modelling. A full reference model called
MOVIE index is presented in [32], the model design is based on spatio-temporal features of a
video sequence. Authors in [1] developed a bitstream layer novel model to estimate the
visibility and the impact of packet loss on H.264/AVC HD and SD videos quality assessment
by extracting MV features information from H.264/AVC encoded bitstream to account for
spatio-temporal characteristics of video content and classification of packet loss events using
support vector regression (SVR) [40]. Although the work presented in [1] used extracted MV,
this is fundamentally different from our work as the extracted MV were only used to identify
macroblock (MBs) that were impacted by loss without any consideration for the overall
motion activities and complexities of video sequences.
Beside bitstream layer objective models, a large body of research work have also used
human vision system (HVS) to design objective Video Quality Assessment (VQA) metric. This
approach uses the human eye and brain pathway to define video content type. For example,
Multimed Tools Appl (2015) 74:3715–3738 3719
work presented in [4] proposed an HVS-based reference free objective quality metric based on
macro-blocks error detection weighted by temporal and spatial saliency maps computed at the
decoder side of video delivery process. This HVS-based approach used salient areas in a video
sequence to identify different content types. HVS-based systems used an elaborate mechanism
to capture human perception of distortion and spatial differences. However, these systems have
limitations in modelling the temporal aspects of human vision and distortions in video [31].
Additionally, HVS-based systems only model temporal changes that occur in the early stages of
processing the visual cortex [13]. Our approach on the other hand, measures the amount of
motion activities/complexities of video sequences independently of human judgement to define
video content type which is subsequently used with the QP to accurately predict video
quality.
3 HEVC: an overview
The increased demand for high quality video services has necessitated a substantial improve-
ment in video compression efficiency. HEVC is a video coding standard developed by the
Joint Collaborative Team on Video Coding (JCT-VC) [12] to improve the encoding efficiency
of AVC [34]. HEVC uses a new coding structure called coding tree units (CTU) structures
which replaced the macroblocks in H.264 standard. Unlike H264 which divides frames into
macroblock of 16×16, 16×8, 8×16, 8×8, 8×4, 4×8 and 4×4. HEVC deploys Coding Units
(CU) of 64×64, 32×32, 16×16 and 8×8 pixels. By using larger CU in images with similar
characteristics, HEVC can easily achieve high efficient compression through intra-prediction
and transforms.
HEVC and AVC share some common similarities in terms of video codec design. Con-
ceptually, both codecs have a Network Abstraction Layer (NAL) and a Video coding Layer
(VCL).
The two main features that differentiate HEVC from AVC and its predecessors are
the larger macro block structure and the two additional new filters (Sample Adaptive
Offset (SAO) and Adaptive Loop Filter (ALF)) which are applied after deblocking
filter in HEVC [25].
In this work, we used low delay-high efficiency configuration setting recommended
by JCT-VC; this encoding complexity configuration uses the previous frame for
reference. The encoding testing conditions selected are also based on JCT-VC recom-
mended common testing conditions. This includes the QP which regulates how much
spatial detail is saved and has a great impact on the compression rate and the visual
quality of videos and its bitrate.
A. Experimental Setup
Figure 1 shows a block diagram of the system that was designed and used to provide a
realistic investigation of the impact that HEVC encoding process and MA have on video
quality. The system consists of three key components: encoding/decoding, content
definition using both motion features (MV) and complexities of video sequence and
modelling of encoded video quality. It should be noted that transmission impairments was
not investigated in this work, its inclusion in the system block diagram is for complete-
ness as the model designed in this work is intended for content providers (see later).
3720 Multimed Tools Appl (2015) 74:3715–3738
Measured
PSNR
FR video quality
Measurement
Predicted
PSNR
Fig. 1 Block diagram of a system for motion based video quality prediction
1) Encoding proccess
The encoding process involves encoding a source video sequence with an HEVC
encoder for a given test scenario. In this paper, video sequences were encoded using
the full range of the Quantisation Parameter (QP) recommended by the JCT-VC which
published a list that recommends the conditions under which HEVC should be tested
[3]. Out of a list of 12 listed test conditions (random access, low delay, intra etc.), five
are optional and the rest were required and depend on the application scenario.
The work presented in this paper used a subset of six different video test sequences
from the recommended HEVC test sequences. These sequences include ParkScene,
BasketballDrive, Vidyo1, Johnny, BQterrace and Kimono. All video sequences were
encoded using HEVC encoder version 9.1. A snapshot of each video sequence is
shown in Fig. 2.
Fig. 2 Snapshots of testing video sequences (a) BQterrace, (b) Johnny (c) Kimono (d) ParkScene (e) Vidyo1
and (f) BasketballDrive
Multimed Tools Appl (2015) 74:3715–3738 3721
Given the need for a low complexity and low delay for mobile services, a Low
Complexity (LC) and Low Delay (LD) configuration was used throughout the
encoding process. Additionally, seven QP settings recommended by JCT-VC [3] were
used. Because home and mobile devices may range from low to high resolution video,
a spatial resolution of 1280×720 was used for testing and a temporal resolution of 30
frames per second.
2) Content type definition with encoded bitstream motion information
As stated in previous sections, the content type plays a significant role in video
quality estimation. We defined the content type of video sequences in term of Motion
Amount (MA) which represents video motion characteristics and video complexities.
Video motion characteristics can be determined during decoding (extraction of
motion features from encoded bitstream) and pixel wise motion estimation (post-
decoding process using matching metrics). What differentiates these two methods of
motion characteristics definition is simplicity and accessibility to end users’ devices.
Motion features extraction from the encoded bitstream is considered to be the
simplest because the bitsream carries motion information which can be extracted
during decoding. We used extracted motion features from bitstream to estimate
the motion activities of sequences by extracting coding unit (CU) level Motion
Vectors (MV) information from HEVC encoded bitstream. The MVs reflects the
motion activities of a video sequence and carries the motion information of encoded
videos [43].
A modified version of the HEVC decoder from the current reference software was
used to extract MV, QPs and the bits information from the encoded bitstream and
store for post-processing.
An MV is made up of vertical and horizontal vectors which can either be positive
or negative. In this paper, we used the extracted information (MV, the number of bits
and QP) to define video content type in terms of the motion amount (MA). The MA
takes into account the motion activities and complexity index of a video sequence.
To determine the motion activities of a video sequence, we counted all the MV
extracted from an encoded bitstream (expressed as MVcount). The MVcount of a
video at a given QP was calculated using the following:
MVcount ¼ countðMVÞ ð1Þ
Results from MVcount shows that the number of MVs in all video sequences in
our work decreases when QP increases.
The complexity of a video sequence is defined in terms of composite picture
complexity index (CPC), this approach is similar to the one used by authors in [30] to
define the complexity index of MPEG videos. CPC is comparable to composite
complexity index (CC) in MPEG encoding where the CC is calculated for individual
frames type (e.g. I, P and B frames) in order to allocate suitable amount of bitrate.
The extracted QPs and bits information from the encoded bitstream were used to
calculate the CPC for different frame types. The following equation was used for
CPC calculation.
Qi
CPC ¼ ∑i¼Ni¼1 Bi ð2Þ
2
where Bi and Qi are number of bits and QPs per frame respectively.
3722 Multimed Tools Appl (2015) 74:3715–3738
values against full reference PSNR values to determine if the relationship between
MA and PSNR is similar to the one shown by BR and PSNR.
Results show that the MA of all video sequences has logarithmic relationship
with PSNR (Fig. 5). This is similar to the relationship shown by BR vs. PSNR.
This is important because in order to design a quality model that takes into account
video content type, the relationship between the quality metric and the content
type metric must be established.
To quantify the relationship between MA and PSNR, we performed another
direct correlation between MA and PSNR. This was also used to determine if the
correlation between MA and BR are of any similarity.
Results of these correlations are given in terms of R2 and summarized
in Table 1.
Results in Table 1 clearly show that MA correlation with PSNR is
similar to the direct correlation between bitrate (BR) and PSNR.
B. Impact of encoding process on HEVC video quality
Video quality depends on the content type and the initial encoding process
settings such as QP. As expected, lower QP results in higher bitrate (BR) which
lead to increased video quality. On the other hand, as the QP increases the BR
reduces resulting in reduction of video quality. Meanwhile, different video content
types have different encoding bitrate requirements because of the encoding
process which used video complexities to get a QP that gives a target BR for
a given quality. Results show that videos with higher motion activities have
higher BR when compared to those with lower motion activities as shown by
Fig. 6.
For all encoded videos, the quality generally drops with increased compression
(defined by QP). However, the drop in quality is steeper for videos with higher
motion activities than those with lower motion. What is also evident from the
result is that when encoding video, one would need to be mindful of the level of
compression because it determines the initial video quality i.e. too much com-
pression might cause higher degradation in initial video quality. For example, to
achieve a video quality greater than 37 dB, it is clear from the results presented
in Fig. 7 that encoding the videos with a QP value of more than 37 will not
achieve a video quality of 37 dB. Furthermore, all sequences require different
compression level for the same quality because of the differences in motion
characteristics. For example, Johnny and Vidyo1 video sequences with lower
motion activities require lower compression for the same quality when compared
to ParkScene, BaskballDrive, BQterrace and Kimono video sequences with higher
motion activities. Results also show that the QP has a linear relationship with
PSNR. This relationship is important because it will enable us to determine how
QP impact video quality and how it should be used in designing video quality
model.
To further understand the commonality between video content type and PSNR,
we plotted MA against PSNR as shown in result in Fig. 5 above. Results show
that MA has a logarithmic relationship with PSNR. This is also important
because it enable us to establish how video content type relates to PSNR and
how it can be used to model video quality.
R2 0.410 0.413
Multimed Tools Appl (2015) 74:3715–3738 3725
Although the results are apparent and not HEVC-specific, we included them to
highlight the limitations with current content blind video quality prediction. This
we think is important, especially given that the model that we derived from this
work is primarily aimed at indicating the limitations with past and current video
quality prediction models.
Results presented in Section 4 show that changes in QP and video content type affect PSNR
values. Encoding different video with the same codec parameters such as QP, frame rate, BR
and spatial resolution, produced different PSNR values. This indicates that other parameters
must be influencing the PSNR values other than the encoding parameters. Because PSNR
varies with content type, we believe that the content type has an impact on PSNR values.
To actually determine if the content type has any influence on PSNR values, we used the
defined content type in Section 4(2) and the QP to evaluate whether the changes in PSNR
values is due to these parameters and if the impact on PSNR is statistically significant. To
achieve this, we performed a two-way Analysis of Variance (ANOVA) [33] on dataset from the
six video sequences. ANOVA enables us to calculate the p-value (probability), which is derived
from the cumulative distribution function of F based on the F-value [33]. Our P-value indicates
how a parameter impacts PSNR. For example, a parameter with a p-value of less than 0.05 (p-
value ≤0.05) will indicate that PSNR is significantly affected by such parameter. On the other
hand, a P-value of more than 0.05 will indicates that such parameter has no significant impact
on PSNR. Results in Table 2 indicate that, the interaction between QP and MA does not
significantly impact PSNR with p-value of 0.106. However, as independent parameters, they
both impact PSNR with QP and MA having p-value of 0 and 0.02 respectively. ANOVA results
also indicate that the QP has more influence on PSNR (with p-value of 0) than MA whose p-
value is 0.016 as shown in Table 2. This is important because it enables us to understand how
statistically significant these parameters are and how they can be used in modelling video
quality. Given that MA increases when QP decreases, to accurately determine the impact of MA
on PSNR values, we used MA as a continuous variable (covariate) [17] in ANOVA.
The findings of ANOVA can be summarized as follows:
1) Encoded video quality depends on content type (defined by MA) as also indicated by
authors in [16, 42]. It is therefore important to measure and consider content type when
designing a video quality prediction model.
2) The initial encoding settings such as QP determined the initial video quality and it is
content dependent. To adequately predict video quality, encoding setting such as the QP
should be taken into consideration.
Because ANOVA results show that the content type of a video sequence significantly
impacts PSNR, we explicitly defined content type in terms of motion activities (as presented in
Section 4(2)) and used in modelling video quality. We believe this approach is simple and
robust with less complexity given that parameters needed for quality estimation can be
extracted from the encoded bitstream with low computational efforts. Additionally, by defining
content type and not grouping videos, a single model can be used to estimate encoded video
quality with good accuracy as presented in the section that follows.
Source Degree of Seq. sum of Adj. sum of squares Mean squares F-statistics P-value
freedom squares
Based on results presented in Figs. 5 and 7 and ANOVA, we also established that PSNR
has a logarithmic relation with MA (Fig. 5). On the other hand, PSNR has a linear
relationship with QP (Fig. 7). Based on the logarithmic and linear that exists between
QP, MA and PSNR, encoding quality of a video sequence can therefore be predicted by
using the following equation.
PSNR ¼ g þ hðQPÞ þ jðlnðMAÞÞ ð5Þ
where g=82.432, h=−0.803 and j=−1.908 are model coefficient values through regres-
sion. The derived coefficients hold for all video sequences.
B. Performance evaluation with unseen dataset
To evaluate the performance of the proposed model, we extracted and computed the
MA and QP from video sequences bitstream not used in model derivation (testing
sequences). These sequences include Traffic, KristenAndSara, RaceHorses and
SlideShow. The PSNR of each QP in a test sequence was computed using the proposed
model. The accuracy of the model is given in terms of correlation coefficient R2 and Root
Mean Square Error (RMSE) as summarized in Tables 3 and 4 and Fig. 8 where a scatter
plot of actual PSNR (obtained from the encoding log file of test and training sequences)
against model predicted PSNRp. Figures 9 and 10 on the other hand show a graph of
model for all six testing and training sequences respectively. We achieved a correlation
coefficient of around 96 % with the training video sequences and 92 % with the testing
sequences when compared with full reference PSNR measurements. There were a total of
42 training and 28 test data sets for model development and validation.
A. Subjective test
Objective video quality assessment metrics such as PSNR are less computationally
intensive but are limited because they do not factor in the human visual perception. On
the other hand, subjective video quality assessment techniques are able to model the
human visual perception of quality.
The goal of this section is to evaluate the performance of CVQP model using Double
Stimulus Impairment Scale (DSIS) subjective quality rating approach [23]. We compare the
degree of closeness of predicted PSNR (PSNRp) with actual DSIS subjective ratings. The
subjective test plan followed the ITU recommendations for subjective video testing [10, 11].
B. Test sequences
A total of 42 impaired videos of 9 s each were generated from six reference video clips
BasketballDrive, Vidyo1, Johnny, BQterrace, Kimono and ParkScene. The sequences
were encoded using full range of JCT-VC recommended settings. These include actual
Quantisation Parameter (QP) of 19.25, 24.25, 29.25, 34.25, 39.25, 44.25 and 49.25, frame
rate of 30fps, spatial resolution of 1280×720, Coding Unit (CU) of 64×64 and Group of
Pictures (GoP) size of 4.
To capture the absolute perceived quality over time, we chose to use the ITU
recommended Double Stimulus Impairment Scale (DSIS) for the subjective video testing.
This testing method is more effective when the quality differences between the unim-
paired and the impaired sequences are minimal.
C. Design
We used DSIS quality evaluation where participants are shown the impaired video after
the unimpaired video sequence. To restrict the subjective evaluation to between 10 and
15 min as recommended by ITU, the sequences were randomly split into three datasets. The
unimpaired and impaired videos were uploaded to three identical subjective test websites
[37] each containing two columns, one for unimpaired sequences and the other for the
impaired sequences. However, the impaired video sequences column also had unimpaired
videos (used as hidden reference videos) needed to validate test scores. It should be noted
that DSIS ratings for reference sequences were only used for referencing and not for analysis.
Fig. 8 Scatter plot of Actual PSNR vs. Predicted PSNR for training and testing video sequences
Multimed Tools Appl (2015) 74:3715–3738 3729
Fig. 9 Impact of QP on actual and predicted PSNR for training video sequences
The voting period was non-restrictive, as participants had the option to watch a video clip
more than once before committing to submit their final vote through a “submit” button
located at the bottom of the websites. To track how long participants took to complete the
test, all testing websites had hidden timers. To minimize memory effect, all sequences were
randomized using Latin square randomization technique [7].
D. Participants
A sample of 63 students from the School of Computing and Mathematics (SoCM)
from Plymouth University took part in the test. This includes 43 undergraduates and 20
graduate students and a mix of males and females with majority male. There were no
monetary compensations for students who took the test. Although no attempt was made
to measure the computer skills and familiarity with video testing of test takers, partici-
pants were all computer science students who had high degree of computer literacy.
E. Test Procedure
The subjective test was conducted over a two and half week period at the Plymouth
University computer lab. As shown in Fig. 11, the assessor used the first 5 min of a test
Fig. 10 Impact of QP on actual and predicted PSNR for testing video sequences
3730 Multimed Tools Appl (2015) 74:3715–3738
Imperceptible
4
Perceptible
annoying
but not
3
annoying
Slightly
2
Annoying
INSTRUCTIONS
5 Mins
.... 1
2s
annoying
2s 10 s 2s 10 s
2s 10 s 2s 10 s 2s 10 s 10 s
Very
9s 9s 9s 9s 9s Voting 9s 9s Voting 9s 9s
9s 9s Voting 9s Voting Voting Voting
0
Total Test Time = ~10 minutes to watch and grade 15 video clips Scale
session to explain to test takers the type of assessment, the grading scale, the sequences
and timing and voting. This was immediately followed by the distribution of printout
pages containing the test web links. This procedure was repeated for all test sessions
which had between 10 and 17 test takers.
To gauge the visible artefacts caused by HEVC encoder settings on video, DSIS
ratings were mapped onto a MOS scale from 1 to 5 where 1=“very annoying”, 2=“an-
noying”, 3=“slightly annoying”, 4=“perceptible but not annoying” and 5=“impercepti-
ble” as recommended by ITU P.910 [11]. On average, it took between 10 and 15 min to
complete a test (5 min to read test instructions, fill-in and accept terms and conditions, 9 s
to watch original video clip, 2 s to switch and watch the degraded video and 10 s non-
restrictive voting time). After watching both non-degraded and the degraded video
sequences, participants were asked; “On a scale of 1 to 5, grade the difference between
the videos in terms of quality”, where 1 indicates worst quality (“very annoying”) and 5
the best (“imperceptible”). To simplify numerical analysis and plotting of graphs, indi-
vidual DSIS ratings were averaged to obtain the Mean Opinion Score (MOS).
F. Outlier detection
To ensure validity of subjective results, hidden reference videos and a timer were
incorporated into the websites. Test takers whose DSIS ratings for the reference video
were lower than the degraded video were automatically rejected. Additionally, test takers
who failed to enter their first name and accept the terms and conditions of the test were
asked to go back and do so before their scores could be accepted. To identify test takers
who did not watch all the videos before scoring, we incorporated a hidden timer on all
testing websites. The timer calculated the entire test time by subtracting the start from the
finish time. Test scores of participants whose test time was less than 10 or 15 min were
automatically rejected. In total, 9 out of the 63 test takers were rejected because of poor
scoring or lack of credible ratings.
G. Test results
To test the quality of our subjective data, we computed the distribution of the 95 %
confidence intervals (CI) for all DSIS ratings from valid test takers. The CI for the mean
for Vidyo1 and ParkScene is shown in Fig. 12. The average size of confidence intervals is
0.130 on a MOS scale of 1–5 for all video sequences. This indicates a good agreement
between test takers.
The impact of encoder settings on video quality as perceived by users is presented in
Fig. 13 where the QP is plotted against the mean MOS of subjective quality rating for all
Multimed Tools Appl (2015) 74:3715–3738 3731
a
5
MOS 3
0
19.25 24.25 29.25 34.25 39.25 44.25 49.25
QP
b
5
4
MOS
0
19.25 24.25 29.25 34.25 39.25 44.25 49.25
QP
Fig. 12 95% confidence interval for the MOS for (a) Vidyo1 and (b) ParkScene video sequences
six video contents. Figure 13 clearly shows that increased compression (QP) leads to high
visible quality degradation for all sequences. Furthermore, results indicate that because of
differences in motion characteristics, video sequences are impacted differently under the
same encoder setting. For example, Johnny and Vidyo1 had an average MOS value of 4.9
and 4.8 compared to BasketballDrive, BQterrace, Kimono and ParkScene MOS score of
4.2, 4.6, 4.7 and 4.7 respectively under the same QP (QP of 19.25). However, all
sequences show high level of acceptability as the video compression decreases. The
threshold for acceptable MOS for encoded video was determined by calculated the mean
MOS scores for six video sequences (Mean of 3.7 MOS). It is also evident from the results
that perceived video quality depends on encoder settings and content type as shown by
ANOVA in Section 5 above. Therefore, one has to be mindful when selecting and
encoding video as the quality of encoded video increases with decreased compression.
H. PSNR to MOS mapping
In this section, we used subjective quality ratings (MOS) and Full Reference (FR) test
ratings (PSNR) to create a PSNR to MOS mapping metric. This metric will enable us to
evaluate the accuracy and performance of the reference free model proposed in Section 5(a)
above.
To derive a mapping metric, the full reference PSNR values from all six contents were
mapped to their corresponding MOS as shown in Fig. 14 where a scatter plot of FR PSNR
versus MOS for all six sequences is presented.
In contrast to the mapping of MSE to MOS presented in [2], we chose to adopt the
polynomial function proposed by VQEG [36] and authors in [27] to map PSNR to MOS.
This approach is simple and gives a good fit to our data with an overall Pearson correlation
of 0.94 when the actual MOS from DSIS was compared to the predicted MOS (MOSp) from
the actual PSNR values. Additionally, because a perceptual quality metric cannot be linear
our proposed metric also reflects the non-linearity of human vision [21] as shown in Eq. 6.
It can be observed from Fig. 14 that all video sequences show non-linearity in the scatter
plot and saturate between 4.2 and 4.8 MOS (predicted). Because of differences in content
type, sequences with lower MA saturate with a higher MOS than those with higher MA. For
example, BasketballDrive, BQterrace, Kimono and ParkScene saturate between 4 and 4.5
MOS while Johnny and Vidyo1 saturate at 4.8 and 4.7 MOS respectively. The accuracy of
mapping is measured by comparing the actual MOS (from DSIS ratings) to predicted MOS
(from mapping) in terms of RMSE and R2 as presented in Table 5.
I. Mapping metric comparison
To further determine the accuracy of our PSNR to MOS mapping metric, in this
subsection, we used actual PSNR to compare our proposed mapping metric to the popular
Evalvid PSNR to MOS conversion metric [18] and the mapping metrics proposed by
authors in [6]. Result of this comparison is shown in Table 6.
1) Comparison of predicted PSNR with subjective video quality ratings.
Recently, novel comparison models have been proposed to evaluate the perfor-
mance of objective video prediction models. This approach adds human dimensions to
objective quality metric. For example, work presented in [41] performed a study of
performance of NTIA General Model [24] for HDTV video clips, the degree of
accuracy was measured by comparing objective results with results of Single Stimulus
Continuous Quality Evaluation (SSCQ) subjective quality ratings. Using a similar
approach, we evaluate the performance and accuracy of content-based Video Quality
Prediction (CVQP) model by comparing the predicted PSNR (converted toMOSp)
with MOS (obtained from DSIS subjective ratings) using Eq. 6 above. Results from
this comparison indicate a high correlation between predicted MOSp and actual MOS
as perceived by end users as shown in Fig. 15. Additionally, the amount of motion in a
video clip also influence video quality as all video clips in our experiment show
different quality level under the same encoding settings and evaluation.
Fig. 15 Performance comparison between Predicted PSNR (MOS) and actual MOS
Figure 15 show that MOS from predicted PSNR (PSNRp) for all six video
sequences have good correlation with actual MOS. However, MOS from PSNRp
show a high degree of linearity than its subjective MOS counterpart. In general,
PSNRp underestimates the MOS score when QP is high. This implies the limitations
of PSNR-based metric which does not take into account human perception of quality
and indicates the future work to develop more accurate QoE model based on subjec-
tive results.
7 Conclusion
From the results, we observed that it is possible to design a video quality prediction model that
does not take the amount of motion in a video into consideration. However, results show that
different videos encoded with the same encoding settings have different PSNR and MOS
values necessitating the need to define video content type and use it as a parameter in video
quality prediction. It was also observed that, in objective video quality measurement, the
encoding parameters set the initial video quality and the content type determines how the
initial video quality is degraded during encoding. We also observed that the degradation of the
initial video quality is significantly impacted by motion characteristics. This is important
because different videos have different amount of motion which may impact video quality
model design and implementation.
In this work, we defined video content type by extracting motion features (MV) and video
frame bits and QP information from the encoded bitstream. This extracted information was
used to calculate the motion activities and the complexity of a video sequence. We expressed
the content type in terms of motion amount (MA). Using the calculated MA, we were able to
design a motion based video quality prediction model that enables us to predict the encoded
video quality when the amount of motion activities/complexity and encoding parameters are
known. This model had an overall correlation coefficient of 96 % with the training sequences
Multimed Tools Appl (2015) 74:3715–3738 3735
and 92 % accuracy with the testing video sequences when compared with FR PSNR. The
proposed model will enable the design of quality adaptive video applications that takes into
account the initial video encoded parameters and the amount of motion to deliver a maximum
possible video quality.
From subjective results, we observed that visible quality is heavily impacted by the QP. All
participants in our subjective experiment gave low grades to videos that were encoded with QP
above 37 than those with lower QP (17–32).
By mapping predicted PSNR to MOS score using our PSNR to MOS mapping metric; we
were able to determine the closeness of our predicted PSNR to MOS. Results from this
mapping indicates that our predicted PSNR correlate well with actual MOS. However,
predicted PSNR show more linearity than the actual MOS because PSNR does not take into
account the human perception of quality. Our work is important because it provides the basis
for which motion characteristic of a video should be extracted and used (define video content
type) to measure quality and also some understanding on how the new HEVC codec can be
effectively used for effective provision of videos with acceptable quality. Although there are
computational issues that need to be resolved before HEVC is universally used, we believe that
it will not be far given the evolution of technology, and providing a basis for understanding the
quality issues with encoding and amount of motion for HEVC is paramount as it has not been
done.
Future work will focus on identifying the non-technical parameters such as context,
environment and user type that may impact HEVC video quality.
References
1. Argyropoulos S, Raake A, Garcia M, List P (2011) No-reference video quality assessment for SD and HD
H.264/AVC sequences based on continuous estimates of packet loss visibility. Third Int Work Qual
Multimed Experience, pp. 31–36
2. Bhat A, Richardson I, Kannangara S (2009) A novel perceptual quality metric for video compression. IEEE
Picture Coding Symp PCS, pp. 1–4
3. Bossen F (2010) Common test conditions and software reference configurations. JCT-VC Doc. JCTVC-
G1200
4. Boujut H, Benois-Pineau J, Ahmed T, Bonnet P, Sheva B, Armstrong N (2011) A metric for no-reference
video quality assessment for HD TV delivery based on saliency maps. IEEE Int Conf Multimed Expo
(ICME), pp. 1–5
5. Choi H, Nam J, Sim D, Bajiü IV (2011) Scalable video coding based on high efficiency video coding
(HEVC). IEEE Pac Rim Conf Commun Comput Sig Process, pp. 346–351
6. Dymarski P, Kula S, Huy TN (2011) QoS conditions for VoIP and VoD. J Telecommun Inf Technol:29–37
7. Hands DS (2004) A basic multimedia quality model. IEEE Trans Multimed 6(6):806–816
8. Hiramatsu K, Nakao S, Hoshino M, Imamura D (2010) Technology evolutions in LTE/LTE-advanced and its
applications. IEEE Int Conf Commun Syst, 161–165
9. Hu J, Wildfeuer H (2009) Use of content complexity factors in video over IP quality monitoring. Int Work
Qual Multimed Exp, 216–221
10. ITU-T Recomm. BT.500-13, Methodology for the subjective assessment of the quality of television pictures
11. ITU-T Recomm. P.910, Subjective video quality assessment methods for multimedia applications
12. ITU-T SG16 WP3/ISO/IEC JTC1/SC29/WG11 JCTVC-A124 (2010) Samsung’s Response to Call Propos.
Video Compression Technol. Dresden
13. Kanwisher N, Wojciulik E (2000) Visual attention: insights from brain imaging. Nat Rev Neurosci 1:1–10
14. Khan A, Sun L, Ifeachor E (2009) content clustering based video quality prediction model for MPEG4 video
streaming over wireless networks. IEEE Int Conf Commun ICC, pp. 1–5
15. Khan A, Sun L, Ifeachor E (2009) Content-based video quality prediction for MPEG4 video streaming over
wireless networks. J Multimed 4:228–239
16. Khan A, Sun L, Ifeachor E (2012) QoE prediction model and its application in video quality adaptation over
UMTS networks. IEEE Trans Multimed 14(2):431–442
3736 Multimed Tools Appl (2015) 74:3715–3738
17. Kim J-O, Kohout FJ (1975) Analysis of variance and covariance: subprograms ANOVA and ONEWAY. Stat
Packag Soc Sci 2:398–433
18. Klaue J, Rathke B, Wolisz A (2003) EvalVid - a framework for video transmission and quality evaluation.
Model Tech Tools Comput Perform Eval, 255–272
19. Koumaras H, Lin C-H, Shieh C-K, Kourtis A (2010) A framework for end-to-end video quality prediction of
MPEG video. J Vis Commun Image Represent 21(2):139–154
20. Lee B, Kim M (2013) No-reference PSNR estimation for HEVC encoded video. IEEE Trans Broadcast
59(1):20–27
21. Ong E, Lin W, Lu Z, Yao S, Yang X, Moschetti F (2003) Low bit rate quality assessment based on perceptual
characteristics. IEEE Int Conf Image Process ICIP 1:3–5
22. Ou Y, Ma Z, Wang Y (2008) A novel quality metric for compressed video considering both frame rate and
quantization artifacts. Int Work Image Process Qual Metrics Consum
23. Pinson MH, Wolf S (2003) Comparing subjective video quality testing methodologies. SPIE Proc 5150(3):
573–582
24. Pinson MH, Wolf S (2004) A New standardized method for objectively measuring video quality. IEEE Trans
Broadcast 50(3):312–322
25. Pourazad MT, Doutre C, Azimi M, Nasiopoulos P (2012) HEVC: the New gold standard for video
compression How does HEVC compare with H.264/AVC? IEEE Consum Electron Mag
26. Reibman AR, Vaishampayan VA, Sermadevi Y (2004) Quality monitoring of video over a packet network.
IEEE Trans Multimed 6(2):327–334
27. Ries M, Nemethova O, Badic B, Rupp M (2004) Assessment of H. 264 coded panorama sequences. First Int
Conf Multimed Serv Access Netw, pp. 12–15
28. Ries M, Nemethova O, Rupp M (2007) Motion based reference-free quality estimation for H.264/AVC video
streaming. 2nd Int Symp Wirel Pervasive Comput
29. Ries M, Nemethova O, Rupp M (2008) Video quality estimation for mobile H. 264/AVC video streaming. J
Commun 3(1):41–50
30. Rosdiana E, Ghanbari M (2000) Picture complexity based rate allocation algorithm for transcoded video over
ABR networks. Electron Lett 366 36(6):521–522
31. Seshadrinathan K, Bovik AC (2009) Motion-based Perceptual Quality Assessment of Video. IS&T/SPIE
Electron Imaging Int Soc Opt Photonics
32. Seshadrinathan K, Bovik AC (2010) Motion tuned spatio-temporal quality assessment of natural videos.
IEEE Trans Image Process 19(2):335–350
33. Snedecor GW, Cochran WG (1989) Statistical methods, 8th ed. Ames: Iowa State Univ Press, p. 503.
34. Sullivan G, Ohm J, Han W, Wiegand T, High A, Video E, Hevc C (2012) Overview of the high efficiency
video coding. IEEE Trans Circuits Syst Video Technol 22(12):1649–1668
35. Takahashi A, Hands D, Barriac V (2008) Standardization activities in the ITU for a QoE assessment of IPTV.
IEEE Commun Mag 46(2):78–84
36. Takahashi A, Schmidmer C, Lee C, Speranza F, Okamoto J (2010) VQEG Report on the validation of video
quality models for high definition video content
37. University of Plymouth - SPMC Subjective Video Test. [Online]. Available: http://www.tech.plymouth.ac.
uk/spmc/staff/laanegekuh/subjective1/. [Accessed: 03-Aug-2013]
38. Van Wallendael G, Staelens N, Janowski L (2012) No-reference bitstream-based impairment detection for
high efficiency video coding. Fouth Int Work Qual Multimed Exp, 7–12
39. Verscheure O, Frossard P, Hamdi M (1998) MPEG-2 video services over packet networks: joint effect of
encoding rate and data loss on user-oriented QoS. 8th Int Work Netw Oper Syst Support Digit Audio Video
(NOSSDAV 98), pp. 257–264
40. Welling M (2004) Support vector regression. Dep Comput Sci Univ. Toronto
41. Wolf S, Pinson M (2007) Application of the NTIA general video quality metric (VQM) to HDTV quality
monitoring. Proc Third Int Work Video Process Qual Metrics Consum Electron (VPQM), pp. 4–8
42. Zhai G, Cai J, Member S, Lin W (2008) Cross-dimensional perceptual quality assessment for Low bitrate
videos. IEEE Trans Multimed 10(7):1316–1324
43. Zhai J, Yu K, Li J, Li S (2005) A Low complexity motion compensated frame interpolation method. IEEE Int
Symp Circuits Syst ISCAS, 4927–4930
Multimed Tools Appl (2015) 74:3715–3738 3737
Louis Anegekuh is a certified computer network engineer. He received M.Sc. degree in computing from
University of Wales Cardiff, UK, in 2010. He is currently a PhD student at Plymouth University, UK. His
research interests include video quality of service over IP networks, perceptual modeling, content and context-
based factors analysis of perceptual Quality of Experience (QoE). He has hold many IT engineering and lecturing
positions, the latest being an associate computer network lecturer at Plymouth University.
Lingfen Sun received the B. Eng. degree in telecommunication engineering and the M.Sc. degree in commu-
nication and electronics system from the Institute of Communication Engineering, Nanjing, China, in 1985 and
1988, respectively, and the Ph.D. in computing and communications from the University of Plymouth, Plymouth,
U.K., in 2004. She is currently an Associate Professor (Reader) in Multimedia Communications and Networks in
the School of Computing and Mathematics, University of Plymouth. She has been involved in several European
and industry funded projects related with multimedia QoE. She has published about 70 peer-refereed technical
papers since 2000, filed 1 patent, and published one book. Her current research interests include multimedia
(voice/video/audiovisual) quality assessment, QoS/QoE management/control, VoIP, and network performance
characterization. Dr. Sun was the Chair of QoE Interest Group of IEEE MMTC during 2010–2012, Publicity Co-
Chair of IEEE ICME 2011, and Post & Demo Co-Chair of IEEE Globecom 2010.
3738 Multimed Tools Appl (2015) 74:3715–3738
Emmanuel Ifeachor received the M.Sc. degree in communication engineering from Imperial College, London,
U.K., and the Ph.D. degree in medical electronics from the University of Plymouth, Plymouth, U.K. He is a
Professor of Intelligent Electronic Systems and Head of Signal Processing andMultimedia Communications
research at the University of Plymouth. His primary research interests are in information processing and
computational intelligence techniques and their application to problems in communications and biomedicine.
His current research includes user-perceived QoS and QoE prediction and control for real-time multimedia
services, biosignals analysis for personalized healthcare, and ICT for health. He has published extensively in
these areas.