POLQA 2015 (V2.4) Investigated: - Technical White Paper
POLQA 2015 (V2.4) Investigated: - Technical White Paper
October 2014
Contents
Summary ................................................................................................................................................. 3
Objectives for an Advanced POLQA ....................................................................................................... 3
Length Dependency in NB Mode ............................................................................................................ 3
Analysis of the Transparency Behavior ................................................................................................... 4
Other Changes in POLQA V2.4 .............................................................................................................. 5
Analysis of POLQA 2015 (V2.4) .............................................................................................................. 6
POLQA Results for Reference Conditions – Backward Compatibility ................................................. 6
General Prediction Performance ......................................................................................................... 6
Comparison of EVRC and AMR codec Measurements ....................................................................... 7
Analysis of the Sampling Error Detection ............................................................................................ 7
Shift Performance ................................................................................................................................ 7
Processing Requirements .................................................................................................................... 7
POLQA and VoLTE ................................................................................................................................. 8
Availability of POLQA V2.4 ...................................................................................................................... 9
Conclusions ........................................................................................................................................... 10
Acknowledgments ................................................................................................................................. 10
Literature................................................................................................................................................ 10
Summary - Shift performance: Small shifts of the
starting point of the degraded signal may
POLQA, the third generation perceptual voice have had unexpectedly large impacts on the
quality test method standardized as P.863 in measured MOS
2011, has been widely adopted as the state-of- Due to time constraints it was not feasible to
the-art MOS benchmarking technology for tackle all three matters for this update. Where
mobile networks. Since its first release (V1.1) in the shift performance is critical there already
2011, numerous experiences in the field became exists a pragmatic solution, by using the High
available and along with the technical Accuracy mode (HA-Mode), therefore this matter
development in mobile communication (VoLTE, was deferred for possible future consideration.
5G) some areas for improvement were identified.
The main focus of this update was therefore
In strong collaboration with SG12, the parties of solving the length dependency of the MOS as
the POLQA Coalition (OPTICOM, SwissQual well as the transparency problem. Along with
and TNO) proposed an evolved version to ITU-T this, it goes without saying that, the general
Study Group 12, which was approved as Rec. behaviour should not become worse and scores
P.863 Edition 2.4 in September 2014. It is predicted by V2.4 should remain as close as
expected that V2.4 will supersede the earlier possible to those measured with V1.1.
released V1.1 with product implementing it from
2015 on. The update also provides several other
improvements due to general bug fixes and
This white paper outlines the objectives for the optimization, as are described herein.
update, compares those to the achieved
improvements and highlights the most important
changes for users. It will be demonstrated that
this update marks a significant step towards Length Dependency in NB Mode
even higher measurement accuracy and a
broader range of applications for POLQA, while One problem of V1.1 was that narrowband
maximum backward compatibility of the scores were dependent on the signal length. It
measured scores is maintained. As an addition, was observed that files longer than
an investigation of the applicability of POLQA on approximately 10s were consistently scored
live VoLTE networks is presented as well. lower with increasing file length. This problem is
completely solved with POLQA v2.4. For
illustration the speech samples of P.501 were
used in the following example. The four Italian
Objectives for an Advanced POLQA files which are included in that recommendation
have durations of more than 15s, while all other
POLQA V1.1 was found to have three minor files are mostly around 8s long. All samples were
issues, which had very little effect in most real processed by a range of standard coding
measurement scenarios, but which nevertheless conditions in order to create degraded versions
may have caused problems if users were of the reference files. MOS scores were
unaware of standard best practices for the use of produced by applying either POLQA V1.1 or
POLQA. These issues were, POLQA V2.4 to these files.
- Length dependency of the results in The following two diagrams show the resulting
narrowband (NB) mode if the signal duration average scores for the Italian (solid line) and the
exceeded approximately 10 s in narrowband remaining other 28 speech samples (dotted line).
mode As can be seen in Figure 1, P.863 V1.1 scores
the long files significantly lower than the short
- Transparency problems: Too often a files, while for P.863 V2.4 there is no such
comparison of the reference signal with itself systemic bias of the longer files visible anymore
resulted in MOS scores below the optimum (Figure 2), indicating that the length dependency
(known as “the transparency problem”) problem is resolved for POLQA V2.4
3.5 3.5
MOS
MOS
3.0 3.0
2.5 2.5
2.0 P.863 v 1.1 (long files) 2.0 P.863 v 2.4 (long files)
1.0 1.0
Figure 1: P.863 V1.1 results in NB mode for long speech files Figure 2: P.863 V2.4 results in NB mode for long speech files
compared to 8s regular speech files compared to 8s regular speech files
NB SWB
A general feature of POLQA is that reference
signals which are not optimal are idealized P.863 V1.1 14 10
before they are compared to the degraded
signal. The reasoning behind this is that in an P.863 V2.4 19 26
ACR test, subjects will attribute all audible
distortions to the degraded signal, even if they Table 1: Transparent P.501 samples in V1.1 and V2.4
were already part of the reference signal. A
consequence of this is that a sub-optimal
reference signal compared to itself will be scored
<4.75 (or <4.5 in NB mode). Such reference
signals are generally described as non- P.863 and P.863.1. Mainly, the signal length, the
transparent. While this is the desired behaviour duration of leading silence and especially the
of POLQA, V1.1 was apparently too sensitive in noise floor are seen to violate these
this aspect and consequently too many samples requirements. As a consequence, for an
were considered as non-transparent. V2.4 additional analysis all 32 speech samples were
addresses this topic and a revision of the internal manually cleaned and edited correctly according
reference signal handling reduced the effect to the following specification:
without any negative impact on the prediction - Leading silence duration ~0.5s
performance. The following table (Table 1)
illustrates the increased number of samples - Silence between sentences ~1s
considered as transparent in POLQA V2.4. For - File length 8s
public repeatability reasons, this analysis was
made on the 32 samples of P.501. All samples - Speech pauses muted and file interlaced
were filtered to NB or SWB (using the G.191 with -85dB white SWB noise
SWB band-pass filter) and levels adjusted prior
As Table 2 shows, the strict requirements on the
to use.
reference samples lead to a further, significant
As reported earlier [C0085], not all P.501 speech improvement of the number of transparent
samples are consistent with the requirements of samples in NB mode.
4.0
3.5
MOS
3.0
2.5
2.0
P.863 v2.4 (aligned, -85dB floor)
1.5
P.863 v2.4 (P.501 original)
1.0
G.711
G.729
AMR-WB 23.85
AMR-WB 15.85
AMR-WB 12.65
IRS EVRC-A
50 - 7800Hz
EVRC-Bop0.
QCELP 13kbps
IRS G.711
AMR-WB 8.85
AMR-WB 6.60
AMR 12.2kbps
AMR 10.2kbps
AMR 7.95kbps
3 x AMR 4.75kbps
EVRC-A
IRS only
NB only
AMR 7.4kbps
AMR 6.7kbps
AMR 5.9kbps
IRS EVRC-Bop0.
3 x IRS AMR 4.75kbps
P.341
Figure 3: POLQA V2.4 results for P.501 original and cleaned samples.
Note that the use of the edited and cleaned P.863 V1.1 20 12
P.501 samples also leads to a very small but
visible increase of the predicted MOS scores P.863 V2.4 27 26
(Figure 3).
Table 2: Transparent P.501 samples in V1.1 and V2.4 after
cleaning P.501 samples
3.5
MOS
3.0
2.5
2.0
P.863 v2.4
1.5
P.863 v1.1
1.0
IRS EVRC-A
G.711
G.729
AMR-WB 23.85
AMR-WB 15.85
AMR-WB 12.65
QCELP 13kbps
50 - 7800Hz
transparent
EVRC-A
EVRC-Bop0.
IRS only
AMR 12.2kbps
AMR 10.2kbps
AMR 7.95kbps
AMR 5.15kbps
AMR 4.75kbps
IRS G.711
IRS G.729
IRS G.723 6.3kbps
IRS G.723 5.3kbps
IRS QCELP 13kbps
AMR 7.4kbps
AMR 6.7kbps
AMR 5.9kbps
3 x AMR 4.75kbps
G.723 6.3kbps
G.723 5.3kbps
NB only
IRS EVRC-Bop0.
3 x IRS AMR 4.75kbps
P.341
Figure 4: POLQA V2.4 results in SWB mode, average over 32 speech samples (P.501) compared to V1.1 results.
5.0
P.863 Narrowband
Analysis of POLQA 2015 (V2.4) 4.5 28 P.501 Ref Samples ( w/o Italian)
4.0
POLQA Results for Reference Conditions – 3.5
Backward Compatibility
MOS
3.0
Frequency
0.15
Currently, VoLTE networks are not yet widely 0.1
deployed and very little field data exist which can
be used for POLQA measurements. 0.05
Consequently, even the updated V2.4 of P.863
0
correctly includes a remark that the use of
0.31<=x
-0.01<=x<0.01
x<-0.31
0.23<=x<0.25
-0.13<=x<-0.11
-0.09<=x<-0.07
-0.05<=x<-0.03
-0.29<=x<-0.27
-0.25<=x<-0.23
-0.21<=x<-0.19
-0.17<=x<-0.15
0.03<=x<0.05
0.07<=x<0.09
0.11<=x<0.13
0.15<=x<0.17
0.19<=x<0.21
0.27<=x<0.29
POLQA with VoLTE networks must be further
studied.
As a first step [C229] presented some POLQA
measurement results for a limited set of field
MOSshifted - MOScenter
collected data (around 2000 files) which indicate
the safe use of POLQA for VoLTE:
Figure 6: Comparison of the shift performance
In a first analysis (Figure 7) the variability of the
delay vs. time is investigated and compared for
VoLTE (upper chart) and 3G (middle chart), This can be well explained by the fact that 50%
using a typical signal as it is applied for drive of the cases show potentially audible delay
testing (lower chart). It is obvious, that the variations of 20 to 40 ms.
VoLTE case exhibits not only far more delay The idea behind [C229] was based on the
variations than the 3G case, but also larger assumption, that the main difference between
delay steps. This is further analysed in Figure 8, cases for which POLQA is known to behave well
which is a histogram of the distribution of the (e.g. 3G networks) and the still little known
delay steps. As can be seen, in the 3G case behaviour on VoLTE is related to the delay
delay steps typically do not exceed 20 ms, which variability. If it can be shown that POLQA works
corresponds to the frame size used in 3G nicely for these delay variations, then this can be
networks. Usually these delay changes occur seen as a clear indication for the applicability of
during handovers between cells only. Please POLQA in VoLTE networks. The subsequent
note that in this chart the 0 to 20 ms range analysis therefore focuses on only those
includes many cases where no delay variation is samples where delay variations actually occur.
present at all. For the VoLTE case instead, the From subjective experience it is expected that
majority of the delay steps (50%) are between the delay variation in VoLTE has some, but
20 and 40 ms and range up to 80 ms. It is limited effect on the resulting POLQA score
assumed that these variations coincide with jitter since in contrast to 3G networks, the system
buffer adaptations and variations of the playout tries to conceal the audible effect.
speed in the VoIP like architecture of VoLTE.
The outcome of this analysis can be seen in
It is now interesting to see how POLQA scores Figure 10, where the MOS for different amounts
the VoLTE conditions compared to the 3G of delay variation is presented. The blue bars
cases. A histogram of the MOS values is indicate the VoLTE case where it can be seen
presented in Figure 9. For the 3G case, the that the effect of the delay variability is small, but
result is clear; in roughly 30% of the cases a clearly increasing as the amount of delay
clean channel is found and the maximum score variation increases. For the 3G case (orange
possible for the used codec is achieved, which is bars), the effect is much stronger since the
in the range of 4.0 to 4.2 MOS. With increasing changes typically happen quite uncontrolled. For
amounts of transmission errors, the quality delay variations of more than 40 ms the amount
degrades rapidly. For VoLTE the situation is of data for 3G networks is very small and does
slightly different. Here the peak in the histogram not allow drawing conclusions. It is assumed that
is at a slightly lower MOS of 3.8 to 4.0 and only these few cases are typically more related to
20% of the cases reach the maximum quality.
1 0.5
0.9 0.45 VoLTE / 3G networks - live
VoLTE / 3G - live networks
0.8 0.4 AMR-WB
AMR-WB / Occurence of Time Warping
0.7 0.35
Probability
0.3
3G - live
0.5 0.25 3G - live
0.4 VoLTE - live 0.2
0.3 0.15
0.2 0.1
0.1 0.05
0 0
1.0 - 1.2
1.2 - 1.4
1.4 - 1.6
1.6 - 1.8
1.8 - 2.0
2.0 - 2.2
2.2 - 2.4
2.4 - 2.6
2.6 - 2.8
2.8 - 3.0
3.0 - 3.2
3.2 - 3.4
3.4 - 3.6
3.6 - 3.8
3.8 - 4.0
4.0 - 4.2
4.2 - 4.4
4.4 - 4.6
4.6 - 4.8
4.8- 5 .0
0 - 20ms
>80ms
20 - 40ms
40 - 60ms
60 - 80ms
Figure 8: Histogram of delay variations in live VoLTE and 3G Figure 9: Histogram of MOS-LQO for live VoLTE and 3G
networks, for field collected data. networks.
5.0
3G / VoLTE - live networks AMR-WB 3G - live
other problems like e.g. bad coverage and 4.5
Average MOS if Time Warping VoLTE - live
transmission errors It can thus be concluded 4.0
that, at least for the investigated cases, POLQA 3.5
scores VoLTE as expected and no problems due
MOS
3.0
to the increased delay variability, which is the
main difference to 3G networks with regard to 2.5
>80ms
20 - 40ms
40 - 60ms
60 - 80ms
Acknowledgments
The POLQA Coalition would like to thank the following partners who have supported the development of POLQA
V2.4 by extensive Beta testing and useful comments: ASCOM, Dolby, Head Acoustics, Malden and Orange
Literature
[P863V1] Recommendation ITU-T P.863 (2011), Perceptual objective listening quality assessment
[C0085] ITU-T SG12 C229 (2014), Reference Speech Samples for POLQA, Selection Method and Available Samples,
OPTICOM GmbH, Rohde & Schwarz, TNO
[C229] ITU-T SG12 C229 (2014), P.863 under live VoLTE conditions, Rohde & Schwarz
For inquiries on POLQA Licensing please contact OPTICOM GmbH or visit www.polqa.info for further details.
For an updated reference list of available POLQA products and solutions please refer to our website:
www.polqa.info
Copyright and Trademark Information
© 2011 The POLQA Coalition: OPTICOM GmbH, Erlangen, Germany; SwissQual AG, Solothurn, Switzerland; TNO
Telecom, Delft, The Netherlands.
POLQA®, PESQ® and the OPTICOM logo are registered trademarks of OPTICOM GmbH; All other brand
and product names are trademarks and/or registered trademarks of their respective owners.
This information may be subject to change. All rights reserved.