0% found this document useful (0 votes)
21 views28 pages

Audio File

Uploaded by

Venkatesh K
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
21 views28 pages

Audio File

Uploaded by

Venkatesh K
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 28

Hindawi Publishing Corporation

EURASIP Journal on Advances in Signal Processing


Volume 2010, Article ID 451695, 28 pages
doi:10.1155/2010/451695

Research Article
Audio Signal Processing Using Time-Frequency Approaches:
Coding, Classification, Fingerprinting, and Watermarking

K. Umapathy, B. Ghoraani, and S. Krishnan


Department of Electrical and Computer Engineering, Ryerson University, 350, Victoria Street, Toronto, ON, Canada M5B 2k3

Correspondence should be addressed to S. Krishnan, krishnan@ee.ryerson.ca

Received 24 February 2010; Accepted 14 May 2010

Academic Editor: Srdjan Stankovic

Copyright © 2010 K. Umapathy et al. This is an open access article distributed under the Creative Commons Attribution License,
which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Audio signals are information rich nonstationary signals that play an important role in our day-to-day communication, perception
of environment, and entertainment. Due to its non-stationary nature, time- or frequency-only approaches are inadequate in
analyzing these signals. A joint time-frequency (TF) approach would be a better choice to efficiently process these signals. In this
digital era, compression, intelligent indexing for content-based retrieval, classification, and protection of digital audio content are
few of the areas that encapsulate a majority of the audio signal processing applications. In this paper, we present a comprehensive
array of TF methodologies that successfully address applications in all of the above mentioned areas. A TF-based audio coding
scheme with novel psychoacoustics model, music classification, audio classification of environmental sounds, audio fingerprinting,
and audio watermarking will be presented to demonstrate the advantages of using time-frequency approaches in analyzing and
extracting information from audio signals.

1. Introduction This contrasting pros and cons of digital audio inspired the
development of variety of audio processing techniques.
A normal human can hear sound vibrations in the range of In general, a majority of audio processing techniques
20 Hz to 20 kHz. Signals that create such audible vibrations address the following 3 application areas: (1) compression,
qualify as an audio signal. Creating, modulating, and inter- (2) classification, and (3) security. The underlying theme
preting audio clues were among the foremost abilities that (or motivation) for each of these areas is different and
differentiated humans from the rest of the animal species. at sometimes contrasting, which poses a major challenge
Over the years, methodical creation and processing of to arrive at a single solution. In spite of the bandwidth
audio signals resulted in the development of different forms expansion and better storage solution, compression still plays
of communication, entertainment, and even biomedical an important role particularly in mobile devices and content
diagnostic tools. With the advancements in the technology, delivery over Internet. While the requirement of compaction
audio processing was automated and various enhancements (in terms of retaining major audio components) drives the
were introduced. The current digital era furthered the audio audio coding approaches, audio classification requires the
processing with the power of computers. Complex audio extraction of subtle, accurate, and discriminatory informa-
processing tasks were easily implemented and performed tion to group or index a variety of audio signals. It also
in blistering speeds. The digitally converted and formatted covers a wide range of subapplications where the accuracy
audio signals brought in high levels of noise immunity with of the extracted audio information plays a vital role in
guaranteed quality of reproduction over time. However, the content-based retrievals, sensing auditory environment for
benefits of digital audio format came with the penalty of critical applications, and biometrics. Unlike compaction in
huge data rates and difficulties in protecting copyrighted audio coding or extraction of information in classification,
audio content over Internet. On the other hand, the ability to protect the digital audio content addition of information
to use computers brought in great power and flexibility in in the form of a security key is required which would then
analyzing and extracting information from audio signals. prove the ownership of the audio content. The addition
2 EURASIP Journal on Advances in Signal Processing

of the external message (or key) should be in such a way In general, TF transformations can be classified into
that the addition does not cause perceptual distortions and two main categories based on (1) Signal decomposition
remains robust from attacks to remove it. Considering the approaches, and (2) Bilinear TF distributions (also known
above requirements it would be difficult to address all the as Cohen’s class). In decomposition-based approach the
above application areas with a universal methodology unless signal is approximated into small TF functions derived from
we could model the audio signal as accurately as possible translating, modulating, and scaling a basis function having
in a joint TF plane and then adaptively process the model a definite time and frequency localization. Distributions
parameters depending upon the application. In line with the are two dimensional energy representations with high TF
above 3 application areas, this paper presents and discusses resolution. Depending upon the application in hand and
a TF-based audio coding scheme, music classification, audio the feature extraction strategies either the TF decomposition
classification of environmental sounds, audio fingerprinting, approach or TF distribution approach could be used.
and audio watermarking.
The paper is organized as follows. Section 2 is devoted 2.1. Adaptive Time-Frequency Transform (ATFT) Algorithm—
to the theories and the algorithms related to TF analysis. Decomposition Approach. The ATFT technique is based on
Section 3 will deal with the use of TF analysis in audio the matching pursuit algorithm with TF dictionaries [1, 2].
coding and also will present the comparisons among some ATFT has excellent TF resolution properties (better than
of the audio coding technologies including adaptive time- Wavelets and Wavelet Packets) and due to its adaptive
frequency transform (ATFT) coding, MPEG-Layer 3 (MP3) nature (handling non-stationarity), there is no need for
coding and MPEG Advanced Audio Coding (AAC). In signal segmentations. Flexible signal representations can
Section 4, TF analysis-based music classification and envi- be achieved as accurately as possible depending upon the
ronmental sounds classification will be covered. Section 5 characteristics of the TF dictionary.
will present fingerprinting and watermarking of audio In the ATFT algorithm, any signal x(t) is decomposed
signals using TF approaches and summary of the paper will into a linear combination of TF functions gγn (t) selected from
be provided in Section 6. a redundant dictionary of TF functions [2]. In this context,
redundant dictionary means that the dictionary is overcom-
plete and contains much more than the minimum required
2. Time-Frequency Analysis basis functions, that is, a collection of nonorthogonal basis
functions, that is, much larger than the minimum required
Signals can be classified into different classes based on basis functions to span the given signal space. Using ATFT,
their characteristics. One such classification is deterministic we can model any given signal x(t) as
and random signals. Deterministic signals are those, which
can be represented mathematically or in other words all ∞

information about the signals are known a priori. Random x(t) = an gγn (t), (1)
signals take random values and cannot be expressed in a n=0

simple mathematical form like deterministic signals, instead


where
they are represented using their probabilistic statistics. When
 
the statistics of such signals vary over time, they qualify 1 t − pn   
to form another subdivision called nonstationary signals. gγn (t) = √ g exp j 2π fn t + φn (2)
sn sn
Nonstationary signals are associated with time-varying
spectral content and most of the real world (including and an are the expansion coefficients. The choice of the
audio) signals fall into this category. Due to the time- window function g(t) determines the characteristics of the
varying behavior, it is challenging to analyze nonstationary TF dictionary. The dictionary of TF functions can either
signals. suitably be modified or selected based on the application in
Early signal processing techniques were mainly using hand. The scale factor sn , also called as octave parameter,
time-domain operations such as correlation, convolution, is used to control the width of the window function, and
inner product, and signal averaging. While the time-domain the parameter pn controls the temporal placement. The
operations provided some information about the signal they parameters fn and φn are the frequency and phase of the
were limited in their ability to extract the frequency content exponential function, respectively. The index γn represents a
of a signal. Introduction of Fourier theory addressed this particular combination of the TF decomposition parameters
issue by enabling the analysis of signals in the frequency (sn , pn , fn and φn ). In the TF decomposition-based works
domain. However, Fourier technique provided only the that will be presented at later part of this paper, a Gabor
global frequency content of a signal and not the time occur- dictionary (Gaussian functions, i.e., g(t) = exp(−2πt 2 ) in
rences of those frequencies. Hence neither time-domain (2)) was used which has the best TF localization properties
nor frequency domain analysis were sufficient enough to [3] and in the discrete ATFT algorithm implementation
analyze signals with time-varying frequency content. To used in these works, the octave parameter sn could take
over come this difficulty and to analyze the nonstationary any equivalent time-width value between 90 μs to 0.4 s; the
signals effectively, techniques which could give joint time and phase parameter φn could take any value between 0 to 1
frequency information were needed. This gave birth to the TF scaled to 0 to 180 degrees; the frequency parameter fn could
transformations. take one of the 8192 levels corresponding to 0 to 22,050 Hz
EURASIP Journal on Advances in Signal Processing 3

(i.e., sampling frequency of 44,100 Hz for wideband audio); 2.2. TF Distribution Approach. TF distribution (TFD) indi-
the temporal position parameter pn could take any value cates a two-dimensional energy representations of a signal in
between 1 to the length of the signal. terms of time-and frequency-domains. The work in the area
The signal x(t) is projected over a redundant dictionary of TFD methods is extensive [2, 5–7]. Some well-known TFD
of TF functions with all possible combinations of scaling, techniques are as follows.
translations, and modulations. When x(t) is real and discrete,
like the audio signals in the presented technique, we use
a dictionary of real and discrete TF functions. Due to 2.2.1. Linear TFDs. The simplest linear TFD is the squared
the redundant or overcomplete nature of the dictionary modulus of STFT of a signal, which assumes that the
it gives extreme flexibility to choose the best fit for the signal is stationary in short durations and multiplies the
local signal structures (local optimization) [2]. This extreme signal by a window, and takes the Fourier transform on the
flexibility enables to model a signal as accurately as possible windowed segments. This joint TF representation represents
with the minimum number of TF functions providing a the localization of frequency in time; however, it suffers from
compact approximation of the signal. At each iteration, TF resolution tradeoff.
the best matched TF function (i.e., the TF function that
captured maximum fraction of signal energy) was searched 2.2.2. Quadratic TFDs. In quadratic TFDs, the analysis
and selected from the Gabor dictionary. The best match window is adapted to the analyzed signal. To achieve this, the
depends on the choice function and in this work maximum quadratic TFD transforms the time varying autocorrelation
energy capture per iteration was used as described in [1]. The of the signal to obtain a representation of the signal energy
remaining signal called the residue was further decomposed distributed over time and frequency
in the same way at each iteration subdividing them into
TF functions. Due to the sequential selection of the TF    
functions, the signal decomposition may take longer times 1 1  
XWV (τ, ω) = x t + τ x∗ t − τ exp − jωt dt, (4)
especially for longer signals. To overcome this, there exists 2 2
faster approaches in choosing multiple TF functions in each
of the iterations [4]. After M iterations, signal x(t) could be where XWV is Wigner-Ville distribution (WVD) of the
expressed as signal. WVD offers higher resolution than STFT; however,

M −1 when more than one component exists in the signal, the
x(t) = Rn x, gγn gγn (t) + RM x(t), (3) WVD contains interference cross terms. Interference cross
n=0 terms do not belong to the signal and are generated by
the quadratic nature of the WVD. They generate highly
where the first part of (3) is the decomposed TF functions oscillatory interference in the TFD, and their presence will
until M iterations, and the second part is the residue lead to incorrect interpretation of the signal properties.
which will be decomposed in the subsequent iterations. This drawback of the WVD is the motivation for introduc-
This process is repeated till all the energy of the signal is ing other TFDs such as Pseudo Wigner-Ville Distribution
decomposed. At each iteration some portion of the signal (PWVD), SPWVD, Choi-Williams Distribution (CWD), and
energy was modeled with an optimal TF resolution in the Cohen kernel distribution to define a kernel in ambiguity
TF plane. Over iterations it can be observed the captured domain that can eliminate cross terms. These distributions
energy increases and the residue energy falls. Based on belong to a general class called the Cohens class of bilinear
the signal content the value of M could be very high TF representation [3]. These TFDs are not always positive.
for a complete decomposition (i.e., residue energy = 0). In order to produce meaningful features, the value of the
Examples of Gaussian TF functions with different scales TFD should be positive at each point; otherwise the extracted
and modulation parameters are shown in Figure 1. The features may not be interpretable, for example, the WVD
order of computational complexity for one iteration of the always results in positive instantaneous frequency, but it
ATFT algorithm is given by O(N log N) where N is the also gives that the expectation value of the square of the
length of the signal samples. The time complexity of the frequency, for a fixed time, can become negative which does
ATFT algorithm increases with the increase in the number not make any sense [8]. Additionally, it is very difficult to
of iterations required to model a signal, which in turn explain negative probabilities.
depends on the nature of the signal. Compared to this
the computational complexity of Modified Discrete Cosine
Transform (MDCT) used in few of the state-of-the-art audio 2.2.3. Positive TFDs. They produce non-negative TFD of a
coders is only O(N log N) (same as FFT). signal, and do not contain any cross terms. Cohen and Posch
Once the signal is modeled accurately or decomposed [8] demonstrate the existence of an infinite set of positive
into TF functions with definite time and frequency localiza- TFDs, and developed formulations to compute the positive
tion, the TF parameters governing the TF functions could TFDs based on signal-dependent kernels. However, in order
be analyzed for extracting application-specific information. to calculate these kernels, the method requires the signal
In our case we process the TF decomposition parameters of equation which is not known in most of the cases. Therefore,
the audio signals to perform both audio compression and although positive TFDs exist, their derivation process is very
classification as will be explained in the later sections. complicated to implement.
4 EURASIP Journal on Advances in Signal Processing

Higher
Time position Centre frequency centre frequency
pn fn

sn
Scale or octave

TF functions with
smaller scale

Figure 1: Gaussian TF function with different scale, and modulation parameters.

2.2.4. Matching Pursuit TFD. (MP-TFD) is constructed from redundancy and keeping only the representative parts of the
matching pursuit as proposed by Mallat and Zhang [2] in TFD. In [9], the authors consider the TF representation of
1993. As shown in (3), matching pursuit decomposes a music signals as texture images, and then they look for the
signal into Gabor atoms with a wide variety of frequency repeating patterns of a given instrument as the representative
modulated, phase and time shift, and duration. After M feature of that instrument. This approach is useful for music
iteration, the selected components may be concluded to signals; however, it is not very efficient for environmental
represent coherent structures, and the residue represents sound classification, where we can not assume the presence
incoherent structures in the signal. The residue may be of such a structured TF patterns.
assumed to be due to random noise, since it does not show Another TF quantification approach is obtaining the
any TF localization. Therefore, in MP-TFD, the decompo- instantaneous features from the TFD. One of the first works
sition residue in (3) is ignored, and the WVD of each M in this area is the work of Tacer and Loughlin [10], in
component is added as the following: which Tacer and Loughlin derive two-dimensional moments
of the TF plane as features. This approach simply obtains

M −1 one instantaneous feature for every temporal sample as
2
X(τ, ω) = Rnx , gγn Wgγn (τ, ω), (5) related to spectral behavior of the signal at each point.
n=0 However, the quantity of the features is still very large.
In [11, 12], instead of directly applying the instantaneous
where Wgγn (τ, ω) is the WVD of the Gabor atom gγn (t), features in the classification process, some statistical prop-
and X(τ, ω) is the constructed MP-TFD. As previously erties of these features (e.g., mean and variance) are used.
mentioned, the WVD is a powerful TF representation; Although this solution reduces the dimension of instanta-
however when more than one component is present in the neous features, its shortcoming is that the statistical analysis
signal, the TF resolution will be confounded by cross terms. diminishes the temporal localization of the instantaneous
In MP-TFD, we apply the WVD to single components and features.
add them up, therefore, the summation will be a cross-term In a recent approach, the TFD is considered as a matrix,
free distribution. and then a matrix decomposition (MD) technique is applied
Despite the potential advantages of TFD to quantify to the TF matrix (TFM) to derive the significant TF com-
nonstationary information of real world signals, they have ponents. This idea has been used for separating instruments
been mainly used for visualization purposes. We review the in music [13, 14], and has been recently used for music
TFD quantification in the next section, and then we explain classification [15]. In this approach, the base components
our proposed TFD quantification method. are used as feature vectors. The major disadvantage of this
method is that the decomposed base vectors have a high
2.3. TFD-Based Quantification. There have been some dimension, and as a result they are not very appealing
attempts in literature to TF quantification by removing the features for classification purposes.
EURASIP Journal on Advances in Signal Processing 5

Figure 2 depicts our proposed TF quantification decomposes a complex dataset into components that are as
approach. As shown in this figure, signal (x(t)) is independent as possible; and NMF technique is applied to a
transformed into TF matrix V, where V is the TFD of non-negative matrix, and decomposes the matrix to its non-
signal x(t) (V = X(τ, ω)). Next, a MD is applied to the TFM negative components.
to decompose the TF matrix into its base and coefficient A MD technique is suitable for TF quantification that the
matrices (W and H, resp.) in a way that V = W × H. We decomposed matrices produce representative and meaning-
then extract some features from each vector of the base ful features. In this work, we choose NMF as the MD method
matrix, and use them as joint TF features of the signal (x(t)). because of the following two reasons.
This approach significantly reduces the dimensionality (1) In a previous study [18], we showed that the
of the TFD compared to the previous TF quantification NMF components promise a higher representation and
approaches. We call the proposed methodology as TFM localization property compared to the other MD techniques.
decomposition feature extraction technique. In our previous Therefore, the features extracted from the NMF component
paper [16], we applied TF decomposition feature extraction represent the TFM with a high-time and-frequency localiza-
methodology to speech signals in order to automatically tion.
identify and measure the speech pathology problem. We (2) NMF decomposes a matrix into non-negative com-
extracted meaningful and unique features from both base ponents. Negative spectral and temporal distributions are
and coefficient matrices. In this work, we showed that the not physically interpretable and therefore do not result in
proposed method extracts meaningful and unique joint meaningful features. Since PCA and ICA techniques do not
TF features from speech, and automatically identifies and guarantee the non-negativity of the decomposed factors,
measures the abnormality of the signal. We employed TFM instead of directly using W and H matrices to extract
decomposition technique to quantify TFD, and proposed features, their squared values, W  and H are used [19]. In
novel features for environmental audio signal classification other words, rather than extracting the features from V ≈
[17]. Our aim in the present work is to extract novel TF WH, the features are extracted from TFM of V  as defined
features, based on TFM decomposition technique in an below
attempt to increase the accuracy of the environmental audio
classification. 
r
 
≈
V wi f |hi (t)|. (8)
i=1
2.4. TFM Decomposition. The TFM of a signal x(t) is denoted
with VK ×N , where N is signal length and K is frequency =
It can be shown that V / V, and the negative elements of W
resolution in the TF analysis. An MD technique with r and H cause artifacts in the extracted TF features. NMF is
decomposition is applied to a matrix in such a way that each the only MD techniques that guarantees the non-negativity
element in the TFM can be written as follows: of the decomposed factors and it therefore is a better MD

r technique to extract meaningful features compared to ICA
VK ×N = WK ×r Hr ×N = wi hi , (6) and PCA. Therefore, NMF is chosen as the MD technique in
i=1 TFM decomposition.
NMF algorithm starts with an initial estimate for W and
where the decomposed TF matrices, W and H, are defined as: H, and performs an iterative optimization to minimize a
WK ×r = [w1 w2 · · · wr ], given cost function. In [20], Lee and Seung introduce two
updating algorithms using the least square error and the
⎡ ⎤
h1 Kullback-Leibler (KL) divergence as the cost functions.
⎢ ⎥
⎢h ⎥
⎢ 2⎥ (7) Least square error:
⎢ ⎥
Hr ×N = ⎢ . ⎥.
⎢.⎥
⎢.⎥ VHT WT V
⎣ ⎦ W ←− W · , H ←− H · .
hr WHHT WT WH
KL divergence:
In (6), MD reduces the TF matrix (V) to the base and
coefficient vectors ({wi }i=1,...,r and {hi }i=1,...,r , resp.) in a way (V/WH)HT WT (V/WH)
that the former represents the spectral components in the TF W ←− W · , H ←− H · .
1·H W·1
signal structure, and the latter indicates the location of the (9)
corresponding spectral component in time.
There are several well-known MD techniques in liter- In these equations, · and  /   are term by term multi-
ature, for example, Principal Component Analysis (PCA), plication and division of two matrices. Various alternative
Independent Component Analysis (ICA), and Non-negative minimization strategies for NMF decomposition have been
Matrix Factorization (NMF). Each MD technique considers proposed in [21, 22]. In this work, we use a projected gradi-
different sets of criteria to choose the decomposed matrices ent bound-constrained optimization method by Lin in [23].
with the desired properties, for example, PCA finds a set The gradient-based NMF is computationally competitive
of orthogonal bases that minimize the mean squared error and offers better convergence properties than the standard
of the reconstructed data; ICA is a statistical technique that approach.
6 EURASIP Journal on Advances in Signal Processing

Train
WM ×r
Audio signal VM ×N Fr ×20
Feature LDA
x(t) MP-TFD NMF extraction classifier
Hr ×N
1. Aircraft
{C } 2. Helicopter
Test 3. Drum
WM ×r 4. Flute
Audio signal VM ×N Fr ×20 LDA 5. Piano
Feature
x(t) MP-TFD NMF extraction classifier 6. Male
Hr ×N 7. Female
8. Animal
9. Bird
10. Insect

Figure 2: This block diagram represents the TFM quantification technique. In this approach, first the TFD (VK ×N ) of a signal (x(t)) is
estimated. Then a MD technique decomposes the estimated TF matrix into r bases components (WK ×r and Hr ×N ). Finally, a discriminant
and representative feature vector F is extracted from each decomposed component.

Perceptual filtering

TF Threshold
Wideband TF in Media
parameter Masking Quantizer or
audio modeling quiet
processing channel
(TIQ)

Figure 3: Block diagram of ATFT audio coder.

We apply the TFM decomposition of the audio signals to diagram of the ATFT coder is shown in Figure 3. The ATFT
perform environmental audio classification as is explained in approach provides higher TF resolution than the existing TF
Section 4.2. techniques such as wavelets and wavelet packets [2]. This
high-resolution sparse decomposition enables us to achieve a
compact representation of the audio signal in the transform
3. Audio Coding domain itself. Also, due to the adaptive nature of the ATFT,
there was no need for signal segmentation.
In order to address the high demand for audio com- Psychoacoustics were applied in a novel way on the TF
pression, over the years many compression methodologies decomposition parameters to achieve further compression.
were introduced to reduce the bit rates without sacrificing In most of the existing audio coding techniques the funda-
much of the audio quality. Since it is out of scope of mental decomposition components or building blocks are in
this paper to cover all of the existing audio compression the frequency domain with corresponding energy associated
methodologies, the authors recommend the work of Painter with them. This makes it much easier for them to adapt
and Spanias in [24] for a comprehensive review of most the conventional, well-modeled psychoacoustics techniques
of the existing audio compression techniques. Audio signals into their encoding schemes. On the other hand, in ATFT,
are highly nonstationary in nature and the best way to the signal was modeled using TF functions which have a
analyze them is to use a joint TF approach. The presented definite time and frequency resolution (i.e., each individual
coding methodology is based on ATFT and falls under the TF function is time limited and band limited), hence the
transform-like coder category. The usual methodology of existing psychoacoustics models need to be adapted to apply
a transform-based coding technique involves the following on the TF functions [25].
steps: (i) transforming the audio signal into frequency
or TF-domain coefficients, (ii) processing the coefficients
using psychoacoustic models and computing the audio 3.1. ATFT of Audio Signals. Any signal could be expressed
masking thresholds, (iii) controlling the quantizer resolution as a combination of coherent and noncoherent signal
using the masking thresholds, (iv) applying intelligent bit structures. Here the term coherent signal structures means
allocation schemes, and (v) enhancing the compression ratio those signal structures that have a definite TF localization
with further lossless compression schemes. The ATFT-based (or) exhibit high correlation with the TF dictionary elements.
coder nearly follows the above general transform coder In general, the ATFT algorithm models the coherent signal
methodology; however, unlike the existing techniques, the structures well within the first few 100 iterations, which
major part of the compression was achieved by exploiting in most cases contribute to >90% of the signal energy.
the joint TF properties of the audio signals. The block On the other hand, the noncoherent noise-like structures
EURASIP Journal on Advances in Signal Processing 7

cannot be easily modeled since they do not have a definite functions than a signal with more noncoherent structures. In
TF localization or correlation with dictionary elements. most cases a 99.5% of energy capture nearly characterises the
Hence these noncoherent structures are broken down by audio signal completely. The upper limit of the iterations is
the ATFT into smaller components to search for coherent fixed to 10,000 iterations to reduce the computational load.
structures. This process is repeated until the whole residue Figure 4 demonstrates the number of TF functions needed
information is diluted across the whole TF dictionary [2]. for a sample audio signal. In the figure, the lower panel shows
From a compression point of view, it would be desirable the energy capture curve for the sample audio signal in the
to keep the number of iterations (M ≪ N), as low as top panel with number of TF functions in the X-axis and the
possible and at the same time sufficient enough to model normalised energy in the Y -axis. On average, it was observed
the audio signal without introducing perceptual distortions. that 6000 TF functions are needed to represent a signal of 5 s
Considering this requirement, an adaptive limit has to be set duration sampled at 44.1 kHz.
for controlling the number of iterations. The energy capture
rate (signal energy capture rate per iteration) could be used
3.2. Implementation of Psychoacoustics. In the conventional
to achieve this. By monitoring the cumulative energy capture
coding methods, the signal is segmented into short time
over iterations we could set a limit to stop the decomposition
segments and transformed into frequency domain coeffi-
when a particular amount of signal energy was captured.
cients. These individual frequency components are used
The minimum number of iterations required to model
to compute the psychoacoustic masking thresholds and
an audio signal without introducing perceptual distortions
accordingly their quantization resolutions are controlled.
depends on the signal composition and the length of the
In contrast, in our approach we computed the psychoa-
signal. In theory, due to the adaptive nature of the ATFT
coustic masking properties of individual TF functions and
decomposition, it is not necessary to segment the signals.
used them to decide whether a TF function with certain
However, due to the computational resource limitations
energy was perceptually relevant or not based on its time
(Pentium III, 933 MHZ with 1 GB RAM), we decomposed
occurrence with other TF functions. TF functions are the
the audio signals in 5 s durations. The larger the duration
basic components of the presented technique and each TF
decomposed, the more efficient is the ATFT modeling. This
function has a certain time and frequency support in the
is because if the signal is not sufficiently long, we cannot
TF plane. So their psychoacoustical properties have to be
efficiently utilise longer TF functions (highest possible scale)
studied by taking them as a whole to arrive at a suitable
to approximate the signal. As the longer TF functions cover
psychoacoustical model. More details on the implementation
larger signal segments and also capture more signal energy
of psychoacoustics is covered in [25, 26].
in the initial iterations, they help to reduce the total number
of TF functions required to model an audio signal. Each
TF function has a definite time and frequency localization, 3.3. Quantization. Most of the existing transform-based
which means all the information about the occurrences of coders rely on controlling the quantizer resolution based on
each of the TF functions in time and frequency of the psychoacoustic thresholds to achieve compression. Unlike
signal is available. This flexibility helps us later in our this, the presented technique achieves a major part of
processing to group the TF functions corresponding to any the compression in the transformation itself followed by
short time segments of the audio signal for computing the perceptual filtering. That is, when the number of iterations
psychoacoustic thresholds. In other words, the complete M needed to model a signal is very low compared to the
length of the audio signal can be first decomposed into TF length of the signal, we just need M × L bits. Where L is the
functions and later the TF functions corresponding to any number of bits needed to quantize the 5 TF parameters that
short time segment of the signal can be grouped together. represent a TF function. Hence, we limited our research work
In comparison, most of the DCT- and MDCT-based existing to scalar quantizers as the focus of the research mainly lies on
techniques have to segment the signals into time frames and the TF transformation block and the psychoacoustics block
process them sequentially. This is needed to account for the rather than the usual sub-blocks of the data compression
non-stationarity associated with the audio signals and also to application.
maintain a low signal delay in encoding and decoding. As explained earlier each of the five parameters Energy
In the presented technique for a signal duration of 5 s, the (an ), Center frequency ( fn ), Time position (pn ), Octave
decomposition limit was set to be the number of iterations (sn ), and Phase (φn ) are needed to represent a TF function
(Mx ) needed to capture 99.5% of the signal energy or to a and thereby the signal itself. These five parameters were
maximum of 10,000 iterations and is given by to be quantized in such a way that the quantization error
⎧ introduced was imperceptible while, at the same time,
⎪ M−1 2
obtaining good compression. Each of the five parameters

⎪ Rn x, gγn
⎨ n=0
Mx = ⎪M, if M <10000, .995 = ∞ 2 , has different characteristics and dynamic range. After careful
⎪ −∞ |x(t)| dt analysis of them the following bit allocations were made. In

⎩10000, otherwise. arriving at the final bit allocations informal Mean Opinions
(10) Score (MOS) tests were conducted to compare the quality of
the audio samples before and after quantization stage.
For a signal with less noncoherent structures, 99.5% of In total, 54 bits are needed to represent each TF func-
signal energy could be modeled with a lower number of TF tion without introducing significant perceptual quantization
8 EURASIP Journal on Advances in Signal Processing

Sample signal
0.2

Amplitude (a.u.)
0.1

−0.1

−0.2
0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 2.2
Time samples ×105

(a)
×10−3 Energy curve
1

0.8
Energy (a.u.)

0.6

0.4
99.5% of the signal energy
0.2

0
0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000
Number of TF functions
(b)

Figure 4: Energy cutoff of the sample signal in panel 1. a.u.: arbitrary units.

noise in the reconstructed signal. The final form of data for 3.4. Compression Ratios. Compression ratios achieved by the
M TF functions will contain the following. presented coder were computed for eight sample wideband
audio signals (of 5 s duration) as described below. These
(i) Energy parameter (Log companded) = M ∗ 12 bits. eight sample signals (namely, ACDC, DEFLE, ENYA, HARP,
(ii) Time position parameter = M ∗ 15 bits. HARPSICHORD, PIANO, TUBULARBELL, and VISIT)
(iii) Center frequency parameter = M ∗ 13 bits. were representatives of wide range of music types.
(iv) Phase parameter = M ∗ 10 bits. (i) As explained earlier, the total number of bits needed
to represent each TF function is 54.
(v) Octave parameter = M ∗ 4 bits.
(ii) The energy parameter is curve fitted and only the first
The sum of all the above (= 54 ∗ M bits) will be the 150 points in addition to the curve fitted point need
total number of bits transmitted or stored representing an to be coded.
audio segment of duration 5 s. The energy parameter after (iii) So the total number of bits needed for M iterations
log companding was observed to be a very smooth curve. for a 5 s duration of the signal is TB1 = (M ∗ 42) +
Fitting a curve to the energy parameter further reduces ((150 + C) ∗ 12), where C is the number of curve
the bit rate [25, 26]. With just a simple scalar quantizer fitted points, and M is the number of perceptually
and curve fitting of the energy parameter, the presented important functions.
coder achieves high-compression ratios. Although a scalar
quantizer was used to reduce the computational complexity (iv) The total number of bits needed for a CD quality 16
of the presented coder, sophisticated vector quantization bit PCM technique for a 5 s duration of the signal
techniques can be easily incorporated to further increase the sampled at 44100 Hz is TB2 = 44100 ∗ 5 ∗ 16 =
coding efficiency. The 5 parameters of the TF function can 3, 528, 000.
be treated as one vector and accordingly quantized using (v) The compression ratio can be expressed as the ratio of
predefined codebooks. Once the vector is quantized, only the number of bits needed by the presented coder to the
index of the codebook needs to be transmitted for each set number of bits needed by the CD quality 16 bit PCM
of TF parameters resulting in a large reduction of the total technique for the same length of the signal, that is,
number of bits. However designing the codebooks would be TB2
challenging as the dynamic ranges of the 5 TF parameters Compression ratio = . (11)
TB1
are drastically different. Apart from reducing the number
of total bits, the quantization stage can also be utilized to (vi) The overall compression ratio for a signal was then
control the bit rates suitable for CBR (Constant Bit Rate) calculated by averaging all the 5 s duration segments
applications. of the signal for both the channels.
EURASIP Journal on Advances in Signal Processing 9

The presented coder is based on an adaptive signal trans- Table 1: Compression ratio (CR) and subjective difference grades
formation technique, that is, the content of the signal and the (SDGs). MP3: Moving Picture Experts Group I Layer 3, MPEG-4
dictionary of basis functions used to model the signal play an AAC: Moving Picture Experts Group 4 Advanced Audio Coding,
important role in determining how compact a signal can be VBR Main LTP profile, and ATFT: Adaptive Time-Frequency
represented (compressed). Hence, VBR (Variable Bit Rate) is Transform.
the best way to present the performance benefit of using an MP3 AAC ATFT
adaptive decomposition approach. The inherent variability Samples
CR SDG CR SDG CR SDG
introduced in the number of TF functions required to model
ACDC 7.5 0.067 9.3 −0.067 8.4 −0.93
a signal and thereby the compression is one of the highlights
of using ATFT. Although VBR would be more appropriate to DEFLE 7.7 −0.2 9.5 −0.067 8.3 −1.73
present the performance benefit of the presented coder, CBR ENYA 9 0 9.6 −0.133 20.6 −0.8
mode has its own advantages when using with applications HARP 11 −0.067 9.4 −0.067 36.3 −1
that demand network transmissions over constant bitrate HARPSICHORD 8.5 −0.067 10.2 0.33 9.3 −0.73
channels with limited delays. The presented coder can also
PIANO 13.6 0.067 9.6 −0.2 40 −0.8
be used in CBR mode by fixing the number of TF functions
used for representing signal segments, however due to the TUBULARBELL 8.3 0 10.1 0.067 10.5 −0.53
signal adaptive nature of the presented coder this would VISIT 8.4 −0.067 11.5 0 11.6 −2.27
compromise the quality at instances where signal segments AVERAGE 9.3 −0.03 9.9 −0.02 18.3 −1.1
demand a higher number of TF functions for perceptually
lossless reproduction. Hence we choose to present the results
of the presented coder using only the VBR mode.
straightforward process of linearly adding all the TF func-
We compared the presented coder with two existing
tions with their corresponding five TF parameters. In order
popular and state-of-the-art audio coders, namely, MP3
to do that, first the TF parameters modified for reducing
(MPEG 1 layer 3) and MPEG-4 AAC/HE-AAC. Advanced
the bit rates have to be expanded back to their original
audio coding (AAC) is the current industrial standard which
forms. The log compressed energy curve was log expanded
was initially developed for multichannel surround signals
after recovering back all the curve points using interpolation
(MPEG-2 AAC [27]). As there are ample studies in the
on the equally placed 50 length points. The energy curve
literature [27–32] available for both MP3 and MPEG-2/4
was multiplied with the normalization factor to bring the
AAC more details about these techniques are not provided
energy parameter as it was during the decomposition of
in this paper. The average bit rates were used to calculate
the signal. The restored parameters (Energy, Time-position,
the compression ratio achieved by MP3 and MPEG-4 AAC
Center frequency, Phase and Octave) were fed to the ATFT
as described below.
algorithm to reconstruct the signal. The reconstructed signal
(i) Bitrate for a CD quality 16 bit PCM technique for 1 s was then smoothed using a 3rd-order Savitzky-Golay [33]
stereo signal is given by TB3 = 2 ∗ 44100 ∗ 16. filter and saved in a playable format.
Figure 5 demonstrates a sample signal (/“HARP”/) and
(ii) The average bit rate/s achieved by (MP3 or MPEG-4 its reconstructed version and the corresponding spectro-
AAC) in VBR mode = TB4 . grams. It can be clearly observed from the reconstructed
signal spectrogram compared with the original signal spec-
(iii) Compression ratio achieved by (MP3 or MPEG-4 trogram, how accurately the ATFT technique has filtered
AAC) = TB3 /TB4 . out the irrelevant components from the signal (evident
from Table 1—(/“HARP”/)—high-compression ratio versus
The 2nd, 4th and 6th columns of Table 1 show the
acceptable quality). The accuracy in adaptive filtering of the
compression ratio (CR) achieved by the MP3, MPEG-4 AAC
irrelevant components is made possible by the TF resolution
and the presented ATFT coders for the set of 8 sample audio
provided by the ATFT algorithm.
files. It is evident from the table that the presented coder has
better compression ratios than MP3. When comparing with
MPEG-4 AAC, 5 out of 8 signals are either comparable or 3.5. Subjective Evaluation of ATFT Coder. Subjective evalu-
have better compression ratios than the MPEG-4 AAC. It is ation of audio quality is needed to assess the audio coder
noteworthy to mention that for slow music (classical type) performance. Even though there are objective measures such
the ATFT coder provides 3 to 4 times better comparison than as SNR, total harmonic distortion (THD), and Noise-to-
MPEG-4 AAC or MP3. mask ratio [34] they would not give a true evaluation of the
The compression ratio alone cannot be used to evaluate audio codec particularly if they use lossy schemes as in the
an audio coder. The compressed audio signals has to undergo proposed technique. This is due to the fact say, for example,
a subjective evaluation to compare the quality achieved in a perceptual coder, SNR is lost however audio quality is
with respect to the original signal. The combination of the claimed to be perceptually lossless. In this case SNR measure
subjective rating and the compression ratio will provide a may not give the correct performance evaluation of the coder.
true evaluation of the coder performance. We used the subjective evaluation method recommended
Before performing the subjective evaluation, the signal by ITU-R standards (BS. 1116). It is called a “double blind
has to be reconstructed. The reconstruction process is a triple stimulus with hidden reference” [24, 34]. A Subjective
10 EURASIP Journal on Advances in Signal Processing

Original ×104 Original

2
0.2

0.1 1.5

Frequency (Hz)
Amplitude (a.u.)

0 1

−0.1
0.5

−0.2
0
1 2 3 4 0 2 4 6 8
Time samples ×105 Time (s)
(a) (b)

Reconstructed ×104 Reconstructed

2
0.2

0.1 1.5
Frequency (Hz)
Amplitude (a.u.)

0 1

−0.1
0.5

−0.2
0
1 2 3 4 0 2 4 6 8
Time samples ×105 Time (s)
(c) (d)

Figure 5: Example of a sample original (/“HARP”/) and the reconstructed signal with their respective spectrograms. X-axes for the original
and reconstructed signal are in time samples, and X-axes for the spectrogram of the original and the reconstructed signal are in equivalent
time in seconds. Note that the sampling frequency = 44.1 kHz. au: arbitrary units.

Difference Grade (SDG) [24] was computed by subtracting (or) Slightly annoying, (−1): Good (or) Perceptible but
the absolute score assigned to the hidden reference audio not annoying, and (0): Excellent (or) Imperceptible. Fifteen
signal from the absolute score assigned to the compressed listeners (randomly selected) participated in the MOS studies
audio signal. It is given by and evaluated all the 3 audio coders (MP3, AAC and ATFT
in VBR mode). The average SDG was computed for each
SDG = Grade{compressed} − Grade{reference} . (12) of the audio sample. The 3rd, 5th and 7th columns of the
Table 1 show the SDGs obtained for MP3, AAC and ATFT
Accordingly the scale of SDG will range from (−4 to coders, respectively. MP3 and AAC SDGs fall very close to the
0) with the following interpretation: (−4): Unsatisfactory Imperceptible (0) region, whereas the proposed ATFT SDGs
(or) Very Annoying, (−3): Poor (or) Annoying, (−2): Fair are spread out between −0.53 to −2.27.
EURASIP Journal on Advances in Signal Processing 11

3.6. Results and Discussion. The compression ratios (CRs) Subjective difference grade (SDG) versus
and the SDG for all three coders (MP3, AAC and ATFT) compression ratios (CR)
45
are shown in Table 1. All the coders were tested in the VBR
mode. For the presented technique, VBR was the best way 40
to present the performance benefit of using an adaptive
decomposition approach. In ATFT, the type of the signal and 35
the characteristics of the TF functions (type of dictionary)

Compression ratio (CR)


control the number of transformation parameters required 30
to approximate the signal and thereby the compression ratio.
The inherent variability introduced in the number of TF 25
functions required to model a signal is one of the highlights
of using ATFT. Hence we choose to present comparison of 20
the coders in the VBR mode.
The results show that the MP3 and AAC coders per- 15
form well with excellent SDG scores (Imperceptible) at a
compression ratio around 10. The presented coder does 10
not perform well with all of the eight samples. Out of
the 8 samples, 6 samples have an SDG between −0.53 to 5
−1 (Imperceptible—perceptible but not annoying) and 2 −4 −3 −2 −1 0 1
Very annoying Imperceptible
samples have SDG below −1. Out of the 6 samples with
Subjective difference grade (SDG)
SDGs between (−0.53 and −1), 3 samples (ENYA, HARP and
PIANO) have compression ratios 2 to 4 times higher than MP3
MP3 and AAC and 3 samples (ACDC, HARPSICHORD and AAC
TUBULARBELL) have comparable compression ratios with ATFT
moderate SDGs.
Figure 6: Subjective Difference Grade (SDG) versus Compression
Figure 6 shows the comparison of all three coders
ratios (CRs).
by plotting the samples with their SDGs in X-axis and
compression ratios in the Y -axis. If we can virtually divide
this plot in segments of SDGs (horizontally) and the all types of signal structures. The second signal VISIT has
compression ratios (vertically), then the ideal desirable coder significant amount(s) of voice components. Even though
performance should be in the right top corner of the plot the main voice components are modeled well by the ATFT,
(high-compression ratios and excellent SDG scores). This is the noise-like hissing and shrilling sounds (noncoherent
followed next by the right bottom corner (low-compression structures) could not be modeled within the decomposition
ratios and excellent SDG scores) and so on as we move from limit of 10,000 iterations. These hissing and shrilling sounds
right to left in the plot. Here the terms “Low”- and “High”- actually add to the pleasantness of the music. Any distortion
compression ratios are used in a relative sense based on the in them is easily perceived which could have reduced the
compression ratios achieved by all the 3 coders in this study. SDG of the signal to the lowest of the group −2.27. The
From the plot it can be seen that MP3 and AAC coders poor performances with the two audio sample cases could
occupy the right bottom corner, whereas the samples from be addressed by using a hybrid dictionary of TF functions
ATFT coder are spread over. As mentioned earlier 3 out the 8 and residue coding the noncoherent structures separately.
samples of the ATFT coder occupy the right top corner only However this would increase the computational complexity
with moderate SDGs that are much less than the MP3 and of the coder and reduce the compression ratios.
the AAC. 3 out of the remaining 5 samples of the ATFT coder We have covered most details involved in a stage by
occupy the right bottom corner, again with only moderate stage implementation and evaluation of a transform-based
SDGs that are less than MP3 and AAC. The remaining 2 audio coder. The approach demonstrated the application
samples perform the worst occupying the left bottom corner. of ATFT for audio coding and the development of a
We analyzed the poorly performing ATFT coded signals novel psychoacoustics model adapted to TF functions. The
DEFLE and VISIT. DEFLE is a rapidly varying rock-like compression strategy was changed from the conventional
signal with minimal voice components and VISIT is a signal way of controlling quantizer resolution to achieving majority
with dominant voice components. We observed that the of the compression in the transformation itself. Listening
symmetrical and smooth Gaussian dictionary used in this tests were conducted and the performance comparison of the
study does not model the transients well, which are the presented coder with MP3 and AAC coders were presented.
main features of all rapidly varying signals like DEFLE. From the preliminary results, although the proposed coder
This inefficient modeling of transients by the symmetrical achieves high-compression ratios, its SDG scores are well
Gaussian TF functions resulted in the poor SDG for the below the MP3 and AAC family of coders. The proposed
DEFLE. A more appropriate dictionary would be a damped coder however performs moderately well for slowly varying
sinusoids dictionary [35] which can better model the classical type signals with acceptable SDGs. The proposed
transient-like decaying structures in audio signals. However coder is not as refined as the state-of-the-art commercial
a single dictionary alone may not be sufficient to model coders, which to some extent explains its poor performance.
12 EURASIP Journal on Advances in Signal Processing

From the results presented for the ATFT coder, the Rock
signal adaptive performance of the coder for a specific Audio Classical
Adaptive Linear Country
TF dictionary is evident, that is, with a Gaussian TF signal
signal Feature discriminant
dictionary the coder performed moderately well for slow- extraction analysis
decomposition Folk
varying classical signals than fast varying rock-like signals. Jazz
In other words the ATFT algorithm demonstrated notable Pop
differences in the decomposition patterns of classical and
rock-like signals. This is a valid clue and a motivating Figure 7: Block diagram of the proposed music classification
factor that these differences in the decomposition patterns if scheme.
quantified using TF decomposition parameters could be used
as discriminating features for classifying audio signals. We
apply this hypothesis in extracting TF features for classifying used for audio classification include mel frequency cepstral
audio signals for a content-based audio retrieval application coefficients (MFCCs) [40, 41], spectral similarity [44],
as will be explained in Section 4. timbral texture [41], band periodicity [38], LPCC (Linear
Prediction Coefficient-derived cepstral coefficients) [45],
zero crossing rate [38, 45], MPEG-7 descriptors [46], entropy
3.7. Summary of Steps Involved in Implementing [12], and octaves [39]. Few techniques generate a pattern
ATFT Audio Coder from the features and use it for classification by the degree
of correlation. Few other techniques use the numerical
Step 1 (ATFT algorithm and TF dictionaries). Existing values of the features coupled to statistical classification
implementation of Matching Pursuits can be adapted for the methods.
purposes; (1) LastWave (http://www.cmap.polytechnique.fr/
∼bacry/LastWave/), (2) Matching Pursuit Package (MPP)
(ftp://cs.nyu.edu/pub/wave/software/mpp.tar.Z), and (3) 4.1. Music Classification. In this section, we present a
Matching Pursuit ToolKit (MPTK) [36]. content-based audio retrieval application employing audio
classification and explain the generic steps involved in
Step 2 (Control decomposition). The number of TF func- performing successful audio classification. The simplest of
tions required to model a fixed segment of audio signal can all retrieval techniques is the text-based searching where the
be arrived using similar criteria described in Section 3.1. information about the multimedia data is stored with the
data file. However the success of these type of text-based
Step 3 (Perceptual Filtering). The TF functions obtained searches depend on how well they are text indexed by the
from Step 2 can be further filtered using the psychoacoustics author and they do not provide any information on the real
thresholds discussed in Section 3.2. content of the data. To make the retrieval system automated,
efficient, and intelligent, content-based retrieval techniques
Step 4 (Quantization). The simple quantization scheme were introduced. The presented work focuses on one such
presented in Section 3.3 can be used for bit allocation or way for automatic classification of audio signals for retrieval
advanced vector quantization methods can also be explored. purposes. The block diagram of the proposed technique is
shown in Figure 7.
Step 5 (Lossless schemes). Further lossless schemes can be In content-based retrieval systems, audio data is ana-
applied to the quantized TF parameters to further increase lyzed, and discriminatory features are extracted. The selec-
the compression ratio. tion of features depends on the domain of analysis and
the perceptual characteristics of the audio signals under
4. Audio Classification consideration. These features are used to generate subspaces
dividing the audio signal types to fit in one of the subspaces.
Audio feature extraction plays an important role in analyzing The division of subspaces and the level of classification vary
and characterizing audio content. Auditory scene analysis, from technique to technique. When a query is placed the
content-based retrieval, indexing, and fingerprinting of similarity of the query is checked with all subspaces and
audio are few of the applications that require efficient feature the audio signals from the highly correlated subspace is
extraction. The general methodology of audio classification returned as the result. The classification accuracy, and the
involves extracting discriminatory features from the audio discriminatory power of the features extracted determine the
data and feeding them to a pattern classifier. Different success of such retrieval systems.
approaches and various kinds of audio features were pro- Most of the existing techniques do not take into con-
posed with varying success rates. Audio feature extraction sideration the true nonstationary behavior of the audio
serves as the basis for a wide range of applications in the areas signals while deriving their features. The presented approach
of speech processing [37], multimedia data management and uses the same ATFT transform that was discussed in the
distribution [38–41], security [42], biometrics and bioacous- previous audio coding section. ATFT approach is one of the
tics [43]. The features can be extracted either directly from best ways to handle nonstationary behavior of the audio
the time-domain signal or from a transformation domain signals and also due to its adaptive nature, does not require
depending upon the choice of the signal analysis approach. any signal segmentation techniques as used by most of the
Some of the audio features that have been successfully existing techniques. Unlike many existing techniques where
EURASIP Journal on Advances in Signal Processing 13

Sample music signal


0.2

Amplitude (a.u.)
0.1

−0.1

−0.2
0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 2.2
Time samples ×105

(a)
Reconstructed signal with 10 TF functions
0.2
Amplitude (a.u.)

0.1

−0.1
Octave or scale
−0.2
0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 2.2
Time samples ×105

(b)

Figure 8: A sample music signal, and its reconstructed version with 10 TF functions.

multiple features are used for classification, in the proposed 4.1.2. Feature Extraction. All the signals were decomposed
technique, only one TF decomposition parameter is used using the ATFT algorithm. The decomposition parameters
to generate a feature set from different frequency bands for provided by the ATFT algorithm were analyzed, and the
classification. Due to its strong discriminatory power, just octave sn parameter was observed to contain significant
one TF decomposition parameter is sufficient enough for information on different types of music signals. In the
accurate classification of music into six groups. decomposition process, the octave or scaling parameter is
decided by the adaptive window duration of the Gaussian
4.1.1. Audio Database. A database consisting of 170 audio function that is used in the best possible approximation
signals was used in the proposed technique. Each audio of the local signal structures. Higher octaves correspond to
signal is a segment of 5 s duration extracted from individual longer window durations and the lower octaves correspond
original CD music tracks (wide band audio at 44100 to shorter window duration. In other words combinations
samples/second) and no more than one audio signal (5 s of these octaves represent the envelope of the signal. The
duration) was extracted from the same music track. The 170 envelope (temporal structures) [47] of an audio signal
audio signals consist of 24 rock, 35 classical, 31 country, provides valid clues such as rhythmic structure [41], indirect
21 jazz, 34 folk, and 25 pop signals. As all signals of pitch content [41], phonetic composition [48], tonal and
the database were extracted from commercial CD music transient contributions. Figure 8 demonstrates a sample
tracks, they exhibited all the required characteristics of their piece of a music signal and its reconstructed version using
respective music genre, such as guitars, drumbeats, vocal, 10 TF functions. The relation between the octave parameter
and piano. The signal duration of 5 s was arrived at using and the envelope of the signal is clearly seen. Based on the
the rationale that the longer the audio signal analyzed, the composition of different structures in a signal, the octave
better the extracted feature which exhibits more accurate mapping or distribution varies significantly. For example,
music characteristics. As the ATFT algorithm is adaptive and more lower-order octaves are needed for signals containing
does not need any segmentation, theoretically there is no lot of transient-like structures and on the other hand
limit for the signal length. However considering the hardware more higher-order octaves are needed for signal containing
(Pentium III @ 933 MHz and 1.5 GB RAM) limitations of rhythmic tonal components. As an illustration, from Figure 9
the processing facility, we used 5 s duration samples. In the it can be observed that signals with similar spectral charac-
proposed technique first all the signals were chosen between teristics exhibit a similar pattern in their octave distribution.
15 s to 20 s of the original music tracks. Later by inspection Signals 1 and 2 are rock-like music, whereas Signals 3 and
those segments, which were inappropriately selected were 4 are instrumental classical. Comparing the spectrograms
replaced by segments (5 s duration) at random locations of with the octave distributions, one can observe that the octave
the original music track in such way their music genre is distribution reflecting the spectral similarities for the same
exhibited. category of signals.
14 EURASIP Journal on Advances in Signal Processing

Signal 1 Signal 2
1 1

Frequency
Frequency

0.5 0.5

0 0
0 2 4 6 8 10 0 2 4 6 8 10
Time Time
(a) (b)

Octave distribution signal 1 Octave distribution signal 2


1 1
Normalised distribution

Normalised distribution
0.5 0.5

0 0
1 2 3 4 5 6 7 8 9 10 11 12 13 1 2 3 4 5 6 7 8 9 10 11 12 13
Octaves Octaves
(c) (d)

Signal 3 Signal 4
1 1
Frequency

Frequency

0.5 0.5

0 0
0 2 4 6 8 10 0 2 4 6 8 10
Time Time
(e) (f)

Octave distribution signal 3 Octave distribution signal 4


1 1
Normalised distribution

Normalised distribution

0.5 0.5

0 0
1 2 3 4 5 6 7 8 9 10 11 12 13 1 2 3 4 5 6 7 8 9 10 11 12 13
Octaves Octaves
(g) (h)

Figure 9: Comparison of octave distributions. Signals 1 and 2: Rock-like signals, and Signals 3 and 4: Classical-like signals.
EURASIP Journal on Advances in Signal Processing 15

Rock sample signal Classical sample signal


Distribution of

Distribution of
20 Frequency band (10–20 kHz)
200 Frequency band (10–20 kHz) 15
octaves

octaves
10
100
5
0 0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 1 2 3 4 5 6 7 8 9 10 11 12 13 14
Octaves Octaves
(a) (a)

Rock sample signal Classical sample signal


1000 20
Distribution of

Distribution of
15 Frequency band (5–10 kHz)
Frequency band (5–10 kHz)
octaves

octaves
500 10
5
0 0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 1 2 3 4 5 6 7 8 9 10 11 12 13 14
Octaves Octaves
(b) (b)

Rock sample signal ×102 Classical sample signal


Distribution of

Distribution of
1000 20
Frequency band (0–5 kHz) Frequency band (0–5 kHz)
15
octaves

octaves
500 10
5
0 0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 1 2 3 4 5 6 7 8 9 10 11 12 13 14
Octaves Octaves
(c) (c)

Figure 10: Octave distribution over three frequency bands for a Figure 11: Octave distribution over three frequency bands for a
rock signal. classical signal.

To further improve the discriminatory power of this After decomposing all the audio signals using ATFT, the
parameter the distribution of this parameter is grouped into TF functions were grouped into three frequency bands based
three frequency bands 0–5 kHz, 5–10 kHz, and 10–20 kHz. on their center frequencies fn . Then the distribution of each
This is done since analyzing the audio signals in subbands of the 14 octave parameter sn values were calculated over the
will provide more precise information about their audio 3 frequency bands to get a total of 14 × 3 = 42 different
characteristics [49]. The bounds for frequency bands were distribution values. All these 42 values of each audio segment
arrived considering the fact that most of the audio content were used as a feature set for classification. As an illustration,
lies well within 10 kHz range so this band needs to be looked in Figures 10 and 11 the X-axis represents the 14 octave
more in detail hence broken further into 0–5 kHz and 5 kHz parameters and the Y -axis represents the distribution of the
to 10 kHz and the remaining as one band between 10 kHz octave parameters over three frequency bands for 10,000
to 20 kHz. By this frequency division we get an indirect iterations. Each of the distribution value forms one of 42
measure of signal envelope contribution from each frequency elements in the feature set.
band. From Figure 9 even though we see difference in the
distribution of octaves between rock-like and classical music,
it becomes more evident when the distribution is divided 4.1.3. Pattern Classification. The motivation for the pattern
into three frequency bands as shown for a sample rock classification is to automatically group audio signals of same
and a classical signal in Figures 10 and 11. Dividing the characteristics using the discriminatory features derived as
octave distribution into frequency bands basically reveal the explained in previous subsection.
pattern in which the temporal structures occur over the Pattern classification was carried out by linear dis-
range of frequencies. As music is the combination of different criminant analysis (LDA)-based classifier using the SPSS
temporal structures with different frequencies occurring at software [50]. In discriminant analysis, the feature vector
same or different time instants, each type of music exhibit derived as explained above were transformed into canonical
a unique average pattern. Based on the subtle differences discriminant functions such as
between patterns to be detected, the division of octave
distribution over fine frequency intervals and the dimension
of the feature set can be controlled. f = u1 b1 + u2 b2 + · · · + uq bq + a, (13)
16 EURASIP Journal on Advances in Signal Processing

Table 2: Classification results. Method: Regular: linear discrim- 6


inant analysis, Cross-validated: linear discriminant analysis with
leave-one-out method, CA%: Classification Accuracy Rate, Gr:
4
Groups, Ro: Rock, Cl: Classical, Co: Country, Ja: Jazz, Fo: Folk and
Po: Pop.
2
Method Gr Ro Cl Co Ja Fo Po CA%
Regular Ro 24 0 0 0 0 0 100
0
Cl 0 35 0 0 0 0 100

Function 2
Co 0 0 31 0 0 0 100
−2
Ja 0 2 0 19 0 0 90.5
Fo 1 0 0 1 32 0 94.1
Po 0 0 0 0 0 25 100 −4

Overall 97.6
Cross- Ro 23 0 1 0 0 0 95.8 −6
Validated Cl 0 34 0 1 0 0 97.1
Co 1 0 29 0 1 0 93.5 −8
−10 −8 −6 −4 −2 0 2 4 6
Ja 0 3 0 18 0 0 85.7
Function 1
Fo 1 1 0 2 30 0 88.2
Po 2 0 2 0 0 21 84 Rock Jazz
Overall 91.2 Classical Folk
Country Pop

Figure 12: All-groups scatter plot with the first two canonical
where {u} is the set of features, {b} and a are the coefficients discriminant functions.
and constant, respectively. The feature dimension q repre-
sents the number of features used in the analysis. Using the
discriminant scores and the prior probability values of each
group, the posterior probabilities of each sample occurring method proves the robustness of the proposed technique
in each of the groups were computed. The sample was then and independence of the achieved results irrespective of
assigned to the group with the highest posterior probability the dataset size. Figure 12 shows the all-groups scatter plot
[50]. with the first two canonical discriminant functions. One
The classification accuracy was estimated using the leave- can clearly observe the significant separation between the
one-out method which is known to provide a least bias group spaces explaining the high-discriminatory power of
estimate [51]. In the leave-one-out method, one sample is the feature set based on the octave distribution.
excluded from the dataset and the classifier is trained with the The misclassified signals were analyzed but could not
remaining samples. Then the excluded signal is used as the identify a clear auditory clue to why they were misclassified.
test data and the classification accuracy is determined. This However their differences are observed in the feature set.
is repeated for all samples of the dataset. Since each signal Considering the known fact that no music genre has clear
is excluded from the training set in turn, the independence hard line boundaries and the perceptual boundaries are
between the test and the training set are maintained. often subjective (e.g., rock and pop often have overlaps
and likewise jazz and classical too have overlaps), we may
4.1.4. Results and Discussion. A database of 170 audio signals attribute the classification error of these signals on the
consisting of 24 rock, 35 classical, 31 country, 21 jazz, 34 folk natural overlap of the music genre and the amount of
and 25 pop each of 5 s duration was used. All the 170 audio knowledge imparted to the classifier with the given database.
signals were decomposed and the feature set of 42 octave In this section, we have covered details involved in
distribution values were extracted. The extracted feature sets a simple audio classification task using a time-frequency
for the entire 170 signals were fed to the classifier based on approach. The high-classification accuracies achieved by
LDA. Six-group classification was performed (rock, classical, the proposed technique clearly demonstrate the potential
country, jazz, folk and pop). Table 2 shows the confusion of a true nonstationary tool in the form of a joint TF
matrices for different classification procedures. An overall approach for audio classification. More interestingly a single
classification accuracy of 97.6% is achieved by the regular TF decomposition parameter is used for feature extraction
LDA method and 91.2% with the leave-one-out-based LDA proving the high-discriminatory power provided by TF
method. In the regular LDA method, all the 24 rock, 35 approach compared to the existing techniques.
classical, 31 country, and 25 pop were correctly classified
with 100% classification accuracy. Two out of 21 jazz and
2 out of 34 folk signals were misclassified with a correct 4.2. Classification of Environmental Sounds. In this section,
classification accuracy of 90.5% and 94.1%, respectively. we present an environmental audio classification. Audio sig-
The classification accuracy of 91.2% with the leave-one-out nals are important sources of information for understanding
EURASIP Journal on Advances in Signal Processing 17

the content of multimedia. Therefore, developing audio clas- The sparsity is zero if and only if a vector contains a single
sification techniques that better characterize audio signals nonzero component, and is negative infinity if and only if all
plays an essential role in many multimedia implications such the components are equal. The sparsity measure in (15) has
as (a) multimedia indexing and retrieval, and (b) auditory been used for applications such as NMF matrix decompo-
scene analysis. sition with more part-based properties [56]; however, it has
never been used for feature extraction application.
4.2.1. Audio Database. The lack of a common dataset (b) Dh and Dw represent the discontinuities and abrupt
does not allow researchers to compare the performance of changes in each vector. These features are calculated as
different audio classification methodologies in a fair manner. follows:
Some literatures report an impressive accuracy rate, but they −1
N
use only a small number of classes and/or a small dataset in Dhi = Log10 hi (n)2 ,
their evaluations. The number of classes used in literature n=1
varies from study to study. For example, in [52], the authors (16)
−1
K
use two classes (i.e., speech and music) while audio content 2
Dwi = Log10 wi (k) ,
analysis at Microsoft research [53] uses four audio classes
k=1
(i.e., speech, music, environment sound, and silence). Free-
man et al. [54] uses four classes of speech (i.e., babble, traffic where hi and wi are derivatives of coefficient and base
noise, typing, and white noise) while the authors in [55] use vectors, respectively
14 different environmental scenes (i.e., inside restaurants,
playground, street traffic, train passing, inside moving vehi- hi (n) = hi (n + 1) − hi (n), n = 1, . . . , N − 1,
cles, inside casinos, street with police car siren, street with (17)
ambulance siren, nature daytime, nature nighttime, Ocean wi (k) = wi (k + 1) − wi (k), k = 1, . . . , K − 1.
waves, running water, rain, and thunder). In this work we
use an environmental audio dataset that was developed and (c) MOh and MOw represent the temporal and spectral
compiled in our signal analysis research (SAR) group at Ryer- moments, respectively. Our observation showed that the
son University. This database consists of 192 audio signals of temporal and spectral spread of the TF energy are discrim-
5 s duration each with a sampling rate of 22.05 kHz and a inant characteristics for different audio groups. To quantify
resolution of 16 bits/sample. It is designed to have 10 differ- this property, we extract the second moment around the
ent classes including 20 aircraft, 17 helicopters, 20 drums, 15 mean of each coefficient and base vectors as follows:
flutes, 20 pianos, 20 animals, 20 birds and 20 insects, and the
speech of 20 males and 20 females. Most of the music samples 
N
 2
were collected from the Internet and suitably processed to MOhi = Log10 n − μhi hi (n),
n=1
have uniform sampling frequency and duration. (18)

K
 2
4.2.2. Feature Extraction. All signals were decomposed using MOwi = Log10 k − μwi wi (k),
k=1
the TFM decomposition method. First, we perform the MP-
TFD on 3 s duration of each signal, and construct the TFM where μhi and μwi are the mean of the coefficient and base
of each signal. Next, NMF with decomposition order of 15 vector i, respectively.
(r = 15) is performed on each MP-TF matrix, and 15 (d) MP is the Matching Pursuit Feature. Using M
base vectors and 15 coefficient vectors are extracted for each iterations of MP, we project an audio signal into a linear
signal. Figures 13 and 14 show the decomposition vectors of combination of Gaussian functions gγn (t) as shown in (3).
an aircraft and a piano signal, respectively. The amount of signal energy that is projected at each
20 features are extracted from each decomposed base iteration depends on the signal structure. The signal with
and coefficient vector. 13 of the features are the first 13 coherent structure needs less number of iterations, while
MFCC of each base vector, and the next six features are noncoherent structured signals take more iterations to get
Sh , Sw , Dh , Dw , MOh , MOw , and MP. These features are decomposed. In order to calculate MP feature in a way that it
explained as follows: discriminates coherent signals from noncoherent ones, and
(a) Shi and Swi are the sparsity of coefficient and base vec- it is independent from the signal’s energy, we calculate sum
tors, respectively. This feature helps to distinguish between of the normalized projected energy per iteration as MP. The
transient and continuous components. Several sparseness MP feature for piano and aircraft signals is calculated as 2.9
measures have been proposed and used in the literature. We and 10.6, respectively. As it is expected, MP feature is high for
propose a sparsity function as follows the noncoherent segment (aircraft), and low for the coherent
√    segment (piano).
N N 2
N− n=1 hi (n) / n=1 hi (n) (14)
Figure 15 demonstrates the feature vectors that are
Shi = Log10 √ , extracted from the aircraft (Figure 13(a)) and the piano
N −1
√    signals (Figure 14(a)) in the feature domain. As it can be
K K 2
K− k=1 wi (k) / k=1 wi (k) (15) observed, the feature vectors from aircraft and piano are
Swi = Log10 √ . separate from each other in the feature space.
K −1
18 EURASIP Journal on Advances in Signal Processing

11

0.6

0.4

Frequency (kHz)
0.2

0 5

−0.2

−0.4

−0.6

−0.8
0.5 1 1.5 2 2.5 0.5 1 1.5 2 2.5
Time (s) Time (s)
(a) Time representation (b) TF representation

1 1
2 2
3 3
4 4
5 5
6 6
7 7
8 8
9 9
10 10
11 11
12 12
13 13
14 14
15 15
5 11 0.5 1 1.5 2 2.5
Frequency (kHz) Time (s)
(c) Base vectors decomposed using NMF MD technique (d) Coefficient vectors decomposed using NMF MD technique

Figure 13: (a) and (b) show a segment that belongs to an aircraft signal in time and TF representations, respectively. Applying NMF to the
TF matrix, we extract 15 base and coefficient vectors which are depicted in (c) and (d), respectively.

4.2.3. Pattern Classification. The pattern classification is to different aircrafts. The number of correct and misclassified
automatically group audio signals of same characteristics signals are shown in the next two columns and the accuracy
using the discriminatory features derived above. Similar to percentage is presented in the last column. As it can be seen in
music classification, the Pattern classification was carried out Table 3, the overall classification accuracy of 85% is achieved.
by LDA-based classifier using the SPSS software [50]. The classification rate is high for human speech (male and
female), instruments (piano, drum and flute) and aircraft;
however, the accuracy rate is lower in the cases of animal,
4.2.4. Results and Discussion. The LDA classifier is trained bird and insect sounds. The reason is that these classes are
using 75% signals in each group, and is tested on all the audio created from a variety of creatures, for example, the animal
samples in the dataset. For each signal, 15 feature vectors are class includes sounds of cow, elephant, hippo, hyena, wolf,
classified and the majority of the vote defines the class of that sheep, horse, cat and donkey, which are very diverse in their
signal. Table 3 shows the classification accuracy. In this table, nature.
the first column contains the ten classes in the dataset and In order to evaluate the relative performance of the
the number shows the number of signals in each class, for proposed features, we compared them with the well-known
example, Aircraft includes 20 audio signals collected from MFCC features. MFCCs are short-term spectral features and
EURASIP Journal on Advances in Signal Processing 19

11

0.2

Frequency (kHz)
0.1

5
0

−0.1

−0.2
0.5 1 1.5 2 2.5 0.5 1 1.5 2 2.5
Time (s) Time (s)
(a) Time representation (b) TF representation

1 1
2 2
3 3
4 4
5 5
6 6
7 7
8 8
9 9
10 10
11 11
12 12
13 13
14 14
15 15
5 11 0.5 1 1.5 2 2.5
Frequency (kHz) Time (s)
(c) Base vectors decomposed using NMF MD technique (d) Coefficient vectors decomposed using NMF MD technique

Figure 14: (a) and (b) show a segment that belongs to a piano signal in time and TF representations, respectively. Applying NMF to the TF
matrix, we extract 15 base and coefficient vectors which are depicted in (c) and (d), respectively.

are widely used in the area of audio and speech processing. signals, such as aircraft, helicopter, bird, insect, and music
In this paper, we computed the first 13 MFCCs for all the instruments.
segments of the entire length of the audio signals and find the Next, in order to obtain the role of each feature in the
mean and variance of these 13 MFCCs as the MFCC features. classification accuracy, we use the Students t-test to calculate
For each audio signal we derived 26 features, 13 features were the P value of the TF features and MFCC features extracted
from the mean of the segment MFCCs and the remaining 13 from each decomposed base and coefficient vectors. The
were the variance of the segment MFCCs. These 26 features feature with the smallest P value plays the most important
were computed for all the 192 signals and fed to an LDA- role in the classification accuracy. Figure 16 demonstrates
based classifier for classification. Using MFCC features, an 1/(P-value) as the relative importance of the 20 features.
overall classification accuracy of 75% was achieved which As shown in this figure, the MP feature plays the most
is 10% lower that the overall classification accuracy of our significant role in the classification accuracy. It can also
proposed features. Our experiments demonstrated that the be observed that the proposed TF features show a higher
proposed TF features are very effective in characterizing significance compared to the fourth MFCC feature and
the nonstationary dynamics of the environmental audio higher. This is proven by comparing the accuracy results
20 EURASIP Journal on Advances in Signal Processing

18 Table 3: Classification results; proposed features extraction


method.

16 Class (#) Correct Misclassified Accuracy (%)


MOH

Aircraft (20) 16 4 80
14 Helicopter (17) 17 0 100
Drum (20) 18 2 90
12 Flute (15) 15 0 100
−6
0 Piano (20) 20 0 100
−4 Male (20) 18 2 90
D −2
H
−2 −4 SW Female (20) 19 1 95
Animal (20) 11 9 55
Bird (20) 14 6 70
Aircraft
Piano
Insect (20) 15 5 75
Total (192) 163 29 85
Figure 15: This figure represents the aircraft and piano segments in
the feature plane. Since maximum three dimensions of the feature
domain can be plotted, only three features of the feature vectors are
shown in this figure. MOH , DH , and SW represent the second central
moment of coefficient vectors in H, the derivative of coefficient access conditions of the content. When a watermark is added
vectors in H, and the sparsity of base vectors in W, respectively. As it to the content, it introduces distortion. But the watermark
can be observed from the feature domain, the feature vectors from is added in such a way that the watermarked content is
aircraft and piano are separate from each other. perceptually similar to the original content. The embedded
watermark may be extracted using a watermark detector.
Since the watermark contains information that protects the
content, the watermarking technique should be robust, that
with the TF features (Sh , Dh , MOh , Sw , Dw , MOw , MP) is, the watermark signal should be difficult to remove without
and with the MFCC coefficients only (MFCC1,...,13 ). causing significant distortion to the content.
In this section, we proposed a novel methodology to In watermarking, the embedding process adds a water-
extract TF features for the purpose of environmental audio mark before the content is released. But watermarking
classification. Our methodology was proposed to address cannot be used if the content has been already released.
the tradeoff between long-term analysis of audio signals, According to Venkatachalam et al. [57], there are about 0.5
and their non-stationarity characteristics. Experiments per- trillion copies of sound recordings in existence and 20 billion
formed with a diverse database and the high-classification sound recordings are added every year. This underscores
accuracies achieved by the proposed TFM decomposition the importance of securing legacy content. Fingerprinting
feature extraction technique clearly demonstrated the poten- is a technology to identify and protect legacy content. In
tial of the technique as a true nonstationary tool in the form multimedia fingerprinting, the main objective is to establish
of a TFM decomposition approach for environmental audio the perceptual equality of two multimedia objects: not by
classification. comparing the objects themselves, but by comparing the
associated fingerprints. The fingerprints of a large number
5. Audio Fingerprinting and Watermarking of multimedia objects, along with their associated metadata
(e.g., name of artist, title, and album, copyright) are stored
The technologies used for security of multimedia data in a database. This database is usually maintained online and
include encryption, fingerprinting, and watermarking. can be accessed by recording devices.
Encryption can be used to package the content securely In recent years, the digital format has become the
and force all accesses rules to the protected content. If standard for the representation of multimedia content.
the content is not packaged securely, the content could Today’s technology allows the copying and redistribution
be easily copied. Encryption scrambles the content and of multimedia content over the Internet at a very low or
renders the content unintelligible unless a decryption key is no cost. This has become a serious threat for multimedia
known. However, once an authorized user has decrypted the content owners. Therefore, there is significant interest to
content, it does not provide any protection to the decrypted protect copyright ownership of multimedia content (audio,
content. Encryption does not prevent an authorized user image, and video). Watermarking is the process of embed-
from making and distributing illegal copies. Watermarking ding additional data into the host signal for identifying
and fingerprinting are two technologies that can provide the copyright ownership. The embedded data characterizes
protection to the data after it has been decrypted. the owner of the data and should be extracted to prove
A watermark is a signal that is embedded in the content ownership. Besides copyright protection, watermarking may
to produce a watermarked content. The watermark may be used for data monitoring, fingerprinting, and observing
contain information about the owner of the content and the content manipulations. All watermarking techniques should
EURASIP Journal on Advances in Signal Processing 21

MOw MOh Sw Dw Dh Sh MP MFCC 1 ··· MFCC 13

Figure 16: The relative height of each feature represents the relative importance of the feature compared to the other features.

satisfy a set of requirements [58]. In particular, the embedded IMF. Therefore, the IMF of a signal could be expressed
watermark should be: as [59]
Fm  
(i) imperceptible, f =0 f TFD n, f
fi (n) = Fm   . (19)
(ii) undetectable to prevent unauthorized removal, f =0 TFD n, f

(iii) resistant to all signal manipulations, and This IMF is computed over each time window of the
(iv) extractable to prove ownership. spectrogram, and TFD (n, f ) refers to the energy of the signal
at a given time and frequency. Note that in (19), Fm refers to
Before the proposed technique is made public, all the above the maximum frequency of the signal, n is the time index
requirements should be met. In order to propose water- and f is the frequency index. From this we can derive an
marking algorithms that are robust to signal manipulations, estimate of the IMF of a nonstationary signal assuming that
we introduced two TF signatures for audio watermarking: the IMF is constant throughout the window. The watermark
instantaneous mean frequency (IMF) of the signal, and fixed message is defined as a sequence of randomly generated bits
amplitude linear and quadratic phase signal (chirp). The that each bit is spread using a narrowband PN sequence, then
following sections present an overview of the two proposed shaped using BPSK modulation and an embedding strength.
methods, and their performances. The modulated watermarked signal can now be defined by
 
wi = mi pn ai cos 2π fi , (20)
5.1. IMF-Based Watermarking. We proposed a watermarking
scheme using the estimated IMF of the audio signal. Our where mi refers to the watermark or hidden message bit
motivation for this work is to address two important features before spreading, pn is the spreading code or the PN sequence
of security and imperceptibility and this can be achieved which is low-passed by filter h. The FIR low-pass filter
using spread spectrum and instantaneous mean frequency should be chosen according to frequency characteristics of
(IMF). In fact, the estimated IMF of the signal is examined the audio signal; the cutoff frequency of the filter was chosen
as an optimal point of insertion of the watermark in empirically to be 1.5 KHz. fi refers to the time-varying carrier
order to maximize its energy while achieving imperceptibil- frequency which represents the IMF of the audio signal.
ity. The power of the carrier signal is determined by ai , and is
adjusted according to the frequency masking properties of
5.1.1. Watermarking Algorithm. Figure 17 demonstrates the the HAS.
watermark embedding and extracting procedure. In this In order to understand the simultaneous masking
figure, Si is a nonoverlapping block of the windowed signal. phenomenon of the HAS, we will examine two different
Based on Gabor’s work on IF [1], Ville devised the Wigner- scenarios of simultaneous masking. First, in the case where
Ville Distribution (WVD), which showed the distribution of a narrowband noise masks a simultaneously occurring tone
a signal over time and frequency. The IMF of a signal was within the same critical band, the signal-to-mask ratio is
then calculated as the first moment of the WVD with respect about 5 dB. Second, in the case of tone-masking noise, the
to frequency. In this work, instead of WVD, spectrogram noise needs to be about 24 dB below the masker excitation
was used which is free of cross terms and obtains a positive level. Meaning that, it is generally easier for a broadband
22 EURASIP Journal on Advances in Signal Processing

Encoding Channel Recovery

AWGN
ji
Watermarked
Audio signal segments music
yi BPSK ri vi
si Threshold
demodulation Correlator
detector Recovered
and LPF
Embedding message
mi
strength
ai Spectrally shaped
watermark
wi

STFT PN sequence
analysis and BPSK pni
IMF modulation
extraction IMF fi

IMF carrier frequency


fi
Watermark
generation

PN sequence Message bit


pni mi

Figure 17: Watermark embedding and recovery using IMF.

noise to mask a tonal sound, than for the tonal sound to 5 dB below that of the audio signal. In order to recover the
mask a broadband noise. Note that in both cases, the noise watermark and thus the hidden message, the user needs to
and tonal sounds need to occur within the same critical band know the PN sequence and the IMF of the original signal.
for simultaneous masking to occur. In our case, the tone- Figure 17 illustrates the message recovery operation. The
or noise-like characteristic is determined for each window decoding stage consists of a demodulation step using the IMF
of the spectrogram and not for each component in the frequencies, and a dispreading step using the PN sequence.
frequency domain. We found the entropy of the signal useful
in determining whether the window can best be classified as 5.1.2. Algorithm Performance. The proposed watermarking
tone-like or noise-like. The entropy can be expressed as algorithm was applied to several different music files ranging
between classical, pop, rock, and country music. These

Fm
      files were sampled at a rate of 44.1 kHz, and 25 bits were
H(n) = P f TFD n, f log2 P f TFD n, f , (21)
f =0
embedded into a 5 sec sample of the audio signal. Figure 18
gives an overview of the watermark procedure for a voiced
where pop segment. As can be seen from these plots, the watermark
  envelope follows the shape of the music signal. As a result,
   TFD n, f the strength of the watermark increases as the amplitude of
P f TFD n, f = Fm  . (22)
f =0 TFD n, f
the audio signal increases.
As it was demonstrated in this section, the proposed
since the maximum entropy can be written as IMF-based watermarking is a robust watermarking method.
In the following section, the proposed chirp-based water-
Hmax (n) = log2 Fm (23) marking technique is introduced that uses linear chirps as
watermarking message. The motivation of using linear chirps
We assume that if the entropy calculated is greater than as a TF signature is taking the advantage of using a chirp
half the maximum entropy, the window can be considered detector in the final stage of watermark decoding to improve
noise-like; otherwise it is tone-like. Based on these values, the robustness of the watermarking technique and also to
the watermark energy is then scaled by the coefficients ai decrease the complexity of the watermark detection stage
such that the watermark energy will be either 24 dB or compared to the IMF-based watermarking.
EURASIP Journal on Advances in Signal Processing 23

×104 Original music PN sequence Message


1 1
2
0.8 0.8
1.5
Frequency

0.6

Frequency

Frequency
0.6
1 0.4
0.4

0.5 0.2 0.2

0 0 0
0 1 2 3 4 0 1000 2000 3000 4000 −10 0 10 20
Time Time Time
(a) (b) (c)

×104 Shaped watermark ×104 Watermarked music


2 2 PSD

PSD magnitude (dB)


−20
Music
1.5 1.5
Frequency

Frequency

−40
1 1

0.5 0.5 −60 Watermarked

0 0
0 1 2 3 4 0 1 2 3 4 0 0.5 1 1.5 2
Time Time Frequency ×104

(d) (e) (f)

Figure 18: Overview of watermarking procedure for POP voiced segment (“viorg.wav”). Several robustness tests based on StirMark
Benchmark [60] attacks were performed on the five different audio files to examine the reliability of our algorithm against signal
manipulations. In an attempt to standardize this, Petitcolas et al. [60] realized that many claims of robustness have been made in several
papers without following the same criteria. They have published a work with 4 popular audio watermarking algorithms, three of which were
submitted by companies have been exposed to several attacks. The algorithms are referred to as A, B, C, and D. The summary of these results
can be seen in Table 4. For each algorithm, 6 audio segments were watermarked and it was noted whether the watermark was completely
destroyed or somewhat changed by the attacks. As can be seen from the above tests, our technique offers several improvements over existing
algorithms.

Table 4: Performance of the IMF-based algorithm after various attacks.

Attacks Average BER Affected Algorithms in StirMark (%)


(1) None 0.00 N/A
(2) HPF (100 Hz) 0.05 A, D
(3) LPF (4 kHz) 0.06 A, C, D
(4) Resampling factor (0.5) 0.04 C, D
(5) Amplitude change (±10 dB) 0.08 N/A
(6) Parametric equalizer (bass boost) 0.13 A, B, C, D
(7) Noise reduction (hiss removal) 0.02 C, D
(8) MP3 compression 0.08 N/A

5.2. Chirp-Based Watermarking. We proposed a chirp-based 5.2.1. Watermark Algorithm. Figure 19 provides an overview
watermarking scheme [61], where a linear frequency modu- of the chirp-based watermarking scheme for a spread
lated signal, known as a chirp, is embedded as the watermark spectrum watermarking algorithm. The watermark message
message. Our motivation in chirp-based watermarking is is a 1-bit quantized amplitude version of the normalized
utilizing a chirp detection tool in the postprocessing stage chirp b on a TF plane, with initial and final frequencies
to compensate bit errors that occur in embedding and f0b and f1b , respectively. Each watermark bit is spread with
extracting the watermark signal. Some recent TF-based a secret-key generated binary PN sequence p. The spread
watermarking studies include the work in [62, 63]. spectrum signal wk appears as wideband noise and occupies
24 EURASIP Journal on Advances in Signal Processing

PN-sequence Circular
shifter

p
PN-sequence Circular
Watermark
Modulator shifter
selection bi
y = { y1 , . . . , yNb }
p

x = {x1 , . . . , xNb } i
Perceptual Low pass Correlator b
Post-processing
shaping filter and detector

Audio signal
Watermark Watermark signal
embedding
Watermark embedding Watermark extraction

Figure 19: Watermark embedding and detecting scheme.

the entire frequency spectrum spanned by the audio signal x. represents the parameter of the HRT constraint. Since we
In order for the embedded watermark to be imperceptible, are looking for the embedded chirp as straight lines in the
the watermark signal is perceptually shaped by a scale factor TF plane in the application of postprocessing of chirp-based
α, and a low-pass filter. The cutoff frequency of the low-pass watermarking, we can apply the HRT method to detect
filter is 0.05 fsx , where fsx is the sampling frequency of the the embedded chirp. First, the extracted watermark bits are
audio signal. The low-pass filtering step allows us to increase transformed to the TF plane; then the HRT detects the
the value of α to a value while maintaining imperceptibility. line representing the chirp in TFD. In order to achieve a
We used the empirically determined value of 0.3 for the good detection performance, Wigner-Ville Transform (WV)
embedding strength parameter α. is used as the TFD representation of the signal as it provides
Since the watermark bit is embedded in the low- fine TF resolution.
frequency bands of the transmitted signal, we extract the
watermark bit by processing the low-frequency bands of the
received signal, and despread the signal using the same PN 5.2.3. Technique Evaluation. We implemented the time-
sequence used in watermark embedding. We repeat the bit domain spread spectrum watermarking algorithm to embed
estimation process outlined above for each input block, until and extract watermark. The sampling frequency fsb = 1 kHz
we have an estimate of all the transmitted watermark bits. to generate the watermark signals. Therefore, the initial and
While it is possible to combine the estimated bits sequence, final frequencies, f0b and f1b of the linear chirps representing
we can improve the performance of the watermark extraction all watermark messages are constrained to [0–500] Hz. As
algorithm by postprocessing the estimated bits. Here, as we host signals, we used five different audio files with fsx =
know that the embedded watermark has a chirp structure, by 44.1 kHz and 16 bits/sample quantization. These sample
using a chirp detector, the original watermark message can audio files represent rock, classical, harp, piano, and pop
be estimated. music, respectively. We embedded watermark messages into
audio signals of 40 second duration for a chip length of
10,000 samples per watermark bit (corresponding to an
5.2.2. Postprocessing of the Estimated Bits for Watermark embedding rate of 4.41 bps), and into audio signals of 20
Message Extraction. After all watermark bits are extracted, second duration for a chip length of 5,000 samples per
we first construct the TFD of the extracted watermark. The watermark bit (corresponding to an embedding rate of
TF representation resulting from the TFD of the estimated 8.82 bps). In both cases, these values result in 176-bit long
bits can be considered as an image in TF plane. Once chirp sequences.
we generate the image of the TF plane, a parametric line To measure the robustness of the watermarking algo-
detection algorithm based on the Hough-Radon transform rithm, we performed 8 signal manipulation tests, which
(HRT) operates searches for the presence of the straight line represent commonly used signal processing techniques.
and estimates its parameters. The HRT is a parametric tool Table 2 shows the BER results expressed as a percentage of
to detect the pixels that belong to a parametric constraint the total number of watermark bits for the two chip lengths
of either a line or curve in a gray-level image [64]. HRT and for each signal manipulation operation.
divides the Hough-Radon parameter space into cells, and In all the robustness tests performed, the HRT was able
then calculates the accumulator value for each cell in the to extract the watermark message parameters correctly even
parameter space. The cell with the highest accumulator value in the worst-case scenario. The experiments showed that the
EURASIP Journal on Advances in Signal Processing 25

Table 5: Bit error rate (in percentage) for 5 different music signals Table 6: Performance comparison of the fec-based postprocessing
under different signal manipulations. schemes and DPPT-based technique under checkmark benchmark
attacks [70] for 10 images.
Audio Samples (%)
Robustness Test
S1 S2 S3 S4 S5 Error correction methods
Attacks
No signal manipulation 1.14 0.57 0.00 0.57 0.00 DPPT REP BCH (7,63)
MP3 128 kbps 1.14 0.57 0.00 1.70 0.00 Remodulation (4) 95 58 65
MP3 80 kbps 1.14 0.57 0.00 1.70 0.00 MAP (6) 100 97 100
4 kHz low-pass filtering 3.42 3.42 1.14 5.68 1.70 Copy (1) 100 90 100
Resampling at 22.05 kHz 3.98 3.42 2.27 3.98 1.14 Wavelet (10) 98 90 92
Amplitude scaling 1.14 0.57 0.00 0.57 0.00 JPEG (12) 100 100 100
Inversion 1.14 0.57 0.00 0.57 0.00 ML (7) 79 57 67
Addition of delayed signal 1.14 0.57 0.00 1.14 0.57 Filtering (3) 100 100 100
Additive noise 2.27 2.84 1.70 2.27 1.14 Resampling (1) 100 100 100
Embedding multiple (two) Color Reduce (2) 75 65 70
2.27 2.84 1.70 2.27 1.14 Total Detection (%) 95 85 89
watermarks

HRT-based postprocessing is able to estimate the correct for audio watermarking are Bose-Chaudhuri-Hocquenghem
watermark message up to a BER of 20%, where the maxi- (BCH) codes and repetition codes. Table 6 compares the
mum BER reported in Table 5 was about 6%. The proposed performance of the chirp-based watermarking using DPPT
chirp-based watermarking using HRT as postprocessing step chirp detector, Repetition coding and BCH coding; all codes
offers a robust watermark extraction performance; however, have a redundancy value of about 11/12. The chirp-based
calculation of WVD and taking HRT on the resulted WVD watermarking offers higher amount of BER correction than
has a high complexity of maximum O(N 2 log2 (N)) + O(N 3 ), the Repetition and BCH coding.
where N is the length of the chirp. In order to decrease the
complexity of the postprocessing stage, we could use Discrete 6. Summary
Polynomial Phase Transform (DPPT) [65] as a faster chirp
estimator to estimate the watermark message. DPPT is a In this paper we presented a stage-by-stage implementation
parametric signal analysis approach for estimating the phase and analysis of three important audio processing tasks,
parameters of constant amplitude polynomial phase signals. namely, (1) audio compression, (2) audio classification,
The DPPT operates directly on the signal in time domain and and (3) securing audio content using TF approaches. The
is a computationally efficient method comparing to HRT. proposed TF methodologies are best suited for analyzing
Complexity of DPPT is O(Nlog2 (N)). highly nonstationary audio signals. Although the audio
The proposed chirp-based watermark representation is compression results were not on par with the state-of-
fundamentally generic and inherently flexible for embedding the-art coders, we introduced a novel way of performing
and extraction purposes such that it can be embedded and audio compression. Moreover, the proposed coder is not as
extracted in any domain. Accordingly, we can embed the refined as the state-of-the-art commercial coders, which to
chirp sequence into the audio or image signals using any some extent explains its poor performance. A content-based
of the methods in [66, 67]. For example, if we were to use audio retrieval application was presented to explain the basic
the algorithm developed in [68] we would embed the chirp blocks of audio classification. TF features were extracted
sequence into the Fourier coefficients. At the receiver, we from the music signals and were segregated into 6 groups
extract the chirp sequence which is likely to have some bits using a pattern classifier. High-classification accuracies of
in error. We then input the extracted chirp sequence to the >90% (cross validated) were reported. We proposed a novel
HRT- or DPPT-based postprocessing stage to detect the slope methodology to extract TF features for the purpose of
of the chirp. environmental audio classification, and called the developed
Table 6 presents the result of the chirp-based watermark- technique, TFM decomposition feature extraction. The
ing using DPPT for Images in Discrete Cosine Transform obtained features from ten different environmental audio
domain (DCT) [69]. As it is observed in this table, the signals were fed into a multiclassifier, and a classification
robustness of the watermarking scheme is satisfactory. accuracy of 85% was achieved, which was 10% higher than
Although the proposed chirp-based watermarking rep- the classical features.
resentation is not a classical forward error correction (FEC) Furthermore, we brought highlights of our proposed
code, an analogy can be made between FEC codes and watermarking schemes by introducing two TF signatures. We
this new representation as they both introduce perfor- used IMF estimation of the signal, nonlinear TF signature, as
mance improvements at the expense of code redundancy. the watermark signal. Due to complexity of the watermark
FEC codes have been commonly used in watermarking to estimation, then we proposed chirp-based watermarking, in
reduce the bit error rate (BER) in order to achieve the which we embedded the linear phase signals as TF signatures.
desired BER performance. Most commonly used FEC codes HRT is used as chirp detector in the postprocessing stage
26 EURASIP Journal on Advances in Signal Processing

to compensate the BERs in the estimated watermark signal. [12] S. Esmaili, S. Krishnan, and K. Raahemifar, “Content based
The method could correct the error up to BER of 20%, audio classification and retrieval using joint time-frequency
and the robustness result was satisfactory. Since the HRT analysis,” in Proceedings of IEEE International Conference on
had high complexity, and the postprocessing stage was time Acoustics, Speech, and Signal Processing, pp. V-665–V-668, can,
consuming, we used DPPT instead of HRT in postprocessing. May 2004.
The DPPT-based postprocessing was applied on chirp-based [13] B. Wan and M. D. Plumbley, “Musical audio stream separation
image watermarking. Due to error correction property of by non-negative matrix factorization,” in Proceedings of the
the chirp-based watermarking, we also compared it with Digital Music Research Network Summer Conference (DMRN
’05), Glasgow, UK, 2005.
two well-known FEC schemes; it was shown that the chirp-
based watermarking offered higher BER correction than the [14] P. Smaragdis, “Non-negative matrix factor deconvolution;
extraction of multiple sound sources from monophonic
Repetition, and BCH coding.
inputs,” in Proceedings of the 5th International Conference on
Independent Component Analysis and Blind Signal Separation
Acknowledgments (ICA ’04), vol. 3195 of Lecture Notes in Computer Science, pp.
494–499, Granada, Spain, September 2004.
Portions of this paper reprinted, with permission, from [15] A. Holzapfel and Y. Stylianou, “Musical genre classification
[39, 69, 71] (2005,
c 2006, 2007) IEEE, some are reproduced using nonnegative matrix factorization-based features,” IEEE
from the open source article [25], others are reproduced Transactions on Audio, Speech and Language Processing, vol. 16,
from the book chapter mentioned in [26], and finally no. 2, Article ID 4432640, pp. 424–434, 2008.
portions are reproduced from [72]. [16] S. Krishnan and B. Ghoraani, “A joint time-frequency
and matrix decomposition feature extraction methodology
for pathological voice classification,” EURASIP Journal on
References Advances in Signal Processing, vol. 2009, Article ID 928974, 11
pages, 2009.
[1] S. Mallat, A Wavelet Tour of Signal Processing, Academic Press, [17] N. Shams, B. Ghoraani, and S. Krishnan, “Audio feature
New York, NY, USA, 1998. clustering for hearing aid systems,” in Proceedings of IEEE
[2] S. G. Mallat and Z. Zhang, “Matching pursuits with time- Toronto International Conference: Science and Technology for
frequency dictionaries,” IEEE Transactions on Signal Process- Humanity (TIC-STH ’09), pp. 976–980, September 2009.
ing, vol. 41, no. 12, pp. 3397–3415, 1993. [18] B. Ghoraani and S. Krishnan, “Quantification and localization
[3] L. Cohen, “Time-frequency distributions—a review,” Proceed- of features in time-frequency plane,” in Proceedings of IEEE
ings of the IEEE, vol. 77, no. 7, pp. 941–981, 1989. Canadian Conference on Electrical and Computer Engineering
[4] R. Gribonval and Rennes IRISA-INRIA, “Fast matching (CCECE ’08), pp. 1207–1210, May 2008.
pursuit with a multiscale dictionary of Gaussian chirps,” IEEE [19] D. Groutage and D. Bennink, “Feature sets for nonstationary
Transactions on Signal Processing, vol. 49, no. 5, pp. 994–1001, signals derived from moments of the singular value decom-
2001. position of cohen-posch (positive time-frequency) distribu-
[5] H. Choi and W. J. Williams, “Improved time-frequency tions,” IEEE Transactions on Signal Processing, vol. 48, no. 5,
representation of multicomponent signals using exponential pp. 1498–1503, 2000.
kernels,” IEEE Transactions on Acoustics, Speech, and Signal [20] D. Lee and H. Seung, “Algorithms for non-negative matrix
Processing, vol. 37, no. 6, pp. 862–871, 1989. factorization,” in Advances in Neural Information Processing
[6] I. Daubechies, “Wavelet transform, time-frequency localiza- Systems 13, pp. 556–562, 2000.
tion and signal analysis,” IEEE Transactions on Information [21] M. W. Berry, M. Browne, A. N. Langville, V. P. Pauca, and R.
Theory, vol. 36, no. 5, pp. 961–1005, 1990. J. Plemmons, “Algorithms and applications for approximate
[7] Z. K. Peng, P. W. Tse, and F. L. Chu, “An improved Hilbert- nonnegative matrix factorization,” Computational Statistics
Huang transform and its application in vibration signal and Data Analysis, vol. 52, no. 1, pp. 155–173, 2007.
analysis,” Journal of Sound and Vibration, vol. 286, no. 1-2, pp. [22] I. Buciu, “Non-negative matrix factorization, a new tool
187–205, 2005. for feature extraction: theory and applications,” International
[8] L. Cohen and T. E. Posch, “Positive time-frequency distribu- Journal of Computers, Communications and Control, vol. 3, pp.
tion functions,” IEEE Transactions on Acoustics, Speech, and 67–74, 2008.
Signal Processing, vol. 33, no. 1, pp. 31–38, 1985. [23] C.-J. Lin, “Projected gradient methods for nonnegative matrix
[9] H. Deshpande, R. Singh, and U. Nam, “Classification of music factorization,” Neural Computation, vol. 19, no. 10, pp. 2756–
signals in the visual domain,” in Proceedings of the COSTG6 2779, 2007.
Conference on Digital Audio Effects, 2001. [24] T. Painter and A. Spanias, “Perceptual coding of digital audio,”
[10] B. Tacer and P. Loughlin, “Time-frequency-based classifica- Proceedings of the IEEE, vol. 88, no. 4, pp. 451–512, 2000.
tion,” in Advanced Signal Processing Algorithms, Architectures, [25] K. Umapathy and S. Krishnan, “Perceptual coding of audio
and Implementations VI, vol. 2846 of Proceedings of SPIE, pp. signals using adaptive time-frequency transform,” EURASIP
186–192, Denver, Colo, USA, August 1996. Journal on Audio, Speech and Music Processing, vol. 2007,
[11] I. Paraskevas and E. Chilton, “Audio classification using Article ID 51563, 14 pages, 2007.
acoustic images for retrieval from multimedia databases,” [26] K. Umapathy and S. Krishnan, Audio Coding and Clas-
in Proceedings of the 4th EURASIP Conference focused on sification: Principles and Algorithms in Mobile Multimedia
Video/Image Processing and Multimedia Communications, vol. Broadcasting Multi-Standards, Springer, San Diego, Calif,
1, pp. 187–192, July 2003. USA, 2009.
EURASIP Journal on Advances in Signal Processing 27

[27] K. Brandenburg and M. Bosi, “MPEG-2 advanced audio [47] H. Soltau, T. Schultz, M. Westphal, and A. Waibel, “Recog-
coding: overview and applications,” in Proceedings of the 103rd nition of music type,” in Proceedings of IEEE International
Audio Engineering Society Convention, New York, NY, USA, Conference on Acoustics, Speech and Signal Processing, pp.
1997, Preprint 4641. 1137–1140, May 1998.
[28] E. Eberlein and H. Popp, “Layer-3, a flexible coding standard,” [48] B. C. J. Moore, An Introduction to the Psychology of Hearing,
in Proceedings of the 94th Audio Engineering Society Conven- Academic Press, Toronto, Canada, 1992.
tion, Berlin, Germany, March 1993, Preprint 3493. [49] E. Allamanche, J. Herre, O. Hellmuth, B. Froba, T. Kastner, and
[29] J. Herre, “Second generation iso/mpeg audio layer-3 coding,” M. Cremer, “Content-based identification of audio material
in Proceedings of the 98th Audio Engineering Society Conven- using MPEG-7 low level description,” in Proceedings of the
tion, Paris, France, February 1995. 2nd Annual International Symposium on Music Information
[30] I. JTC1/SC29/WG11, “Overview of the mpeg-4 standard,” Retrieval, pp. 197–204, October 2001.
International Organisation for Standardisation, March 2002. [50] SPSS Inc, “SPSS advanced statistics user’s guide,” in User
[31] http://www.iis.fraunhofer.de/amm/techinf/index.html. Manual, SPSS Inc., Chicago, Ill, USA, 1990.
[32] S. Meltzer and G. Moser, “MPEG-4 HE-AAC v2—audio [51] K. Fukunaga, Introduction to Statistical Pattern Recognition,
coding for today’s digital media world,” EBU Technical Review, Academic Press, San Diego, Calif, USA, 1990.
no. 305, pp. 37–48, 2006. [52] C. Panagiotakis and G. Tziritas, “A speech/music discrimina-
[33] S. J. Orfanidis, Introduction to Signal Processing, Prentice-Hall, tor based on RMS and zero-crossings,” IEEE Transactions on
New Jersey, NJ, USA, 1996. Multimedia, vol. 7, no. 1, pp. 155–166, 2005.
[34] T. Ryden, “Using listening tests to assess audio codecs,” in [53] Microsoft, http://research.microsoft.com/.
Collected Papers on Digital Audio Bit-Rate Reduction, pp. 115–
[54] G. Freeman, R. Dony, and S. Areibi, “Audio environment
125, AES, 1996.
classification for hearing aids using artificial neural networks
[35] M. M. Goodwin, Adaptive Signal Models: Theory, Algorithms with windowed input,” in Proceedings of IEEE Symposium on
and Audio Applications, Kluwer Academic Publishers, Norwell, Computational Intelligence in Image and Signal Processing, pp.
Mass, USA, 1998. 183–188, Honolulu, Hawaii, April 2007.
[36] S. Krstulović and R. Gribonval, “MPTK: matching pursuit
[55] S. Chu, S. Narayanan, and C.-C. J. Kuo, “Environmental
made tractable,” in Proceedings of IEEE International Confer-
sound recognition using MP-based features,” in Proceedings of
ence on Acoustics, Speech and Signal Processing (ICASSP ’06),
IEEE International Conference on Acoustics, Speech and Signal
vol. 3, pp. 496–499, May 2006.
Processing (ICASSP ’08), pp. 1–4, March-April 2008.
[37] J. P. Campbell Jr., “Speaker recognition: a tutorial,” Proceedings
[56] P. O. Hoyer, “Non-negative matrix factorization with sparse-
of the IEEE, vol. 85, no. 9, pp. 1437–1462, 1997.
ness constraints,” Journal of Machine Learning Research, vol. 5,
[38] L. Lu, H.-J. Zhang, and H. Jiang, “Content analysis for audio pp. 1457–1469, 2004.
classification and segmentation,” IEEE Transactions on Speech
and Audio Processing, vol. 10, no. 7, pp. 504–516, 2002. [57] V. Venkatachalam, L. Cazzanti, N. Dhillon, and M. Wells,
“Automatic identification of sound recordings,” IEEE Signal
[39] K. Umapathy, S. Krishnan, and S. Jimaa, “Multigroup clas-
Processing Magazine, vol. 21, no. 2, pp. 92–99, 2004.
sification of audio signals using time-frequency parameters,”
IEEE Transactions on Multimedia, vol. 7, no. 2, pp. 308–315, [58] M. Arnold, “Audio watermarking: features, applications and
2005. algorithms,” in Proceedings of IEEE International Conference
on Multimedia and Expo (ICME ’00), pp. 1013–1016, August
[40] G. Guo and S. Z. Li, “Content-based audio classification and
2000.
retrieval by support vector machines,” IEEE Transactions on
Neural Networks, vol. 14, no. 1, pp. 209–215, 2003. [59] S. Krishnan, “Instantaneous mean frequency estimation using
[41] G. Tzanetakis and P. Cook, “Musical genre classification adaptive time-frequency distributions,” in Proceedings of the
of audio signals,” IEEE Transactions on Speech and Audio Canadian Conference on Electrical and Computer Engineering,
Processing, vol. 10, no. 5, pp. 293–302, 2002. pp. 141–146, May 2001.
[42] C. J. C. Burges, J. C. Platt, and S. Jana, “Distortion discrimi- [60] A. P. Petitcolas, et al., “Stirmark benchmark: audio watermark-
nant analysis for audio fingerprinting,” IEEE Transactions on ing attacks,” in Proceedings of the International Conference on
Speech and Audio Processing, vol. 11, no. 3, pp. 165–174, 2003. Information Technology: Coding and Computing (ITCC ’01),
[43] J.-L. Dugelay, J.-C. Junqua, C. Kotropoulos, R. Kuhn, F. pp. 49–55, April 2001.
Perronnin, and I. Pitas, “Recent advances in biometric person [61] S. Erküçük, S. Krishnan, and M. Zeytinoǧlu, “A robust
authentication,” in Proceedings of IEEE International Confer- audio watermark representation based on linear chirps,” IEEE
ence on Acoustic, Speech, and Signal Processing, vol. 4, pp. 4060– Transactions on Multimedia, vol. 8, no. 5, pp. 925–936, 2006.
4063, May 2002. [62] S. Stanković, I. Orović, and N. Žarić, “Robust speech water-
[44] M. Cooper and J. Foote, “Summarizing popular music via marking procedure in the time-frequency domain,” EURASIP
structural similarity analysis,” in Proceedings of IEEE Workshop Journal on Advances in Signal Processing, vol. 2008, Article ID
on Applications of Signal Processing to Audio and Acoustics, pp. 519206, 9 pages, 2008.
127–130, 2003. [63] S. Stanković, I. Orović, and N. Žarić, “An application of
[45] C. Xu, N. C. Maddage, and X. Shao, “Automatic music classi- multidimensional time-frequency analysis as a base for the
fication and summarization,” IEEE Transactions on Speech and unified watermarking approach,” IEEE Transactions on Image
Audio Processing, vol. 13, no. 3, pp. 441–450, 2005. Processing, vol. 19, no. 3, pp. 736–745, 2010.
[46] H.-G. Kim, N. Moreau, and T. Sikora, “Audio classification [64] R. Rangayyan and S. Krishnan, “Feature identification in the
based on MPEG-7 spectral basis representations,” IEEE Trans- time-frequency plane by using the hough-radon transform,”
actions on Circuits and Systems for Video Technology, vol. 14, IEEE Transactions on Pattern Recognition, vol. 34, pp. 1147–
no. 5, pp. 716–725, 2004. 1158, 2001.
28 EURASIP Journal on Advances in Signal Processing

[65] L. Lam, S. Krishnan, and B. Ghoraani, “Discrete polynomial


transform for digital image watermarking application,” in
Proceedings of IEEE International Conference on Multimedia
and Expo (ICME ’06), pp. 1569–1572, July 2006.
[66] W.-N. Lie and L.-C. Chang, “Robust and high-quality time-
domain audio watermarking subject to psychoacoustic mask-
ing,” in Proceedings of IEEE International Symposium on
Circuits and Systems (ISCAS ’01), pp. 45–48, May 2001.
[67] M. D. Swanson, B. Zhu, and A. H. Tewfik, “Current state of the
art, challenges and future directions for audio watermarking,”
in Proceedings of the 6th International Conference on Multime-
dia Computing and Systems, vol. 1, pp. 19–24, June 1999.
[68] J. W. Seok and J. W. Hong, “Audio watermarking for copyright
protection of digital audio data,” Electronics Letters, vol. 37, no.
1, pp. 60–61, 2001.
[69] B. Ghoraani and S. Krishnan, “Chirp-based image watermark-
ing as error-control coding,” in Proceedings of the International
Conference on Intelligent Information Hiding and Multimedia
Signal Processing (IIH-MSP ’06), pp. 647–650, December 2006.
[70] S. Pereira, S. Voloshynovskiy, M. Madueno, S. Marchand-
Maillet, and T. Pun, “Second generation benchmarking and
application oriented evaluation,” in Proceedings of the Infor-
mation Hiding Workshop III, Pittsburgh, Pa, USA, April 2001.
[71] K. Umapathy, S. Krishnan, and R. K. Rao, “Audio signal feature
extraction and classification using local discriminant bases,”
IEEE Transactions on Audio, Speech and Language Processing,
vol. 15, no. 4, pp. 1236–1246, 2007.
[72] S. Esmaili, S. Krishnan, and K. Raahemifar, “Audio water-
marking using time-frequency characteristics,” Canadian
Journal of Electrical and Computer Engineering, vol. 28, no. 2,
pp. 57–61, 2003.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy