First Research Paper
First Research Paper
1 Introduction
but also upon the waveform, the sound pressure, the frequency location of the
spectrum, and the temporal characteristics of the stimulus [2], [5]. Still, mu-
sical sounds must be very carefully parameterized to allow automatic timbre
recognition.
So far, there is no standard parameterization used as a classification basis.
The sound descriptors applied are based on various methods of analysis in
time domain, spectrum domain, time-frequency domain and cepstrum, with
Discrete Fourier Transform (DFT) for spectral analysis being most common,
e.g. Fast Fourier Transform (FFT), and so on. Also, wavelet analysis gains
increasing interest for sound and especially for musical sound analysis and
representation.
Researchers explored different statistical summations to describe signa-
tures of music instruments based on vectors or matrices in features, such as
Tristimulus parameters, brightness, irregularity of the spectrum etc. [6], [14],
[21]. Flattening these features for traditional classifiers increases the number
of features. In [16] authors used a new set of features jointly with other pop-
ular features used in music instrument identification. They built a database
of music instrument sounds for training a number of classifiers. These classi-
fiers are used by MIRAI system to identify music instruments in polyphonic
sounds.
MIRAI is designed as a web-based storage and retrieval system which
can automatically index musical input (of polyphonic, polytimbral type),
transforming it into a database, and answer queries requesting specific musi-
cal pieces, see http://www.mir.uncc.edu/. When MIRAI receives a musical
waveform, it divides this waveform into segments of equal size and then the
classifiers incorporated into the system identify the most dominating musical
instruments and emotions associated with that segment. A database of mu-
sical instrument sounds describing about 4,000 sound objects by more than
1,100 features is associated with MIRAI. Each sound object is represented as
a temporal sequence of approximately 150-300 tuples which gives a temporal
database of more than 1,000,000 tuples, each one represented as a vector of
about 1,100 features. This database is mainly used to train classifiers for auto-
matic indexing of musical instrument sounds. It is semantically reach enough
(in terms of successful sound separation and recognition) so the constructed
classifiers have a high level of accuracy in recognizing the dominating mu-
sical instrument and/or its type when music is polyphonic. Unfortunately,
the loss of information on non-dominant instruments by the sound separation
algorithm, due to the overlap of sound features, may significantly lower the
recognition confidence of the remaining instruments in a polyphonic sound.
This paper shows that by identifying a weighted set of dominating instruments
in a sequence of overlapping frames and using a special voting strategy, we can
improve the overall confidence of the indexing strategy for polyphonic music,
and the same improve the precision and recall of MIRAI retrieval engine.
Title Suppressed Due to Excessive Length 3
Polyphonic
Sound
Get pitch
Get spectrum
Timbre
Estimation
Power Spectrum
Feature based datasets are easier and more efficient to work with classi-
fiers, however, there is usually information loss during the feature extraction
process. Feature is the abstract or compressed representation of waveform or
spectrum, such as harmonic peaks, MFCC (Mel Frequency Cepstral Coeffi-
cients), zero-crossing rate, and so on. In the case of monophonic music sound
estimation tasks with only singular non-layered sounds, the features can be
easily extracted and identified. However, this is not the case in polyphonic,
polytimbral sound. It is difficult or even often impossible to extract distinct
clear features representing single instrument from polyphonic sound, because
the overlapping of the signals and their spectra, especially when instruments
have the similar patterns in their features space.
one (the only one or dominating) instrument playing for each frame of music
sound is given, then information about other possibly contributing instru-
ments is lost.
In fact, it is common for the polyphonic music sound to have multiple
instruments playing simultaneously, which means that in each frame, there
are representations of multiple timbres existing in the signal. Providing one
only candidate leads to obtaining predominant timbre while ignoring other
timbre information. And also, there could be no dominating timbre in each
frame, when all instruments play equally loud. This means that classifier has
to randomly choose one of the equally possible candidates. In order to find
solution to this problem, authors introduce the Top-N winner strategy which
gives multiple candidates for each evaluated frame.
The fact that discriminating one instrument from another depends on more
details from raw signals leads to another way of pattern recognition: directly
detecting distinct patterns of instruments based on lower representation of
signal, such as power spectrum. Fig. 2 shows two different ways of pattern
recognition.
Figure 3 shows the power spectrum of trumpet, piano and the mixture of
those two instruments. As we can see, the spectrum of mixture preserves part
of the pattern of each single instrument.
Fig. 3. Power spectrum of trumpet, piano and their mixture; frequency axis is in
linear scale, whereas amplitude axis is in log [dB] scale
The same similarity of properties of the spectra is also observed e.g. for
flute, trombone and their mixture, as Figure 4 shows.
In order to index the polyphonic sound, we need to detect the instrument
information in each small slice of music sound. Such detection is rather not
feasible directly in time domain. Therefore, in our experiments, we have ob-
served the short term spectrum space. This has been calculated via short time
8 Wenxin Jiang, Alicja Wieczorkowska, and Zbigniew W. Raś,
In order to represent accurately the short term spectrum with high resolution
in frequency axis, allowing more precise pattern matching, long analyzing
frame with 8192 numeric samples was chosen. Fourier transform performed
on these frames describes frequency space for each slice (or frame). Instead of
parameterizing the spectrum (or time domain) and extracting a few dozens
Title Suppressed Due to Excessive Length 9
Fig. 5. Sub-patterns of single instruments in the mixture sound slice for flute,
trombone, and their mix
to yield good classification models, and also any classification model itself
stands for some sort of abstraction, which is in conflict with any information
preserving strategy. However, one of the most fundamental and simple clas-
sification methods, K Nearest Neighbor algorithm, needs no prior knowledge
about the distribution of the data and it seems to be an appropriate classifier
for numeric spectrum vectors.
Polyphonic
Sound
Training data
Get frame
Timbre
Estimation
Classifier
Get Instrument
FFT Power Spectrum
Candidates
Fig. 6. Flow chart of music instrument recognition system with new strategy
7 Conclusion
We have provided a new solution to an important problem of instrument
identification in polyphonic music: The loss of information on non-dominant
14 Wenxin Jiang, Alicja Wieczorkowska, and Zbigniew W. Raś,
Acknowledgments
This work was supported by the National Science Foundation under grant IIS-
0414815, and also by the Research Center of PJIIT, supported by the Polish
National Committee for Scientific Research (KBN).
We are grateful to Dr. Xin Zhang for many helpful discussions we had
with her and for the comments she made which improved the quality and
readability of the paper.
References
1. Agostini G, Longari M, Pollastri E. (2001) Content-Based Classification of Mu-
sical Instrument Timbres. International Workshop on Content-Based Multime-
dia Indexing
2. American National Standards Institute (1973) American national standard:
Psychoacoustical terminology. ANSI S3.20-1973
3. Aniola P, Lukasik E (2007) JAVA Library for Automatic Musical Instruments
Recognition. AES 122 Convention, Vienna, Austria
4. Brown JC (1999) Computer identification of musical instruments using pattern
recognition with cepstral coefficients as features. J.Acoust.Soc.Am. 105, 1933–
1941
5. Fitzgerald R, Lindsay A (2004) Tying semantic labels to computational de-
scriptors of similar timbres. Sound and Music Computing’04
6. Fujinaga I, McMillan K (2000) Real Time Recognition of Orchestral Instru-
ments. International Computer Music Conference
7. Herrera P, Amatriain X, Batlle E, Serra X (2000) Towards instrument segmen-
tation for music content description: a critical review of instrument classification
techniques. International Symposium on Music Information Retrieval ISMIR
8. ISO/IEC JTC1/SC29/WG11 (2004) MPEG-7 Overview. Available at http:
//www.chiariglione.org/mpeg/standards/mpeg-7/mpeg-7.htm
Title Suppressed Due to Excessive Length 15