Music Database Retrieval Based On Spectral Similarity.
Music Database Retrieval Based On Spectral Similarity.
Cheng Yang
Department of Computer Science
Stanford University
yangc@cs.stanford.edu
personal or classroom use is granted without fee provided that copies are Polyphony refers to the scenario where multiple notes occur at the
not made or distributed for profit or commercial advantageand that copies same time, possibly by different instruments or vocal sounds. As we know,
bear this notice and the full citation on the first page. most music pieces are polyphonic.
1
they are fully or partially based on the same score, even if
4000
they are performed by different people or at different tempo.
In the next section we will discuss some previous work 3500
frequency (Hz)
our algorithm to detect music similarity. Section 4 gives 2500
experimental results, and future directions will be discussed
in Section 5. 2000
1500
500
2
4
3000 x 10
8
7
2500
2000
5
intensity
power
1500
4
3
1000
500
1
0 0
0 500 1000 1500 2000 2500 3000 3500 4000 0 5 10 15 20 25 30 35 40
frequency (Hz) time (sec.)
B
A
1500
1000
(a) C
500 D
0
0 500 1000 1500 2000 2500 3000 3500 4000
2000
(b) 1000
time
0
0 500 1000 1500 2000 2500 3000 3500 4000 Figure 5. True peak vs. bogus peak
3000
2000
(c)
1000
0
3.2 The Algorithm
0 500 1000 1500 2000 2500 3000 3500 4000
3000
(d)
2000
The algorithm consists of three components, which are
1000
discussed separately.
0
0 500 1000 1500 2000 2500 3000 3500 4000
frequency (Hz)
1. Intermediate Data Generation.
Figure 3. Illustration of polyphony For each music piece, we generate its spectrogram as
discussed in Section 3.1, and plot its instantaneous
power as a function of time. Figure 4 shows such a
(its harmonics). Fundamental frequency corresponds to the power plot for a 40-second sound clip of Tchaikovsky’s
pitch (middle G in this case), and the pattern of harmonics Piano Concerto No. 1. Next, we identify peaks in this
depends on the characteristics of the musical instrument that power plot, where peak is defined as a local maximum
plays it. value within a neighborhood of a fixed size. This def-
inition helps remove bogus local “peaks” which are
When multiple notes occur at the same time
immediately followed or preceded by higher values.
(“polyphony”), their frequency components add. Figure
For example, in Figure 5, are true peaks but
3(a)-(c) show the frequency components of C, E and G
is a bogus peak. Intuitively, these peaks roughly cor-
played individually, while Figure 3(d) shows that of all three
respond to distinctive notes or rhythmic patterns. For
notes played together. In this simple example it is still pos-
the 60-second music clips used in our experiments, we
sible to design algorithms to extract individual pitches from
typically find 100-200 peaks in each of them.
the chord signal C-E-G, but in actual music recordings, many
more notes co-exist, played by many different instruments, After a list of peaks is obtained, we extract the fre-
of which we do not know the patterns of harmonics. In addi- quency components near each peak. We take 180
tion, there are sounds produced by percussion instruments, samples of frequency components between 200Hz and
human voice, and noise. The task of automatic transcription 2000Hz. Average values over a short time period fol-
of music from arbitrary audio data (i.e., conversion from lowing the peak are used in order to reduce sensitivity
raw audio format into MIDI) becomes extremely difficult, to noise and to avoid the “attack” portions produced by
and remains unsolved today. Our algorithm, as in most other certain instruments (short, non-harmonic signal seg-
music retrieval systems, does not attempt to do transcription. ments at the onset of each note).
3
1 x1 x2 x3 x4 x5 xk n to as:
s ...... .... ...... ......
...... C D ,G ! G I
HKJ : H <ML
time 5+? ? @BA
9
1 m FE
? [Y IH
WZY
? /Y \H
J
W ? /Y ]HJ
WZY
? HKJ6
2. Matching.
This component matches two music pieces against each The optimal matching set ^ that leads to the
other and determines how close they are, based on the minimum distance can also be traced from the
intermediate representation generated above. Match- dynamic programming algorithm.
ing comes in two stages: minimum-distance matching Based on the definitions above, the minimum dis-
and linearity filtering. tance between the two music pieces with spectral
vectors
mean-squared error between vectors and . tance function, it is not robust enough for music
It can be shown that is linearly related to comparison. Experiments have shown that cer-
the correlation coefficient of the original spec- tain subjectively dissimilar pieces may also end
tra near peak of the first piece and peak of up with a small distance score, therefore appear-
the second one. A smaller value corresponds ing similar to the system. To make the algorithm
to a larger correlation coefficient. (See [13] for more robust, further filtering is needed.
proof.) Therefore,
Figure 7 shows two ways to match against ,
is a natural indicator of
similarity of the original spectra at corresponding
both with 10 matches. Both may yield a low
peaks.
matching score, but the top one is obviously better
Let be
than the bottom one. In the top one, there is a
,
and a particular match 4 (5 the top match would give a straight line while the
2
), define the distance of and 8 9 with respect bottom one would not.
4
s
query music Intermediate query vector
music
Data
vector
r database Generation database
A "good" match
s Minimum-
candidate
matches Distance
Matching
r
A "bad" match
can be plotted on a 2-D graph, with the origi- Figure 8. Summary of algorithm structure
nal location (time offset) of peaks
5
4
x 10
10
A. 5
0 55
0 4 5 10 15 20 25 30 35 40
x 10 50
10
45
B. 5
40
similarity
35
0 30
0 5 5 10 15 20 25 30 35 40
x 10 25
2
20
0
C. 1
15
2
10
4
0
0 2 6
0 4 5 10 15 20 25 30 35 40 4
x 10 6 8
15 8
10 10
Item 2
10
D. Item 1
5
0
0 5 10 15 20 25 30 35 40 Figure 10. Pairwise matching result
time (sec.)
Figure 9. Power plots Polonaise (C and D). Both pairs are of Type-IV similarity.
Each pair was performed by different orchestras, published
by different companies. There were variations in tempo as
4 Experiments well as in performance style. From the power plots it can
be seen that notes are emphasized differently. Neverthe-
Our data collection is done by recording CDs or tapes less, both pairs yield small distance scores after minimum-
into PCs through a low-quality PC microphone. No special distance matching. On the other hand, a few dissimilar pairs
efforts are taken to reduce noise. This setup is intentional, in also yield scores that are not large, such as Tchaikovsky’s Pi-
order to test the algorithm’s robustness and performance in ano Concerto No. 1 (A) vs. Brahms’ Cradle Song (referred
a practical environment. Both classical music and modern to as E from now on), and Chopin’s “Military” Polonaise
music are included, with classical music being the focus. (D) vs. Mendelssohn’s Spring Song (referred to as F from
Instead of taking the entire pieces, only 30- to 60-second now on).
clips are taken from each piece, because that much data is Figure 11 shows sample plots of optimal matching sets
generally enough for similarity detection. before linearity filtering (solid lines connecting the dots),
We identify five different types of “similar” music pairs, where the horizontal axis is time (in seconds) of the first
with increasing levels of difficulty: piece and vertical axis is time of the second piece. A straight
line is fitted through each set of matching points (dashed
Type I: Identical digital copy lines). As is clear from the plots, A and B are truly similar
Type II: Same analog source, different digital copies, (almost all points are colinear), while A and E are not; C
possibly with noise and D are truly similar, while D and F are not.
After certain matching points are removed by linearity
Type III: Same instrumental performance, different vo- filtering, Figure 11 becomes Figure 12. The pairs (A, B)
cal components and (C, D) have 49 and 54 matching points respectively,
while the other two pairs have fewer than 15 remaining
Type IV: Same score, different performances (possibly matching points.
at different tempo) Figure 10 shows the pairwise matching result of a set of
Type V: Same underlying melody, different otherwise, 10 music pieces, of which two pairs ((A, B) and (C, D))
are different performances of the same scores (with Type-
with possible transposition
IV similarity). The result is shown as a matrix
Sound samples of each type can be found at http: where the entry ( , ) gives the final number of matching
//www-db.stanford.edu/˜yangc/musicir/ . points between two pieces and after linearity filtering.
Figure 9 shows the power plots of two different per- Because of symmetry only the upper triangle of the matrix
formances of Tchaikovsky’s Piano Concerto No. 1 (A and is presented. Two peaks in the graph clearly indicate the
B) and two different performances of Chopin’s “Military” discovery of the “correct” pairs.
6
A. vs. B. C. vs. D.
35 40
30
30
25
20
20
15
10
10
5
0 0
0 10 20 30 40 0 10 20 30 40
A. vs. E. D. vs. F.
60 40
50
30
40
30 20
20
10
10
0 0
0 10 20 30 40 0 10 20 30 40
A. vs. B. C. vs. D.
40 40
30 30
20 20
10 10
0 0
0 10 20 30 40 0 10 20 30 40
A. vs. E. D. vs. F.
60 40
50
30
40
30 20
20
10
10
0 0
0 10 20 30 40 0 10 20 30 40
7
100 processing, one can also incorporate existing rhythm de-
90
tection algorithms to improve performance. Also, different
algorithms may be suited to different types of music, so it
80
may be helpful to conduct some analysis of general statisti-
% Retrieval Accuracy
70
cal properties before deciding which algorithm to use.
60 Content-based retrieval of musical audio data is still a new
50 area that is not well explored. There are many possible future
40
directions, and this paper is only intended as a demonstration
on the feasibility of certain prototype ideas, of which more
30
extensive experiments and research will need to be done.
20
10
References
0
I II III IV V
Type
[1] J. P. Bello, G. Monti and M. Sandler, “Techniques
for Automatic Music Transcription”, in International
Figure 13. Retrieval Accuracy Symposium on Music Information Retrieval, 2000.
[2] S. Blackburn and D. DeRoure, “A Tool for Content
More queries are conducted on a larger dataset of 120 Based Navigation of Music”, in Proc. ACM Multime-
music pieces, each of size 1MB. For each query, items from dia, 1998.
the database are ranked according to the number of final
matching points with the query music, and the top 2 matches [3] J. C. Brown and B. Zhang, “Musical Frequency Track-
are returned. Figure 13 shows the retrieval accuracy for each ing using the Methods of Conventional and ’Narrowed’
of the five types of similarity queries. As can be seen from Autocorrelation”, J. Acoust. Soc. Am. 89, pp. 2346-
the graph, the algorithm performs very well in the first 4 2354. 1991.
types. Type-V is the most difficult, and better algorithms
[4] J. Foote, “ARTHUR: Retrieving Orchestral Music by
need to be developed to handle it.
Long-Term Structure”, in International Symposium on
Music Information Retrieval, 2000.
5 Conclusions and Future Work
[5] A. Ghias, J. Logan, D. Chamberlin and B. Smith,
“Query By Humming – Musical Information Retrieval
We have presented an efficient algorithm to perform in an Audio Database”, in Proc. ACM Multimedia,
content-based music retrieval based on spectral similarity. 1995.
Experiments have shown that the approach can detect sim-
ilarity while tolerating tempo changes, some performance [6] R. J. McNab, L. A. Smith, I. H. Witten, C. L. Hender-
style changes and noise, as long as the different perfor- son and S. J. Cunningham, “Towards the digital music
mances are based on the same score. library: Tune retrieval from acoustic input”, in Proc.
Future research may include the study of the effects of ACM Digital Libraries, 1996.
various threshold parameters used in the algorithm, and to
find ways to automate the selection of certain parameters to [7] E. D. Scheirer, “Pulse Tracking with a Pitch Tracker”,
optimize performance. in Proc. Workshop on Applications of Signal Process-
ing to Audio and Acoustics, 1997.
We are experimenting with indexing schemes [12] in
order to get faster retrieval response. We are also planning [8] E. D. Scheirer, Music-Listening Systems, Ph. D. disser-
to augment the algorithm to handle transpositions (pitch tation, Massachusetts Institute of Technology, 2000.
shifts). Although transpositions of entire pieces are not very
common, it is common to have small segments transposed [9] A. S. Tanguiane, Artificial Perception and Music
to a different key, and it would be important that we detect Recognition, Springer-Verlag, 1993.
such cases.
[10] G. Tzanetakis and P. Cook, “Audio Information Re-
One other future direction is to design algorithms to ex- trieval (AIR) Tools”, in International Symposium on
tract high-level representations such as approximate melody Music Information Retrieval, 2000.
contours. This task is certainly non-trivial, but it may be
less difficult than transcription, and at the same time very [11] E. Wold, T. Blum, D. Keislar and J. Wheaton,
powerful in similarity detection for complex cases. “Content-Based Classification, Search and retrieval of
Instead of using the peak-detection scheme during pre- audio”, in IEEE Multimedia, 3(3), 1996.
8
[12] C. Yang, “MACS: Music Audio Characteristic Se-
quence Indexing for Similarity Retrieval”, in IEEE
Workshop on Applications of Signal Processing to Au-
dio and Acoustics, 2001.
[13] C. Yang and T. Lozano-Pérez, “Image Database Re-
trieval with Multiple-Instance Learning Techniques”,
Proc. International Conference on Data Engineering,
2000, pp. 233-243.