0% found this document useful (0 votes)
14 views9 pages

Music Database Retrieval Based On Spectral Similarity.

Uploaded by

fuh926
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views9 pages

Music Database Retrieval Based On Spectral Similarity.

Uploaded by

fuh926
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

Music Database Retrieval Based on Spectral Similarity

Cheng Yang
Department of Computer Science
Stanford University
yangc@cs.stanford.edu

Abstract Music can be represented in computers in two different


ways. One way is based on musical scores, with one entry
We present an efficient algorithm to retrieve similar music per note, keeping track of the pitch, duration (start time /
pieces from an audio database. The algorithm tries to cap- end time), strength, etc, for each note. Examples of this rep-
ture the intuitive notion of similarity perceived by human: resentation include MIDI and Humdrum, with MIDI being
two pieces are similar if they are fully or partially based the most popular format. Another way is based on acoustic
on the same score, even if they are performed by different signals, recording the audio intensity as a function of time,
people or at different speed. sampled at a certain frequency, often compressed to save
Each audio file is preprocessed to identify local peaks space. Examples of this representation include .wav, .au,
in signal power. A spectral vector is extracted near each and MP3.
peak, and a list of such spectral vectors forms our interme- A simple software or hardware synthesizer can convert
diate representation of a music piece. A database of such MIDI-style data into audio signals, to be played back for hu-
intermediate representations is constructed, and two pieces man listeners. However, there is no known algorithm to do
are matched against each other based on a specially-defined reliable conversion in the other direction. For decades peo-
distance function. Matching results are then filtered accord- ple have been trying to design automatic transcription sys-
ing to some linearity criteria to select the best result to a tems that extract musical scores from raw audio recordings,
user query. but have only succeeded in monophonic and very simple 

polyphonic cases [1, 3, 9], not in general polyphonic case .


In Section 3.1 we will explain briefly why it is a difficult task
1 Introduction to do automatic transcription on general polyphonic music.
Score-based representations such as MIDI and Humdrum
With the explosive amount of music data available on are much more structured and easier to handle than raw audio
the internet in recent years, there has been much interest data. On the other hand, they have limited expressive power
in developing new ways to search and retrieve such data and are not as rich as what people would like to hear in
effectively. Most on-line music databases today, such as music recordings. Therefore, only a small fraction of music
Napster and mp3.com, rely on file names or text labels to do data on the internet is represented in score-based formats;
searching and indexing, using traditional text searching tech- most music data is found in various raw audio formats.
niques. Although this approach has proven to be useful and Most content-based music retrieval systems operate on
widely accepted, it would be nice to have more sophisticated score-based databases, with input methods ranging from
search capabilities, namely, searching by content. Potential note sequences to melody contours to user-hummed tunes [2,
applications include “intelligent” music retrieval systems, 5, 6]. Relatively few systems are for raw audio databases. A
music identification, plagiarism detection, etc. Traditional brief review of related work will be given in Section 2. Our
techniques used in text searching do not easily carry over work focuses on raw audio databases; both the underlying
to the music domain, and people have built a number of database and the user query are given in .wav audio format.
special-purpose systems for content-based music retrieval. We develop algorithms to search for music pieces similar to


Supported by a Leonard J. Shustek Fellowship, part of the


the user query. Similarity is based on the intuitive notion
Stanford Graduate Fellowship program, and NSF Grant IIS-9811904. of similarity perceived by humans: two pieces are similar if
Permission to make digital or hard copies of all or part of this work for 

personal or classroom use is granted without fee provided that copies are Polyphony refers to the scenario where multiple notes occur at the
not made or distributed for profit or commercial advantageand that copies same time, possibly by different instruments or vocal sounds. As we know,
bear this notice and the full citation on the first page. most music pieces are polyphonic.

1
they are fully or partially based on the same score, even if
4000
they are performed by different people or at different tempo.
In the next section we will discuss some previous work 3500

in this area. In Section 3 we will start with some back-


3000
ground information and then give a detailed presentation of

frequency (Hz)
our algorithm to detect music similarity. Section 4 gives 2500
experimental results, and future directions will be discussed
in Section 5. 2000

1500

2 Related Work 1000

500

Examples of score-based database (MIDI or Hum- 0


2.6 2.8 3 3.2 3.4 3.6 3.8 4 4.2 4.4 4.6
drum) retrieval systems include the ThemeFinder project time (sec.)
(http://www.themefinder.org) developed at Stanford Uni-
versity, where users can query its Humdrum database by Figure 1. Spectrogram of piano notes C, E, G
entering pitch sequences, pitch intervals, scale degrees or
contours (up, down, etc). The “Query-By-Humming” sys-
tem [5] at Cornell University takes a user-hummed tune as in-
3 Detecting Similarity
put, converts it to contour sequences, and matches it against
its MIDI database. Human-hummed tunes are monophonic In this section we start with some background infor-
melodies and can be automatically transcribed into pitches mation on signal processing techniques and musical signal
with reasonable accuracy, and melody contour information properties, then give a detailed discussion of our algorithm.
is generally sufficient for retrieval purposes [2, 5, 6].
3.1 Background
Among music retrieval research conducted on raw au-
dio databases, Scheirer [7, 8] studied pitch and rhythmic After decompression and parsing, each raw audio file
analysis, segmentation, as well as music similarity estima- can be regarded as a list of signal intensity values, sampled
tion at a high level such as genre classification. Tzanetakis at a specific frequency. CD-quality stereo recordings have
and Cook [10] built tools to distinguish speech from music, two channels, each sampled at 44.1kHz, with each sam-
and to do segmentation and simple retrieval tasks. Wold ple represented as a 16-bit integer. In our experiments we
et al. at Muscle Fish LLC [11] developed audio retrieval use single-channel recordings of a lower quality, sampled
methods for a wider range of sounds besides music, based at 22.05kHz, with each sample represented as an 8-bit inte-
on analyses of sound signals’ statistical properties such as ger. Therefore, a 60-second uncompressed sound clip takes
loudness, pitch, brightness, bandwidth, etc. Recently, *CD  
bytes.
(http://www.starcd.com) commercialized a music identifica- We use the Short-Time Fourier Transform (STFT) to con-
tion system that can identify songs played on radio stations vert each signal into a spectrogram: split each signal into
by analyzing each recording’s audio properties. 1024-byte-long segments with 50% overlap, window each
Foote [4] experimented with music similarity detection segment with a Hanning window and perform 2048-byte
by matching power and spectrogram values over time using a zero-padded FFT on each windowed segment. Taking ab-
dynamic programming method. He defined a cost model for solute values (magnitudes) of the FFT result, we obtain a
matching two pieces point-by-point, with a penalty added spectrogram giving localized spectral content as a function
for non-matching points. Lower cost means a closer match of time. Since the details of this process are covered in most
in the retrieval result. Test results on a small test corpus signal processing textbooks, we will not discuss them here.
indicated that the method is feasible for detecting similarity Figure 1 shows a sample spectrogram on the note se-
in orchestral music. Part of our algorithm makes use of a quence of middle C, E and G played on a piano. The
similar idea, but with two important differences: we focus horizontal axis is time in seconds, and the vertical axis is
on spectrogram values near power peaks only, rather than frequency component in Hz. Lighter pixels correspond to

over the entire time period, therefore making tempo changes higher values. If we zoom in to time  and look at the
more transparent; furthermore, we evaluate final matching frequency components of note G closely, we notice that it
results by some linearity criteria which is more intuitive and has many peaks (Figure 2), one at 392 Hz (its fundamental
robust than the cost models used for dynamic programming. frequency) and several others at integer multiples of 392 Hz

2
4
3000 x 10
8

7
2500

2000

5
intensity

power
1500
4

3
1000

500
1

0 0
0 500 1000 1500 2000 2500 3000 3500 4000 0 5 10 15 20 25 30 35 40
frequency (Hz) time (sec.)

Figure 2. Frequency components of note G Figure 4. Power plot of Tchaikovsky’s Piano


played by a piano Concerto No. 1

B
A
1500

1000
(a) C
500 D
0
0 500 1000 1500 2000 2500 3000 3500 4000
2000

(b) 1000
time

0
0 500 1000 1500 2000 2500 3000 3500 4000 Figure 5. True peak vs. bogus peak
3000

2000
(c)
1000

0
3.2 The Algorithm
0 500 1000 1500 2000 2500 3000 3500 4000
3000

(d)
2000
The algorithm consists of three components, which are
1000
discussed separately.
0
0 500 1000 1500 2000 2500 3000 3500 4000
frequency (Hz)
1. Intermediate Data Generation.
Figure 3. Illustration of polyphony For each music piece, we generate its spectrogram as
discussed in Section 3.1, and plot its instantaneous
power as a function of time. Figure 4 shows such a
(its harmonics). Fundamental frequency corresponds to the power plot for a 40-second sound clip of Tchaikovsky’s
pitch (middle G in this case), and the pattern of harmonics Piano Concerto No. 1. Next, we identify peaks in this
depends on the characteristics of the musical instrument that power plot, where peak is defined as a local maximum
plays it. value within a neighborhood of a fixed size. This def-
inition helps remove bogus local “peaks” which are
When multiple notes occur at the same time
immediately followed or preceded by higher values.
(“polyphony”), their frequency components add. Figure  
For example, in Figure 5, are true peaks but 
3(a)-(c) show the frequency components of C, E and G
is a bogus peak. Intuitively, these peaks roughly cor-
played individually, while Figure 3(d) shows that of all three
respond to distinctive notes or rhythmic patterns. For
notes played together. In this simple example it is still pos-
the 60-second music clips used in our experiments, we
sible to design algorithms to extract individual pitches from
typically find 100-200 peaks in each of them.
the chord signal C-E-G, but in actual music recordings, many
more notes co-exist, played by many different instruments, After a list of peaks is obtained, we extract the fre-
of which we do not know the patterns of harmonics. In addi- quency components near each peak. We take 180
tion, there are sounds produced by percussion instruments, samples of frequency components between 200Hz and
human voice, and noise. The task of automatic transcription 2000Hz. Average values over a short time period fol-
of music from arbitrary audio data (i.e., conversion from lowing the peak are used in order to reduce sensitivity
raw audio format into MIDI) becomes extremely difficult, to noise and to avoid the “attack” portions produced by
and remains unsolved today. Our algorithm, as in most other certain instruments (short, non-harmonic signal seg-
music retrieval systems, does not attempt to do transcription. ments at the onset of each note).

3
1 x1 x2 x3 x4 x5 xk n to   as:
s ...... .... ...... ......

...... C D ,G ! G I
 HKJ : H <ML
time  5+? ? @BA   
9
1 m FE 

r ... .... ...... ......


4N5
y1 y2 y3 y4 y5 yk and the minimum distance between and 8-9
as:
PORQTS
 5+?  +
5 ? ?@
Figure 6. Set of matching pairs 9 @ 9

The distance definition is basically a sum of all


In the end, we get spectral vectors of 180 dimensions matching errors plus a penalty term for the num-
each, where is the number of peaks obtained. We ber of non-matching points (weighted by J ). Ex-
 
normalize each spectral vector so that they each have periments have shown that J  works rea-
mean 0 and variance 1. After normalization, these sonably well.
vectors form our intermediate representation of the The minimum distance  5 can be found by a
corresponding music piece. Typically each new note 9
dynamic programming approach, because
in a piece corresponds to a new peak, and therefore to
a vector in this representation. Notice that we do not VU ? W ? U J
expect to capture all new notes in this way, and will
 
almost certainly have some false positives and false and for any NX BX ,
negatives. However, later stages of the algorithm will
PORQTS 
compensate for this inaccuracy. W ? BZY 

? [Y IH 

 WZY 

? /Y \H
 J

W ? /Y ]HJ 

WZY 

? HKJ6
2. Matching.
This component matches two music pieces against each The optimal matching set ^ that leads to the
other and determines how close they are, based on the minimum distance can also be traced from the
intermediate representation generated above. Match- dynamic programming algorithm.
ing comes in two stages: minimum-distance matching Based on the definitions above, the minimum dis-
and linearity filtering. tance between the two music pieces with spectral
    
vectors 

 and      is  ?  , 

(a) Minimum-distance matching and can be found with dynamic programming.


Suppose we would like to compare two music (b) Linearity filtering
  
pieces with spectral vectors   and Although the previous step gives the minimum



     respectively. Define  to be root-


 distance and optimal matching based on the dis-


mean-squared error between vectors and  . tance function, it is not robust enough for music
It can be shown that  is linearly related to comparison. Experiments have shown that cer-
the correlation coefficient of the original spec- tain subjectively dissimilar pieces may also end
tra near peak of the first piece and peak of up with a small distance score, therefore appear-
the second one. A smaller  value corresponds ing similar to the system. To make the algorithm
to a larger correlation coefficient. (See [13] for more robust, further filtering is needed.
proof.) Therefore,  
Figure 7 shows two ways to match against  ,
is a natural indicator of
similarity of the original spectra at corresponding
both with 10 matches. Both may yield a low
peaks.
           matching score, but the top one is obviously better
Let      be
  than the bottom one. In the top one, there is a
 

a #" set of  " matches, pairing with ! ,


 %$ slight tempo change between the two pieces, but
with  ! , etc, as shown in Figure 6. ( the change is uniform in time. In the bottom one,
 '& ( &*)+)+),&  $ -$. '& / &0)+))1&


 , 

however, there is no plausible explanation for the


 $32
 .) twisted matching. If we plot a 2-D graph of the
 
Given the following subsets of and  vectors:
465 7   
matching points of on the horizontal axis vs.
 5  7        ,
  , 8-9 the corresponding points of  on the vertical axis,
$.:;$ $=<>9 $
 

and a particular match  4 (5   the top match would give a straight line while the
2
), define the distance of and 8 9 with respect bottom one would not.

4
s
query music Intermediate query vector

music
Data
vector
r database Generation database
A "good" match

s Minimum-
candidate
matches Distance
Matching
r
A "bad" match

Figure 7. “Good” vs. “bad” matching Linearity Filtering

Formally, the matching set


C,     /         Final Results
   

can be plotted on a 2-D graph, with the origi- Figure 8. Summary of algorithm structure
   
nal location (time offset) of peaks   

(of the first music piece) on the horizontal axis


   
and that of peaks   (of the second


intermediate representation), the database is matched


piece) on the vertical axis. If the two pieces were against the query using minimum-distance matching
indeed mostly based on the same score, the plot- and linearity filtering algorithm. The pieces that end
ted points should fall roughly on a straight line. up with the highest number of matching points (and if
Without tempo change, the line should be at a 45- above a certain threshold) are selected as answers to
degree angle. With possible tempo change, the the user query.
line may be at a different angle, but it should still
be straight. Figure 8 summarizes the overall structure of the music
In this step of linearity filtering, we examine the retrieval algorithm.
graph of the optimal matching set obtained from
dynamic programming above, fit a straight line 3.3 Complexity Analysis
through the points (using least mean-square cri- 
teria), and check if any points fall too far away Time complexity of the preprocessing step is ,
from the line. If so, remove the most outlying where is the size of the database. Because only “peak”
point and fit a new line through the remaining information is recorded in the spectral vector representa-
points. Repeat the process until all remaining tion, space required is only a fraction of the original audio
points lie within a small neighborhood of the fit- database.
ted line. (In the worst case, only two points are Dynamic programming for minimum-distance
 matching
2  2  overall, where 2
left at the end. But in practice we stop when fewer takes time for each run,
2
than 10 points remain.) is the expected number of peaks in each piece. Because
The total number of matching points after this is much less than when the database is large, it can be
 
filtering step is taken as an indicator of how well regarded as a constant and is the dominant factor.
two pieces match. As will be shown in Section 4, Linearity filtering takes a negligible amount of time in
this criterion is remarkably effective in detecting 
practice, although its worst-case complexity is also up to
2 .
similarity. 2
Overall, assuming is a constant factor, the algorithm
3. Query Processing.  
runs in time for each query. When the database gets
 
All music files are preprocessed into the intermedi- large, the running time of may be too slow. We are
ate representation of spectral vectors discussed ear- experimenting with indexing schemes [12] which will give
lier. Given a query sound clip (also converted into the better performance.

5
4
x 10
10

A. 5

0 55
0 4 5 10 15 20 25 30 35 40
x 10 50
10
45

B. 5
40

similarity
35

0 30
0 5 5 10 15 20 25 30 35 40
x 10 25
2
20
0
C. 1
15
2
10
4
0
0 2 6
0 4 5 10 15 20 25 30 35 40 4
x 10 6 8
15 8
10 10
Item 2
10
D. Item 1
5

0
0 5 10 15 20 25 30 35 40 Figure 10. Pairwise matching result
time (sec.)

Figure 9. Power plots Polonaise (C and D). Both pairs are of Type-IV similarity.
Each pair was performed by different orchestras, published
by different companies. There were variations in tempo as
4 Experiments well as in performance style. From the power plots it can
be seen that notes are emphasized differently. Neverthe-
Our data collection is done by recording CDs or tapes less, both pairs yield small distance scores after minimum-
into PCs through a low-quality PC microphone. No special distance matching. On the other hand, a few dissimilar pairs
efforts are taken to reduce noise. This setup is intentional, in also yield scores that are not large, such as Tchaikovsky’s Pi-
order to test the algorithm’s robustness and performance in ano Concerto No. 1 (A) vs. Brahms’ Cradle Song (referred
a practical environment. Both classical music and modern to as E from now on), and Chopin’s “Military” Polonaise
music are included, with classical music being the focus. (D) vs. Mendelssohn’s Spring Song (referred to as F from
Instead of taking the entire pieces, only 30- to 60-second now on).
clips are taken from each piece, because that much data is Figure 11 shows sample plots of optimal matching sets
generally enough for similarity detection. before linearity filtering (solid lines connecting the dots),
We identify five different types of “similar” music pairs, where the horizontal axis is time (in seconds) of the first
with increasing levels of difficulty: piece and vertical axis is time of the second piece. A straight
line is fitted through each set of matching points (dashed
Type I: Identical digital copy lines). As is clear from the plots, A and B are truly similar
Type II: Same analog source, different digital copies, (almost all points are colinear), while A and E are not; C
possibly with noise and D are truly similar, while D and F are not.
After certain matching points are removed by linearity
Type III: Same instrumental performance, different vo- filtering, Figure 11 becomes Figure 12. The pairs (A, B)
cal components and (C, D) have 49 and 54 matching points respectively,
while the other two pairs have fewer than 15 remaining
Type IV: Same score, different performances (possibly matching points.
at different tempo) Figure 10 shows the pairwise matching result of a set of
Type V: Same underlying melody, different otherwise, 10 music pieces, of which two pairs ((A, B) and (C, D))
are different performances of the same scores (with Type-
with possible transposition   
IV similarity). The result is shown as a matrix
Sound samples of each type can be found at http: where the entry ( , ) gives the final number of matching
//www-db.stanford.edu/˜yangc/musicir/ . points between two pieces and after linearity filtering.
Figure 9 shows the power plots of two different per- Because of symmetry only the upper triangle of the matrix
formances of Tchaikovsky’s Piano Concerto No. 1 (A and is presented. Two peaks in the graph clearly indicate the
B) and two different performances of Chopin’s “Military” discovery of the “correct” pairs.

6
A. vs. B. C. vs. D.
35 40

30
30
25

20
20
15

10
10
5

0 0
0 10 20 30 40 0 10 20 30 40

A. vs. E. D. vs. F.
60 40

50
30
40

30 20

20
10
10

0 0
0 10 20 30 40 0 10 20 30 40

Figure 11. Matching plots before filtering

A. vs. B. C. vs. D.
40 40

30 30

20 20

10 10

0 0
0 10 20 30 40 0 10 20 30 40

A. vs. E. D. vs. F.
60 40

50
30
40

30 20

20
10
10

0 0
0 10 20 30 40 0 10 20 30 40

Figure 12. Matching plots after filtering

7
100 processing, one can also incorporate existing rhythm de-
90
tection algorithms to improve performance. Also, different
algorithms may be suited to different types of music, so it
80
may be helpful to conduct some analysis of general statisti-
% Retrieval Accuracy

70
cal properties before deciding which algorithm to use.
60 Content-based retrieval of musical audio data is still a new
50 area that is not well explored. There are many possible future
40
directions, and this paper is only intended as a demonstration
on the feasibility of certain prototype ideas, of which more
30
extensive experiments and research will need to be done.
20

10
References
0
I II III IV V
Type
[1] J. P. Bello, G. Monti and M. Sandler, “Techniques
for Automatic Music Transcription”, in International
Figure 13. Retrieval Accuracy Symposium on Music Information Retrieval, 2000.
[2] S. Blackburn and D. DeRoure, “A Tool for Content
More queries are conducted on a larger dataset of 120 Based Navigation of Music”, in Proc. ACM Multime-
music pieces, each of size 1MB. For each query, items from dia, 1998.
the database are ranked according to the number of final
matching points with the query music, and the top 2 matches [3] J. C. Brown and B. Zhang, “Musical Frequency Track-
are returned. Figure 13 shows the retrieval accuracy for each ing using the Methods of Conventional and ’Narrowed’
of the five types of similarity queries. As can be seen from Autocorrelation”, J. Acoust. Soc. Am. 89, pp. 2346-
the graph, the algorithm performs very well in the first 4 2354. 1991.
types. Type-V is the most difficult, and better algorithms
[4] J. Foote, “ARTHUR: Retrieving Orchestral Music by
need to be developed to handle it.
Long-Term Structure”, in International Symposium on
Music Information Retrieval, 2000.
5 Conclusions and Future Work
[5] A. Ghias, J. Logan, D. Chamberlin and B. Smith,
“Query By Humming – Musical Information Retrieval
We have presented an efficient algorithm to perform in an Audio Database”, in Proc. ACM Multimedia,
content-based music retrieval based on spectral similarity. 1995.
Experiments have shown that the approach can detect sim-
ilarity while tolerating tempo changes, some performance [6] R. J. McNab, L. A. Smith, I. H. Witten, C. L. Hender-
style changes and noise, as long as the different perfor- son and S. J. Cunningham, “Towards the digital music
mances are based on the same score. library: Tune retrieval from acoustic input”, in Proc.
Future research may include the study of the effects of ACM Digital Libraries, 1996.
various threshold parameters used in the algorithm, and to
find ways to automate the selection of certain parameters to [7] E. D. Scheirer, “Pulse Tracking with a Pitch Tracker”,
optimize performance. in Proc. Workshop on Applications of Signal Process-
ing to Audio and Acoustics, 1997.
We are experimenting with indexing schemes [12] in
order to get faster retrieval response. We are also planning [8] E. D. Scheirer, Music-Listening Systems, Ph. D. disser-
to augment the algorithm to handle transpositions (pitch tation, Massachusetts Institute of Technology, 2000.
shifts). Although transpositions of entire pieces are not very
common, it is common to have small segments transposed [9] A. S. Tanguiane, Artificial Perception and Music
to a different key, and it would be important that we detect Recognition, Springer-Verlag, 1993.
such cases.
[10] G. Tzanetakis and P. Cook, “Audio Information Re-
One other future direction is to design algorithms to ex- trieval (AIR) Tools”, in International Symposium on
tract high-level representations such as approximate melody Music Information Retrieval, 2000.
contours. This task is certainly non-trivial, but it may be
less difficult than transcription, and at the same time very [11] E. Wold, T. Blum, D. Keislar and J. Wheaton,
powerful in similarity detection for complex cases. “Content-Based Classification, Search and retrieval of
Instead of using the peak-detection scheme during pre- audio”, in IEEE Multimedia, 3(3), 1996.

8
[12] C. Yang, “MACS: Music Audio Characteristic Se-
quence Indexing for Similarity Retrieval”, in IEEE
Workshop on Applications of Signal Processing to Au-
dio and Acoustics, 2001.
[13] C. Yang and T. Lozano-Pérez, “Image Database Re-
trieval with Multiple-Instance Learning Techniques”,
Proc. International Conference on Data Engineering,
2000, pp. 233-243.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy