0% found this document useful (0 votes)
66 views12 pages

Hybrid Approach of Structural Lyric and Audio Segments For Detecting Song Emotion

This document proposes a hybrid approach to detecting song emotion that integrates both song lyrics and audio features. It involves segmenting songs into structural parts like intro, verse, chorus, etc. and extracting features from both the audio and lyrics of each segment. Six classification methods are used to classify emotion for each segment separately, and then an aggregate method is used to analyze the results and identify the most emotionally representative segments. A hybrid approach is then used to combine the audio and lyrics features of the selected segments to detect the overall emotion of the song. The approach achieved an F-measure of 0.823 for emotion detection.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
66 views12 pages

Hybrid Approach of Structural Lyric and Audio Segments For Detecting Song Emotion

This document proposes a hybrid approach to detecting song emotion that integrates both song lyrics and audio features. It involves segmenting songs into structural parts like intro, verse, chorus, etc. and extracting features from both the audio and lyrics of each segment. Six classification methods are used to classify emotion for each segment separately, and then an aggregate method is used to analyze the results and identify the most emotionally representative segments. A hybrid approach is then used to combine the audio and lyrics features of the selected segments to detect the overall emotion of the song. The approach achieved an F-measure of 0.823 for emotion detection.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

Received: August 10, 2019. Revised: October 29, 2019.

86

Hybrid Approach of Structural Lyric and Audio Segments for Detecting Song
Emotion

Fika Hastarita Rachman1,2* Riyanarto Sarno1 Chastine Fatichah1

1
Departement of Informatics, Institut Teknologi Sepuluh Nopember, Indonesia
2
Departement of Informatics, University of Trunojoyo Madura, Indonesia
* Corresponding author’s Email: hastarita.fika@gmail.com

Abstract: Detecting song emotion is very important, however many studies have been done based on song lyrics and
song audio separately. This research proposes a method for detecting song emotion based on integrated song lyrics
and audio. Synchronizing the right structural segment and lyrics of song can be used for hybrid approach to detected
right emotion. Song emotion can be classified into Thayer emotion label. The features of a song lyric are extracted
using Psycholinguistic and Stylistic; whereas the features of a song audio are extracted using analyze audio signal
waveform using Fast Fourier Transform (FFT) method. A song can be divided into 5 structural segments, which are
intro, chorus, bridge, verse and outro. A preprocessing method of audio uses Correlation Features Selection (CFS) and
preprocessing text for lyrics. Six classification methods are used for classifying emotion based on lyrics and audio of
song structural segments separately. The aggregate method is used to analyze the results of classification before to
obtain structural segments that represent emotions. Then, the final process Hybrid approach used to combine audio
and lyrics features in emotion detection. Sum of matrix and Majority Voting Concept are used for Hybrid approach.
The value of F-Measure is 0.823.
Keywords: Emotion detection, Song structural segment, Audio features, Lyric features, CFS, Aggregate method,
Hybrid approach.

music audio clips to detecting of music mood. The


1. Introduction other features, MFCC and Chroma from Echonest
analyzer are used to detect the genre of music
Digital songs are a popular thing among the
classification [5]. The mood tracking using the audio
people. The amount of song data confuses us when
features from Marsyas extractor and smooth process
we have to choose a song that suitable with emotions.
method is done for [6] to detect the mood of audio
Song emotion detection is preferred and necessary to
music [6]. Their research shows that audio is an
support other applications such as song emotion
important feature for song emotion detection. Lyrics
recommendation for drivers [1], children [2],
can also be a feature in song emotion detection. The
teenagers, etc.
previous research [7] uses Sentiwordnet to extract
There are two types of data in songs that can be
sentiment features in lyrics. Mood classification is
processed, audio and lyrics. Many researches on
done using sentiment features, feature selection
detecting emotion has been done based on song lyrics
process and classification models. Research [8] build
and song audio separately. The previous research [3-
ANCW from ANEW to detect emotion of chinese
6] uses audio features for song emotion and genre
song dataset using fuzzy clustering. The emotional
detection. In research [3] mood of song is classified
classification of songs based on lyrics can also be
using SVM. It used the audio power and audio
done using partial syntatic analysis model [9]. The
harmony features of audio form with discrete wavelet
result of [10] show that the use of LSA for lyric
transform (DWT) to reduce noice. Intensity, timbre
features is not optimal. The classification results are
and rhythm features are used in [4] using 20 second
International Journal of Intelligent Engineering and Systems, Vol.13, No.1, 2020 DOI: 10.22266/ijies2020.0229.09
Received: August 10, 2019. Revised: October 29, 2019. 87

still low. Lyrics text mining is used to find the role of to determine the segment of audio. Ricardo [12] uses
lyrics that can improve the accuracy of mood short audio to detect song emotions. The
detection [11]. Besides lyrics, the combined testing determination of short audio duration comes from
of audio and lyric features produced the conclusion experts. Supposedly without the short audio from the
that not all audio is the dominant feature in expert, the system is able to detect emotions. This
determining the emotions of songs, not all mood research can detect song emotions automatically
labels can produce high accuracy [11]. Even though because the system can determine structural
these studies obtained good results but in the process segments that represent emotion based on Correlation
they used audio datasets that the determination of the Feature Selection (CFS) and Aggregate method.
audio duration is based on the expert. In fact, a song The structure segment of song is part of the song
audio has duration of two until five minutes. It is that can be processed to emotion detection beside
difficult to determine which segment that represent audio and lyrics. Song structure has a unique form
emotion of song. and composed to be a song. In song writing, song is
Generally Music Emotion Recognition (MER) generally composed of five parts: intro, verse, bridge,
used Bimodal dataset [12]. Bimodal dataset that chorus, outro [16]. Each part can be different position
combine of audio vocals and instruments. Emotion in a song. The intro is the beginning of a song and
detection usually uses all lyrics of song. Bimodal usually just an instrument. Verse is the verses of
Dataset [1] is one of the songs emotional datasets that songs with lyrics. Chorus is the message or core of
uses 30 seconds audio and all lyrics of song. The the song. The verse and chorus part can be repeated.
1000 song dataset [13] is also a song emotional Bridge is the connecting part between the parts of the
dataset that uses 45 seconds of audio data. Both song. While outro is the closing part of the song.
datasets uses short audio that the duration is Position of the intro and outro are often at the
determined based on the expert. Duration of segment beginning and end of the song. Position of the verse,
determinated by expert is different of each song, so it chorus, and bridge located in the middle of the music.
is difficult to determine automatically when a song It can be different order. The example of song
does not have information from the expert. Song structure position in a song is: Intro – Verse – Chorus
emotion is wrongly detected if the system does not – Verse – Bridge – Chorus – Outro.
have the right segment of audio. Chia [16] conducted research on the chorus
Previous research presents that there are detection algorithm and emotion detection of songs
emotional differences in each song segment [14]. The based on audio data. The value of intensity, frequency
result shows that the best accuracy for music mood band and rhythm regularity in the chorus are used to
detection uses 8 and 16 seconds duration. Other detect song emotions. Their emotion detection
research uses data spoken speech dataset (Emo-DB). algorithm provides similar results for the same
The emotion of spoken speech is analyzed after melody in various languages and lyrics. There is an
segmentation audio process with 400ms duration. influence on the selection of song structure for song
The result of emotion prediction is different of each emotion detection.
segment. There is a mechanism to fuse segment We propose song emotion detection that used
emotion in order to make a global emotion in full synchronously between audio and lyrics. Song
audio [15]. structural segments are also used in this research. The
Emotion detection in audio and lyrics was proposed system can automatically recognize the
performed by Ricardo Malheiro, 2016 [12]. They song emotion with the existence of structural segment
used Bimodal dataset with 133 songs data. Each data data.
has 30 seconds of audio and all lyrics of song. They We have two contributions in this research. First
used 1701 audio subset features from melody, timbre, contribution is automatically segment selection that
rhythm, and others [12]. Features of lyrics are represent the emotions of the whole of song by
obtained from lyric tools (Synesketch, ConceptNet, analyzing structure of song. Second contribution is
LIWC, and General Inquirer). The fuse of audio and hybrid approach to combine audio and lyrics features
lyrics features in emotion recognition can increase from song structure segment for detection emotion
the F-Measure value to 88.4%. Although the results using prediction frequency matrix. With this
are very high, but only songs that have right segment contribution, to find out the emotions of the song, we
audio can be detected emotion of song accurately. need to know the structure of certain songs only. So
Audio data that used in this study is short audio, while even though the song doesn't have audio samples
the lyric data that used are overall data of the song. from experts, we can still find out the emotions of the
There is no synchronization of data between audio song.
and lyrics. If there is a new song, an expert is needed
International Journal of Intelligent Engineering and Systems, Vol.13, No.1, 2020 DOI: 10.22266/ijies2020.0229.09
Received: August 10, 2019. Revised: October 29, 2019. 88

This paper is organized as follows. Related work Table 1. Filtering POS tagging
and contribution is shown in introduction in section POS tagging Meaning of POS
1. Section 2 shows a proposed method of this research. position
Section 3 shows a structural emotion of song dataset DT Determiner
that used in this research. Section 4 shows a method CD Cardinal number
of audio and lyric extractions. Section 5 shows a WRB Wh – adverb
TO To
emotional of structural segmen analysis to know what
CC Coordinating Conjungtion
structural segment that represent a whole of song
PRPS Personal pronoun
emotion. Section 6 shows a model and method that MD Modal
we used for hybrid features in song emotional
detection. Result and discussion are presented in
features. The psycholinguistic feature is a feature
section 7. Finally, section 8 provides conclusions and
based on the emotional psychology dataset. This
future work.
dataset is a CBE (Corpus Based Emotion) dataset
from the results of previous research[19] which is
2. Proposed method expanded according to the Thayer emotion label.
The research was carried out according to the The stylistic features are words that are often
proposed method in Fig.1. The song data that found in lyrics, but are not in the English dictionary
processed is audio structural data and lyrics. Lyrics (Such as: ‘ah’, ‘ooh’, ‘yeah’). Those unique words,
data is synchronized with segments in audio data. exclamation marks, and question marks in the lyrics
Audio segment data experiences extraction are a feature of stylistic. These features are used as
features using analyze audio signal waveform using input in the emotional classification process from
Fast Fourier Transform (FFT) method in MIR each lyric structural segment. Emotion predicted
Toolbox [17]. The features of an audio segment are label of each segment presented using matrix
sub features from dynamics, rhythm, timbre, pitch, segment.
and tonality features. Then the feature selection The hybrid of audio and lyric matrix for each
process is performed to reduce the number of features structural segment use operation sum of matrices.
used in the classification. This research use This hybrid matrix used for emotion detection model
Correlation Feature Selection (CFS) [18-19] for to get a song emotion predicted label. In the emotion
selection feature method. The audio feature of each detection model, sum of matrix and majority voting
structural segment is classified using the Random are used in several structural segment combinations
Forest Method. The results of the classification show analyzed.
that the best structural segment represent the The emotional label used in this research is the
emotions of the song. Each song has several song emotional label of Thayer[6]. Thayer emotional label
structural segments. Each of structural segments has four (4) quadrants that depicted in a 2-
represented as a matrix, therefore there are several dimensional model. Coordinate axis is Valance and
audio and lyric matrices. The row of a matrix Arousal which have a 'low' and 'high' area. Thayer's
represents the index of song data, and the column of emotional label is: Quadrant 1 (Q1), Quadrant 2 (Q2),
a matrix represents count number of emotional label Quadrant 3 (Q3), and Quadrant 4 (Q4). Q1 is in a high
prediction for each song structural segment. valance and high arousal areas. The example of Q1
The lyric data is synchronized first with structural are happy and exited emotion. Q2 is in low valance
segment audio using extension file configuration lrc and high arousal. Angry and nervous emotion
(short for LyRiCs). Preprocess of lyrics are done in includes in Q2. Q3 is in low valance and low arousal
several processes: slang word repair, POS tagging, (The example: sad). Q4 is in low valance and high
Porter stemming and stopword removal. Slang word arousal (The example: calm, relaxed)[20].
repair is done by using slang word corpus which is
often found in the lyrics. Stanford Part of Speech 3. Structural emotion of song dataset
(POS) Tagging [18] is done to find out the position of
This research requires a dataset of structural
words in the lyrics. There are several word positions
segments of songs that have labeled Thayer emotions.
that rarely describe the emotions of the lyrics (shown
Previous datasets are separated between structural
in Table 1). The word in that position is deleted.
song dataset and emotional song dataset. Bimodal
After preprocessing the lyrics, next process is the
dataset is an emotional song dataset which has 133
feature extraction process. The extracted features
songs with 30 second audio data, full data text lyrics
from the lyrics are psycholinguistic and stylistic
and thayer emotion label [12]. One of the structural
song datasets that has data in file xml is
International Journal of Intelligent Engineering and Systems, Vol.13, No.1, 2020 DOI: 10.22266/ijies2020.0229.09
Received: August 10, 2019. Revised: October 29, 2019. 89

Ep_groundtruth_excl_Paulus (Ep-dataset) [21]. The


duration of each structural segment data is between Table 2. Distribution amount of data structural segment
20-45 second [21]. This structural song dataset was Structural Segment Amount of data
created with the concept of the Chroma [22] feature Intro 39
on song audio data. This research, used combination Chorus 283
of two datasets, Bimodal dataset and Ep-dataset. Bridge 76
Bimodal dataset has emotional labels for each song Verse 299
but does not have song structural data. So the audio Outro 41
and lyrics of song synchronization based on the
official store song (file wav) and file lrc are needed that has lyrics only 738 data. So in this research we
to complete this dataset. Ep-dataset have structural used 738 data. The distribution amount of data for
segment data but do not have emotional label for each each segment is shown in Table 2.
song. A music expert is needed to complete this The genre of song is becoming one of the
dataset. All data in two datasets are not used as a considerations as a limit in the scope of research data
whole. The availability of full song data (.wav), lyrics or not. For this purpose, we analyze the influence
data (.lrc) , and the balance of the amount of data parent of genre music with Thayer's emotional label
between emotional labels is also a consideration. on a bimodal dataset. Fig. 2 shows that in certain
This research dataset consists of 100 songs, with genres can emerge certain emotions as well.
25 data for each Thayer emotion label. From 100 Examples for song with pop genres tend to have Q4
songs, there are 875 song structure data along with emotional labels while song with rock genre tends to
audio segments (.wav), duration structural segment have Q2 emotional labels. With this analysis, the data
data, and Thayer emotion label of song. Audio data on the structural emotion of song datasets do not
differentiate between music genres.

Figure. 1 Proposed method

International Journal of Intelligent Engineering and Systems, Vol.13, No.1, 2020 DOI: 10.22266/ijies2020.0229.09
Received: August 10, 2019. Revised: October 29, 2019. 90

Figure. 2 Analysis graph of bimodal dataset in the song


genre

4. Feature extraction Figure. 3 Feature extraction scheme using MIRToolbox

The extraction feature is done for two data:


audio data and song lyrics. Feature extraction uses
different concepts for different data. The feature
extraction results are used in structural segment
classification process.

4.1 Audio feature extraction

The audio features extraction used to analyze


audio signal waveform from Fast Fourier Transform
(FFT) method [23]. That extraction method has been
applied into Music Information Retrieval Matlab
tools, namely MIRToolbox version 1.6.1 [17].
Besides MIRToolbox, MPEG-7 is another tool for
extracting audio signals [26]. MPEG-7 was not used
in this research, because MPEG-7 is a Low Level Figure. 4 The process of extracting lyric data
Description (LLD) including: basic spectral features,
basic signal parameters, and timbral description. With this method, 16 subset features use in the next
Previous research [26-27] using this MPEG-7, the process. The sixteen subset features include: std-
feature that used are audio power and audio beatspectrum, eventdensity, pulseclarity, mean-
harmonicity. In Mirtoolbox, audio waveform is attacktime, mean-decreaseslope, med-decreaseslope,
extracted into standard level feature. There are std-decreaseslope, zerocross, kurtosis, mean-
dynamic, rhythm, timbre, pitch, and tonality features. roughness, std-hcdf, mean-mfcc, std-mfcc , mean-
From these features there are several sub features that spectrum, mean-chromagram, and envelope-
extracted. These subset features are in scalar and halfwavediff.
signal data. Subset features in signal data form is
simplified statistically into statistical parameters. The 4.2 Lyric features extraction
statistical parameters that used are average (avg),
The process of extracting lyric data can be seen in
standard deviation (std), median (med) of the signal
Fig.4. Preprocessing data is done before the data
data value vector. Total of parameters that used as
extraction process. The preprocessing data include
sub-features are 54 parameters. The feature
repairing data, filtering word using POS Tagging
extraction scheme using MIRToolbox can be seen in
Stanford, Porter Stemming and Stopword removal.
Fig. 3.
This research uses the stylistic feature and
From these 54 parameters, a feature selection
psycholinguistic features for lyrics feature of song.
process is carried out using Correlation Feature
The stylistic feature is a feature taken from unique
Selection (CSF). The CFS configuration using the
words and special punctuation in the lyrics. Unique
Best First Search method with the Forward-backward
and informal words are often found in lyrics, such as:
(Bidirectional) search model and 60% threshold.
‘oh’, ‘ooh ’,‘ah ’,’yeah’,’huuu’,‘whoo’ and others.
International Journal of Intelligent Engineering and Systems, Vol.13, No.1, 2020 DOI: 10.22266/ijies2020.0229.09
Received: August 10, 2019. Revised: October 29, 2019. 91

Figure. 6 Comparison of F-measure result using CBE


[24], ANEW [26], WNA[25] and CBE-Ex

corpus data in the previous CBE was 2122 terms,


while Expand CBE (CBE-Ex) obtained an additional
1649 terms to 3771 terms. CBE is emotion corpus
that merging of two corpus: WordNet Affect Emotion
(WNA) [25] and Affective Norms for English Words
(ANEW) [26]. The reliability of the four corpuses
Figure. 5 Automatic tagging procedure
(ANEW, WNA, CBE and CBE-Ex) is tested by
calculating the F-measure for the emotional
Likewise special punctuation found in the lyrics, such
classification of the whole of lyrics with 11
as exclamation marks (!) and question marks (?).In
psycholinguistic and 19 stylistic features using
previous research [19] the lyrics that were processed
Random Forest method 5 fold cross-validation. The
are overall lyrics of each song data. For this research,
results are shown in Fig. 6. It is seen that CBE-Ex is
the feature extracted from lyrics of the structural
better than CBE with the value of F-Measure of 0.399.
segment data that has been synchronized with the
This research uses CBE-Ex to obtain the
audio structural segment.
psycholinguistic feature. The overall lyrical features
Psycholinguistic feature is a feature that was
used in this research are 30 features: 19 features are
obtained with the help of the existing corpus of
stylistic features and 11 features are psycholinguistic
emotion. The emotional corpus that we used is the
features.
expand of Corpus Based Emotion (CBE). The CBE
in the previous research [10] used 5 emotional labels
from MIREX [24]. Because the bimodal dataset used
5. Emotional structural segment analysis
previously has the Thayer’s emotional label, the from audio and lyrics of song
emotional label on the CBE is also adjusted. The The structural segment of the song consists of the
change in the emotional label on the CBE causes a intro, chorus, verse, bridge and outro [27]. In this
change in the center of the cluster data. Other research, the structure of songs that represent the
developments occur in the number of corpus data due emotions of the song is analyzed. The analysis is seen
to the Automatic tagging procedure. From all lyrics from the audio and lyrics features. Purpose of this
data in the dataset, data are found to be incomplete. step is to reduce the duration of audio data to be
Uncompleted data are the data term does not have the processed, so that all classification features can be
label or emotional value of Arousal-Valance. Not all obtained properly.
term of the lyric in the dataset has own emotion label Emotional prediction of song structural segments
in CBE. With the concept of synonym of term and is searched using six classification method: Logistic
automatic tagging procedure, CBE is expanding. method (ML1), C45 method (ML2), Random Forest
Automatic tagging procedure [23] is a procedure method (ML3), Multilayer Perceptron method (ML4),
of labeling emotions or looking for Valance-Arousal Bayes Net method (ML5), Naive Bayes method
dimension values using Cluster center concept and (ML6), then find the results using the aggregate
LESK similarity measure [24]. The Automatic method. Six types of classification methods are used
Tagging Procedure shown in Fig.5. The amount of because previous research [28] also uses them.

International Journal of Intelligent Engineering and Systems, Vol.13, No.1, 2020 DOI: 10.22266/ijies2020.0229.09
Received: August 10, 2019. Revised: October 29, 2019. 92

Table 3. The result of structural segment classification from song audio


F-measure
Song Structural Segments
ML1 ML2 ML3 ML4 ML5 ML6
Bridge3 0.471 0.606 0.598 0.57 0.415 0.433
Chorus1 0.539 0.574 0.707 0.683 0.5 0.557
Intro4 0.432 0.382 0.429 0.469 0.236 0.475
Outro5 0.425 0.285 0.364 0.314 0.138 0.329
Verse2 0.534 0.598 0.658 0.615 0.528 0.49

Table 4. The result of structural segment classification from song lyrics


F-measure
Song Structural Segments
ML1 ML2 ML3 ML4 ML5 ML6
Bridge2 0.546 0.474 0.46 0.577 0.186 0.373
Chorus1 0.609 0.644 0.770 0.678 0.447 0.487
Intro4 0.283 0.327 0.287 0.254 0.167 0.323
Outro5 0.201 0.179 0.218 0.216 0.155 0.206
Verse3 0.449 0.394 0.510 0.458 0.274 0.362

Aggregate method [29] is used to determine the


best choice, because of various classification 6. Hybrid approach
methods have different ranking result. If in each
A hybrid approach is a specific way to combine
classification result of each method there are n
matrix audio and matrix lyrics. The hybrid matrix is
alternatives, then the ranking of each alternative is
used in emotion detection algorithms to produce one
given a value. The first rank is given the highest value,
Thayer emotion label for the entire song.
n, the second rank is given a value of n-1, the third
rank is given a value of n-2 and so on. Addition with 6.1 Hybrid matrix thayer emotional label
the highest value is the aggregate ranking.
Table 3 shows the results of the emotion The song structural segment that analyzed at this
classification using song audio. The aggregate step is the chorus, bridge and verse for audio data and
ranking for audio show that the chorus is in the first lyric data. Of the three structural segments, it is
position, then sequentially occupied by verse, bridge, known that verse has the most part in a song. Each
intro and outro. Table 4 shows the result of emotion verse in the lyrics may also vary. To balance the
classification using song lyrics. The superscript m of number between segments, verse to be analyzed is
a structural segment in Tables 3 and 4 (segmentm) categorized again into 3 parts. The 3 parts of the verse
means ranked. The aggregate ranking for lyrics are verses on the beginning of the song (v1), verses
shows that the chorus is in the first position, then in the middle of the song (v2) and verses at the end of
sequentially occupied by bridge, verse, intro and the song (v3). The following is an example of the xml
outro. It shows that for structural segments of audio segment data from a song titled Anna (go to him),
and lyrics, the highest position is in the same section, artist The Beatles. It can be seen that the intro has 1
except bridge and verse. Three highest positions, section, the chorus has 2 parts, the bridge is 2 parts,
Chorus-Bridge-Verse is used for next process. the verse has 4 parts, and the outro has 0 parts. The
From the six classification methods, the Random verse is divided into 3 segments: segment v1 there are
Forest method (ML3) in Tables 3 and 4 show that the 2 parts, segment v2 there is 1 part and segment v3
ranking results are the same as the aggregate ranking, there is 1 part.
so the prediction results that used for the next process
are emotion label predictions from ML3.
<segmentation>
In the next process to obtain the best analysis
results, three structural segments are used which <segment label="Intro" start="00:00:708"
represent the emotions of the song. There are chorus, end="00:09:669" start_sec="0.7084616"
bridge, and verse. Emotion predicted labels are end_sec="9.6695437"/>
presented in the form of a matrix for each structural <segment label="Verse" start="00:09:669"
segment lyric and audio separately. end="00:29:410" start_sec="9.6695437"
end_sec="29.4107308"/>

International Journal of Intelligent Engineering and Systems, Vol.13, No.1, 2020 DOI: 10.22266/ijies2020.0229.09
Received: August 10, 2019. Revised: October 29, 2019. 93

<segment label="Chorus" start="00:29:410" 𝑐𝑎11 , 𝑐𝑎1,2 𝑐𝑎1,3 𝑐𝑎1,4


𝑐𝑎2,1 𝑐𝑎2,2 𝑐𝑎2,3 𝑐𝑎2,4
end="00:36:239" start_sec="29.4107308" 𝐶𝐴 = [ ] (1)
⋮ ⋮ ⋮ ⋮
end_sec="36.2399923"/> 𝑐𝑎𝑖,1 𝑐𝑎𝑖,2 𝑐𝑎𝑖,3 𝑐𝑎𝑖,4
<segment label="Verse" start="00:36:239"
end="00:57:710" start_sec="36.2399923" 𝑏𝑎1,1 𝑏𝑎1,2 𝑏𝑎1,3 𝑏𝑎1,4
end_sec="57.7109001"/> 𝑏𝑎2,1 𝑏𝑎2,2 𝑏𝑎2,3 𝑏𝑎2,4
𝐵𝐴 = [ ] (2)
<segment label="Bridge" start="00:57:710" ⋮ ⋮ ⋮ ⋮
end="01:31:965" start_sec="57.7109001" 𝑏𝑎𝑖,1 𝑏𝑎𝑖,2 𝑏𝑎𝑖,3 𝑏𝑎𝑖,4
end_sec="91.9659766"/>
𝑣1𝑎1,1 𝑣1𝑎1,2 𝑣1𝑎1,3 𝑣1𝑎1,4
<segment label="Verse" start="01:31:965" 𝑣1𝑎2,1 𝑣1𝑎2,2 𝑣1𝑎2,3 𝑣1𝑎2,4
end="01:49:562" start_sec="91.9659766" 𝑉1𝐴 = [ ] (3)
⋮ ⋮ ⋮ ⋮
end_sec="109.5626313"/> 𝑣1𝑎𝑖,1 𝑣1𝑎𝑖,2 𝑣1𝑎𝑖,3 𝑣1𝑎𝑖,4
<segment label="Bridge" start="01:49:562"
end="02:24:200" start_sec="109.5626313"
end_sec="144.2006599"/>
<segment label="Verse" start="02:24:200"
end="02:39:659" start_sec="144.2006599"
end_sec="159.6600000"/>
<segment label="Chorus" start="02:39:659"
end="02:54:530" start_sec="159.6600000"
end_sec="174.5304761">
<alt_label>outro</alt_label>
</segment>

𝐶𝐴 is a prediction frequency matrix with audio


chorus vector elements. Matrix 𝐶𝐴 containing count
of Thayer label emotion predict in chorus audio with
elements 𝑐𝑎𝑖,𝑗 . Matrix 𝐶𝐴 shown in Eq.(1). The first
Figure. 7 Scheme of vector hybrid chorus formation
subscript (𝑐𝑎𝑖 ) will refer to the row position or row
of song id in the array. The second subscript (𝑐𝑎𝑗 ) Furthermore, matrix prediction of lyrics is
will refer to to the column position or column of formed with the same process as forming an audio
Thayer label emotions in audio. In this research the prediction matrix. Both matrix predictions of the
number of datasets is 100 songs, so the maximal audio and lyrics structural segment is hybridized with
value is 𝑖 = 100 and 𝑗= 4 (representing the data of the the concept of majority voting and the sum of hybrid
Thayer label emotion Q1, Q2, Q3 and Q4). 𝐵𝐴 is a matrices, to detect the emotions of the whole song.
Prediction frequency matrix with audio bridge vector Schematic of forming a hybrid vector, especially for
elements which contain the count of Thayer label chorus shows in Fig.7. Variable P1, P2, P3, Pn are
emotion predict in bridge audio with elements 𝑏𝑎𝑖,𝑗 . label prediction for each chorus of song
Prediction frequency matrix 𝐵𝐴 shown in Eq. (2). (Q1/Q2/Q3/Q4).
𝑉1𝐴 is a matrix that have a count of Thayer label
emotion predict in v1 audio as its elements data. The 6.2 Emotion detection algorithm
elements of the matrix 𝑉1𝐴 are 𝑣1𝑎𝑖,𝑗 . Prediction
The next step is detecting emotions by using
frequency matrix 𝑉1𝐴 shown in Eq.(3). Similarly emotion detection algorithm. There are 12 alternative
with Matrix 𝑉1𝐴, Prediction frequency matrix 𝑉2𝐴 combination from three (3) models to detect song
represent the v2 audio matrix and Prediction emotions. Model 1 is the first emotional detection
frequency matrix 𝑉3𝐴 represent the v3 audio matrix. model by using the concept of majority voting. In
Beside 5 audio matrices, we also have 5 lyric majority voting, each structural segment is decided its
Prediction frequency matrices namely: matrix 𝐶𝐿 emotional predicted label. It based on maximum
(chorus lyrics), matrix 𝐵𝐿 (bridge lyrics), matrix 𝑉1𝐿 value element of a hybrid vector. Emotion label
(v1 lyrics), matrix 𝑉2𝐿 (v2 lyrics), and matrix 𝑉3𝐿 prediction is taken based on the maximum value of
(v3 lyrics). vector elements. If each structural segment already
has a prediction label, the major prediction label is
International Journal of Intelligent Engineering and Systems, Vol.13, No.1, 2020 DOI: 10.22266/ijies2020.0229.09
Received: August 10, 2019. Revised: October 29, 2019. 94

taken through the concept of majority voting. That Table 5. Confusion Matrix for Multi Class
major prediction label is the label of the song's Prediction
emotional prediction. Model 1 shown in Fig. 8(a). A B C D
Model 2 is a detection emotion model using the Sum A TP_A E_AB E_AC E_AD
of hybrid vector concept from the analyzed structural B E_BA TP_B E_BC E_BD
Actual
segments and majority voting. It can be seen in Fig. C E_CA E_CB TP_C E_CD
8(b). Pc, Pb and Pv2 are the abbreviation of predicted D E_DA E_DB E_DC TP_D
chorus label, predicted bridge label and predicted v2
label. Hybrid vector for chorus, bridge and v2 is 7. Result and discussion
named hc, hb and hv2. The hybrid matrix has been
synchronized between audio and lyrics and Ps is label This study was tested using 12 case studies. Each
prediction from a hybrid matrix of song. case study uses 2 to 7 structural segments that
represent songs. The 12 case studies were tested
using 3 emotion detection models: model 1, model 2
and model 3. Confusion matrix for multiclass show
in Table 5. For that confusion matrix, we get True
Positive, False Negative and False Positive values. F-
Measure calculations for each case study using F-
measure formula in Eq. (6). In this formula, 𝑟𝑒𝑐𝑎𝑙𝑙
and 𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 calculated according to Eq. (4) and (5).
True Positives (𝑇𝑃) are on the diagonal position.
False Positives (𝐹𝑃) are column-wise sums (E_BA,
E_CA, E_DA), without the diagonal. False Negatives
( 𝐹𝑁 ) are row-wise sums (E_AB, E_AC, E_AD),
without the diagonal.

𝑇𝑃
𝑅𝑒𝑐𝑎𝑙𝑙 = (4)
𝑇𝑃 + 𝐹𝑁
𝑇𝑃
𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 = (5)
𝑇𝑃 + 𝐹𝑃

𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛. 𝑅𝑒𝑐𝑎𝑙𝑙
𝐹 − 𝑀𝑒𝑎𝑠𝑢𝑟𝑒 = 2 . (6)
(a) 𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 + 𝑅𝑒𝑐𝑎𝑙𝑙

In Tables 6 - 8, it can be seen that case studies


with the best accuracy are in the structural
combination of the chorus-bridge-v3 using emotion
detection model 2. The average 𝐹 − 𝑀𝑒𝑎𝑠𝑢𝑟𝑒 value
obtained is 0.823. This is possible because in model
2, hybrid vector are combined first with the sum of
vector concept then prediction labels are determined.
So there must be a prediction label result.
This study also compared the result of song
emotion detection with previous research[12]. But we
do not find any features used in [12], so what we
compare in this case is the data used as input for
emotional detection system.
System 1 in previous research used 30 seconds
audio and all lyric [12]. System 2 use hybrid matrix
for chorus structural_segment audio [16] and all
lyrics. System 3 use hybrid matrix for all
(b) structural_segment audio and all lyrics. System 4, use
Figure. 8 Emotion detection model using 3 structural [20] to detect emotion using valence and arousal
segments (Chorus-Bridge-Verse2): (a) model 1 and (b) dimension. System 5, our proposed, uses chorus-
model 2
International Journal of Intelligent Engineering and Systems, Vol.13, No.1, 2020 DOI: 10.22266/ijies2020.0229.09
Received: August 10, 2019. Revised: October 29, 2019. 95

Table 6. Recall test result


𝑅𝑒𝑐𝑎𝑙𝑙
Combination of Structural Segments
Audio Lyrics Model 1 Model 2 Model 3
Chorus-Bridge-V1-V2-V3 0.670 0.720 0.550 0.770 0.650
Chorus-V1 0.661 0.670 0.420 0.800 0.800
Chorus-V2 0.720 0.680 0.410 0.800 0.800
Chorus-V3 0.690 0.730 0.580 0.807 0.790
Bridge-V1 0.490 0.350 0.505 0.525 0.499
Bridge-V2 0.480 0.360 0.370 0.464 0.500
Bridge-V3 0.406 0.260 0.328 0.408 0.440
Chorus-bridge 0.680 0.700 0.590 0.810 0.810
Chorus-bridge-V1 0.700 0.700 0.510 0.790 0.730
Chorus-Bridge-V2 0.710 0.710 0.470 0.810 0.760
Chorus-bridge-V3 0.700 0.710 0.590 0.830 0.770
Chorus-bridge-V1-V2-V3-intro-outro 0.720 0.650 0.388 0.730 0.660

Table 7. Precision test result


𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛
Combination of Structural Segments
Audio Lyrics Model 1 Model 2 Model 3
Chorus-Bridge-V1-V2-V3 0.736 0.664 0.70 0.773 0.68
Chorus-V1 0.658 0.678 0.788 0.807 0.807
Chorus-V2 0.718 0.730 0.765 0.801 0.801
Chorus-V3 0.713 0.776 0.861 0.841 0.818
Bridge-V1 0.490 0.350 0.622 0.571 0.551
Bridge-V2 0.641 0.400 0.616 0.595 0.601
Bridge-V3 0.573 0.368 0.613 0.612 0.643
Chorus-bridge 0.698 0.745 0.817 0.833 0.833
Chorus-bridge-V1 0.703 0.711 0.82 0.796 0.75
Chorus-Bridge-V2 0.721 0.732 0.81 0.819 0.778
Chorus-bridge-V3 0.703 0.734 0.825 0.825 0.778
Chorus-bridge-V1-V2-V3-intro-outro 0.737 0.645 0.474 0.728 0.653

Table 8. 𝐹 − 𝑀𝑒𝑎𝑠𝑢𝑟𝑒 test result


Combination of Structural Segments 𝐹 − 𝑀𝑒𝑎𝑠𝑢𝑟𝑒
Audio Lyrics Model 1 Model 2 Model 3
Chorus-Bridge-V1-V2-V3 0.720 0.663 0.586 0.764 0.639
Chorus-V1 0.651 0.653 0.533 0.795 0.795
Chorus-V2 0.717 0.676 0.515 0.795 0.795
Chorus-V3 0.694 0.731 0.673 0.813 0.796
Bridge-V1 0.487 0.351 0.409 0.519 0.499
Bridge-V2 0.524 0.367 0.424 0.484 0.513
Bridge-V3 0.441 0.287 0.394 0.440 0.483
Chorus-bridge 0.681 0.699 0.671 0.811 0.811
Chorus-bridge-V1 0.692 0.681 0.578 0.785 0.722
Chorus-Bridge-V2 0.705 0.700 0.550 0.807 0.757
Chorus-bridge-V3 0.694 0.701 0.661 0.823 0.762
Chorus-bridge-V1-V2-V3-intro-outro 0.719 0.645 0.369 0.728 0.641

bridge-v3 that are synchronized between audio and reach the F-measure value of 0.731 and detection
lyrics. By using the emotional detection method of emotion based on audio reaches a value of 0.720.
model 2 and Bimodal dataset for testing, the result of Emotion detection based on lyrics and audio can
𝐹 − 𝑀𝑒𝑎𝑠𝑢𝑟𝑒 shown in Fig.9 shows that data input reach Recall value 0.830 (shown in Table 6),
model that we proposed have better results, which is Precision value 0.861 (shown in Table 7) and F-
0.695. Measure value 0.823 (shown in Table 8). The highest
From the results of this research, it can be Recall and F-Measure are obtained for the combined
analyzed by using the second detection emotion chorus-bridge-V3 on M2 but the highest precision
model, detection emotion based on lyrics is able to

International Journal of Intelligent Engineering and Systems, Vol.13, No.1, 2020 DOI: 10.22266/ijies2020.0229.09
Received: August 10, 2019. Revised: October 29, 2019. 96

prediction frequency matrices. Then with a sum of


matrices and majority voting, a hybrid matrix is
obtained to detect the emotions of the whole song.
From 3 emotion detection models for 12
combinations of structural segments, the best F-
measure for chorus-bridge-v3 was 0.823.
The development of this research can be done by
adding other features to the audio and lyrics features.
Word sense disambiguation [30] can be used for
additive features of lyrics. Dependency parser [31] is
possible to preprocessing data before extracting
lyrics. It used to know the relationship between words.
When the psycholinguistic feature extraction process
is done, the related words can be processed together,
so that the results of extraction and detection can be
better.
Figure. 9 Comparison song emotion detection result
based on input data audio and lyrics References
[1] E. Cano, “Mood-Based On-Car Music
value is obtained for the combined chorus-V3 on M1. Recommendation”, Lecture Notes of the
F-measure is a combination of precision and recall Institute for Computer Science, No. November,
values, so the reference we use to see the system 2016.
reliability is the highest F-measure value. [2] J. Jones, “The role of music in your classroom”,
Comparison of detection emotion based on audio and in Music Curriculum Exchange, Lincoln, pp.
lyric features with the previous study also looks better. 90–92, 2010.
The F-measure value of proposed system is 0.695. [3] J. A. Ridoean and R. Sarno, “Music Mood
Classification Using Audio Power and Audio
8. Conclusion Harmonicity Based on MPEG-7 Audio Features
The final of this research is to detect the emotion and Support Vector Machine”, In: Proc. of
of the whole a song by using synchronized audio and International Conference on Science in
lyrics features. The difference with previous research Information Technology (ICSITech), pp. 72–77,
is the use of input data and the integration of audio 2017.
and lyrics features. Generally song emotion detection [4] L. Lu, D. Liu, and H. Zhang, “Automatic Mood
used all lyric and audio data (30-45 seconds) that the Detection and Tracking of Music Audio
duration comes from the expert. In this research song Signals”, IEEE Transactions on Audio, Speech,
structural segment, audio and lyrics are used as input and Language Processing, Vol. 14, No. 1, pp. 5–
data. The lyric data are lyrics that are synchronized 18, 2006.
with the song structural segment audio. [5] A. Schindler and A. Rauber, “Capturing the
Analyzing of song structural segment that Temporal Domain in Echonest Features for
represent a whole emotion of the song is done by 6 Improved Classification Effectiveness”, In:
classification methods and aggregate method. The Proc. of International Workshop on Adaptive
result shows that chorus, bridge, and verse are the Multimedia Retrieval, pp. 1–15, 2014.
best segments. The audio features used are extracted [6] R. Panda and R. P. Paiva, Automatic Mood
using MIRToolbox and the lyrics features are Tracking in Audio Music. Universidade de
extracted psychologically and stylistically. The Coimbra, 2010.
results of audio extraction and feature selection [7] V. Kumar, “Mood Classifiaction of Lyrics using
process with CFS obtained 16 audio features. In the SentiWordNet”, In: Proc. of International
lyric feature, psycholinguistically extracted 11 Conference on Computer Communication and
features. The lyric extraction process uses CBE-Ex, Informatics (ICCCI -2013), pp. 1–5, 2013.
because system that use it is better than the previous [8] Y. Hu, X. Chen, and D. Yang, “Lyric-Based
CBE. The F-Measure value is 0.399. Song Emotion Detection with Affective Lexicon
A hybrid approach is used to integrate structural and Fuzzy Clustering Method”, In: Proc. of
segment audio and lyrics that already synchronized. International Society for Music Information
Data for each structural segment are presented in Retrieval Conference, pp. 123–128, 2009.

International Journal of Intelligent Engineering and Systems, Vol.13, No.1, 2020 DOI: 10.22266/ijies2020.0229.09
Received: August 10, 2019. Revised: October 29, 2019. 97

[9] M. Kim and H. Kwon, “Lyrics-based Emotion Structure Detection in Popular Music”, In: Proc.
Classification using Feature Selection by Partial of LSAS, Vol. 106, No. August, pp. 45–59, 2008.
Syntactic Analysis”, In: Proc. of International [22] S. Ewert, “Chroma Toolbox: Matlab
Conference on Tools with Artificial Intelligence, Implementations for Extracting Varians of
2011. Chroma-Based Audio Features”, International
[10] C. Laurier, Automatic Classification of Musical Society for Music Information Retrieval, pp. 1–
Mood by Content Based Analysis. Barcelona, 6, 2011.
Spain: Universitat Pompeu Fabra, 2011. [23] O. Lartillot, O. Lartillot, P. Toiviainen, and P.
[11] J. S. Downie and A. F. Ehmann, “Lyric text Toiviainen, “A matlab toolbox for musical
mining in music mood classification”, In: Proc. feature extraction from audio”, In: Proc. of
of International Society for Music Information International Conference on Digital Audio
Retrieval Conference, pp. 411–416, 2009. Effects, No. Ii, pp. 1–8, 2007.
[12] R. Malheiro, R. Panda, P. Gomes, and R. Paiva, [24] F. H. Rachman, R. Sarno, and C. Fatichah,
“Bi-modal music emotion recognition: Novel “CBE : Corpus-Based of Emotion for Emotion
lyrical features and dataset”, In: Proc. of Detection in Text Document”, In: Proc. of
International Workshop on Music and Machine ICITACEE, pp. 331–335, 2016.
Learning, pp. 1–5, 2016. [25] C. Strapparava and A. Valitutti, “WordNet-
[13] M. Soleymani, M. N. Caro, E. M. Schmidt, C. Affect : an Affective Extension of WordNet”, In:
Sha, and Y. Yang, “1000 Songs Database”, In: Proc. of LREC, pp. 1083–1086, 2004.
Proc. of ACM International Workshop on [26] M. M. Bradley, P. J. Lang, M. M. Bradley, and
Crowdsourcing for Multimedia, pp. 4–7, 2014. P. J. Lang, “Affective Norms for English Words
[14] D. Liu, L. Lu, and H.-J. Zhang, “Automatic ( ANEW ): Instruction Manual and Affective
mood detection from acoustic music data”, In: Ratings”, The Center for Research in
Proc. of the International Conference on Music Psychophysiology, University of Florida, 1999.
Information Retrieval, pp. 13–17, 2003. [27] F. Kaiser, Music Structure Segmentation. Berlin:
[15] M. A. Pandharipande and S. K. Kopparapu, Universit¨at Berlin, 2012.
“Audio segmentation based approach for [28] D. Unal, “Comparison of Data Mining
improved emotion recognition”, In: Proc. of Classification Algorithms Determining the
TENCON 2015 - 2015 IEEE Region 10 Default Risk”, Scientific Programming, pp. 1–8,
Conference, Macao, No. I, pp. 1–4, 2015. 2019.
[16] C. Yeh, C.Tseng, W.Chen, C.Lin, Y.Tsai, Y.Bi, [29] H. Lee and C. Chang, “Comparative analysis of
H.Lin, Y.Lin, and Ho-yi, “Popular music MCDM methods for ranking renewable energy
representation : chorus detection & emotion sources in Taiwan”, Renewable and Sustainable
recognition”, Multimedia Tools Application, Vol. Energy Reviews, Vol. 92, No. April 2017, pp.
73, pp. 2103–2128, 2014. 883–896, 2018.
[17] O. Lartillot, “MIRtoolbox 1.6.1”, Denmark, [30] B. S. Rintyarna and R. Sarno, “Adapted
2014. Weighted Graph for Word Sense
[18] C. D. Manning, “Part-of-Speech Tagging from Disambiguation”, In: Proc. of IcoICT, 2016.
97 % to 100 %: Is It Time for Some [31] Z. Jie and W. Lu, “Dependency-based Hybrid
Linguistics ?”, In: Proc. of Computational Trees for Semantic Parsing”, In: Proc. of
Linguistics and Intelligent Text Processing, pp. Conference on Empirical Methods in Natural
171–189, 2011. Language Processing, pp. 2431–2441, 2018.
[19] F. H. Rachman, R. Sarno, and C. Fatichah,
“Music emotion classification based on lyrics-
audio using corpus based emotion”,
International Journal of Electrical and
Computer Engineering, Vol. 8, No. 3, pp. 1720–
1730, 2018.
[20] V. L. Nguyen, D. Kim, V. P. Ho, and Y. Lim, “A
New Recognition Method for Visualizing Music
Emotion”, International Journal of Electrical
and Computer Engineering (IJECE), Vol. 7, No.
3, pp. 1246–1254, 2017.
[21] E. Peiszer, T. Lidy, and A. Rauber, “Automatic
Audio Segmentation : Segment Boundary and
International Journal of Intelligent Engineering and Systems, Vol.13, No.1, 2020 DOI: 10.22266/ijies2020.0229.09

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy