Hybrid Approach of Structural Lyric and Audio Segments For Detecting Song Emotion
Hybrid Approach of Structural Lyric and Audio Segments For Detecting Song Emotion
86
Hybrid Approach of Structural Lyric and Audio Segments for Detecting Song
Emotion
1
Departement of Informatics, Institut Teknologi Sepuluh Nopember, Indonesia
2
Departement of Informatics, University of Trunojoyo Madura, Indonesia
* Corresponding author’s Email: hastarita.fika@gmail.com
Abstract: Detecting song emotion is very important, however many studies have been done based on song lyrics and
song audio separately. This research proposes a method for detecting song emotion based on integrated song lyrics
and audio. Synchronizing the right structural segment and lyrics of song can be used for hybrid approach to detected
right emotion. Song emotion can be classified into Thayer emotion label. The features of a song lyric are extracted
using Psycholinguistic and Stylistic; whereas the features of a song audio are extracted using analyze audio signal
waveform using Fast Fourier Transform (FFT) method. A song can be divided into 5 structural segments, which are
intro, chorus, bridge, verse and outro. A preprocessing method of audio uses Correlation Features Selection (CFS) and
preprocessing text for lyrics. Six classification methods are used for classifying emotion based on lyrics and audio of
song structural segments separately. The aggregate method is used to analyze the results of classification before to
obtain structural segments that represent emotions. Then, the final process Hybrid approach used to combine audio
and lyrics features in emotion detection. Sum of matrix and Majority Voting Concept are used for Hybrid approach.
The value of F-Measure is 0.823.
Keywords: Emotion detection, Song structural segment, Audio features, Lyric features, CFS, Aggregate method,
Hybrid approach.
still low. Lyrics text mining is used to find the role of to determine the segment of audio. Ricardo [12] uses
lyrics that can improve the accuracy of mood short audio to detect song emotions. The
detection [11]. Besides lyrics, the combined testing determination of short audio duration comes from
of audio and lyric features produced the conclusion experts. Supposedly without the short audio from the
that not all audio is the dominant feature in expert, the system is able to detect emotions. This
determining the emotions of songs, not all mood research can detect song emotions automatically
labels can produce high accuracy [11]. Even though because the system can determine structural
these studies obtained good results but in the process segments that represent emotion based on Correlation
they used audio datasets that the determination of the Feature Selection (CFS) and Aggregate method.
audio duration is based on the expert. In fact, a song The structure segment of song is part of the song
audio has duration of two until five minutes. It is that can be processed to emotion detection beside
difficult to determine which segment that represent audio and lyrics. Song structure has a unique form
emotion of song. and composed to be a song. In song writing, song is
Generally Music Emotion Recognition (MER) generally composed of five parts: intro, verse, bridge,
used Bimodal dataset [12]. Bimodal dataset that chorus, outro [16]. Each part can be different position
combine of audio vocals and instruments. Emotion in a song. The intro is the beginning of a song and
detection usually uses all lyrics of song. Bimodal usually just an instrument. Verse is the verses of
Dataset [1] is one of the songs emotional datasets that songs with lyrics. Chorus is the message or core of
uses 30 seconds audio and all lyrics of song. The the song. The verse and chorus part can be repeated.
1000 song dataset [13] is also a song emotional Bridge is the connecting part between the parts of the
dataset that uses 45 seconds of audio data. Both song. While outro is the closing part of the song.
datasets uses short audio that the duration is Position of the intro and outro are often at the
determined based on the expert. Duration of segment beginning and end of the song. Position of the verse,
determinated by expert is different of each song, so it chorus, and bridge located in the middle of the music.
is difficult to determine automatically when a song It can be different order. The example of song
does not have information from the expert. Song structure position in a song is: Intro – Verse – Chorus
emotion is wrongly detected if the system does not – Verse – Bridge – Chorus – Outro.
have the right segment of audio. Chia [16] conducted research on the chorus
Previous research presents that there are detection algorithm and emotion detection of songs
emotional differences in each song segment [14]. The based on audio data. The value of intensity, frequency
result shows that the best accuracy for music mood band and rhythm regularity in the chorus are used to
detection uses 8 and 16 seconds duration. Other detect song emotions. Their emotion detection
research uses data spoken speech dataset (Emo-DB). algorithm provides similar results for the same
The emotion of spoken speech is analyzed after melody in various languages and lyrics. There is an
segmentation audio process with 400ms duration. influence on the selection of song structure for song
The result of emotion prediction is different of each emotion detection.
segment. There is a mechanism to fuse segment We propose song emotion detection that used
emotion in order to make a global emotion in full synchronously between audio and lyrics. Song
audio [15]. structural segments are also used in this research. The
Emotion detection in audio and lyrics was proposed system can automatically recognize the
performed by Ricardo Malheiro, 2016 [12]. They song emotion with the existence of structural segment
used Bimodal dataset with 133 songs data. Each data data.
has 30 seconds of audio and all lyrics of song. They We have two contributions in this research. First
used 1701 audio subset features from melody, timbre, contribution is automatically segment selection that
rhythm, and others [12]. Features of lyrics are represent the emotions of the whole of song by
obtained from lyric tools (Synesketch, ConceptNet, analyzing structure of song. Second contribution is
LIWC, and General Inquirer). The fuse of audio and hybrid approach to combine audio and lyrics features
lyrics features in emotion recognition can increase from song structure segment for detection emotion
the F-Measure value to 88.4%. Although the results using prediction frequency matrix. With this
are very high, but only songs that have right segment contribution, to find out the emotions of the song, we
audio can be detected emotion of song accurately. need to know the structure of certain songs only. So
Audio data that used in this study is short audio, while even though the song doesn't have audio samples
the lyric data that used are overall data of the song. from experts, we can still find out the emotions of the
There is no synchronization of data between audio song.
and lyrics. If there is a new song, an expert is needed
International Journal of Intelligent Engineering and Systems, Vol.13, No.1, 2020 DOI: 10.22266/ijies2020.0229.09
Received: August 10, 2019. Revised: October 29, 2019. 88
This paper is organized as follows. Related work Table 1. Filtering POS tagging
and contribution is shown in introduction in section POS tagging Meaning of POS
1. Section 2 shows a proposed method of this research. position
Section 3 shows a structural emotion of song dataset DT Determiner
that used in this research. Section 4 shows a method CD Cardinal number
of audio and lyric extractions. Section 5 shows a WRB Wh – adverb
TO To
emotional of structural segmen analysis to know what
CC Coordinating Conjungtion
structural segment that represent a whole of song
PRPS Personal pronoun
emotion. Section 6 shows a model and method that MD Modal
we used for hybrid features in song emotional
detection. Result and discussion are presented in
features. The psycholinguistic feature is a feature
section 7. Finally, section 8 provides conclusions and
based on the emotional psychology dataset. This
future work.
dataset is a CBE (Corpus Based Emotion) dataset
from the results of previous research[19] which is
2. Proposed method expanded according to the Thayer emotion label.
The research was carried out according to the The stylistic features are words that are often
proposed method in Fig.1. The song data that found in lyrics, but are not in the English dictionary
processed is audio structural data and lyrics. Lyrics (Such as: ‘ah’, ‘ooh’, ‘yeah’). Those unique words,
data is synchronized with segments in audio data. exclamation marks, and question marks in the lyrics
Audio segment data experiences extraction are a feature of stylistic. These features are used as
features using analyze audio signal waveform using input in the emotional classification process from
Fast Fourier Transform (FFT) method in MIR each lyric structural segment. Emotion predicted
Toolbox [17]. The features of an audio segment are label of each segment presented using matrix
sub features from dynamics, rhythm, timbre, pitch, segment.
and tonality features. Then the feature selection The hybrid of audio and lyric matrix for each
process is performed to reduce the number of features structural segment use operation sum of matrices.
used in the classification. This research use This hybrid matrix used for emotion detection model
Correlation Feature Selection (CFS) [18-19] for to get a song emotion predicted label. In the emotion
selection feature method. The audio feature of each detection model, sum of matrix and majority voting
structural segment is classified using the Random are used in several structural segment combinations
Forest Method. The results of the classification show analyzed.
that the best structural segment represent the The emotional label used in this research is the
emotions of the song. Each song has several song emotional label of Thayer[6]. Thayer emotional label
structural segments. Each of structural segments has four (4) quadrants that depicted in a 2-
represented as a matrix, therefore there are several dimensional model. Coordinate axis is Valance and
audio and lyric matrices. The row of a matrix Arousal which have a 'low' and 'high' area. Thayer's
represents the index of song data, and the column of emotional label is: Quadrant 1 (Q1), Quadrant 2 (Q2),
a matrix represents count number of emotional label Quadrant 3 (Q3), and Quadrant 4 (Q4). Q1 is in a high
prediction for each song structural segment. valance and high arousal areas. The example of Q1
The lyric data is synchronized first with structural are happy and exited emotion. Q2 is in low valance
segment audio using extension file configuration lrc and high arousal. Angry and nervous emotion
(short for LyRiCs). Preprocess of lyrics are done in includes in Q2. Q3 is in low valance and low arousal
several processes: slang word repair, POS tagging, (The example: sad). Q4 is in low valance and high
Porter stemming and stopword removal. Slang word arousal (The example: calm, relaxed)[20].
repair is done by using slang word corpus which is
often found in the lyrics. Stanford Part of Speech 3. Structural emotion of song dataset
(POS) Tagging [18] is done to find out the position of
This research requires a dataset of structural
words in the lyrics. There are several word positions
segments of songs that have labeled Thayer emotions.
that rarely describe the emotions of the lyrics (shown
Previous datasets are separated between structural
in Table 1). The word in that position is deleted.
song dataset and emotional song dataset. Bimodal
After preprocessing the lyrics, next process is the
dataset is an emotional song dataset which has 133
feature extraction process. The extracted features
songs with 30 second audio data, full data text lyrics
from the lyrics are psycholinguistic and stylistic
and thayer emotion label [12]. One of the structural
song datasets that has data in file xml is
International Journal of Intelligent Engineering and Systems, Vol.13, No.1, 2020 DOI: 10.22266/ijies2020.0229.09
Received: August 10, 2019. Revised: October 29, 2019. 89
International Journal of Intelligent Engineering and Systems, Vol.13, No.1, 2020 DOI: 10.22266/ijies2020.0229.09
Received: August 10, 2019. Revised: October 29, 2019. 90
International Journal of Intelligent Engineering and Systems, Vol.13, No.1, 2020 DOI: 10.22266/ijies2020.0229.09
Received: August 10, 2019. Revised: October 29, 2019. 92
International Journal of Intelligent Engineering and Systems, Vol.13, No.1, 2020 DOI: 10.22266/ijies2020.0229.09
Received: August 10, 2019. Revised: October 29, 2019. 93
taken through the concept of majority voting. That Table 5. Confusion Matrix for Multi Class
major prediction label is the label of the song's Prediction
emotional prediction. Model 1 shown in Fig. 8(a). A B C D
Model 2 is a detection emotion model using the Sum A TP_A E_AB E_AC E_AD
of hybrid vector concept from the analyzed structural B E_BA TP_B E_BC E_BD
Actual
segments and majority voting. It can be seen in Fig. C E_CA E_CB TP_C E_CD
8(b). Pc, Pb and Pv2 are the abbreviation of predicted D E_DA E_DB E_DC TP_D
chorus label, predicted bridge label and predicted v2
label. Hybrid vector for chorus, bridge and v2 is 7. Result and discussion
named hc, hb and hv2. The hybrid matrix has been
synchronized between audio and lyrics and Ps is label This study was tested using 12 case studies. Each
prediction from a hybrid matrix of song. case study uses 2 to 7 structural segments that
represent songs. The 12 case studies were tested
using 3 emotion detection models: model 1, model 2
and model 3. Confusion matrix for multiclass show
in Table 5. For that confusion matrix, we get True
Positive, False Negative and False Positive values. F-
Measure calculations for each case study using F-
measure formula in Eq. (6). In this formula, 𝑟𝑒𝑐𝑎𝑙𝑙
and 𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 calculated according to Eq. (4) and (5).
True Positives (𝑇𝑃) are on the diagonal position.
False Positives (𝐹𝑃) are column-wise sums (E_BA,
E_CA, E_DA), without the diagonal. False Negatives
( 𝐹𝑁 ) are row-wise sums (E_AB, E_AC, E_AD),
without the diagonal.
𝑇𝑃
𝑅𝑒𝑐𝑎𝑙𝑙 = (4)
𝑇𝑃 + 𝐹𝑁
𝑇𝑃
𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 = (5)
𝑇𝑃 + 𝐹𝑃
𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛. 𝑅𝑒𝑐𝑎𝑙𝑙
𝐹 − 𝑀𝑒𝑎𝑠𝑢𝑟𝑒 = 2 . (6)
(a) 𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 + 𝑅𝑒𝑐𝑎𝑙𝑙
bridge-v3 that are synchronized between audio and reach the F-measure value of 0.731 and detection
lyrics. By using the emotional detection method of emotion based on audio reaches a value of 0.720.
model 2 and Bimodal dataset for testing, the result of Emotion detection based on lyrics and audio can
𝐹 − 𝑀𝑒𝑎𝑠𝑢𝑟𝑒 shown in Fig.9 shows that data input reach Recall value 0.830 (shown in Table 6),
model that we proposed have better results, which is Precision value 0.861 (shown in Table 7) and F-
0.695. Measure value 0.823 (shown in Table 8). The highest
From the results of this research, it can be Recall and F-Measure are obtained for the combined
analyzed by using the second detection emotion chorus-bridge-V3 on M2 but the highest precision
model, detection emotion based on lyrics is able to
International Journal of Intelligent Engineering and Systems, Vol.13, No.1, 2020 DOI: 10.22266/ijies2020.0229.09
Received: August 10, 2019. Revised: October 29, 2019. 96
International Journal of Intelligent Engineering and Systems, Vol.13, No.1, 2020 DOI: 10.22266/ijies2020.0229.09
Received: August 10, 2019. Revised: October 29, 2019. 97
[9] M. Kim and H. Kwon, “Lyrics-based Emotion Structure Detection in Popular Music”, In: Proc.
Classification using Feature Selection by Partial of LSAS, Vol. 106, No. August, pp. 45–59, 2008.
Syntactic Analysis”, In: Proc. of International [22] S. Ewert, “Chroma Toolbox: Matlab
Conference on Tools with Artificial Intelligence, Implementations for Extracting Varians of
2011. Chroma-Based Audio Features”, International
[10] C. Laurier, Automatic Classification of Musical Society for Music Information Retrieval, pp. 1–
Mood by Content Based Analysis. Barcelona, 6, 2011.
Spain: Universitat Pompeu Fabra, 2011. [23] O. Lartillot, O. Lartillot, P. Toiviainen, and P.
[11] J. S. Downie and A. F. Ehmann, “Lyric text Toiviainen, “A matlab toolbox for musical
mining in music mood classification”, In: Proc. feature extraction from audio”, In: Proc. of
of International Society for Music Information International Conference on Digital Audio
Retrieval Conference, pp. 411–416, 2009. Effects, No. Ii, pp. 1–8, 2007.
[12] R. Malheiro, R. Panda, P. Gomes, and R. Paiva, [24] F. H. Rachman, R. Sarno, and C. Fatichah,
“Bi-modal music emotion recognition: Novel “CBE : Corpus-Based of Emotion for Emotion
lyrical features and dataset”, In: Proc. of Detection in Text Document”, In: Proc. of
International Workshop on Music and Machine ICITACEE, pp. 331–335, 2016.
Learning, pp. 1–5, 2016. [25] C. Strapparava and A. Valitutti, “WordNet-
[13] M. Soleymani, M. N. Caro, E. M. Schmidt, C. Affect : an Affective Extension of WordNet”, In:
Sha, and Y. Yang, “1000 Songs Database”, In: Proc. of LREC, pp. 1083–1086, 2004.
Proc. of ACM International Workshop on [26] M. M. Bradley, P. J. Lang, M. M. Bradley, and
Crowdsourcing for Multimedia, pp. 4–7, 2014. P. J. Lang, “Affective Norms for English Words
[14] D. Liu, L. Lu, and H.-J. Zhang, “Automatic ( ANEW ): Instruction Manual and Affective
mood detection from acoustic music data”, In: Ratings”, The Center for Research in
Proc. of the International Conference on Music Psychophysiology, University of Florida, 1999.
Information Retrieval, pp. 13–17, 2003. [27] F. Kaiser, Music Structure Segmentation. Berlin:
[15] M. A. Pandharipande and S. K. Kopparapu, Universit¨at Berlin, 2012.
“Audio segmentation based approach for [28] D. Unal, “Comparison of Data Mining
improved emotion recognition”, In: Proc. of Classification Algorithms Determining the
TENCON 2015 - 2015 IEEE Region 10 Default Risk”, Scientific Programming, pp. 1–8,
Conference, Macao, No. I, pp. 1–4, 2015. 2019.
[16] C. Yeh, C.Tseng, W.Chen, C.Lin, Y.Tsai, Y.Bi, [29] H. Lee and C. Chang, “Comparative analysis of
H.Lin, Y.Lin, and Ho-yi, “Popular music MCDM methods for ranking renewable energy
representation : chorus detection & emotion sources in Taiwan”, Renewable and Sustainable
recognition”, Multimedia Tools Application, Vol. Energy Reviews, Vol. 92, No. April 2017, pp.
73, pp. 2103–2128, 2014. 883–896, 2018.
[17] O. Lartillot, “MIRtoolbox 1.6.1”, Denmark, [30] B. S. Rintyarna and R. Sarno, “Adapted
2014. Weighted Graph for Word Sense
[18] C. D. Manning, “Part-of-Speech Tagging from Disambiguation”, In: Proc. of IcoICT, 2016.
97 % to 100 %: Is It Time for Some [31] Z. Jie and W. Lu, “Dependency-based Hybrid
Linguistics ?”, In: Proc. of Computational Trees for Semantic Parsing”, In: Proc. of
Linguistics and Intelligent Text Processing, pp. Conference on Empirical Methods in Natural
171–189, 2011. Language Processing, pp. 2431–2441, 2018.
[19] F. H. Rachman, R. Sarno, and C. Fatichah,
“Music emotion classification based on lyrics-
audio using corpus based emotion”,
International Journal of Electrical and
Computer Engineering, Vol. 8, No. 3, pp. 1720–
1730, 2018.
[20] V. L. Nguyen, D. Kim, V. P. Ho, and Y. Lim, “A
New Recognition Method for Visualizing Music
Emotion”, International Journal of Electrical
and Computer Engineering (IJECE), Vol. 7, No.
3, pp. 1246–1254, 2017.
[21] E. Peiszer, T. Lidy, and A. Rauber, “Automatic
Audio Segmentation : Segment Boundary and
International Journal of Intelligent Engineering and Systems, Vol.13, No.1, 2020 DOI: 10.22266/ijies2020.0229.09