0% found this document useful (0 votes)
127 views85 pages

Minor Project Final Report (20bca19)

This document discusses analyzing music data from Spotify using data mining techniques. It describes how digital music collections contain large amounts of musical data that can be analyzed. The analysis aims to help navigate the world of music by exploring relationships between songs. Different types of musical data are available, including audio files and metadata. Various analyses can then be carried out on this data, such as genre classification and identification.

Uploaded by

Ramakrishna.ks
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
127 views85 pages

Minor Project Final Report (20bca19)

This document discusses analyzing music data from Spotify using data mining techniques. It describes how digital music collections contain large amounts of musical data that can be analyzed. The analysis aims to help navigate the world of music by exploring relationships between songs. Different types of musical data are available, including audio files and metadata. Various analyses can then be carried out on this data, such as genre classification and identification.

Uploaded by

Ramakrishna.ks
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 85

PREDICTING ELECTRICITY ENERGY CONSUMPTION

MINOR PROJECT REPORT

Submitted by
KAVI PRIYA.R
(20BCA019)

Under The Guidance Of


DR.C.ARUN PRIYA.Msc.,M.Phil.,Ph.D.,
Associate Professor
Department of Computer Applications

In partial fulfillment of the requirement of the award of the Degree of


BACHELOR OF COMPUTER APPLICATION
Of Bharathiar University

DEPARTMENT OF COMPUTER APPLICATIONS


PSG COLLEGE OF ARTS & SCIENCE
An Autonomous College – Affiliated to Bharathiar University
Accredited with ‘A++’ Grade by NAAC (3rd cycle)
College with Potential For Excellence
(Status Awarded by the UGC)
Star College Status Awarded by DBT-MST
An ISO 9001:2015 Certified Institution
Coimbatore- 641014

NOVEMBER 2022

i
DEPARTMENT OF COMPUTER APPLICATIONS

PSG COLLEGE OF ARTS & SCIENCE

An Autonomous college - Affiliated to Bharathiar University


Accredited with 'A++' Grade by NAAC (3rd cycle)
College with potential for excellence
(Status Awarded by the UGC)
Star College Status Awarded by DBT- MST
An ISO 9001:2015 certified Institution Coimbatore-
641014

CERTIFICATE
This is to certify that this project work entitled “DATA ANALYSIS OF SPOTIFY ”

is a bonafide record of work done by KAVI PRIYA.R (20BCA019) in partial fulfilment of the
requirements for the award of Degree of Bachelor of Computer Applications of Bharathiar University.

Faculty Guide Head of the Department

Submitted for Viva-Voce Examination held on ____________________

Internal Examiner External Examinar

ii
DECLARATION

I, KAVI PRIYA R (20BCA019) do here by declare that this Minor Projecct


work entitled “DATA ANALYSIS OF SPOTIFY ” is submitted to PSG
College of Arts and Science (Autonomous), Coimbatore in partial Fulfillment
for the award of Bachelor Of Computer Applications, is a record of original
work done by me under the supervision and guidance of DR.C.ARUN
PRIYA.,Msc.,M.Phil.,Ph.d., Associate Professor in Department of Computer
Applications,PSG College of Arts and Science, Coimbaore. This project work
has not yet submitted by me for the award of any other
Degree/Diploma/Associate ship/Fellowship or any other similar degree to any
other University.

PLACE : COIMBATORE S KAVI PRIYA


DATE : (20BCA019)

iii
ACKNOWLEDGEMENT

My venture stands imperfect without dedicating my gratitude to a few people who have
contributed a lot towards the victorious completion for my project work.

I would like to thank Thiru L.GOPALAKRISHNAN, Managing Trustee, PSG &


Sons Charities, for providing me a prospect and surroundings that made the work
possible.

I take this opportunity to express my deep sense of gratitude to Dr. T. KANNAIAN,


Secretary of PSG College of Arts & Science, Coimbatore for permitting and doing the
needful towards the successful completion of this project.

I express my deep sense of gratitude and sincere thanks to our Principal


Dr. D. BRINDHA, M.Sc., M.Phil., Ph.D., MA (Yoga)., for her valuable advice
and concern on students.

I am very thankful to Dr.A.ANGURAJ, M.Sc., M.Phil., Ph.D., Vice Principal


(Academics), Dr. M. JAYANTHI, M.Com., MBA., M.Phil., Ph.D., Vice
Principal (Student Affairs), Prof. M.UMARANI, MBA, M.Phil., Faculty-InCharge
(Student Affairs), for their support towards my project.

I also thank Dr.R.SUDHA, MCA., M.Phil., Ph.D., Head of the Department of


Computer Applications for her help to complete this project successfully by giving
valuable suggestions.

I convey my heartiest and passionate sense of


thankfulness to my project guide
DR. C. ARUN PRIYA, Msc.,M.PHIL.,phD.,
Assistant professor, Department of Computer Applications Department, for her
timely suggestion which had enable me in completing the project successfully.

This note of acknowledgement will be incomplete without paying my heartful devotion to


my parents, my friends and other people, for their blessings, encouragement, financial
support and the patience, without which it would have been impossible for me to complete
the job.

iv
ABSTRACT

The project “ DATA ANALYSIS OF SPOTIFY ” deals with the management of songs
audio features from a statistical point of view. Digital music distribution is increasingly
powered by automated mechanisms that continuously capture, sort and analyze large amounts
of Web-based data.

In particular, it explores the datacatching mechanisms enabled by Spotify Web API, and
suggests statistical tools for the analysis of these data.

The identification of a model able to describe this relationship, the determination within the
set of characteristics of those considered most important in making a song popular is a very
interesting topic for those who aim to predict the success of new products.

Everyone listens to music all day. Even I am hooked to music. I need music no matter which
activity I do. I have an eclectic taste in music, the genres I listen to vary from dance music
with a high tempo to sweet mellow acoustic music. Being able to learn more about music and
how to analyze it will allow us to broaden our knowledge while also making us more
interesting human beings when we are conversing with others.

A variety of data cleaning and tidying techniques will be used before performing a
fundamental exploratory data analysis procedure. In terms of research question, we want to
investigate the characteristics of songs that make them popular. With the help of this
analysis , we will have a much better understanding of listening taste and habits.

v
TABLE OF CONTENTS
S.NO CONTENTS PAGE NO
1. Introduction 1
Project Overview 1
Module Description 2

2. 3
System Specification

2.1Hardware Specification 3
2.2Software Specification 3
2.3Software Description 4

3. System Analysis 7
3.1Existing System 7
3.2Proposed System 7

4. System Design and Development 8


4.1 Input Design 8
4.2 Output Design 9
12

5. Audio features of spotify 13


5.1features 13
5.2Beta Regression and GLMM 15

6. Scope of Future Enhancement 19


7. Conclusion 20
Bibliography 21

Appendices 22
A. Screen Layouts 22
B. Sample Coding 26

vi
1.INTRODUCTION
This section gives a detailed description of how analysis is done on spotify . It gives
an overview of the analysis made and the reports generated.

1.1 Project Overview

2Music plays an important role in


everyday life of people, and with
digitalization, large collections of
3musical data are formed, which
tend to be further cumulated by
music lovers (Sloboda, 2011).
This
4has led to music collections, not
only on the private shelf as audio
or video discs and domain discs,
5but also on the hard disk and
online, to grow beyond what was
previously impossible. With the
advent
1
6of new technologies, it has
become impossible for a single
individual to keep track of the
music and
7the relationships between different
songs. The techniques of data
mining and automatic learning can
8help the navigation in the world of
music (Lerch, 2018).
9Data mining strategies are often
based on two main problems: the
type of available data and the use
10 you want to make of them.
What kind of data is the music? A
collection of music tracks consists
of
2
11 various types of data; for
example, data could consist of
music audio files or metadata such
as track ti-
12 tle and artist name (Pachet,
2011). What kind of analysis can
be carried out? The musical data
mining
13 provides specific methods to
answer to the most varied
questions: e.g. gender
classification, identi-
14 fication of artists/singers,
mood/emotion detection,
instrument recognition, similarity
search music,
3
15 musical synthesis and so on.
16 This research investigates the
relationship between song data
audio features obtained from the
Spotify
17 database (e.g. key and tempo)
and song popularity, measured by
the number of streams that a song
has
18 on Spotify. Previous
researches on the topic of new
product success prediction have
identified mul-
19 tiple approaches to answer to
this question. Moreover, the

4
existing body of research which
defines
20 many popularity prediction
models stresses the complexity of
the mechanisms of song
popularity.
21 Research from Lee and Lee
(2018) shows that it is feasible to
predict the popularity metrics of a
song
22 significantly better than
random chance based on its audio
signal. Additionally, Ni et al.
(2015) also
23 show that certain audio
features such as Loudness,
5
duration and harmonic simplicity
correlate with
24 the evolution of musical
trends. Dhanaraj and Logan
(2005) propose features from both
songs’ lyrics
25 and audio content for
prediction of hits and also study a
hit detection model based solely
on lyrics’
26 features. In an attempt to
predict the popularity of a song
from Spotify’s song data, the
research of
27 Will Berger (2017) uses
(Echo-Nest) audio features similar
6
to this research and uses Spotify’s
own
28 calculated metric “popularity”
to measure popularity. Other
attemps through classical linear
regres-
29 sion or quadratic models can
be found on the net but they are
not exaustive works and do not
take
30 into accont the particular data
structure and other aspects that
could led to biased predictions.
This is
31 the reason why this paper can
be considered as an innovative
7
way to look at popularity
predictions
32 and represents an innovative
approach inside literature.
33 Music plays an important role
in everyday life of people, and
with digitalization, large
collections of
34 musical data are formed,
which tend to be further
cumulated by music lovers
(Sloboda, 2011). This
35 has led to music collections,
not only on the private shelf as
audio or video discs and domain
discs,
8
36 but also on the hard disk and
online, to grow beyond what was
previously impossible. With the
advent
37 of new technologies, it has
become impossible for a single
individual to keep track of the
music and
38 the relationships between
different songs. The techniques of
data mining and automatic
learning can
39 help the navigation in the
world of music (Lerch, 2018).
40 Data mining strategies are
often based on two main
9
problems: the type of available
data and the use
41 you want to make of them.
What kind of data is the music? A
collection of music tracks consists
of
42 various types of data; for
example, data could consist of
music audio files or metadata such
as track ti-
43 tle and artist name (Pachet,
2011). What kind of analysis can
be carried out? The musical data
mining
44 provides specific methods to
answer to the most varied
10
questions: e.g. gender
classification, identi-
45 fication of artists/singers,
mood/emotion detection,
instrument recognition, similarity
search music,
46 musical synthesis and so on.
47 This research investigates the
relationship between song data
audio features obtained from the
Spotify
48 database (e.g. key and tempo)
and song popularity, measured by
the number of streams that a song
has

11
49 on Spotify. Previous
researches on the topic of new
product success prediction have
identified mul-
50 tiple approaches to answer to
this question. Moreover, the
existing body of research which
defines
51 many popularity prediction
models stresses the complexity of
the mechanisms of song
popularity.
52 Research from Lee and Lee
(2018) shows that it is feasible to
predict the popularity metrics of a
song
12
53 significantly better than
random chance based on its audio
signal. Additionally, Ni et al.
(2015) also
54 show that certain audio
features such as Loudness,
duration and harmonic simplicity
correlate with
55 the evolution of musical
trends. Dhanaraj and Logan
(2005) propose features from both
songs’ lyrics
56 and audio content for
prediction of hits and also study a
hit detection model based solely
on lyrics’
13
57 features. In an attempt to
predict the popularity of a song
from Spotify’s song data, the
research of
58 Will Berger (2017) uses
(Echo-Nest) audio features similar
to this research and uses Spotify’s
own
59 calculated metric “popularity”
to measure popularity. Other
attemps through classical linear
regres-
60 sion or quadratic models can
be found on the net but they are
not exaustive works and do not
take
14
61 into accont the particular data
structure and other aspects that
could led to biased predictions.
This is
62 the reason why this paper can
be considered as an innovative
way to look at popularity
predictions
63 and represents an innovative
approach inside literature.
64 Music plays an important role
in everyday life of people, and
with digitalization, large
collections of
65 musical data are formed,
which tend to be further
15
cumulated by music lovers
(Sloboda, 2011). This
66 has led to music collections,
not only on the private shelf as
audio or video discs and domain
discs,
67 but also on the hard disk and
online, to grow beyond what was
previously impossible. With the
advent
68 of new technologies, it has
become impossible for a single
individual to keep track of the
music and
69 the relationships between
different songs. The techniques of
16
data mining and automatic
learning can
70 help the navigation in the
world of music (Lerch, 2018).
71 Data mining strategies are
often based on two main
problems: the type of available
data and the use
72 you want to make of them.
What kind of data is the music? A
collection of music tracks consists
of
73 various types of data; for
example, data could consist of
music audio files or metadata such
as track ti-
17
74 tle and artist name (Pachet,
2011). What kind of analysis can
be carried out? The musical data
mining
75 provides specific methods to
answer to the most varied
questions: e.g. gender
classification, identi-
76 fication of artists/singers,
mood/emotion detection,
instrument recognition, similarity
search music,
77 musical synthesis and so on.
78 This research investigates the
relationship between song data

18
audio features obtained from the
Spotify
79 database (e.g. key and tempo)
and song popularity, measured by
the number of streams that a song
has
80 on Spotify. Previous
researches on the topic of new
product success prediction have
identified mul-
81 tiple approaches to answer to
this question. Moreover, the
existing body of research which
defines
82 many popularity prediction
models stresses the complexity of
19
the mechanisms of song
popularity.
83 Research from Lee and Lee
(2018) shows that it is feasible to
predict the popularity metrics of a
song
84 significantly better than
random chance based on its audio
signal. Additionally, Ni et al.
(2015) also
85 show that certain audio
features such as Loudness,
duration and harmonic simplicity
correlate with
86 the evolution of musical
trends. Dhanaraj and Logan
20
(2005) propose features from both
songs’ lyrics
87 and audio content for
prediction of hits and also study a
hit detection model based solely
on lyrics’
88 features. In an attempt to
predict the popularity of a song
from Spotify’s song data, the
research of
89 Will Berger (2017) uses
(Echo-Nest) audio features similar
to this research and uses Spotify’s
own
90 calculated metric “popularity”
to measure popularity. Other
21
attemps through classical linear
regres-
91 sion or quadratic models can
be found on the net but they are
not exaustive works and do not
take
92 into accont the particular data
structure and other aspects that
could led to biased predictions.
This is
93 the reason why this paper can
be considered as an innovative
way to look at popularity
predictions
94 and represents an innovative
approach inside literature.
22
95 Music plays an important role
in everyday life of people, and
with digitalization, large
collections of
96 musical data are formed,
which tend to be further
cumulated by music lovers
(Sloboda, 2011). This
97 has led to music collections,
not only on the private shelf as
audio or video discs and domain
discs,
98 but also on the hard disk and
online, to grow beyond what was
previously impossible. With the
advent
23
99 of new technologies, it has
become impossible for a single
individual to keep track of the
music and
100 the relationships between
different songs. The techniques of
data mining and automatic
learning can
101 help the navigation in the
world of music (Lerch, 2018).
102 Data mining strategies are
often based on two main
problems: the type of available
data and the use
103 you want to make of them.
What kind of data is the music? A
24
collection of music tracks consists
of
104 various types of data; for
example, data could consist of
music audio files or metadata such
as track ti-
105 tle and artist name (Pachet,
2011). What kind of analysis can
be carried out? The musical data
mining
106 provides specific methods to
answer to the most varied
questions: e.g. gender
classification, identi-
107 fication of artists/singers,
mood/emotion detection,
25
instrument recognition, similarity
search music,
108 musical synthesis and so on.
109 This research investigates the
relationship between song data
audio features obtained from the
Spotify
110 database (e.g. key and tempo)
and song popularity, measured by
the number of streams that a song
has
111 on Spotify. Previous
researches on the topic of new
product success prediction have
identified mul-

26
112 tiple approaches to answer to
this question. Moreover, the
existing body of research which
defines
113 many popularity prediction
models stresses the complexity of
the mechanisms of song
popularity.
114 Research from Lee and Lee
(2018) shows that it is feasible to
predict the popularity metrics of a
song
115 significantly better than
random chance based on its audio
signal. Additionally, Ni et al.
(2015) also
27
116 show that certain audio
features such as Loudness,
duration and harmonic simplicity
correlate with
117 the evolution of musical
trends. Dhanaraj and Logan
(2005) propose features from both
songs’ lyrics
118 and audio content for
prediction of hits and also study a
hit detection model based solely
on lyrics’
119 features. In an attempt to
predict the popularity of a song
from Spotify’s song data, the
research of
28
120 Will Berger (2017) uses
(Echo-Nest) audio features similar
to this research and uses Spotify’s
own
121 calculated metric “popularity”
to measure popularity. Other
attemps through classical linear
regres-
122 sion or quadratic models can
be found on the net but they are
not exaustive works and do not
take
123 into accont the particular data
structure and other aspects that
could led to biased predictions.
This is
29
124 the reason why this paper can
be considered as an innovative
way to look at popularity
predictions
125 and represents an innovative
approach inside literature.
126 Music plays an important role
in everyday life of people, and
with digitalization, large
collections of
127 musical data are formed,
which tend to be further
cumulated by music lovers
(Sloboda, 2011). This
128 has led to music collections,
not only on the private shelf as
30
audio or video discs and domain
discs,
129 but also on the hard disk and
online, to grow beyond what was
previously impossible. With the
advent
130 of new technologies, it has
become impossible for a single
individual to keep track of the
music and
131 the relationships between
different songs. The techniques of
data mining and automatic
learning can
132 help the navigation in the
world of music (Lerch, 2018).
31
133 Data mining strategies are
often based on two main
problems: the type of available
data and the use
134 you want to make of them.
What kind of data is the music? A
collection of music tracks consists
of
135 various types of data; for
example, data could consist of
music audio files or metadata such
as track ti-
136 tle and artist name (Pachet,
2011). What kind of analysis can
be carried out? The musical data
mining
32
137 provides specific methods to
answer to the most varied
questions: e.g. gender
classification, identi-
138 fication of artists/singers,
mood/emotion detection,
instrument recognition, similarity
search music,
139 musical synthesis and so on.
140 This research investigates the
relationship between song data
audio features obtained from the
Spotify
141 database (e.g. key and tempo)
and song popularity, measured by

33
the number of streams that a song
has
142 on Spotify. Previous
researches on the topic of new
product success prediction have
identified mul-
143 tiple approaches to answer to
this question. Moreover, the
existing body of research which
defines
144 many popularity prediction
models stresses the complexity of
the mechanisms of song
popularity.
145 Research from Lee and Lee
(2018) shows that it is feasible to
34
predict the popularity metrics of a
song
146 significantly better than
random chance based on its audio
signal. Additionally, Ni et al.
(2015) also
147 show that certain audio
features such as Loudness,
duration and harmonic simplicity
correlate with
148 the evolution of musical
trends. Dhanaraj and Logan
(2005) propose features from both
songs’ lyrics
149 and audio content for
prediction of hits and also study a
35
hit detection model based solely
on lyrics’
150 features. In an attempt to
predict the popularity of a song
from Spotify’s song data, the
research of
151 Will Berger (2017) uses
(Echo-Nest) audio features similar
to this research and uses Spotify’s
own
152 calculated metric “popularity”
to measure popularity. Other
attemps through classical linear
regres-
153 sion or quadratic models can
be found on the net but they are
36
not exaustive works and do not
take
154 into accont the particular data
structure and other aspects that
could led to biased predictions.
This is
155 the reason why this paper can
be considered as an innovative
way to look at popularity
predictions
156 and represents an innovative
approach inside literature.
157 Music plays an important role
in everyday life of people, and
with digitalization, large
collections of
37
158 musical data are formed,
which tend to be further
cumulated by music lovers
(Sloboda, 2011). This
159 has led to music collections,
not only on the private shelf as
audio or video discs and domain
discs,
160 but also on the hard disk and
online, to grow beyond what was
previously impossible. With the
advent
161 of new technologies, it has
become impossible for a single
individual to keep track of the
music and
38
162 the relationships between
different songs. The techniques of
data mining and automatic
learning can
163 help the navigation in the
world of music (Lerch, 2018).
164 Data mining strategies are
often based on two main
problems: the type of available
data and the use
165 you want to make of them.
What kind of data is the music? A
collection of music tracks consists
of
166 various types of data; for
example, data could consist of
39
music audio files or metadata such
as track ti-
167 tle and artist name (Pachet,
2011). What kind of analysis can
be carried out? The musical data
mining
168 provides specific methods to
answer to the most varied
questions: e.g. gender
classification, identi-
169 fication of artists/singers,
mood/emotion detection,
instrument recognition, similarity
search music,
170 musical synthesis and so on.

40
171 This research investigates the
relationship between song data
audio features obtained from the
Spotify
172 database (e.g. key and tempo)
and song popularity, measured by
the number of streams that a song
has
173 on Spotify. Previous
researches on the topic of new
product success prediction have
identified mul-
174 tiple approaches to answer to
this question. Moreover, the
existing body of research which
defines
41
175 many popularity prediction
models stresses the complexity of
the mechanisms of song
popularity.
176 Research from Lee and Lee
(2018) shows that it is feasible to
predict the popularity metrics of a
song
177 significantly better than
random chance based on its audio
signal. Additionally, Ni et al.
(2015) also
178 show that certain audio
features such as Loudness,
duration and harmonic simplicity
correlate with
42
179 the evolution of musical
trends. Dhanaraj and Logan
(2005) propose features from both
songs’ lyrics
180 and audio content for
prediction of hits and also study a
hit detection model based solely
on lyrics’
181 features. In an attempt to
predict the popularity of a song
from Spotify’s song data, the
research of
182 Will Berger (2017) uses
(Echo-Nest) audio features similar
to this research and uses Spotify’s
own
43
183 calculated metric “popularity”
to measure popularity. Other
attemps through classical linear
regres-
184 sion or quadratic models can
be found on the net but they are
not exaustive works and do not
take
185 into accont the particular data
structure and other aspects that
could led to biased predictions.
This is
186 the reason why this paper can
be considered as an innovative
way to look at popularity
predictions
44
187 and represents an innovative
approach inside literature.
Music plays an important role in everyday life of people, and with digitalization, large collections of
musical data are formed, which tend to be further cumulated by music lovers (Sloboda, 2011). This has led
to music collections, not only on the private shelf as audio or video discs and domain discs, but also on the
hard disk and online, to grow beyond what was previously impossible. With the advent of new
technologies, it has become impossible for a single individual to keep track of the music and the
relationships between different songs. The techniques of data mining and automatic learning can help the
navigation in the world of music (Lerch, 2018). Data mining strategies are often based on two main
problems: the type of available data and the use you want to make of them. What kind of data is the music?
A collection of music tracks consists of various types of data; for example, data could consist of music
audio files or metadata such as track title and artist name (Pachet, 2011). What kind of analysis can be
carried out? The musical data mining provides specific methods to answer to the most varied questions:
e.g. gender classification, identification of artists/singers, mood/emotion detection, instrument recognition,
similarity search music, musical synthesis and so on. This research investigates the relationship between
song data audio features obtained from the Spotify database (e.g. key and tempo) and song popularity,
measured by the number of streams that a song has on Spotify.

Additionally, Ni et al. (2015) also show that certain audio features such as Loudness, duration and
harmonic simplicity correlate with the evolution of musical trends. Dhanaraj and Logan (2005) propose
features from both songs’ lyrics and audio content for prediction of hits and also study a hit detection
model based solely on lyrics’ features. In an attempt to predict the popularity of a song from Spotify’s song
data, the research of Will Berger (2017) uses (Echo-Nest) audio features similar to this research and uses
Spotify’s own calculated metric “popularity” to measure popularity. Other attemps through classical linear
regression or quadratic models can be found on the net but they are not exaustive works and do not take
into accont the particular data structure and other aspects that could led to biased predictions. This is the
reason why this paper can be considered as an innovative way to look at popularity predictions and
represents an innovative approach inside literature.

45
Music plays an important role in
everyday life of people, and with
digitalization, large collections of
musical data are formed, which tend
to be further cumulated by music
lovers (Sloboda, 2011). This
has led to music collections, not
only on the private shelf as audio or
video discs and domain discs,
but also on the hard disk and online,
to grow beyond what was
previously impossible. With the
advent
of new technologies, it has become
impossible for a single individual to
keep track of the music and
46
the relationships between different
songs. The techniques of data
mining and automatic learning can
help the navigation in the world of
music (Lerch, 2018).
Data mining strategies are often
based on two main problems: the
type of available data and the use
you want to make of them. What
kind of data is the music? A
collection of music tracks consists
of
various types of data; for example,
data could consist of music audio
files or metadata such as track ti-

47
tle and artist name (Pachet, 2011).
What kind of analysis can be
carried out? The musical data
mining
provides specific methods to answer
to the most varied questions: e.g.
gender classification, identi-
fication of artists/singers,
mood/emotion detection, instrument
recognition, similarity search
music,
musical synthesis and so on.
This research investigates the
relationship between song data
audio features obtained from the
Spotify
48
database (e.g. key and tempo) and
song popularity, measured by the
number of streams that a song has
on Spotify. Previous researches on
the topic of new product success
prediction have identified mul-
tiple approaches to answer to this
question. Moreover, the existing
body of research which defines
many popularity prediction models
stresses the complexity of the
mechanisms of song popularity.
Research from Lee and Lee (2018)
shows that it is feasible to predict
the popularity metrics of a song

49
significantly better than random
chance based on its audio signal.
Additionally, Ni et al. (2015) also
show that certain audio features
such as Loudness, duration and
harmonic simplicity correlate with
the evolution of musical trends.
Dhanaraj and Logan (2005) propose
features from both songs’ lyrics
and audio content for prediction of
hits and also study a hit detection
model based solely on lyrics’
features. In an attempt to predict the
popularity of a song from Spotify’s
song data, the research of

50
Will Berger (2017) uses (Echo-
Nest) audio features similar to this
research and uses Spotify’s own
calculated metric “popularity” to
measure popularity. Other attemps
through classical linear regres-
sion or quadratic models can be
found on the net but they are not
exaustive works and do not take
into accont the particular data
structure and other aspects that
could led to biased predictions. This
is
the reason why this paper can be
considered as an innovative way to
look at popularity predictions
51
and represents an innovative
approach inside literature.
1.2 Module Descrition

Data cleaning
Data cleaning is the process of detecting and correcting (or removing) corrupt or
inaccurate records from a record set, table, or database and refers to identifying incomplete,
incorrect, inaccurate or irrelevant parts of the data and then replacing, modifying, or deleting
the dirty or coarse data. Data cleansing may be performed interactively with data wrangling
tools, or as batch processing through scripting.

Data Preparation

The aim of this section is the identification of the determinants of songs popularity. In
particular, we want to investigate the possible relationship between the audio characteristics
of the songs in the Spotify database (for example, Energy, Loudness, etc. ...) and the
popularity of the songs also available in the Spotify dataset.

Coding

Coding refer to computer programming, the process of creating and maintaining the
source code of computer programs .In programming, code is a term used for both the
statements written in a particular programming language - the source code, and a term for the
source code after it has been processed by a compiler and made ready to run in the computer.
Building simple linear regression is a linear regression model with a single explanatory
variable. That is, it concerns two-dimensional sample points with one independent variable
and one dependent variable (conventionally, the x and y coordinates in a Cartesian coordinate
system) and finds a linear function (a non-vertical straight line) that, as accurately as possible,
predicts the dependent variable values as a function of the independent variable. The adjective
simple refers to the fact that the outcome variable is related to a single predictor .In this
project, Python is used for programming. Programme is written to understand and identify
insights from the data.

52
Predictions for the next year was made using Simple linear regression.

2.SYSTEM DESCRIPTION

2.1 Hardware Specification

COMPONENTS REQUIRMENTS

PROCESSOR AMD Ryzen 7 4800H

SPEED 2.90 GHz

RAM 8.00 GB.

HARD DISK DRIVE 1TB

WAN MODEM, LAN, DATA CARD

2.2 Software Specification

➢ Operating System: Windows 11


➢ Third Party Tool : Jupyter Notebook
➢ Language : Python

53
2.3 SOFTWARE DESCRIPTION

This section gives the brief introduction about the tool used. It gives the overview of
the icon description of the tool.

2.3.1 PYTHON

Python is a programming language. It is an interpreter, high-level, general-purpose


programming language. Python is said to be relatively easy to learn and portable, meaning its
statements can be interpreted in a number of operating systems. Python is dynamically typed
and garbage-collected. It supports multiple programming paradigms, including procedural,
object-oriented, and functional programming.

PYTHON FEATURES

Python provides lots of features that are listed below.

Easy to Learn and Use - Python is easy to learn and use. It is developer-friendly and high
level programming language.

Expressive Language - Python language is more expressive means that it is more


understandable and readable. Interpreted Language - Python is an interpreted language i.e.
interpreter executes the code line by line at a time. This makes debugging easy and thus
suitable for beginners.

Cross-platform Language - Python can run equally on different platforms such as Windows,
Linux, UNIX and Macintosh etc. So, we can say that Python is a portable language.

Free and Open Source - Python language is freely available at official web address. The
source-code is also available. Therefore it is open source.

Object-Oriented Language - Python supports object oriented language and concepts of classes and
objects come into existence.

54
Extensible - It implies that other languages such as C/C++ can be used to compile the code
and thus it can be used further in our python code.

Large Standard Library - Python has a large and broad library and provides rich set of
module and functions for rapid application development.

Dynamically-Typed Language - Python is not statically-typed like Java. You don’t need to declare
data type while defining a variable..

2.3.2 Python IDLE

IDLE is Python’s Integrated Development and Learning Environment.

IDLE has the following features:

➢ coded in 100% pure Python, using the tkinter GUI toolkit


➢ cross-platform: works mostly the same on Windows, Unix, and macOS
➢ Python shell window (interactive interpreter) with colorizing of code input, output, and error
messages
➢ multi-window text editor with multiple undo, Python colorizing, smart indent, call tips, auto
completion, and other features
➢ search within any window, replace within editor windows, and search through multiple files (grep)
➢ debugger with persistent breakpoints, stepping, and viewing of global and local namespaces ➢
configuration, browsers, and other dialogs

2.3.3JUPYTER NOTEBOOK

It is an open source cross-platform integrated development environment (IDE) for


scientific programming in the Python language.

The Jupyter Notebook is an open-source web application that allows you to create and share
documents that contain live code, equations, visualizations, and narrative text. Its uses include data
cleaning and transformation, numerical simulation, statistical modeling, data visualization, machine
learning, and much more.

Jupyter Notebook (formerly IPython Notebooks) is a web-based interactive computational


environment for creating Jupyter notebook documents. The “notebook” term can colloquially make

55
reference to many different entities, mainly the Jupyter web application, Jupyter Python web server,
or Jupyter document format depending on context.

Jupyter Notebook is extensible with first- and third-party plugins, includes support for interactive
tools for data inspection and embeds Python-specific code quality assurance and introspection
instruments, such as Pyflakes, Pylint and Rope.

It is available cross-platform through Anaconda, on Windows, on macOS through MacPorts,


and on major Linux distributions such as Arch Linux, Debian, Fedora, Gentoo Linux, openSUSE and
Ubuntu.

FEATURES:

• An editor with syntax highlighting, introspection, code completion

• Support for multiple IPython consoles

• The ability to explore and edit variables from a GUI

• A Help pane able to retrieve and render rich text documentation on functions,
classes and methods automatically or on-demand
• A debugger linked to IPdb, for step-by-step execution

• Static code analysis, powered by Pylint

• A run-time Profiler, to benchmark code

• Project support, allowing work on multiple development efforts simultaneously

• A built-in file explorer, for interacting with the filesystem and managing projects.

56
3. SYSTEM ANALYSIS

System analysis is a detailed study of the various operations performed by a system


and their relationships within and outside of the system. Analysis begins when a user or
manager begins a study of the program using existing system.

3.1 EXISTING RESEARCHES

The existing body of research which defines many popularity prediction models
stresses the complexity of the mechanisms of song popularity. Research from Lee and Lee
(2018) shows that it is feasible to predict the popularity metrics of a song significantly better
than random chance based on its audio signal. Additionally, Ni et al. (2015) also show that
certain audio features such as Loudness, duration and harmonic simplicity correlate with the
evolution of musical trends. Dhanaraj and Logan (2005) propose features from both songs’
lyrics and audio content for prediction of hits and also study a hit detection model based solely
on lyrics’ features.

3.2 PROPOSED SYSTEM

Other attemps through classical linear regression or quadratic models can be found on the net
but they are not exaustive works and do not take into accont the particular data structure and
other aspects that could led to biased predictions. This is the reason why this research can be
considered as an innovative way to look at popularity predictions and represents an innovative
approach inside literature.

ADVANTAGES

➢ An accurate prediction of patterns could provide useful information to make suggestions on


the spotify

57
4.SYSTEM DESIGN AND
DEVELOPMENT

4.1 Input design:


Input design is the part of the overall system design that requires very careful
attention and is the most expensive phase. Import dataset in the analyse module.

➢ It ensures proper completion with accuracy.

➢ It should be easy to fill and straightforward.

➢ It should focus on user’s attention, consistency, and simplicity.

58
59
60
4.2 Output Design

The output design defines the output required and the format in which it is to
be produced. Care must be taken to present the right information so that right
decisions are made. The output generated can be classified into three main categories,

• Screen Output

• Output to be stored as files in storage media.

• Hard copy of the output

61
62
5.AUDIO FEATURES OF SPOTIFY
Spotify Web API makes users able to extract several audio features of songs. The available
features that also have been used in this paper are listed in Table 1:

5.1 Features

63
Among all the features returned by spotify, song popularity plays an important role. The popularity of a
track is a value between 0 and 100, with 100 the most popular. The popularity is calculated by algorithm and
is based, in the most part, on the total number of track plays and taking into account how recent those plays
are. Generally speaking, songs that are being played a lot now will have a higher popularity than songs that
were played a lot in the past.

Duplicate tracks (e.g. the same track from a single and an album) are rated independently. Artist and album
popularity is derived mathematically from track popularity. Note that the popularity value may lag actual
popularity by a few days: the value is not updated in real time. Songs popularity is an important issue for
music industry.

In 2017 music industry generated $8.72 billion in the United States alone. Thanks to growing streaming
services (Spotify, Apple Music, etc) the industry continues to flourish. The top 10 artists in 2016 generated a
combined $362.5 million in revenue. The question of what makes a song popular has been studied before
with varying degrees of success (Giles, 2006).

Every song has key characteristics including lyrics, duration, artist information, temp, beat, Loudness, chord,
etc. Previous studies that considered lyrics to predict a song’s popularity had limited success.

Using Spotify data to predict what songs will be hits

The aim of this section is the identification of the determinants of songs’popularity. In particular, we want to
investigate the possible relationship between the audio characteristics of the songs in the Spotify database
(for example, Energy, Loudness, etc. ...) and the popularity of the songs also available in the Spotify dataset.
The identification of a model able to describe this relationship, the determination within the set of
characteristics of those considered most important in making a song popular is a very interesting topic for
those who aim to predict the success of new products.

Then, the fundamental question is: What does determine popularity? Why is a song popular? In cultural
markets like music, forecasting is very complex. Studies in this field called Hit Song Science (HSS) are of
interest to record companies but also to consumers themselves and to Spotify (Middlebrook and Sheik,
2019). Previous attempts in this direction have always referred to linear or quadratic model regression
(Nasreldin, 2018).

In this paper the application of a Beta regression with random effects is proposed. The choice of this class of
models derives by the nature of the response variable (a continuous variable limited in [0,100]) and by the
correlation structure in the data: it is assumed, in fact, that songs belonging to the same album can be related

64
to each other more than song from different albums; ignoring this level of hierarchy in the data could lead to
biased or inefficient results.

5.2 Beta regression for correlated data

In this work we study the dependence of the popularity of songs on the musical characteristics by using
an extension of the Beta regression model, including random effects. The resulting model will be a
generalized Beta model with mixed effects (Beta GLMM).

Before defining the model from a theoretical point of view, following a brief remaind to the classical
Beta regression and the Generalized Linear models with Mixed Effects (GLMMs); this brief summary
will be useful to understand the reasons why we have chosen to focus on this particular model and for the
theoretical definition of the model itself. 0

BETA REGRESSION :

Beta distribution is a continuous probability distribution defined in the unitary range with a density
function given by:

f(y,µ,φ) = Γ (φ) Γ (µφ)Γ ((1− µ)φ) y µφ−1 (1−y) (1−µ)φ−1 , (1)


where Γ (.) indicates the Gamma function.

The parameter µ indicates the expected value of Y, i.e. E(Y) = µ.


The parameter φ meets the definition of a precision parameter because, for fixed µ, the higher the value
of φ, the lower the variance of the dependent variable.

More specifically, Var(Y) = µ(1− µ) 1+φ . (2)

In Beta regression models (Ferrari and Cribari-Neto, 2004), the parameter that indicates the average µ ∈
(0,1) of the Beta distribution is expressed as a function of the covariates, while the parameter of precision
φ ∈ R + is treated as a disturbance parameter.

In order to ensure that the linear predictor takse on values in the space given by the dependent variable’s
support, the link logit represents the most commonly chosen link function:

g(µi) = logµi 1− µi = x T i β, (3)

where x T i j denotes a vector of explanatory variables, and β refers to the vector of regression
coefficients, i = 1,...,N.

65
The Beta distribution is defined only on the open unit interval. If exact one and zero values are admitted,
these values must be transformed in order to ensure the nature of the Beta distribution support (Bonat et
al., 2014).The most frequently applied transformation is:

Y ∗ = [Y(N −1) +0.5]/N (4)

where Y ∗ is the transformed and Y is the untransformed dependent variable.

Generalized Linear Mixed Models (GLMM)

Generalized Linear Mixed Models (or GLMM) are an extension of the Generalized Linear Model (GLM)
in which the linear predictor contains random effects in addition to the usual fixed effects. For this model
class, the assumption of homogeneity and independence of the sample units is lost. In addition, with
regard to the distribution of the response variable, the GLMM inherit from the GLM the idea of
extending mixed linear models to the non-normal data case (Lovison et al., 2011). GLMM provide a wide
range of models for the analysis of data that have some form of grouping, since differences between
groups can be modelled through the use of a random effect. The basic concept is the structure in cluster:
the data with clustering structure has a univariate response variable y double indexed, i for the first level
units and j for the second level units and a vector xi j of explanatory variables p for the j-th unit in the i-th
cluster. It is important to remember that clusters can have different sizes and that this can influence the
results of the analysis. These models are useful in the analysis of many types of data, including
longitudinal data. The general form of the model, in matrix notation, is:

y = Xβ +Zb+ε (5)

where y is a column vector N × 1;


X is an array N × p of explanatory variable p;
β is a column vector p × 1 of fixed effect regression coefficients;
Z is the random effects model matrix N × q for random effects q;
b is a vector q×1 of random effects;
ε is the column vector N ×1 of residues.

The assumptions that underlie this class of models can be summarized as follows:

Yi j | (xi j,zi j,bi) ∼ C.ξ .N(θi j,φ);


ηi j = X T i j β +Z T i jbi ;
g(µi j) = ηi j;
µi j = h(ηi j) = g(ηi j) −1 ;
bi ∼ (0,Σq);
Yi⊥Yj ∀i 6= j.

By putting together all the assumptions the conditional distribution is easily derived:

f(yi j | xi j,zi j,bi) = exp yi jθi j −b(θi) φ + c(yi j +φ) ;

66
Conditioning on bi , observations from the same cluster are assumed to be independent. In addition, the
conditional expected value is related to the linear predictor (containing both random and fixed effects) by
the following linking function g(·):

g(µi j) = x T i jβ +Z T i jbi .

The Beta GLMM :

In longitudinal analyses or when subjects have any grouping structure, observations related to the same
unit will typically be correlated, violating the assumption of independence of observations typical in
regression models.The dependence within clusters can be accounted for by adding random cluster or subject
effects in the linear predictor (Bonat et al., 2014). Consider the case of longitudinal studies where j = 1,...,ni
observations are nested within i = 1,...,N subjects. Let bi denote the vector of random effects specific to each
subject i. Adding random effects to the beta regression model in (3.3) we get the GLMM beta (Bonat et al.,
2014) given by

log µi j 1− µi j = x T i jβ +z T i jbi con bi ∼ N(0,G) (6)

where z T i j is a vector of explanatory variables, and G is the defined positive covariance matrix of random
effects.

Note that although the assumption of normality for random effects is common and statistically convenient,
other distribution hypotheses are also possible. In a longitudinal study, the bi is typically a scalar (for random
intercept models) or a bivariate vector (for random intercept models with a random regression coefficient),
i.e. z T i j = (1,ti j), where ti j is the measurement time j for the subject i. In the Beta GLMM the regression
parameters have only one specific interpretation per unit and do not describe the effect of the respective
variable on the population average; this is due to the non-linear transformation of the average response (i.e.
the logit link) as it can be deduced that

logit(E(Yi j|bi)) = x T i jβ +z T i jbi , (7)

but logit(E(Yi j|bi)) 6= x T i jβ.

Model parameters can be estimated by maximizing the marginal probability that is obtained by integrating
the joint distribution of [Y ,b] on random effects. The contribution to the log-likelihood by each group is as
follows:

fi(yi |β,Σ,φ) = Z ni ∏ j=1 fi(yi j|bi ,β,φ)(bi |Σ)dbi . (8)

Assuming independence among the N groups, the full likelihood is:

L(β,Σ,φ) = N ∏ i=1 fi(yi |β,Σ,φ).

67
FIGURE 1: VISUALIZATION OF DATASET

FIGURE 2: CLASSIFICATION OF DATASET

68
7.SCOPE OF FUTURE ENHANCEMENT

The purpose of this project is to predicting the patterns and relationships between the
songs of the data based the dataset provided by the spotify. It provides the useful
informations to the users and the management of spotify.

Here we take only sample datasets for prediction and analysis. In future we
make collected a better dataset and we predict the patterns and relationships with the
help of regression and correlation for spotify. It may useful for the company to know
about the current trends and help to make the application better.

69
8.CONCLUSION

The project “DATA ANALYSIS OF SPOTIFY ” analyses the relationships and the current
trend of the songs in the application of spotify.

The Spotify Web API audio features, used as covariates, have shown that not all the Spotify
characteristics have a high explanatory power for a higher stream count but some of them are actually
important. Significant relationships were found, which lays a promising foundation for the research in
prediction with these variables. In particular, Speechness, Instrumentalness and Live are the features that
negatively affect the Popularity Index, while Energy, Valence and Duration of the song are the ones that
positively affect it.

This research contributes to further understanding in the field of HSS and the new product success
prediction. Creating effective prediction models is an interesting next step to this research and so next
step would be to expand on the variables used. We hope that this paper will have practical implications
also on Spotify, suggesting for example interesting ideas to further develop its database with the hope
that data of increasing quality can lead to interesting discoveries and added value to the world of HSS.

70
BIBILOGRAPHY

REFERENCES:

• Bonat, W. H., P. J. Ribeiro, and W. M. Zeviani (2014, Aug). Likelihood analysis for a
class of beta mixed models. Journal of Applied Statistics 42(2), 252–266.

• Brooks, M. E., K. Kristensen, K. J. van Benthem, A. Magnusson, C. W. Berg, A.


Nielsen, H. J. Skaug, M. Maechler, and B. M. Bolker (2017). glmmTMB balances
speed and flexibility among packages for zero-inflated generalized linear mixed
modeling. The R Journal 9(2), 378–400.

• Middlebrook, K. and K. Sheik (2019). Song hit prediction: Predicting billboard hits
using spotify data.

• Nijkamp, R. (2018, July). Prediction of product success: explaining song popularity by


audio features from spotify data.

• Sloboda, J. A. (2011). Music in everyday life: The role of emotions. In P. N. Juslin and
J. Sloboda (Eds.), Handbook of Music and Emotion: Theory, Research, Applications.
Oxford University Press.

• Nasreldin, M. (2018). Song popularity predictor. https://towardsdatascience.com/song-


popularitypredictor-1ef69735e380.

• Pachet, F. (2011). Musical metadata and knowledge management. In Encyclopedia of


Knowledge Management.

71
WEBSITES

• https://www.programiz.com/python-programming

• https://open.spotify.com/

• https://jupyter.org/

• https://www.kaggle.com/

• https://towardsdatascience.com/

72
APPENDICES
A.SCREENSHOTS

73
74
75
B.APPENDIX

!pip install pandas


!pip install numpy
!pip install matplotlib
!pip install seaborn

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

df_tracks = pd.read_csv('D:\spotify datas\SpotifyFeatures.csv')


df_tracks.head()

#null values

pd.isnull(df_tracks).sum()

#Info

df_tracks.info()

#ten least popular songs

sorted_df = df_tracks.sort_values('popularity',ascending = True).head(10)


sorted_df

#descriptive statistics

df_tracks.describe().transpose()

#ten most popular songs

most_popular=df_tracks.query('popularity>90',inplace=False).sort_values('popularity',ascening=False)
most_popular[:10]

76
#making release date column as Index columnn

df_tracks.set_index("time_signature",inplace=True)
df_tracks.index=pd.to_datetime(df_tracks.index)
df_tracks.head()

#finding the aritist based on the location

df_tracks[["artists"]].iloc[18]

#converting the duration of the songs from milliseconds to seconds

df_tracks["duration"]=df_tracks["duration_ms"].apply(lambda x:round(x/1000))
df_tracks.drop("duration_ms",inplace=True, axis=1)

#Let’s Move Ahead and Sample Only 4 Percent of the Whole Dataset.

sample_df=df_tracks.sample(int(0.004*len(df_tracks)))
print(len(sample_df))

#correlation map

corr_df=df_tracks(["key","mode","explicit"],axis=1).corr(method="pearson")
plt.figure(figsize=(14,6))
heatmap=sns.heatmap(corr_df,annot=True,fmt=".ig",vmin=1,centre=0,cnap="Inferno",linewidth=1,linecolo
r="Black")
heatmap.set_title("Correlation Heatmap Between Variable")
heatmap.set_xticklabels(heatmap.get_xticklabels(),rotation=90)

#Regression plot between loudness and Energy

plt.figure(figsize=(10,6))
sns.regplot(data=sample_df,y="loudness",x="energy",color="c").set(title="Loudness vs Energy
Correlation")

#Regression plot between popularity and Acousticness

plt.figure(figsize=(10,6))
sns.regplot(data=sample_df,y="popularity",x="acousticness",color="b").set(title="Popularity vs
Acousticness Correlation")

77
#Plot a line graph to show the duration of the songs for each year

total_dr=df_tracks.duration
sns.set_style(style="whitegrid")
fig_dims=(10,5)
fig,ax=plt.subplots(figsize=fig_dims)
fig=sns.lineplot(x= 'years',y='total_dr',ax=ax).set(title="Year vs Duration")
plt.xticks(rotation=60)

DATA ANALYSIS BASED ON GENRES OF THE SONGS :

df_genre=pd.read_csv("D:\spotify datas\SpotifyFeatures.csv")
df_genre.head()

#barplot functions present in seaborn library

plt.title("Duration of the songs in Different geners")


sns.color_palette("rocket",as_cmap=True)
sns.barplot(y='genre',x='duration_ms',data=df_genre)
plt.xlabel("Duration in milli seconds")
plt.ylabel("Genres")

#top five genres by popularity and pot a barplot for the same

sns.set_style(style="darkgrid")
plt.figure(figsize=(10,5))
famous=df_genre.sort_values("popularity",ascending=False).head(10)
sns.barplot(y='genre',x='popularity',data=famous).set(title="Top 5 genres by popularity")

78
ABSTRACT

The project “PREDICTING ELECTRICITY ENERGY CONSUMPTION ”

analyses the Consumption of the electricity energy. Electricity is a significant form of energy that
cannot be stored physically and is usually generated as needed. In order to avoid waste or
shortage, a good system needs to be designed to constantly maintain the level of electricity
needed. Electricity consumption is an important economic index and plays a significant role in
drawing up an energy development policy for each country.

In python, simple linear regression algorithm is used to predict the electrical energy
consumption based on the previous data. This contribution could help to optimize the economy
of energy producers/distributors value chain. The generated report helps the government to know
about the electricity demand and help the government to manage the electricity supply based on
the demand.

I am KEERTHANA.S (20BCA021) doing final year computer application in PSG college of arts and
science, batch 2020-2023

79

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy