Minor Project Final Report (20bca19)
Minor Project Final Report (20bca19)
Submitted by
KAVI PRIYA.R
(20BCA019)
NOVEMBER 2022
i
DEPARTMENT OF COMPUTER APPLICATIONS
CERTIFICATE
This is to certify that this project work entitled “DATA ANALYSIS OF SPOTIFY ”
is a bonafide record of work done by KAVI PRIYA.R (20BCA019) in partial fulfilment of the
requirements for the award of Degree of Bachelor of Computer Applications of Bharathiar University.
ii
DECLARATION
iii
ACKNOWLEDGEMENT
My venture stands imperfect without dedicating my gratitude to a few people who have
contributed a lot towards the victorious completion for my project work.
iv
ABSTRACT
The project “ DATA ANALYSIS OF SPOTIFY ” deals with the management of songs
audio features from a statistical point of view. Digital music distribution is increasingly
powered by automated mechanisms that continuously capture, sort and analyze large amounts
of Web-based data.
In particular, it explores the datacatching mechanisms enabled by Spotify Web API, and
suggests statistical tools for the analysis of these data.
The identification of a model able to describe this relationship, the determination within the
set of characteristics of those considered most important in making a song popular is a very
interesting topic for those who aim to predict the success of new products.
Everyone listens to music all day. Even I am hooked to music. I need music no matter which
activity I do. I have an eclectic taste in music, the genres I listen to vary from dance music
with a high tempo to sweet mellow acoustic music. Being able to learn more about music and
how to analyze it will allow us to broaden our knowledge while also making us more
interesting human beings when we are conversing with others.
A variety of data cleaning and tidying techniques will be used before performing a
fundamental exploratory data analysis procedure. In terms of research question, we want to
investigate the characteristics of songs that make them popular. With the help of this
analysis , we will have a much better understanding of listening taste and habits.
v
TABLE OF CONTENTS
S.NO CONTENTS PAGE NO
1. Introduction 1
Project Overview 1
Module Description 2
2. 3
System Specification
2.1Hardware Specification 3
2.2Software Specification 3
2.3Software Description 4
3. System Analysis 7
3.1Existing System 7
3.2Proposed System 7
Appendices 22
A. Screen Layouts 22
B. Sample Coding 26
vi
1.INTRODUCTION
This section gives a detailed description of how analysis is done on spotify . It gives
an overview of the analysis made and the reports generated.
4
existing body of research which
defines
20 many popularity prediction
models stresses the complexity of
the mechanisms of song
popularity.
21 Research from Lee and Lee
(2018) shows that it is feasible to
predict the popularity metrics of a
song
22 significantly better than
random chance based on its audio
signal. Additionally, Ni et al.
(2015) also
23 show that certain audio
features such as Loudness,
5
duration and harmonic simplicity
correlate with
24 the evolution of musical
trends. Dhanaraj and Logan
(2005) propose features from both
songs’ lyrics
25 and audio content for
prediction of hits and also study a
hit detection model based solely
on lyrics’
26 features. In an attempt to
predict the popularity of a song
from Spotify’s song data, the
research of
27 Will Berger (2017) uses
(Echo-Nest) audio features similar
6
to this research and uses Spotify’s
own
28 calculated metric “popularity”
to measure popularity. Other
attemps through classical linear
regres-
29 sion or quadratic models can
be found on the net but they are
not exaustive works and do not
take
30 into accont the particular data
structure and other aspects that
could led to biased predictions.
This is
31 the reason why this paper can
be considered as an innovative
7
way to look at popularity
predictions
32 and represents an innovative
approach inside literature.
33 Music plays an important role
in everyday life of people, and
with digitalization, large
collections of
34 musical data are formed,
which tend to be further
cumulated by music lovers
(Sloboda, 2011). This
35 has led to music collections,
not only on the private shelf as
audio or video discs and domain
discs,
8
36 but also on the hard disk and
online, to grow beyond what was
previously impossible. With the
advent
37 of new technologies, it has
become impossible for a single
individual to keep track of the
music and
38 the relationships between
different songs. The techniques of
data mining and automatic
learning can
39 help the navigation in the
world of music (Lerch, 2018).
40 Data mining strategies are
often based on two main
9
problems: the type of available
data and the use
41 you want to make of them.
What kind of data is the music? A
collection of music tracks consists
of
42 various types of data; for
example, data could consist of
music audio files or metadata such
as track ti-
43 tle and artist name (Pachet,
2011). What kind of analysis can
be carried out? The musical data
mining
44 provides specific methods to
answer to the most varied
10
questions: e.g. gender
classification, identi-
45 fication of artists/singers,
mood/emotion detection,
instrument recognition, similarity
search music,
46 musical synthesis and so on.
47 This research investigates the
relationship between song data
audio features obtained from the
Spotify
48 database (e.g. key and tempo)
and song popularity, measured by
the number of streams that a song
has
11
49 on Spotify. Previous
researches on the topic of new
product success prediction have
identified mul-
50 tiple approaches to answer to
this question. Moreover, the
existing body of research which
defines
51 many popularity prediction
models stresses the complexity of
the mechanisms of song
popularity.
52 Research from Lee and Lee
(2018) shows that it is feasible to
predict the popularity metrics of a
song
12
53 significantly better than
random chance based on its audio
signal. Additionally, Ni et al.
(2015) also
54 show that certain audio
features such as Loudness,
duration and harmonic simplicity
correlate with
55 the evolution of musical
trends. Dhanaraj and Logan
(2005) propose features from both
songs’ lyrics
56 and audio content for
prediction of hits and also study a
hit detection model based solely
on lyrics’
13
57 features. In an attempt to
predict the popularity of a song
from Spotify’s song data, the
research of
58 Will Berger (2017) uses
(Echo-Nest) audio features similar
to this research and uses Spotify’s
own
59 calculated metric “popularity”
to measure popularity. Other
attemps through classical linear
regres-
60 sion or quadratic models can
be found on the net but they are
not exaustive works and do not
take
14
61 into accont the particular data
structure and other aspects that
could led to biased predictions.
This is
62 the reason why this paper can
be considered as an innovative
way to look at popularity
predictions
63 and represents an innovative
approach inside literature.
64 Music plays an important role
in everyday life of people, and
with digitalization, large
collections of
65 musical data are formed,
which tend to be further
15
cumulated by music lovers
(Sloboda, 2011). This
66 has led to music collections,
not only on the private shelf as
audio or video discs and domain
discs,
67 but also on the hard disk and
online, to grow beyond what was
previously impossible. With the
advent
68 of new technologies, it has
become impossible for a single
individual to keep track of the
music and
69 the relationships between
different songs. The techniques of
16
data mining and automatic
learning can
70 help the navigation in the
world of music (Lerch, 2018).
71 Data mining strategies are
often based on two main
problems: the type of available
data and the use
72 you want to make of them.
What kind of data is the music? A
collection of music tracks consists
of
73 various types of data; for
example, data could consist of
music audio files or metadata such
as track ti-
17
74 tle and artist name (Pachet,
2011). What kind of analysis can
be carried out? The musical data
mining
75 provides specific methods to
answer to the most varied
questions: e.g. gender
classification, identi-
76 fication of artists/singers,
mood/emotion detection,
instrument recognition, similarity
search music,
77 musical synthesis and so on.
78 This research investigates the
relationship between song data
18
audio features obtained from the
Spotify
79 database (e.g. key and tempo)
and song popularity, measured by
the number of streams that a song
has
80 on Spotify. Previous
researches on the topic of new
product success prediction have
identified mul-
81 tiple approaches to answer to
this question. Moreover, the
existing body of research which
defines
82 many popularity prediction
models stresses the complexity of
19
the mechanisms of song
popularity.
83 Research from Lee and Lee
(2018) shows that it is feasible to
predict the popularity metrics of a
song
84 significantly better than
random chance based on its audio
signal. Additionally, Ni et al.
(2015) also
85 show that certain audio
features such as Loudness,
duration and harmonic simplicity
correlate with
86 the evolution of musical
trends. Dhanaraj and Logan
20
(2005) propose features from both
songs’ lyrics
87 and audio content for
prediction of hits and also study a
hit detection model based solely
on lyrics’
88 features. In an attempt to
predict the popularity of a song
from Spotify’s song data, the
research of
89 Will Berger (2017) uses
(Echo-Nest) audio features similar
to this research and uses Spotify’s
own
90 calculated metric “popularity”
to measure popularity. Other
21
attemps through classical linear
regres-
91 sion or quadratic models can
be found on the net but they are
not exaustive works and do not
take
92 into accont the particular data
structure and other aspects that
could led to biased predictions.
This is
93 the reason why this paper can
be considered as an innovative
way to look at popularity
predictions
94 and represents an innovative
approach inside literature.
22
95 Music plays an important role
in everyday life of people, and
with digitalization, large
collections of
96 musical data are formed,
which tend to be further
cumulated by music lovers
(Sloboda, 2011). This
97 has led to music collections,
not only on the private shelf as
audio or video discs and domain
discs,
98 but also on the hard disk and
online, to grow beyond what was
previously impossible. With the
advent
23
99 of new technologies, it has
become impossible for a single
individual to keep track of the
music and
100 the relationships between
different songs. The techniques of
data mining and automatic
learning can
101 help the navigation in the
world of music (Lerch, 2018).
102 Data mining strategies are
often based on two main
problems: the type of available
data and the use
103 you want to make of them.
What kind of data is the music? A
24
collection of music tracks consists
of
104 various types of data; for
example, data could consist of
music audio files or metadata such
as track ti-
105 tle and artist name (Pachet,
2011). What kind of analysis can
be carried out? The musical data
mining
106 provides specific methods to
answer to the most varied
questions: e.g. gender
classification, identi-
107 fication of artists/singers,
mood/emotion detection,
25
instrument recognition, similarity
search music,
108 musical synthesis and so on.
109 This research investigates the
relationship between song data
audio features obtained from the
Spotify
110 database (e.g. key and tempo)
and song popularity, measured by
the number of streams that a song
has
111 on Spotify. Previous
researches on the topic of new
product success prediction have
identified mul-
26
112 tiple approaches to answer to
this question. Moreover, the
existing body of research which
defines
113 many popularity prediction
models stresses the complexity of
the mechanisms of song
popularity.
114 Research from Lee and Lee
(2018) shows that it is feasible to
predict the popularity metrics of a
song
115 significantly better than
random chance based on its audio
signal. Additionally, Ni et al.
(2015) also
27
116 show that certain audio
features such as Loudness,
duration and harmonic simplicity
correlate with
117 the evolution of musical
trends. Dhanaraj and Logan
(2005) propose features from both
songs’ lyrics
118 and audio content for
prediction of hits and also study a
hit detection model based solely
on lyrics’
119 features. In an attempt to
predict the popularity of a song
from Spotify’s song data, the
research of
28
120 Will Berger (2017) uses
(Echo-Nest) audio features similar
to this research and uses Spotify’s
own
121 calculated metric “popularity”
to measure popularity. Other
attemps through classical linear
regres-
122 sion or quadratic models can
be found on the net but they are
not exaustive works and do not
take
123 into accont the particular data
structure and other aspects that
could led to biased predictions.
This is
29
124 the reason why this paper can
be considered as an innovative
way to look at popularity
predictions
125 and represents an innovative
approach inside literature.
126 Music plays an important role
in everyday life of people, and
with digitalization, large
collections of
127 musical data are formed,
which tend to be further
cumulated by music lovers
(Sloboda, 2011). This
128 has led to music collections,
not only on the private shelf as
30
audio or video discs and domain
discs,
129 but also on the hard disk and
online, to grow beyond what was
previously impossible. With the
advent
130 of new technologies, it has
become impossible for a single
individual to keep track of the
music and
131 the relationships between
different songs. The techniques of
data mining and automatic
learning can
132 help the navigation in the
world of music (Lerch, 2018).
31
133 Data mining strategies are
often based on two main
problems: the type of available
data and the use
134 you want to make of them.
What kind of data is the music? A
collection of music tracks consists
of
135 various types of data; for
example, data could consist of
music audio files or metadata such
as track ti-
136 tle and artist name (Pachet,
2011). What kind of analysis can
be carried out? The musical data
mining
32
137 provides specific methods to
answer to the most varied
questions: e.g. gender
classification, identi-
138 fication of artists/singers,
mood/emotion detection,
instrument recognition, similarity
search music,
139 musical synthesis and so on.
140 This research investigates the
relationship between song data
audio features obtained from the
Spotify
141 database (e.g. key and tempo)
and song popularity, measured by
33
the number of streams that a song
has
142 on Spotify. Previous
researches on the topic of new
product success prediction have
identified mul-
143 tiple approaches to answer to
this question. Moreover, the
existing body of research which
defines
144 many popularity prediction
models stresses the complexity of
the mechanisms of song
popularity.
145 Research from Lee and Lee
(2018) shows that it is feasible to
34
predict the popularity metrics of a
song
146 significantly better than
random chance based on its audio
signal. Additionally, Ni et al.
(2015) also
147 show that certain audio
features such as Loudness,
duration and harmonic simplicity
correlate with
148 the evolution of musical
trends. Dhanaraj and Logan
(2005) propose features from both
songs’ lyrics
149 and audio content for
prediction of hits and also study a
35
hit detection model based solely
on lyrics’
150 features. In an attempt to
predict the popularity of a song
from Spotify’s song data, the
research of
151 Will Berger (2017) uses
(Echo-Nest) audio features similar
to this research and uses Spotify’s
own
152 calculated metric “popularity”
to measure popularity. Other
attemps through classical linear
regres-
153 sion or quadratic models can
be found on the net but they are
36
not exaustive works and do not
take
154 into accont the particular data
structure and other aspects that
could led to biased predictions.
This is
155 the reason why this paper can
be considered as an innovative
way to look at popularity
predictions
156 and represents an innovative
approach inside literature.
157 Music plays an important role
in everyday life of people, and
with digitalization, large
collections of
37
158 musical data are formed,
which tend to be further
cumulated by music lovers
(Sloboda, 2011). This
159 has led to music collections,
not only on the private shelf as
audio or video discs and domain
discs,
160 but also on the hard disk and
online, to grow beyond what was
previously impossible. With the
advent
161 of new technologies, it has
become impossible for a single
individual to keep track of the
music and
38
162 the relationships between
different songs. The techniques of
data mining and automatic
learning can
163 help the navigation in the
world of music (Lerch, 2018).
164 Data mining strategies are
often based on two main
problems: the type of available
data and the use
165 you want to make of them.
What kind of data is the music? A
collection of music tracks consists
of
166 various types of data; for
example, data could consist of
39
music audio files or metadata such
as track ti-
167 tle and artist name (Pachet,
2011). What kind of analysis can
be carried out? The musical data
mining
168 provides specific methods to
answer to the most varied
questions: e.g. gender
classification, identi-
169 fication of artists/singers,
mood/emotion detection,
instrument recognition, similarity
search music,
170 musical synthesis and so on.
40
171 This research investigates the
relationship between song data
audio features obtained from the
Spotify
172 database (e.g. key and tempo)
and song popularity, measured by
the number of streams that a song
has
173 on Spotify. Previous
researches on the topic of new
product success prediction have
identified mul-
174 tiple approaches to answer to
this question. Moreover, the
existing body of research which
defines
41
175 many popularity prediction
models stresses the complexity of
the mechanisms of song
popularity.
176 Research from Lee and Lee
(2018) shows that it is feasible to
predict the popularity metrics of a
song
177 significantly better than
random chance based on its audio
signal. Additionally, Ni et al.
(2015) also
178 show that certain audio
features such as Loudness,
duration and harmonic simplicity
correlate with
42
179 the evolution of musical
trends. Dhanaraj and Logan
(2005) propose features from both
songs’ lyrics
180 and audio content for
prediction of hits and also study a
hit detection model based solely
on lyrics’
181 features. In an attempt to
predict the popularity of a song
from Spotify’s song data, the
research of
182 Will Berger (2017) uses
(Echo-Nest) audio features similar
to this research and uses Spotify’s
own
43
183 calculated metric “popularity”
to measure popularity. Other
attemps through classical linear
regres-
184 sion or quadratic models can
be found on the net but they are
not exaustive works and do not
take
185 into accont the particular data
structure and other aspects that
could led to biased predictions.
This is
186 the reason why this paper can
be considered as an innovative
way to look at popularity
predictions
44
187 and represents an innovative
approach inside literature.
Music plays an important role in everyday life of people, and with digitalization, large collections of
musical data are formed, which tend to be further cumulated by music lovers (Sloboda, 2011). This has led
to music collections, not only on the private shelf as audio or video discs and domain discs, but also on the
hard disk and online, to grow beyond what was previously impossible. With the advent of new
technologies, it has become impossible for a single individual to keep track of the music and the
relationships between different songs. The techniques of data mining and automatic learning can help the
navigation in the world of music (Lerch, 2018). Data mining strategies are often based on two main
problems: the type of available data and the use you want to make of them. What kind of data is the music?
A collection of music tracks consists of various types of data; for example, data could consist of music
audio files or metadata such as track title and artist name (Pachet, 2011). What kind of analysis can be
carried out? The musical data mining provides specific methods to answer to the most varied questions:
e.g. gender classification, identification of artists/singers, mood/emotion detection, instrument recognition,
similarity search music, musical synthesis and so on. This research investigates the relationship between
song data audio features obtained from the Spotify database (e.g. key and tempo) and song popularity,
measured by the number of streams that a song has on Spotify.
Additionally, Ni et al. (2015) also show that certain audio features such as Loudness, duration and
harmonic simplicity correlate with the evolution of musical trends. Dhanaraj and Logan (2005) propose
features from both songs’ lyrics and audio content for prediction of hits and also study a hit detection
model based solely on lyrics’ features. In an attempt to predict the popularity of a song from Spotify’s song
data, the research of Will Berger (2017) uses (Echo-Nest) audio features similar to this research and uses
Spotify’s own calculated metric “popularity” to measure popularity. Other attemps through classical linear
regression or quadratic models can be found on the net but they are not exaustive works and do not take
into accont the particular data structure and other aspects that could led to biased predictions. This is the
reason why this paper can be considered as an innovative way to look at popularity predictions and
represents an innovative approach inside literature.
45
Music plays an important role in
everyday life of people, and with
digitalization, large collections of
musical data are formed, which tend
to be further cumulated by music
lovers (Sloboda, 2011). This
has led to music collections, not
only on the private shelf as audio or
video discs and domain discs,
but also on the hard disk and online,
to grow beyond what was
previously impossible. With the
advent
of new technologies, it has become
impossible for a single individual to
keep track of the music and
46
the relationships between different
songs. The techniques of data
mining and automatic learning can
help the navigation in the world of
music (Lerch, 2018).
Data mining strategies are often
based on two main problems: the
type of available data and the use
you want to make of them. What
kind of data is the music? A
collection of music tracks consists
of
various types of data; for example,
data could consist of music audio
files or metadata such as track ti-
47
tle and artist name (Pachet, 2011).
What kind of analysis can be
carried out? The musical data
mining
provides specific methods to answer
to the most varied questions: e.g.
gender classification, identi-
fication of artists/singers,
mood/emotion detection, instrument
recognition, similarity search
music,
musical synthesis and so on.
This research investigates the
relationship between song data
audio features obtained from the
Spotify
48
database (e.g. key and tempo) and
song popularity, measured by the
number of streams that a song has
on Spotify. Previous researches on
the topic of new product success
prediction have identified mul-
tiple approaches to answer to this
question. Moreover, the existing
body of research which defines
many popularity prediction models
stresses the complexity of the
mechanisms of song popularity.
Research from Lee and Lee (2018)
shows that it is feasible to predict
the popularity metrics of a song
49
significantly better than random
chance based on its audio signal.
Additionally, Ni et al. (2015) also
show that certain audio features
such as Loudness, duration and
harmonic simplicity correlate with
the evolution of musical trends.
Dhanaraj and Logan (2005) propose
features from both songs’ lyrics
and audio content for prediction of
hits and also study a hit detection
model based solely on lyrics’
features. In an attempt to predict the
popularity of a song from Spotify’s
song data, the research of
50
Will Berger (2017) uses (Echo-
Nest) audio features similar to this
research and uses Spotify’s own
calculated metric “popularity” to
measure popularity. Other attemps
through classical linear regres-
sion or quadratic models can be
found on the net but they are not
exaustive works and do not take
into accont the particular data
structure and other aspects that
could led to biased predictions. This
is
the reason why this paper can be
considered as an innovative way to
look at popularity predictions
51
and represents an innovative
approach inside literature.
1.2 Module Descrition
Data cleaning
Data cleaning is the process of detecting and correcting (or removing) corrupt or
inaccurate records from a record set, table, or database and refers to identifying incomplete,
incorrect, inaccurate or irrelevant parts of the data and then replacing, modifying, or deleting
the dirty or coarse data. Data cleansing may be performed interactively with data wrangling
tools, or as batch processing through scripting.
Data Preparation
The aim of this section is the identification of the determinants of songs popularity. In
particular, we want to investigate the possible relationship between the audio characteristics
of the songs in the Spotify database (for example, Energy, Loudness, etc. ...) and the
popularity of the songs also available in the Spotify dataset.
Coding
Coding refer to computer programming, the process of creating and maintaining the
source code of computer programs .In programming, code is a term used for both the
statements written in a particular programming language - the source code, and a term for the
source code after it has been processed by a compiler and made ready to run in the computer.
Building simple linear regression is a linear regression model with a single explanatory
variable. That is, it concerns two-dimensional sample points with one independent variable
and one dependent variable (conventionally, the x and y coordinates in a Cartesian coordinate
system) and finds a linear function (a non-vertical straight line) that, as accurately as possible,
predicts the dependent variable values as a function of the independent variable. The adjective
simple refers to the fact that the outcome variable is related to a single predictor .In this
project, Python is used for programming. Programme is written to understand and identify
insights from the data.
52
Predictions for the next year was made using Simple linear regression.
2.SYSTEM DESCRIPTION
COMPONENTS REQUIRMENTS
53
2.3 SOFTWARE DESCRIPTION
This section gives the brief introduction about the tool used. It gives the overview of
the icon description of the tool.
2.3.1 PYTHON
PYTHON FEATURES
Easy to Learn and Use - Python is easy to learn and use. It is developer-friendly and high
level programming language.
Cross-platform Language - Python can run equally on different platforms such as Windows,
Linux, UNIX and Macintosh etc. So, we can say that Python is a portable language.
Free and Open Source - Python language is freely available at official web address. The
source-code is also available. Therefore it is open source.
Object-Oriented Language - Python supports object oriented language and concepts of classes and
objects come into existence.
54
Extensible - It implies that other languages such as C/C++ can be used to compile the code
and thus it can be used further in our python code.
Large Standard Library - Python has a large and broad library and provides rich set of
module and functions for rapid application development.
Dynamically-Typed Language - Python is not statically-typed like Java. You don’t need to declare
data type while defining a variable..
2.3.3JUPYTER NOTEBOOK
The Jupyter Notebook is an open-source web application that allows you to create and share
documents that contain live code, equations, visualizations, and narrative text. Its uses include data
cleaning and transformation, numerical simulation, statistical modeling, data visualization, machine
learning, and much more.
55
reference to many different entities, mainly the Jupyter web application, Jupyter Python web server,
or Jupyter document format depending on context.
Jupyter Notebook is extensible with first- and third-party plugins, includes support for interactive
tools for data inspection and embeds Python-specific code quality assurance and introspection
instruments, such as Pyflakes, Pylint and Rope.
FEATURES:
• A Help pane able to retrieve and render rich text documentation on functions,
classes and methods automatically or on-demand
• A debugger linked to IPdb, for step-by-step execution
• A built-in file explorer, for interacting with the filesystem and managing projects.
56
3. SYSTEM ANALYSIS
The existing body of research which defines many popularity prediction models
stresses the complexity of the mechanisms of song popularity. Research from Lee and Lee
(2018) shows that it is feasible to predict the popularity metrics of a song significantly better
than random chance based on its audio signal. Additionally, Ni et al. (2015) also show that
certain audio features such as Loudness, duration and harmonic simplicity correlate with the
evolution of musical trends. Dhanaraj and Logan (2005) propose features from both songs’
lyrics and audio content for prediction of hits and also study a hit detection model based solely
on lyrics’ features.
Other attemps through classical linear regression or quadratic models can be found on the net
but they are not exaustive works and do not take into accont the particular data structure and
other aspects that could led to biased predictions. This is the reason why this research can be
considered as an innovative way to look at popularity predictions and represents an innovative
approach inside literature.
ADVANTAGES
57
4.SYSTEM DESIGN AND
DEVELOPMENT
58
59
60
4.2 Output Design
The output design defines the output required and the format in which it is to
be produced. Care must be taken to present the right information so that right
decisions are made. The output generated can be classified into three main categories,
• Screen Output
61
62
5.AUDIO FEATURES OF SPOTIFY
Spotify Web API makes users able to extract several audio features of songs. The available
features that also have been used in this paper are listed in Table 1:
5.1 Features
63
Among all the features returned by spotify, song popularity plays an important role. The popularity of a
track is a value between 0 and 100, with 100 the most popular. The popularity is calculated by algorithm and
is based, in the most part, on the total number of track plays and taking into account how recent those plays
are. Generally speaking, songs that are being played a lot now will have a higher popularity than songs that
were played a lot in the past.
Duplicate tracks (e.g. the same track from a single and an album) are rated independently. Artist and album
popularity is derived mathematically from track popularity. Note that the popularity value may lag actual
popularity by a few days: the value is not updated in real time. Songs popularity is an important issue for
music industry.
In 2017 music industry generated $8.72 billion in the United States alone. Thanks to growing streaming
services (Spotify, Apple Music, etc) the industry continues to flourish. The top 10 artists in 2016 generated a
combined $362.5 million in revenue. The question of what makes a song popular has been studied before
with varying degrees of success (Giles, 2006).
Every song has key characteristics including lyrics, duration, artist information, temp, beat, Loudness, chord,
etc. Previous studies that considered lyrics to predict a song’s popularity had limited success.
The aim of this section is the identification of the determinants of songs’popularity. In particular, we want to
investigate the possible relationship between the audio characteristics of the songs in the Spotify database
(for example, Energy, Loudness, etc. ...) and the popularity of the songs also available in the Spotify dataset.
The identification of a model able to describe this relationship, the determination within the set of
characteristics of those considered most important in making a song popular is a very interesting topic for
those who aim to predict the success of new products.
Then, the fundamental question is: What does determine popularity? Why is a song popular? In cultural
markets like music, forecasting is very complex. Studies in this field called Hit Song Science (HSS) are of
interest to record companies but also to consumers themselves and to Spotify (Middlebrook and Sheik,
2019). Previous attempts in this direction have always referred to linear or quadratic model regression
(Nasreldin, 2018).
In this paper the application of a Beta regression with random effects is proposed. The choice of this class of
models derives by the nature of the response variable (a continuous variable limited in [0,100]) and by the
correlation structure in the data: it is assumed, in fact, that songs belonging to the same album can be related
64
to each other more than song from different albums; ignoring this level of hierarchy in the data could lead to
biased or inefficient results.
In this work we study the dependence of the popularity of songs on the musical characteristics by using
an extension of the Beta regression model, including random effects. The resulting model will be a
generalized Beta model with mixed effects (Beta GLMM).
Before defining the model from a theoretical point of view, following a brief remaind to the classical
Beta regression and the Generalized Linear models with Mixed Effects (GLMMs); this brief summary
will be useful to understand the reasons why we have chosen to focus on this particular model and for the
theoretical definition of the model itself. 0
BETA REGRESSION :
Beta distribution is a continuous probability distribution defined in the unitary range with a density
function given by:
In Beta regression models (Ferrari and Cribari-Neto, 2004), the parameter that indicates the average µ ∈
(0,1) of the Beta distribution is expressed as a function of the covariates, while the parameter of precision
φ ∈ R + is treated as a disturbance parameter.
In order to ensure that the linear predictor takse on values in the space given by the dependent variable’s
support, the link logit represents the most commonly chosen link function:
where x T i j denotes a vector of explanatory variables, and β refers to the vector of regression
coefficients, i = 1,...,N.
65
The Beta distribution is defined only on the open unit interval. If exact one and zero values are admitted,
these values must be transformed in order to ensure the nature of the Beta distribution support (Bonat et
al., 2014).The most frequently applied transformation is:
Generalized Linear Mixed Models (or GLMM) are an extension of the Generalized Linear Model (GLM)
in which the linear predictor contains random effects in addition to the usual fixed effects. For this model
class, the assumption of homogeneity and independence of the sample units is lost. In addition, with
regard to the distribution of the response variable, the GLMM inherit from the GLM the idea of
extending mixed linear models to the non-normal data case (Lovison et al., 2011). GLMM provide a wide
range of models for the analysis of data that have some form of grouping, since differences between
groups can be modelled through the use of a random effect. The basic concept is the structure in cluster:
the data with clustering structure has a univariate response variable y double indexed, i for the first level
units and j for the second level units and a vector xi j of explanatory variables p for the j-th unit in the i-th
cluster. It is important to remember that clusters can have different sizes and that this can influence the
results of the analysis. These models are useful in the analysis of many types of data, including
longitudinal data. The general form of the model, in matrix notation, is:
y = Xβ +Zb+ε (5)
The assumptions that underlie this class of models can be summarized as follows:
By putting together all the assumptions the conditional distribution is easily derived:
66
Conditioning on bi , observations from the same cluster are assumed to be independent. In addition, the
conditional expected value is related to the linear predictor (containing both random and fixed effects) by
the following linking function g(·):
g(µi j) = x T i jβ +Z T i jbi .
In longitudinal analyses or when subjects have any grouping structure, observations related to the same
unit will typically be correlated, violating the assumption of independence of observations typical in
regression models.The dependence within clusters can be accounted for by adding random cluster or subject
effects in the linear predictor (Bonat et al., 2014). Consider the case of longitudinal studies where j = 1,...,ni
observations are nested within i = 1,...,N subjects. Let bi denote the vector of random effects specific to each
subject i. Adding random effects to the beta regression model in (3.3) we get the GLMM beta (Bonat et al.,
2014) given by
where z T i j is a vector of explanatory variables, and G is the defined positive covariance matrix of random
effects.
Note that although the assumption of normality for random effects is common and statistically convenient,
other distribution hypotheses are also possible. In a longitudinal study, the bi is typically a scalar (for random
intercept models) or a bivariate vector (for random intercept models with a random regression coefficient),
i.e. z T i j = (1,ti j), where ti j is the measurement time j for the subject i. In the Beta GLMM the regression
parameters have only one specific interpretation per unit and do not describe the effect of the respective
variable on the population average; this is due to the non-linear transformation of the average response (i.e.
the logit link) as it can be deduced that
Model parameters can be estimated by maximizing the marginal probability that is obtained by integrating
the joint distribution of [Y ,b] on random effects. The contribution to the log-likelihood by each group is as
follows:
67
FIGURE 1: VISUALIZATION OF DATASET
68
7.SCOPE OF FUTURE ENHANCEMENT
The purpose of this project is to predicting the patterns and relationships between the
songs of the data based the dataset provided by the spotify. It provides the useful
informations to the users and the management of spotify.
Here we take only sample datasets for prediction and analysis. In future we
make collected a better dataset and we predict the patterns and relationships with the
help of regression and correlation for spotify. It may useful for the company to know
about the current trends and help to make the application better.
69
8.CONCLUSION
The project “DATA ANALYSIS OF SPOTIFY ” analyses the relationships and the current
trend of the songs in the application of spotify.
The Spotify Web API audio features, used as covariates, have shown that not all the Spotify
characteristics have a high explanatory power for a higher stream count but some of them are actually
important. Significant relationships were found, which lays a promising foundation for the research in
prediction with these variables. In particular, Speechness, Instrumentalness and Live are the features that
negatively affect the Popularity Index, while Energy, Valence and Duration of the song are the ones that
positively affect it.
This research contributes to further understanding in the field of HSS and the new product success
prediction. Creating effective prediction models is an interesting next step to this research and so next
step would be to expand on the variables used. We hope that this paper will have practical implications
also on Spotify, suggesting for example interesting ideas to further develop its database with the hope
that data of increasing quality can lead to interesting discoveries and added value to the world of HSS.
70
BIBILOGRAPHY
REFERENCES:
• Bonat, W. H., P. J. Ribeiro, and W. M. Zeviani (2014, Aug). Likelihood analysis for a
class of beta mixed models. Journal of Applied Statistics 42(2), 252–266.
• Middlebrook, K. and K. Sheik (2019). Song hit prediction: Predicting billboard hits
using spotify data.
• Sloboda, J. A. (2011). Music in everyday life: The role of emotions. In P. N. Juslin and
J. Sloboda (Eds.), Handbook of Music and Emotion: Theory, Research, Applications.
Oxford University Press.
71
WEBSITES
• https://www.programiz.com/python-programming
• https://open.spotify.com/
• https://jupyter.org/
• https://www.kaggle.com/
• https://towardsdatascience.com/
72
APPENDICES
A.SCREENSHOTS
73
74
75
B.APPENDIX
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
#null values
pd.isnull(df_tracks).sum()
#Info
df_tracks.info()
#descriptive statistics
df_tracks.describe().transpose()
most_popular=df_tracks.query('popularity>90',inplace=False).sort_values('popularity',ascening=False)
most_popular[:10]
76
#making release date column as Index columnn
df_tracks.set_index("time_signature",inplace=True)
df_tracks.index=pd.to_datetime(df_tracks.index)
df_tracks.head()
df_tracks[["artists"]].iloc[18]
df_tracks["duration"]=df_tracks["duration_ms"].apply(lambda x:round(x/1000))
df_tracks.drop("duration_ms",inplace=True, axis=1)
#Let’s Move Ahead and Sample Only 4 Percent of the Whole Dataset.
sample_df=df_tracks.sample(int(0.004*len(df_tracks)))
print(len(sample_df))
#correlation map
corr_df=df_tracks(["key","mode","explicit"],axis=1).corr(method="pearson")
plt.figure(figsize=(14,6))
heatmap=sns.heatmap(corr_df,annot=True,fmt=".ig",vmin=1,centre=0,cnap="Inferno",linewidth=1,linecolo
r="Black")
heatmap.set_title("Correlation Heatmap Between Variable")
heatmap.set_xticklabels(heatmap.get_xticklabels(),rotation=90)
plt.figure(figsize=(10,6))
sns.regplot(data=sample_df,y="loudness",x="energy",color="c").set(title="Loudness vs Energy
Correlation")
plt.figure(figsize=(10,6))
sns.regplot(data=sample_df,y="popularity",x="acousticness",color="b").set(title="Popularity vs
Acousticness Correlation")
77
#Plot a line graph to show the duration of the songs for each year
total_dr=df_tracks.duration
sns.set_style(style="whitegrid")
fig_dims=(10,5)
fig,ax=plt.subplots(figsize=fig_dims)
fig=sns.lineplot(x= 'years',y='total_dr',ax=ax).set(title="Year vs Duration")
plt.xticks(rotation=60)
df_genre=pd.read_csv("D:\spotify datas\SpotifyFeatures.csv")
df_genre.head()
#top five genres by popularity and pot a barplot for the same
sns.set_style(style="darkgrid")
plt.figure(figsize=(10,5))
famous=df_genre.sort_values("popularity",ascending=False).head(10)
sns.barplot(y='genre',x='popularity',data=famous).set(title="Top 5 genres by popularity")
78
ABSTRACT
analyses the Consumption of the electricity energy. Electricity is a significant form of energy that
cannot be stored physically and is usually generated as needed. In order to avoid waste or
shortage, a good system needs to be designed to constantly maintain the level of electricity
needed. Electricity consumption is an important economic index and plays a significant role in
drawing up an energy development policy for each country.
In python, simple linear regression algorithm is used to predict the electrical energy
consumption based on the previous data. This contribution could help to optimize the economy
of energy producers/distributors value chain. The generated report helps the government to know
about the electricity demand and help the government to manage the electricity supply based on
the demand.
I am KEERTHANA.S (20BCA021) doing final year computer application in PSG college of arts and
science, batch 2020-2023
79