0% found this document useful (0 votes)
40 views74 pages

Final BE Project Report

Uploaded by

aroopkumar50
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
40 views74 pages

Final BE Project Report

Uploaded by

aroopkumar50
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 74

A

Preliminary Project Report


on

Depression Detection using Sentiment Analysis of Social


Media Posts

SUBMITTED TOWARDS THE


PARTIAL FULFILLMENT OF THE REQUIREMENTS OF

Bachelor of Engineering (Computer Engineering)

BY

Aroop Kumar Roll No: 3423


Ashish Roll No: 3426
K Chaitanya Roll No: 3448
Saurabh Kulkarni Roll No: 7435

Under The Guidance of

Prof. Shubhada Bhalerao

Department of Computer Engineering


Army Institute of Technology, Pune - 411015.

SAVITRIBAI PHULE PUNE UNIVERSITY


2020-21
ARMY INSTITUTE OF TECHNOLOGY,
DEPARTMENT OF COMPUTER ENGINEERING

CERTIFICATE

This is to certify that the Project Entitled

Depression Detection using Sentiment Analysis of Social Media Posts

Submitted by

Aroop Kumar Roll No: 3423


Ashish Roll No: 3426
K Chaitanya Roll No: 3448
Saurabh Kulkarni Roll No: 7435

is a bonafide work carried out by students under the supervision of Prof. Shubhada
Bhalerao and it is submitted towards the partial fulfillment of the requirement of
bachelor of engineering (Computer Engineering) Project.

Prof. Shubhada Bhalerao Dr. SR Dhore


Internal Guide H.O.D

Dr. B P Patil
External Examiner Principal

Place : AIT, Pune


Date :
PROJECT APPROVAL SHEET

Project Stage-I Report

on

(Depression Detection using Sentiment Analysis of Social Media Posts)

is successfully completed by

Aroop Kumar (Roll No: 3423)


Ashish (Roll No: 3426)
K Chaitanya (Roll No: 3448)
Saurabh Kulkarni (Roll No: 7435)

at

Department Of Computer Engineering


Army Institute of Technology, Pune-411 015.

SAVITRIBAI PHULE PUNE UNIVERSITY


2020-21

Prof. Shubhada Bhalerao Dr. S.R Dhore


Project Guide HOD
—Depression Detection using Sentiment Analysis of Social Media Posts—– I

Abstract

Depression has become a huge mayhem plaguing the world today. About 265 mil-
lion individuals of all ages suffer from depression worldwide. Of these, about 75%
remain untreated, with one million individuals taking their lives every year. Thus,
depression is amongst the leading causes of suicide esp. amongst adolescents.

Social media platforms are becoming an inseparable part of people’s daily lives.
They mirror the user’s personal life as users share their happiness, joy, insecurities
and sorrow on social media. These platforms are often utilized by researchers to spot
the causes of depression and retract it. Detection of early depression could prove to
be an enormous step in improving mental health of our society collectively.

Thus, to address our problem, we propose a stacking-based ensemble machine learn-


ing model which uses XGBoost and Multinomial Naive Bayes as the base-learners
and Logistic Regression as the meta-learner. The model is developed for twitter data
and would flag tweets which are found to be depressive. The stacked model pro-
duced a very high accuracy and F1-Score, which is superior to any other standalone
model proposed earlier. The project would help us employ emotional AI in twitter
which would in turn lead to lower suicide rates and improved mental health.

Keywords: depression, mental health, social networking sites, twitter, machine


learning, emotional AI.

Department of Computer Engineering, AIT, Pune


—Depression Detection using Sentiment Analysis of Social Media Posts—– II

Acknowledgments

It gives us great pleasure in presenting the final project report on ‘Depression De-
tection using Sentiment Analysis of Social Media Posts’.

We would like to take this opportunity to thank our project guide Prof. Shubhada
Bhalerao for giving us all the guidance needed. We are very grateful for her kind
support. Her valuable suggestions were extremely helpful.

We also extend our sincere gratitude to Dr. S.R Dhore, Head of Computer Engi-
neering Department, for creating a competitive environment and providing us with
all the essential facilities and encouragement at the department and institute level.

We would also like to acknowledge all our friends and classmates for their co-
operation. Lastly, we express our gratitude to our parents and other family mem-
bers, whose continuous encouragement, love and affection enabled us to complete
this piece of work successfully.

Aroop Kumar
Ashish
K Chaitanya
Saurabh Kulkarni
(B.E. Computer Engg.)

Department of Computer Engineering, AIT, Pune


INDEX

Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . I
Acknowledgment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . II
List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . VI

1 INTRODUCTION 1
1.1 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.3 Scope Of The Project . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.4 Motivation of The project . . . . . . . . . . . . . . . . . . . . . . . 2
1.5 Organization of report . . . . . . . . . . . . . . . . . . . . . . . . . 3

2 LITERATURE SURVEY 4
2.1 Literature Survey . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.2 Possible Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.3 Inferences from Literature Survey . . . . . . . . . . . . . . . . . . 8

3 SOFTWARE REQUIREMENT SPECIFICATION 10


3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3.2 Overall Description . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.3 System Features and Requirements . . . . . . . . . . . . . . . . . . 14

4 ALGORITHM ANALYSIS AND MATHEMATICAL MODELING 16


4.1 Naive Bayes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
4.2 XGBoost . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
4.3 Proposed Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . 17
4.4 Mathematical Modelling . . . . . . . . . . . . . . . . . . . . . . . 18
—Depression Detection using Sentiment Analysis of Social Media Posts—– IV

5 DETAILED DESIGN 21
5.1 Architectural Design . . . . . . . . . . . . . . . . . . . . . . . . . 21
5.2 UML Diagrams . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
5.2.1 Class Diagram . . . . . . . . . . . . . . . . . . . . . . . . 23
5.2.2 Activity Diagram . . . . . . . . . . . . . . . . . . . . . . . 23
5.2.3 Use Case Diagram . . . . . . . . . . . . . . . . . . . . . . 24
5.2.4 Sequence Diagram . . . . . . . . . . . . . . . . . . . . . . 25
5.2.5 Deployment Diagram . . . . . . . . . . . . . . . . . . . . . 26
5.3 Data design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
5.3.1 Internal software data structure . . . . . . . . . . . . . . . . 27
5.3.2 Global data structure . . . . . . . . . . . . . . . . . . . . . 27
5.3.3 Temporary data structure . . . . . . . . . . . . . . . . . . . 27
5.3.4 Database description . . . . . . . . . . . . . . . . . . . . . 27

6 PROJECT PLANNING 28
6.1 Tasks Involved . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
6.2 Technical Risks . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
6.3 Budget/Time Constraints . . . . . . . . . . . . . . . . . . . . . . . 29

7 CODING 30
7.1 Algorithms / Flowcharts . . . . . . . . . . . . . . . . . . . . . . . 30
7.2 Software Used . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
7.2.1 Utility Packages / Applications . . . . . . . . . . . . . . . . 31
7.2.2 Model Development . . . . . . . . . . . . . . . . . . . . . 31
7.2.3 Deployment . . . . . . . . . . . . . . . . . . . . . . . . . . 32
7.3 Hardware Specifiation . . . . . . . . . . . . . . . . . . . . . . . . 32
7.4 Programming Language . . . . . . . . . . . . . . . . . . . . . . . . 32
7.5 Platform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
7.6 Coding Style Format . . . . . . . . . . . . . . . . . . . . . . . . . 32

8 RESULT & ANALYSIS 33

Department of Computer Engineering, AIT, Pune


—Depression Detection using Sentiment Analysis of Social Media Posts—– V

9 TESTING 36
9.1 Formal Technical Reviews . . . . . . . . . . . . . . . . . . . . . . 36
9.2 Test Plan . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
9.3 Test Cases & Results . . . . . . . . . . . . . . . . . . . . . . . . . 37

10 CONFIGURATION MANAGEMENT PLAN 38

11 SOFTWARE QUALITY ASSURANCE PLAN 39

12 CONCLUSION 41

13 References 42
ANNEXURE A - Plagiarism Report. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

Department of Computer Engineering, AIT, Pune


List of Figures

4.1 XGBoost Objective Function . . . . . . . . . . . . . . . . . . . . . 18


4.2 Bayes Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
4.3 Independence of Features . . . . . . . . . . . . . . . . . . . . . . . 19
4.4 Proportionality Relation . . . . . . . . . . . . . . . . . . . . . . . . 19
4.5 Optimization Function . . . . . . . . . . . . . . . . . . . . . . . . 20

5.1 Machine Learning Life Cycle . . . . . . . . . . . . . . . . . . . . . 21


5.2 Architecture : Block Diagram . . . . . . . . . . . . . . . . . . . . 22
5.3 Architecture : Flowchart . . . . . . . . . . . . . . . . . . . . . . . 22
5.4 UML : Class Diagram . . . . . . . . . . . . . . . . . . . . . . . . . 23
5.5 UML : Activity Diagram . . . . . . . . . . . . . . . . . . . . . . . 24
5.6 UML : Use Case Diagram . . . . . . . . . . . . . . . . . . . . . . 25
5.7 UML : Sequence Diagram . . . . . . . . . . . . . . . . . . . . . . 26
5.8 UML : Deployment Diagram . . . . . . . . . . . . . . . . . . . . . 27

6.1 Project Timeline . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

7.1 Proposed Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30


7.2 Flowchart . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

8.1 State-of-the-art models vs Stacked Model . . . . . . . . . . . . . . 33


8.2 F1-Score Comparison . . . . . . . . . . . . . . . . . . . . . . . . 34
8.3 Accuracy Comparison . . . . . . . . . . . . . . . . . . . . . . . . 34
Chapter 1

INTRODUCTION

1.1 Problem Statement

The aim of the project is to detect whether a person is showing signs of clinical
depression. The project will be developed using various machine learning models
trained on social media data (e.g. tweets). The model would then predict whether
a person is showing symptoms of depression and if yes, necessary actions will be
taken.

1.2 Objectives
• To build a machine learning model that would analyse twitter posts and predict
whether a person is showing signs of clinical depression.

• To collect ample data which could be used for future research.

• To improve the F1-score (primary) and accuracy (secondary) for the model by
cross validation and hyperparameter tuning.

• To successfully deploy the model.

1.3 Scope Of The Project


• The project deals with detecting depression in twitter posts using a machine
learning model which will be built as a combination of XGBoost and Naı̈ve
Bayes algorithm. The XGBoost data will work as a filter which would make
—Depression Detection using Sentiment Analysis of Social Media Posts—– 2

sure that the data imbalance is minimal. The Naive Bayes model would use
this filtered data for making predictions. The project will specifically target
twitter posts. The reasons are stated as follows: -

1. Twitter data is easy to handle.


2. Being text heavy, it is simple and easy to pre-process.
3. Quantitative and Qualitative availability.
4. Smaller memory storage size required compared to image and video data.

• End user identification is a crucial step in scope definition. For the purpose of
our project, the following scenarios have been identified: -

1. The project could be deployed along with the already existing social media
platforms (esp. Twitter) whereby it can fetch user posts and analyse depres-
sion level in the individual. Here, the social media user acts as the end user.
2. The project can also be beneficial to psychologists who wish to study de-
pression and mood disorders especially in young adults. Hence, psychologists
can be taken to be the end user.
3. The project could be made open source which could help other researchers
making necessary strides in this field, improve the model further. Hence, re-
searchers make up the last set of end users.

1.4 Motivation of The project


• The motivation for doing this project was primarily an interest in undertaking
a challenging project in the domain of Machine Learning. The opportunity to
learn about various Machine Learning algorithms and their role in preventing
bias in imbalanced datasets was appealing. Depression is a major challenge
plaguing our society esp. millennials and we are extremely motivated to find
a solution for its early detection.

Department of Computer Engineering, AIT, Pune


—Depression Detection using Sentiment Analysis of Social Media Posts—– 3

1.5 Organization of report


The report will cover all the project work which has been done this year. The report
will further cover topics such as literature survey, Software Requirement Specifica-
tion, Algorithms used and Mathematical Model, Project design and Planning, Cod-
ing, Testing, SQA etc. The report will provide a comprehensive understanding of
various aspects of the project and will serve as a useful documentation of our work.

Department of Computer Engineering, AIT, Pune


Chapter 2

LITERATURE SURVEY

2.1 Literature Survey


This is a field where immense research is taking place. Over the last few years, social
media has been used to examine mental health by many researchers. In “Proceed-
ings of the Fifth Workshop on Computational Linguistics and Clinical Psychology:
From Keyboard to Clinic” [1] the authors considered that social media platforms
can reflect the users’ personal life on many levels. Their primary objective was to
detect depression using the most effective deep neural architecture from two of the
most popular deep learning approaches in the field of natural language processing:
Convolutional Neural Networks (CNNs) and Recurrent Neural Networks (RNNs).

According to the report, “Exploring opportunities to support mental health care using
social media: A survey of social media users with mental illness” [2], it was found
that millennials are more open to talk about their mental health issues on social me-
dia. Machine Learning (ML) has advanced significantly in recent years, allowing
for the solution of real-world problems and also the implementation of automated
systems.

In “Predicting future mental illness from social media: A big-data approach” [3],
the author predicted future mental illness based on the posts from an individual’s
post on Reddit, by gathering the posts from clinical sub - reddits and then classify-
ing them to the corresponding mental illness. After gathering the posts, clustering
was applied on those posts to find the markers of mental illnesses present in their
—Depression Detection using Sentiment Analysis of Social Media Posts—– 5

everyday spoken language.

In “Depression detection using Emotional Artificial Intelligence” [4], Natural Lan-


guage Processing was applied on Twitter feeds for conducting emotion analysis fo-
cusing on depression. Specific tweets were labelled as neutral or negative using a
curated word list to detect depression. For preditive modelling, support vector ma-
chine and Naive-Bayes classifier have been used. The results showed that Naive
Bayes gave a better accuracy and F1-Score than SVM.

In “X-A-BiLSTM: a Deep Learning Approach for Depression Detection in Imbal-


anced Data” [5] the authors proposed a deep learning model (X-A-BiLSTM) for
depression detection in imbalanced social media data. This approach focused on
solving the problem caused by data imbalance in the real world. The X-A-BiLSTM
model comprised of two components: the first XGBoost component, which permit-
ted acquiring balanced data by means of an end-to-end scalable tree boosting system,
and the second component, BiLSTM with the attention mechanism, which achieved
good classification performance.

In “Utilizing Neural Networks and Linguistic Metadata for Early Detection of De-
pression Indications in Text Sequences” [6] the authors used machine learning mod-
els focused on messages on a social network to identify depression early. In particu-
lar, a classification based on user-level linguistic metadata is compared to a convolu-
tional neural network based on different word embeddings. In addition, the current
common ERDE score as a metric for early detection systems is discussed in depth,
as well as its drawbacks in the context of shared tasks. Finally, a broad corpus was
used to train a new word embedding.

In “Detection of Mood Disorder Using Modulation Spectrum of Facial Action Unit


Profiles” [7] the authors constructed a database of facial expressions responding to
emotional stimuli from the patients with BD, UD and healthy controls. To detect
mood disorder, the subject’s facial expressions in CHIMEI database were applied

Department of Computer Engineering, AIT, Pune


—Depression Detection using Sentiment Analysis of Social Media Posts—– 6

to generate the AU profiles. The MS characterizing the fluctuation of AU profile


sequence over a video segment was then used for mood disorder detection. From
the comparison results of mood disorder detection, we can find that the proposed
ANN-based method achieved the best performance.

In “Depression Detection by Analyzing Social Media Posts of User” [8] it has been
demonstrated that depression can lead an individual to severe mental illness, even
to the path of suicide and how a machine learning approach can detect depression
of social media users. Micro-blogging social networking sites such as: twitter and
Facebook provide users to express their day-to-day thoughts and activities which re-
flect users’ behavioral attributes and personality traits. This paper proposed a model
that takes a username and analyzes the social media posts of the user to determine
the levels of vulnerability to depression. Correlating with this result the authors eval-
uated the accuracy of this model to be 74% and a precision of 100%.

In “Facebook Social Media for Depression Detection in the Thai Community” [9]
the author provides a tool by which depression could be easily and early detected.
This would help people to be aware of their emotional states and seek help from pro-
fessional services. This study uses Natural Language Processing (NLP) techniques
to create a depression detection algorithm for the Thai language on Facebook, a so-
cial media platform where people share their thoughts, emotions, and life events.
Results from 35 Facebook users indicated that Facebook behaviors could predict de-
pression level.

In “Twitter Analysis for Depression on Social Networks based on Sentiment and


Stress” [10] the author says that Detecting words that express negativity in a social
media message is one step towards detecting depressive moods. The authors applied
a multistep approach which allowed us to identify potential users and then discover
the words that expressed negativity by these users. Results showed that the senti-
ment of these words can be obtained and scored efficiently as the computation on
these datasets were narrowed to only these selected users. They also obtained the

Department of Computer Engineering, AIT, Pune


—Depression Detection using Sentiment Analysis of Social Media Posts—– 7

stress scores which correlated well with negative sentiment expressed in the content.

In “A novel Co-training based approach for the classification of mental illnesses


using Social media posts” [11] the authors performed several experiments to classify
the posts and their associated comments related to four mental issues such as Anxi-
ety, ADHD, Depression and Bipolar. They also mined date from the Reddit platform
where community related posts are published. The Authors used an API to extract
posts and associated comments and performed experiments by using SVM, NB, and
RF classifiers. The experimental results indicate that SVM, NB, and RF outper-
formed with Co-training technique as compared to their individual use in terms of
Precision, Recall, and F-measure.

In “Realizing a Stacking Generalization Model to Improve the Prediction Accu-


racy of Major Depressive Disorder in Adults” [12] the authors developed a stack-
ing generalization model for improving the accuracy in predicting MDD. In the first
step, they have implemented a KNN Imputation preprocessing technique for han-
dling the missing values in the data. Then in the next step, the authors have used
Random Forest-Based Backward Elimination, which is a wrapper-based feature se-
lection method for reducing the feature dimension, which would reduce the feature
interactions and helps in increasing the prediction accuracy. The initial number of
features was 22, and then RF-BE has reduced to 12 features with which further
process. The stacking generalisation is accomplished by combining three low-level
learners, MLP, SVM, and RF, and then averaging them to create a Meta-level learner
(MLP). The classifiers are also implemented individually to compare the results. The
accuracy of individual classifiers MLP, SVM, RF is 96.38%, 95.06%, and 96.90%,
respectively. The accuracy of the stacking generalization model is 98.16%.

In “A Machine Learning based Depression Analysis and Suicidal Ideation Detection


System using Questionnaires and Twitter” [13] the authors analyzed social media
posts (especially twitter), conducted questionnaire and asked students and parents to
give their opinion and also scrapped blogs on internet. According to the research,

Department of Computer Engineering, AIT, Pune


—Depression Detection using Sentiment Analysis of Social Media Posts—– 8

major factors of depression among the age group of 15-29 which they found during
the course of the project are parental pressure, love, failures, bullying, body sham-
ing, inferiority complex, exam pressure, peer pressure, physical and sexual abuse
etc. Depression being a recurrent type of illness, repeated episode of the same are
common. Finally, little is known about the prevention and identification of the disor-
der at an early stage. Among future directions, the authors researched to understand
how social media behavior analysis can help in leading to development of methods
for analyzing depression at scale.

2.2 Possible Challenges


Some possible risks / challenges associated with the project: -

1. Machine Learning algorithms cannot grant a human level accuracy in predic-


tion of depression.
2. There is significant noise in the Tweets collected before pre-processing, which
would lead to a lot of unnecessary data due to third person and news references.
3. Also, social media posts are highly imbalanced due to which machine learning
models often develop a bias which in turn leads to erroneous results. On the other
hand, deep learning models require a huge amount of data to train which is generally
not possible with the datasets being used in earlier models.

2.3 Inferences from Literature Survey


1. The literature survey conducted earlier clearly shows that for limited datasets Ma-
chine Learning algorithms will be most effective.
2. Naı̈ve Bayes works extremely well with textual data and gave better results com-
pared to SVM. Hence, we would use this algorithm for our purpose.
3. We would use the Sentiment140 dataset for scraping depressive tweets and prepar-

Department of Computer Engineering, AIT, Pune


—Depression Detection using Sentiment Analysis of Social Media Posts—– 9

ing our final dataset.


4. The Sentiment140 dataset contains 1,600,000 tweets extracted using the Twitter
API. The tweets have been annotated (0 = negative, 2 = neutral, 4 = positive) and
they can be used to detect sentiment.
5. The dataset would be highly imbalanced and thus, we’ll use XGBoost Algorithm
as a filter for avoiding bias.
6. Finally, model evaluation, cross validation and hyperparameter tuning will be car-
ried out to test the effectiveness of the model.

Department of Computer Engineering, AIT, Pune


Chapter 3

SOFTWARE REQUIREMENT
SPECIFICATION

3.1 Introduction
• Purpose
The aim of the project is to detect whether a person is showing signs of clinical
depression. The project will be developed using various machine learning
models trained on twitter data. The model would then predict whether a person
is showing symptoms of depression and if yes, necessary actions will be taken.
The model developed will be trained on a comprehensive dataset containing a
mix of depressive and non-depressive tweets. The dataset would be prepared
from a mix of various open source repositories and data scraped through the
Twitter API. A hybrid model comprising of XGBoost and Naive BAyes has
been identified for predictive modelling.

• Intended Audience
For the purpose of our project, the following end users have been identified: -

1. The project could be deployed along with the already existing social media
platforms (esp. Twitter) whereby it can fetch user posts and analyse depres-
sion level in the individual. Here, the social media user acts as the end user.

2. The project can also be beneficial to psychologists who wish to study de-
pression and mood disorders especially in young adults. Hence, psychologists
—Depression Detection using Sentiment Analysis of Social Media Posts—– 11

can be taken to be the end user.

3. The project could be made open source which could help other researchers
making necessary strides in this field, improve the model further. Hence, re-
searchers make up the last set of end users.

• Scope
The project deals with detecting depression in twitter posts using a machine
learning model which will be built as a combination of XGBoost and Naı̈ve
Bayes algorithm. The literature survey conducted in the initial phases pointed
to the fact that Naı̈ve Bayes works best for textual data. The XGBoost data
will work as a filter which would make sure that the data imbalance is mini-
mal.
The project would be limited to twitter data. The reasons for choosing twitter
for our project have been listed below: -
1. Twitter data is easy to handle.
2. Being text heavy, it is simple and easy to pre-process.
3. Quantitative and Qualitative availability.
4. Smaller memory storage size required compared to image and video data.

The project would be deployed as an API which would be developed using


FastAPI framework and Uvicorn. The API will be hosted on Heroku, which is
provides free cloud-based servers. We would use Twitter API for interaction
with Twitter databases for data retrieval.

• Definitions and Acronyms


1. API: It stands for Application Programming Interface. An API is a set of
programming code that enables data transmission between one software prod-
uct and another. It also contains the terms of this data exchange.

2. XGBoost: It stands for Extreme Gradient Boosting. XGBoost is an op-


timized distributed gradient boosting library designed to be highly efficient,
flexible and portable. It implements machine learning algorithms under the

Department of Computer Engineering, AIT, Pune


—Depression Detection using Sentiment Analysis of Social Media Posts—– 12

Gradient Boosting framework.

3. Naı̈ve Bayes: Naive Bayes classifiers are a collection of classification al-


gorithms based on Bayes’ Theorem. It is not a single algorithm but a family
of algorithms where all of them share a common principle, i.e., every pair of
features being classified is independent of each other.

4. FastAPI: FastAPI is a modern, fast (high-performance), web framework


for building APIs with Python 3.6+ based on standard Python type hints. It is
one of the fastest python frameworks available.

5. Uvicorn: It is an ASGI server based on uvloop and httptools, with an em-


phasis on speed.

6. Heroku: Heroku is a platform as a service (PaaS) that enables develop-


ers to build, run, and operate applications entirely in the cloud.

3.2 Overall Description


• Product Perspective
The main motive of the project is not only to detect depression in tweets but to
also ensure that the user experience is not hampered. Due to this it is necessary
that our system works in the back-end and works as an independent module.
Hence, the project would be built as an API which will host our hybrid ML
model on the cloud, which will only be active when the user tweets and will
be in passive state at all other times.
The user will only interact with the Twitter interface whereas the model would
constantly monitor tweets via the Twitter API, which provides various features
for developers who wish to work with twitter. The project will also use a de-
pression score metric for classifying a user as depressive. If the depression

Department of Computer Engineering, AIT, Pune


—Depression Detection using Sentiment Analysis of Social Media Posts—– 13

score exceeds a particular threshold value, the user will be classified as de-
pressive and assistance in the form medical help notifications, positive feeds
etc. will be provided.

• Constraints, Assumptions and Dependencies


For simplicity and ensuring computability, we would make the following as-
sumptions: -
1. The model is highly accurate and would provide human-level depression
predictions.
2. The servers will always remain active and would never crash.
3. The user will express his/her true emotions in his tweets and would not
put-up depressive tweets unless depressed.

In addition to the above assumptions, the system would have some constraints
some of which have identified below: -
1. Some amount of Latency will always be there regardless of how fast the
servers are.
2. It is impossible to achieve 100% accuracy and some cases of false positives
will always be there.
3. For the development phase, the Heroku servers used would need some time
to start before they can be fully functional. Hence, the server may not be avail-
able at all times.

The dependencies for the successful development and deployment of the project
have been listed below: -
1. Software Dependencies: Windows OS, Anaconda, Jupyter, Python 3.7,
Standard ML Libraries, Pipenv, FastAPI, Uvicorn, Heroku CLI, An IDE (VS-
Code, Sublime Text etc.).
2. Hardware Dependencies: Intel i5/i7 processor, 4/8 GB RAM, Heroku Cloud
Server.
3. Other Dependencies: Sentiment140 Dataset, Twitter Developer Account,
curated word list of depressive keywords (available on GitHub [14]), Google

Department of Computer Engineering, AIT, Pune


—Depression Detection using Sentiment Analysis of Social Media Posts—– 14

Colab GPU.

3.3 System Features and Requirements


• External Interface Requirements
1. User Interface: -
• Front-end: Twitter Interface.
• Back-end: Python Based ML Model (with FastAPI and Uvicorn) hosted on
Heroku.

2.Hardware Interface: -
• Not Applicable.

3. Software Interface: -
• User Level Interface: Any OS (Windows Preferable), Web Browser.

4. Communication Interface: -
• Communication b/w components will be carried out using HTTP Protocol
and data transfer in JSON format

• System Features
1. The model must be able to correctly identify cases of clinical depression
from tweets with high accuracy and a decent F1-Score.
2. Latency should be minimal so as to increase efficiency of our system.
3. The model must have the ability to deal with highly imbalanced data and
should not be biased.
4. The model developed should facilitate easy integration with the Twitter API
so that it could be deployed in the real world.
5. The system must ensure that user data is protected and confidentiality is

Department of Computer Engineering, AIT, Pune


—Depression Detection using Sentiment Analysis of Social Media Posts—– 15

maintained.

• Non-Functional Requirements
1. Performance Requirement
The response time of the model must be as little as possible. To achieve this
we will build our API with FastAPI, which is one of the fastest python frame-
works available.

2. Usability Requirement
The system must be easy to use and should have the ability to be easily inte-
grated with existing software. This will be achieved by hosting our API on the
cloud from where the model could directly plugged in any application.

3. Reliability Requirement
The system should be reliable and must produce accurate results. Also, the
model must be available at all times and should not break down in events of
failure (E.g., Server failure etc.). To achieve this, we will use Heroku deploy-
ment to achieve shared servers which would prevent total failure.

Department of Computer Engineering, AIT, Pune


Chapter 4

ALGORITHM ANALYSIS AND


MATHEMATICAL MODELING

4.1 Naive Bayes


The supervised learning algorithms based on Bayes’ theorem with the ”naive” as-
sumption of conditional independence between any pair of features given the value
of the class variable are known as naive Bayes methods.
It’s a classification method based on Bayes’ Theorem and the presumption of pre-
dictor independence. A Naive Bayes classifier, in simple terms, assumes that the
existence of one function in a class is unrelated to the presence of any other feature.
Despite their oversimplified assumptions, Naive Bayes classifiers have performed
admirably in a number of real-world applications, most notably document classifi-
cation and spam filtering. To estimate the necessary parameters, they only need a
small amount of training data.
When compared to more advanced methods, Naive Bayes learners and classifiers can
be extremely swift. Since the class conditional feature distributions are decoupled,
each distribution can be calculated as a one-dimensional distribution independently.
This, in essence, results in the alleviation of problems created by the curse of dimen-
sionality.
—Depression Detection using Sentiment Analysis of Social Media Posts—– 17

4.2 XGBoost
XGBoost is a Gradient Boosting Machine Learning library that has been tailored.
It was written in C++ at first, but it has APIs in many other languages. The core
XGBoost algorithm is parallelizable, which means it can run in parallel in a single
tree.
The decision tree is made up of a set of binary questions, and the final predictions
are made at the leaf. XGBoost is typically used for a tree as the base learner. XG-
Boost is an ensemble system in and of itself. The trees are designed in stages before
a stopping criterion is reached.
CART(Classification and Regression Trees) Decision trees are used by XGBoost.
CART refers to trees in which each leaf contains a real-valued ranking, regardless of
whether they are used for classification or regression. If required, real-valued scores
can be translated to categories for classification.
XGBoost makes use of advanced regularisation to increase model generalisation.
XGBoost outperforms Gradient Boosting in terms of efficiency. It has a short learn-
ing curve and can be parallelized across clusters.

4.3 Proposed Algorithm


The algorithms described above are state of the art algorithms which work very well
with balanced datasets. But we are dealing with highly imbalanced datasets and thus,
our algorithm must have the ability to remove this imbalance and avoid any bias.
Our model will be developed as a combination of XGBoost and Naive Bayes. This
hybrid model would be able to extract most relevant information using the power
of both algorithms. The XGBoost layer will act as a filter which would remove
imbalance in the data and the Naie Bayes layer would final make predictions on the
refined data.

Department of Computer Engineering, AIT, Pune


—Depression Detection using Sentiment Analysis of Social Media Posts—– 18

4.4 Mathematical Modelling


Our algorithm will be implemented as a combination of two algorithms - XGBoost
and Naive Bayes. Thus, the mathematics of the proposed system is based on these
algorithms.

Let our dataset be comprised of the following sets of features: -


1. The set of independent features, X = F1, F2, F3, . . . . . . . . . .
2. The dependent feature, y.
The data is first passed through the XGBoost layer. The objective function for XG-
Boost is shown below: -

Figure 4.1: XGBoost Objective Function

The objective function above comprises of the loss function as well as the regulariza-
tion function. Our motive is to minimize the above function. This is done internally
using the Taylor approximation technique. And finally, we will have our prediction.
Let the probability of prediction for XGBoost be P(xg). This will be used further to
calculate final result

At the Naı̈ve Bayes layer, Bayes Theorem is used for prediction. The standard Bayes
Theorem is represented by the formula below: -

Department of Computer Engineering, AIT, Pune


—Depression Detection using Sentiment Analysis of Social Media Posts—– 19

Figure 4.2: Bayes Theorem

Here,
P(y | X) is the posterior probability of class (y, target) given predictor (X, features).
P(y) is the prior probability of class.
P(X | y) is the likelihood which is the probability of predictor given class.
P(X) is the prior probability of predictor.
In Naı̈ve Bayes we make the naı̈ve assumption that all the features are independent
hence we’ll have: -

Figure 4.3: Independence of Features

P(X) is constant and thus our earlier formula will reduce to: -

Figure 4.4: Proportionality Relation

Department of Computer Engineering, AIT, Pune


—Depression Detection using Sentiment Analysis of Social Media Posts—– 20

The goal of Naive Bayes is to choose the class y with the maximum probability.
Thus, our final optimization function will be: -

Figure 4.5: Optimization Function

Let the Naı̈ve Bayes probability of prediction be P(nb).

Finally, we will take weighted average of the probabilities of prediction of both the
algorithms and set a threshold value. If the final probability will surpass the thresh-
old only then the tweet will be classified as depressive.

FinalProbability = (k1 ∗ P(xg) + k2 ∗ P(nb))/(k1 + k2) (4.1)

Department of Computer Engineering, AIT, Pune


Chapter 5

DETAILED DESIGN

5.1 Architectural Design


The requirement analysis done earlier, makes way for identifying and analyzing the
various processes involved in the development of the project. Any machine learning
/ deep learning project follows the data science life cycle which has been shown
below.

Figure 5.1: Machine Learning Life Cycle


—Depression Detection using Sentiment Analysis of Social Media Posts—– 22

The life cycle helps us identify the primary processes which need to be followed for
successful implementation of the project. For the purpose of our project, four stages
of technical work have been identified which have been shown in the block diagram
and flowchart below: -

Figure 5.2: Architecture : Block Diagram

Figure 5.3: Architecture : Flowchart

Department of Computer Engineering, AIT, Pune


—Depression Detection using Sentiment Analysis of Social Media Posts—– 23

5.2 UML Diagrams

5.2.1 Class Diagram

Class diagram shows relationship and dependency between various classes in the
system. For our purpose, we’ll use pre-defined classes of the Twitter API. The class
diagram has been shown below.

Figure 5.4: UML : Class Diagram

5.2.2 Activity Diagram

Activity diagram shows the sequential representation of various activities involved


in the project. It portrays the control flow from a start point to a finish point showing
the various decision paths that exist while the activity is being executed. The activity
diagram for our system has been shown below.

Department of Computer Engineering, AIT, Pune


—Depression Detection using Sentiment Analysis of Social Media Posts—– 24

Figure 5.5: UML : Activity Diagram

5.2.3 Use Case Diagram

A use case diagram at its simplest is a representation of a user’s interaction with the
system that shows the relationship between the user and the different use cases in
which the user is involved. The use case diagram for our system has been shown
below.

Department of Computer Engineering, AIT, Pune


—Depression Detection using Sentiment Analysis of Social Media Posts—– 25

Figure 5.6: UML : Use Case Diagram

5.2.4 Sequence Diagram

A sequence diagram shows object interactions arranged in time sequence. It depicts


the objects involved in the scenario and the sequence of messages exchanged be-
tween the objects needed to carry out the functionality of the scenario. Sequence
Diagrams show elements as they interact over time and they are organized accord-
ing to object (horizontally) and time (vertically). Sequence diagrams are sometimes
known as event diagrams or event scenarios.

Department of Computer Engineering, AIT, Pune


—Depression Detection using Sentiment Analysis of Social Media Posts—– 26

Figure 5.7: UML : Sequence Diagram

5.2.5 Deployment Diagram

A UML deployment diagram is a diagram that shows the configuration of run time
processing nodes and the components that live on them. Deployment diagrams is a
kind of structure diagram used in modeling the physical aspects of an object-oriented
system. They are often be used to model the static deployment view of a system
(topology of the hardware). Deployment diagrams are important for visualizing,
specifying, and documenting embedded, client/server, and distributed systems and
also for managing executable systems through forward and reverse engineering. A
deployment diagram is just a special kind of class diagram, which focuses on a
system’s nodes. Graphically, a deployment diagram is a collection of vertices and
arcs.

Department of Computer Engineering, AIT, Pune


—Depression Detection using Sentiment Analysis of Social Media Posts—– 27

Figure 5.8: UML : Deployment Diagram

5.3 Data design

5.3.1 Internal software data structure

The twitter API stores information in the form of various classes internally and trans-
mits this data in the form of JSON Objects.

5.3.2 Global data structure

Our API will extend the api/predict interface as the global structure accessible through
the twitter API. This will send prediction details in JSON format.

5.3.3 Temporary data structure

Some temporary files may be creted for storing user data which would be deleted
once our prediction is done.

5.3.4 Database description

No external database will be used as such. However, the Twitter API will enable us
to access the twitter database server for tweets and other information.

Department of Computer Engineering, AIT, Pune


Chapter 6

PROJECT PLANNING

6.1 Tasks Involved


• Data Collection : Sentiment140 dataset, Scraping data using Twitter API /
Tweepy.

• Data Pre-processing : NLTK library, Python ML libraries.

• Model Selection : Naı̈ve Bayes, XGBoost etc.

• Model Evaluation : Precision, Recall, F1-score (Primary metric), Accuracy


(Secondary metric).

• Model Improvement : Cross Validation, Hyperparameter tuning, data clean-


ing, Co-training etc.

6.2 Technical Risks


• Machine Learning algorithms cannot grant a human level accuracy in predic-
tion of depression. This risk can be mitigated by developing hybrid models.

• There is significant noise in the Tweets collected before pre-processing, which


would lead to a lot of unnecessary data due to third person and news refer-
ences. This risk is mitigated using Natural language Pre-processing libraries
which help in extracting the most useful words in textual data.
—Depression Detection using Sentiment Analysis of Social Media Posts—– 29

• Social media posts are highly imbalanced due to which machine learning mod-
els often develop a bias which in turn leads to erroneous results. This risk is
mitigated using Gradient Boosting.

• We need a huge amount of data to train machine learning models which is


generally not possible with existing repositories. Thus, we can scrape etra
data using Twitter API.

6.3 Budget/Time Constraints


• There is no significant budget constraint as all the software and hardware being
used for the purpose of this project will be open source. However, in the future
more robust and fast hardware may be required to deploy the model in the real
world.

• Time is limited and can be a factor which could lead to failure of the project.
But this risk could be mitigated by managing time properly using a well de-
fined timeline.

The workflow timeline has been shown below: -

Figure 6.1: Project Timeline

Department of Computer Engineering, AIT, Pune


Chapter 7

CODING

7.1 Algorithms / Flowcharts


The proposed model will be developed as a stacking generalized combination of
XGBoost and Multinomial Naive Bayes models. This hybrid model would be able to
extract relevant information using the power of both algorithms. Here, the XGBoost
and Multinomial Naı̈ve Bayes models would act as the base-learners for our stack
while a Logistic Regression classifier will act as the meta-model. The architecture
of the stacking model is shown below: -

Figure 7.1: Proposed Model

For developing the model, we need to follow the standard NLP procedures which
comprise of Data Collection, Data Cleaning, Tokenization, Model Selection with
Hyperparameter Tuning, Model Stacking and Evaluation. The flow of events has
been depicted below: -
—Depression Detection using Sentiment Analysis of Social Media Posts—– 31

Figure 7.2: Flowchart

7.2 Software Used

7.2.1 Utility Packages / Applications

• OS : Windows 10

• IDE : VSCode

• Package Manager (Python): pip

7.2.2 Model Development

• Dataset Storage : MS-Excel(csv format).

• Twitter API Scraping : tweepy & dotenv.

• Python ML Libraries : numpy, pandas, scikit-learn, matplotlib, seaborn, xg-


boost, plotly, wordcloud etc.

• IPython Kernel : Jupyter Notebook & Google Colab.

• Saving Models : pickle module

Department of Computer Engineering, AIT, Pune


—Depression Detection using Sentiment Analysis of Social Media Posts—– 32

7.2.3 Deployment

• Front-end web development : HTML, CSS, Javascript & Bootstrap 5

• API development : FastAPI & Uvicorn

• Cloud Deployment: Heroku

7.3 Hardware Specifiation


The project utilizes resources openly available on the cloud and hence has no specific
hardware requirements. The project has been developed on a system having an Intel
i7 processor along with 8GB RAM.

7.4 Programming Language


The project has been developed in Python 3.7. The reason for choosing Python
lies in the fact that it is a simple & extremely powerful language which has a huge
repository of Machine Learning tools and libraries.

7.5 Platform
The model was developed on the Google Colab platform, which is a cloud based
IPython Kernel integrated with the standard ML libraries. For deployment, the
Heroku cloud deployment platform has been utilized.

7.6 Coding Style Format


The project has been developed using PEP8 coding style format. PEP8 ( Python
Enhancement Proposals 8) defines useful naming conventions and other guidelines
for programming in python and is extremely helpful in writing well organized, con-
sistent and neat code.

Department of Computer Engineering, AIT, Pune


Chapter 8

RESULT & ANALYSIS

For testing the model, we need to define the metrics upon which the models must
be evaluated. In our case the metrics defined are: Precision, Recall, F1-score (pri-
mary) and Accuracy (secondary). Evaluating the model using the traditional train-
test split method will not be of much use due to imbalanced distribution. Hence, we
use Stratified 5-Fold Cross Validation for obtaining our prediction results.
The stacking model (MNB + XGB) was tested against standalone MNB and XGB
models to get a good understanding of the properties of our model. The results for
the 5-fold cross validation have been summarized below: -

Figure 8.1: State-of-the-art models vs Stacked Model


—Depression Detection using Sentiment Analysis of Social Media Posts—– 34

Figure 8.2: F1-Score Comparison

Figure 8.3: Accuracy Comparison

Department of Computer Engineering, AIT, Pune


—Depression Detection using Sentiment Analysis of Social Media Posts—– 35

The results have shown convincingly that on both F1-score and Accuracy metrics,
the proposed stacking model has had superior results to the state-of-the-art models.
Our proposed stacking model gives an accuracy of 96% and an F1-Score of 93%.

Department of Computer Engineering, AIT, Pune


Chapter 9

TESTING

9.1 Formal Technical Reviews


Formal Technical Reviews helped us in assessing how our model is performing at
each stage of development. In Machine Learning, the logic is not explicitly coded
by the programmer but is inferred by the model so instead of performing traditional
software tests, we need to make sure that the model performs consistently at all times
with all data-sets. So we decided to go for K-fold cross validation testing for this
purpose. For each fold, we assessed various metrics like precision, recall, F1-score
and accuracy to obtain our final result.

9.2 Test Plan


In this phase, we analyzed our methodology and identified the key areas which could
lead to bugs. Using this information we came to the conclusion the following types
of tests are to be carried out for our project.

• Unit Testing

• Integration Testing

• Aplha Testing

• Beta Testing
—Depression Detection using Sentiment Analysis of Social Media Posts—– 37

9.3 Test Cases & Results


• Unit Testing : Here, our main aim was to make sure that each model works
properly without overfitting or underfitting the data. We used cross validation
testing along with hyperparameter tuning to make sure that all units(models)
work consistently.

• Integration Testing : Once the model and the web interface were created, we
had to make sure that integration of these two components didn’t cause the
system to break. Hence, we used Selenium WebDriver to test for any potential
bug arising due to integration.

• Aplha Testing : This testing was done after deployment of the project to
Heroku. This test helped us to check the functionalities of the final product.

• Beta Testing : The final deployed product was tested by different members of
our team on their systems and the project was put under different situations to
assess its endurance.

The testing phase helped us identify bugs and rectify them. The testing results clearly
show that almost all bugs have been removed and the project works smoothly under
all conditions.

Department of Computer Engineering, AIT, Pune


Chapter 10

CONFIGURATION MANAGEMENT PLAN

Project configuration management is managing the configuration of all the project’s


key products and assets. This includes any end products that will be delivered to the
customer, as well as all management products, such as the project management plan
and performance management baseline. Implementation of configuration manage-
ment and project change management need to happen hand-in-hand. Any change
must be monitored and assessed to determine its impact on project configuration.
Thus, configuration management is an extremely important step in project develop-
ment.

Our project uses Git and Heroku CLI for configuration management. Different ver-
sions of the project can be developed and changes be pushed to Heroku directly
using the Command Line Interface(CLI) provided by Heroku. During the develop-
ment phase we used Git for version control as it is openly available and easy to use.
After deployment, all configuration related issues can be tackled by Heroku itself.
This can range from version control, configuring add-ons, scaling of dyno formation,
analyzing usage etc.

Another very useful feature available on Heroku is the Project Dashboard which
provides UI support for tasks like viewing app metrics, managing heroku teams,
configuring deployment integrations etc.
Hence, Heroku comes with inbuilt features which take care of the configuration
phase.
Chapter 11

SOFTWARE QUALITY ASSURANCE


PLAN

Software quality is one of the most significant factors determining the success of the
project. Software Quality Assurance Plan lays down the guidelines for ensuring that
at each step the software developed is up to the mark. The SQAP followed for our
project is as follows: -

• All modules need to be developed using proper naming conventions and other
specifications conforming to the Python PEP8 standard. This step ensures that
the code developed is consistent and neat.

• Code must be well documented, with Doc strings and images wherever possi-
ble.

• Data collected through scraping via Twitter API must be manually evaluated
to find bad data points and remove them. This step makes sure that we have
relevant data points in our data set and avoids the ”Garbage In Garbage Out”
phenomena.

• No module must be put into production without adequate testing. This ensures
that bugs are detected and rectified early.

• During model development, the machine learning based models must be tested
and validated in each cycle to check for over-fitting and under-fitting of data.

• No confidential information should be exposed by the API to the outside


world. This would help in ensuring that privacy of users is not compromised.
—Depression Detection using Sentiment Analysis of Social Media Posts—– 40

• The web interface should be developed keeping in mind good design practices
for web apps. This would ensure that the web app is responsive, consistent
and easy to navigate.

• The project should regularly updated even after Deployment. This would make
sure that existing bugs are fixed and additional functionalities are added from
time to time.

Department of Computer Engineering, AIT, Pune


Chapter 12

CONCLUSION

To sum up, it has been well established that depression is one of the leading issues
faced by our society. Detecting depression early can play a major role in preventing
suicides and improving mental health of the society collectively. Social media has
been a revelation in sentiment analysis and can be used to effectively tackle depres-
sion.
Our project uses twitter data to train a stack ensembled model consisting of Multino-
mial Naı̈ve Bayes and XGBoost base-models along with a Logistic Regression meta-
classifier. The results have shown convincingly that on both F1-score and Accuracy
metrics, the proposed stacking model has had superior results to the state-of-the-art
models. With a F1-score of 93% and an Accuracy of 96%, our model stands out as
one of the best performers among all models developed to date.
This project could go a long way in integrating emotional AI with social media for
eradicating depression from our society.
Chapter 13

References

[1] A. H. B. P. O. M. H. . I. Orabi, ”Deep Learning for Depression Detection of


Twitter Users.,” Proceedings of the Fifth Workshop on Computational Linguis-
tics and Clinical Psychology: From Keyboard to Clinic. doi: 10.18653/v1/w18-
0609, 2018.

[2] J. A. Naslund, K. A. Aschbrenner, G. J. McHugo, J. Unützer, L. A. Marsch,


and S. J. Bartels, “Exploring opportunities to support mental health care using
social media: A survey of social media users with mental illness,” Early Interv.
Psychiatry, 2019.

[3] R. Thorstad and P. Wolff, “Predicting future mental illness from social media:
A big-data approach,” Behav. Res. Methods, 2019.

[4] Mandar Deshpande and Vignesh Rao, “ Depression detection using Emotional
Artificial Intelligence., ” 2017.

[5] Cong, Z. Feng, F. Li, Y. Xiang, G. Rao and C. Tao, ”X-A-BiLSTM: a Deep
Learning Approach for Depression Detection in Imbalanced Data,” 2018 IEEE
International Conference on Bioinformatics and Biomedicine (BIBM), Madrid,
Spain, 2018, pp. 1624-1627, doi: 10.1109/BIBM.2018.8621230.

[6] Trotzek, Marcel Koitka, Sven Friedrich, Christoph. (2018). ”Utilizing Neural
Networks and Linguistic Metadata for Early Detection of Depression Indica-
tions in Text Sequences”. IEEE Transactions on Knowledge and Data Engineer-
ing. 32. 588-601. 10.1109/TKDE.2018.2885515.
—Depression Detection using Sentiment Analysis of Social Media Posts—– 43

[7] T. Yang, C. Wu, M. Su and C. Chang, ”Detection of mood disorder using


modulation spectrum of facial action unit profiles,” 2016 International Con-
ference on Orange Technologies (ICOT), Melbourne, VIC, 2016, pp. 5-8, doi:
10.1109/ICOT.2016.8278966.

[8] N. A. Asad, M. A. Mahmud Pranto, S. Afreen and M. M. Islam, ”Depres-


sion Detection by Analyzing Social Media Posts of User,” 2019 IEEE Inter-
national Conference on Signal Processing, Information, Communication Sys-
tems (SPICSCON), Dhaka, Bangladesh, 2019, pp. 13-17, doi: 10.1109/SPIC-
SCON48833.2019.9065101.

[9] K. Katchapakirin, K. Wongpatikaseree, P. Yomaboot and Y. Kaewpitakkun,


”Facebook Social Media for Depression Detection in the Thai Community,”
2018 15th International Joint Conference on Computer Science and Soft-
ware Engineering (JCSSE), Nakhonpathom, 2018, pp. 1-6, doi: 10.1109/JC-
SSE.2018.8457362.

[10] X. Tao, R. Dharmalingam, J. Zhang, X. Zhou, L. Li and R. Gurura-


jan, ”Twitter Analysis for Depression on Social Networks based on Senti-
ment and Stress,” 2019 6th International Conference on Behavioral, Economic
and Socio-Cultural Computing (BESC), Beijing, China, 2019, pp. 1-4, doi:
10.1109/BESC48373.2019.8963550.

[11] S. Tariq et al., ”A Novel Co-Training-Based Approach for the Classification of


Mental Illnesses Using Social Media Posts,” in IEEE Access, vol. 7, pp. 166165-
166172, 2019, doi: 10.1109/ACCESS.2019.2953087.

[12] N. Mahendran, P. M. Durai Raj Vincent, K. Srinivasan, V. Sharma and D. K.


Jayakody, ”Realizing a Stacking Generalization Model to Improve the Predic-
tion Accuracy of Major Depressive Disorder in Adults,” in IEEE Access, vol. 8,
pp. 49509-49522, 2020, doi: 10.1109/ACCESS.2020.2977887.

[13] S. Jain, S. P. Narayan, R. K. Dewang, U. Bhartiya, N. Meena and V. Ku-


mar, ”A Machine Learning based Depression Analysis and Suicidal Ideation

Department of Computer Engineering, AIT, Pune


—Depression Detection using Sentiment Analysis of Social Media Posts—– 44

Detection System using Questionnaires and Twitter,” 2019 IEEE Students Con-
ference on Engineering and Systems (SCES), Allahabad, India, 2019, pp. 1-6,
doi: 10.1109/SCES46477.2019.8977211.

[14] https://github.com/halolimat/Social-media-Depression-
Detector/blob/master/depression lexicon.json

[15] https://towardsdatascience.com/xgboost-mathematics-explained-
58262530904a

[16] https://towardsdatascience.com/a-mathematical-explanation-of-naive-bayes-
in-5-minutes-44adebcdb5f8

[17] https://www.kaggle.com/kazanova/sentiment140

[18] https://medium.com/topic/machine-learning

Department of Computer Engineering, AIT, Pune


Annexure A

PLAGIARISM REPORT

Plagiarism report has been attached herewith.


Plagiarism Checking Result for your Document Page 1 of 19

Plagiarism Checker X Originality Report

Plagiarism Quantity: 14% Duplicate

Sources found:
Date Saturday, June 05, 2021
Click on the highlighted sentence to see sources.
Words 1451 Plagiarized Words / Total 10110 Words

Sources More than 165 Sources Identified. Internet Pages


Low Plagiarism Detected - Your Document needs Optional
Remarks <1% http://pace.ac.in/documents/ece/FACE%20R
Improvement.
<1% http://spcoe.in/staff-activity/

<1% https://www.psychologytoday.com/gb/basic

A Preliminary Project Report on Depression Detection using Sentiment Analysis of Social Media Posts <1% https://ourworldindata.org/rise-of-socia
SUBMITTED TOWARDS THE PARTIAL FULFILLMENT OF THE REQUIREMENTS OF Bachelor of
<1% https://www.researchgate.net/publication
Engineering (Computer Engineering) BY Aroop Kumar Roll No: 3423 Ashish Roll No: 3426 K Chaitanya Roll
<1% https://www.researchgate.net/publication
No: 3448 Saurabh Kulkarni Roll No: 7435 Under The Guidance of Prof. Shubhada Bhalerao Department of

Computer Engineering Army Institute of Technology, Pune - 411015. SAVITRIBAI PHULE PUNE <1% https://issuu.com/iasir/docs/hass_issue_

UNIVERSITY 2020-21 ARMY INSTITUTE OF TECHNOLOGY, DEPARTMENT OF COMPUTER <1% https://wwbp.org/papers/detecting_depres


ENGINEERING CERTIFICATE This is to certify that the Project Entitled Depression Detection using
<1% https://mea.gov.in/Portal/XML/Articles_i
Sentiment Analysis of Social Media Posts Submitted by Aroop Kumar Roll No: 3423 Ashish Roll No: 3426 K
<1% https://assets.publishing.service.gov.uk
Chaitanya Roll No: 3448 Saurabh Kulkarni Roll No: 7435 is a bona?de work carried out by students under the

supervision of Prof. <1% https://iitbhu.ac.in/contents/institute/

<1% https://www.mmit.edu.in/index.php/facult
Shubhada Bhalerao and it is submitted towards the partial ful?llment of the requirement of bachelor of
<1% https://sstc.ac.in/ssgi/4-Preliminary%20
engineering (Computer Engineering) Project. Prof. Shubhada Bhalerao Dr. SR Dhore Internal Guide H.O.D
<1% http://eprints.usm.my/23695/1/ADW_622_-_
Dr. B P Patil External Examiner Principal Place : AIT, Pune Date : PROJECT APPROVAL SHEET A Project

Stage-I Report on (Depression Detection using Sentiment Analysis of Social Media Posts) is successfully <1% https://sites.google.com/site/hecpm2013/

completed by Aroop Kumar (Roll No: 3423) Ashish (Roll No: 3426) K Chaitanya (Roll No: 3448) Saurabh <1% https://www.academia.edu/4066547/Researc
Kulkarni (Roll No: 7435) at Department Of Computer Engineering Army Institute of Technology, Pune-411
<1% https://www.easa.europa.eu/sites/default
015. SAVITRIBAI PHULE PUNE UNIVERSITY 2020-21 Prof. Shubhada Bhalerao Dr. S.R
<1% https://www.slideshare.net/SomnathLinKin

Dhore Project Guide HOD �Depression Detection using Sentiment Analysis of Social Media Posts�� I <1% https://www.slideshare.net/gajapandiyan/

Abstract Depression has become a huge mayhem plaguing the world today. About 265 mil- lion individuals of <1% https://www.academia.edu/34291111/Intern

file:///C:/Users/hp/AppData/Local/Temp/3F8XUI8B.htm 05-06-2021
Plagiarism Checking Result for your Document Page 2 of 19

<1% https://www.sciencedirect.com/science/ar
all ages suffer from depression worldwide. Of these, about 75% remain untreated, with one million individuals

taking their lives every year. Thus, depression is amongst the leading causes of suicide esp. amongst <1% http://heath.cs.illinois.edu/scicomp/con

adolescents. Social media platforms are becoming an inseparable part of people�s daily lives. <1% https://www.researchgate.net/publication

<1% https://www.hhs.gov/ohrp/sachrp-committe
They mirror the user�s personal life as users share their happiness, joy, insecurities and sorrow on social
<1% https://www.researchgate.net/publication
media. These platforms are often utilized by researchers to spot the causes of depression and retract it.

Detection of early depression could prove to be an enormous step in improving mental health of our society <1% https://monkeylearn.com/text-analysis/

collectively. Thus, to address our problem, we propose a stacking-based ensemble machine learn- ing model <1% https://research.cyber.ee/~janwil/publ/N
which uses XGBoost and Multinomial Naive Bayes as the base-learners and Logistic Regression as the meta-
<1% https://www.researchgate.net/publication
learner. The model is developed for twitter data and would ?ag tweets which are found to be depressive.
<1% https://www.slideshare.net/Vivekreddy91/

The stacked model pro- duced a very high accuracy and F1-Score, which is superior to any other standalone <1% https://www.sciencedirect.com/science/ar

model proposed earlier. The project would help us employ emotional AI in twitter which would in turn lead to <1% https://www.ijcaonline.org/archives/volu
lower suicide rates and improved mental health. Keywords: depression, mental health, social networking sites,
<1% https://www.projectmanager.com/blog/stat
twitter, machine learning, emotional AI. Department of Computer Engineering, AIT, Pune �Depression
<1% https://link.springer.com/article/10.100
Detection using Sentiment Analysis of Social Media Posts�� II Acknowledgments It gives us great pleasure

in presenting the ?nal project report on �Depression De- tection using Sentiment Analysis of Social Media <1% https://www.aclweb.org/anthology/W18-060

Posts�. <1% https://www.aclweb.org/anthology/W18-060

<1% https://www.mlq.ai/what-are-convolutiona
We would like to take this opportunity to thank our project guide Prof. Shubhada Bhalerao for giving us all the
<1% https://pubmed.ncbi.nlm.nih.gov/29052947
guidance needed. We are very grateful for her kind support. Her valuable suggestions were extremely helpful.

We also extend our sincere gratitude to Dr. S.R Dhore, Head of Computer Engi- neering Department, for <1% https://www.vogue.com/article/celebrity-

creating a competitive environment and providing us with all the essential facilities and encouragement at the <1% https://link.springer.com/article/10.375
department and institute level. We would also like to acknowledge all our friends and classmates for their co-
<1% https://www.researchgate.net/publication
operation.
<1% https://www.researchgate.net/publication

Lastly, we express our gratitude to our parents and other family mem- bers, whose continuous <1% https://www.researchgate.net/publication

encouragement, love and affection enabled us to complete this piece of work successfully. Aroop Kumar <1% https://www.apnns.org/ICONIP2020/file/IC
Ashish K Chaitanya Saurabh Kulkarni (B.E. Computer Engg.) Department of Computer Engineering, AIT,
<1% https://www.researchgate.net/publication
Pune INDEX Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . I Acknowledgment . . . . . . . . . . . . . . . . . . . .
<1% http://export.arxiv.org/pdf/1804.07000v1
. . . . . . . . . . II List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . VI 1 INTRODUCTION 1 1.1 Problem

Statement . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 <1% https://www.cs.ucr.edu/~amr/papers/vldbj

<1% https://www.researchgate.net/profile/Chu

Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.3 Scope Of The Project . . . . . . . . . . . . . . . . . . . . . . . . . <1% https://www.researchgate.net/publication


1 1.4 Motivation of The project . . . . . . . . . . . . . . . . . . . . . . . 2 1.5 Organization of report . . . . . . . . . . . . . . . . . .
<1% https://link.springer.com/article/10.100
. . . . . . . 3 2 LITERATURE SURVEY 4 2.1 Literature Survey . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 2.2 Possible
<1% https://www.researchgate.net/publication

file:///C:/Users/hp/AppData/Local/Temp/3F8XUI8B.htm 05-06-2021
Plagiarism Checking Result for your Document Page 3 of 19

<1% https://medium.datadriveninvestor.com/a-
Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.3 Inferences from Literature Survey . . . . . . . . . . . . . . . . . . 8 3

SOFTWARE REQUIREMENT SPECIFICATION 10 3.1 Introduction . . . . . <1% https://www.termpaperwarehouse.com/essay

<1% https://tbiomed.biomedcentral.com/articl
. . . . . . . . . . . . . . . . . . . . . . . . . 10 3.2 Overall Description . . . . . . . . . . . . . . . . . . . . . . . . . . 12 3.3 System
<1% https://www.sciencedirect.com/science/ar
Features and Requirements . . . . . . . . . . . . . . . . . . 14 4 ALGORITHM ANALYSIS AND MATHEMATICAL
1% https://eprints.usq.edu.au/38102/1/besc2
MODELING 16 4.1 Naive Bayes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 4.2 XGBoost . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . 17 4.3 Proposed Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . 17 4.4 Mathematical Modelling . . . <1% https://inpressco.com/wp-content/uploads

.................... <1% https://www.sciencedirect.com/science/ar

<1% https://www.researchgate.net/publication
18 �Depression Detection using Sentiment Analysis of Social Media Posts�� IV 5 DETAILED DESIGN 21
<1% https://people.dmi.uns.ac.rs/~svc/papers
5.1 Architectural Design . . . . . . . . . . . . . . . . . . . . . . . . . 21 5.2 UML Diagrams . . . . . . . . . . . . . . . . . . . . . . . . .

. . . 23 5.2.1 Class Diagram . . . . . . . . . . . . . . . . . . . . . . . . 23 5.2.2 Activity Diagram . . . . . . . . . . . . . . . . . . . . . <1% https://www.irjet.net/archives/V8/i1/IRJ

. . 23 5.2.3 Use Case Diagram . . . . . . . . . . . . . . . . . . . . . . 24 5.2.4 Sequence Diagram . . . . . . . . . . . . . . . . . . <1% http://inpressco.com/wp-content/uploads/


. . . . 25 5.2.5 Deployment Diagram . . . . . . . . . . . . . . . . . . . . . 26 5.3 Data design . . . . . .
<1% https://pmj.bmj.com/content/80/947/516

<1% https://www.researchgate.net/publication
. . . . . . . . . . . . . . . . . . . . . . . . 27 5.3.1 Internal software data structure . . . . . . . . . . . . . . . . 27 5.3.2 Global

data structure . . . . . . . . . . . . . . . . . . . . . 27 5.3.3 Temporary data structure . . . . . . . . . . . . . . . . . . . 27 5.3.4 <1% https://towardsdatascience.com/breaking-

Database description . . . . . . . . . . . . . . . . . . . . . 27 6 PROJECT PLANNING 28 6.1 Tasks Involved . . . . . . . . . <1% https://www.researchgate.net/publication


. . . . . . . . . . . . . . . . . . . 28 6.2 Technical Risks . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 6.3 Budget/Time
<1% https://www.aclweb.org/anthology/W13-160
Constraints . . . . . . . . . . . . . . . . . . . . . . . 29 7 CODING 30 7.1 Algorithms / Flowcharts . . . .
<1% https://www.slideshare.net/ApoorvaChandr

. . . . . . . . . . . . . . . . . . . 30 7.2 Software Used . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 7.2.1 Utility Packages / <1% https://www.thelancet.com/cms/10.1016/S2

Applications . . . . . . . . . . . . . . . . 31 7.2.2 Model Development . . . . . . . . . . . . . . . . . . . . . 31 7.2.3 <1% https://link.springer.com/article/10.100


Deployment . . . . . . . . . . . . . . . . . . . . . . . . . . 32 7.3 Hardware Speci?ation . . . . . . . . . . . . . . . . . . . . . . . . 32
<1% https://www.redhat.com/en/topics/api/wha
7.4 Programming Language . . . . . . . . . . . . . . . . . . . . . . . . 32 7.5 Platform . . . . . . . . . . . . . . . . . . . . . . . . . . . .
<1% https://medium.com/xgboost-all-you-need-
. . . . 32 7.6 Coding Style Format . . . . . . . . . . . . . . . . . . . . . . . . .
<1% https://www.researchgate.net/publication

32 8 RESULT & ANALYSIS 33 Department of Computer Engineering, AIT, Pune �Depression Detection <1% http://www.ijeast.com/papers/32-34,Tesma

using Sentiment Analysis of Social Media Posts�� V 9 TESTING 36 9.1 Formal Technical Reviews . . . . . . . . <1% https://www.tutorialspoint.com/biopython
. . . . . . . . . . . . . . 36 9.2 Test Plan . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 9.3 Test Cases & Results . . . . .
<1% https://bolt.mph.ufl.edu/6050-6052/unit-
. . . . . . . . . . . . . . . . . . . . 37 10 CONFIGURATION MANAGEMENT PLAN 38 11 SOFTWARE QUALITY

ASSURANCE PLAN 39 12 CONCLUSION 41 13 References 42 Department of Computer Engineering, AIT, <1% https://codeburst.io/implement-a-product

Pune List of Figures 4.1 XGBoost Objective Function . . . . . . . . . . . . . . . . . . . . . 18 4.2 Bayes Theorem . . . . . . <1% http://boqf.consegnameloacasa.it/fastapi

...................... <1% https://www.heroku.com/

<1% http://ijirt.org/master/publishedpaper/I
19 4.3 Independence of Features . . . . . . . . . . . . . . . . . . . . . . . 19 4.4 Proportionality Relation . . . . . . . . . . . . .
<1% https://www.transpower.co.nz/system-oper

file:///C:/Users/hp/AppData/Local/Temp/3F8XUI8B.htm 05-06-2021
Plagiarism Checking Result for your Document Page 4 of 19

<1% https://stackoverflow.com/questions/5735
. . . . . . . . . . . 19 4.5 Optimization Function . . . . . . . . . . . . . . . . . . . . . . . . 20 5.1 Machine Learning Life Cycle .

. . . . . . . . . . . . . . . . . . . . 21 5.2 Architecture : Block Diagram . . . . . . . . . . . . . . . . . . . . 22 5.3 Architecture : <1% https://www.perforce.com/blog/alm/how-wr

Flowchart . . . . . . . . . . . . . . . . . . . . . . . 22 5.4 UML : Class Diagram . . . . . . . . . . . . . . . . . . . . . . . . . 23 5.5 <1% https://www.researchgate.net/publication


UML : Activity Diagram . . . . . . . . . . . . . . . . . . . . . . . 24 5.6
<1% https://www.squash.io/16-amazing-python-

<1% https://www.softwaretestinghelp.com/what
UML : Use Case Diagram . . . . . . . . . . . . . . . . . . . . . . 25 5.7 UML : Sequence Diagram . . . . . . . . . . . . . . . . . .

. . . . 26 5.8 UML : Deployment Diagram . . . . . . . . . . . . . . . . . . . . . 27 6.1 Project Timeline . . . . . . . . . . . . . . . . <1% https://www.researchgate.net/publication

. . . . . . . . . . . . 29 7.1 Proposed Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 7.2 Flowchart . . . . . . . . . . . . . . . <1% https://www.techwithabhijeet.com/introdu


. . . . . . . . . . . . . . . . 31 8.1 State-of-the-art models vs Stacked Model . . . . . . . . . . . . . . 33 8.2 F1-Score
<1% https://www.researchgate.net/publication
Comparison . . . . . . . . . . . . . . . . . . . . . . . . 34 8.3
<1% https://www.hindawi.com/journals/complex

Accuracy Comparison . . . . . . . . . . . . . . . . . . . . . . . . 34 Chapter 1 INTRODUCTION 1.1 Problem Statement <1% https://www.includehelp.com/data-science

The aim of the project is to detect whether a person is showing signs of clinical depression. The project will be <1% https://tutorialspoint.dev/language/pyth
developed using various machine learning models trained on social media data (e.g. tweets). The model
<1% https://researchportal.port.ac.uk/portal
would then predict whether a person is showing symptoms of depression and if yes, necessary actions will be
<1% https://www.thetechplatform.com/post/int
taken. 1.2
<1% https://www.sciencedirect.com/science/ar

Objectives � To build a machine learning model that would analyse twitter posts and predict whether a person <1% https://www.math.arizona.edu/~hzhang/mat

is showing signs of clinical depression. � To collect ample data which could be used for future research. � To <1% https://link.springer.com/article/10.100
improve the F1-score (primary) and accuracy (secondary) for the model by cross validation and
<1% https://towardsdatascience.com/machine-l
hyperparameter tuning. � To successfully deploy the model. 1.3 Scope Of The Project � The project deals
with detecting depression in twitter posts using a machine learning model which will be built as a combination <1% https://pubmed.ncbi.nlm.nih.gov/30442593

of XGBoost and Na� ive Bayes algorithm. <1% https://www.cp.eng.chula.ac.th/~prabhas/

<1% https://www.sciencedirect.com/science/ar
The XGBoost data will work as a ?lter which would make �Depression Detection using Sentiment Analysis of
<1% https://www.researchgate.net/publication
Social Media Posts�� 2 sure that the data imbalance is minimal. The Naive Bayes model would use this ?
ltered data for making predictions. The project will speci?cally target twitter posts. The reasons are stated as <1% http://dcase.community/documents/challen

follows: - 1. Twitter data is easy to handle. 2. Being text heavy, it is simple and easy to pre-process. 3. <1% https://www.geeksforgeeks.org/ml-normal-
Quantitative and Qualitative availability. 4. Smaller memory storage size required compared to image and
<1% https://www.hindawi.com/journals/complex
video data.
<1% https://www.analyticsvidhya.com/blog/201

<1% https://inblog.in/Categorical-Naive-Baye
� End user identi?cation is a crucial step in scope de?nition. For the purpose of our project, the following
scenarios have been identi?ed: - 1. The project could be deployed along with the already existing social media <1% https://danielpimentel.github.io/teachin

platforms (esp. Twitter) whereby it can fetch user posts and analyse depres- sion level in the individual. Here, <1% https://www.kdnuggets.com/2020/07/data-c
the social media user acts as the end user. 2. The project can also be bene?cial to psychologists who wish to
<1% https://www.rfwireless-world.com/Tutoria
study de- pression and mood disorders especially in young adults.
<1% https://www.coep.org.in/mycoep/yblcompco

file:///C:/Users/hp/AppData/Local/Temp/3F8XUI8B.htm 05-06-2021
Plagiarism Checking Result for your Document Page 5 of 19

<1% https://www.careerride.com/UML-differenc

Hence, psychologists can be taken to be the end user. 3. The project could be made open source which could <1% https://www.geeksforgeeks.org/unified-mo

help other researchers making necessary strides in this ?eld, improve the model further. Hence, re- searchers <1% https://www.geeksforgeeks.org/swim-lanes
make up the last set of end users. 1.4 Motivation of The project � The motivation for doing this project was
<1% https://www.professionalcipher.com/2017/
primarily an interest in undertaking a challenging project in the domain of Machine Learning.
<1% https://www.conceptdraw.com/examples/use

The opportunity to learn about various Machine Learning algorithms and their role in preventing bias in <1% http://inpressco.com/wp-content/uploads/

imbalanced datasets was appealing. Depression is a major challenge plaguing our society esp. millennials <1% https://en.wikipedia.org/wiki/Sequence_d
and we are extremely motivated to ?nd a solution for its early detection. Department of Computer Engineering,
<1% https://www.lucidchart.com/pages/uml-seq
AIT, Pune �Depression Detection using Sentiment Analysis of Social Media Posts�� 3 1.5 Organization of
<1% https://vpkbiet.org/dept_Computer.php
report The report will cover all the project work which has been done this year.
<1% https://zelfmoordmiluje.com/cyhin160ygg2

The report will further cover topics such as literature survey, Software Requirement Speci?ca- tion, Algorithms <1% https://online.visual-paradigm.com/diagr
used and Mathematical Model, Project design and Planning, Cod- ing, Testing, SQA etc. The report will
<1% http://pvpsiddhartha.ac.in/dep_it/lectur
provide a comprehensive understanding of various aspects of the project and will serve as a useful
<1% https://www.freeprojectz.com/
documentation of our work. Department of Computer Engineering, AIT, Pune Chapter 2 LITERATURE

SURVEY 2.1 Literature Survey This is a ?eld where immense research is taking place. <1% https://www2a.cdc.gov/cdcup/library/temp

<1% https://www.dataquest.io/blog/streaming-
Over the last few years, social media has been used to examine mental health by many researchers. In
<1% https://www.coep.org.in/departments
�Proceed- ings of the Fifth Workshop on Computational Linguistics and Clinical Psychology: From Keyboard
<1% https://www.mckinsey.com/industries/publ
to Clinic� [1] the authors considered that social media platforms can re?ect the users� personal life on many
levels. Their primary objective was to detect depression using the most effective deep neural architecture from <1% https://towardsdatascience.com/how-to-ge

two of the most popular deep learning approaches in the ?eld of natural language processing: Convolutional <1% https://www.sumologic.com/blog/microserv
Neural Networks (CNNs) and Recurrent Neural Networks (RNNs).
<1% https://www.mdpi.com/1999-4893/13/9/208/

<1% https://towardsdatascience.com/customer-
According to the report, �Exploring opportunities to support mental health care using social media: A survey
<1% https://www.fiverr.com/devpartho/do-fron
of social media users with mental illness� [2], it was found that millennials are more open to talk about their
mental health issues on social me- dia. Machine Learning (ML) has advanced signi?cantly in recent years, <1% https://machinelearningmastery.com/faq/
allowing for the solution of real-world problems and also the implementation of automated systems. In
<1% http://www.powershow.com/view0/8d89fd-MT
�Predicting future mental illness from social media: A big-data approach� [3], the author predicted future
<1% https://www.researchgate.net/publication
mental illness based on the posts from an individual�s post on Reddit, by gathering the posts from clinical
sub - reddits and then classify- ing them to the corresponding mental illness. <1% https://towardsdatascience.com/test-your

<1% https://www.researchgate.net/publication

After gathering the posts, clustering was applied on those posts to ?nd the markers of mental illnesses present <1% https://www.ritchieng.com/machine-learni
in their �Depression Detection using Sentiment Analysis of Social Media Posts�� 5 everyday spoken
<1% https://www.sciencedirect.com/science/ar
language. In �Depression detection using Emotional Arti?cial Intelligence� [4], Natural Lan- guage
<1% https://journals.plos.org/plosone/articl

file:///C:/Users/hp/AppData/Local/Temp/3F8XUI8B.htm 05-06-2021
Plagiarism Checking Result for your Document Page 6 of 19

<1% https://www.researchgate.net/publication
Processing was applied on Twitter feeds for conducting emotion analysis fo- cusing on depression. Speci?c

tweets were labelled as neutral or negative using a curated word list to detect depression. <1% https://dokumen.pub/machine-learning-for

<1% https://machinelearningmastery.com/diffe
For preditive modelling, support vector ma- chine and Naive-Bayes classi?er have been used. The results
<1% https://www.sciencedirect.com/science/ar
showed that Naive Bayes gave a better accuracy and F1-Score than SVM. In �X-A-BiLSTM: a Deep Learning
<1% https://www.researchgate.net/publication
Approach for Depression Detection in Imbal- anced Data� [5] the authors proposed a deep learning model
(X-A-BiLSTM) for depression detection in imbalanced social media data. This approach focused on solving the <1% https://www.analyticsvidhya.com/blog/201

problem caused by data imbalance in the real world. <1% https://www.wrike.com/project-management

<1% https://www.researchgate.net/publication
The X-A-BiLSTM model comprised of two components: the ?rst XGBoost component, which permit- ted
<1% https://www.researchgate.net/publication
acquiring balanced data by means of an end-to-end scalable tree boosting system, and the second

component, BiLSTM with the attention mechanism, which achieved good classi?cation performance. In <1% https://www.sciencedirect.com/topics/com

�Utilizing Neural Networks and Linguistic Metadata for Early Detection of De- pression Indications in Text <1% https://blog.radware.com/security/clouds
Sequences� [6] the authors used machine learning mod- els focused on messages on a social network to
<1% https://www.passporthealthusa.com/employ
identify depression early.
<1% https://www.researchgate.net/publication

In particu- lar, a classi?cation based on user-level linguistic metadata is compared to a convolu- tional neural <1% https://dl.acm.org/doi/abs/10.1145/34425

network based on different word embeddings. In addition, the current common ERDE score as a metric for <1% https://www.safetonet.com/wp-content/upl
early detection systems is discussed in depth, as well as its drawbacks in the context of shared tasks. Finally,
<1% https://www.ncbi.nlm.nih.gov/pubmed/3103
a broad corpus was used to train a new word embedding. In �Detection of Mood Disorder Using Modulation
<1% http://scholar.google.com/citations?user
Spectrum of Facial Action Unit Pro?les� [7] the authors constructed a database of facial expressions
responding to emotional stimuli from the patients with BD, UD and healthy controls. <1% https://www.researchgate.net/publication

<1% https://www.researchgate.net/publication

To detect mood disorder, the subject�s facial expressions in CHIMEI database were applied Department of <1% https://eprints.usq.edu.au/38102/
Computer Engineering, AIT, Pune �Depression Detection using Sentiment Analysis of Social Media
<1% http://tao-xiaohui.com/
Posts�� 6 to generate the AU pro?les. The MS characterizing the ?uctuation of AU pro?le sequence over a
video segment was then used for mood disorder detection. From the comparison results of mood disorder <1% https://www.researchgate.net/publication

detection, we can ?nd that the proposed ANN-based method achieved the best performance. <1% http://www.ijirset.com/upload/2021/may/1

<1% https://www.aitpune.com/Documents/Comp/p
In �Depression Detection by Analyzing Social Media Posts of User� [8] it has been demonstrated that
depression can lead an individual to severe mental illness, even to the path of suicide and how a machine

learning approach can detect depression of social media users. Micro-blogging social networking sites such

as: twitter and Facebook provide users to express their day-to-day thoughts and activities which re- ?ect

users� behavioral attributes and personality traits.

This paper proposed a model that takes a username and analyzes the social media posts of the user to

file:///C:/Users/hp/AppData/Local/Temp/3F8XUI8B.htm 05-06-2021
Plagiarism Checking Result for your Document Page 7 of 19

determine the levels of vulnerability to depression. Correlating with this result the authors eval- uated the

accuracy of this model to be 74% and a precision of 100%. In �Facebook Social Media for Depression

Detection in the Thai Community� [9] the author provides a tool by which depression could be easily and
early detected. This would help people to be aware of their emotional states and seek help from pro- fessional

services.

This study uses Natural Language Processing (NLP) techniques to create a depression detection algorithm for

the Thai language on Facebook, a so- cial media platform where people share their thoughts, emotions, and

life events. Results from 35 Facebook users indicated that Facebook behaviors could predict de- pression

level. In �Twitter Analysis for Depression on Social Networks based on Sentiment and Stress� [10] the
author says that Detecting words that express negativity in a social media message is one step towards

detecting depressive moods.

The authors applied a multistep approach which allowed us to identify potential users and then discover the

words that expressed negativity by these users. Results showed that the senti- ment of these words can be

obtained and scored ef?ciently as the computation on these datasets were narrowed to only these selected

users. They also obtained the Department of Computer Engineering, AIT, Pune �Depression Detection using

Sentiment Analysis of Social Media Posts�� 7 stress scores which correlated well with negative sentiment
expressed in the content.

In �A novel Co-training based approach for the classi?cation of mental illnesses using Social media posts�
[11] the authors performed several experiments to classify the posts and their associated comments related to

four mental issues such as Anxi- ety, ADHD, Depression and Bipolar. They also mined date from the Reddit

platform where community related posts are published. The Authors used an API to extract posts and

associated comments and performed experiments by using SVM, NB, and RF classi?ers.

The experimental results indicate that SVM, NB, and RF outper- formed with Co-training technique as

compared to their individual use in terms of Precision, Recall, and F-measure. In �Realizing a Stacking

Generalization Model to Improve the Prediction Accu- racy of Major Depressive Disorder in Adults� [12] the
authors developed a stack- ing generalization model for improving the accuracy in predicting MDD.

In the ?rst step, they have implemented a KNN Imputation preprocessing technique for han- dling the missing

values in the data. Then in the next step, the authors have used Random Forest-Based Backward Elimination,

which is a wrapper-based feature se- lection method for reducing the feature dimension, which would reduce

the feature interactions and helps in increasing the prediction accuracy. The initial number of features was 22,

and then RF-BE has reduced to 12 features with which further process.

The stacking generalisation is accomplished by combining three low-level learners, MLP, SVM, and RF, and

file:///C:/Users/hp/AppData/Local/Temp/3F8XUI8B.htm 05-06-2021
Plagiarism Checking Result for your Document Page 8 of 19

then averaging them to create a Meta-level learner (MLP). The classi?ers are also implemented individually to

compare the results. The accuracy of individual classi?ers MLP, SVM, RF is 96.38%, 95.06%, and 96.90%,

respectively. The accuracy of the stacking generalization model is 98.16%. In �A Machine Learning based

Depression Analysis and Suicidal Ideation Detection System using Questionnaires and Twitter� [13] the
authors analyzed social media posts (especially twitter), conducted questionnaire and asked students and

parents to give their opinion and also scrapped blogs on internet.

According to the research, Department of Computer Engineering, AIT, Pune �Depression Detection using

Sentiment Analysis of Social Media Posts�� 8 major factors of depression among the age group of 15-29
which they found during the course of the project are parental pressure, love, failures, bullying, body sham-

ing, inferiority complex, exam pressure, peer pressure, physical and sexual abuse etc. Depression being a

recurrent type of illness, repeated episode of the same are common. Finally, little is known about the

prevention and identi?cation of the disor- der at an early stage.

Among future directions, the authors researched to understand how social media behavior analysis can help

in leading to development of methods for analyzing depression at scale. 2.2 Possible Challenges Some

possible risks / challenges associated with the project: - 1. Machine Learning algorithms cannot grant a human

level accuracy in predic- tion of depression. 2. There is signi?cant noise in the Tweets collected before pre-

processing, which would lead to a lot of unnecessary data due to third person and news references. 3.

Also, social media posts are highly imbalanced due to which machine learning models often develop a bias

which in turn leads to erroneous results. On the other hand, deep learning models require a huge amount of

data to train which is generally not possible with the datasets being used in earlier models. 2.3 Inferences from

Literature Survey 1. The literature survey conducted earlier clearly shows that for limited datasets Ma- chine

Learning algorithms will be most effective. 2. Na� ive Bayes works extremely well with textual data and gave
better results com- pared to SVM. Hence, we would use this algorithm for our purpose. 3.

We would use the Sentiment140 dataset for scraping depressive tweets and prepar- Department of Computer

Engineering, AIT, Pune �Depression Detection using Sentiment Analysis of Social Media Posts�� 9 ing
our ?nal dataset. 4. The Sentiment140 dataset contains 1,600,000 tweets extracted using the Twitter API. The

tweets have been annotated (0 = negative, 2 = neutral, 4 = positive) and they can be used to detect sentiment.

5. The dataset would be highly imbalanced and thus, we�ll use XGBoost Algorithm as a ?lter for avoiding
bias. 6.

Finally, model evaluation, cross validation and hyperparameter tuning will be car- ried out to test the

effectiveness of the model. Department of Computer Engineering, AIT, Pune Chapter 3 SOFTWARE

REQUIREMENT SPECIFICATION 3.1 Introduction � Purpose The aim of the project is to detect whether a
person is showing signs of clinical depression. The project will be developed using various machine learning

file:///C:/Users/hp/AppData/Local/Temp/3F8XUI8B.htm 05-06-2021
Plagiarism Checking Result for your Document Page 9 of 19

models trained on twitter data.

The model would then predict whether a person is showing symptoms of depression and if yes, necessary

actions will be taken. The model developed will be trained on a comprehensive dataset containing a mix of

depressive and non-depressive tweets. The dataset would be prepared from a mix of various open source

repositories and data scraped through the Twitter API. A hybrid model comprising of XGBoost and Naive

BAyes has been identi?ed for predictive modelling.

� Intended Audience For the purpose of our project, the following end users have been identi?ed: - 1. The
project could be deployed along with the already existing social media platforms (esp. Twitter) whereby it can

fetch user posts and analyse depres- sion level in the individual. Here, the social media user acts as the end

user. 2. The project can also be bene?cial to psychologists who wish to study de- pression and mood

disorders especially in young adults.

Hence, psychologists �Depression Detection using Sentiment Analysis of Social Media Posts�� 11 can be
taken to be the end user. 3. The project could be made open source which could help other researchers

making necessary strides in this ?eld, improve the model further. Hence, re- searchers make up the last set of

end users. � Scope The project deals with detecting depression in twitter posts using a machine learning

model which will be built as a combination of XGBoost and Na� ive Bayes algorithm. The literature survey

conducted in the initial phases pointed to the fact that Na� ive Bayes works best for textual data.

The XGBoost data will work as a ?lter which would make sure that the data imbalance is mini- mal. The

project would be limited to twitter data. The reasons for choosing twitter for our project have been listed below:

- 1. Twitter data is easy to handle. 2. Being text heavy, it is simple and easy to pre-process. 3. Quantitative

and Qualitative availability. 4. Smaller memory storage size required compared to image and video data. The

project would be deployed as an API which would be developed using FastAPI framework and Uvicorn.

The API will be hosted on Heroku, which is provides free cloud-based servers. We would use Twitter API for

interaction with Twitter databases for data retrieval. � De?nitions and Acronyms 1. API: It stands for
Application Programming Interface. An API is a set of programming code that enables data transmission

between one software prod- uct and another. It also contains the terms of this data exchange. 2. XGBoost: It

stands for Extreme Gradient Boosting.

XGBoost is an op- timized distributed gradient boosting library designed to be highly ef?cient, ?exible and

portable. It implements machine learning algorithms under the Department of Computer Engineering, AIT,

Pune �Depression Detection using Sentiment Analysis of Social Media Posts�� 12 Gradient Boosting

framework. 3. Na� ive Bayes: Naive Bayes classi?ers are a collection of classi?cation al- gorithms based on

Bayes� Theorem. It is not a single algorithm but a family of algorithms where all of them share a common

file:///C:/Users/hp/AppData/Local/Temp/3F8XUI8B.htm 05-06-2021
Plagiarism Checking Result for your Document Page 10 of 19

principle, i.e.,

every pair of features being classi?ed is independent of each other. 4. FastAPI: FastAPI is a modern, fast

(high-performance), web framework for building APIs with Python 3.6+ based on standard Python type hints. It

is one of the fastest python frameworks available. 5. Uvicorn: It is an ASGI server based on uvloop and

httptools, with an em- phasis on speed. 6. Heroku: Heroku is a platform as a service (PaaS) that enables

develop- ers to build, run, and operate applications entirely in the cloud. 3.2

Overall Description � Product Perspective The main motive of the project is not only to detect depression in
tweets but to also ensure that the user experience is not hampered. Due to this it is necessary that our system

works in the back-end and works as an independent module. Hence, the project would be built as an API

which will host our hybrid ML model on the cloud, which will only be active when the user tweets and will be in

passive state at all other times.

The user will only interact with the Twitter interface whereas the model would constantly monitor tweets via the

Twitter API, which provides various features for developers who wish to work with twitter. The project will also

use a de- pression score metric for classifying a user as depressive. If the depression Department of

Computer Engineering, AIT, Pune �Depression Detection using Sentiment Analysis of Social Media

Posts�� 13 score exceeds a particular threshold value, the user will be classi?ed as de- pressive and
assistance in the form medical help noti?cations, positive feeds etc. will be provided.

� Constraints, Assumptions and Dependencies For simplicity and ensuring computability, we would make the
following as- sumptions: - 1. The model is highly accurate and would provide human-level depression

predictions. 2. The servers will always remain active and would never crash. 3. The user will express his/her

true emotions in his tweets and would not put-up depressive tweets unless depressed. In addition to the above

assumptions, the system would have some constraints some of which have identi?ed below: - 1.

Some amount of Latency will always be there regardless of how fast the servers are. 2. It is impossible to

achieve 100% accuracy and some cases of false positives will always be there. 3. For the development

phase, the Heroku servers used would need some time to start before they can be fully functional. Hence, the

server may not be avail- able at all times. The dependencies for the successful development and deployment

of the project have been listed below: - 1.

Software Dependencies: Windows OS, Anaconda, Jupyter, Python 3.7, Standard ML Libraries, Pipenv,

FastAPI, Uvicorn, Heroku CLI, An IDE (VS- Code, Sublime Text etc.). 2. Hardware Dependencies: Intel i5/i7

processor, 4/8 GB RAM, Heroku Cloud Server. 3. Other Dependencies: Sentiment140 Dataset, Twitter

Developer Account, curated word list of depressive keywords (available on GitHub [14]), Google Department

of Computer Engineering, AIT, Pune �Depression Detection using Sentiment Analysis of Social Media

file:///C:/Users/hp/AppData/Local/Temp/3F8XUI8B.htm 05-06-2021
Plagiarism Checking Result for your Document Page 11 of 19

Posts�� 14 Colab GPU. 3.3

System Features and Requirements � External Interface Requirements 1. User Interface: - � Front-end:

Twitter Interface. � Back-end: Python Based ML Model (with FastAPI and Uvicorn) hosted on Heroku.

2.Hardware Interface: - � Not Applicable. 3. Software Interface: - � User Level Interface: Any OS (Windows

Preferable), Web Browser. 4. Communication Interface: - � Communication b/w components will be carried

out using HTTP Protocol and data transfer in JSON format � System Features 1.

The model must be able to correctly identify cases of clinical depression from tweets with high accuracy and a

decent F1-Score. 2. Latency should be minimal so as to increase ef?ciency of our system. 3. The model must

have the ability to deal with highly imbalanced data and should not be biased. 4. The model developed should

facilitate easy integration with the Twitter API so that it could be deployed in the real world. 5. The system

must ensure that user data is protected and con?dentiality is Department of Computer Engineering, AIT, Pune

�Depression Detection using Sentiment Analysis of Social Media Posts�� 15 maintained. � Non-Functional
Requirements 1. Performance Requirement The response time of the model must be as little as possible.

To achieve this we will build our API with FastAPI, which is one of the fastest python frame- works available.

2. Usability Requirement The system must be easy to use and should have the ability to be easily inte- grated

with existing software. This will be achieved by hosting our API on the cloud from where the model could

directly plugged in any application. 3. Reliability Requirement The system should be reliable and must produce

accurate results.

Also, the model must be available at all times and should not break down in events of failure (E.g., Server

failure etc.). To achieve this, we will use Heroku deploy- ment to achieve shared servers which would prevent

total failure. Department of Computer Engineering, AIT, Pune Chapter 4 ALGORITHM ANALYSIS AND

MATHEMATICAL MODELING 4.1 Naive Bayes The supervised learning algorithms based on Bayes�

theorem with the �naive� as- sumption of conditional independence between any pair of features given the

value of the class variable are known as naive Bayes methods. It�s a classi?cation method based on

Bayes� Theorem and the presumption of pre- dictor independence.

A Naive Bayes classi?er, in simple terms, assumes that the existence of one function in a class is unrelated to

the presence of any other feature. Despite their oversimpli?ed assumptions, Naive Bayes classi?ers have

performed admirably in a number of real-world applications, most notably document classi?- cation and

spam ?ltering. To estimate the necessary parameters, they only need a small amount of training data. When

compared to more advanced methods, Naive Bayes learners and classi?ers can be extremely swift.

Since the class conditional feature distributions are decoupled, each distribution can be calculated as a one-

dimensional distribution independently. This, in essence, results in the alleviation of problems created by the

file:///C:/Users/hp/AppData/Local/Temp/3F8XUI8B.htm 05-06-2021
Plagiarism Checking Result for your Document Page 12 of 19

curse of dimen- sionality. �Depression Detection using Sentiment Analysis of Social Media Posts�� 17 4.2
XGBoost XGBoost is a Gradient Boosting Machine Learning library that has been tailored. It was written in

C++ at ?rst, but it has APIs in many other languages.

The core XGBoost algorithm is parallelizable, which means it can run in parallel in a single tree. The decision

tree is made up of a set of binary questions, and the ?nal predictions are made at the leaf. XGBoost is

typically used for a tree as the base learner. XG- Boost is an ensemble system in and of itself. The trees are

designed in stages before a stopping criterion is reached. CART(Classi?cation and Regression Trees)

Decision trees are used by XGBoost.

CART refers to trees in which each leaf contains a real-valued ranking, regardless of whether they are used

for classi?cation or regression. If required, real-valued scores can be translated to categories for classi?cation.

XGBoost makes use of advanced regularisation to increase model generalisation. XGBoost outperforms

Gradient Boosting in terms of ef?ciency. It has a short learn- ing curve and can be parallelized across clusters.

4.3 Proposed Algorithm The algorithms described above are state of the art algorithms which work very well

with balanced datasets.

But we are dealing with highly imbalanced datasets and thus, our algorithm must have the ability to remove

this imbalance and avoid any bias. Our model will be developed as a combination of XGBoost and Naive

Bayes. This hybrid model would be able to extract most relevant information using the power of both

algorithms. The XGBoost layer will act as a ?lter which would remove imbalance in the data and the Naie

Bayes layer would ?nal make predictions on the re?ned data.

Department of Computer Engineering, AIT, Pune �Depression Detection using Sentiment Analysis of Social

Media Posts�� 18 4.4 Mathematical Modelling Our algorithm will be implemented as a combination of two
algorithms - XGBoost and Naive Bayes. Thus, the mathematics of the proposed system is based on these

algorithms. Let our dataset be comprised of the following sets of features: - 1. The set of independent

features, X = F1, F2, F3, . . . . . . . . . . 2. The dependent feature, y. The data is ?rst passed through the

XGBoost layer.

The objective function for XG- Boost is shown below: - Figure 4.1: XGBoost Objective Function The objective

function above comprises of the loss function as well as the regulariza- tion function. Our motive is to minimize

the above function. This is done internally using the Taylor approximation technique. And ?nally, we will have

our prediction. Let the probability of prediction for XGBoost be P(xg).

This will be used further to calculate ?nal result At the Na� ive Bayes layer, Bayes Theorem is used for
prediction. The standard Bayes Theorem is represented by the formula below: - Department of Computer

Engineering, AIT, Pune �Depression Detection using Sentiment Analysis of Social Media Posts�� 19 Figure

file:///C:/Users/hp/AppData/Local/Temp/3F8XUI8B.htm 05-06-2021
Plagiarism Checking Result for your Document Page 13 of 19

4.2: Bayes Theorem Here, P(y| X) is the posterior probability of class (y, target) given predictor (X, features). P

(y) is the prior probability of class. P(X| y) is the likelihood which is the probability of predictor given class. P(X)

is the prior probability of predictor.

In Na� ive Bayes we make the na� ive assumption that all the features are independent hence we�ll have: -
Figure 4.3: Independence of Features P(X) is constant and thus our earlier formula will reduce to: - Figure 4.4:

Proportionality Relation Department of Computer Engineering, AIT, Pune �Depression Detection using

Sentiment Analysis of Social Media Posts�� 20 The goal of Naive Bayes is to choose the class y with the
maximum probability. Thus, our ?nal optimization function will be: - Figure 4.5:

Optimization Function Let the Na� ive Bayes probability of prediction be P(nb). Finally, we will take weighted
average of the probabilities of prediction of both the algorithms and set a threshold value. If the ?nal

probability will surpass the thresh- old only then the tweet will be classi?ed as depressive. FinalProbability =

(k1* P (xg ) + k2* P (nb )) / (k1 + k2 ) (4.1) Department of Computer Engineering, AIT, Pune Chapter 5

DETAILED DESIGN 5.1

Architectural Design The requirement analysis done earlier, makes way for identifying and analyzing the

various processes involved in the development of the project. Any machine learning / deep learning project

follows the data science life cycle which has been shown below. Figure 5.1: Machine Learning Life Cycle

�Depression Detection using Sentiment Analysis of Social Media Posts�� 22 The life cycle helps us identify
the primary processes which need to be followed for successful implementation of the project.

For the purpose of our project, four stages of technical work have been identi?ed which have been shown in

the block diagram and ?owchart below: - Figure 5.2: Architecture : Block Diagram Figure 5.3: Architecture :

Flowchart Department of Computer Engineering, AIT, Pune �Depression Detection using Sentiment Analysis

of Social Media Posts�� 23 5.2 UML Diagrams 5.2.1 Class Diagram Class diagram shows relationship and
dependency between various classes in the system.

For our purpose, we�ll use pre-de?ned classes of the Twitter API. The class diagram has been shown below.
Figure 5.4: UML : Class Diagram 5.2.2 Activity Diagram Activity diagram shows the sequential representation

of various activities involved in the project. It portrays the control ?ow from a start point to a ?nish point

showing the various decision paths that exist while the activity is being executed. The activity diagram for our

system has been shown below.

Department of Computer Engineering, AIT, Pune �Depression Detection using Sentiment Analysis of Social

Media Posts�� 24 Figure 5.5: UML : Activity Diagram 5.2.3 Use Case Diagram A use case diagram at its

simplest is a representation of a user�s interaction with the system that shows the relationship between the
user and the different use cases in which the user is involved. The use case diagram for our system has been

file:///C:/Users/hp/AppData/Local/Temp/3F8XUI8B.htm 05-06-2021
Plagiarism Checking Result for your Document Page 14 of 19

shown below. Department of Computer Engineering, AIT, Pune �Depression Detection using Sentiment

Analysis of Social Media Posts�� 25 Figure 5.6: UML : Use Case Diagram 5.2.4

Sequence Diagram A sequence diagram shows object interactions arranged in time sequence. It depicts the

objects involved in the scenario and the sequence of messages exchanged be- tween the objects needed to

carry out the functionality of the scenario. Sequence Diagrams show elements as they interact over time and

they are organized accord- ing to object (horizontally) and time (vertically). Sequence diagrams are sometimes

known as event diagrams or event scenarios.

Department of Computer Engineering, AIT, Pune �Depression Detection using Sentiment Analysis of Social

Media Posts�� 26 Figure 5.7: UML : Sequence Diagram 5.2.5 Deployment Diagram A UML deployment
diagram is a diagram that shows the con?guration of run time processing nodes and the components that live

on them. Deployment diagrams is a kind of structure diagram used in modeling the physical aspects of an

object-oriented system. They are often be used to model the static deployment view of a system (topology of

the hardware).

Deployment diagrams are important for visualizing, specifying, and documenting embedded, client/server, and

distributed systems and also for managing executable systems through forward and reverse engineering. A

deployment diagram is just a special kind of class diagram, which focuses on a system�s nodes. Graphically,
a deployment diagram is a collection of vertices and arcs. Department of Computer Engineering, AIT, Pune

�Depression Detection using Sentiment Analysis of Social Media Posts�� 27 Figure 5.8: UML : Deployment
Diagram 5.3

Data design 5.3.1 Internal software data structure The twitter API stores information in the form of various

classes internally and trans- mits this data in the form of JSON Objects. 5.3.2 Global data structure Our API

will extend the api/predict interface as the global structure accessible through the twitter API. This will send

prediction details in JSON format. 5.3.3

Temporary data structure Some temporary ?les may be creted for storing user data which would be deleted

once our prediction is done. 5.3.4 Database description No external database will be used as such. However,

the Twitter API will enable us to access the twitter database server for tweets and other information.

Department of Computer Engineering, AIT, Pune Chapter 6 PROJECT PLANNING 6.1

Tasks Involved � Data Collection : Sentiment140 dataset, Scraping data using Twitter API / Tweepy. � Data

Pre-processing : NLTK library, Python ML libraries. � Model Selection : Na� ive Bayes, XGBoost etc. �

Model Evaluation : Precision, Recall, F1-score (Primary metric), Accuracy (Secondary metric). � Model
Improvement : Cross Validation, Hyperparameter tuning, data clean- ing, Co-training etc. 6.2 Technical Risks

� Machine Learning algorithms cannot grant a human level accuracy in predic- tion of depression.

file:///C:/Users/hp/AppData/Local/Temp/3F8XUI8B.htm 05-06-2021
Plagiarism Checking Result for your Document Page 15 of 19

This risk can be mitigated by developing hybrid models. � There is signi?cant noise in the Tweets collected
before pre-processing, which would lead to a lot of unnecessary data due to third person and news refer-

ences. This risk is mitigated using Natural language Pre-processing libraries which help in extracting the most

useful words in textual data. �Depression Detection using Sentiment Analysis of Social Media Posts�� 29

� Social media posts are highly imbalanced due to which machine learning mod- els often develop a bias
which in turn leads to erroneous results. This risk is mitigated using Gradient Boosting.

� We need a huge amount of data to train machine learning models which is generally not possible with

existing repositories. Thus, we can scrape etra data using Twitter API. 6.3 Budget/Time Constraints � There
is no signi?cant budget constraint as all the software and hardware being used for the purpose of this project

will be open source. However, in the future more robust and fast hardware may be required to deploy the

model in the real world.

� Time is limited and can be a factor which could lead to failure of the project. But this risk could be mitigated
by managing time properly using a well de- ?ned timeline. The work?ow timeline has been shown below: -

Figure 6.1: Project Timeline Department of Computer Engineering, AIT, Pune Chapter 7 CODING 7.1

Algorithms / Flowcharts The proposed model will be developed as a stacking generalized combination of

XGBoost and Multinomial Naive Bayes models.

This hybrid model would be able to extract relevant information using the power of both algorithms. Here, the

XGBoost and Multinomial Na� ive Bayes models would act as the base-learners for our stack while a Logistic
Regression classi?er will act as the meta-model. The architecture of the stacking model is shown below: -

Figure 7.1: Proposed Model For developing the model, we need to follow the standard NLP procedures which

comprise of Data Collection, Data Cleaning, Tokenization, Model Selection with Hyperparameter Tuning,

Model Stacking and Evaluation.

The ?ow of events has been depicted below: - �Depression Detection using Sentiment Analysis of Social

Media Posts�� 31 Figure 7.2: Flowchart 7.2 Software Used 7.2.1 Utility Packages / Applications � OS :

Windows 10 � IDE : VSCode � Package Manager (Python): pip 7.2.2 Model Development � Dataset

Storage : MS-Excel(csv format). � Twitter API Scraping : tweepy & dotenv. � Python ML Libraries : numpy,

pandas, scikit-learn, matplotlib, seaborn, xg- boost, plotly, wordcloud etc. � IPython Kernel : Jupyter
Notebook & Google Colab.

� Saving Models : pickle module Department of Computer Engineering, AIT, Pune �Depression Detection

using Sentiment Analysis of Social Media Posts�� 32 7.2.3 Deployment � Front-end web development :

HTML, CSS, Javascript & Bootstrap 5 � API development : FastAPI & Uvicorn � Cloud Deployment: Heroku
7.3 Hardware Speci?ation The project utilizes resources openly available on the cloud and hence has no

file:///C:/Users/hp/AppData/Local/Temp/3F8XUI8B.htm 05-06-2021
Plagiarism Checking Result for your Document Page 16 of 19

speci?c hardware requirements. The project has been developed on a system having an Intel i7 processor

along with 8GB RAM. 7.4 Programming Language The project has been developed in Python 3.7.

The reason for choosing Python lies in the fact that it is a simple & extremely powerful language which has a

huge repository of Machine Learning tools and libraries. 7.5 Platform The model was developed on the Google

Colab platform, which is a cloud based IPython Kernel integrated with the standard ML libraries. For

deployment, the Heroku cloud deployment platform has been utilized. 7.6 Coding Style Format The project

has been developed using PEP8 coding style format. PEP8 ( Python Enhancement Proposals 8) de?nes

useful naming conventions and other guidelines for programming in python and is extremely helpful in writing

well organized, con- sistent and neat code.

Department of Computer Engineering, AIT, Pune Chapter 8 RESULT & ANALYSIS For testing the model, we

need to de?ne the metrics upon which the models must be evaluated. In our case the metrics de?ned are:

Precision, Recall, F1-score (pri- mary) and Accuracy (secondary). Evaluating the model using the traditional

train- test split method will not be of much use due to imbalanced distribution. Hence, we use Strati?ed 5-Fold

Cross Validation for obtaining our prediction results.

The stacking model (MNB + XGB) was tested against standalone MNB and XGB models to get a good

understanding of the properties of our model. The results for the 5-fold cross validation have been

summarized below: - Figure 8.1: State-of-the-art models vs Stacked Model �Depression Detection using

Sentiment Analysis of Social Media Posts�� 34 Figure 8.2: F1-Score Comparison Figure 8.3: Accuracy

Comparison Department of Computer Engineering, AIT, Pune �Depression Detection using Sentiment

Analysis of Social Media Posts�� 35 The results have shown convincingly that on both F1-score and
Accuracy metrics, the proposed stacking model has had superior results to the state-of-the-art models.

Our proposed stacking model gives an accuracy of 96% and an F1-Score of 93%. Department of Computer

Engineering, AIT, Pune Chapter 9 TESTING 9.1 Formal Technical Reviews Formal Technical Reviews helped

us in assessing how our model is performing at each stage of development. In Machine Learning, the logic is

not explicitly coded by the programmer but is inferred by the model so instead of performing traditional

software tests, we need to make sure that the model performs consistently at all times with all data-sets. So

we decided to go for K-fold cross validation testing for this purpose.

For each fold, we assessed various metrics like precision, recall, F1-score and accuracy to obtain our ?nal

result. 9.2 Test Plan In this phase, we analyzed our methodology and identi?ed the key areas which could

lead to bugs. Using this information we came to the conclusion the following types of tests are to be carried

out for our project. � Unit Testing � Integration Testing � Aplha Testing � Beta Testing �Depression

Detection using Sentiment Analysis of Social Media Posts�� 37 9.3

file:///C:/Users/hp/AppData/Local/Temp/3F8XUI8B.htm 05-06-2021
Plagiarism Checking Result for your Document Page 17 of 19

Test Cases & Results � Unit Testing : Here, our main aim was to make sure that each model works properly
without over?tting or under?tting the data. We used cross validation testing along with hyperparameter tuning

to make sure that all units(models) work consistently. � Integration Testing : Once the model and the web

interface were created, we had to make sure that integration of these two components didn�t cause the
system to break. Hence, we used Selenium WebDriver to test for any potential bug arising due to integration.

� Aplha Testing : This testing was done after deployment of the project to Heroku. This test helped us to

check the functionalities of the ?nal product. � Beta Testing : The ?nal deployed product was tested by
different members of our team on their systems and the project was put under different situations to assess its

endurance. The testing phase helped us identify bugs and rectify them. The testing results clearly show that

almost all bugs have been removed and the project works smoothly under all conditions.

Department of Computer Engineering, AIT, Pune Chapter 10 CONFIGURATION MANAGEMENT PLAN

Project con?guration management is managing the con?guration of all the project�s key products and assets.
This includes any end products that will be delivered to the customer, as well as all management products,

such as the project management plan and performance management baseline. Implementation of con?

guration manage- ment and project change management need to happen hand-in-hand.

Any change must be monitored and assessed to determine its impact on project con?guration. Thus, con?

guration management is an extremely important step in project develop- ment. Our project uses Git and

Heroku CLI for con?guration management. Different ver- sions of the project can be developed and changes

be pushed to Heroku directly using the Command Line Interface(CLI) provided by Heroku. During the develop-

ment phase we used Git for version control as it is openly available and easy to use.

After deployment, all con?guration related issues can be tackled by Heroku itself. This can range from version

control, con?guring add-ons, scaling of dyno formation, analyzing usage etc. Another very useful feature

available on Heroku is the Project Dashboard which provides UI support for tasks like viewing app metrics,

managing heroku teams, con?guring deployment integrations etc. Hence, Heroku comes with inbuilt features

which take care of the con?guration phase.

Chapter 11 SOFTWARE QUALITY ASSURANCE PLAN Software quality is one of the most signi?cant factors

determining the success of the project. Software Quality Assurance Plan lays down the guidelines for ensuring

that at each step the software developed is up to the mark. The SQAP followed for our project is as follows: -

� All modules need to be developed using proper naming conventions and other speci?cations conforming to

the Python PEP8 standard. This step ensures that the code developed is consistent and neat. � Code must
be well documented, with Doc strings and images wherever possi- ble.

� Data collected through scraping via Twitter API must be manually evaluated to ?nd bad data points and

file:///C:/Users/hp/AppData/Local/Temp/3F8XUI8B.htm 05-06-2021
Plagiarism Checking Result for your Document Page 18 of 19

remove them. This step makes sure that we have relevant data points in our data set and avoids the

�Garbage In Garbage Out� phenomena. � No module must be put into production without adequate testing.

This ensures that bugs are detected and recti?ed early. � During model development, the machine learning
based models must be tested and validated in each cycle to check for over-?tting and under-?tting of data.

� No con?dential information should be exposed by the API to the outside world. This would help in ensuring

that privacy of users is not compromised. �Depression Detection using Sentiment Analysis of Social Media

Posts�� 40 � The web interface should be developed keeping in mind good design practices for web apps.

This would ensure that the web app is responsive, consistent and easy to navigate. � The project should
regularly updated even after Deployment. This would make sure that existing bugs are ?xed and additional

functionalities are added from time to time.

Department of Computer Engineering, AIT, Pune Chapter 12 CONCLUSION To sum up, it has been well

established that depression is one of the leading issues faced by our society. Detecting depression early can

play a major role in preventing suicides and improving mental health of the society collectively. Social media

has been a revelation in sentiment analysis and can be used to effectively tackle depres- sion. Our project

uses twitter data to train a stack ensembled model consisting of Multino- mial Na� ive Bayes and XGBoost
base-models along with a Logistic Regression meta- classi?er.

The results have shown convincingly that on both F1-score and Accuracy metrics, the proposed stacking

model has had superior results to the state-of-the-art models. With a F1-score of 93% and an Accuracy of

96%, our model stands out as one of the best performers among all models developed to date. This project

could go a long way in integrating emotional AI with social media for eradicating depression from our society.

Chapter 13 References [1] A. H. B. P. O. M. H. . I. Orabi, �Deep Learning for Depression Detection of Twitter

Users.,�

Proceedings of the Fifth Workshop on Computational Linguis- tics and Clinical Psychology: From Keyboard to

Clinic. doi: 10.18653/v1/w18- 0609, 2018. [2] J. A. Naslund, K. A. Aschbrenner, G. J. McHugo, J. Un� utzer,

L. A. Marsch, and S. J. Bartels, �Exploring opportunities to support mental health care using social media: A

survey of social media users with mental illness,� Early Interv. Psychiatry, 2019. [3] R. Thorstad and P. Wolff,

�Predicting future mental illness from social media: A big-data approach,� Behav. Res. Methods, 2019.

[4] Mandar Deshpande and Vignesh Rao, � Depression detection using Emotional Arti?cial Intelligence., �

2017. [5] Cong, Z. Feng, F. Li, Y. Xiang, G. Rao and C. Tao, �X-A-BiLSTM: a Deep Learning Approach for

Depression Detection in Imbalanced Data,� 2018 IEEE International Conference on Bioinformatics and
Biomedicine (BIBM), Madrid, Spain, 2018, pp. 1624-1627, doi: 10.1109/BIBM.2018.8621230. [6] Trotzek,

Marcel Koitka, Sven Friedrich, Christoph. (2018). �Utilizing Neural Networks and Linguistic Metadata for Early

Detection of Depression Indica- tions in Text Sequences�.

file:///C:/Users/hp/AppData/Local/Temp/3F8XUI8B.htm 05-06-2021
Plagiarism Checking Result for your Document Page 19 of 19

IEEE Transactions on Knowledge and Data Engineer- ing. 32. 588-601. 10.1109/TKDE.2018.2885515.

�Depression Detection using Sentiment Analysis of Social Media Posts�� 43 [7] T. Yang, C. Wu, M. Su and

C. Chang, �Detection of mood disorder using modulation spectrum of facial action unit pro?les,� 2016
International Con- ference on Orange Technologies (ICOT), Melbourne, VIC, 2016, pp. 5-8, doi:

10.1109/ICOT.2016.8278966. [8] N. A. Asad, M. A. Mahmud Pranto, S. Afreen and M. M.

Islam, �Depres- sion Detection by Analyzing Social Media Posts of User,� 2019 IEEE Inter- national
Conference on Signal Processing, Information, Communication Sys- tems (SPICSCON), Dhaka, Bangladesh,

2019, pp. 13-17, doi: 10.1109/SPIC- SCON48833.2019.9065101. [9] K. Katchapakirin, K. Wongpatikaseree,

P. Yomaboot and Y. Kaewpitakkun, �Facebook Social Media for Depression Detection in the Thai

Community,� 2018 15th International Joint Conference on Computer Science and Soft- ware Engineering
(JCSSE), Nakhonpathom, 2018, pp. 1-6, doi: 10.1109/JC- SSE.2018.8457362. [10] X. Tao, R. Dharmalingam,

J. Zhang, X. Zhou, L. Li and R.

Gurura- jan, �Twitter Analysis for Depression on Social Networks based on Senti- ment and Stress,� 2019
6th International Conference on Behavioral, Economic and Socio-Cultural Computing (BESC), Beijing, China,

2019, pp. 1-4, doi: 10.1109/BESC48373.2019.8963550. [11] S. Tariq et al., �A Novel Co-Training-Based

Approach for the Classi?cation of Mental Illnesses Using Social Media Posts,� in IEEE Access, vol. 7, pp.
166165- 166172, 2019, doi: 10.1109/ACCESS.2019.2953087. [12] N. Mahendran, P. M. Durai Raj Vincent, K.

Srinivasan, V. Sharma and D. K.

Jayakody, �Realizing a Stacking Generalization Model to Improve the Predic- tion Accuracy of Major

Depressive Disorder in Adults,� in IEEE Access, vol. 8, pp. 49509-49522, 2020, doi:
10.1109/ACCESS.2020.2977887. [13] S. Jain, S. P. Narayan, R. K. Dewang, U. Bhartiya, N. Meena and V.

Ku- mar, �A Machine Learning based Depression Analysis and Suicidal Ideation Department of Computer

Engineering, AIT, Pune �Depression Detection using Sentiment Analysis of Social Media Posts�� 44

Detection System using Questionnaires and Twitter,� 2019 IEEE Students Con- ference on Engineering and
Systems (SCES), Allahabad, India, 2019, pp. 1-6, doi: 10.1109/SCES46477.2019.8977211. [14]

https://github.com/halolimat/Social-media-Depression- Detector/blob/master/depression lexicon.json [15]

https://towardsdatascience.com/xgboost-mathematics-explained- 58262530904a [16]

https://towardsdatascience.com/a-mathematical-explanation-of-naive-bayes- in-5-minutes-44adebcdb5f8 [17]

https://www.kaggle.com/kazanova/sentiment140 [18] https://medium.com/topic/machine-learning Department

of Computer Engineering, AIT, Pune

file:///C:/Users/hp/AppData/Local/Temp/3F8XUI8B.htm 05-06-2021

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy