Industrial Training Report Format
Industrial Training Report Format
NNN
Submitted By
SUBMITTED TO:
I hereby declare that the Industrial Training Report entitled Depression Detection Using Tweets
Through Machine Learning is an authentic record of my own work as requirements of Industrial
Training during the period from 02 JULY, 2023 to 12 AUGUST, 2023 for the award of degree of
B.Tech. (Computer Science & Engineering), ABES Engineering College, Ghaziabad, under the
guidance of Dr. Jagriti Singh.
(Signature of student)
ABHISHEK PANDEY
2000321540003
Date: ____________________
ACKNOWLEDGEMENT
I would like to express special thanks & gratitude to Dr. Jagriti Singh for
giving the motivation, knowledge and support throughout the course of the
project. The continuous support helps in a successful completion of
project. The knowledge provided is very insightful for me.
I would also like to extend our sincere obligation to Dr. Jagriti Singh for
providing this golden opportunity to me which led into doing a lot of Research
which diversified my knowledge to a huge extent for which I am thankful.
Also, I would like to thank my parents and friends who supported me a lot in
finalizing this project within the limited time frame.
Abhishek Pandey
ABOUT THE INSTITUTE
Centre for Advanced Studies is an in-campus research driven institute established by Dr. A.P.J
Abdul Kalam Technical University Lucknow to impart state of the art education to post graduate
students and to facilitate quality research work in the emerging areas of Engineering and
Technology. The institute offers M.Tech. and Ph.D programs in the disciplines of Computer
Science and Engineering, Mechatronics, Nanotechnology, Manufacturing Technology and
Automation, and Energy Science and Technology. Established by the Uttar Pradesh State
Government in 2017 with an objective to provide a stimulating platform to research scholars and
academicians for creating and disseminating research based knowledge and technologies for the
development of State/Country. The Institute is climbing up consistently on the path to visibility
across the globe. In a short span of time, significant progress has been made with quality
education, impactful research, publications and patents, funded projects, training and placement.
The University is also making constant efforts to create a healthy environment for meaningful
research outcomes, to mentor affiliated Institutions with an establishment of world class
laboratories and facilities in the Institute, and to enhance the knowledge of faculty and students
with the latest technologies and developments through training and education
TABLE OF CONTENTS
Page No.
Introduction………………………………………………………………………
Tools & technology………………………………………………………………
Objective of the project…………………………………………………………
System Design……………………………………………………………………
Methodology for implementation………………………………………………
Implementation Details…………………………………………………………
Results……………………………………………………………………………
Conclusion………………………………………………………………………
Reference…………………………………………………………………………
INTRODUCTION
Across the globe, of people experience depression, it is also the main reason
why individuals commit suicide. People now use social networking sites like
Twitter and Facebook to share their ideas and feelings, which has prompted
experts to look into how this data may be used to track mental health
disorders. The real-time nature of social media posts enables researchers to
examine emotional well-being and observe changes over time, which
traditional surveys are unable to capture. Expressions of loneliness or
melancholy, negative language, self-deprecating remarks, or a lack of
participation in social events are just a few examples of language patterns on
social media that can be a sign of depression. Finding these patterns enables
the identification of those who are vulnerable and may profit from care.
Monitoring social media data can also reveal how particular scenarios affect
the emotional well-being of individuals. Researchers analyze the
consequences of widely reported occurrences, such as celebrity deaths, and
how depressive symptoms are affected by campaigns to raise awareness of
mental health issues.
Even though it may be beneficial, there are limitations to relying solely on
social media for information about depressed people. These drawbacks
involve representation bias, problems with precision because not everyone
has access to or uses said particular social media, and privacy concerns as
data collection must be protected so it will not violate the confidentiality of
an individual rights.
In summary, social media data can be used to track depression, but concerns
about privacy and the necessity of appropriate post interpretation must be
tackled. Moreover, various sources should be merged for an accurate
assessment.
TOOLS & TECHNOLOGY USED
Hardware Requirements:
• Core i5/i7 processor
• At least 8 GB RAM
• At least 60 GB of Usable Hard Disk Space
Software Requirements:
• Python 3.x
• Anaconda Distribution
• NLTK Toolkit
• Jupyter Notebook // Google Collaboratory
OBJECTIVE OF THE PROJECT
Scrapping Tweets from various Tweeter Handles featuring various news, reviews from
Tweeter.com.
A representation of dataset
Data Format:
The dataset we will use is in .csv format. The sample of the dataset is given below.
METHODOLOGY FOR IMPLEMENTATION
Resources: -
In order to facilitate the preprocessing part of the data, we introduce resources which are,
• A stop word dictionary corresponding to words which are filtered out before or after
processing of natural language data because they are not useful in our case.
Pre-Processing: -
We can pre-process the tweets now that we have the corpus of tweets and all the
resources that might be helpful. It is crucial because all the changes we make during this
process will have an immediate effect on how well the classifier functions.
Preprocessing will produce uniform and consistent data that can be used to optimize the
performance of the classifier.
Since we need to extract features from our data set of tweets, we use three different
vectorization methods:
▪ TF-IDF
▪ Count Vectorizer
▪ N-gram vectorizer
In text classification, the count (number of time) of each word appears is a document is
used as a feature for training the classifier.
Firstly, we divide the data set into two parts, the training set and the test set. To do this,
we first shuffle the data set to get rid of any order applied to the data, the training dataset
contains 2/3rd of the dataset, while rest is the testing dataset.
A third set of data, known as the validation set, is actually required after the training set
and test set have been produced. It will be utilized to test our model against previously
unreported data and adjust the learning algorithm's potential parameters to prevent
underfitting and overfitting, among other things.
We need this validation set because our test set should be used only to verify how well
the model will generalize. If we use the test set rather than the validation set, our model
could be overly optimistic and twist the results.
Classification Algorithms:-
• A linear regression will predict values outside the acceptable range (e.g.,
predicting probabilities outside the range 0 to 1)
• Since the dichotomous experiments can only have one of two possible values for
each experiment, the residuals will not be normally distributed about the predicted line.
Contrarily, a logistic regression results in a logistic curve that can only have values
between 0 and 1. Similar to a linear regression, a logistic regression builds its curve using
the natural logarithm of the target variable's "odds" rather than the probability.
Furthermore, neither the predictors nor the variance in each group must be regularly
distributed.
if y i =−1 then
If the data is linearly inseparable, the SVM uses nonlinear mapping to transform the data
into a higher dimension. It then solve the problem by finding a linear hyperplane.
Functions to perform such transformations are called kernel functions. The kernel
function selected for our experiment is the Gaussian Radial Basis Function (RBF):
where Xi are support vectors, X j are testing tuples, and γ is a free parameter that uses the
default value from scikit-learn in our experiment. Figure shows a classification example
of SVM based on the linear kernel and the RBF kernel.
Confusion Matrics:-
The performance of the model during the classification process may be seen and its
accuracy can be assessed using a confusion matrix. It is "about" because these figures can
change based on, for instance, how we shuffle our data
We can hopefully tell that there are more true positive and true negative categorised
tweets than there are false positive and false negative tweets. However, based on this
outcome, we experiment with various strategies to try and increase the classifier's
accuracy, and we repeat the procedure using k fold cross validation to assess its average
accuracy
IMPLEMENTATION DETAILS
The training of dataset consists of the following steps:
Loading of Dataset: - A small python code is written to load the csv file
Preprocessing Data:
This is a vital part of training the dataset. Here Words present in the file are accessed both as a
solo word and also as pair of words. Because, for example the word “bad” means negative but
when someone writes “not bad” it refers to as positive. In such cases considering single word for
training data will work otherwise. So words in pairs are checked to find the occurrence to
modifiers before any adjective which if present which might provide a different meaning to the
outlook
After pre-processing:
The machine is now able to determine if a sentence that has been entered will receive an either
positive or negative response as a result of this training dataset of public comments.
The proportion of relevant instances among the recovered instances is known as precision (also
known as positive predictive value), whereas the proportion of relevant instances that have been
retrieved relative to the total number of relevant instances is known as recall (also known as
sensitivity). Therefore, a comprehension of and a measurement of relevance are the foundations
of both precision and recall.
SVC 76.15
NAÏVE BAYES
LOGISTIC REGRESSION
SVC
CONCLUSION
Texts are categorized using sentiment analysis based on the emotions they express. Data
preparation, review analysis, and sentiment classification are the three key steps of a
conventional sentiment analysis model. The three steps are the main topic of this article, which
also discusses typical methods employed in each.
A growing area of text mining and computer linguistics, sentiment analysis has attracted a lot of
research attention in recent years.
Future research will focus on in-depth methods for extracting opinion and product attributes, as
well as cutting-edge classification models that can take the ordered labels property in rating
inference into account. Applications that utilize the sentiment analysis results are also anticipated
to surface soon.
REFERENCE
• Priya A, Garg S, Tigga NP (2020) Predicting anxiety, depression and stress in modern life using machine
learning algorithms. Procedia Computer Science 167:1258-1267
• Alsagri HS, Ykhlef M (2020) Machine learning-based approach for depression detection in
twitter using
• Content and Activity Features. IEICE Transactions on Information and Systems E103.D
(8):1825-1832.doi:10.1587/transinf.2020EDP7023
• Kumar P, Garg S, Garg A (2020) Assessment of anxiety, depression and stress using machine
learning models.
• Procedia Computer Science 171:1989-1998. doi:https://doi.org/10.1016/j.procs.2020.04.213
• Shelton, J. 2019. Depression Definition and DSM-5 Diagnostic Criteria. Retrieved June 13, 2019,
from https://www.psycom.net/depression-definition-dsm-5- diagnostic-criteria/
• Cavazos-Rehg, P. A., Krauss, M. J., Sowles, S., Connolly, S., Rosas, C., Bharadwaj, M., and
Bierut, L. J. 2016. A content analysis of depression-related tweets. Computers in Human
Behaviour. https://doi.org/10.1016/j.chb.2015.08.023.
• Neethu M S, Rajasree R(2014) Sentiment analysis in twitter using machine learning techniques