0% found this document useful (0 votes)

51 views25 pages

NLP - J - Final ReviewReport - Cyberbullying

The document summarizes a student project on cyberbullying detection using natural language processing. The project aims to automatically flag potentially harmful tweets and identify patterns of hatred. It reviews literature on sentiment analysis of product reviews and cyberbullying detection using word similarity and FastText models. The project trains models on tweet data labeled for cyberbullying and evaluates the models' performance on classification.

Uploaded by

KEERTHANA V 20BCE1561

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

51 views25 pages

NLP - J - Final ReviewReport - Cyberbullying

Uploaded by

KEERTHANA V 20BCE1561

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 25

School of Computer Science and Engineering

VIT Chennai
Vandalur - Kelambakkam Road, Chennai - 600 127

Final Review Report

Programme: B Tech CSE

Course: CSE4022 - Natural Language Processing
Slot: E1 + TE1
Faculty: Dr. M. Premalatha
Component: J Component Review 3

Title: CYBERBULLYING DETECTION USING NLP

Team Member(s):

VAIBHAV THALANKI (20BCE1385)

SIDDHARTH R (20BCE1286)
NEERAJ S (20BCE1656)
ROGHAN (20BCE1275)
Abstract

With rise of social media coupled with the Covid-19 pandemic, cyberbullying has reached
all-time high. We can combat this by creating models to automatically flag potentially
harmful tweets as well as break down the patterns of hatred. As social media usage becomes
increasingly prevalent in every age group, a vast majority of citizens rely on this essential
medium for day-to-day communication. Social media’s ubiquity means that cyberbullying
can effectively impact anyone at any time or anywhere, and the relative anonymity of the
internet makes such personal attacks more difficult to stop than traditional bullying. Taking
the leverage of Natural Language Processing, cyberbullying tweets can be detected and
traced.

Keywords: Tweet, Cyberbullying, NLP, Classification

Introduction

As social media usage becomes increasingly prevalent in every age group, a vast majority of
citizens rely on this essential medium for day-to-day communication. Social media’s ubiquity
means that cyberbullying can effectively impact anyone at any time or anywhere, and the
relative anonymity of the internet makes such personal attacks more difficult to stop than
traditional bullying.

On April 15th, 2020, UNICEF issued a warning in response to the increased risk of
cyberbullying during the COVID-19 pandemic due to widespread school closures, increased
screen time, and decreased face-to-face social interaction. The statistics of cyberbullying are
outright alarming: 36.5% of middle and high school students have felt cyberbullied and 87%
have observed cyberbullying, with effects ranging from decreased academic performance to
depression to suicidal thoughts.

Despite the severity of the issue, very few effective attempts to identify abusive conduct have
been made, both by the academic community and on social media. This is because of a
number of intrinsic challenges such poor syntax, syntactic errors, and a very narrow context.
Beyond only using overtly harsh words, aggression and bullying may also take many other
forms, such as frequent sarcasm, trolling, etc.

With rise of social media coupled with the Covid-19 pandemic, cyberbullying has reached all
time highs. We can combat this by creating models to automatically flag potentially harmful
tweets as well as break down the patterns of hatred. This project aims to achieve this by using
concepts of Natural Language Processing and Sentiment Analysis.
Literature Review

PAPER 1: SENTIMENT ANALYSIS USING PRODUCT REVIEW DATA [1]

Sentiment analysis is also referred as opinion mining which studies and evaluates people’s
liking to a certain entity. This paper revolves around tackling the problem of sentiment
polarity categorization for product reviews. This is a difficult process because it has many
obstacles. The first one is that since people can freely post reviews, the quality of the reviews
cannot be guaranteed. There are cases where spam is posted instead of product reviews. Some
spams are meaningless and others contain fake reviews. The second problem is that the basis
of classifying each review is arbitrary, how to evaluate a review being positive, negative or
neutral according to what standards.

The data used for this paper is from Amazon product reviews collected between February and
April 2014. The above-mentioned problems are somewhat overcome in the following two
ways: Each product review receives an inspection before it can be posted and each rating
must have a rating which can be used as the ground truth. This rating is based on a 5-star
scale with 5 being very high and 1 being very low rating (negative).

Figure 1: Amazon Review System [1]

An algorithm was proposed and implemented for negation phrases identification. A

mathematical approach was proposed for sentiment score computation. A feature vector
generation method was presented for sentiment polarity categorization. Then, two sentiment
polarity categorization experiments were respectively performed based on sentence level and
review level. Performance of three classification models were evaluated and compared based
on experimental results.

All objective content in a sentence is removed and only subjective content was used for
analysis. A sentiment sentence is the one that contains, at least, one positive or negative
word. All the sentences were firstly tokenized into separated English words. Each sentence
was tagged using a POS tagger. 25 million adjectives, over 22 million adverbs, and over 56
million verbs were tagged out of all the sentiment analysis. The paper used an algorithm for
negation phrases identification which classifies negations into negation-of-adjective (NOV)
and negations-of-verb (NOV). The algorithm was able to identify 21,586 different phrases
with total occurrence of over 0.68 million, each of which has a negative prefix.

For phrase tokens, 3,023 phrases were selected of the 21,586 identified sentiment phrases,
which each of the 3,023 phrases also has an occurrence that is no less than 30. Given a token
t, the formula for t’s sentiment score (SS) computation is given as:
Feature vector formation

Two binary strings are used to represent each token’s appearance. One string with 11,478 bits
is used for word tokens, while the other one with a bit-length of 3,023 is applied for phrase
tokens. For instance, if the ith word (phrase) token appears, the word (phrase) string’s ith bit
will be flipped from “0" to “1". A hash value of each string is then computed and saved.
Hence, a sentence level feature vector totally has four elements: two hash values computed
based on the flipped binary strings, an averaged sentiment score, and a ground truth label.

Results and conclusions

On manually-labelled sentences, the classification models (SVM, Naïve Bayes, Random

Forest) show the same level of performance based on their F1-scores, where the three scores
all take a same value of 0.85. With the help of the ROC curves, all three models performed
quite well for testing data that have high posterior probability. As the probability goes lower,
the Naïve Bayesian classifier outperforms the SVM classifier, with a larger area under curve.
In general, the Random Forest model performs the best.

On machine-labelled sentences, the SVM model takes the most significant enhancement from
0.61 to 0.94 as its training data increased from 180 to 1.8 million. The model outperforms the
Naïve Bayesian model and becomes the 2nd best classifier, on subset C and the full set. The
Random Forest model again performs the best for datasets on all scopes. Figure below shows
the ROC curves plotted based on the result of the full set.

Figure 2: ROC curves based on complete set [1]

PAPER 2: CYBERBULLYING DETECTION, BASED ON THE FASTTEXT AND
WORD SIMILARITY SCHEMES [2]

In the study, vocabulary and syntax are first used to analyse the features of cyberbullying.
Then, a new recognition technique based on word similarity and Fast Text is suggested. The
effectiveness and performance of the suggested method are then assessed through
experiments. Results obtained indicate that the suggested approach is capable of significantly
enhancing both the detection accuracy and recall rate of cyberbullying detection.

In cyberbullying situations, the Word Similarity technique is generally used to analyse the
text's morphology. In order to determine how similar terms are to the vocabulary used in
cyberbullying and to pinpoint the words that are definitely utilised in bullying, the Word2vec
model and cosine similarity are also used. In order to identify any implicit terms associated to
bullying in the text, the FastText approach next looks at how contextual texts relate to one
another.

Model Training

Each word in the sample is compared to offensive terms in the training set to see how similar
they are, and the greatest value obtained is used to determine how likely the sample is to be
subjected to cyberbullying. Let B = {b1,b2, ... ,bn } be the collection of insulting words and S
= {S1, S2, ... , SN} be a training sample. For sample Si ∈ S, i = 1, 2, ... , N, the participle
result is denoted as Si ={ s1, s2, ... , smi } . The Word2vec word vector of the word s is
represented as σ (s). Then, the possibility that Si is marked as cyberbullying can be expressed
as the follows:

In order to reduce the similarity calculations, the following two measuring schemes are
applied:
(1) “Remove stop words. More than 891 common stop words, including “am,” “about,” and
“believe,” are collected.”
(2) “Remove common words from the non cyber bullying class. The frequency of words in
the training set is counted. NumCB (s) and NumNCB (s) denote the number of times the
word s appears in the cyberbullying and the non cyber bullying, respectively.” The following
calculation should be carried out:

If common (s) > 1, then s is considered as a common word in the NCB so that it can be
removed.
The words and phrases in the input layer are used by FastText to build the feature vector,
which is then linearly transformed and mapped to the hidden layer. The Huffman tree is built
using the weights assigned to each category and the model parameters after the hidden layer
has solved the maximum likelihood problem. The calculation of the output is optimised via
hierarchical SoftMax based on the Huffman tree.

Figure 3: Cyberbullying Dataset Summary used in [2]

The FTSW method performs the best while focusing on cyberbullying cases. It has the
highest precision and F1 score but recall score is less. The accuracy of the FTSW model is
the highest of all models. In the case of detecting for cyberbullying, high accuracy is all that
matters.

Figure 4: Results of [2]

PAPER 3: MEASURING, UNDERSTANDING, AND CLASSIFYING NEWS MEDIA

SYMPATHY ON TWITTER AFTER CRISIS EVENTS [3]

The paper focuses on the coverage and sympathy bias between Arab and Western media after
the 2015 Beirut and Paris terrorist attacks. For 2,390 tweets in four different languages
(English, Arabic, French, and German), sympathy and sentiment labels were crowdsourced.
Then a regression model was built to characterize sympathy and trained a deep convolutional
neural network to predict sympathy. In the paper, media bias research was used to examine
news on Twitter. Instead of focusing on whether media biases exist on Twitter specifically as
a social media platform, Twitter is used as a journalism tool to study timely news reporting
and offer the first steps in developing a system for categorising sympathetic tweets.

906,583 tweets on Beirut bombing were collected shortly after that news breakout on
November 12, 2015. This was done using the twitter hashtags: #beirut, #lebanon,
#beirut2paris, #beirutattacks, #beirutbombing. The dataset consisted of 667,073 retweets and
610,879 unique users. After removing the duplicates (retweets), 239,093 unique tweets
remained. For Paris, 5,339,452 tweets during the two days after bombing were collected
using the hashtags: #paris, #france, #parisattacks, #prayforparis and #porteouverte. 74.78% of
the tweets were retweets and there were 2,538,348 unique users. In addition to these, Paris
Ruest dataset created by Nick Reust [4] was also used.

Figure 5: Top 5 most frequent hashtags [3]

Temporal event slicing and sampling

Since the attacks in Beirut and Paris differed temporally in the coverage, Normalisation was
required as tweets posted 5 days back after attacks may vary compared to those posted three
weeks after the attacks. The normalisation is constrained by the size and coverage of the
smaller Beirut dataset. Therefore with the coverage length of Beirut’s data, Paris dataset was
sliced.

This time slicing further reduced the size of the dataset to 7,768 distinct tweets, including
coverage of Beirut by Western media (N=131), Paris by Western media (N=5,298), Beirut by
Arab media (N=287), and Paris by Arab media (N=1,566).

Then the required work was to send the data to the crowd for annotation. To avoid lengthy
crowdwork time and costs, only a sample of each dataset was decided to be sent. However, a
random sampling would not suffice as the period of 3 days after the attacks is where the
major important headlines show up. Therefore, the data was divided into buckets each of 24
hours and samples were drawn out from each bucket. The normalization constant was
calculated by dividing the size of the desired sample draw (1,000) by the total number of
rows in each dataset. For each bucket, the sample drawn was the number of records in that
bucket multiplied by the normalization constant, and rounded to ensure all day buckets cap at
1,000 records.

Results

COVERAGE BIAS: As expected from the plots, there was more coverage from Arab media
for the Beirut attacks, and inverse for the Paris attacks, which showed more Western media
coverage. A Chi-square test with Yates’ continuity correction was performed across all days
to compare the difference between Arab and Western media coverage. The result was to
accept the alternate hypothesis: there was a significant difference in coverage bias between
the attacks (χ2 (1, N=7,768) = 1489, p <0.001, f=0.44, odds ratio=0.05). Correlation analysis
was also done to find out whether they followed a similar pattern of tweeting. It was observed
that for tweet activity volume, Western and Arab media were engaged at approximately
similar time points, which supports the fairness of collected data.
CLASSIFYING NEWS MEDIA SYMPATHY: The crowd annotating was not complete. ML
models were trained to generalize the analysis. The learning was to recognise the sympathy
of the tweets, basically a sentiment analysis: sympathetic or not. A CNN fitted with word2vec
model was used for classification. Western media was more sympathetic towards Paris, while
Arab media was more sympathetic towards Beirut. This aligns with prior work showing
strong regionalism in news geography and with producer-consumer attention asymmetries
across countries. What is interesting to observe that while retweeting behavior appears to be
impartial as to whether a tweet is sympathetic or not, it does appear that this similarly applies
to sentiment labels also.

Figure 6: Results [3]

PAPER 4: SOCIAL MEDIA CYBERBULLYING DETECTION USING MACHINE

LEARNING [5]

Cyberbullying is a form of abuse through electronic messages. Social media acts as the
perfect environment for these bullies who take advantage of the platform and attack the users.
These bullies usually use forms of emails, messages and so on to carry out these malicious
activities. The paper tries to detect these messages and email with such ill intent such that
these messages do not reach the user and is not harmed through applying few supervised
machine learning models. These models help in detecting the patterns that these bullies might
use when carrying out the crimes. A Kaggle dataset is used for the same.
The steps for building these models are as follows:
• Tokenize: The dataset which is in the form of text is taken in as sentences/paragraphs
and processed to give as separate words in an array.
• Lowercase: The array of words generated in the previous step is then converted to
lowercase to normalize the data and to remove uneven casing. Example: “ROGHAN”
will be converted to “roghan”.
• Stop words and Cleaning: An essential part of to the pre-processing process is that we
must clean the text from the stop words so that we can split the text into sentences and
paragraphs from ‘\n’ or ‘\t’.

• Word Correction: Here, the Microsoft Bing word correction API takes the word from
the array and returns a JSON object with the similar words and the distance between
them and the original word.

The next step in the processing for model, is the extraction of the features. The textual data is
transformed to a format that is acceptable by the Machine Learning model. The features are
then separated from the array of features. Sentimental analysis technique is used to determine
whether the text is of a positive or negative note. The extracted features are finally classified
and fed into the algorithm. Two classifiers are used, SVM and Neural Network. This network
has 3 layers: Input, Hidden and Output layer. The input layer has 128 nodes, whereas the
hidden layer contains 64 neurons. An output in the format of Boolean is resulted. Evaluation
of these classifier model is done by using error metrics based on the confusion matrix.
Criteria like Accuracy, Precision, Recall and F-score are used for the evaluation.

The dataset consists of 12773 rows. The data consists of questions along with the answers
given with the class labels which says whether the answer is bullying or is a normal message.
After cleaning/pre-processing the dataset, a total of 1608 (Cyberbullying) and 804 (Normal)
from each class was present. The dataset is then split into 80% and 20% from train and test.
SVM as well as Neural Network (NN), best-performing classifiers are applied. Several
experiments are then run on different n-gram language models. The SVM classifier created
using a 4-gram model gives an accuracy of 90.3%, while the Neural Networks (NN) models
created gives an accuracy of 91.76%.

The evaluations of both classifiers in terms of precision and recall respectively for each
language model. The average accuracy, recall, precision, and F-score for the 2 classifiers are
compared. The Neural Network model performs the best out of all the classifiers. The
proposed approach to detect cyber-bullying using Machine Learning Techniques are through
2 models SVM and Neural Networks using TFIDF and sentimental analysis for the extracted
features.

PAPER 5: CYBER-BULLYING DETECTION IN HINGLISH LANGUAGES USING

MACHINE LEARNING [6]

Cyberbullying results in depression, self-esteem, emotional problems to the victims.

Solutions to prevent this kind of cyberbullying use keyword matching techniques or lexicon
methods which are common methods to identify abusive language. The Idea proposed here is
the building of a model to classify and identify cyberbullying in English and Hinglish
languages. A chat app is also built to detect/classify the same in group chats.

The methodology for the detection of the malicious messages is Natural Language Processing
and Machine Learning. Data was extracted from different social media platforms like
WhatsApp, Twitter, YouTube for the purpose of building the model. This data has been taken
from Kaggle where this data has been extracted through scraping the platforms. The dataset
consists about 15,307 rows. Dataset also consists of a class label which signifies whether the
text is cyberbullying which helps in building a supervised machine learning model.
The special characters, retweet symbols, hashtags etc. were removed. Words of small lengths
are also removed as they do not contribute to the “cyberbullying” part of the model and are
most often just articles and prepositions. NLP techniques like tokenization and lemmatization
is applied to extract the meaningful words from these texts. Tokenization is where the
sequence of words is split to smaller chunks. Whereas, lemmatization is the process to reduce
the inflectional forms of the same root word. The final step is that of vectorization, where
weights are assigned to the words based on the probability at which a certain word can be
found.

Feature selection of the text was done, namely Count Vectorization and Term Frequency-
Inverse Document Frequency. Count Vectorization is a method used to convert a collection
of words within the corpus into a vector of terms. The model then obtained from this is then
used to fit and learn the vocabulary and then makes a word-matrix accordingly. The TF-IDF
is used to evaluate how relevant a word is in the document. It tells us how frequent or rare a
word occurs in a document. 0 meaning that the word is most frequent. A comparative
analysis is done between the two methods of feature extraction to conclude that Count
Vectorizer (CV) gives a better accuracy compared to the TF-IDF method. Therefore, the CV
method is used for the feature selection.

Various other machine learning models are the applied like the Linear SVC, Decision Tree
and Naïve Bayes to train the model and find the accuracy for each model. After the models
are run, it was concluded that the Random Forest classifier shows the best accuracy based on
evaluation metrics. The Random Forest Classifier has an accuracy and F1 score of 96.5% and
97% respectively.

PAPER 6: CYBERBULLYING DETECTION: HYBRID MODELS BASED ON

MACHINE LEARNING AND NATURAL LANGUAGE PROCESSING TECHNIQUES [7]

Deep neural networks are favoured over conventional machine learning techniques for the
identification of cyberbullying because of their benefits. In addition to an algorithmic
comparison of eleven classification techniques, the research suggests an unique neural
network framework with parameter optimisation. Moreover, it investigates how natural
language processing based on word embedding methods and feature extraction affect
algorithmic efficiency. When it came to detecting cyberbullying, the neural networks Bi-
GRU and Bi-LSTM performed the best.

Detection frameworks for cyberbullying online often employ traditional machine learning
methods. These traditional machine learning methods do, however, have a limitation in that
they are unable to give highly accurate results on extremely large volumes of data for
supervised categorization. Neural networks overcome this limitation and provide better
results and trustworthy processes. Several well-known shallow neural network and traditional
machine learning techniques are covered in this section. The structure of each suggested
network as well as the suggested methodology for our classification frameworks are covered
in this article.

The text is transformed into vector notation so that the classification algorithms can handle it.
Before it is utilised in the conversion, the raw text is heavily pre-processed. Data cleansing is
the name of this technique. The elimination of empty rows, punctuation, special characters,
etc., is one of the phases. Conventional machine learning methods like Count Vectorization
and TF-IDF unigram/bigram/trigram were employed to evaluate the model's accuracy.
Shallow neural network representations will be made using GloVe, FasText, and Paragram.
The training and testing datasets are divided into a 5-fold cross, with each dataset being
divided in 5 ways. Count A process called vectorization is used to turn a group of words from
the corpus into a vector of phrases. The model that results from this is then used to learn and
fit the vocabulary before creating a word-matrix in line with it. A word's relevance to the
document is determined using the TF-IDF. It reveals how often or infrequently a term
appears in a manuscript. 0 indicates that the term is most common. Word embeddings can be
obtained from text input using the unsupervised method known as Global Vectors (GloVe)
for word representations. To obtain representations, a term-based co-occurrence matrix was
used. The cooccurrence matrix is used to examine the semantic relationship between terms. A
high cosine similarity is considered.

The simplest shallow network has an accuracy of about 90% but is unsuitable for datasets
that are not balanced. SVM provides accuracy in the range of 98.12%, whereas Logistic
Regression provides accuracy in the range of 97.13%. When evaluating all assessment
metrics, most outcomes obtained using shallow neural networks are over 95%. These
numbers are also greater than those for conventional machine learning models that have been
published.

PAPER 7: IDENTIFICATION OF POTENTIAL CYBER BULLYING TWEETS USING

HYBRID APPROACH IN SENTIMENT ANALYSIS [8]

The paper suggests that there are models for cyberbullying tweet detection which uses bag of
words with typical classification algorithms like Logistic Regression, decision tree and so on.
But it is extremely important that False Positives are to be reduced. People who really have
extreme negative comments targeted at a particular person should be penalised and not
everyone because everyone has freedom of speech. Hence it suggests that we will be using
hybrid approaches combining the outputs of knowledge-based approaches and Machine
Learning approaches.

The paper also suggests the use of Lexicon based techniques. Lexicon-based techniques have
been used extensively with traditional text but very little with material from Twitter since it is
so challenging to process. Twitter contains information that is made up of emojis, hashtags,
and many variations of acronyms like "lol." Due of this, Twitter data is challenging to
evaluate and is therefore less frequently studied for analysis.

Calculation of the reinforced polarity is done through exhaustively reviewing the sentences
and analysis by both models and compute the score based on individual scores of words and
letters. Part Of Speech tagging can be exploited for this purpose. So for each sentence the
paper has made use of three polarities. Sentiment analysis of emoticons, Sentiment analysis
from Knowledge based approach and Sentiment analysis from Machine Learning Approach.
This hybrid approach helps them to reduce the False Positive cases so that Freedom of speech
is not affected.

Confusion matrix for the paper shows that they have around 30.3% True Positive rate, 14.5%
False Negative rate, 15.1% False positive rate and 40.0% False negative rate yielding a very
high accuracy than traditional approaches.
PAPER 8: A BAG-OF-PHONETIC-CODES MODEL FOR CYBERBULLYING
DETECTION IN TWITTER [9]

The paper cites a survey conducted by Ditch the label showing that almost everyone is
affected by cyberbullying these days. More than 47 percent of the people have received
hateful text messages and more than 62 percent of the people have received harsh comments
on text messaging platforms like WhatsApp, Instagram etc.

Even after detecting a hateful comment, penalising the person responsible for it is a difficult
task as people who give out hateful comments on the platforms tend to hide their real identity
often referred to as the “social-mask”. The paper suggests that there are two techniques to
perform sentiment analysis.
• Machine Learning approach
• Lexicon based approach.
The machine learning approach uses a training dataset and trains the model based on the
features of the data which can then be put use to classification of real world data.
Lexicon based approaches are knowledge based and need an efficient representation for
identifying sentiments behind this text because of which this approach doesn’t fare well when
it comes to datasets with a neutral class in it. Hence the paper has used machine leaning
approaches.

Sarcasm and hidden meanings often pose another major difficulty. They often lead to texts
being classified in the wrong context. Using syntactic features like punctuations, part of
speech and so on, feeding it to a naïve bayes model has led to the achievement of 0.7 F1 –
Score as cited by the research paper.

The methodology used in this paper starts with dataset collection and preparation from online
sources and twitter API. Then comes pre-processing this data and tokenising, it poses a huge
difficulty. This is so because twitter data is so noisy and all over the place with emoticons,
symbols, abbreviations, and mixed language texts. The paper has first converted all
characters to lowercase, removed extra symbols and punctuations, stemmed words, converted
emoticons to appropriate words that convey the sentiment, and removal of stop words.

Next comes the Feature selection and preparation of feature vector. Every tweet needs to be
converted to fixed length vector for it to be fed into a machine learning model. This can be
done using the bag of words model or the TF-IDF approach. The bag-of-words model uses
the frequency of occurrence of words as the weights. This comes with a huge drawback of
not being able to capture the semantic information in the texts or tweets.

After generating the feature vector, word embeddings have to be generated, which can be
done either in frequency-based approaches or predictive based approaches. Then performing
both clustering and classification on the word embeddings (unsupervised and supervised) to
see which method fares well in this scenario.

The paper has made use of 3 different datasets, each collected either previously on online
sources or using the twitter API containing over 35 thousand plus tweets in each one of them.
The first dataset was used for clustering analysis and the other two datasets were used for
supervised learning. Dataset 2, achieved an accuracy of 57% whilst the dataset 3 gave an F1
– score of 0.98 on average when built using Support Vector Machine model.
PAPER 9: DETECTING A TWITTER CYBERBULLYING USING MACHINE
LEARNING [10]

The paper uses machine learning models to predict and classify cyberbullying techniques like
Support Vector machines, Naïve bayes classifier and so on. One major introduction is the
testing they did. They have collected real time data from Twitter API and fed it to the trained
model to detect cyberbullying in real-time and how the model fares in real time data.
The paper starts with importing the twitter dataset downloaded from Kaggle and Github.

In the pre-processing phase, they have made use of the NLTK library of python to perform
the necessary pre-processing steps on the tweets. Starting with tokenisation, using
WhiteSpaceTokeniser, WordPunctTokenizer, TreebankWordTokenizer and
PunctWordTokenizer. Then they have carried on to lower all tokens and texts and going
forward to removing stop words (a an the I am etc…). Finally, they made use of the
WordNetLemmatizer built in nltk to lemmatise the words into their source morphemes.
Next stop is feature extraction. The researchers have made use of TF-IDF vectorizer for this
purpose. The data's characteristics are taken out and listed as features. Additionally, each
text's polarity (i.e., whether it contains bullying or not) is extracted and saved in the list of
features.

Next identifying error metrics and algorithm used in the model. For the algorithms in many
classification approaches, the researchers have listed SVM and naïve bayes to be the best
classifiers among all others. Support Vector Machines uses hyperplane equation to divide or
classify datasets into their respective classes making use of support vectors which are the
points closest to the hyperplane. Naïve bayes theorem makes uses of the bayes theorem
formula with likelihood of occurrence given a certain occurrence.

The error metrics used for identifying models’ success includes,

Precision = TP / (TP+FP)
Recall =TP/(TP+FN)
F-Score = 2*(Precision*Recall) / (Precision + Recall)
TP = True positive numbers; TN = True negative numbers; FN = False negative numbers; FP
= False positive numbers

Finally, both models are trained using the Kaggle dataset downloaded and is tested with test
data, which is pre-processed, passed through TF-IDF model and is cleaned for texts with false
pretences like satire. Naïve bayes classifier gave an accuracy of 52.70 whilst SVM gave an
accuracy of 71.25 percent. Suggesting that SVM is best suited for this dataset.
With regard to precision, recall and F-score, naïve bayes classifier produced 52%, 52%, 53%
respectively and Support Vector Machines produced 71%, 71% and 70% respectively which
also ultimately suggests that Support Vector machines work good in this scenario.
Finally as the paper suggests that they test the data with real-time data (apart from the 45%
test data performed earlier). Even in this case, Support vector machines outperformed Naïve
bayes classifier in all aspects.

Around ten tweets was fetched from Twitter API out of which seven were classified as non-
bullying tweets and 3 were classified as bullying tweets.
PAPER 10: NLP AND MACHINE LEARNING TECHNIQUES FOR DETECTING
INSULTING COMMENTS ON SOCIAL NETWORKING PLATFORMS [11]

This involves an appropriate data set will be extracted from a variety of web sources, pre-
processing, generating ground truth, engineering features, and choosing classification.
The main starting point is to collect relevant data from various online platform. The second
step involves pre-processing or cleaning of the data set – noise reduction, lowercasing,
tokenization, stemming, lemmatization, stop words removal,etc. the next step would be
feature engineering – extracting user, textual, and network features. The final step is to
perform classification using the (extracted) features and the ground truth.

The main starting point is to collect relevant data from various online platform. The second
step involves pre-processing or cleaning of the data set – noise reduction, lowercasing,
tokenization, stemming, lemmatization, stop words removal,etc. the next step would be
feature engineering – extracting user, textual, and network features. The final step is to
perform classification using the (extracted) features and the ground truth.

Two attribute fields, together with an identifier, make up the data. The timestamp of the
comment's posting is shown in the First Attribute. There are several null instances, therefore
an exact and genuine timestamp is impossible. The next attribute is the actual content in
double quotes which is shown in Unicode text. Data set labelling was the most time
consuming and labour intensive. In plain English, the data collected for machine learning is
divided into two levels: "1" for offensive remarks and "0" for neutral remarks. The final
result ought to fall between [0, 1].

A data mining approach called data pre-processing entails putting raw data into a
comprehensible format. Real-world data is often inaccurate and missing in specific
behaviours or patterns. It is also frequently inconsistent and incomplete. Data pre-processing
prepares raw data for further processing.

Figure 7: Data pre-processing [11]

Four kinds of classifiers are used – the most basic Logistic Regression, Support Vector
Machine, and the two most popular ensemble methods, Random Forest Classifier and
Gradient Boosting Machine. Random forest and gradient boosting machines need dense
feature matrices as inputs, but logistic regression and support vector machines need sparse
ones. The models were applied to the test dataset provided by Kaggle after being trained.
Finally, a file containing the test dataset's predictions is created. All prediction values fall
between 0 and 1, with a score of 0 to 0.5 designating a "non-insulting" remark and a score of
0.5 to 1 designating a "insulting" statement.
Results

All four classifiers' training accuracy ranged from 75% to 90%, however their test accuracy is
between 50% and 55%. Though the model produced a score between 77% and 90% on the
training dataset, it failed to transfer to the test dataset. The model makes every effort to match
the training dataset, but despite this, it is unable to accurately categorise the test dataset due to
over-fitting or a high degree of variation. According to the results of the trials and the total
project effort, Support Vector Machine and Gradient Boosting Machine were outperformed in
this specific instance by Logistic Regression and Random Forest Classifier trained on the
feature stack.

PAPER 11: AN APPLICATION TO DETECT CYBERBULLYING USING MACHINE

LEARNING AND DEEP LEARNING TECHNIQUES [12]

Proposed Methodology

Before being fed into stacked word embeddings, the training data is cleaned and prepared.
The CNN-BiLSTM deep learning model is then taught to outperform other deep learning
models trained separately. For usage on the internet, the model is stored.

Figure 8: Methodology used in [12]

After the pre-processing of data is completed, we now build our CNN-BiLSTM model, Word
Embedding approach is used as it solves various issues that the simple one-hot vector
encodings have. The most important factor is that word embeddings improve generalisation
and effectiveness. GloVe and FastText word embeddings will be stacked. The best outcomes
have been achieved using a mix of embeddings. After the stacking of word embedding, CNN-
BiLSTM model is built. The proposed CNN-BiLSTM model is compared with an ensemble
ML model to draw out a comparison on the accuracy.

CNN-BiLSTM Architecture:

Although CNN has fewer hyper parameters and needs less supervision, CNN has less hyper
parameters than LSTM, which often delivers better results. While the LSTM takes longer to
evaluate, it is more accurate for lengthier texts. As nodes go farther back in the hierarchy, less
of the front is seen by the RNN since it has a serious gradient loss problem while processing
sequences. Fixed sequence to sequence prediction is solved using BiLSTM. When the input
and output are the same size, RNN has a restriction.

A bidirectional LSTM and CNN architecture that has been concatenated is known as a CNN
BiLSTM. In the basic formulation, it trains both character-level and word-level properties for
classification and prediction. Utilizing the CNN layer, character-level traits are induced. The
model has a convolution and a max pooling layer for each word to create a new feature vector
utilising per-character feature vectors like character embeddings and (ideally) character type.

Results

On the basis of accuracy, the performance of various activation and optimizer setups for a
basic LSTM model is compared.

The neural network's capabilities and performance are significantly influenced by the
activation function used, and different activation functions may be applied to different parts
of the model. Any input is transformed into a number between 0 and 1 via the sigmoid
function. The outcome of the sigmoid function is close to zero for low values and close to one
for large values. A two-element Softmax with the second element set to zero is equivalent to
a sigmoid. The sigmoid is so often used in binary classification. Other than Sigmoid, ReLU is
also employed as the activation layer for our CNN-BiLSTM model’s hidden layer. Following
the comparison, it is clear that the CNN-BiLSTM model, performs the best of all the models
examined. The model is fitted to our data for 10 epochs after all the layers are combined, and
it achieves an accuracy of roughly 98%.

PAPER 12: DETECTING CYBERBULLYING AND AGGRESSION IN SOCIAL

COMMENTARY USING NLP AND MACHINE LEARNING [13]

Dataset

The study project also takes into account gathering comment threads from popular yet
contentious YouTube videos that have the potential to incite hate speech using an
HTML/CSS parser. These are in JSON format with delimiters. Similar criteria are used to
choose appropriate entries for establishing ground truth.
The dataset on cyberbullying detection contributed on Kaggle by Impermium is selected as
the test dataset for validation of the model. Manual labels are applied to the collected dataset.
Each sample of textual data is carefully examined, interpreted, and classed in order to identify
instances of cyberbullying via hate speech and insults. The following is a list of the potential
classes: "Bully" and "Non-bully" for binary classification; "Bully," "Aggressor," "Spammer,"
and "None" for multiclass classification; and "0" and "1" for bully and non-bully comments,
respectively. The dataset is divided into "0" and "1" categories.

Figure 9: Feature Engineering in [13]

A single feature set made up of count vectors and TF-IDF vectors of both words and
characters as tokens with an n-gram sequencing of up to five levels is created by stacking all
the retrieved feature vector sets. Numerous hyper parameters are researched and tweaked in
order to increase learning's effectiveness. For instance, the inverse of regularisation strength
is "C," a parameter in logistic regression and support vector machines. SVM values that are
less indicate greater regularisation. Other factors include "learning rate," "number of
subsamples," and "number of trees in the forest" in a random forest, among others.

Results

The table shows various metrics of evaluation of the performance used after training the
dataset and validating it with a test dataset. Training accuracy varied from 75% - 90% for all
the four classifiers while the test accuracy lies between 70% - 75%.

Figure 10: Results [13]

Dataset Description

Dataset contains more than 47000 tweets labelled according to the class of cyberbullying:
• Age
• Ethnicity
• Gender
• Religion
• Other type of cyberbullying
• Not cyberbullying

The dataset contains the tweet and the class (among the 6 listed above) it corresponds to. The
data has been balanced in order to contain ~8000 of each class.
https://www.kaggle.com/datasets/andrewmvd/cyberbullying-classification

Pre-processing Steps

Removing stop words, punctuations, twitter handles and URLs.

Stop word removal is one of the most commonly used pre-processing steps across different
NLP applications. The idea is to simply remove the words that add no significant meaning to
the natural text. These generally include words like pronouns and articles. These are removed
as they are just considered as a “noise”. Classifying tweets will not depend on such stop
words.

Figure 11: Pre-processing stop words, punctuations

Stemming the words

Stemming is the process of reducing a word to its stem that affixes to suffixes and prefixes or
to the roots of words known as lemmas. The same word may have different wordforms
illustrated below. The idea is to take the root form of the word instead of its various forms.

Figure 12: Stemming process

Lowercasing each token

This is just to normalise the whole corpus. We could instead uppercase everything too.

Implementation

Methodology

The data comprises of two columns: tweet and the corresponding cyberbullying class. This
data has to be pre-processed before feature extraction. The preprocessing steps are mentioned
above. After pre-processing, the tweet data has to be represented in a numerical form. This
can be done by either Continuous Bag of words model or TF-IDF Vectorizer which will serve
as feature extractor. However, in this project, we used TF-IDF Vectorizer.

Figure 13: Methodology used

TF-IDF/Term Frequency Technique:

This is a technique which is used to find meaning of sentences consisting of words and
cancels out the incapability of Bag of Words technique which is good for text classification
or for helping a machine read words in numbers.

The TF-IDF is defined by

Continuous Bag of words model:

CBOW or Continuous bag of words is to use embedding in order to train a neural network
where the context is represented by multiple words for a given target words.

Figure 14: CBOW model as an alternative to TF-IDF

The embedding matrix contains the word feature representation (or embedding) that will used
by the model for classification.

Classification algorithms like Logistic Regression, Support Vector Machines, Naïve Bayes,
Random Forest, Gradient Boosting were used to classify the type of cyberbullying.

Multinomial Naïve Bayes Classifier

The probabilistic machine learning technique known as Multinomial NB classifier is mostly

employed in natural language processing. The Bayes theorem was utilised to create NB
classifiers, which were then used to design cyberbullying prediction models in the detection
of cyberbullying field. It operates under the tenet that there is no relationship between any
two features that are being categorised. The presence or absence of one feature has no bearing
on the other feature's presence or absence.

This approach uses training data to identify the model's Bayes-optimal parameter estimates
and makes the assumption that a parametric model generates the text. It classifies the test data
that was obtained using those approximations. An arbitrary number of distinct continuous or
categorical functions may be supported by NB classifiers. A job for estimating high
dimensional density is reduced to predicting one-dimensional kernel density under the
assumption that the functions are different. The NB algorithm is a learning algorithm built on
the application of the Bayes theorem with strong (naive) independence assumptions.
Logistic regression

One of the well-known methods that machine learning introduced to the field of statistics is
logistic regression. It is an algorithm that uses the logistic function to create a unique
hyperplane between two datasets. A sparse feature set matrix is required for the input of
logistic regression. For training purposes, the sparse feature vector matrix is appropriately
transformed into a dense matrix.

The logistic regression algorithm uses features and generates a forecast based on the
likelihood that a class would be appropriate for the input. The instance classification, for
instance, will be a positive class if the likelihood is less than 0.5; otherwise, the prediction
would be for the other class (negative class). The implementation of predictive cyberbullying
models employed logistic regression.

hθ (x) = 1/1 + e −θTx

Random Forest Classifier

Random Forest is a machine learning algorithm used for both classification and regression
tasks. It is an ensemble learning method that combines multiple decision trees and makes
predictions based on the output of each individual tree.

The algorithm works by building a large number of decision trees, each trained on a different
subset of the training data, and with a different subset of features. Each tree in the forest
produces a prediction for the class of the input data point, and the final prediction is made by
taking the majority vote of all the trees.

Here are the steps involved in building a Random Forest Classification Model:

1. Randomly select a subset of the training data

2. Randomly select a subset of features from the available set of features
3. Build a decision tree using the selected subset of data and features
4. Repeat steps 1-3 to build multiple decision trees
5. Make predictions by taking the majority vote of the output of all the trees

Support Vector Machines

Support Vector Machines (SVM) is a powerful machine learning algorithm used for
classification, regression, and outlier detection tasks. It is a supervised learning algorithm that
works by finding the best boundary between classes in the data. SVMs are particularly
effective in dealing with complex datasets where there are many features or where the data is
not linearly separable.

The basic idea behind SVM is to find a hyperplane that separates the data points into different
classes with the largest possible margin. In other words, SVM tries to find the optimal
boundary that maximizes the distance between the closest data points from each class. This
distance is known as the margin. Here are the steps involved in building an SVM
Classification Model:
1. Given a training dataset with input features and class labels, the SVM algorithm tries
to find the hyperplane that separates the data into two classes
2. The algorithm tries to find the hyperplane that maximizes the margin between the two
classes. This is done by minimizing the error or misclassification rate of the model.
3. If the data is not linearly separable, the algorithm transforms the data into a higher-
dimensional space using a kernel function. This helps to separate the data into
different classes.
4. Once the hyperplane is found, the algorithm makes predictions on new data points by
classifying them based on which side of the hyperplane they lie.

Gradient Boosting

Gradient Boosting is a machine learning algorithm used for both classification and regression
tasks. It is an ensemble learning method that combines multiple weak models to create a
strong model. Gradient Boosting works by sequentially adding weak learners to the model,
with each new model trying to correct the errors of the previous model.

In Gradient Boosting Classification, the algorithm learns to predict the probability of each
class by iteratively adding decision trees to the model. Each decision tree is built using a
subset of the training data and with a different subset of features. The algorithm then
combines the predictions of all the trees to make a final prediction.

Results and Discussion

The classification reports for all the models is shown below. A GridSearchCV is also used for
faster training. GridSearchCV (short for Grid Search Cross Validation) is a hyperparameter
tuning technique used in machine learning to find the optimal set of hyperparameters for a
given model. Hyperparameters are the parameters of the machine learning algorithm that are
set before the training process begins and are not learned from the data.

GridSearchCV works by exhaustively searching over a specified parameter grid, which is a

combination of hyperparameter values, to find the optimal set of hyperparameters for a given
model. The optimal set of hyperparameters is the one that produces the best cross-validation
performance on the training data.

Figure 15: SVM Report

Figure 16: Logistic Regression Report
Figure 18: Random Forest Report Figure 17: Gradient Boosting Report

Figure 19: Naive Bayes Report

From the above classification reports of all the algorithms, it can be seen that Random Forest
Classifier has achieved the highest accuracy of 93% (Figure 18). Naïve Bayes (Figure 19) has
the lowest accuracy of 84% among the other algorithms. The remaining models, Logistic
regression, Gradient boosting and SVM achieve a 92% accuracy. The classification reports
also show the metrics precision, recall, f1-score and support for each class (Age, Ethnicity,
Gender, Non-cyberbullying, and Religion).

Conclusion

The latest research on cyberbullying uses supervised learning to build a machine learning
model. The majority of the study effort focuses on feature engineering, or discovering traits
that can distinguish between bullying and non-bullying remarks. According to the results of
the trials and the body of work, random forest classifier that was trained on the feature stack
outperformed logistic regression, naïve bayes, SVM, and gradient boosting models in this
specific instance. An accuracy of 93% was achieved by the random forest classifier. On the
other hand, naïve bayes only achieved an accuracy of 84%.
Future Scope

Here are some potential areas for future development and research:

• Multimodal Detection: Currently, most of the research in this area is focused on text-
based detection, but cyberbullying can also involve images, videos, and other
multimedia. Future research can explore the development of models that can analyse
multiple modes of communication for better detection and classification of
cyberbullying. [15] proposes a model in this aspect.
• Real-time Detection: Most of the existing models are trained on historical data, which
limits their effectiveness in detecting new forms of cyberbullying. Developing models
that can detect cyberbullying in real-time can be a potential future scope something
like [14].
• Context-based Detection: The context of a communication can have a significant
impact on whether it is considered cyberbullying or not. Future models can be
developed to take into account the context of a communication, such as the
relationship between the sender and the receiver, the language used, and other
contextual factors.
• Multilingual Detection: Currently, most of the research has been conducted on
English language-based cyberbullying detection. Future research can focus on
developing models for detecting cyberbullying in other languages to make the
detection process more effective.

References

[1] Fang, X., & Zhan, J. (2015). Sentiment analysis using product review data. Journal of Big
Data, 2(1). https://doi.org/10.1186/s40537-015-0015-2

[2] Wang, K., Cui, Y., Hu, J., Zhang, Y., Zhao, W., & Feng, L. (2020). Cyberbullying
Detection, Based on the FastText and Word Similarity Schemes. ACM Transactions on Asian
and Low-Resource Language Information Processing, 20(1), 1–15.
https://doi.org/10.1145/3398191

[3] El Ali, A., Stratmann, T. C., Park, S., Schöning, J., Heuten, W., & Boll, S. C. (2018).
Measuring, Understanding, and Classifying News Media Sympathy on Twitter after Crisis
Events. Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems.
https://doi.org/10.1145/3173574.3174130

[4] Ruest, N. (2017, February 7). #PARIS #Bataclan #parisattacks #porteouverte tweets.
Borealis, from https://hdl.handle.net/10864/11312

[5] Hani, J., Mohamed, N., Ahmed, M., Emad, Z., Amer, E., & Ammar, M. (2019). Social
media cyberbullying detection using machine learning. International Journal of Advanced
Computer Science and Applications, https://dx.doi.org/10.14569/IJACSA.2019.0100587
[6] Shah, Karan & Phadtare, chaitaniya & rajpara, keval. (2022). Cyber-Bullying Detection in
Hinglish Languages Using Machine Learning. International Journal of Engineering and
Technical Research. 11. 439.

[7] Raj, C., Agarwal, A., Bharathy, G., Narayan, B., & Prasad, M. (2021). Cyberbullying
detection: hybrid models based on machine learning and natural language processing
techniques. Electronics, 10(22), 2810. https://doi.org/10.3390/electronics10222810

[8] A. Mody, S. Shah, R. Pimple and N. Shekokar, "Identification of Potential Cyber

Bullying Tweets using Hybrid Approach in Sentiment Analysis," 2018 International
Conference on Electrical, Electronics, Communication, Computer, and Optimization
Techniques (ICEECCOT), Msyuru, India, 2018, pp. 878-881, doi:
10.1109/ICEECCOT43722.2018.9001476

[9] A. Shekhar and M. Venkatesan, "A Bag-of-Phonetic-Codes Modelfor Cyber-Bullying

Detection in Twitter," 2018 International Conference on Current Trends towards Converging
Technologies (ICCTCT), Coimbatore, India, 2018, pp. 1-7, doi:
10.1109/ICCTCT.2018.8550938.

[10] R. R. Dalvi, S. Baliram Chavan and A. Halbe, "Detecting A Twitter Cyberbullying

Using Machine Learning," 2020 4th International Conference on Intelligent Computing and
Control Systems (ICICCS), Madurai, India, 2020, pp. 297-301, doi:
10.1109/ICICCS48265.2020.9120893.

[11] Sharma, H. K., & Kshitiz, K. (2018, June). Nlp and machine learning techniques for
detecting insulting comments on social networking platforms. In 2018 International
Conference on Advances in Computing and Communication Engineering (ICACCE) (pp.
265-272). IEEE.

[12] Raj, M., Singh, S., Solanki, K., & Selvanambi, R. (2022). An application to detect
cyberbullying using machine learning and deep learning techniques. SN computer science,
3(5), 401.

[13] Sahay, K., Khaira, H. S., Kukreja, P., & Shukla, N. (2018). Detecting cyberbullying and
aggression in social commentary using nlp and machine learning. International Journal of
Engineering Technology Science and Research, 5(1), 1428-1435.

[14] Raj M, Singh S, Solanki K, Selvanambi R. An Application to Detect Cyberbullying

Using Machine Learning and Deep Learning Techniques. SN Comput Sci. 2022;3(5):401.
doi: 10.1007/s42979-022-01308-5. Epub 2022 Jul 26. PMID: 35911437; PMCID:
PMC9321314.

[15] Roy, P.K., Mali, F.U. Cyberbullying detection using deep transfer learning. Complex
Intell. Syst. 8, 5449–5467 (2022). https://doi.org/10.1007/s40747-022-00772-z

How To Solve Just About Any Problem - Book - Print Version - LATEST - FINAL - EDITED
No ratings yet
How To Solve Just About Any Problem - Book - Print Version - LATEST - FINAL - EDITED
228 pages
The Wisdom of Your Face Change Your Life With Chinese Face Reading! Secure Download
100% (19)
The Wisdom of Your Face Change Your Life With Chinese Face Reading! Secure Download
17 pages
Machine Learning Based Cyber Bullying Detection
No ratings yet
Machine Learning Based Cyber Bullying Detection
5 pages
B.E Cse Batchno 168
No ratings yet
B.E Cse Batchno 168
42 pages
Cyber Bullying
No ratings yet
Cyber Bullying
20 pages
Natural Language Processing Project Review-3: Cyber Bullying Detection System Using Sentiment Analysis
No ratings yet
Natural Language Processing Project Review-3: Cyber Bullying Detection System Using Sentiment Analysis
30 pages
The Use of A Large Language Model For Cyberbullying Detection
No ratings yet
The Use of A Large Language Model For Cyberbullying Detection
14 pages
Ijisa V11 N11 5
No ratings yet
Ijisa V11 N11 5
10 pages
2022 Using ML and Deep Learning
No ratings yet
2022 Using ML and Deep Learning
13 pages
Detection of Cyberbullying On Social Media (RESEARCH) 123
No ratings yet
Detection of Cyberbullying On Social Media (RESEARCH) 123
8 pages
Abstract
No ratings yet
Abstract
10 pages
Cyberbullying Text Identification Based On Deep Le
No ratings yet
Cyberbullying Text Identification Based On Deep Le
12 pages
Formulario - de - Extraccion - de - Datos - 3
No ratings yet
Formulario - de - Extraccion - de - Datos - 3
8 pages
Paper Jan1
No ratings yet
Paper Jan1
12 pages
Chavan 2015
No ratings yet
Chavan 2015
5 pages
Detection and Classification of Cyberbullying in Social Media Using Text Mining
No ratings yet
Detection and Classification of Cyberbullying in Social Media Using Text Mining
6 pages
BULLYNET
No ratings yet
BULLYNET
7 pages
Draft Artikel RTI - BahasaInggris - Template SJI
No ratings yet
Draft Artikel RTI - BahasaInggris - Template SJI
7 pages
Online Harassement
No ratings yet
Online Harassement
48 pages
b27 CHAPTER 6
No ratings yet
b27 CHAPTER 6
5 pages
Hate Speech Detection Using Machine Learning2
No ratings yet
Hate Speech Detection Using Machine Learning2
4 pages
2019 Using Deep Neural Network
No ratings yet
2019 Using Deep Neural Network
4 pages
3) Sentiment Analysis of Tweets Including Emoji Data
No ratings yet
3) Sentiment Analysis of Tweets Including Emoji Data
22 pages
Proceedings 31 00027 PDF
No ratings yet
Proceedings 31 00027 PDF
10 pages
Equivalency of Courses Transcripts and Credit System
No ratings yet
Equivalency of Courses Transcripts and Credit System
16 pages
Optimized Twitter Cyberbullying Detection Based On Deep Learning
No ratings yet
Optimized Twitter Cyberbullying Detection Based On Deep Learning
5 pages
Sharma, Patel - 2018 - Toxic Comment Classification Using Neural Networks and Machine Learning-Annotated
No ratings yet
Sharma, Patel - 2018 - Toxic Comment Classification Using Neural Networks and Machine Learning-Annotated
6 pages
Ijireeice 2022 10731
No ratings yet
Ijireeice 2022 10731
4 pages
How To Create A Trust in Windows Server 2008 R2
100% (1)
How To Create A Trust in Windows Server 2008 R2
18 pages
Paper Final
No ratings yet
Paper Final
8 pages
Fake News Synopsis
No ratings yet
Fake News Synopsis
10 pages
Detection Oof Cyber Bullying in Social Media Using Machine Learningppt
No ratings yet
Detection Oof Cyber Bullying in Social Media Using Machine Learningppt
19 pages
Comparison of Deep Learning and Ensemble Learning in Classification of Toxic Comments
No ratings yet
Comparison of Deep Learning and Ensemble Learning in Classification of Toxic Comments
6 pages
Detection of Cyberbullying On Social Media Using Machine Learning
No ratings yet
Detection of Cyberbullying On Social Media Using Machine Learning
5 pages
Online Abuse Detection
No ratings yet
Online Abuse Detection
8 pages
Abstract 9
No ratings yet
Abstract 9
11 pages
2022 V14i4075
No ratings yet
2022 V14i4075
9 pages
BRILLIANT Portraiture 1991 Introduction
100% (1)
BRILLIANT Portraiture 1991 Introduction
19 pages
Yshu
No ratings yet
Yshu
23 pages
Effective Cyberbullying Detection With SparkNLP
No ratings yet
Effective Cyberbullying Detection With SparkNLP
8 pages
14209-Article Text-17727-1-2-20201228
No ratings yet
14209-Article Text-17727-1-2-20201228
7 pages
Paper 7
No ratings yet
Paper 7
13 pages
Smart Contract Vulnerability Detection
No ratings yet
Smart Contract Vulnerability Detection
12 pages
Cyberbullying Detection Through Sentiment Analysis
No ratings yet
Cyberbullying Detection Through Sentiment Analysis
6 pages
Impact Factor: 8.165: Volume 10, Issue 3, March 2022
No ratings yet
Impact Factor: 8.165: Volume 10, Issue 3, March 2022
7 pages
Cyberbullying Detection and Classification Using Information Retrieval Algorithm
No ratings yet
Cyberbullying Detection and Classification Using Information Retrieval Algorithm
6 pages
CONFERENCE
No ratings yet
CONFERENCE
9 pages
Cyber Bullying Detection On Social Media Network
No ratings yet
Cyber Bullying Detection On Social Media Network
9 pages
The Role of Text Pre-Processing in Sentiment Analysis: Information Technology and Quantitative Management (ITQM2013)
No ratings yet
The Role of Text Pre-Processing in Sentiment Analysis: Information Technology and Quantitative Management (ITQM2013)
7 pages
JES 2 Sandip+Bankar 6 2241
No ratings yet
JES 2 Sandip+Bankar 6 2241
9 pages
Analyzing Variations of Opinions On Twitter: R. Nisha Pauline
No ratings yet
Analyzing Variations of Opinions On Twitter: R. Nisha Pauline
5 pages
Grand Strategy Matrix
No ratings yet
Grand Strategy Matrix
13 pages
Cyberbullying Detection On Twitter Using Machine Learning A Review
No ratings yet
Cyberbullying Detection On Twitter Using Machine Learning A Review
5 pages
Paper 11-Normalization of Unstructured and Informal Text
No ratings yet
Paper 11-Normalization of Unstructured and Informal Text
8 pages
Symbiosis International University, Pune: Case Analysis
100% (1)
Symbiosis International University, Pune: Case Analysis
15 pages
Cyberbullying Detection Through Sentiment Analysis
No ratings yet
Cyberbullying Detection Through Sentiment Analysis
6 pages
Thomas More - Letter To The University of Oxford (1516)
No ratings yet
Thomas More - Letter To The University of Oxford (1516)
7 pages
Machine Learning-Based Strategies For Detecting Cyberbullying in Online Chats
No ratings yet
Machine Learning-Based Strategies For Detecting Cyberbullying in Online Chats
4 pages
Twitter Sentiment Analysis Using Deep Learning
No ratings yet
Twitter Sentiment Analysis Using Deep Learning
17 pages
Investigating Sentiment Analysis From Social Media Data: Received: Review: Accepted: Published
No ratings yet
Investigating Sentiment Analysis From Social Media Data: Received: Review: Accepted: Published
9 pages
Minor Fnal
No ratings yet
Minor Fnal
22 pages
Sentiment Analysis Presentationnotes
No ratings yet
Sentiment Analysis Presentationnotes
4 pages
Cyberbullying Detection Using Natural Language Processing
No ratings yet
Cyberbullying Detection Using Natural Language Processing
10 pages
Review On Detection of Spam Comments Using NLP Algorithm
No ratings yet
Review On Detection of Spam Comments Using NLP Algorithm
4 pages
Comparative Study of Available Technique For Detection in Sentiment Analysis
No ratings yet
Comparative Study of Available Technique For Detection in Sentiment Analysis
5 pages
Capstone Project Report
No ratings yet
Capstone Project Report
38 pages
SIOP Lesson Plans
100% (1)
SIOP Lesson Plans
3 pages
Thesis
No ratings yet
Thesis
22 pages
Eaapp Pagdonsolan
No ratings yet
Eaapp Pagdonsolan
2 pages
Harvard Referencing
0% (1)
Harvard Referencing
15 pages
Material MGT BBA 5 MDU
No ratings yet
Material MGT BBA 5 MDU
365 pages
Machine Learning Algorithm For Sentimental Analysis of Twitter Feeds
No ratings yet
Machine Learning Algorithm For Sentimental Analysis of Twitter Feeds
4 pages
Classroom Management Signature Assignment
No ratings yet
Classroom Management Signature Assignment
11 pages
Faculty of Engineering & Technology Mechanical Engineering Syllabus Structure For B.E. (Mechanical Engineering) W.E.F. Academic Year 2017-2018 (CGPA)
No ratings yet
Faculty of Engineering & Technology Mechanical Engineering Syllabus Structure For B.E. (Mechanical Engineering) W.E.F. Academic Year 2017-2018 (CGPA)
52 pages
Needs and Motivation Theories
No ratings yet
Needs and Motivation Theories
13 pages
Nonverbal Communication
No ratings yet
Nonverbal Communication
5 pages
Why Marianne Bachmeier Took Justice Into Her Own Hands
No ratings yet
Why Marianne Bachmeier Took Justice Into Her Own Hands
4 pages
KJ Somaiya Faq2012
No ratings yet
KJ Somaiya Faq2012
11 pages
Global Annual Holiday Calendar 2025 Final
No ratings yet
Global Annual Holiday Calendar 2025 Final
4 pages
Fat Loss Ebook
No ratings yet
Fat Loss Ebook
26 pages
SSRN Id4057055
No ratings yet
SSRN Id4057055
9 pages
Cataloging Continuing Resources
No ratings yet
Cataloging Continuing Resources
42 pages
Retrometabolic Drug Design
No ratings yet
Retrometabolic Drug Design
7 pages
Consciousness and Cognition: Andrew A. Fingelkurts, Alexander A. Fingelkurts, Tarja Kallio-Tamminen
No ratings yet
Consciousness and Cognition: Andrew A. Fingelkurts, Alexander A. Fingelkurts, Tarja Kallio-Tamminen
31 pages
Multivariate Time Series Data Prediction Based On
No ratings yet
Multivariate Time Series Data Prediction Based On
14 pages
Baughman Don Marianne 1977 Nigeria
No ratings yet
Baughman Don Marianne 1977 Nigeria
11 pages
Exam
No ratings yet
Exam
10 pages
R. P. Sethu Pillai
No ratings yet
R. P. Sethu Pillai
3 pages
(Data Structure AND Algorathims) : (Teacher: MR Yang Weichao)
No ratings yet
(Data Structure AND Algorathims) : (Teacher: MR Yang Weichao)
6 pages
K Mediods Clustering Implementation
No ratings yet
K Mediods Clustering Implementation
5 pages
Bio f4 Endocrine System
No ratings yet
Bio f4 Endocrine System
6 pages
Mid-Term Year 5 Paper 2 (2021)
No ratings yet
Mid-Term Year 5 Paper 2 (2021)
6 pages
Data Science Fundamentals and Practical Approaches: Understand Why Data Science Is the Next (English Edition)
From Everand
Data Science Fundamentals and Practical Approaches: Understand Why Data Science Is the Next (English Edition)
Dr. Gypsy Nandi
No ratings yet
Ethical Hacker's Certification Guide (CEHv11): A comprehensive guide on Penetration Testing including Network Hacking, Social Engineering, and Vulnerability Assessment
From Everand
Ethical Hacker's Certification Guide (CEHv11): A comprehensive guide on Penetration Testing including Network Hacking, Social Engineering, and Vulnerability Assessment
Mohd Sohaib
No ratings yet

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

NLP - J - Final ReviewReport - Cyberbullying

Uploaded by

NLP - J - Final ReviewReport - Cyberbullying

Uploaded by

School of Computer Science and Engineering

Final Review Report

Programme: B Tech CSE

Title: CYBERBULLYING DETECTION USING NLP

VAIBHAV THALANKI (20BCE1385)

Keywords: Tweet, Cyberbullying, NLP, Classification

PAPER 1: SENTIMENT ANALYSIS USING PRODUCT REVIEW DATA [1]

Figure 1: Amazon Review System [1]

An algorithm was proposed and implemented for negation phrases identification. A

Results and conclusions

On manually-labelled sentences, the classification models (SVM, Naïve Bayes, Random

Figure 2: ROC curves based on complete set [1]

Figure 3: Cyberbullying Dataset Summary used in [2]

Figure 4: Results of [2]

PAPER 3: MEASURING, UNDERSTANDING, AND CLASSIFYING NEWS MEDIA

Figure 5: Top 5 most frequent hashtags [3]

Temporal event slicing and sampling

Figure 6: Results [3]

PAPER 4: SOCIAL MEDIA CYBERBULLYING DETECTION USING MACHINE

PAPER 5: CYBER-BULLYING DETECTION IN HINGLISH LANGUAGES USING

Cyberbullying results in depression, self-esteem, emotional problems to the victims.

PAPER 6: CYBERBULLYING DETECTION: HYBRID MODELS BASED ON

PAPER 7: IDENTIFICATION OF POTENTIAL CYBER BULLYING TWEETS USING

The error metrics used for identifying models’ success includes,

Figure 7: Data pre-processing [11]

PAPER 11: AN APPLICATION TO DETECT CYBERBULLYING USING MACHINE

Figure 8: Methodology used in [12]

PAPER 12: DETECTING CYBERBULLYING AND AGGRESSION IN SOCIAL

Figure 9: Feature Engineering in [13]

Figure 10: Results [13]

Removing stop words, punctuations, twitter handles and URLs.

Figure 11: Pre-processing stop words, punctuations

Stemming the words

Figure 12: Stemming process

Figure 13: Methodology used

TF-IDF/Term Frequency Technique:

The TF-IDF is defined by

Figure 14: CBOW model as an alternative to TF-IDF

Multinomial Naïve Bayes Classifier

The probabilistic machine learning technique known as Multinomial NB classifier is mostly

hθ (x) = 1/1 + e −θTx

Random Forest Classifier

1. Randomly select a subset of the training data

Support Vector Machines

Results and Discussion

GridSearchCV works by exhaustively searching over a specified parameter grid, which is a

Figure 15: SVM Report

Figure 19: Naive Bayes Report

[8] A. Mody, S. Shah, R. Pimple and N. Shekokar, "Identification of Potential Cyber

[9] A. Shekhar and M. Venkatesan, "A Bag-of-Phonetic-Codes Modelfor Cyber-Bullying

[10] R. R. Dalvi, S. Baliram Chavan and A. Halbe, "Detecting A Twitter Cyberbullying

[14] Raj M, Singh S, Solanki K, Selvanambi R. An Application to Detect Cyberbullying

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.