NLP - J - Final ReviewReport - Cyberbullying
NLP - J - Final ReviewReport - Cyberbullying
VIT Chennai
Vandalur - Kelambakkam Road, Chennai - 600 127
Team Member(s):
With rise of social media coupled with the Covid-19 pandemic, cyberbullying has reached
all-time high. We can combat this by creating models to automatically flag potentially
harmful tweets as well as break down the patterns of hatred. As social media usage becomes
increasingly prevalent in every age group, a vast majority of citizens rely on this essential
medium for day-to-day communication. Social media’s ubiquity means that cyberbullying
can effectively impact anyone at any time or anywhere, and the relative anonymity of the
internet makes such personal attacks more difficult to stop than traditional bullying. Taking
the leverage of Natural Language Processing, cyberbullying tweets can be detected and
traced.
Introduction
As social media usage becomes increasingly prevalent in every age group, a vast majority of
citizens rely on this essential medium for day-to-day communication. Social media’s ubiquity
means that cyberbullying can effectively impact anyone at any time or anywhere, and the
relative anonymity of the internet makes such personal attacks more difficult to stop than
traditional bullying.
On April 15th, 2020, UNICEF issued a warning in response to the increased risk of
cyberbullying during the COVID-19 pandemic due to widespread school closures, increased
screen time, and decreased face-to-face social interaction. The statistics of cyberbullying are
outright alarming: 36.5% of middle and high school students have felt cyberbullied and 87%
have observed cyberbullying, with effects ranging from decreased academic performance to
depression to suicidal thoughts.
Despite the severity of the issue, very few effective attempts to identify abusive conduct have
been made, both by the academic community and on social media. This is because of a
number of intrinsic challenges such poor syntax, syntactic errors, and a very narrow context.
Beyond only using overtly harsh words, aggression and bullying may also take many other
forms, such as frequent sarcasm, trolling, etc.
With rise of social media coupled with the Covid-19 pandemic, cyberbullying has reached all
time highs. We can combat this by creating models to automatically flag potentially harmful
tweets as well as break down the patterns of hatred. This project aims to achieve this by using
concepts of Natural Language Processing and Sentiment Analysis.
Literature Review
Sentiment analysis is also referred as opinion mining which studies and evaluates people’s
liking to a certain entity. This paper revolves around tackling the problem of sentiment
polarity categorization for product reviews. This is a difficult process because it has many
obstacles. The first one is that since people can freely post reviews, the quality of the reviews
cannot be guaranteed. There are cases where spam is posted instead of product reviews. Some
spams are meaningless and others contain fake reviews. The second problem is that the basis
of classifying each review is arbitrary, how to evaluate a review being positive, negative or
neutral according to what standards.
The data used for this paper is from Amazon product reviews collected between February and
April 2014. The above-mentioned problems are somewhat overcome in the following two
ways: Each product review receives an inspection before it can be posted and each rating
must have a rating which can be used as the ground truth. This rating is based on a 5-star
scale with 5 being very high and 1 being very low rating (negative).
All objective content in a sentence is removed and only subjective content was used for
analysis. A sentiment sentence is the one that contains, at least, one positive or negative
word. All the sentences were firstly tokenized into separated English words. Each sentence
was tagged using a POS tagger. 25 million adjectives, over 22 million adverbs, and over 56
million verbs were tagged out of all the sentiment analysis. The paper used an algorithm for
negation phrases identification which classifies negations into negation-of-adjective (NOV)
and negations-of-verb (NOV). The algorithm was able to identify 21,586 different phrases
with total occurrence of over 0.68 million, each of which has a negative prefix.
For phrase tokens, 3,023 phrases were selected of the 21,586 identified sentiment phrases,
which each of the 3,023 phrases also has an occurrence that is no less than 30. Given a token
t, the formula for t’s sentiment score (SS) computation is given as:
Feature vector formation
Two binary strings are used to represent each token’s appearance. One string with 11,478 bits
is used for word tokens, while the other one with a bit-length of 3,023 is applied for phrase
tokens. For instance, if the ith word (phrase) token appears, the word (phrase) string’s ith bit
will be flipped from “0" to “1". A hash value of each string is then computed and saved.
Hence, a sentence level feature vector totally has four elements: two hash values computed
based on the flipped binary strings, an averaged sentiment score, and a ground truth label.
On machine-labelled sentences, the SVM model takes the most significant enhancement from
0.61 to 0.94 as its training data increased from 180 to 1.8 million. The model outperforms the
Naïve Bayesian model and becomes the 2nd best classifier, on subset C and the full set. The
Random Forest model again performs the best for datasets on all scopes. Figure below shows
the ROC curves plotted based on the result of the full set.
In the study, vocabulary and syntax are first used to analyse the features of cyberbullying.
Then, a new recognition technique based on word similarity and Fast Text is suggested. The
effectiveness and performance of the suggested method are then assessed through
experiments. Results obtained indicate that the suggested approach is capable of significantly
enhancing both the detection accuracy and recall rate of cyberbullying detection.
In cyberbullying situations, the Word Similarity technique is generally used to analyse the
text's morphology. In order to determine how similar terms are to the vocabulary used in
cyberbullying and to pinpoint the words that are definitely utilised in bullying, the Word2vec
model and cosine similarity are also used. In order to identify any implicit terms associated to
bullying in the text, the FastText approach next looks at how contextual texts relate to one
another.
Model Training
Each word in the sample is compared to offensive terms in the training set to see how similar
they are, and the greatest value obtained is used to determine how likely the sample is to be
subjected to cyberbullying. Let B = {b1,b2, ... ,bn } be the collection of insulting words and S
= {S1, S2, ... , SN} be a training sample. For sample Si ∈ S, i = 1, 2, ... , N, the participle
result is denoted as Si ={ s1, s2, ... , smi } . The Word2vec word vector of the word s is
represented as σ (s). Then, the possibility that Si is marked as cyberbullying can be expressed
as the follows:
In order to reduce the similarity calculations, the following two measuring schemes are
applied:
(1) “Remove stop words. More than 891 common stop words, including “am,” “about,” and
“believe,” are collected.”
(2) “Remove common words from the non cyber bullying class. The frequency of words in
the training set is counted. NumCB (s) and NumNCB (s) denote the number of times the
word s appears in the cyberbullying and the non cyber bullying, respectively.” The following
calculation should be carried out:
If common (s) > 1, then s is considered as a common word in the NCB so that it can be
removed.
The words and phrases in the input layer are used by FastText to build the feature vector,
which is then linearly transformed and mapped to the hidden layer. The Huffman tree is built
using the weights assigned to each category and the model parameters after the hidden layer
has solved the maximum likelihood problem. The calculation of the output is optimised via
hierarchical SoftMax based on the Huffman tree.
The FTSW method performs the best while focusing on cyberbullying cases. It has the
highest precision and F1 score but recall score is less. The accuracy of the FTSW model is
the highest of all models. In the case of detecting for cyberbullying, high accuracy is all that
matters.
The paper focuses on the coverage and sympathy bias between Arab and Western media after
the 2015 Beirut and Paris terrorist attacks. For 2,390 tweets in four different languages
(English, Arabic, French, and German), sympathy and sentiment labels were crowdsourced.
Then a regression model was built to characterize sympathy and trained a deep convolutional
neural network to predict sympathy. In the paper, media bias research was used to examine
news on Twitter. Instead of focusing on whether media biases exist on Twitter specifically as
a social media platform, Twitter is used as a journalism tool to study timely news reporting
and offer the first steps in developing a system for categorising sympathetic tweets.
906,583 tweets on Beirut bombing were collected shortly after that news breakout on
November 12, 2015. This was done using the twitter hashtags: #beirut, #lebanon,
#beirut2paris, #beirutattacks, #beirutbombing. The dataset consisted of 667,073 retweets and
610,879 unique users. After removing the duplicates (retweets), 239,093 unique tweets
remained. For Paris, 5,339,452 tweets during the two days after bombing were collected
using the hashtags: #paris, #france, #parisattacks, #prayforparis and #porteouverte. 74.78% of
the tweets were retweets and there were 2,538,348 unique users. In addition to these, Paris
Ruest dataset created by Nick Reust [4] was also used.
Since the attacks in Beirut and Paris differed temporally in the coverage, Normalisation was
required as tweets posted 5 days back after attacks may vary compared to those posted three
weeks after the attacks. The normalisation is constrained by the size and coverage of the
smaller Beirut dataset. Therefore with the coverage length of Beirut’s data, Paris dataset was
sliced.
This time slicing further reduced the size of the dataset to 7,768 distinct tweets, including
coverage of Beirut by Western media (N=131), Paris by Western media (N=5,298), Beirut by
Arab media (N=287), and Paris by Arab media (N=1,566).
Then the required work was to send the data to the crowd for annotation. To avoid lengthy
crowdwork time and costs, only a sample of each dataset was decided to be sent. However, a
random sampling would not suffice as the period of 3 days after the attacks is where the
major important headlines show up. Therefore, the data was divided into buckets each of 24
hours and samples were drawn out from each bucket. The normalization constant was
calculated by dividing the size of the desired sample draw (1,000) by the total number of
rows in each dataset. For each bucket, the sample drawn was the number of records in that
bucket multiplied by the normalization constant, and rounded to ensure all day buckets cap at
1,000 records.
Results
COVERAGE BIAS: As expected from the plots, there was more coverage from Arab media
for the Beirut attacks, and inverse for the Paris attacks, which showed more Western media
coverage. A Chi-square test with Yates’ continuity correction was performed across all days
to compare the difference between Arab and Western media coverage. The result was to
accept the alternate hypothesis: there was a significant difference in coverage bias between
the attacks (χ2 (1, N=7,768) = 1489, p <0.001, f=0.44, odds ratio=0.05). Correlation analysis
was also done to find out whether they followed a similar pattern of tweeting. It was observed
that for tweet activity volume, Western and Arab media were engaged at approximately
similar time points, which supports the fairness of collected data.
CLASSIFYING NEWS MEDIA SYMPATHY: The crowd annotating was not complete. ML
models were trained to generalize the analysis. The learning was to recognise the sympathy
of the tweets, basically a sentiment analysis: sympathetic or not. A CNN fitted with word2vec
model was used for classification. Western media was more sympathetic towards Paris, while
Arab media was more sympathetic towards Beirut. This aligns with prior work showing
strong regionalism in news geography and with producer-consumer attention asymmetries
across countries. What is interesting to observe that while retweeting behavior appears to be
impartial as to whether a tweet is sympathetic or not, it does appear that this similarly applies
to sentiment labels also.
Cyberbullying is a form of abuse through electronic messages. Social media acts as the
perfect environment for these bullies who take advantage of the platform and attack the users.
These bullies usually use forms of emails, messages and so on to carry out these malicious
activities. The paper tries to detect these messages and email with such ill intent such that
these messages do not reach the user and is not harmed through applying few supervised
machine learning models. These models help in detecting the patterns that these bullies might
use when carrying out the crimes. A Kaggle dataset is used for the same.
The steps for building these models are as follows:
• Tokenize: The dataset which is in the form of text is taken in as sentences/paragraphs
and processed to give as separate words in an array.
• Lowercase: The array of words generated in the previous step is then converted to
lowercase to normalize the data and to remove uneven casing. Example: “ROGHAN”
will be converted to “roghan”.
• Stop words and Cleaning: An essential part of to the pre-processing process is that we
must clean the text from the stop words so that we can split the text into sentences and
paragraphs from ‘\n’ or ‘\t’.
• Word Correction: Here, the Microsoft Bing word correction API takes the word from
the array and returns a JSON object with the similar words and the distance between
them and the original word.
The next step in the processing for model, is the extraction of the features. The textual data is
transformed to a format that is acceptable by the Machine Learning model. The features are
then separated from the array of features. Sentimental analysis technique is used to determine
whether the text is of a positive or negative note. The extracted features are finally classified
and fed into the algorithm. Two classifiers are used, SVM and Neural Network. This network
has 3 layers: Input, Hidden and Output layer. The input layer has 128 nodes, whereas the
hidden layer contains 64 neurons. An output in the format of Boolean is resulted. Evaluation
of these classifier model is done by using error metrics based on the confusion matrix.
Criteria like Accuracy, Precision, Recall and F-score are used for the evaluation.
The dataset consists of 12773 rows. The data consists of questions along with the answers
given with the class labels which says whether the answer is bullying or is a normal message.
After cleaning/pre-processing the dataset, a total of 1608 (Cyberbullying) and 804 (Normal)
from each class was present. The dataset is then split into 80% and 20% from train and test.
SVM as well as Neural Network (NN), best-performing classifiers are applied. Several
experiments are then run on different n-gram language models. The SVM classifier created
using a 4-gram model gives an accuracy of 90.3%, while the Neural Networks (NN) models
created gives an accuracy of 91.76%.
The evaluations of both classifiers in terms of precision and recall respectively for each
language model. The average accuracy, recall, precision, and F-score for the 2 classifiers are
compared. The Neural Network model performs the best out of all the classifiers. The
proposed approach to detect cyber-bullying using Machine Learning Techniques are through
2 models SVM and Neural Networks using TFIDF and sentimental analysis for the extracted
features.
The methodology for the detection of the malicious messages is Natural Language Processing
and Machine Learning. Data was extracted from different social media platforms like
WhatsApp, Twitter, YouTube for the purpose of building the model. This data has been taken
from Kaggle where this data has been extracted through scraping the platforms. The dataset
consists about 15,307 rows. Dataset also consists of a class label which signifies whether the
text is cyberbullying which helps in building a supervised machine learning model.
The special characters, retweet symbols, hashtags etc. were removed. Words of small lengths
are also removed as they do not contribute to the “cyberbullying” part of the model and are
most often just articles and prepositions. NLP techniques like tokenization and lemmatization
is applied to extract the meaningful words from these texts. Tokenization is where the
sequence of words is split to smaller chunks. Whereas, lemmatization is the process to reduce
the inflectional forms of the same root word. The final step is that of vectorization, where
weights are assigned to the words based on the probability at which a certain word can be
found.
Feature selection of the text was done, namely Count Vectorization and Term Frequency-
Inverse Document Frequency. Count Vectorization is a method used to convert a collection
of words within the corpus into a vector of terms. The model then obtained from this is then
used to fit and learn the vocabulary and then makes a word-matrix accordingly. The TF-IDF
is used to evaluate how relevant a word is in the document. It tells us how frequent or rare a
word occurs in a document. 0 meaning that the word is most frequent. A comparative
analysis is done between the two methods of feature extraction to conclude that Count
Vectorizer (CV) gives a better accuracy compared to the TF-IDF method. Therefore, the CV
method is used for the feature selection.
Various other machine learning models are the applied like the Linear SVC, Decision Tree
and Naïve Bayes to train the model and find the accuracy for each model. After the models
are run, it was concluded that the Random Forest classifier shows the best accuracy based on
evaluation metrics. The Random Forest Classifier has an accuracy and F1 score of 96.5% and
97% respectively.
Deep neural networks are favoured over conventional machine learning techniques for the
identification of cyberbullying because of their benefits. In addition to an algorithmic
comparison of eleven classification techniques, the research suggests an unique neural
network framework with parameter optimisation. Moreover, it investigates how natural
language processing based on word embedding methods and feature extraction affect
algorithmic efficiency. When it came to detecting cyberbullying, the neural networks Bi-
GRU and Bi-LSTM performed the best.
Detection frameworks for cyberbullying online often employ traditional machine learning
methods. These traditional machine learning methods do, however, have a limitation in that
they are unable to give highly accurate results on extremely large volumes of data for
supervised categorization. Neural networks overcome this limitation and provide better
results and trustworthy processes. Several well-known shallow neural network and traditional
machine learning techniques are covered in this section. The structure of each suggested
network as well as the suggested methodology for our classification frameworks are covered
in this article.
The text is transformed into vector notation so that the classification algorithms can handle it.
Before it is utilised in the conversion, the raw text is heavily pre-processed. Data cleansing is
the name of this technique. The elimination of empty rows, punctuation, special characters,
etc., is one of the phases. Conventional machine learning methods like Count Vectorization
and TF-IDF unigram/bigram/trigram were employed to evaluate the model's accuracy.
Shallow neural network representations will be made using GloVe, FasText, and Paragram.
The training and testing datasets are divided into a 5-fold cross, with each dataset being
divided in 5 ways. Count A process called vectorization is used to turn a group of words from
the corpus into a vector of phrases. The model that results from this is then used to learn and
fit the vocabulary before creating a word-matrix in line with it. A word's relevance to the
document is determined using the TF-IDF. It reveals how often or infrequently a term
appears in a manuscript. 0 indicates that the term is most common. Word embeddings can be
obtained from text input using the unsupervised method known as Global Vectors (GloVe)
for word representations. To obtain representations, a term-based co-occurrence matrix was
used. The cooccurrence matrix is used to examine the semantic relationship between terms. A
high cosine similarity is considered.
The simplest shallow network has an accuracy of about 90% but is unsuitable for datasets
that are not balanced. SVM provides accuracy in the range of 98.12%, whereas Logistic
Regression provides accuracy in the range of 97.13%. When evaluating all assessment
metrics, most outcomes obtained using shallow neural networks are over 95%. These
numbers are also greater than those for conventional machine learning models that have been
published.
The paper suggests that there are models for cyberbullying tweet detection which uses bag of
words with typical classification algorithms like Logistic Regression, decision tree and so on.
But it is extremely important that False Positives are to be reduced. People who really have
extreme negative comments targeted at a particular person should be penalised and not
everyone because everyone has freedom of speech. Hence it suggests that we will be using
hybrid approaches combining the outputs of knowledge-based approaches and Machine
Learning approaches.
The paper also suggests the use of Lexicon based techniques. Lexicon-based techniques have
been used extensively with traditional text but very little with material from Twitter since it is
so challenging to process. Twitter contains information that is made up of emojis, hashtags,
and many variations of acronyms like "lol." Due of this, Twitter data is challenging to
evaluate and is therefore less frequently studied for analysis.
Calculation of the reinforced polarity is done through exhaustively reviewing the sentences
and analysis by both models and compute the score based on individual scores of words and
letters. Part Of Speech tagging can be exploited for this purpose. So for each sentence the
paper has made use of three polarities. Sentiment analysis of emoticons, Sentiment analysis
from Knowledge based approach and Sentiment analysis from Machine Learning Approach.
This hybrid approach helps them to reduce the False Positive cases so that Freedom of speech
is not affected.
Confusion matrix for the paper shows that they have around 30.3% True Positive rate, 14.5%
False Negative rate, 15.1% False positive rate and 40.0% False negative rate yielding a very
high accuracy than traditional approaches.
PAPER 8: A BAG-OF-PHONETIC-CODES MODEL FOR CYBERBULLYING
DETECTION IN TWITTER [9]
The paper cites a survey conducted by Ditch the label showing that almost everyone is
affected by cyberbullying these days. More than 47 percent of the people have received
hateful text messages and more than 62 percent of the people have received harsh comments
on text messaging platforms like WhatsApp, Instagram etc.
Even after detecting a hateful comment, penalising the person responsible for it is a difficult
task as people who give out hateful comments on the platforms tend to hide their real identity
often referred to as the “social-mask”. The paper suggests that there are two techniques to
perform sentiment analysis.
• Machine Learning approach
• Lexicon based approach.
The machine learning approach uses a training dataset and trains the model based on the
features of the data which can then be put use to classification of real world data.
Lexicon based approaches are knowledge based and need an efficient representation for
identifying sentiments behind this text because of which this approach doesn’t fare well when
it comes to datasets with a neutral class in it. Hence the paper has used machine leaning
approaches.
Sarcasm and hidden meanings often pose another major difficulty. They often lead to texts
being classified in the wrong context. Using syntactic features like punctuations, part of
speech and so on, feeding it to a naïve bayes model has led to the achievement of 0.7 F1 –
Score as cited by the research paper.
The methodology used in this paper starts with dataset collection and preparation from online
sources and twitter API. Then comes pre-processing this data and tokenising, it poses a huge
difficulty. This is so because twitter data is so noisy and all over the place with emoticons,
symbols, abbreviations, and mixed language texts. The paper has first converted all
characters to lowercase, removed extra symbols and punctuations, stemmed words, converted
emoticons to appropriate words that convey the sentiment, and removal of stop words.
Next comes the Feature selection and preparation of feature vector. Every tweet needs to be
converted to fixed length vector for it to be fed into a machine learning model. This can be
done using the bag of words model or the TF-IDF approach. The bag-of-words model uses
the frequency of occurrence of words as the weights. This comes with a huge drawback of
not being able to capture the semantic information in the texts or tweets.
After generating the feature vector, word embeddings have to be generated, which can be
done either in frequency-based approaches or predictive based approaches. Then performing
both clustering and classification on the word embeddings (unsupervised and supervised) to
see which method fares well in this scenario.
The paper has made use of 3 different datasets, each collected either previously on online
sources or using the twitter API containing over 35 thousand plus tweets in each one of them.
The first dataset was used for clustering analysis and the other two datasets were used for
supervised learning. Dataset 2, achieved an accuracy of 57% whilst the dataset 3 gave an F1
– score of 0.98 on average when built using Support Vector Machine model.
PAPER 9: DETECTING A TWITTER CYBERBULLYING USING MACHINE
LEARNING [10]
The paper uses machine learning models to predict and classify cyberbullying techniques like
Support Vector machines, Naïve bayes classifier and so on. One major introduction is the
testing they did. They have collected real time data from Twitter API and fed it to the trained
model to detect cyberbullying in real-time and how the model fares in real time data.
The paper starts with importing the twitter dataset downloaded from Kaggle and Github.
In the pre-processing phase, they have made use of the NLTK library of python to perform
the necessary pre-processing steps on the tweets. Starting with tokenisation, using
WhiteSpaceTokeniser, WordPunctTokenizer, TreebankWordTokenizer and
PunctWordTokenizer. Then they have carried on to lower all tokens and texts and going
forward to removing stop words (a an the I am etc…). Finally, they made use of the
WordNetLemmatizer built in nltk to lemmatise the words into their source morphemes.
Next stop is feature extraction. The researchers have made use of TF-IDF vectorizer for this
purpose. The data's characteristics are taken out and listed as features. Additionally, each
text's polarity (i.e., whether it contains bullying or not) is extracted and saved in the list of
features.
Next identifying error metrics and algorithm used in the model. For the algorithms in many
classification approaches, the researchers have listed SVM and naïve bayes to be the best
classifiers among all others. Support Vector Machines uses hyperplane equation to divide or
classify datasets into their respective classes making use of support vectors which are the
points closest to the hyperplane. Naïve bayes theorem makes uses of the bayes theorem
formula with likelihood of occurrence given a certain occurrence.
Precision = TP / (TP+FP)
Recall =TP/(TP+FN)
F-Score = 2*(Precision*Recall) / (Precision + Recall)
TP = True positive numbers; TN = True negative numbers; FN = False negative numbers; FP
= False positive numbers
Finally, both models are trained using the Kaggle dataset downloaded and is tested with test
data, which is pre-processed, passed through TF-IDF model and is cleaned for texts with false
pretences like satire. Naïve bayes classifier gave an accuracy of 52.70 whilst SVM gave an
accuracy of 71.25 percent. Suggesting that SVM is best suited for this dataset.
With regard to precision, recall and F-score, naïve bayes classifier produced 52%, 52%, 53%
respectively and Support Vector Machines produced 71%, 71% and 70% respectively which
also ultimately suggests that Support Vector machines work good in this scenario.
Finally as the paper suggests that they test the data with real-time data (apart from the 45%
test data performed earlier). Even in this case, Support vector machines outperformed Naïve
bayes classifier in all aspects.
Around ten tweets was fetched from Twitter API out of which seven were classified as non-
bullying tweets and 3 were classified as bullying tweets.
PAPER 10: NLP AND MACHINE LEARNING TECHNIQUES FOR DETECTING
INSULTING COMMENTS ON SOCIAL NETWORKING PLATFORMS [11]
This involves an appropriate data set will be extracted from a variety of web sources, pre-
processing, generating ground truth, engineering features, and choosing classification.
The main starting point is to collect relevant data from various online platform. The second
step involves pre-processing or cleaning of the data set – noise reduction, lowercasing,
tokenization, stemming, lemmatization, stop words removal,etc. the next step would be
feature engineering – extracting user, textual, and network features. The final step is to
perform classification using the (extracted) features and the ground truth.
The main starting point is to collect relevant data from various online platform. The second
step involves pre-processing or cleaning of the data set – noise reduction, lowercasing,
tokenization, stemming, lemmatization, stop words removal,etc. the next step would be
feature engineering – extracting user, textual, and network features. The final step is to
perform classification using the (extracted) features and the ground truth.
Two attribute fields, together with an identifier, make up the data. The timestamp of the
comment's posting is shown in the First Attribute. There are several null instances, therefore
an exact and genuine timestamp is impossible. The next attribute is the actual content in
double quotes which is shown in Unicode text. Data set labelling was the most time
consuming and labour intensive. In plain English, the data collected for machine learning is
divided into two levels: "1" for offensive remarks and "0" for neutral remarks. The final
result ought to fall between [0, 1].
A data mining approach called data pre-processing entails putting raw data into a
comprehensible format. Real-world data is often inaccurate and missing in specific
behaviours or patterns. It is also frequently inconsistent and incomplete. Data pre-processing
prepares raw data for further processing.
Four kinds of classifiers are used – the most basic Logistic Regression, Support Vector
Machine, and the two most popular ensemble methods, Random Forest Classifier and
Gradient Boosting Machine. Random forest and gradient boosting machines need dense
feature matrices as inputs, but logistic regression and support vector machines need sparse
ones. The models were applied to the test dataset provided by Kaggle after being trained.
Finally, a file containing the test dataset's predictions is created. All prediction values fall
between 0 and 1, with a score of 0 to 0.5 designating a "non-insulting" remark and a score of
0.5 to 1 designating a "insulting" statement.
Results
All four classifiers' training accuracy ranged from 75% to 90%, however their test accuracy is
between 50% and 55%. Though the model produced a score between 77% and 90% on the
training dataset, it failed to transfer to the test dataset. The model makes every effort to match
the training dataset, but despite this, it is unable to accurately categorise the test dataset due to
over-fitting or a high degree of variation. According to the results of the trials and the total
project effort, Support Vector Machine and Gradient Boosting Machine were outperformed in
this specific instance by Logistic Regression and Random Forest Classifier trained on the
feature stack.
Proposed Methodology
Before being fed into stacked word embeddings, the training data is cleaned and prepared.
The CNN-BiLSTM deep learning model is then taught to outperform other deep learning
models trained separately. For usage on the internet, the model is stored.
After the pre-processing of data is completed, we now build our CNN-BiLSTM model, Word
Embedding approach is used as it solves various issues that the simple one-hot vector
encodings have. The most important factor is that word embeddings improve generalisation
and effectiveness. GloVe and FastText word embeddings will be stacked. The best outcomes
have been achieved using a mix of embeddings. After the stacking of word embedding, CNN-
BiLSTM model is built. The proposed CNN-BiLSTM model is compared with an ensemble
ML model to draw out a comparison on the accuracy.
CNN-BiLSTM Architecture:
Although CNN has fewer hyper parameters and needs less supervision, CNN has less hyper
parameters than LSTM, which often delivers better results. While the LSTM takes longer to
evaluate, it is more accurate for lengthier texts. As nodes go farther back in the hierarchy, less
of the front is seen by the RNN since it has a serious gradient loss problem while processing
sequences. Fixed sequence to sequence prediction is solved using BiLSTM. When the input
and output are the same size, RNN has a restriction.
A bidirectional LSTM and CNN architecture that has been concatenated is known as a CNN
BiLSTM. In the basic formulation, it trains both character-level and word-level properties for
classification and prediction. Utilizing the CNN layer, character-level traits are induced. The
model has a convolution and a max pooling layer for each word to create a new feature vector
utilising per-character feature vectors like character embeddings and (ideally) character type.
Results
On the basis of accuracy, the performance of various activation and optimizer setups for a
basic LSTM model is compared.
The neural network's capabilities and performance are significantly influenced by the
activation function used, and different activation functions may be applied to different parts
of the model. Any input is transformed into a number between 0 and 1 via the sigmoid
function. The outcome of the sigmoid function is close to zero for low values and close to one
for large values. A two-element Softmax with the second element set to zero is equivalent to
a sigmoid. The sigmoid is so often used in binary classification. Other than Sigmoid, ReLU is
also employed as the activation layer for our CNN-BiLSTM model’s hidden layer. Following
the comparison, it is clear that the CNN-BiLSTM model, performs the best of all the models
examined. The model is fitted to our data for 10 epochs after all the layers are combined, and
it achieves an accuracy of roughly 98%.
Dataset
The study project also takes into account gathering comment threads from popular yet
contentious YouTube videos that have the potential to incite hate speech using an
HTML/CSS parser. These are in JSON format with delimiters. Similar criteria are used to
choose appropriate entries for establishing ground truth.
The dataset on cyberbullying detection contributed on Kaggle by Impermium is selected as
the test dataset for validation of the model. Manual labels are applied to the collected dataset.
Each sample of textual data is carefully examined, interpreted, and classed in order to identify
instances of cyberbullying via hate speech and insults. The following is a list of the potential
classes: "Bully" and "Non-bully" for binary classification; "Bully," "Aggressor," "Spammer,"
and "None" for multiclass classification; and "0" and "1" for bully and non-bully comments,
respectively. The dataset is divided into "0" and "1" categories.
A single feature set made up of count vectors and TF-IDF vectors of both words and
characters as tokens with an n-gram sequencing of up to five levels is created by stacking all
the retrieved feature vector sets. Numerous hyper parameters are researched and tweaked in
order to increase learning's effectiveness. For instance, the inverse of regularisation strength
is "C," a parameter in logistic regression and support vector machines. SVM values that are
less indicate greater regularisation. Other factors include "learning rate," "number of
subsamples," and "number of trees in the forest" in a random forest, among others.
Results
The table shows various metrics of evaluation of the performance used after training the
dataset and validating it with a test dataset. Training accuracy varied from 75% - 90% for all
the four classifiers while the test accuracy lies between 70% - 75%.
Dataset contains more than 47000 tweets labelled according to the class of cyberbullying:
• Age
• Ethnicity
• Gender
• Religion
• Other type of cyberbullying
• Not cyberbullying
The dataset contains the tweet and the class (among the 6 listed above) it corresponds to. The
data has been balanced in order to contain ~8000 of each class.
https://www.kaggle.com/datasets/andrewmvd/cyberbullying-classification
Pre-processing Steps
Stop word removal is one of the most commonly used pre-processing steps across different
NLP applications. The idea is to simply remove the words that add no significant meaning to
the natural text. These generally include words like pronouns and articles. These are removed
as they are just considered as a “noise”. Classifying tweets will not depend on such stop
words.
Stemming is the process of reducing a word to its stem that affixes to suffixes and prefixes or
to the roots of words known as lemmas. The same word may have different wordforms
illustrated below. The idea is to take the root form of the word instead of its various forms.
This is just to normalise the whole corpus. We could instead uppercase everything too.
Implementation
Methodology
The data comprises of two columns: tweet and the corresponding cyberbullying class. This
data has to be pre-processed before feature extraction. The preprocessing steps are mentioned
above. After pre-processing, the tweet data has to be represented in a numerical form. This
can be done by either Continuous Bag of words model or TF-IDF Vectorizer which will serve
as feature extractor. However, in this project, we used TF-IDF Vectorizer.
This is a technique which is used to find meaning of sentences consisting of words and
cancels out the incapability of Bag of Words technique which is good for text classification
or for helping a machine read words in numbers.
CBOW or Continuous bag of words is to use embedding in order to train a neural network
where the context is represented by multiple words for a given target words.
The embedding matrix contains the word feature representation (or embedding) that will used
by the model for classification.
Classification algorithms like Logistic Regression, Support Vector Machines, Naïve Bayes,
Random Forest, Gradient Boosting were used to classify the type of cyberbullying.
This approach uses training data to identify the model's Bayes-optimal parameter estimates
and makes the assumption that a parametric model generates the text. It classifies the test data
that was obtained using those approximations. An arbitrary number of distinct continuous or
categorical functions may be supported by NB classifiers. A job for estimating high
dimensional density is reduced to predicting one-dimensional kernel density under the
assumption that the functions are different. The NB algorithm is a learning algorithm built on
the application of the Bayes theorem with strong (naive) independence assumptions.
Logistic regression
One of the well-known methods that machine learning introduced to the field of statistics is
logistic regression. It is an algorithm that uses the logistic function to create a unique
hyperplane between two datasets. A sparse feature set matrix is required for the input of
logistic regression. For training purposes, the sparse feature vector matrix is appropriately
transformed into a dense matrix.
The logistic regression algorithm uses features and generates a forecast based on the
likelihood that a class would be appropriate for the input. The instance classification, for
instance, will be a positive class if the likelihood is less than 0.5; otherwise, the prediction
would be for the other class (negative class). The implementation of predictive cyberbullying
models employed logistic regression.
Random Forest is a machine learning algorithm used for both classification and regression
tasks. It is an ensemble learning method that combines multiple decision trees and makes
predictions based on the output of each individual tree.
The algorithm works by building a large number of decision trees, each trained on a different
subset of the training data, and with a different subset of features. Each tree in the forest
produces a prediction for the class of the input data point, and the final prediction is made by
taking the majority vote of all the trees.
Here are the steps involved in building a Random Forest Classification Model:
Support Vector Machines (SVM) is a powerful machine learning algorithm used for
classification, regression, and outlier detection tasks. It is a supervised learning algorithm that
works by finding the best boundary between classes in the data. SVMs are particularly
effective in dealing with complex datasets where there are many features or where the data is
not linearly separable.
The basic idea behind SVM is to find a hyperplane that separates the data points into different
classes with the largest possible margin. In other words, SVM tries to find the optimal
boundary that maximizes the distance between the closest data points from each class. This
distance is known as the margin. Here are the steps involved in building an SVM
Classification Model:
1. Given a training dataset with input features and class labels, the SVM algorithm tries
to find the hyperplane that separates the data into two classes
2. The algorithm tries to find the hyperplane that maximizes the margin between the two
classes. This is done by minimizing the error or misclassification rate of the model.
3. If the data is not linearly separable, the algorithm transforms the data into a higher-
dimensional space using a kernel function. This helps to separate the data into
different classes.
4. Once the hyperplane is found, the algorithm makes predictions on new data points by
classifying them based on which side of the hyperplane they lie.
Gradient Boosting
Gradient Boosting is a machine learning algorithm used for both classification and regression
tasks. It is an ensemble learning method that combines multiple weak models to create a
strong model. Gradient Boosting works by sequentially adding weak learners to the model,
with each new model trying to correct the errors of the previous model.
In Gradient Boosting Classification, the algorithm learns to predict the probability of each
class by iteratively adding decision trees to the model. Each decision tree is built using a
subset of the training data and with a different subset of features. The algorithm then
combines the predictions of all the trees to make a final prediction.
The classification reports for all the models is shown below. A GridSearchCV is also used for
faster training. GridSearchCV (short for Grid Search Cross Validation) is a hyperparameter
tuning technique used in machine learning to find the optimal set of hyperparameters for a
given model. Hyperparameters are the parameters of the machine learning algorithm that are
set before the training process begins and are not learned from the data.
From the above classification reports of all the algorithms, it can be seen that Random Forest
Classifier has achieved the highest accuracy of 93% (Figure 18). Naïve Bayes (Figure 19) has
the lowest accuracy of 84% among the other algorithms. The remaining models, Logistic
regression, Gradient boosting and SVM achieve a 92% accuracy. The classification reports
also show the metrics precision, recall, f1-score and support for each class (Age, Ethnicity,
Gender, Non-cyberbullying, and Religion).
Conclusion
The latest research on cyberbullying uses supervised learning to build a machine learning
model. The majority of the study effort focuses on feature engineering, or discovering traits
that can distinguish between bullying and non-bullying remarks. According to the results of
the trials and the body of work, random forest classifier that was trained on the feature stack
outperformed logistic regression, naïve bayes, SVM, and gradient boosting models in this
specific instance. An accuracy of 93% was achieved by the random forest classifier. On the
other hand, naïve bayes only achieved an accuracy of 84%.
Future Scope
Here are some potential areas for future development and research:
• Multimodal Detection: Currently, most of the research in this area is focused on text-
based detection, but cyberbullying can also involve images, videos, and other
multimedia. Future research can explore the development of models that can analyse
multiple modes of communication for better detection and classification of
cyberbullying. [15] proposes a model in this aspect.
• Real-time Detection: Most of the existing models are trained on historical data, which
limits their effectiveness in detecting new forms of cyberbullying. Developing models
that can detect cyberbullying in real-time can be a potential future scope something
like [14].
• Context-based Detection: The context of a communication can have a significant
impact on whether it is considered cyberbullying or not. Future models can be
developed to take into account the context of a communication, such as the
relationship between the sender and the receiver, the language used, and other
contextual factors.
• Multilingual Detection: Currently, most of the research has been conducted on
English language-based cyberbullying detection. Future research can focus on
developing models for detecting cyberbullying in other languages to make the
detection process more effective.
References
[1] Fang, X., & Zhan, J. (2015). Sentiment analysis using product review data. Journal of Big
Data, 2(1). https://doi.org/10.1186/s40537-015-0015-2
[2] Wang, K., Cui, Y., Hu, J., Zhang, Y., Zhao, W., & Feng, L. (2020). Cyberbullying
Detection, Based on the FastText and Word Similarity Schemes. ACM Transactions on Asian
and Low-Resource Language Information Processing, 20(1), 1–15.
https://doi.org/10.1145/3398191
[3] El Ali, A., Stratmann, T. C., Park, S., Schöning, J., Heuten, W., & Boll, S. C. (2018).
Measuring, Understanding, and Classifying News Media Sympathy on Twitter after Crisis
Events. Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems.
https://doi.org/10.1145/3173574.3174130
[4] Ruest, N. (2017, February 7). #PARIS #Bataclan #parisattacks #porteouverte tweets.
Borealis, from https://hdl.handle.net/10864/11312
[5] Hani, J., Mohamed, N., Ahmed, M., Emad, Z., Amer, E., & Ammar, M. (2019). Social
media cyberbullying detection using machine learning. International Journal of Advanced
Computer Science and Applications, https://dx.doi.org/10.14569/IJACSA.2019.0100587
[6] Shah, Karan & Phadtare, chaitaniya & rajpara, keval. (2022). Cyber-Bullying Detection in
Hinglish Languages Using Machine Learning. International Journal of Engineering and
Technical Research. 11. 439.
[7] Raj, C., Agarwal, A., Bharathy, G., Narayan, B., & Prasad, M. (2021). Cyberbullying
detection: hybrid models based on machine learning and natural language processing
techniques. Electronics, 10(22), 2810. https://doi.org/10.3390/electronics10222810
[11] Sharma, H. K., & Kshitiz, K. (2018, June). Nlp and machine learning techniques for
detecting insulting comments on social networking platforms. In 2018 International
Conference on Advances in Computing and Communication Engineering (ICACCE) (pp.
265-272). IEEE.
[12] Raj, M., Singh, S., Solanki, K., & Selvanambi, R. (2022). An application to detect
cyberbullying using machine learning and deep learning techniques. SN computer science,
3(5), 401.
[13] Sahay, K., Khaira, H. S., Kukreja, P., & Shukla, N. (2018). Detecting cyberbullying and
aggression in social commentary using nlp and machine learning. International Journal of
Engineering Technology Science and Research, 5(1), 1428-1435.
[15] Roy, P.K., Mali, F.U. Cyberbullying detection using deep transfer learning. Complex
Intell. Syst. 8, 5449–5467 (2022). https://doi.org/10.1007/s40747-022-00772-z