BTech Project Research Paper
BTech Project Research Paper
The dataset has been taken from Kaggle and is used for
detecting fake accounts. In all, there are 4000 rows in it.
Real users count 2000, while fake users count 2000. ID,
name, Statuses_Count, followers_count, friends_count,
and favorites_count make up the dataset.
An essential phase in the pipeline for data analysis and
machine learning is data preprocessing. Several
techniques are used in the process, including lowercasing,
stopword removal, stemming, tokenization, and padding.
Lowercasing, a common preprocessing technique,
involves converting all text data to lowercase. This allows
models to focus on the semantic content of the text,
mitigating the impact of variations in letter case. Stopword
removal entails eliminating common words, known as
stop words, from the text data. This step concentrates on
retaining content-rich words, enhancing the meaningful
information.Tokenization is a key preprocessing technique
that breaks down text into individual words or tokens,
aiding in the model's understanding of textual data.
The LSTM neural network processes textual descriptions
of user accounts by sequentially analyzing and learning
complex dependencies in the data. Its ability to capture long-
range dependencies helps in discerning subtle linguistic
patterns indicative of fake accounts. By embedding the text
and utilizing the LSTM layer, the model gains contextual
understanding, allowing it to recognize intricate linguistic
structures associated with fake profiles. This analysis
improves the systems capability to distinguish between
authentic and malevolent accounts by utilizing the sequential
Fig.1. Block Diagram of Architecture information encoded in the textual data. Traditional RNNs
have a drawbackof capturing the long- term dependencies
because as thenetwork processes new items, knowledge about
After that , The data was split into training and testing sets
earlier partsin the sequence fades away. This problem is
using the train test split method. After Splitting data, define
known as the gradient vanishing problem. In order to
the Models and compile it and test the model on training
overcome this problem, LSTM, which is Long Short-Term
data andplot the results .The User Account details are
Memory was designed.
processed usinga total 6 ML Models . To calculate the
The LSTM process begins with loading and shuffling the patterns that are valuable for detecting fake accounts
dataset to ensure randomness. Text preprocessing steps follow, based on user descriptions
including converting text to lowercase, removing unnecessary
words, and tokenizing Numerical features are extracted, labels
are assigned, and data is split into training and testing sets.
LSTM model is used to organize users as either real or fake
based on their Twitter descriptions and other numerical Algorithm – Fake Account Detection Model using LSTM
features. The model first embeds the text descriptions into a
vector representation using an embedding layer. Then, it feeds
the embedded vectors into an LSTM layer, which learns to Input : User Account Details
capture long-term dependencies in the text. Finally, it
Output : Detecting Real and Fake Accounts
combines the output of the LSTM layer with the numerical
features and passes them through a dense layer to produce the 1 : Loading Dataset
final classification prediction. 2 : Pre-processing the Input text
Equations (1), (2) and (3) describes the gates present in the 3 : for every sentence in training sentences do
LSTM layer, where 𝑓𝑡, 𝑖𝑡 and 𝑜𝑡 represents the forget, input 4 : Remove punctuations
and output gates respectively.
5 : Perform Lowercasing of words
6 : Remove Stopwords
𝑓𝑡 = 𝜎(𝜔𝑓ℎ𝑡−1 + 𝑈𝑓𝑥𝑡 + 𝑏𝑓) - (1) Where, σ 7 : Perform tokenization and padding.
is sigmoid activation function 𝜔𝑓 is forget gate 8 : end for
weight matrix 9 : Applying LSTM Model
𝑏𝑓 is forget gate bias vector 10 : Compile, train, and fit the model
Equation 1 represents the Forget gate . Forget gate helps the 11 : Test the model and calculate the
LSTM model focus on relevant and informative featuresin the performanceparameters
user's description, discarding irrelevant and potentially 12 : Return whether an account is Real or Fake
misleading information.
𝑖𝑡 =𝜎(𝑤𝑖ℎ𝑡−1 + 𝑈𝑖𝑥𝑡 + 𝑏𝑖) - (2)
Where,
The Artificial Neural Network (ANN) learns patterns
𝑤𝑖 is input gate weight matrix and relationships within user features, enabling it to
𝑏𝑖 is input gate bias vector discern distinguishing characteristics of fake accounts.
By iteratively adjusting weights during training, the
Equation 2 represents the Input gate . Input gate decides model captures complex representations associated with
how much new data should be added by taking into deceptive behavior. The final trained ANN utilizes these
consideration the current input and previous state’s output. learned patterns to makepredictions, effectively
identifying potential fake accounts based on the input
feature patterns it has internalized. The Decision Tree
𝑜𝑡 = (𝜎𝜔0 ℎ𝑡−1 + 𝑈0𝑥𝑡 + 𝑏0 ) - (3) model aids in detecting fake accounts by recursively
Where, partitioning the feature space based on attribute values.
It makes binary decisions at each node to maximize
𝑤𝑜 is output gate weight matrix information gain or minimize impurity. During training,
𝑏𝑜 is output gate bias vector the model learns to distinguish between genuine and
fake accounts by identifying crucial feature thresholds.
Equation 3 represents Output gate . The output gate In the evaluation phase, the Decision Tree efficiently
generate a meaningful output in the form of a probability classifiesinstances by traversing the learned tree
value between 0 and 1. A value nearer to 1 specifies that structure.
the network is highly confident that the account is real, The SVM is a ML algorithm works by mapping the
while avalue closer to 0 specifies that the network is input features (such as user attributes) and finding a
highly confident that the account is fake. hyperplane that best separates the two modules (fake and
LSTM structure represented by these equations genuine accounts).The SVM optimization process
provides the model with the ability to analyze sequential involves maximizing the margin between the classes
data, effectively capturing dependencies and discerning while penalizing misclassifications. The 'C' and 'gamma'
parameters are tunedthrough a grid search to optimize
the SVM's performance. The resulting SVM model Fig 2. Shows accuracy performance of model. The
classifies new instances by assigning them to one of the accuracy scores indicate the performance of different models
two classes based on their position relative to the in classifying fake accounts. Decision Tree and ANN exhibit
hyperplane in the feature space. The Random Forest the highest accuracies at 98.75% and 98.76%, respectively,
algorithm is an ensemble learning techniquethat builds showcasing their proficiency in distinguishing between
multiple decision trees and combines their predictions genuine and fake accounts. Random Forest follows closely
for improved accuracy. Every tree is trained on a random with an accuracy of 94.5%, indicating robust ensemble
subset of dataset, reducing overfitting and enhancing learning. XGBoost and SVM also demonstrate strong
generalization. The finalprediction is determined by a performance at 94.15% and 92.91%, respectively.
majority vote from the individual trees. This approach is
effective for detecting fake accounts as it can capture
complex patterns and relationshipswithin diverse user
attribute data, offering a robust solution for classification
tasks.
XGBoost, aids in fake account detection by
constructing arobust ensemble of decision trees that
collectively model complex relationships within the
dataset. It excels at handling imbalanced classes and
prioritizes correct classification of minority class
instances, which is crucial for identifying fakeaccounts.
The algorithm's optimization objective, which combines
both loss minimization and regularization terms,
enhances its ability to generalize well on unseen data,
contributing to its effectiveness in discerning genuine
and fake accounts.
Fig 3. Precision Score of different Algorithms
Fig.2.Accuracy Score of different Algorithms account detection. A classification model with a recall metric
can identify every related case of a given class. Greater The left-hand plot shows the k-means clusters after the
recall values show how well a model captures a significant algorithm has converged. The data points are colored
fraction of true positive cases in comparison to the real according to their assigned cluster. The right-hand plot
positives in the dataset. In the presented results, ANN shows the original data before clustering.
(Artificial Neural Network) attains the highest recall which
The k-means algorithm has successfully identified two
0.9961, reflecting its efficacy in identifying genuine
clusters of data points. The first cluster (shown in blue)
accounts among the total genuine instances. XGB Classifier
contains data points that are for the real accounts. The second
, SVM, Random Forest , LSTM and Decision Tree has
cluster (shown in red) contains data points that are for the fake
Recall values of 0.9020 , 0.8649 , 0.8986, 0.6753 and 0.9725
accounts. The k-means algorithm can be used to detect fake
respectively.
accounts by identifying data points that are anomalous or
outliers.
Fig 5. shows the F1-score performance parameter . ANN The area under the ROC curve (AUC) is a single metric
and Decision Tree exhibits a strong F1 score of 0.9864, that used to summarize the performance of classifier. A
highlighting its precision and recall balance. Random Forest higher AUC shows better performance. In Fig 5, the
follows closely with a score of 0.9449, indicating effective Random Forest classifier has highest AUC (0.87), followed
performance in correctly identifying both classes. XGB by the XGBoost classifier (0.86), the SVM classifier (0.86),
classifier, SVM and LSTM has f1-scores values of 0.9418, and the Decision Tree classifier (0.78).
0.9275 and 0.7087 respectively. III. CONCLUSION
Successfully detected the massive problem of malicious
accounts on online social platforms using advanced
machine learning algorithms . The proposed solution
maintains the authenticity of social media interactions and
enhances online environments due to its Accuracy ,
precision, Recall, adaptability, and scalability. The
existing system has some novelties such as, by focusing on
the behavioral patterns of user interactions, the system
provides an innovative approach and provides an
advanced understanding that enhances the detection of
fake accounts. Using advanced deep learning and Machine
learning algorithms, the system exceeds traditional
methods and accurately detects fake profiles. Innovative
Fig 6. Scatter plot using K-means Clustering optimization techniques reduce false positives and false
negatives while maintaining a special balance between
The Fig.6 displays the outcomes of applying k-means precision and recall which increases the overall efficiency
procedure to a dataset of fake account detection . The k- of detecting fake accounts. According to above
means algorithm is unsupervised learning method that observation ANN has maximum Accuracy of 98.76%,
clusters data points into k clusters based on their similarity.
SVM and Decision Tree has maximum precision value of
1.00, ANN has maximum Recall value of 0.9961 and ANN
has maximum F1-Score value of 0.9864.The limitations of
the existingsystem consists of, the lack of diverse training
data mightaffect the systems performance by reducing the
ability todetect fake accounts.
REFERENCES
[1] Sahoo, Somya Ranjan, and Brij B. Gupta. "Multiple features based
approach for automatic fake news detection on social networks using deep
learning." Applied Soft Computing 100 (2021): 106983.
[2] Uppada, Santosh Kumar, K. Manasa, B. Vidhathri, R. Harini, and B.
Sivaselvan. "Novel approaches to fake news and fake account detection in
OSNs: user social engagement and visual content centric model." Social
Network Analysis and Mining 12, no. 1 (2022): 52.
[3] Sahoo, Somya Ranjan, and Brij B. Gupta. "Hybrid approach for detection
of malicious profiles in twitter." Computers & Electrical Engineering 76
(2019): 65-81..
[4] Juefei-Xu, Felix, Run Wang, Yihao Huang, Qing Guo, Lei Ma, and Yang
Liu. "Countering malicious deepfakes: Survey, battleground, and horizon."
International journal of computer vision 130, no. 7 (2022): 1678-1734.
[5] Adewole, Kayode Sakariyah, Tao Han, Wanqing Wu, Houbing Song, and
Arun Kumar Sangaiah. "Twitter spam account detection based on
clustering and classification methods." The Journal of Supercomputing 76
(2020): 4802-4837.
[6] Masood, Faiza, Ahmad Almogren, Assad Abbas, Hasan Ali Khattak, Ikram
Ud Din, Mohsen Guizani, and Mansour Zuair. "Spammer detection and
fake user identification on social networks." IEEE Access 7 (2019): 68140-
68152.
[7] Cola, Guglielmo, Michele Mazza, and Maurizio Tesconi. "Twitter
Newcomers: Uncovering the Behavior and Fate of New Accounts through
Early Detection and Monitoring." IEEE Access (2023).
[8] Balaanand, Muthu, S. Karthik,Gunasekaran Manogaran, and C. B.
Sivaparthipan. "An enhanced graph- based semi-supervised learning
algorithm to detect fake users on Twitter."The Journal of Supercomputing
75 (2019): 6085-6105.
[9] Wanda, Putra, and Huang Jin Jie. "DeepProfile: Finding fake profile in
online social network using dynamic CNN." Journal of Information
Security and Applications 52 (2020): 102465.
[10] Bharti, Kusum Kumari, and Shivanjali Pandey. "Fake account detection in
twitter using logistic regression with particle swarm optimization." Soft
Computing 25, no. 16 (2021): 11333-11345.
[11] Ni, Shiwen, Jiawen Li, and Hung-Yu Kao. "MVAN: Multi-view attention
networks for fake news detection on social media." IEEE Access 9 (2021):
106907-106917.
[12] Shahbazi, Zeinab, and Yung-Cheol Byun. "Fake media detection based on
natural language processing and blockchain approaches." IEEE Access 9
(2021): 128442-128453.
[13] Ramalingam, Devakunchari, and Valliyammai Chinnaiah. "Fake profile
detection techniques in large-scale online social networks: A
comprehensive review." Computers & Electrical Engineering 65 (2018):
165-177.
[14] Monica, C., and N. Nagarathna. "Detection of fake tweets using sentiment
analysis." SN Computer Science 1 (2020): 1-7.
[15] Van Der Walt, Estée, and Jan Eloff. "Using machine learning to detect fake
identities: bots vs humans." IEEE access 6 (2018): 6540-6549.