Cs Batchno29
Cs Batchno29
VAISHNAVI.V
SIST
JUSTICE PEACESREVOLUTION
cH
SATHYABAMA
INSTITUTE OF SCIENCE AND TECHNOLOGY
(DEEMED TO BE UNIVERSITY)
Accredited with Grade "A" by NAAC|12B Status by UGC|Approved by AICTE
JEPPIAAR NAGAR, RAJIV GANDHI SALAI, CHENNAI 600199
MAY 2023
SATHYABAMA
INSTITUTE OF SCIENCE AND TECHNOLOGY
(DEEMED TOBE UNIVERSITY)
Accredited "A" Grade by NAAC |12B Status by UGC | Approved by AICTE
www.sathyabama.ac.in
internal Guide
Ms. VINODHINIK, M.Sc., Assistant Professor
ABSTRACT X
LIST OF ABBREVATIONS IX
1 INTRODUCTION 1
2 SYSTEM ANALYSIS 7
IV
CHAPTER TITLE PAGE NO
3 SYSTEM REQUIREMENTS 11
3.4 PYTHON 11
3.5 ANACONDA 12
4 SYSTEM ARCHITECTURE 15
V
CHAPTER TITLE PAGE NO
6 CONCLUSION 31
7 APPENDIX 32
7.2 OUTPUT 37
8 REFERENCE 38
VI
LIST OF FIGURES
VII
LIST OF TABLES
VIII
LIST OF ABBREVATIONS
4. SN - Social Network
6. AI - Artificial Intelligence
7. RF - Random Forest
8. DT - Decision Tree
9. NB - Naïve Bayes
IX
ABSTRACT
At present social network sites are part of the life for most of the people. Every day
several people are creating their profiles on the social network platforms and they
are interacting with others independent of the user’s location and time. The social
network sites not only providing advantages to the users and also provide security
issues to the users as well their information. To analyze, who are encouraging
threats in social network we need to classify the social networks profiles of the users.
From the classification, we can get the genuine profiles and fake profiles on the
social networks. Traditionally, we have different classification methods for detecting
the fake profiles on the social networks. But we need to improve the accuracy rate
of the fake profile detection in the social networks. In this paper we are proposing
Machine learning and Natural language Processing (NLP) techniques to improve
the accuracy rate of the fake profiles detection. We can use the Support Vector
Machine (SVM) and Naïve Bayes algorithm.
X
CHAPTER 1
INTRODUCTION
Social networking has end up a well-known recreation within the web at present,
attracting hundreds of thousands of users, spending billions of minutes on such
services. Online Social network (OSN) services variety from social interactions-
based platforms similar to Instagram or Facebook or MySpace, to understanding
dissemination-centric platforms reminiscent of twitter or Google Buzz, to social
interaction characteristic brought to present systems such as Flicker. The opposite
hand, enhancing security concerns and protecting the OSN privateness still signify
a most important bottleneck and viewed mission.
When making use of Social Network’s (SN’s), one of a kind men and women share
one-of-a-kind quantities of their private understanding. Having our individual know-
how entirely or in part uncovered to the general public, makes us excellent targets
for unique types of assaults, the worst of which could be identification theft. Identity
theft happens when any individual uses character’s expertise for a private attain or
purpose. During the earlier years, online identification theft has been a primary
problem considering it affected millions of people’s worldwide. Victims of
identification theft may suffer unique types of penalties; for illustration, they would
lose time/cash, get dispatched to reformatory, get their public image ruined, or have
their relationships with associates and loved ones damaged. At present, the vast
majority of SN’s does no longer verifies ordinary users‟ debts and has very
susceptible privateness and safety policies. In fact, most SN’s applications default
their settings to minimal privateness; and consequently, SN’s became a best
platform for fraud and abuse. Social Networking offerings have facilitated identity
theft and Impersonation attacks for serious as good as naive attackers. To make
things worse, users are required to furnish correct understanding to set up an
account in Social Networking web sites.
1
The details which can be supplied with the aid of the person on the time of profile
creation is known as static knowledge, the place as the small print that are recounted
with the aid of the system within the network is called dynamic knowledge. Static
knowledge includes demographic elements of a person and his/her interests and
dynamic knowledge includes person runtime habits and locality in the network. The
vast majority of current research depends on static and dynamic data. However, this
isn't relevant to lots of the social networks, where handiest some of static profiles
are seen and dynamic profiles usually are not obvious to the person network. More
than a few procedures have been proposed by one of a kind researcher to realize
the fake identities and malicious content material in online social networks. Each
process had its own deserves and demerits.
The problems involving social networking like privacy, on-line bullying, misuse, and
trolling and many others. Are many of the instances utilized by false profiles on
social networking sites. False profiles are the profiles which are not specific i.e.
They're the profiles of men and women with false credentials. The false Facebook
profiles more commonly are indulged in malicious and undesirable activities,
causing problems to the social community customers. Individuals create fake
profiles for social engineering, online impersonation to defame a man or woman,
promoting and campaigning for a character or a crowd of individuals. Facebook has
its own security system to guard person credentials from spamming, phishing, and
so on. And the equal is often called Facebook Immune system (FIS). The FIS has
now not been ready to observe fake profiles created on Facebook via customers to
a bigger extent.
[1] Title: Understanding User Profiles on Social Media for Fake News Detection
2
Description:
Description:
3
Fake profiles have an adverse effect on the trustworthiness of the network as a
whole, and can represent significant costs in time and effort in building a
connection based on fake information. Unfortunately, fake profiles are difficult to
identify. Approaches have been proposed for some social networks; however,
these generally rely on data that are not publicly available for LinkedIn profiles. In
this research, we identify the minimal set of profile data necessary for identifying
fake profiles in LinkedIn, and propose an appropriate data mining approach for
fake profile identification. We demonstrate that, even with limited profile data, our
approach can identify fake profiles with 87% accuracy and 89% True Negative
Rate, which is comparable to the results obtained based on larger data sets and
more expansive profile information. Further, when compared to approaches using
similar amounts and types of data, our method provides an improvement of
approximately 14% accuracy.
Description:
Social networking platforms, particularly sites like Twitter and Facebook have grown
tremendously in the past decade and has solicited the interest of millions of users.
They have become a preferred means of communication, due to which it has also
attracted the interest of various malicious entities such as spammers. The growing
number of users on social media has also created the problem of fake accounts.
These false and fake identities are intensively involved in malicious activities such
as spreading abuse, misinformation, spamming and artificially inflating the number
of users in an application to promote and sway public opinion. Detecting these fake
identities, thus becomes important to protect genuine users from malicious intents.
To address this issue, we aim to use a feature-based approach to identify these fake
profiles on social media platforms. We have used twenty-four features to identify
fake accounts efficiently. To verify the classification results three classification
4
algorithms are used. Experimental results show that our model was able to reach
87.9% accuracy using the Random Forest algorithm. Hence, the proposed approach
is efficient in detecting fake profiles.
[4] Title: Method for detecting spammers and fake profiles in social networks
Description:
A method for protecting user privacy in an online social network, according to which
negative examples of fake profiles and positive examples of legitimate profiles are
chosen from the database of existing users of the social network. Then, a
predetermined set of features is extracted for each chosen fake and legitimate
profile, by dividing the friends or followers of the chosen examples to communities
and analyzing the relationships of each node inside and between the communities.
Classifiers that can detect other existing fake profiles according to their features are
constructed and trained by using supervised learning.
[5] Title: Social Networks Fake Profiles Detection Using Machine Learning
Algorithms
Description:
Fake profiles play an important role in advanced persisted threats and are also
involved in other malicious activities. The present paper focuses on identifying
fake profiles in social media. The approaches to identifying fake profiles in social
media can be classified into the approaches aimed on analysing profiles data and
individual accounts. Social networks fake profile creation is considered to cause
more harm than any other form of cybercrime. This crime has to be detected even
before the user is notified about the fake profile creation. Many algorithms and
methods have been proposed for the detection of fake profiles in the literature.
5
This paper sheds light on the role of fake identities in advanced persistent threats
and covers the mentioned approaches of detecting fake social media profiles. In
order to make a relevant prediction of fake or genuine profiles, we will assess the
impact of three supervised machine learning algorithms: Random Forest (RF),
Decision Tree (DT-J48), and Naïve Bayes (NB).
6
CHAPTER 2
SYSTEM ANALYSIS
There are lots of issues that make this procedure tough to implement and one of the
biggest problems associated with fraud detection is the lack of both the literature
providing experimental results and of real-world data for academic researchers to
perform experiments on. The reason behind this is the sensitive financial data
associated with the fraud that has to be kept confidential for the purpose of
customer’s privacy. Now, here we enumerate different properties a fraud detection
system should have in order to generate proper results:
The system should be able to handle skewed distributions, since only a very small
percentage of all credit card transactions is fraudulent.
There should be a proper means to handle the noise. Noise is the errors that is
present in the data, for example, incorrect dates. Another problem related to this
field is overlapping data. Many transactions may resemble fraudulent transactions
when actually they are genuine transactions. The opposite also happens, when a
fraudulent transaction appears to be genuine.
The systems should be able to adapt themselves to new kinds of fraud. Since after
a while, successful fraud techniques decrease in efficiency due to the fact that they
become well known because an efficient fraudster always find a new and inventive
ways of performing his job. There is a need for good metrics to evaluate the classifier
system. For example, the overall accuracy is not suited for evaluation on a skewed
distribution, since even with a very high accuracy; almost all fraudulent transactions
can be misclassified.
2.2 DISADVANTAGES OF EXISTING SYSTEM
• The most of existing methods has ignored the poor-quality data like noise or
Feature handled complex.
• The problems involving social networking like privacy, on-line bullying,
misuse, not accurate analysis and trolling and many others. There are many
of the instances utilized by false profiles on social networking sites.
7
• False profiles are the profiles which are not specific i.e, They're the profiles
of men and women with false credentials.
A proper and thorough literature survey concludes that there are various methods
that can be used to detect Fake profile detection. Some of these approaches are
Machine Learning and NLP.
To analyze, who are encouraging threats in social network we need to classify the
social networks profiles of the users. From the classification, we can get the genuine
profiles and fake profiles on the social networks.
Traditionally, we have different classification methods for detecting the fake profiles
on the social networks. But we need to improve the accuracy rate of the fake profile
detection in the social networks.
On this paper we presented a machine learning and natural language processing
system to observe the false profiles in online social networks. Moreover, we are
adding the five algorithms such that model Support Vector Machine (SVM), Random
Forest classifier, Gradient Boost classifier, Naïve Bayes, and Logistic Regression
algorithm to increase the detection accuracy rate of the fake profiles. In final
prediction we gain the values of accuracy, classification report and confusion matrix.
This proposed system is used to evaluate the best model to increase the detection
accuracy rate of the fake profiles.
• Fully secure and easily detect the fake profile in social networks.
8
• More datasets are included.
• We can find the all types of profiles on different social media application also.
The feasibility of the project is analyzed in this phase and business proposal is put
forth N with a very general plan for the project and some cost estimates. During
system analysis the feasibility study of the proposed system is to be carried out.
This is to ensure that the proposed system is not a burden to the company. For
feasibility analysis, some understanding of the major requirements for the system is
essential.
i. Economical Feasibility
This study is carried out to check the economic impact that the system will
have on the organization. The amount of fund that the company can pour into the
research and development of the system is limited. The expenditures must be
justified. Thus, the developed system as well within the budget and this was
achieved because most of the technologies used are freely available. Only the
customized products had to be purchased.
This study is carried out to check the technical feasibility, that is, the technical
requirements of the system. Any system developed must not have a high demand
on the available technical resources. This will lead to high demands on the available
technical resources.
9
This will lead to high demands being placed on the client. The developed system
must have a modest requirement, as only minimal or null changes are required for
implementing this system.
The aspect of study is to check the level of acceptance of the system by the user.
This includes the process of training the user to use the system efficiently. The user
must not feel threatened by the system, instead must accept it as a necessity. The
level of acceptance by the users solely depends on the methods that are employed
to educate the user about the system and to make him familiar with it. His level of
confidence must be raised so that he is also able to make some constructive
criticism, which is welcomed, as he is the final user of the system.
10
CHAPTER 3
SYSTEM REQUIREMENTS
Documentation Ms – Office
Software Python
Python is a dynamic, high level, free open source and interpreted programming
language. It supports object-oriented programming as well as procedural oriented
11
programming. In Python, we don’t need to declare the type of variable because it
is a dynamically typed language.
For example, x=10. Here, x can be anything such as String, int, etc.
Features in Python
There are many features in Python, some of which are discussed below
• Easy to code
• Free and Open Source
• Object-Oriented Language
• GUI Programming Support
• High-Level Language
• Extensible feature
• Python is Portable language
• Python is Integrated language
3.5 ANACONDA
Anaconda distribution comes with over 250 packages automatically installed, and
over 7,500 additional open-source packages can be installed from PyPI as well as
the conda package and virtual environment manager. It also includes a
GUI, Anaconda Navigator,[12] as a graphical alternative to the Command Line
Interface (CLI).
The big difference between conda and the pip package manager is in how package
12
dependencies are managed, which is a significant challenge for Python data science
and the reason conda exists.
be installed into a conda environment using pip, and conda will keep track of what
it has installed itself and what pip has installed.
Custom packages can be made using the conda build command, and can be shared
with others by uploading them to Anaconda Cloud, PyPI or other repositories.
The default installation of Anaconda2 includes Python 2.7 and Anaconda3 includes
Python 3.7. However, it is possible to create new environments that include any
version of Python packaged with conda.
13
The following applications are available by default in Navigator:
• JupyterLab
• Jupyter Notebook
• QtConsole
• Spyder
• Glue
• Orange
• RStudio
• Visual Studio Code
The Notebook interface was added to IPython in the 0.12 release[14] (December
2011), renamed to Jupyter notebook in 2015 (IPython 4.0 – Jupyter 1.0). Jupyter
Notebook is similar to the notebook interface of other programs such
as Maple, Mathematica, and SageMath, a computational interface style that
originated with Mathematica in the 1980s. According to The Atlantic, Jupyter
interest overtook the popularity of the Mathematica notebook interface in early 2018.
14
CHAPTER 4
SYSTEM ARCHITECTURE
Test Train
Dataset Dataset
15
4.2 DATA FLOW (DF) DIAGRAM
Instagram Datasets
Reduction
Learning Algorithms
Feature Extraction
Evaluation Parameters:
True Positive Rate-TPR
False Positive Rate-FPR
Area Under ROC Curve-AUC
Yes No
If
TPR>FPR
16
4.3 ENTITY RELATIONSHIP (ER) DIAGRAM
Username Fullname
Profile pic
Follows Dataset
Split X - Split Y -
Training Training
Pre-
Followers
Private process
Pre-processed Dataset
Naïve Logistic
Bayes Regression Train
Split X - Split Y -
Testing Testing
Model Training
Evaluation
Random Validate
Forest Support
Vector
Machine
Score
Gradient Accuracy
Boosting
Percentage
17
CHAPTER 5
SYSTEM IMPLEMENTATION
5.1 ARTIFICIAL INTELLIGENCE:
18
We can define it in a summarized way as: “Machine learning enables a machine to
automatically learn from data, improve performance from experiences, and predict
things without being explicitly programmed”. Suppose we have a complex problem,
where we need to perform some predictions, so instead of writing a code for it, we
just need to feed the data to generic algorithms, and with the help of these
algorithms, machine builds the logic as per the data and predict the output. Machine
learning has changed our way of thinking about the problem. The below block
diagram explains the working of Machine Learning algorithm:
5.2.1 FEATURES OF MACHINE LEARNING
• Machine learning uses data to detect various patterns in a given dataset.
• It can learn from past data and improve automatically.
• It is a data-driven technology.
• Machine learning is much similar to data mining as it also deals with the huge
amount of the data.
5.2.2 CLASSIFICATION OF MACHINE LEARNING
At a broad level, machine learning can be classified into three types:
1. Supervised learning
2. Unsupervised learning
3. Reinforcement learning
1) SUPERVISED LEARNING
Supervised learning is a type of machine learning method in which we provide
sample labelled data to the machine learning system in order to train it, and on that
basis, it predicts the output. The system creates a model using labelled data to
understand the datasets and learn about each data, once the training and
processing are done then we test the model by providing a sample data to check
whether it is predicting the exact output or not. The goal of supervised learning is to
map input data with the output data.
Supervised learning can be grouped further in two categories of algorithms:
• Classification
• Regression
19
2) UNSUPERVISED LEARNING
Unsupervised learning is a learning method in which a machine learns without any
supervision. The training is provided to the machine with the set of data that has not
been labelled, classified, or categorized, and the algorithm needs to act on that data
without any supervision. The goal of unsupervised learning is to restructure the input
data into new features or a group of objects with similar patterns.
It can be further classifieds into two categories of algorithms:
• Clustering
• Association
3) REINFORCEMENT LEARNING
20
Basic NLP tasks include tokenization and parsing, lemmatization/stemming, part-
of-speech tagging, language detection and identification of semantic relationships.
In general terms, NLP tasks break down language into shorter, elemental pieces,
try to understand relationships between the pieces and explore how the pieces
work together to create meaning.
These underlying tasks are often used in higher-level NLP capabilities, such as:
• Content categorization. A linguistic-based document summary, including
search and indexing, content alerts and duplication detection.
• Topic discovery and modelling. Accurately capture the meaning and
themes in text collections, and apply advanced analytics to text, like optimization
and forecasting.
• Contextual extraction. Automatically pull structured information from text-
based sources.
• Sentiment analysis. Identifying the mood or subjective opinions within large
amounts of text, including average sentiment and opinion mining.
• Speech-to-text and text-to-speech conversion. Transforming voice
commands into written text, and vice versa.
• Document summarization. Automatically generating synopses of large
bodies of text.
• Machine translation. Automatic translation of text or speech from one
language to another.
In all these cases, the overarching goal is to take raw language input and use
linguistics and algorithms to transform or enrich the text in such a way that it delivers
greater value.
21
Text preprocessing techniques may be general so that they are applicable to many
types of applications, or they can be specialized for a specific task.
TOKENIZATION
Tokenization is used in natural language processing to split paragraphs and
sentences into smaller units that can be more easily assigned meaning. The first
step of the NLP process is gathering the data (a sentence) and breaking it into
understandable parts (words).
STOP WORD
Stop word removal is one of the most commonly used preprocessing steps across
different NLP applications. The idea is simply removing the words that occur
commonly across all the documents in the corpus. Typically, articles and pronouns
are generally classified as stop words.
Stemming is basically removing the suffix from a word and reduce it to its root word.
For example: “Flying” is a word and its suffix is “ing”, if we remove “ing” from “Flying”
then we will get base word or root word which is “Fly”. We uses these suffix to create
a new word from original stem word.
Lemmatization considers the context and converts the word to its meaningful base
form, which is called Lemma. For instance, stemming the word 'Caring' would return
'Car'. For instance, lemmatizing the word 'Caring' would return 'Care'.
22
point in the correct category in the future. This best decision boundary is called hyper
plane.
SVM chooses the extreme points/vectors that help in creating the hyper plane.
These extreme cases are called as support vectors, and hence algorithm is termed
as Support Vector Machine. Consider the below diagram in which there are two
different categories that are classified using a decision boundary or hyper plane:
HYPER PLANE
There can be multiple lines/decision boundaries to segregate the classes in n-
dimensional space. This best boundary is known as the hyper plane of SVM. The
dimensions of the hyper plane depend on the features present in the dataset, which
means if there are 2 features (as shown in image), then hyper plane will be a straight
line. And if there are 3 features, then hyper plane will be a 2-dimension plane. We
always create a hyper plane that has a maximum margin.
SUPPORT VECTORS
The data points or vectors that are the closest to the hyper plane and which affect
the position of the hyper plane are termed as Support Vector. Since these vectors
support the hyper plane, hence called a Support vector.
23
5.5 NAIVE BAYES
1. Naïve Bayes algorithm is a supervised learning algorithm, which is based
on Bayes theorem and used for solving classification problems.
2. It is mainly used in text classification that includes a high-dimensional training
dataset.
3. Naïve Bayes Classifier is one of the simple and most effective Classification
algorithms which helps in building the fast machine learning models that can
make quick predictions.
4. It is a probabilistic classifier, which means it predicts on the basis of the
probability of an object.
5. Some popular examples of Naïve Bayes Algorithm are spam filtration,
Sentimental analysis, and classifying articles.
The Naïve Bayes algorithm is comprised of two words Naïve and Bayes, which can
be described as:
BAYES' THEOREM
• Bayes' theorem is also known as Bayes' Rule or Bayes' law, which is used
to determine the probability of a hypothesis with prior knowledge. It
depends on the conditional probability.
24
• The formula for Bayes' theorem is given as:
𝑃(𝐵∖𝐴)𝑃(𝐴)
𝑃(𝐴|𝐵) = 𝑃(𝐵)
In simple words, the dependent variable is binary in nature having data coded as
either 1 (stands for success/yes) or 0 (stands for failure/no).
Generally, logistic regression means binary logistic regression having binary target
variables, but there can be two more categories of target variables that can be
predicted by it. Based on those number of categories, Logistic regression can be
divided into following types −
BINARY OR BINOMIAL
In such a kind of classification, a dependent variable will have only two possible
types either 1 and 0. For example, these variables may represent success or failure,
yes or no, win or loss etc.
25
MULTINOMIAL
ORDINAL
“Excellent” and each category can have the scores like 0,1,2,3.
In case of binary logistic regression, the target variables must be binary always and
the desired outcome is represented by the factor level 1. There should not be any
multi-collinearity in the model, which means the independent variables must be
independent of each other. We must include meaningful variables in our model. We
should choose a large sample size for logistic regression.
Before understanding the working of the random forest, we must look into the
ensemble technique. Ensemble simply means combining multiple models. Thus, a
collection of models is used to make predictions rather than an individual model.
26
1. BAGGING– It creates a different training subset from sample training data
with replacement & the final output is based on majority voting. For
example, Random Forest.
2. BOOSTING– It combines weak learners into strong learners by creating
sequential models such that the final model has the highest accuracy.
Steps involved in random forest algorithm:
Step 1: In Random Forest n number of random records is taken from the data
set having k number of records.
Step 2: Individual decision trees are constructed for each sample.
Step 3: Each decision tree will generate an output.
Step 4: Final output is considered based on Majority Voting or Averaging for
Classification and regression respectively.
2. Immune to the curse of dimensionality- Since each tree does not consider all
the features, the feature space is reduced.
4. Train-Test split- In a random forest we don’t have to segregate the data for
train and test as there will always be 30% of the data which is not seen by the
decision tree.
27
Fig 5.1 Comparing the models using Cross-Validation
Correlation heatmaps are a type of plot that visualize the strength of relationships
between numerical variables. Correlation plots are used to understand which
variables are related to each other and the strength of this relationship. A correlation
plot typically contains a number of numerical variables, with each variable
represented by a column. The rows represent the relationship between each pair of
variables. The values in the cells indicate the strength of the relationship, with
positive values indicating a positive relationship and negative values indicating a
negative relationship.
28
Fig 5.2 Correlation Heatmap Between Features
• For the 2 prediction classes of classifiers, the matrix is of 2*2 table, for 3
classes, it is 3*3 table, and so on.
• The matrix is divided into two dimensions, that are predicted
values and actual values along with the total number of predictions.
• Predicted values are those values, which are predicted by the model, and
actual values are the true values for the given observations.
29
• It looks like the below table:
• True Negative: Model has given prediction No, and the real or actual value
was also No.
• True Positive: The model has predicted yes, and the actual value was also
true.
• False Negative: The model has predicted no, but the actual value was Yes, it
is also called as Type-II error.
• False Positive: The model has predicted Yes, but the actual value was No. It
is also called a Type-I error.
30
CHAPTER 6
CONCLUSION
31
CHAPTER 7
APPENDIX
7.1 SOURCE CODE
#Import Packages
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
def load_train_data():
train_data = pd.read_csv('train.csv', header = 0)
X_train = train_data.drop(columns='fake')
y_train = train_data['fake']
return X_train, y_train
from sklearn.datasets import load_files
def load_test_data():
test_data = pd.read_csv('test.csv', header = 0)
X_test = test_data.drop(columns='fake')
y_test = test_data['fake']
return X_test, y_test
from sklearn.model_selection import cross_validate
def get_classifier_cv_score(model, X, y, scoring='accuracy', cv=7):
scores = cross_validate(model, X, y, cv=cv, scoring=scoring,
return_train_score=True)
train_scores = scores['train_score']
val_scores = scores['test_score']
train_mean = np.mean(train_scores)
val_mean = np.mean(val_scores)
return train_mean, val_mean
def print_grid_search_result(grid_search):
32
print(grid_search.best_params_)
best_train =
grid_search.cv_results_["mean_train_score"][grid_search.best_index_]
print("best mean_train_score: {:.3f}".format(best_train))
best_test =
grid_search.cv_results_["mean_test_score"][grid_search.best_index_]
print("best mean_test_score: {:.3f}".format(best_test))
from sklearn.metrics import confusion_matrix
def plot_confusion_matrix(y_actual, y_pred, labels, title=''):
data = confusion_matrix(y_actual, y_pred)
ax = sns.heatmap(data,
annot=True,
cbar=False,
fmt='d',
xticklabels = labels,
yticklabels = labels)
ax.set_title(title)
ax.set_xlabel("predicted values")
ax.set_ylabel("actual values")
#data loading
X_data, y_data = load_train_data()
X_data.info()
X_data.head()
X_data.tail()
X_data.shape
y_data.shape
# Finding Missing Values
X_data.isnull().sum()
# Check if Imbalance in Labels
33
#labels is about 1:1 which means there is no imbalance in the labels.
#but here the ratio would be more 2:1.
unique, freq = np.unique(y_data, return_counts = True)
for i, j in zip(unique, freq):
print("Label: ", i, ", Frequency: ", j)
data_corr = X_data.corr(method='pearson')
ax = sns.heatmap(data_corr, vmin=-2, vmax=2, cmap='BrBG')
ax.set_title("Correlation Heatmap Between Features")
34
train_scores.append(train)
val_scores.append(val)
models_score = sorted(list(zip(val_scores, train_scores, model_list)),
reverse=True)
print("-------------------------------------")
for val, train, model in models_score:
print("\nModel: {} ".format(model.__class__.__name__))
print("\ntrain_score: {:.3f}".format(train))
print("\nvalidation_score: {:.3f}".format(val))
print("-------------------------------------")
model = RandomForestClassifier(random_state=55)
35
grid2.fit(X_train, y_train)
print_grid_search_result(grid2)
Pipeline
Final Evaluation
36
7.2 OUTPUT
37
CHAPTER 8
REFERENCES
38
9. Ojo, Adebola K. "Improved model for detecting fake profiles in online social
network: A case study of twitter." Journal of Advances in Mathematics and
Computer Science (2019): 1-17.
10. Meel, Priyanka, and Dinesh Kumar Vishwakarma. "Fake news, rumor,
information pollution in social media and web: A contemporary survey of state-
of-the-arts, challenges and opportunities." Expert Systems with
Applications (2019): 112986.
39