0% found this document useful (0 votes)
18 views60 pages

Classification Algorithms: Inteligência Artificial E Cibersegurança (Inacs)

Uploaded by

Hil_Silva
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views60 pages

Classification Algorithms: Inteligência Artificial E Cibersegurança (Inacs)

Uploaded by

Hil_Silva
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 60

Classification Algorithms

INTELIGÊNCIA ARTIFICIAL E CIBERSEGURANÇA (INACS)


N U N A L @ I S E P. I P P. P T
O M S @ I S E P. I P P. P T
I C P @ I S E P. I P P. P T
Classification Algorithms
• Machine Learning comprises different learning paradigms that relate to different tasks
• When focusing on Supervised Learning
• We can perform Classification and Regression tasks

• Classification will be addressed in INACS since it is widely used in Cybersecurity


• There are multiple algorithms for classification, in INACS we are going to address some of
them:
• K-Nearest Neighbors (KNN)
• A simple and diversified algorithm that is quite explainable on its own
• Decision Tree (DT)
• DT are highly explainable and good performing algorithms, especially when
combining multiple trees in an Ensemble
• Ensemble Algorithms
• Well performing models that take advantage of multiple Estimators
K-Nearest Neighbors
• It works for both classification and regression
problems

• It’s easy to implement and understand, but


has a major drawback of becoming
significantly slow as the size of dataset grows
K-Nearest Neighbors
• KNN requires:
• A set of labeled records

• Proximity metric to compute distance/similarity between a pair of records


• Euclidean, Manhattan or Minkowski distance

• The value of k, the number of nearest neighbors to retrieve

• A method for using class labels of K nearest neighbors to determine the class label of n
unknown record
• It can be the majority voting of the k nearest neighbors of the unknown record
K-Nearest Neighbors
• KNN works as follows:

• In the training phase:


• Store all training records and labels in an efficient data structure

• Store all algorithm’s configurations


• E.g., the number of k neighbors, the distance metric

• There is no real “training” involved as the algorithm stores some sort of “memory”
K-Nearest Neighbors
• KNN works as follows:

• In the inference phase:


1. Determine the distance between the unknown record and all records of the
dataset according to a specific measure (e.g., Euclidean)
• This is the algorithm’s bottleneck, as it gets slower as the dataset increases
• Note: Manhattan distance is preferred over Euclidean for high-dimensional
data

2. Take the K nearest neighbors

3. Among these k neighbors, count the data points that belong to each label

4. Assign the new data point to the most frequent label among its nearest neighbors
K-Nearest Neighbors
• In practice:
• https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html
• Sklearn’s implementation
• https://towardsdatascience.com/make-knn-300-times-faster-than-scikit-learns-in-20-lines-
5e29d74e76bb
• https://engineering.fb.com/2017/03/29/data-infrastructure/faiss-a-library-for-efficient-similarity-
search/
• With Facebook’s faiss, custom implementation can be up to 300x faster
Decision Trees
How can an algorithm be represented as a tree?

• Example from the Titanic dataset


• 3 features / attributes / columns
considered
• Sex, age and sibsp (number of spouses
or children along)
• Decision nodes test the features
classes
Decision Trees
• Nodes:
• Root node
• Node
• Leaf node
• Parent and Child Node

• Processes:
• Splitting
• Pruning
Decision Trees
• Intuition

Hitters dataset to predict a baseball player’s Salary (mean log salary)


based on Years (the number of years played in the major leagues) and
Hits (the number of hits made in the previous year).

•R1 ={X | Years<4.5 }


•R2 ={X | Years>=4.5 ,Hits<117.5
•R3 ={X | Years>=4.5 ,
Hits>=117.5 }
Decision Trees
• The decision boundaries are parallel to the axes because the conditions tests on tree nodes
involve only one attribute at a time
Decision Trees
• Greedy strategy
• Separate the records based on the test of an attribute that optimizes a specific criterion

• Particularities
• How to separate the records?
• Determine the attribute (s) that lead to the best partition?
• Specify the test condition of the attribute?
• Different algorithms lead to different decision trees, based on local optimization of
attributes

• When to stop data partition?


Decision Trees
• Splitting in Decision Trees

• start at the tree root and split the data on the feature that results in the largest
information gain (IG)
• objective function is to maximize the information gain (IG) at each split

• f is the feature to perform the split


• Dp and Dj are data set of the parent, j-th child node
• I is the impurity measure
• Np is the total number of samples at the parent node
• Nj is the number of samples in the j-th child node
Decision Trees
• Most libraries (including scikit-learn) implement binary decision trees

• How to calculate the Impurity?


• Gini Impurity
• Entropy
• Misclassification Impurity

Note: the lower the impurity of the child nodes, the larger the information gain
Decision Trees
• Gini Impurity
• measure of how often a randomly chosen element from the dataset would be incorrectly
labeled if it was randomly labeled according to the distribution of labels in the subset

• where j is the number of classes present in the node and p is the distribution of the
class in the node

• Total Gini Impurity = weighted average of IGini


Decision Trees
• Gini Impurity - example
• Simple simulation with Heart Disease Data set with 303 rows and has 13 attributes
• Target consist 138 value 0 and 165 value 1
• Focus on attributes Sex, fbs (fasting blood sugar) and exang (exercise induced angina)

https://archive.ics.uci.edu/ml/datasets/Heart+Disease
Decision Trees
• Gini Impurity - example • Gini Impurity Sex = 0

• Gini Impurity Sex = 1

• Total Gini Impurity


Decision Trees
• Gini Impurity - example • Gini Impurity Fbs = 0

• Gini Impurity Fbs = 1

• Total Gini Impurity


Decision Trees
• Gini Impurity - example • Gini Impurity Exang = 0

= 0,495
• Gini Impurity Exang = 1

= 0,24

• Total Gini Impurity

= 0, 356
Decision Trees
• Gini Impurity – example
• Gini Impurity
• Sex = 0,399
• Fbs = 0,249
• Exang = 0,356

• Gini Impurity tells what is the probability of misclassifying an observation

• Which is the best attribute to use for the partition?


• The lower the Gini the better the split, as the likelihood of misclassification is lower
Decision Trees
• Entropy
• Measures the amount of information for a variable that can take n values

• where j is the number of classes present in the node and p is the distribution of the class in
the node

• The highest Entropy value on the first iteration will be the Root Node

• Information Gain = information before partition – information after partition


Decision Trees
• Entropy - example
• Simple simulation with Heart Disease Data set with 303 rows and has 13 attributes
• Target consist 138 value 0 and 165 value 1
• Focus on attributes Sex, fbs (fasting blood sugar) and exang (exercise induced angina)

calculate entropy in Target attribute first

Information Gain

https://archive.ics.uci.edu/ml/datasets/Heart+Disease
Decision Trees
• Entropy - example • Entropy Sex = 0

• Entropy Sex = 1

• Information Gain
Decision Trees
• Entropy - example • Entropy Fbs = 0

• Entropy Fbs = 1

• Information Gain
Decision Trees
• Entropy - example • Entropy Exang = 0

• Entropy Exang = 1

• Information Gain
Decision Trees
• Entropy – example

• Information Gain
• Sex = 0,328
• Fbs = 0,389
• Exang = 0,224

• Which is the best attribute to use for the partition?

• The highest the Information Gain the better the split


Decision Trees
• Misclassification Impurity

C1 0
C2 6

C1 1
C2 5

C1 3
C2 3
Decision Trees
• Classification and Regression Trees (CART) algorithm
• Step 1: Start at the root node with all training instances
• Step 2: Select an attribute based on splitting criteria
• Step 3: Partition instances according to selected attribute recursively

• Partitioning stops when:


• There are no examples left
• All examples for a given node belong to the same class
• There are no remaining attributes for further partitioning – majority class is the leaf

• Uses Gini Impurity


CART - example
◦ Number of instances: 14
◦ Number of attributes: 4

◦ Step 1: Start at the root node with all training


instances
◦ Step 2: Select an attribute based on splitting
criteria (use Gini Impurity)
◦ Step 3: Partition instances according to
selected attribute recursively

https://sefiks.com/2018/08/27/a-step-by-step-cart-decision-tree-example/
CART - example
Step 2: Select an attribute on the basis of splitting criteria
Oulook

Gini(Oulook)
CART - example
Step 2: Select an attribute on the basis of splitting criteria
Temp

Gini(Temp)
CART - example
Step 2: Select an attribute on the basis of splitting criteria
Humidity

Gini(Humidity)
CART - example
Step 2: Select an attribute on the basis of splitting criteria
Wind

Gini(Wind)
CART - example
Step 2: Select an attribute on the basis of splitting criteria
Time to decide

Let’s place Outlook at the top of the tree


CART - example
CART - example

Next decisions:
• Focus on the sub dataset for Sunny outlook
• Focus on the sub dataset for Rain outlook
CART - example
Focus on the sub dataset for Sunny outlook Step 2: Select an attribute on the basis of splitting criteria
Temp

Gini(Outlook=Sunny and Temp.)


CART - example
Focus on the sub dataset for Sunny outlook Step 2: Select an attribute on the basis of splitting criteria
Humidity

Gini(Outlook=Sunny and Humidity.)


CART - example
Focus on the sub dataset for Sunny outlook Step 2: Select an attribute on the basis of splitting criteria
Wind

Gini(Outlook=Sunny and Wind)


= 0.466
CART - example
Focus on the sub dataset for Sunny outlook Step 2: Select an attribute on the basis of splitting criteria
Time to decide

Let’s place Humidity at the extension of Sunny Outlook


CART - example
• Next decisions:
• Focus on the sub dataset for Sunny outlook
CART - example
• Next decisions:
• Focus on the sub dataset for Rain outlook
CART - example
Focus on the sub dataset for Rain outlook Step 2: Select an attribute on the basis of splitting criteria
Temp

Gini(Outlook=Rain and Temp.)


CART - example
Focus on the sub dataset for Rain outlook Step 2: Select an attribute on the basis of splitting criteria
Humidity

Gini(Outlook=Rain and Humidity)


CART - example
Focus on the sub dataset for Rain outlook Step 2: Select an attribute on the basis of splitting criteria
Wind

Gini(Outlook=Rain and Wind)


CART - example
Focus on the sub dataset for Sunny outlook Step 2: Select an attribute on the basis of splitting criteria
Time to decide

Let’s place Wind at the extension of Rain Outlook


CART - example
• Next decisions:
• Focus on the sub dataset for Rain outlook
CART - example
Decision Trees
• Stopping criteria

• All examples belong to the same class

• All examples have the same values in their attributes (but different classes)

• Few examples left

• The gain of all possible solutions is very low


Decision Trees
• Model Selection
• the possibility for model overfitting increases as the model becomes more complex

• Occam’s razor or the principle of parsimony


• given two models with the same errors, the simpler model is preferred over the more
complex model – keep it simple!
Decision Trees
• Advantages
• Simple to Understand, Visualize and Interpret
• Nonlinear parameters don’t influence its performance
• Can handle both Numerical and Categorical data

• Disadvantages
• The model can get unstable due to the slight variation of the data
• Overfitting happens when the algorithm grabs noise in the data
• A highly complicated decision tree tends to have a low bias / high variance
• Which makes it problematic for the model to work with new data
Decision Trees
• Using a Decision Tree Analysis for Intrusion Detection: A How-To Guide
• https://sansorg.egnyte.com/dl/6edQgfwngE

• Classification of Malware Attacks Using Machine Learning In Decision Tree


• http://repository.uwl.ac.uk/id/eprint/8022/1/IJS-155.pdf

• Network intrusion detection system using J48 Decision Tree


• https://ieeexplore.ieee.org/document/7275914

• A Consolidated Decision Tree-Based Intrusion Detection System for Binary and Multiclass
Imbalanced Datasets
• https://www.mdpi.com/2227-7390/9/7/751/pdf

• Decision Tree and Context Based Malicious Event Detection


• Chapter 7 of Book “Hands-On Machine Learning for Cybersecurity”
Ensemble Learning
• It uses multiple machine learning models to make the better

• Specific models do well in modeling one aspect of data, while others do well in modeling
another

• Instead of learning a single complex model, learn several simple models and combine their
output to build the final decision

• The combined strength of the models compensates for individual model variances and biases

• Delivers a combined prediction where the final accuracy is better than the accuracy of the
individual models
Bagging
• Bagging (Bootstrap Aggregating)
• Random samples of the training
data set (subsets of training data
set) are created

• A classifier is built for each sample

• The results of these multiple


classifiers are combined using
average or majority voting.

• Bagging helps to reduce the


variance error (!)
Boosting
• Boosting
• Boosting provides sequential
learning of the predictors

• The first predictor is learned on the


whole data set, while the following
are learnt on the errors of the
previous ones
Boosting
• Boosting
• It starts by classifying original data set and giving equal weights to each observation

• If classes are predicted incorrectly using the first learner, then it gives higher weight to
the missed classified observation

• Being an iterative process, it continues to add classifier learner until a limit is reached in
the number of models or accuracy

• Boosting has shown better predictive accuracy than bagging, but it also tends to overfit
the training data as well (!)
Stacking
• Stacking
• Use a learner to combine output
from different learners, learning a
function of their predictions

• This can lead to decrease in either


bias or variance error depending on
the combining learner we use
Tree Ensembles
• Tree Ensembles

• Random Forest
• The "forest" it builds, is an ensemble of decision trees, usually trained with the
“bagging” method
• Instead of searching for the most important feature while splitting a node, it
searches for the best feature among a random subset of features
• This results in a wide diversity that generally results in a better model
• Is based on the “Wisdom of Crowds” theory

• To assure good performance:


• Requires good representative data
• The correlation between the predictions of each tree must be low
Tree Ensembles
• Tree Ensembles

• XGBoost (eXtreme Gradient Boosting)


• Highly efficient tree-ensemble based on “boosting”
• It is the champion of many open Kaggle competitions
• Features:
• Sparse Aware implementation with automatic handling of missing data values
• Block Structure to support the parallelization of tree construction
• Continued Training so that you can further boost an already fitted model on
new data
• Support several Boosting algorithms:
• Gradient Boosting algorithm with custom learning rate
• Stochastic Gradient Boosting with multiple sub-sampling split levels
• Regularized Gradient Boosting with both L1 and L2 regularization.
Tree Ensembles
• In practice:
• https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html
• sklearn’s “RandomForestClassifier”
• https://xgboost.readthedocs.io/en/stable/tutorials/model.html
• Inner details of XGBoost framework
• https://practicaldatascience.co.uk/machine-learning/how-to-create-a-classification-model-using-xgboost
• https://www.kaggle.com/code/dansbecker/xgboost/notebook
• Practical examples with XGBoost
• https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.StackingClassifier.html
• sklearn’s “StackingClassifier”

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy