0% found this document useful (0 votes)
2 views34 pages

Unit-II

The document states that the training data is current up to October 2023. It implies that any information or developments after this date are not included. This sets a temporal limitation on the relevance of the content.

Uploaded by

zackbhavsar1209
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views34 pages

Unit-II

The document states that the training data is current up to October 2023. It implies that any information or developments after this date are not included. This sets a temporal limitation on the relevance of the content.

Uploaded by

zackbhavsar1209
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 34

Unit-II

Supervised Machine Learning


Algorithms
Supervised Machine Learning
● In supervised machine learning, models are trained using a dataset that consists of
input-output pairs.
● The supervised learning algorithm analyzes the dataset and learns the relation between the
input data (features) and correct output (labels/ targets). In the process of training, the model
estimates the algorithm's parameters by minimizing a loss function. The loss function
measures the difference between the model's predictions and actual target values.
● The model iteratively updates its parameters until the loss/ error has been sufficiently
minimized.
● Once the training is completed, the model parameters have optimal values. The model has
learned the optimal mapping/ relation between the inputs and targets. Now, the model can
predict values for the new and unseen input data.

Types of Supervised Learning Algorithm
1. Classification
The key objective of classification-based tasks is to predict categorical output labels or responses for the
given input data such as true-false, male-female, yes-no etc. As we know, the categorical output responses
mean unordered and discrete values; hence, each output response will belong to a specific class or category.

Some popular classification algorithms are decision trees, random forests, support vector machines (SVM),
logistic regression, etc.

2. Regression
The key objective of regression-based tasks is to predict output labels or responses, which are continuous
numeric values, for the given input data. Basically, regression models use the input data features
(independent variables) and their corresponding continuous numeric output values (dependent or outcome
variables) to learn specific associations between inputs and corresponding outputs.

Some popular regression algorithms are linear regression, polynomial regression, Laso regression, etc.
Decision Tree
● A Decision tree is a tree-like structure used to make decisions and analyze the
possible consequences. The algorithm splits the data into subsets based on
features, where each parent node represents internal decisions and the leaf node
represents final prediction.
● A decision tree is a supervised learning algorithm used for both classification and
regression problems. It is represented as a tree structure where each internal
node represents a test on an attribute, each branch represents the outcome of the
test, and each leaf node represents a class label or a predicted value. The goal of
a decision tree is to split the dataset into subsets based on the value of an
attribute, repeating this process until each subset contains only instances that
belong to a single class or have similar values.
Example
Attribute Selection Measures

Choosing the right attribute to split the data at each node is a critical step in
building an accurate decision tree. The most common methods used for attribute
selection are information gain and Gini index.

Information Gain

Information Gain is based on the concept of entropy, which measures the


amount of disorder or uncertainty in the dataset. The goal is to select the attribute
that reduces entropy the most when the data is split.

The formula for entropy is:


Gini Index
The Gini index measures the impurity of a dataset. A lower Gini index indicates a
purer dataset, meaning most of the instances belong to a single class. The Gini
index is computed as:

Attributes that result in the lowest Gini index after the split are considered the best
candidates for splitting the data.
Example of Decision Tree
import pandas
df = pandas.read_csv("data.csv")
d = {'UK': 0, 'USA': 1, 'N': 2}
df['Nationality'] = df['Nationality'].map(d)
d = {'YES': 1, 'NO': 0}
df['Go'] = df['Go'].map(d)
print(df)
features = ['Age', 'Experience', 'Rank', 'Nationality']
X = df[features]
y = df['Go']
dtree = DecisionTreeClassifier()
dtree = dtree.fit(X, y)

tree.plot_tree(dtree, feature_names=features)
plt.show()
Output
Rank <= 6.5 means that every comedian with a rank of 6.5 or lower will follow the True arrow (to the left), and the
rest will follow the False arrow (to the right).

gini = 0.497 refers to the quality of the split, and is always a number between 0.0 and 0.5, where 0.0 would mean
all of the samples got the same result, and 0.5 would mean that the split is done exactly in the middle.

samples = 13 means that there are 13 comedians left at this point in the decision, which is all of them since this is
the first step.

value = [6, 7] means that of these 13 comedians, 6 will get a "NO", and 7 will get a "GO".

The Gini method uses this formula:

Gini = 1 - (x/n)2 - (y/n)2

Where x is the number of positive answers("GO"), n is the number of samples, and y is the number of
negative answers ("NO"), which gives us this calculation:

1 - (7 / 13)2 - (6 / 13)2 = 0.497


True - 5 Comedians End Here:

gini = 0.0 means all of the samples got the same result.

samples = 5 means that there are 5 comedians left in this branch (5 comedian with a Rank of 6.5 or lower).

value = [5, 0] means that 5 will get a "NO" and 0 will get a "GO".

False - 8 Comedians Continue:

Nationality

Nationality <= 0.5 means that the comedians with a nationality value of less than 0.5 will follow the arrow to the left (which
means everyone from the UK, ), and the rest will follow the arrow to the right.

gini = 0.219 means that about 22% of the samples would go in one direction.

samples = 8 means that there are 8 comedians left in this branch (8 comedian with a Rank higher than 6.5).

value = [1, 7] means that of these 8 comedians, 1 will get a "NO" and 7 will get a "GO".
Advantages of DT
● Simple to understand and to interpret. Trees can be visualized.
● Requires little data preparation. Other techniques often require data normalization,
dummy variables need to be created and blank values to be removed. Some tree and
algorithm combinations support missing values.
● The cost of using the tree (i.e., predicting data) is logarithmic in the number of data
points used to train the tree.
● Able to handle both numerical and categorical data. However, the scikit-learn
implementation does not support categorical variables for now. Other techniques are
usually specialized in analyzing datasets that have only one type of variable. See
algorithms for more information.
● Able to handle multi-output problems.
Disadvantages of DT
● Decision-tree learners can create over-complex trees that do not generalize the data well.
This is called overfitting. Mechanisms such as pruning, setting the minimum number of
samples required at a leaf node or setting the maximum depth of the tree are necessary to
avoid this problem.
● Decision trees can be unstable because small variations in the data might result in a
completely different tree being generated. This problem is mitigated by using decision trees
within an ensemble.
● Predictions of decision trees are neither smooth nor continuous, but piecewise constant
approximations as seen in the above figure. Therefore, they are not good at extrapolation.
● The problem of learning an optimal decision tree is known to be NP-complete under several
aspects of optimality and even for simple concepts. There are concepts that are hard to learn
because decision trees do not express them easily, such as XOR, parity or multiplexer
problems.
● Decision tree learners create biased trees if some classes dominate. It is therefore
recommended to balance the dataset prior to fitting with the decision tree.
Appropriate Problems for Decision Tree Learning
Decision tree learning is generally best suited to problems with the following characteristics:
○ Instances are represented by attribute-value pairs.
■ There is a finite list of attributes (e.g. hair colour) and each instance stores a value for
that attribute (e.g. blonde).
■ When each attribute has a small number of distinct values (e.g. blonde, brown, red) it
is easier for the decision tree to reach a useful solution.
■ The algorithm can be extended to handle real-valued attributes (e.g. a floating point
temperature)
○ The target function has discrete output values.
■ A decision tree classifies each example as one of the output values.
■ Simplest case exists when there are only two possible classes (Boolean
classification).
■ However, it is easy to extend the decision tree to produce a target function with
more than two possible output values.
■ Although it is less common, the algorithm can also be extended to produce a target
function with real-valued outputs.
● Disjunctive descriptions may be required.
○ Decision trees naturally represent disjunctive expressions.
● The training data may contain errors.
○ Errors in the classification of examples, or in the attribute values describing those examples
are handled well by decision trees, making them a robust learning method.
● The training data may contain missing attribute values.
○ Decision tree methods can be used even when some training examples have unknown values
(e.g., humidity is known for only a fraction of the examples).
ID3 Algorithm
● ID3 or Iterative Dichotomiser3 Algorithm is used in machine learning for building
decision trees from a given dataset. It was developed in 1986 by Ross Quinlan. It is a
greedy algorithm that builds a decision tree by recursively partitioning the data set into
smaller and smaller subsets until all data points in each subset belong to the same class.
It employs a top-down approach, recursively selecting features to split the dataset based
on information gain.
● The ID3 algorithm selects the feature that provides the most information about the target
variable. The decision tree is built top-down, starting with the root node, which represents
the entire dataset. At each node, the ID3 algorithm selects the attribute that provides the
most information gain about the target variable. The attribute with the highest information
gain is the one that best separates the data points into different categories.
Steps in ID3 Algorithm
1. Determine entropy for the overall the dataset using class distribution.
2. For each feature.
● Calculate Entropy for Categorical Values.
● Assess information gain for each unique categorical value of the
feature.
3. Choose the feature that generates highest information gain.
4. Iteratively apply all above steps to build the decision tree structure.
Step 1: Calculating Entropy for dataset
def calculate_entropy(data, target_column):

total_rows = len(data)

target_values = data[target_column].unique()

entropy = 0

for value in target_values:

# Calculate the proportion of instances with the current value

value_count = len(data[data[target_column] == value])

proportion = value_count / total_rows

entropy -= proportion * math.log2(proportion)

return entropy

entropy_outcome = calculate_entropy(df, 'Outcome')

print(f&quot;Entropy of the dataset: {entropy_outcome}&quot;)


Step 2: Calculating Entropy and Information Gain
def calculate_information_gain(data, feature, target_column):

# Calculate weighted average entropy for the feature

unique_values = data[feature].unique()

weighted_entropy = 0

for value in unique_values:

subset = data[data[feature] == value]

proportion = len(subset) / len(data)

weighted_entropy += proportion * calculate_entropy(subset, target_column)

# Calculate information gain

information_gain = entropy_outcome - weighted_entropy

return information_gain
Step 3: Assessing best feature with highest information gain
for column in df.columns[:-1]:

entropy = calculate_entropy(df, column)

information_gain = calculate_information_gain(df, column, 'Outcome')

print(f&quot;{column} - Entropy: {entropy:.3f}, Information Gain: {information_gain:.3f}&quot;)


Step 4: Built ID3 Algorithm
def id3(data, target_column, features):

if len(data[target_column].unique()) == 1:

return data[target_column].iloc[0]

if len(features) == 0:

return data[target_column].mode().iloc[0]

best_feature = max(features, key=lambda x: calculate_information_gain(data, x, target_column))

tree = {best_feature: {}}

features = [f for f in features if f != best_feature]

for value in data[best_feature].unique():

subset = data[data[best_feature] == value]

tree[best_feature][value] = id3(subset, target_column, features)

return tree
Issues in Decision Tree Learning
1. Overfitting:
Decision trees can become overly complex and fit the training data too closely, including noise and irrelevant
details. This leads to poor performance on new, unseen data.
2. Sensitivity to Data Changes:
Small variations in the training data can lead to significant differences in the resulting decision tree, making the
model unstable.
3. Bias Towards Features with More Levels:
Decision trees may favor features with many unique values, potentially leading to biased models that neglect other
important features
4. Difficulty in Capturing Linear Relationships:
Decision trees inherently represent non-linear relationships and may struggle with datasets where relationships
are primarily linear.
K-Nearest Neighbor (KNN)
The k-Nearest Neighbors (KNN) algorithm is a supervised machine learning algorithm used for both
classification and regression tasks. It operates on the principle that similar data points exist in close
proximity to each other.

Supervised Learning:
KNN requires labeled training data, where each data point has a known class or value.
Distance Metric:
KNN relies on calculating distances between data points. The most common metric is Euclidean
distance, but others like Manhattan, Minkowski, and Hamming distances can also be used.
K Value:
The 'k' represents the number of nearest neighbors considered when making a prediction. The
choice of 'k' is crucial and can impact the model's performance.
No Training Phase:
KNN is a "lazy learner," meaning it doesn't have an explicit training phase. The training data is
simply stored and used during prediction.
KNN is a simple, supervised machine learning (ML) algorithm that can be used for
classification or regression tasks - and is also frequently used in missing value imputation. It
is based on the idea that the observations closest to a given data point are the most "similar"
observations in a data set, and we can therefore classify unforeseen points based on the
values of the closest existing points. By choosing K, the user can select the number of
nearby observations to use in the algorithm.

https://colab.research.google.com/drive/1l0t7JHRHXFAr_r_bd7FWLvy2rV5LL_Ge
?pli=1&authuser=1#scrollTo=vJOqkRd3nfxY
K Means Clustering

● K-Means Clustering is an Unsupervised Machine Learning algorithm which groups


unlabeled dataset into different clusters. It is used to organize data into groups based
on their similarity.
● The algorithm works by first randomly picking some central points called centroids and
each data point is then assigned to the closest centroid forming a cluster. After all the
points are assigned to a cluster the centroids are updated by finding the average
position of the points in each cluster. This process repeats until the centroids stop
changing forming clusters. The goal of clustering is to divide the data points into
clusters so that similar data points belong to same group.

https://colab.research.google.com/drive/1BDvrp2jol_afAow3ZAnzse77dK-yeF-Y
Distance Metrics Used in KNN Algorithm

1. Euclidean Distance

2. Manhattan Distance

3. Minkowski Distance
Working of KNN algorithm
Step 1: Selecting the optimal value of K

● K represents the number of nearest neighbors that needs to be considered while making
prediction.

Step 2: Calculating distance

● To measure the similarity between target and training data points Euclidean distance is
used. Distance is calculated between data points in the dataset and target point.

Step 3: Finding Nearest Neighbors

● The k data points with the smallest distances to the target point are nearest neighbors.
Step 4: Voting for Classification or Taking Average for Regression

● When you want to classify a data point into a category like spam or not spam, the KNN
algorithm looks at the K closest points in the dataset. These closest points are called
neighbors. The algorithm then looks at which category the neighbors belong to and picks
the one that appears the most. This is called majority voting.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy