Unit-II
Unit-II
Some popular classification algorithms are decision trees, random forests, support vector machines (SVM),
logistic regression, etc.
2. Regression
The key objective of regression-based tasks is to predict output labels or responses, which are continuous
numeric values, for the given input data. Basically, regression models use the input data features
(independent variables) and their corresponding continuous numeric output values (dependent or outcome
variables) to learn specific associations between inputs and corresponding outputs.
Some popular regression algorithms are linear regression, polynomial regression, Laso regression, etc.
Decision Tree
● A Decision tree is a tree-like structure used to make decisions and analyze the
possible consequences. The algorithm splits the data into subsets based on
features, where each parent node represents internal decisions and the leaf node
represents final prediction.
● A decision tree is a supervised learning algorithm used for both classification and
regression problems. It is represented as a tree structure where each internal
node represents a test on an attribute, each branch represents the outcome of the
test, and each leaf node represents a class label or a predicted value. The goal of
a decision tree is to split the dataset into subsets based on the value of an
attribute, repeating this process until each subset contains only instances that
belong to a single class or have similar values.
Example
Attribute Selection Measures
Choosing the right attribute to split the data at each node is a critical step in
building an accurate decision tree. The most common methods used for attribute
selection are information gain and Gini index.
Information Gain
Attributes that result in the lowest Gini index after the split are considered the best
candidates for splitting the data.
Example of Decision Tree
import pandas
df = pandas.read_csv("data.csv")
d = {'UK': 0, 'USA': 1, 'N': 2}
df['Nationality'] = df['Nationality'].map(d)
d = {'YES': 1, 'NO': 0}
df['Go'] = df['Go'].map(d)
print(df)
features = ['Age', 'Experience', 'Rank', 'Nationality']
X = df[features]
y = df['Go']
dtree = DecisionTreeClassifier()
dtree = dtree.fit(X, y)
tree.plot_tree(dtree, feature_names=features)
plt.show()
Output
Rank <= 6.5 means that every comedian with a rank of 6.5 or lower will follow the True arrow (to the left), and the
rest will follow the False arrow (to the right).
gini = 0.497 refers to the quality of the split, and is always a number between 0.0 and 0.5, where 0.0 would mean
all of the samples got the same result, and 0.5 would mean that the split is done exactly in the middle.
samples = 13 means that there are 13 comedians left at this point in the decision, which is all of them since this is
the first step.
value = [6, 7] means that of these 13 comedians, 6 will get a "NO", and 7 will get a "GO".
Where x is the number of positive answers("GO"), n is the number of samples, and y is the number of
negative answers ("NO"), which gives us this calculation:
gini = 0.0 means all of the samples got the same result.
samples = 5 means that there are 5 comedians left in this branch (5 comedian with a Rank of 6.5 or lower).
value = [5, 0] means that 5 will get a "NO" and 0 will get a "GO".
Nationality
Nationality <= 0.5 means that the comedians with a nationality value of less than 0.5 will follow the arrow to the left (which
means everyone from the UK, ), and the rest will follow the arrow to the right.
gini = 0.219 means that about 22% of the samples would go in one direction.
samples = 8 means that there are 8 comedians left in this branch (8 comedian with a Rank higher than 6.5).
value = [1, 7] means that of these 8 comedians, 1 will get a "NO" and 7 will get a "GO".
Advantages of DT
● Simple to understand and to interpret. Trees can be visualized.
● Requires little data preparation. Other techniques often require data normalization,
dummy variables need to be created and blank values to be removed. Some tree and
algorithm combinations support missing values.
● The cost of using the tree (i.e., predicting data) is logarithmic in the number of data
points used to train the tree.
● Able to handle both numerical and categorical data. However, the scikit-learn
implementation does not support categorical variables for now. Other techniques are
usually specialized in analyzing datasets that have only one type of variable. See
algorithms for more information.
● Able to handle multi-output problems.
Disadvantages of DT
● Decision-tree learners can create over-complex trees that do not generalize the data well.
This is called overfitting. Mechanisms such as pruning, setting the minimum number of
samples required at a leaf node or setting the maximum depth of the tree are necessary to
avoid this problem.
● Decision trees can be unstable because small variations in the data might result in a
completely different tree being generated. This problem is mitigated by using decision trees
within an ensemble.
● Predictions of decision trees are neither smooth nor continuous, but piecewise constant
approximations as seen in the above figure. Therefore, they are not good at extrapolation.
● The problem of learning an optimal decision tree is known to be NP-complete under several
aspects of optimality and even for simple concepts. There are concepts that are hard to learn
because decision trees do not express them easily, such as XOR, parity or multiplexer
problems.
● Decision tree learners create biased trees if some classes dominate. It is therefore
recommended to balance the dataset prior to fitting with the decision tree.
Appropriate Problems for Decision Tree Learning
Decision tree learning is generally best suited to problems with the following characteristics:
○ Instances are represented by attribute-value pairs.
■ There is a finite list of attributes (e.g. hair colour) and each instance stores a value for
that attribute (e.g. blonde).
■ When each attribute has a small number of distinct values (e.g. blonde, brown, red) it
is easier for the decision tree to reach a useful solution.
■ The algorithm can be extended to handle real-valued attributes (e.g. a floating point
temperature)
○ The target function has discrete output values.
■ A decision tree classifies each example as one of the output values.
■ Simplest case exists when there are only two possible classes (Boolean
classification).
■ However, it is easy to extend the decision tree to produce a target function with
more than two possible output values.
■ Although it is less common, the algorithm can also be extended to produce a target
function with real-valued outputs.
● Disjunctive descriptions may be required.
○ Decision trees naturally represent disjunctive expressions.
● The training data may contain errors.
○ Errors in the classification of examples, or in the attribute values describing those examples
are handled well by decision trees, making them a robust learning method.
● The training data may contain missing attribute values.
○ Decision tree methods can be used even when some training examples have unknown values
(e.g., humidity is known for only a fraction of the examples).
ID3 Algorithm
● ID3 or Iterative Dichotomiser3 Algorithm is used in machine learning for building
decision trees from a given dataset. It was developed in 1986 by Ross Quinlan. It is a
greedy algorithm that builds a decision tree by recursively partitioning the data set into
smaller and smaller subsets until all data points in each subset belong to the same class.
It employs a top-down approach, recursively selecting features to split the dataset based
on information gain.
● The ID3 algorithm selects the feature that provides the most information about the target
variable. The decision tree is built top-down, starting with the root node, which represents
the entire dataset. At each node, the ID3 algorithm selects the attribute that provides the
most information gain about the target variable. The attribute with the highest information
gain is the one that best separates the data points into different categories.
Steps in ID3 Algorithm
1. Determine entropy for the overall the dataset using class distribution.
2. For each feature.
● Calculate Entropy for Categorical Values.
● Assess information gain for each unique categorical value of the
feature.
3. Choose the feature that generates highest information gain.
4. Iteratively apply all above steps to build the decision tree structure.
Step 1: Calculating Entropy for dataset
def calculate_entropy(data, target_column):
total_rows = len(data)
target_values = data[target_column].unique()
entropy = 0
return entropy
unique_values = data[feature].unique()
weighted_entropy = 0
return information_gain
Step 3: Assessing best feature with highest information gain
for column in df.columns[:-1]:
if len(data[target_column].unique()) == 1:
return data[target_column].iloc[0]
if len(features) == 0:
return data[target_column].mode().iloc[0]
return tree
Issues in Decision Tree Learning
1. Overfitting:
Decision trees can become overly complex and fit the training data too closely, including noise and irrelevant
details. This leads to poor performance on new, unseen data.
2. Sensitivity to Data Changes:
Small variations in the training data can lead to significant differences in the resulting decision tree, making the
model unstable.
3. Bias Towards Features with More Levels:
Decision trees may favor features with many unique values, potentially leading to biased models that neglect other
important features
4. Difficulty in Capturing Linear Relationships:
Decision trees inherently represent non-linear relationships and may struggle with datasets where relationships
are primarily linear.
K-Nearest Neighbor (KNN)
The k-Nearest Neighbors (KNN) algorithm is a supervised machine learning algorithm used for both
classification and regression tasks. It operates on the principle that similar data points exist in close
proximity to each other.
Supervised Learning:
KNN requires labeled training data, where each data point has a known class or value.
Distance Metric:
KNN relies on calculating distances between data points. The most common metric is Euclidean
distance, but others like Manhattan, Minkowski, and Hamming distances can also be used.
K Value:
The 'k' represents the number of nearest neighbors considered when making a prediction. The
choice of 'k' is crucial and can impact the model's performance.
No Training Phase:
KNN is a "lazy learner," meaning it doesn't have an explicit training phase. The training data is
simply stored and used during prediction.
KNN is a simple, supervised machine learning (ML) algorithm that can be used for
classification or regression tasks - and is also frequently used in missing value imputation. It
is based on the idea that the observations closest to a given data point are the most "similar"
observations in a data set, and we can therefore classify unforeseen points based on the
values of the closest existing points. By choosing K, the user can select the number of
nearby observations to use in the algorithm.
https://colab.research.google.com/drive/1l0t7JHRHXFAr_r_bd7FWLvy2rV5LL_Ge
?pli=1&authuser=1#scrollTo=vJOqkRd3nfxY
K Means Clustering
https://colab.research.google.com/drive/1BDvrp2jol_afAow3ZAnzse77dK-yeF-Y
Distance Metrics Used in KNN Algorithm
1. Euclidean Distance
2. Manhattan Distance
3. Minkowski Distance
Working of KNN algorithm
Step 1: Selecting the optimal value of K
● K represents the number of nearest neighbors that needs to be considered while making
prediction.
● To measure the similarity between target and training data points Euclidean distance is
used. Distance is calculated between data points in the dataset and target point.
● The k data points with the smallest distances to the target point are nearest neighbors.
Step 4: Voting for Classification or Taking Average for Regression
● When you want to classify a data point into a category like spam or not spam, the KNN
algorithm looks at the K closest points in the dataset. These closest points are called
neighbors. The algorithm then looks at which category the neighbors belong to and picks
the one that appears the most. This is called majority voting.