0% found this document useful (0 votes)
21 views17 pages

Ca10bd6d De86 4bae 9427 c60d433d2076 Supervised Learning

Uploaded by

pchugh965
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
21 views17 pages

Ca10bd6d De86 4bae 9427 c60d433d2076 Supervised Learning

Uploaded by

pchugh965
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 17

Supervised Learning

Unit-1
Introduction to Machine Learning
Machine Learning (ML) is a subfield of artificial intelligence (AI) that focuses on
developing algorithms that allow computers to learn from and make predictions
based on data. Unlike traditional programming, where explicit rules are defined for
tasks, ML enables systems to automatically improve their performance through
experience.

Key Concepts:
1. Data: The foundation of ML models. It consists of features (input variables)
and labels (output variables).

2. Model: An algorithm that makes predictions or decisions based on input data.

3. Training: The process of teaching a model by feeding it data.

4. Inference: Using the trained model to make predictions on new, unseen data.

Importance of Machine Learning:


Automation: Reduces manual effort and improves efficiency.

Insights: Extracts patterns and trends from data.

Adaptability: Models can improve over time with new data.

Types of Machine Learning


Machine learning can be broadly categorized into three main types:

1. Supervised Learning
In supervised learning, models are trained on labeled datasets, meaning that each
training example is associated with an output label. The model learns to map
inputs to outputs based on this labeled data.

Supervised Learning 1
Input: Labeled data (features + corresponding labels).

Goal: Predict outcomes for new, unseen data based on learned relationships.

Common Algorithms:

Linear Regression

Logistic Regression

Decision Trees

Support Vector Machines (SVM)

Applications:

Email spam detection (classification)

House price prediction (regression)

2. Unsupervised Learning
In unsupervised learning, the model is trained using unlabeled data. The goal is to
find hidden patterns or structures in the data without pre-defined labels.

Input: Unlabeled data.

Goal: Discover patterns, group similar data points, or reduce dimensionality.

Common Algorithms:

K-Means Clustering

Hierarchical Clustering

Principal Component Analysis (PCA)

Applications:

Customer segmentation

Market basket analysis

3. Reinforcement Learning
In reinforcement learning, an agent interacts with an environment and learns to
make decisions by receiving feedback in the form of rewards or penalties. The

Supervised Learning 2
agent learns from the consequences of its actions to maximize cumulative reward
over time.

Input: Feedback from the environment based on actions taken.

Goal: Maximize cumulative rewards through trial and error.

Common Algorithms:

Q-Learning

Deep Q-Networks (DQN)

Applications:

Game playing (e.g., AlphaGo)

Robotics (e.g., autonomous navigation)

Supervised Learning Basics


Supervised learning is the most common type of machine learning. It involves
training a model on a labeled dataset to make predictions about new data.

Key Concepts:
1. Training Data: A subset of the dataset used to train the model.

2. Test Data: A separate subset used to evaluate the model's performance.

3. Features: Input variables used for making predictions.

4. Labels: Output variables that the model aims to predict.

Common Algorithms in Supervised Learning:


Linear Regression: Used for predicting continuous outcomes. It models the
relationship between independent variables and a dependent variable.

Logistic Regression: Used for binary classification. It estimates the probability


that a given input belongs to a certain class.

Decision Trees: A tree-like model used for classification and regression. It


splits the data into branches based on feature values.

Supervised Learning 3
Support Vector Machines (SVM): A classification algorithm that finds the
hyperplane that best separates classes in the feature space.

Workflow of Supervised Learning:


1. Data Collection: Gather data relevant to the problem.

2. Data Preprocessing: Clean and prepare data for training (handling missing
values, normalization, etc.).

3. Model Training: Use the training data to train the model.

4. Model Evaluation: Assess the model's performance using test data and
evaluation metrics.

5. Prediction: Use the trained model to make predictions on new data.

Regression and Classification in Machine Learning


In supervised learning, tasks can be broadly categorized into two types:
regression and classification. Both of these approaches are used for predicting
outcomes based on input data, but they differ in the nature of the output.

Regression
Regression is a type of predictive modeling technique that estimates the
relationships among variables. It is primarily used when the output variable is
continuous.

Key Concepts:
Continuous Output: The target variable can take any value within a range.

Objective: To find a function that best fits the data and can predict new
values.

Common Types of Regression:


1. Linear Regression: Models the relationship between the dependent variable
and one or more independent variables using a linear equation.

Supervised Learning 4
2. Polynomial Regression: Extends linear regression by considering polynomial
relationships between variables.

3. Ridge Regression: A type of linear regression that includes a penalty term to


reduce overfitting.

4. Lasso Regression: Similar to Ridge but can shrink some coefficients to zero,
effectively performing variable selection.

Applications:
Predicting house prices based on features like square footage, number of
bedrooms, etc.

Estimating sales based on advertising spend.

Classification
Classification is a predictive modeling technique used when the output variable is
categorical (i.e., it can take on a limited number of classes).

Key Concepts:
Categorical Output: The target variable can take one of a limited set of values.

Objective: To assign a category label to new observations based on learned


patterns from the training data.

Common Types of Classification:


1. Binary Classification: Involves two classes (e.g., spam vs. not spam).

2. Multi-Class Classification: Involves more than two classes (e.g., categorizing


fruits as apples, bananas, or oranges).

3. Multi-Label Classification: Each instance can belong to multiple classes (e.g.,


tagging articles with multiple topics).

Applications:
Email filtering (spam detection).

Image recognition (classifying images of animals).

Supervised Learning 5
Linear Regression
Linear Regression is one of the simplest and most widely used regression
algorithms. It establishes a relationship between the dependent variable \(y\) and
one or more independent variables \(X\).

Model Representation:
The relationship can be represented mathematically as:
\[
y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + ... + \beta_n x_n + \epsilon
\]
Where:

\(y\) is the dependent variable.

\(\beta_0\) is the intercept.

\(\beta_1, \beta_2, ..., \beta_n\) are the coefficients for independent variables \
(x_1, x_2, ..., x_n\).

\(\epsilon\) is the error term (residuals).

Assumptions:
1. Linearity: The relationship between the independent and dependent variables
is linear.

2. Independence: Observations are independent of each other.

3. Homoscedasticity: Constant variance of residuals across all levels of the


independent variables.

4. Normality: The residuals of the model are normally distributed.

Model Training:
The parameters (\(\beta\)) are typically estimated using the Ordinary Least
Squares (OLS) method, which minimizes the sum of the squared differences
between observed and predicted values.

Applications:
Financial forecasting.

Supervised Learning 6
Risk assessment.

Logistic Regression
Logistic Regression is used for binary classification problems. Unlike linear
regression, logistic regression predicts the probability that an instance belongs to
a certain class using a logistic function.

Model Representation:
The logistic regression model can be expressed as:
\[
P(y=1 | X) = \frac{1}{1 + e^{-(\beta_0 + \beta_1 x_1 + \beta_2 x_2 + ... + \beta_n
x_n)}}
\]
Where:

\(P(y=1 | X)\) is the probability that the output is in class 1.

The output is transformed using the logistic (sigmoid) function to ensure


values are between 0 and 1.

Assumptions:
1. Binary Outcome: The dependent variable is binary (0 or 1).

2. Independence: Observations should be independent.

3. No Multicollinearity: Independent variables should not be too highly


correlated.

Model Training:
Logistic regression is trained using Maximum Likelihood Estimation (MLE), which
finds the parameters that maximize the likelihood of observing the given data.

Applications:
Disease prediction based on risk factors.

Customer churn prediction.

Supervised Learning 7
Model Evaluation Metrics
Evaluating the performance of regression and classification models is crucial for
understanding their effectiveness. Different metrics are used based on the type of
task.

Regression Metrics:
1. Mean Absolute Error (MAE):
\[
\text{MAE} = \frac{1}{n} \sum_{i=1}^{n} |y_i - \hat{y}_i|
\]

Measures the average absolute difference between predicted and actual


values.

2. Mean Squared Error (MSE):


\[
\text{MSE} = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2
\]

Measures the average squared difference between predicted and actual


values.

3. R-squared (\(R^2\)):
\[
R^2 = 1 - \frac{\text{SS}
{res}}{\text{SS}{tot}}
\]

Represents the proportion of variance explained by the model, where \(


\text{SS}{res} \) is the sum of squares of residuals, and \( \text{SS}{tot} \)
is the total sum of squares.

Classification Metrics:
1. Accuracy:
\[
\text{Accuracy} = \frac{\text{TP} + \text{TN}}{\text{TP} + \text{TN} +
\text{FP} + \text{FN}}
\]

Supervised Learning 8
Measures the proportion of correct predictions.

2. Precision:
\[
\text{Precision} = \frac{\text{TP}}{\text{TP} + \text{FP}}
\]

Measures the proportion of true positive predictions among all positive


predictions.

3. Recall (Sensitivity):
\[
\text{Recall} = \frac{\text{TP}}{\text{TP} + \text{FN}}
\]

Measures the proportion of true positive predictions among all actual


positives.

4. F1 Score:
\[
F1 = 2 \times \frac{\text{Precision} \times \text{Recall}}{\text{Precision} +
\text{Recall}}
\]

Combines precision and recall into a single metric.

5. Receiver Operating Characteristic (ROC) Curve and Area Under the Curve
(AUC):

The ROC curve plots the true positive rate against the false positive rate.

The AUC quantifies the overall ability of the model to discriminate between
positive and negative classes.

Unit-2
Deep Learning (DL) is a subset of machine learning that focuses on algorithms
inspired by the structure and function of the brain, known as artificial neural
networks (ANNs). It excels at handling large datasets and complex tasks, such as
image and speech recognition, natural language processing, and game playing.

Supervised Learning 9
1. Introduction to Deep Learning
Definition
Deep Learning involves training artificial neural networks with many layers (hence
"deep") to learn representations of data. This approach enables models to
automatically learn features from raw data, reducing the need for manual feature
extraction.

Importance
Feature Learning: Automatically discovers patterns and features from raw
data.

Scalability: Performs well with large datasets and can leverage parallel
processing capabilities of GPUs.

Versatility: Applicable to various domains, including computer vision, natural


language processing, and reinforcement learning.

Comparison with Traditional Machine Learning


Traditional ML: Requires manual feature extraction, and models are often less
complex.

Deep Learning: Automatically learns hierarchical features from raw data,


requiring more computational resources but often achieving superior
performance on complex tasks.

2. Artificial Neural Networks (ANNs)


Basic Structure
An ANN is composed of layers of interconnected nodes (neurons), where each
node represents a computational unit. The architecture consists of:

Input Layer: Receives the input data.

Hidden Layers: One or more layers where computations occur. Each layer can
have multiple neurons.

Output Layer: Produces the final output or prediction.

Supervised Learning 10
Neuron Model
Each neuron performs the following operations:

1. Weighted Sum: Each input \( x_i \) is multiplied by a weight \( w_i \), and a bias
\( b \) is added:
\[
z = \sum_{i=1}^{n} w_i x_i + b
\]

2. Activation Function: The weighted sum is passed through an activation


function to introduce non-linearity:
\[
a = f(z)
\]

Types of Neural Networks


1. Feedforward Neural Networks: Information moves in one direction (input to
output).

2. Convolutional Neural Networks (CNNs): Specialize in processing grid-like


data (e.g., images).

3. Recurrent Neural Networks (RNNs): Designed for sequential data (e.g., time
series or text).

3. Activation Functions
Activation functions determine the output of a neuron given an input or set of
inputs. They introduce non-linearity into the model, allowing it to learn complex
patterns.

Common Activation Functions


1. Sigmoid Function

Formula:
\[
f(x) = \frac{1}{1 + e^{-x}}
\]

Supervised Learning 11
Range: (0, 1)

Use: Often used in binary classification problems.

2. Tanh (Hyperbolic Tangent)

Formula:
\[
f(x) = \frac{e^x - e^{-x}}{e^x + e^{-x}}
\]

Range: (-1, 1)

Use: Preferred over sigmoid as it is zero-centered.

3. ReLU (Rectified Linear Unit)

Formula:
\[
f(x) = \max(0, x)
\]

Range: [0, ∞)

Use: Widely used in hidden layers due to its simplicity and effectiveness.

4. Leaky ReLU

Formula:
\[
f(x) = \begin{cases}
x & \text{if } x > 0 \\
\alpha x & \text{if } x \leq 0
\end{cases}
\]

Use: Addresses the "dying ReLU" problem by allowing a small, non-zero


gradient when \(x < 0\).

5. Softmax

Formula:
\[

Supervised Learning 12
f(x_i) = \frac{e^{x_i}}{\sum_{j} e^{x_j}}
\]

Use: Used in multi-class classification problems to produce probabilities


that sum to 1.

Choosing Activation Functions


Hidden Layers: ReLU and its variants are generally preferred due to faster
convergence.

Output Layer: Softmax for multi-class classification; sigmoid for binary


classification.

4. Loss Functions
Loss functions quantify the difference between the predicted output and the
actual output, guiding the optimization of the model.

Common Loss Functions


1. Mean Squared Error (MSE): Used for regression tasks.
\[
\text{MSE} = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2
\]

2. Binary Cross-Entropy: Used for binary classification.


\[
\text{Loss} = -\frac{1}{n} \sum_{i=1}^{n} [y_i \log(\hat{y}_i) + (1-y_i) \log(1 -
\hat{y}_i)]
\]

3. Categorical Cross-Entropy: Used for multi-class classification.


\[
\text{Loss} = -\sum_{i} y_i \log(\hat{y}_i)
\]

4. Hinge Loss: Used for "maximum-margin" classification, primarily with SVMs.


\[

Supervised Learning 13
\text{Loss} = \sum_{i} \max(0, 1 - y_i \cdot \hat{y}_i)
\]

Choosing Loss Functions


Regression Problems: MSE or MAE.

Binary Classification: Binary cross-entropy.

Multi-Class Classification: Categorical cross-entropy.

5. Optimization Algorithms
Optimization algorithms adjust the weights of the network to minimize the loss
function during training. The goal is to find the set of weights that minimizes the
loss.

Common Optimization Algorithms


1. Stochastic Gradient Descent (SGD): Updates weights based on each training
example.

Formula:
\[
w := w - \eta \nabla L(w)
\]
Where \( \eta \) is the learning rate and \( \nabla L(w) \) is the gradient of
the loss function.

2. Mini-Batch Gradient Descent: Combines the benefits of both batch and


stochastic gradient descent. Updates weights using a small batch of training
samples.

3. Momentum: Adds a fraction of the previous weight update to the current


update to accelerate SGD.

Formula:
\[
v_t = \beta v_{t-1} + (1 - \beta) \nabla L(w)
\]
\[

Supervised Learning 14
w := w - \eta v_t
\]

4. Adagrad: Adapts the learning rate for each parameter based on its gradients.

Formula:
\[
w_t = w_{t-1} - \frac{\eta}{\sqrt{G_t + \epsilon}} \nabla L(w)
\]
Where \( G_t \) is the sum of the squares of the gradients.

5. RMSprop: An extension of Adagrad that maintains a moving average of the


gradients to prevent rapid decrease in learning rates.

Formula:
\[
w_t = w_{t-1} - \frac{\eta}{\sqrt{E[g^2]_t} + \epsilon} \nabla L(w)
\]

6. Adam (Adaptive Moment Estimation): Combines the benefits of momentum


and RMSprop.

Formula:
\[
m_t = \beta_1 m_{t-1} + (1 - \beta_1) \nabla L(w)
\]
\[
v_t = \beta_2 v_{t-1} + (1 - \beta_2) (\nabla L(w))^2
\]
\[
w_t = w_{t-1} - \frac{\eta m_t}{\sqrt{v_t} + \epsilon}
\]

Choosing Optimization Algorithms


SGD: Good for large datasets; simple and effective.

Adam: Recommended for most applications due to its adaptive learning rates
and momentum.

Supervised Learning 15
6. Backpropagation Algorithm
The backpropagation algorithm is a supervised learning method used for training
neural networks. It computes the gradient of the loss function with respect to each
weight by applying the chain rule, allowing for efficient weight updates.

Steps of the Backpropagation Algorithm


1. Forward Pass:

Compute the output of the network for a given input by passing it through
the layers.

Calculate the loss using the chosen loss function.

2. Backward Pass:

Compute the gradient of the loss with respect to the output layer.

Propagate the gradient backward through the network, layer by layer,


using the chain rule to compute gradients for each weight.

For a weight \( w \) connected to a neuron, the gradient can be computed


as:
\[
\frac{\partial L}{\partial w} = \frac{\partial L}{\partial a} \cdot \frac{\partial
a}{\partial z} \cdot \frac{\partial z}{\partial w}
\]

3. Weight Update:

Update the weights using the computed

gradients and the optimization algorithm.

Challenges
Vanishing Gradients: Occurs when gradients are very small, making it difficult
to update weights effectively in deep networks.

Exploding Gradients: Occurs when gradients become very large, causing


weights to diverge.

Supervised Learning 16
7. Regularization Techniques
Regularization techniques are used to prevent overfitting in neural networks by
adding constraints to the model training process. They help the model generalize
better to unseen data.

Common Regularization Techniques


1. L1 Regularization (Lasso): Adds a penalty equal to the absolute value of the
magnitude of coefficients.
\[
L(w) = L(w) + \lambda \sum_{i} |w_i|
\]

2. L2 Regularization (Ridge): Adds a penalty equal to the square of the


magnitude of coefficients.
\[
L(w) = L(w) + \lambda \sum_{i} w_i^2
\]

3. Dropout: Randomly sets a fraction of the neurons to zero during training,


preventing the network from becoming overly dependent on specific neurons.

4. Early Stopping: Monitors the model’s performance on a validation set during


training and stops when performance begins to degrade, preventing
overfitting.

5. Data Augmentation: Increases the diversity of training data by applying


transformations (e.g., rotation, scaling) to the original dataset.

Choosing Regularization Techniques


L1 and L2: Commonly used when model complexity is a concern; L1 can also
help with feature selection.

Dropout: Effective for large networks; widely used in practice.

Supervised Learning 17

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy