0% found this document useful (0 votes)
24 views29 pages

Machine L

Machine Learning

Uploaded by

mainasaratech
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
24 views29 pages

Machine L

Machine Learning

Uploaded by

mainasaratech
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 29

Machine Learning

Arthur Samuel, an early American leader in the field of computer gaming and artificial
intelligence, coined the term “Machine Learning” in 1959 while at IBM. In 1959, he
published a paper in the IBM Journal of Research and Development with an, at the time,
obscure and curious title. The paper invested the use of machine learning in the game of
checkers “to verify the fact that a computer can be programmed so that it will learn to
play a better game of checkers than can be played by the person who wrote the
program.
He defined machine learning as “the field of study that gives computers the ability to
learn without being explicitly programmed.” However, there is no universally accepted
definition for machine learning. Different authors define the term differently.
 The generally agreed upon definition of machine learning is: “machine learning
focuses on the development of computer programs that can access data and use
it to learn for themselves.”
 Machine learning is a growing technology which enables computers to learn
automatically from past data.
Machine learning is recognized as a field of computer sciences; however, it does not
carry the same approach that traditional computer sciences carry. Whereas traditional
computer sciences are driven with algorithms that are human-created and managed,
machine learning is driven by algorithms that the device itself can learn from and grow
from.
Machine Learning is one of the most popular sub-fields of Artificial Intelligence. Machine
learning concepts are used almost everywhere, such as Healthcare, Finance,
Infrastructure, Marketing, Self-driving cars, recommendation systems, chatbots, social
sites, gaming, cyber security, and many more

Features of Machine Learning


i. Machine learning uses data to detect various patterns in a given dataset.
ii. It can learn from past data and improve automatically.
iii. It is a data-driven technology.
iv. Machine learning is much similar to data mining as it also deals with the huge
amount of the data.

Why is machine learning important?


Machine learning is important because it gives enterprises a view of trends in customer
behavior and business operational patterns, as well as supports the development of new
products. Many of today's leading companies, such as Facebook, Google, and Uber,
make machine learning a central part of their operations. Machine learning has become
a significant competitive differentiator for many companies.

Types of Machine Learning


There are four basic approaches:
i. Supervised learning
ii. Unsupervised learning
iii. Semi-supervised learning and
iv. Reinforcement learning.

Supervised learning: In this type of machine learning, data scientists supply algorithms
with labeled training data and define the variables they want the algorithm to assess for
correlations. Both the input and the output of the algorithm is specified.

Types of Supervised Machine Learning Algorithms


Regression: Regression algorithms are used if there is a relationship between the input
variable and the output variable. Regression is basically classification where we forecast
a number instead of category. Examples are car price by its mileage, traffic by time of
the day, demand volume by growth of the company etc. Regression is perfect when
something depends on time.
Classification: Classification algorithms are used when the output variable is categorical,
which means there are two classes such as Yes-No, Male-Female, True-false, etc. Spam
Filtering, Random Forest, Decision Trees, Logistic Regression, Support vector Machines
are examples of classification models.

How does supervised machine learning work?


Supervised machine learning requires the data scientist to train the algorithm with both
labeled inputs and desired outputs. Supervised learning algorithms are good for the
following tasks:
 Binary classification: Dividing data into two categories.
 Multi-class classification: Choosing between more than two types of answers.
 Regression modeling: Predicting continuous values.
 Ensembling: Combining the predictions of multiple machine learning models to
produce an accurate prediction.

Advantages of Supervised learning


i. With the help of supervised learning, the model can predict the output based on
prior experiences.
ii. In supervised learning, we can have an exact idea about the classes of objects.
Supervised learning model helps us to solve various real-world problems such as fraud
detection, spam filtering Language detection, Sentiment analysis, Recognition of
handwritten characters and numbers etc.

Disadvantages of supervised learning


i. Supervised learning models are not suitable for handling complex tasks.
ii. Supervised learning cannot predict the correct output if the test data is different
from the training dataset.
iii. Training required lots of computation times.
iv. In supervised learning, we need enough knowledge about the classes of object.

Unsupervised learning: This type of machine learning involves algorithms that train on
unlabeled data. The algorithm scans through data sets looking for any meaningful
connection. The data that algorithms train on as well as the predictions or
recommendations they output are predetermined.

Types of Unsupervised Learning Algorithm


The unsupervised learning algorithm can be further categorized into two types of
problems:
 Clustering: Clustering is a method of grouping the objects into clusters such that
objects with most similarities remains into a group and has less or no similarities
with the objects of another group. Cluster analysis finds the commonalities
between the data objects and categorizes them as per the presence and absence
of those commonalities.
 Association: An association rule is an unsupervised learning method which is
used for finding the relationships between variables in the large database. It determines
the set of items that occurs together in the dataset. Association rule makes marketing
strategy more effective. Such as people who buy X item (suppose a bread) also tend to
purchase Y (Butter/Jam) item. A typical example of Association rule is Market Basket
Analysis.

Advantages of Unsupervised Learning


i. Unsupervised learning is used for more complex tasks as compared to supervised
learning because, in unsupervised learning, we don't have labeled input data.
ii. Unsupervised learning is preferable as it is easy to get unlabeled data in
comparison to labeled data.

Disadvantages of Unsupervised Learning


i. Unsupervised learning is intrinsically more difficult than supervised learning as it
does not have corresponding output.
ii. The result of the unsupervised learning algorithm might be less accurate as input
data is not labeled, and algorithms do not know the exact output in advance.

How does unsupervised machine learning work?


Unsupervised machine learning algorithms do not require data to be labeled. They
sift through unlabeled data to look for patterns that can be used to group data points

into subsets. Most types of deep learning, including neural networks, are unsupervised
algorithms. Unsupervised learning algorithms are good for the following tasks:
 Clustering: Splitting the dataset into groups based on similarity.
 Anomaly detection: Identifying unusual data points in a data set.
 Association mining: Identifying sets of items in a data set that frequently occur
together.
 Dimensionality reduction: Reducing the number of variables in a dataset.

Semi-supervised Learning
This approach to machine learning involves a mix of the two preceding types. Data
scientists may feed an algorithm mostly labeled training data, but the model is free to
explore the data on its own and develop its own understanding of the dataset.

How does semi-supervised learning work?


Semi-supervised learning works by data scientists feeding a small amount of labeled
training data to an algorithm. From this, the algorithm learns the dimensions of the
dataset, which it can then apply to new, unlabeled data. The performance of algorithms
typically improves when they train on labeled datasets. But labeling data can be time
consuming and expensive. Semi-supervised learning strikes a middle ground between
the performance of supervised learning and the efficiency of unsupervised learning.
Some areas where semi-supervised learning is used include:
 Machine translation: Teaching algorithms to translate language based on less
than a full dictionary of words.
 Fraud detection: Identifying cases of fraud when you only have a few positive
examples.
Labelling data: Algorithms trained on small data sets can learn to apply data labels to
larger sets automatically.

Reinforcement learning
Data scientists typically use reinforcement learning to teach a machine to complete a
multi-step process for which there are clearly defined rules. Data scientists program an
algorithm to complete a task and give it positive or negative cues as it works out how to
complete a task. But for the most part, the algorithm decides on its own what steps to
take along the way.

How does reinforcement learning work?


Reinforcement learning works by programming an algorithm with a distinct goal and a
prescribed set of rules for accomplishing that goal. Data scientists also program the
algorithm to seek positive rewards -- which it receives when it performs an action that is
beneficial toward the ultimate goal -- and avoid punishments -- which it receives when it
performs an action that gets it farther away from its ultimate goal. Reinforcement
learning is often used in areas such as:
 Robotics: Robots can learn to perform tasks the physical world using this
technique.
Video gameplay: Reinforcement learning has been used to teach bots to play a
number of video games.
 Resource management: Given finite resources and a defined goal, reinforcement
learning can help enterprises plan out how to allocate resources.
In the example below, there is a reward and there is an agent. There are many obstacles
placed between the machine or agent, and the reward. The machine or agent should
identify the right path that it should take to reach the reward in the shortest time
possible. The image below will provide some information about this problem:
In the image above, there is a robot, which is the machine or agent, the fire that
represents the obstacles and the diamond, which represents the reward. The robot
should look for the different ways it can reach the diamond while avoiding fire. The aim
should be to reach the diamond in the shortest time possible. The robot is rewarded for
every correct step taken, and for every incorrect step it takes the reward is reduced. The
total reward is calculated when the robot finally reaches the diamond.

The main differences between Supervised and Unsupervised learning are given
below:
Overfitting in Machine Learning
A statistical model is said to be overfitted when the model does not make accurate
predictions on testing data. When a model gets trained with so much data, it starts
learning from the noise and inaccurate data entries in our data set. And when testing
with test data results in High variance. Then the model does not categorize the data
correctly, because of too many details and noise. The causes of overfitting are the
non-parametric and non-linear methods because these types of machine learning
algorithms have more freedom in building the model based on the dataset and
therefore they can really build unrealistic models.
The causes of overfitting can be numerous:

 Complex models. Using an overly complex model for a simple task can lead to
overfitting. For instance, using a high-degree polynomial regression for data
that's linear in nature.

 Insufficient data. If there's not enough data, the model might find patterns that
don't really exist.
 Noisy data. If the training data contains errors or random fluctuations, an
overfitted model will treat these as patterns.

The impact of overfitting is significant. While an overfitted model will have high accuracy
on its training data, it will perform poorly on new, unseen data because it's not
generalized enough.

How to Prevent Overfitting

Preventing overfitting is better than curing it. Here are some steps to take:

 Simpler models. Start with a simpler model and only add complexity if necessary.

 More data. If possible, collect more data. The more data a model is trained on,
the better it can generalize.

 Regularization. Techniques like L1 and L2 regularization can help prevent


overfitting by penalizing certain model parameters if they're likely causing
overfitting.

 Dropout. In neural networks, dropout is a technique where random neurons are


"dropped out" during training, forcing the network to learn more robust features.

Overfitting vs Underfitting

While overfitting is a model's excessive adaptation to training data, underfitting is the


opposite. An underfitted model fails to capture even the basic patterns in the training
data.

 Overfitting: High accuracy on training data, low accuracy on new data. Imagine a
GPS that works perfectly in your hometown but gets lost everywhere else.

 Underfitting: Low accuracy on both training and new data. It's like a GPS that
can't even navigate your hometown.
Both overfitting and underfitting lead to poor predictions on new data, but for different
reasons. While overfitting is often due to an overly complex model or noisy data,
underfitting might result from an overly simple model or not enough features.

Train and Test datasets in Machine Learning


Machine Learning is one of the booming technologies across the world that enables
computers/machines to turn a huge amount of data into predictions. However, these
predictions highly depend on the quality of the data, and if we are not using the right
data for our model, then it will not generate the expected result. In machine learning
projects, we generally divide the original dataset into training data and test data. We
train our model over a subset of the original dataset, i.e., the training dataset, and then
evaluate whether it can generalize well to the new or unseen dataset or test
set. Therefore, train and test datasets are the two key concepts of machine learning,
where the training dataset is used to fit the model, and the test dataset is used to
evaluate the model.

What is Training Dataset?

The training data is the biggest (in -size) subset of the original dataset, which is used to
train or fit the machine learning model. Firstly, the training data is fed to the ML
algorithms, which lets them learn how to make predictions for the given task.
What is Test Dataset?

Once we train the model with the training dataset, it's time to test the model with the
test dataset. This dataset evaluates the performance of the model and ensures that the
model can generalize well with the new or unseen dataset. The test dataset is another
subset of original data, which is independent of the training dataset. However, it has
some similar types of features and class probability distribution and uses it as a
benchmark for model evaluation once the model training is completed. Test data is a
well-organized dataset that contains data for each type of scenario for a given problem
that the model would be facing when used in the real world. Usually, the test dataset is
approximately 20-25% of the total original data for an ML project.

Need of Splitting dataset into Train and Test set

Splitting the dataset into train and test sets is one of the important parts of data pre-
processing, as by doing so, we can improve the performance of our model and hence
give better predictability.

We can understand it as if we train our model with a training set and then test it with a
completely different test dataset, and then our model will not be able to understand the
correlations between the features.

Therefore, if we train and test the model with two different datasets, then it will
decrease the performance of the model. Hence it is important to split a dataset into two
parts, i.e., train and test set.
In this way, we can easily evaluate the performance of our model. Such as, if it performs
well with the training data, but does not perform well with the test dataset, then it is
estimated that the model may be overfitted. For splitting the dataset, we can use
the train_test_split function of scikit-learn.

There is a three-step process followed to create a model:


i. Train the model
ii. Test the model
iii. Deploy the model
Training Set Test Set
 The training set is examples  The test set is used to test the accuracy
given to the model to analyze of the hypothesis generated by the
and learn model
 70% of the total data is typically  Remaining 30% is taken as testing
taken as the training dataset dataset
 This is labeled data used to train  We test without labeled data and then
the model verify results with labels

Consider a case where you have labeled data for 1,000 records. One way to train the
model is to expose all 1,000 records during the training process. Then you take a small
set of the same data to test the model, which would give good results in this case.
But this is not an accurate way of testing. So, we set aside a portion of that data called
the ‘test set’ before starting the training process. The remaining data is called the
‘training set’ that we use for training the model. The training set passes through the
model multiple times until the accuracy is high, and errors are minimized.

Missing or Corrupted Data in a Dataset

Missing values are a common issue in machine learning. This occurs when a particular
variable lacks data points, resulting in incomplete information and potentially harming
the accuracy and dependability of your models. One of the easiest ways to handle
missing or corrupted data is to drop those rows or columns or replace them entirely
with some other value. Missing values are data points that are absent for a specific
variable in a dataset. They can be represented in various ways, such as blank cells, null
values, or special symbols like “NA” or “unknown.” These missing data points pose a
significant challenge in data analysis and can lead to inaccurate or biased results.

Missing Values

There are two useful methods in Pandas:

 IsNull() and dropna() will help to find the columns/rows with missing data and
drop them.

 Fillna() will replace the wrong values with a placeholder value.


Confusion Matrix
A confusion matrix (or error matrix) is a specific table that is used to measure the
performance of an algorithm. It is mostly used in supervised learning; in unsupervised
learning, it’s called the matching matrix. The confusion matrix has two parameters:
 Actual
 Predicted
It also has identical sets of features in both of these dimensions.
Consider a confusion matrix (binary matrix) shown below:

Here,
For actual values:
Total Yes = 12+1 = 13
Total No = 3+9 = 12
Similarly, for predicted values:
Total Yes = 12+3 = 15
Total No = 1+9 = 10
For a model to be accurate, the values across the diagonals should be high. The total
sum of all the values in the matrix equals the total observations in the test data set.
For the above matrix, total observations = 12+3+1+9 = 25
Now, accuracy = sum of the values across the diagonal/total dataset
= (12+9) / 25
= 21 / 25
= 84%
False Positive and False Negative
False positives are those cases that wrongly get classified as True but are False.
False negatives are those cases that wrongly get classified as False but are True.
In the term ‘False Positive,’ the word ‘Positive’ refers to the ‘Yes’ row of the predicted
value in the confusion matrix. The complete term indicates that the system has
predicted it as a positive, but the actual value is negative.

So, looking at the confusion matrix, we get:


False-positive = 3
True positive = 12
Similarly, in the term ‘False Negative,’ the word ‘Negative’ refers to the ‘No’ row of the
predicted value in the confusion matrix. And the complete term indicates that the
system has predicted it as negative, but the actual value is positive.
So, looking at the confusion matrix, we get:
False Negative = 1
True Negative = 9
Stages of building the hypotheses or model in machine
learning

In machine learning, building a hypothesis or model typically involves three stages:

1. Model selection: This is the process of selecting the most appropriate


type of model for the problem at hand. For example, selecting between
linear regression, decision tree, or deep neural network.
2. Model training: Once a model is selected, the next step is to train the
model on the available data. This is the process of adjusting the model's
parameters so that it best fits the data. The model is provided with a set
of input-output pairs and it learn from them to make predictions on
new, unseen data.
3. Model evaluation: After the model is trained, the final step is to
evaluate its performance on a separate dataset, called the test set. This
helps to determine how well the model is likely to perform on new,
unseen data. Evaluation metrics such as accuracy, precision, recall, F1-
score, and AUC-ROC are used to evaluate the performance of a model.
These stages are iterative, meaning that the process may be repeated several times until
the best model is found. For example, different models may be trained and compared,
or different sets of features may be used to train the model.

Additionally, there is also a stage called 'hyperparameter tuning' where the model's
hyperparameters are fine-tuned to get the best performance, this stage usually comes
right after model selection or after the model evaluation stage, so that the performance
can be further optimized.

What is Deep Learning?


The Deep learning is a subset of machine learning that involves systems that think and
learn like humans using artificial neural networks. The term ‘deep’ comes from the fact
that you can have several layers of neural networks.

One of the primary differences between machine learning and deep learning is that
feature engineering is done manually in machine learning. In the case of deep learning,
the model consisting of neural networks will automatically determine which features to
use (and which not to use).

Differences Between Machine Learning and Deep Learning

Machine Learning Deep Learning

 Enables machines to take decisions on  Enables machines to take decisions


their own, based on past data with the help of artificial neural
networks
 It needs only a small amount of data
for training  It needs a large amount of training
data
 Works well on the low-end system, so
you don't need large machines  Needs high-end machines because it
requires a lot of computing power
 Most features need to be identified in
advance and manually coded  The machine learns the features from
the data it is provided
 The problem is divided into two parts
and solved individually and then  The problem is solved in an end-to-
combined end manner

Applications of Supervised Machine Learning in Modern Businesses

Applications of supervised machine learning include:


 Email Spam Detection

Here we train the model using historical data that consists of emails
categorized as spam or not spam. This labeled information is fed as input to
the model.

 Healthcare Diagnosis

By providing images regarding a disease, a model can be trained to detect if a


person is suffering from the disease or not.

 Sentiment Analysis

This refers to the process of using algorithms to mine documents and


determine whether they’re positive, neutral, or negative in sentiment.

 Fraud Detection

By training the model to identify suspicious patterns, we can detect instances


of possible fraud.

Unsupervised Machine Learning Techniques

There are two techniques used in unsupervised learning: clustering and association.

Clustering

Clustering problems involve data to be divided into subsets. These subsets, also called
clusters, contain data that are similar to each other. Different clusters reveal different
details about the objects, unlike classification or regression.
Association

In an association problem, we identify patterns of associations between different


variables or items.

For example, an e-commerce website can suggest other items for you to buy, based on
the prior purchases that you have made, spending habits, items in your wishlist, other
customers’ purchase habits, and so on.

Compare K-means and KNN Algorithms.

K-means KNN

 K-Means is unsupervised  KNN is supervised in nature

 K-Means is a clustering algorithm  KNN is a classification algorithm


 The points in each cluster are
 It classifies an unlabeled observation
similar to each other, and each
based on its K (can be any number)
cluster is different from its
surrounding neighbors
neighboring clusters

Naive Bayes Classifier

The classifier is called ‘naive’ because it makes assumptions that may or may not turn
out to be correct.

The algorithm assumes that the presence of one feature of a class is not related to the
presence of any other feature (absolute independence of features), given the class
variable.

For instance, a fruit may be considered to be a cherry if it is red in color and round in
shape, regardless of other features. This assumption may or may not be right (as an
apple also matches the description).

How Will You Know Which Machine Learning Algorithm to Choose for Your
Classification Problem?

While there is no fixed rule to choose an algorithm for a classification problem, you can
follow these guidelines:

 If accuracy is a concern, test different algorithms and cross-validate them

 If the training dataset is small, use models that have low variance and high
bias

 If the training dataset is large, use models that have high variance and little
bias
Random Forest

A ‘random forest’ is a supervised machine learning algorithm that is generally used for
classification problems. It operates by constructing multiple decision trees during the
training phase. The random forest chooses the decision of the majority of the trees as
the final decision.

Bias and Variance in a Machine Learning Model

Bias

Bias in a machine learning model occurs when the predicted values are further from the
actual values. Low bias indicates a model where the prediction values are very close to
the actual ones.

Underfitting: High bias can cause an algorithm to miss the relevant relations between
features and target outputs.

Variance

Variance refers to the amount the target model will change when trained with different
training data. For a good model, the variance should be minimized.
Overfitting: High variance can cause an algorithm to model the random noise in the
training data rather than the intended outputs.

Precision and Recall.

Precision

Precision is the ratio of several events you can correctly recall to the total number of
events you recall (mix of correct and wrong recalls).

Precision = (True Positive) / (True Positive + False Positive)

Recall

A recall is the ratio of the number of events you can recall the number of total events.

Recall = (True Positive) / (True Positive + False Negative)

Decision Tree Classification

A decision tree builds classification (or regression) models as a tree structure, with
datasets broken up into ever-smaller subsets while developing the decision tree, literally
in a tree-like way with branches and nodes. Decision trees can handle both categorical
and numerical data.

Logistic Regression
Logistic regression is a classification algorithm used to predict a binary outcome for a
given set of independent variables.

The output of logistic regression is either a 0 or 1 with a threshold value of generally 0.5.
Any value above 0.5 is considered as 1, and any point below 0.5 is considered as 0.

K Nearest Neighbor Algorithm

K nearest neighbor algorithm is a classification algorithm that works in a way that a new
data point is assigned to a neighboring group to which it is most similar.

In K nearest neighbors, K can be an integer greater than 1. So, for every new data point,
we want to classify, we compute to which neighboring group it is closest.

Let us classify an object using the following example. Consider there are three clusters:

 Football

 Basketball

 Tennis ball
Let the new data point to be classified is a black ball. We use KNN to classify it. Assume
K = 5 (initially).

Next, we find the K (five) nearest data points, as shown.

Observe that all five selected points do not belong to the same cluster. There are three
tennis balls and one each of basketball and football.

When multiple classes are involved, we prefer the majority. Here the majority is with
the tennis ball, so the new data point is assigned to this cluster.
F1 score

The F1 score is a metric that combines both Precision and Recall. It is also the weighted
average of precision and recall.

The F1 score can be calculated using the below formula:

F1 = 2 * (P * R) / (P + R)

The F1 score is one when both Precision and Recall scores are one.

Correlation and Covariance

Correlation: Correlation tells us how strongly two random variables are related to each
other. It takes values between -1 to +1.

Formula to calculate Correlation:

Covariance: Covariance tells us the direction of the linear relationship between two
random variables. It can take any value between - ∞ and + ∞.

Formula to calculate Covariance:


Support Vectors in SVM

Support Vectors are data points that are nearest to the hyperplane. It influences the
position and orientation of the hyperplane. Removing the support vectors will alter the
position of the hyperplane. The support vectors help us build our support vector
machine model.

Ensemble learning

Ensemble learning is a combination of the results obtained from multiple machine


learning models to increase the accuracy for improved decision-making. Example: A
Random Forest with 100 trees can provide much better results than using just one
decision tree.
Cross-Validation

Cross-Validation in Machine Learning is a statistical resampling technique that uses


different parts of the dataset to train and test a machine learning algorithm on different
iterations. The aim of cross-validation is to test the model’s ability to predict a new set
of data that was not used to train the model. Cross-validation avoids the overfitting of
data.

K-Fold Cross Validation is the most popular resampling technique that divides the whole
dataset into K sets of equal sizes.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy