Machine Learning Notes
Machine Learning Notes
A Machine Learning system learns from historical data, builds the prediction
models, and whenever it receives new data, predicts the output for it. The
accuracy of predicted output depends upon the amount of data, as the huge amount
of data helps to build a better model which predicts the output more accurately.
Suppose we have a complex problem, where we need to perform some predictions,
so instead of writing a code for it, we just need to feed the data to generic
algorithms, and with the help of these algorithms, machine builds the logic as per the
data and predict the output.
Following are some key points which show the importance of Machine Learning:
1. Supervised learning
2. Unsupervised learning
3. Reinforcement learning
Supervised Learning
Supervised learning is a type of machine learning method in which we provide sample labeled data to
the machine learning system in order to train it, and on that basis, it predicts the output.
The system creates a model using labeled data to understand the datasets and learn about each data,
once the training and processing are done then we test the model by providing a sample data to check
whether it is predicting the exact output or not.
The goal of supervised learning is to map input data with the output data. The supervised learning is
based on supervision, and it is the same as when a student learns things in the supervision of the teacher.
The example of supervised learning is spam filtering.
o Classification
o Regression
Unsupervised Learning
Unsupervised learning is a learning method in which a machine learns without any supervision.
The training is provided to the machine with the set of data that has not been labeled, classified, or
categorized, and the algorithm needs to act on that data without any supervision. The goal of
unsupervised learning is to restructure the input data into new features or a group of objects with similar
patterns.
In unsupervised learning, we don't have a predetermined result. The machine tries to find useful insights
from the huge amount of data. It can be further classifieds into two categories of algorithms:
o Clustering
o Association
Reinforcement learning
The robotic dog, which automatically learns the movement of his arms, is an example
of Reinforcement learning.
Now machine learning has got a great advancement in its research, and it is present everywhere around
us, such as self-driving cars, Amazon Alexa, Catboats, recommender system, and many more. It
includes Supervised, unsupervised, and reinforcement learning with
clustering, classification, decision tree, SVM algorithms, etc.
Modern machine learning models can be used for making various predictions, including weather
prediction, disease prediction, stock market analysis, etc.
Prerequisites
Before learning machine learning, you must have the basic knowledge of followings
so that you can easily understand the concepts of machine learning:
In this step, we need to identify the different data sources, as data can be collected
from various sources such as files, database, internet, or mobile devices. It is one of
the most important steps of the life cycle. The quantity and quality of the collected
data will determine the efficiency of the output. The more will be the data, the more
accurate will be the prediction.
By performing the above task, we get a coherent set of data, also called as a dataset.
It will be used in further steps.
2. Data preparation
After collecting the data, we need to prepare it for further steps. Data preparation is a
step where we put our data into a suitable place and prepare it to use in our machine
learning training.
In this step, first, we put all data together, and then randomize the ordering of data.
o Data exploration:
It is used to understand the nature of data that we have to work with. We need
to understand the characteristics, format, and quality of data.
A better understanding of data leads to an effective outcome. In this, we find
Correlations, general trends, and outliers.
o Datapre-processing:
Now the next step is preprocessing of data for its analysis.
3. Data Wrangling
Data wrangling is the process of cleaning and converting raw data into a useable
format. It is the process of cleaning the data, selecting the variable to use, and
transforming the data in a proper format to make it more suitable for analysis in the
next step. It is one of the most important steps of the complete process. Cleaning of
data is required to address the quality issues.
It is not necessary that data we have collected is always of our use as some of the data
may not be useful. In real-world applications, collected data may have various issues,
including:
o Missing Values
o Duplicate data
o Invalid data
o Noise
It is mandatory to detect and remove the above issues because it can negatively affect
the quality of the outcome.
4. Data Analysis
Now the cleaned and prepared data is passed on to the analysis step. This step
involves:
The aim of this step is to build a machine learning model to analyze the data using
various analytical techniques and review the outcome. It starts with the determination
of the type of the problems, where we select the machine learning techniques such
as Classification, Regression, Cluster analysis, Association, etc. then build the
model using prepared data, and evaluate the model.
5. Train Model
Now the next step is to train the model, in this step we train our model to improve its
performance for better outcome of the problem.
We use datasets to train the model using various machine learning algorithms.
Training a model is required so that it can understand the various patterns, rules, and,
features.
6. Test Model
Once our machine learning model has been trained on a given dataset, then we test
the model. In this step, we check for the accuracy of our model by providing a test
dataset to it.
Testing the model determines the percentage accuracy of the model as per the
requirement of project or problem.
7. Deployment
The last step of machine learning life cycle is deployment, where we deploy the model
in the real-world system.
AI vs Machine learning
Artificial intelligence is a technology which Machine learning is a subset of AI which allows a machine
enables a machine to simulate human to automatically learn from past data without programming
behavior. explicitly.
The goal of AI is to make a smart computer The goal of ML is to allow machines to learn from data so
system like humans to solve complex that they can give accurate output.
problems.
In AI, we make intelligent systems to perform In ML, we teach machines with data to perform a particular
any task like a human. task and give an accurate result.
Machine learning and deep learning are the Deep learning is a main subset of machine learning.
two main subsets of AI.
AI has a very wide range of scope. Machine learning has a limited scope.
AI is working to create an intelligent system Machine learning is working to create machines that can
which can perform various complex tasks. perform only those specific tasks for which they are trained.
AI system is concerned about maximizing the Machine learning is mainly concerned about accuracy and
chances of success. patterns.
The main applications of AI are Siri, customer The main applications of machine learning are Online
support using catboats, Expert System, recommender system, Google search
Online game playing, intelligent humanoid algorithms, Facebook auto friend tagging suggestions,
robot, etc. etc.
On the basis of capabilities, AI can be divided Machine learning can also be divided into mainly three
into three types, which are, Weak AI, General types that are Supervised learning, Unsupervised
AI, and Strong AI. learning, and Reinforcement learning.
It includes learning, reasoning, and self- It includes learning and self-correction when introduced
correction. with new data.
AI completely deals with Structured, semi- Machine learning deals with Structured and semi-
structured, and unstructured data. structured data.
Before knowing the sources of the machine learning dataset, let's discuss datasets.
What is a dataset?
A dataset is a collection of data in which data is arranged in some order. A dataset
can contain any data from a series of an array to a database table. Below table shows
an example of the dataset:
Country Age Salary Purchased
India 38 48000 No
Germany 30 54000 No
France 48 65000 No
Germany 40 Yes
A tabular dataset can be understood as a database table or matrix, where each column
corresponds to a particular variable, and each row corresponds to the fields of the
dataset. The most supported file type for a tabular dataset is "Comma Separated
File," or CSV. But to store a "tree-like data," we can use the JSON file more efficiently
Note: A real-world dataset is of huge size, which is difficult to manage and process at
the initial level. Therefore, to practice machine learning algorithms, we can use any
dummy dataset.
Need of Dataset
To work with machine learning projects, we need a huge amount of data, because,
without the data, one cannot train ML/AI models. Collecting and preparing the dataset
is one of the most crucial parts while creating an ML/AI project.
The technology applied behind any ML projects cannot work properly if the dataset is
not well prepared and pre-processed.
During the development of the ML project, the developers completely rely on the
datasets. In building ML applications, datasets are divided into two parts:
o Training dataset:
o Test Dataset
Kaggle Datasets
UCI Machine Learning Repository
Datasets via AWS
Google's Dataset Search Engine
Microsoft Datasets
Awesome Public Dataset Collection
Government Datasets
Computer Vision Datasets
SUPERVISED MACHINE LEARNING
Supervised learning is the types of machine learning in which machines are trained
using well "labelled" training data, and on basis of that data, machines predict the
output. The labelled data means some input data is already tagged with the correct
output.
In supervised learning, the training data provided to the machines work as the
supervisor that teaches the machines to predict the output correctly. It applies the
same concept as a student learns in the supervision of the teacher.
Supervised learning is a process of providing input data as well as correct output data
to the machine learning model. The aim of a supervised learning algorithm is to find
a mapping function to map the input variable(x) with the output variable(y).
In the real-world, supervised learning can be used for Risk Assessment, Image
classification, Fraud Detection, spam filtering, etc.
The working of Supervised learning can be easily understood by the below example
and diagram:
Now, after training, we test our model using the test set, and the task of the model is
to identify the shape.
The machine is already trained on all types of shapes, and when it finds a new shape,
it classifies the shape on the bases of a number of sides, and predicts the output.
Regression algorithms are used if there is a relationship between the input variable
and the output variable. It is used for the prediction of continuous variables, such as
Weather forecasting, Market Trends, etc. Below are some popular Regression
algorithms which come under supervised learning:
o Linear Regression
o Regression Trees
o Non-Linear Regression
o Bayesian Linear Regression
o Polynomial Regression
2. Classification
Classification algorithms are used when the output variable is categorical, which means
there are two classes such as Yes-No, Male-Female, True-false, etc.
Spam Filtering,
o Random Forest
o Decision Trees
o Logistic Regression
o Support vector Machines
We can understand the concept of regression analysis using the below example:
Now, the company wants to do the advertisement of $200 in the year 2019 and wants to
know the prediction about the sales for this year. So to solve such type of prediction
problems in machine learning, we need regression analysis.
In Regression, we plot a graph between the variables which best fits the given datapoints,
using this plot, the machine learning model can make predictions about the data. In
simple words, "Regression shows a line or curve that passes through all the
datapoints on target-predictor graph in such a way that the vertical distance
between the datapoints and the regression line is minimum." The distance between
datapoints and line tells whether a model has captured a strong relationship or not.
Some examples of regression can be as:
o Regression estimates the relationship between the target and the independent
variable.
o It is used to find the trends in data.
o It helps to predict real/continuous values.
o By performing the regression, we can confidently determine the most important
factor, the least important factor, and how each factor is affecting the other
factors.
Types of Regression
There are various types of regressions which are used in data science and machine
learning. Each type has its own importance on different scenarios, but at the core, all the
regression methods analyze the effect of the independent variable on dependent
variables. Here we are discussing some important types of regression which are given
below:
o Linear Regression
o Logistic Regression
o Polynomial Regression
o Support Vector Regression
o Decision Tree Regression
o Random Forest Regression
o Ridge Regression
o Lasso Regression:
Linear Regression:
o Linear regression is a statistical regression method which is used for predictive analysis.
o It is one of the very simple and easy algorithms which works on regression and shows the
relationship between the continuous variables.
o It is used for solving the regression problem in machine learning.
o Linear regression shows the linear relationship between the independent variable (X-
axis) and the dependent variable (Y-axis), hence called linear regression.
o If there is only one input variable (x), then such linear regression is called simple linear
regression. And if there is more than one input variable, then such linear regression is
called multiple linear regression.
o The relationship between variables in the linear regression model can be explained using
the below image. Here we are predicting the salary of an employee on the basis of the
year of experience.
1. Y= aX+b
Logistic Regression:
o Logistic regression is another supervised learning algorithm which is used to solve the
classification problems. In classification problems, we have dependent variables in a
binary or discrete format such as 0 or 1.
o Logistic regression algorithm works with the categorical variable such as 0 or 1, Yes or No,
True or False, Spam or not spam, etc.
o It is a predictive analysis algorithm which works on the concept of probability.
o Logistic regression is a type of regression, but it is different from the linear regression
algorithm in the term how they are used.
o Logistic regression uses sigmoid function or logistic function which is a complex cost
function. This sigmoid function is used to model the data in logistic regression. The
function can be represented as:
When we provide the input values (data) to the function, it gives the S-curve as follows:
o It uses the concept of threshold levels, values above the threshold level are
rounded up to 1, and values below the threshold level are rounded up to 0.
o Binary(0/1, pass/fail)
o Multi(cats, dogs, lions)
o Ordinal(low, medium, high)
Polynomial Regression:
o Polynomial Regression is a type of regression which models the non-linear dataset using
a linear model.
o It is similar to multiple linear regression, but it fits a non-linear curve between the value
of x and corresponding conditional values of y.
o Suppose there is a dataset which consists of datapoints which are present in a non-linear
fashion, so for such case, linear regression will not best fit to those datapoints. To cover
such datapoints, we need Polynomial regression.
o In Polynomial regression, the original features are transformed into polynomial
features of given degree and then modeled using a linear model. Which means the
datapoints are best fitted using a polynomial line.
o The equation for polynomial regression also derived from linear regression
equation that means Linear regression equation Y= b0+ b1x, is transformed into
Polynomial regression equation Y= b0+b1x+ b2x2+ b3x3+.....+ bnxn.
o Here Y is the predicted/target output, b0, b1,... bn are the regression
coefficients. x is our independent/input variable.
o The model is still linear as the coefficients are still linear with quadratic