Unit 3
Unit 3
2
3
Explanation
1. SAS It is one of those data science tools which are specifically
designed for statistical operations. SAS is a closed source proprietary
software that is used by large organizations to analyze data.
2. Apache Spark
Apache Spark or simply Spark is an all-powerful analytics engine and
it is the most used Data Science tool. Spark is specifically designed to
handle batch processing and Stream Processing.
3. BigML
It provides a fully interactable, cloud-based GUI environment that you
can use for processing Machine Learning Algorithms.
4. D3.js
Javascript is mainly used as a client-side scripting language. D3.js, a
Javascript library allows you to make interactive visualizations on your
web-browser.
Explanation
5. MATLAB
MATLAB is a multi-paradigm numerical computing environment for
processing mathematical information.
It is a closed-source software that facilitates matrix functions,
algorithmic implementation and statistical modeling of data. MATLAB
is most widely used in several scientific disciplines.
6. Excel Probably the most widely used Data Analysis tool. Microsoft
developed Excel mostly for spreadsheet calculations and today, it is
widely used for data processing, visualization, and complex
calculations.
7. ggplot2 ggplot2 is an advanced data visualization package for the
R programming language. The developers created this tool to replace
the native graphics package of R and it uses powerful commands to
create illustrious visualizations. It is the most widely used library that
Data Scientists use for creating visualizations from analyzed data.
Explanation
8. Tableau
Tableau is a Data Visualization software that is packed with
powerful graphics to make interactive visualizations. It is focused
on industries working in the field of business intelligence. The
most
important aspect of Tableau is its ability to interface with
databases, spreadsheets, OLAP (Online Analytical Processing)
cubes, etc. Along with these features, Tableau has the ability to
visualize geographical data and for plotting longitudes and
latitudes in maps.
10
Machine
Learning
11
Traditional Programs VS
ML
12
Key differences between AI and
ML
13
Key differences between AI and
ML
14
15
Approach
3/24/2021 16
Applications
3/24/2021 17
Types of machine learning
(ML)
18
Supervised
learning
• Supervised learning as the name indicates the
presence of a supervisor as a teacher.
• Basically supervised learning is a learning in
which we teach or train the machine using data
which is well labeled that means some data is
already tagged with the correct answer.
• After that, the machine is provided with a new
set of examples(data) so that supervised learning
algorithm analyses the training data(set of training
examples) and produces a correct outcome from
labeled data.
19
Unsupervised
learning
• Unsupervised learning is the training of machine using
information that is neither classified nor labeled and
allowing the algorithm to act on that information
without guidance.
• Here the task of machine is to group unsorted
information according to similarities, patterns and
differences without any prior training of data.
• Unlike supervised learning, no teacher is provided that
means no training will be given to the machine.
• Therefore machine is restricted to find the hidden
structure in unlabeled data by our-self.
20
Semi-supervised learning
& Reinforcement learning
• Semi-supervised Learning is between the supervised
and unsupervised learning.
• It uses both labelled and unlabelled data for training.
• Reinforcement learning trains an algorithm with a
reward system, providing feedback when an artificial
intelligence agent performs the best action in a
particular situation.
• In Reinforcement learning , AI agents are attempting to
find the optimal way to accomplish a particular goal, or
improve performance on a specific task.
• As the agent takes action that goes toward the goal, it
receives a reward. 21
Examples /
Applications
22
Regressio
n
• Regression analysis is a statistical method to model
the relationship between a dependent (target) and
independent (predictor) variables with one or
more independent variables.
• Regression is a process of finding the correlations
between dependent and independent variables.
• It helps in predicting the continuous variables such
as prediction of Market Trends, prediction of
House prices, etc
23
ML Regression
Algorithms
• Simple Linear Regression
• Multiple Linear Regression
• Polynomial Regression
• Support Vector Regression
• Decision Tree Regression
• Random Forest Regression
24
Classificatio
n
• Classification algorithm is a Supervised Learning
technique that is used to identify the category of
new observations on the basis of training data.
• In Classification, a program learns from the
given dataset or observations and then classifies
new observation into a number of classes or
groups.
• Such as, Yes or No, 0 or 1, Spam or Not Spam,
cat or dog, etc.
25
ML Classification
Algorithms
• Logistic Regression
• K-Nearest Neighbours
• Support Vector Machines
• Kernel SVM
• Naïve Bayes
• Decision Tree Classification
• Random Forest Classification
26
Difference between
Regression and Classification
27
Clusterin
g
• Grouping the similar data is called cluster
• Clustering or cluster analysis is a machine
learning technique, which groups the unlabelled
dataset.
28
Clustering
Algorithms
• K-Means algorithm
• Agglomerative Hierarchical algorithm
• Mean-shift algorithm
• DBSCAN Algorithm (Density-Based Spatial
Clustering of Applications with Noise)
• Expectation-Maximization (EM) Clustering
using GMM (Gaussian Mixture Model)
29
Feature
selection
• In machine learning and statistics, feature
selection, also known as variable selection,
attribute selection or variable subset selection
• It is the process of selecting a subset of relevant
features (variables, predictors) for use in model
construction.
• When the number of features are very large. No-
need not use every feature at your disposal for
creating an algorithm.
• You can assist your algorithm by feeding in only
those features that are really important.
30
Feature
selection
• Machine learning works on a simple rule – if you put
garbage in, you will only get garbage to come out.
(garbage -noise) - “Sometimes, less is better!”
Top reasons to use feature selection are:
• It enables the machine learning algorithm to train faster.
• It reduces the complexity of a model and makes it easier
to interpret.
• It improves the accuracy of a model if the right subset is
chosen.
• It reduces overfitting. 31
ML Feature selection
Algorithms
Filter Methods:In this method, the dataset is filtered,
and a subset that contains only the relevant
features is taken.
• Pearson’s Correlation
• Linear Discriminant Analysis (LDA)
• ANOVA (Analysis of variance)
• Chi-Square
32
Wrapper Methods
• The wrapper method has the same goal as
the filter method, but it takes a machine
learning model for its evaluation. In this
method, some features are fed to the ML
model, and evaluate the performance. The
performance decides whether to add those
features or remove to increase the
accuracy of the model. This method is more
accurate than the filtering method but
complex to work.
• Forward Selection
ML Feature selection
Algorithms
Embedded Methods
Embedded methods check the different training
iterations of the machine learning model and
evaluate the importance of each feature.
• Decision Tree
• ID3
• C4.5
• Classification And Regression Tree (CART)
34
Linear
regression
• Linear regression is one of the easiest and most
popular Machine Learning algorithms.
• It is a statistical method that is used for predictive
analysis.
• Linear regression makes predictions for
continuous/real or numeric variables such as sales,
salary, age, product price, etc.
• Linear regression algorithm shows a linear
relationship between a dependent (y) and one or
more independent (x) variables 35
Linear
regression
36
Linear
regression
y= mx+c+ ε
• y= Dependent Variable (Target Variable)
• x= Independent Variable (predictor Variable)
• c= y intercept of the line
• m= slope
• ε= error
37
Logistic
regression
• Logistic regression is one of the most popular
Machine Learning algorithms, which comes under
the Supervised Learning technique.
• It is used for predicting the categorical dependent
variable using a given set of independent variables.
• The outcome must be a categorical or discrete
value. It can be either Yes or No, 0 or 1, true or
False, etc.
• but instead of giving the exact value as 0 and 1,
it gives the probabilistic values which lie between
0 and 1
38
Logistic
regression
• Logistic regression is used for solving the classification
problems.
• In Logistic regression, instead of fitting a regression line,
we fit an "S" shaped logistic function, which predicts two
maximum values (0 or 1).
• The curve from thelogistic function indicates
likelihood of something such as whether the
the
cancerousor
cells are not, a dog is puppy or not
based on its weight, etc.
• Logistic regression is based on the concept of Maximum
Likelihood estimation.
• According to this estimation, the observed data should
be most probable.
39
Logistic
regression
• it has the ability to provide probabilities and
classify new data using continuous and discrete
datasets
40
Logistic
regression
41
Logistic Function (Sigmoid
•Function)
The sigmoid function is a mathematical function used to
map the predicted values to probabilities.
• It maps any real value into another value within a range
of 0 and 1.
• The value of the logistic regression must be between 0
and 1, which cannot go beyond this limit, so it forms a
curve like the "S" form.
• The S-form curve is called the Sigmoid function or the
logistic function.
• In logistic regression, we use theconcept of
threshold
the value, which defines the probability of either
0 or 1.
• Such as values above the threshold value tends to 1,
and a value below the threshold values tends to 0.
42
Linear Vs Logistic
Regression
43
Introducing the Gaussian
Carl Friedrich GAUSS ranked among “history's
most influential mathematicians” discovered
normal distribution
It is also called Gaussian distribution.
=NORMDIST(x,mean,standard_dev,cumulative)
1. X (required argument) – This is the value for which we wish to calculate the
distribution.
63
between the range of 23 and
27
64
between 15
and 20
65
less than 20
66
Central Limit Theorem
75
Terminologies
▪ Portfolio --- a collection of investments.
▪ Expected risk --- the total amount of money that
can be Lost.
77
Example
Security Expected Return R i % Proportion Xi %
1 10 25
2 20 75
3 30 80
• Mean = 0
• SD = 1
80
Example 1
Co-efficient corelation formula
Formula
Coefficient Co-relation
Regression equation
Regression equation
Example 2
Example
88
Standardization Simplifies Linear
Regression
In order to simplify the standardization of
equation of line, Residual Sum of Square (SSR)
should be minimized
89
Residual Sum of Square
90
Modeling Error in Linear
Regression
▪ The coefficient of determination, or R2 is a measure
that provides information about the goodness of fit
of a model
92
Example
93
Information Gain from Linear
Regression
▪ Information gain is calculated by comparing the
entropy of the dataset before and after a
transformation
94
Information Gain from
Linear Regression
▪ For example, we may wish to evaluate the impact on
purity by splitting a dataset S by a random variable with
a range of values, then
97