0% found this document useful (0 votes)
18 views97 pages

Unit 3

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views97 pages

Unit 3

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 97

Unit 3

Data science components


Tools for data
science

2
3
Explanation
1. SAS It is one of those data science tools which are specifically
designed for statistical operations. SAS is a closed source proprietary
software that is used by large organizations to analyze data.

2. Apache Spark
Apache Spark or simply Spark is an all-powerful analytics engine and
it is the most used Data Science tool. Spark is specifically designed to
handle batch processing and Stream Processing.

3. BigML
It provides a fully interactable, cloud-based GUI environment that you
can use for processing Machine Learning Algorithms.

4. D3.js
Javascript is mainly used as a client-side scripting language. D3.js, a
Javascript library allows you to make interactive visualizations on your
web-browser.
Explanation
5. MATLAB
MATLAB is a multi-paradigm numerical computing environment for
processing mathematical information.
It is a closed-source software that facilitates matrix functions,
algorithmic implementation and statistical modeling of data. MATLAB
is most widely used in several scientific disciplines.

In Data Science, MATLAB is used for simulating neural networks and


fuzzy logic. Using the MATLAB graphics library, you can create
powerful visualizations.

6. Excel Probably the most widely used Data Analysis tool. Microsoft
developed Excel mostly for spreadsheet calculations and today, it is
widely used for data processing, visualization, and complex
calculations.
7. ggplot2 ggplot2 is an advanced data visualization package for the
R programming language. The developers created this tool to replace
the native graphics package of R and it uses powerful commands to
create illustrious visualizations. It is the most widely used library that
Data Scientists use for creating visualizations from analyzed data.
Explanation
8. Tableau
Tableau is a Data Visualization software that is packed with
powerful graphics to make interactive visualizations. It is focused
on industries working in the field of business intelligence. The
most
important aspect of Tableau is its ability to interface with
databases, spreadsheets, OLAP (Online Analytical Processing)
cubes, etc. Along with these features, Tableau has the ability to
visualize geographical data and for plotting longitudes and
latitudes in maps.

9. Jupyter Project Jupyter is an open-source tool based on


IPython for helping developers in making open-source software
and experiences interactive computing. Jupyter supports multiple
languages like Julia, Python, and R. It is a web-application tool
used for writing live code, visualizations, and presentations.
Jupyter is a widely popular tool that is designed to address the
requirements of Data Science.
Explanation
10. Matplotlib Matplotlib is a plotting and visualization library
developed for Python. It is the most popular tool for generating
graphs with the analyzed data. It is mainly used for plotting
complex graphs using simple lines of code. Using this, one can
generate bar plots, histograms, scatterplots etc. Matplotlib has
several essential modules. One of the most widely used modules is
pyplot. It offers a MATLAB like an interface. Pyplot is also an open-
source alternative to MATLAB‘s graphic modules.

13. TensorFlow TensorFlow has become a standard tool for


Machine Learning. It is widely used for advanced machine learning
algorithms like Deep Learning. Developers named TensorFlow after
Tensors which are multidimensional arrays. It is an open-source
and ever-evolving toolkit which is known for its performance and
high computational abilities.
8
Artificial intelligence
(AI)
• Artificial intelligence (AI) is intelligence
demonstrated by machines, unlike the natural
intelligence displayed by humans and animals, which
involves consciousness and emotionality
• Artificial intelligence (AI), the ability of a digital
computer or computer-controlled robot to perform
tasks commonly associated with intelligent beings.
• Artificial intelligence (AI) refers to the simulation of
human intelligence in machines that are
programmed to think like humans and mimic their
actions. 9
Machine
Learning
“Machine learning enables a
machineto automatically learn
from data, improve performance
from experiences, and predict
things without being explicitly
programmed.”

10
Machine
Learning

11
Traditional Programs VS
ML

12
Key differences between AI and
ML

13
Key differences between AI and
ML

14
15
Approach

3/24/2021 16
Applications

3/24/2021 17
Types of machine learning
(ML)

18
Supervised
learning
• Supervised learning as the name indicates the
presence of a supervisor as a teacher.
• Basically supervised learning is a learning in
which we teach or train the machine using data
which is well labeled that means some data is
already tagged with the correct answer.
• After that, the machine is provided with a new
set of examples(data) so that supervised learning
algorithm analyses the training data(set of training
examples) and produces a correct outcome from
labeled data.
19
Unsupervised
learning
• Unsupervised learning is the training of machine using
information that is neither classified nor labeled and
allowing the algorithm to act on that information
without guidance.
• Here the task of machine is to group unsorted
information according to similarities, patterns and
differences without any prior training of data.
• Unlike supervised learning, no teacher is provided that
means no training will be given to the machine.
• Therefore machine is restricted to find the hidden
structure in unlabeled data by our-self.
20
Semi-supervised learning
& Reinforcement learning
• Semi-supervised Learning is between the supervised
and unsupervised learning.
• It uses both labelled and unlabelled data for training.
• Reinforcement learning trains an algorithm with a
reward system, providing feedback when an artificial
intelligence agent performs the best action in a
particular situation.
• In Reinforcement learning , AI agents are attempting to
find the optimal way to accomplish a particular goal, or
improve performance on a specific task.
• As the agent takes action that goes toward the goal, it
receives a reward. 21
Examples /
Applications

22
Regressio
n
• Regression analysis is a statistical method to model
the relationship between a dependent (target) and
independent (predictor) variables with one or
more independent variables.
• Regression is a process of finding the correlations
between dependent and independent variables.
• It helps in predicting the continuous variables such
as prediction of Market Trends, prediction of
House prices, etc

23
ML Regression
Algorithms
• Simple Linear Regression
• Multiple Linear Regression
• Polynomial Regression
• Support Vector Regression
• Decision Tree Regression
• Random Forest Regression

24
Classificatio
n
• Classification algorithm is a Supervised Learning
technique that is used to identify the category of
new observations on the basis of training data.
• In Classification, a program learns from the
given dataset or observations and then classifies
new observation into a number of classes or
groups.
• Such as, Yes or No, 0 or 1, Spam or Not Spam,
cat or dog, etc.
25
ML Classification
Algorithms
• Logistic Regression
• K-Nearest Neighbours
• Support Vector Machines
• Kernel SVM
• Naïve Bayes
• Decision Tree Classification
• Random Forest Classification

26
Difference between
Regression and Classification

27
Clusterin
g
• Grouping the similar data is called cluster
• Clustering or cluster analysis is a machine
learning technique, which groups the unlabelled
dataset.

28
Clustering
Algorithms
• K-Means algorithm
• Agglomerative Hierarchical algorithm
• Mean-shift algorithm
• DBSCAN Algorithm (Density-Based Spatial
Clustering of Applications with Noise)
• Expectation-Maximization (EM) Clustering
using GMM (Gaussian Mixture Model)

29
Feature
selection
• In machine learning and statistics, feature
selection, also known as variable selection,
attribute selection or variable subset selection
• It is the process of selecting a subset of relevant
features (variables, predictors) for use in model
construction.
• When the number of features are very large. No-
need not use every feature at your disposal for
creating an algorithm.
• You can assist your algorithm by feeding in only
those features that are really important.
30
Feature
selection
• Machine learning works on a simple rule – if you put
garbage in, you will only get garbage to come out.
(garbage -noise) - “Sometimes, less is better!”
Top reasons to use feature selection are:
• It enables the machine learning algorithm to train faster.
• It reduces the complexity of a model and makes it easier
to interpret.
• It improves the accuracy of a model if the right subset is
chosen.
• It reduces overfitting. 31
ML Feature selection
Algorithms
Filter Methods:In this method, the dataset is filtered,
and a subset that contains only the relevant
features is taken.
• Pearson’s Correlation
• Linear Discriminant Analysis (LDA)
• ANOVA (Analysis of variance)
• Chi-Square

32
Wrapper Methods
• The wrapper method has the same goal as
the filter method, but it takes a machine
learning model for its evaluation. In this
method, some features are fed to the ML
model, and evaluate the performance. The
performance decides whether to add those
features or remove to increase the
accuracy of the model. This method is more
accurate than the filtering method but
complex to work.

• Forward Selection
ML Feature selection
Algorithms
Embedded Methods
Embedded methods check the different training
iterations of the machine learning model and
evaluate the importance of each feature.
• Decision Tree
• ID3
• C4.5
• Classification And Regression Tree (CART)

34
Linear
regression
• Linear regression is one of the easiest and most
popular Machine Learning algorithms.
• It is a statistical method that is used for predictive
analysis.
• Linear regression makes predictions for
continuous/real or numeric variables such as sales,
salary, age, product price, etc.
• Linear regression algorithm shows a linear
relationship between a dependent (y) and one or
more independent (x) variables 35
Linear
regression

36
Linear
regression
y= mx+c+ ε
• y= Dependent Variable (Target Variable)
• x= Independent Variable (predictor Variable)
• c= y intercept of the line
• m= slope
• ε= error

37
Logistic
regression
• Logistic regression is one of the most popular
Machine Learning algorithms, which comes under
the Supervised Learning technique.
• It is used for predicting the categorical dependent
variable using a given set of independent variables.
• The outcome must be a categorical or discrete
value. It can be either Yes or No, 0 or 1, true or
False, etc.
• but instead of giving the exact value as 0 and 1,
it gives the probabilistic values which lie between
0 and 1
38
Logistic
regression
• Logistic regression is used for solving the classification
problems.
• In Logistic regression, instead of fitting a regression line,
we fit an "S" shaped logistic function, which predicts two
maximum values (0 or 1).
• The curve from thelogistic function indicates
likelihood of something such as whether the
the
cancerousor
cells are not, a dog is puppy or not
based on its weight, etc.
• Logistic regression is based on the concept of Maximum
Likelihood estimation.
• According to this estimation, the observed data should
be most probable.

39
Logistic
regression
• it has the ability to provide probabilities and
classify new data using continuous and discrete
datasets

40
Logistic
regression

41
Logistic Function (Sigmoid
•Function)
The sigmoid function is a mathematical function used to
map the predicted values to probabilities.
• It maps any real value into another value within a range
of 0 and 1.
• The value of the logistic regression must be between 0
and 1, which cannot go beyond this limit, so it forms a
curve like the "S" form.
• The S-form curve is called the Sigmoid function or the
logistic function.
• In logistic regression, we use theconcept of
threshold
the value, which defines the probability of either
0 or 1.
• Such as values above the threshold value tends to 1,
and a value below the threshold values tends to 0.
42
Linear Vs Logistic
Regression

43
Introducing the Gaussian
Carl Friedrich GAUSS ranked among “history's
most influential mathematicians” discovered
normal distribution
It is also called Gaussian distribution.

It is often called the bell curve, because the graph


of its probability looks like a bell.

Normal distribution that occurs naturally in many


situations.

Examples: Heights of people, Measurement errors,


Blood pressure, Test marks, IQ scores, Salaries
Gaussian Distribution

68% of the data falls within one standard


deviation of the mean
95% of the data falls within two standard
deviations of the mean
99.7% of the data falls within three standard
deviations of the mean
Properties of normal
distribution
-The mean, mode and median are all
equal

-The curve is symmetric at the center

-Exactly half of the values are to the left


of center and exactly half the values are
to the right

-The total area under the curve is 1


Standard Deviation
▪ Standard Deviation is a measure of the amount of
variation
▪ A low standard deviation indicates that the values
tend to be close to the mean
▪ A high standard deviation indicates that the values
are spread out over a wider range
Example 1
Mean
Variance
Standard Deviation
Example 2
Solution
Mean=27
Variance=24.86
SD=4.96
Introduction to
Standardization
Standardization is scaling technique where the
values are centered around the mean with a unit
standard deviation.

It means that the mean of the attribute becomes 0


and the resultant distribution has a unit (1)
standard deviation

Standard scores are most commonly called z-


score
Standard Normal Probability Distribution
in Excel

The STANDARDIZE Function is available under


Excel Statistical functions.
It will return a normalized value (z-score) based on
the mean and standard deviation.

=NORMDIST(x,mean,standard_dev,cumulative)
1. X (required argument) – This is the value for which we wish to calculate the
distribution.

2. Mean (required argument) – The arithmetic mean of the distribution.


3. Standard_dev (required argument) – This is the standard deviation of the
distribution.
4. Cumulative (required argument) – This is a logical value. It specifies the type of
distribution to be used: TRUE (Cumulative Normal Distribution Function) or FALSE
(Normal
Probability Density Function).
Example
If we wish to calculate the probability
mass function for the data above, the
formula to use is:
We get
Conti….
STANDARDIZE Z-Score Function

The STANDARDIZE Function is available under Excel Statistical


functions. It will return a normalized value (z-score) based on the
mean and standard deviation.

=STANDARDIZE(x, mean, standard_dev)

The STANDARDIZE function uses the following arguments:


1. X (required argument) – This is the value that we want to
normalize.
2. Mean (required argument) – The arithmetic mean of the
distribution.
3. Standard_dev (required argument) – This is the standard
deviation of the distribution.
Example
Conti,,,,
Using z-Scores to find a
Probability
Example:
The mean score for the population is
21, and the standard deviation is 5.
How will you determine the
probability that a score fall on
-higher than 30
-between the range of 23 and 27
-between 15 and 20.
- less than 20.
62
higher than 30

63
between the range of 23 and
27

64
between 15
and 20

65
less than 20

66
Central Limit Theorem

The Central Limit Theorem states that the sampling


distribution of the sample means approaches a normal
distribution as the sample size gets larger
Central
Limit
“Given a dataset with unknown distribution (it could
Theorem
be uniform, binomial or completely random), the
sample means will approximate the normal
distribution”

The Central limit theorem shows how the mean of a sample


distribution approaches the normal distribution when the
size of the sample gets larger.
Algebra with Gaussians
Gauss elimination method is used to solve a system of linear
equations
Gaussian elimination is the name of the method to
perform the three types of matrix row operations
Interchanging two rows
Multiplying a row by a constant (any
constant - not 0)
Adding a row to another row

This technique is also called row reduction and it


consists of two stages:
Forward elimination
Algebra with
Gaussians
The forward elimination step refers to the
row reduction needed to simplify the
matrix
Back substitution step refers to substitute
the value to solve the equation
Example
If we were to have the following system of linear equations
containing three equations for three unknowns:
Algebra with Gaussians
Algebra with Gaussians

Row reducing (applying the Gaussian elimination


method to) the augmented matrix
73
Markowitz Portfolio
Optimization

75
Terminologies
▪ Portfolio --- a collection of investments.
▪ Expected risk --- the total amount of money that
can be Lost.

▪ Expected return --- future income from invested


capital
▪ Portfolio effect --- portfolio that will reduce total
risk of Investment
▪ Portfolio manager --- project manager

▪ Efficient portfolio --- provides the lowest


76
Markowitz Portfolio
Optimization - Approach
▪ According to theory, the effects of one
security purchase over the effects of
the other security purchase are taken
into consideration

▪ The results are evaluated and helpful


to reduce the risk minimization

77
Example
Security Expected Return R i % Proportion Xi %

1 10 25
2 20 75
3 30 80

The return on the portfolio on combining


the two securities will be
Rp = R1X1 + R2X2
Rp = 0.10(0.25)+ 0.20(0.75)
Rp = 17.5%
78
Advantages
▪ It is believed that holding multiple securities is
less risky than having only one investment in a
person’s portfolio

▪ When multiple stocks are taken on a portfolio


and if they have negative correlation, then risk
can be completely reduced because the gain on
one can offset the loss on the other

▪ The effect of multiple securities can also be


studied when one security is more risky when
compared to the other security
79
Standardizing x and y
Coordinates for Linear
Regression
• To standardize the set of coordinates that
are more deviated from the normal range of
values
• Standardization results in mean of the
entire coordinates becomes Zero and unit
standard deviation

• Mean = 0
• SD = 1

80
Example 1
Co-efficient corelation formula
Formula
Coefficient Co-relation
Regression equation
Regression equation
Example 2
Example

88
Standardization Simplifies Linear
Regression
In order to simplify the standardization of
equation of line, Residual Sum of Square (SSR)
should be minimized

89
Residual Sum of Square

SSR - Residual Sum of Square


SSR = ∑(yi−y^)2
y^ = mx+c
where,
m= slope and c = intercept

90
Modeling Error in Linear
Regression
▪ The coefficient of determination, or R2 is a measure
that provides information about the goodness of fit
of a model

▪ In the context of regression, it is a statistical


measure of how well the regression line
approximates the actual data

▪ It is important when a statistical model is used


either to predict future outcomes or in the testing of
hypothesis
91
R2 Measure (co-efficient of
determination)

92
Example

93
Information Gain from Linear
Regression
▪ Information gain is calculated by comparing the
entropy of the dataset before and after a
transformation

▪ Entropy of a random variable Y can be represented


as H(Y), which tells about the uncertainty about the
random variable

▪ Information gain provides a way to use entropy to


calculate how a change to the dataset impacts the
purity of the dataset

94
Information Gain from
Linear Regression
▪ For example, we may wish to evaluate the impact on
purity by splitting a dataset S by a random variable with
a range of values, then

▪ IG(Y, X) = H(Y) – H(Y | X)

▪ IG(Y, X) is the information for the dataset Y for the


variable X

▪ H(Y) is the entropy for the dataset before any change


and

▪ H(Y | X) is the conditional entropy for the dataset given


95
the variable X
96
Thank You

97

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy