Unit 2 Data Analytics (1)
Unit 2 Data Analytics (1)
Data analysis refers to the process of inspecting, cleaning, transforming, and modeling
data with the goal of discovering useful information, drawing conclusions, and
supporting decision- making. It is a critical component of many fields, including
business, finance, healthcare, engineering, and the social sciences. The data analysis
process typically involves the following steps:
1. Data collection: This step involves gathering data from various sources, such as
databases, surveys, sensors, and social media.
2. Data cleaning: This step involves removing errors, inconsistencies, and outliers
from the data. It may also involve imputing missing values, transforming
variables, and normalizing the data.
3. Data exploration: This step involves visualizing and summarizing the data to
gain insights and identify patterns. This may include statistical analyses, such as
descriptive statistics, correlation analysis, and hypothesis testing.
4. Data modeling: This step involves developing mathematical models to predict
or explain the behavior of the data. This may include regression analysis, time
series analysis, machine learning, and other techniques.
5. Data visualization: This step involves creating visual representations of the data
to communicate insights and findings to stakeholders. This may include charts,
graphs, tables, and other visualizations.
6. Decision-making: This step involves using the results of the data analysis to
make informed decisions, develop strategies, and take actions.
Data analysis is a complex and iterative process that requires expertise in statistics,
programming, and domain knowledge. It is often performed using specialized
software, such as R, Python, SAS, and Excel, as well as cloud-based platforms, such as
Amazon Web Services and Google Cloud Platform. Effective data analysis can lead to
better business outcomes, improved healthcare outcomes, and a deeper understanding
of complex phenomena.
Regression Modeling
Regression modeling is a statistical technique used to examine the relationship
between a dependent variable (also called the outcome or response variable) and one
or more independent variables (also called predictors or explanatory variables). The
goal of regression modeling is to identify the nature and strength of the relationship
between the dependent variable and the independent variable(s) and to use this
information to make predictions about the dependent variable.
There are many different types of regression models, including linear regression,
logistic regression, polynomial regression, and multivariate regression. Linear
regression is one of the most commonly used types of regression modeling, and it
assumes that the relationship between the dependent variable and the independent
variable(s) is linear.
Regression modeling is used in a wide range of fields, including economics, finance,
psychology, and epidemiology, among others. It is often used to understand the
relationships between different factors and to make predictions about future outcomes.
Regression
Simple Linear Regression Linear Regression
In statistics, linear regression is a linear approach to modeling the relationship between
a scalar response (or dependent variable) and one or more explanatory variables (or
independent variables). The case of one explanatory variable is called simple linear
regression.
Linear regression is used to predict the continuous dependent variable using a
given set of independent variables.
Linear Regression is used for solving Regression problem.
In Linear regression, value of continuous variables is predicted.
Linear regression tried to find the best fit line, through which the output can be
easily predicted.
Least square estimation method is used for estimation of accuracy.
The output for Linear Regression must be a continuous value, such as price,
age, etc.
In Linear regression, it is required that relationship between dependent variable
and independent variable must be linear.
In linear regression, there may be collinearity5 between the independent
variables.
Some Regression examples:
Regression analysis is used in stats to find trends in data. For example, you might
guess that there is a connection between how much you eat and how much you weigh;
regression analysis can help you quantify that.
Regression analysis will provide you with an equation for a graph so that you can
make predictions about your data. For example, if you’ve been putting on weight over
last few years, it can predict how much you’ll weigh in ten years time if you continue
to put on weight at the same rate.
It is also called simple linear regression. It establishes the relationship between two
variables using a straight line. If two or more explanatory variables have a linear
relationship with the dependent variable, the regression is called a multiple linear
regression.
Logistic Regression
Logistic Regression use to resolve classification problems where given an element you
have to classify the same in N categories. Typical examples are for example given a
mail to classify it as spam or not, or given a vehicle find to which category it belongs
(car, truck, van, etc.). That’s basically the output is a finite set of discrete values
Logistic Regression is used to predict the categorical dependent variable using a
given set of independent variables.
Logistic regression is used for solving Classification problems.
In logistic Regression, we predict the values of categorical variables.
In Logistic Regression, we find the S-curve by which we can classify the
samples.
Maximum likelihood estimation method is used for estimation of accuracy.
The output of Logistic Regression must be a Categorical value such as 0 or 1,
Yes or No, etc.
In Logistic regression, it is not required to have the linear relationship between
the dependent and independent variable.
In logistic regression, there should not be collinearity between the independent
variable.
Multivariate Analysis
hMAP = ∈ 𝑝( )
= ∈ 𝑝 𝑝(ℎ)/𝑝(𝐷)
= ∈ 𝑝 𝑝(ℎ))
hML = ∈ 𝑝 𝑝(ℎ))
Overall, Bayesian modeling is a powerful tool for making predictions and estimating
parameters in situations where there is uncertainty and prior information is available.
Support vector machines (SVMs) and kernel methods are commonly used in machine
learning and pattern recognition to solve classification and regression problems.
SVMs are a type of supervised learning algorithm that aims to find the optimal
hyperplane that separates the data into different classes. The optimal hyperplane is the
one that maximizes the margin, or the distance between the hyperplane and the closest
data points from each class. SVMs can also use kernel functions to transform the
original input data into a higher dimensional space, where it may be easier to find a
separating hyperplane.
Kernel methods are a class of algorithms that use kernel functions to compute the
similarity between pairs of data points. Kernel functions can transform the input data
into a higher dimensional feature space, where linear methods can be applied more
effectively. Some commonly used kernel functions include linear, polynomial, and
radial basis functions.
Working of Biological NN
If, the cumulative inputs received by soma raise the internal electrical potential of
the cell, known as cell membrane potential then neuron ‘fires’ by propagating the
action potential down the axon to excite other neurons. Axon terminates in
specialized contact called synapse. The synapse is a minute gap at the end of
dendrite link contains a neuro-transmitted fluid. It is responsible for accelerating or
retardating the electrical charges to the soma. In general a single neuron can have
many synaptic inputs and synaptic outputs. The size of synapses is believed to be
related to learning. The synapses with larger area are excitatory while those with
smaller area are inhibitory.
Artificial Neuron and Its Model
Artificial neural network (ANN) is an efficient information processing system
which resembles in characteristics with biological network. ANN possesses large
number of highly interconnected processing elements called nodes or units or
neurons which usually operate in parallel and are configured in regular
architectures.
Neurons are connected to other by a connection links. Each connection link is
associated with weights. This link contains information about input signal. This
information is used by neuron to solve a particular problem.
ANN’s collective behavior characterized by their ability to learn, recall and
generalize pattern or data similar to that of human brain thereby capability to model
network of original neurons as found in brain.
To understand basic operation of a neural net, consider a neural net as shown in
figure on next slide .
Neurons X1 and X2, transmitting signal to neuron Y (output neuron). Inputs are
connected to output neuron Y over interconnection links (W1 and W2).
Here, x1 ,x2 are inputs and w1, w2 are weights attached to input links.
Thus, weights here are multiplicative factors of inputs to account for the strength for
synapse.
Net Input yin= x1w1+x2w2
To generate the final output y, the sum is passed to activation function (f) or transfer
function of squash function which releases the output. Hence
y
= f(yin)
Competitive Learning
Competitive learning is a type of machine learning technique in which a set of neurons
compete to be activated by input data. The neurons are organized into a layer, and each
neuron receives the same input data. However, only one neuron is activated, and the
competition is based on a set of rules that determine which neuron is activated.
The competition competitive learning is typically based on a measure of similarity
between the input data and the weights of each neuron. The neuron with the highest
similarity to the input data is activated, and the weights of that neuron are updated to
become more similar to the input data. This process is repeated for multiple iterations,
and over time, the neurons learn to become specialized in recognizing different types
of input data.
Competitive learning is often used for unsupervised learning tasks, such as clustering
or feature extraction. In clustering, the neurons learn to group similar input data into
clusters, while in feature extraction, the neurons learn to recognize specific features in
the input data.
One of the advantages of competitive learning is that it can be used to discover hidden
structures and patterns in data without the need for labeled data. This makes it
particularly useful for applications such as image and speech recognition, where
labeled data can be difficult and expensive to obtain.
Overall, competitive learning is a powerful machine learning technique that can be
used for a variety of unsupervised learning tasks. It involves a set of neurons that
compete to be activated by input data, and over time, the neurons learn to become
specialized in recognizing different types of input data.
Principal Component Analysis and Neural Networks
Principal component analysis (PCA) and neural networks are both machine learning
techniques that can be used for a variety of tasks, including data compression, feature
extraction, and dimensionality reduction.
PCA is a linear technique that involves finding the principal components of a dataset,
which are the directions of greatest variance. The principal components can be used to
reduce the dimensionality of the data, while preserving as much of the original
variance as possible.
Neural networks, on the other hand, are nonlinear techniques that involve multiple
layers of interconnected neurons. Neural networks can be used for a variety of tasks,
including classification, regression, and clustering. They can also be used for feature
extraction, where the network learns to identify the most important features of the
input data.
PCA and neural networks can be used together for a variety of tasks. For example,
PCA can be used to reduce the dimensionality of the data before feeding it into a
neural network. This can help to improve the performance of the network by reducing
the amount of noise and irrelevant information in the input data. Neural networks can
also be used to improve the performance of PCA. In some cases, PCA can be limited
by its linear nature, and may not be able to capture complex nonlinear relationships in
the data. By combining PCA with a neural network, the network can learn to capture
these nonlinear relationships and improve the accuracy of the PCA results.
Overall, PCA and neural networks are both powerful machine learning techniques that
can be used for a variety of tasks. When used together, they can improve the
performance and accuracy of each technique and help to solve more complex
problems.
Dimension Reduction
Dimension Reduction-
In pattern recognition, Dimension Reduction is defined as
It is a process of converting a data set having vast dimensions into a data set
with lesser dimensions.
It ensures that the converted data set conveys similar information concisely.
Benefits-
Dimension reduction offers several benefits such as
It compresses the data and thus reduces the storage space requirements.
It reduces the time required for computation since less dimensions require less
computation.
It eliminates the redundant features.
It improves the model performance.
Dimension Reduction Techniques-
The two popular and well-known dimension reduction techniques are-
1. Principal Component Analysis (PCA)
2. Fisher Linear Discriminant Analysis (LDA)
In this article, we will discuss about Principal Component Analysis.
Principal Component Analysis-
x5= (6, 7)
x6= (7,8)
Point vector is
∑( µ)( µ)
Covarience Matrix ==
−2.5 ⌈ 6.25 10
𝑚 = (𝑥 − µ)(𝑥 − µ) = −2.5 −4⌉=
−4 10 16
−1.5 2.25 0
𝑚 = (𝑥 − µ)(𝑥 − µ) = ⌈−1.5 0⌉=
0 0 0
−0.5 ⌈ 0.25 1
𝑚 = (𝑥 − µ)(𝑥 − µ) = −0.5 −2⌉=
−2 1 4
0.5 ⌈ 0.25 0.5
𝑚 = (𝑥 − µ)(𝑥 − µ) = 0.5 1⌉=
1 0.5 1
1.5 ⌈ 2.25 3
𝑚 = (𝑥 − µ)(𝑥 − µ) = 1.5 2⌉=
2 3 4
2.5 ⌈ 6.25 7.5
𝑚 = (𝑥 − µ)(𝑥 − µ) = 2.5 3⌉=
3 7.5 9
= (m1 + m2 + m3 + m4 + m5 + m6) / 6
On adding the above matrices and dividing by 6, we get-
1 17.5 22
𝐶𝑜𝑣𝑎𝑟𝑖𝑒𝑛𝑐𝑒 𝑀𝑎𝑡𝑟𝑖𝑥 =
6 22 34
2.92 3.67
𝐶𝑜𝑣𝑎𝑟𝑖𝑒𝑛𝑐𝑒 𝑀𝑎𝑡𝑟𝑖𝑥 =
3.67 5.67
Step-05: Calculate the Eigen values and Eigen vectors of the covariance matrix. λ is an
Eigen value for a matrix M if it is a solution of the characteristic equation |M – λI| = 0.
So, we have-
From here,
(2.92 – λ)(5.67 – λ) – (3.67 x 3.67)
= 0 16.56 – 2.92λ – 5.67λ + λ2 – 13.47
= 0 λ2 – 8.59λ + 3.09 = 0
Solving this quadratic equation, we get λ = 8.22, 0.38
Thus, two Eigen values are λ1 = 8.22 and λ2 = 0.38.
Clearly, the second Eigen value is very small compared to the first Eigen value.
So, the second Eigen vector can be left out.
Eigen vector corresponding to the greatest Eigen value is the principal component for
the given data set.
So we find the Eigen vector corresponding to Eigen value λ1.
We use the following equation to find the Eigen vector- MX = λX
Where-
M = Covariance Matrix
X = Eigen vector
λ = Eigen value
Substituting the values in the above equation, we get-
Fuzzy logic is a type of logic that allows for degrees of truth, rather than just true or
false values. It is often used in machine learning to extract fuzzy models from data.
A fuzzy model is a model that uses fuzzy logic to make predictions or decisions based
on uncertain or incomplete data. Fuzzy models are particularly useful in situations
where traditional models may not work well, such as when the data is noisy or when
there
re is a lot of uncertainty or ambiguity in the data.
To extract a fuzzy model from data, the first step is to define the input and output
variables of the model. The input variables are the features or attributes of the data,
while the output variable is the target variable that we want to predict or classify.
Next, we use fuzzy logic to define the membership functions for each input and output
variable. The membership functions describe the degree of membership of each data
point to each category or class.
ss. For example, a data point may have a high degree of
membership to the category ”low”, but a low degree of membership to the category
”high”.
Once the membership functions have been defined, we can use fuzzy inference to
make predictions or decisions based on the input data. Fuzzy inference involves using
the membership functions to determine the degree of membership of each data point to
each category or class, and then combining these degrees of membership to make a
prediction or decision.
Overall, extracting fuzzy models from data involves using fuzzy logic to define the
membership functions for each input and output variable, and then using fuzzy
inference to make predictions or decisions based on the input data. Fuzzy models are
particularly useful in situations where traditional models may not work well, and can
help to improve the accuracy and robustness of machine learning models.
Fuzzy Decision Trees
Fuzzy decision trees are a type of decision tree that use fuzzy logic to make decisions
based on uncertain or imprecise data. Decision trees are a type of supervised learning
technique that involves recursively partitioning the input space into regions that
correspond to different classes or categories.
Fuzzy decision trees extend traditional decision trees by allowing for degrees of
membership to each category or class, rather than just a binary classification. This is
particularly useful in situations where the data is uncertain or imprecise, and where a
single, crisp classification may not be appropriate.
To build a fuzzy decision tree, we start with a set of training data that consists of input-
output pairs. We then use fuzzy logic to determine the degree of membership of each
data point to each category or class. This is done by defining the membership functions
for each input and output variable, and using these to compute the degree of
membership of each data point to each category or class.
Next, we use the fuzzy membership values to construct a fuzzy decision tree. The tree
consists of a set of nodes and edges, where each node represents a test on one of the
input variables, and each edge represents a decision based on the result of the test. The
degree of membership of each data point to each category or class is used to determine
the probability of reaching each leaf node of the tree.
Fuzzy decision trees can be used for a variety of tasks, including classification,
regression, and clustering. They are particularly useful in situations where the data is
uncertain or imprecise, and where traditional decision trees may not work well.
Overall, fuzzy decision trees are a powerful machine learning technique that can be
used to make decisions based on uncertain or imprecise data. They extend traditional
decision trees by allowing for degrees of membership to each category or class, and
can help to improve the accuracy and robustness of machine learning models.
Stochastic Search Methods
Stochastic search methods are a class of optimization algorithms that use probabilistic
techniques to search for the optimal solution in a large search space. These methods
are commonly used in machine learning to find the best set of parameters for a model,
such as the weights in a neural network or the parameters in a regression model.
Stochastic search methods are often used when the search space is too large to
exhaustively search all possible solutions, or when the objective function is highly
nonlinear and has many local optima. The basic idea behind these methods is to
explore the search space by randomly sampling solutions and using probabilistic
techniques to move towards better solutions.
One common stochastic search method is called the stochastic gradient descent (SGD)
algorithm. In this method, the objective function is optimized by iteratively updating
the parameters in the directionof the negative gradient of the objective function. The
update rule includes a learning rate, which controls the step size and the direction of
the update. SGD is widely used in training neural networks and other deep learning
models.
Another stochastic search method is called simulated annealing. This method is based
on the physical process of annealing, which involves heating and cooling a material to
improve its properties. In simulated annealing, the search process starts with a high
temperature and gradually cools down over time. At each iteration, the algorithm
randomly selects a new solution and computes its fitness. If the new solution is better
than the current solution, it is accepted. However, if the new solution is worse, it may
still be accepted with a certain probability that decreases as the temperature decreases.
Other stochastic search methods include evolutionary algorithms, such as genetic
algorithms and particle swarm optimization, which mimic the process of natural
selection and evolution to search for the optimal solution.
Overall, stochastic search methods are powerful optimization techniques that are
widely used in machine learning and other fields. These methods allow us to
efficiently search large search spaces and find optimal solutions in the presence of
noise, uncertainty, and nonlinearity.