Unit Iii Poriyan Notes
Unit Iii Poriyan Notes
Syllabus
Correlation - Scatter plots - correlation coefficient for quantitative data - computational formula
for correlation coefficient - Regression - regression line - least squares regression line - Standard
error of estimate - interpretation of R2 - multiple regression equations - regression towards the
mean.
Correlation
• When one measurement is made on each observation, uni-variate analysis is applied. If more
than one measurement is made on each observation, multivariate analysis is applied. Here we
focus on bivariate analysis, where exactly two measurements are made on each observation.
• The two measurements will be called X and Y. Since X and Y are obtained for each
observation, the data for one observation is the pair (X, Y).
• Some examples:
1. Height (X) and weight (Y) are measured for each individual in a sample.
2. Stock market valuation (X) and quarterly corporate earnings (Y) are recorded for each
company in a sample.
3. A cell culture is treated with varying concentrations of a drug and the growth rate (X) and
drug concentrations (Y) are recorded for each trial.
4. Temperature (X) and precipitation (Y) are measured on a given day at a set of weather
stations.
•There is difference in bivariate data and two sample data. In two sample data, the X and Y
values are not paired and there are not necessarily the same number of X and Y values.
• Correlation refers to a relationship between two or more objects. In statistics, the word
correlation refers to the relationship between two variables. Correlation exists between two
variables when one of them is related to the other in some way.
• Examples: One variable might be the number of hunters in a region and the other variable
could be the deer population. Perhaps as the number of hunters increases, the deer population
decreases. This is an example of a negative correlation: As one variable increases, the other
decreases.
A positive correlation is where the two variables react in the same way, increasing or
decreasing together. Temperature in Celsius and Fahrenheit has a positive correlation.
• The term "correlation" refers to a measure of the strength of association between two variables.
• The correlation coefficient r is a function of the data, so it really should be called the sample
correlation coefficient. The (sample) correlation coefficient r estimates the population correlation
coefficient p.
• If either the X, or the Y; values are constant (i.e. all have the same value), then one of the
sample standard deviations is zero and therefore the correlation coefficient is not defined.
Types of Correlation
• Positive correlation : Association between variables such that high scores on one variable
tends to have high scores on the other variable. A direct relation between the variables.
• Negative correlation : Association between variables such that high scores on one variable
tends to have low scores on the other variable. An inverse relation between the variables.
• Simple: It is about the study of only two variables, the relationship is described as simple
correlation.
• Partial correlation : Analysis recognizes more than two variables but considers only two
variables keeping the other constant. Example: Price and demand, eliminating the supply side.
• Total correlation is based on all the relevant variables, which is normally not feasible. In total
correlation, all the facts are taken into account.
• Linear correlation : Correlation is said to be linear when the amount of change in one variable
tends to bear a constant ratio to the amount of change in the other. The graph of the variables
having a linear relationship will form a straight line.
• Non linear correlation : The correlation would be non linear if the amount of change in one
variable does not bear a constant ratio to the amount of change in the other variable.
Classification of correlation
1. Graphic methods
2. Mathematical methods.
• Graphic methods contain two sub methods: Scatter diagram and simple graph.
Coefficient of Correlation
Correlation : The degree of relationship between the variables under consideration is measure
through the correlation analysis.
• The measure of correlation called the correlation coefficient. The degree of relationship is
expressed by coefficient which range from correlation (- 1 ≤ r≥ + 1). The direction of change is
indicated by a sign.
• The correlation analysis enables us to have an idea about the degree and direction of the
relationship between the two variables under study.
• Correlation is a statistical tool that helps to measure and analyze the degree of relationship
between two variables. Correlation analysis deals with the association between two or more
variables.
• Correlation denotes the interdependency among the variables for correlating two phenomenon,
it is essential that the two phenomenon should have cause-effect relationship and if such
relationship does not exist then the two phenomenon can not be correlated.
• If two variables vary in such a way that movement in one are accompanied by movement in
other, these variables are called cause and effect relationship.
Properties of Correlation
2. Positive r indicates positive association between the variables and negative r indicates negative
association.
5. The correlation coefficient measures clustering about a line, but only relative to the SD's.
7. Correlation measures association. But association does not necessarily show causation.
Example 3.1.1: A sample of 6 children was selected, data about their age in years and
weight in kilograms was recorded as shown in the following table. It is required to find the
correlation between age and weight.
Solution :
•Because the relationship between two sets of data is seldom perfect, the majority of correlation
coefficients are fractions (0.92, -0.80 and the like).
• If r = + 1, then the correlation between the two variables is said to be perfect and positive.
•If r = -1, then the correlation between the two variables is said to be perfect and negative.
Example 3.1.2: A sample of 12 fathers and their elder sons gave the following data about
their heights in inches. Calculate the coefficient of rank correlation.
Solution:
Example 3.1.3: Calculate coefficient of correlation between age of cars and annual
maintenance and comment.
Solution: Let,
= 12600 / 7 = 1800
=3700/4427.188 = 0.8357
= 46 / 5.29 × 9.165
r = 0.9488
3. EXPLAIN SCATTER PLOT IN DEATIL NOV / DEC 2023
Scatter Plots
• When two variables x and y have an association (or relationship), we say there exists
a correlation between them. Alternatively, we could say x and y are correlated. To find such an
association, we usually look at a scatterplot and try to find a pattern.
• Scatterplot (or scatter diagram) is a graph in which the paired (x, y) sample data are plotted
with a horizontal x axis and a vertical y axis. Each individual (x, y) pair is plotted as a single
point.
• One variable is called independent (X) and the second is called dependent (Y).
Example:
1. Positive relationship
2. Negative relationship
3. No relationship.
• The scattergram can indicate a positive relationship, a negative relationship or
a zero relationship.
1. It is a simple to implement and attractive method to find out the nature of correlation.
2. It is easy to understand.
3. User will get rough idea about correlation (positive or negative correlation).
The product moment correlation, r, summarizes the strength of association between two
metric (interval or ratio scaled) variables, say X and Y.
• The product moment correlation, r, summarizes the strength of association between two
metric (interval or ratio scaled) variables, say X and Y. It is an index used to determine whether a
linear or straight-line relationship exists between X and Y.
• As it was originally proposed by Karl Pearson, it is also known as the Pearson correlation
coefficient. It is also referred to as simple correlation, bivariate correlation or merely the
correlation coefficient.
• The correlation coefficient between two variables will be the same regardless of their
underlying units of measurement.
• It measures the nature and strength between two variables of the quantitative type.
• The sign of r denotes the nature of association. While the value of r denotes the strength of
association.
• If the sign is positive this means the relation is direct (an increase in one variable is associated
with an increase in the other variable and a decrease in one variable is associated with a decrease
in the other variable).
• While if the sign is negative this means an inverse or indirect relationship (which means an
increase in one variable is associated with a decrease in the other).
• The value of r ranges between (-1) and (+ 1). The value of r denotes the strength of the
association as illustrated by the following diagram,
• Pearson's 'r' is the most common correlation coefficient. Karl Pearson's Coefficient of
Correlation denoted by - 'r' The coefficient of correlation 'r' measure the degree of linear
relationship between two variables say x and y.
n= 10
X= Maintains cost
y=Sales cost
Example 3.3.2: A random sample of 5 college students is selected and their grades in
operating system and software engineering are found to be ?
Calculate Pearson's rank correlation coefficient?
Solution:
Example 3.3.3: Find Karl Pearson's correlation coefficient for the following paired data.
Solution: Let
Example 3.3.4: Find Karl Pearson's correlation coefficient for the following paired data.
Solution:
For an input x, if the output is continuous, this is called a regression problem.
Regression
• For an input x, if the output is continuous, this is called a regression problem. For example,
based on historical information of demand for tooth paste in your supermarket, you are asked to
predict the demand for the next month.
• Regression is concerned with the prediction of continuous quantities. Linear regression is the
oldest and most widely used predictive model in the field of machine learning. The goal is to
minimize the sum of the squared errors to fit a straight line to a set of data points.
• It is one of the supervised learning algorithms. A regression model requires the knowledge of
both the dependent and the independent variables in the training data set.
• Simple Linear Regression (SLR) is a statistical model in which there is only one independent
variable and the functional relationship between the dependent variable and the regression
coefficient is linear.
• Regression line is the line which gives the best estimate of one variable from the value of any
other given variable.
• The regression line gives the average relationship between the two variables in mathematical
form. For two variables X and Y, there are always two lines of regression.
• Regression line of Y on X: Gives the best estimate for the value of Y for any specific given
values of X:
where
Y = a + bx
a = Y - intercept
Y = Dependent variable
X = Independent variable
• By using the least squares method, we are able to construct a best fitting straight line to the
scatter diagram points and then formulate a regression equation in the form of:
ŷ = a + bx
ŷ = ȳ + b(x- x̄)
• Regression analysis is the art and science of fitting straight lines to patterns of data. In a linear
regression model, the variable of interest ("dependent" variable) is predicted from k other
variables ("independent" variables) using a linear equation.
• If Y denotes the dependent variable and X1, ..., Xk are the independent variables, then the
assumption is that the value of Y at time t in the data sample is determined by the linear
equation:
where the betas are constants and the epsilons are independent and identically distributed normal
random variables with mean zero.
Regression Line
• A way of making a somewhat precise prediction based upon the relationships between two
variables. The regression line is placed so that it minimizes the predictive error.
• The regression line does not go through every point; instead it balances the difference between
all data points and the straight-line model. The difference between the observed data value and
the predicted value (the value on the straight line) is the error or residual. The criterion to
determine the line that best describes the relation between two variables is based on the residuals.
• A negative residual indicates that the model is over-predicting. A positive residual indicates
that the model is under-predicting.
Linear Regression
• The simplest form of regression to visualize is linear regression with a single predictor. A linear
regression technique can be used if the relationship between X and Y can be approximated with a
straight line.
• Linear regression with a single predictor can be expressed with the equation:
y = Ɵ2x + Ɵ1 + e
• The regression parameters in simple linear regression are the slope of the line (Ɵ2), the angle
between a data point and the regression line and the y intercept (Ɵ1) the point where x crosses the
y axis (X = 0).
• Model 'Y', is a linear function of 'X'. The value of 'Y' increases or decreases in linear manner
according to which the value of 'X' also changes.
Nonlinear Regression:
• Often the relationship between x and y cannot be approximated with a straight line. In this case,
a nonlinear regression technique may be used.
• Alternatively, the data could be preprocessed to make the relationship linear. In Fig. 3.4.2
shows nonlinear regression. (Refer Fig. 3.4.2 on previous page)
• If data does not show a linear dependence we can get a more accurate model using a nonlinear
regression model.
• Generalized linear model is foundation on which linear regression can be applied to modeling
categorical response variables.
Advantages:
a. Training a linear regression model is usually much faster than methods such as neural
networks.
b. Linear regression models are simple and require minimum memory to implement.
c. By examining the magnitude and sign of the regression coefficients you can infer how
predictor variables affect the target outcome.
1. Predictive ability: The linear regression fit often has low bias but high variance. Recall that
expected test error is a combination of these two quantities. Prediction accuracy can sometimes
be improved by sacrificing some small amount of bias in order to decrease the variance.
• The method of least squares is about estimating parameters by minimizing the squared
discrepancies between observed data, on the one hand and their expected values on the other.
• The Least Squares (LS) criterion states that the sum of the squares of errors is minimum. The
least-squares solutions yield y(x) whose elements sum to 1, but do not ensure the outputs to be in
the range [0, 1].
• How to draw such a line based on data points observed? Suppose a imaginary line of y = a +
bx.
• Imagine a vertical distance between the line and a data point E = Y - E(Y). This error is the
deviation of the data point from the imaginary line, regression line. Then what is the best values
of a and b? A and b that minimizes the sum of such errors.
• Deviation does not have good properties for computation. Then why do we use squares of
deviation? Let us get a and b that can minimize the sum of squared deviations rather than the
sum of deviations. This method is called least squares.
• Least squares method minimizes the sum of squares of errors. Such a and b are called least
squares estimators i.e. estimators of parameters a and B.
• The process of getting parameter estimators (e.g., a and b) is called estimation. Lest squares
method is the estimation method of Ordinary Least Squares (OLS).
Example 3.4.1: Fit a straight line to the points in the table. Compute m and b by least
squares.
Standard Error of Estimate
• The standard error of estimate represents a special kind of standard deviation that reflects tells
that we the magnitude of predictive error. The standard error of estimate, denoted S
approximately how large the prediction errors (residuals) are for our data set in the same units as
Y.
Example 3.4.2: Define linear and nonlinear regression using figures. Calculate the value of
Y for X = 100 based on linear regression prediction method.
Solution
The primary objective of regression is to explain the variation in Y using the knowledge
of X.
Interpretation of R2
• The following measures are used to validate the simple linear regression models:
3. Analysis of variance for overall model validity (relevant more for multiple linear regression).
• The primary objective of regression is to explain the variation in Y using the knowledge of X.
The coefficient of determination (R-square) measures the percentage of variation in Y explained
by the model (ẞ0 + ẞ1 X).
Characteristics of R-square:
2. If R2 = 1, all of the data points fall perfectly on the regression line. The predictor x accounts
for all of the variation in y!.
3. If R2 = 0, the estimated regression line is perfectly horizontal. The predictor x accounts for
none of the variation in y!
• In general, a high R2 value indicates that the model is a good fit for the data, although
interpretations of fit depend on the context of analysis. An R2 of 0.35, for example, indicates that
35 percent of the variation in the outcome has been explained just by predicting the outcome
using the covariates included in the model.
• That percentage might be a very high portion of variation to predict in a field such as the social
sciences; in other fields, such as the physical sciences, one would expect R2 to be much closer to
100 percent.
• The theoretical minimum R2 is 0. However, since linear regression is based on the best possible
fit, R2 will always be greater than zero, even when the predictor and outcome variables bear no
relationship to one another.
• R2 increases when a new predictor variable is added to the model, even if the new predictor is
not associated with the outcome. To account for that effect, the adjusted R2 incorporates the
same information as the usual R2 but then also penalizes for the number of predictor variables
included in the model.
• As a result, R2 increases as new predictors are added to a multiple linear regression model, but
the adjusted R increases only if the increase in R2 is greater than one would expect from chance
alone. In such a model, the adjusted R2 is the most realistic estimate of the proportion of the
variation that is predicted by the covariates included in the model.
Spurious Regression
• The regression is spurious when we regress one random walk onto another independent random
walk. It is spurious because the regression will most likely indicate a non-existing relationship:
1. The coefficient estimate will not converge toward zero (the true value). Instead, in the limit
the coefficient estimate will follow a non-degenerate distribution.
• Granger and Newbold(1974) pointed out that along with the large t-values strong evidence of
serially correlated errors will appear in regression analysis, stating that when a low value of the
Durbin-Watson statistic is combined with a high value of the t-statistic the relationship is not
true.
• The regression co-efficient (ẞ1) captures the existence of a linear relationship between the
response variable and the explanatory variable.
If ẞi = 0, we can conclude that there is no statistically significant linear relationship between the
two variables.
• Using the Analysis of Variance (ANOVA), we can test whether the overall model is
statistically significant. However, for a simple linear regression, the null and alternative
hypotheses in ANOVA and t-test are exactly same and thus there will be no difference in the p-
value.
Residual analysis
• Residual (error) analysis is important to check whether the assumptions of regression models
have been satisfied. It is performed to check the following:
• In a multiple regression model, two or more independent variables, i.e. predictors are involved
in the model. The simple linear regression model and the multiple regression model assume that
the dependent variable is continuous.
Regression toward the mean refers to a tendency for scores, particularly extreme scores,
to shrink toward the mean.
• Regression toward the mean refers to a tendency for scores, particularly extreme scores, to
shrink toward the mean. Regression toward the mean appears among subsets of extreme
observations for a wide variety of distributions.
• The rule goes that, in any series with complex phenomena that are dependent on many
variables, where chance is involved, extreme outcomes tend to be followed by more moderate
ones.
• The effects of regression to the mean can frequently be observed in sports, where the effect
causes plenty of unjustified speculations.
• It basically states that if a variable is extreme the first time we measure it, it will be closer to
the average the next time we measure it. In technical terms, it describes how a random variable
that is outside the norm eventually tends to return to the norm.
• For example, our odds of winning on a slot machine stay the same. We might hit a "winning
streak" which is, technically speaking, a set of random variables outside the norm. But play the
machine long enough and the random variables will regress to the mean (i.e. "return to normal")
and we shall end up losing.
• Consider a sample taken from a population. The value of the variable will be some distance
from the mean. For instance, we could take a sample of people, it could be just one measure their
heights and then determine the average height of the sample. This value will be some distance
away from the average height of the entire population of people, though the distance might be
zero.
• Regression to the mean usually happens because of sampling error. A good sampling technique
is to randomly sample from the population. If we asymmetrically sampled, then results may be
abnormally high or low for the average and therefore would regress back to the mean.
Regression to the mean can also happen because we take a very small, unrepresentative sample.
Regression fallacy
• Regression fallacy assumes that a situation has returned to normal due to corrective actions
having been taken while the situation was abnormal. It does not take into consideration normal
fluctuations.
• An example of this could be a business program failing and causing problems which is then
cancelled. The return to "normal", which might be somewhat different from the original situation
or a situation of "new normal" could fall into the category of regression fallacy. This is
considered an informal fallacy.
Regression toward the mean refers to a tendency for scores, particularly extreme scores,
to shrink toward the mean.
• Regression toward the mean refers to a tendency for scores, particularly extreme scores, to
shrink toward the mean. Regression toward the mean appears among subsets of extreme
observations for a wide variety of distributions.
• The rule goes that, in any series with complex phenomena that are dependent on many
variables, where chance is involved, extreme outcomes tend to be followed by more moderate
ones.
• The effects of regression to the mean can frequently be observed in sports, where the effect
causes plenty of unjustified speculations.
• It basically states that if a variable is extreme the first time we measure it, it will be closer to
the average the next time we measure it. In technical terms, it describes how a random variable
that is outside the norm eventually tends to return to the norm.
• For example, our odds of winning on a slot machine stay the same. We might hit a "winning
streak" which is, technically speaking, a set of random variables outside the norm. But play the
machine long enough and the random variables will regress to the mean (i.e. "return to normal")
and we shall end up losing.
• Consider a sample taken from a population. The value of the variable will be some distance
from the mean. For instance, we could take a sample of people, it could be just one measure their
heights and then determine the average height of the sample. This value will be some distance
away from the average height of the entire population of people, though the distance might be
zero.
• Regression to the mean usually happens because of sampling error. A good sampling technique
is to randomly sample from the population. If we asymmetrically sampled, then results may be
abnormally high or low for the average and therefore would regress back to the mean.
Regression to the mean can also happen because we take a very small, unrepresentative sample.
Regression fallacy
• Regression fallacy assumes that a situation has returned to normal due to corrective actions
having been taken while the situation was abnormal. It does not take into consideration normal
fluctuations.
• An example of this could be a business program failing and causing problems which is then
cancelled. The return to "normal", which might be somewhat different from the original situation
or a situation of "new normal" could fall into the category of regression fallacy. This is
considered an informal fallacy.
Two Marks Questions with Answers
Ans. : Correlation refers to a relationship between two or more objects. In statistics, the word
correlation refers to the relationship between two variables. Correlation exists between two
variables when one of them is related to the other in some way.
Ans. :
• Positive correlation : Association between variables such that high scores on one variable tends
to have high scores on the other variable. A direct relation between the variables.
• Negative correlation: Association between variables such that high scores on one variable
tends to have low scores on the other variable. An inverse relation between the variables.
Ans. If two variables vary in such a way that movement in one are accompanied by movement in
other, these variables are called cause and effect relationship.
Ans. 1. It is a simple to implement and attractive method to find out the nature of correlation.
2. It is easy to understand.
3. User will get rough idea about correlation (positive or negative correlation).
Ans. For an input x, if the output is continuous, this is called a regression problem.
Ans. The regression has five key assumptions: Linear relationship, Multivariate normality, No or
little multi-collinearity and No auto-correlation.
Q.7 What is regression analysis used for?
Ans. : Regression analysis is a form of predictive modelling technique which investigates the
relationship between a dependent (target) and independent variable (s) (predictor). This
technique is used for forecasting, time series modelling and finding the causal effect relationship
between the variables.
Ans. Types of regression are linear regression, logistic regression, polynomial regression,
stepwise regression, ridge regression, lasso regression and elastic-net regression.
Ans. Least squares is a statistical method used to determine a line of best fit by minimizing the
sum of squares created by a mathematical function. A "square" is determined by squaring the
distance between a data point and the regression line or mean value of the data set.
Ans. Correlation is a statistical analysis used to measure and describe the relationship between
two variables. A correlation plot will display correlations between the values of variables in the
dataset. If two variables are correlated, X and Y then a regression can be done in order to predict
scores on Y from the scores on X.
Ans. Multiple linear regression is an extension of linear regression, which allows a response
variable, y to be modelled as a linear function of two or more predictor variables. In a multiple
regression model, two or more independent variables, i.e. predictors are involved in the model.
The simple linear regression model and the multiple regression model assume that the dependent
variable is continuous.