Unit 7 Infrential Statistics Corelation
Unit 7 Infrential Statistics Corelation
INFERENTIAL STATISTICS:
CORRELATION AND REGRESSION
Written By:
Prof. Dr. Nasir Mahmood
Reviewed By:
Dr. Rizwan Akram Rana
Introduction
A correlation is a relationship between two variables. The purpose of using correlation in
research is to determine the degree to which a relationship exists between two or more
variables. Correlation is important in research because several hypotheses are stated in
terms of correlation or lack of correlation between two variables, so correlational studies
are directly related to such hypotheses.
Regression is used when the relationship includes a dependent variable and one or more
independent variables. It helps us understand which among the independent variables are
related to the dependent variable, and to explore the forms of these relationships.
Owing to the importance of correlation and regression in research, these are given in this
unit in detail.
Objectives
After reading this unit, you will be able to:
1. explain correlation.
2. explain where and why to use correlation.
3. explain what considerations should be kept in mind while interpreting correlation.
4. explain Pearson and Spearman correlation method.
5. explain the situations in which Spearman correlation can be used.
6. explain Regression.
7. explain why we use regression analysis.
8. explain types of regression.
9. explain p-value.
7.1 Correlation
Correlation is a statistical technique used to measure and describe relationship between two
variables. These variables are neither manipulated nor controlled, rather they simply are
observed as they naturally exist in the environment. Suppose a researcher is interested in
relationship between number of children in a family and IQ of the individual child. He would
take a group of students coming from different families. Then he simply observe or record the
number of children in a family and then measure IQ score of each individual student same
group. He will neither manipulate nor control any variable. Correlation requires two separate
scores for each individual (one score from each of two variables). These scores are normally
identified as X and Y and can be presented in a table or in a graph.
72
i) The Direction of the Relationship
The direction of the relationship can be classified into two basic categories:
positive and negative.
In a positive correlation both variables tend to change into same direction. When
variable X increases, the variable Y also increases. And if the variable X decreases,
the variable Y also decreases. In other words we can say that both variables are
directly proportional to each other.
In a negative correlation both variables do not tend to change into same direction.
They go in opposite direction of each other. When the variable X increases, the
variable Y decreases. And if the variable X decreases, the variable Y increases. In
other words we can say that both variables are indirectly proportional to each other.
73
To calculate the Pearson correlation r we use the formula
=
Two formulas (definitional and computational) are available to calculate the sum of
square of product. Both formulas are given in the following box.
1. The definitional formula is SP = ∑ (X – X ) (Y – Y )
∑ ∑
2. The computational formula is = ∑ XY −
SS is sum of squares, SSx is the sum of squares of the variable X and SSy is the sum of
squares of variable Y. In the following lines different formulas are given to calculate SSx
and SSy. These formulas are categorized as definitional and computational. The
definitional formulas for sum of squares of variable X are:
SSx = ∑ (X − X) 2
It should be kept in mind that whichever formula one uses, it will yield similar result.
ii) Validity
Suppose a researcher develops a new test for measuring intelligence. It is necessary
that he should show that this new test valid and truly measures what it claims to
measure. One common technique for demonstrating validity is to use correlation.
74
If newly constructed test actually measures intelligence, then the scores on this test
should be related to other already established measures of intelligence – for
example standardized IQ tests, performance on learning tasks, problem-solving
ability, and so on. The newly constructed test can be correlated to each of these
measures to demonstrate that the new test is valid.
iii) Reliability
Apart from determining validity, correlations are also used to determine reliability.
A measurement procedure is reliable if it produces stable and consistent
measurement. It means a reliable measurement procedure will produce the same (or
nearly same) scores when the same individuals are measured under the same
conditions. One common way to evaluate reliability is to use correlations to
determine relationship between two sets of scores.
Now let us have a few words on interpreting correlation. For interpreting correlation
following consideration should be kept in mind.
i) Correlation simply describes a relationship between two variables. It does not
explain why two variables are related. That is why correlation cannot be interpreted
as a proof of cause and effect relationship between two variables.
ii) The value of the correlation cannot be affected by range of scores represented in
the data.
iii) One or two extreme data points, often called outliers, can have a dramatic effect on
the value of the correlation.
iv) When judging how good a relationship is, it is tempting to focus on the numerical
value of the correlation. For example, a correlation of + 5 is halfway between 0 and
1.00 and therefore appears to represent a moderate degree of relationship. Here it
should be noted that we cannot interpret correlation as a proportion. Although a
correlation of 1.00 means that there is a 100% perfectly predictable relationship
between variables X and Y; but a correlation of .5 does not mean that we can make
a prediction with 50% accuracy. The appropriate process of describing how
accurately one variable predicts the other is to square the correlation. Thus a
correlation of r = .5 provides r2 = .52 = .25, 25% accuracy. (The value r2 is called
coefficient of determination because it measures the proportion of variability in one
variable that can be determined from the relationship with the other variable).
75
7.3 The Spearman Correlation
The most commonly used measure of relationship is the Pearson correlation. It measures
the degree of linear relationship between two variables and is used with interval or ratio
data. However other measures of correlation have been developed for non-linear
relationship and for other type of data (or scale of measurement). One such measure is
the Spearman Correlation. The Spearman correlation is used in two situations.
i) The Spearman correlation is designed to measure the relationship between
variables measured on an ordinal scale of measurement.
ii) The Spearman correlation is used when the researcher wants to measure the
consistency of a relationship between the variables X and Y. In this case the
original scores are first converted into ranks, and then Spearman correlation is used
to measure the relationship for the ranks. Incidentally, when there is consistently
one-directional relationship between two variables, the relationship is said to be
monotonic. Thus, the Spearman correlation can be used to measure the degree of
monotonic relationship between two variables.
As the Pearson correlation measures the degree of linear relationship between two
variables, the spearman correlation measures the consistency of relationship. It can be
used as a valuable alternative of Pearson correlation even when the original raw scores
are on an interval or ratio scale. Generally Spearman correlation is computed by using
Pearson correlation formula, i.e.
rs =
where D is the difference between X rank and Y rank for each individual. Again this
formula will yield the same result as Pearson correlation formula.
7.4 Regression
A correlation quantifies the degree and direction to which two variables are related. It
does not fit a line through the data points. It does not have to think about the cause and
effect. It does not natter which of the two variables is called dependent and which is
called independent.
On the other hand regression finds the best line that predicts dependent variables from the
independent variable. The decision of which variable is calls dependent and which calls
independent is an important matter in regression, as it will get a different best-fit line if
we exchange the two variables, i.e. dependent to independent and independent to
dependent. The line that best predicts independent variable from dependent variable will
not be the same as the line that predicts dependent variable from independent variable.
76
Let us start with the simple case of studying the relationship between two variables X and
Y. The variable Y is dependent variable and the variable X is the independent variable. We
are interested in seeing how various values of the independent variable X predict
corresponding values of dependent Y. This statistical technique is called regression
analysis. We can say that regression analysis is a technique that is used to model the
dependency of one dependent variable upon one independent variable. Merriam-Webster
online dictionary defines regression as a functional relationship between two or more
correlated variables that is often empirically determined from data and is used especially to
predict values of one variable when given variables of others. According to Gravetter &
Wallnua (2002), regression is a statistical technique for finding the best-fitting straight line
for a set of data is called regression, and the resulting straight line is called regression line.
These benefits help a researcher to estimate and evaluate the best set of variables to be
used for building productive models.
77
ii) Logistic Regression
Logistic regression is a statistical method for analyzing a dataset in which there are
one or more independent variables that determine an outcome. The outcome is
measured with the dichotomous (binary) variable. Like all regression analysis, the
logistic regression is a predictive analysis. It is used to describe and explain
relationship between one dependent binary variable and one or more nominal,
ordinal, interval or ratio level independent variables.
v) Ridge Regression
It is a technique for analyzing multiple regression data that suffer from
multicollinearity (independent variables are highly correlated). When
multicollinearity occurs, least squares estimates are unbiased, but their variances
are large so that they may be far from the true value. By adding the degree of bias
to the regression estimates, ridge regression reduces the standard errors.
7.5 P-Value
The p-value is the level of marginal significance within a statistical hypothesis test
representing the probability of occurrence of a given event. This value is used as an
alternative to rejection points to provide the smallest level of significance at which the
null hypothesis would be rejected.
78
In other words we can say that p-value is the calculated probability or the probability of
finding the observed or more extreme results when the null hypothesis is true. P-value is
also described in terms of rejecting null hypothesis when it is actually true.
A p-value is used in hypothesis testing to help researcher support or reject the null
hypothesis. It is evidence against the null hypothesis. The smaller p-value is the stronger
the evidence to reject the null hypothesis.
If the p-value gets lower (i.e. closer to 0% and farther away from 100), a researcher is
more inclined to reject the null hypothesis and accept the research hypothesis.
A relatively simple way to interpret p-value is to think of them as representing how likely
a result would occur by chance. For a calculated p-value of .01, we can say that the
observed outcomes would be expected to occur by chance only 1 in 100 times in repeated
tests on different samples of the population. Similarly a p-value of .05 would represent
the expected outcome to occur by chance only 5 times out of 100 times in repeated tests
and a p-value of .001 would represent the expected outcome to occur by chance only
once if the same treatment is repeated for 1000 times on different samples of the
population. In case of p-value .01, the researcher is 99% confident of getting similar
results if same test is repeated for 100 times. Similarly in case of p-value .05, the
researcher is 95% confident and in case of p-value .001, he is 999% confident of getting
similar results if same test is repeated for 100 times and 1000 times respectively.
7.7 Activities
1. Think and make a list of using correlation.
2. Enlist the consideration that you will keep in mind while using correlation.
3. Think and write primary objective of regression analysis.
79
7.8 Bibliography
Argyrous, G. (2012). Statistics for Research, with a guide to SPSS. India: SAGE
Publications.
Bartz, A. E. (1981). Basic Statistical Concepts (2nd Ed.). Minnesota: Burgess Publishing
Company
Deitz, T., & Kalof, L. (2009). Introduction to Social Statistics. UK: Wiley_-Blackwell
Fraenkel, J. R., Wallen, N. E., & Hyun, H. H. (2012). How to Design and Evaluate in
Education. (8th Ed.) McGraw-Hill, New York
Gravetter, F. J., & Wallnau, L. B. (2002). Essentials of Statistics for the Behavioral
Sciences (4th Ed.). Wadsworth, California, USA.
80