Introduction To Correlation Analysis GB6023 2012
Introduction To Correlation Analysis GB6023 2012
GB6023 2012
Scatterplot
Graphical display of two quantitative variables:
Horizontal Axis: Explanatory variable (pembolehubah bebas), x Vertical Axis: Response variable,(pembolehubah bersandar) y
x y y
x y y
x y
To quantify the relationship between two quantitative variables, we use a correlation coefficient, Pearsons r or Spearmans rho. The correlation coefficient tells us: If there is a relationship between the variables The strength of the relationship The direction of the relationship
Correlation coefficients vary from -1.0 to +1.0. A correlation coefficient of 0.0 indicates that there is no relationship. A correlation coefficient of -1.0 or + 1.0 indicates a perfect relationship, i.e. the scores on one variable can be accurately determined by the scores on the other variable. 1/14/2013 Slide 7
If a correlation coefficient is negative, it implies an inverse relationship, i.e. the scores on the two variables move in opposite directions, higher scores on one variable are associated with lower scores on the other variable. If a correlation coefficient is positive, it implies a direct relationship, i.e. the scores on the two variables move in the same direction, higher scores on one variable are associated with higher scores on the other variable. When we talk about the size of a correlation, we refer to the value irrespective of the sign a correlation of -.728 is just as large or strong as a correlation of +.728 The Pearson R correlation coefficient treats the data as interval. Spearmans Rho treats the data as ordinal, using the 1/14/2013 Slide 8 rank order of the scores for each variable rather than the
Suppose I had the data to the right showing the relationship between GPA and income. SPSS would calculate Pearsons r for this data to be .911 and Spearmans rho to be .900.
The ranks for the values for each of the variables are shown in the table to the right. Using the ranks as data, SPSS would calculate both Pearsons r and Spearmans rho to be .900.
GPA Incom e 3.2 3.3 3.5 3.7 45000 42000 48000 50000
3.8
55000
1/14/2013
Slide 9
Suppose the fifth subject had an income of 100,000 instead of 55,000. SPSS would calculate Pearsons r for this data to be .733 and Spearmans rho to be .900.
GPA Incom e 3.2 3.3 3.5 3.7 45000 42000 48000 50000
3.8 100000
The ranks for the values did not change. The fifth subject had the highest income, so Spearmans rho has the same value. The Pearsons r decreased from .911 to .733. Outliers, and the skewing of the distribution by outliers, have a greater effect on 1/14/2013 Pearsons r than they do on Spearmans
Slide 10
In the scatterplot, outliers (the case I changed from 55,000 to 100,000) will draw the loess line toward them away from the linear fit line, making the pattern of points appear less linear. or more non-linear.
100,00 0
55,00 0
1/14/2013
The lines demonstrate the point, but the cyan line is really a quadratic fit rather than a loess line because I cant do much smoothing with only 5 data points.
Slide 11
Outliers, and the skewing of the distribution by outliers, have a greater effect on Pearsons r than they do on Spearmans rho. As the outliers become more extreme, and the distribution becomes more skewed, Spearmans rho becomes larger than Pearsons r, and the overall trend in the data is non-linear. To accurately model the relationship, we have three choices: 1. use a more complex non-linear model to analyze the relationship 2. re-express the data to reduce skewing and the impact of outliers, and analyze the relationship with a linear model 3. Exclude outliers to reduce skewing, and analyze the 1/14/2013 Slide 12 relationship with a linear model
If the three following conditions are present, reexpressing the data may reduce the skewness and increase the size of Pearsons r to justify treating the relationship as linear: 1. If the model appears non-linear because of the difference between the loess line and the linear fit line, 2. If Spearmans rho is larger than Pearsons r (by .05 or more), 3. If one or both of the variables violates the skewness criteria for a normal distribution.
We will employ the transformations we have used previously: if the distribution is negatively skewed, we re-express the data as squares; if the distribution is 1/14/2013 Slide 13 positively skewed, we re-express the data as
There are two sets of guidelines used to translate the correlation coefficient into a narrative phrase, guidelines attributed to Tukey and guidelines attributed to Cohen. Tukeys guidelines interpret a correlation:
between 0.0 up to 0.20 as very weak; equal to or greater than 0.20 up to 0.40 as weak; equal to or greater than 0.40 up to 0.60 as moderate; equal to or greater than 0.60 up to 0.80 as strong; and equal to or greater than 0.80 as very strong. Cohens guidelines interpret a correlation: less than 0.10 = trivial; equal to or greater than 0.10 up to 0.30 = weak or small; equal to or greater than 0.30 up to 0.50 = moderate; equal to or greater than 0.50 or greater = strong or large
1/14/2013 Slide 14
r = -1
y
r = -.6
y
r=0
r = +.3
r = +1
( x x )( y y ) [ ( x x ) ][ ( y y ) ]
2 2
r
where:
[n( x 2 ) ( x )2 ][n( y 2 ) ( y )2 ]
n xy x y
r = Sample correlation coefficient n = Sample size x = Value of the independent variable y = Value of the dependent variable
Calculation Example
Tree Height y 35 49 27 33 60 21 45 51 =321 Trunk Diamete r x 8 9 7 6 13 7 11 12 =73 xy 280 441 189 198 780 147 495 612 y2 1225 2401 729 1089 3600 441 2025 2601 x2 64 81 49 36 169 49 121 144 =713
=3142 =14111
Calculation Example
Tree Height, y 70
60
(continued)
n xy x y
50
40
30
0.886
r = 0.886 relatively strong positive linear association between x and y
20
10
0 0 2 4 6 8 10 12 14
Trunk Diameter, x
Test statistic
r 1 r n2
2
(with n 2 degrees of freedom)
d.f. = 8-2 = 6
/2=.025 /2=.025
Reject H0
-t/2 -2.4469
Do not reject H0
t/2 2.4469
Reject H0
4.68
Contingency tables enable us to compare one characteristic of the sample, e.g. degree of religious fundamentalism, for groups or subsets of cases defined by another categorical variable, e.g. gender. A contingency table, which SPSS calls a cross-tabulated table is shown below:
1/14/2013
Slide 23
Each cell in the table represents a combination of the characteristics associated with the two variables:
29 males were also fundamentalists.
42 females were fundamentalists. While a larger number of females were fundamentalist, we cannot tell if females were more likely to be fundamentalist because the total number of females (146) was different from the total number of males (107). To answer the more likely question, we need to compare percentages.
1/14/2013
Slide 24
There are three percentages that can be calculated for a contingency table: percentage of the total number of cases percentage of the total in each row percentage of the total in each column Each of the three percentages provide different information and answer a different question.
1/14/2013 Slide 25
The percentage of the total number of cases is computed by dividing the number in each cell (e.g. 29, 42, etc.) by the total number of cases (253).
11.5% of the cases were both male and fundamentalist. 16.6% of the cases were both female and fundamentalist.
We have two clues that the table contains total percentages. First, the rows that the percentages are on are labeled % of Total.
Second, the 100% figure appears ONLY in the grand total cell beneath the table total of 253.
1/14/2013
Slide 26
The percentage of the total for each row is computed by dividing the number in each cell (e.g. 29, 42) by the total for the row (71).
40.8% of the fundamentalist s were male.
The label for the percentage tells us that it is computed within the category for fundamentalist.
The percentages in each row sums to 100% in the total column for rows (the row margin).
1/14/2013
Slide 27
The percentage of the total for each column is computed by dividing the number in each cell (e.g. 29, 36, and 42) by the total for the column (107).
27.1% of the males were fundamentalists.
33.6% of the males were moderates. The label for the percentage tells us that it is computed within the category for sex.
1/14/2013
The percentage in each column sums to 100% in the total row for columns (the column margin).
Slide 28
The three percentages tell us: the percent that is in both categories (total percentage) the percent of each row that is found in each of the column categories (row percentages) the percent of each column that is found in each of the row categories (column percentages) The row and column percentages are referred to as conditional or contingent 1/14/2013 Slide 29 percentages.
Our real interest is in conditional or contingent percentages because these tell us about the relationship between the variables. The relationship between variables is defined by a distinct role for each:
the variable which affected or impacted by the other is the dependent variable the variable which affects or impacts the other is the independent variable
We assign the role to the variable. An independent variable in one analysis may be 1/14/2013 Slide 30 a dependent variable in another analysis.
A categorical variable has a relationship to another categorical variable if the probability of being in one category of the dependent variable differs depending on the category of the independent variable. For example, if there is a relationship between social class and college attendance, the percentage of upper class persons who attend college will be different from the percentage of middle class persons who attend college. Attending college is the dependent 1/14/2013 Slide 31
Given that we can represent this statistically with either the row or column percentages in a contingency table, my practice is to always put the independent variable in the columns and the dependent variable in the rows, and compute column percentages. This order matches the order for many graphics where the dependent variable is on the vertical axis and the independent variable is on the horizontal axis.
1/14/2013 Slide 32
Based on the column percentages, we can make statements like the following:
Males were most likely to be liberal (39.3%), while females were most likely to be moderate (45.5%).
1/14/2013
Slide 33
Based on the column percentages, we can make statements like the following:
This is not equivalent to the statement that liberals are more likely to be male or female.
1/14/2013
Slide 34