0% found this document useful (0 votes)
53 views34 pages

Introduction To Correlation Analysis GB6023 2012

The document provides an introduction to correlation analysis and scatter plots. It discusses how scatter plots can be used to show the relationship between two variables and how correlation coefficients quantify the strength and direction of linear relationships. Pearson's r and Spearman's rho are introduced as measures of the correlation between two variables, where r ranges from -1 to 1 with values farther from 0 indicating stronger relationships. Examples are given of how to interpret the size of r based on different guidelines.

Uploaded by

Ina Amalina
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
53 views34 pages

Introduction To Correlation Analysis GB6023 2012

The document provides an introduction to correlation analysis and scatter plots. It discusses how scatter plots can be used to show the relationship between two variables and how correlation coefficients quantify the strength and direction of linear relationships. Pearson's r and Spearman's rho are introduced as measures of the correlation between two variables, where r ranges from -1 to 1 with values farther from 0 indicating stronger relationships. Examples are given of how to interpret the size of r based on different guidelines.

Uploaded by

Ina Amalina
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 34

Introduction to Correlation Analysis

GB6023 2012

Scatter Plots and Correlation


A scatter plot (or scatter diagram) is used to show the relationship between two variables Correlation analysis is used to measure strength of the association (linear relationship) between two variables Only concerned with strength of the relationship No causal effect is implied

Scatterplot
Graphical display of two quantitative variables:
Horizontal Axis: Explanatory variable (pembolehubah bebas), x Vertical Axis: Response variable,(pembolehubah bersandar) y

Scatter Plot Examples


Linear relationships y y Curvilinear relationships

x y y

Scatter Plot Examples


(continued)
Strong relationships y y Weak relationships

x y y

Scatter Plot Examples


(continued)
No relationship y

x y

To quantify the relationship between two quantitative variables, we use a correlation coefficient, Pearsons r or Spearmans rho. The correlation coefficient tells us: If there is a relationship between the variables The strength of the relationship The direction of the relationship
Correlation coefficients vary from -1.0 to +1.0. A correlation coefficient of 0.0 indicates that there is no relationship. A correlation coefficient of -1.0 or + 1.0 indicates a perfect relationship, i.e. the scores on one variable can be accurately determined by the scores on the other variable. 1/14/2013 Slide 7

If a correlation coefficient is negative, it implies an inverse relationship, i.e. the scores on the two variables move in opposite directions, higher scores on one variable are associated with lower scores on the other variable. If a correlation coefficient is positive, it implies a direct relationship, i.e. the scores on the two variables move in the same direction, higher scores on one variable are associated with higher scores on the other variable. When we talk about the size of a correlation, we refer to the value irrespective of the sign a correlation of -.728 is just as large or strong as a correlation of +.728 The Pearson R correlation coefficient treats the data as interval. Spearmans Rho treats the data as ordinal, using the 1/14/2013 Slide 8 rank order of the scores for each variable rather than the

Suppose I had the data to the right showing the relationship between GPA and income. SPSS would calculate Pearsons r for this data to be .911 and Spearmans rho to be .900.
The ranks for the values for each of the variables are shown in the table to the right. Using the ranks as data, SPSS would calculate both Pearsons r and Spearmans rho to be .900.

GPA Incom e 3.2 3.3 3.5 3.7 45000 42000 48000 50000

3.8

55000

GPA Income Rank Rank 1 2 3 4 5 2 1 3 4 5

1/14/2013

Slide 9

Suppose the fifth subject had an income of 100,000 instead of 55,000. SPSS would calculate Pearsons r for this data to be .733 and Spearmans rho to be .900.

GPA Incom e 3.2 3.3 3.5 3.7 45000 42000 48000 50000

3.8 100000

The ranks for the values did not change. The fifth subject had the highest income, so Spearmans rho has the same value. The Pearsons r decreased from .911 to .733. Outliers, and the skewing of the distribution by outliers, have a greater effect on 1/14/2013 Pearsons r than they do on Spearmans

GPA Income Rank Rank 1 2 3 4 5 2 1 3 4 5

Slide 10

In the scatterplot, outliers (the case I changed from 55,000 to 100,000) will draw the loess line toward them away from the linear fit line, making the pattern of points appear less linear. or more non-linear.
100,00 0

55,00 0

1/14/2013

The lines demonstrate the point, but the cyan line is really a quadratic fit rather than a loess line because I cant do much smoothing with only 5 data points.

Slide 11

Outliers, and the skewing of the distribution by outliers, have a greater effect on Pearsons r than they do on Spearmans rho. As the outliers become more extreme, and the distribution becomes more skewed, Spearmans rho becomes larger than Pearsons r, and the overall trend in the data is non-linear. To accurately model the relationship, we have three choices: 1. use a more complex non-linear model to analyze the relationship 2. re-express the data to reduce skewing and the impact of outliers, and analyze the relationship with a linear model 3. Exclude outliers to reduce skewing, and analyze the 1/14/2013 Slide 12 relationship with a linear model

If the three following conditions are present, reexpressing the data may reduce the skewness and increase the size of Pearsons r to justify treating the relationship as linear: 1. If the model appears non-linear because of the difference between the loess line and the linear fit line, 2. If Spearmans rho is larger than Pearsons r (by .05 or more), 3. If one or both of the variables violates the skewness criteria for a normal distribution.
We will employ the transformations we have used previously: if the distribution is negatively skewed, we re-express the data as squares; if the distribution is 1/14/2013 Slide 13 positively skewed, we re-express the data as

There are two sets of guidelines used to translate the correlation coefficient into a narrative phrase, guidelines attributed to Tukey and guidelines attributed to Cohen. Tukeys guidelines interpret a correlation:
between 0.0 up to 0.20 as very weak; equal to or greater than 0.20 up to 0.40 as weak; equal to or greater than 0.40 up to 0.60 as moderate; equal to or greater than 0.60 up to 0.80 as strong; and equal to or greater than 0.80 as very strong. Cohens guidelines interpret a correlation: less than 0.10 = trivial; equal to or greater than 0.10 up to 0.30 = weak or small; equal to or greater than 0.30 up to 0.50 = moderate; equal to or greater than 0.50 or greater = strong or large
1/14/2013 Slide 14

Examples of Approximate r Values


y y y

r = -1
y

r = -.6
y

r=0

r = +.3

r = +1

Calculating the Correlation Coefficient


Sample correlation coefficient:

( x x )( y y ) [ ( x x ) ][ ( y y ) ]
2 2

or the algebraic equivalent:

r
where:

[n( x 2 ) ( x )2 ][n( y 2 ) ( y )2 ]

n xy x y

r = Sample correlation coefficient n = Sample size x = Value of the independent variable y = Value of the dependent variable

Calculation Example
Tree Height y 35 49 27 33 60 21 45 51 =321 Trunk Diamete r x 8 9 7 6 13 7 11 12 =73 xy 280 441 189 198 780 147 495 612 y2 1225 2401 729 1089 3600 441 2025 2601 x2 64 81 49 36 169 49 121 144 =713

=3142 =14111

Calculation Example
Tree Height, y 70
60

(continued)

[n( x 2 ) ( x) 2 ][n( y 2 ) ( y)2 ] 8(3142) (73)(321) [8(713) (73)2 ][8(14111) (321)2 ]

n xy x y

50

40

30

0.886
r = 0.886 relatively strong positive linear association between x and y

20

10

0 0 2 4 6 8 10 12 14

Trunk Diameter, x

Significance Test for Correlation


Hypotheses H0: = 0 (no correlation) HA: 0 (correlation exists)

Test statistic

r 1 r n2
2
(with n 2 degrees of freedom)

Example: Produce Stores


Is there evidence of a linear relationship between tree height and trunk diameter at the .05 level of significance?

H0: = 0 H1: 0 exists)

(No correlation) (correlation

=.05 , df .886 2 = 6 =8r t 4.68 1 r 2 1 .8862 n2 82

Example: Test Solution


t r 1 r 2 n2 .886 1 .8862 82 4.68
Decision: Reject H0 Conclusion: There is evidence of a linear relationship at the 5% level of significance

d.f. = 8-2 = 6
/2=.025 /2=.025

Reject H0

-t/2 -2.4469

Do not reject H0

t/2 2.4469

Reject H0

4.68

Spss /PAWS applications!! Contingency tables Khi square test.

Contingency tables enable us to compare one characteristic of the sample, e.g. degree of religious fundamentalism, for groups or subsets of cases defined by another categorical variable, e.g. gender. A contingency table, which SPSS calls a cross-tabulated table is shown below:

1/14/2013

Slide 23

Each cell in the table represents a combination of the characteristics associated with the two variables:
29 males were also fundamentalists.

42 females were fundamentalists. While a larger number of females were fundamentalist, we cannot tell if females were more likely to be fundamentalist because the total number of females (146) was different from the total number of males (107). To answer the more likely question, we need to compare percentages.

1/14/2013

Slide 24

There are three percentages that can be calculated for a contingency table: percentage of the total number of cases percentage of the total in each row percentage of the total in each column Each of the three percentages provide different information and answer a different question.
1/14/2013 Slide 25

The percentage of the total number of cases is computed by dividing the number in each cell (e.g. 29, 42, etc.) by the total number of cases (253).
11.5% of the cases were both male and fundamentalist. 16.6% of the cases were both female and fundamentalist.

We have two clues that the table contains total percentages. First, the rows that the percentages are on are labeled % of Total.

Second, the 100% figure appears ONLY in the grand total cell beneath the table total of 253.

1/14/2013

Slide 26

The percentage of the total for each row is computed by dividing the number in each cell (e.g. 29, 42) by the total for the row (71).
40.8% of the fundamentalist s were male.

The label for the percentage tells us that it is computed within the category for fundamentalist.

59.2% of the fundamentalist s were female.

The percentages in each row sums to 100% in the total column for rows (the row margin).

1/14/2013

Slide 27

The percentage of the total for each column is computed by dividing the number in each cell (e.g. 29, 36, and 42) by the total for the column (107).
27.1% of the males were fundamentalists.

33.6% of the males were moderates. The label for the percentage tells us that it is computed within the category for sex.

1/14/2013

The percentage in each column sums to 100% in the total row for columns (the column margin).

Slide 28

The three percentages tell us: the percent that is in both categories (total percentage) the percent of each row that is found in each of the column categories (row percentages) the percent of each column that is found in each of the row categories (column percentages) The row and column percentages are referred to as conditional or contingent 1/14/2013 Slide 29 percentages.

Our real interest is in conditional or contingent percentages because these tell us about the relationship between the variables. The relationship between variables is defined by a distinct role for each:
the variable which affected or impacted by the other is the dependent variable the variable which affects or impacts the other is the independent variable

We assign the role to the variable. An independent variable in one analysis may be 1/14/2013 Slide 30 a dependent variable in another analysis.

A categorical variable has a relationship to another categorical variable if the probability of being in one category of the dependent variable differs depending on the category of the independent variable. For example, if there is a relationship between social class and college attendance, the percentage of upper class persons who attend college will be different from the percentage of middle class persons who attend college. Attending college is the dependent 1/14/2013 Slide 31

Given that we can represent this statistically with either the row or column percentages in a contingency table, my practice is to always put the independent variable in the columns and the dependent variable in the rows, and compute column percentages. This order matches the order for many graphics where the dependent variable is on the vertical axis and the independent variable is on the horizontal axis.
1/14/2013 Slide 32

Based on the column percentages, we can make statements like the following:

Males were most likely to be liberal (39.3%), while females were most likely to be moderate (45.5%).

1/14/2013

Slide 33

Based on the column percentages, we can make statements like the following:

This is not equivalent to the statement that liberals are more likely to be male or female.

Males were more likely to be liberal (39.3%) compared to females (26.7%).

1/14/2013

Slide 34

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy