0% found this document useful (0 votes)
7 views43 pages

CH 12

This document covers correlation and regression analysis, explaining how to assess the relationship between two variables using correlation coefficients and scatter diagrams. It details the significance testing of correlation, regression equations, and the least squares method to predict outcomes based on independent variables. Additionally, it discusses the coefficient of determination and standard error of estimate to evaluate the accuracy of predictions.

Uploaded by

김봉기
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views43 pages

CH 12

This document covers correlation and regression analysis, explaining how to assess the relationship between two variables using correlation coefficients and scatter diagrams. It details the significance testing of correlation, regression equations, and the least squares method to predict outcomes based on independent variables. Additionally, it discusses the coefficient of determination and standard error of estimate to evaluate the accuracy of predictions.

Uploaded by

김봉기
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 43

Correlation and Regression

(상관과 회귀)
Chapter 12
What is Correlation Analysis?
 Used to report the relationship between two variables
CORRELATION ANALYSIS a descriptive statistic that summarizes the
relationship between two variables with a single number

 In addition to graphing techniques, we’ll develop


numerical measures to describe the relationships
 Examples
 Does the amount Healthtex spends per month on training its
sales force affect its monthly sales?
 Does the number of hours students study for an exam
influence the exam score ?
 The most common type of correlation: Pearson’s
product-moment correlation coefficient
Scatter Diagram
 A scatter diagram is a graphic tool used to portray the
relationship between two variables
 The independent variable is scaled on the X-axis and is
the variable used as the predictor
 The dependent variable is scaled on the Y-axis and is the
variable being estimated

Graphing the data in a scatter


diagram will make the
relationship between sales
calls and copiers sales easier
to see.
Scatter Diagram Example
North American Copier Sales sells copiers to businesses of all sizes throughout the
United States and Canada. The new national sales manager is preparing for an
upcoming sales meeting and would like to impress upon the sales representatives the
importance of making an extra sales call each day. She takes a random sample of 15
sales representatives and gathers information on the number of sales calls made last
month and the number of copiers sold. Develop a scatter diagram of the data.

Sales reps who make


more calls tend to
sell more copiers!
Correlation Coefficient
 Pearson’s product-moment correlation
coefficient(적률상관계수): a type of correlation that
summarises linear relationships; varies from -1
(representing a perfect negative relationship), through 0
(representing no relationship), up to +1 (representing a
perfect positive relationship)
 Also called Pearson’s r, Pearson’s correlation, or correlation
 There are two rules that govern the actual value of a
correlation:
 Magnitude(강도)
 The closer the correlation is to – 1 or +1, the stronger the linear
relationship.
 Example: r = –.80 is stronger than r = +.08
 Direction(방향)
 In positive relationships, both variables go the same
direction at the same time.
 Example: There is a positive relationship between customer
satisfaction and sales if as sales increase, customer satisfaction also
increases.
 In negative relationships, one variable goes up while the
other goes down.
 Example: There is a negative relationship between workload and
employee satisfaction if a workload increases, employee satisfaction
decreases.
 The following graphs summarize the strength and
direction of the correlation coefficient
Correlation Coefficient, r
How is the correlation coefficient determined? We’ll use the North American
Copier Sales as an example. We begin with a scatter diagram, but this time we’ll
draw a vertical line at the mean of the x-values (96 sales calls) and a horizontal line
at the mean of the y-values (45 copiers).
Correlation Coefficient, r, Continued
How is the correlation coefficient determined? Now we find the deviations from
the mean number of sales calls and the mean number of copiers sold; then multiply
the them. The sum of their product is 6,672 and will be used in formula 13-1 to
find r. We also need the standard deviations. The result, r=.865 indicates a strong,
positive relationship. (상관을 인과관계로 해석해서는 안된다; Spurious correlation)

6672
r = (15−1)(42.76)(12.89) = 0.865
Correlation Coefficient Example
The Applewood Auto Group’s marketing department believes younger buyers
purchase vehicles on which lower profits are earned and older buyers purchase
vehicles on which higher profits are earned. They would like to use this information as
part of an upcoming advertising campaign to try to attract older buyers. Develop a
scatter diagram and then determine the correlation coefficient. Would this be a useful
advertising feature?

The scatter diagram suggests that a positive


relationship does exist between age and profit. But
it does not appear to be a strong relationship.

Next, calculate r, it is 0.262. The


relationship is positive but weak. The data
does not support a business decision to
create an advertising campaign to attract
older buyers!
Testing the Significance of r
 Research questions for correlations specify the two variables
and emphasize that we are looking for a linear relationship.
 Example: Is there a linear relationship between variable 1 and
variable 2?
 Recall that the sales manager from North American Copier
Sales found an r of 0.865
 Could the result be due to sampling error? Remember only 15
salespeople were sampled
 We ask the question, could there be zero correlation in the
population from which the sample was selected?
 We’ll let ρ(rho) represent the correlation in the population
and conduct a hypothesis test to find out whether ρ is
different than zero.
Testing the Significance of r Example
Step 1: State the null and the alternate hypothesis
H0: ρ = 0 The correlation in the population is zero
H1: ρ ≠ 0 The correlation in the population is different from zero
Step 2: Select the level of significance, we’ll use .05
Step 3: Select the test statistic, we use t
Step 4: Formulate the decision rule, reject H0 if t < -2.160 or > +2.160
Step 5: Make decision, reject H0, t=6.216
Step 6: Interpret, there is correlation with respect to the number of sales calls
made and the number of copiers sold in the population of salespeople.
 The critical value for a correlation depends on alpha
and the degrees of freedom for the test
 Example: If α = .05 and n = 24, tcrit(22) = 2.074
Testing the Significance of the Correlation
Coefficient
In the Applewood Auto Group example, we found an r=0.262 which is positive, but
rather weak. We test our conclusion by conducting a hypothesis test that the
correlation is greater than 0.

Step 1: State the null and the alternate hypothesis


H0: ρ ≤ 0 The correlation in the population is negative or zero
H1: ρ > 0 The correlation in the population is positive
Step 2: Select the level of significance, we’ll use .05
Step 3: Select the test statistic, we use t
Step 4: Formulate the decision rule, reject H0 if t > 1.653
Step 5: Make decision, reject H0, t=3.622
Step 6: Interpret, there is correlation with respect to profits and age of the buyer
Formally stating the results

 We formally state the results of correlations


 using the format:
 r(df) = r, p = p-value or p = relationship α
 Example: r(8) = 2.39, p = .04 or p < .05

 Conducting supplemental analyses


 If we found statistical significance, compute the coefficient
of determination (r2)
 the proportion of variance in one variable that can be explained
by the variance in the other variable.
 A regression analysis can also be computed if necessary.
 If we did not find statistical significance, no further analyses
needed.
Regression Analysis
 In regression analysis, we estimate one variable based on
another variable
 The variable being estimated is the dependent variable
(or criterion)
 The variable used to make the estimate or predict the
value is the independent variable (or predictor)
 The relationship between the variables is linear
 Simple linear regression: a regression of exactly one
variable predicting exactly one variable
REGRESSION EQUATION(회귀방정식) An equation that expresses
the linear relationship between two variables.
Regression Analysis
 Regression line: the line created as the result of regression;
the line that minimizes the sum of the squared residuals for a
given dataset; also called the line of best fit
 Residual: the vertical distance any particular datum is from the
regression line.
Least Squares Principle (최소제곱법)
 In regression analysis, our objective is to use the data to
position a line that best represents the relationship between
two variables
 The first approach is to use a scatter diagram to visually
position the line

 But this depends on judgement, we would prefer a method


that results in a single, best regression line
Least Squares Regression Line
LEAST SQUARES PRINCIPLE A mathematical procedure that uses the data to
position a line with the objective of minimizing the sum of the squares of the
vertical distances between the actual y values and the predicted values of y.

 To illustrate, the same data are plotted in the three charts below
Least Squares Regression Line
 This is the equation of a line

 yො is the estimated value of y for a selected value of x


 a is the constant or intercept
 b is the slope of the fitted line
 x is the value of the independent variable
 The formulas for a and b are
Least Squares Regression Line Example
Recall the example of North American Copier Sales. The sales manager gathered
information on the number of sales calls made and the number of copiers sold. Use
the least squares method to determine a linear equation to express the relationship
between the two variables.

The first step is to find the slope of the least squares regression line, b

Next, find a

Then determine the regression line

So if a salesperson makes 100 calls, he or she can expect to sell 46.0432 copiers
Drawing the Regression Line
The least squares equation can be drawn on the scatter diagram. For example, the fifth
sales representative is Jeff Hall. He made 164 calls. His estimated number of copiers sold
is 62.7344. The plot x = 164 and yො = 62.7344 is located by moving to 164 on the x-axis
and then going vertically to 62.7344. The other points on the regression equation can be
determined by substituting a particular value of x into the regression equation and
calculating yො .
Regression Equation Slope Test
 For a regression equation, the slope is tested for
significance
 We test the hypothesis that the slope of the line in the
population is 0
 If we do not reject the null hypothesis, we conclude there
is no relationship between the two variables
 We begin with the hypothesis statements
H0: β = 0
H1: 𝛽 ≠ 0
Regression Equation Slope Test Example
Recall the North American Copier Sales example. We identified the slope as b and it is
our estimate of the slope of the population, β. We conduct a hypothesis test.
Step 1: State the null and alternate hypothesis
H0: β ≤ 0
H1: 𝛽 > 0
Step 2: Select the level of significance, we use .05
Step 3: Select the test statistic, t
Step 4: Formulate the decision rule, reject H0 if t > 1.771
Step 5: Make decision, reject H0, t = 6.205
Step 6: Interpret, the number of sales calls is useful in estimating copier sales

𝑠𝑒
𝑠𝑏 =
(𝑛−1)× 𝑠𝑥 2
(𝑌𝑖 −𝑌𝑖 )2
Highlighted, b is .2606; the standard error is .0420 𝑠𝑒 =
2
(𝑛−2)
coefficient se t p

Y intercept 19.98 4.389675533 4.551589258 0.000544

Sales calls 0.260625 0.042001817 6.205088676 0.00


Evaluating a Regression Equation’s Ability
to Predict
 Perfect prediction is practically impossible in almost all
disciplines, including economics and business
 The North American Copier Sales example showed a
significant relationship between sales calls and copier
sales, the equation is
Number of copiers sold = 19.9632 + .2608(Number of sales calls)
 What if the number of sales calls is 84, we calculate the
number of copiers sold is 41.8704—we did have two
employees with 84 sales calls, they sold just 30 and 43
 So, is the regression equation a good predictor?
 We need a measure that will tell how inaccurate the
estimate might be standard error of estimate (추정치의
표준오차: 추정치가 X에 근거해서 Y를 얼마나 정확히 예측하는가).
The Standard Error of Estimate
 The standard error of estimate measures the variation
around the regression line
STANDARD ERROR OF ESTIMATE A measure of the dispersion, or scatter,
of the observed values around the line of regression for a given value of x.

 It is based on squared deviations from the regression line


 Small values indicate that the points cluster closely about
the regression line
 It is computed using the following formula
The Standard Error of Estimate Example
We calculate the standard error of estimate in this example. We need the sum of
the squared differences between each observed value of y and the predicted value
of y, which is 𝑦.
ത We use a spreadsheet to help with the calculations.

The standard error of


estimate is 6.720

If the standard error of estimate is small, this indicates that the data are relatively
close to the regression line and the regression equation can be used. If it is large,
the data are widely scattered around the regression line and the regression
equation will not provide a precise estimate of y.
Coefficient of Determination
COEFFICIENT OF DETERMINATION The proportion of the total variation in
the dependent variable Y that is explained, or accounted for, by the variation in
the independent variable X.
 It ranges from 0 to 1.0
 It is the square of the correlation coefficient
 It is found from the following formula

 SST = SSB + SSW; SSTotal = SSRegression + SSError

 In the North American Copier Sales example, the correlation coefficient was
.865; just square that (.865)2 = .748; this is the coefficient of determination
 This means 74.8% of the variation in the number of copiers sold is explained by
the variation in sales calls
Relationships among r, r2, and sy,x
 Recall the standard error of estimate measures how close
the actual values are to the regression line
 When it is small, the two variables are closely related
 The correlation coefficient measures the strength of the
linear association between two variables
 When points on the scatter diagram are close to the
line, the correlation coefficient tends to be large
 Therefore, the correlation coefficient and the standard
error of estimate are inversely related
 As noted earlier, the coefficient of determination is the
correlation coefficient squared
Hypothesis testing using correlations:
Example
 Acme Co. is researching whether there is a linear
relationship between workload and employee
satisfaction. Workload and self-reported employee
satisfaction was recorded for 4 employees. Data is
provided below:
Workload Satisfaction

2 3
3 2
4 5
3 5
 RQ: Is there a linear relationship between workload
and employee satisfaction?
 Hypotheses:
 H0: ρ = 0
 H1: ρ ≠ 0
 α = .05
 t(2)crit = ± 4.303
Satisfaction
Workload (X) x2 y2 xy
(Y)
2 3 4 9 6
3 2 9 4 6
4 5 16 25 20
3 5 9 25 15
ΣX2 =
ΣX = 12 ΣY = 15 ΣY2 = 63 ΣXY = 47
38
12*15
47 −
r= 4 = .544331
 122   152 
 38 −   63 − 
 4   4 
  

.544331 4 − 2
t= = .917663
1 − .5443312
 r(2) = .92 , p > .05
 Retain the null. The correlation was not statistically
significant. There is not a linear relationship between
workload and employee satisfaction.
Learning check
 Acme Co. is interested in whether there is a linear
relationship between customer satisfaction and number
of purchases. Customer satisfaction and number of
purchases were recorded for 5 customers. If there is a
linear relationship, calculate the predicted number of
sales if customer satisfaction is 2.5. Data is provided
below:
Satisfaction # of Purchases
3 1
3 3
4 5
4 5
4 5
Learning check
 RQ: Is there a linear relationship between satisfaction and
number of purchases?
 Hypotheses:
 H0: ρ = 0
 H1: ρ ≠ 0
 α = .05
 t(3)crit = ±3.182
Learning check

Satisfaction (X) # of Purchases (Y) x2 y2 xy


3 1 9 1 3
3 3 9 9 9
4 5 16 25 20
4 5 16 25 20
4 5 16 25 20
ΣX = 18 ΣY = 19 ΣX2 = 66 ΣY2 = 85 ΣXY = 72
Learning check
18*19
72 −
r= 5 = .918559
 18 2  2
19 
 66 −   85 − 
 5  5 
  

  − 
= 
= 
−

 r(3) = 4.02 , p < .05


 r2 = .9185592 = .843751
1.788854
b = .918559 =3
.547723
–7
a = 3.8 – (3*3.6) =
y = 3(2.5) – 7 = .5
Learning check
 Reject the null and accept the alternative. The correlation
is statistically significant. There is a linear relationship
between customer satisfaction and number of purchases.
84% of the variance in number of purchases can be
explained by customer satisfaction. For every increase in
customer satisfaction, we’d expect 3 additional purchases.
If customer satisfaction was reported as 2.5, we’d expect
.5 purchases.

Yhat = -7 + 3x: x가 1단위 증가할 때마다 Y는 3(회귀계수,


regression coefficient)단위 증가한다.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy