0% found this document useful (0 votes)
2 views92 pages

Correlation & Regression

The document serves as a lab guide for biostatistics and epidemiology, focusing on inferential statistical procedures for correlation and regression analysis. It outlines the learning outcomes, graphical approaches, types of correlation coefficients, and regression analysis methods, including assumptions and interpretations. Additionally, it provides examples and hypotheses testing related to correlation and regression in health-related studies.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views92 pages

Correlation & Regression

The document serves as a lab guide for biostatistics and epidemiology, focusing on inferential statistical procedures for correlation and regression analysis. It outlines the learning outcomes, graphical approaches, types of correlation coefficients, and regression analysis methods, including assumptions and interpretations. Additionally, it provides examples and hypotheses testing related to correlation and regression in health-related studies.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 92

BIOSTATISTICS AND

EPIDEMIOLOGY (L)
LABORATORY
Lab Guide 03
Learning Outcomes
• Relate inferential statistical procedures for correlation and
regression with data characteristics and sampling
procedures
• Use appropriate inferential statistical procedures for
correlation and regression for a given data set
• Compute for key inferential statistics measures for
correlation and regression
• Make narratives for inferential statistics analyses results for
correlation and regression in report format
INVESTIGATING
RELATIONSHIPS
BETWEEN VARIABLES
Graphical approach
• Qualitative variables: Comparative bar graph
Graphical approach
• Quantitative variables: Scatterpoint diagram
Direct/Positive: Inverse/Negative:
Points rise from left Points fall from left
to right. to right.

No Relationship: Inverted U-shape:


Points are scattered Points rise, reach a
randomly with no peak, and then fall.
pattern.
Variable Type Objective Method Example
Relationship between
Chi-Square Test
Qualitative Association
of Association
smoking and lung
disease

Correlation
Relationship between
Coefficient
Quantitative Correlation
(Pearson/
study hours and exam
scores
Spearman)

Predicting blood
Quantitative/ Regression
Prediction pressure from age and
Qualitative Analysis
weight
Correlation analysis
• Used when the objective
is to measure the strength and
the direction of linear relationship
between 2 or more quantitative variables

• The value of the correlation coefficient provides a


quantitative value to the strength of the relationship
between X and Y
Correlation analysis

• done through the computation of a correlation coefficient


• correlation coefficient measures the extent or degree of
linear relationship between two variables
r – sample (statistics)
rho (ρ) - population (parameter)
Correlation analysis: Key points
• Strength of relation of X and Y – indicated by r
strong relationship: r is close to +1 or -1
weak or nonexistent: r is close to 0
• Direction
Positive correlation (r > 0): X↓Y↓ or X↑Y↑
Negative correlation (r< 0): X↓Y↑ or X↑Y↓
Aspect Parametric Correlation Non-Parametric Correlation
Spearman's rank correlation coefficient
Common Tests Pearson's correlation coefficient (r)
(ρ), Kendall's tau
Assumes that the data is normally
Does not assume normality or linearity.
distributed and that the
Assumptions Suitable for ordinal or non-continuous
relationship between variables is
data.
linear.
Data Type Interval or ratio data (continuous) Ordinal, interval, or ratio data
Highly sensitive to outliers; Less sensitive to outliers, as it focuses
Sensitivity to
extreme values can significantly on the ranks of the data rather than
Outliers
affect the correlation coefficient. their actual values.
Calculation Based on the actual values of the Based on the ranks of the values, not
Method variables their actual values.
Indicates the monotonic relationship
Indicates the linear relationship
Interpretation (not necessarily linear) between
between variables.
variables.
Kinds of Correlation coefficients

Parametric

• Pearson’s correlation coefficient (r)

Non-parametric

• Spearman’s rank-order correlation coefficient (rs)

• Kendall’s rank correlation coefficient (Τ – tau)


PEARSON’S
CORRELATION
COEFFICIENT
Pearson’s correlation coefficient
• used when the variables X and Y are both measured in
the interval or ratio scales of measurement
• can be validly applied when both X and Y, as well as their
joint distributions, are normally distributed
• applicable
only when determining the existence and
magnitude of a linear relationship between X and Y
Pearson’s correlation coefficient
ASSUMPTIONS:
• For each variable X, the corresponding subpopulation of
values for variable Y is normally distributed.
• For each value of variable Y, the corresponding
subpopulation of values of the variable X is normally
distributed.
• The joint distribution of the variables X and Y is also
normally distributed (bivariate normal distribution)
Pearson’s correlation coefficient
• The values of r (or ρ) vary from -1 to 1
• If r = 0, the variables have no linear relationship.
• Strength of the relationship
The nearer the computed value to 1 to -1, the stronger
the linear relationship.
(variables with r=0.85 are more strongly correlated
to each other than pair with r=0.30)
Pearson’s correlation coefficient
• The direction and nature of relationship between X and Y
is determined by the sign and value
Positive values – direct relationship
Negative values – inverse relationship

High values of Y are associated with high values of


X and vice versa
• WHO Research Manual recommends the following
interpretation
Sample
• To determine correlation between systolic blood
pressure in the recumbent position and the standing
position among patients using Pearson’s correlation
coefficient.
SBP(mmHg) Step 1: construct a scatter point
# diagram for the data
Recumbent Standing Systolic Blood Pressure (mmHg) of 10
patients in the recumbent and standard
1 170 180 250 positions
2 180 190

SBP Recumbent Position (mmHg)


200
3 190 185
4 150 155 150
5 180 185
100
6 135 125
7 200 205 50
8 180 170
0
9 200 190 0 50 100 150 200 250
10 140 145 SBP Standing Position (mmHg)
Step 1: construct a scatter point
diagram for the data
Systolic Blood Pressure (mmHg) of 10
patients in the recumbent and standard • Linearity: Linear
250 positions
SBP Recumbent Position (mmHg)

200 • Direction: Positive

150

100
• Strength: Strong

50
Strong positive linear
0
0 50 100 150 200 250
relationship
SBP Standing Position (mmHg)
Test of hypothesis
• H0: ρ = 0 (There is no correlation between the SBP of patients
in the recumbent and standing positions.)
• H1: ρ ≠ 0 (There is a correlation between the SBP of patients in
the recumbent and standing positions.)
•α = 0.05
• Test statistics
• Critical region: t ≥2.306 or t≤ 2.306
df = n-2 = 10-2=8
Critical region: t ≥ 2.306 or t≤ 2.306
Test of hypothesis
• Statistical decision: The null hypothesis is rejected since the
computed t-test (7.82) falls within the critical region (2.306).

• Conclusion: Therefore, the systolic blood pressure of


patients in the recumbent and standing positions correlate.
Further, there is a positive correlation between the variables.
Sample
Objective: To examine the relationship between the
number of hours of physical activity per week and
cholesterol levels (measured in mg/dL) in a sample of
adults.
Cholesterol Level (Y, mg/dL)
Hours of Cholesterol 300

Participant Physical Level (Y,


Activity (X) mg/dL) 250

1 2 220
200
2 5 210
3 3 215 150

4 7 180 100

5 1 240
6 6 185 50

7 4 210 0
0 1 2 3 4 5 6 7 8

8 3 225
9 2 230 Strong negative linear relationship
10 5 200
Test of hypothesis
• Null Hypothesis (H₀): There is no significant relationship
between the number of hours of physical activity and
cholesterol levels (ρ = 0).
• Alternative Hypothesis (H₁): There is a significant relationship
between the number of hours of physical activity and
cholesterol levels (ρ ≠ 0).
• α = 0.05
• Test statistics
• Critical region: t ≥ 2.306 or t≤ 2.306
df = n-2 = 10-2=8
Critical region: t ≥ 2.306 or t≤ 2.306
Test of hypothesis
• Statistical decision: The null hypothesis is rejected since the
computed t-test (-9.96) falls within the critical region
(2.306).

• Conclusion: There is a significant correlation between the


number of hours of physical activity and cholesterol levels
Further, there is a strong negative correlation between the
variables.
Pearson’s correlation coefficient
COMMENT
• High values of r do not necessarily imply a cause-effect
relationship between X and Y.
• The value of r to be considered high or low is frequently
a subjective decision and depends to a large extent on
the type of variables being correlated.
Pearson’s correlation coefficient
COMMENT
• The value of r simply measure the degree of a linear
relationship between X and Y and not all types of
relationships.
• The correlation coefficient r can both be greatly affected
by just one data point (or a few data points.)
SPEARMAN RANK
CORRELATION COEFFICIENT
(𝜌 𝑜𝑟 RS)
Description and Use
a. A nonparametric version of the Pearson product-
moment correlation
➢There is no requirement for normality

b. This procedure uses the two sets of ranks that may be


assigned to the X and Y sample values, the independent
and continuous variables of a bivariate distribution.
Description, Use AND ASSUMPTIONS
a. Used when the variables X and Y are either ordinal
measurement scales. It is used on interval or ratio data
when the assumptions of the Pearson correlation are
markedly violated.
b. Determines the strength and direction of
the monotonic relationship between your two
variables rather than the strength and direction of the
linear relationship.
➢A monotonic relationship is a relationship that does one of the
following:
• direct relationship between X and Y [ ↑X ↑Y or ↓X ↓Y ]
• Inverse relationship between X and Y [ ↑X ↓ Y or ↓X↑Y ]

➢Monotonicity is "less restrictive" than that of a linear relationship.


Thus,
➢Spearman rank correlation coefficient is much less sensitive to outlying
values than Pearson's correlation coefficient.
𝒓𝒔 INTERPRETATION

a. What is the direction of the relationship between


X and Y?
𝒓𝒔 INTERPRETATION

b. How strong is the MONOTONIC relationship between


X and Y?

• 𝒓𝒔 ranges in value from -1.0 to 1.0


NOTE
✓𝑟𝑠 is a measure of a monotonic relationship, thus
✓𝑟𝑠 = 0 does NOT imply NO relationship between variables

✓example in the scatterplot,


𝑟𝑠 = 0 implies no (monotonic
correlation, but there is a perfect
quadratic relationship
Sample
To determine if there is a
significant relationship
between the participants'
age and their performance
scores using a correlation
analysis.
Score
120

100

80

60

40

20

0
0 5 10 15 20 25 30 35 40

Non-linear
Test of hypothesis
• Null Hypothesis (H0): There is no significant correlation
between age and performance score (ρ=0).
• Alternative Hypothesis (H1): There is a significant correlation
between age and performance score (ρ≠0)
•α = 0.05
• Test statistics
• There is strong monotonic inverse relationship between age
and performance score.

• Statistical decision: Since our p-value is lower than α, we reject


the null hypothesis.

• Conclusion: There is a significant correlation between age and


performance score .
REGRESSION ANALYSIS
Regression analysis
• Statistical technique which determine and quantify the
extent of which one quantitative variable, X, varies with
another quantitative variable, Y without necessarily
assuming that Y is dependent on X.

• The relationship between Y and X is described by means


of an equation of the curve which best fits the data.
Regression analysis
• After the regression equation has been derived, values of the
independent variables can be substituted in order to
determine the predicted values of the dependent variables.

• The effect of each independent variable on the dependent


variable is measured by the magnitude and sign of the
corresponding regression coefficient.
Basis of Classification Categories Types of Regression Analysis
Numbers of independent one Simple regression
variables two Multiple regression
Shape of the relationship Straight Linear regression
between the dependent and Polynomial or curvilinear
independent variables Curvilinear
regression
"Standard" (Least squares)
Quantitative
regression
Level of measurement of the
Qualitative, binary Logistic regression
depedent variable
Qualitative with several
Categorical data analysis
categories
Any quantitative variable "Standard" (Least squares)
other than time regression
Nature of the dependent variable
Time for an event to occur Survival analysis
SIMPLE LINEAR
REGRESSION ANALYSIS
Simple linear regression analysis
• Looks at the linear relationship between one quantitative
dependent variable (Y) and one independent variable (X).
Simple linear regression analysis
The assumptions for the valid application of simple linear
regression analysis are as follows:
• For any fixed X, Y has a normal distribution.
• The variance of Y is the same for any value of X. This
assumption of equality of variance is called
“homoscedasticity”.
• The value of Y at one value of X does not depend on, and is
not affected by the value of Y at another value of X. This
assumption is called “independence”.
Step 1: looking for the straight line that
best fits the data
• This is done by the methods of least squares.

• Under this method, the best line among all the possible
lines which can be fitted to the data, is the one which
gives the minimum value of the sum of the squares of
the vertical deviations of each data point, from the
corresponding value in the line.
General formula for the prediction equation in
simple linear regression analysis
Y =b0 –/+ b1X
General formula for slope
General formula for intercept
Sample
Hours
• A researcher is studying the Test
Student Studied
relationship between the Score (y)
number of hours studied and (x)
the test scores achieved by 1 1 50
students. The researcher 2 2 55
wants to predict the test
score based on the number of 3 3 60
hours studied. 4 4 65
5 5 70
Regression result
• Intercept (b₀): 45
predicted value of Y when X = O
• Slope (b₁): 5
rate of change Y for a one-unit increase X
• regression equation: y=45+5x
R-squared (R²)
• how well the model fits the data
• R² ranges from 0 to 1.
• R² = 1
A value closer to 1 indicates a better fit.
If R2=1, the model perfectly explains the variation in
the dependent variable.
Standard Error
• tells you the precision of the regression coefficients
(intercept and slope)

• smaller values indicate more precise estimates

• SE = O
Prediction
• If a student studies for 6 hours, what is the predicted
test score?

• y = 45+5(6)=75

• The predicted test score for a student who studies for 6


hours is 75
Assessing the utility of the fitted
regression equation
1. Compute for the coefficient of determination, R2

2. Test for the hypothesis


Ho : β = 0
• Ho: b1 = 0 (Hours studied does not affect test score)
• Ha: b1 ≠ 0 (Hours studied significant predictor test score)
• α = 0.05
• Statistical decision: reject the null hypothesis since the p-
value of ANOVA is less than 0.05.
MULTIPLE
REGRESSION ANALYSIS
Multiple regression
• used to study the relationship between a dependent
variable Y and several independent variables
Sample
• A public health researcher aims to understand the factors
influencing Blood Pressure (BP) among adults. The study
considers the following independent variables (predictors):
1.Age (in years) (X1)
2.Body Mass Index (BMI) (X2)
3.Daily Sodium Intake (in mg) (X3)
• The goal is to develop a multiple linear regression model
to predict Blood Pressure (BP) based on these three
predictors.
• Multiple R (0.9988): Indicates a strong
correlation between the predictors and
the dependent variable.
• R Square (0.9976): The model
accounts for approximately 99.76% of
the variance in the dependent variable.
• Standard Error (0.4360): small value
indicates better model accuracy
• Observations (15): Shows the sample
size used in the regression analysis.
• F-statistic (1522.58): A very high F-value with a p-value
< 0.0001 indicates that the overall regression model is
statistically significant.
Intercept (104.45)
• The predicted value of the dependent variable when all predictors
are zero.
• Significant p-value: When all predictors are zero, the intercept
confidently predicts the dependent variable's value.
Age (X1)
• Coefficient (1.877): A unit increase in age increases by 1.877 units in
Blood pressure if all other factors are constant.
• Significant p-value: we can be confident that the effect of age on
blood pressure is not due to random chance
BMI (X2)
• Coefficient (-4.109): A unit increase in BMI results in a decrease of
4.109 units in the dependent variable if all other factors are
constant.
• Significant p-value: we can be confident that the effect of BMI on
blood pressure is not due to random chance
Sodium intake (X3)
• Coefficient (0.039): A unit increase in sodium intake increases the
dependent variable by 0.039 units if all other factors are constant.
• Significant p-value: we can be confident that the effect of sodium
intake on blood pressure is not due to random chance
What is the predicted value of the dependent
variable Y, for given values of the variables X1,
X2, X3….

• Determined by computing for the value of Y after


substituting for the given values of all the X’s and the
constant in the regression equation.

Y = b0 + b1X1 - b2X2 + b3X3


Y = 104.45 + 1.88X1 – 4.11X2 + 0.04X3
LOGISTIC REGRESSION
Logistic regression
• Most popular regression technique for modeling
qualitative outcome variables

• Y: Qualitative (Binary or with more than two categories)

• X: quantitative or qualitative variables


Assumptions
• Binary outcome: The dependent variable must be binary
(two possible outcomes)
• Independence of observations: The observations should be
independent of each other
• Linearity of the log-odds: The relationship between the
predictor variables and the log-odds of the outcome must
be linear
• No multicollinearity: The independent variables should not
be highly correlated.
Sample
• An epidemiologic study was conducted to assess the
factors contributing to hypertension development in a
population. The dataset includes various variables such
as age, gender, body mass index (BMI), smoking status,
alcohol consumption, and physical activity level. The
outcome variable is whether a person has hypertension
(1 = Yes, 0 = No).
Gender (1 = Alcohol Physical Hypertension
Smoking (1 =
ID Age (years) Male, 0 = BMI Consumption Activity (1 = (1 = Yes, 0 =
Yes, 0 = No)
Female) (1 = Yes, 0 = No) Yes, 0 = No) No)

1 55 1 30 1 1 0 1
2 45 0 28 0 0 1 0
3 60 1 32 1 1 0 1
4 40 0 22 0 0 1 0
5 50 1 35 1 0 1 1
6 35 0 25 0 1 1 0
7 48 1 33 1 0 0 1
8 52 0 29 0 1 0 0
9 44 1 31 1 1 1 1
10 60 0 36 1 1 0 1

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy