0% found this document useful (0 votes)
9 views4 pages

Unit 2 notes

The document provides an overview of methods for analyzing two-variable data, including techniques for comparing categorical and quantitative variables. It covers the use of two-way tables, segmented bar graphs, scatterplots, correlation coefficients, and linear regression models. Key concepts such as conditional distributions, relative risk, residuals, and the effects of outliers and leverage points are also discussed.

Uploaded by

rishi.nrkr
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views4 pages

Unit 2 notes

The document provides an overview of methods for analyzing two-variable data, including techniques for comparing categorical and quantitative variables. It covers the use of two-way tables, segmented bar graphs, scatterplots, correlation coefficients, and linear regression models. Key concepts such as conditional distributions, relative risk, residuals, and the effects of outliers and leverage points are also discussed.

Uploaded by

rishi.nrkr
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

Unit 2 Notes – Exploring Two-Variable Data

TPS – Chapter 3, Chapter 1 (p. 13-22)


Workshop Statistics(WS3)-Topics 6, 26-28
Compare 2 categorical variables: (TPS-ch. 1 p.13-22) (WS3-Topic 6)

 Be able to set up/read two-way tables.


 Be able to create segmented bar graphs.
 Calculate conditional distribution. (recognize exp/resp. var.)
 Know what makes two variables independent.
 Be able to calculate relative risk and interpret it in context.
Compare 2 quantitative variables: (TPS-ch. 3)(WS3-Topics 26-28)

 Be able to draw/read scatterplots. (explanatory on x-axis!)


 Be able to interpret the correlation coefficient (r). Know the properties of r.
 Interpret the slope and y-intercept in context.
 Recognize the definition of r2.
 Be able to write the equation of a least squares regression line either with the formulas,
the calculator, or computer printouts.
 Be able to find predicted/fitted values using the LSRL.
 Find residuals.
 Identify outliers, high leverage points and influential observations and their effects.

Two Categorical Variables:

 Calculate statistics for two categorical variables:


o The marginal relative frequencies are the row and column totals in a two-way
table divided by the total for the entire table.
o A conditional relative frequency is a relative frequency for a specific part of the
table – cell frequencies divided by the total for that row or column.
 A segmented bar graph can be used to compare two categorical variables. Be sure to
put the explanatory variable on the x-axis. Always use relative frequencies (percents) on
the y-axis, and each bar should always go up to 100%.
 The relative risk between two variables is the larger % divided by the smaller %. It is
interpreted as how many times more likely the top group is than the bottom group to
display a certain characteristic.
 Two variables are considered independent if knowing one does not affect the likelihood
of the other. In a segmented bar graph this leads to the bars having the same, or very
close percents.
Two Quantitative Variables:

 A bivariate quantitative data set consists of observations of two different quantitative


variables made on individuals in a sample or population.
 A scatterplot shows two numeric values for each observation, one corresponding to the
value on the x-axis and one to the value on the y-axis.
 An explanatory variable is a variable whose values are used to explain or predict
corresponding values for the response variable.
 A description of a scatterplot includes strength, direction, form, and unusual features.
o The direction of the association shown in a scatterplot, if any, can be described
as positive or negative.
 A positive association means that as values of one variable increase, the
values of the other variable tend to increase.
 A negative association means that as values of one variable increase,
values of the other variable tend to decrease.
o The strength of the association is how closely the individual points follow a
specific pattern (such as linear), and can be shown in a scatterplot. Strength can
be described as strong, moderate, or weak.
o The form of the association shown in a scatterplot, if any, can be described as
linear or non-linear.
o Unusual features of a scatterplot include clusters of points or points with
relatively large discrepancies between the value of the response variable and a
predicted value for the response variable.
 Correlation:
o The correlation, r, gives the direction and quantifies the strength of the linear
association between two quantitative variables.
o The correlation, r, is unit-free, and is always between -1 and 1, inclusive. A value
of 0 indicated there is no linear association. A value of 1 or -1 indicates that
there is a perfect linear association.
o Correlation does not make a distinction between explanatory and response
variables. Switching x and y does not change the correlation.
o r uses the standardized values of the observations (z-scores), so r does not
change when we change the units of measurement of x, y, or both. (changing
from inches to cm does not affect the correlation.) Correlation has no units of
measurement, it is just a number.
o Correlation measures the strength of only a linear relationship between two
variables. Correlation does not describe curved relationships, no matter how
strong they are.
o Correlation is not resistant. It is strongly affected by outliers.
o A correlation close to 1 or -1 does not necessarily mean that a linear model is
appropriate. Remember to also look at the scatterplot and/or the residual plot
to see if the data is actually linear.
o A perceived or real relationship between two variables does not mean that
changes in one variable cause changes in the others. Correlation does not
necessarily imply causation.
 Linear Regression Models:
o A simple linear regression model is an equation that uses an explanatory
variable, x, to predict the response variable, y.
o The predicted response value, 𝑦̂, is calculated as 𝑦̂ = 𝑎 + 𝑏𝑥, where a is the y-
intercept and b is the slope of the regression line, and x is the value of the
explanatory variable.
o Extrapolation is predicting a response value using a value for the explanatory
variable that is beyond the interval of x-values used to determine the regression
line. The predicted value is less reliable as an estimate the further we
extrapolate.
 Residuals:
o The residual is the difference between the actual value and the predicted value
(A-P) 𝑟𝑒𝑠𝑖𝑑𝑢𝑎𝑙 = 𝑦 − 𝑦̂.
o The sum and mean of the residuals is always 0.
o A residual plot is a plot of residuals vs. explanatory variable values or predicted
response values.
o Residual plots can be used to investigate the appropriateness of a selected
model. Apparent randomness in a residual plot for a linear model is evidence of
a linear form to the association between the variables. A pattern indicates that
the attempted model is not a good fit.
 Least Squares Regression Line (LSRL) 𝑦̂ = 𝑎 + 𝑏𝑥
o The LSRL minimizes the sum of the squares of the residuals and always passes
through the point (𝑥, ̅ 𝑦̅).
o x and y are the variables of the LSRL, while a and b are the coefficients of the
line.
𝑠𝑦
o b is the slope of the LSRL and can be calculated by 𝑏 = 𝑟 𝑠
𝑥
o The slope can be interpreted as the predicted change in y for every one unit
increase in x.
o The y-intercept can be calculated by 𝑦̅ = 𝑎 + 𝑏𝑥̅ or 𝑎 = 𝑦̅ − 𝑏𝑥̅ .
o The y-intercept is the predicted value of the response variable (y) when the
explanatory variable (x) is 0.
o Sometimes the y-intercept of the line does not have a logical interpretation in
context.
o 𝑟 2 is the square of the correlation, r. It is also called the coefficient of
determination. 𝑟 2 is the proportion of variation in the response variable (y) that
is explained by the regression line (LSRL) with the explanatory variable (x).
 Departures from Linearity (unusual features)
o An outlier in regression is a point that does not follow the general trend shown
in the rest of the data and has a large residual when the LSRL is calculated. (will
stick out on the y-axis from the rest of the data – far off the line)
o A high-leverage point in regression has a substantially larger or smaller x-value
that the other observations have. (will stick out on the x-axis from the rest of the
data)
o An influential point in regression is any point that, if removed, changes the
relationship substantially. Examples include much different slope, y-intercept,
and/or correlation. Outliers and high leverage points are often influential.
 Transformations of data
o Transformations of variables, such as evaluating the natural log of each value of
y, or squaring each value of the x, can be used to create transformed data sets,
which may be more linear in form than the untransformed data.
o Increased randomness in residual plots after transformation of data and/or an 𝑟 2
value closer to 1 offers evidence that the LSRL for the transformed data is a more
appropriate model to predict responses to the explanatory variable than the line
for the untransformed data.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy