The document provides an overview of methods for analyzing two-variable data, including techniques for comparing categorical and quantitative variables. It covers the use of two-way tables, segmented bar graphs, scatterplots, correlation coefficients, and linear regression models. Key concepts such as conditional distributions, relative risk, residuals, and the effects of outliers and leverage points are also discussed.
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0 ratings0% found this document useful (0 votes)
9 views4 pages
Unit 2 notes
The document provides an overview of methods for analyzing two-variable data, including techniques for comparing categorical and quantitative variables. It covers the use of two-way tables, segmented bar graphs, scatterplots, correlation coefficients, and linear regression models. Key concepts such as conditional distributions, relative risk, residuals, and the effects of outliers and leverage points are also discussed.
Be able to create segmented bar graphs. Calculate conditional distribution. (recognize exp/resp. var.) Know what makes two variables independent. Be able to calculate relative risk and interpret it in context. Compare 2 quantitative variables: (TPS-ch. 3)(WS3-Topics 26-28)
Be able to draw/read scatterplots. (explanatory on x-axis!)
Be able to interpret the correlation coefficient (r). Know the properties of r. Interpret the slope and y-intercept in context. Recognize the definition of r2. Be able to write the equation of a least squares regression line either with the formulas, the calculator, or computer printouts. Be able to find predicted/fitted values using the LSRL. Find residuals. Identify outliers, high leverage points and influential observations and their effects.
Two Categorical Variables:
Calculate statistics for two categorical variables:
o The marginal relative frequencies are the row and column totals in a two-way table divided by the total for the entire table. o A conditional relative frequency is a relative frequency for a specific part of the table – cell frequencies divided by the total for that row or column. A segmented bar graph can be used to compare two categorical variables. Be sure to put the explanatory variable on the x-axis. Always use relative frequencies (percents) on the y-axis, and each bar should always go up to 100%. The relative risk between two variables is the larger % divided by the smaller %. It is interpreted as how many times more likely the top group is than the bottom group to display a certain characteristic. Two variables are considered independent if knowing one does not affect the likelihood of the other. In a segmented bar graph this leads to the bars having the same, or very close percents. Two Quantitative Variables:
A bivariate quantitative data set consists of observations of two different quantitative
variables made on individuals in a sample or population. A scatterplot shows two numeric values for each observation, one corresponding to the value on the x-axis and one to the value on the y-axis. An explanatory variable is a variable whose values are used to explain or predict corresponding values for the response variable. A description of a scatterplot includes strength, direction, form, and unusual features. o The direction of the association shown in a scatterplot, if any, can be described as positive or negative. A positive association means that as values of one variable increase, the values of the other variable tend to increase. A negative association means that as values of one variable increase, values of the other variable tend to decrease. o The strength of the association is how closely the individual points follow a specific pattern (such as linear), and can be shown in a scatterplot. Strength can be described as strong, moderate, or weak. o The form of the association shown in a scatterplot, if any, can be described as linear or non-linear. o Unusual features of a scatterplot include clusters of points or points with relatively large discrepancies between the value of the response variable and a predicted value for the response variable. Correlation: o The correlation, r, gives the direction and quantifies the strength of the linear association between two quantitative variables. o The correlation, r, is unit-free, and is always between -1 and 1, inclusive. A value of 0 indicated there is no linear association. A value of 1 or -1 indicates that there is a perfect linear association. o Correlation does not make a distinction between explanatory and response variables. Switching x and y does not change the correlation. o r uses the standardized values of the observations (z-scores), so r does not change when we change the units of measurement of x, y, or both. (changing from inches to cm does not affect the correlation.) Correlation has no units of measurement, it is just a number. o Correlation measures the strength of only a linear relationship between two variables. Correlation does not describe curved relationships, no matter how strong they are. o Correlation is not resistant. It is strongly affected by outliers. o A correlation close to 1 or -1 does not necessarily mean that a linear model is appropriate. Remember to also look at the scatterplot and/or the residual plot to see if the data is actually linear. o A perceived or real relationship between two variables does not mean that changes in one variable cause changes in the others. Correlation does not necessarily imply causation. Linear Regression Models: o A simple linear regression model is an equation that uses an explanatory variable, x, to predict the response variable, y. o The predicted response value, 𝑦̂, is calculated as 𝑦̂ = 𝑎 + 𝑏𝑥, where a is the y- intercept and b is the slope of the regression line, and x is the value of the explanatory variable. o Extrapolation is predicting a response value using a value for the explanatory variable that is beyond the interval of x-values used to determine the regression line. The predicted value is less reliable as an estimate the further we extrapolate. Residuals: o The residual is the difference between the actual value and the predicted value (A-P) 𝑟𝑒𝑠𝑖𝑑𝑢𝑎𝑙 = 𝑦 − 𝑦̂. o The sum and mean of the residuals is always 0. o A residual plot is a plot of residuals vs. explanatory variable values or predicted response values. o Residual plots can be used to investigate the appropriateness of a selected model. Apparent randomness in a residual plot for a linear model is evidence of a linear form to the association between the variables. A pattern indicates that the attempted model is not a good fit. Least Squares Regression Line (LSRL) 𝑦̂ = 𝑎 + 𝑏𝑥 o The LSRL minimizes the sum of the squares of the residuals and always passes through the point (𝑥, ̅ 𝑦̅). o x and y are the variables of the LSRL, while a and b are the coefficients of the line. 𝑠𝑦 o b is the slope of the LSRL and can be calculated by 𝑏 = 𝑟 𝑠 𝑥 o The slope can be interpreted as the predicted change in y for every one unit increase in x. o The y-intercept can be calculated by 𝑦̅ = 𝑎 + 𝑏𝑥̅ or 𝑎 = 𝑦̅ − 𝑏𝑥̅ . o The y-intercept is the predicted value of the response variable (y) when the explanatory variable (x) is 0. o Sometimes the y-intercept of the line does not have a logical interpretation in context. o 𝑟 2 is the square of the correlation, r. It is also called the coefficient of determination. 𝑟 2 is the proportion of variation in the response variable (y) that is explained by the regression line (LSRL) with the explanatory variable (x). Departures from Linearity (unusual features) o An outlier in regression is a point that does not follow the general trend shown in the rest of the data and has a large residual when the LSRL is calculated. (will stick out on the y-axis from the rest of the data – far off the line) o A high-leverage point in regression has a substantially larger or smaller x-value that the other observations have. (will stick out on the x-axis from the rest of the data) o An influential point in regression is any point that, if removed, changes the relationship substantially. Examples include much different slope, y-intercept, and/or correlation. Outliers and high leverage points are often influential. Transformations of data o Transformations of variables, such as evaluating the natural log of each value of y, or squaring each value of the x, can be used to create transformed data sets, which may be more linear in form than the untransformed data. o Increased randomness in residual plots after transformation of data and/or an 𝑟 2 value closer to 1 offers evidence that the LSRL for the transformed data is a more appropriate model to predict responses to the explanatory variable than the line for the untransformed data.