BA 216 Lecture 5 Notes
BA 216 Lecture 5 Notes
We often are interested in, not just the distribution of one variable, but also the
relationships between two (or more) variables
● Many analyses are motivated by a researcher looking for a relationship between
two or more variables. A social scientist may like to answer some of the following
questions:
1. If homeownership is lower than the national average in one county, will the
percent of multi-unit structures in that county tend to be above or below
the national average?
● A scatter plot provides a case-by-case view of data for two numerical variables.
● Two variables:
1. homeownership
2. multi_unit, (the % of units in multi-unit structures (e.g. apartments,
condos))
Scatterplots are one type of graph used to study the relationship between two numerical
variables.
● The scatterplot suggests a relationship between the two variables: counties with
a higher rate of multi-units tend to have lower homeownership rates.
● When two variables show some connection with one another, they are called
associated variables.
○ Associated variables can also be called dependent variables and
vice-versa.
Scatterplots are one type of graph used to study the relationship between two numerical
variables.
● A positive association is shown in the relationship between the
median_hh_income and pop_change
● Counties with higher median household income tend to have higher rates of
population growth.
Visualizing bivariate data – scatterplots & nonlinear relationships
● Consider the case where your vertical axis represents something “good” and
your horizontal axis represents something that is only good in moderation.
● Health and water consumption fit this description: we require some water to
survive, but consume too much and it becomes toxic and can kill a person.
● Job satisfaction and age is another – people have a “dip” in satisfaction around
age 40- 50, which is often referred to as the midlife crisis.
Describing bivariate (two-variable) relationships and their scatterplots, and eyeballing a
trend line
○ Outliers: Do there appear to be any data points that are unusually far
away from the general pattern?
"This scatterplot shows a strong, negative, linear association between age of drivers
and number of accidents. There don't appear to be any outliers in the data."
● Notice that the description mentions the form (linear), the direction (negative),
the strength (strong), and the lack of outliers. It also mentions the context of
the two variables in question (age of drivers and number of accidents).
Eyeballing a Trend line (phone data example)
● Now she wants a trend line to describe the relationship between how much time
she spent on phone and the battery life remaining. She drew three possible trend
lines:
● Q: Which line fits the data graphed?
Example - % of students graduating high school and % in poverty
The scatterplot below shows the relationship between HS graduate rate in all 50 US
states and DC and the % of residents who live below the poverty line (income below
$23,050 for a family of 4 in 2012).
● Unit of analysis?
○ US states
● Response/dependent variable?
○ % in poverty
● Explanatory/independent variable?
○ % HS grad
● Relationship?
○ linear, negative, moderately strong
Prediction calculation, and eyeballing a line
Correlation
● NOTE: the new statistic we learn here is the R statistic, NOT to be confused with
the statistical program R!
The correlation coefficient goes from -1 to 1
● Only when the relationship between two variables is perfectly linear is the
correlation either -1 or 1.
○ If the relationship is strong and positive, the correlation will be near +1.
● The formula for correlation is quite complex, and usually completed using a
statistical program (like R). We will demo how to calculate the correlation
coefficient next week.
Correlation example 2
Summary of summarizing data
● So, we’ve sampled from a larger population, creating data, and now have have
categorical and/or numerical data to analyze. You may also want to explore the
relationship between two variables.
1. Visualize the trends and distribution shape, using some kind of table or
graph (frequency plot, bar chart, histogram, box-and-whisker time series,
scatterplot)*
*Reminder: the type of table, diagram, and figure depends primarily on whether the
variables in question are numerical or categorical.
We just covered how to describe data with visualizations, and summary statistics
Next, we’ll cover the topics below, and then prep for the first exam:
● How to recognize and work with the normal distribution, including calculating
z-scores and percentiles.
● An overview of the basics of probability theory
Back to univariate statistics: the normal distribution
Before we continue…
What is the difference between a standard histogram and the “smoothed curve”
histograms?
● How does changing the number of bins allow you to make different
interpretations of the data?
Continuous distributions
● This suggests the population height as a continuous numerical variable may best
be explained by a curve that represents the outline of EXTREMEMLY smooth
bins.
● May be tall and thin, or short and stocky, or slumping out very flatly
● But no matter the height and width, the area under the normal distribution curve
is the same! The area under the normal curve ALWAYS adds up to 1.
Practice
● A: Curve b looks the most like a normal distribution. Not only is it symmetrical
around a central peak (which c is not) but it also appears to have the right
proportions (which a does not).
Normal Distribution (aka Gaussian Distribution)
● The mean (µ) and standard deviation (σ) describe the normal distribution perfectly, and
so are called the distribution’s PARAMETERS.
● You can think of these like the “short hand” for the shape of the (perfect-world,
mathematical) distribution curve.
● The way we write the “shorthand” for the normal distribution is N (µ = __, σ = __)
Real world data will never look like a completely perfect curve
● Important Points: The ‘normal’ curve means the idealized, perfect curve (or pattern, or
standard) against which we can compare real world distributions.
● Samples we get in everyday life will NEVER produce such a perfect curve. But,
even fairly small samples can produce a fairly bell-shaped distribution.
● A college admissions officer wants to determine which of the two applicants scored
better on their standardized test with respect to the other test takers:
○ Pam, who earned an 1800 on her SAT, or Jim, who scored a 24 on his ACT?
Using Z-scores to standardize comparisons, when working with normally distributed data
Standardizing with Z scores
Since we cannot just compare these two raw scores, we instead compare how many standard
deviations beyond the mean each observation is. This is the Z-SCORE.
○ Pam's score is (1800 - 1500) / 300 = 1 σ standard deviation above the mean.
○ Jim's score is (24 - 21) / 5 = 0.6 σ standard deviations above the mean.
○ Z scores are defined for distributions of any shape, but only when the distribution
is a normal distribution (symmetrical, unimodal, gently curved) can we use Z
scores to calculate percentiles.
■ We can’t use z-scores for comparison with skewed or other funky data.
○ Observations that are more than 2 SD away from the mean on either side
(|Z| > 2) are usually considered unusual. More than 3 SD away are very unusual.
How are z-scores & the normal distribution used?
Use 1 (previous slides): You’ve seen that you can use them to compare how unusual two
measurements/observations are, as we just saw, even when they’re looking at different normal
distributions (i.e. the SAT vs. the ACT).
Use 2: It’s also very useful in statistics to be able to identify “tail areas” of distributions, also
called a PERCENTILE. Examples:
● We can ask, “if you’re a 19-year-old man, and you are 5 feet 8 inches tall, what % of
people are you taller than?” According to the US CDC’s percentile calculator, you would
be in the 29.1 th percentile for height with a z-score of -0.55.
● We can also ask: what percentage of SAT scores are below Ann's score of 1300?
Another way to ask this is, “what is Ann’s SAT score percentile?”
Example 1
What fraction of people have an SAT score below Ann's score of 1300?
N(µ = 1100, σ = 200), and our “cutoff value” is Ann’s score of x = 1300
How are z-scores used? (continued)
● We’ve learned:
○ Use 1: to compare distributions by standardizing measurements/observations to
z-scores
○ Use 2: to calculate a percentile of a cutoff value (i.e. show what % of
observations fall below a certain value)
● What is the probability that a random student scores at least 1190 on her SATs? We
don’t know anything about their aptitude, so we can base our statistical guess ONLY off
of the population’s average scores (and the standard deviation).
● The picture shows the mean and the values at 2 standard deviations above and below
the mean.
● The simplest way to find the shaded area under the curve makes use of the Z-score of
the cutoff value.
Key Concept
Terminology note: