0% found this document useful (0 votes)
37 views31 pages

BA 216 Lecture 5 Notes

This document provides an introduction to describing and visualizing bivariate relationships between two variables using scatterplots and interpreting the correlation coefficient. Key concepts covered include distinguishing between univariate and bivariate statistics, visually assessing relationships in scatterplots including linear, nonlinear, positive and negative associations, and quantifying relationships using the correlation coefficient which ranges from -1 to 1. Examples are provided to demonstrate describing visual patterns in scatterplots and calculating correlation.

Uploaded by

Harrison Lim
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
37 views31 pages

BA 216 Lecture 5 Notes

This document provides an introduction to describing and visualizing bivariate relationships between two variables using scatterplots and interpreting the correlation coefficient. Key concepts covered include distinguishing between univariate and bivariate statistics, visually assessing relationships in scatterplots including linear, nonlinear, positive and negative associations, and quantifying relationships using the correlation coefficient which ranges from -1 to 1. Examples are provided to demonstrate describing visual patterns in scatterplots and calculating correlation.

Uploaded by

Harrison Lim
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 31

BA 216

An introduction to describing and visualizing the relationship between 2 variables


(i.e. bivariate relationships), including creating scatterplots, and interpreting the
correlation coefficient.

We often are interested in, not just the distribution of one variable, but also the
relationships between two (or more) variables
● Many analyses are motivated by a researcher looking for a relationship between
two or more variables. A social scientist may like to answer some of the following
questions:

1. If homeownership is lower than the national average in one county, will the
percent of multi-unit structures in that county tend to be above or below
the national average?

2. Does a higher than average increase in county population tend to


correspond to counties with higher or lower median household incomes?

3. How useful a predictor is median education level for the median


household income for US counties?

New terms to learn -- univariate vs. bivariate statistics

Univariate statistics Bivariate statistics


● Translates to “one variable” - you ● Translated to “two variables” - with
are describing the single variable bivariate data we want to compare
the two sets of data and finding
● Example: you want to measure any relationships.
puppy weights at a shelter
● Example: An ice cream shop
● We can: keeps track of how much ice
○ Find a central value using cream they sell versus the
mean, median and mode temperature on that day.

○ Find how spread out it is ● We can use Tables, Scatter Plots,


using range, quartiles and Correlation, and linear regression
standard deviation

○ Make plots like Bar Graphs,


Pie Charts and Histograms
Visualizing bivariate data - scatterplots

● A scatter plot provides a case-by-case view of data for two numerical variables.

● Scatterplots are helpful in quickly spotting associations relating variables,


whether those associations come in the form of simple trends or whether those
relationships are more complex.
Scatterplots are one type of graph used to study the relationship between two numerical
variables.

● Two variables:
1. homeownership
2. multi_unit, (the % of units in multi-unit structures (e.g. apartments,
condos))

● Each point on the plot represents a single county.


○ The highlighted dot corresponds to County 413 in the county data set:
Chattahoochee County, Georgia, with 39.4% of units in multi-unit
structures and a homeownership rate of 31.3%.

Scatterplots are one type of graph used to study the relationship between two numerical
variables.

● The scatterplot suggests a relationship between the two variables: counties with
a higher rate of multi-units tend to have lower homeownership rates.
● When two variables show some connection with one another, they are called
associated variables.
○ Associated variables can also be called dependent variables and
vice-versa.
Scatterplots are one type of graph used to study the relationship between two numerical
variables.
● A positive association is shown in the relationship between the
median_hh_income and pop_change

● Counties with higher median household income tend to have higher rates of
population growth.
Visualizing bivariate data – scatterplots & nonlinear relationships

Examples of non-linear relationship

● Consider the case where your vertical axis represents something “good” and
your horizontal axis represents something that is only good in moderation.

● Health and water consumption fit this description: we require some water to
survive, but consume too much and it becomes toxic and can kill a person.

● Job satisfaction and age is another – people have a “dip” in satisfaction around
age 40- 50, which is often referred to as the midlife crisis.
Describing bivariate (two-variable) relationships and their scatterplots, and eyeballing a
trend line

Bivariate relationships – strength and direction

● A quick description of the association in a scatterplot between two numerical


variables should always include a description of the following:

○ Form: Is the association linear or nonlinear?

○ Direction: Is the association positive or negative?

○ Strength: Does the association appear to be very strong, very weak, or


somewhere in between? (I tend to use the term “moderately strong,” etc.)

○ Outliers: Do there appear to be any data points that are unusually far
away from the general pattern?

Bivariate relationships – strength and direction

● Form: Is the association linear or nonlinear?


● Direction: Is the association positive or negative?
● Strength: Does the association appear to be strong, moderately strong, or weak?
● Outliers: Do there appear to be any data points that are unusually far away from
the general pattern?
Describing relationships in a scatterplot

"This scatterplot shows a strong, negative, linear association between age of drivers
and number of accidents. There don't appear to be any outliers in the data."

● Notice that the description mentions the form (linear), the direction (negative),
the strength (strong), and the lack of outliers. It also mentions the context of
the two variables in question (age of drivers and number of accidents).
Eyeballing a Trend line (phone data example)

Eyeballing a Trend line (phone data example)

● Now she wants a trend line to describe the relationship between how much time
she spent on phone and the battery life remaining. She drew three possible trend
lines:
● Q: Which line fits the data graphed?
Example - % of students graduating high school and % in poverty

The scatterplot below shows the relationship between HS graduate rate in all 50 US
states and DC and the % of residents who live below the poverty line (income below
$23,050 for a family of 4 in 2012).

● Unit of analysis?
○ US states
● Response/dependent variable?
○ % in poverty
● Explanatory/independent variable?
○ % HS grad
● Relationship?
○ linear, negative, moderately strong
Prediction calculation, and eyeballing a line

Correlation

We’ve already built the intuition for understanding correlation

● When we were describing bivariate relationships (between two variables), we


used terms like “strong” and “weak” relationships.

● Calculating a “Correlation Coefficient” is a way to attach a number (to


“quantify”) to this intuition

● NOTE: the new statistic we learn here is the R statistic, NOT to be confused with
the statistical program R!
The correlation coefficient goes from -1 to 1

● Only when the relationship between two variables is perfectly linear is the
correlation either -1 or 1.

○ If there is no apparent linear relationship between the variables, then the


correlation will be near zero (almost never exactly 0 though).

○ If the relationship is strong and positive, the correlation will be near +1.

○ If it is strong and negative, it will be near -1.

○ Correlations between +/- 0.3 to 0.7 can look deceptively messy.


Visualizing correlation, & quantifying with the correlation coefficient

(i.e. the R Statistic)

We’ve already built the intuition for understanding correlation

● The formula for correlation is quite complex, and usually completed using a
statistical program (like R). We will demo how to calculate the correlation
coefficient next week.

● However, if you are interested, correlation is calculated as follows:


Correlation and non-linear trends

● The correlation is intended to quantify the strength of a linear trend.

● Nonlinear relationships between two variables will usually produce a correlation


coefficient that does not reflect the strength of the relationship. Even when
they’re strong!

Concept check: correlation means linear association


Correlation example 1

Correlation example 2
Summary of summarizing data

● So, we’ve sampled from a larger population, creating data, and now have have
categorical and/or numerical data to analyze. You may also want to explore the
relationship between two variables.

● Past a few observations, raw data is confusing and un-interesting.

● Instead, you’ve learned to:

1. Visualize the trends and distribution shape, using some kind of table or
graph (frequency plot, bar chart, histogram, box-and-whisker time series,
scatterplot)*

2. Calculate summary statistics, including central tendency (e.g. mean,


median, or in the case of categorical data, mode) and dispersion (e.g.
range, IQR, or standard deviation). This includes a correlation coefficient if
the research question involves the relationship between two numerical
variables.

*Reminder: the type of table, diagram, and figure depends primarily on whether the
variables in question are numerical or categorical.

Between now and the exam….

We just covered how to describe data with visualizations, and summary statistics

● Descriptive data visualizations: frequency chart/plots, relative frequency


charts/plots, bar charts, time series, histograms, box-and-whisker plots, and
scatterplots
● Summary statistics: measures of central tendency (mean, median, mode),
measures of dispersion (range, IQR, and standard deviation), and correlation.

Next, we’ll cover the topics below, and then prep for the first exam:

● How to recognize and work with the normal distribution, including calculating
z-scores and percentiles.
● An overview of the basics of probability theory
Back to univariate statistics: the normal distribution

Before we continue…

What is the difference between a standard histogram and the “smoothed curve”
histograms?

Bins & Heights of adults in the US

● How does changing the number of bins allow you to make different
interpretations of the data?

Continuous distributions

● Below is a histogram of the distribution of heights of US adults.


● The proportion of data that falls in the shaded bins gives the probability that a
randomly sampled US adult is between 180 cm and 185 cm (about 5'11" to 6'1").
From histograms to continuous distributions (the return of spaghetti)

● The bins are so slim, they’re starting to resemble a smooth curve.

● This suggests the population height as a continuous numerical variable may best
be explained by a curve that represents the outline of EXTREMEMLY smooth
bins.

● Since height is a continuous numerical variable, its probability density function is


a smooth curve.
Probabilities from continuous distributions

● Therefore, the probability that a randomly sampled US adult is between 180 cm


and 185 cm can also be estimated as the shaded area under the curve.

The Normal Distribution

The “Normal Distribution” should be very familiar…

● The characteristic symmetrical, bell-shaped curve pops up everywhere in nature


Normal Distribution (aka Gaussian Distribution)

● A NORMAL (AKA GAUSSIAN) DISTRIBUTION is always bell-shaped and


symmetrical around a central mean.

● May be tall and thin, or short and stocky, or slumping out very flatly

● This largely depends on if the standard deviation is large or small

○ (OR the scale on which we’ve chosen to draw the graph)

● But no matter the height and width, the area under the normal distribution curve
is the same! The area under the normal curve ALWAYS adds up to 1.

Practice

● Question: which of these three looks like a normal distribution?

● A: Curve b looks the most like a normal distribution. Not only is it symmetrical
around a central peak (which c is not) but it also appears to have the right
proportions (which a does not).
Normal Distribution (aka Gaussian Distribution)

“Statistical short hand” notation for the normal distribution

● The mean (µ) and standard deviation (σ) describe the normal distribution perfectly, and
so are called the distribution’s PARAMETERS.

● You can think of these like the “short hand” for the shape of the (perfect-world,
mathematical) distribution curve.

● The way we write the “shorthand” for the normal distribution is N (µ = __, σ = __)
Real world data will never look like a completely perfect curve

● Important Points: The ‘normal’ curve means the idealized, perfect curve (or pattern, or
standard) against which we can compare real world distributions.

● Remember: the normal curve is a mathematical abstraction, and it assumes an infinitely


large population. But we don’t ever measure infinitely large populations.

● Samples we get in everyday life will NEVER produce such a perfect curve. But,
even fairly small samples can produce a fairly bell-shaped distribution.

Normal distributions with different parameters


● SAT scores are distributed nearly normally with mean 1500 and standard deviation
300.
○ N(µ = 1500, σ = 300)
● ACT scores are distributed nearly normally with mean 21 and standard deviation 5.
○ N(µ = 21, σ = 5)

● A college admissions officer wants to determine which of the two applicants scored
better on their standardized test with respect to the other test takers:
○ Pam, who earned an 1800 on her SAT, or Jim, who scored a 24 on his ACT?

Using Z-scores to standardize comparisons, when working with normally distributed data
Standardizing with Z scores

Since we cannot just compare these two raw scores, we instead compare how many standard
deviations beyond the mean each observation is. This is the Z-SCORE.

● The formula is (observation – population mean) / standard deviation

○ Pam's score is (1800 - 1500) / 300 = 1 σ standard deviation above the mean.

○ Jim's score is (24 - 21) / 5 = 0.6 σ standard deviations above the mean.

Standardizing with Z scores


Standardizing with z-scores – visual example

Standardizing with Z scores

● These are called standardized scores, or Z scores


○ Z score of an observation is the number of standard deviations it falls above or
below the mean.

○ Z scores are defined for distributions of any shape, but only when the distribution
is a normal distribution (symmetrical, unimodal, gently curved) can we use Z
scores to calculate percentiles.

■ We can’t use z-scores for comparison with skewed or other funky data.

○ Observations that are more than 2 SD away from the mean on either side
(|Z| > 2) are usually considered unusual. More than 3 SD away are very unusual.
How are z-scores & the normal distribution used?

Use 1 (previous slides): You’ve seen that you can use them to compare how unusual two
measurements/observations are, as we just saw, even when they’re looking at different normal
distributions (i.e. the SAT vs. the ACT).

Use 2: It’s also very useful in statistics to be able to identify “tail areas” of distributions, also
called a PERCENTILE. Examples:

● We can ask, “if you’re a 19-year-old man, and you are 5 feet 8 inches tall, what % of
people are you taller than?” According to the US CDC’s percentile calculator, you would
be in the 29.1 th percentile for height with a z-score of -0.55.

● We can also ask: what percentage of SAT scores are below Ann's score of 1300?
Another way to ask this is, “what is Ann’s SAT score percentile?”

Example 1

What fraction of people have an SAT score below Ann's score of 1300?

N(µ = 1100, σ = 200), and our “cutoff value” is Ann’s score of x = 1300
How are z-scores used? (continued)

● We’ve learned:
○ Use 1: to compare distributions by standardizing measurements/observations to
z-scores
○ Use 2: to calculate a percentile of a cutoff value (i.e. show what % of
observations fall below a certain value)

● We’re about to learn:


○ Use 3: related to use2, you can also calculate the probability that a measurement
will fall ABOVE a particular cutoff value.
○ Use 4: you can also calculate the probability that a measurement will fall
BETWEEN two values.

Normal probability – example 3

● What is the probability that a random student scores at least 1190 on her SATs? We
don’t know anything about their aptitude, so we can base our statistical guess ONLY off
of the population’s average scores (and the standard deviation).

● The picture shows the mean and the values at 2 standard deviations above and below
the mean.
● The simplest way to find the shaded area under the curve makes use of the Z-score of
the cutoff value.
Key Concept

● THE AREA UNDER THE NORMAL CURVE ALWAYS ADDS TO 1.


● R functions (and basically all other methods) default to the area to the LEFT of the cutoff
value or percentage.
● If you want the area to the RIGHT (see above), calculate: 1 – pnorm(value)
Example 4 – finding area between cutoff values
Summary: Key points for z-scores

Terminology note:

● Area below a cutoff value = area to the left


● Area above a cutoff value = area to the right
● Area between two values = area to left of higher value – area to left of lower value

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy