0% found this document useful (0 votes)

37 views31 pages

BA 216 Lecture 5 Notes

This document provides an introduction to describing and visualizing bivariate relationships between two variables using scatterplots and interpreting the correlation coefficient. Key concepts covered include distinguishing between univariate and bivariate statistics, visually assessing relationships in scatterplots including linear, nonlinear, positive and negative associations, and quantifying relationships using the correlation coefficient which ranges from -1 to 1. Examples are provided to demonstrate describing visual patterns in scatterplots and calculating correlation.

Uploaded by

Harrison Lim

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

37 views31 pages

BA 216 Lecture 5 Notes

Uploaded by

Harrison Lim

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 31

BA 216

An introduction to describing and visualizing the relationship between 2 variables

(i.e. bivariate relationships), including creating scatterplots, and interpreting the
correlation coefficient.

We often are interested in, not just the distribution of one variable, but also the
relationships between two (or more) variables
● Many analyses are motivated by a researcher looking for a relationship between
two or more variables. A social scientist may like to answer some of the following
questions:

1. If homeownership is lower than the national average in one county, will the
percent of multi-unit structures in that county tend to be above or below
the national average?

2. Does a higher than average increase in county population tend to

correspond to counties with higher or lower median household incomes?

3. How useful a predictor is median education level for the median

household income for US counties?

New terms to learn -- univariate vs. bivariate statistics

Univariate statistics Bivariate statistics

● Translates to “one variable” - you ● Translated to “two variables” - with
are describing the single variable bivariate data we want to compare
the two sets of data and finding
● Example: you want to measure any relationships.
puppy weights at a shelter
● Example: An ice cream shop
● We can: keeps track of how much ice
○ Find a central value using cream they sell versus the
mean, median and mode temperature on that day.

○ Find how spread out it is ● We can use Tables, Scatter Plots,

using range, quartiles and Correlation, and linear regression
standard deviation

○ Make plots like Bar Graphs,

Pie Charts and Histograms
Visualizing bivariate data - scatterplots

● A scatter plot provides a case-by-case view of data for two numerical variables.

● Scatterplots are helpful in quickly spotting associations relating variables,

whether those associations come in the form of simple trends or whether those
relationships are more complex.
Scatterplots are one type of graph used to study the relationship between two numerical
variables.

● Two variables:
1. homeownership
2. multi_unit, (the % of units in multi-unit structures (e.g. apartments,
condos))

● Each point on the plot represents a single county.

○ The highlighted dot corresponds to County 413 in the county data set:
Chattahoochee County, Georgia, with 39.4% of units in multi-unit
structures and a homeownership rate of 31.3%.

Scatterplots are one type of graph used to study the relationship between two numerical
variables.

● The scatterplot suggests a relationship between the two variables: counties with
a higher rate of multi-units tend to have lower homeownership rates.
● When two variables show some connection with one another, they are called
associated variables.
○ Associated variables can also be called dependent variables and
vice-versa.
Scatterplots are one type of graph used to study the relationship between two numerical
variables.
● A positive association is shown in the relationship between the
median_hh_income and pop_change

● Counties with higher median household income tend to have higher rates of
population growth.
Visualizing bivariate data – scatterplots & nonlinear relationships

Examples of non-linear relationship

● Consider the case where your vertical axis represents something “good” and
your horizontal axis represents something that is only good in moderation.

● Health and water consumption fit this description: we require some water to
survive, but consume too much and it becomes toxic and can kill a person.

● Job satisfaction and age is another – people have a “dip” in satisfaction around
age 40- 50, which is often referred to as the midlife crisis.
Describing bivariate (two-variable) relationships and their scatterplots, and eyeballing a
trend line

Bivariate relationships – strength and direction

● A quick description of the association in a scatterplot between two numerical

variables should always include a description of the following:

○ Form: Is the association linear or nonlinear?

○ Direction: Is the association positive or negative?

○ Strength: Does the association appear to be very strong, very weak, or

somewhere in between? (I tend to use the term “moderately strong,” etc.)

○ Outliers: Do there appear to be any data points that are unusually far
away from the general pattern?

Bivariate relationships – strength and direction

● Form: Is the association linear or nonlinear?

● Direction: Is the association positive or negative?
● Strength: Does the association appear to be strong, moderately strong, or weak?
● Outliers: Do there appear to be any data points that are unusually far away from
the general pattern?
Describing relationships in a scatterplot

"This scatterplot shows a strong, negative, linear association between age of drivers
and number of accidents. There don't appear to be any outliers in the data."

● Notice that the description mentions the form (linear), the direction (negative),
the strength (strong), and the lack of outliers. It also mentions the context of
the two variables in question (age of drivers and number of accidents).
Eyeballing a Trend line (phone data example)

Eyeballing a Trend line (phone data example)

● Now she wants a trend line to describe the relationship between how much time
she spent on phone and the battery life remaining. She drew three possible trend
lines:
● Q: Which line fits the data graphed?
Example - % of students graduating high school and % in poverty

The scatterplot below shows the relationship between HS graduate rate in all 50 US
states and DC and the % of residents who live below the poverty line (income below
$23,050 for a family of 4 in 2012).

● Unit of analysis?
○ US states
● Response/dependent variable?
○ % in poverty
● Explanatory/independent variable?
○ % HS grad
● Relationship?
○ linear, negative, moderately strong
Prediction calculation, and eyeballing a line

Correlation

We’ve already built the intuition for understanding correlation

● When we were describing bivariate relationships (between two variables), we

used terms like “strong” and “weak” relationships.

● Calculating a “Correlation Coefficient” is a way to attach a number (to

“quantify”) to this intuition

● NOTE: the new statistic we learn here is the R statistic, NOT to be confused with
the statistical program R!
The correlation coefficient goes from -1 to 1

● Only when the relationship between two variables is perfectly linear is the
correlation either -1 or 1.

○ If there is no apparent linear relationship between the variables, then the

correlation will be near zero (almost never exactly 0 though).

○ If the relationship is strong and positive, the correlation will be near +1.

○ If it is strong and negative, it will be near -1.

○ Correlations between +/- 0.3 to 0.7 can look deceptively messy.

Visualizing correlation, & quantifying with the correlation coefficient

(i.e. the R Statistic)

We’ve already built the intuition for understanding correlation

● The formula for correlation is quite complex, and usually completed using a
statistical program (like R). We will demo how to calculate the correlation
coefficient next week.

● However, if you are interested, correlation is calculated as follows:

Correlation and non-linear trends

● The correlation is intended to quantify the strength of a linear trend.

● Nonlinear relationships between two variables will usually produce a correlation

coefficient that does not reflect the strength of the relationship. Even when
they’re strong!

Concept check: correlation means linear association

Correlation example 1

Correlation example 2
Summary of summarizing data

● So, we’ve sampled from a larger population, creating data, and now have have
categorical and/or numerical data to analyze. You may also want to explore the
relationship between two variables.

● Past a few observations, raw data is confusing and un-interesting.

● Instead, you’ve learned to:

1. Visualize the trends and distribution shape, using some kind of table or
graph (frequency plot, bar chart, histogram, box-and-whisker time series,
scatterplot)*

2. Calculate summary statistics, including central tendency (e.g. mean,

median, or in the case of categorical data, mode) and dispersion (e.g.
range, IQR, or standard deviation). This includes a correlation coefficient if
the research question involves the relationship between two numerical
variables.

*Reminder: the type of table, diagram, and figure depends primarily on whether the
variables in question are numerical or categorical.

Between now and the exam….

We just covered how to describe data with visualizations, and summary statistics

● Descriptive data visualizations: frequency chart/plots, relative frequency

charts/plots, bar charts, time series, histograms, box-and-whisker plots, and
scatterplots
● Summary statistics: measures of central tendency (mean, median, mode),
measures of dispersion (range, IQR, and standard deviation), and correlation.

Next, we’ll cover the topics below, and then prep for the first exam:

● How to recognize and work with the normal distribution, including calculating
z-scores and percentiles.
● An overview of the basics of probability theory
Back to univariate statistics: the normal distribution

Before we continue…

What is the difference between a standard histogram and the “smoothed curve”
histograms?

Bins & Heights of adults in the US

● How does changing the number of bins allow you to make different
interpretations of the data?

Continuous distributions

● Below is a histogram of the distribution of heights of US adults.

● The proportion of data that falls in the shaded bins gives the probability that a
randomly sampled US adult is between 180 cm and 185 cm (about 5'11" to 6'1").
From histograms to continuous distributions (the return of spaghetti)

● The bins are so slim, they’re starting to resemble a smooth curve.

● This suggests the population height as a continuous numerical variable may best
be explained by a curve that represents the outline of EXTREMEMLY smooth
bins.

● Since height is a continuous numerical variable, its probability density function is

a smooth curve.
Probabilities from continuous distributions

● Therefore, the probability that a randomly sampled US adult is between 180 cm

and 185 cm can also be estimated as the shaded area under the curve.

The Normal Distribution

The “Normal Distribution” should be very familiar…

● The characteristic symmetrical, bell-shaped curve pops up everywhere in nature

Normal Distribution (aka Gaussian Distribution)

● A NORMAL (AKA GAUSSIAN) DISTRIBUTION is always bell-shaped and

symmetrical around a central mean.

● May be tall and thin, or short and stocky, or slumping out very flatly

● This largely depends on if the standard deviation is large or small

○ (OR the scale on which we’ve chosen to draw the graph)

● But no matter the height and width, the area under the normal distribution curve
is the same! The area under the normal curve ALWAYS adds up to 1.

Practice

● Question: which of these three looks like a normal distribution?

● A: Curve b looks the most like a normal distribution. Not only is it symmetrical
around a central peak (which c is not) but it also appears to have the right
proportions (which a does not).
Normal Distribution (aka Gaussian Distribution)

“Statistical short hand” notation for the normal distribution

● The mean (µ) and standard deviation (σ) describe the normal distribution perfectly, and
so are called the distribution’s PARAMETERS.

● You can think of these like the “short hand” for the shape of the (perfect-world,
mathematical) distribution curve.

● The way we write the “shorthand” for the normal distribution is N (µ = __, σ = __)
Real world data will never look like a completely perfect curve

● Important Points: The ‘normal’ curve means the idealized, perfect curve (or pattern, or
standard) against which we can compare real world distributions.

● Remember: the normal curve is a mathematical abstraction, and it assumes an infinitely

large population. But we don’t ever measure infinitely large populations.

● Samples we get in everyday life will NEVER produce such a perfect curve. But,
even fairly small samples can produce a fairly bell-shaped distribution.

Normal distributions with different parameters

● SAT scores are distributed nearly normally with mean 1500 and standard deviation
300.
○ N(µ = 1500, σ = 300)
● ACT scores are distributed nearly normally with mean 21 and standard deviation 5.
○ N(µ = 21, σ = 5)

● A college admissions officer wants to determine which of the two applicants scored
better on their standardized test with respect to the other test takers:
○ Pam, who earned an 1800 on her SAT, or Jim, who scored a 24 on his ACT?

Using Z-scores to standardize comparisons, when working with normally distributed data
Standardizing with Z scores

Since we cannot just compare these two raw scores, we instead compare how many standard
deviations beyond the mean each observation is. This is the Z-SCORE.

● The formula is (observation – population mean) / standard deviation

○ Pam's score is (1800 - 1500) / 300 = 1 σ standard deviation above the mean.

○ Jim's score is (24 - 21) / 5 = 0.6 σ standard deviations above the mean.

Standardizing with Z scores

Standardizing with z-scores – visual example

Standardizing with Z scores

● These are called standardized scores, or Z scores

○ Z score of an observation is the number of standard deviations it falls above or
below the mean.

○ Z scores are defined for distributions of any shape, but only when the distribution
is a normal distribution (symmetrical, unimodal, gently curved) can we use Z
scores to calculate percentiles.

■ We can’t use z-scores for comparison with skewed or other funky data.

○ Observations that are more than 2 SD away from the mean on either side
(|Z| > 2) are usually considered unusual. More than 3 SD away are very unusual.
How are z-scores & the normal distribution used?

Use 1 (previous slides): You’ve seen that you can use them to compare how unusual two
measurements/observations are, as we just saw, even when they’re looking at different normal
distributions (i.e. the SAT vs. the ACT).

Use 2: It’s also very useful in statistics to be able to identify “tail areas” of distributions, also
called a PERCENTILE. Examples:

● We can ask, “if you’re a 19-year-old man, and you are 5 feet 8 inches tall, what % of
people are you taller than?” According to the US CDC’s percentile calculator, you would
be in the 29.1 th percentile for height with a z-score of -0.55.

● We can also ask: what percentage of SAT scores are below Ann's score of 1300?
Another way to ask this is, “what is Ann’s SAT score percentile?”

Example 1

What fraction of people have an SAT score below Ann's score of 1300?

N(µ = 1100, σ = 200), and our “cutoff value” is Ann’s score of x = 1300
How are z-scores used? (continued)

● We’ve learned:
○ Use 1: to compare distributions by standardizing measurements/observations to
z-scores
○ Use 2: to calculate a percentile of a cutoff value (i.e. show what % of
observations fall below a certain value)

● We’re about to learn:

○ Use 3: related to use2, you can also calculate the probability that a measurement
will fall ABOVE a particular cutoff value.
○ Use 4: you can also calculate the probability that a measurement will fall
BETWEEN two values.

Normal probability – example 3

● What is the probability that a random student scores at least 1190 on her SATs? We
don’t know anything about their aptitude, so we can base our statistical guess ONLY off
of the population’s average scores (and the standard deviation).

● The picture shows the mean and the values at 2 standard deviations above and below
the mean.
● The simplest way to find the shaded area under the curve makes use of the Z-score of
the cutoff value.
Key Concept

● THE AREA UNDER THE NORMAL CURVE ALWAYS ADDS TO 1.

● R functions (and basically all other methods) default to the area to the LEFT of the cutoff
value or percentage.
● If you want the area to the RIGHT (see above), calculate: 1 – pnorm(value)
Example 4 – finding area between cutoff values
Summary: Key points for z-scores

Terminology note:

● Area below a cutoff value = area to the left

● Area above a cutoff value = area to the right
● Area between two values = area to left of higher value – area to left of lower value

Data Exploration and Visualization Unit 2
100% (1)
Data Exploration and Visualization Unit 2
19 pages
Chapter 4
No ratings yet
Chapter 4
52 pages
Chapter 6 PPT Sldies
No ratings yet
Chapter 6 PPT Sldies
30 pages
DATA202-02 - Descriptive Statistics (Part 2)
No ratings yet
DATA202-02 - Descriptive Statistics (Part 2)
18 pages
3 Bivariate Data
No ratings yet
3 Bivariate Data
33 pages
Research Methods Chapter 5
No ratings yet
Research Methods Chapter 5
59 pages
Analise Bivariada - Moodle
No ratings yet
Analise Bivariada - Moodle
46 pages
UNIT 2 - Understanding Relationship
No ratings yet
UNIT 2 - Understanding Relationship
9 pages
Statistics Learners' Working Manual
No ratings yet
Statistics Learners' Working Manual
25 pages
AP Stats 3.1
No ratings yet
AP Stats 3.1
38 pages
Oxford Insight Mathematics 10 5 25 3 AC For NSW Student Book Obook John Ley Michael Fuller Z Lib Org - 125
No ratings yet
Oxford Insight Mathematics 10 5 25 3 AC For NSW Student Book Obook John Ley Michael Fuller Z Lib Org - 125
1 page
Two Variables Chap3
No ratings yet
Two Variables Chap3
47 pages
CS3353 FDS Unit 3 New
No ratings yet
CS3353 FDS Unit 3 New
48 pages
Stats CH 4 Powerpoint
No ratings yet
Stats CH 4 Powerpoint
67 pages
Stat215 Test 2
No ratings yet
Stat215 Test 2
18 pages
Bi Variate 1
No ratings yet
Bi Variate 1
75 pages
Chapter 05
No ratings yet
Chapter 05
13 pages
Chapter 2
No ratings yet
Chapter 2
67 pages
Correlation Analysis
No ratings yet
Correlation Analysis
32 pages
3.1 Power Point
No ratings yet
3.1 Power Point
17 pages
المادة العمية المتلقة بالارتباط والانحدار - د فواز القربي
100% (1)
المادة العمية المتلقة بالارتباط والانحدار - د فواز القربي
150 pages
Chap 1
No ratings yet
Chap 1
75 pages
Laser Guide Product Tips
No ratings yet
Laser Guide Product Tips
12 pages
Lecture 7
No ratings yet
Lecture 7
65 pages
Notes3.1 TPS6up
No ratings yet
Notes3.1 TPS6up
19 pages
Correlation File 1
No ratings yet
Correlation File 1
15 pages
IPS7e LecturePPT ch02
No ratings yet
IPS7e LecturePPT ch02
105 pages
STAT1600 (24-25, 1st) Chapter 2
No ratings yet
STAT1600 (24-25, 1st) Chapter 2
63 pages
Q4 Week 6 - Statistics and Probability
No ratings yet
Q4 Week 6 - Statistics and Probability
22 pages
L3 Correlation
No ratings yet
L3 Correlation
101 pages
Final 2nd MAT1243 Handout 2023 Ac Year
No ratings yet
Final 2nd MAT1243 Handout 2023 Ac Year
81 pages
Chapter 3 Slides
No ratings yet
Chapter 3 Slides
40 pages
Second Stats Packet 24
No ratings yet
Second Stats Packet 24
100 pages
Bivariate Data Year 10 Notes Pwe 2016
No ratings yet
Bivariate Data Year 10 Notes Pwe 2016
14 pages
Statistics Regression Final Project
100% (2)
Statistics Regression Final Project
12 pages
Correlation and Its Significance
No ratings yet
Correlation and Its Significance
15 pages
Scatterplots - Week 5
No ratings yet
Scatterplots - Week 5
34 pages
2-StatProb11 Q4 Mod2 Correlation-Analysis Version3
No ratings yet
2-StatProb11 Q4 Mod2 Correlation-Analysis Version3
31 pages
Hypothesis Testing Correlation
No ratings yet
Hypothesis Testing Correlation
15 pages
Correlation 2
No ratings yet
Correlation 2
23 pages
Chapter2-ESTA3042 2020S2
No ratings yet
Chapter2-ESTA3042 2020S2
80 pages
Q4 Week 6 Statistics and Probability
No ratings yet
Q4 Week 6 Statistics and Probability
21 pages
Correg
No ratings yet
Correg
19 pages
2 Correlation and Regression PDF
No ratings yet
2 Correlation and Regression PDF
84 pages
Chapter 6 Booklet - Bivariate Data
No ratings yet
Chapter 6 Booklet - Bivariate Data
12 pages
WEEK 6 Modular
No ratings yet
WEEK 6 Modular
10 pages
Notes 2 - Scatterplots and Correlation
No ratings yet
Notes 2 - Scatterplots and Correlation
6 pages
Statistics and Probability: Quarter 4 - Module
0% (1)
Statistics and Probability: Quarter 4 - Module
17 pages
Chapter 03 Describing Bivarate Data
No ratings yet
Chapter 03 Describing Bivarate Data
32 pages
4 Describing Bivariate Data
No ratings yet
4 Describing Bivariate Data
19 pages
Cochran's Formula
No ratings yet
Cochran's Formula
10 pages
Session 3 - Bivariate Data Analysis Tutorial Prac
No ratings yet
Session 3 - Bivariate Data Analysis Tutorial Prac
24 pages
3 Bivariate Data
No ratings yet
3 Bivariate Data
31 pages
Data Analysis-Univariate & Bivariate
100% (1)
Data Analysis-Univariate & Bivariate
9 pages
Chapter 4: Describing The Relationship Between Two Variables
No ratings yet
Chapter 4: Describing The Relationship Between Two Variables
27 pages
Module 2 - Section 4 (Linear Regression) - 11
No ratings yet
Module 2 - Section 4 (Linear Regression) - 11
20 pages
Stat and Prob Q4 Week 7 Module 15 Lorena
No ratings yet
Stat and Prob Q4 Week 7 Module 15 Lorena
24 pages
Chapter 8
0% (1)
Chapter 8
55 pages
Unit 4 Statistics Notes Scatter Plot 2023-24
No ratings yet
Unit 4 Statistics Notes Scatter Plot 2023-24
15 pages
MDM4U Unit3
No ratings yet
MDM4U Unit3
22 pages
Correlation New
No ratings yet
Correlation New
37 pages
Pops Test Results
No ratings yet
Pops Test Results
3 pages
CEP Presentation Explained Part 2
No ratings yet
CEP Presentation Explained Part 2
139 pages
Normal Probability Dist 1
No ratings yet
Normal Probability Dist 1
104 pages
Standard Score Lesson Plan
No ratings yet
Standard Score Lesson Plan
15 pages
ADM SHS StatProb Q3 M13 Computing Probabilities and Percentiles Using The Standard
100% (1)
ADM SHS StatProb Q3 M13 Computing Probabilities and Percentiles Using The Standard
26 pages
Zscore 2
No ratings yet
Zscore 2
17 pages
RANDOM - Discrete and Continuous - VARIABLE
100% (1)
RANDOM - Discrete and Continuous - VARIABLE
13 pages
Geo Ma HG Basic Statistics Self Test
No ratings yet
Geo Ma HG Basic Statistics Self Test
9 pages
Fact or Bluff
No ratings yet
Fact or Bluff
22 pages
AP Stats Interpretationss
No ratings yet
AP Stats Interpretationss
4 pages
Psychological Assessment Rationalization
No ratings yet
Psychological Assessment Rationalization
7 pages
ES031 M1 DataCollection&Presentation
No ratings yet
ES031 M1 DataCollection&Presentation
64 pages
Brochure Maquet Powerled II-En-non Us Canada
No ratings yet
Brochure Maquet Powerled II-En-non Us Canada
16 pages
Z - Score
No ratings yet
Z - Score
22 pages
BioStat Module 3
No ratings yet
BioStat Module 3
41 pages
Assignment 1st - 523 - Business Mathematics and Statistics
No ratings yet
Assignment 1st - 523 - Business Mathematics and Statistics
16 pages
GuPPy A Python Toolbox For The Analysis of Fiber P
No ratings yet
GuPPy A Python Toolbox For The Analysis of Fiber P
10 pages
Innovation in Up-Hole Deviation Measurements in Sublevel Stoping Mines
No ratings yet
Innovation in Up-Hole Deviation Measurements in Sublevel Stoping Mines
10 pages
Waters Proficiency Testing Program: Report No. 1119
No ratings yet
Waters Proficiency Testing Program: Report No. 1119
28 pages
Comprehensive Ebook of Statistics For Data Science - Chaitali
No ratings yet
Comprehensive Ebook of Statistics For Data Science - Chaitali
21 pages
Unit Normal Table
No ratings yet
Unit Normal Table
4 pages
AMCAT Report
No ratings yet
AMCAT Report
21 pages
Groebner Tif Ch03
No ratings yet
Groebner Tif Ch03
27 pages
The Relationship Between A Child's Postural Stability and Manual Dexterity
No ratings yet
The Relationship Between A Child's Postural Stability and Manual Dexterity
11 pages
QBM101 Tutorial Module
No ratings yet
QBM101 Tutorial Module
8 pages
Application of Normal Distribution
No ratings yet
Application of Normal Distribution
6 pages
Scoring by Kamini Chaudhary2
No ratings yet
Scoring by Kamini Chaudhary2
7 pages
Correlation and Regression: Six Sigma Thinking, #8
From Everand
Correlation and Regression: Six Sigma Thinking, #8
Sumeet Savant
5/5 (1)
Descriptive Statistics: Six Sigma Thinking, #3
From Everand
Descriptive Statistics: Six Sigma Thinking, #3
Sumeet Savant
No ratings yet

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

BA 216 Lecture 5 Notes

Uploaded by

BA 216 Lecture 5 Notes

Uploaded by

BA 216

An introduction to describing and visualizing the relationship between 2 variables

2. Does a higher than average increase in county population tend to

3. How useful a predictor is median education level for the median

New terms to learn -- univariate vs. bivariate statistics

Univariate statistics Bivariate statistics

○ Find how spread out it is ● We can use Tables, Scatter Plots,

○ Make plots like Bar Graphs,

● Scatterplots are helpful in quickly spotting associations relating variables,

● Each point on the plot represents a single county.

Examples of non-linear relationship

Bivariate relationships – strength and direction

● A quick description of the association in a scatterplot between two numerical

○ Form: Is the association linear or nonlinear?

○ Direction: Is the association positive or negative?

○ Strength: Does the association appear to be very strong, very weak, or

Bivariate relationships – strength and direction

● Form: Is the association linear or nonlinear?

Eyeballing a Trend line (phone data example)

We’ve already built the intuition for understanding correlation

● When we were describing bivariate relationships (between two variables), we

● Calculating a “Correlation Coefficient” is a way to attach a number (to

○ If there is no apparent linear relationship between the variables, then the

○ If it is strong and negative, it will be near -1.

○ Correlations between +/- 0.3 to 0.7 can look deceptively messy.

(i.e. the R Statistic)

We’ve already built the intuition for understanding correlation

● However, if you are interested, correlation is calculated as follows:

● The correlation is intended to quantify the strength of a linear trend.

● Nonlinear relationships between two variables will usually produce a correlation

Concept check: correlation means linear association

● Past a few observations, raw data is confusing and un-interesting.

● Instead, you’ve learned to:

2. Calculate summary statistics, including central tendency (e.g. mean,

Between now and the exam….

● Descriptive data visualizations: frequency chart/plots, relative frequency

Bins & Heights of adults in the US

● Below is a histogram of the distribution of heights of US adults.

● The bins are so slim, they’re starting to resemble a smooth curve.

● Since height is a continuous numerical variable, its probability density function is

● Therefore, the probability that a randomly sampled US adult is between 180 cm

The Normal Distribution

The “Normal Distribution” should be very familiar…

● The characteristic symmetrical, bell-shaped curve pops up everywhere in nature

● A NORMAL (AKA GAUSSIAN) DISTRIBUTION is always bell-shaped and

● This largely depends on if the standard deviation is large or small

○ (OR the scale on which we’ve chosen to draw the graph)

● Question: which of these three looks like a normal distribution?

“Statistical short hand” notation for the normal distribution

● Remember: the normal curve is a mathematical abstraction, and it assumes an infinitely

Normal distributions with different parameters

● The formula is (observation – population mean) / standard deviation

Standardizing with Z scores

Standardizing with Z scores

● These are called standardized scores, or Z scores

● We’re about to learn:

Normal probability – example 3

● THE AREA UNDER THE NORMAL CURVE ALWAYS ADDS TO 1.

● Area below a cutoff value = area to the left

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.