Notes 2 - Scatterplots and Correlation
Notes 2 - Scatterplots and Correlation
We have worked with data for bar graphs, box plots, dot plots, and histograms. The type of data that is
represented with these graphs are _________________________ data.
In our Vitruvian Man activity, we explored ____________________________ data. When we compare two
variables, we are exploring the relationship between them.
Most statistical studies involve more than one variable. Often in the AP Statistics exam, you will be asked to
compare two data sets by using side by side boxplots or histograms etc. However, there are times where we
want to examine relationships among several variables for the same group of data (in the Vitruvian Man
activity, each person had TWO sets of data: height and arm span).
A scatterplot shows the relationship between two quantitative variables measured on the same individuals. The
values of one variable appear on the horizontal axis, and the values of the other variable appear on the vertical
axis. Each individual in the data appears as a point in the plot fixed by the values of both variables for that
individual.
145
**You will often find explanatory
135 variables called independent variables, and
135
145 155 165 175 185 195 response variable called dependent
Arm Span variables. The idea behind this language
is that the response variable depends on
the explanatory variable. Because the
words independent and dependent have other, unrelated meanings in statistics, we won’t use them here.
● In any graph of data, look for overall pattern and for striking deviations from that pattern.
● You can describe the overall pattern of a scatterplot by the direction, form, and strength of the
relationship.
● An important kind of deviation is an outlier, an individual value that falls outside the overall pattern of
the relationship.
Things to look for in a scatterplot:
● Form: Overall pattern (linear, exponential, etc.) or deviations from the pattern (outliers)
● Direction: Positive or negative slope
● Strength: How close do the points lie to a simple form (such as a line)
Form:
Direction:
Positive Negative
Strength:
Weak Moderate Strong
Example
Suppose we hypothesize that the number of doctor visits a person has can be explained by the amount of
cigarettes they smoke. So we want to see if there is a relationship between the number of cigarettes one smokes
a week and the number of times per year one visits a doctor. We ask 10 random people and get the following
information:
Unusual Features
Outliers and Influential Points Gaps Clusters
Correlation
In order to strengthen the analysis when comparing two variables, we can attach a number, called the
correlation coefficient (r), to describe the linear relationship between two variables. This number helps remove
any subjectivity in reading a linear scatter plot.
The correlation measures the strength and direction of the linear relationship between two quantitative
variables.
While we will never have to find correlation by hand, the formula is provided to us on the AP Statistics formula
sheet. There are a few facts about the correlation that the formula can help us remember.
____ : correlation
____ : each x-value
____ : each y-value
____ : mean of the x values
____ : mean of the y values
____ : standard deviation of x values
____ : standard deviation of y values
Essentially, the correlation coefficient, r, finds the average of the product of the standardized scores.
Fact #2 : Positive correlations between 0 and 1 have varying strengths, with the strongest positive correlations being closer to 1.
*Note: There is no magic “cutoff number” for describing weak/moderate/strong. Use your statistical intuition!
Fact #3 : Negative correlations between -1 and 0 have varying strengths, with the strongest negative correlations being closer to -1.
r = -0.9 r = -0.65 r = -0.3
MANY students are tempted to say that this scatterplot has a correlation of -1
because it is a perfect negative quadratic relationship.
NO NO NO!!!!
Fact #5 : Correlation does not have units and changing units on either axis will not affect correlation.
Since we are standardizing all the x and y values, it does not matter what the units are! We take the product of
their standardized scores. Speaking of that….
Fact #6 : Switching the explanatory and response variables on the axes will not change the correlation.
Again, looking at the formula, this is because the order of the multiplication does not matter. Correlation makes
no distinction between explanatory and response variables. It makes no difference which variable you call x
and which you call y when calculating the correlation.
Use correlation with caution when outliers appear in your scatter plot. Don’t rely on correlation alone to
determine the linear strength between two variables – graph a scatter plot first!