Class 3 Disper&normalcurveh
Class 3 Disper&normalcurveh
• Standard Deviation
Compare the variability of the following groups of quiz scores by comparing the standard deviations:
Group 1: 0, 4, 4, 5, 7, 10
Group 2: 0, 0, 1, 9, 10, 10
1. Group 1:
Scores (𝑿𝑿𝒊𝒊 ) �)
Deviations (𝑿𝑿𝒊𝒊 − 𝑿𝑿 � )𝟐𝟐
Deviations Squared (𝑿𝑿𝒊𝒊 − 𝑿𝑿
0
4
4
5
7
10
�(𝑋𝑋𝑖𝑖 ) = �(𝑋𝑋𝑖𝑖 − 𝑋𝑋�) = �(𝑋𝑋𝑖𝑖 − 𝑋𝑋�)2 =
∑(𝑋𝑋𝑖𝑖 )
𝑋𝑋� = =
𝑁𝑁
� )𝟐𝟐
∑(𝑿𝑿𝒊𝒊 − 𝑿𝑿
𝒔𝒔 = � =
𝑵𝑵
2. Group 2:
Scores (𝑿𝑿𝒊𝒊 ) �)
Deviations (𝑿𝑿𝒊𝒊 − 𝑿𝑿 � )𝟐𝟐
Deviations Squared (𝑿𝑿𝒊𝒊 − 𝑿𝑿
0
0
1
9
10
10
�(𝑋𝑋𝑖𝑖 ) = �(𝑋𝑋𝑖𝑖 − 𝑋𝑋�) = �(𝑋𝑋𝑖𝑖 − 𝑋𝑋�)2 =
∑(𝑋𝑋𝑖𝑖 )
𝑋𝑋� = =
𝑁𝑁
� )𝟐𝟐
∑(𝑿𝑿𝒊𝒊 − 𝑿𝑿
𝒔𝒔 = � =
𝑵𝑵
1
• Opinion Polarization
Data: World Values Survey (Wave 6) (subset: only Japan, Sweden, and US are included)
Variables Description
V198 – Opinions on controversial issues. Answers ranged from 1 (never justifiable) to 10
V210 (always justifiable).
V2 Country code:
392 = Japan
752 = Sweden
840 = United States
A case
from
Japan.
A case
from
Sweden.
2
1. Load the wvs6sn.csv data into RStudio. Be sure to click “Yes” for heading.
2. Measures of Dispersion:
Variable: V204 (opinions on abortion)
Commands you need:
range()
IQR()
sd()
Hint
boxplot() You can use the summary() command to find out the
measures of central tendency first.
summary(wvs6sn$V204)
Range
Codes:
range(wvs6sn$V204)
*Again, if there are missing values, use the option na.rm = TRUE to remove the NAs when doing the
range(), IQR(), and sd() commands. NA is the default label for missing values in R.
Interquartile Range
Codes:
IQR(wvs6sn$V204)
*same as the notes above. Use the na.rm = TRUE option to remove the NAs.
Standard Deviation
Codes:
sd(wvs6sn$V204)
*same as the notes above. Use the na.rm = TRUE option to remove the NAs.
3
3. Optional: Which countries are more polarized in their views on controversial issues?
Similar to the last time, the codes below allow you to compute the measures of dispersion for a
continuous variable (e.g., V204) for each category of a categorical variable (e.g., V2). By computing
the standard deviation of the attitudes on abortion for the three countries, you can see which
countries are more polarized in their views on controversial issues.
OR
*Method 1
*Codes: (392 = Japan; 752 = Sweden; 840 = USA)
sd(wvs6sn$V204[wvs6sn$V2 == 392], na.rm = TRUE)
sd(wvs6sn$V204[wvs6sn$V2 == 752], na.rm = TRUE)
sd(wvs6sn$V204[wvs6sn$V2 == 840], na.rm = TRUE)
Codes Breakdown
a. The command line:
sd(wvs6sn$V204[wvs6sn$V2 == 392], na.rm = TRUE)
in plain words: calculate the standard deviation of the V204 variable if the V2 variable is 392, with
missing values removed.
Hint
The conditions in the brackets here are somewhat different from the ones we encountered before.
See that we don’t use quotes around the country code here (i.e.,[wvs6sn$V2 == 392]). That’s
because the variable V2 is a numeric variable, whereas the other variables we saw before are
factors (i.e., categorical variables). You can find out the object class (i.e., whether a variable is
numeric or not) by using the class() command below:
class(wvs6sn$V2)
The object class affects the operation of the codes, so pay attention to this. Just like here, if you put
quotes around 392, R will reply you with error messages.
4
*Method 2
*Codes:
Hints
a. The tapply() command calculates the standard deviation for each category of the V2 variable at the
same time.
b. The first argument should be a continuous variable and the second argument must be a
categorical variable. The third argument, statistics, can be mean, median, sd, or other statistics.
*Codes:
boxplot(continuous variable ~ categorical variable)
MJ’s interpretation: People in the United States are more polarized in their views on abortion. From the
box plots above, we can see that the IQR for the United States is the greatest, while the IQR for Japan
and Sweden are smaller and almost the same. In addition to IQRs, we can also see that the medians are
different across countries. Sweden has the highest median, meaning the people there generally hold
liberal opinions on abortion. Japanese are quite conservative, but their opinions do not vary much. In
summary, Swedish tend to be more liberal, Japanese tend to be more conservative, and Americans tend
to have little tendency (although a little towards the conservative side). Americans hold diverse
viewpoints on the issue of abortion.
5
Exercises
1. Compute the range, interquartile range, and standard deviation for the variable V203
(homosexuality). Draw a boxplot for V203 and interpret the boxplot.
2. Optional: People in which countries are more polarized in their views on homosexuality
(V203)? Report the range, IQR, and standard deviation of the three countries at the same
time. Please also plot the three boxplots in the same chart. (Hint: use the tapply()
command and change the statistics argument.)
3. Optional: Report the standard deviation of opinions on suicide (V207) for Japan (V2 = 392).
6
SOCI 1028 The Normal Curve 1
Height (cm)
2. The mean height of the population is 170, are you above, below, or at the mean?
Height (cm)
3. How would you describe your position? Use cm? (Step 1: find the distance
between your height and the mean)
4. Someone told you that human height is normally distributed. The person also said the
mean height is 170 cm and the standard deviation is 5 cm. How can the information
help you find your position in the population?
7
The empirical rules:
Normal Distribution
when you know the distance in the unit of standard deviation, you know your
position.
Step 2: turn the distance to be in the unit of the standard deviation of the
distribution.
8
5. Step 3: Find the percentage of population above or below you using the
empirical rules or/and the Z table.
• Exercises
1. Assume the distribution of the SAT score is a normal distribution. You take the SAT and
score 1500. The mean score for the SAT is 1060 and the standard deviation is 210. You
are _______ standard deviations above the mean.
9
SOCI 1028 The Normal Curve 2
3. Step 3: Find the percentage of population above or below you using the empirical rules
or/and the Z table.
• Question last time: What’s the percentage of population that’s taller/shorter than you?
• Try to answer these after today’s class: What’s the probability of selecting an individual
who
• Is taller/shorter than you in the population?
• Has a height that is between Michael Jordan (198 cm) and you?
• Exercises
1. Human height is normally distributed. The mean height is 170 cm and the standard
deviation is 5. Jayson’s height is 179 cm. How many standard deviation(s) is his height
away from the mean?
Find the area beyond the Z score using the Z table (Z table in COOL’s lesson’s page):
10
Find the area below the Z score using the Z table (Z table in COOL’s lesson’s page):
3. Human height is normally distributed. The mean height for women is 160 cm, and the
standard deviation is 5 cm. Only 20.9% of women are taller than Emily. How tall is Emily
(in cm)?
𝑋𝑋𝑖𝑖 −𝑋𝑋�
𝒛𝒛 𝒔𝒔𝒔𝒔𝒔𝒔𝒔𝒔𝒔𝒔 =
𝑠𝑠
Normal Distribution
11
4. If a distribution of test scores is normal, with a mean of 78 and a standard deviation of
11, what percentage of the area lies: below 60.
12
• Look up the Z table in R
*Since the default of the pnorm() command is mean = 0, sd = 1, and lower.tail = TRUE,
the above line can also be written as
pnorm(-1)
*Similar to (a), the default for mean and sd are 0 and 1 respectively, so just change the
lower.tail default to FALSE.
Negative Z:
0.5 - pnorm(-1, mean = 0, sd = 1, lower.tail = TRUE)
qnorm(0.1587, mean = 0, sd = 1)
13
IMPORTANT Hints
a. See the mean and sd options? You can change the mean and sd according to the normal
distribution at hand. For example: The distribution of the Emily’s height example is N(160, 5),
which means it’s a normal distribution with a mean of 160 and a standard deviation of 5. The
code for finding the area below 167 cm is:
I believe you know how to find the area above 167 now. Simply change the option lower.tail
= TRUE to FALSE.
b. Similarly, you can answer the Emily’s height question with the qnorm() code
Exercises
1. If a normal distribution has a mean of 100 and a standard deviation of 20, what is the
probability of randomly selecting a score of 105 or more (i.e., the area beyond 105)?
14
Answers
> #Exercise
> #1)
> pnorm(105, mean = 100, sd = 20, lower.tail = FALSE)
[1] 0.4012937
>
> #2)
> qnorm(0.27, mean = 50, sd = 15)
[1] 40.80781
>
> #3)
> pnorm(29, mean = 25, sd = 3, lower.tail = TRUE) - pnorm(23, mean = 2
5, sd = 3, lower.tail = TRUE)
[1] 0.6562962
15