Data1901 Notes
Data1901 Notes
Controlled Experiments
Controlled Experiment
Control Group
Confounding EXAMPLES ?
What is Confounding
What is Bias
Selection Bias
What is the -
solution
Observer Bias
Consent Bias
Placebo
Precaution
What are some Precautions
Confounders can be hard to find, and can mislead about a cause and
effect relationship
First general look at the data without formally answering the research
questions
Helps to see whether the data can answer the research question
Identify the data's main qualities and suggest the population from
which a sample derives
Barplot
table(Dayweek)
barplot(table(Dayweek)) #or
# what is the
irgut -
plot(table(Dayweek))
↓ .
state
T Tit fill
us . color
-
Vs of Big Data of 1
ggat
(ace <x=
ye
,
yehught) +
high volume
gen-point (est(illgede1) +
weight
high velocity
lab(x=
La
=
)
high variety ,
y .
M
high variability
# F
low varacity ... ...
O
high vulnerability
& -
high volatility
L
. -
high value
-
... S
&
Histogram
als
-
(color=guh) .
-
til
Age = data$Age
breaks = c(0, 18, 25, 70, 100) #choosing the class interval
AgeM = data$Age[data$Gender=="Male"]
hist(AgeF,freq = FALSE)
hist(AgeM,freq = FALSE)
e
Data and Graphical Summaries 5
p1 = ggplot(diamonds, aes(price))
p1 + geom_histogram(aes(fill = cut),
position = "dodge", bindwidwth = 1000)
Boxplot
Gender = data$Gender
summary(Age[Gender == "Female"])
summary(Age[Gender == "Male"])
p3 + geom_boxplot() + facet_grid(cut~.)
SpeedLimit = data$SpeedLimit
plot(Age, SpeedLimit)
Max
Min
Numerical Summaries 1
Spread SD, range, IQR
Centre
Mean
Mean is the unique point at which the data is balanced
n
1 a1 + a2 + ⋯ + an
A = ∑ ai =
n n
i=1
Median
The median is the middle data point, where the data is ordered from the
smallest to largest.
Numerical Summaries 2
Comparison
For symmetric data, we expect the mean and median to be the same
For left skewed data, we expect the mean to be smaller than the median
For right skewed data, we expect the mean to be bigger than the median
Median
For data which is basically symmetric and not have many outliers
Spread
Standard Deviation
Gap between each data point and the mean
Root Mean Square measures the average of a set of numbers regardless of the
sign.
Square
Mean
Numerical Summaries 3
Root
How is SD calculated
RMS of gaps
sd(data$Sold)
popsd(data$sold)
#or
sd(data$Sold)*sqrt(55/56) #there are 56 data points
n−1
popSD = SD ×
n
IQR
Numerical Summaries 4
What is IQR
Range of the middle 50% of the data
IQR = Q3 − Q1
quantile(data$Sold)
IQR on Boxplot
LowerT hreshold = Q1 − 1.5IQR
UpperT hreshold = Q1 − 1.5IQR
Outside of lower and upper thresholds are outliers
Coefficient of Variation
SD
CV =
mean
The higher the CV, the more volatile the sample
Numerical Summaries 5
Normal Model
Normal Curve
General vs Standard Normal Curve
Finding Area under Normal Curve
Standard Normal Curve
General Normal Curve
Special Properties
Measurement Error
Chance Error
Bias
Outliers
Normal Curve
Normal curve approximates many natural phenomenon
1 (x−σ)2
f(x) = e− 2σ 2
2πσ 2
Normal Model 1
Standard normal curve (Z) has mean 0 and SD 1
pnorm(0.8, lower.tail = FALSE) #Defaults to lower tail (finding area to the left)
pnorm(171, 161.9, 7.7) #pnorm(x value, mean, SD), defaults to lower tail
Special Properties
68, 95, 99.7 Rule
Normal Model 2
Any General Normal can be Standardised
Measurement Error
Individual Measurement = Exact Value + Chance Error + Bias
Chance Error
What is Chance Error
No matter how careful any measurement is made, it could have turned out
differently
Replicate the measurement under the same conditions, and calculate the
SD
Bias
What is Bias
Outliers
Normal Model 3
What are Outliers
Normal Model 4
Reproducible Report
Dangers of Non-reproducible Report
Good Practices
People edit can Excel file without documenting what has changed and why
People can photoshop images without keeping record of what changed and
why
Good Practices
Annotate your code as you go
Ensures understanding
Reproducible Report 1
Linear Model
Scatter Plot and Correlation
Bivariate Data
Correlation Coefficient
Misleading Correlation
Regression Line
SD Line vs Regression Line
SD Line
Regression Line
Predictions
Baseline Predictions using mean
Prediction in a strip
Based on the Regression line
Predicting using Percentile ranks
Limitations in Prediction
Residual
RMS Error
Residual Plot
Vertical Strips
Normal Distribution within strips
Linear Model 1
Examines the relationship between 2 quantitative variable
SpeedLimit = data$SpeedLimit
plot(Age, SpeedLimit)
Linear Model 2
Bivariate Data
What is Bivariate Data
Correlation Coefficient
How can we summarise a scatterplot
Mean and SD of X
Mean and SD of Y
Correlation Coefficient, r
Linear Model 3
Positive r value slopes up, Negative r value slopes down
SU_x=(data$fheight-mean(data$fheight))/popsd(data$fheight)
SU_y=(data$sheight-mean(data$sheight))/popsd(data$sheight)
mean(SU_x*SU_y)
Linear Model 4
or
umit
cor(data$fheight,data$sheight)
-
rpop = rsample
↓
X and Y is swapped
L
No
Misleading Correlation
Outliers can overly influence the correlation coefficient
- -
-
Non-linear association cannot be detected by r -
r value can be high, but in actual fact another model would fit it better
Linear Model 5
Association is not causation
Regression Line
SD Line vs Regression Line
SD Line
Line connecting point of averages (x
ˉ, yˉ) and (x
ˉ + SDx , yˉ + SDy )
Regression Line
Linear Model 6
Line connecting point of averages (x
ˉ, yˉ) and (x
ˉ + SDx , yˉ + rSDy )
lm(NW~CE) #lm(Yvariable,Xvariable)
Predictions
Baseline Predictions using mean
For any X value, the Y value is the mean of Y
Prediction in a strip
Find the Y value that corresponds to the X value in the given data
If there are multiple values, take the average
Linear Model 7
Limitations in Prediction
Extrapolating
Regression fallacy
Regression to mediocrity
Residual
What is Residual
Residual is the vertical distance of a point above and below the regression
line
ei = yi − y^i
What does the Residual represent
l$residual[10]
RMS Error
What does the RMS error represent
Population RMS error represents the average gap between the points and
the regression line
e 21 +e 22 +...e 2n
RMS errorpop = n
= 1 − r 2 SDy
res = NW - l$fitted.values
sqrt(mean(res^2)) # RMSError(Pop)
Linear Model 8
sqrt(1-(cor(CE,NW))^2)*sd(NW) #RMSError(Sample)
sqrt(1-(cor(CE,NW))^2)*sd(NW)*sqrt((length(CE)-1)/(length(CE))) #RMSError(Pop)
Residual Plot
Graphs the residuals vs x
plot(CE,l$residuals, ylab="residuals")
abline(h=0)
Vertical Strips
How to determine if the data is homoscedastic or heteroscedastic
Linear Model 9
SDy∗ ≈ RMS Error
Find the percentage of days above a certain x value
eg.
Of the days where x = value, what percentage of days had y > value
yˉ∗ = yˉ + zx rSDy
SDy∗ ≈ RMS Error
Standardise value of y
eg.
Linear Model 10
Linear Model 11
Chance
Prosecutor's Fallacy
Chance
Independence & Dependence
Making List
Example: Two dice are thrown, what is the chance of getting a total of 6 spots
Addition Rule
Example
Prosecutor's Fallacy
What is the Prosecutor's Fallacy
P (A∣B) = P (B∣A)
Misunderstanding on how conditional probability works
P (A∩B )
P (A∣B) = P (B )
Probability Table
Chance 1
Category DNA Match DNA doesn't match
Guilty 1 0
Innocent 9 4,999,990
Chance
What is Chance
2 events are independent is the chance of the 2nd given the 1st is the
same as the 2nd i.e. the probability of the 1st event happening does not
affect the 2nd
Making List
Chance 2
Write a list of all outcomes
Method 2 Simulate
setseed(1)
totals = sample(1:6, 1000, rep = TRUE) + sample(1:6, 1000, rep = TRUE)
#sample(x, size(number of items to choose), replace, probability vector)
table(totals)
Addition Rule
Chance 3
2 events are mutually exclusive when both cannot happen at the
same time
If mutually exclusive P (A ∩ B) =0
When to multiply
Example
What is the probability of at least 1 Ace, after tossing 4 times
1 − P (not ACE)4 = 1 − ( 56 )4 = 0.518
What is the probability of at least 1 double Ace, after tossing 24 times
1 − P (no double Ace) = 1 − ( 35
36
)24 = 0.49
Chance 4
Binomial Formula
Binomial Coefficient
Binomial Model
Binomial Theorem
Example
Binomial Coefficient
Suppose we have n objects in a row, made up of 2 types: x and n − x
( )=
n n!
x x!(n − x)!
Binomial Model
Binomial Theorem
Binomial Formula 1
Suppose we have n indepedent, binary trials, the chance of exactly x
n
events occur is (x)px (1 − p)n−x
Example
A fair coin is tossed 5 times, what is the probability of getting 3 heads?
A fair coin is tossed 500 times, what is the probability of getting 300
heads?
Suppose 100 babies are born with P (boy) = 0.51, what is the probability
that 55 boys were born?
Binomial Formula 2
Box Model
Law of Averages
Box Model
Sum of draws from a box model
Mean of draws from a box model
Modelling the Sum/Mean of a sample using Normal Curve
Example 1
Example 2
Example 3
Example 4
Law of Averages
The absolute size of the chance error increases, but the percentage
size decreases. Proportion of the event will converge to expected
proportion
Box Model 1
Box Model
CE = OV − EV
Examples
Example 1
Box Model 2
What is the EV, OV and chance error
EV = 2 × 5 = 10
OV = 3 + 1 + 2 + 2 + 3 = 11
Chance Error = OV − EV = 1
Example 2
Suppose it costs $1 to play the game
Answer
Box Model 3
set.seed(1)
dietosses = sample(c(1, -1, -1, -1, -1, -1), 25, repl = TRUE)
#or dietosses = sample(c(1, -1), 25, repl = TRUE, prob = c(1/6, 5/6)
sum(dietosses)
Example 3
38 pockets, numbered 0 (green), 00 (green), and 1 36 (alternate red and
black)
Answer
set.seed(1)
roulette = sample(c(1, -1), 10, repl = TRUE, prob = c(18/38,20/38))
Example 4
38 Numbers
Box Model 4
Answer
set.seed(1)
number = sample(c(35, -1), 100, repl = TRUE, prob = c(1/38, 37/38))
SD of box = (big −
small) proportion of big × proportion of small
SD of box
Standard Error =
number of draws
Answer
Method 1
EV = 50
SE = 100 × 0.5 × 0.5 = 5
N(50, 52 )
Standardise 40 and 60, 2 and 2
Box Model 5
pnorm(2)-pnorm(-2)
Method 2
Simulation using R
set.seed(1)
box = c(1, 0)
totals = replicate(1000, sample(box, 100, repl = TRUE))
Example 2
Number 1 to 80, 20 numbers were chosen at random without replacement
You pick a single number
If you play 100 times, how much would you expect to win or lose?
Answer
Mean = −0.25
SD = (2 − (−1)) 0.25 × 0.75 = 1.299038
EV = −0.25 × 100 = −25
SE = 100 × 1.299038 = 12.99038
How often would you lose more than $20?
Answer
Example 3
A die is rolled 60 times, how many 6's do we expect?
Answer
EV = 10
Box Model 6
1 5
SE = 60 × 6
× 6
≈ 2.89
In 60 plays we would expect to see 10 6's with a SE of around 3
Use SDpop
Example 4
A box containing 0, 2, 3, 4, 6
Answer
Method 1
2+3+4+6
EV = 25 × 5
= 75
SE = 25 × SDpop = 10
In 25 plays, we would expect to see a sum of 75 with a SE of 10
Method 2
set.seed(1)
totals = replicate(30, sum(sample(box, 25, rep = T)))
table(totals)
length(totals[totals>=65 & totals<=85])/30
If you play 25 times, how often is the sum between 50 and 100?
Answer
Box Model 7
Normal Approximation
Types of Histograms
Central Limit Theorem
Continuity Correction
Types of Histograms
Types of Histograms
Type Use
Data Represents data by area
Probability Represents chance by area
Simulation Converges in shape to probability histogram
Normal Approximation 1
Data Histogram
Probability Histogram
Normal Approximation 2
Simulation Histogram
The probability histogram for the sum/mean will closely follow the
normal curve, if the sample size for the sum is sufficiently large when
drawing at random with replacement
Continuity Correction
Normal Approximation 3
Sample Survey
Populations vs Sample
Limitation of a census
Parameter and Estimate
Parameter
Estimate
Issues with finding the best estimate of the parameter
Choosing a Sample
Biases
Picking a good Sample
Multi-stage Cluster Sampling
Quota Sampling
Convenience Sampling
Correction Factor
What affects accuracy
Populations vs Sample
The population is the full amount of information being studied, collected
through a census
Limitation of a census
Sample Survey 1
Collecting every unit of a population is takes a lot of
Time
Money
Resources
Estimate
What are estimates
Calculation of sample values which best predicts the parameter eg. sample
ˉ
mean x
Choosing a Sample
Biases ->
examples
What are the 4 common types of Bias
Sample Survey 2
interviewer have an effect on the answer given by participants
Sensitive Questions
Lack of clarity
The interviewer can compute the chance of any particular individual being
chosen → objectivity
Sample Survey 3
Quota Sampling
Non-probability sampling where the assembled sample has the same
proportions of individuals as the entire population with respect to known
characteristics, traits or focused phenomenon.
→ Results in unintentional bias from the interviewer when they choose the
subjects to survey.
Sample Survey 4
Convenience Sampling
Non-probability sampling where the subjects are selected because of their
convenient accessibility. Good for pilot surveys.
Sample Survey 5
Correction Factor
population size − sample size
SEwithout replacement =
population size − 1
When the sample is only a small part of the population, the size of the
population has almost no effect on the SE of the estimate
Sample Survey 6
Hypothesis Test
Steps in Hypothesis Testing
Common Mistakes
Hypotheses
Example 1
Z Test
Example 1
T Test
Example
Paired T Test
2 Sample T Test
Ways to check Assumptions
Comparative Boxplots - normality and equality of variance assumptions
QQ Plot - normality
Shapiro-Wilk Test - normality
Levene's Test F Test) - equal spread
Welch 2 Sample T Test - for unequal variance
Mann-Whitney-Wilcoxon Test - for non-normality
Paired T Test
Comparing T Test and Z Test
Hypothesis Test 1
Set up research question
Define H0 and H1
Weigh up evidence
Assumptions
Test Statistic
Measures the difference between what is observed in the data and the
expected from null hypothesis
OV −EV
test statistic = SE
P-value
Explain conclusion
Common Mistakes
P-value is not the chance that the null hypothesis is true
What to say
Hypotheses
What is H0
Hypothesis Test 2
The null hypothesis assumes that the difference between the observed
value and expected value is due to chance alone
What is H1
Example 1
Question: Does the probiotic treatment work for 80% of patients?
Treatment 29 26
Placebo 28 2
Mean = 0.8
SD = (1 − 0) 0.8 × 0.2 = 0.4
EV = 0.8 × 29 = 23.2
SE = 29 × 0.4 ≈ 2.2
Assumptions
Assume normality
Test Statistic
OV −EV 26−23.2
test statistic = SE = 2.2 ≈ 1.3
Hypothesis Test 3
P-value
pnorm(1.3, lower.tail=FALSE)
Conclusion
From P-value, we conclude that the data is consistent with the null
hypothesis
The new treatment does not seem to have an effectiveness rate higher
than 0.8
We fail to find sufficient evidence to claim an effective rate of higher
than 0.8
Z Test
Example 1
caf0 = c(36.05,52.47,56.55,45.2,35.25,66.38,40.57,57.15,28.34)
caf13 = c(37.55,59.3,79.12,58.33,70.54,69.47,46.48,66.35,36.2)
mean(caf0) = 46.44
sd(caf0) = 12.48826
mean(caf13) = 58.14889
sd(caf13) = 15.13416
Research Question: Is the mean time to exhaustion with no caffeine equal to 45
minutes?
Answer
H0 = 45
H1 = 45 2 sided test)
Assumptions
Hypothesis Test 4
We assume the sample of cyclist is random and they are all
independent to each other (not related)
Test Statistic
OV −EV 46.44−45
test statistic = SE = 12/ 9
≈ 0.36
SD = 12 is from previous study
P-value
Conclusion
The probability of getting a test statistic like this is 0.72 if the
hypothesis is true. Therefore the data appears to be consistent with the
null hypothesis.
T Test
When we don't know the population SD
This estimation will add extra variability to the test statistic, as the
sample SD varies from sample to sample
Example
Same question as on top
Answer
t.test(caf0, mu = 45)
Hypothesis Test 5
Paired T Test
Research Question: Is the mean time to exhaustion with no caffeine the same
as 13mg caffeine? Is there a difference?
Answer
cafdiff=caf13-caf0 #This1sampleisourfocus:the"differences".
mean(cafdiff) = 11.70889
sd(cafdiff) = 10.79987
t.test(cafdiff, mu = 0)
2 Sample T Test
What are the assumptions
Check using
Context in question
Check using
Boxplots
Histogram
Variance Test
Check using
Hypothesis Test 6
Histograms
QQ Plots
Normality Test
ˉ1 − x
x ˉ2 − 0
T est Statistic =
SE
QQ Plot - normality
Hypothesis Test 7
p3 = ggplot(RB_data, aes(sample = rate, colour = group)) +
stat_qq() + stat_qq_line() + ggtitle("QQplot")
QQ plot graphs the theoretical quantiles based on the normal curve against
actual quantiles. If the line formed by the points is reasonably straight, we
-
can assume that the data is normally distributed.
-
i
shapiro.test(No_RB)
var.test(No_RB,RB)
&
Welch 2 Sample T Test - for unequal variance
-
-
Use Welch Test for unequal variance
Hypothesis Test 8
t.test(No_RB, RB, var.equal = FALSE) #add var.equal = FALSE
-
rack .
Paired T Test
For times when analysis of dependent data is more desirable, we can use
paired T Test
Z Normal Population SD -
T t(n-1) Sample SD -
SD tells us how far each individual cyclist varies from the mean in this
sample
SE tells us how far the sample means vary from the true population
mean
Hypothesis Test 9
Chi-Squared Tests and
Regression Tests
Chi-Squared Test
Test for Goodness of Fit of Model
Example
Test for Independence
Example
Yates Continuity Correction
Fisher's Exact Test
Regression Tests
Example
Tests so far
Test Data
We can see that Z & T Test can be used for counting and classifying, by
modelling 0 1 box.
Goodness of Fit
Homogeneity
Independence
n
(Ok − Ek )2
χ =∑
2
Ek
k=1
1 7 10 3 0.9
2 7 10 3 0.9
3 9 10 1 0.1
4 14 10 4 1.6
5 10 10 0 0.9
6 13 10 3 0.9
Total 60 60 0 4.4
Answer
P-value
pchisq(4.4, 5, lower.tail=F)
#pchisq(value calculated on top, degree of freedom)
chisq.test(throws1a, p = c(1/6,1/6,1/6,1/6,1/6,1/6))
Smoke 9 4 13
No Smoke 6 6 12
Total 15 10 25
Answer
chisq.test(Smoke)
Conclusion
Given large P-value, background and smoking preference appear to be
independent suggesting that there is no background bias in smokers
fisher.test(Smoke)
Regression Tests
We use Regression Test to check if there is a linear trend
Example
Answer
Test Statistic
summary(lm(Value~Time, PCBeer))
Conclusion
Therefore, we reject H0 and conclude that there is strong evidence to
suggest that the slope is significant
Hence, there is strong evidence to suggest that beer consumption has
been changing in Australia since Year 2000