0% found this document useful (0 votes)
9 views70 pages

Data1901 Notes

The document provides an overview of data science concepts, focusing on controlled experiments, confounding variables, and biases that can affect data interpretation. It discusses the importance of domain knowledge, various types of biases, and methods to control for confounding variables in experiments. Additionally, it covers data analysis techniques including graphical summaries, numerical summaries, and the significance of reproducibility in research.

Uploaded by

Nanda gouri
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views70 pages

Data1901 Notes

The document provides an overview of data science concepts, focusing on controlled experiments, confounding variables, and biases that can affect data interpretation. It discusses the importance of domain knowledge, various types of biases, and methods to control for confounding variables in experiments. Additionally, it covers data analysis techniques including graphical summaries, numerical summaries, and the significance of reproducibility in research.

Uploaded by

Nanda gouri
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 70

Intro, Controlled Experiment

Intro to Data Science


Domain Knowledge
Controlled Experiments
Confounding
Precaution

Intro to Data Science


-
Domain Knowledge
Domain Knowledge

Background context information that helps to understand the data

Controlled Experiments
Controlled Experiment

Control the effects of other variables on the treatment

Control Group

Contemporaneous Control Group - same time as the Treatment


Groups

Intro, Controlled Experiment 1


Historical Control Group - earlier than Treatment Groups

Confounding EXAMPLES ?
What is Confounding

Confounding is when the effect of 1 variable sometimes mixes up with


another variable, causing confusion in interpretation

What causes Confounding

Caused by many types of bias, including Selection, Observer and


Consent

What is Bias

Something that affects the ability of the data to accurately measure


the treatment effect

What are the different types of Bias

Selection Bias

The bias in the selection of individuals into the groups eg.


allocate base on health

What is the -
solution

Randomised Controlled Trial RCT - random allocation of


individuals into groups

Observer Bias

Bias when the subjects (in response) or the investigators (in


evaluation) are aware of the identity of the 2 groups, as they
may deliberately or subconsciously report more or less
favourable results.

What is the solution


-

Randomised Controlled Double-Blind Trial - Both subject


and investigators are not aware of the identity of the 2
groups.

→ To control the patient's expectations and the


investigator's observations

→ By third party administrator

Consent Bias

Intro, Controlled Experiment 2


Subjects choose whether or not they take part in the
experiment

Placebo

Pretend treatment, designed to be neutral and indistinguishable


from the treatment.

Placebo Effect is the effect which occurs to the subjects


thinking they had the treatment

Precaution
What are some Precautions

Observational studies cannot establish causation, only association

Association may suggest causation, does not prove causation

Observational studies can have misleading hidden confounders

Confounders can be hard to find, and can mislead about a cause and
effect relationship

Can be introduced into randomised study via

Selection Bias - introduced by investigators in the selection of


subjects for treatment eg. select healthier subjects for surgery

Survival Bias - caused by dropout of some subjects, eg


improvement is due to the dropout of worse subjects who do not
respond to the treatment

Adherers Bias - subjects who adhere to the treatment tend to be


more compliant and healthier

How to deal with confounders (controlling for confounders → trying


to reduce the influence of confounding variables)

Make the groups more comparable by dividing them into subgroups


with respect to the confounder

eg. If alcohol consumption is a potential confounding factor for


smoking's effect on liver cancer, we can divide our subjects into 3
different levels of drinkers.

What is the limitation in this strategy

Intro, Controlled Experiment 3


Limited by the ability to identify all confounders and then divide
the study by the confounders.

Observational studies with confounding variables can lead to Simpson's


Paradox

Sometimes there is a clear trend in individual groups of data that


disappears when the groups are pooled together.

It occurs when the relationship between percentages in


subgroups are reversed when the subgroups are combined,
because of a confounding or lurking variable

Observational studies result from using an historical control

Some studies appear to be controlled experiments, but use historical


control where time can be a confounding variable. Contemporaneous
Control Group (mentioned on top is prefered)

Intro, Controlled Experiment 4


Data and Graphical Summaries
Qualitative Data
Barplot
Vs of Big Data
Histogram
Common Mistakes of histogram
Boxplot
Scatterplot

Initial Data Analysis IDA

First general look at the data without formally answering the research
questions

What are the functions of IDA

Helps to see whether the data can answer the research question

Pose other research questions

Identify the data's main qualities and suggest the population from
which a sample derives

Data is normally of the sample, not the population

What is involved in IDA

Data and Graphical Summaries 1


Data background - checking the quality and integrity of the data

Data structure - what information has been collected

Data wrangling - scraping, cleaning, tidying, reshaping, splitting,


combining

Data summaries - graphical and numerical

Qualitative Data Uses of each


type at
plt
.

Barplot

table(Dayweek)
barplot(table(Dayweek)) #or
# what is the
irgut -

plot(table(Dayweek))

Data and Graphical Summaries 2


Dayweek = data$Dayweek
Gender = data$Gender

data1 = table(Gender, Dayweek)

barplot(data1, main="Fatalities by Day of the Week and Biological Sex",


xlab="Day of the week", col=c("lightblue","lightgreen"),
legend = rownames(data1))

barplot(data1, main="Fatalities by Day of the Week and Biological Sex",


xlab="Day of the week", col=c("lightblue","lightgreen"),
legend = rownames(data1),
beside=TRUE)

Data and Graphical Summaries 3


T star i
ask code.
p = ggplot(diamonds, aes(x = cut))
Potential Q : show
glot
Las (
-
p + geom_bar(fill = clarity), position = "dodge") : know
your gyplot syntax
.

↓ .
state

T Tit fill
us . color
-

Vs of Big Data of 1
ggat
(ace <x=
ye
,

yehught) +

high volume
gen-point (est(illgede1) +

weight
high velocity
lab(x=

La
=
)
high variety ,

y .
M
high variability
# F
low varacity ... ...
O
high vulnerability
& -

high volatility
L

. -

high value
-

... S

&

Histogram
als
-
(color=guh) .

-
til

Data and Graphical Summaries


Din 4
Area of each block represents the percentage of subject in that partcular
class interval

Horizontal scale is divided into class interval

Heigh of each block represents crowding

Age = data$Age

breaks = c(0, 18, 25, 70, 100) #choosing the class interval

table(cut(Age, breaks, right = FALSE)

hist(Age, br = breaks, freq = FALSE, right = FALSE)


#freq = F generates histogram on density scale
#right = F makes the intervals right-open (convention)

Mentioned in intro, below shows the controlling of a variable

AgeF=data$Age[data$Gender=="Female"] #This selects just the female ages.

AgeM = data$Age[data$Gender=="Male"]

par(mfrow=c(1,2)) #This puts the graphic output in 1row with 2columns

hist(AgeF,freq = FALSE)
hist(AgeM,freq = FALSE)

e
Data and Graphical Summaries 5
p1 = ggplot(diamonds, aes(price))

p1 + geom_histogram(aes(fill = cut),
position = "dodge", bindwidwth = 1000)

Common Mistakes of histogram


Making the block heights equal to the percentages

Data and Graphical Summaries 6


Too many class intervals (use between 10 15

Boxplot

Gender = data$Gender

summary(Age[Gender == "Female"])
summary(Age[Gender == "Male"])

boxplot(Age~Gender, horizontal = TRUE)

p3 = ggplot(diamonds, aes(color, price))

p3 + geom_boxplot() + facet_grid(cut~.)

Data and Graphical Summaries 7


Scatterplot
Examines the relationship between 2 quantitative variable

SpeedLimit = data$SpeedLimit

plot(Age, SpeedLimit)

Random scatterplot → Does not appear to be a relationship between age


and speed limit in fatalities

Data and Graphical Summaries 8


p2 = ggplot(diamonds, aes(carats, price)

p2 + geom_point(aes(color = clarity, shape = cut))

Data and Graphical Summaries 9


Numerical Summaries
Centre
Mean
Median
Comparison
Spread
Standard Deviation
Standard Units Z Score)
IQR
IQR on Boxplot
Coefficient of Variation

What are the advantages of numerical summaries

Reduces all the data into 1 simple number

Loses a lot of information

But allows easy communication and comparisons

Major features for numerical summaries

Max

Min

Centre (mean, median)

Numerical Summaries 1
Spread SD, range, IQR

Centre
Mean
Mean is the unique point at which the data is balanced

n
1 a1 + a2 + ⋯ + an
A = ∑ ai =
n n
i=1

mean(data$Sold[data$Type == "House" & data$Bedrooms == "4"])

hist(dataSold[data$Type == "House" & data$Bedrooms == "4"]


abline(v = mean(data$Sold[data$Type == "House" & data$Bedrooms == "4"]),
color = "green")

Median
The median is the middle data point, where the data is ordered from the
smallest to largest.

median(data$Sold[data$Type == "House" & data$Bedrooms == "4"])

Numerical Summaries 2
Comparison
For symmetric data, we expect the mean and median to be the same

For left skewed data, we expect the mean to be smaller than the median

For right skewed data, we expect the mean to be bigger than the median

Which is optimal for describing centre?

Both have strengths and weaknesses → need to examine the nature of


data

When is neither sensible?

When the data is biomodal

Which is prefered is the data is skewed or has many outliers

Median

What is Mean helpful for

For data which is basically symmetric and not have many outliers

Spread
Standard Deviation
Gap between each data point and the mean

Root Mean Square measures the average of a set of numbers regardless of the
sign.

Square

Mean

Numerical Summaries 3
Root

How is SD calculated
RMS of gaps

sd(data$Sold)

popsd(data$sold)
#or
sd(data$Sold)*sqrt(55/56) #there are 56 data points

Important to decide if the data given is a sample or population,


depending on the research question

n−1
popSD = SD ×
n

Standard Units (Z Score)

Will go into more depth in Normal Curve

IQR

Numerical Summaries 4
What is IQR
Range of the middle 50% of the data

IQR = Q3 − Q1

quantile(data$Sold)

IQR on Boxplot
LowerT hreshold = Q1 − 1.5IQR
UpperT hreshold = Q1 − 1.5IQR
Outside of lower and upper thresholds are outliers

Coefficient of Variation
SD
CV =
mean
The higher the CV, the more volatile the sample

Numerical Summaries 5
Normal Model
Normal Curve
General vs Standard Normal Curve
Finding Area under Normal Curve
Standard Normal Curve
General Normal Curve
Special Properties
Measurement Error
Chance Error
Bias
Outliers

Normal Curve
Normal curve approximates many natural phenomenon

1 (x−σ)2
f(x) = e− 2σ 2
2πσ 2

General vs Standard Normal Curve


What is the difference between general and standard normal curve

General normal curve (X) has any mean and SD

Normal Model 1
Standard normal curve (Z) has mean 0 and SD 1

Finding Area under Normal Curve


N(mean, σ 2 )

In the equation, it uses Variance, not SD

Standard Normal Curve

pnorm(0.8, lower.tail = FALSE) #Defaults to lower tail (finding area to the left)

General Normal Curve

pnorm(171, 161.9, 7.7) #pnorm(x value, mean, SD), defaults to lower tail

Special Properties
68, 95, 99.7 Rule

Area of 1 SD out is 0.68

Area of 2 SD out is 0.95

Area of 3 SD out is 0.997

Normal Model 2
Any General Normal can be Standardised

data point − mean


Z Score =
SD

Measurement Error
Individual Measurement = Exact Value + Chance Error + Bias

Chance Error
What is Chance Error

No matter how careful any measurement is made, it could have turned out
differently

What is the best way to estimate the Chance Error

Replicate the measurement under the same conditions, and calculate the
SD

Bias
What is Bias

Constant amount added to or subtracted from each measurement, can be


deliberate or accidental

How can Bias be estimated

Replicating the measurement

Outliers

Normal Model 3
What are Outliers

Any large enough series of careful replicated measurements, we expect to


see a small percentage of extreme measurements

What are defined as Outliers

3 SD from mean, assuming normality

Normal Model 4
Reproducible Report
Dangers of Non-reproducible Report
Good Practices

Dangers of Non-reproducible Report


Data versions can change

People edit can Excel file without documenting what has changed and why

Graphical summaries can change

People can photoshop images without keeping record of what changed and
why

Reproducible research is about being responsible with possible human errors,


or worse, detecting intentionally changed results

Good Practices
Annotate your code as you go

Ensures understanding

Ensures transferability to someone else, either for collaboration, or to


someone who inherits your project

Reproducible Report 1
Linear Model
Scatter Plot and Correlation
Bivariate Data
Correlation Coefficient
Misleading Correlation
Regression Line
SD Line vs Regression Line
SD Line
Regression Line
Predictions
Baseline Predictions using mean
Prediction in a strip
Based on the Regression line
Predicting using Percentile ranks
Limitations in Prediction
Residual
RMS Error
Residual Plot
Vertical Strips
Normal Distribution within strips

Scatter Plot and Correlation


Recall from earlier weeks how to plot Scatterplots

Linear Model 1
Examines the relationship between 2 quantitative variable

SpeedLimit = data$SpeedLimit

plot(Age, SpeedLimit)

Random scatterplot → Does not appear to be a relationship between


age and speed limit in fatalities

p2 = ggplot(diamonds, aes(carats, price)

p2 + geom_point(aes(color = clarity, shape = cut))

Linear Model 2
Bivariate Data
What is Bivariate Data

It involves a pair of variable, that we want to find the relationship for

Can one variable be used to predict the other?

What variables are in X and Y

X is the independent variable

Y is the dependent variable

Correlation Coefficient
How can we summarise a scatterplot

Mean and SD of X

Mean and SD of Y

Correlation Coefficient, r

What is Correlation Coefficient

r is a numerical summary that measures the clustering around the line,


−1 ≤ r ≤ 1

Linear Model 3
Positive r value slopes up, Negative r value slopes down

How is r calculated by hand


Sum of the product of the standardised X and Y units

How is r calculated using R

SU_x=(data$fheight-mean(data$fheight))/popsd(data$fheight)
SU_y=(data$sheight-mean(data$sheight))/popsd(data$sheight)

mean(SU_x*SU_y)

Linear Model 4
or
umit
cor(data$fheight,data$sheight)
-

SD used here is SDpop

rpop = rsample

Does the value of r change when


-


X and Y is swapped

L
No

X and Y are scaled


No

Misleading Correlation
Outliers can overly influence the correlation coefficient
- -

-
Non-linear association cannot be detected by r -

r value can be high, but in actual fact another model would fit it better

Very different data can give the same correlation

Rates of averages tend to inflate the correlation coefficient

Linear Model 5
Association is not causation

Small SDs can make the correlation bigger

Regression Line
SD Line vs Regression Line
SD Line
Line connecting point of averages (x
ˉ, yˉ) and (x
ˉ + SDx , yˉ + SDy )

What are the limitations of SD Line

It does not use the correlation coefficient so it is insensitive to the


clustering around the line.

It underestimates LHS and overestimates RHS at the extremes

Regression Line

Linear Model 6
Line connecting point of averages (x
ˉ, yˉ) and (x
ˉ + SDx , yˉ + rSDy )

lm(NW~CE) #lm(Yvariable,Xvariable)

Predictions
Baseline Predictions using mean
For any X value, the Y value is the mean of Y

Prediction in a strip
Find the Y value that corresponds to the X value in the given data
If there are multiple values, take the average

Based on the Regression line


Use the formula of regression line to predict the values

Predicting using Percentile ranks


Find the percentile of X value, and find the Y value that has the corresponding
percentile

Linear Model 7
Limitations in Prediction
Extrapolating

Why is extrapolation not reliable


If given X value for prediction is outside the range of the data, the
prediction is not reliable

When linear model is not the best

Check scatterplot first!

Regression fallacy

Regression to mediocrity

Residual
What is Residual

Residual is the vertical distance of a point above and below the regression
line
ei = yi − y^i
What does the Residual represent

The error between the actual value and the prediction

l$residual[10]

RMS Error
What does the RMS error represent
Population RMS error represents the average gap between the points and
the regression line
e 21 +e 22 +...e 2n
RMS errorpop = n
= 1 − r 2 SDy

res = NW - l$fitted.values
sqrt(mean(res^2)) # RMSError(Pop)

Linear Model 8
sqrt(1-(cor(CE,NW))^2)*sd(NW) #RMSError(Sample)

sqrt(1-(cor(CE,NW))^2)*sd(NW)*sqrt((length(CE)-1)/(length(CE))) #RMSError(Pop)

Residual Plot
Graphs the residuals vs x

plot(CE,l$residuals, ylab="residuals")
abline(h=0)

Vertical Strips
How to determine if the data is homoscedastic or heteroscedastic

If data shows equal spread in the y direction → homoscedastic → RMS


error can be used as a measure of spread for individual strips

If data shows unequal spread in the y direction → heteroscedastic →


RMS error cannot be used a measure of spread for individual strips

Normal Distribution within strips


When can we use Normal Distribution within vertical strips

If the data is homoscedastic, where

yˉ∗ = yˉ + zx rSDy , zx is the Z Score of the strip

Linear Model 9
SDy∗ ≈ RMS Error
Find the percentage of days above a certain x value

Find the mean and SD

Calculate the standardised x value

pnorm(standardised x, lower.tail = FALSE

eg.

Of the days where x = value, what percentage of days had y > value
yˉ∗ = yˉ + zx rSDy
SDy∗ ≈ RMS Error
Standardise value of y

pnorm(standardised y, lower.tail = FALSE

eg.

RMS using formula on top

Linear Model 10
Linear Model 11
Chance
Prosecutor's Fallacy
Chance
Independence & Dependence
Making List
Example: Two dice are thrown, what is the chance of getting a total of 6 spots
Addition Rule
Example

Prosecutor's Fallacy
What is the Prosecutor's Fallacy

A mistake in statistical thinking, whereby it is assumed that the probability


of a random match is equal to the probability that the defendant is innocent

P (A∣B) = P (B∣A)
Misunderstanding on how conditional probability works
P (A∩B )
P (A∣B) = P (B )

Example of Prosecutor's Fallacy

Probability Table

Category DNA Match DNA doesn't match

Chance 1
Category DNA Match DNA doesn't match

Guilty 1 0
Innocent 9 4,999,990

Chance that DNA matches, given innocent person is tiny


9
P (DNA Match∣Innocent) = 4,999,999

Chance that the person is innocent, given a DNA match is high

P (Innocent∣DNA Match) = 9/10

Chance
What is Chance

Chance is the percentage of time a certain event is expected to happen, if


the same process is repeated long-term

Independence & Dependence


What are independent events

2 events are independent is the chance of the 2nd given the 1st is the
same as the 2nd i.e. the probability of the 1st event happening does not
affect the 2nd

How do you ensure independency

Draw randomly with replacement

If the 2 events are independent, P (A ∩ B) = P (A) × P (B),


P (B∣A) = P (B)

If the 2 events are dependent, P (A ∩ B) = P (A) × P (B∣A)

Making List

Chance 2
Write a list of all outcomes

Count which outcomes belong to the event of interest

Lead to simpler way to summarise the outcomes

Example: Two dice are thrown, what is the chance of getting a


total of 6 spots
Methods 1 Summarise in a tree diagram

Method 2 Simulate

Using R to simulating throwing 2 dice x times

setseed(1)
totals = sample(1:6, 1000, rep = TRUE) + sample(1:6, 1000, rep = TRUE)
#sample(x, size(number of items to choose), replace, probability vector)
table(totals)

Addition Rule

Chance 3
2 events are mutually exclusive when both cannot happen at the
same time

If mutually exclusive P (A ∩ B) =0

When to multiply

What When Formula Condition


Addition Rule P At least 1 of 2 events occur) PA PB Mutually Exclusive

Multiplication Rule P Both events occur) P A *P B Independent


Untitled P A * B|A Dependent

Example
What is the probability of at least 1 Ace, after tossing 4 times
1 − P (not ACE)4 = 1 − ( 56 )4 = 0.518
What is the probability of at least 1 double Ace, after tossing 24 times
1 − P (no double Ace) = 1 − ( 35
36
)24 = 0.49

Chance 4
Binomial Formula
Binomial Coefficient
Binomial Model
Binomial Theorem
Example

Binomial Coefficient
Suppose we have n objects in a row, made up of 2 types: x and n − x

The number of ways of rearranging the n objects is given by the binomial


coefficient:

( )=
n n!
x x!(n − x)!

Binomial Model

Binomial Trial is where only 2 independent things can happen

Binomial Theorem

Binomial Formula 1
Suppose we have n indepedent, binary trials, the chance of exactly x
n
events occur is (x)px (1 − p)n−x

Example
A fair coin is tossed 5 times, what is the probability of getting 3 heads?

dbinom(3, 5, 0.5) #dbinom(x , number of trials, probability)

A fair coin is tossed 500 times, what is the probability of getting 300
heads?

dbinom(300, 500, 0.5)

Suppose 100 babies are born with P (boy) = 0.51, what is the probability
that 55 boys were born?

dbinom(55, 100, 0.51)

Binomial Formula 2
Box Model
Law of Averages
Box Model
Sum of draws from a box model
Mean of draws from a box model
Modelling the Sum/Mean of a sample using Normal Curve
Example 1
Example 2
Example 3
Example 4

Law of Averages

The Law of Large Numbers Averages) states that the proportion of


heads becomes more stable as the length of the simulation increases
and approaches a fixed number called the relative frequency.

The absolute size of the chance error increases, but the percentage
size decreases. Proportion of the event will converge to expected
proportion

Box Model 1
Box Model

Simple way to describe many chance processes

What are the things we need to know

Distinct numbers that go into the box

Number of each kind of tickets in the box

Number of draws from the box

Box is the population, draws create the sample

CE = OV − EV

Examples

Example 1

Box Model 2
What is the EV, OV and chance error

EV = 2 × 5 = 10
OV = 3 + 1 + 2 + 2 + 3 = 11
Chance Error = OV − EV = 1

Example 2
Suppose it costs $1 to play the game

If you roll 6, you get $2

If you roll other numbers, you lose

If you play 25 times, what is your net gain/loss?

Answer

Box Model 3
set.seed(1)
dietosses = sample(c(1, -1, -1, -1, -1, -1), 25, repl = TRUE)
#or dietosses = sample(c(1, -1), 25, repl = TRUE, prob = c(1/6, 5/6)
sum(dietosses)

Example 3
38 pockets, numbered 0 (green), 00 (green), and 1 36 (alternate red and
black)

You place bet on red, costing $1

If land on red, you get $2

If dont land on red, you lose

If you play 10 times, what is your net gain/loss?

Answer

set.seed(1)
roulette = sample(c(1, -1), 10, repl = TRUE, prob = c(18/38,20/38))

Example 4
38 Numbers

You place bet on a number, costing $1

If you win, you get $36

If you lose, you lose

If you play 100 times, what is your net gain/loss?

Box Model 4
Answer

set.seed(1)
number = sample(c(35, -1), 100, repl = TRUE, prob = c(1/38, 37/38))

Sum of draws from a box model


EV = number of draws × mean of box

Standard Error = number of draws × SD of box

SD of box = (big −
small) proportion of big × proportion of small

Mean of draws from a box model


EV = mean of box

SD of box
Standard Error =
number of draws

Modelling the Sum/Mean of a sample


using Normal Curve
Example 1
If a coin is rolled 100 times, what is the chance of getting between 40 and 60
heads

Answer

Method 1
EV = 50
SE = 100 × 0.5 × 0.5 = 5
N(50, 52 )
Standardise 40 and 60, 2 and 2

Box Model 5
pnorm(2)-pnorm(-2)

Method 2

Simulation using R

set.seed(1)
box = c(1, 0)
totals = replicate(1000, sample(box, 100, repl = TRUE))

Example 2
Number 1 to 80, 20 numbers were chosen at random without replacement
You pick a single number

You win if the number is one of the 20, gain $2

If you lose, you lose $1

If you play 100 times, how much would you expect to win or lose?

Answer

Mean = −0.25
SD = (2 − (−1)) 0.25 × 0.75 = 1.299038
EV = −0.25 × 100 = −25
SE = 100 × 1.299038 = 12.99038
How often would you lose more than $20?

Answer

pnorm(-20, -25, 12.99038)

Example 3
A die is rolled 60 times, how many 6's do we expect?

Answer

EV = 10

Box Model 6
1 5
SE = 60 × 6
× 6
≈ 2.89
In 60 plays we would expect to see 10 6's with a SE of around 3

Use SDpop

Example 4
A box containing 0, 2, 3, 4, 6

You pick a single number


If you play 25 times, adding each number you draw, what would your expected
sum be?

Answer

Method 1
2+3+4+6
EV = 25 × 5
= 75
SE = 25 × SDpop = 10
In 25 plays, we would expect to see a sum of 75 with a SE of 10

Method 2

set.seed(1)
totals = replicate(30, sum(sample(box, 25, rep = T)))
table(totals)
length(totals[totals>=65 & totals<=85])/30

If you play 25 times, how often is the sum between 50 and 100?

Answer

pnorm(100, 75, 10) - pnorm(50, 75, 10)

Box Model 7
Normal Approximation
Types of Histograms
Central Limit Theorem
Continuity Correction

Types of Histograms
Types of Histograms

Type Use
Data Represents data by area
Probability Represents chance by area
Simulation Converges in shape to probability histogram

Normal Approximation 1
Data Histogram

Probability Histogram

Normal Approximation 2
Simulation Histogram

For repeated simulations of a chance process resulting in a sum, the


simulation histogram of the observed values converges to the
probability histogram

Central Limit Theorem

The probability histogram for the sum/mean will closely follow the
normal curve, if the sample size for the sum is sufficiently large when
drawing at random with replacement

Does not work for product

Continuity Correction

To approximate a discrete distribution by Normal Distribution


(continuous), we adjust the endpoints by 0.5

Normal Approximation 3
Sample Survey
Populations vs Sample
Limitation of a census
Parameter and Estimate
Parameter
Estimate
Issues with finding the best estimate of the parameter
Choosing a Sample
Biases
Picking a good Sample
Multi-stage Cluster Sampling
Quota Sampling
Convenience Sampling
Correction Factor
What affects accuracy

Populations vs Sample
The population is the full amount of information being studied, collected
through a census

Sample is part of the population

Limitation of a census

Sample Survey 1
Collecting every unit of a population is takes a lot of

Time

Money

Resources

Parameter and Estimate


Parameter
What are parameters

Numerical fact about the population which we are interested in eg.


population mean μ

Estimate
What are estimates

Calculation of sample values which best predicts the parameter eg. sample
ˉ
mean x

Issues with finding the best estimate of the parameter


How was the sample chosen

Is it representative of the population

What estimate is closest to the parameter

Choosing a Sample
Biases ->
examples
What are the 4 common types of Bias

Selection Bias - A systematic tendency to exclude or include one type


of person from the sample

Non-response Bias - Caused by participants who fail to complete the


survey

Interviewer's Bias - When the interviewer has to make a choice of


participants in the survey, or when the characteristics of the

Sample Survey 2
interviewer have an effect on the answer given by participants

Measurement Bias - The form of question in the survey affects the


response of the question

Some examples of Measurement Bias

Bias is question wording and order

Recall Bias (long questions) - People forget details

Sensitive Questions

Lack of clarity

Attributes of the interview process may cause bias

If a section process is biased, taking a larger sample does not reduce


bias, rather it can amplify it, as mistake is repeated on a larger scale.

Some Bias are unavoidable

Picking a good Sample


We use a probability method to pick the sample so

The interviewer is not involved in the selection → impartial method of


selection

The interviewer can compute the chance of any particular individual being
chosen → objectivity

Multi-stage Cluster Sampling


Probability sampling that takes sample in stages, and individuals or cluster are
chosen at each stage

Sample Survey 3
Quota Sampling
Non-probability sampling where the assembled sample has the same
proportions of individuals as the entire population with respect to known
characteristics, traits or focused phenomenon.
→ Results in unintentional bias from the interviewer when they choose the
subjects to survey.

Sample Survey 4
Convenience Sampling
Non-probability sampling where the subjects are selected because of their
convenient accessibility. Good for pilot surveys.

Sample Survey 5
Correction Factor
population size − sample size
SEwithout replacement =
population size − 1

What affects accuracy


When sampling with replacement, the SE is determined by the absolute
size of the sample

When sampling without replacement, the SE will decrease by increasing the


ratio of sample size to population size, as when a higher proportion of the
population is sampled, the variability decreases

When the sample is only a small part of the population, the size of the
population has almost no effect on the SE of the estimate

Sample Survey 6
Hypothesis Test
Steps in Hypothesis Testing
Common Mistakes
Hypotheses
Example 1
Z Test
Example 1
T Test
Example
Paired T Test
2 Sample T Test
Ways to check Assumptions
Comparative Boxplots - normality and equality of variance assumptions
QQ Plot - normality
Shapiro-Wilk Test - normality
Levene's Test F Test) - equal spread
Welch 2 Sample T Test - for unequal variance
Mann-Whitney-Wilcoxon Test - for non-normality
Paired T Test
Comparing T Test and Z Test

Why do we do hypothesis testing

Steps in Hypothesis Testing

Hypothesis Test 1
Set up research question

Define H0 and H1

Weigh up evidence

Assumptions

A conclusion is not transparent, if the assumptiopns are not stated

A conclusion is not potentially invalid, if the assumptions are not


justified

Test Statistic

Measures the difference between what is observed in the data and the
expected from null hypothesis
OV −EV
test statistic = SE

P-value

Way of weighing up whether the sample is consistent with H0

Is the chance of observing the test statistic if H0 is true

Statistically significant if P-value < 0.05

Explain conclusion

Common Mistakes
P-value is not the chance that the null hypothesis is true

Large p-value does not mean that H0 is true

What to say

Size of P-value NO YES


Small H0 is not true There is evidence against H0

Untitled H0 is false We reject H0


Large We accept H0 Data is consistent with H0
Untitled We retain H0

Hypotheses
What is H0

Hypothesis Test 2
The null hypothesis assumes that the difference between the observed
value and expected value is due to chance alone

What is H1

The alternative hypothesis assumes that the difference between the


observed value and the expected value is not due to chance alone

Example 1
Question: Does the probiotic treatment work for 80% of patients?

Initial Trial Results

Name Participants Numbers showing desensitisation

Treatment 29 26

Placebo 28 2

H0 : 80% of the patients respond to the treatment


H1 : More than 80% of the patients respond to treatment 1 tail test)
Answer

Mean = 0.8
SD = (1 − 0) 0.8 × 0.2 = 0.4
EV = 0.8 × 29 = 23.2
SE = 29 × 0.4 ≈ 2.2
Assumptions

We assume each child in the trial ws independent of each other (not


related)

We assume each child had the same chance of showing improvement


with their peanut allergy by using the probiotic treatment

How can we check these assumptions

Looking at the records of the medical trial

Assume normality

Test Statistic
OV −EV 26−23.2
test statistic = SE = 2.2 ≈ 1.3

Hypothesis Test 3
P-value

pnorm(1.3, lower.tail=FALSE)

Conclusion
From P-value, we conclude that the data is consistent with the null
hypothesis
The new treatment does not seem to have an effectiveness rate higher
than 0.8
We fail to find sufficient evidence to claim an effective rate of higher
than 0.8

Z Test

Only can use when we know population SD

Example 1

caf0 = c(36.05,52.47,56.55,45.2,35.25,66.38,40.57,57.15,28.34)
caf13 = c(37.55,59.3,79.12,58.33,70.54,69.47,46.48,66.35,36.2)

mean(caf0) = 46.44
sd(caf0) = 12.48826
mean(caf13) = 58.14889
sd(caf13) = 15.13416
Research Question: Is the mean time to exhaustion with no caffeine equal to 45
minutes?

Answer

H0 = 45
H1 = 45 2 sided test)
Assumptions

Hypothesis Test 4
We assume the sample of cyclist is random and they are all
independent to each other (not related)

We assume normality in population, since sample size is small


We assume we know the population SD. For example, suppose a large
scale study on the exhaustion time without caffeine was conducted
previously, and the SD was 12 mins.

Test Statistic
OV −EV 46.44−45
test statistic = SE = 12/ 9
≈ 0.36
SD = 12 is from previous study
P-value

2*pnorm(0.36, lower.tail=FALSE) #since 2 tail test

Conclusion
The probability of getting a test statistic like this is 0.72 if the
hypothesis is true. Therefore the data appears to be consistent with the
null hypothesis.

T Test
When we don't know the population SD

Estimate the population SD from sample SD → use Z Test

This estimation will add extra variability to the test statistic, as the
sample SD varies from sample to sample

For large samples, the difference between population SD and


sample SD should be small, and so this may be appropriate. For
small samples, the difference will be more noticeable → use T Test

Example
Same question as on top

Answer

t.test(caf0, mu = 45)

Hypothesis Test 5
Paired T Test
Research Question: Is the mean time to exhaustion with no caffeine the same
as 13mg caffeine? Is there a difference?

Answer

cafdiff=caf13-caf0 #This1sampleisourfocus:the"differences".

mean(cafdiff) = 11.70889
sd(cafdiff) = 10.79987

t.test(cafdiff, mu = 0)

As the p-value is so small, we reject H0 and conclude that there is a


difference, and the caffeine consumption affects endurance.

2 Sample T Test
What are the assumptions

The 2 samples contains different people (indepedence)

Check using

Context in question

The 2 populations have the same variation

Check using

Boxplots

Histogram

Variance Test

If assumptions don't hold → use Welch 2 sample T Test

The 2 populations are normal

Check using

Boxplots (expect no or few outliers)

Hypothesis Test 6
Histograms

QQ Plots

Normality Test

ˉ1 − x
x ˉ2 − 0
T est Statistic =
SE

Ways to check Assumptions


Comparative Boxplots - normality and equality of variance
assumptions

How does it show normality -

The 2 samples look symmetrical and so are consistent with normality

How does it show equality in variance


The 2 samples have similar spread and so are consistent with equality of
variance assumption

QQ Plot - normality

Hypothesis Test 7
p3 = ggplot(RB_data, aes(sample = rate, colour = group)) +
stat_qq() + stat_qq_line() + ggtitle("QQplot")

How does it show normality

QQ plot graphs the theoretical quantiles based on the normal curve against
actual quantiles. If the line formed by the points is reasonably straight, we
-
can assume that the data is normally distributed.
-

Shapiro-Wilk Test - normality

i
shapiro.test(No_RB)

What are the limitations of this test


This test is sensitive to sample size, small samples will almost always be
retained as normal. Therefore it is recommended to use this test with
graphical methods.

Levene's Test (F-Test) - equal spread

var.test(No_RB,RB)

&
Welch 2 Sample T Test - for unequal variance
-
-
Use Welch Test for unequal variance

Hypothesis Test 8
t.test(No_RB, RB, var.equal = FALSE) #add var.equal = FALSE
-

rack .

Mann-Whitney-Wilcoxon Test - for non-normality


- -

For non-normality, we can either

Transform non-normal data (using log or square root) before performing T


Test

Use non-parametric tests like wilcox.test

Paired T Test
For times when analysis of dependent data is more desirable, we can use
paired T Test

t.test(No_RB, RB, paired = TRUE) #add paired=TRUE


-

Comparing T Test and Z Test


T Test vs Z Test

Test P-value curve SD

Z Normal Population SD -

T t(n-1) Sample SD -

SD tells us how far each individual cyclist varies from the mean in this
sample

SE tells us how far the sample means vary from the true population
mean

Hypothesis Test 9
Chi-Squared Tests and
Regression Tests
Chi-Squared Test
Test for Goodness of Fit of Model
Example
Test for Independence
Example
Yates Continuity Correction
Fisher's Exact Test
Regression Tests
Example

Tests so far

Test Data

1 - Sample Z & T Test Cyclist endurance times

2 - Sample T Test Redbull effect on heart rate

We can see that Z & T Test can be used for counting and classifying, by
modelling 0 1 box.

Chi-Squared Tests and Regression Tests 1


Chi-Squared Test
How is Chi-Squared Test being used

Goodness of Fit

Test a hypothesis about the distribution of a qualitative variable in a


population

Homogeneity

Test a hypothesis about the distribution of a qualitative variable in


several populations

Independence

Test a hypothesis about the relationship between 2 qualitative variables


in a population

n
(Ok − Ek )2
χ =∑
2
Ek
k=1

Test for Goodness of Fit of Model


Example
Throws

Face Observed Frequency Expected Frequency O E O E ^2/E

1 7 10 3 0.9

2 7 10 3 0.9

3 9 10 1 0.1

4 14 10 4 1.6

5 10 10 0 0.9

6 13 10 3 0.9

Total 60 60 0 4.4

Answer

H0 : Gambler is innocent, proportions are equal

Chi-Squared Tests and Regression Tests 2


H1 : Gamlber is guilty, at least one of the proportions are different from
stated proportions

P-value

The P-value is the chance of observing 4.4 or more extreme on a Chi-


Squared distribution (with k − 1 degrees of freedom)

k is the number of categories, in this case k = 6 − 1 = 5

pchisq(4.4, 5, lower.tail=F)
#pchisq(value calculated on top, degree of freedom)

chisq.test(throws1a, p = c(1/6,1/6,1/6,1/6,1/6,1/6))

Test for Independence


Example
Is there a connection between background and smoking

Categories Domestic International Total

Smoke 9 4 13

No Smoke 6 6 12

Total 15 10 25

Answer

H0 : Smoking is independent of background


H1 : Smoking is NOT independent of background

chisq.test(Smoke)

Conclusion
Given large P-value, background and smoking preference appear to be
independent suggesting that there is no background bias in smokers

Chi-Squared Tests and Regression Tests 3


Yates Continuity Correction
Notes does not go in depth into this, but R defaults to using this. Use correct =
FALSE to turn off.

Fisher's Exact Test


For small sample size < 5, we can use Fisher's Exact Test

fisher.test(Smoke)

Regression Tests
We use Regression Test to check if there is a linear trend

Example

Answer

H0 : β1 = 0 (or no linear trend)


H1 : β1 = 0 (or linear trend)
Assumptions

Residuals should be independent, normal, with constant variance


(homoscedasticity)
Check using Residual Plot, QQ Plot, Shapiro-Wilk Test

Chi-Squared Tests and Regression Tests 4


The relationship between the dependent and independent variable
should look linear

Check using Scatterplot

Test Statistic

summary(lm(Value~Time, PCBeer))

P-value = 7.2 10^ 10

Conclusion
Therefore, we reject H0 and conclude that there is strong evidence to
suggest that the slope is significant
Hence, there is strong evidence to suggest that beer consumption has
been changing in Australia since Year 2000

Chi-Squared Tests and Regression Tests 5

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy