Bes Summary
Bes Summary
Dillon Pretorius
August 23, 2016
1
1 Chapter 1: Introduction to Data
1.1 Terminology and Concepts
• Data - Observations
• Summary statistic - Single number summarising large amounts of data.
• Data matrix - A table with each specific case on a row and variables in
the columns.
• Types of variables
– Numerical - Can take wide range of numerical values (can add/subtract/take
averages)
∗ Discrete - Can only take specific values with jumps.
∗ Continuous - Can take any value
– Categorical - Non-numerical. Can only be specific values, called lev-
els. If levels have a natural ordering, they are ordinal.
• Associated/Dependant variables - Variables showing some connection be-
tween them. Either positively or negatively associated.
• Independent variables - Variables not associated with each other.
• Bias - Bias can occur because of non-response (ie. When people selected
as a sample don’t respond) and convenience sample (individuals who
are more easily accessible are selected more)
• Explanatory and Response variables - Explanatory variables are variables
which we suspect may effect the response variables.
• Blind Study - When patients do not know which treatment they are receiv-
ing. When the researcher also doesn’t know, it is called a double-blind
setup.
• Confounding Variable - A Variable correlated to both explanatory and
response variables. Must be taken into consideration before making state-
ments of causality.
• Prospective and Retrospective studies - Observational studies done as
events unfold, or afterwards, respectively.
2
• Random Sampling
– Simple - Each case in population has equal chance of selection
– Stratified - Similar cases are grouped into strata then simple sam-
pling occurs within each stratum.
– Cluster - Observations grouped into clusters, then random entire clus-
ters are selected.
– Multistage - Same as cluster, but only a simple random sample is
taken from each cluster instead of the whole cluster.
• Experiment - Conducted with suspected explanatory and response vari-
ables. Checks for causal connection. Researchers assign treatments to
cases. Differences between groups are controlled. Cases are randomised
into treatment groups to even out uncontrollable variable differences. Large
samples of cases are preferable for replication, either in a single study or
by multiple groups of scientists. Cases may first be grouped by variables
that may influence results, this is called blocking.
• Scatter plot - Type of graph useful for relationships between variables.
• Dot plot - A one-variable scatter plot
• Robust estimates - Median and IQR, extreme observations have little ef-
fect.
• Transformations - When some mathematical function is applied to a vari-
able.
3
• Intensity map - Shows geographical data with colours representing values
according to a key.
1.2 Formulae
Count
P roportion =
T otal
sum of observations
M ean = x =
number of observations
Deviation = x − x
IQR = Q3 − Q1
4
2 Chapter 2: Foundation for Inference
2.1 Terminology and Concepts
• Point estimate - A single value given as an estimate of a population
• Hypothesis test - Statistical technique used to evaluate opposing claims
using data. Should be set up before seeing the data, to avoid choosing
one- or two- sided incorrectly.
– Frame research question in terms of hypotheses (H0 and HA )
– Collect data (observational study or experiment)
– Analyse data (eg. with p-value)
– Form conclusion (eg. comparison of p-value to α)
• Null Hypothesis (H0 ) - A skeptical perspective of no-difference. Any rela-
tionship caused by chance. If this hypothesis strongly disagrees with the
data, we reject it in favour of the alternative hypothesis.
• Alternative Hypothesis (HA ) - The variables are not independent. Differ-
ence was not due to chance.
• Statistical Inference - Practice of making decisions and conclusions from
data in the context of uncertainty.
• Randomisation - Simulating the null hypothesis and calculating the prob-
ability of the observed difference occurring by chance.
• p-value - Also called the test statistic. Probability of observing data at
least as favourable to H0 as our current data set if the null hypothesis were
true.
• Statistical significance - If the p-value is lower than some significance
level, usually α = 0.05, it means the data provides strong enough evidence
against H0 that we reject it in favour of HA . We say it is statistically
significant.
• Decision errors - If we reject H0 or HA when either was actually true, we
have made a Type 1 or Type 2 error respectively. Depending on which
error is more costly in a practical implementation, the significance level
can be adjusted. A smaller α avoids Type 1 errors and a larger α avoids
Type 2 errors.
• Confirmation bias - Looking for data that supports our ideas. Setting an
alternative hypothesis that agrees with our worldview.
5
• Null value - Reference value for H0 .
• Central limit theorem - If we look at a proportion (or difference in propor-
tions) and the scenario satisfies certain conditions (observations in sam-
ple are independent and the sample is large enough), then the sample
proportion will appear to follow a bell-shaped curve called the normal
distribution.
• Normal distribution - Symmetric, unimodal, bell curve. Also called normal
curve, or normal model. Can look different depending on details of model,
the mean and standard deviation. If x = 0 and s = 1 then it’s called the
standard normal distribution. A normal distribution can be written
as a function of these parameters: N (x, s)
• Z-Score - The number of standard deviations an observation is above or
below the mean. Can be used to identify how unusual an observation is.
• Normal approximation - How closely real data follows the normal distri-
bution can be seen by the shape of its histogram, or how close the points
are to a straight line in a normal probability plot.
• Standard Error (SE) - Standard deviation associated with the estimate
(ie. a point estimate). This will either be given, or a formula to calculate
it will be provided.
6
2.2 Formulae
In General
observation - mean
Z-Score =
standard deviation
7
Formulas
Pn Pn sP
n
i=1 xi i=1 (xi x̄)2 p i=1 (xi x̄)2
x̄ = var = s= var =
n n 1 n 1
x µ
Q1 1.5 ⇥ IQR Q3 + 1.5 ⇥ IQR Z=
negative Z
positive Z
11