Statistics
Statistics
frequeny
Frequency density =
class width
The range is the difference between the largest and smallest values in a data set, and the
interquartile range is the difference between the upper and lower quartile. While they can be
a measure of spread, neither consider all the values given.
To counter this, we can use the standard deviation, usually given by the symbol σ.
Let’s consider the data 2,5,8, which has the mean (usually denoted by x̄ ) of 5.
We can look at the difference of each data point from the mean:
5 0
8 3
The mean of the differences will always be 0, as the negatives will cancel out the positives.
This means that it cannot be used as a measure of spread. Because of this, we can square
the differences to ensure that they are non-negative.
x ( x - x̄ )²
2 9
5 0
8 9
The average is given by adding all the values of ( x - x̄ )² and dividing by n , the number of data
items. The symbol for adding up all the values is Σ.
9+0+ 9 18
In our case, the average would be = =6
3 3
However, we would need to undo the squaring to ensure the measure has the same units as
x . This means that the standard deviation for our data would be √ 6
√
2
Standard deviation: σ = Σ ( x− x̄ )
n
σ=
√
Σ x2
n
− x̄ ²
Which can also be written as ‘the mean of the squares minus the square of the means’:
σ =(x ²)−(x)²
Variance (σ ² ) is the square of standard deviation and has very useful mathematical
properties.
Σ fx ²
x=
n
Where f is the frequency of each x value and n is the total frequency
2 Σ fx ²
σ = −x ²
n
Now, let’s have a look if there is a relationship between two variables. Data that comes in
pairs int this fashion is said to be bivariate. When we have these two sets of data, there may
or may not be a relationship between them. We can describe the relationship between them
by investigation their correlation.
However, instead of describing the correlation with words, we can use a numerical value, the
correlation coefficient, r, which can only take values of −1 ¿ r ¿1
As x increases, y generally
Strong posi- increases. r ≈ 1
tive correla-
tion
As x increases, y generally
Strong nega- decreases. r ≈−1
tive
Scatter diagrams can also reveal if there are 2 separate groups within the data
However, you must remember that correlation does not equal causation. Such correlation
may be due to a coincidence, or due to a third hidden variable. For example, there might be
a strong correlation between ice cream sales and number of swimmers at a beach. Clearly,
eating ice cream doesn’t make you want to swim; instead, the hidden variable of
temperature could cause both to rise.
When working with real-world data, there may be errors, missing data, or extreme values
that can distort results.
Often the most useful thing to do is to look at your data graphically. And if the underlying
pattern is strong, outliers can become obvious.
There are also some calculations you can do to check for outliers:
An outlier is any number more than 1.5 interquartile ranges away from the nearest
quartile
Any outlier is more than 2 standard deviations away from the mean
Once an outlier has been spotted, you must decide then decide whether to include it in your
calculation. This often requires you to look at the data in context:
If there are several outliers it might be a distinctly different group which should be
analysed separately.
17 – Probability
Events are mutually exclusive if they both cannot happen at the same time e.g. rolling a 6
and a 5 on a die in 1 roll. If the events are mutually exclusive:
P ( A∧B ) =0
P ( A∨B ) =P ( A )+ P (B)
P ( A∧B ) =P ( A ) × P (B)
Bionomial distribution