Stat210 FL17 LCN 1
Stat210 FL17 LCN 1
Probability and
Statistics
Unit 1:Descriptive Statistics
Outline
Introduction to Statistics:
Graphical method:
Bar and pie charts, Histogram
Summary Statistics:
Measures of location, measures of variability,
boxplot
IE FF GC GC OT FF FF FF FF IE
GC FF FF OT FF FF IE GC FF FF
GC IE IE IE GC FF OT OT OT OT
FF IE IE IE OT IE FF OT IE FF
FF IE IE GC IE FF GC GC GC FF
The distribution of the CPU times is skewed to the right with one potential outlier.
(3) 56,52, 13,34,33, 18, 44, 41, 48, 75, 24, 19,35, 27, 46,
62, 71, 24, 66, 94, 40,18,15,39,53,23,41,78,15,35
x i
x i 1
n
Similarly, the population mean, denoted by µ, is given by
N
x i
i 1
N
where N is the population size.
Sometimes a sample may contain a few points that are much
larger or smaller than the rest. Such points are called outliers
and may affect the mean.
STAT210: Probability and Statistics 29
Median
The median is the value in the middle when the data are
arranged in ascending order (smallest value to largest value).
To find the median the values in the sample are ordered from
smallest to largest, then
If n is odd, the sample median is the number in (n+1)/2
position .
If n is even, the sample median is the average of the
numbers in n/2 and (n/2)+1 positions.
Although the mean is the more commonly used measure of
central location, in some situations the median is preferred. The
mean is influenced by extremely small and large data values. In
such case, the median is often the preferred measure of central
location.
STAT210: Probability and Statistics 30
Mean vs. Median
Mean tends to be drawn in the direction of the tail of a
skewed distribution. The median is more appropriate when
the distribution is highly skewed.
Mean can be greatly a effected by the presence of outliers
whereas median is not.
For symmetric distributions, mean and median are the
same.
For skewed distributions, the mean lies towards the longer
tail relative to the median.
Trimmed Mean:
The trimmed mean is a measure of center that is not affected by
outliers.
With the trimmed mean, p% of the data is trimmed from either
end of the data set.
First, arranging the sample values in (ascending or descending)
order. 2 Then, trimming an equal number of them (np/100 points)
from each end. Finally, computing the sample mean of the
remaining points.
Note: Minitab prints the 5% trimmed mean.
The first quartile, Q1, is the value that has approximately 25% of
the observations below it. It represents the median of the lower half
of the data and corresponds to the 25th percentile.
The second quartile or median is the 50th percentile.
The third quartile, Q3, has approximately 75% of the observations
below it and corresponds to the 75th percentile.
STAT210: Probability and Statistics 33
Measures of Variability: Variance and
Standard Deviation
The variance is the average of squared deviations of values from the
mean. The population variance (σ2) is given by
N
1
N
2
(x )
i 1
i
2
The Interquartile Range (IQR) is the range for the middle 50% of
the data.
IQR = Q3 - Q1
It is not in influenced by outliers but used to detect them.
Minitab Output:
Descriptive Statistics: CPU Time
53 46 36 48 39 35 37 36 39 45
compare the number of intrusions before and after the change, construct
parallel boxplots and comment on your findings.
STAT210: Probability and Statistics 41
Exercise
(3) Match each histogram to the boxplot that represents the
same data set.
24.1 13.3 16.2 17.5 19.0 23.9 14.8 22.2 21.7 20.7
13.5 15.8 13.1 16.1 21.9 23.9 19.3 12.0 19.9 19.4
15.4 16.7 19.5 16.2 16.9 17.1 20.2 13.4 19.8 17.7
19.7 18.7 17.6 15.9 15.2 17.1 15.0 18.8 21.6 11.9