Statistical Analysis
Statistical Analysis
o Data cleaning
o Data integration
o Data transformation
o Data reduction
Data discretization
Part of data reduction but with particular importance, especially for numerical data
A measure of central tendency is a single value that attempts to describe a set of data
by identifying the central position within that set of data. As such, measures of
47
central tendency are sometimes called measures of central location.
Mean: mean, or average, of numbers is the sum of the numbers divided by n. That is:
Example 1
The marks of seven students in a mathematics test with a maximum possible mark of 20
are given below:
15 13 18 16 14 17 12
Solution:
48
Midrange
The midrange of a data set is the average of the minimum and maximum values.
Median: median of numbers is the middle number when the numbers are written in order.
If is even, the median is the average of the two middle numbers.
Example 2
The marks of nine students in a geography test that had a maximum possible mark of 50
are given below:
47 35 37 32 38 39 36 34 35
Solution:
Arrange the data values in order from the lowest value to the highest value:
32 34 35 35 36 37 38 39 47
The fifth data value, 36, is the middle value in this arrangement.
Note:
In general:
If the number of values in the data set is even, then the median is the average of the
two middle values.
49
Example 3
Solution:
Arrange the data values in order from the lowest value to the highest
value: 10 12 13 16 17 18 19 21
The number of values in the data set is 8, which is even. So, the median is the average
of the two middle values.
Trimmed mean
Mode of numbers is the number that occurs most frequently. If two numbers tie for most
frequent occurrence, the collection has two modes and is called bimodal.
The mode has applications in printing . For example, it is important to print more of
the most popular books; because printing different books in equal numbers would
cause a shortage of some books and an oversupply of others.
48 44 48 45 42 49 48
50
Solution:
The mode is 48 since it occurs most often.
It is possible for a set of data values to have more than one mode.
If there are two data values that occur most frequently, we say that the set of data
values is bimodal.
If there is three data values that occur most frequently, we say that the set of data
values is trimodal
If two or more data values that occur most frequently, we say that the set of
data values is multimodal
If there is no data value or data values that occur most frequently, we say that
the set of data values has no mode.
The mean, median and mode of a data set are collectively known as measures of
central tendency as these three measures focus on where the data is centered or
clustered. To analyze data using the mean, median and mode, we need to use the most
appropriate measure of central tendency. The following points should be remembered:
The mean is useful for predicting future results when there are no extreme
values in the data set. However, the impact of extreme values on the mean may
be important and should be considered. E.g. the impact of a stock market crash
on average investment returns.
The median may be more useful than the mean when there are extreme
values in the data set as it is not affected by the extreme values.
The mode is useful when the most common item, characteristic or value of a
data set is required.
Measures of Dispersion
Measures of dispersion measure how spread out a set of data is. The two most
commonly used measures of dispersion are the variance and the standard deviation.
Rather than showing how data are similar, they show how data differs from its variation,
spread, or dispersion.
Other measures of dispersion that may be encountered include the Quartiles, Inter quartile
range (IQR), Five number summary, range and box plots
Variance and Standard Deviation
Very different sets of numbers can have the same mean. You will now study two
measures of dispersion, which give you an idea of how much the numbers in a set differ
from the mean of the set. These two measures are called the variance of the set and the
51
standard deviation of the set
Percentile
Percentiles are values that divide a sample of data into one hundred groups containing (as
far as possible) equal numbers of observations.
The pth percentile of a distribution is the value such that p percent of the observations fall
at or below it.
The most commonly used percentiles other than the median are the 25th percentile and
the 75th percentile.
The 25th percentile demarcates the first quartile, the median or 50th percentile
demarcates the second quartile, the 75th percentile demarcates the third quartile, and the
100th percentile demarcates the fourth quartile.
Quartiles
Quartiles are numbers that divide an ordered data set into four portions, each containing
approximately one-fourth of the data. Twenty-five percent of the data values come
before the first quartile (Q1). The median is the second quartile (Q2); 50% of the data
52
values come before the median. Seventy-five percent of the data values come before the
third quartile (Q3).
Q1=25th percentile=(n*25/100), where n is total number of data in the given data set
Q2=median=50th percentile=(n*50/100)
th
Q3=75 percentile=(n*75/100)
The inter quartile range is the length of the interval between the lower quartile (Q1) and
the upper quartile (Q3). This interval indicates the central, or middle, 50% of a data set.
IQR=Q3-Q1
Range
The range of a set of data is the difference between its largest (maximum) and
smallest (minimum) values. In the statistical world, the range is reported as a single
number, the difference between maximum and minimum. Sometimes, the range is
often reported as ―from (the minimum) to (the maximum),‖ i.e., two numbers.
Example1:
The range of data set is 3–8. The range gives only minimal information about the spread
of the data, by defining the two extremes. It says nothing about how the data are
distributed between those two endpoints.
Example2:
In this example we demonstrate how to find the minimum value, maximum value,
and range of the following data: 29, 31, 24, 29, 30, 25
Five-Number Summary
The Five-Number Summary of a data set is a five-item list comprising the minimum
value, first quartile, median, third quartile, and maximum value of the set.
Box plots
A box plot is a graph used to represent the range, median, quartiles and inter quartile range
of a set of data values.
(i) Draw a box to represent the middle 50% of the observations of the data
set. (ii) Show the median by drawing a vertical line within the box.
(iii) Draw the lines (called whiskers) from the lower and upper ends of the box to the
minimum and maximum values of the data set respectively, as shown in the following
diagram.
76 79 76 74 75 71 85 82 82 79 81
71 74 75 76 76 79 79 81 82 82 85
54
There are 11 values in the data set.
Q1=11*(25/100) th value
=75
=79
=11*(75/100)th value
= 82
Step 5: Min X= 71
55
Since the medians represent the middle points, they split the data into four equal parts. In
other words:
Outliers
Outlier data is a data that falls outside the range. Outliers will be any points below Q1
– 1.5×IQR or above Q3 + 1.5×IQR.
Example:
10.2, 14.1, 14.4, 14.4, 14.4, 14.5, 14.5, 14.6, 14.7, 14.7, 14.7, 14.9, 15.1, 15.9, 16.4
To find out if there are any outliers, I first have to find the IQR. There are fifteen data
points, so the median will be at position (15/2) = 7.5=8th value=14.6. That is, Q2 =
14.6.
Q1 is the fourth value in the list and Q3 is the twelfth: Q1 = 14.4 and Q3 = 14.9.
The values for Q1 – 1.5×IQR and Q3 + 1.5×IQR are the "fences" that mark off
the "reasonable" values from the outlier values. Outliers lie outside the
fences.
56
1 Histogram
The histogram is only appropriate for variables whose values are numerical and measured
on an interval scale. It is generally used when dealing with large data sets
(>100 observations)
A histogram can also help detect any unusual observations (outliers), or any gaps in the
data set.
2 Scatter Plot
A scatter plot is a useful summary of a set of bivariate data (two variables), usually
drawn before working out a linear correlation coefficient or fitting a regression line. It
gives a good visual picture of the relationship between the two variables, and aids the
interpretation of the correlation coefficient or regression model.
Each unit contributes one point to the scatter plot, on which points are plotted but not
joined. The resulting pattern indicates the type and strength of the relationship between
the two variables.
57
Positively and Negatively Correlated Data
A scatter plot will also show up a non-linear relationship between the two variables and
whether or not there exist any outliers in the data.
3 Loess curve
It is another important exploratory graphic aid that adds a smooth curve to a scatter plot in
order to provide better perception of the pattern of dependence. The word loess is short
for ―local regression.‖
58
4 Box plot
The picture produced consists of the most extreme values in the data set (maximum and
minimum values), the lower and upper quartiles, and the median.
5 Quintile plot
Displays all of the data (allowing the user to assess both the overall behavior
and unusual occurrences)
Plots quintile information
For a data xi data sorted in increasing order, fi indicates that
approximately 100 fi% of the data are below or equal to the value xi
59
The f quintile of the data is found. That data value is denoted q(f). Each data point can
be assigned an f-value. Let a time series x of length n be sorted from smallest to
largest values, such that the sorted values have rank. The f-value for each observation is
computed as . 1,2,..., n . The f-value for
This kind of comparison is much more detailed than a simple comparison of means
or medians.
A normal distribution is often a reasonable model for the data. Without inspecting the
data, however, it is risky to assume a normal distribution. There are a number of graphs
that can be used to check the deviations of the data from the normal distribution. The
most useful tool for assessing normality is a quintile or QQ plot. This is a scatter plot
with the quantiles of the scores on the horizontal axis and the expected normal
scores on the vertical axis.
In other words, it is a graph that shows the quintiles of one univariate distribution against
the corresponding quintiles of another. It is a powerful visualization tool in that it allows
the user to view whether there is a shift in going from one distribution to another.
First, we sort the data from smallest to largest. A plot of these scores against the
expected normal scores should reveal a straight line.
60
The expected normal scores are calculated by taking the z-scores of (I - ½)/n where I is the
rank in increasing order.
Curvature of the points indicates departures of normality. This plot is also useful
for detecting outliers. The outliers appear as points that are far away from the overall
pattern op points
A quintile plot is a graphical method used to show the approximate percentage of values
below or equal to the indepequintile information for all the data, where the values
measured for the independent variable are plotted against their corresponding quintile.
Data Cleaning
Data cleaning routines attempt to fill in missing values, smooth out noise while
identifying outliers, and correct inconsistencies in the data.
61