0% found this document useful (0 votes)
14 views15 pages

Statistical Analysis

Uploaded by

Hello Hello
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views15 pages

Statistical Analysis

Uploaded by

Hello Hello
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

 Major Tasks in Data Preprocessing

o Data cleaning

o Fill in missing values, smooth noisy data, identify or remove outliers,


and resolve inconsistencies

o Data integration

o Integration of multiple databases, data cubes, or files

o Data transformation

o Normalization and aggregation

o Data reduction

 Obtains reduced representation in volume but produces the same or similar


analytical results

Data discretization

 Part of data reduction but with particular importance, especially for numerical data

 Forms of Data Preprocessing

Descriptive Data Summarization

Categorize the measures

 A measure is distributive, if we can partition the dataset into smaller subsets,


compute the measure on the individual subsets, and then combine the partial
results in order to arrive at the measure‘s value on the entire (original) dataset
 A measure is algebraic if it can be computed by applying an algebraic function to
one or more distributive measures
 A measure is holistic if it must be computed on the entire dataset as a whole

2.2.1 Measure the Central Tendency

A measure of central tendency is a single value that attempts to describe a set of data
by identifying the central position within that set of data. As such, measures of

47
central tendency are sometimes called measures of central location.

In other words, in many real-life situations, it is helpful to describe data by a single


number that is most representative of the entire collection of numbers. Such a number is
called a measure of central tendency. The most commonly used measures are as
follows. Mean, Median, and Mode

Mean: mean, or average, of numbers is the sum of the numbers divided by n. That is:

Example 1

The marks of seven students in a mathematics test with a maximum possible mark of 20
are given below:
15 13 18 16 14 17 12

Find the mean of this set of data values.

Solution:

So, the mean mark is 15.

48
Midrange

The midrange of a data set is the average of the minimum and maximum values.

Median: median of numbers is the middle number when the numbers are written in order.
If is even, the median is the average of the two middle numbers.

Example 2

The marks of nine students in a geography test that had a maximum possible mark of 50
are given below:
47 35 37 32 38 39 36 34 35

Find the median of this set of data values.

Solution:

Arrange the data values in order from the lowest value to the highest value:

32 34 35 35 36 37 38 39 47

The fifth data value, 36, is the middle value in this arrangement.

Note:

In general:

If the number of values in the data set is even, then the median is the average of the
two middle values.

49
Example 3

Find the median of the following data


set: 12 18 16 21 10 13
17 19

Solution:

Arrange the data values in order from the lowest value to the highest
value: 10 12 13 16 17 18 19 21

The number of values in the data set is 8, which is even. So, the median is the average
of the two middle values.

Trimmed mean

A trimming mean eliminates the extreme observations by removing observations


from each end of the ordered sample. It is calculated by discarding a certain percentage
of the lowest and the highest scores and then computing the mean of the remaining scores.

Mode of numbers is the number that occurs most frequently. If two numbers tie for most
frequent occurrence, the collection has two modes and is called bimodal.

The mode has applications in printing . For example, it is important to print more of
the most popular books; because printing different books in equal numbers would
cause a shortage of some books and an oversupply of others.

Likewise, the mode has applications in manufacturing. For example, it is important to


manufacture more of the most popular shoes; because manufacturing different shoes in
equal numbers would cause a shortage of some shoes and an oversupply of others.
Example 4
Find the mode of the following data set:

48 44 48 45 42 49 48
50
Solution:
The mode is 48 since it occurs most often.

 It is possible for a set of data values to have more than one mode.
 If there are two data values that occur most frequently, we say that the set of data
values is bimodal.
 If there is three data values that occur most frequently, we say that the set of data
values is trimodal
 If two or more data values that occur most frequently, we say that the set of
data values is multimodal
 If there is no data value or data values that occur most frequently, we say that
the set of data values has no mode.

The mean, median and mode of a data set are collectively known as measures of
central tendency as these three measures focus on where the data is centered or
clustered. To analyze data using the mean, median and mode, we need to use the most
appropriate measure of central tendency. The following points should be remembered:

 The mean is useful for predicting future results when there are no extreme
values in the data set. However, the impact of extreme values on the mean may
be important and should be considered. E.g. the impact of a stock market crash
on average investment returns.
 The median may be more useful than the mean when there are extreme
values in the data set as it is not affected by the extreme values.
 The mode is useful when the most common item, characteristic or value of a
data set is required.
Measures of Dispersion

Measures of dispersion measure how spread out a set of data is. The two most
commonly used measures of dispersion are the variance and the standard deviation.
Rather than showing how data are similar, they show how data differs from its variation,
spread, or dispersion.
Other measures of dispersion that may be encountered include the Quartiles, Inter quartile
range (IQR), Five number summary, range and box plots
Variance and Standard Deviation

Very different sets of numbers can have the same mean. You will now study two
measures of dispersion, which give you an idea of how much the numbers in a set differ
from the mean of the set. These two measures are called the variance of the set and the
51
standard deviation of the set

Percentile

Percentiles are values that divide a sample of data into one hundred groups containing (as
far as possible) equal numbers of observations.

The pth percentile of a distribution is the value such that p percent of the observations fall
at or below it.

The most commonly used percentiles other than the median are the 25th percentile and
the 75th percentile.

The 25th percentile demarcates the first quartile, the median or 50th percentile
demarcates the second quartile, the 75th percentile demarcates the third quartile, and the
100th percentile demarcates the fourth quartile.

Quartiles

Quartiles are numbers that divide an ordered data set into four portions, each containing
approximately one-fourth of the data. Twenty-five percent of the data values come
before the first quartile (Q1). The median is the second quartile (Q2); 50% of the data
52
values come before the median. Seventy-five percent of the data values come before the
third quartile (Q3).

Q1=25th percentile=(n*25/100), where n is total number of data in the given data set

Q2=median=50th percentile=(n*50/100)
th
Q3=75 percentile=(n*75/100)

Inter quartile range (IQR)

The inter quartile range is the length of the interval between the lower quartile (Q1) and
the upper quartile (Q3). This interval indicates the central, or middle, 50% of a data set.

IQR=Q3-Q1

Range

The range of a set of data is the difference between its largest (maximum) and
smallest (minimum) values. In the statistical world, the range is reported as a single
number, the difference between maximum and minimum. Sometimes, the range is
often reported as ―from (the minimum) to (the maximum),‖ i.e., two numbers.

Example1:

Given data set: 3, 4, 4, 5, 6, 8

The range of data set is 3–8. The range gives only minimal information about the spread
of the data, by defining the two extremes. It says nothing about how the data are
distributed between those two endpoints.

Example2:

In this example we demonstrate how to find the minimum value, maximum value,
and range of the following data: 29, 31, 24, 29, 30, 25

1. Arrange the data from smallest to largest.

24, 25, 29, 29, 30, 31

2. Identify the minimum and maximum values:

Minimum = 24, Maximum = 31

3. Calculate the range:


53
Range = Maximum-Minimum = 31–24 = 7.

Thus the range is 7.

Five-Number Summary

The Five-Number Summary of a data set is a five-item list comprising the minimum
value, first quartile, median, third quartile, and maximum value of the set.

{MIN, Q1, MEDIAN (Q2), Q3, MAX}

Box plots

A box plot is a graph used to represent the range, median, quartiles and inter quartile range
of a set of data values.

Constructing a Box plot: To construct a box plot:

(i) Draw a box to represent the middle 50% of the observations of the data
set. (ii) Show the median by drawing a vertical line within the box.
(iii) Draw the lines (called whiskers) from the lower and upper ends of the box to the
minimum and maximum values of the data set respectively, as shown in the following
diagram.

 X is the set of data values.


 Min X is the minimum value in the
data Max X is the maximum value in
the data set.

Example: Draw a boxplot for the following data set of scores:

76 79 76 74 75 71 85 82 82 79 81

Step 1: Arrange the score values in ascending order of magnitude:

71 74 75 76 76 79 79 81 82 82 85
54
There are 11 values in the data set.

Step 2: Q1=25th percentile value in the given data set

Q1=11*(25/100) th value

=2.75 =>3rd value

=75

Step 3: Q2=median=50th percentile value

=11 * (50/100) th value

=5.5th value => 6th value

=79

Step 4: Q3=75th percentile value

=11*(75/100)th value

=8.25th value=>9th value

= 82

Step 5: Min X= 71

Step 6: Max X=85

Step 7: Range= 85-71 = 14

Step 5: IQR=height of the box=Q3-Q1=9-3=6th value=79

55
Since the medians represent the middle points, they split the data into four equal parts. In
other words:

 one quarter of the data numbers are less than 75


 one quarter of the data numbers are between 75 and
79  one quarter of the data numbers are between
79 and 82  one quarter of the data numbers are
greater than 82

Outliers

Outlier data is a data that falls outside the range. Outliers will be any points below Q1
– 1.5×IQR or above Q3 + 1.5×IQR.

Example:

Find the outliers, if any, for the following data set:

10.2, 14.1, 14.4, 14.4, 14.4, 14.5, 14.5, 14.6, 14.7, 14.7, 14.7, 14.9, 15.1, 15.9, 16.4

To find out if there are any outliers, I first have to find the IQR. There are fifteen data
points, so the median will be at position (15/2) = 7.5=8th value=14.6. That is, Q2 =
14.6.

Q1 is the fourth value in the list and Q3 is the twelfth: Q1 = 14.4 and Q3 = 14.9.

Then IQR = 14.9 – 14.4 = 0.5.

Outliers will be any points below:

Q1 – 1.5×IQR = 14.4 – 0.75 = 13.65 or above Q3 + 1.5×IQR = 14.9 + 0.75 = 15.65.

Then the outliers are at 10.2, 15.9, and 16.4.

The values for Q1 – 1.5×IQR and Q3 + 1.5×IQR are the "fences" that mark off
the "reasonable" values from the outlier values. Outliers lie outside the
fences.

Graphic Displays of Basic Descriptive Data Summaries

56
1 Histogram

A histogram is a way of summarizing data that are measured on an interval scale


(either discrete or continuous). It is often used in exploratory data analysis to illustrate
the major features of the distribution of the data in a convenient form. It divides up
the range of possible values in a data set into classes or groups. For each group,
a rectangle is constructed with a base length equal to the range of values in that
specific group, and an area proportional to the number of observations falling into that
group. This means that the rectangles might be drawn of non-uniform height.

The histogram is only appropriate for variables whose values are numerical and measured
on an interval scale. It is generally used when dealing with large data sets
(>100 observations)

A histogram can also help detect any unusual observations (outliers), or any gaps in the
data set.

2 Scatter Plot

A scatter plot is a useful summary of a set of bivariate data (two variables), usually
drawn before working out a linear correlation coefficient or fitting a regression line. It
gives a good visual picture of the relationship between the two variables, and aids the
interpretation of the correlation coefficient or regression model.

Each unit contributes one point to the scatter plot, on which points are plotted but not
joined. The resulting pattern indicates the type and strength of the relationship between
the two variables.

57
Positively and Negatively Correlated Data

A scatter plot will also show up a non-linear relationship between the two variables and
whether or not there exist any outliers in the data.

3 Loess curve

It is another important exploratory graphic aid that adds a smooth curve to a scatter plot in
order to provide better perception of the pattern of dependence. The word loess is short
for ―local regression.‖

58
4 Box plot

The picture produced consists of the most extreme values in the data set (maximum and
minimum values), the lower and upper quartiles, and the median.

5 Quintile plot

 Displays all of the data (allowing the user to assess both the overall behavior
and unusual occurrences)
 Plots quintile information
 For a data xi data sorted in increasing order, fi indicates that
approximately 100 fi% of the data are below or equal to the value xi

59
The f quintile of the data is found. That data value is denoted q(f). Each data point can
be assigned an f-value. Let a time series x of length n be sorted from smallest to
largest values, such that the sorted values have rank. The f-value for each observation is
computed as . 1,2,..., n . The f-value for

each observation is computed as,

6 Quantile-Quantile plots (Q-Q plot)

Quantile-quantile plots allow us to compare the quintiles of two sets of numbers.

This kind of comparison is much more detailed than a simple comparison of means
or medians.

A normal distribution is often a reasonable model for the data. Without inspecting the
data, however, it is risky to assume a normal distribution. There are a number of graphs
that can be used to check the deviations of the data from the normal distribution. The
most useful tool for assessing normality is a quintile or QQ plot. This is a scatter plot
with the quantiles of the scores on the horizontal axis and the expected normal
scores on the vertical axis.

In other words, it is a graph that shows the quintiles of one univariate distribution against
the corresponding quintiles of another. It is a powerful visualization tool in that it allows
the user to view whether there is a shift in going from one distribution to another.

The steps in constructing a QQ plot are as follows:

First, we sort the data from smallest to largest. A plot of these scores against the
expected normal scores should reveal a straight line.

60
The expected normal scores are calculated by taking the z-scores of (I - ½)/n where I is the
rank in increasing order.

Curvature of the points indicates departures of normality. This plot is also useful
for detecting outliers. The outliers appear as points that are far away from the overall
pattern op points

How is a quantile-quantile plot different from a quintile plot?

A quintile plot is a graphical method used to show the approximate percentage of values
below or equal to the indepequintile information for all the data, where the values
measured for the independent variable are plotted against their corresponding quintile.

A quantile-quantile plot however, graphs the quantiles of one univariate distribution


against the corresponding quantiles of another univariate distribution. Both axes
display the range of values measured for their corresponding distribution, and points
are plotted that correspond to the quantile values of the two distributions. A line (y
= x) can be added to the graph along with points representing where the first,
second and third quantiles lie, in order to increase the graph‘s informational value.
Points that lie above such a line indicate a correspondingly higher value for the
distribution plotted on the y-axis, than for the distribution plotted on the x-axis at
the same quantile. The opposite effect is true for points lying below this line.

Data Cleaning

Data cleaning routines attempt to fill in missing values, smooth out noise while
identifying outliers, and correct inconsistencies in the data.

61

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy