0% found this document useful (0 votes)

14 views15 pages

Statistical Analysis

Uploaded by

Hello Hello

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

14 views15 pages

Statistical Analysis

Uploaded by

Hello Hello

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 15

 Major Tasks in Data Preprocessing

o Data cleaning

o Fill in missing values, smooth noisy data, identify or remove outliers,

and resolve inconsistencies

o Data integration

o Integration of multiple databases, data cubes, or files

o Data transformation

o Normalization and aggregation

o Data reduction

 Obtains reduced representation in volume but produces the same or similar

analytical results

Data discretization

 Part of data reduction but with particular importance, especially for numerical data

 Forms of Data Preprocessing

Descriptive Data Summarization

Categorize the measures

 A measure is distributive, if we can partition the dataset into smaller subsets,

compute the measure on the individual subsets, and then combine the partial
results in order to arrive at the measure‘s value on the entire (original) dataset
 A measure is algebraic if it can be computed by applying an algebraic function to
one or more distributive measures
 A measure is holistic if it must be computed on the entire dataset as a whole

2.2.1 Measure the Central Tendency

A measure of central tendency is a single value that attempts to describe a set of data
by identifying the central position within that set of data. As such, measures of

47
central tendency are sometimes called measures of central location.

In other words, in many real-life situations, it is helpful to describe data by a single

number that is most representative of the entire collection of numbers. Such a number is
called a measure of central tendency. The most commonly used measures are as
follows. Mean, Median, and Mode

Mean: mean, or average, of numbers is the sum of the numbers divided by n. That is:

Example 1

The marks of seven students in a mathematics test with a maximum possible mark of 20
are given below:
15 13 18 16 14 17 12

Find the mean of this set of data values.

Solution:

So, the mean mark is 15.

48
Midrange

The midrange of a data set is the average of the minimum and maximum values.

Median: median of numbers is the middle number when the numbers are written in order.
If is even, the median is the average of the two middle numbers.

Example 2

The marks of nine students in a geography test that had a maximum possible mark of 50
are given below:
47 35 37 32 38 39 36 34 35

Find the median of this set of data values.

Solution:

Arrange the data values in order from the lowest value to the highest value:

32 34 35 35 36 37 38 39 47

The fifth data value, 36, is the middle value in this arrangement.

Note:

In general:

If the number of values in the data set is even, then the median is the average of the
two middle values.

49
Example 3

Find the median of the following data

set: 12 18 16 21 10 13
17 19

Solution:

Arrange the data values in order from the lowest value to the highest
value: 10 12 13 16 17 18 19 21

The number of values in the data set is 8, which is even. So, the median is the average
of the two middle values.

Trimmed mean

A trimming mean eliminates the extreme observations by removing observations

from each end of the ordered sample. It is calculated by discarding a certain percentage
of the lowest and the highest scores and then computing the mean of the remaining scores.

Mode of numbers is the number that occurs most frequently. If two numbers tie for most
frequent occurrence, the collection has two modes and is called bimodal.

The mode has applications in printing . For example, it is important to print more of
the most popular books; because printing different books in equal numbers would
cause a shortage of some books and an oversupply of others.

Likewise, the mode has applications in manufacturing. For example, it is important to

manufacture more of the most popular shoes; because manufacturing different shoes in
equal numbers would cause a shortage of some shoes and an oversupply of others.
Example 4
Find the mode of the following data set:

48 44 48 45 42 49 48
50
Solution:
The mode is 48 since it occurs most often.

 It is possible for a set of data values to have more than one mode.
 If there are two data values that occur most frequently, we say that the set of data
values is bimodal.
 If there is three data values that occur most frequently, we say that the set of data
values is trimodal
 If two or more data values that occur most frequently, we say that the set of
data values is multimodal
 If there is no data value or data values that occur most frequently, we say that
the set of data values has no mode.

The mean, median and mode of a data set are collectively known as measures of
central tendency as these three measures focus on where the data is centered or
clustered. To analyze data using the mean, median and mode, we need to use the most
appropriate measure of central tendency. The following points should be remembered:

 The mean is useful for predicting future results when there are no extreme
values in the data set. However, the impact of extreme values on the mean may
be important and should be considered. E.g. the impact of a stock market crash
on average investment returns.
 The median may be more useful than the mean when there are extreme
values in the data set as it is not affected by the extreme values.
 The mode is useful when the most common item, characteristic or value of a
data set is required.
Measures of Dispersion

Measures of dispersion measure how spread out a set of data is. The two most
commonly used measures of dispersion are the variance and the standard deviation.
Rather than showing how data are similar, they show how data differs from its variation,
spread, or dispersion.
Other measures of dispersion that may be encountered include the Quartiles, Inter quartile
range (IQR), Five number summary, range and box plots
Variance and Standard Deviation

Very different sets of numbers can have the same mean. You will now study two
measures of dispersion, which give you an idea of how much the numbers in a set differ
from the mean of the set. These two measures are called the variance of the set and the
51
standard deviation of the set

Percentile

Percentiles are values that divide a sample of data into one hundred groups containing (as
far as possible) equal numbers of observations.

The pth percentile of a distribution is the value such that p percent of the observations fall
at or below it.

The most commonly used percentiles other than the median are the 25th percentile and
the 75th percentile.

The 25th percentile demarcates the first quartile, the median or 50th percentile
demarcates the second quartile, the 75th percentile demarcates the third quartile, and the
100th percentile demarcates the fourth quartile.

Quartiles

Quartiles are numbers that divide an ordered data set into four portions, each containing
approximately one-fourth of the data. Twenty-five percent of the data values come
before the first quartile (Q1). The median is the second quartile (Q2); 50% of the data
52
values come before the median. Seventy-five percent of the data values come before the
third quartile (Q3).

Q1=25th percentile=(n*25/100), where n is total number of data in the given data set

Q2=median=50th percentile=(n*50/100)
th
Q3=75 percentile=(n*75/100)

Inter quartile range (IQR)

The inter quartile range is the length of the interval between the lower quartile (Q1) and
the upper quartile (Q3). This interval indicates the central, or middle, 50% of a data set.

IQR=Q3-Q1

Range

The range of a set of data is the difference between its largest (maximum) and
smallest (minimum) values. In the statistical world, the range is reported as a single
number, the difference between maximum and minimum. Sometimes, the range is
often reported as ―from (the minimum) to (the maximum),‖ i.e., two numbers.

Example1:

Given data set: 3, 4, 4, 5, 6, 8

The range of data set is 3–8. The range gives only minimal information about the spread
of the data, by defining the two extremes. It says nothing about how the data are
distributed between those two endpoints.

Example2:

In this example we demonstrate how to find the minimum value, maximum value,
and range of the following data: 29, 31, 24, 29, 30, 25

1. Arrange the data from smallest to largest.

24, 25, 29, 29, 30, 31

2. Identify the minimum and maximum values:

Minimum = 24, Maximum = 31

3. Calculate the range:

53
Range = Maximum-Minimum = 31–24 = 7.

Thus the range is 7.

Five-Number Summary

The Five-Number Summary of a data set is a five-item list comprising the minimum
value, first quartile, median, third quartile, and maximum value of the set.

{MIN, Q1, MEDIAN (Q2), Q3, MAX}

Box plots

A box plot is a graph used to represent the range, median, quartiles and inter quartile range
of a set of data values.

Constructing a Box plot: To construct a box plot:

(i) Draw a box to represent the middle 50% of the observations of the data
set. (ii) Show the median by drawing a vertical line within the box.
(iii) Draw the lines (called whiskers) from the lower and upper ends of the box to the
minimum and maximum values of the data set respectively, as shown in the following
diagram.

 X is the set of data values.

 Min X is the minimum value in the
data Max X is the maximum value in
the data set.

Example: Draw a boxplot for the following data set of scores:

76 79 76 74 75 71 85 82 82 79 81

Step 1: Arrange the score values in ascending order of magnitude:

71 74 75 76 76 79 79 81 82 82 85
54
There are 11 values in the data set.

Step 2: Q1=25th percentile value in the given data set

Q1=11*(25/100) th value

=2.75 =>3rd value

=75

Step 3: Q2=median=50th percentile value

=11 * (50/100) th value

=5.5th value => 6th value

=79

Step 4: Q3=75th percentile value

=11*(75/100)th value

=8.25th value=>9th value

= 82

Step 5: Min X= 71

Step 6: Max X=85

Step 7: Range= 85-71 = 14

Step 5: IQR=height of the box=Q3-Q1=9-3=6th value=79

55
Since the medians represent the middle points, they split the data into four equal parts. In
other words:

 one quarter of the data numbers are less than 75

 one quarter of the data numbers are between 75 and
79  one quarter of the data numbers are between
79 and 82  one quarter of the data numbers are
greater than 82

Outliers

Outlier data is a data that falls outside the range. Outliers will be any points below Q1
– 1.5×IQR or above Q3 + 1.5×IQR.

Example:

Find the outliers, if any, for the following data set:

10.2, 14.1, 14.4, 14.4, 14.4, 14.5, 14.5, 14.6, 14.7, 14.7, 14.7, 14.9, 15.1, 15.9, 16.4

To find out if there are any outliers, I first have to find the IQR. There are fifteen data
points, so the median will be at position (15/2) = 7.5=8th value=14.6. That is, Q2 =
14.6.

Q1 is the fourth value in the list and Q3 is the twelfth: Q1 = 14.4 and Q3 = 14.9.

Then IQR = 14.9 – 14.4 = 0.5.

Outliers will be any points below:

Q1 – 1.5×IQR = 14.4 – 0.75 = 13.65 or above Q3 + 1.5×IQR = 14.9 + 0.75 = 15.65.

Then the outliers are at 10.2, 15.9, and 16.4.

The values for Q1 – 1.5×IQR and Q3 + 1.5×IQR are the "fences" that mark off
the "reasonable" values from the outlier values. Outliers lie outside the
fences.

Graphic Displays of Basic Descriptive Data Summaries

56
1 Histogram

A histogram is a way of summarizing data that are measured on an interval scale

(either discrete or continuous). It is often used in exploratory data analysis to illustrate
the major features of the distribution of the data in a convenient form. It divides up
the range of possible values in a data set into classes or groups. For each group,
a rectangle is constructed with a base length equal to the range of values in that
specific group, and an area proportional to the number of observations falling into that
group. This means that the rectangles might be drawn of non-uniform height.

The histogram is only appropriate for variables whose values are numerical and measured
on an interval scale. It is generally used when dealing with large data sets
(>100 observations)

A histogram can also help detect any unusual observations (outliers), or any gaps in the
data set.

2 Scatter Plot

A scatter plot is a useful summary of a set of bivariate data (two variables), usually
drawn before working out a linear correlation coefficient or fitting a regression line. It
gives a good visual picture of the relationship between the two variables, and aids the
interpretation of the correlation coefficient or regression model.

Each unit contributes one point to the scatter plot, on which points are plotted but not
joined. The resulting pattern indicates the type and strength of the relationship between
the two variables.

57
Positively and Negatively Correlated Data

A scatter plot will also show up a non-linear relationship between the two variables and
whether or not there exist any outliers in the data.

3 Loess curve

It is another important exploratory graphic aid that adds a smooth curve to a scatter plot in
order to provide better perception of the pattern of dependence. The word loess is short
for ―local regression.‖

58
4 Box plot

The picture produced consists of the most extreme values in the data set (maximum and
minimum values), the lower and upper quartiles, and the median.

5 Quintile plot

 Displays all of the data (allowing the user to assess both the overall behavior
and unusual occurrences)
 Plots quintile information
 For a data xi data sorted in increasing order, fi indicates that
approximately 100 fi% of the data are below or equal to the value xi

59
The f quintile of the data is found. That data value is denoted q(f). Each data point can
be assigned an f-value. Let a time series x of length n be sorted from smallest to
largest values, such that the sorted values have rank. The f-value for each observation is
computed as . 1,2,..., n . The f-value for

each observation is computed as,

6 Quantile-Quantile plots (Q-Q plot)

Quantile-quantile plots allow us to compare the quintiles of two sets of numbers.

This kind of comparison is much more detailed than a simple comparison of means
or medians.

A normal distribution is often a reasonable model for the data. Without inspecting the
data, however, it is risky to assume a normal distribution. There are a number of graphs
that can be used to check the deviations of the data from the normal distribution. The
most useful tool for assessing normality is a quintile or QQ plot. This is a scatter plot
with the quantiles of the scores on the horizontal axis and the expected normal
scores on the vertical axis.

In other words, it is a graph that shows the quintiles of one univariate distribution against
the corresponding quintiles of another. It is a powerful visualization tool in that it allows
the user to view whether there is a shift in going from one distribution to another.

The steps in constructing a QQ plot are as follows:

First, we sort the data from smallest to largest. A plot of these scores against the
expected normal scores should reveal a straight line.

60
The expected normal scores are calculated by taking the z-scores of (I - ½)/n where I is the
rank in increasing order.

Curvature of the points indicates departures of normality. This plot is also useful
for detecting outliers. The outliers appear as points that are far away from the overall
pattern op points

How is a quantile-quantile plot different from a quintile plot?

A quintile plot is a graphical method used to show the approximate percentage of values
below or equal to the indepequintile information for all the data, where the values
measured for the independent variable are plotted against their corresponding quintile.

A quantile-quantile plot however, graphs the quantiles of one univariate distribution

against the corresponding quantiles of another univariate distribution. Both axes
display the range of values measured for their corresponding distribution, and points
are plotted that correspond to the quantile values of the two distributions. A line (y
= x) can be added to the graph along with points representing where the first,
second and third quantiles lie, in order to increase the graph‘s informational value.
Points that lie above such a line indicate a correspondingly higher value for the
distribution plotted on the y-axis, than for the distribution plotted on the x-axis at
the same quantile. The opposite effect is true for points lying below this line.

Data Cleaning

Data cleaning routines attempt to fill in missing values, smooth out noise while
identifying outliers, and correct inconsistencies in the data.

Mean, Median, Mode, Standard Deviation (Descriptive Statistics)
No ratings yet
Mean, Median, Mode, Standard Deviation (Descriptive Statistics)
43 pages
Statistics Part 1 and 2
No ratings yet
Statistics Part 1 and 2
53 pages
DWDM 3-1 Unit 2
No ratings yet
DWDM 3-1 Unit 2
32 pages
Statistical Machine Learning
100% (1)
Statistical Machine Learning
12 pages
L3 Numerical Summary Measures
No ratings yet
L3 Numerical Summary Measures
44 pages
Measures of Central Tendency
100% (15)
Measures of Central Tendency
15 pages
EECM3724 Unit 1 Ch3 Slides 2022
No ratings yet
EECM3724 Unit 1 Ch3 Slides 2022
48 pages
Lecture03 Descriptive Statistics
No ratings yet
Lecture03 Descriptive Statistics
22 pages
Descriptive Statistics
No ratings yet
Descriptive Statistics
41 pages
Measures of Central Tendency and Spread
No ratings yet
Measures of Central Tendency and Spread
26 pages
Explain Briefly The Stages in Data Processing
No ratings yet
Explain Briefly The Stages in Data Processing
7 pages
Linear Regression Assignment
0% (2)
Linear Regression Assignment
8 pages
Dtatistical Measures
No ratings yet
Dtatistical Measures
54 pages
3.3.1 Data Summarization
No ratings yet
3.3.1 Data Summarization
56 pages
Lec5&6 02sep2016
No ratings yet
Lec5&6 02sep2016
32 pages
L-03 PBH 611 Exploratory Data Analysis
No ratings yet
L-03 PBH 611 Exploratory Data Analysis
78 pages
Biostatistics3 2
No ratings yet
Biostatistics3 2
36 pages
Almendralejo Statistics
No ratings yet
Almendralejo Statistics
19 pages
Ken Black QA ch03
0% (1)
Ken Black QA ch03
61 pages
Lecture 2 Core Statistics 101 Mean Median Mode Distribution
No ratings yet
Lecture 2 Core Statistics 101 Mean Median Mode Distribution
32 pages
01 Data
No ratings yet
01 Data
100 pages
Data Mining-5 - Getting Know Data 1
No ratings yet
Data Mining-5 - Getting Know Data 1
27 pages
Topic 1 Describing Data II
No ratings yet
Topic 1 Describing Data II
68 pages
STAE Lecture Notes - LU3
No ratings yet
STAE Lecture Notes - LU3
24 pages
Siegle Reliability Calculator LINDA
No ratings yet
Siegle Reliability Calculator LINDA
398 pages
Unit 3 Notes
No ratings yet
Unit 3 Notes
11 pages
Measusres of Locations
No ratings yet
Measusres of Locations
52 pages
Data Management
No ratings yet
Data Management
7 pages
Descriptive Statistics
No ratings yet
Descriptive Statistics
38 pages
Lecture 3 - Stat HO
No ratings yet
Lecture 3 - Stat HO
21 pages
Standard Deviation
No ratings yet
Standard Deviation
13 pages
4 Measures of Central Tendency, Position, Variability PDF
100% (1)
4 Measures of Central Tendency, Position, Variability PDF
24 pages
Research Methodology Practical File Bcom Hons Iv (E)
No ratings yet
Research Methodology Practical File Bcom Hons Iv (E)
53 pages
STAE Lecture Notes - LU3 - Annotated
No ratings yet
STAE Lecture Notes - LU3 - Annotated
10 pages
Topic 21 - Statistics by Ui
No ratings yet
Topic 21 - Statistics by Ui
58 pages
Statistics Class 12 ICS Practical Notes For Book 2025 Exam
No ratings yet
Statistics Class 12 ICS Practical Notes For Book 2025 Exam
26 pages
Stat Chapter 5-9
No ratings yet
Stat Chapter 5-9
32 pages
Measures of Central Tendancy
No ratings yet
Measures of Central Tendancy
18 pages
Measures of Central Tendency Position and Dispersion 1.Pptx 20241015 145631 0000
No ratings yet
Measures of Central Tendency Position and Dispersion 1.Pptx 20241015 145631 0000
44 pages
Share MBBS - Lecture 4 (1) - 1
No ratings yet
Share MBBS - Lecture 4 (1) - 1
68 pages
MATH& 146 Lesson 8: Averages and Variation
No ratings yet
MATH& 146 Lesson 8: Averages and Variation
30 pages
المحاضرة رقم 3
No ratings yet
المحاضرة رقم 3
44 pages
Lecture 2-Descriptive Statistics
No ratings yet
Lecture 2-Descriptive Statistics
74 pages
It B.tech II Year II Sem DV (R18a0555)
No ratings yet
It B.tech II Year II Sem DV (R18a0555)
73 pages
Measures of Centrality and Variability
No ratings yet
Measures of Centrality and Variability
42 pages
Quantitative Methods For Decision Making: Dr. Akhter
No ratings yet
Quantitative Methods For Decision Making: Dr. Akhter
100 pages
DWDM Unit-2
No ratings yet
DWDM Unit-2
20 pages
Statistics 3: DR Taher
No ratings yet
Statistics 3: DR Taher
38 pages
Statistics 2024 - Filled in - 5-09-2024 Pg. 1 - 15
No ratings yet
Statistics 2024 - Filled in - 5-09-2024 Pg. 1 - 15
15 pages
Quantitative Data Analysis
No ratings yet
Quantitative Data Analysis
31 pages
Jerome Statistics
No ratings yet
Jerome Statistics
12 pages
Assignment 1 Midterm
No ratings yet
Assignment 1 Midterm
5 pages
Formal Languages and Automata Theory June July 2022
No ratings yet
Formal Languages and Automata Theory June July 2022
8 pages
Data Management
No ratings yet
Data Management
36 pages
Algebra1section9 1
No ratings yet
Algebra1section9 1
7 pages
Week 2
No ratings yet
Week 2
27 pages
Sampling: Final and Initial Sample Size Determination True/False Questions
No ratings yet
Sampling: Final and Initial Sample Size Determination True/False Questions
13 pages
R - Iii Unit
No ratings yet
R - Iii Unit
34 pages
Decision Theory
No ratings yet
Decision Theory
101 pages
SMB-R Programming Lab
No ratings yet
SMB-R Programming Lab
57 pages
Applied ML Notes
No ratings yet
Applied ML Notes
123 pages
Applied Statistical Methods (ASM) : "The True Logic of This World Is in The Calculus of Probabilities"
No ratings yet
Applied Statistical Methods (ASM) : "The True Logic of This World Is in The Calculus of Probabilities"
90 pages
f592b059 1643454320549
No ratings yet
f592b059 1643454320549
39 pages
UMI QM Coursework
No ratings yet
UMI QM Coursework
10 pages
Describing Data: Centre Mean Is The Technical Term For What Most People Call An Average. in Statistics, "Average"
No ratings yet
Describing Data: Centre Mean Is The Technical Term For What Most People Call An Average. in Statistics, "Average"
4 pages
Dsbda Unit 2
No ratings yet
Dsbda Unit 2
155 pages
Least Square
No ratings yet
Least Square
6 pages
Evaluating Analytical Chemistry
No ratings yet
Evaluating Analytical Chemistry
4 pages
Quantitative Analysis For The Firm: Assignment 1
No ratings yet
Quantitative Analysis For The Firm: Assignment 1
9 pages
Ho There Is No Significant Difference Among The Means. Ha There Is A Significant Difference Among The Means
No ratings yet
Ho There Is No Significant Difference Among The Means. Ha There Is A Significant Difference Among The Means
3 pages
Assignment 1 Research Methodology
No ratings yet
Assignment 1 Research Methodology
5 pages
Formula 1
No ratings yet
Formula 1
8 pages
Independent Component Analysis
No ratings yet
Independent Component Analysis
27 pages
Lecture 6 - NHST and Assumptions Testing
No ratings yet
Lecture 6 - NHST and Assumptions Testing
50 pages
Lesson 5 Measure of Spread 1
No ratings yet
Lesson 5 Measure of Spread 1
9 pages
Introductory of Statistics - Chapter 3
No ratings yet
Introductory of Statistics - Chapter 3
7 pages
GE MODMAT Unit 4 Statistics 1
No ratings yet
GE MODMAT Unit 4 Statistics 1
14 pages
Intermittent Demand Inventory Obsolescence and Temporal Aggregation Forecasts
No ratings yet
Intermittent Demand Inventory Obsolescence and Temporal Aggregation Forecasts
24 pages
MEFA Online Bits
No ratings yet
MEFA Online Bits
9 pages
Text
No ratings yet
Text
13 pages
ADIGRAT UNIVERSITY Bass New
No ratings yet
ADIGRAT UNIVERSITY Bass New
8 pages
Datamining 1
No ratings yet
Datamining 1
21 pages
Exercise Sheet 1 Mathematics and Statistics
No ratings yet
Exercise Sheet 1 Mathematics and Statistics
9 pages
UNIT III Operator Overloading and Type Conversion
No ratings yet
UNIT III Operator Overloading and Type Conversion
5 pages
Chapter No 15 Chi Square
No ratings yet
Chapter No 15 Chi Square
22 pages
Difference Between Classification and Regression
No ratings yet
Difference Between Classification and Regression
1 page
Domain-Specific Physical Activity and Mental Health-A Meta-Analysis
No ratings yet
Domain-Specific Physical Activity and Mental Health-A Meta-Analysis
14 pages
Aerofit Case Study Analysis - Ipynb - Colaboratory
No ratings yet
Aerofit Case Study Analysis - Ipynb - Colaboratory
6 pages
RAJASTHAN TECHNICAL UNIVERSITY Paper 2022
No ratings yet
RAJASTHAN TECHNICAL UNIVERSITY Paper 2022
2 pages
Ca Foundation Maths Test-5
No ratings yet
Ca Foundation Maths Test-5
15 pages
Shrinking The Cross Section
No ratings yet
Shrinking The Cross Section
22 pages
Tripod Cluster Checklist
No ratings yet
Tripod Cluster Checklist
2 pages
Tutorial 2 PSNM (2024-25) Unit-1 Correlation, Regression and Curve Fitting
No ratings yet
Tutorial 2 PSNM (2024-25) Unit-1 Correlation, Regression and Curve Fitting
2 pages
De-Mystifying Math and Stats for Machine Learning: Mastering the Fundamentals of Mathematics and Statistics for Machine Learning
From Everand
De-Mystifying Math and Stats for Machine Learning: Mastering the Fundamentals of Mathematics and Statistics for Machine Learning
Seaport AI Madhavan
No ratings yet

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Statistical Analysis

Uploaded by

Statistical Analysis

Uploaded by

 Major Tasks in Data Preprocessing

o Fill in missing values, smooth noisy data, identify or remove outliers,

o Integration of multiple databases, data cubes, or files

o Normalization and aggregation

 Obtains reduced representation in volume but produces the same or similar

 Forms of Data Preprocessing

Descriptive Data Summarization

Categorize the measures

 A measure is distributive, if we can partition the dataset into smaller subsets,

2.2.1 Measure the Central Tendency

In other words, in many real-life situations, it is helpful to describe data by a single

Find the mean of this set of data values.

So, the mean mark is 15.

Find the median of this set of data values.

Find the median of the following data

A trimming mean eliminates the extreme observations by removing observations

Likewise, the mode has applications in manufacturing. For example, it is important to

Inter quartile range (IQR)

Given data set: 3, 4, 4, 5, 6, 8

1. Arrange the data from smallest to largest.

24, 25, 29, 29, 30, 31

2. Identify the minimum and maximum values:

Minimum = 24, Maximum = 31

3. Calculate the range:

Thus the range is 7.

{MIN, Q1, MEDIAN (Q2), Q3, MAX}

Constructing a Box plot: To construct a box plot:

 X is the set of data values.

Example: Draw a boxplot for the following data set of scores:

Step 1: Arrange the score values in ascending order of magnitude:

Step 2: Q1=25th percentile value in the given data set

=2.75 =>3rd value

Step 3: Q2=median=50th percentile value

=11 * (50/100) th value

=5.5th value => 6th value

Step 4: Q3=75th percentile value

=8.25th value=>9th value

Step 6: Max X=85

Step 7: Range= 85-71 = 14

Step 5: IQR=height of the box=Q3-Q1=9-3=6th value=79

 one quarter of the data numbers are less than 75

Find the outliers, if any, for the following data set:

Then IQR = 14.9 – 14.4 = 0.5.

Outliers will be any points below:

Q1 – 1.5×IQR = 14.4 – 0.75 = 13.65 or above Q3 + 1.5×IQR = 14.9 + 0.75 = 15.65.

Then the outliers are at 10.2, 15.9, and 16.4.

Graphic Displays of Basic Descriptive Data Summaries

A histogram is a way of summarizing data that are measured on an interval scale

each observation is computed as,

6 Quantile-Quantile plots (Q-Q plot)

Quantile-quantile plots allow us to compare the quintiles of two sets of numbers.

The steps in constructing a QQ plot are as follows:

How is a quantile-quantile plot different from a quintile plot?

A quantile-quantile plot however, graphs the quantiles of one univariate distribution

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.