DM Lec2 Getting To Know Your Data
DM Lec2 Getting To Know Your Data
— Chapter 2 —
1
Chapter 2: Getting to Know Your Data
2
Data Objects
s1 xyz 25 DHA M U
s2 abc 34 JT F M
s3 qwe 29 JT M M
…. … …. …. …. ….
4
Attributes
Attribute (or dimensions, features, variables):
a data field, representing a characteristic or feature
of a data object.
E.g., customer _ID, name, address
Types:
Nominal
Binary
Numeric: quantitative
Interval-scaled
Ratio-scaled
5
Qualitative Attribute Types
Nominal: categories, states, or “names of things”
No meaningful order (enumerations)
1 and so on)
May have integer value but not considered numeric ID vs age
Binary
Nominal attribute with only 2 states (0 and 1)
e.g., gender
Asymmetric binary: outcomes not equally important.
6
Qualitative Attribute Types
Ordinal
Values have a meaningful order (ranking) but magnitude between
successive values is not known.
Size = {small, medium, large}, grades, army rankings,
professional ranks
Ordinal attributes are used in surveys for ratings.
Can be obtained from the discretization of numeric quantities
Mode and Median can be used
7
Numeric Attribute Types
Quantity (integer or real-valued)
Interval
Measured on a scale of equal-sized units
Values have order
E.g., temperature in C˚or F˚, calendar dates
No true zero-point (0˚C doesn’t indicate no
temperature)
Cannot speak of values in terms of ratios
Mean, median and mode all valid
8
Numeric Attribute Types
Quantity (integer or real-valued)
Ratio
Inherent zero-point
Values can be represented in ratios of each other
We can speak of values as being an order of
magnitude larger than the unit of measurement
(10 K˚ is twice as high as 5 K˚).
e.g., temperature in Kelvin, length, counts,
monetary quantities, no. of words, years of
experience
0 K˚ means particles with zero kinetic energy
9
Discrete vs. Continuous Attributes
Discrete Attribute
Has only a finite or countably infinite set of values
collection of documents
Sometimes, represented as integer variables
attributes
Continuous Attribute
Has real numbers as attribute values
floating-point variables
10
Chapter 2: Getting to Know Your Data
11
Basic Statistical Descriptions of Data
Motivation
To better understand the data: central tendency,
variation and spread
Data dispersion characteristics
median, max, min, quantiles, outliers, variance, etc.
Numerical dimensions correspond to sorted intervals
Data dispersion: analyzed with multiple granularities
of precision
Boxplot or quantile analysis on sorted intervals
Dispersion analysis on computed measures
Folding measures into numerical dimensions
Boxplot or quantile analysis on the transformed cube
12
Measuring the Central Tendency
Mean (algebraic measure) (sample vs. population):
1 n
x xi
x
n i 1 N
w
i 1
i
13
Measuring the Central Tendency
Median:
Good for skewed (asymmetric) data
Middle value if odd number of values, or average of
the middle two values otherwise
Concept can be extended to ordinal data (odd –
middle value even – median not unique)
Expensive when data is huge No. of values
Sum of frequencies of all
Estimated by interpolation (for grouped data): the intervals that are
n / 2 ( freq )l
lower than medial
interval
median L1 ( ) width
freq median
Width of median
range
Lower boundary
of median range Frequency of
median interval
14
Median Estimation
n / 2 ( freq )l
median L1 ( ) width
freq median
15
Measuring the Central Tendency
Mode
Value that occurs most frequently in the data
For qualitative and quantitative
Greatest frequency can be of several values
Unimodal, bimodal, trimodal and multimodal
Another extreme is when each value occur once – no mode
Empirical formula:
16
Symmetric vs. Skewed Data
Median, mean and mode of symmetric, symmetric
positively and negatively skewed data
18
Quantiles
Suppose data is sorted in ascending order
Quantiles are data points that split data
distribute\on into equal size consecutive sets
First Quartile(Q1) = ((n + 1)/4)th Term
Second Quartile(Q2) = ((n + 1)/2)th Term
Third Quartile(Q3) = (3(n + 1)/4)th Term
Inter-quartile range:
IQR = Q3 – Q1
19
Finding the median, quartiles and inter-quartile range.
Example 1: Find the median and quartiles for the data below.
12, 6, 4, 9, 8, 4, 9, 8, 5, 9, 8, 10
Order the data
Q1 Q2 Q3
4, 4, 5, 6, 8, 8, 8, 9, 9, 9, 10, 12
Lower Upper
Median
Quartile Quartile
= 8
= 5½ = 9
Inter-Quartile Range = 9 - 5½ = 3½
Finding the median, quartiles and inter-quartile range.
Example 2: Find the median and quartiles for the data below.
6, 3, 9, 8, 4, 10, 8, 4, 15, 8, 10
Order the data
Q1 Q2 Q3
Lower Upper
Quartile Median Quartile
= 4 = 8 = 10
Inter-Quartile Range = 10 - 4 = 6
Five-Number Summary
A summary consists of five values: the most
extreme values in the data set (the
maximum and minimum values), the lower
and upper quartiles, and the median.
These values are presented together and ordered
from lowest to highest:
minimum valuel
lower quartile (Q )
1
23
Box and Whisker Diagrams.
Box plots are useful for comparing two or more sets of data like
that shown below for heights of boys and girls in a class.
4 5 6 7 8 9 10 11 12
Boys
130 140 150 160 170 180 cm 190
Girls
Box Plots
Drawing a Box Plot.
Example 1: Draw a Box plot for the data below
Q1 Q2 Q3
4, 4, 5, 6, 8, 8, 8, 9, 9, 9, 10, 12
Lower Upper
Median
Quartile Quartile
= 8
= 5½ = 9
Drawing a Box Plot.
Example 1: Draw a Box plot for the data below
Q1 Q2 Q3
4, 4, 5, 6, 8, 8, 8, 9, 9, 9, 10, 12
Lower Upper
Median
Quartile Quartile
= 8
= 5½ = 9
4 5 6 7 8 9 10 11 12
Drawing a Box Plot.
Example 2: Draw a Box plot for the data below
Q1 Q2 Q3
Lower Upper
Quartile Median Quartile
= 4 = 8 = 10
3 4 5 6 7 8 9 10 11 12 13 14 15
Drawing a Box Plot.
Question: Stuart recorded the heights in cm of boys in his
class as shown below. Draw a box plot for this data.
QL Q2 Qu
137, 148, 155, 158, 165, 166, 166, 171, 171, 173, 175, 180, 184, 186, 186
Lower Upper
Quartile Median Quartile
= 158 = 171 = 180
Boys
Girls
1. The girls are taller on average. 2. The boys are taller on average.
3. The girls show less variability in height. 5. The smallest person is a girl.
4. The boys show less variability in height. 6. The tallest person is a boy.
Boxplot Analysis
Five-number summary of a distribution
Minimum, Q1, Median, Q3, Maximum
Boxplot
Data is represented with a box
The ends of the box are at the first and third
quartiles, i.e., the height of the box is IQR
The median is marked by a line within the
box
Whiskers: two lines outside the box extended
to Minimum and Maximum
30
worksheet
Example 2: Find the median and quartiles for the data below.
6, 3, 9, 8, 4, 10, 8, 4, 15, 8, 10
Workshee
t1
worksheet
Box Plots
Worksheet 2
1
4 5 6 7 8 9 10 11 12
2
3 4 5 6 7 8 9 10 11 12 13 14 15
4
0 10 20 30 40 50 60
Variance and Standard Deviation
Data Set 1: 3, 5, 7, 10, 10
Data Set 2: 7, 7, 7, 7, 7
33
Variance and Standard Deviation
Measures of data dispersion
A low SD means data observations tend to be
very close to mean
While, high SD indicates the data spread out over
a large range of values
Variance is the square of SD
x X
2
2
N
34