0% found this document useful (0 votes)
53 views34 pages

DM Lec2 Getting To Know Your Data

This chapter discusses getting to know data by describing data objects, attributes, and basic statistical descriptions. It covers: - Data objects represent entities and are described by attributes like columns in a database. - Attribute types include nominal, binary, numeric qualitative and quantitative. - Statistical descriptions measure central tendency (mean, median, mode), dispersion (variance, standard deviation, quartiles), and identify outliers. - These descriptions provide a better understanding of data distribution and properties.

Uploaded by

JAMEEL AHMAD
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
53 views34 pages

DM Lec2 Getting To Know Your Data

This chapter discusses getting to know data by describing data objects, attributes, and basic statistical descriptions. It covers: - Data objects represent entities and are described by attributes like columns in a database. - Attribute types include nominal, binary, numeric qualitative and quantitative. - Statistical descriptions measure central tendency (mean, median, mode), dispersion (variance, standard deviation, quartiles), and identify outliers. - These descriptions provide a better understanding of data distribution and properties.

Uploaded by

JAMEEL AHMAD
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 34

Data Mining:

Concepts and Techniques

— Chapter 2 —

1
Chapter 2: Getting to Know Your Data

 Data Objects and Attribute Types

 Basic Statistical Descriptions of Data

2
Data Objects

 Data sets are made up of data objects.


 A data object represents an entity.
 Examples:
 sales database: customers, store items, sales
 medical database: patients, treatments
 university database: students, professors, courses
 Also called samples , examples, instances, data points,
objects, tuples.
 Data objects are described by attributes.
 Database rows -> data objects; columns ->attributes.
3
EXAMPLE

Name Age Location gender Marital


status

s1 xyz 25 DHA M U

s2 abc 34 JT F M

s3 qwe 29 JT M M

…. … …. …. …. ….

4
Attributes
 Attribute (or dimensions, features, variables):
a data field, representing a characteristic or feature
of a data object.
 E.g., customer _ID, name, address

 Types:
 Nominal

 Binary

 Numeric: quantitative

 Interval-scaled

 Ratio-scaled

5
Qualitative Attribute Types
 Nominal: categories, states, or “names of things”
 No meaningful order (enumerations)

 Hair_color = {auburn, black, blond, brown, grey, red, white}

 marital status, occupation, ID numbers, zip codes

 Possible to represent with numbers (haircolor: black = 0 brown =

1 and so on)
 May have integer value but not considered numeric ID vs age

 No sense to find mean and median. Mode can be useful

 Binary
 Nominal attribute with only 2 states (0 and 1)

 Symmetric binary: both outcomes equally important

 e.g., gender
 Asymmetric binary: outcomes not equally important.

 e.g., medical test (positive vs. negative)


 Convention: assign 1 to most important outcome (e.g., HIV)

6
Qualitative Attribute Types
 Ordinal
 Values have a meaningful order (ranking) but magnitude between
successive values is not known.
 Size = {small, medium, large}, grades, army rankings,
professional ranks
 Ordinal attributes are used in surveys for ratings.
 Can be obtained from the discretization of numeric quantities
 Mode and Median can be used

 Nominal, binary and ordinal are qualitative attributes.

7
Numeric Attribute Types
 Quantity (integer or real-valued)
 Interval
 Measured on a scale of equal-sized units
 Values have order
 E.g., temperature in C˚or F˚, calendar dates
 No true zero-point (0˚C doesn’t indicate no
temperature)
 Cannot speak of values in terms of ratios
 Mean, median and mode all valid

8
Numeric Attribute Types
 Quantity (integer or real-valued)
 Ratio
 Inherent zero-point
 Values can be represented in ratios of each other
 We can speak of values as being an order of
magnitude larger than the unit of measurement
(10 K˚ is twice as high as 5 K˚).
 e.g., temperature in Kelvin, length, counts,
monetary quantities, no. of words, years of
experience
 0 K˚ means particles with zero kinetic energy

9
Discrete vs. Continuous Attributes
 Discrete Attribute
 Has only a finite or countably infinite set of values

 E.g., zip codes, profession, or the set of words in a

collection of documents
 Sometimes, represented as integer variables

 Note: Binary attributes are a special case of discrete

attributes
 Continuous Attribute
 Has real numbers as attribute values

 E.g., temperature, height, or weight

 Practically, real values can only be measured and

represented using a finite number of digits


 Continuous attributes are typically represented as

floating-point variables
10
Chapter 2: Getting to Know Your Data

 Data Objects and Attribute Types

 Basic Statistical Descriptions of Data

11
Basic Statistical Descriptions of Data
 Motivation
 To better understand the data: central tendency,
variation and spread
 Data dispersion characteristics
 median, max, min, quantiles, outliers, variance, etc.
 Numerical dimensions correspond to sorted intervals
 Data dispersion: analyzed with multiple granularities
of precision
 Boxplot or quantile analysis on sorted intervals
 Dispersion analysis on computed measures
 Folding measures into numerical dimensions
 Boxplot or quantile analysis on the transformed cube
12
Measuring the Central Tendency
 Mean (algebraic measure) (sample vs. population):

1 n
x   xi 
 x
n i 1 N

Note: n is sample size and N is population size.


 Weighted arithmetic mean:
n weight reflects significance
w x i i
x i 1
n

w
i 1
i

 Not always the best measure for central tendency


 Trimmed mean: chopping extreme values like 2%

13
Measuring the Central Tendency
 Median:
 Good for skewed (asymmetric) data
 Middle value if odd number of values, or average of
the middle two values otherwise
 Concept can be extended to ordinal data (odd –
middle value even – median not unique)
 Expensive when data is huge No. of values
Sum of frequencies of all
 Estimated by interpolation (for grouped data): the intervals that are

n / 2  ( freq )l
lower than medial
interval
median  L1  ( ) width
freq median
Width of median
range
Lower boundary
of median range Frequency of
median interval

14
Median Estimation
n / 2  ( freq )l
median  L1  ( ) width
freq median

age Freq Cum Freq


1-5 200 200
6-15 450 650
16-20 300 950
21-50 1500 2450
51-80 700 3150
81-110 44 3194

15
Measuring the Central Tendency

 Mode
 Value that occurs most frequently in the data
 For qualitative and quantitative
 Greatest frequency can be of several values
 Unimodal, bimodal, trimodal and multimodal
 Another extreme is when each value occur once – no mode
 Empirical formula:

mean  mode  3  (mean  median)


 Midrange
 Central tendency for numeric dataset
 Average of largest and smallest value

16
Symmetric vs. Skewed Data
 Median, mean and mode of symmetric, symmetric
positively and negatively skewed data

positively skewed negatively skewed

March 12, 2022 Data Mining: Concepts and Techniques 17


Measuring the Dispersion of Data
 Quartiles, outliers and boxplots
 Quartiles: Q1 (25th percentile), Q3 (75th percentile)
 Inter-quartile range: IQR = Q3 – Q1
 Five number summary: min, Q1, median, Q3, max
 Boxplot: ends of the box are the quartiles; median is marked; add
whiskers, and plot outliers individually
 Outlier: usually, a value higher/lower than 1.5 x IQR
 Variance and standard deviation (sample: s, population: σ)
 Variance: (algebraic, scalable computation)
1 n 1 n 2 1 n 1 n
1 n
2
s  
n  1 i 1
2
( xi  x )  [ xi  ( xi ) 2 ]
n  1 i 1 n i 1
2
 
N

i 1
( xi  2
 ) 
N
 xi   2
i 1
2

 Standard deviation s (or σ) is the square root of variance s2 (or σ2)

18
Quantiles
 Suppose data is sorted in ascending order
 Quantiles are data points that split data
distribute\on into equal size consecutive sets
 First Quartile(Q1) = ((n + 1)/4)th Term 
 Second Quartile(Q2) = ((n + 1)/2)th Term
 Third Quartile(Q3) = (3(n + 1)/4)th Term

 Inter-quartile range:
IQR = Q3 – Q1

19
Finding the median, quartiles and inter-quartile range.
Example 1: Find the median and quartiles for the data below.
12, 6, 4, 9, 8, 4, 9, 8, 5, 9, 8, 10
Order the data
Q1 Q2 Q3

4, 4, 5, 6, 8, 8, 8, 9, 9, 9, 10, 12

Lower Upper
Median
Quartile Quartile
= 8
= 5½ = 9

Inter-Quartile Range = 9 - 5½ = 3½
Finding the median, quartiles and inter-quartile range.

Example 2: Find the median and quartiles for the data below.
6, 3, 9, 8, 4, 10, 8, 4, 15, 8, 10
Order the data
Q1 Q2 Q3

3, 4, 4, 6, 8, 8, 8, 9, 10, 10, 15,

Lower Upper
Quartile Median Quartile
= 4 = 8 = 10

Inter-Quartile Range = 10 - 4 = 6
Five-Number Summary
 A summary consists of five values: the most
extreme values in the data set (the
maximum and minimum values), the lower
and upper quartiles, and the median.
 These values are presented together and ordered
from lowest to highest:
 minimum valuel

 lower quartile (Q )
1

 median value (Q2)


 upper quartile (Q3)
 maximum value.
22
BoxPlots
 Ends of the box are the Quartiles so box length is
interquartile range
 The median is marked by a line within box
 Two lines (whiskers) outside the box extend to
the smallest and largest data point
 Outliers: points beyond a specified outlier
threshold, plotted individually

23
Box and Whisker Diagrams.

Box plots are useful for comparing two or more sets of data like
that shown below for heights of boys and girls in a class.

Anatomy of a Box and Whisker Diagram.


Lowest Lower Upper Highest
Value Quartile Median Quartile Value
Whisker Whisker
Box

4 5 6 7 8 9 10 11 12

Boys
130 140 150 160 170 180 cm 190
Girls
Box Plots
Drawing a Box Plot.
Example 1: Draw a Box plot for the data below

Q1 Q2 Q3

4, 4, 5, 6, 8, 8, 8, 9, 9, 9, 10, 12

Lower Upper
Median
Quartile Quartile
= 8
= 5½ = 9
Drawing a Box Plot.
Example 1: Draw a Box plot for the data below

Q1 Q2 Q3

4, 4, 5, 6, 8, 8, 8, 9, 9, 9, 10, 12

Lower Upper
Median
Quartile Quartile
= 8
= 5½ = 9

4 5 6 7 8 9 10 11 12
Drawing a Box Plot.
Example 2: Draw a Box plot for the data below

Q1 Q2 Q3

3, 4, 4, 6, 8, 8, 8, 9, 10, 10, 15,

Lower Upper
Quartile Median Quartile
= 4 = 8 = 10

3 4 5 6 7 8 9 10 11 12 13 14 15
Drawing a Box Plot.
Question: Stuart recorded the heights in cm of boys in his
class as shown below. Draw a box plot for this data.
QL Q2 Qu

137, 148, 155, 158, 165, 166, 166, 171, 171, 173, 175, 180, 184, 186, 186

Lower Upper
Quartile Median Quartile
= 158 = 171 = 180

130 140 150 160 170 180 cm 190


Drawing a Box Plot.
Question: Gemma recorded the heights in cm of girls in the same class and
constructed a box plot from the data. The box plots for both boys and girls
are shown below. Use the box plots to choose some correct statements
comparing heights of boys and girls in the class. Justify your answers.

Boys

130 140 150 160 170 180 cm 190

Girls
1. The girls are taller on average. 2. The boys are taller on average.

3. The girls show less variability in height. 5. The smallest person is a girl.

4. The boys show less variability in height. 6. The tallest person is a boy.
Boxplot Analysis
 Five-number summary of a distribution
 Minimum, Q1, Median, Q3, Maximum
 Boxplot
 Data is represented with a box
 The ends of the box are at the first and third
quartiles, i.e., the height of the box is IQR
 The median is marked by a line within the
box
 Whiskers: two lines outside the box extended
to Minimum and Maximum

30
worksheet

Finding the median, quartiles and inter-quartile range.


Example 1: Find the median and quartiles for the data below.
12, 6, 4, 9, 8, 4, 9, 8, 5, 9, 8, 10

Example 2: Find the median and quartiles for the data below.

6, 3, 9, 8, 4, 10, 8, 4, 15, 8, 10

Workshee
t1
worksheet
Box Plots
Worksheet 2
1
4 5 6 7 8 9 10 11 12

2
3 4 5 6 7 8 9 10 11 12 13 14 15

130 140 150 160 170 180 cm 190

4
0 10 20 30 40 50 60
Variance and Standard Deviation
 Data Set 1: 3, 5, 7, 10, 10
Data Set 2: 7, 7, 7, 7, 7

Data Set 1: mean = 7, median = 7


Data Set 2: mean = 7, median = 7

But we know that the two data sets are not


identical! The variance shows how they are
different.

33
Variance and Standard Deviation
 Measures of data dispersion
 A low SD means data observations tend to be
very close to mean
 While, high SD indicates the data spread out over
a large range of values
 Variance is the square of SD

x  X 
2
2
 
N

34

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy