0% found this document useful (0 votes)

53 views34 pages

DM Lec2 Getting To Know Your Data

This chapter discusses getting to know data by describing data objects, attributes, and basic statistical descriptions. It covers: - Data objects represent entities and are described by attributes like columns in a database. - Attribute types include nominal, binary, numeric qualitative and quantitative. - Statistical descriptions measure central tendency (mean, median, mode), dispersion (variance, standard deviation, quartiles), and identify outliers. - These descriptions provide a better understanding of data distribution and properties.

Uploaded by

JAMEEL AHMAD

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

53 views34 pages

DM Lec2 Getting To Know Your Data

Uploaded by

JAMEEL AHMAD

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

You are on page 1/ 34

Data Mining:

Concepts and Techniques

— Chapter 2 —

1
Chapter 2: Getting to Know Your Data

 Data Objects and Attribute Types

 Basic Statistical Descriptions of Data

2
Data Objects

 Data sets are made up of data objects.

 A data object represents an entity.
 Examples:
 sales database: customers, store items, sales
 medical database: patients, treatments
 university database: students, professors, courses
 Also called samples , examples, instances, data points,
objects, tuples.
 Data objects are described by attributes.
 Database rows -> data objects; columns ->attributes.
3
EXAMPLE

Name Age Location gender Marital

status

s1 xyz 25 DHA M U

s2 abc 34 JT F M

s3 qwe 29 JT M M

…. … …. …. …. ….

4
Attributes
 Attribute (or dimensions, features, variables):
a data field, representing a characteristic or feature
of a data object.
 E.g., customer _ID, name, address

 Types:
 Nominal

 Binary

 Numeric: quantitative

 Interval-scaled

 Ratio-scaled

5
Qualitative Attribute Types
 Nominal: categories, states, or “names of things”
 No meaningful order (enumerations)

 Hair_color = {auburn, black, blond, brown, grey, red, white}

 marital status, occupation, ID numbers, zip codes

 Possible to represent with numbers (haircolor: black = 0 brown =

1 and so on)
 May have integer value but not considered numeric ID vs age

 No sense to find mean and median. Mode can be useful

 Binary
 Nominal attribute with only 2 states (0 and 1)

 Symmetric binary: both outcomes equally important

 e.g., gender
 Asymmetric binary: outcomes not equally important.

 e.g., medical test (positive vs. negative)

 Convention: assign 1 to most important outcome (e.g., HIV)

6
Qualitative Attribute Types
 Ordinal
 Values have a meaningful order (ranking) but magnitude between
successive values is not known.
 Size = {small, medium, large}, grades, army rankings,
professional ranks
 Ordinal attributes are used in surveys for ratings.
 Can be obtained from the discretization of numeric quantities
 Mode and Median can be used

 Nominal, binary and ordinal are qualitative attributes.

7
Numeric Attribute Types
 Quantity (integer or real-valued)
 Interval
 Measured on a scale of equal-sized units
 Values have order
 E.g., temperature in C˚or F˚, calendar dates
 No true zero-point (0˚C doesn’t indicate no
temperature)
 Cannot speak of values in terms of ratios
 Mean, median and mode all valid

8
Numeric Attribute Types
 Quantity (integer or real-valued)
 Ratio
 Inherent zero-point
 Values can be represented in ratios of each other
 We can speak of values as being an order of
magnitude larger than the unit of measurement
(10 K˚ is twice as high as 5 K˚).
 e.g., temperature in Kelvin, length, counts,
monetary quantities, no. of words, years of
experience
 0 K˚ means particles with zero kinetic energy

9
Discrete vs. Continuous Attributes
 Discrete Attribute
 Has only a finite or countably infinite set of values

 E.g., zip codes, profession, or the set of words in a

collection of documents
 Sometimes, represented as integer variables

 Note: Binary attributes are a special case of discrete

attributes
 Continuous Attribute
 Has real numbers as attribute values

 E.g., temperature, height, or weight

 Practically, real values can only be measured and

represented using a finite number of digits

 Continuous attributes are typically represented as

floating-point variables
10
Chapter 2: Getting to Know Your Data

 Data Objects and Attribute Types

 Basic Statistical Descriptions of Data

11
Basic Statistical Descriptions of Data
 Motivation
 To better understand the data: central tendency,
variation and spread
 Data dispersion characteristics
 median, max, min, quantiles, outliers, variance, etc.
 Numerical dimensions correspond to sorted intervals
 Data dispersion: analyzed with multiple granularities
of precision
 Boxplot or quantile analysis on sorted intervals
 Dispersion analysis on computed measures
 Folding measures into numerical dimensions
 Boxplot or quantile analysis on the transformed cube
12
Measuring the Central Tendency
 Mean (algebraic measure) (sample vs. population):

1 n
x   xi 
 x
n i 1 N

Note: n is sample size and N is population size.

 Weighted arithmetic mean:
n weight reflects significance
w x i i
x i 1
n

w
i 1
i

 Not always the best measure for central tendency

 Trimmed mean: chopping extreme values like 2%

13
Measuring the Central Tendency
 Median:
 Good for skewed (asymmetric) data
 Middle value if odd number of values, or average of
the middle two values otherwise
 Concept can be extended to ordinal data (odd –
middle value even – median not unique)
 Expensive when data is huge No. of values
Sum of frequencies of all
 Estimated by interpolation (for grouped data): the intervals that are

n / 2  ( freq )l
lower than medial
interval
median  L1  ( ) width
freq median
Width of median
range
Lower boundary
of median range Frequency of
median interval

14
Median Estimation
n / 2  ( freq )l
median  L1  ( ) width
freq median

age Freq Cum Freq

1-5 200 200
6-15 450 650
16-20 300 950
21-50 1500 2450
51-80 700 3150
81-110 44 3194

15
Measuring the Central Tendency

 Mode
 Value that occurs most frequently in the data
 For qualitative and quantitative
 Greatest frequency can be of several values
 Unimodal, bimodal, trimodal and multimodal
 Another extreme is when each value occur once – no mode
 Empirical formula:

mean  mode  3  (mean  median)

 Midrange
 Central tendency for numeric dataset
 Average of largest and smallest value

16
Symmetric vs. Skewed Data
 Median, mean and mode of symmetric, symmetric
positively and negatively skewed data

positively skewed negatively skewed

March 12, 2022 Data Mining: Concepts and Techniques 17

Measuring the Dispersion of Data
 Quartiles, outliers and boxplots
 Quartiles: Q1 (25th percentile), Q3 (75th percentile)
 Inter-quartile range: IQR = Q3 – Q1
 Five number summary: min, Q1, median, Q3, max
 Boxplot: ends of the box are the quartiles; median is marked; add
whiskers, and plot outliers individually
 Outlier: usually, a value higher/lower than 1.5 x IQR
 Variance and standard deviation (sample: s, population: σ)
 Variance: (algebraic, scalable computation)
1 n 1 n 2 1 n 1 n
1 n
2
s  
n  1 i 1
2
( xi  x )  [ xi  ( xi ) 2 ]
n  1 i 1 n i 1
2
 
N

i 1
( xi  2
 ) 
N
 xi   2
i 1
2

 Standard deviation s (or σ) is the square root of variance s2 (or σ2)

18
Quantiles
 Suppose data is sorted in ascending order
 Quantiles are data points that split data
distribute\on into equal size consecutive sets
 First Quartile(Q1) = ((n + 1)/4)th Term
 Second Quartile(Q2) = ((n + 1)/2)th Term
 Third Quartile(Q3) = (3(n + 1)/4)th Term

 Inter-quartile range:
IQR = Q3 – Q1

19
Finding the median, quartiles and inter-quartile range.
Example 1: Find the median and quartiles for the data below.
12, 6, 4, 9, 8, 4, 9, 8, 5, 9, 8, 10
Order the data
Q1 Q2 Q3

4, 4, 5, 6, 8, 8, 8, 9, 9, 9, 10, 12

Lower Upper
Median
Quartile Quartile
= 8
= 5½ = 9

Inter-Quartile Range = 9 - 5½ = 3½
Finding the median, quartiles and inter-quartile range.

Example 2: Find the median and quartiles for the data below.
6, 3, 9, 8, 4, 10, 8, 4, 15, 8, 10
Order the data
Q1 Q2 Q3

3, 4, 4, 6, 8, 8, 8, 9, 10, 10, 15,

Lower Upper
Quartile Median Quartile
= 4 = 8 = 10

Inter-Quartile Range = 10 - 4 = 6
Five-Number Summary
 A summary consists of five values: the most
extreme values in the data set (the
maximum and minimum values), the lower
and upper quartiles, and the median.
 These values are presented together and ordered
from lowest to highest:
 minimum valuel

 lower quartile (Q )
1

 median value (Q2)

 upper quartile (Q3)
 maximum value.
22
BoxPlots
 Ends of the box are the Quartiles so box length is
interquartile range
 The median is marked by a line within box
 Two lines (whiskers) outside the box extend to
the smallest and largest data point
 Outliers: points beyond a specified outlier
threshold, plotted individually

23
Box and Whisker Diagrams.

Box plots are useful for comparing two or more sets of data like
that shown below for heights of boys and girls in a class.

Anatomy of a Box and Whisker Diagram.

Lowest Lower Upper Highest
Value Quartile Median Quartile Value
Whisker Whisker
Box

4 5 6 7 8 9 10 11 12

Boys
130 140 150 160 170 180 cm 190
Girls
Box Plots
Drawing a Box Plot.
Example 1: Draw a Box plot for the data below

Q1 Q2 Q3

4, 4, 5, 6, 8, 8, 8, 9, 9, 9, 10, 12

Lower Upper
Median
Quartile Quartile
= 8
= 5½ = 9
Drawing a Box Plot.
Example 1: Draw a Box plot for the data below

Q1 Q2 Q3

4, 4, 5, 6, 8, 8, 8, 9, 9, 9, 10, 12

Lower Upper
Median
Quartile Quartile
= 8
= 5½ = 9

4 5 6 7 8 9 10 11 12
Drawing a Box Plot.
Example 2: Draw a Box plot for the data below

Q1 Q2 Q3

3, 4, 4, 6, 8, 8, 8, 9, 10, 10, 15,

Lower Upper
Quartile Median Quartile
= 4 = 8 = 10

3 4 5 6 7 8 9 10 11 12 13 14 15
Drawing a Box Plot.
Question: Stuart recorded the heights in cm of boys in his
class as shown below. Draw a box plot for this data.
QL Q2 Qu

137, 148, 155, 158, 165, 166, 166, 171, 171, 173, 175, 180, 184, 186, 186

Lower Upper
Quartile Median Quartile
= 158 = 171 = 180

130 140 150 160 170 180 cm 190

Drawing a Box Plot.
Question: Gemma recorded the heights in cm of girls in the same class and
constructed a box plot from the data. The box plots for both boys and girls
are shown below. Use the box plots to choose some correct statements
comparing heights of boys and girls in the class. Justify your answers.

Boys

130 140 150 160 170 180 cm 190

Girls
1. The girls are taller on average. 2. The boys are taller on average.

3. The girls show less variability in height. 5. The smallest person is a girl.

4. The boys show less variability in height. 6. The tallest person is a boy.
Boxplot Analysis
 Five-number summary of a distribution
 Minimum, Q1, Median, Q3, Maximum
 Boxplot
 Data is represented with a box
 The ends of the box are at the first and third
quartiles, i.e., the height of the box is IQR
 The median is marked by a line within the
box
 Whiskers: two lines outside the box extended
to Minimum and Maximum

30
worksheet

Finding the median, quartiles and inter-quartile range.

Example 1: Find the median and quartiles for the data below.
12, 6, 4, 9, 8, 4, 9, 8, 5, 9, 8, 10

Example 2: Find the median and quartiles for the data below.

6, 3, 9, 8, 4, 10, 8, 4, 15, 8, 10

Workshee
t1
worksheet
Box Plots
Worksheet 2
1
4 5 6 7 8 9 10 11 12

2
3 4 5 6 7 8 9 10 11 12 13 14 15

130 140 150 160 170 180 cm 190

4
0 10 20 30 40 50 60
Variance and Standard Deviation
 Data Set 1: 3, 5, 7, 10, 10
Data Set 2: 7, 7, 7, 7, 7

Data Set 1: mean = 7, median = 7

Data Set 2: mean = 7, median = 7

But we know that the two data sets are not

identical! The variance shows how they are
different.

33
Variance and Standard Deviation
 Measures of data dispersion
 A low SD means data observations tend to be
very close to mean
 While, high SD indicates the data spread out over
a large range of values
 Variance is the square of SD

x  X 
2
2
 
N

HUMSS 12 DIASS FIRST QUARTER EXAM. by ALMIRAH MACALUNAS
100% (9)
HUMSS 12 DIASS FIRST QUARTER EXAM. by ALMIRAH MACALUNAS
11 pages
CANON Color ImageRUNNER C2880, C2880i, C3380, C3380i Parts List
100% (1)
CANON Color ImageRUNNER C2880, C2880i, C3380, C3380i Parts List
150 pages
2 Knowing Data & Visualization
No ratings yet
2 Knowing Data & Visualization
51 pages
Why Parallel Computing?: Peter Pacheco
No ratings yet
Why Parallel Computing?: Peter Pacheco
84 pages
IS 4308 Product Manual
No ratings yet
IS 4308 Product Manual
7 pages
Sony kv-27fs13 27fs17 27fv17 29fv17-c 32fs13 32fs17 34fs13c 34fs17 CH Ba-5
No ratings yet
Sony kv-27fs13 27fs17 27fv17 29fv17-c 32fs13 32fs17 34fs13c 34fs17 CH Ba-5
299 pages
Lecture 2
No ratings yet
Lecture 2
62 pages
建筑师求职信
100% (1)
建筑师求职信
7 pages
CS822 DataMining Week2
No ratings yet
CS822 DataMining Week2
28 pages
Chap 4
No ratings yet
Chap 4
126 pages
Manual F315-F321-F330-F340
No ratings yet
Manual F315-F321-F330-F340
19 pages
1 L2 Intro DAM
No ratings yet
1 L2 Intro DAM
27 pages
Lec.02 Getting To Know Your Data
No ratings yet
Lec.02 Getting To Know Your Data
62 pages
CH 2
No ratings yet
CH 2
68 pages
AGA 3842-2022-2023. Descriptive Statistics
No ratings yet
AGA 3842-2022-2023. Descriptive Statistics
101 pages
Human Settlements and Town Planning
No ratings yet
Human Settlements and Town Planning
3 pages
02 Kinds of Data
No ratings yet
02 Kinds of Data
41 pages
My Homework For You
100% (1)
My Homework For You
4 pages
Data Mining 1
No ratings yet
Data Mining 1
29 pages
Unit 3 Data Preprocessing - Data
No ratings yet
Unit 3 Data Preprocessing - Data
90 pages
Chapter 2 - Tagged
No ratings yet
Chapter 2 - Tagged
66 pages
01 Data
No ratings yet
01 Data
100 pages
Evolutionary Search: Genetic Algorithm I
No ratings yet
Evolutionary Search: Genetic Algorithm I
47 pages
Lecture-4 Parallel hardware-Jameel-NNL
No ratings yet
Lecture-4 Parallel hardware-Jameel-NNL
39 pages
02 Data
No ratings yet
02 Data
66 pages
02data DMDW
No ratings yet
02data DMDW
40 pages
Lecture 2 3 Distributed Systems
No ratings yet
Lecture 2 3 Distributed Systems
27 pages
Day 2 1 Advanced-Openmp
No ratings yet
Day 2 1 Advanced-Openmp
52 pages
1210 6261v1 PDF
No ratings yet
1210 6261v1 PDF
8 pages
LG HG6 Datasheet
No ratings yet
LG HG6 Datasheet
9 pages
02 KnowYourData
No ratings yet
02 KnowYourData
44 pages
Concepts and Techniques: - Chapter 2
No ratings yet
Concepts and Techniques: - Chapter 2
65 pages
Standardization For Oil and Gas Sector: S.M. Bhatia Deputy Director General Bureau of Indian Standards
No ratings yet
Standardization For Oil and Gas Sector: S.M. Bhatia Deputy Director General Bureau of Indian Standards
41 pages
# 4 Pemusatan & Penyebaran Data (TM)
No ratings yet
# 4 Pemusatan & Penyebaran Data (TM)
65 pages
Module 1
No ratings yet
Module 1
64 pages
02 Data
No ratings yet
02 Data
65 pages
Association Rule Mining: FP Growth
No ratings yet
Association Rule Mining: FP Growth
22 pages
Quantitative Methods For Management
No ratings yet
Quantitative Methods For Management
118 pages
Topic 1 Describing Data II
No ratings yet
Topic 1 Describing Data II
68 pages
CHP 2
No ratings yet
CHP 2
52 pages
Data Analysts-1
No ratings yet
Data Analysts-1
65 pages
Imaging and Design For The Online Environment
No ratings yet
Imaging and Design For The Online Environment
30 pages
02 Data
No ratings yet
02 Data
36 pages
Data Preprocessing Data Basics
No ratings yet
Data Preprocessing Data Basics
86 pages
60 41 Ab SPC 00002
No ratings yet
60 41 Ab SPC 00002
39 pages
02 Data
No ratings yet
02 Data
64 pages
Data Type, Data Chart, Descriptive Statistics
No ratings yet
Data Type, Data Chart, Descriptive Statistics
65 pages
Lec 2
No ratings yet
Lec 2
26 pages
Variability Final
No ratings yet
Variability Final
53 pages
VIPDMTheory Chapter 2
No ratings yet
VIPDMTheory Chapter 2
56 pages
Transportation Data Mining: Chapter 2. Getting To Know Your Data
No ratings yet
Transportation Data Mining: Chapter 2. Getting To Know Your Data
77 pages
PP PPT Myp5
No ratings yet
PP PPT Myp5
14 pages
2 1 Data
No ratings yet
2 1 Data
22 pages
02know Your Data Lecture2 3
No ratings yet
02know Your Data Lecture2 3
53 pages
The Cambridge Handbook of Violent Behavior and Aggression, 1st Edition Annotated PDF Download
100% (17)
The Cambridge Handbook of Violent Behavior and Aggression, 1st Edition Annotated PDF Download
17 pages
02 Data
No ratings yet
02 Data
65 pages
Chapter 2
No ratings yet
Chapter 2
65 pages
02data Part2
No ratings yet
02data Part2
34 pages
Lecture 2b - Describing Data-Numerical
No ratings yet
Lecture 2b - Describing Data-Numerical
47 pages
IDTR 2019-20 Announcement
No ratings yet
IDTR 2019-20 Announcement
3 pages
Data Mining:: Concepts and Techniques
100% (1)
Data Mining:: Concepts and Techniques
63 pages
Accounts Project Bcom 1year
No ratings yet
Accounts Project Bcom 1year
6 pages
02 Data
No ratings yet
02 Data
62 pages
Data Mining (DM) : Lecture 3: Know Your Data
No ratings yet
Data Mining (DM) : Lecture 3: Know Your Data
53 pages
Data Mining-5 - Getting Know Data 1
No ratings yet
Data Mining-5 - Getting Know Data 1
27 pages
DM Unit-1-1
No ratings yet
DM Unit-1-1
56 pages
Basic Statistical Descriptions of Data
No ratings yet
Basic Statistical Descriptions of Data
26 pages
Measusres of Locations
No ratings yet
Measusres of Locations
52 pages
Internet Case Study For Chapter 13: Aggregate Planning Cornwell Glass
No ratings yet
Internet Case Study For Chapter 13: Aggregate Planning Cornwell Glass
5 pages
02know Your Data-Lecture2-3
No ratings yet
02know Your Data-Lecture2-3
53 pages
Year 11 Algebra HSCs 2022 To 2005
No ratings yet
Year 11 Algebra HSCs 2022 To 2005
17 pages
Lect 3
No ratings yet
Lect 3
51 pages
02 Data
No ratings yet
02 Data
35 pages
BC672 772RB-2 6pg
No ratings yet
BC672 772RB-2 6pg
6 pages
Chapter 2 - Understand Data
No ratings yet
Chapter 2 - Understand Data
63 pages
Đề thi học kì 2 2022 - 2023
No ratings yet
Đề thi học kì 2 2022 - 2023
3 pages
Innovative Lpe Coatings
No ratings yet
Innovative Lpe Coatings
30 pages
Getting To Know Your Data
No ratings yet
Getting To Know Your Data
78 pages
Data Management
No ratings yet
Data Management
36 pages
Concepts and Techniques: - Chapter 2
No ratings yet
Concepts and Techniques: - Chapter 2
36 pages
Cashless Economy
No ratings yet
Cashless Economy
9 pages
Secrets of Mind Power Harry Lorayne
No ratings yet
Secrets of Mind Power Harry Lorayne
45 pages
02data Edited v2
No ratings yet
02data Edited v2
43 pages
Lesson Plan in Science 6
100% (1)
Lesson Plan in Science 6
6 pages
02data (Compatibility Mode)
No ratings yet
02data (Compatibility Mode)
11 pages
Concepts and Techniques: - Chapter 2
No ratings yet
Concepts and Techniques: - Chapter 2
65 pages
Concepts and Techniques: - Chapter 2
No ratings yet
Concepts and Techniques: - Chapter 2
29 pages
Data Warehousing and Data Mining
No ratings yet
Data Warehousing and Data Mining
46 pages
SikaGrout-220 2011-11 - 1
No ratings yet
SikaGrout-220 2011-11 - 1
4 pages
EDF 222 - Philosophy of Education
No ratings yet
EDF 222 - Philosophy of Education
7 pages
Darin Barney The Participatory Condition in The Digital Age
100% (1)
Darin Barney The Participatory Condition in The Digital Age
348 pages
Cleanrooms and HVAC Systems Design Fundamentals
100% (6)
Cleanrooms and HVAC Systems Design Fundamentals
39 pages
Statistics I Essentials
From Everand
Statistics I Essentials
Emil G. Milewski
No ratings yet
System-On-Chip Design Book 2019 200dpi Aw
No ratings yet
System-On-Chip Design Book 2019 200dpi Aw
334 pages
Learn Statistics Fast: A Simplified Detailed Version for Students
From Everand
Learn Statistics Fast: A Simplified Detailed Version for Students
Hesbon R.M
No ratings yet
Statistics: a QuickStudy Laminated Reference Guide
From Everand
Statistics: a QuickStudy Laminated Reference Guide
BarCharts Publishing, Inc.
No ratings yet

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

DM Lec2 Getting To Know Your Data

Uploaded by

DM Lec2 Getting To Know Your Data

Uploaded by

Data Mining:

Concepts and Techniques

 Data Objects and Attribute Types

 Basic Statistical Descriptions of Data

 Data sets are made up of data objects.

Name Age Location gender Marital

 Hair_color = {auburn, black, blond, brown, grey, red, white}

 marital status, occupation, ID numbers, zip codes

 Possible to represent with numbers (haircolor: black = 0 brown =

 No sense to find mean and median. Mode can be useful

 Symmetric binary: both outcomes equally important

 e.g., medical test (positive vs. negative)

 Nominal, binary and ordinal are qualitative attributes.

 E.g., zip codes, profession, or the set of words in a

 Note: Binary attributes are a special case of discrete

 E.g., temperature, height, or weight

 Practically, real values can only be measured and

represented using a finite number of digits

 Data Objects and Attribute Types

 Basic Statistical Descriptions of Data

Note: n is sample size and N is population size.

 Not always the best measure for central tendency

age Freq Cum Freq

mean  mode  3  (mean  median)

positively skewed negatively skewed

March 12, 2022 Data Mining: Concepts and Techniques 17

 Standard deviation s (or σ) is the square root of variance s2 (or σ2)

3, 4, 4, 6, 8, 8, 8, 9, 10, 10, 15,

 median value (Q2)

Anatomy of a Box and Whisker Diagram.

3, 4, 4, 6, 8, 8, 8, 9, 10, 10, 15,

130 140 150 160 170 180 cm 190

130 140 150 160 170 180 cm 190

Finding the median, quartiles and inter-quartile range.

130 140 150 160 170 180 cm 190

Data Set 1: mean = 7, median = 7

But we know that the two data sets are not

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.