0% found this document useful (0 votes)

32 views7 pages

Data Mining

This document discusses key concepts in data mining and knowledge discovery from data. It explains that data needs to be prepared through preprocessing steps like cleaning, transformation, and reduction before it can be analyzed. Common techniques discussed include filling missing values, smoothing noisy data, normalization, discretization, and data reduction. The goal of these techniques is to handle issues like inconsistent, incomplete or noisy data to improve the quality and understanding that can be derived from the data.

Uploaded by

gianghytien

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

32 views7 pages

Data Mining

Uploaded by

gianghytien

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 7

Data Mining ~ Knowledge Discovery

Data <-> Choose data -> Preprocessing data -> Transforming data

Reason why we need to prepare the data:

 Noisy
 Incomplete
 Inconsistent

Data -> Data warehouse / Data Mining -> Decision

Data cleaning attempts to:

 Fill in missing values

 Smooth out noisy data
 Correct inconsistencies
 Remove irrelevant data

Example:

ID Name QT1 GK CK Group

1 Mickey 7 9 9 1
2 Donald 5 4 6 1
3 Pluto 9 5 2
4 Goofy 7 8 6 2

Record (Row) ~ 4 attributes: Name, QT1, GK, CK ~ Field (Column)

 If one unit of field is missing, calculate the Mean (Average) of the whole field to fill in the blank
(ex: Pluto – GK: (9+4+8)/3 = 7)
 If there is a field indicating that the data is divided into groups => only calculate the Mean of the
field belonging to that group (ex: Pluto – GK: 8)
 Another way to fill in the blank is to rearrange in ascending order, then use the middle number
to fill it in (ex: Pluto – GK: 8)
Solving the missing data problem:

 Use a global constant to fill in missing values (NULL, N/A, unknown, Vắng, etc.) -> The sheet
will automatically skip the missing values
 Use the attribute value mean to fill missing values of that attribute
 Use the attribute mean for all samples belonging to the same class to fill in the missing values

Smoothing Noisy Data:

 The purpose is to eliminate noise and “smooth out” the data fluctuations

Ex: Original Data for “price” (after sorting); 4, 8, 15, 21, 21, 24, 25, 28, 34

 Binning: Partition into equidepth bins

o Bin1: 4, 8, 15
o Bin2: 21, 21, 24
o Bin3: 25, 28, 34
 Means: each value in a bin is replaced by the mean value of the bin
o Bin1: 9, 9, 9
o Bin2: 22, 22, 22
o Bin3: 29, 29, 29
 Boundaries: min and max values in each bin are identified (boundaries). Each value in a bin is
replaced with the closest boundary value
o Bin1: 4, 4, 15
o Bin2: 21, 21, 24
o Bin3: 25, 25, 34
 Other methods:
o Clustering: Similar values are organized into groups (clusters). Values falling outside of
clusters may be considered “outliers” and may be candidates for elimination.
o Regression: Fit data to a function. Linear regression finds the best line to fit 2 variables.
Multiple regression can handle multiple variables. The values given by the function are
used instead of the original values.
Temperature:

5 8
6 5 8 9
7 0 1 2 3 5 5
8 0 1 3 5

ID Temperature
7 58
6 65 Bin1
5 68
9 69
4 70 Bin2
10 71
8 72
12 73 Bin3
11 75
14 75
2 80 Bin4
13 81
3 83
Bin5
1 85
ID Temperature
7 64
6 64 Bin1
5 64
9 70
4 70 Bin2
10 70
8 73
12 73 Bin3
11 73
14 79
2 79 Bin4
13 79
3 84
Bin5
1 84

Humidity:

6 5
7 0 0 0 5 8
8 0 0 0 5
9 0 0 5 6

Data Transformation (Normalization): We transition the data into variables ranging from 0 -> 1

Ex: 65% 75% 96%

0 x 1

X = (75-65) / (96-65) = 0.32

Ex: 60% 75% 100%

0 x 1

X = (75-60) / (100-60) = 0.375

Data Transformation: Normalization (Định lượng)

 Min-Max normalization: linear transformation from v to v’

x 1−min x 1
x ' 1= ¿
max x 1−min x1

 Z-score normalization: normalization of v into v’ based on attribute value mean and standard
deviation

( v−Mean) v−μ
v '= =
Standard Deviation σ

μ=mean=
∑v
n

σ=
√ (v i−μ)2
n−1

 Normalization by decimal scaling

o Moves the decimal point of v by j positions such that j is the minimum number of
positions moved so that absolute maximun falls in [0…..1]
' v
v= j
10
Ex: if v in [-56……9976] and j=4 -> v’ in [-0,0056……..0,9976]
ID Gender Age Salary
1 0 0.00 0.00
2 1 0.96 0.56
3 1 1.00 1.00
4 0 0.24 0.44
5 1 0.72 0.32

Data Transformation: Discretization (Định tính)

 3 types of attributes:
o Nominal: values from an unordered set (also “categorical” attributes)
o Ordinal: values from an ordered set
o Numberic/Continuous: real numbers (but sometimes also integer values)

Khi làm định tính sang định lượng => tuyệt đối không đc tính trung bình (mean)

Chỉ có thể chia tỉ lệ phần trăm và biểu thị bằng các đồ thị
Data Reduction

 Data is often too large; reducing data can improve performance

 Data reduction consists of reducing the representation of the data set while producing the same
(or almost the same) results
 Data reduction includes:
o Data cube aggregation
o Dimensionality reduction
o Discretization
o Numerosity reduction
 Regression
 Histogram
 Clustering
 Sampling

Regression Analysis

(Cambridge Studies in Comparative Politics) Mark R. Beissinger - Nationalist Mobilization and The Collapse of The Soviet State-Cambridge University Press (2002) PDF
100% (1)
(Cambridge Studies in Comparative Politics) Mark R. Beissinger - Nationalist Mobilization and The Collapse of The Soviet State-Cambridge University Press (2002) PDF
522 pages
ICS 2408 - Lecture 2 - Data Preprocessing
No ratings yet
ICS 2408 - Lecture 2 - Data Preprocessing
29 pages
Concepts (PPT) - Data Preprocessing
No ratings yet
Concepts (PPT) - Data Preprocessing
19 pages
Instructional Material Sample ES209
No ratings yet
Instructional Material Sample ES209
17 pages
03 Data Preparation
No ratings yet
03 Data Preparation
28 pages
1 s2.0 S1544612323006918 Main
No ratings yet
1 s2.0 S1544612323006918 Main
7 pages
Data Preprocessing
No ratings yet
Data Preprocessing
49 pages
Unit - 1 Data Preprocessing
No ratings yet
Unit - 1 Data Preprocessing
66 pages
Machine Learning
No ratings yet
Machine Learning
41 pages
Topic 05 - Data Preprocessing
No ratings yet
Topic 05 - Data Preprocessing
62 pages
Pre Processing
No ratings yet
Pre Processing
68 pages
dmdw2 2
No ratings yet
dmdw2 2
24 pages
Untitled
No ratings yet
Untitled
33 pages
Data Cleaning Techniques
No ratings yet
Data Cleaning Techniques
11 pages
Big Data Lecture # 04
No ratings yet
Big Data Lecture # 04
22 pages
Week2 DataPreprocessing
No ratings yet
Week2 DataPreprocessing
43 pages
Data Preprocessing Techniques
No ratings yet
Data Preprocessing Techniques
11 pages
Mco 3 - June Tee 2024
No ratings yet
Mco 3 - June Tee 2024
46 pages
Regression Discontinuity Design
No ratings yet
Regression Discontinuity Design
29 pages
10-2 Data Analysis and Pre-Processing Part 4 PDF
No ratings yet
10-2 Data Analysis and Pre-Processing Part 4 PDF
23 pages
Lecture 2 - Trip Distribution - Rev 2021
No ratings yet
Lecture 2 - Trip Distribution - Rev 2021
34 pages
Lecture 7 - Data Preprocessing - Cleaning-M
No ratings yet
Lecture 7 - Data Preprocessing - Cleaning-M
21 pages
Data Preprocessing
No ratings yet
Data Preprocessing
33 pages
Unit 2
No ratings yet
Unit 2
34 pages
Week2 2
No ratings yet
Week2 2
25 pages
Data Minig Lab Manual
No ratings yet
Data Minig Lab Manual
58 pages
6-Significance of Exploratory Data Analysis, Making Sense of Data-06!02!2024
No ratings yet
6-Significance of Exploratory Data Analysis, Making Sense of Data-06!02!2024
85 pages
Lecture 5
No ratings yet
Lecture 5
27 pages
Predictive Analytics Modelling (21CSH-440) : Apex Institute of Technology
No ratings yet
Predictive Analytics Modelling (21CSH-440) : Apex Institute of Technology
20 pages
Week 2 - Data Quality
No ratings yet
Week 2 - Data Quality
43 pages
Unit 2
No ratings yet
Unit 2
37 pages
DWDM Unit-Ii
No ratings yet
DWDM Unit-Ii
18 pages
Chap 3
No ratings yet
Chap 3
26 pages
GSEMinstataintroduction
No ratings yet
GSEMinstataintroduction
39 pages
Data Preparation DM
No ratings yet
Data Preparation DM
26 pages
ML 4
No ratings yet
ML 4
17 pages
M 2.3 Data Preprocessing
No ratings yet
M 2.3 Data Preprocessing
22 pages
Chap 3 TensorFlow
No ratings yet
Chap 3 TensorFlow
24 pages
Le Bihan
No ratings yet
Le Bihan
10 pages
ISE233 Lecture 3
No ratings yet
ISE233 Lecture 3
21 pages
Group 33 Mid
No ratings yet
Group 33 Mid
16 pages
Outliners
No ratings yet
Outliners
15 pages
Lec2 - Data Preprocessing
No ratings yet
Lec2 - Data Preprocessing
30 pages
DSR Unit III
No ratings yet
DSR Unit III
11 pages
Fortran Library of Scientific Subroutines
No ratings yet
Fortran Library of Scientific Subroutines
196 pages
6 Data Preprocessing
No ratings yet
6 Data Preprocessing
37 pages
Gauss Markov Theorem
No ratings yet
Gauss Markov Theorem
16 pages
L2 Data Preparation
No ratings yet
L2 Data Preparation
18 pages
Bi Ut2 Answers
No ratings yet
Bi Ut2 Answers
23 pages
3-Data Pre-Processing
No ratings yet
3-Data Pre-Processing
18 pages
Dataminin Presentation (1) .PPTX - Read-Only
No ratings yet
Dataminin Presentation (1) .PPTX - Read-Only
23 pages
Module 3 Notes
No ratings yet
Module 3 Notes
5 pages
Question Bank Machine Learning With Python
No ratings yet
Question Bank Machine Learning With Python
3 pages
3 Data Preprocessing
No ratings yet
3 Data Preprocessing
25 pages
CS-DM Module-2
No ratings yet
CS-DM Module-2
29 pages
Positional and Temporal Differences in Peak Match
No ratings yet
Positional and Temporal Differences in Peak Match
9 pages
7.data Preprocessing
No ratings yet
7.data Preprocessing
12 pages
Unit-2 Lecture Notes
No ratings yet
Unit-2 Lecture Notes
33 pages
Data Preprocessing
No ratings yet
Data Preprocessing
77 pages
Preprocessing - M2
No ratings yet
Preprocessing - M2
53 pages
Data Preprocessing 013333
No ratings yet
Data Preprocessing 013333
8 pages
Study+Material+Unit 4+Data+Preprocessing+
No ratings yet
Study+Material+Unit 4+Data+Preprocessing+
8 pages
Thermodynamics and Phase Behaivor PVT Analysis
No ratings yet
Thermodynamics and Phase Behaivor PVT Analysis
22 pages
Modeling The Equilibrium Compressed Void Volume of Carbon Black
No ratings yet
Modeling The Equilibrium Compressed Void Volume of Carbon Black
30 pages
Data Preprocessing Techniques Cleaning Transformation and Integration
No ratings yet
Data Preprocessing Techniques Cleaning Transformation and Integration
6 pages
Econ140 Spring2016 Section09 Handout Solutions
No ratings yet
Econ140 Spring2016 Section09 Handout Solutions
12 pages
Module 3.3 Classification Models, An Overview
No ratings yet
Module 3.3 Classification Models, An Overview
11 pages
4 - Finding and Fixing Data Quality Issues
No ratings yet
4 - Finding and Fixing Data Quality Issues
48 pages
DWDM 3
No ratings yet
DWDM 3
12 pages
FDS CH 3
No ratings yet
FDS CH 3
2 pages
Nca-Genl 2
No ratings yet
Nca-Genl 2
11 pages
Signal Processing: Simon Yiu, Marzieh Dashti, Holger Claussen, Fernando Perez-Cruz
No ratings yet
Signal Processing: Simon Yiu, Marzieh Dashti, Holger Claussen, Fernando Perez-Cruz
10 pages
Normalization
No ratings yet
Normalization
35 pages
CS3361 DS Lab-2021 R
No ratings yet
CS3361 DS Lab-2021 R
2 pages
Data Preprocessing Unit 2
No ratings yet
Data Preprocessing Unit 2
3 pages
Tahoe Salt
100% (1)
Tahoe Salt
12 pages
The Impact of Perceived e-WOM On Purchase Intention: The Mediating Role of Corporate Image
No ratings yet
The Impact of Perceived e-WOM On Purchase Intention: The Mediating Role of Corporate Image
12 pages
Data Pre-Processing: - Data Cleaning - Data Integration - Data Transformation - Data Reduction - Data Discretization
No ratings yet
Data Pre-Processing: - Data Cleaning - Data Integration - Data Transformation - Data Reduction - Data Discretization
55 pages
FINS 3635 Short Computer Assignment-2017-1-UPDATED
No ratings yet
FINS 3635 Short Computer Assignment-2017-1-UPDATED
1 page
Multicollinearity
100% (1)
Multicollinearity
2 pages
Stata Introduction and Worksheet
No ratings yet
Stata Introduction and Worksheet
2 pages
Data Cleaning: Missing Values: - For Example in Attribute Income If
No ratings yet
Data Cleaning: Missing Values: - For Example in Attribute Income If
30 pages
Data Pre Processing - NG
No ratings yet
Data Pre Processing - NG
43 pages
Lecture 09 DM
No ratings yet
Lecture 09 DM
14 pages
CPA Syllabus 2009
No ratings yet
CPA Syllabus 2009
2 pages
Blockchain, Crypto and DeFi: Bridging Finance and Technology
From Everand
Blockchain, Crypto and DeFi: Bridging Finance and Technology
Marco Di Maggio
No ratings yet
Introduction to Coding in Hours With Python Level 1: A Guide to Programming for Students With No Prior Experience (Learn Coding Basics With Python)
From Everand
Introduction to Coding in Hours With Python Level 1: A Guide to Programming for Students With No Prior Experience (Learn Coding Basics With Python)
Jack C. Stanely
No ratings yet
Knits in a Day: 40 Quick Knits to Cast On and Complete in Three Hours or Less
From Everand
Knits in a Day: 40 Quick Knits to Cast On and Complete in Three Hours or Less
Candi Derr
4/5 (3)
Learn Digital and Microprocessor Techniques On Your Smartphone: Portable Learning, Reference and Revision Tools.
From Everand
Learn Digital and Microprocessor Techniques On Your Smartphone: Portable Learning, Reference and Revision Tools.
Clive W. Humphris
No ratings yet

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Data Mining

Uploaded by

Data Mining

Uploaded by

Data Mining ~ Knowledge Discovery

Reason why we need to prepare the data:

Data -> Data warehouse / Data Mining -> Decision

Data cleaning attempts to:

 Fill in missing values

ID Name QT1 GK CK Group

Record (Row) ~ 4 attributes: Name, QT1, GK, CK ~ Field (Column)

Smoothing Noisy Data:

 Binning: Partition into equidepth bins

Ex: 65% 75% 96%

X = (75-65) / (96-65) = 0.32

X = (75-60) / (100-60) = 0.375

Data Transformation: Normalization (Định lượng)

 Min-Max normalization: linear transformation from v to v’

 Normalization by decimal scaling

Data Transformation: Discretization (Định tính)

 Data is often too large; reducing data can improve performance

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.