0% found this document useful (0 votes)
32 views7 pages

Data Mining

This document discusses key concepts in data mining and knowledge discovery from data. It explains that data needs to be prepared through preprocessing steps like cleaning, transformation, and reduction before it can be analyzed. Common techniques discussed include filling missing values, smoothing noisy data, normalization, discretization, and data reduction. The goal of these techniques is to handle issues like inconsistent, incomplete or noisy data to improve the quality and understanding that can be derived from the data.

Uploaded by

gianghytien
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
32 views7 pages

Data Mining

This document discusses key concepts in data mining and knowledge discovery from data. It explains that data needs to be prepared through preprocessing steps like cleaning, transformation, and reduction before it can be analyzed. Common techniques discussed include filling missing values, smoothing noisy data, normalization, discretization, and data reduction. The goal of these techniques is to handle issues like inconsistent, incomplete or noisy data to improve the quality and understanding that can be derived from the data.

Uploaded by

gianghytien
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 7

Data Mining ~ Knowledge Discovery

Data <-> Choose data -> Preprocessing data -> Transforming data

Reason why we need to prepare the data:

 Noisy
 Incomplete
 Inconsistent

Data -> Data warehouse / Data Mining -> Decision

Data cleaning attempts to:

 Fill in missing values


 Smooth out noisy data
 Correct inconsistencies
 Remove irrelevant data

Example:

ID Name QT1 GK CK Group


1 Mickey 7 9 9 1
2 Donald 5 4 6 1
3 Pluto 9 5 2
4 Goofy 7 8 6 2

Record (Row) ~ 4 attributes: Name, QT1, GK, CK ~ Field (Column)

 If one unit of field is missing, calculate the Mean (Average) of the whole field to fill in the blank
(ex: Pluto – GK: (9+4+8)/3 = 7)
 If there is a field indicating that the data is divided into groups => only calculate the Mean of the
field belonging to that group (ex: Pluto – GK: 8)
 Another way to fill in the blank is to rearrange in ascending order, then use the middle number
to fill it in (ex: Pluto – GK: 8)
Solving the missing data problem:

 Use a global constant to fill in missing values (NULL, N/A, unknown, Vắng, etc.) -> The sheet
will automatically skip the missing values
 Use the attribute value mean to fill missing values of that attribute
 Use the attribute mean for all samples belonging to the same class to fill in the missing values

Smoothing Noisy Data:

 The purpose is to eliminate noise and “smooth out” the data fluctuations

Ex: Original Data for “price” (after sorting); 4, 8, 15, 21, 21, 24, 25, 28, 34

 Binning: Partition into equidepth bins


o Bin1: 4, 8, 15
o Bin2: 21, 21, 24
o Bin3: 25, 28, 34
 Means: each value in a bin is replaced by the mean value of the bin
o Bin1: 9, 9, 9
o Bin2: 22, 22, 22
o Bin3: 29, 29, 29
 Boundaries: min and max values in each bin are identified (boundaries). Each value in a bin is
replaced with the closest boundary value
o Bin1: 4, 4, 15
o Bin2: 21, 21, 24
o Bin3: 25, 25, 34
 Other methods:
o Clustering: Similar values are organized into groups (clusters). Values falling outside of
clusters may be considered “outliers” and may be candidates for elimination.
o Regression: Fit data to a function. Linear regression finds the best line to fit 2 variables.
Multiple regression can handle multiple variables. The values given by the function are
used instead of the original values.
Temperature:

5 8
6 5 8 9
7 0 1 2 3 5 5
8 0 1 3 5

ID Temperature
7 58
6 65 Bin1
5 68
9 69
4 70 Bin2
10 71
8 72
12 73 Bin3
11 75
14 75
2 80 Bin4
13 81
3 83
Bin5
1 85
ID Temperature
7 64
6 64 Bin1
5 64
9 70
4 70 Bin2
10 70
8 73
12 73 Bin3
11 73
14 79
2 79 Bin4
13 79
3 84
Bin5
1 84

Humidity:

6 5
7 0 0 0 5 8
8 0 0 0 5
9 0 0 5 6

Data Transformation (Normalization): We transition the data into variables ranging from 0 -> 1

Ex: 65% 75% 96%

0 x 1

X = (75-65) / (96-65) = 0.32


Ex: 60% 75% 100%

0 x 1

X = (75-60) / (100-60) = 0.375

Data Transformation: Normalization (Định lượng)

 Min-Max normalization: linear transformation from v to v’


x 1−min x 1
x ' 1= ¿
max x 1−min x1

 Z-score normalization: normalization of v into v’ based on attribute value mean and standard
deviation

( v−Mean) v−μ
v '= =
Standard Deviation σ

μ=mean=
∑v
n

σ=
√ (v i−μ)2
n−1

 Normalization by decimal scaling


o Moves the decimal point of v by j positions such that j is the minimum number of
positions moved so that absolute maximun falls in [0…..1]
' v
v= j
10
Ex: if v in [-56……9976] and j=4 -> v’ in [-0,0056……..0,9976]
ID Gender Age Salary
1 0 0.00 0.00
2 1 0.96 0.56
3 1 1.00 1.00
4 0 0.24 0.44
5 1 0.72 0.32

Data Transformation: Discretization (Định tính)


 3 types of attributes:
o Nominal: values from an unordered set (also “categorical” attributes)
o Ordinal: values from an ordered set
o Numberic/Continuous: real numbers (but sometimes also integer values)

Khi làm định tính sang định lượng => tuyệt đối không đc tính trung bình (mean)

Chỉ có thể chia tỉ lệ phần trăm và biểu thị bằng các đồ thị
Data Reduction

 Data is often too large; reducing data can improve performance


 Data reduction consists of reducing the representation of the data set while producing the same
(or almost the same) results
 Data reduction includes:
o Data cube aggregation
o Dimensionality reduction
o Discretization
o Numerosity reduction
 Regression
 Histogram
 Clustering
 Sampling

Regression Analysis

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy