BI Lecture05A DataWrangling
BI Lecture05A DataWrangling
February
CS459
24 - Business Intelligence - Abeera Tariq 8
Data Cleaning
• Mean: The "average" number; found by adding all data points and
dividing by the number of data points.
(impacted by outlier)
• Median: The middle number; found by ordering all data points and
picking out the one in the middle (or if there are two middle numbers,
taking the mean of those two numbers).
(Not impacted by outlier)
• Mode: The most frequent number—that is, the number that occurs the
highest number of times.
• Standardization typically
means rescales data to have a
mean of 0 and a standard
deviation of 1 (unit variance).
• Normalization typically means
rescales the values into a
range of [0,1] or [-1,1].
• Discretization is the
process through which
we can transform
continuous variables,
models or functions
into a discrete form.
• For categorical
variables to reduce the
number of possible
groups.
Investigate why are they occurring? Where—and what—might the meaning be?
The answer could differ from business to business, but it’s important to have the
conversation rather than ignoring the data.
NUMERICAL
1.Filling the missing data with the mean
2.Filling the missing data with the median.
CATEGORICAL
1.Filling the missing data with mode
2.Filling with a new type for the missing values.
Last observation carried forward (LOCF)