0% found this document useful (0 votes)
14 views50 pages

SCSA1606 - Predictive and Advanced Analytics - Unit II

Unit II of the SCSA1606 course focuses on data understanding and preparation, covering topics such as data attributes, summary statistics, outlier detection, and data visualization. It emphasizes the importance of cleaning, formatting, and combining data from various sources, including primary and secondary data, to ensure valid analysis. The unit also discusses different data types, distributions, and the data preparation pipeline, which includes accessing, ingesting, cleansing, formatting, combining, and analyzing data.

Uploaded by

crc86550
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views50 pages

SCSA1606 - Predictive and Advanced Analytics - Unit II

Unit II of the SCSA1606 course focuses on data understanding and preparation, covering topics such as data attributes, summary statistics, outlier detection, and data visualization. It emphasizes the importance of cleaning, formatting, and combining data from various sources, including primary and secondary data, to ensure valid analysis. The unit also discusses different data types, distributions, and the data preparation pipeline, which includes accessing, ingesting, cleansing, formatting, combining, and analyzing data.

Uploaded by

crc86550
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 50

SCHOOL OF COMPUTING

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING

SCSA1606 - Predictive and Advanced Analytics


UNIT – II

1
UNIT 2 DATA UNDERSTANDING AND PREPARATION

Introduction, Reading data from various sources, Data visualization, Distributions and summary
statistics, Relationships among variables, Extent of Missing Data. Segmentation, Outlier
detection, Automated Data Preparation, Combining data files, Aggregate Data, Duplicate
Removal, Sampling DATA, Data Caching, Partitioning data, Missing Values.

2.1 INTRODUCTION:
The objectives of data understanding are:

1. Understand the attributes of the data.


2. Summarize the data by identifying key characteristics, such as data volume and total
number of variables in the data.
3. Understand the problems with the data, such as missing values, inaccuracies, and
outliers.
4. Visualize the data to validate the key characteristics of the data or unearth problems
with the summary statistics.

Data Attributes:

A variable consists of two parts – the label and the data type. Data types can be numeric
(integers, real numbers) or strings. The data type can sometimes be tricky; for example, US
postal codes are numeric but need to be treated as strings. Once the labels and data types are
known, you can group attributes into two kinds for modeling:

1. Continuous Variables: These are numbers which can range from negative infinity to
positive infinity. You would associate with the labels a sense of magnitude, maximum
and minimum. You can sort on such variables and filter by ranges.
2. Categorical Variables: These variables can have a limited set of values, each of which
indicates a sub-type. For example, Direction is a categorical variable because it can be
North, South, East, or West. You can filter on or group by a specific value or values of a
categorical variable.

Some string data (like the names of people) can be transformed to either a continuous

2
variable (length of the name) or a categorical variable (first letter of the last name). You can also
transform a continuous variable into a categorical variable by binning. Binning means taking a
continuous variable and putting it into a discrete interval based on its value; the intervals can be
treated as values of a categorical variable.

Key Characteristics and Outliers

Once you have identified the variables of interest, summary statistics help you understand
the nature of each variable. The common summary statistics are mean, standard deviation,
skewness, and kurtosis.

Mean

The mean (µ) is a measure of central tendency of a distribution or discrete set of


numbers. In the latter case, it is the sum of all values divided by the count of values. The mean is
a good measure of the central tendency when the distribution is normal or uniform, but this is not
true of all cases. The mean by itself is a misleading statistic as it gets distorted by very low or
very high values (outliers) and should be reported with the standard deviation.

Standard Deviation

The standard deviation (σ) is a measure of the spread of the distribution. The larger
standard deviation represents the bigger spread of data points. For a discrete set of values, the
standard deviation is calculated as the square root of variance. Variance is calculated by dividing
the sum of squares of the difference of each value and the mean, with the total number of values.
The standard deviation is considered in the context of a normal distribution.

3
Normal Distribution

This is a very common distribution that is observed in many natural phenomena. The
Bean Machine is a famous exhibit that demonstrates this phenomenon. In the Bean Machine,
balls are dropped from the top, bounce randomly as they hit the pins, and collect at the bottom
into a normal distribution of heights. Normal distributions have some additional properties:

 The distribution is symmetric.


 The mean, median, and mode are all the same value.
 The approximate data distribution around the mean with increasing standard deviation
are:
 68% for +/- one standard deviation
 95% for +/- two standard deviations
 99.7% for +/- three standard deviations
 The data points outside of three standard deviations are considered outliers as they are
very unlikely to occur.

In order to check if the variable follows a normal distribution, you need to know its mean
and standard deviation. You can then use a normal distribution function to get the probability of
each value of the variable in the sorted set. If the distribution is indeed normal, you will get a
plot that is close to the following figure:

Normal distribution with = 0 and =1

Skewness

Skewness is a measure of the balance of the distribution. A normal distribution has a


skewness value of 0 (zero). Positive skew (value greater than 0) indicate a tail on the right side of

4
the distribution, while negative skew indicate a tail on the left side of the distribution. Negative
skewness is observed in distributions with outliers less than the mean, while positive skewness is
observed when there are outliers greater than the mean. Significant skewness indicates that the
mean and standard deviation are not good measures of the distribution.

Negative Skew

Positive Skew

Kurtosis

Kurtosis is another measure of the shape of the distribution. Kurtosis for a normal
distribution is 3. Kurtosis higher than 3 indicates that the shape is thinner than a normal
distribution and a kurtosis lower than 3 indicates that the shape is flatter than a normal
distribution. Kurtosis is important because it may present problems with the data even when the
distribution is symmetric. For example, a distribution may have long symmetric tails on both
sides, with a sharp peak – this indicates that the mean and standard deviation are not good
measures for the distribution. Some statistical packages report excess kurtosis, which is the
difference of the kurtosis value and kurtosis for normal distributions. In this scale, a positive
kurtosis indicates a thinner distribution, while a negative kurtosis indicates a flatter distribution
(both relative to a normal distribution).

5
In addition to the normal distribution, some other well-known distributions are:
 Log Normal – Sometimes, the log of a set of values have a normal distribution. To check
this, you can plot the normal probability distribution of the log of the values.
 Pareto – Pareto distributions also occur in nature and have been used to describe the
distribution of wealth in the world. If your data has a Pareto distribution, you will get a
straight line by plotting the CCDF on the log-log scale.
 Exponential – These distributions are observed within the time intervals between events
that are likely to occur any time. Similar to the Pareto distribution, if your data has an
exponential distribution, the CCDF plot on log-log scale will be a straight line.

If, after an examination of a variable for known distributions, the variable does not
satisfactorily fit any distribution, it is necessary to look at rank-ordered statistics.

Rank-Ordered Statistics

The measures we’ve seen so far are heavily influenced by the presence of large outliers.
Rank ordered statistics overcome this issue. If the variable is sorted and associated with a
percentile rank, insights can be gained regardless of distribution shape. The most common rank

6
ordered statistics are:

Percentile Metric
0 Minimum
25 1st Quartile
50 Median
75 3rd Quartile
100 Maximum

The Inter Quartile Range (IQR) is the difference between the 75th and 25th percentile
values. A larger IQR indicates more variation in the data. This is a robust statistic that is not
affected by large outliers. It can be used in lieu of standard deviation for distributions that are not
normal.

Visualization of Single Variables

Visualization of a variable helps in obtaining insights into the distribution of the data.
The two common visualizations for single variables are a histogram and a box plot.

Histogram

A histogram depicts the shape of the distribution by binning the values into discrete
intervals. The intervals create the x-axis and the number of values in the interval creates the y-
axis. A histogram is a good indicator of the symmetry of the distribution, skew, and kurtosis.

Box Plot

A Box Plot represents quartile statistics. The box represents the IQR, and the median is a

7
line cutting the box in half (or close to it). A box plot can also have whiskers, which typically
represent the upper quartile boundary and lower quartile boundary – this is known as a box and
whiskers plot.

This is a box plot of the test scores of two teams (Alpha and Beta). The IQR for the
Alpha team is much higher, which indicates a large variation in the test scores, versus the Beta
team. The y-axis is the percentile range. The data preparation pipeline consists of the following
steps:

1. Access the data.


2. Ingest (or fetch) the data.
3. Cleanse the data.
4. Format the data.
5. Combine the data.
6. And finally, analyze the data.

Access:

There are many sources of business data within any organization. Examples include
endpoint data, customer data, marketing data, and all their associated repositories. This first
essential data preparation step involves identifying the necessary data and its repositories. This is
not simply identifying all possible data sources and repositories, but identifying all that are
applicable to the desired analysis. This means that there must first be a plan that includes the
specific questions to be answered by the data analysis.

8
Ingest:

Once the data is identified, it needs to be brought into the analysis tools. The data will
likely be some combination of structured and semi-structured data in different types of
repositories. Importing it all into a common repository is necessary for the subsequent steps in
the pipeline. Access and ingest tend to be manual processes with significant variations in exactly
what needs to be done. Both data preparation steps require a combination of business and IT
expertise and are therefore best done by a small team. This step is also the first opportunity for
data validation.

Cleanse:

Cleansing the data ensures that the data set can provide valid answers when the data is
analyzed. This step could be done manually for small data sets but requires automation for most
realistically sized data sets. There are software tools available for this processing. If custom
processing is needed, many data engineers rely on applications coded in Python. There are many
different problems possible with the ingested data. There could be missing values, out-of-range
values, nulls, and whitespaces that obfuscate values, as well as outlier values that could skew
analysis results. Outliers are particularly challenging when they are the result of combining two
or more variables in the data set. Data engineers need to plan carefully for how they are going to
cleanse their data.

Format:

Once the data set has been cleansed; it needs to be formatted. This step includes resolving
issues like multiple date formats in the data or inconsistent abbreviations. It is also possible that
some data variables are not needed for the analysis and should therefore be deleted from the
analysis data set. This is another data preparation step that will benefit from automation.
Cleansing and formatting steps should be saved into a repeatable recipe data scientists or
engineers can apply to similar data sets in the future. For example, a monthly analysis of sales
and support data would likely have the same sources that need the same cleansing and formatting
steps each month.

9
Combine:

When the data set has been cleansed and formatted, it may be transformed by merging,
splitting, or joining the input sets. Once the combining step is complete, the data is ready to be
moved to the data warehouse staging area. Once data is loaded into the staging area, there is a
second opportunity for validation.

Analyze:

Once the analysis has begun, changes to the data set should only be made with careful
consideration. During analysis, algorithms are often adjusted and compared to other results.
Changes to the data can skew analysis results and make it impossible to determine whether the
different results are caused by changes to the data or the algorithms.

2.2 READING DATA FROM VARIOUS SOURCES:

The data which is collected is known as raw data which is not useful now but on cleaning
the impure and utilizing that data for further analysis forms information, the information
obtained is known as “knowledge”. Knowledge has many meanings like business knowledge or
sales of enterprise products, disease treatment, etc. The main goal of data collection is to collect
information-rich data. Data collection starts with asking some questions such as what type of
data is to be collected and what the source of collection is. Most of the data collected are of two
types known as “qualitative data“ which is a group of non-numerical data such as words,
sentences mostly focus on behavior and actions of the group and another one is “quantitative
data” which is in numerical forms and can be calculated using different scientific tools and
sampling data. The actual data is then further divided mainly into two types known as:

1. Primary data
2. Secondary data

1. Primary data:

The data which is Raw, original, and extracted directly from the official sources is known
as primary data. This type of data is collected directly by performing techniques such as

10
questionnaires, interviews, and surveys. The data collected must be according to the demand and
requirements of the target audience on which analysis is performed otherwise it would be a
burden in the data processing. Few methods of collecting primary data:

i. Interview method:
The data collected during this process is through interviewing the target audience by a
person called interviewer and the person who answers the interview is known as the
interviewee. Some basic business or product related questions are asked and noted down
in the form of notes, audio, or video and this data is stored for processing. These can be
both structured and unstructured like personal interviews or formal interviews through
telephone, face to face, email, etc.

ii. Survey method:


The survey method is the process of research where a list of relevant questions are asked
and answers are noted down in the form of text, audio, or video. The survey method can
be obtained in both online and offline mode like through website forms and email. Then
that survey answers are stored for analyzing data. Examples are online surveys or surveys
through social media polls.

iii. Observation method:


The observation method is a method of data collection in which the researcher keenly
observes the behavior and practices of the target audience using some data collecting tool

11
and stores the observed data in the form of text, audio, video, or any raw formats. In this
method, the data is collected directly by posting a few questions on the participants. For
example, observing a group of customers and their behavior towards the products. The
data obtained will be sent for processing.

iv. Experimental method:


The experimental method is the process of collecting data through performing
experiments, research, and investigation. The most frequently used experiment methods
are CRD, RBD, LSD, FD.

a) CRD- Completely Randomized design


b) RBD- Randomized Block Design
c) LSD – Latin Square Design
d) FD- Factorial design

2. Secondary data:

Secondary data is the data which has already been collected and reused again for some
valid purpose. This type of data is previously recorded from primary data and it has two types of
sources named internal source and external source.

Internal source:

These types of data can easily be found within the organization such as market record, a
sales record, transactions, customer data, accounting resources, etc. The cost and time
consumption is less in obtaining internal sources.

External source:

The data which can’t be found at internal organizations and can be gained through
external third party resources is external source data. The cost and time consumption is more
because this contains a huge amount of data. Examples of external sources are Government
publications, news publications, Registrar General of India, planning commission, international
labor bureau, syndicate services, and other non-governmental publications.

12
The data from multiple sources are integrated into a common source known as Data
Warehouse. Let’s discuss about the type of sources data can be mined:

1. Flat Files
2. Relational Databases
3. Data Warehouse
4. Transactional Databases
5. Multimedia Databases
6. Spatial Databases
7. Time Series Databases
8. World Wide Web(WWW)

1. Flat Files:

Flat files are defined as data files in text form or binary form with a structure that can
be easily extracted by data mining algorithms. Data stored in flat files have no relationship or
path among themselves, like if a relational database is stored on flat file, and then there will be
no relations between the tables. Flat files are represented by data dictionary. Eg: CSV file.
Application: Used in Data Warehousing to store data, Used in carrying data to and from
server, etc.

2. Relational Databases

A Relational database is defined as the collection of data organized in tables with rows
and columns. Physical schema in Relational databases is a schema which defines the structure
of tables. Logical schema in Relational databases is a schema which defines the relationship
among tables. Eg: SQL.
Application: Data Mining, ROLAP model, etc.

3. DataWarehouse

A datawarehouse is defined as the collection of data integrated from multiple sources


that will query and decision making. There are three types of data warehouse: Enterprise data
warehouse, Data Mart and Virtual Warehouse. Two approaches can be used to update data in

13
DataWarehouse: Query-driven Approach and Update-driven Approach.
Application: Business decision making, Data mining, etc.

4. Transactional Databases

A transactional database is a collection of data organized by time stamps, date, etc to


represent transaction in databases. This type of database has the capability to roll back or undo
its operation when a transaction is not completed or committed. Highly flexible system: where
users can modify the information without changing any sensitive information.
Application: Banking, Distributed systems, Object databases, etc.

5. Multimedia Databases

Multimedia databases consists audio, video, images and text media. They can be stored
on Object-Oriented Databases. They are used to store complex information in pre-specified
formats.
Application: Digital libraries, video-on demand, news-on demand, musical database, etc.

6. Spatial Database

It stores geographical information. It stores the data in the form of coordinates,


topology, lines, polygons, etc.
Application: Maps, Global positioning, etc.

7. Time-series Databases

It contains stock exchange data and user logged activities. Handles array of numbers
indexed by time, date, etc. It requires real-time analysis.
Application: eXtremeDB, Graphite, InfluxDB, etc.

8. WWW

WWW refers to World wide web is a collection of documents and resources like audio,
video, text, etc which are identified by Uniform Resource Locators (URLs) through web
browsers, linked by HTML pages, and accessible via the Internet network. It is the most

14
heterogeneous repository as it collects data from multiple resources. It is dynamic in nature as
Volume of data is continuously increasing and changing.
Application: Online shopping, Job search, Research, studying, etc.

Challenges:

1. Heterogeneous Data: Different data sources may store data in different ways, using
different data formats. For example, you may need to take data from files, web APIs, e-
commerce databases, CRM systems and more. What’s more, this information may be
structures, semi-structured, or unstructured data.
2. Data Integrations: Each data source you use needs to be integrated with the larger
integration workflow. Not only is this a complex and technically demanding undertaking,
but it can also impact data integration success if the structure of the underlying data source
changes. What’s more, the problem scales as you add more data sources.
3. Scalability: As your business grows, the problem of how to get data from multiple sources
intensifies. If you don’t plan for efficiency and scalability, however, this can slow down the
integration process.
4. Data Duplication: Multiple sources may have the same data, requiring you to detect and
remove these duplicates. Even worse, the sources may not agree with each other, forcing
you to figure out which of them is correct.
5. Transformation Rules: Solving the problem of duplicate and conflicting data requires you
to have well-defined, robust transformation rules. Again, defining clear transformation
rules will help you automate the vast majority of the processes involved in how to get data
from multiple sources. As you get more familiar with data integration, you’ll get a better
sense of which kinds of reconciliations need to be performed over your data sources.

2.3 DATA VISUALIZATION:

Data visualization is actually a set of data points and information that are represented
graphically to make it easy and quick for user to understand. Data visualization is good if it has
a clear meaning, purpose, and is very easy to interpret, without requiring context. Tools of data
visualization provide an accessible way to see and understand trends, outliers, and patterns in
data by using visual effects or elements such as a chart, graphs, and maps.

15
Key Characteristics of Effective Graphical Visual:

 It shows or visualizes data very clearly in an understandable manner.


 It encourages viewers to compare different pieces of data.
 It closely integrates statistical and verbal descriptions of data set.
 It grabs our interest, focuses our mind, and keeps our eyes on message as human brain
tends to focus on visual data more than written data.
 It also helps in identifying area that needs more attention and improvement.
 Using graphical representation, a story can be told more efficiently. Also, it requires
less time to understand picture than it takes to understand textual data.

Categories of Data Visualization:

Data visualization is very critical to market research where both numerical and
categorical data can be visualized that helps in an increase in impacts of insights and also helps
in reducing risk of analysis paralysis. So, data visualization is categorized into following
categories:

1. Numerical Data:

Numerical data is also known as quantitative data. Numerical data is any data where
data generally represents amount such as height, weight, age of a person, etc. Numerical data
visualization is easiest way to visualize data. It is generally used for helping others to digest
large data sets and raw numbers in a way that makes it easier to interpret into action.
Numerical data is categorized into two categories:

16
i. Continuous Data – It can be narrowed or categorized (Example: Height
measurements).
ii. Discrete Data – This type of data is not “continuous” (Example: Number of cars or
children’s a household has).

The type of visualization techniques that are used to represent numerical data
visualization is Charts and Numerical Values. Examples are Pie Charts, Bar Charts, Averages,
Scorecards, etc.

2. Categorical Data :

Categorical data is also known as Qualitative data. Categorical data is any data where
data generally represents groups. It simply consists of categorical variables that are used to
represent characteristics such as a person’s ranking, a person’s gender, etc. Categorical data
visualization is all about depicting key themes, establishing connections, and lending context.
Categorical data is classified into three categories:

 Binary Data – In this, classification is based on positioning (Example: Agree or


Disagree).
 Nominal Data – In this, classification is based on attributes (Example: Male or
Female).
 Ordinal Data – In this, classification is based on ordering of information (Example:
Timeline or processes).

The type of visualization techniques that are used to represent categorical data is
Graphics, Diagrams, and Flowcharts. Examples are Word clouds, Sentiment Mapping, Venn
Diagram, etc.

Advantages:

Data visualization is another form of visual art that grabs our interest and keeps our
eyes on the message. When we see a chart, we quickly see trends and outliers. If we can see
something, we internalize it quickly. It’s storytelling with a purpose. If you’ve ever stared at a

17
massive spreadsheet of data and couldn’t see a trend, you know how much more effective a
visualization can be. Some advantages of data visualization include:

i. Easily sharing information.


ii. Interactively explore opportunities.
iii. Visualize patterns and relationships.

Disadvantages:

While there are many advantages, some of the disadvantages may seem less obvious.
For example, when viewing visualization with many different data points, it’s easy to make an
inaccurate assumption. Or sometimes the visualization is just designed wrong so that it’s
biased or confusing. Some disadvantages of data visualization include:

 Biased or inaccurate information.


 Correlation doesn’t always mean causation.
 Core messages can get lost in translation.

Types of data visualizations:

 Tables: This consists of rows and columns used to compare variables. Tables can show
a great deal of information in a structured way, but they can also overwhelm users that
are simply looking for high-level trends.
 Pie charts and stacked bar charts: These graphs are divided into sections that
represent parts of a whole. They provide a simple way to organize data and compare the
size of each component to one other.
 Line charts and area charts: These visuals show change in one or more quantities by
plotting a series of data points over time and are frequently used within predictive
analytics. Line graphs utilize lines to demonstrate these changes while area charts
connect data points with line segments, stacking variables on top of one another and
using color to distinguish between variables.
 Histograms: This graph plots a distribution of numbers using a bar chart (with no
spaces between the bars), representing the quantity of data that falls within a particular

18
range. This visual makes it easy for an end user to identify outliers within a given
dataset.
 Scatter plots: These visuals are beneficial in reveling the relationship between two
variables, and they are commonly used within regression data analysis. However, these
can sometimes be confused with bubble charts, which are used to visualize three
variables via the x-axis, the y-axis, and the size of the bubble.
 Heat maps: These graphical representation displays are helpful in visualizing
behavioral data by location. This can be a location on a map, or even a webpage.
 Tree maps, which display hierarchical data as a set of nested shapes, typically
rectangles. Tree maps are great for comparing the proportions between categories via
their area size.

Different Types of Analysis for Data Visualization:

Mainly, there are three different types of analysis for Data Visualization:

i. Univariate Analysis: In the univariate analysis, we will be using a single feature to


analyze almost all of its properties.
ii. Bivariate Analysis: When we compare the data between exactly 2 features then it is
known as bivariate analysis.
iii. Multivariate Analysis: In the multivariate analysis, we will be comparing more than 2
variables.

2.4 DISTRIBUTIONS AND SUMMARY STATISTICS:

For data preprocessing to be successful, it is essential to have an overall picture of your


data. Basic statistical descriptions can be used to identify properties of the data and highlight
which data values should be treated as noise or outliers. A statistic can be defined as a summary
of a sample. This section discusses three areas of basic statistical descriptions. We start with
measures of central tendency, which measure the location of the middle or center of a data
distribution. Intuitively speaking, given an attribute, where do most of its values fall? In
particular, we discuss the mean, median, mode, and midrange. In addition to assessing the central
tendency of our data set, we also would like to have an idea of the dispersion of the data. That is,

19
how are the data spread out? The most common data dispersion measures are the range,
quartiles, and interquartile range; the five-number summary and box plots; and the variance and
standard deviation of the data. These measures are useful for identifying outliers.

Statistics

Descriptive Statistical
Statistics Inference

Graphs and Measures of Central Hypothesis Estimation


Visualizations tendency, Spread, Testing
Position

2.4.1. Measuring the Central Tendency: Mean, Median, Mode and Midrange:
The most common and effective numeric measure of the “center” of a set of data is the
(arithmetic) mean. Let x1, x2,………,xN be a set of N values or observations, such as for some
numeric attribute X, like salary. The mean of this set of values is represented as

Example: Mean
Suppose we have the following values for salary (in thousands of dollars), shown in
increasing order: X = 30, 36, 47, 50, 52, 52, 56, 60, 63, 70, 70, 110.

= 58,000.

Thus, the mean salary is $58,000.

20
Example: Median
Let’s find the median of the data from above mentioned Example. The data are already
sorted in increasing order. There is an even number of observations (i.e., 12); therefore, the
median is not unique. It can be any value within the two middlemost values of 52,000 and 56,000
(that is, within the sixth and seventh values in the list). By convention, we assign the average of
the two middlemost values as the median; that is, 52,000 & 56,000.
= (52,000+56,000) / 2
= 108,000/2
= 54,000
Thus, the median is $54,000.

Example: Mode

The mode for a set of data is the value that occurs most frequently in the set. Therefore, it
can be determined for qualitative and quantitative attributes. The data from above mentioned
example are bimodal. The two modes are $52,000 and $70,000.

Example: Midrange

The midrange can also be used to assess the central tendency of a numeric data set. It is
the average of the largest and smallest values in the set. The midrange of the data of above
mentioned Example is
= (30,000 + 110,000) / 2
= $70,000.

2.4.2. Measuring the Dispersion of Data:

To assess the dispersion or spread of numeric data, the measures include range, quantiles,
quartiles, percentiles, and the interquartile range are used. The five-number summary, which can
be displayed as a box plot, is useful in identifying outliers. Variance and standard deviation also
indicate the spread of a data distribution.
i. Range
ii. Quartiles
iii. Variance

21
iv. Standard Deviation
v. Interquartile Range
vi. five-number summary
vii. Box plots
i. Range:

Let X1, X2, …, XN be a set of observations for some numeric attribute, X. The range of
the set is the difference between the largest and smallest values.
Large (X) = $110,000
Small (X) - $30,000
Range = 110,000 – 30,000
= $80,000
ii. Quantile:

Suppose that the data for attribute X are sorted in increasing numeric order. Imagine that
we can pick certain data points so as to split the data distribution into equal-size consecutive sets,
as mentioned in the above figure. These data points are called quantiles. Quantiles are points
taken at regular intervals of a data distribution, dividing it into essentially equal size consecutive
sets. The 100-quantiles are more commonly referred to as percentiles; they divide the data
distribution into 100 equal-sized consecutive sets. The median, quartiles, and percentiles are the
most widely used forms of quantiles.

iii. Quartiles:

The 2-quantile is the data point dividing the lower and upper halves of the data

22
distribution. It corresponds to the median. The 4-quantiles are the three data points that split the
data distribution into four equal parts; each part represents one-fourth of the data distribution.
They are more commonly referred to as quartiles. The quartiles give an indication of a
distribution’s center, spread, and shape. The first quartile, denoted by Q1, is the 25th percentile.
It cuts off the lowest 25% of the data. The third quartile, denoted by Q3, is the 75th percentile. It
cuts off the lowest 75% (or highest 25%) of the data. The second quartile is the 50th percentile.
As the median, it gives the center of the data distribution.

iv. Inter Quartile Range (IQR):

The distance between the first and third quartiles is a simple measure of spread that gives
the range covered by the middle half of the data. This distance is called the Inter Quartile
Range (IQR) and is defined as,

IQR = Q3 – Q1

= 63,000 – 47,000
IQR = $16,000.

v. Five-Number Summary:

Q1, the median, and Q3 together contain no information about the endpoints (e.g., tails)
of the data, a fuller summary of the shape of a distribution can be obtained by providing the
lowest and highest data values as well. This is known as the Five-Number Summary (FNS). The
Five-Number Summary of a distribution consists of the median (Q2), the quartiles Q1 and Q3,
and the smallest and largest individual observations, written in the order of Minimum, Q1,
Median, Q3 and Maximum.

vi. Box Plots:

Box plots are a popular way of visualizing a distribution. A box plot incorporates the
five-number summary as follows:

23
FNS (Branch 1) = {40, 60, 80, 100, 150}
FNS (Branch 2) = {60, 80, 100, 160, 180}
FNS (Branch 3) = {40, 60, 140, 160, 190}
FNS (Branch 1) = {30, 40, 80, 120, 130}

 Typically, the ends of the box are at the quartiles so that the box length is the interquartile
range.
 The median is marked by a line within the box.
 Two lines (called whiskers) outside the box extend to the smallest (Minimum) and largest
(Maximum) observations.
vii. Variance and Standard Deviation

Variance and standard deviation are measures of data dispersion. They indicate how
spread out a data distribution is. A low standard deviation means that the data observations tend
to be very close to the mean, while a high standard deviation indicates that the data are spread
out over a large range of values. The variance of N observations, X1, X2, …, XN for a numeric
attribute X is

where is the mean value of the observations, as defined in the above equation. The standard
deviation ( ) of the observations is the square root of the variance. We already found,
Mean = $58,000
N = 12
Therefore,

24
σ
σ

2.4.3. Graphic displays of Basic Statistical descriptions of data:

i. Quantile Plot
ii. Quantile – Quantile Plot (q-q Plot)
iii. Histograms
iv. Scatter Plots
i. Quantile Plot:

Suppose that the data for attribute X are sorted in increasing numeric order. Imagine that
we can pick certain data points so as to split the data distribution into equal-size consecutive sets,
as mentioned in the above figure. These data points are called quantiles. Quantiles are points
taken at regular intervals of a data distribution, dividing it into essentially equal size consecutive
sets. The 100-quantiles are more commonly referred to as percentiles; they divide the data
distribution into 100 equal-sized consecutive sets. The median, quartiles, and percentiles are the
most widely used forms of quantiles. A quantile plot is a simple and effective way to have a
first look at a univariate data distribution.

ii. Quantile – Quantile Plot (q-q Plot):

A quantile–quantile plot, or q-q plot, graphs the quantiles of one univariate distribution

25
against the corresponding quantiles of another. It is a powerful visualization tool in that it allows
the user to view whether there is a shift in going from one distribution to another.

iii. Histograms:

Histograms (or frequency histograms) are at least a century old and are widely used.
“Histos” means pole or mast, and “gram” means chart, so a histogram is a chart of poles. Plotting
histograms is a graphical method for summarizing the distribution of a given attribute, X. If X is
nominal, then a pole or vertical bar is drawn for each known value of X. The height of the bar
indicates the frequency (i.e., count) of that X value. The resulting graph is more commonly
known as a bar chart.

iv. Scatter Plot:

26
A scatter plot is one of the most effective graphical methods for determining if there
appears to be a relationship, pattern, or trend between two numeric attributes. To construct a
scatter plot, each pair of values is treated as a pair of coordinates in an algebraic sense and
plotted as points in the plane. The scatter plot is a useful method for providing a first look at
bivariate data to see clusters of points and outliers, or to explore the possibility of correlation
relationships. Two attributes, X, and Y, are correlated if one attribute implies the other. It is
used identify the positive, negative and null correlation between the attributes. Scatter plots can
be extended to n attributes, resulting in a scatter-plot matrix.

2.5 RELATIONSHIPS AMONG VARIABLES:

Given two attributes, correlation analysis can measure how strongly one attribute implies
the other, based on the available data. For nominal data, we use the (chi-square) test. For
numeric attributes, we can use the correlation coefficient and covariance, both of which access
how one attribute’s values vary from those of another.

Correlation Test for Nominal Data:

For nominal data, a correlation relationship between two attributes, A and B, can be
discovered by a (chi-square) test. It is defined as,

Where, Oij is the observed frequency (i.e., actual count) of the joint event and eij is the expected

27
frequency of the joint event, which can be computed as

where n is the number of data tuples, count is the number of tuples having value ai for
A, and count is the number of tuples having value bj for B. The sum is computed over
all of the r x c cells. Note that the cells that contribute the most to the value are those for which
the actual count is very different from that expected. The statistic tests the hypothesis that A
and B are independent, that is, there is no correlation between them. The test is based on a
significance level, with (r-1) x (c-1) degrees of freedom. If the hypothesis can be rejected, then
we say that A and B are statistically correlated.

Example: Are gender and preferred reading correlated?

Each person was polled as to whether his or her preferred type of reading material was
fiction or nonfiction. Thus, we have two attributes, gender and preferred reading. The observed
frequency (or count) of each possible joint event is summarized in the contingency table shown
in Table, Where the numbers in parentheses are the expected frequencies. The expected
frequencies are calculated based on the data distribution for both attributes using expected
frequency formula ( ). For example,

Expected frequency for the cell (male, fiction) is

= (300*450) / 1500 = 90

Expected frequency for the cell (Female, fiction) is

28
= (1200*450) / 1500 = 360

Expected frequency for the cell (male, non-fiction) is

= (300*1050) / 1500 = 210

Expected frequency for the cell (female, non-fiction) is

= (1200*1050) / 1500 = 840.


Finally,

= 284.44 + 121.90 + 71.11 + 30.48 = 507.93.

For this 2x2 table, the degrees of freedom are (2-1) (2-1) = 1. For 1 degree of freedom,
the Chi square value needed to reject the hypothesis at the 0.001 significance level is 10.828.
Since our computed value is above this, we can reject the hypothesis that gender and preferred
reading are independent and conclude that the two attributes are (strongly) correlated for the
given group of people.

Correlation Coefficient for Numeric Data:

For numeric attributes, we can evaluate the correlation between two attributes, A and B,
by computing the correlation coefficient (also known as Pearson’s product moment
coefficient, named after its inventor, Karl Pearson). This is

where n is the number of tuples, ai and bi are the respective values of A and B in tuple i, and
are the respective mean values of A and B, and are the respective standard deviations of A
and B, and is the sum of the AB cross-product (i.e., for each tuple, the
value for A is multiplied by the value for B in that tuple). Note that -1≤ +1. If is

29
greater than 0, then A and B are positively correlated, meaning that the values of A increase as
the values of B increase. The higher the value, the stronger the correlation (i.e., the more each
attribute implies the other). Hence, a higher value may indicate that A (or B) may be removed as
a redundancy. If the resulting value is equal to 0, then A and B are independent and there is no
correlation between them. If the resulting value is less than 0: then A and B are negatively
correlated where the values of one attribute increase as the values of the other attribute decrease.
This means that each attribute discourages the other. Scatter plots can also be used to view
correlations between attributes.

Note that correlation does not imply causality. That is, if A and B are correlated, this does
not necessarily imply that A causes B or that B causes A. For example, in analyzing a
demographic database, we may find that attributes representing the number of hospitals and the
number of car thefts in a region are correlated. This does not mean that one causes the other.
Both are actually causally linked to a third attribute, namely, population.

Covariance of Numeric Data:

In probability theory and statistics, correlation and covariance are two similar measures
for assessing how much two attributes change together. Consider two numeric attributes A and
B, and a set of n observations {(a1, b1), ………, (an, bn)}. The mean values of A and B,
respectively, are also known as the expected values on A and B, that is,

And

The covariance between A and B is defined as,

)) =

If we compare and , we see that

30
Where σ and σ are the standard deviations of A and B respectively. It can also be shown that,

))

This equation may simplify calculations. For two attributes A and B that tend to change
together, if A is larger than (the expected value of A), then B is likely to be larger than (the
expected value of B). Therefore, the covariance between A and B is positive. On the other hand,
if one of the attributes tends to be above its expected value when the other attribute is below its
expected value, then the covariance of A and B is negative. If, A and B are independent (i.e.,
they do not have correlation) . Therefore, )
= =0. However, the converse is not true. Some pairs of random variables
(attributes) may have a covariance of 0 but are not independent, only under some additional
assumptions.

Covariance analysis of numeric attributes. Consider Table, which presents a simplified


example of stock prices. If the stocks are affected by the same industry trends, will their prices
rise or fall together? Variance is a special case of covariance, where the two attributes are
identical (i.e., the covariance of an attribute with itself).

A B
6 20
5 10
4 14
3 5
2 5

= =4

31
= = 10.8

))

= - (4*10.8)
= 50.2 – 43.2 = 7

Therefore, it is positive Covariance. Given the positive covariance we can say that stock
prices for both A and B rise together.

2.6 EXTENT OF MISSING DATA

Missing data are errors because your data don’t represent the true values of what you set
out to measure. The reason for the missing data is important to consider, because it helps you
determine the type of missing data and what you need to do about it.

Types of missing data:

i. Missing Completely at Random (MCAR)


ii. Missing at Random (MAR)
iii. Missing not at Random (MNAR)

Type Definition

MCAR Missing data are randomly distributed across the variable and unrelated to
other variables. Ex: Ignoring the question which is not related to you.
MAR Missing data are not randomly distributed but they are accounted for by other
observed variables. Ex: Due to privacy ignoring to answer few questions
MNAR Missing data systematically differ from the observed values. Ex: Purposefully
ignoring the question even though it is mandatory.

2.7 SEGMENTATION

Clustering is also called data segmentation in some applications because clustering


partitions large data sets into groups according to their similarity. Clustering can also be used for

32
outlier detection, where outliers (values that are “far away” from any cluster) may be more
interesting than common cases. Applications of outlier detection include the detection of credit
card fraud and the monitoring of criminal activities in electronic commerce. For example,
exceptional cases in credit card transactions, such as very expensive and infrequent purchases,
may be of interest as possible fraudulent activities.

Clustering techniques consider data tuples as objects. They partition the objects into
groups, or clusters, so that objects within a cluster are “similar” to one another and “dissimilar”
to objects in other clusters. Similarity is commonly defined in terms of how “close” the objects
are in space, based on a distance function. The “quality” of a cluster may be represented by its
diameter, the maximum distance between any two objects in the cluster. Centroid distance is an
alternative measure of cluster quality and is defined as the average distance of each cluster object
from the cluster centroid. In data reduction, the cluster representations of the data are used to
replace the actual data. The effectiveness of this technique depends on the data’s nature. It is
much more effective for data that can be organized into distinct clusters than for smeared data.

It is also called as Numerosity reduction techniques. It is used to replace the original data
volume by small forms of representation. It is the process of taking the data you hold and
dividing it up and grouping similar data together based on the chosen parameters. Some of the
examples are presented here:

 Probability based model


 Dimensionality reduction method
 Constraints model

2.8 OUTLIER DETECTION

Outlier detection also called as novelty detection or anomaly detection and it is the
process of finding data objects with behaviors that are very different from expectation. Such
objects are called outliers or anomalies.

33
An outlier is a data object that deviate the data significantly from the rest of the objects as
mentioned in the above figure as R region. Outlier detection is important in many applications in
addition to fraud detection such as medical care, public safety and security, industry damage
detection, image processing, sensor/video network surveillance, and intrusion detection. Outlier
detection and clustering analysis are two highly related tasks. Clustering finds the majority
patterns in a data set and organizes the data accordingly, whereas outlier detection tries to
capture those exceptional cases that deviate substantially from the majority patterns. Outlier
detection and clustering analysis serve different purposes.

An outlier is a data object that deviates significantly from the rest of the objects, as if it
were generated by a different mechanism. For ease of presentation, we may refer to data objects
that are not outliers as “normal” or expected data. Similarly, we may refer to outliers as
“abnormal” data. Outliers are different from noisy data. Noise is a random error or variance in a
measured variable. In general, noise is not interesting in data analysis, including outlier
detection. For example, in credit card fraud detection, a customer’s purchase behavior can be
modeled as a random variable. A customer may generate some “noise transactions” that may
seem like “random errors” or “variance,” such as by buying a bigger lunch one day, or having
one more cup of coffee than usual. Such transactions should not be treated as outliers; otherwise,
the credit card company would incur heavy costs from verifying that many transactions. The
company may also lose customers by bothering them with multiple false alarms. As in many
other data analysis and data mining tasks, noise should be removed before outlier detection.

Outliers are interesting because they are suspected of not being generated by the same
mechanisms as the rest of the data. Therefore, in outlier detection, it is important to justify why
the outliers detected are generated by some other mechanisms. This is often achieved by making

34
various assumptions on the rest of the data and showing that the outliers detected violate those
assumptions significantly. Outlier detection is also related to novelty detection in evolving data
sets. For example, by monitoring a social media web site where new content is incoming, novelty
detection may identify new topics and trends in a timely manner. Novel topics may initially
appear as outliers. To this extent, outlier detection and novelty detection share some similarity in
modeling and detection methods. However, a critical difference between the two is that in
novelty detection, once new topics are confirmed, they are usually incorporated into the model of
normal behavior so that follow-up instances are not treated as outliers.

Types of Outliers:

1. Global Outliers
2. Contextual or Conditional Outliers
3. Collective Outliers

1. Global Outliers:

In a given data set, a data object is a global outlier if it deviates significantly from the
rest of the data set. Global outliers are sometimes called point anomalies, and are the simplest
type of outliers. Most outlier detection methods are aimed at finding global outliers. Global
outlier detection is important in many applications. Consider intrusion detection in computer
networks, for example. If the communication behavior of a computer is very different from the
normal patterns this behavior may be considered as a global outlier and the corresponding
computer is a suspected victim of hacking. As another example, in trading transaction auditing
systems, transactions that do not follow the regulations are considered as global outliers and
should be held for further examination.

2. Contextual or Conditional Outliers:

“The temperature today is 28oC. Is it exceptional (i.e., an outlier)?” It depends, for


example, on the time and location! If it is in winter in Toronto, yes, it is an outlier. If it is a
summer day in Toronto, then it is normal. Unlike global outlier detection, in this case, whether or
not today’s temperature value is an outlier depends on the context—the date, the location, and
possibly some other factors. In a given data set, a data object is a contextual outlier if it deviates

35
significantly with respect to a specific context of the object. Contextual outliers are also known
as conditional outliers because they are conditional on the selected context. Therefore, in
contextual outlier detection, the context has to be specified as part of the problem definition.
Generally, in contextual outlier detection, the attributes of the data objects in question are
divided into two groups:

Contextual attributes: The contextual attributes of a data object define the object’s context. In
the temperature example, the contextual attributes may be date and location.

Behavioral attributes: These define the object’s characteristics, and are used to evaluate
whether the object is an outlier in the context to which it belongs. In the temperature example,
the behavioral attributes may be the temperature, humidity, and pressure. Unlike global outlier
detection, in contextual outlier detection, whether a data object is an outlier depends on not only
the behavioral attributes but also the contextual attributes.

3. Collective Outliers:

Suppose you are a supply-chain manager of a company and you handle thousands of
orders and shipments every day. If the shipment of an order is delayed, it may not be considered
an outlier because, statistically, delays occur from time to time. However, you have to pay
attention if 100 orders are delayed on a single day. Those 100 orders as a whole form an outlier,
although each of them may not be regarded as an outlier if considered individually. You may
have to take a close look at those orders collectively to understand the shipment problem.

36
Given a data set, a subset of data objects forms a collective outlier if the objects as a
whole deviate significantly from the entire data set. Importantly, the individual data objects may
not be outliers. Collective outlier detection has many important applications. For example, in
intrusion detection, a denial-of-service package from one computer to another is considered
normal and not an outlier at all. However, if several computers keep sending denial-of-service
packages to each other, they as a whole should be considered as a collective outlier. The
computers involved may be suspected of being compromised by an attack. As another example, a
stock transaction between two parties is considered normal. However, a large set of transactions
of the same stock among a small party in a short period are collective outliers because they may
be evidence of some people manipulating the market. Unlike global or contextual outlier
detection, in collective outlier detection we have to consider not only the behavior of individual
objects, but also that of groups of objects. Therefore, to detect collective outliers, we need
background knowledge of the relationship among data objects such as distance or similarity
measurements between objects.

Challenges of Outlier detection:

Outlier detection is useful in many applications yet faces many challenges such as the
following:

1. Modeling normal objects and outliers effectively.


Outlier detection quality highly depends on the modeling of normal (non outlier) objects
and outliers. Often, building a comprehensive model for data normality is very challenging, if
not impossible. This is partly because it is hard to enumerate all possible normal behaviors in an
application. The border between data normality and abnormality (outliers) is often not clear cut.
Instead, there can be a wide range of gray area. Consequently, while some outlier detection
methods assign to each object in the input data set a label of either “normal” or “outlier,” other
methods assign to each object a score measuring the “outlier-ness” of the object.

2. Application-specific outlier detection.

Technically, choosing the similarity/distance measure and the relationship model to


describe data objects is critical in outlier detection. Unfortunately, such choices are often

37
application-dependent. Different applications may have very different requirements. For
example, in clinic data analysis, a small deviation may be important enough to justify an outlier.
In contrast, in marketing analysis, objects are often subject to larger fluctuations, and
consequently a substantially larger deviation is needed to justify an outlier. Outlier detection’s
high dependency on the application type makes it impossible to develop a universally applicable
outlier detection method. Instead, individual outlier detection methods that are dedicated to
specific applications must be developed.

3. Handling noise in outlier detection.

As mentioned earlier, outliers are different from noise. It is also well known that the
quality of real data sets tends to be poor. Noise often unavoidably exists in data collected in
many applications. Noise may be present as deviations in attribute values or even as missing
values. Low data quality and the presence of noise bring a huge challenge to outlier detection.
They can distort the data, blurring the distinction between normal objects and outliers. Moreover,
noise and missing data may “hide” outliers and reduce the effectiveness of outlier detection—an
outlier may appear “disguised” as a noise point, and an outlier detection method may mistakenly
identify a noise point as an outlier.

4. Understandability.

In some application scenarios, a user may want to not only detect outliers, but also
understand why the detected objects are outliers. To meet the understandability requirement, an
outlier detection method has to provide some justification of the detection. For example, a
statistical method can be used to justify the degree to which an object may be an outlier based on
the likelihood that the object was generated by the same mechanism that generated the majority
of the data. The smaller the likelihood, the more unlikely the object was generated by the same
mechanism, and the more likely the object is an outlier.

Outlier Detection Methods:

There are many outlier detection methods in the literature and in practice. Here, we
present outlier detection methods according to whether the sample of data for analysis is given
with domain expert–provided labels that can be used to build an outlier detection model. Second,

38
we divide methods into groups according to their assumptions regarding normal objects versus
outliers.
1. Supervised Methods
2. Unsupervised Methods
3. Semi Supervised Methods

1. Supervised Methods:

Supervised methods model data normality and abnormality. Domain experts examine
and label a sample of the underlying data. Outlier detection can then be modeled as a
classification problem. The task is to learn a classifier that can recognize outliers. The sample is
used for training and testing. In some applications, the experts may label just the normal objects,
and any other objects not matching the model of normal objects are reported as outliers. Other
methods model the outliers and treat objects not matching the model of outliers as normal.
Although many classification methods can be applied, challenges to supervised outlier detection
include the following:

 The two classes (i.e., normal objects versus outliers) are imbalanced. That is, the
population of outliers is typically much smaller than that of normal objects. Therefore,
methods for handling imbalanced classes may be used, such as oversampling (i.e.,
replicating) outliers to increase their distribution in the training set used to construct the
classifier. Due to the small population of outliers in data, the sample data examined by
domain experts and used in training may not even sufficiently represent the outlier
distribution. The lack of outlier samples can limit the capability of classifiers built as
such. To tackle these problems, some methods “makeup” artificial outliers.
 In many outlier detection applications, catching as many outliers as possible (i.e., the
sensitivity or recall of outlier detection) is far more important than not mislabeling
normal objects as outliers. Consequently, when a classification method is used for
supervised outlier detection, it has to be interpreted appropriately so as to consider the
application interest on recall.

39
2. Unsupervised Methods:

In some application scenarios, objects labeled as “normal” or “outlier” is not available.


Thus, an unsupervised learning method has to be used. Unsupervised outlier detection methods
make an implicit assumption: The normal objects are somewhat “clustered.” In other words, an
unsupervised outlier detection method expects that normal objects follow a pattern far more
frequently than outliers. Normal objects do not have to fall into one group sharing high
similarity. Instead, they can form multiple groups, where each group has distinct features.
However, an outlier is expected to occur far away in feature space from any of those groups of
normal objects. This assumption may not be true all the time. For example, if the normal objects
do not share any strong patterns and uniformly distributed. The collective outliers, however,
share high similarity in a small area. Unsupervised methods cannot detect such outliers
effectively. In some applications, normal objects are diversely distributed, and many such objects
do not follow strong patterns. For instance, in some intrusion detection and computer virus
detection problems, normal activities are very diverse and many do not fall into high-quality
clusters. In such scenarios, unsupervised methods may have a high false positive rate—they may
mislabel many normal objects as outliers (intrusions or viruses in these applications), and let
many actual outliers go undetected. Due to the high similarity between intrusions and viruses,
modeling outliers using supervised methods may be far more effective.

Many clustering methods can be adapted to act as unsupervised outlier detection


methods. The central idea is to find clusters first, and then the data objects not belonging to any
cluster are detected as outliers. However, such methods suffer from two issues:

 First, a data object not belonging to any cluster may be noise instead of an outlier.
 Second, it is often costly to find clusters first and then find outliers.

It is usually assumed that there are far fewer outliers than normal objects. Having to
process a large population of non target data entries (i.e., the normal objects) before one can
touch the real meat (i.e., the outliers) can be unappealing. The latest unsupervised outlier
detection methods develop various smart ideas to tackle outliers directly without explicitly and
completely finding clusters.

40
3. Semi-supervised Methods:

In many applications, although obtaining some labeled examples is feasible, the number
of such labeled examples is often small. We may encounter cases where only a small set of the
normal and/or outlier objects are labeled, but most of the data are unlabeled. Semi-supervised
outlier detection methods were developed to tackle such scenarios. Semi-supervised outlier
detection methods can be regarded as applications of semi-supervised learning methods. For
example, when some labeled normal objects are available, we can use them, together with
unlabeled objects that are close by, to train a model for normal objects. The model of normal
objects then can be used to detect outliers—those objects not fitting the model of normal objects
are classified as outliers.

 If only some labeled outliers are available, semi-supervised outlier detection is


trickier.
 A small number of labeled outliers are unlikely to represent all the possible
outliers.

Therefore, building a model for outliers based on only a few labeled outliers is unlikely to
be effective. To improve the quality of outlier detection, we can get help from models for normal
objects learned from unsupervised methods.

3.4 AUTOMATED DATA PREPARATION PRINCIPLES:

Many of the principles of functional programming can be applied to data preparation. It is


not necessary to use a functional programming language to automate data preparation, but such
languages are often used to do so.

1. Understand the data consumer – who is going to use the data and what questions do they
need answered.
2. Understand the data – where it is coming from and how it was generated.
3. Save the raw data. If the data engineer has the raw data, then all the data transformations can
be recreated. Additionally, don’t move or delete the raw data once it is saved.

41
4. If possible, store all the data, raw and processed. Of course, privacy regulations like the
European Union (EU)’s General Data Protection Regulation (GDPR) will influence what
data can be saved and for how long.
5. Ensure that transforms are reproducible, deterministic and idempotent. Each transform must
produce the same results each time it is executed given the same input data set, without
harmful effects.
6. Futures proof your data pipeline. Version not only the data and the code that performs the
analysis, but also the transforms that have been applied to the data.
7. Ensure that there is adequate separation between the online system and the offline analysis
so that the ingest step does not impact user-facing services.
8. Monitor the data pipeline for consistency across data sets.
9. Employ Data Governance early, and be proactive. IT’s need for security and compliance
means incorporating governance capabilities like data masking, retention, lineage, and role-
based permissions are all important aspects of the pipeline.

3.5 COMBINING DATA FILES:

It is likely that your data analysis task will involve data integration, which combines data
from multiple sources into a coherent data store, as in data warehousing. These sources may
include multiple databases, data cubes, or flat files. There are a number of issues to consider
during data integration. Schema integration and object matching can be tricky. How can
equivalent real-world entities from multiple data sources be matched up? This is referred to as
the entity identification problem. For example, how can the data analyst or the computer is sure
that customer id in one database and customer number in another refers to the same attribute?
Examples of metadata for each attribute include the name, meaning, data type, and range of
values permitted for the attribute, and null rules for handling blank, zero, or null values.

Such metadata can be used to help avoid errors in schema integration. The metadata may
also be used to help transform the data (e.g., where data codes for pay type in one database may
be “H” and “S” but 1 and 2 in another). Hence, this step also relates to data cleaning, as
described earlier. When matching attributes from one database to another during integration,
special attention must be paid to the structure of the data. This is to ensure that any attribute

42
functional dependencies and referential constraints in the source system match those in the target
system. For example, in one system, a discount may be applied to the order, whereas in another
system it is applied to each individual line item within the order. If this is not caught before
integration, items in the target system may be improperly discounted.

3.6 AGGREGATE DATA:

Imagine that you have collected the data for your analysis. These data consist of the sales
per quarter, for the years 2008 to 2010. You are, however, interested in the annual sales (total per
year), rather than the total per quarter. Thus, the data can be aggregated so that the resulting data
summarize the total sales per year instead of per quarter. This aggregation is illustrated in the
above figure. The resulting data set is smaller in volume, without loss of information necessary
for the analysis task. Data cubes store multidimensional aggregated information. For example,
Figure shows a data cube for multidimensional analysis of sales data with respect to annual sales
per item type for each branch. Each cell holds an aggregate data value, corresponding to the data
point in multidimensional space. (For readability, only some cell values are shown.) Concept
hierarchies may exist for each attribute, allowing the analysis of data at multiple abstraction
levels.

For example, a hierarchy for branch could allow branches to be grouped into regions,
based on their address. Data cubes provide fast access to pre-computed, summarized data,
thereby benefiting online analytical processing as well as data mining. The cube created at the
lowest abstraction level is referred to as the base cuboid. The base cuboid should correspond to
an individual entity of interest such as sales or customer. In other words, the lowest level should
be usable, or useful for the analysis. A cube at the highest level of abstraction is the apex

43
cuboid. For the sales data in Figure, the apex cuboid would give one total—the total sales for all
three years, for all item types, and for all branches. Data cubes created for varying levels of
abstraction are often referred to as cuboids, so that a data cube may instead refer to a lattice of
cuboids. Each higher abstraction level further reduces the resulting data size. When replying to
data mining requests, the smallest available cuboid relevant to the given task should be used.

3.7 DUPLICATE REMOVAL:

In addition to detecting redundancies between attributes, duplication should also be


detected at the tuple level (e.g., where there are two or more identical tuples for a given unique
data entry case). The use of denormalized tables (often done to improve performance by
avoiding joins) is another source of data redundancy. Inconsistencies often arise between various
duplicates, due to inaccurate data entry or updating some but not all data occurrences. For
example, if a purchase order database contains attributes for the purchaser’s name and address
instead of a key to this information in a purchaser database, discrepancies can occur, such as the
same purchaser’s name appearing with different addresses within the purchase order database.

3.8 SAMPLING DATA:

Sampling can be used as a data reduction technique because it allows a large data set to
be represented by a much smaller random data sample (or subset). Suppose that a large data set,
D, contains N tuples. Let’s look at the most common ways that we could sample D for data
reduction, as illustrated in Figure.

Simple random sample without replacement (SRSWOR) of size s: This is created by drawing
s of the N tuples from D (s < N), where the probability of drawing any tuple in D is 1=N, that is,
all tuples are equally likely to be sampled.

Simple random sample with replacement (SRSWR) of size s: This is similar to SRSWOR,
except that each time a tuple is drawn from D; it is recorded and then replaced. That is, after a
tuple is drawn, it is placed back in D so that it may be drawn again.

Cluster sample: If the tuples in D are grouped into M mutually disjoint “clusters,” then an SRS
of s clusters can be obtained, where s < M. For example, tuples in a database are usually

44
retrieved a page at a time, so that each page can be considered a cluster. A reduced data
representation can be obtained by applying, say, SRSWOR to the pages, resulting in a cluster
sample of the tuples. Other clustering criteria conveying rich semantics can also be explored. For
example, in a spatial database, we may choose to define clusters geographically based on how
closely different areas are located.

Stratified sample: If D is divided into mutually disjoint parts called strata, a stratified sample of
D is generated by obtaining an SRS at each stratum. This helps ensure a representative sample,
especially when the data are skewed. For example, a stratified sample may be obtained from
customer data, where a stratum is created for each customer age group. In this way, the age
group having the smallest number of customers will be sure to be represented.

45
An advantage of sampling for data reduction is that the cost of obtaining a sample is
proportional to the size of the sample, s, as opposed to N, the data set size. Hence, sampling
complexity is potentially sub linear to the size of the data. Other data reduction techniques can
require at least one complete pass through D. For a fixed sample size, sampling complexity
increases only linearly as the number of data dimensions n, increases, whereas techniques using
histograms, for example, increase exponentially in n. When applied to data reduction, sampling
is most commonly used to estimate the answer to an aggregate query. It is possible (using the
central limit theorem) to determine a sufficient sample size for estimating a given function within
a specified degree of error. This sample size, s, may be extremely small in comparison to N.
Sampling is a natural choice for the progressive refinement of a reduced data set. Such a set can
be further refined by simply increasing the sample size.

3.9 DATA CACHING:

The idea behind caching is to temporarily copy data to a location that enables an
application or component to access the data faster than if retrieving it from the original position.

3.10 PARTITIONING DATA:

Histograms (or frequency histograms) are at least a century old and are widely used.
“Histos” means pole or mast, and “gram” means chart, so a histogram is a chart of poles. Plotting
histograms is a graphical method for summarizing the distribution of a given attribute, X. If X is
nominal, then a pole or vertical bar is drawn for each known value of X. The height of the bar
indicates the frequency (i.e., count) of that X value. The resulting graph is more commonly
known as a bar chart.

If X is numeric, the term histogram is preferred. The range of values for X is partitioned
into disjoint consecutive sub ranges. The sub ranges, referred to as buckets or bins, are disjoint
subsets of the data distribution for X. The range of a bucket is known as the width. Typically,
the buckets are of equal width. For example, a price attribute with a value range of $40 to $200
(rounded up to the nearest dollar) can be partitioned into sub ranges 40 to 59, 60 to 79, 80 to 99,
and so on. For each sub range, a bar is drawn with a height that represents the total count of
items observed within the sub range.

46
Histograms. The following data are a list of item prices for commonly sold items (rounded to
the nearest dollar). The numbers have been sorted: 1, 1, 5, 5, 5, 5, 5, 8, 8, 10, 10, 10, 10, 12, 14,
14, 14, 15, 15, 15, 15, 15, 15, 18, 18, 18, 18, 18, 18, 18, 18, 20, 20, 20, 20, 20, 20, 20, 21, 21, 21,
21, 25, 25, 25, 25, 25, 28, 28, 30, 30, 30.

Figure shows a histogram for the data using singleton buckets. To further reduce the data,
it is common to have each bucket denote a continuous value range for the given attribute. Below
Figure, each bucket represents a different $10 range for price. “How are the buckets determined
and the attribute values partitioned?” There are several partitioning rules, including the
following:

47
Equal-width: In an equal-width histogram, the width of each bucket range is uniform (e.g., the
width of $10 for the buckets in Figure).

Equal-frequency (or equal-depth): In an equal-frequency histogram, the buckets are created so


that, roughly, the frequency of each bucket is constant (i.e., each bucket contains roughly the
same number of contiguous data samples).

9
8
7
6
5
4
Count
3
2
1
0
1 5 8 10 12 14 15 18 20 21 25 28 30

Histograms are highly effective at approximating both sparse and dense data, as well as
highly skewed and uniform data. The histograms described before for single attributes can be
extended for multiple attributes. Multidimensional histograms can capture dependencies between
attributes. These histograms have been found effective in approximating data with up to five
attributes. More studies are needed regarding the effectiveness of multidimensional histograms
for high dimensionalities. Singleton buckets are useful for storing high-frequency outliers.

3.11 MISSING VALUES:

There are different was to filling in the missing values for an attribute is presented here:
Ignore the tuple: This is usually done when the class label is missing. This method is not very
effective, unless the tuple contains several attributes with missing values. It is especially poor
when the percentage of missing values per attribute varies considerably. By ignoring the tuple,
we do not make use of the remaining attributes values in the tuple. Such data could have been

48
useful to the task at hand.

Fill in the missing value manually: In general, this approach is time consuming and may not be
feasible given a large data set with many missing values.

Use a global constant to fill in the missing value: Replace all missing attribute values by the
same constant such as a label like “Unknown” or ∞. If missing values are replaced by, say,
“Unknown,” then the mining program may mistakenly think that they form an interesting
concept, since they all have a value in common—that of “Unknown.” Hence, although this
method is simple, it is not foolproof.

Use a measure of central tendency for the attribute (e.g., the mean or median) to fill in the
missing value: For normal (symmetric) data distributions, the mean can be used, while skewed
data distribution should employ the median. For example, suppose that the data distribution
regarding the income of customers is symmetric and that the mean income is $56,000. Use this
value to replace the missing value for income.

Use the attribute mean or median for all samples belonging to the same class as the given
tuple: For example, if classifying customers according to credit risk, we may replace the missing
value with the mean income value for customers in the same credit risk category as that of the
given tuple. If the data distribution for a given class is skewed, the median value is a better
choice.

Use the most probable value to fill in the missing value: This may be determined with
regression, inference-based tools using a Bayesian formalism, or decision tree induction. For
example, using the other customer attributes in your data set, you may construct a decision tree
to predict the missing values for income. Methods 3 through 6 bias the data—the filled-in value
may not be correct. Method 6, however, is a popular strategy. In comparison to the other
methods, it uses the most information from the present data to predict missing values. By
considering the other attributes’ values in its estimation of the missing value for income, there is
a greater chance that the relationships between income and the other attributes are preserved. It is
important to note that, in some cases, a missing value may not imply an error in the data! For
example, when applying for a credit card, candidates may be asked to supply their driver’s

49
license number. Candidates who do not have a driver’s license may naturally leave this field
blank. Forms should allow respondents to specify values such as “not applicable.” Software
routines may also be used to uncover other null values (e.g., “don’t know,” “?” or “none”).
Ideally, each attribute should have one or more rules regarding the null condition. The rules may
specify whether or not nulls are allowed and/or how such values should be handled or
transformed. Fields may also be intentionally left blank if they are to be provided in a later step
of the business process. Hence, although we can try our best to clean the data after it is seized,
good database and data entry procedure design should help minimize the number of missing
values or errors in the first place.

50

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy