0% found this document useful (0 votes)

36 views272 pages

Data Mining Merged Pdf CS1 CS8

The document provides an overview of data mining, defining it as the automated analysis of massive data sets to extract useful patterns and knowledge. It discusses the evolution of database technology, the various tasks and applications of data mining, including classification, clustering, and association rule discovery. Additionally, it highlights the importance of data mining in business intelligence and its reliance on machine learning techniques.

Uploaded by

kbalajiganeshdav

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

36 views272 pages

Data Mining Merged Pdf CS1 CS8

Uploaded by

kbalajiganeshdav

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 272

Data Mining

BITS Pilani M1: Introduction to Data Mining

1.1 Data Mining Defined

Source Courtesy: Some of the contents of this PPT are sourced from materials provided by publishers of prescribed books
Why Data Mining?
• The Explosive Growth of Data: from terabytes to petabytes
• Data collection and data availability
• Automated data collection tools, database systems, Web, computerized
society
• Major sources of abundant data
• Business: Web, e-commerce, transactions, stocks, …
• Science: Remote sensing, bioinformatics, scientific simulation, …
• Society and everyone: news, digital cameras, YouTube
• We are drowning in data, but starving for knowledge!
• “Necessity is the mother of invention”—Data mining—Automated analysis of
massive data sets
3
Why Data Mining
A search engine (e.g., Google) receives hundreds of millions of queries every day. Each query can be
viewed as a transaction where the user describes her or his information need.

What novel and useful knowledge can a search engine learn from such a huge collection of queries
collected from users over time? Some patterns found in user search queries can disclose invaluable
knowledge that cannot be obtained by reading individual data items alone.

For example, Google's Flu Trends uses specific search terms as indicators of flu activity. It found a
close relationship between the number of people who search for flu-related information and the
number of people who actually have flu symptoms. A pattern emerges when all of the search
queries related to flu are aggregated. Using aggregated Google search data, Flu Trends can estimate
flu activity up to two weeks faster than traditional systems can. This example shows how data
mining can turn a large collection of data into knowledge that can help meet a current global
challenge.
Evolution of Database Technology

• 1960s:
• Data collection, database creation, IMS and network DBMS
• 1970s:
• Relational data model, relational DBMS implementation
• 1980s:
• RDBMS, advanced data models (extended-relational, OO, deductive, etc.)
• Application-oriented DBMS (spatial, scientific, engineering, etc.)
• 1990s:
• Data mining, data warehousing, multimedia databases, and Web databases
• 2000s
• Stream data management and mining
• Data mining and its applications
• Web technology (XML, data integration) and global information systems

5
What Is Data Mining?

• Data mining (knowledge discovery from data)

• Extraction of interesting (non-trivial, implicit, previously unknown and potentially useful)
patterns or knowledge from huge amount of data
• Data mining: a misnomer?

• Alternative names
• Knowledge discovery (mining) in databases (KDD), knowledge extraction, data/pattern
analysis, data archeology, data dredging, information harvesting, business intelligence, etc.

• Watch out: Is everything “data mining”?

• Simple search and query processing
• (Deductive) expert systems

6
What is (not) Data Mining?
What is not Data What is Data Mining?
Mining? – Certain names are more
prevalent in certain US
– Look up phone
locations (O’Brien, O’Rurke,
number in phone
O’Reilly… in Boston area)
directory
– Query a Web – Group together similar
search engine for documents returned by search
information about engine according to their
“Amazon” context (e.g. Amazon
rainforest, Amazon.com,)
Origins of Data Mining
• Draws ideas from machine learning/AI, pattern recognition,
statistics, and database systems
• Traditional Techniques Statistics/ Machine Learning/
may be unsuitable due to AI Pattern
• Enormity of data Recognition

• High dimensionality Data Mining

of data
• Heterogeneous,
distributed nature Database
of data systems
Data Mining in Business Intelligence
Increasing potential
to support
business decisions End User
Decision
Making

Data Presentation Business

Analyst
Visualization Techniques
Data Mining Data
Information Discovery Analyst

Data Exploration
Statistical Summary, Querying, and Reporting

Data Preprocessing/Integration, Data Warehouses

DBA
Data Sources
Paper, Files, Web documents, Scientific experiments, Database Systems
9
Data Mining/KDD Process

Input Data Data Pre- Data Post-

Processing Mining Processing

Data integration Pattern discovery Pattern evaluation

Normalization Association & correlation Pattern selection
Feature selection Classification Pattern interpretation
Clustering
Dimension reduction Pattern visualization
Outlier analysis
…………

10
Multi-Dimensional View of Data Mining
• Data to be mined
• Database data (extended-relational, object-oriented, heterogeneous, legacy), data warehouse,
transactional data, stream, spatiotemporal, time-series, sequence, text and web, multi-media,
graphs & social and information networks
• Knowledge to be mined (or: Data mining functions)
• Characterization, discrimination, association, classification, clustering, trend/deviation, outlier
analysis, etc.
• Descriptive vs. predictive data mining
• Multiple/integrated functions and mining at multiple levels
• Techniques utilized
• Data-intensive, data warehouse (OLAP), machine learning, statistics, pattern recognition,
visualization, high-performance, etc.
• Applications adapted
• Retail, telecommunication, banking, fraud analysis, bio-data mining, stock market analysis, text
mining, Web mining, etc.

11
Data Mining & Machine Learning
According to Tom M. Mitchell, Chair of Machine Learning at Carnegie Mellon
University and author of the book Machine Learning (McGraw-Hill),
A computer program is said to learn from experience E with respect to some class of tasks T
and performance measure P, if its performance at tasks in T, as measured by P, improves with
the experience E.
We now have a set of objects to define machine learning:
Task (T), Experience (E), and Performance (P)
With a computer running a set of tasks, the experience should be leading to performance
increases (to satisfy the definition)

Many data mining tasks are executed successfully with help of machine learning

Machine Learning: Hands-on for Developers and Technical Professionals by Jason Bell John Wiley & Sons
Data Mining on Diverse kinds of Data
Besides relational database data (from operational or analytical systems), there are many other kinds of data
that have diverse forms and structures and different semantic meanings.
Examples of data can be :
time-related or sequence data (e.g., historical records, stock exchange data, and time-series and biological
sequence data),
data streams (e.g., video surveillance and sensor data, which are continuously transmitted),
spatial data (e.g., maps),
engineering design data (e.g., the design of buildings, system components, or integrated circuits),
hypertext and multimedia data (including text, image, video, and audio data),
graph and networked data (e.g., social and information networks), and
the Web (a widely distributed information repository).
Diversity of data brings in new challenges such as handling special structures (e.g., sequences, trees, graphs,
and networks) and specific semantics (such as ordering, image, audio and video contents, and connectivity)
Prescribed Text Books

Author(s), Title, Edition, Publishing House

1.2 Data Mining Activities

Source Courtesy: Some of the contents of this PPT are sourced from materials provided by publishers of prescribed books
Data Mining Tasks
• Prediction Methods
• Use some variables to predict unknown or future values of other variables.

• Description Methods
• Find human-interpretable patterns that describe the data.

From [Fayyad, et.al.] Advances in Knowledge Discovery and Data Mining, 1996
Data Mining Tasks...
• Classification [Predictive]
• Clustering [Descriptive]
• Association Rule Discovery [Descriptive]
• Sequential Pattern Discovery [Descriptive]
• Regression [Predictive]
• Deviation Detection [Predictive]
Classification: Definition
• Given a collection of records (training set )
• Each record contains a set of attributes, one of the attributes is the class.
• Find a model for class attribute as a function of the values of other
attributes.
• Goal: previously unseen records should be assigned a class as
accurately as possible.
• A test set is used to determine the accuracy of the model. Usually, the given
data set is divided into training and test sets, with training set used to build
the model and test set used to validate it.
Classification Example

Tid Refund Marital Taxable Refund Marital Taxable

Status Income Cheat Status Income Cheat

1 Yes Single 125K No No Single 75K ?

2 No Married 100K No Yes Married 50K ?
3 No Single 70K No No Married 150K ?
4 Yes Married 120K No Yes Divorced 90K ?
5 No Divorced 95K Yes No Single 40K ?
6 No Married 60K No No Married 80K ? Test
7 Yes Divorced 220K No
10

Set

8 No Single 85K Yes

9 No Married 75K No Learn
Training
10 No Single 90K Yes Model
10

Set Classifier
Classification: Application 1
• Direct Marketing
• Goal: Reduce cost of mailing by targeting a set of consumers likely to buy a
new cell-phone product.
• Approach:
• Use the data for a similar product introduced before.
• We know which customers decided to buy and which decided otherwise. This {buy, don’t
buy} decision forms the class attribute.
• Collect various demographic, lifestyle, and company-interaction related information
about all such customers.
• Type of business, where they stay, how much they earn, etc.
• Use this information as input attributes to learn a classifier model.

From [Berry & Linoff] Data Mining Techniques, 1997

Classification: Application 2
• Fraud Detection
• Goal: Predict fraudulent cases in credit card transactions.
• Approach:
• Use credit card transactions and the information on its account-holder as attributes.
• When does a customer buy, what does he buy, how often he pays on time, etc
• Label past transactions as fraud or fair transactions. This forms the class attribute.
• Learn a model for the class of the transactions.
• Use this model to detect fraud by observing credit card transactions on an account.
Classification: Application 3
• Customer Attrition/Churn:
• Goal: To predict whether a customer is likely to be lost to a competitor.
• Approach:
• Use detailed record of transactions with each of the past and present customers, to find
attributes.
• How often the customer calls, where he calls, what time-of-the day he calls most, his financial
status, marital status, etc.
• Label the customers as loyal or disloyal.
• Find a model for loyalty.

From [Berry & Linoff] Data Mining Techniques, 1997

Classification: Application 4
• Sky Survey Cataloging
• Goal: To predict class (star or galaxy) of sky objects, especially visually faint
ones, based on the telescopic survey images (from Palomar Observatory).
• 3000 images with 23,040 x 23,040 pixels per image.
• Approach:
• Segment the image.
• Measure image attributes (features) - 40 of them per object.
• Model the class based on these features.
• Success Story: Could find 16 new high red-shift quasars, some of the farthest objects that
are difficult to find!

From [Fayyad, et.al.] Advances in Knowledge Discovery and Data Mining, 1996
Clustering Definition
• Given a set of data points, each having a set of attributes, and a
similarity measure among them, find clusters such that
• Data points in one cluster are more similar to one another.
• Data points in separate clusters are less similar to one another.
• Similarity Measures:
• Euclidean Distance if attributes are continuous.
• Other Problem-specific Measures.
Illustrating Clustering
Euclidean Distance Based Clustering in 3-D space.

Intracluster distances Intercluster distances

are minimized are maximized
Clustering: Application 1
• Market Segmentation:
• Goal: subdivide a market into distinct subsets of customers where any subset
may conceivably be selected as a market target to be reached with a distinct
marketing mix.
• Approach:
• Collect different attributes of customers based on their geographical and lifestyle related
information.
• Find clusters of similar customers.
• Measure the clustering quality by observing buying patterns of customers in same
cluster vs. those from different clusters.
Clustering: Application 2
• Document Clustering:
• Goal: To find groups of documents that are similar to each other based on the
important terms appearing in them.
• Approach: To identify frequently occurring terms in each document. Form a
similarity measure based on the frequencies of different terms. Use it to
cluster.
• Gain: Information Retrieval can utilize the clusters to relate a new document
or search term to clustered documents.
Clustering of S&P 500 Stock Data
• Observe Stock Movements every day.
• Clustering points: Stock-{UP/DOWN}
• Similarity Measure: Two points are more similar if the events
described by them frequently happen together on the same day.
• We used association rules to quantify a similarity measure.
Discovered Clusters Industry Group
Applied-Matl-DOW N,Bay-Net work-Down,3-COM-DOWN,

1 Cabletron-Sys-DOWN,CISCO-DOWN,HP-DOWN,
DSC-Co mm-DOW N,INTEL-DOWN,LSI-Logic-DOWN,
Micron-Tech-DOWN,Texas-Inst-Down,Tellabs-Inc-Down,
Technology1-DOWN
Natl-Semiconduct-DOWN,Oracl-DOWN,SGI-DOW N,
Sun-DOW N
Apple-Co mp-DOW N,Autodesk-DOWN,DEC-DOWN,

2 ADV-M icro-Device-DOWN,Andrew-Corp-DOWN,
Co mputer-Assoc-DOWN,Circuit-City-DOWN,
Co mpaq-DOWN, EM C-Corp-DOWN, Gen-Inst-DOWN,
Technology2-DOWN
Motorola-DOW N,Microsoft-DOWN,Scientific-Atl-DOWN
Fannie-Mae-DOWN,Fed-Ho me-Loan-DOW N,

3 MBNA-Corp -DOWN,Morgan-Stanley-DOWN Financial-DOWN

Baker-Hughes-UP,Dresser-Inds-UP,Halliburton-HLD-UP,

4 Louisiana-Land-UP,Phillips-Petro-UP,Unocal-UP,
Schlu mberger-UP
Oil-UP
Association Rule Discovery: Definition
• Given a set of records each of which contain some number of items from a given
collection;
• Produce dependency rules which will predict occurrence of an item based on
occurrences of other items.
TID Items
1 Bread, Coke, Milk
Rules Discovered:
2 Beer, Bread {Milk} --> {Coke}
3 Beer, Coke, Diaper, Milk
4 Beer, Bread, Diaper, Milk
5 Coke, Diaper, Milk
Association Rule Discovery: Application 1

• Marketing and Sales Promotion:

• Let the rule discovered be
{Bagels, … } --> {Potato Chips}
• Potato Chips as consequent => Can be used to determine what should be
done to boost its sales.
• Bagels in the antecedent => Can be used to see which products would be
affected if the store discontinues selling bagels.
• Bagels in antecedent and Potato chips in consequent => Can be used to see
what products should be sold with Bagels to promote sale of Potato chips!
Association Rule Discovery: Application

• Inventory Management:
• Goal: A consumer appliance repair company wants to anticipate the nature of
repairs on its consumer products and keep the service vehicles equipped with
right parts to reduce on number of visits to consumer households.
• Approach: Process the data on tools and parts required in previous repairs at
different consumer locations and discover the co-occurrence patterns.
Sequential Pattern Discovery: Definition
• Given is a set of objects, with each object associated with its own timeline of events, find rules that predict
strong sequential dependencies among different events.

(A B) (C) (D E)

• Rules are formed by first disovering patterns. Event occurrences in the patterns are governed by timing
constraints.
(A B) (C) (D E)
<= xg >ng <= ws
Timing constraints include maxgap (xg),
mingap (ng), windowsize (ws), maxspan (ms)
<= ms
Sequential Pattern Discovery: Examples
• In telecommunications alarm logs,
• (Inverter_Problem Excessive_Line_Current)
(Rectifier_Alarm) --> (Fire_Alarm)
• In point-of-sale transaction sequences,
• Computer Bookstore:
(Intro_To_Visual_C) (C++_Primer) -->
(Perl_for_dummies,Tcl_Tk)
• Athletic Apparel Store:
(Shoes) (Racket, Racketball) --> (Sports_Jacket)
Regression
• Predict a value of a given continuous valued variable based on the values of
other variables, assuming a linear or nonlinear model of dependency.
• Greatly studied in statistics, neural network fields.
• Examples:
• Predicting sales amounts of new product based on advetising expenditure.
• Predicting wind velocities as a function of temperature, humidity, air
pressure, etc.
• Time series prediction of stock market indices.
Deviation/Anomaly Detection
• Detect significant deviations from normal behavior

• Applications:
• Credit Card Fraud Detection
• Network Intrusion Detection
Prescribed Text Books

Author(s), Title, Edition, Publishing House

1.3 DM Process & Challenges

Source Courtesy: Some of the contents of this PPT are sourced from materials provided by publishers of prescribed books
DM Process
• The standard data mining process involves
1. understanding the problem,
2. preparing the data (samples),
3. developing the model,
4. applying the model on a data set to see how the model may work in real
world, and
5. production deployment.
• A popular data mining process frameworks is CRISP-DM (Cross
Industry Standard Process for Data Mining). This framework was
developed by a consortium of companies involved in data mining
Generic Data mining process
Prior Knowledge
• Data Mining tools/solutions identify hidden patterns.
• Generally we get many patterns
• Out of them many could be false or trivial.
• Filtering false patterns requires domain understanding.
• Understanding how the data is collected, stored, transformed,
reported, and used is essential.
• Causation vs. Correlation
• A bank may decide interest rate based on credit score. Looking at data, credit
score moves as per interest rate. It does not make sense to derive credit score
from interest rate.
Data Preparation
• Data needs to be understood. It requires descriptive statistics such as mean, median, mode, standard deviation, and range for each
attribute
• Data quality is an ongoing concern wherever data is collected, processed, and stored.
• The data cleansing practices include elimination of duplicate records, quarantining outlier records that exceed the bounds, standardization of attribute
values, substitution of missing values, etc.
• it is critical to check the data using data exploration techniques in addition to using prior knowledge of the data and business before building models to
ensure a certain degree of data quality

• Missing Values
• Need to track the data lineage of the data source to find right solution
• Data Types and Conversion
• The attributes in a data set can be of different types, such as continuous numeric (interest rate), integer numeric (credit score), or categorical
• data mining algorithms impose different restrictions on what data types they accept as inputs
• Transformation
• Can go beyond type conversion, may include dimensionality reduction or numerosity reduction
• Outliers are anomalies in the data set
• May occur legitimately or erroneously.
• Feature Selection
• Many data mining problems involve a data set with hundreds to thousands of attributes, most of which may not be helpful. Some attributes may be
correlated, e.g. sales amount and tax.
• Data Sampling may be adequate in many cases
Modeling
A model is the abstract
representation of the data and
its relationships in a given data
set.
Data mining models can be
classified into the following
categories: classification,
regression, association analysis,
clustering, and outlier or
anomaly detection.
Each category has a few dozen
different algorithms; each takes
a slightly different approach to
solve the problem at hand
Application
• The model deployment stage considerations:
• assessing model readiness, technical integration, response time, model maintenance, and assimilation
• Production Readiness
• Real-time response capabilities, and other business requirements
• Technical Integration
• Use of modeling tools (e.g. RapidMiner), Use of PMML for portable and consistent format of model
description, integration with other tools
• Timeliness
• The trade-offs between production responsiveness and build time need to be considered
• Remodeling
• The conditions in which the model is built may change after deployment
• Assimilation
• The challenge is to assimilate the knowledge gained from data mining in the organization. For example, the
objective may be finding logical clusters in the customer database so that separate treatment can be provided
to each customer cluster.
CRISP data mining framework

CRISP is the most popular methodology

for analytics, data mining, and data
science projects, with 43% share as per
2014 KDnuggets Poll.
CRISP-DM was conceived in 1996. In
1997 it got underway as a European
Union project, led by SPSS, Teradata,
Daimler AG, NCR Corporation and OHRA.
DM Issues/Challenges – Mining Methodology
Mining Methodology involves the investigation of new kinds of knowledge, mining in multidimensional space, integrating methods from other
disciplines, and the consideration of semantic ties among data objects.
• Mining various and new kinds of knowledge: Data mining covers a wide spectrum of data analysis and knowledge discovery tasks, from
data characterization and discrimination to association and correlation analysis, classification, regression, clustering, outlier analysis,
sequence analysis, and trend and evolution analysis.
• Mining knowledge in multidimensional space: When searching for knowledge in large data sets, we can explore the data in
multidimensional space. That is, we can search for interesting patterns among combinations of dimensions (attributes) at varying levels of
abstraction. Data can be aggregated or viewed as a multidimensional data cube.
• Data mining—an interdisciplinary effort: For example, to mine data with natural language text, it makes sense to fuse data mining methods
with methods of information retrieval and natural language processing, e.g. consider the mining of software bugs in large programs, known
as bug mining, benefits from the incorporation of software engineering knowledge into the data mining process.
• Boosting the power of discovery in a networked environment: Most data objects reside in a linked or interconnected environment,
whether it be the Web, database relations, files, or documents. Semantic links across multiple data objects can be used to advantage in
data mining.
• Handling uncertainty, noise, or incompleteness of data: Data often contain noise, errors, exceptions, or uncertainty, or are incomplete.
Errors and noise may confuse the data mining process, leading to the derivation of erroneous patterns. Data cleaning, data preprocessing,
outlier detection and removal, and uncertainty reasoning are examples of techniques that need to be integrated with the data mining
process.
• Pattern evaluation and pattern- or constraint-guided mining: What makes a pattern interesting may vary from user to user. Therefore,
techniques are needed to assess the interestingness of discovered patterns based on subjective measures. These estimate the value of
patterns with respect to a given user class, based on user beliefs or expectations.
DM Issues/Challenges – User Interaction
The user plays an important role in the data mining process. Interesting areas include how to interact with a data mining system, how to
incorporate a user's background knowledge in mining, and how to visualize and comprehend data mining results.

• Interactive mining: The data mining process should be highly interactive. Thus, it is important to build flexible user interfaces and an
exploratory mining environment, facilitating the user's interaction with the system. A user may like to first sample a set of data, explore
general characteristics of the data, and estimate potential mining results. Interactive mining should allow users to dynamically change the
focus of a search, to refine mining requests based on returned results, and to drill, dice, and pivot through the data and knowledge space
interactively, dynamically exploring "cube space" while mining.

• Incorporation of background knowledge: Background knowledge, constraints, rules, and other information regarding the domain under
study should be incorporated into the knowledge discovery process. Such knowledge can be used for pattern evaluation as well as to guide
the search toward interesting patterns.

• Ad hoc data mining and data mining query languages: Query languages (e.g., SQL) have played an important role in flexible searching
because they allow users to pose ad hoc queries. Similarly, high-level data mining query languages or other high-level flexible user
interfaces will give users the freedom to define ad hoc data mining tasks. This should facilitate specification of the relevant sets of data for
analysis, the domain knowledge, the kinds of knowledge to be mined, and the conditions and constraints to be enforced on the discovered
patterns. Optimization of the processing of such flexible mining requests is another promising area of study.

• Presentation and visualization of data mining results: How can a data mining system present data mining results, vividly and flexibly, so
that the discovered knowledge can be easily understood and directly usable by humans? This is especially crucial if the data mining process
is interactive. It requires the system to adopt expressive knowledge representations, user-friendly interfaces, and visualization techniques.
DM Issues/Challenges - Efficiency and Scalability
Efficiency and scalability are always considered when comparing data mining algorithms. As data amounts continue to multiply, these two
factors are especially critical.

• Efficiency and scalability of data mining algorithms: Data mining algorithms must be efficient and scalable in order to effectively extract
information from huge amounts of data in many data repositories or in dynamic data streams. In other words, the running time of a data
mining algorithm must be predictable, short, and acceptable by applications. Efficiency, scalability, performance, optimization, and the
ability to execute in real time are key criteria that drive the development of many new data mining algorithms.

• Parallel, distributed, and incremental mining algorithms: The humongous size of many data sets, the wide distribution of data, and the
computational complexity of some data mining methods are factors that motivate the development of parallel and distributed data-
intensive mining algorithms. Such algorithms first partition the data into "pieces." Each piece is processed, in parallel, by searching for
patterns. The parallel processes may interact with one another. The patterns from each partition are eventually merged.

• Cloud computing and cluster computing, which use computers in a distributed and collaborative way to tackle very large-scale
computational tasks, are also active research themes in parallel data mining. In addition, the high cost of some data mining processes and
the incremental nature of input promote incremental data mining, which incorporates new data updates without having to mine the entire
data "from scratch." Such methods perform knowledge modification incrementally to amend and strengthen what was previously
discovered.
DM Issues/Challenges - Diversity of Database Types
The wide diversity of database types brings about challenges to data mining.

Handling complex types of data: Diverse applications generate a wide spectrum of new data types, from structured data such as relational and
data warehouse data to semi-structured and unstructured data; from stable data repositories to dynamic data streams; from simple data
objects to temporal data, biological sequences, sensor data, spatial data, hypertext data, multimedia data, software program code, Web data,
and social network data. It is unrealistic to expect one data mining system to mine all kinds of data, given the diversity of data types and the
different goals of data mining. Domain- or application-dedicated data mining systems are being constructed for in-depth mining of specific
kinds of data. The construction of effective and efficient data mining tools for diverse applications remains a challenging area.

Mining dynamic, networked, and global data repositories: Multiple sources of data are connected by the Internet and various kinds of
networks, forming gigantic, distributed, and heterogeneous global information systems and networks. The discovery of knowledge from
different sources of structured, semi-structured, or unstructured yet interconnected data with diverse data semantics poses great challenges to
data mining. Mining such gigantic, interconnected information networks may help disclose many more patterns and knowledge in
heterogeneous data sets than can be discovered from a small set of isolated data repositories. Web mining, multisource data mining, and
information network mining have become challenging and fast-evolving data mining fields.
DM Issues/Challenges - Society
How does data mining impact society? What steps can data mining take to preserve the privacy of individuals? Do we use data mining in our
daily lives without even knowing that we do?

Social impacts of data mining: With data mining penetrating our everyday lives, it is important to study the impact of data mining on society.
How can we use data mining technology to benefit society? How can we guard against its misuse? The improper disclosure or use of data and
the potential violation of individual privacy and data protection rights are areas of concern that need to be addressed.

Privacy-preserving data mining: Data mining will help scientific discovery, business management, economy recovery, and security protection
(e.g., the real-time discovery of intruders and cyberattacks). However, it poses the risk of disclosing an individual's personal information.
Studies on privacy-preserving data publishing and data mining are ongoing. The philosophy is to observe data sensitivity and preserve people's
privacy while performing successful data mining.

Invisible data mining: We cannot expect everyone in society to learn and master data mining techniques. More and more systems should have
data mining functions built within so that people can perform data mining or use data mining results simply by mouse clicking, without any
knowledge of data mining algorithms. Intelligent search engines and Internet-based stores perform such invisible data mining by incorporating
data mining into their components to improve their functionality and performance. This is done often unbeknownst to the user. For example,
when purchasing items online, users may be unaware that the store is likely collecting data on the buying patterns of its customers, which may
be used to recommend other items for purchase in the future.
Prescribed Text Books

Author(s), Title, Edition, Publishing House

T1 Tan P. N., Steinbach M & Kumar V. “Introduction to Data Mining” Pearson
Education
T2 Data Mining: Concepts and Techniques, Third Edition by Jiawei Han, Micheline
Kamber and Jian Pei Morgan Kaufmann Publishers
R1 Predictive Analytics and Data Mining: Concepts and Practice with RapidMiner
by Vijay Kotu and Bala Deshpande Morgan Kaufmann Publishers
Data Mining
BITS Pilani M2: Data Preprocessing
Pilani|Dubai|Goa|Hyderabad
BITS Pilani
Pilani|Dubai|Goa|Hyderabad

2.1 Data Preprocessing Concepts

Source Courtesy: Some of the contents of this PPT are sourced from materials provided by publishers of prescribed books
Preprocessing Objectives

• To improve data quality

• To modify data to better fit specific data mining technique

Major Tasks in Data Preprocessing

• Data cleaning
• Fill in missing values, smooth noisy data, identify or remove outliers, and resolve
inconsistencies
• Data integration
• Integration of multiple databases, data cubes, or files
• Data reduction
• Dimensionality reduction
• Numerosity reduction
• Data compression
• Data transformation and data discretization
• Normalization
• Concept hierarchy generation

55
Data Quality: Multidimensional View

• Measures for data quality: A multidimensional view

• Accuracy: correct or wrong, accurate or not
• Completeness: not recorded, unavailable, …
• Consistency: some modified but some not, dangling, …
• Timeliness: timely update?
• Believability: how trustable the data are correct?
• Interpretability: how easily the data can be understood?

57
Data Quality
• What kinds of data quality problems?
• How can we detect problems with the data?
• What can we do about these problems?

• Examples of data quality problems:

• Noise and outliers
• missing values
• duplicate data
Data Cleaning
• Data in the Real World Is Dirty: Lots of potentially incorrect data, e.g., instrument faulty,
human or computer error, transmission error
• incomplete: lacking attribute values, lacking certain attributes of interest, or
containing only aggregate data
• e.g., Occupation = “ ” (missing data)
• noisy: containing noise, errors, or outliers
• e.g., Salary = “−10” (an error)
• inconsistent: containing discrepancies in codes or names, e.g.,
• Age = “42”, Birthday = “03/07/2010”
• Was rating “1, 2, 3”, now rating “A, B, C”
• discrepancy between duplicate records
• Intentional (e.g., disguised missing data)
• Jan. 1 as everyone’s birthday?

59
Incomplete (Missing) Data
• Data is not always available
• E.g., many tuples have no recorded value for several attributes, such as customer
income in sales data
• Missing data may be due to
• equipment malfunction
• inconsistent with other recorded data and thus deleted
• data not entered due to misunderstanding
• certain data may not be considered important at the time of entry
• not register history or changes of the data
• Missing data may need to be inferred

60
How to Handle Missing Data?
• Ignore the tuple: usually done when class label is missing (when doing classification)—
not effective when the % of missing values per attribute varies considerably
• Fill in the missing value manually: tedious + infeasible?
• Fill in it automatically with
• a global constant : e.g., “unknown”, a new class?!
• the attribute mean
• the attribute mean for all samples belonging to the same class: smarter
• the most probable value: inference-based such as Bayesian formula or decision tree

61
Noisy Data
• Noise: random error or variance in a measured variable
• Incorrect attribute values may be due to
• faulty data collection instruments
• data entry problems
• data transmission problems
• technology limitation
• inconsistency in naming convention
• Other data problems which require data cleaning
• duplicate records
• incomplete data
• inconsistent data

62
How to Handle Noisy Data?
• Binning (also used for discretization)
• first sort data and partition into (equal-frequency) bins
• then one can smooth by bin means, smooth by bin median, smooth by bin
boundaries, etc.
• Binning methods smooth a sorted data value by consulting its "neighborhood," that
is, the values around it, i.e. they perform local smoothing.
• Regression
• smooth by fitting the data into regression functions
• Clustering
• detect and remove outliers
• Combined computer and human inspection
• detect suspicious values and check by human (e.g., deal with possible outliers)

63
Noise
• Noise refers to modification of original values
• Examples: distortion of a person’s voice when talking on a poor phone and
“snow” on television screen

Two Sine Waves Two Sine Waves + Noise

Duplicate Data
• Data set may include data objects that are duplicates, or almost
duplicates of one another
• Major issue when merging data from heterogeous sources

• Examples:
• Same person with multiple email addresses

• Data cleaning
• Process of dealing with duplicate data issues
Outliers
• Outliers are data objects with characteristics that are considerably
different than most of the other data objects in the data set
Data Cleaning as a Process
• Data discrepancy detection
• Use metadata (e.g., domain, range, dependency, distribution)
• Check field overloading
• Check uniqueness rule, consecutive rule and null rule
• Use commercial tools
• Data scrubbing: use simple domain knowledge (e.g., postal code, spell-check) to detect errors
and make corrections
• Data auditing: by analyzing data to discover rules and relationship to detect violators (e.g.,
correlation and clustering to find outliers)
• Data migration and integration
• Data migration tools: allow transformations to be specified
• ETL (Extraction/Transformation/Loading) tools: allow users to specify transformations through a
graphical user interface
• Integration of the two processes
• Iterative and interactive (e.g., Potter’s Wheels)

67
Prescribed Text Books

Author(s), Title, Edition, Publishing House

2.1 Data Preprocessing Techniques

Source Courtesy: Some of the contents of this PPT are sourced from materials provided by publishers of prescribed books
Major Tasks in Data Preprocessing

70
Data Integration
• Data integration:
• Combines data from multiple sources into a coherent store

• Schema integration: e.g., A.cust-id  B.cust-#

• Integrate metadata from different sources

• Entity identification problem:

• Identify real world entities from multiple data sources, e.g., Bill Clinton = William Clinton

• Detecting and resolving data value conflicts

• For the same real world entity, attribute values from different sources are different
• Possible reasons: different representations, different scales, e.g., metric vs. British units

71
71
Handling Redundancy in Data Integration

• Redundant data occur often when integration of multiple databases

• Object identification: The same attribute or object may have different names
in different databases
• Derivable data: One attribute may be a “derived” attribute in another table,
e.g., annual revenue
• Redundant attributes may be able to be detected by correlation analysis and
covariance analysis
• Careful integration of the data from multiple sources may help reduce/avoid
redundancies and inconsistencies and improve mining speed and quality

72
72
Correlation Analysis (Nominal Data)
• Χ2 (chi-square) test
(Observed − Expected ) 2
2 = 
Expected

• The larger the Χ2 value, the more likely the variables are related
• The cells that contribute the most to the Χ2 value are those
whose actual count is very different from the expected count
• Correlation does not imply causality
• # of hospitals and # of car-theft in a city are correlated
• Both are causally linked to the third variable: population

73
Chi-Square Calculation: An Example
Play chess Not play chess Sum (row)
Like science fiction 250(90) 200(360) 450

Not like science fiction 50(210) 1000(840) 1050

Sum(col.) 300 1200 1500

• Χ2 (chi-square) calculation (numbers in parenthesis are expected

counts calculated based on the data distribution in the two
categories)
(250 − 90) 2 (50 − 210) 2 (200 − 360) 2 (1000 − 840) 2
 =
2
+ + + = 507.93
90 210 360 840

• It shows that like_science_fiction and play_chess are correlated

in the group
74
Correlation Analysis (Numeric Data)
• Correlation coefficient (also called Pearson’s product moment
coefficient)
 
n n
(ai − A)(bi − B) (ai bi ) − n AB
rA, B = i =1
= i =1
(n − 1) A B (n − 1) A B

where n is the number of tuples, A and B are the respective

means of A and B, σA and σB are the respective standard
deviation of A and B, and Σ(aibi) is the sum of the AB cross-
product.
• If rA,B > 0, A and B are positively correlated (A’s values increase as
B’s). The higher, the stronger correlation.
• rA,B = 0: independent; rAB < 0: negatively correlated
75
Correlation (viewed as linear relationship)
• Correlation measures the linear relationship between objects
• To compute correlation, we standardize data objects, A and B, and
then take their dot product

a'k = (ak − mean( A)) / std ( A)

b'k = (bk − mean( B)) / std ( B)

correlation( A, B) = A'• B'

76
Covariance (Numeric Data)

• Covariance is similar to correlation

Correlation coefficient:

where n is the number of tuples, A and B are the respective mean or expected values of A and B,
σA and σB are the respective standard deviation of A and B
• Positive covariance: If CovA,B > 0, then A and B both tend to be larger than their expected values
• Negative covariance: If CovA,B < 0 then if A is larger than its expected value, B is likely to be smaller
than its expected value
• Independence: CovA,B = 0

77
Co-Variance: An Example

• It can be simplified in computation as

• Suppose two stocks A and B have the following values in one week: (2, 5), (3, 8), (5, 10), (4, 11), (6, 14).

• Question: If the stocks are affected by the same industry trends, will their prices rise or fall together?
• E(A) = (2 + 3 + 5 + 4 + 6)/ 5 = 20/5 = 4
• E(B) = (5 + 8 + 10 + 11 + 14) /5 = 48/5 = 9.6
• Cov(A,B) = (2×5+3×8+5×10+4×11+6×14)/5 − 4 × 9.6 = 4

• Thus, A and B rise together since Cov(A, B) > 0.

Discretization
• Three types of attributes
• Nominal—values from an unordered set, e.g., color, profession
• Ordinal—values from an ordered set, e.g., military or academic rank
• Numeric—real numbers, e.g., integer or real numbers
• Discretization: Divide the range of a continuous attribute into intervals
• Interval labels can then be used to replace actual data values
• Reduce data size by discretization
• Supervised vs. unsupervised
• Split (top-down) vs. merge (bottom-up)
• Discretization can be performed recursively on an attribute
• Prepare for further analysis, e.g., classification

79
Data Discretization Methods
• Typical methods: All the methods can be applied recursively
• Binning
• Top-down split, unsupervised
• Histogram analysis
• Top-down split, unsupervised
• Clustering analysis (unsupervised, top-down split or bottom-up merge)
• Decision-tree analysis (supervised, top-down split)
• Correlation (e.g., 2) analysis (unsupervised, bottom-up merge)

80
Simple Discretization: Binning

• Equal-width (distance) partitioning

• Divides the range into N intervals of equal size: uniform grid
• if A and B are the lowest and highest values of the attribute, the width of intervals will be: W = (B
–A)/N.
• The most straightforward, but outliers may dominate presentation
• Skewed data is not handled well

• Equal-depth (frequency) partitioning

• Divides the range into N intervals, each containing approximately same number of samples
• Good data scaling
• Managing categorical attributes can be tricky

81
Binning Methods for Data Smoothing

82
Discretization by Classification & Correlation Analysis

• Classification (e.g., decision tree analysis)

• Supervised: Given class labels, e.g., cancerous vs. benign
• Using entropy to determine split point (discretization point)
• Top-down, recursive split
• Details to be covered in Chapter “Classification”

• Correlation analysis (e.g., Chi-merge: χ2-based discretization)

• Supervised: use class information
• Bottom-up merge: find the best neighboring intervals (those having similar distributions of classes,
i.e., low χ2 values) to merge
• Merge performed recursively, until a predefined stopping condition

83
Data Reduction Strategies
• Data reduction: Obtain a reduced representation of the data set that is much smaller in volume
but yet produces the same (or almost the same) analytical results
• Why data reduction? — A database/data warehouse may store terabytes of data. Complex data
analysis may take a very long time to run on the complete data set.
• Data reduction strategies
• Dimensionality reduction, e.g., remove unimportant attributes
• Wavelet transforms
• Principal Components Analysis (PCA)
• Feature subset selection, feature creation
• Numerosity reduction (some simply call it: Data Reduction)
• Regression and Log-Linear Models
• Histograms, clustering, sampling
• Data cube aggregation
• Data compression

84
Data Reduction : Dimensionality Reduction
• Curse of dimensionality
• When dimensionality increases, data becomes increasingly sparse
• Density and distance between points, which is critical to clustering, outlier analysis, becomes less
meaningful
• The possible combinations of subspaces will grow exponentially
• Dimensionality reduction
• Avoid the curse of dimensionality
• Help eliminate irrelevant features and reduce noise
• Reduce time and space required in data mining
• Allow easier visualization
• Dimensionality reduction techniques
• Wavelet transforms
• Principal Component Analysis
• Supervised and nonlinear techniques (e.g., feature selection)

85
Mapping Data to a New Space
◼ Fourier transform
◼ Wavelet transform

Two Sine Waves Two Sine Waves + Noise Frequency

86
Wavelet Transformation
• Discrete wavelet transform (DWT) for linear signal processing, multi-resolution analysis
• Compressed approximation: store only a small fraction of the strongest of the wavelet
coefficients
• Similar to discrete Fourier transform (DFT), but better lossy compression, localized in
space
• Method:
• Length, L, must be an integer power of 2 (padding with 0’s, when necessary)
• Each transform has 2 functions: smoothing, difference
• Applies to pairs of data, resulting in two set of data of length L/2
• Applies two functions recursively, until reaches the desired length

87
Wavelet Decomposition
• Wavelets: A math tool for space-efficient hierarchical decomposition of functions
• S = [2, 2, 0, 2, 3, 5, 4, 4] can be transformed to S^ = [23/4, -11/4, 1/2, 0, 0, -1, -1, 0]
• Compression: many small detail coefficients can be replaced by 0’s, and only the
significant coefficients are retained

88
Principal Component Analysis (PCA)

• Find a projection that captures the largest amount of variation in data

• The original data are projected onto a much smaller space, resulting in dimensionality
reduction.
x2

x1 89
Principal Component Analysis (Steps)
• Given N data vectors from n-dimensions, find k ≤ n orthogonal vectors (principal components)
that can be best used to represent data
• Normalize input data: Each attribute falls within the same range
• Compute k orthonormal (unit) vectors, i.e., principal components
• Each input data (vector) is a linear combination of the k principal component vectors
• The principal components are sorted in order of decreasing “significance” or strength
• Since the components are sorted, the size of the data can be reduced by eliminating the
weak components, i.e., those with low variance (i.e., using the strongest principal
components, it is possible to reconstruct a good approximation of the original data)
• Works for numeric data only

90
Attribute Subset Selection
• Another way to reduce dimensionality of data
• Redundant attributes
• Duplicate much or all of the information contained in one or more other
attributes
• E.g., purchase price of a product and the amount of sales tax paid
• Irrelevant attributes
• Contain no information that is useful for the data mining task at hand
• E.g., students' ID is often irrelevant to the task of predicting students' GPA

91
Heuristic Search in Attribute Selection

• There are 2d possible attribute combinations of d attributes

• Typical heuristic attribute selection methods:
• Best single attribute under the attribute independence assumption: choose
by significance tests
• Best step-wise feature selection:
• The best single-attribute is picked first
• Then next best attribute condition to the first, ...
• Step-wise attribute elimination:
• Repeatedly eliminate the worst attribute
• Best combined attribute selection and elimination
• Optimal branch and bound:
• Use attribute elimination and backtracking

92
Attribute Creation (Feature Generation)
• Create new attributes (features) that can capture the important information in a
data set more effectively than the original ones
• Three general methodologies
• Attribute extraction
• Domain-specific
• Mapping data to new space (see: data reduction)
• E.g., Fourier transformation, wavelet transformation
• Attribute construction
• Combining features
• Data discretization

93
Data Reduction: Numerosity Reduction

• Reduce data volume by choosing alternative, smaller forms of data

representation
• Parametric methods (e.g., regression)
• Assume the data fits some model, estimate model parameters, store only the
parameters, and discard the data (except possible outliers)
• Ex.: Log-linear models—obtain value at a point in m-D space as the product
on appropriate marginal subspaces
• Non-parametric methods
• Do not assume models
• Major families: histograms, clustering, sampling, …

94
Prescribed Text Books

Author(s), Title, Edition, Publishing House

Dimension Reduction using PCA

Today’s Agenda

• Curse of Dimensionality

• Introduction to Dimension Reduction

Assume the data set represent height and weight

of people in a region.

• Covariance is extremely sensitive to large values

• Multiply some dimension by 1000

• Dominates covariance

• Becomes principal component

• Normalize each dimension to zero mean and unit

variance.

X’=(X-mean)/standard-deviation

BITS Pilani, Hyderabad Campus

PCA Limitations

• PCA assumes underlying subspace is linear.

• 1D –line
• 2D - plane

BITS Pilani, Hyderabad Campus

PCA and classification

• PCA is unsupervised
• Maximize overall variance of the data along a small set
of directions
• Does not known anything about class labels
• Can pick direction that makes it hard to separate classes

BITS Pilani, Hyderabad Campus

Take home message

• As the number of dimensions increases, the complexity

and computational power required to build the model
also increases.

• Dimension reduction methods are employed to find the

best representation of data.

• PCA finds the best vectors on which the maximum

variance in the data can be preserved.

BITS Pilani, Hyderabad Campus

BITS Pilani
Hyderabad Campus

Data
Today’s Learning objective

• Describe Data

• List various Data types

• List the issues in Data quality

• List and identify the right preprocessing techniques

given data

BITS Pilani, Hyderabad Campus

What is Data?
• Collection of data objects and their Attributes
attributes

• An attribute is a property or Tid Refund Marital Taxable

Status Income Cheat
characteristic of an object
1 Yes Single 125K No
– Examples: eye color of a person,
temperature, etc. 2 No Married 100K No
3 No Single 70K No
– Other names: variable, filed,
characteristic, feature, Predictor, 4 Yes Married 120K No

etc. 5 No Divorced 95K Yes

• A collection of attributes describe Objects 6 No Married 60K No

an object 7 Yes Divorced 220K No

– Other names: record, point, case, 8 No Single 85K Yes

sample, entity, or instance 9 No Married 75K No

10 No Single 90K Yes
10

BITS Pilani, Hyderabad Campus

Attribute Values

• Each attribute has a set of values object draws from.

• The same attribute can be mapped to different attribute

values

• Example: Temperature can be Celsius in feet or Fahrenheit

• Different attributes can be mapped to the same set of values

• Example: Attribute values for ID and age are integers

BITS Pilani, Hyderabad Campus

Types of Attributes

– each term is a component (attribute) of the vector,
– the value of each component is the number of times the corresponding term
occurs in the document.

timeout

season
coach

game
score
team

ball

lost
pla

wi
n
y
Document 1 3 0 5 0 2 6 0 2 0 2

Document 2 0 7 0 2 1 0 0 3 0 0

Document 3 0 1 0 0 1 2 2 0 3 0

GGTTCCGCCTTCAGCCCCGCGCC
CGCAGGGCCCGCCCCGCGCCGTC
GAGAAGGGCCCGCCTGGCGGGCG
GGGGGAGGCGGGGCCGCCCGAGC
CCAACCGAGTCCGACCAGGTGCC
CCCTCTGCTCGGCCTAGACCTGA
GCTCATTAGGCGGCAGCGGACAG
GCCAAGTAGAACACGCGAAGCGC
TGGGCTGCCTGCTGCGACCAGGG

BITS Pilani, Hyderabad Campus

Ordered Data

Spatio-Temporal Data

Average Monthly
Temperature of land
and ocean

BITS Pilani, Hyderabad Campus

Data Quality

• What kinds of data quality problems?

• How can we detect problems with the data?

• What can we do about these problems?

• Examples of data quality problems:
– Noise and outliers
– missing values
– duplicate data

BITS Pilani, Hyderabad Campus

Noise
• Noise: An invalid signal overlapping valid data
– Examples: distortion of a person’s voice when talking
on a poor phone and “snow” on television screen

Two Sine Waves Two Sine Waves + Noise

BITS Pilani, Hyderabad Campus
Outliers
• Outliers are data objects with characteristics that are
considerably different than most of the other data objects
in the data set

BITS Pilani, Hyderabad Campus

Data pre-processing
1. Data cleaning: handling errors and missing values

2. Feature extraction: creating new features by combining and

transforming existing ones

• a crucial step! ⇒ what patterns can you find application-

specific require understanding of the domain

3. Data reduction

• Aggregation, sampling

• feature selection

• dimension reduction by transformations

BITS Pilani, Hyderabad Campus
Missing Values
• Reasons for missing values
– Information is not collected
(e.g., people decline to give their age and weight)
– Attributes may not be applicable to all cases
(e.g., annual income is not applicable to children)

• Handling missing values

– Eliminate Data Objects
– Estimate Missing Values
– Ignore the Missing Value During Analysis
– Replace with all possible values (weighted by their
probabilities)

BITS Pilani, Hyderabad Campus

Data Cleaning

• Strategies to handle Missing values

• If a feature has many missing values, prune the

feature with correct values.

• If a record has many missing values, prune the record

• Impute missing values

• If the modeling technique allows missing values, just

replace them with special values (like “NA”)

BITS Pilani, Hyderabad Campus

Duplicate Data

• Data set may include data objects that are duplicates, or

almost duplicates of one another
– Major issue when merging data from heterogeneous
sources

• Examples:
– Same person with multiple email addresses

• Data cleaning
– Process of dealing with duplicate data issues

BITS Pilani, Hyderabad Campus

Aggregation

• Combining two or more attributes (or objects) into a single

attribute (or object)

• Purpose
– Data reduction
• Reduce the number of attributes or objects
– Change of scale
• Cities aggregated into regions, states, countries, etc
– More “stable” data
• Aggregated data tends to have less variability

BITS Pilani, Hyderabad Campus

Aggregation
Variation of Precipitation in Australia

Standard Deviation of Average Standard Deviation of

Monthly Precipitation Average Yearly Precipitation
BITS Pilani, Hyderabad Campus
Sampling

• Sampling is the main technique employed for data

selection.
– It is often used for both the preliminary investigation
of the data and the final data analysis.

• Statisticians sample because obtaining the entire set of

data of interest is too expensive or time consuming.

• Sampling is used in data mining because processing the

entire set of data of interest is too expensive or time
consuming.

BITS Pilani, Hyderabad Campus

Sampling

• The key principle for effective sampling is:

• A sample will work almost as well as using the entire

data set if the sample is representative(different for

different data set).

• Sampling may remove outliers and if done improperly

• binarization: categorical → binary (0/1)

• creating similarity graphs: any type → graph

• transformations for dimension reduction: create new, less

redundant features and keep the best ones, both feature
extraction and data reduction

BITS Pilani, Hyderabad Campus

Scaling and Normalization

• Features with large magnitudes dominate the aggregate

functions like Euclidean distances.

• Hence, we can transform all features to the same scale or

standardize distributions.
• Normalization is particularly useful for classification algorithms.
• min-max normalization

• z-score normalization

• Normalization by decimal scaling

BITS Pilani, Hyderabad Campus

Scaling and Normalization (Contd..)

BITS Pilani, Hyderabad Campus

Min-max normalization

Transform the data from measured units to a new interval

of training.
BITS Pilani, Hyderabad Campus
Wrapper Methods

BITS Pilani, Hyderabad Campus

Sequential Forward Selection (SFS)

1. Start with an empty feature set

2. Try each remaining feature

3. Estimate classifier performance for adding each feature

4. Select feature that gives max improvement

5. Stop when there is no significant improvement

Disadvantage: Once a feature is retained, it cannot be discarded;

nesting problem

BITS Pilani, Hyderabad Campus

Sequential Backward Selection (SBS)

1. Start with an full feature set

2. Try removing feature

3. Drop the feature with smallest impact on classifier

performance

Disadvantage: SBS requires more computation than SFS

BITS Pilani, Hyderabad Campus

Search space for feature selection

Forward Feature subset selection

Backward Feature subset selection

BITS Pilani, Hyderabad Campus

Embedded Method for feature selection

• Embedded methods perform feature selection and

training of the algorithm in parallel.

• Example

• Lasso Regression

• Decision Trees

BITS Pilani, Hyderabad Campus

Feature Creation
• Create new attributes that can capture the important
information in a data set much more efficiently than the
original attributes

• Three general methodologies:

– Feature Extraction: multimedia features(low,middle,high

level fetaures)
• domain-specific

– Mapping Data to New Space

BITS Pilani, Hyderabad Campus

Proximity

➢ Examples:

✓ For an item bought by a customer, find other similar items

✓ Group together the customers of the site so that similar customers are shown the
same ad.

Proximity Measure with Symmetric
Binary

1 0

1 1 1

0 2 2

BITS Pilani, Hyderabad Campus

Proximity Measure with Asymmetric
Binary

BITS Pilani, Hyderabad Campus

Proximity Measure with Asymmetric
Binary

BITS Pilani
Hyderabad Campus

Classification
Today’s Agenda

• Jargons used in Data Mining

• Tasks in Data Mining

• Decision Tree Algorithm

BITS Pilani, Hyderabad Campus

Functions (Contd..)

X: English Sentence Y: Hindi sentence

?????????????????????????

BITS Pilani, Hyderabad Campus

Parameters

X is the input
Y = 3X + 1
Y is the output

Y = WX + b W and b are parameters

Model

Input - Fixed comes from the training data

Parameters- Needs to be estimated

BITS Pilani, Hyderabad Campus

Parameters

Training Data Model

X Y
Y = WX + b
1 0
5 16 How to estimate the parameters W and b?
6 20
Assume random numbers for W and b
Y = 1X + 0 Y = 2X + 2
Training Training
Data Data
X Y Y’ X Y Y’ Which model is
1 0 1 1 0 4
better?
5 16 5 5 16 12
6 20 6 6 20 14
BITS Pilani, Hyderabad Campus
Functions (Contd..)
X=1

Y=0

X=5

Y=16

X=6

Y=20

X=3

Y=??
BITS Pilani, Hyderabad Campus
Cost function
Y = 1X + 0
Training Data Training Data
Model
n X Y n X Y Y’ (Y – Y’)2

0 1 0
Yn = WXn + b 0 1 0 1 1

1 5 16 1 5 16 5 121

2 6 20 2 6 20 6 196
C(1,0) 318

Y = 2X + 2
Cost Training Data
n X Y Y’ (Y – Y’)2
C(W,b) = ∑ (Y – Y’)2
0 1 0 4 16
nϵ {0,1,2}
1 5 16 12 16
The one that gives us the lowest cost
is a better model 2 6 20 14 36
C(2,2) 68
BITS Pilani, Hyderabad Campus
Optimizer

Training Data
Model
n X Y
0 1 0
Yn = WXn + b
1 5 16
2 6 20
Optimizer
arg min C(W,b)
W,b ϵ [-∞ ∞]
Cost
C(W,b) = ∑ (Y – Y’)2
nϵ {0,1,2}

BITS Pilani, Hyderabad Campus

Gradients
Cost
C(W,b) = ∑ (Y – Y’)2
nϵ {0,1,2}

BITS Pilani, Hyderabad Campus
BITS Pilani, Hyderabad Campus
BITS Pilani, Hyderabad Campus
Second Iteration

Now we have only three attributes: Gender, car ownership and Income
level.
BITS Pilani, Hyderabad Campus
• Then, we repeat the procedure of computing degree of
impurity and information gain for the three attributes.
BITS Pilani, Hyderabad Campus
BITS Pilani, Hyderabad Campus
BITS Pilani, Hyderabad Campus
Third Iteration

BITS Pilani, Hyderabad Campus

Decision Tree

BITS Pilani, Hyderabad Campus

Probabilistic vs. Discriminative learning

attributes in that case we need to work with Bayesian
Belief Networks.

BITS Pilani, Hyderabad Campus

Take home message

• Applications of supervised learning are in almost any field or

domain.
• There are still many other methods, e.g.,
– Support Vector Machines
– Logistic Regression
– This large number of methods also shows the importance of
classification and its wide applicability.
• It remains to be an active research area.

BITS Pilani, Hyderabad Campus

BITS Pilani
Hyderabad Campus

Revision
Topics for Mid Sem Exam

• Different types of data using applications

• Aggregation

• Sampling

• Dimensionality Reduction

• Feature subset selection

• Discretization (Example is available on the slide)

• Classification (Decision Tree, Naïve Bayes)

BITS Pilani, Hyderabad Campus

Different types of data using applications

• Suppose you are given movie reviews, and you are

asked to perform sentiment analysis. In this application,
what type of data would you use, and how will you create
the dataset?

BITS Pilani, Hyderabad Campus

Which preprocessing would you use?

• The Election Commission would like to find the voter

turnout in each state.

• The words in a document would need to grouped based

on the what topics they belong.

BITS Pilani, Hyderabad Campus

Sampling

• There are 1000 employees in BITS out of which 100 of

them have to be selected for weekend work. All their
names will be put in a basket to pull 100 names out.

• Which sampling would you use?

• Suppose you are asked to find the employees to be

selected equally from all the departments which
sampling would you use and why?

BITS Pilani, Hyderabad Campus

Dimensionality Reduction - PCA

• What are the Principal Components?

• What do eigenvalue and vector represent?

• How does dimension reduction happen?

• What are the constraints on the eigen vectors?

Oracle_Database_19c__RAC_Administration_Workshop
No ratings yet
Oracle_Database_19c__RAC_Administration_Workshop
10 pages
Data Mining e Resources
No ratings yet
Data Mining e Resources
98 pages
lec slides combined mid quiz with old quizzes (1)
No ratings yet
lec slides combined mid quiz with old quizzes (1)
378 pages
Data Mining
No ratings yet
Data Mining
254 pages
Distributed System Notes
0% (1)
Distributed System Notes
17 pages
CS1713-Blockchain Technologies Lecture Notes-Unit II
No ratings yet
CS1713-Blockchain Technologies Lecture Notes-Unit II
56 pages
combinepdf-1
No ratings yet
combinepdf-1
74 pages
Cloud Data Protection Appliance Buyer'S Guide: by Charley Mcmaster and Jerome Wendt
No ratings yet
Cloud Data Protection Appliance Buyer'S Guide: by Charley Mcmaster and Jerome Wendt
51 pages
DevOps Shack _ Comprehensive Monitoring Guide
No ratings yet
DevOps Shack _ Comprehensive Monitoring Guide
41 pages
DM-Unit 1 PPT
No ratings yet
DM-Unit 1 PPT
110 pages
IoT mid-2 question bank WITH ANSWERS
No ratings yet
IoT mid-2 question bank WITH ANSWERS
21 pages
DB-14
No ratings yet
DB-14
97 pages
Data Mining
No ratings yet
Data Mining
61 pages
01Intro
No ratings yet
01Intro
52 pages
Comparative Analysis of Traditional and AI-based D
No ratings yet
Comparative Analysis of Traditional and AI-based D
24 pages
Chapter 1 - Tagged
No ratings yet
Chapter 1 - Tagged
46 pages
Hazelcast IMDG Deployment and Operations Guide 3.10 PDF
No ratings yet
Hazelcast IMDG Deployment and Operations Guide 3.10 PDF
64 pages
01-Introduction To Data Mining
No ratings yet
01-Introduction To Data Mining
43 pages
16-Elastic Resource Capacity Architecture-12!03!2025
No ratings yet
16-Elastic Resource Capacity Architecture-12!03!2025
14 pages
Team Daedalus
No ratings yet
Team Daedalus
12 pages
intro data mining
No ratings yet
intro data mining
51 pages
Week1-1
No ratings yet
Week1-1
18 pages
Data Mining - Concepts and Techniques
No ratings yet
Data Mining - Concepts and Techniques
224 pages
Week 01 Chapt01
No ratings yet
Week 01 Chapt01
49 pages
DWDM-LS1-Fall-24-25
No ratings yet
DWDM-LS1-Fall-24-25
42 pages
ADM Pranit Micro
100% (1)
ADM Pranit Micro
28 pages
2 DM Module 1 Introduction DVS
No ratings yet
2 DM Module 1 Introduction DVS
81 pages
Tum Dersler Veri Madenciligi
No ratings yet
Tum Dersler Veri Madenciligi
123 pages
Module 3
No ratings yet
Module 3
187 pages
Lecture 2
No ratings yet
Lecture 2
66 pages
Lecture - 1 02032023 095637am 1 29022024 124126pm
No ratings yet
Lecture - 1 02032023 095637am 1 29022024 124126pm
33 pages
AZ-900 - Azure Fundamentals Exam Preparation
No ratings yet
AZ-900 - Azure Fundamentals Exam Preparation
17 pages
Major Issues in Data Mining
75% (4)
Major Issues in Data Mining
45 pages
Introduction
No ratings yet
Introduction
46 pages
CS435-Highlighted Handouts VU-Gateway
No ratings yet
CS435-Highlighted Handouts VU-Gateway
94 pages
Data Mining
No ratings yet
Data Mining
88 pages
Lec 1
No ratings yet
Lec 1
48 pages
Improve Quality and Transparency With IT OT Integration 1716293668
No ratings yet
Improve Quality and Transparency With IT OT Integration 1716293668
12 pages
dm mod1
No ratings yet
dm mod1
29 pages
1_Lect 1 & 2 Data Mining
No ratings yet
1_Lect 1 & 2 Data Mining
20 pages
Introduction-to-Data-Mining
No ratings yet
Introduction-to-Data-Mining
32 pages
EDQ Topology Architecture
No ratings yet
EDQ Topology Architecture
4 pages
DWDM
No ratings yet
DWDM
30 pages
Data Mining: Nicoleta ROGOVSCHI
No ratings yet
Data Mining: Nicoleta ROGOVSCHI
84 pages
Comp 6838
No ratings yet
Comp 6838
41 pages
Notes On OS (BCA CACS 251)
No ratings yet
Notes On OS (BCA CACS 251)
20 pages
1.4 Module-1
No ratings yet
1.4 Module-1
21 pages
Data Mining - Digital Notes (Unit I To V)
No ratings yet
Data Mining - Digital Notes (Unit I To V)
85 pages
Module - 1 - DM
No ratings yet
Module - 1 - DM
52 pages
Synopsis For Term Paper ON Amazon Ec2 (Elastic Compute Cloud)
No ratings yet
Synopsis For Term Paper ON Amazon Ec2 (Elastic Compute Cloud)
5 pages
CSM6404 DM L1
No ratings yet
CSM6404 DM L1
29 pages
Unit 3
No ratings yet
Unit 3
23 pages
Unit 2 CC
No ratings yet
Unit 2 CC
23 pages
01 Intro 1
No ratings yet
01 Intro 1
50 pages
01 Intro
No ratings yet
01 Intro
40 pages
3-OLAP Operations-13!08!2021 (13-Aug-2021) Material I 13-Aug-2021 Data Mining - Introductory Slides
No ratings yet
3-OLAP Operations-13!08!2021 (13-Aug-2021) Material I 13-Aug-2021 Data Mining - Introductory Slides
37 pages
Concepts and Techniques: - Chapter 1
No ratings yet
Concepts and Techniques: - Chapter 1
48 pages
5 Capacity Planning
No ratings yet
5 Capacity Planning
35 pages
Data Mining 1
No ratings yet
Data Mining 1
56 pages
ProgrammableFlow Intro - Sep2011
No ratings yet
ProgrammableFlow Intro - Sep2011
39 pages
Data Mining
No ratings yet
Data Mining
26 pages
CSE2021 - MODULE 1ppt
No ratings yet
CSE2021 - MODULE 1ppt
62 pages
S O N D: Selling Guide - Aruba Airmesh Wireless Mesh Routers
No ratings yet
S O N D: Selling Guide - Aruba Airmesh Wireless Mesh Routers
2 pages
1 Chapter One
No ratings yet
1 Chapter One
54 pages
Introduction To Data Mining: - Chapter 3
No ratings yet
Introduction To Data Mining: - Chapter 3
39 pages
Amazon Red Shift
No ratings yet
Amazon Red Shift
17 pages
Introduction To Data Mining-Week1
No ratings yet
Introduction To Data Mining-Week1
43 pages
ARCON PAM Hardware Sizing Option 1
No ratings yet
ARCON PAM Hardware Sizing Option 1
11 pages
1712060004 (1)
No ratings yet
1712060004 (1)
25 pages
Introduction To Data Mining & Business Intelligence
No ratings yet
Introduction To Data Mining & Business Intelligence
25 pages
Motivation For Data Mining The Information Crisis
No ratings yet
Motivation For Data Mining The Information Crisis
13 pages
01 Intro
No ratings yet
01 Intro
23 pages
Software Engineering
No ratings yet
Software Engineering
12 pages
Data Mining Concepts
No ratings yet
Data Mining Concepts
35 pages
Architecting For Fast Data Applications Mesosphere
No ratings yet
Architecting For Fast Data Applications Mesosphere
45 pages
The Definitive Guide To Hyperconverged Infrastructure: How Nutanix Works
No ratings yet
The Definitive Guide To Hyperconverged Infrastructure: How Nutanix Works
26 pages
Mehrdad Jalali: Jalali@mshdiau - Ac.ir Jalali - Mshdiau.ac - Ir
No ratings yet
Mehrdad Jalali: Jalali@mshdiau - Ac.ir Jalali - Mshdiau.ac - Ir
27 pages
Chap 1
No ratings yet
Chap 1
32 pages
Data Mining: An Overview From A Database Perspective
No ratings yet
Data Mining: An Overview From A Database Perspective
30 pages
SQLDM - Implementing K-Means Clustering Using SQL: Jay B.Simha
No ratings yet
SQLDM - Implementing K-Means Clustering Using SQL: Jay B.Simha
5 pages
Sharing Lifes Joy With Mongodb: A Shutterfly Case Study
No ratings yet
Sharing Lifes Joy With Mongodb: A Shutterfly Case Study
16 pages
Introduction To Data Mining
No ratings yet
Introduction To Data Mining
19 pages
Chapter 1. Introduction
No ratings yet
Chapter 1. Introduction
323 pages
1 Intro
No ratings yet
1 Intro
33 pages
Snowflake
No ratings yet
Snowflake
16 pages
Introduction To Data Mining
75% (4)
Introduction To Data Mining
45 pages
Data Mining and Data Warehouse BY: Dept. of Computer Science Engineering
No ratings yet
Data Mining and Data Warehouse BY: Dept. of Computer Science Engineering
10 pages
Data Mining for Beginners: A Programmer’s Guide
From Everand
Data Mining for Beginners: A Programmer’s Guide
Agasti Khatri
No ratings yet
Mastering Data Mining Techniques
From Everand
Mastering Data Mining Techniques
Dhaanyalakshmi Ahuja
No ratings yet
Learn Hadoop in 24 Hours
From Everand
Learn Hadoop in 24 Hours
Alex Nordeen
No ratings yet

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.