0% found this document useful (0 votes)
36 views272 pages

Data Mining Merged Pdf CS1 CS8

The document provides an overview of data mining, defining it as the automated analysis of massive data sets to extract useful patterns and knowledge. It discusses the evolution of database technology, the various tasks and applications of data mining, including classification, clustering, and association rule discovery. Additionally, it highlights the importance of data mining in business intelligence and its reliance on machine learning techniques.

Uploaded by

kbalajiganeshdav
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
36 views272 pages

Data Mining Merged Pdf CS1 CS8

The document provides an overview of data mining, defining it as the automated analysis of massive data sets to extract useful patterns and knowledge. It discusses the evolution of database technology, the various tasks and applications of data mining, including classification, clustering, and association rule discovery. Additionally, it highlights the importance of data mining in business intelligence and its reliance on machine learning techniques.

Uploaded by

kbalajiganeshdav
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 272

Data Mining

BITS Pilani M1: Introduction to Data Mining


Pilani|Dubai|Goa|Hyderabad
BITS Pilani
Pilani|Dubai|Goa|Hyderabad

1.1 Data Mining Defined

Source Courtesy: Some of the contents of this PPT are sourced from materials provided by publishers of prescribed books
Why Data Mining?
• The Explosive Growth of Data: from terabytes to petabytes
• Data collection and data availability
• Automated data collection tools, database systems, Web, computerized
society
• Major sources of abundant data
• Business: Web, e-commerce, transactions, stocks, …
• Science: Remote sensing, bioinformatics, scientific simulation, …
• Society and everyone: news, digital cameras, YouTube
• We are drowning in data, but starving for knowledge!
• “Necessity is the mother of invention”—Data mining—Automated analysis of
massive data sets
3
Why Data Mining
A search engine (e.g., Google) receives hundreds of millions of queries every day. Each query can be
viewed as a transaction where the user describes her or his information need.

What novel and useful knowledge can a search engine learn from such a huge collection of queries
collected from users over time? Some patterns found in user search queries can disclose invaluable
knowledge that cannot be obtained by reading individual data items alone.

For example, Google's Flu Trends uses specific search terms as indicators of flu activity. It found a
close relationship between the number of people who search for flu-related information and the
number of people who actually have flu symptoms. A pattern emerges when all of the search
queries related to flu are aggregated. Using aggregated Google search data, Flu Trends can estimate
flu activity up to two weeks faster than traditional systems can. This example shows how data
mining can turn a large collection of data into knowledge that can help meet a current global
challenge.
Evolution of Database Technology

• 1960s:
• Data collection, database creation, IMS and network DBMS
• 1970s:
• Relational data model, relational DBMS implementation
• 1980s:
• RDBMS, advanced data models (extended-relational, OO, deductive, etc.)
• Application-oriented DBMS (spatial, scientific, engineering, etc.)
• 1990s:
• Data mining, data warehousing, multimedia databases, and Web databases
• 2000s
• Stream data management and mining
• Data mining and its applications
• Web technology (XML, data integration) and global information systems

5
What Is Data Mining?

• Data mining (knowledge discovery from data)


• Extraction of interesting (non-trivial, implicit, previously unknown and potentially useful)
patterns or knowledge from huge amount of data
• Data mining: a misnomer?

• Alternative names
• Knowledge discovery (mining) in databases (KDD), knowledge extraction, data/pattern
analysis, data archeology, data dredging, information harvesting, business intelligence, etc.

• Watch out: Is everything “data mining”?


• Simple search and query processing
• (Deductive) expert systems

6
What is (not) Data Mining?
What is not Data What is Data Mining?
Mining? – Certain names are more
prevalent in certain US
– Look up phone
locations (O’Brien, O’Rurke,
number in phone
O’Reilly… in Boston area)
directory
– Query a Web – Group together similar
search engine for documents returned by search
information about engine according to their
“Amazon” context (e.g. Amazon
rainforest, Amazon.com,)
Origins of Data Mining
• Draws ideas from machine learning/AI, pattern recognition,
statistics, and database systems
• Traditional Techniques Statistics/ Machine Learning/
may be unsuitable due to AI Pattern
• Enormity of data Recognition

• High dimensionality Data Mining


of data
• Heterogeneous,
distributed nature Database
of data systems
Data Mining in Business Intelligence
Increasing potential
to support
business decisions End User
Decision
Making

Data Presentation Business


Analyst
Visualization Techniques
Data Mining Data
Information Discovery Analyst

Data Exploration
Statistical Summary, Querying, and Reporting

Data Preprocessing/Integration, Data Warehouses


DBA
Data Sources
Paper, Files, Web documents, Scientific experiments, Database Systems
9
Data Mining/KDD Process

Input Data Data Pre- Data Post-


Processing Mining Processing

Data integration Pattern discovery Pattern evaluation


Normalization Association & correlation Pattern selection
Feature selection Classification Pattern interpretation
Clustering
Dimension reduction Pattern visualization
Outlier analysis
…………

10
Multi-Dimensional View of Data Mining
• Data to be mined
• Database data (extended-relational, object-oriented, heterogeneous, legacy), data warehouse,
transactional data, stream, spatiotemporal, time-series, sequence, text and web, multi-media,
graphs & social and information networks
• Knowledge to be mined (or: Data mining functions)
• Characterization, discrimination, association, classification, clustering, trend/deviation, outlier
analysis, etc.
• Descriptive vs. predictive data mining
• Multiple/integrated functions and mining at multiple levels
• Techniques utilized
• Data-intensive, data warehouse (OLAP), machine learning, statistics, pattern recognition,
visualization, high-performance, etc.
• Applications adapted
• Retail, telecommunication, banking, fraud analysis, bio-data mining, stock market analysis, text
mining, Web mining, etc.

11
Data Mining & Machine Learning
According to Tom M. Mitchell, Chair of Machine Learning at Carnegie Mellon
University and author of the book Machine Learning (McGraw-Hill),
A computer program is said to learn from experience E with respect to some class of tasks T
and performance measure P, if its performance at tasks in T, as measured by P, improves with
the experience E.
We now have a set of objects to define machine learning:
Task (T), Experience (E), and Performance (P)
With a computer running a set of tasks, the experience should be leading to performance
increases (to satisfy the definition)

Many data mining tasks are executed successfully with help of machine learning

Machine Learning: Hands-on for Developers and Technical Professionals by Jason Bell John Wiley & Sons
Data Mining on Diverse kinds of Data
Besides relational database data (from operational or analytical systems), there are many other kinds of data
that have diverse forms and structures and different semantic meanings.
Examples of data can be :
time-related or sequence data (e.g., historical records, stock exchange data, and time-series and biological
sequence data),
data streams (e.g., video surveillance and sensor data, which are continuously transmitted),
spatial data (e.g., maps),
engineering design data (e.g., the design of buildings, system components, or integrated circuits),
hypertext and multimedia data (including text, image, video, and audio data),
graph and networked data (e.g., social and information networks), and
the Web (a widely distributed information repository).
Diversity of data brings in new challenges such as handling special structures (e.g., sequences, trees, graphs,
and networks) and specific semantics (such as ordering, image, audio and video contents, and connectivity)
Prescribed Text Books

Author(s), Title, Edition, Publishing House


T1 Tan P. N., Steinbach M & Kumar V. “Introduction to Data Mining” Pearson
Education
T2 Data Mining: Concepts and Techniques, Third Edition by Jiawei Han, Micheline
Kamber and Jian Pei Morgan Kaufmann Publishers
BITS Pilani
Pilani|Dubai|Goa|Hyderabad

1.2 Data Mining Activities

Source Courtesy: Some of the contents of this PPT are sourced from materials provided by publishers of prescribed books
Data Mining Tasks
• Prediction Methods
• Use some variables to predict unknown or future values of other variables.

• Description Methods
• Find human-interpretable patterns that describe the data.

From [Fayyad, et.al.] Advances in Knowledge Discovery and Data Mining, 1996
Data Mining Tasks...
• Classification [Predictive]
• Clustering [Descriptive]
• Association Rule Discovery [Descriptive]
• Sequential Pattern Discovery [Descriptive]
• Regression [Predictive]
• Deviation Detection [Predictive]
Classification: Definition
• Given a collection of records (training set )
• Each record contains a set of attributes, one of the attributes is the class.
• Find a model for class attribute as a function of the values of other
attributes.
• Goal: previously unseen records should be assigned a class as
accurately as possible.
• A test set is used to determine the accuracy of the model. Usually, the given
data set is divided into training and test sets, with training set used to build
the model and test set used to validate it.
Classification Example

Tid Refund Marital Taxable Refund Marital Taxable


Status Income Cheat Status Income Cheat

1 Yes Single 125K No No Single 75K ?


2 No Married 100K No Yes Married 50K ?
3 No Single 70K No No Married 150K ?
4 Yes Married 120K No Yes Divorced 90K ?
5 No Divorced 95K Yes No Single 40K ?
6 No Married 60K No No Married 80K ? Test
7 Yes Divorced 220K No
10

Set

8 No Single 85K Yes


9 No Married 75K No Learn
Training
10 No Single 90K Yes Model
10

Set Classifier
Classification: Application 1
• Direct Marketing
• Goal: Reduce cost of mailing by targeting a set of consumers likely to buy a
new cell-phone product.
• Approach:
• Use the data for a similar product introduced before.
• We know which customers decided to buy and which decided otherwise. This {buy, don’t
buy} decision forms the class attribute.
• Collect various demographic, lifestyle, and company-interaction related information
about all such customers.
• Type of business, where they stay, how much they earn, etc.
• Use this information as input attributes to learn a classifier model.

From [Berry & Linoff] Data Mining Techniques, 1997


Classification: Application 2
• Fraud Detection
• Goal: Predict fraudulent cases in credit card transactions.
• Approach:
• Use credit card transactions and the information on its account-holder as attributes.
• When does a customer buy, what does he buy, how often he pays on time, etc
• Label past transactions as fraud or fair transactions. This forms the class attribute.
• Learn a model for the class of the transactions.
• Use this model to detect fraud by observing credit card transactions on an account.
Classification: Application 3
• Customer Attrition/Churn:
• Goal: To predict whether a customer is likely to be lost to a competitor.
• Approach:
• Use detailed record of transactions with each of the past and present customers, to find
attributes.
• How often the customer calls, where he calls, what time-of-the day he calls most, his financial
status, marital status, etc.
• Label the customers as loyal or disloyal.
• Find a model for loyalty.

From [Berry & Linoff] Data Mining Techniques, 1997


Classification: Application 4
• Sky Survey Cataloging
• Goal: To predict class (star or galaxy) of sky objects, especially visually faint
ones, based on the telescopic survey images (from Palomar Observatory).
• 3000 images with 23,040 x 23,040 pixels per image.
• Approach:
• Segment the image.
• Measure image attributes (features) - 40 of them per object.
• Model the class based on these features.
• Success Story: Could find 16 new high red-shift quasars, some of the farthest objects that
are difficult to find!

From [Fayyad, et.al.] Advances in Knowledge Discovery and Data Mining, 1996
Clustering Definition
• Given a set of data points, each having a set of attributes, and a
similarity measure among them, find clusters such that
• Data points in one cluster are more similar to one another.
• Data points in separate clusters are less similar to one another.
• Similarity Measures:
• Euclidean Distance if attributes are continuous.
• Other Problem-specific Measures.
Illustrating Clustering
Euclidean Distance Based Clustering in 3-D space.

Intracluster distances Intercluster distances


are minimized are maximized
Clustering: Application 1
• Market Segmentation:
• Goal: subdivide a market into distinct subsets of customers where any subset
may conceivably be selected as a market target to be reached with a distinct
marketing mix.
• Approach:
• Collect different attributes of customers based on their geographical and lifestyle related
information.
• Find clusters of similar customers.
• Measure the clustering quality by observing buying patterns of customers in same
cluster vs. those from different clusters.
Clustering: Application 2
• Document Clustering:
• Goal: To find groups of documents that are similar to each other based on the
important terms appearing in them.
• Approach: To identify frequently occurring terms in each document. Form a
similarity measure based on the frequencies of different terms. Use it to
cluster.
• Gain: Information Retrieval can utilize the clusters to relate a new document
or search term to clustered documents.
Clustering of S&P 500 Stock Data
• Observe Stock Movements every day.
• Clustering points: Stock-{UP/DOWN}
• Similarity Measure: Two points are more similar if the events
described by them frequently happen together on the same day.
• We used association rules to quantify a similarity measure.
Discovered Clusters Industry Group
Applied-Matl-DOW N,Bay-Net work-Down,3-COM-DOWN,

1 Cabletron-Sys-DOWN,CISCO-DOWN,HP-DOWN,
DSC-Co mm-DOW N,INTEL-DOWN,LSI-Logic-DOWN,
Micron-Tech-DOWN,Texas-Inst-Down,Tellabs-Inc-Down,
Technology1-DOWN
Natl-Semiconduct-DOWN,Oracl-DOWN,SGI-DOW N,
Sun-DOW N
Apple-Co mp-DOW N,Autodesk-DOWN,DEC-DOWN,

2 ADV-M icro-Device-DOWN,Andrew-Corp-DOWN,
Co mputer-Assoc-DOWN,Circuit-City-DOWN,
Co mpaq-DOWN, EM C-Corp-DOWN, Gen-Inst-DOWN,
Technology2-DOWN
Motorola-DOW N,Microsoft-DOWN,Scientific-Atl-DOWN
Fannie-Mae-DOWN,Fed-Ho me-Loan-DOW N,

3 MBNA-Corp -DOWN,Morgan-Stanley-DOWN Financial-DOWN


Baker-Hughes-UP,Dresser-Inds-UP,Halliburton-HLD-UP,

4 Louisiana-Land-UP,Phillips-Petro-UP,Unocal-UP,
Schlu mberger-UP
Oil-UP
Association Rule Discovery: Definition
• Given a set of records each of which contain some number of items from a given
collection;
• Produce dependency rules which will predict occurrence of an item based on
occurrences of other items.
TID Items
1 Bread, Coke, Milk
Rules Discovered:
2 Beer, Bread {Milk} --> {Coke}
3 Beer, Coke, Diaper, Milk
4 Beer, Bread, Diaper, Milk
5 Coke, Diaper, Milk
Association Rule Discovery: Application 1

• Marketing and Sales Promotion:


• Let the rule discovered be
{Bagels, … } --> {Potato Chips}
• Potato Chips as consequent => Can be used to determine what should be
done to boost its sales.
• Bagels in the antecedent => Can be used to see which products would be
affected if the store discontinues selling bagels.
• Bagels in antecedent and Potato chips in consequent => Can be used to see
what products should be sold with Bagels to promote sale of Potato chips!
Association Rule Discovery: Application

• Inventory Management:
• Goal: A consumer appliance repair company wants to anticipate the nature of
repairs on its consumer products and keep the service vehicles equipped with
right parts to reduce on number of visits to consumer households.
• Approach: Process the data on tools and parts required in previous repairs at
different consumer locations and discover the co-occurrence patterns.
Sequential Pattern Discovery: Definition
• Given is a set of objects, with each object associated with its own timeline of events, find rules that predict
strong sequential dependencies among different events.

(A B) (C) (D E)

• Rules are formed by first disovering patterns. Event occurrences in the patterns are governed by timing
constraints.
(A B) (C) (D E)
<= xg >ng <= ws
Timing constraints include maxgap (xg),
mingap (ng), windowsize (ws), maxspan (ms)
<= ms
Sequential Pattern Discovery: Examples
• In telecommunications alarm logs,
• (Inverter_Problem Excessive_Line_Current)
(Rectifier_Alarm) --> (Fire_Alarm)
• In point-of-sale transaction sequences,
• Computer Bookstore:
(Intro_To_Visual_C) (C++_Primer) -->
(Perl_for_dummies,Tcl_Tk)
• Athletic Apparel Store:
(Shoes) (Racket, Racketball) --> (Sports_Jacket)
Regression
• Predict a value of a given continuous valued variable based on the values of
other variables, assuming a linear or nonlinear model of dependency.
• Greatly studied in statistics, neural network fields.
• Examples:
• Predicting sales amounts of new product based on advetising expenditure.
• Predicting wind velocities as a function of temperature, humidity, air
pressure, etc.
• Time series prediction of stock market indices.
Deviation/Anomaly Detection
• Detect significant deviations from normal behavior

• Applications:
• Credit Card Fraud Detection
• Network Intrusion Detection
Prescribed Text Books

Author(s), Title, Edition, Publishing House


T1 Tan P. N., Steinbach M & Kumar V. “Introduction to Data Mining” Pearson
Education
T2 Data Mining: Concepts and Techniques, Third Edition by Jiawei Han, Micheline
Kamber and Jian Pei Morgan Kaufmann Publishers
BITS Pilani
Pilani|Dubai|Goa|Hyderabad

1.3 DM Process & Challenges

Source Courtesy: Some of the contents of this PPT are sourced from materials provided by publishers of prescribed books
DM Process
• The standard data mining process involves
1. understanding the problem,
2. preparing the data (samples),
3. developing the model,
4. applying the model on a data set to see how the model may work in real
world, and
5. production deployment.
• A popular data mining process frameworks is CRISP-DM (Cross
Industry Standard Process for Data Mining). This framework was
developed by a consortium of companies involved in data mining
Generic Data mining process
Prior Knowledge
• Data Mining tools/solutions identify hidden patterns.
• Generally we get many patterns
• Out of them many could be false or trivial.
• Filtering false patterns requires domain understanding.
• Understanding how the data is collected, stored, transformed,
reported, and used is essential.
• Causation vs. Correlation
• A bank may decide interest rate based on credit score. Looking at data, credit
score moves as per interest rate. It does not make sense to derive credit score
from interest rate.
Data Preparation
• Data needs to be understood. It requires descriptive statistics such as mean, median, mode, standard deviation, and range for each
attribute
• Data quality is an ongoing concern wherever data is collected, processed, and stored.
• The data cleansing practices include elimination of duplicate records, quarantining outlier records that exceed the bounds, standardization of attribute
values, substitution of missing values, etc.
• it is critical to check the data using data exploration techniques in addition to using prior knowledge of the data and business before building models to
ensure a certain degree of data quality

• Missing Values
• Need to track the data lineage of the data source to find right solution
• Data Types and Conversion
• The attributes in a data set can be of different types, such as continuous numeric (interest rate), integer numeric (credit score), or categorical
• data mining algorithms impose different restrictions on what data types they accept as inputs
• Transformation
• Can go beyond type conversion, may include dimensionality reduction or numerosity reduction
• Outliers are anomalies in the data set
• May occur legitimately or erroneously.
• Feature Selection
• Many data mining problems involve a data set with hundreds to thousands of attributes, most of which may not be helpful. Some attributes may be
correlated, e.g. sales amount and tax.
• Data Sampling may be adequate in many cases
Modeling
A model is the abstract
representation of the data and
its relationships in a given data
set.
Data mining models can be
classified into the following
categories: classification,
regression, association analysis,
clustering, and outlier or
anomaly detection.
Each category has a few dozen
different algorithms; each takes
a slightly different approach to
solve the problem at hand
Application
• The model deployment stage considerations:
• assessing model readiness, technical integration, response time, model maintenance, and assimilation
• Production Readiness
• Real-time response capabilities, and other business requirements
• Technical Integration
• Use of modeling tools (e.g. RapidMiner), Use of PMML for portable and consistent format of model
description, integration with other tools
• Timeliness
• The trade-offs between production responsiveness and build time need to be considered
• Remodeling
• The conditions in which the model is built may change after deployment
• Assimilation
• The challenge is to assimilate the knowledge gained from data mining in the organization. For example, the
objective may be finding logical clusters in the customer database so that separate treatment can be provided
to each customer cluster.
CRISP data mining framework

CRISP is the most popular methodology


for analytics, data mining, and data
science projects, with 43% share as per
2014 KDnuggets Poll.
CRISP-DM was conceived in 1996. In
1997 it got underway as a European
Union project, led by SPSS, Teradata,
Daimler AG, NCR Corporation and OHRA.
DM Issues/Challenges – Mining Methodology
Mining Methodology involves the investigation of new kinds of knowledge, mining in multidimensional space, integrating methods from other
disciplines, and the consideration of semantic ties among data objects.
• Mining various and new kinds of knowledge: Data mining covers a wide spectrum of data analysis and knowledge discovery tasks, from
data characterization and discrimination to association and correlation analysis, classification, regression, clustering, outlier analysis,
sequence analysis, and trend and evolution analysis.
• Mining knowledge in multidimensional space: When searching for knowledge in large data sets, we can explore the data in
multidimensional space. That is, we can search for interesting patterns among combinations of dimensions (attributes) at varying levels of
abstraction. Data can be aggregated or viewed as a multidimensional data cube.
• Data mining—an interdisciplinary effort: For example, to mine data with natural language text, it makes sense to fuse data mining methods
with methods of information retrieval and natural language processing, e.g. consider the mining of software bugs in large programs, known
as bug mining, benefits from the incorporation of software engineering knowledge into the data mining process.
• Boosting the power of discovery in a networked environment: Most data objects reside in a linked or interconnected environment,
whether it be the Web, database relations, files, or documents. Semantic links across multiple data objects can be used to advantage in
data mining.
• Handling uncertainty, noise, or incompleteness of data: Data often contain noise, errors, exceptions, or uncertainty, or are incomplete.
Errors and noise may confuse the data mining process, leading to the derivation of erroneous patterns. Data cleaning, data preprocessing,
outlier detection and removal, and uncertainty reasoning are examples of techniques that need to be integrated with the data mining
process.
• Pattern evaluation and pattern- or constraint-guided mining: What makes a pattern interesting may vary from user to user. Therefore,
techniques are needed to assess the interestingness of discovered patterns based on subjective measures. These estimate the value of
patterns with respect to a given user class, based on user beliefs or expectations.
DM Issues/Challenges – User Interaction
The user plays an important role in the data mining process. Interesting areas include how to interact with a data mining system, how to
incorporate a user's background knowledge in mining, and how to visualize and comprehend data mining results.

• Interactive mining: The data mining process should be highly interactive. Thus, it is important to build flexible user interfaces and an
exploratory mining environment, facilitating the user's interaction with the system. A user may like to first sample a set of data, explore
general characteristics of the data, and estimate potential mining results. Interactive mining should allow users to dynamically change the
focus of a search, to refine mining requests based on returned results, and to drill, dice, and pivot through the data and knowledge space
interactively, dynamically exploring "cube space" while mining.

• Incorporation of background knowledge: Background knowledge, constraints, rules, and other information regarding the domain under
study should be incorporated into the knowledge discovery process. Such knowledge can be used for pattern evaluation as well as to guide
the search toward interesting patterns.

• Ad hoc data mining and data mining query languages: Query languages (e.g., SQL) have played an important role in flexible searching
because they allow users to pose ad hoc queries. Similarly, high-level data mining query languages or other high-level flexible user
interfaces will give users the freedom to define ad hoc data mining tasks. This should facilitate specification of the relevant sets of data for
analysis, the domain knowledge, the kinds of knowledge to be mined, and the conditions and constraints to be enforced on the discovered
patterns. Optimization of the processing of such flexible mining requests is another promising area of study.

• Presentation and visualization of data mining results: How can a data mining system present data mining results, vividly and flexibly, so
that the discovered knowledge can be easily understood and directly usable by humans? This is especially crucial if the data mining process
is interactive. It requires the system to adopt expressive knowledge representations, user-friendly interfaces, and visualization techniques.
DM Issues/Challenges - Efficiency and Scalability
Efficiency and scalability are always considered when comparing data mining algorithms. As data amounts continue to multiply, these two
factors are especially critical.

• Efficiency and scalability of data mining algorithms: Data mining algorithms must be efficient and scalable in order to effectively extract
information from huge amounts of data in many data repositories or in dynamic data streams. In other words, the running time of a data
mining algorithm must be predictable, short, and acceptable by applications. Efficiency, scalability, performance, optimization, and the
ability to execute in real time are key criteria that drive the development of many new data mining algorithms.

• Parallel, distributed, and incremental mining algorithms: The humongous size of many data sets, the wide distribution of data, and the
computational complexity of some data mining methods are factors that motivate the development of parallel and distributed data-
intensive mining algorithms. Such algorithms first partition the data into "pieces." Each piece is processed, in parallel, by searching for
patterns. The parallel processes may interact with one another. The patterns from each partition are eventually merged.

• Cloud computing and cluster computing, which use computers in a distributed and collaborative way to tackle very large-scale
computational tasks, are also active research themes in parallel data mining. In addition, the high cost of some data mining processes and
the incremental nature of input promote incremental data mining, which incorporates new data updates without having to mine the entire
data "from scratch." Such methods perform knowledge modification incrementally to amend and strengthen what was previously
discovered.
DM Issues/Challenges - Diversity of Database Types
The wide diversity of database types brings about challenges to data mining.

Handling complex types of data: Diverse applications generate a wide spectrum of new data types, from structured data such as relational and
data warehouse data to semi-structured and unstructured data; from stable data repositories to dynamic data streams; from simple data
objects to temporal data, biological sequences, sensor data, spatial data, hypertext data, multimedia data, software program code, Web data,
and social network data. It is unrealistic to expect one data mining system to mine all kinds of data, given the diversity of data types and the
different goals of data mining. Domain- or application-dedicated data mining systems are being constructed for in-depth mining of specific
kinds of data. The construction of effective and efficient data mining tools for diverse applications remains a challenging area.

Mining dynamic, networked, and global data repositories: Multiple sources of data are connected by the Internet and various kinds of
networks, forming gigantic, distributed, and heterogeneous global information systems and networks. The discovery of knowledge from
different sources of structured, semi-structured, or unstructured yet interconnected data with diverse data semantics poses great challenges to
data mining. Mining such gigantic, interconnected information networks may help disclose many more patterns and knowledge in
heterogeneous data sets than can be discovered from a small set of isolated data repositories. Web mining, multisource data mining, and
information network mining have become challenging and fast-evolving data mining fields.
DM Issues/Challenges - Society
How does data mining impact society? What steps can data mining take to preserve the privacy of individuals? Do we use data mining in our
daily lives without even knowing that we do?

Social impacts of data mining: With data mining penetrating our everyday lives, it is important to study the impact of data mining on society.
How can we use data mining technology to benefit society? How can we guard against its misuse? The improper disclosure or use of data and
the potential violation of individual privacy and data protection rights are areas of concern that need to be addressed.

Privacy-preserving data mining: Data mining will help scientific discovery, business management, economy recovery, and security protection
(e.g., the real-time discovery of intruders and cyberattacks). However, it poses the risk of disclosing an individual's personal information.
Studies on privacy-preserving data publishing and data mining are ongoing. The philosophy is to observe data sensitivity and preserve people's
privacy while performing successful data mining.

Invisible data mining: We cannot expect everyone in society to learn and master data mining techniques. More and more systems should have
data mining functions built within so that people can perform data mining or use data mining results simply by mouse clicking, without any
knowledge of data mining algorithms. Intelligent search engines and Internet-based stores perform such invisible data mining by incorporating
data mining into their components to improve their functionality and performance. This is done often unbeknownst to the user. For example,
when purchasing items online, users may be unaware that the store is likely collecting data on the buying patterns of its customers, which may
be used to recommend other items for purchase in the future.
Prescribed Text Books

Author(s), Title, Edition, Publishing House


T1 Tan P. N., Steinbach M & Kumar V. “Introduction to Data Mining” Pearson
Education
T2 Data Mining: Concepts and Techniques, Third Edition by Jiawei Han, Micheline
Kamber and Jian Pei Morgan Kaufmann Publishers
R1 Predictive Analytics and Data Mining: Concepts and Practice with RapidMiner
by Vijay Kotu and Bala Deshpande Morgan Kaufmann Publishers
Data Mining
BITS Pilani M2: Data Preprocessing
Pilani|Dubai|Goa|Hyderabad
BITS Pilani
Pilani|Dubai|Goa|Hyderabad

2.1 Data Preprocessing Concepts

Source Courtesy: Some of the contents of this PPT are sourced from materials provided by publishers of prescribed books
Preprocessing Objectives

• To improve data quality

• To modify data to better fit specific data mining technique


Major Tasks in Data Preprocessing

• Data cleaning
• Fill in missing values, smooth noisy data, identify or remove outliers, and resolve
inconsistencies
• Data integration
• Integration of multiple databases, data cubes, or files
• Data reduction
• Dimensionality reduction
• Numerosity reduction
• Data compression
• Data transformation and data discretization
• Normalization
• Concept hierarchy generation

55
Data Quality: Multidimensional View

• Measures for data quality: A multidimensional view


• Accuracy: correct or wrong, accurate or not
• Completeness: not recorded, unavailable, …
• Consistency: some modified but some not, dangling, …
• Timeliness: timely update?
• Believability: how trustable the data are correct?
• Interpretability: how easily the data can be understood?

57
Data Quality
• What kinds of data quality problems?
• How can we detect problems with the data?
• What can we do about these problems?

• Examples of data quality problems:


• Noise and outliers
• missing values
• duplicate data
Data Cleaning
• Data in the Real World Is Dirty: Lots of potentially incorrect data, e.g., instrument faulty,
human or computer error, transmission error
• incomplete: lacking attribute values, lacking certain attributes of interest, or
containing only aggregate data
• e.g., Occupation = “ ” (missing data)
• noisy: containing noise, errors, or outliers
• e.g., Salary = “−10” (an error)
• inconsistent: containing discrepancies in codes or names, e.g.,
• Age = “42”, Birthday = “03/07/2010”
• Was rating “1, 2, 3”, now rating “A, B, C”
• discrepancy between duplicate records
• Intentional (e.g., disguised missing data)
• Jan. 1 as everyone’s birthday?

59
Incomplete (Missing) Data
• Data is not always available
• E.g., many tuples have no recorded value for several attributes, such as customer
income in sales data
• Missing data may be due to
• equipment malfunction
• inconsistent with other recorded data and thus deleted
• data not entered due to misunderstanding
• certain data may not be considered important at the time of entry
• not register history or changes of the data
• Missing data may need to be inferred

60
How to Handle Missing Data?
• Ignore the tuple: usually done when class label is missing (when doing classification)—
not effective when the % of missing values per attribute varies considerably
• Fill in the missing value manually: tedious + infeasible?
• Fill in it automatically with
• a global constant : e.g., “unknown”, a new class?!
• the attribute mean
• the attribute mean for all samples belonging to the same class: smarter
• the most probable value: inference-based such as Bayesian formula or decision tree

61
Noisy Data
• Noise: random error or variance in a measured variable
• Incorrect attribute values may be due to
• faulty data collection instruments
• data entry problems
• data transmission problems
• technology limitation
• inconsistency in naming convention
• Other data problems which require data cleaning
• duplicate records
• incomplete data
• inconsistent data

62
How to Handle Noisy Data?
• Binning (also used for discretization)
• first sort data and partition into (equal-frequency) bins
• then one can smooth by bin means, smooth by bin median, smooth by bin
boundaries, etc.
• Binning methods smooth a sorted data value by consulting its "neighborhood," that
is, the values around it, i.e. they perform local smoothing.
• Regression
• smooth by fitting the data into regression functions
• Clustering
• detect and remove outliers
• Combined computer and human inspection
• detect suspicious values and check by human (e.g., deal with possible outliers)

63
Noise
• Noise refers to modification of original values
• Examples: distortion of a person’s voice when talking on a poor phone and
“snow” on television screen

Two Sine Waves Two Sine Waves + Noise


Duplicate Data
• Data set may include data objects that are duplicates, or almost
duplicates of one another
• Major issue when merging data from heterogeous sources

• Examples:
• Same person with multiple email addresses

• Data cleaning
• Process of dealing with duplicate data issues
Outliers
• Outliers are data objects with characteristics that are considerably
different than most of the other data objects in the data set
Data Cleaning as a Process
• Data discrepancy detection
• Use metadata (e.g., domain, range, dependency, distribution)
• Check field overloading
• Check uniqueness rule, consecutive rule and null rule
• Use commercial tools
• Data scrubbing: use simple domain knowledge (e.g., postal code, spell-check) to detect errors
and make corrections
• Data auditing: by analyzing data to discover rules and relationship to detect violators (e.g.,
correlation and clustering to find outliers)
• Data migration and integration
• Data migration tools: allow transformations to be specified
• ETL (Extraction/Transformation/Loading) tools: allow users to specify transformations through a
graphical user interface
• Integration of the two processes
• Iterative and interactive (e.g., Potter’s Wheels)

67
Prescribed Text Books

Author(s), Title, Edition, Publishing House


T1 Tan P. N., Steinbach M & Kumar V. “Introduction to Data Mining” Pearson
Education
T2 Data Mining: Concepts and Techniques, Third Edition by Jiawei Han, Micheline
Kamber and Jian Pei Morgan Kaufmann Publishers
BITS Pilani
Pilani|Dubai|Goa|Hyderabad

2.1 Data Preprocessing Techniques

Source Courtesy: Some of the contents of this PPT are sourced from materials provided by publishers of prescribed books
Major Tasks in Data Preprocessing

• Data cleaning
• Fill in missing values, smooth noisy data, identify or remove outliers, and resolve
inconsistencies
• Data integration
• Integration of multiple databases, data cubes, or files
• Data reduction
• Dimensionality reduction
• Numerosity reduction
• Data compression
• Data transformation and data discretization
• Normalization
• Concept hierarchy generation

70
Data Integration
• Data integration:
• Combines data from multiple sources into a coherent store

• Schema integration: e.g., A.cust-id  B.cust-#


• Integrate metadata from different sources

• Entity identification problem:


• Identify real world entities from multiple data sources, e.g., Bill Clinton = William Clinton

• Detecting and resolving data value conflicts


• For the same real world entity, attribute values from different sources are different
• Possible reasons: different representations, different scales, e.g., metric vs. British units

71
71
Handling Redundancy in Data Integration

• Redundant data occur often when integration of multiple databases


• Object identification: The same attribute or object may have different names
in different databases
• Derivable data: One attribute may be a “derived” attribute in another table,
e.g., annual revenue
• Redundant attributes may be able to be detected by correlation analysis and
covariance analysis
• Careful integration of the data from multiple sources may help reduce/avoid
redundancies and inconsistencies and improve mining speed and quality

72
72
Correlation Analysis (Nominal Data)
• Χ2 (chi-square) test
(Observed − Expected ) 2
2 = 
Expected

• The larger the Χ2 value, the more likely the variables are related
• The cells that contribute the most to the Χ2 value are those
whose actual count is very different from the expected count
• Correlation does not imply causality
• # of hospitals and # of car-theft in a city are correlated
• Both are causally linked to the third variable: population

73
Chi-Square Calculation: An Example
Play chess Not play chess Sum (row)
Like science fiction 250(90) 200(360) 450

Not like science fiction 50(210) 1000(840) 1050

Sum(col.) 300 1200 1500

• Χ2 (chi-square) calculation (numbers in parenthesis are expected


counts calculated based on the data distribution in the two
categories)
(250 − 90) 2 (50 − 210) 2 (200 − 360) 2 (1000 − 840) 2
 =
2
+ + + = 507.93
90 210 360 840

• It shows that like_science_fiction and play_chess are correlated


in the group
74
Correlation Analysis (Numeric Data)
• Correlation coefficient (also called Pearson’s product moment
coefficient)
 
n n
(ai − A)(bi − B) (ai bi ) − n AB
rA, B = i =1
= i =1
(n − 1) A B (n − 1) A B

where n is the number of tuples, A and B are the respective


means of A and B, σA and σB are the respective standard
deviation of A and B, and Σ(aibi) is the sum of the AB cross-
product.
• If rA,B > 0, A and B are positively correlated (A’s values increase as
B’s). The higher, the stronger correlation.
• rA,B = 0: independent; rAB < 0: negatively correlated
75
Correlation (viewed as linear relationship)
• Correlation measures the linear relationship between objects
• To compute correlation, we standardize data objects, A and B, and
then take their dot product

a'k = (ak − mean( A)) / std ( A)


b'k = (bk − mean( B)) / std ( B)

correlation( A, B) = A'• B'


76
Covariance (Numeric Data)

• Covariance is similar to correlation

Correlation coefficient:

where n is the number of tuples, A and B are the respective mean or expected values of A and B,
σA and σB are the respective standard deviation of A and B
• Positive covariance: If CovA,B > 0, then A and B both tend to be larger than their expected values
• Negative covariance: If CovA,B < 0 then if A is larger than its expected value, B is likely to be smaller
than its expected value
• Independence: CovA,B = 0

77
Co-Variance: An Example

• It can be simplified in computation as

• Suppose two stocks A and B have the following values in one week: (2, 5), (3, 8), (5, 10), (4, 11), (6, 14).

• Question: If the stocks are affected by the same industry trends, will their prices rise or fall together?
• E(A) = (2 + 3 + 5 + 4 + 6)/ 5 = 20/5 = 4
• E(B) = (5 + 8 + 10 + 11 + 14) /5 = 48/5 = 9.6
• Cov(A,B) = (2×5+3×8+5×10+4×11+6×14)/5 − 4 × 9.6 = 4

• Thus, A and B rise together since Cov(A, B) > 0.


Discretization
• Three types of attributes
• Nominal—values from an unordered set, e.g., color, profession
• Ordinal—values from an ordered set, e.g., military or academic rank
• Numeric—real numbers, e.g., integer or real numbers
• Discretization: Divide the range of a continuous attribute into intervals
• Interval labels can then be used to replace actual data values
• Reduce data size by discretization
• Supervised vs. unsupervised
• Split (top-down) vs. merge (bottom-up)
• Discretization can be performed recursively on an attribute
• Prepare for further analysis, e.g., classification

79
Data Discretization Methods
• Typical methods: All the methods can be applied recursively
• Binning
• Top-down split, unsupervised
• Histogram analysis
• Top-down split, unsupervised
• Clustering analysis (unsupervised, top-down split or bottom-up merge)
• Decision-tree analysis (supervised, top-down split)
• Correlation (e.g., 2) analysis (unsupervised, bottom-up merge)

80
Simple Discretization: Binning

• Equal-width (distance) partitioning


• Divides the range into N intervals of equal size: uniform grid
• if A and B are the lowest and highest values of the attribute, the width of intervals will be: W = (B
–A)/N.
• The most straightforward, but outliers may dominate presentation
• Skewed data is not handled well

• Equal-depth (frequency) partitioning


• Divides the range into N intervals, each containing approximately same number of samples
• Good data scaling
• Managing categorical attributes can be tricky

81
Binning Methods for Data Smoothing

82
Discretization by Classification & Correlation Analysis

• Classification (e.g., decision tree analysis)


• Supervised: Given class labels, e.g., cancerous vs. benign
• Using entropy to determine split point (discretization point)
• Top-down, recursive split
• Details to be covered in Chapter “Classification”

• Correlation analysis (e.g., Chi-merge: χ2-based discretization)


• Supervised: use class information
• Bottom-up merge: find the best neighboring intervals (those having similar distributions of classes,
i.e., low χ2 values) to merge
• Merge performed recursively, until a predefined stopping condition

83
Data Reduction Strategies
• Data reduction: Obtain a reduced representation of the data set that is much smaller in volume
but yet produces the same (or almost the same) analytical results
• Why data reduction? — A database/data warehouse may store terabytes of data. Complex data
analysis may take a very long time to run on the complete data set.
• Data reduction strategies
• Dimensionality reduction, e.g., remove unimportant attributes
• Wavelet transforms
• Principal Components Analysis (PCA)
• Feature subset selection, feature creation
• Numerosity reduction (some simply call it: Data Reduction)
• Regression and Log-Linear Models
• Histograms, clustering, sampling
• Data cube aggregation
• Data compression

84
Data Reduction : Dimensionality Reduction
• Curse of dimensionality
• When dimensionality increases, data becomes increasingly sparse
• Density and distance between points, which is critical to clustering, outlier analysis, becomes less
meaningful
• The possible combinations of subspaces will grow exponentially
• Dimensionality reduction
• Avoid the curse of dimensionality
• Help eliminate irrelevant features and reduce noise
• Reduce time and space required in data mining
• Allow easier visualization
• Dimensionality reduction techniques
• Wavelet transforms
• Principal Component Analysis
• Supervised and nonlinear techniques (e.g., feature selection)

85
Mapping Data to a New Space
◼ Fourier transform
◼ Wavelet transform

Two Sine Waves Two Sine Waves + Noise Frequency

86
Wavelet Transformation
• Discrete wavelet transform (DWT) for linear signal processing, multi-resolution analysis
• Compressed approximation: store only a small fraction of the strongest of the wavelet
coefficients
• Similar to discrete Fourier transform (DFT), but better lossy compression, localized in
space
• Method:
• Length, L, must be an integer power of 2 (padding with 0’s, when necessary)
• Each transform has 2 functions: smoothing, difference
• Applies to pairs of data, resulting in two set of data of length L/2
• Applies two functions recursively, until reaches the desired length

87
Wavelet Decomposition
• Wavelets: A math tool for space-efficient hierarchical decomposition of functions
• S = [2, 2, 0, 2, 3, 5, 4, 4] can be transformed to S^ = [23/4, -11/4, 1/2, 0, 0, -1, -1, 0]
• Compression: many small detail coefficients can be replaced by 0’s, and only the
significant coefficients are retained

88
Principal Component Analysis (PCA)

• Find a projection that captures the largest amount of variation in data


• The original data are projected onto a much smaller space, resulting in dimensionality
reduction.
x2

x1 89
Principal Component Analysis (Steps)
• Given N data vectors from n-dimensions, find k ≤ n orthogonal vectors (principal components)
that can be best used to represent data
• Normalize input data: Each attribute falls within the same range
• Compute k orthonormal (unit) vectors, i.e., principal components
• Each input data (vector) is a linear combination of the k principal component vectors
• The principal components are sorted in order of decreasing “significance” or strength
• Since the components are sorted, the size of the data can be reduced by eliminating the
weak components, i.e., those with low variance (i.e., using the strongest principal
components, it is possible to reconstruct a good approximation of the original data)
• Works for numeric data only

90
Attribute Subset Selection
• Another way to reduce dimensionality of data
• Redundant attributes
• Duplicate much or all of the information contained in one or more other
attributes
• E.g., purchase price of a product and the amount of sales tax paid
• Irrelevant attributes
• Contain no information that is useful for the data mining task at hand
• E.g., students' ID is often irrelevant to the task of predicting students' GPA

91
Heuristic Search in Attribute Selection

• There are 2d possible attribute combinations of d attributes


• Typical heuristic attribute selection methods:
• Best single attribute under the attribute independence assumption: choose
by significance tests
• Best step-wise feature selection:
• The best single-attribute is picked first
• Then next best attribute condition to the first, ...
• Step-wise attribute elimination:
• Repeatedly eliminate the worst attribute
• Best combined attribute selection and elimination
• Optimal branch and bound:
• Use attribute elimination and backtracking

92
Attribute Creation (Feature Generation)
• Create new attributes (features) that can capture the important information in a
data set more effectively than the original ones
• Three general methodologies
• Attribute extraction
• Domain-specific
• Mapping data to new space (see: data reduction)
• E.g., Fourier transformation, wavelet transformation
• Attribute construction
• Combining features
• Data discretization

93
Data Reduction: Numerosity Reduction

• Reduce data volume by choosing alternative, smaller forms of data


representation
• Parametric methods (e.g., regression)
• Assume the data fits some model, estimate model parameters, store only the
parameters, and discard the data (except possible outliers)
• Ex.: Log-linear models—obtain value at a point in m-D space as the product
on appropriate marginal subspaces
• Non-parametric methods
• Do not assume models
• Major families: histograms, clustering, sampling, …

94
Prescribed Text Books

Author(s), Title, Edition, Publishing House


T1 Tan P. N., Steinbach M & Kumar V. “Introduction to Data Mining” Pearson
Education
T2 Data Mining: Concepts and Techniques, Third Edition by Jiawei Han, Micheline
Kamber and Jian Pei Morgan Kaufmann Publishers
BITS Pilani
Hyderabad Campus

Dimension Reduction using PCA


Today’s Agenda

• Curse of Dimensionality

• Introduction to Dimension Reduction

• Motivation for PCA

• PCA

BITS Pilani, Hyderabad Campus


Curse of Dimensionality
• When dimensionality
increases, data becomes
increasingly sparse in
the space that it
occupies

• Definitions of density and


distance between points,
which is critical for
clustering and outlier
• Randomly generate 500 points
detection, become less
• Compute difference between max and min
meaningful distance between any pair of points

BITS Pilani, Hyderabad Campus


Dimensionality Reduction

• Purpose:
– Avoid curse of dimensionality
– Reduce amount of time and memory required by data
mining algorithms
– Allow data to be more easily visualized
– May help to eliminate irrelevant features or reduce noise

• Techniques
– Principle Component Analysis
– Singular Value Decomposition
– Others: supervised and non-linear techniques

BITS Pilani, Hyderabad Campus


Dimensionality Reduction: PCA

• Goal is to find a projection that captures the largest


amount of variation in data

x2

u1

x1

BITS Pilani, Hyderabad Campus


PCA

• Reduce High Dimensional data into something that can be


explained in fewer dimensions.
• We need PCA since we suspect that in our interesting data set
not all measures are independent i.e there exist correlations.

x2

Assume the data set represent height and weight


of people in a region.

x1
BITS Pilani, Hyderabad Campus
Principal Component Analysis (PCA)

• Reduce higher dimension data into something that can


be explained in fewer dimensions and gain an
understanding of the data.

• We need PCA since we suspect that in our data set not


all measures are independent and there exist
correlations or structures or patterns.

BITS Pilani, Hyderabad Campus


Motivation for PCA

BITS Pilani, Hyderabad Campus


Projection

BITS Pilani, Hyderabad Campus


Projection (Contd..)

BITS Pilani, Hyderabad Campus


Principal Component Analysis

PCA helps us in identifying the best projection.


The goal is to find a lower dimensional surface on which to
project such that the sum of squares errors are minimal.

BITS Pilani, Hyderabad Campus


Principal Component Analysis (Contd..)

BITS Pilani, Hyderabad Campus


Salient features of PCA

• Directions are in the order of % of variance explained.

• Every PC is orthogonal.

• PCA can be solved using

• Maximum Variance

• Minimum Error

BITS Pilani, Hyderabad Campus


PCA formulation

BITS Pilani, Hyderabad Campus


Steps for PCA

BITS Pilani, Hyderabad Campus


Example

BITS Pilani, Hyderabad Campus


PCA formulation

BITS Pilani, Hyderabad Campus


PCA formulation (Contd..)

BITS Pilani, Hyderabad Campus


PCA formulation (Contd..)

BITS Pilani, Hyderabad Campus


PCA formulation (Contd..)

BITS Pilani, Hyderabad Campus


PCA formulation (Contd..)

BITS Pilani, Hyderabad Campus


How to derive S?

BITS Pilani, Hyderabad Campus


How to derive S?

BITS Pilani, Hyderabad Campus


How to derive S?

BITS Pilani, Hyderabad Campus


How to derive S?

BITS Pilani, Hyderabad Campus


PCA overview

BITS Pilani, Hyderabad Campus


BITS Pilani, Hyderabad Campus
PCA Limitations

• Covariance is extremely sensitive to large values

• Multiply some dimension by 1000

• Dominates covariance

• Becomes principal component

• Normalize each dimension to zero mean and unit


variance.

X’=(X-mean)/standard-deviation

BITS Pilani, Hyderabad Campus


PCA Limitations

• PCA assumes underlying subspace is linear.


• 1D –line
• 2D - plane

BITS Pilani, Hyderabad Campus


PCA and classification

• PCA is unsupervised
• Maximize overall variance of the data along a small set
of directions
• Does not known anything about class labels
• Can pick direction that makes it hard to separate classes

BITS Pilani, Hyderabad Campus


Take home message

• As the number of dimensions increases, the complexity


and computational power required to build the model
also increases.

• Dimension reduction methods are employed to find the


best representation of data.

• PCA finds the best vectors on which the maximum


variance in the data can be preserved.

BITS Pilani, Hyderabad Campus


BITS Pilani
Hyderabad Campus

Data
Today’s Learning objective

• Describe Data

• List various Data types

• List the issues in Data quality

• List and identify the right preprocessing techniques


given data

BITS Pilani, Hyderabad Campus


What is Data?
• Collection of data objects and their Attributes
attributes

• An attribute is a property or Tid Refund Marital Taxable


Status Income Cheat
characteristic of an object
1 Yes Single 125K No
– Examples: eye color of a person,
temperature, etc. 2 No Married 100K No
3 No Single 70K No
– Other names: variable, filed,
characteristic, feature, Predictor, 4 Yes Married 120K No

etc. 5 No Divorced 95K Yes

• A collection of attributes describe Objects 6 No Married 60K No

an object 7 Yes Divorced 220K No

– Other names: record, point, case, 8 No Single 85K Yes

sample, entity, or instance 9 No Married 75K No


10 No Single 90K Yes
10

BITS Pilani, Hyderabad Campus


Attribute Values

• Each attribute has a set of values object draws from.

• The same attribute can be mapped to different attribute

values

• Example: Temperature can be Celsius in feet or Fahrenheit

• Different attributes can be mapped to the same set of values

• Example: Attribute values for ID and age are integers

BITS Pilani, Hyderabad Campus


Types of Attributes

Hair color

Car prices (low, medium,


High)

Has finite or countably


infinite set of values
Ex: terms in doc

Has real numbers


Ex: length, Weight, temp etc.

BITS Pilani, Hyderabad Campus


Properties of Attribute Values

• The type of an attribute depends on which of the


following properties it possesses:
– Distinctness: = 
– Order: < >
– Addition: + -
– Multiplication: */

– Nominal attribute: distinctness


– Ordinal attribute: distinctness & order
– Interval attribute: distinctness, order & addition
– Ratio attribute: all 4 properties

BITS Pilani, Hyderabad Campus


Attribute Description Examples Operations
Type

Nominal The values of a nominal attribute are justzip codes, mode, entropy,
different names, i.e., nominal attributes employee ID contingency
provide only enough information to numbers, eye correlation, 2
distinguish one object from another. (=, color, sex: {male, test
) female}
Ordinal The values of an ordinal hardness of median,
attribute provide enough minerals, {good, percentiles, rank
information to order objects. better, best}, correlation, run
(<, >) grades, street tests, sign tests
numbers
Interval For interval attributes, the calendar dates, mean, standard
differences between values temperature in deviation, Pearson's
are meaningful, i.e., a unit of Celsius or correlation, t and F
measurement exists. Fahrenheit tests
(+, - )
Ratio For ratio variables, both temperature in
differences and ratios are Kelvin, monetary geometric mean,
meaningful. (*, /) quantities, counts, harmonic mean,
age, mass, length, percent
electrical current variation
BITS Pilani, Hyderabad Campus
Discrete and Continuous Attributes
• Discrete Attribute
– Has only a finite or countably infinite set of values
– Examples: set of words in a collection of documents
– Often represented as integer variables.
– Note: binary attributes are a special case of discrete attributes

• Continuous Attribute
– Has real numbers as attribute values
– Examples: temperature, height, or weight.
– Practically, real values can only be measured and represented
using a finite number of digits.
– Continuous attributes are typically represented as floating-point
variables.

BITS Pilani, Hyderabad Campus


Types of data sets
• Record
– Data Matrix
– Document Data
– Transaction Data
• Graph
– World Wide Web
– Molecular Structures
• Ordered
– Spatial Data
– Temporal Data
– Sequential Data
– Genetic Sequence Data

BITS Pilani, Hyderabad Campus


Important Characteristics of Structured
Data
– Dimensionality
• Curse of Dimensionality

– Sparsity
• Only presence counts

– Resolution
• Patterns depend on the scale

BITS Pilani, Hyderabad Campus


Record Data

• Data that consists of a collection of records, each of


which consists of a fixed set of attributes

Tid Refund Marital Taxable


Status Income Cheat

1 Yes Single 125K No


2 No Married 100K No
3 No Single 70K No
4 Yes Married 120K No
5 No Divorced 95K Yes
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes
10

BITS Pilani, Hyderabad Campus


Data Matrix
• If data objects have the same fixed set of numeric
attributes, then the data objects can be thought of as
points in a multi-dimensional space, where each
dimension represents a distinct attribute

• Such data set can be represented by an m by n matrix,


where there are m rows, one for each object, and n
columns, one for each attribute

Projection Projection Distance Load Thickness


of x Load of y load

10.23 5.27 15.22 2.7 1.2


12.65 6.25 16.22 2.2 1.1

BITS Pilani, Hyderabad Campus


Document Data

• Each document becomes a `term' vector,


– each term is a component (attribute) of the vector,
– the value of each component is the number of times the corresponding term
occurs in the document.

timeout

season
coach

game
score
team

ball

lost
pla

wi
n
y
Document 1 3 0 5 0 2 6 0 2 0 2

Document 2 0 7 0 2 1 0 0 3 0 0

Document 3 0 1 0 0 1 2 2 0 3 0

BITS Pilani, Hyderabad Campus


Transaction Data
• A special type of record data, where
– each record (transaction) involves a set of items.
– For example, consider a grocery store. The set of
products purchased by a customer during one
shopping trip constitute a transaction, while the
individual products that were purchased are the items.
TID Items
1 Bread, Coke, Milk
2 Beer, Bread
3 Beer, Coke, Diaper, Milk
4 Beer, Bread, Diaper, Milk
5 Coke, Diaper, Milk

BITS Pilani, Hyderabad Campus


Graph Data
Examples: Generic graph and HTML Links
<a href="papers/papers.html#bbbb">
Data Mining </a>
<li>
2 <a href="papers/papers.html#aaaa">
Graph Partitioning </a>
<li>
5 1 <a href="papers/papers.html#aaaa">
Parallel Solution of Sparse Linear System of Equations </a>
<li>
2 <a href="papers/papers.html#ffff">
N-Body Computation and Dense Linear System Solvers
5

BITS Pilani, Hyderabad Campus


Chemical Data

Benzene Molecule: C6H6

BITS Pilani, Hyderabad Campus


Ordered Data
Sequences of transactions

An element of the
sequence

BITS Pilani, Hyderabad Campus


Ordered Data
Genomic sequence data

GGTTCCGCCTTCAGCCCCGCGCC
CGCAGGGCCCGCCCCGCGCCGTC
GAGAAGGGCCCGCCTGGCGGGCG
GGGGGAGGCGGGGCCGCCCGAGC
CCAACCGAGTCCGACCAGGTGCC
CCCTCTGCTCGGCCTAGACCTGA
GCTCATTAGGCGGCAGCGGACAG
GCCAAGTAGAACACGCGAAGCGC
TGGGCTGCCTGCTGCGACCAGGG

BITS Pilani, Hyderabad Campus


Ordered Data

Spatio-Temporal Data

Average Monthly
Temperature of land
and ocean

BITS Pilani, Hyderabad Campus


Data Quality

• What kinds of data quality problems?

• How can we detect problems with the data?

• What can we do about these problems?


• Examples of data quality problems:
– Noise and outliers
– missing values
– duplicate data

BITS Pilani, Hyderabad Campus


Noise
• Noise: An invalid signal overlapping valid data
– Examples: distortion of a person’s voice when talking
on a poor phone and “snow” on television screen

Two Sine Waves Two Sine Waves + Noise


BITS Pilani, Hyderabad Campus
Outliers
• Outliers are data objects with characteristics that are
considerably different than most of the other data objects
in the data set

BITS Pilani, Hyderabad Campus


Data pre-processing
1. Data cleaning: handling errors and missing values

2. Feature extraction: creating new features by combining and


transforming existing ones

• a crucial step! ⇒ what patterns can you find application-


specific require understanding of the domain

3. Data reduction

• Aggregation, sampling

• feature selection

• dimension reduction by transformations


BITS Pilani, Hyderabad Campus
Missing Values
• Reasons for missing values
– Information is not collected
(e.g., people decline to give their age and weight)
– Attributes may not be applicable to all cases
(e.g., annual income is not applicable to children)

• Handling missing values


– Eliminate Data Objects
– Estimate Missing Values
– Ignore the Missing Value During Analysis
– Replace with all possible values (weighted by their
probabilities)

BITS Pilani, Hyderabad Campus


Data Cleaning

• Strategies to handle Missing values

• If a feature has many missing values, prune the


feature with correct values.

• If a record has many missing values, prune the record

• Impute missing values

• If the modeling technique allows missing values, just


replace them with special values (like “NA”)

BITS Pilani, Hyderabad Campus


Duplicate Data

• Data set may include data objects that are duplicates, or


almost duplicates of one another
– Major issue when merging data from heterogeneous
sources

• Examples:
– Same person with multiple email addresses

• Data cleaning
– Process of dealing with duplicate data issues

BITS Pilani, Hyderabad Campus


Aggregation

• Combining two or more attributes (or objects) into a single


attribute (or object)

• Purpose
– Data reduction
• Reduce the number of attributes or objects
– Change of scale
• Cities aggregated into regions, states, countries, etc
– More “stable” data
• Aggregated data tends to have less variability

BITS Pilani, Hyderabad Campus


Aggregation
Variation of Precipitation in Australia

Standard Deviation of Average Standard Deviation of


Monthly Precipitation Average Yearly Precipitation
BITS Pilani, Hyderabad Campus
Sampling

• Sampling is the main technique employed for data


selection.
– It is often used for both the preliminary investigation
of the data and the final data analysis.

• Statisticians sample because obtaining the entire set of


data of interest is too expensive or time consuming.

• Sampling is used in data mining because processing the


entire set of data of interest is too expensive or time
consuming.

BITS Pilani, Hyderabad Campus


Sampling

• The key principle for effective sampling is:

• A sample will work almost as well as using the entire

data set if the sample is representative(different for

different data set).

• Sampling may remove outliers and if done improperly

can introduce noise.

BITS Pilani, Hyderabad Campus


Types of Sampling
• Simple Random Sampling
– There is an equal probability of selecting any particular item

• Sampling without replacement


– As each item is selected, it is removed from the population

• Sampling with replacement


– Objects are not removed from the population as they are selected
for the sample.
• In sampling with replacement, the same object can be picked up more than
once

• Stratified sampling
– Split the data into several partitions; then draw random samples
from each partition

BITS Pilani, Hyderabad Campus


Sample Size

8000 points 2000 Points 500 Points

BITS Pilani, Hyderabad Campus


Feature extraction

• scaling and normalization: numerical → numerical

• discretization: numerical → categorical

• binarization: categorical → binary (0/1)

• creating similarity graphs: any type → graph

• transformations for dimension reduction: create new, less


redundant features and keep the best ones, both feature
extraction and data reduction

BITS Pilani, Hyderabad Campus


Scaling and Normalization

• Features with large magnitudes dominate the aggregate


functions like Euclidean distances.

• Hence, we can transform all features to the same scale or


standardize distributions.
• Normalization is particularly useful for classification algorithms.
• min-max normalization

• z-score normalization

• Normalization by decimal scaling

BITS Pilani, Hyderabad Campus


Scaling and Normalization (Contd..)

BITS Pilani, Hyderabad Campus


Min-max normalization

Transform the data from measured units to a new interval


from 𝑛𝑒𝑤_𝑚𝑖𝑛𝐹 to 𝑛𝑒𝑤_𝑚𝑎𝑥𝐹 for feature :

where 𝑣 is the current value of feature 𝐹.


Suppose that the minimum and maximum values for the
feature income are $120,000 and $98,000, respectively. We
would like to map income to the range 0.0,1.0 . By min-max
normalization, a value of $73,600 for income is transformed
to:

BITS Pilani, Hyderabad Campus


Standardization or Z-Score Normalization

BITS Pilani, Hyderabad Campus


Robust Scaling

• If many outliers mean and stdev are biased ⇒ robust


scaling using median and interquartile range:

• Lower Quartile (QL) or First Quartile (Q1) : 25% of the data falls below this percentile
50th percentile
• Median or Second Quartile ( Q2) : 50% of the data falls below this percentile 75th
percentile
• Upper Quartile (QU) or Third Quartile (Q3): 75% of the data falls below this percentile
BITS Pilani, Hyderabad Campus
Log Transformation

• Sometimes y = log2(x) helps to make distribution less


skewed or even normal.

BITS Pilani, Hyderabad Campus


Discretization
numerical → categorical
• Discretization of continuous attributes is most often
performed one attribute at a time, independent of other
attributes.

• This approach is known as static attribute discretization on


the other end of the spectrum is dynamic attribute
discretization, where all attributes are discretized
simultaneously while taking into account the
interdependencies among them.

BITS Pilani, Hyderabad Campus


Discretization

• Unsupervised discretization
• Class labels are ignored
• Equal-interval binning • The best number of bins k is
determined experimentally
• Equal-frequency binning

• Supervised discretization

• Entropy-based discretization

• It tries to maximize the “purity” of the intervals (i.e. to


contain as less as possible mixture of class labels)

BITS Pilani, Hyderabad Campus


Unsupervised Discretization
• Require the user to specify the number of intervals and/or how
many data points should be included in any given interval.
• The following heuristic is often used to choose intervals:
• The number of intervals for each attribute should not be smaller
than the number of classes (if known).
• The other popular heuristic is to choose the number of
intervals, nFi, for each attribute, Fi (i=1,…,n,) where n is the
number of attributes), as follows: nFi = M/3∗ C where M is the
number of training examples and C is the number of known
categories.
BITS Pilani, Hyderabad Campus
Unsupervised Discretization

BITS Pilani, Hyderabad Campus


Unsupervised Discretization

• Equal-frequency binning

• An equal number of values are placed in each of the k bins.

• Disadvantage: Many occurrences of the same continuous

value could cause the values to be assigned into different

bins.

BITS Pilani, Hyderabad Campus


Example

BITS Pilani, Hyderabad Campus


Supervised Discretization

• Suppose you are analyzing risk of Alzheimer's disease and


you split age data at age 16, age 24, and age 30.

Your bins look something like this:


<=16 Now you have a giant bin of people older than 30, where
16...24 most Alzheimers patients are, and multiple bins split at
lower values, where you're not really getting much
24...30 information.
>30

• Because of this issue, we want to make meaningful splits in


our continuous variables.

BITS Pilani, Hyderabad Campus


Feature Subset Selection Techniques

• Brute-force approach: Try all possible feature subsets as


input to data mining algorithm

• Filter approaches: Compute a score for each feature and


then select features according to the score.

• Wrapper approaches: score feature subsets by seeing their


performance on a dataset using a classification algorithm.

• Embedded approaches: Select features during the process


of training.
BITS Pilani, Hyderabad Campus
Wrapper Methods

BITS Pilani, Hyderabad Campus


Sequential Forward Selection (SFS)

1. Start with an empty feature set

2. Try each remaining feature

3. Estimate classifier performance for adding each feature

4. Select feature that gives max improvement

5. Stop when there is no significant improvement

Disadvantage: Once a feature is retained, it cannot be discarded;

nesting problem

BITS Pilani, Hyderabad Campus


Sequential Backward Selection (SBS)

1. Start with an full feature set

2. Try removing feature

3. Drop the feature with smallest impact on classifier

performance

Disadvantage: SBS requires more computation than SFS

BITS Pilani, Hyderabad Campus


Search space for feature selection

Forward Feature subset selection

Backward Feature subset selection

BITS Pilani, Hyderabad Campus


Embedded Method for feature selection

• Embedded methods perform feature selection and


training of the algorithm in parallel.

• Example

• Lasso Regression

• Decision Trees

BITS Pilani, Hyderabad Campus


Feature Creation
• Create new attributes that can capture the important
information in a data set much more efficiently than the
original attributes

• Three general methodologies:

– Feature Extraction: multimedia features(low,middle,high


level fetaures)
• domain-specific

– Mapping Data to New Space

– Feature Construction: combining features


BITS Pilani, Hyderabad Campus
Take home message
• Four different features/attributes/measurements/
independent variables can be of type Nominal, Ordinal,
Interval or Ratio type.
• Based on the type of data, the operations vary.
• The data set can be of the record, graph, or ordered type.
• Real-world data is dirty, so preprocessing is a very important step in
Data Mining.
• There are several methods for preprocessing, choosing the right
method depends on the problem and data obtained.

BITS Pilani, Hyderabad Campus


Take home message

• Missing values can be handled by eliminating features or


records or by imputation methods.

• Feature extraction methods like scaling, normalization,


and discretization need to be applied based on the
problem.

• Data reduction methods will be applied to reduce the


number of features required to build the model.

BITS Pilani, Hyderabad Campus


BITS Pilani
Hyderabad Campus

Similarity and Distance Measures


Today’s Learning objective
• What is Distance?
• Similarity vs. distance
• Properties of distance metrics
• Proximity Measures for Binary Nominal attributes
• Proximity Measures for Nominal Attributes
• Proximity Measures for Ordinal Attributes
• Proximity Measures for Numeric Attributes
• Proximity Measures for Mixed Attributes

BITS Pilani, Hyderabad Campus


What is Distance?

BITS Pilani, Hyderabad Campus


Similarity vs. distance

BITS Pilani, Hyderabad Campus


Metric: distance d that satisfies 4
properties

BITS Pilani, Hyderabad Campus


Proximity

➢ Examples:

✓ For an item bought by a customer, find other similar items

✓ Group together the customers of the site so that similar customers are shown the
same ad.

✓ Group together web documents so that you can separate the ones that talk about
politics and the ones that talk about sports.

✓ Find all the near-duplicate mirrored web documents.

✓ Find credit card transactions that are very different from previous transactions.

➢ To solve these problems, we need a definition of similarity or distance.

➢ For many problems, we need to quantify how close two objects are.

BITS Pilani, Hyderabad Campus


Proximity Measures for Binary attributes

BITS Pilani, Hyderabad Campus


Proximity Measures for Two or more
Binary attributes

BITS Pilani, Hyderabad Campus


Proximity Measure for Symmetric Binary
attribute

BITS Pilani, Hyderabad Campus


Proximity Measure with Symmetric
Binary

1 0

1 1 1

0 2 2

BITS Pilani, Hyderabad Campus


Proximity Measure with Asymmetric
Binary

BITS Pilani, Hyderabad Campus


Proximity Measure with Asymmetric
Binary

1 0
1 1 1
0 2 2

BITS Pilani, Hyderabad Campus


Proximity Measures for Nominal
(Categorical) Attribute

BITS Pilani, Hyderabad Campus


Proximity Measures for Nominal
(Categorical) Attribute

BITS Pilani, Hyderabad Campus


Proximity Measure for Ordinal Attribute

BITS Pilani, Hyderabad Campus


Proximity Measure for Ordinal Attribute

Consider the following set of records, where each record is


defined by two ordinal attributes size={S, M, L} and Quality = {Ex,
A, B, C} such that S<M<L and A<B<C<Ex.

S=1=1-1/3-1=0 A=1= 1-1/4-1=0


M=2=2-1/3-1=0.5 B=2=2-1/4-1=0.33
L=3=3-1/3-1=1 C=3=3-1/4-1= 0.66
Ex=4 = 4-1/4-1 = 1

BITS Pilani, Hyderabad Campus


Proximity Measure with Interval Scale

BITS Pilani, Hyderabad Campus


Proximity Measure with Interval Scale

BITS Pilani, Hyderabad Campus


Proximity Measure with Interval Scale

BITS Pilani, Hyderabad Campus


Proximity Measure with Interval Scale

BITS Pilani, Hyderabad Campus


Proximity Measure for Ratio scale

BITS Pilani, Hyderabad Campus


Proximity Measure for Ratio scale
Normalization:
➢ A major problem when using the similarity (or dissimilarity)
measures (such as Euclidean distance) is that the large
values frequently swamp the small ones.
➢ For example, consider the following data.

➢ Here, the contribution of Cost 2 and Cost 3 is insignificant


compared to Cost 1 so far the Euclidean distance is
concerned.
➢ This problem can be avoided if we consider the normalized
values of all numerical attributes.
BITS Pilani, Hyderabad Campus
Proximity Measure for Mixed Attributes
➢ The previous metrics on similarity measures assume that all the
attributes were of the same type. Thus, a general approach is
needed when the attributes are of different types.

➢ One straightforward approach is to compute the similarity between


each attribute separately and then combine these attribute using a
method that results in a similarity between 0 and 1.

➢ Typically, the overall similarity is defined as the average of all the


individual attribute similarities.

BITS Pilani, Hyderabad Campus


Proximity Measure with Mixed Attributes

BITS Pilani, Hyderabad Campus


Take Home message

• Many algorithms compute proximity using either


similarity or dissimilarity.

• The distance metric used will depend on the type of


Feature/attribute.

BITS Pilani, Hyderabad Campus


BITS Pilani
Hyderabad Campus

Classification
Today’s Agenda

• Jargons used in Data Mining

• Tasks in Data Mining

• Decision Tree Algorithm

• Naïve Bayes Algorithm

BITS Pilani, Hyderabad Campus


Variables

X Y

BITS Pilani, Hyderabad Campus


Functions

If you give me one apple


I will give you three bananas

What is the
function between
X and Y?

Y=X+3

BITS Pilani, Hyderabad Campus


Functions (Contd..)

X: English Sentence Y: Hindi sentence

BITS Pilani, Hyderabad Campus


Functions (Contd..)

X: Board Configuration
Y: Next Move

BITS Pilani, Hyderabad Campus


Functions (Contd..)

BITS Pilani, Hyderabad Campus


Functions (Contd..)

X: English Sentence Y: Hindi sentence


?????????????????????????

BITS Pilani, Hyderabad Campus


Parameters

X is the input
Y = 3X + 1
Y is the output

Y = WX + b W and b are parameters


Model

Input - Fixed comes from the training data


Parameters- Needs to be estimated

BITS Pilani, Hyderabad Campus


Parameters

Training Data Model


X Y
Y = WX + b
1 0
5 16 How to estimate the parameters W and b?
6 20
Assume random numbers for W and b
Y = 1X + 0 Y = 2X + 2
Training Training
Data Data
X Y Y’ X Y Y’ Which model is
1 0 1 1 0 4
better?
5 16 5 5 16 12
6 20 6 6 20 14
BITS Pilani, Hyderabad Campus
Functions (Contd..)
X=1

Y=0

X=5

Y=16

X=6

Y=20

X=3

Y=??
BITS Pilani, Hyderabad Campus
Cost function
Y = 1X + 0
Training Data Training Data
Model
n X Y n X Y Y’ (Y – Y’)2

0 1 0
Yn = WXn + b 0 1 0 1 1

1 5 16 1 5 16 5 121

2 6 20 2 6 20 6 196
C(1,0) 318

Y = 2X + 2
Cost Training Data
n X Y Y’ (Y – Y’)2
C(W,b) = ∑ (Y – Y’)2
0 1 0 4 16
nϵ {0,1,2}
1 5 16 12 16
The one that gives us the lowest cost
is a better model 2 6 20 14 36
C(2,2) 68
BITS Pilani, Hyderabad Campus
Optimizer

Training Data
Model
n X Y
0 1 0
Yn = WXn + b
1 5 16
2 6 20
Optimizer
arg min C(W,b)
W,b ϵ [-∞ ∞]
Cost
C(W,b) = ∑ (Y – Y’)2
nϵ {0,1,2}

BITS Pilani, Hyderabad Campus


Gradients
Cost
C(W,b) = ∑ (Y – Y’)2
nϵ {0,1,2}

W0=2,b0=2; C(W,b)=68

BITS Pilani, Hyderabad Campus


Gradients

BITS Pilani, Hyderabad Campus


Tasks in Data Mining

Some of the important tasks performed in data Mining:

Classification – Logistic Regression, Naïve Bayes, Decision Trees

Regression – Linear regression, Ridge Regression

Clustering-K-Means, Hierarchical Agglomorative Clustering

BITS Pilani, Hyderabad Campus


Tasks in Data Mining (contd..)

Some of the important tasks performed in data mining:

Classification
supervised
Regression
Clustering unsupervised

BITS Pilani, Hyderabad Campus


The data and the goal

• Data: A set of data records (also called examples, instances


or cases) described by
– k features / attributes: f1, f2, … fk.
– a class: Each example is labelled with a pre-defined
class.
• Goal: To learn a classification model from the data that can
be used to predict the classes of new (future, or test)
cases/instances.

BITS Pilani, Hyderabad Campus


What do we mean by learning?
• Given
– a data set D,
– a task T, and
– a performance measure M,
• a computer system is said to learn from D to perform the task T if
after learning the system’s performance on T improves as measured
by M.
• In other words, the learned model helps the system to perform T
better than no learning.

BITS Pilani, Hyderabad Campus


An example

• Data: Loan application data

• Task: Predict whether a loan should be approved or not.

• Performance measure: accuracy.

• No learning: classify all future applications (test data) to


the majority class (i.e., Yes):

• Accuracy = 9/15 = 60%.

• We can do better than 60% with learning.

BITS Pilani, Hyderabad Campus


Fundamental assumption of learning

• Assumption: The distribution of training examples is


identical to that of test examples (including future unseen
examples).
• In practice, this assumption is often violated to a certain
degree.
• Strong violations will result in poor classification accuracy.
• To achieve good accuracy on the test data, training
examples must be sufficiently representative of the test data.

BITS Pilani, Hyderabad Campus


Introduction

• Decision tree learning is one of the most widely used


techniques for classification.

– Its classification accuracy is competitive with other


methods, and

– it is very efficient.

• The classification model is a tree called a decision tree.

BITS Pilani, Hyderabad Campus


4 B|3 C|3 T

4 B|0 C|1 T
0 B|3 C|0 T
0 B|0 C|2 T

Keep adding features and splitting till


you have all leaf nodes.

BITS Pilani, Hyderabad Campus


Entropy and Gini Index
6 Y|3 N
f1 Entropy H(S for C1) = -3/6 log2 3/6 – 3/6 log2 0/6 = 1 (Impure Split)
Entropy H(S for C2) = -3/3 log2 3/3 – 3/3 log2 0/3 = 0 (Pure Spli)

c1 c2
3 Y|3 N 3 Y|0 N

G.I for C1 = 1-((3/6)2 + (3/6)2 ) =0.5


G.I for C2= 1-((3/3)2+(0/3)2) = 0

BITS Pilani, Hyderabad Campus


Information Gain 9 Y|5 N
f1

c1 c2
6 Y|2 N 3 Y|3 N

BITS Pilani, Hyderabad Campus


Algorithm for decision tree learning

• Basic algorithm (a greedy divide-and-conquer algorithm)


– Assume attributes are categorical now (continuous attributes can
be handled too)
– Tree is constructed in a top-down recursive manner
– At start, all the training examples are at the root
– Examples are partitioned recursively based on selected attributes
– Attributes are selected based on an impurity function (e.g.,
information gain)
• Conditions for stopping partitioning
– All examples for a given node belong to the same class
– There are no remaining attributes for further partitioning –
majority class is the leaf
– There are no examples left

BITS Pilani, Hyderabad Campus


Three Possible Partition Scenarios

BITS Pilani, Hyderabad Campus


Attribute Selection Measures
• An attribute selection measure is a heuristic for selecting the splitting
criterion that .“best.” separates a given data partition D Ideally
– Each resulting partition would be pure
– A pure partition is a partition contain tuples that all belong to the same
class
• Attribute selection measures (splitting rules)
– Determine how the tuples at a given node are to be split
– Provide ranking for each attribute describing the tuples
– The attribute with highest score is chosen
– Determine a split point or a splitting subset
• Methods
– Information gain, Gain ratio, Gini Index

BITS Pilani, Hyderabad Campus


Example

BITS Pilani, Hyderabad Campus


Example (contd..)

Based on the data, we can compute the probability of each


class. Transportation Mode
Prob (Bus) = 4 / 10 = 0.4 Bus Car Train
Prob (Car) = 3 / 10 = 0.3 4 3 3
Prob (Train) = 3 / 10 = 0.3

Entropy = – 0.4 log2 (0.4) – 0.3 log2 (0.3) – 0.3 log2 (0.3) = 1.571

• Gini Index = 1 – (0.42 + 0.32 + 0.32) = 0.660

BITS Pilani, Hyderabad Campus


BITS Pilani, Hyderabad Campus
Information gain
Class
B C T
Transportation Cheap 4 0 1 5
Mode(TM) Expensive 0 3 0 3
Standard 0 0 2 2
10

BITS Pilani, Hyderabad Campus


BITS Pilani, Hyderabad Campus
BITS Pilani, Hyderabad Campus
BITS Pilani, Hyderabad Campus
Second Iteration

Now we have only three attributes: Gender, car ownership and Income
level.
BITS Pilani, Hyderabad Campus
• Then, we repeat the procedure of computing degree of
impurity and information gain for the three attributes.
BITS Pilani, Hyderabad Campus
BITS Pilani, Hyderabad Campus
BITS Pilani, Hyderabad Campus
Third Iteration

BITS Pilani, Hyderabad Campus


Decision Tree

BITS Pilani, Hyderabad Campus


Probabilistic vs. Discriminative learning

Probabilistic” learning
– Conditional models just explain y: p(y|x)
– Generative models also explain x: p(x,y)
• Often a component of unsupervised or semi-supervised learning
– Bayes and Naïve Bayes classifiers are generative models
BITS Pilani, Hyderabad Campus
Conditional Probability and Naïve Bayes

BITS Pilani, Hyderabad Campus


BITS Pilani, Hyderabad Campus
Three flavors of Naïve Bayes

• Bernoulli naive bayes: Assumes that each feature is a


binary random variable.

• Multinomial naive bayes: Assumes each feature is a


random variable having discrete count.

• Gaussian naive bayes: Assumes each feature is a


random variable having continuous value.

BITS Pilani, Hyderabad Campus


X = (age = youth, income = medium, student = yes, credit = fair)
BITS Pilani, Hyderabad Campus
We need to maximize P(X|Ci)P(Ci), for i = 1, 2. P(Ci), the a priori
probability of each class, can be estimated based on the training
samples:
– P(buy = yes) = 9 / 14
– P(buy = no) = 5 / 14
To compute P(X|Ci), for i = 1, 2, we compute the following conditional
probabilities:
– P(age = youth|buy = yes) = 2/9
– P(age = youth|buy = no) = 3/ 5
– P(income = medium|buy = yes) = 4 / 9
– P(income = medium|buy = no) = 2 / 5
– P(student = yes|buy = yes) = 6 / 9
– P(student = yes|buy = no) = 1 / 5
– P(credit = fair|buy = yes) = 6 / 9
– P(credit = fair|buy = no) = 2 / 5
BITS Pilani, Hyderabad Campus
Using the probabilities from previous slide, we
obtain
P(X|buy = yes) = P(age = youth|buy = yes)
=P(income = medium|buy = yes)
=P(student = yes|buy = yes)
=P(credit = fair|buy = yes)
= 2/9*4/9*6/9*6/9
= 0.044.

BITS Pilani, Hyderabad Campus


Similarly,
P(X|buy = no) = 3/5 * 2 /5 * 1 / 5 * 2 / 5
= 0.019
To find the class that maximizes P(X|Ci)P(Ci), we compute
P(X|buy = yes)P(buy = yes) = 0.028
P(X|buy = no)P(buy = no) = 0.007
Thus the naive Bayesian classifier predicts buy = yes for
sample X.

BITS Pilani, Hyderabad Campus


Sentiment Analysis using Naïve Bayes
classifier

“(1) I bought an iPhone a few days ago. (2) It was such a


nice phone. (3) The touch screen was really cool. (4) The
voice quality was clear too. (5) Although the battery life
was not long, that is ok for me. (6) However, my mother
was mad with me as I did not tell her before I bought it. (7)
She also thought the phone was too expensive, and
wanted me to return it to the shop. … ”

BITS Pilani, Hyderabad Campus


Modelling Sentiment Analysis problem

• The solution to the Sentiment Analysis problem depends on the


granularity of the sentiment
• Positive / Negative or 1 to 5 stars : (Binary Classification /
Multiclass classification)
• Naïve Bayes, and support vector machines (SVM),Logistic regression and Maximum
Entropy etc..

• Regression if the Sentiment is a continuous value between 1 to


5

BITS Pilani, Hyderabad Campus


The Multinomial Naive Bayes’ Classifier

BITS Pilani, Hyderabad Campus


The Multinomial Naive Bayes’ Classifier

BITS Pilani, Hyderabad Campus


Gaussian naive bayes
P(X|Class=No) =
P(Refund=No|Class=No) X
P(Married| Class=No) X
P(Income=120K| Class=No) =
4/7 X 4/7 X 0.0072 = 0.0024 X

P(X|Class=Yes) =
P(Refund=No| Class=Yes) X
P(Married| Class=Yes) X
P(Income=120K| Class=Yes) =
1 X 0 X 1.2 X 10-9 = 0
X = (Refund = No, Married, Income =120K)

BITS Pilani, Hyderabad Campus


Naïve Bayes

• Robust to noise points.

• Handle missing values by ignoring the instance during


probability estimate calculations.

• Independence assumption may not hold for some


attributes in that case we need to work with Bayesian
Belief Networks.

BITS Pilani, Hyderabad Campus


Take home message

• Applications of supervised learning are in almost any field or


domain.
• There are still many other methods, e.g.,
– Support Vector Machines
– Logistic Regression
– This large number of methods also shows the importance of
classification and its wide applicability.
• It remains to be an active research area.

BITS Pilani, Hyderabad Campus


BITS Pilani
Hyderabad Campus

Revision
Topics for Mid Sem Exam

• Different types of data using applications


• Aggregation

• Sampling

• Dimensionality Reduction

• Feature subset selection

• Discretization (Example is available on the slide)

• Classification (Decision Tree, Naïve Bayes)

BITS Pilani, Hyderabad Campus


Different types of data using applications

• Suppose you are given movie reviews, and you are


asked to perform sentiment analysis. In this application,
what type of data would you use, and how will you create
the dataset?

BITS Pilani, Hyderabad Campus


Which preprocessing would you use?

• The Election Commission would like to find the voter


turnout in each state.

• The words in a document would need to grouped based


on the what topics they belong.

BITS Pilani, Hyderabad Campus


Sampling

• There are 1000 employees in BITS out of which 100 of


them have to be selected for weekend work. All their
names will be put in a basket to pull 100 names out.

• Which sampling would you use?

• Suppose you are asked to find the employees to be


selected equally from all the departments which
sampling would you use and why?

BITS Pilani, Hyderabad Campus


Dimensionality Reduction - PCA

• What are the Principal Components?

• What do eigenvalue and vector represent?

• How does dimension reduction happen?

• What are the constraints on the eigen vectors?

• What is the difference between feature selection and


feature reduction?

BITS Pilani, Hyderabad Campus


Feature subset selection

• If you are asked to find the top 3 features above a


threshold, which feature subset selection would you use
and why?

BITS Pilani, Hyderabad Campus


BITS Pilani, Hyderabad Campus
Discretization

• Pokémon video gaming company wishes to analyze their


data and has captured the data of each game played by the
customer. One attribute is the number of times the
customer has played the game. We have 16 examples:
{ 22, 12, 61, 57, 30, 1, 32, 37, 37, 68, 42, 11, 25, 7, 8, 16 }.
• Apply equal frequency and equi-width binning using
number of bins=4 and explain the difference between the
two methods

BITS Pilani, Hyderabad Campus


BITS Pilani, Hyderabad Campus

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy