Data Mining Merged Pdf CS1 CS8
Data Mining Merged Pdf CS1 CS8
Source Courtesy: Some of the contents of this PPT are sourced from materials provided by publishers of prescribed books
Why Data Mining?
• The Explosive Growth of Data: from terabytes to petabytes
• Data collection and data availability
• Automated data collection tools, database systems, Web, computerized
society
• Major sources of abundant data
• Business: Web, e-commerce, transactions, stocks, …
• Science: Remote sensing, bioinformatics, scientific simulation, …
• Society and everyone: news, digital cameras, YouTube
• We are drowning in data, but starving for knowledge!
• “Necessity is the mother of invention”—Data mining—Automated analysis of
massive data sets
3
Why Data Mining
A search engine (e.g., Google) receives hundreds of millions of queries every day. Each query can be
viewed as a transaction where the user describes her or his information need.
What novel and useful knowledge can a search engine learn from such a huge collection of queries
collected from users over time? Some patterns found in user search queries can disclose invaluable
knowledge that cannot be obtained by reading individual data items alone.
For example, Google's Flu Trends uses specific search terms as indicators of flu activity. It found a
close relationship between the number of people who search for flu-related information and the
number of people who actually have flu symptoms. A pattern emerges when all of the search
queries related to flu are aggregated. Using aggregated Google search data, Flu Trends can estimate
flu activity up to two weeks faster than traditional systems can. This example shows how data
mining can turn a large collection of data into knowledge that can help meet a current global
challenge.
Evolution of Database Technology
• 1960s:
• Data collection, database creation, IMS and network DBMS
• 1970s:
• Relational data model, relational DBMS implementation
• 1980s:
• RDBMS, advanced data models (extended-relational, OO, deductive, etc.)
• Application-oriented DBMS (spatial, scientific, engineering, etc.)
• 1990s:
• Data mining, data warehousing, multimedia databases, and Web databases
• 2000s
• Stream data management and mining
• Data mining and its applications
• Web technology (XML, data integration) and global information systems
5
What Is Data Mining?
• Alternative names
• Knowledge discovery (mining) in databases (KDD), knowledge extraction, data/pattern
analysis, data archeology, data dredging, information harvesting, business intelligence, etc.
6
What is (not) Data Mining?
What is not Data What is Data Mining?
Mining? – Certain names are more
prevalent in certain US
– Look up phone
locations (O’Brien, O’Rurke,
number in phone
O’Reilly… in Boston area)
directory
– Query a Web – Group together similar
search engine for documents returned by search
information about engine according to their
“Amazon” context (e.g. Amazon
rainforest, Amazon.com,)
Origins of Data Mining
• Draws ideas from machine learning/AI, pattern recognition,
statistics, and database systems
• Traditional Techniques Statistics/ Machine Learning/
may be unsuitable due to AI Pattern
• Enormity of data Recognition
Data Exploration
Statistical Summary, Querying, and Reporting
10
Multi-Dimensional View of Data Mining
• Data to be mined
• Database data (extended-relational, object-oriented, heterogeneous, legacy), data warehouse,
transactional data, stream, spatiotemporal, time-series, sequence, text and web, multi-media,
graphs & social and information networks
• Knowledge to be mined (or: Data mining functions)
• Characterization, discrimination, association, classification, clustering, trend/deviation, outlier
analysis, etc.
• Descriptive vs. predictive data mining
• Multiple/integrated functions and mining at multiple levels
• Techniques utilized
• Data-intensive, data warehouse (OLAP), machine learning, statistics, pattern recognition,
visualization, high-performance, etc.
• Applications adapted
• Retail, telecommunication, banking, fraud analysis, bio-data mining, stock market analysis, text
mining, Web mining, etc.
11
Data Mining & Machine Learning
According to Tom M. Mitchell, Chair of Machine Learning at Carnegie Mellon
University and author of the book Machine Learning (McGraw-Hill),
A computer program is said to learn from experience E with respect to some class of tasks T
and performance measure P, if its performance at tasks in T, as measured by P, improves with
the experience E.
We now have a set of objects to define machine learning:
Task (T), Experience (E), and Performance (P)
With a computer running a set of tasks, the experience should be leading to performance
increases (to satisfy the definition)
Many data mining tasks are executed successfully with help of machine learning
Machine Learning: Hands-on for Developers and Technical Professionals by Jason Bell John Wiley & Sons
Data Mining on Diverse kinds of Data
Besides relational database data (from operational or analytical systems), there are many other kinds of data
that have diverse forms and structures and different semantic meanings.
Examples of data can be :
time-related or sequence data (e.g., historical records, stock exchange data, and time-series and biological
sequence data),
data streams (e.g., video surveillance and sensor data, which are continuously transmitted),
spatial data (e.g., maps),
engineering design data (e.g., the design of buildings, system components, or integrated circuits),
hypertext and multimedia data (including text, image, video, and audio data),
graph and networked data (e.g., social and information networks), and
the Web (a widely distributed information repository).
Diversity of data brings in new challenges such as handling special structures (e.g., sequences, trees, graphs,
and networks) and specific semantics (such as ordering, image, audio and video contents, and connectivity)
Prescribed Text Books
Source Courtesy: Some of the contents of this PPT are sourced from materials provided by publishers of prescribed books
Data Mining Tasks
• Prediction Methods
• Use some variables to predict unknown or future values of other variables.
• Description Methods
• Find human-interpretable patterns that describe the data.
From [Fayyad, et.al.] Advances in Knowledge Discovery and Data Mining, 1996
Data Mining Tasks...
• Classification [Predictive]
• Clustering [Descriptive]
• Association Rule Discovery [Descriptive]
• Sequential Pattern Discovery [Descriptive]
• Regression [Predictive]
• Deviation Detection [Predictive]
Classification: Definition
• Given a collection of records (training set )
• Each record contains a set of attributes, one of the attributes is the class.
• Find a model for class attribute as a function of the values of other
attributes.
• Goal: previously unseen records should be assigned a class as
accurately as possible.
• A test set is used to determine the accuracy of the model. Usually, the given
data set is divided into training and test sets, with training set used to build
the model and test set used to validate it.
Classification Example
Set
Set Classifier
Classification: Application 1
• Direct Marketing
• Goal: Reduce cost of mailing by targeting a set of consumers likely to buy a
new cell-phone product.
• Approach:
• Use the data for a similar product introduced before.
• We know which customers decided to buy and which decided otherwise. This {buy, don’t
buy} decision forms the class attribute.
• Collect various demographic, lifestyle, and company-interaction related information
about all such customers.
• Type of business, where they stay, how much they earn, etc.
• Use this information as input attributes to learn a classifier model.
From [Fayyad, et.al.] Advances in Knowledge Discovery and Data Mining, 1996
Clustering Definition
• Given a set of data points, each having a set of attributes, and a
similarity measure among them, find clusters such that
• Data points in one cluster are more similar to one another.
• Data points in separate clusters are less similar to one another.
• Similarity Measures:
• Euclidean Distance if attributes are continuous.
• Other Problem-specific Measures.
Illustrating Clustering
Euclidean Distance Based Clustering in 3-D space.
1 Cabletron-Sys-DOWN,CISCO-DOWN,HP-DOWN,
DSC-Co mm-DOW N,INTEL-DOWN,LSI-Logic-DOWN,
Micron-Tech-DOWN,Texas-Inst-Down,Tellabs-Inc-Down,
Technology1-DOWN
Natl-Semiconduct-DOWN,Oracl-DOWN,SGI-DOW N,
Sun-DOW N
Apple-Co mp-DOW N,Autodesk-DOWN,DEC-DOWN,
2 ADV-M icro-Device-DOWN,Andrew-Corp-DOWN,
Co mputer-Assoc-DOWN,Circuit-City-DOWN,
Co mpaq-DOWN, EM C-Corp-DOWN, Gen-Inst-DOWN,
Technology2-DOWN
Motorola-DOW N,Microsoft-DOWN,Scientific-Atl-DOWN
Fannie-Mae-DOWN,Fed-Ho me-Loan-DOW N,
4 Louisiana-Land-UP,Phillips-Petro-UP,Unocal-UP,
Schlu mberger-UP
Oil-UP
Association Rule Discovery: Definition
• Given a set of records each of which contain some number of items from a given
collection;
• Produce dependency rules which will predict occurrence of an item based on
occurrences of other items.
TID Items
1 Bread, Coke, Milk
Rules Discovered:
2 Beer, Bread {Milk} --> {Coke}
3 Beer, Coke, Diaper, Milk
4 Beer, Bread, Diaper, Milk
5 Coke, Diaper, Milk
Association Rule Discovery: Application 1
• Inventory Management:
• Goal: A consumer appliance repair company wants to anticipate the nature of
repairs on its consumer products and keep the service vehicles equipped with
right parts to reduce on number of visits to consumer households.
• Approach: Process the data on tools and parts required in previous repairs at
different consumer locations and discover the co-occurrence patterns.
Sequential Pattern Discovery: Definition
• Given is a set of objects, with each object associated with its own timeline of events, find rules that predict
strong sequential dependencies among different events.
(A B) (C) (D E)
• Rules are formed by first disovering patterns. Event occurrences in the patterns are governed by timing
constraints.
(A B) (C) (D E)
<= xg >ng <= ws
Timing constraints include maxgap (xg),
mingap (ng), windowsize (ws), maxspan (ms)
<= ms
Sequential Pattern Discovery: Examples
• In telecommunications alarm logs,
• (Inverter_Problem Excessive_Line_Current)
(Rectifier_Alarm) --> (Fire_Alarm)
• In point-of-sale transaction sequences,
• Computer Bookstore:
(Intro_To_Visual_C) (C++_Primer) -->
(Perl_for_dummies,Tcl_Tk)
• Athletic Apparel Store:
(Shoes) (Racket, Racketball) --> (Sports_Jacket)
Regression
• Predict a value of a given continuous valued variable based on the values of
other variables, assuming a linear or nonlinear model of dependency.
• Greatly studied in statistics, neural network fields.
• Examples:
• Predicting sales amounts of new product based on advetising expenditure.
• Predicting wind velocities as a function of temperature, humidity, air
pressure, etc.
• Time series prediction of stock market indices.
Deviation/Anomaly Detection
• Detect significant deviations from normal behavior
• Applications:
• Credit Card Fraud Detection
• Network Intrusion Detection
Prescribed Text Books
Source Courtesy: Some of the contents of this PPT are sourced from materials provided by publishers of prescribed books
DM Process
• The standard data mining process involves
1. understanding the problem,
2. preparing the data (samples),
3. developing the model,
4. applying the model on a data set to see how the model may work in real
world, and
5. production deployment.
• A popular data mining process frameworks is CRISP-DM (Cross
Industry Standard Process for Data Mining). This framework was
developed by a consortium of companies involved in data mining
Generic Data mining process
Prior Knowledge
• Data Mining tools/solutions identify hidden patterns.
• Generally we get many patterns
• Out of them many could be false or trivial.
• Filtering false patterns requires domain understanding.
• Understanding how the data is collected, stored, transformed,
reported, and used is essential.
• Causation vs. Correlation
• A bank may decide interest rate based on credit score. Looking at data, credit
score moves as per interest rate. It does not make sense to derive credit score
from interest rate.
Data Preparation
• Data needs to be understood. It requires descriptive statistics such as mean, median, mode, standard deviation, and range for each
attribute
• Data quality is an ongoing concern wherever data is collected, processed, and stored.
• The data cleansing practices include elimination of duplicate records, quarantining outlier records that exceed the bounds, standardization of attribute
values, substitution of missing values, etc.
• it is critical to check the data using data exploration techniques in addition to using prior knowledge of the data and business before building models to
ensure a certain degree of data quality
• Missing Values
• Need to track the data lineage of the data source to find right solution
• Data Types and Conversion
• The attributes in a data set can be of different types, such as continuous numeric (interest rate), integer numeric (credit score), or categorical
• data mining algorithms impose different restrictions on what data types they accept as inputs
• Transformation
• Can go beyond type conversion, may include dimensionality reduction or numerosity reduction
• Outliers are anomalies in the data set
• May occur legitimately or erroneously.
• Feature Selection
• Many data mining problems involve a data set with hundreds to thousands of attributes, most of which may not be helpful. Some attributes may be
correlated, e.g. sales amount and tax.
• Data Sampling may be adequate in many cases
Modeling
A model is the abstract
representation of the data and
its relationships in a given data
set.
Data mining models can be
classified into the following
categories: classification,
regression, association analysis,
clustering, and outlier or
anomaly detection.
Each category has a few dozen
different algorithms; each takes
a slightly different approach to
solve the problem at hand
Application
• The model deployment stage considerations:
• assessing model readiness, technical integration, response time, model maintenance, and assimilation
• Production Readiness
• Real-time response capabilities, and other business requirements
• Technical Integration
• Use of modeling tools (e.g. RapidMiner), Use of PMML for portable and consistent format of model
description, integration with other tools
• Timeliness
• The trade-offs between production responsiveness and build time need to be considered
• Remodeling
• The conditions in which the model is built may change after deployment
• Assimilation
• The challenge is to assimilate the knowledge gained from data mining in the organization. For example, the
objective may be finding logical clusters in the customer database so that separate treatment can be provided
to each customer cluster.
CRISP data mining framework
• Interactive mining: The data mining process should be highly interactive. Thus, it is important to build flexible user interfaces and an
exploratory mining environment, facilitating the user's interaction with the system. A user may like to first sample a set of data, explore
general characteristics of the data, and estimate potential mining results. Interactive mining should allow users to dynamically change the
focus of a search, to refine mining requests based on returned results, and to drill, dice, and pivot through the data and knowledge space
interactively, dynamically exploring "cube space" while mining.
• Incorporation of background knowledge: Background knowledge, constraints, rules, and other information regarding the domain under
study should be incorporated into the knowledge discovery process. Such knowledge can be used for pattern evaluation as well as to guide
the search toward interesting patterns.
• Ad hoc data mining and data mining query languages: Query languages (e.g., SQL) have played an important role in flexible searching
because they allow users to pose ad hoc queries. Similarly, high-level data mining query languages or other high-level flexible user
interfaces will give users the freedom to define ad hoc data mining tasks. This should facilitate specification of the relevant sets of data for
analysis, the domain knowledge, the kinds of knowledge to be mined, and the conditions and constraints to be enforced on the discovered
patterns. Optimization of the processing of such flexible mining requests is another promising area of study.
• Presentation and visualization of data mining results: How can a data mining system present data mining results, vividly and flexibly, so
that the discovered knowledge can be easily understood and directly usable by humans? This is especially crucial if the data mining process
is interactive. It requires the system to adopt expressive knowledge representations, user-friendly interfaces, and visualization techniques.
DM Issues/Challenges - Efficiency and Scalability
Efficiency and scalability are always considered when comparing data mining algorithms. As data amounts continue to multiply, these two
factors are especially critical.
• Efficiency and scalability of data mining algorithms: Data mining algorithms must be efficient and scalable in order to effectively extract
information from huge amounts of data in many data repositories or in dynamic data streams. In other words, the running time of a data
mining algorithm must be predictable, short, and acceptable by applications. Efficiency, scalability, performance, optimization, and the
ability to execute in real time are key criteria that drive the development of many new data mining algorithms.
• Parallel, distributed, and incremental mining algorithms: The humongous size of many data sets, the wide distribution of data, and the
computational complexity of some data mining methods are factors that motivate the development of parallel and distributed data-
intensive mining algorithms. Such algorithms first partition the data into "pieces." Each piece is processed, in parallel, by searching for
patterns. The parallel processes may interact with one another. The patterns from each partition are eventually merged.
• Cloud computing and cluster computing, which use computers in a distributed and collaborative way to tackle very large-scale
computational tasks, are also active research themes in parallel data mining. In addition, the high cost of some data mining processes and
the incremental nature of input promote incremental data mining, which incorporates new data updates without having to mine the entire
data "from scratch." Such methods perform knowledge modification incrementally to amend and strengthen what was previously
discovered.
DM Issues/Challenges - Diversity of Database Types
The wide diversity of database types brings about challenges to data mining.
Handling complex types of data: Diverse applications generate a wide spectrum of new data types, from structured data such as relational and
data warehouse data to semi-structured and unstructured data; from stable data repositories to dynamic data streams; from simple data
objects to temporal data, biological sequences, sensor data, spatial data, hypertext data, multimedia data, software program code, Web data,
and social network data. It is unrealistic to expect one data mining system to mine all kinds of data, given the diversity of data types and the
different goals of data mining. Domain- or application-dedicated data mining systems are being constructed for in-depth mining of specific
kinds of data. The construction of effective and efficient data mining tools for diverse applications remains a challenging area.
Mining dynamic, networked, and global data repositories: Multiple sources of data are connected by the Internet and various kinds of
networks, forming gigantic, distributed, and heterogeneous global information systems and networks. The discovery of knowledge from
different sources of structured, semi-structured, or unstructured yet interconnected data with diverse data semantics poses great challenges to
data mining. Mining such gigantic, interconnected information networks may help disclose many more patterns and knowledge in
heterogeneous data sets than can be discovered from a small set of isolated data repositories. Web mining, multisource data mining, and
information network mining have become challenging and fast-evolving data mining fields.
DM Issues/Challenges - Society
How does data mining impact society? What steps can data mining take to preserve the privacy of individuals? Do we use data mining in our
daily lives without even knowing that we do?
Social impacts of data mining: With data mining penetrating our everyday lives, it is important to study the impact of data mining on society.
How can we use data mining technology to benefit society? How can we guard against its misuse? The improper disclosure or use of data and
the potential violation of individual privacy and data protection rights are areas of concern that need to be addressed.
Privacy-preserving data mining: Data mining will help scientific discovery, business management, economy recovery, and security protection
(e.g., the real-time discovery of intruders and cyberattacks). However, it poses the risk of disclosing an individual's personal information.
Studies on privacy-preserving data publishing and data mining are ongoing. The philosophy is to observe data sensitivity and preserve people's
privacy while performing successful data mining.
Invisible data mining: We cannot expect everyone in society to learn and master data mining techniques. More and more systems should have
data mining functions built within so that people can perform data mining or use data mining results simply by mouse clicking, without any
knowledge of data mining algorithms. Intelligent search engines and Internet-based stores perform such invisible data mining by incorporating
data mining into their components to improve their functionality and performance. This is done often unbeknownst to the user. For example,
when purchasing items online, users may be unaware that the store is likely collecting data on the buying patterns of its customers, which may
be used to recommend other items for purchase in the future.
Prescribed Text Books
Source Courtesy: Some of the contents of this PPT are sourced from materials provided by publishers of prescribed books
Preprocessing Objectives
• Data cleaning
• Fill in missing values, smooth noisy data, identify or remove outliers, and resolve
inconsistencies
• Data integration
• Integration of multiple databases, data cubes, or files
• Data reduction
• Dimensionality reduction
• Numerosity reduction
• Data compression
• Data transformation and data discretization
• Normalization
• Concept hierarchy generation
55
Data Quality: Multidimensional View
57
Data Quality
• What kinds of data quality problems?
• How can we detect problems with the data?
• What can we do about these problems?
59
Incomplete (Missing) Data
• Data is not always available
• E.g., many tuples have no recorded value for several attributes, such as customer
income in sales data
• Missing data may be due to
• equipment malfunction
• inconsistent with other recorded data and thus deleted
• data not entered due to misunderstanding
• certain data may not be considered important at the time of entry
• not register history or changes of the data
• Missing data may need to be inferred
60
How to Handle Missing Data?
• Ignore the tuple: usually done when class label is missing (when doing classification)—
not effective when the % of missing values per attribute varies considerably
• Fill in the missing value manually: tedious + infeasible?
• Fill in it automatically with
• a global constant : e.g., “unknown”, a new class?!
• the attribute mean
• the attribute mean for all samples belonging to the same class: smarter
• the most probable value: inference-based such as Bayesian formula or decision tree
61
Noisy Data
• Noise: random error or variance in a measured variable
• Incorrect attribute values may be due to
• faulty data collection instruments
• data entry problems
• data transmission problems
• technology limitation
• inconsistency in naming convention
• Other data problems which require data cleaning
• duplicate records
• incomplete data
• inconsistent data
62
How to Handle Noisy Data?
• Binning (also used for discretization)
• first sort data and partition into (equal-frequency) bins
• then one can smooth by bin means, smooth by bin median, smooth by bin
boundaries, etc.
• Binning methods smooth a sorted data value by consulting its "neighborhood," that
is, the values around it, i.e. they perform local smoothing.
• Regression
• smooth by fitting the data into regression functions
• Clustering
• detect and remove outliers
• Combined computer and human inspection
• detect suspicious values and check by human (e.g., deal with possible outliers)
63
Noise
• Noise refers to modification of original values
• Examples: distortion of a person’s voice when talking on a poor phone and
“snow” on television screen
• Examples:
• Same person with multiple email addresses
• Data cleaning
• Process of dealing with duplicate data issues
Outliers
• Outliers are data objects with characteristics that are considerably
different than most of the other data objects in the data set
Data Cleaning as a Process
• Data discrepancy detection
• Use metadata (e.g., domain, range, dependency, distribution)
• Check field overloading
• Check uniqueness rule, consecutive rule and null rule
• Use commercial tools
• Data scrubbing: use simple domain knowledge (e.g., postal code, spell-check) to detect errors
and make corrections
• Data auditing: by analyzing data to discover rules and relationship to detect violators (e.g.,
correlation and clustering to find outliers)
• Data migration and integration
• Data migration tools: allow transformations to be specified
• ETL (Extraction/Transformation/Loading) tools: allow users to specify transformations through a
graphical user interface
• Integration of the two processes
• Iterative and interactive (e.g., Potter’s Wheels)
67
Prescribed Text Books
Source Courtesy: Some of the contents of this PPT are sourced from materials provided by publishers of prescribed books
Major Tasks in Data Preprocessing
• Data cleaning
• Fill in missing values, smooth noisy data, identify or remove outliers, and resolve
inconsistencies
• Data integration
• Integration of multiple databases, data cubes, or files
• Data reduction
• Dimensionality reduction
• Numerosity reduction
• Data compression
• Data transformation and data discretization
• Normalization
• Concept hierarchy generation
70
Data Integration
• Data integration:
• Combines data from multiple sources into a coherent store
71
71
Handling Redundancy in Data Integration
72
72
Correlation Analysis (Nominal Data)
• Χ2 (chi-square) test
(Observed − Expected ) 2
2 =
Expected
• The larger the Χ2 value, the more likely the variables are related
• The cells that contribute the most to the Χ2 value are those
whose actual count is very different from the expected count
• Correlation does not imply causality
• # of hospitals and # of car-theft in a city are correlated
• Both are causally linked to the third variable: population
73
Chi-Square Calculation: An Example
Play chess Not play chess Sum (row)
Like science fiction 250(90) 200(360) 450
Correlation coefficient:
where n is the number of tuples, A and B are the respective mean or expected values of A and B,
σA and σB are the respective standard deviation of A and B
• Positive covariance: If CovA,B > 0, then A and B both tend to be larger than their expected values
• Negative covariance: If CovA,B < 0 then if A is larger than its expected value, B is likely to be smaller
than its expected value
• Independence: CovA,B = 0
77
Co-Variance: An Example
• Suppose two stocks A and B have the following values in one week: (2, 5), (3, 8), (5, 10), (4, 11), (6, 14).
• Question: If the stocks are affected by the same industry trends, will their prices rise or fall together?
• E(A) = (2 + 3 + 5 + 4 + 6)/ 5 = 20/5 = 4
• E(B) = (5 + 8 + 10 + 11 + 14) /5 = 48/5 = 9.6
• Cov(A,B) = (2×5+3×8+5×10+4×11+6×14)/5 − 4 × 9.6 = 4
79
Data Discretization Methods
• Typical methods: All the methods can be applied recursively
• Binning
• Top-down split, unsupervised
• Histogram analysis
• Top-down split, unsupervised
• Clustering analysis (unsupervised, top-down split or bottom-up merge)
• Decision-tree analysis (supervised, top-down split)
• Correlation (e.g., 2) analysis (unsupervised, bottom-up merge)
80
Simple Discretization: Binning
81
Binning Methods for Data Smoothing
82
Discretization by Classification & Correlation Analysis
83
Data Reduction Strategies
• Data reduction: Obtain a reduced representation of the data set that is much smaller in volume
but yet produces the same (or almost the same) analytical results
• Why data reduction? — A database/data warehouse may store terabytes of data. Complex data
analysis may take a very long time to run on the complete data set.
• Data reduction strategies
• Dimensionality reduction, e.g., remove unimportant attributes
• Wavelet transforms
• Principal Components Analysis (PCA)
• Feature subset selection, feature creation
• Numerosity reduction (some simply call it: Data Reduction)
• Regression and Log-Linear Models
• Histograms, clustering, sampling
• Data cube aggregation
• Data compression
84
Data Reduction : Dimensionality Reduction
• Curse of dimensionality
• When dimensionality increases, data becomes increasingly sparse
• Density and distance between points, which is critical to clustering, outlier analysis, becomes less
meaningful
• The possible combinations of subspaces will grow exponentially
• Dimensionality reduction
• Avoid the curse of dimensionality
• Help eliminate irrelevant features and reduce noise
• Reduce time and space required in data mining
• Allow easier visualization
• Dimensionality reduction techniques
• Wavelet transforms
• Principal Component Analysis
• Supervised and nonlinear techniques (e.g., feature selection)
85
Mapping Data to a New Space
◼ Fourier transform
◼ Wavelet transform
86
Wavelet Transformation
• Discrete wavelet transform (DWT) for linear signal processing, multi-resolution analysis
• Compressed approximation: store only a small fraction of the strongest of the wavelet
coefficients
• Similar to discrete Fourier transform (DFT), but better lossy compression, localized in
space
• Method:
• Length, L, must be an integer power of 2 (padding with 0’s, when necessary)
• Each transform has 2 functions: smoothing, difference
• Applies to pairs of data, resulting in two set of data of length L/2
• Applies two functions recursively, until reaches the desired length
87
Wavelet Decomposition
• Wavelets: A math tool for space-efficient hierarchical decomposition of functions
• S = [2, 2, 0, 2, 3, 5, 4, 4] can be transformed to S^ = [23/4, -11/4, 1/2, 0, 0, -1, -1, 0]
• Compression: many small detail coefficients can be replaced by 0’s, and only the
significant coefficients are retained
88
Principal Component Analysis (PCA)
x1 89
Principal Component Analysis (Steps)
• Given N data vectors from n-dimensions, find k ≤ n orthogonal vectors (principal components)
that can be best used to represent data
• Normalize input data: Each attribute falls within the same range
• Compute k orthonormal (unit) vectors, i.e., principal components
• Each input data (vector) is a linear combination of the k principal component vectors
• The principal components are sorted in order of decreasing “significance” or strength
• Since the components are sorted, the size of the data can be reduced by eliminating the
weak components, i.e., those with low variance (i.e., using the strongest principal
components, it is possible to reconstruct a good approximation of the original data)
• Works for numeric data only
90
Attribute Subset Selection
• Another way to reduce dimensionality of data
• Redundant attributes
• Duplicate much or all of the information contained in one or more other
attributes
• E.g., purchase price of a product and the amount of sales tax paid
• Irrelevant attributes
• Contain no information that is useful for the data mining task at hand
• E.g., students' ID is often irrelevant to the task of predicting students' GPA
91
Heuristic Search in Attribute Selection
92
Attribute Creation (Feature Generation)
• Create new attributes (features) that can capture the important information in a
data set more effectively than the original ones
• Three general methodologies
• Attribute extraction
• Domain-specific
• Mapping data to new space (see: data reduction)
• E.g., Fourier transformation, wavelet transformation
• Attribute construction
• Combining features
• Data discretization
93
Data Reduction: Numerosity Reduction
94
Prescribed Text Books
• Curse of Dimensionality
• PCA
• Purpose:
– Avoid curse of dimensionality
– Reduce amount of time and memory required by data
mining algorithms
– Allow data to be more easily visualized
– May help to eliminate irrelevant features or reduce noise
• Techniques
– Principle Component Analysis
– Singular Value Decomposition
– Others: supervised and non-linear techniques
x2
u1
x1
x2
x1
BITS Pilani, Hyderabad Campus
Principal Component Analysis (PCA)
• Every PC is orthogonal.
• Maximum Variance
• Minimum Error
• Dominates covariance
X’=(X-mean)/standard-deviation
• PCA is unsupervised
• Maximize overall variance of the data along a small set
of directions
• Does not known anything about class labels
• Can pick direction that makes it hard to separate classes
Data
Today’s Learning objective
• Describe Data
values
Hair color
Nominal The values of a nominal attribute are justzip codes, mode, entropy,
different names, i.e., nominal attributes employee ID contingency
provide only enough information to numbers, eye correlation, 2
distinguish one object from another. (=, color, sex: {male, test
) female}
Ordinal The values of an ordinal hardness of median,
attribute provide enough minerals, {good, percentiles, rank
information to order objects. better, best}, correlation, run
(<, >) grades, street tests, sign tests
numbers
Interval For interval attributes, the calendar dates, mean, standard
differences between values temperature in deviation, Pearson's
are meaningful, i.e., a unit of Celsius or correlation, t and F
measurement exists. Fahrenheit tests
(+, - )
Ratio For ratio variables, both temperature in
differences and ratios are Kelvin, monetary geometric mean,
meaningful. (*, /) quantities, counts, harmonic mean,
age, mass, length, percent
electrical current variation
BITS Pilani, Hyderabad Campus
Discrete and Continuous Attributes
• Discrete Attribute
– Has only a finite or countably infinite set of values
– Examples: set of words in a collection of documents
– Often represented as integer variables.
– Note: binary attributes are a special case of discrete attributes
• Continuous Attribute
– Has real numbers as attribute values
– Examples: temperature, height, or weight.
– Practically, real values can only be measured and represented
using a finite number of digits.
– Continuous attributes are typically represented as floating-point
variables.
– Sparsity
• Only presence counts
– Resolution
• Patterns depend on the scale
timeout
season
coach
game
score
team
ball
lost
pla
wi
n
y
Document 1 3 0 5 0 2 6 0 2 0 2
Document 2 0 7 0 2 1 0 0 3 0 0
Document 3 0 1 0 0 1 2 2 0 3 0
An element of the
sequence
GGTTCCGCCTTCAGCCCCGCGCC
CGCAGGGCCCGCCCCGCGCCGTC
GAGAAGGGCCCGCCTGGCGGGCG
GGGGGAGGCGGGGCCGCCCGAGC
CCAACCGAGTCCGACCAGGTGCC
CCCTCTGCTCGGCCTAGACCTGA
GCTCATTAGGCGGCAGCGGACAG
GCCAAGTAGAACACGCGAAGCGC
TGGGCTGCCTGCTGCGACCAGGG
Spatio-Temporal Data
Average Monthly
Temperature of land
and ocean
3. Data reduction
• Aggregation, sampling
• feature selection
• Examples:
– Same person with multiple email addresses
• Data cleaning
– Process of dealing with duplicate data issues
• Purpose
– Data reduction
• Reduce the number of attributes or objects
– Change of scale
• Cities aggregated into regions, states, countries, etc
– More “stable” data
• Aggregated data tends to have less variability
• Stratified sampling
– Split the data into several partitions; then draw random samples
from each partition
• z-score normalization
• Lower Quartile (QL) or First Quartile (Q1) : 25% of the data falls below this percentile
50th percentile
• Median or Second Quartile ( Q2) : 50% of the data falls below this percentile 75th
percentile
• Upper Quartile (QU) or Third Quartile (Q3): 75% of the data falls below this percentile
BITS Pilani, Hyderabad Campus
Log Transformation
• Unsupervised discretization
• Class labels are ignored
• Equal-interval binning • The best number of bins k is
determined experimentally
• Equal-frequency binning
• Supervised discretization
• Entropy-based discretization
• Equal-frequency binning
bins.
nesting problem
performance
• Example
• Lasso Regression
• Decision Trees
➢ Examples:
✓ Group together the customers of the site so that similar customers are shown the
same ad.
✓ Group together web documents so that you can separate the ones that talk about
politics and the ones that talk about sports.
✓ Find credit card transactions that are very different from previous transactions.
➢ For many problems, we need to quantify how close two objects are.
1 0
1 1 1
0 2 2
1 0
1 1 1
0 2 2
Classification
Today’s Agenda
X Y
What is the
function between
X and Y?
Y=X+3
X: Board Configuration
Y: Next Move
X is the input
Y = 3X + 1
Y is the output
Y=0
X=5
Y=16
X=6
Y=20
X=3
Y=??
BITS Pilani, Hyderabad Campus
Cost function
Y = 1X + 0
Training Data Training Data
Model
n X Y n X Y Y’ (Y – Y’)2
0 1 0
Yn = WXn + b 0 1 0 1 1
1 5 16 1 5 16 5 121
2 6 20 2 6 20 6 196
C(1,0) 318
Y = 2X + 2
Cost Training Data
n X Y Y’ (Y – Y’)2
C(W,b) = ∑ (Y – Y’)2
0 1 0 4 16
nϵ {0,1,2}
1 5 16 12 16
The one that gives us the lowest cost
is a better model 2 6 20 14 36
C(2,2) 68
BITS Pilani, Hyderabad Campus
Optimizer
Training Data
Model
n X Y
0 1 0
Yn = WXn + b
1 5 16
2 6 20
Optimizer
arg min C(W,b)
W,b ϵ [-∞ ∞]
Cost
C(W,b) = ∑ (Y – Y’)2
nϵ {0,1,2}
W0=2,b0=2; C(W,b)=68
Classification
supervised
Regression
Clustering unsupervised
– it is very efficient.
4 B|0 C|1 T
0 B|3 C|0 T
0 B|0 C|2 T
c1 c2
3 Y|3 N 3 Y|0 N
c1 c2
6 Y|2 N 3 Y|3 N
Entropy = – 0.4 log2 (0.4) – 0.3 log2 (0.3) – 0.3 log2 (0.3) = 1.571
Now we have only three attributes: Gender, car ownership and Income
level.
BITS Pilani, Hyderabad Campus
• Then, we repeat the procedure of computing degree of
impurity and information gain for the three attributes.
BITS Pilani, Hyderabad Campus
BITS Pilani, Hyderabad Campus
BITS Pilani, Hyderabad Campus
Third Iteration
Probabilistic” learning
– Conditional models just explain y: p(y|x)
– Generative models also explain x: p(x,y)
• Often a component of unsupervised or semi-supervised learning
– Bayes and Naïve Bayes classifiers are generative models
BITS Pilani, Hyderabad Campus
Conditional Probability and Naïve Bayes
P(X|Class=Yes) =
P(Refund=No| Class=Yes) X
P(Married| Class=Yes) X
P(Income=120K| Class=Yes) =
1 X 0 X 1.2 X 10-9 = 0
X = (Refund = No, Married, Income =120K)
Revision
Topics for Mid Sem Exam
• Sampling
• Dimensionality Reduction