Chapter 1 Data Mining (Cont.)
Chapter 1 Data Mining (Cont.)
DATA MINING
7080809 Data Mining
WHY DATA MINING
The world is data rich but information poor.
WHAT IS DATA MINING
Data mining—searching for knowledge (interesting patterns) in data.
Data mining is looking for hidden, valid, and potentially useful patterns in huge
data sets.
Data Mining is all about discovering unsuspected/ previously unknown
relationships amongst the data.
It is a multi-disciplinary skill that uses Machine learning, Statistics, AI and
Database technology.
Data mining is also called as Knowledge discovery, Knowledge extraction, data/
pattern analysis, information harvesting.
Knowledge Discovery from Data.
WHAT IS/IS NOT DATA MINING?
- Look up phone number in phone directory.
- Certain names are more prevalent in certain US locations (O’Brien, O’Rurke, O’Reilly…
in Boston area).
- Group together similar documents returned by search engine according to their context
(e.g. Amazon rainforest, Amazon.com).
THE KNOWLEDGE
DISCOVERY PROCESS
THE KNOWLEDGE DISCOVERY PROCESS
- Data cleaning (to remove noise and inconsistent data)
- Data selection (where data relevant to the analysis task are retrieved from the database)
- Data transformation (where data are transformed and consolidated into forms appropriate for
mining by performing summary or aggregation operations)
- Data mining (an essential process where intelligent methods are applied to extract data patterns)
- Pattern evaluation (to identify the truly interesting patterns representing knowledge based on
interestingness measures)
- Knowledge presentation (where visualization and knowledge representation techniques are used
to present mined knowledge to users)
WHAT KINDS OF DATA CAN BE MINED?
WHAT KINDS OF DATA CAN BE MINED?
- Data mining can be applied to any kind of data as long as the data are meaningful for a
target application.
- The most basic forms of data for mining applications are database data, data
warehouse data, and transactional data.
- Data mining can also be applied to other forms of data (e.g., data streams, ordered/
sequence data, graph or networked data, spatial data, text data, multimedia data, and
the WWW).
DATABASE DATA
A database system, also called a database management system (DBMS),
consists of a collection of interrelated data, known as a database, and a set of
software programs to manage and access the data.
The software programs provide mechanisms for defining database structures
and data storage; for specifying and managing concurrent, shared, or
distributed data access; and for ensuring consistency and security of the
information stored despite system crashes or attempts at unauthorized
access.
DATABASE DATA
- A relational database for AllElectronics.
DATABASE DATA
- Relational data can be accessed by database queries written in a relational query
language (e.g., SQL) or with the assistance of graphical user interfaces.
- Show me a list of all items that were sold in the last quarter.
- Show me the total sales of the last month, grouped by branch.
- How many sales transactions occurred in the month of December?
- Which salesperson had the highest sales?
- When mining relational databases, we can go further by searching for trends or data
patterns.
- Analyze customer data to predict the credit risk of new customers based on their
income, age, and previous credit information.
- Detect deviations—that is, items with sales that are far from those expected in
comparison with the previous year.
DATA WAREHOUSES
A data warehouse is a repository of information collected from multiple
sources, stored under a unified schema, and usually residing at a single
site.
Data warehouses are constructed via a process of data cleaning, data
integration, data transformation, data loading, and periodic data
refreshing.
DATA WAREHOUSES
A DATA CUBE FOR ALLELECTRONICS.
DATA MINING FUNCTIONALITIES
DATA MINING FUNCTIONALITIES
- These include
- Characterization and discrimination
- Clustering analysis
- Outlier analysis
fi
CHARACTERIZATION & DISCRIMINATION .
- Class/concept descriptions.
- These descriptions can be derived using
- Data characterization, by summarizing the data of the class under study (often
called the target class),
- Data discrimination, by comparison of the target class with one or a set of
comparative classes (often called the contrasting classes),
- Both data characterization and discrimination.
DATA CHARACTERIZATION
- Summarization of the general characteristics or features of a target class of data.
- The data corresponding to the user-speci ed class are typically collected by a query.
- Examples include pie charts, bar charts, curves, multidimensional data cubes, and
multidimensional tables, including crosstabs.
- The resulting descriptions can also be presented as generalized relations or in rule form
(called characteristic rules). fi
DATA CHARACTERIZATION
- A customer relationship manager at AllElectronics may order the following data mining
task: Summarize the characteristics of customers who spend more than $5000 a year
at AllElectronics.
- The result is a general pro le of these customers, such as that they are 40 to 50 years
old, employed, and have excellent credit ratings.
- The data mining system should allow the customer relationship manager to drill down
on any dimension, such as on occupation to view these customers according to their
type of employment.
fi
DATA DISCRIMINATION
- A comparison of the general features of the target class data objects against the
general features of objects from one or multiple contrasting classes.
- The target and contrasting classes can be speci ed by a user, and the corresponding
data objects can be retrieved through database queries.
- “How are discrimination descriptions output?” The forms of output presentation are
similar to those for characteristic descriptions
fi
DATA DISCRIMINATION
- A customer relationship manager at AllElectronics may want to compare two groups of
customers—those who shop for computer products regularly (e.g., more than twice a
month) and those who rarely shop for such products (e.g., less than three times a year).
- Drilling down on a dimension like occupation, or adding a new dimension like income
level, may help to nd even more discriminative features between the two classes.
fi
fi
MINING FREQUENT PATTERNS, ASSOCIATIONS, &
CORRELATIONS
MINING FREQUENT PATTERNS, ASSOCIATIONS, & CORRELATIONS
- Frequent patterns are patterns that occur frequently in data.
- A frequent itemset typically refers to a set of items that often appear together in a
transactional data set.
- milk and bread, which are frequently bought together in grocery stores by many
customers.
- A frequently occurring subsequence, such as the pattern that customers, tend to
purchase rst a laptop, followed by a digital camera, and then a memory card, is a
(frequent) sequential pattern.
- A substructure can refer to di erent structural forms (e.g., graphs, trees, or lattices) that
may be combined with itemsets or subsequences.
- If a substructure occurs frequently, it is called a (frequent) structured pattern. Mining
frequent patterns leads to the discovery of interesting associations and correlations
within data.
fi
ff
ASSOCIATION ANALYSIS
- A marketing manager, you want to know which items are frequently purchased
together (i.e., within the same transaction).
fi
ASSOCIATION ANALYSIS
- A marketing manager, you want to know which items are frequently purchased together (i.e.,
within the same transaction).
- buys(X,“computer”) buys(X,“software”) [support = 1%,con dence = 50%]
- The rule indicates that of the AllElectronics customers under study, 2% are 20 to 29 years
old with an income of $40,000 to $49,000 and have purchased a laptop (computer) at
AllElectronics.
- There is a 60% probability that a customer in this age and income group will purchase a
laptop.
- This is an association involving more than one attribute or predicate (i.e., age, income, and
buys).
- Adopting the terminology used in multidimensional databases, where each attribute is
referred to as a dimension, the above rule can be referred to as a multidimensional
association rule.
fi
fi
CLASSIFICATION AND REGRESSION
FOR PREDICTIVE ANALYSIS
CLASSIFICATION AND REGRESSION FOR PREDICTIVE ANALYSIS
- Classi cation is the process of nding a model (or function) that describes and
distinguishes data classes or concepts.
- The model are derived based on the analysis of a set of training data (i.e., data objects
for which the class labels are known).
- The model is used to predict the class label of objects for which the the class label is
unknown.
- The derived model may be represented in var- ious forms, such as classi cation rules
(i.e., IF-THEN rules), decision trees, mathematical formulae, or neural networks
fi
fi
fi
CLASSIFICATION AND REGRESSION FOR PREDICTIVE ANALYSIS
- A decision tree is a owchart-like tree structure, where each node denotes a test on an
attribute value, each branch represents an outcome of the test, and tree leaves
represent classes or class distributions.
- A neural network, when used for classi cation, is typically a collection of neuron-like
processing units with weighted connections between the units.
- There are many other methods for constructing classi cation models, such as natıve
Bayesian classi cation, support vector machines, and k-nearest-neighbor classi cation.
fi
fl
fi
fi
fi
fi
CLASSIFICATION AND REGRESSION FOR PREDICTIVE ANALYSIS
- Whereas classi cation predicts categorical (discrete, unordered) labels, regression
models continuous-valued functions.
- Regression is used to predict missing or unavailable numerical data values rather than
(discrete) class labels.
- Regression analysis is a statistical methodology that is most often used for numeric
prediction. Regression also encompasses the identi cation of distribution trends based
on the available data.
- Classi cation and regression may need to be preceded by relevance analysis, which
attempts to identify attributes that are signi cantly relevant to the classi cation and
regression process.
- Such attributes will be selected for the classi cation and regression process. Other
attributes, which are irrelevant, can then be excluded from consideration.
fi
fi
fi
fi
fi
fi
CLASSIFICATION AND REGRESSION
- Classify a large set of items in the store, based on three kinds of responses to a sales
campaign: good response, mild response and no response.
- You want to derive a model for each of these three classes based on the descriptive
features of the items, such as price, brand, place made, type, and category.
- The resulting classi cation should maximally distinguish each class from the others,
presenting an organized picture of the data set.
fi
CLASSIFICATION AND REGRESSION
- The resulting classi cation is expressed as a decision tree.
- The decision tree, for instance, may identify price as being the single factor that best
distinguishes the three classes.
- The tree may reveal that, in addition to price, other features that help to further
distinguish objects of each class from one another include brand and place made.
- Such a decision tree may help you understand the impact of the given sales campaign
and design a more e ective campaign in the future.
fi
ff
CLASSIFICATION AND REGRESSION
- To predict the amount of revenue that each item will generate during an upcoming sale,
based on the previous sales data.
- In many cases, class- labeled data may simply not exist at the beginning. Clustering can be
used to generate.
- Class labels for a group of data. The objects are clustered or grouped based on the
principle of maximizing the intraclass similarity and minimizing the interclass similarity.
- That is, clusters of objects are formed so that objects within a cluster have high similarity
in comparison to one another, but are rather dissimilar to objects in other clusters.
- Each cluster so formed can be viewed as a class of objects, from which rules can be
derived.
fi
CLUSTER ANALYSIS
- Cluster analysis can be
performed customer data to
identify homogeneous
subpopulations of customers.
These clusters may represent
individual target groups for
marketing.
OUTLIER ANALYSIS
OUTLIER ANALYSIS
- A data set may contain objects that do not comply with the general behavior or model
of the data. These data objects are outliers.
- Many data mining methods discard outliers as noise or exceptions. However, in some
applications (e.g., fraud detection) the rare events can be more interesting than the
more regularly occurring ones. The analysis of outlier data is referred to as outlier
analysis or anomaly mining.
OUTLIER ANALYSIS
- Outliers may be detected using statistical tests that assume a distribution or probability
model for the data, or using distance measures where objects that are remote from any
other cluster are considered outliers.
- Rather than using statistical or distance measures, density-based methods may identify
outliers in a local region, although they look normal from a global statistical distribution
view.
OUTLIER ANALYSIS
- Outlier analysis may uncover fraudulent usage of credit cards by detecting purchases
of unusually large amounts for a given account number in comparison to regular
charges incurred by the same account.
- Outlier values may also be detected with respect to the locations and types of
purchase, or the purchase frequency.
ARE ALL PATTERNS INTERESTING?
ARE ALL PATTERNS INTERESTING?
- What makes a pattern interesting?
- A pattern is also interesting if it validates a hypothesis that the user sought to con rm.
- Objective measures of pattern interestingness are based on the structure of discovered patterns
and the statistics underlying them; accuracy and coverage for classi cation (IF-THEN) rules.
- Subjective interestingness measures are based on user beliefs in the data. These measures nd
patterns interesting if the patterns are unexpected (contradicting a user’s belief) or o er strategic
information on which the user can act.
-
fi
fi
ff
fi
ARE ALL PATTERNS INTERESTING?
- Can a data mining system generate all of the interesting patterns?
- Refers to the completeness of a data mining algorithm. It is often unrealistic and
ine cient for data mining systems to generate all possible patterns.
- Association rule mining is an example where the use of constraints and interestingness
measures can ensure the completeness of mining.
ffi
ARE ALL PATTERNS INTERESTING?
- Can the system generate only the interesting ones?
- An optimization problem in data mining.
- It is highly desirable for data mining systems to generate only interesting patterns.
- This would be e cient for users and data mining systems because neither would have
to search through the patterns generated to identify the truly interesting ones.
ffi
ARE ALL PATTERNS INTERESTING?
- Measures of pattern interestingness are essential for the e cient discovery of patterns
by target users.
- Such measures can be used after the data mining step to rank the discovered patterns
according to their interestingness, ltering out the uninteresting ones.
- More important, such measures can be used to guide and constrain the discovery
process, improving the search e ciency by pruning away subsets of the pattern space
that do not satisfy prespeci ed interestingness constraints.
fi
ffi
fi
ffi
WHICH TECHNOLOGIES ARE USED?
WHICH TECHNOLOGIES ARE USED?