Data Mining Moodle Notes U1
Data Mining Moodle Notes U1
The major reason that data mining has attracted a great deal of attention in
information industry in recent years is due to the wide availability of huge amounts
of data and the imminent need for turning such data into useful information and
knowledge. The information and knowledge gained can be used for applications
ranging from business management, production control, and market analysis, to
engineering design and science exploration.
The evolution of database technology Data collection and Database Creation
(1960s and earlier) Primitive file processing Database Management Systems
(1970s-early 1980s)
1) Hierarchical and network database system
2) Relational database system
3) Data modeling tools: entity-relational models, etc
4) Indexing and accessing methods: B-trees, hashing etc.
5) Query languages: SQL, etc. User Interfaces, forms and reports
6) Query Processing and Query Optimization
7) Transactions, concurrency control and recovery
8) Online transaction Processing (OLTP)
What is data mining?
Data mining refers to extracting or mining" knowledge from large amounts of data.
There are many other terms related to data mining, such as knowledge mining,
knowledge extraction, data/pattern analysis, data archaeology, and data dredging.
Many people treat data mining as a synonym for another popularly used term,
Knowledge Discovery in Databases", or KDD Essential step in the process of
knowledge discovery in databases Knowledge discovery as a process is depicted in
following figure and consists of an iterative sequence of the following steps:
data cleaning: to remove noise or irrelevant data
data integration: where multiple data sources may be combined
data selection: where data relevant to the analysis task are retrieved from the
database
data transformation: where data are transformed or consolidated into forms
Advanced Database Systems (mid 1980s-present)
1) Advanced Data models: Extended relational, object-relational ,etc.
2) Advanced applications; Spatial, temporal, multimedia, active stream and
sensor, knowledge based Advanced Data Analysis: Data warehousing and Data
mining (late 1980s-present)
1)Data warehouse and OLAP
2)Data mining and knowledge discovery: generalization, classification, associ
ation, clustering, frequent pattern, outlier analysis, etc
3)Advanced data mining applications: Stream data mining,bio-data mining, text
mining, web mining etc 4)Data mining and society: Privacy –preserving
datamining Web based databases (1990s-present)
1) XML- based database systems
2)Integration with information retrieval
3)Data and information Integration New Generation of Integrated Data and
Information Systems(present future) appropriate for mining by performing
summary or aggregation operations
data mining :an essential process where intelligent methods are applied in order
to extract data patterns
pattern evaluation to identify the truly interesting patterns representing knowledge
based on some interestingness measures
knowledge presentation: where visualization and knowledge representation
techniques are used to present the mined knowledge to the user.
Architecture of a typical data mining system/Major Components
Data mining is the process of discovering interesting knowledge from large
amounts of data stored either in databases, data warehouses, or other information
repositories. Based on this view, the architecture of a typical data mining system
may have the following major components:
1. A database, data warehouse, or other information repository, which consists of
the set of databases, data warehouses, spreadsheets, or other kinds of information
repositories containing the student and course information.
2. A database or data warehouse server which fetches the relevant data based on
users‘ data mining requests.
3. A knowledge base that contains the domain knowledge used to guide the search
or to evaluate the interestingness of resulting patterns. For example, the knowledge
base may contain metadata which describes data from multiple heterogeneous
sources.
4. A data mining engine, which consists of a set of functional modules for tasks
such as classification, association, classification, cluster analysis, and evolution
and deviation analysis.
5. A pattern evaluation module that works in tandem with the data mining modules
by employing interestingness measures to help focus the search towards
interestingness patterns.
6. A graphical user interface that allows the user an interactive approach to the data
mining system.
How is a data warehouse different from a database? How are they similar?
• Differences between a data warehouse and a database: A data warehouse is a
repository of information collected from multiple sources, over a history of time,
stored under a unified schema, and used for data analysis and decision support;
whereas a database, is a collection of interrelated data that represents the current
status of the stored data. There could be multiple heterogeneous databases where
the schema of one database may not agree with the schema of another. A database
system supports ad-hoc query and on-line transaction processing.
Similarities between a data warehouse and a database: Both are repositories of
information, storing huge amounts of persistent data. In principle, data mining
should be applicable to any kind of information repository. This includes relational
databases, data warehouses, transactional databases, advanced database systems,
flat files, and the World-Wide Web. Advanced database systems include object-
oriented and object-relational databases, and special c application-oriented
databases, such as spatial databases, time-series databases, text databases, and
multimedia databases.
Flat files: Flat files are actually the most common data source for data mining
algorithms, especially at the research level. Flat files are simple data files in text or
binary format with a structure known by the data mining algorithm to be applied.
The data in these files can be transactions, time-series data, scientific
measurements, etc.
Relational Databases: a relational database consists of a set of tables containing
either values of entity attributes, or values of attributes from entity relationships.
Tables have columns and rows, where columns represent attributes and rows
represent tuples. A tuple in a relational table corresponds to either an object or a
relationship between objects and is identified by a set of attribute values
representing a unique key. In following figure it presents some relations Customer,
Items, and Borrow representing business activity in a video store. These relations
are just a subset of what could be a database for the video store and is given as an
example. 31 The most commonly used query language for relational database is
SQL, which allows retrieval and manipulation of the data stored in the tables, as
well as the calculation of aggregate functions such as average, sum, min, max and
count. For instance, an SQL query to select the videos grouped by category would
be: SELECT count(*) FROM Items WHERE type=video GROUP BY category.
Data mining algorithms using relational databases can be more versatile than data
mining algorithms specifically written for flat files, since they can take advantage
of the structure inherent to relational databases. While data mining can benefit
from SQL for data selection, transformation and consolidation, it goes beyond
what SQL could provide, such as predicting, comparing, detecting deviations, etc.
Data mining functionalities/Data mining tasks: Data mining functionalities are
used to specify the kind of patterns to be found in data mining tasks. In general,
data mining tasks can be classified into two categories:
• Descri ptive
• predic tive
Descriptive mining tasks characterize the general properties of the data in the
database. Predictive mining tasks perform inference on the current data in order to
make predictions. Describe data mining functionalities and the kinds of patterns
they can discover (or) Define each of the following data mining functionalities:
characterization, discrimination, association and correlation analysis, classification,
prediction, clustering, and evolution analysis. Give examples of each data mining
functionality, using a real-life database that you are familiar with. Concept/class
description: characterization and discrimination 35 Data can be associated with
classes or concepts. It describes a given set of data in a concise and summarative
manner, presenting interesting general properties of the data. These descriptions
can be derived via
1. data characterization, by summarizing the data of the class under study (often
called the target class)
2. data discrimination, by comparison of the target class with one or a set of
comparative classes
3. both data characterization and discrimination Data characterization It is a
summarization of the general characteristics or features of a target class of data.
Example: A data mining system should be able to produce a description
summarizing the characteristics of a student who has obtained more than 75% in
every semester; the result could be a general profile of the student.
Data Discrimination is a comparison of the general features of target class data
objects with the general features of objects from one or a set of contrasting classes.
Example The general features of students with high GPA‘s may be compared with
the general features of students with low GPA‘s. The resulting description could be
a general comparative profile of the students such as 75% of the students with high
GPA‘s are fourth-year computing science students while 65% of the students with
low GPA‘s are not. The output of data characterization can be presented in various
forms. Examples include pie charts, bar charts, curves, multidimensional data
cubes, and multidimensional tables, including crosstabs. The resulting descriptions
can also be presented as generalized relations, or in rule form called characteristic
rules. Discrimination descriptions expressed in rule form are referred to as
discriminant rules. Mining Frequent Patterns, Association and Correlations It is the
discovery of association rules showing attribute-value conditions that 36 occur
frequently together in a given set of data. For example, a data mining system may
find association rules like major(X, ―computing science‖‖) ⇒ owns(X, ―personal
computer‖) [support = 12%, confidence = 98%] where X is a variable representing
a student. The rule indicates that of the students under study, 12% (support) major
in computing science and own a personal computer. There is a 98% probability
(confidence, or certainty) that a student in this group owns a personal computer.
Example: A grocery store retailer to decide whether to but bread on sale. To help
determine the impact of this decision, the retailer generates association rules that
show what other products are frequently purchased with bread. He finds 60% of
the times that bread is sold so are pretzels and that 70% of the time jelly is also
sold. Based on these facts, he tries to capitalize on the association between bread,
pretzels, and jelly by placing some pretzels and jelly at the end of the aisle where
the bread is placed. In addition, he decides not to place either of these items on sale
at the same time.
Classification and prediction
Classification:
It predicts categorical class labels
It classifies data (constructs a model) based on the training set and the values
(class labels) in a classifying attribute and uses it in classifying new data
Typical Applications
credit approval o target marketing
medical diagnosis
treatment effectiveness analysis Classification can be defined as the process of
finding a model (or function) that describes and distinguishes data classes or
concepts, for the purpose of being able to use the model to predict the class of
objects whose class label is unknown. The derived model is based on the analysis
of a set of training data (i.e., data objects whose class label is known).
Example: An airport security screening station is used to deter mine if passengers
are potential terrorist or criminals. To do this, the face of each passenger is scanned
and its basic pattern(distance between eyes, size, and shape of mouth, head etc) is
identified. This pattern is compared to entries in a database to see if it matches any
patterns that are associated with known offenders A classification model can be
represented in various forms, such as
1) IF-THEN rules, student ( class , "undergraduate") AND concentration ( level,
"high") ==> class A student (class ,"undergraduate") AND concentrtion
(level,"low") ==> class B student (class , "post graduate") ==> class C
2) Decision tree
3) Neural Network
Prediction: Find some missing or unavailable data values rather than class labels
referred to as prediction. Although prediction may refer to both data value
prediction and class label prediction, it is usually confined to data value prediction
and thus is distinct from classification. Prediction also encompasses the
identification of distribution trends based on the available data. Example:
Predicting flooding is difficult problem. One approach is uses monitors placed at
various points in the river. These monitors collect data relevant to flood prediction:
water level, rain amount, time, humidity etc. These water levels at a potential
flooding point in the river can be predicted based on the data collected by the
sensors upriver from this point. The prediction must be made with respect to the
time the data were collected.
Classification vs. Prediction
Classification differs from prediction in that the former is to construct a set of
models (or functions) that describe and distinguish data class or concepts, whereas
the latter is to predict some missing or unavailable, and often numerical, data
values. Their similarity is that they are both tools for prediction: Classification is
used for predicting the class label of data objects and prediction is typically used
for predicting missing numerical data values.
Clustering analysis
Clustering analyzes data objects without consulting a known class label. The
objects are clustered or grouped based on the principle of maximizing the
intraclass similarity and minimizing the interclass similarity. Each cluster that is
formed can be viewed as a class of objects. Clustering can also facilitate taxonomy
formation, that is, the organization of observations into a hierarchy of classes that
group similar events together as shown below:
Example: A certain national department store chain creates special catalogs
targeted to various demographic groups based on attributes such as income,
location and physical characteristics of potential customers (age, height, weight,
etc). To determine the target mailings of the various catalogs and to assist in the
creation of new, more specific catalogs, the company performs a clustering of
potential customers based on the determined attribute values. The results of the
clustering exercise are the used by management to create special catalogs and
distribute them to the correct target population based on the cluster for that catalog.
Classification vs. Clustering
In general, in classification you have a set of predefined classes and want to
know which class a new object belongs to.
Clustering tries to group a set of objects and find whether there is some
relationship between the objects.
In the context of machine learning, classification is supervised learning and
clustering is unsupervised learning.
Outlier analysis:
A database may contain data objects that do not comply with general model of
data. These data objects are outliers. In other words, the data objects which do not
fall within the cluster will be called as outlier data objects. Noisy data or
exceptional data are also called as outlier data. The analysis of outlier data is
referred to as outlier mining.
Example Outlier analysis may uncover fraudulent usage of credit cards by
detecting purchases of extremely large amounts for a given account number in
comparison to regular charges incurred by the same account. Outlier values may
also be detected with respect to the location and type of purchase, or the purchase
frequency.
Data evolution analysis describes and models regularities or trends for objects
whose behavior changes over time. Example: The data of result the last several
years of a college would give an idea if quality of graduated produced by it
Correlation analysis Correlation analysis is a technique use to measure the
association between two variables. A correlation coefficient (r) is a statistic used
for measuring the strength of a supposed linear association between two variables.
Correlations range from -1.0 to +1.0 in value. A correlation coefficient of 1.0
indicates a perfect positive relationship in which high values of one variable are
related perfectly to high values in the other variable, and conversely, low values on
one variable are perfectly related to low values on the other variable. A correlation
coefficient of 0.0 indicates no relationship between the two variables. That is, one
cannot use the scores on one variable to tell anything about the scores on the
second variable. A correlation coefficient of -1.0 indicates a perfect negative
relationship in which high values of one variable are related perfectly to low values
in the other variables, and conversely, low values in one variable are perfectly
related to high values on the other variable.
What is the difference between discrimination and classification?
• Discrimination differs from classification in that the former refers to a
comparison of the general features of target class data objects with the general
features of objects from one or a set of contrasting classes, while the latter is the
process of finding a set of models (or functions) that describe and distinguish data
classes or concepts for the purpose of being able to use the model to predict the
class of objects whose class label is unknown. Discrimination and classification are
similar in that they both deal with the analysis of class data objects.
• Characterization differs from clustering in that the former refers to a
summarization of the general characteristics or features of a target class of data
while the latter deals with the analysis of data objects without consulting a known
class label. This pair of tasks is similar in that they both deal with grouping
together objects or data that are related 41 or have high similarity in comparison to
one another.
• Classification differs from prediction in that the former is the process of finding a
set of models (or functions) that describe and distinguish data class or concepts
while the latter predicts missing or unavailable, and often numerical, data values.
This pair of tasks is similar in that they both are tools for Prediction: Classification
is used for predicting the class label of data objects and prediction is typically used
for predicting missing numerical data values.
Are all of the patterns interesting? / What makes a pattern interesting? A pattern is
interesting if, (1) It is easily understood by humans, (2) Valid on new or test data
with some degree of certainty, (3) Potentially useful, and (4) Novel. A pattern is
also interesting if it validates a hypothesis that the user sought to confirm. An
interesting pattern represents knowledge.
Classification of data mining systems There are many data mining systems
available or being developed. Some are specialized systems dedicated to a given
data source or are confined to limited data mining functionalities, other are more
versatile and comprehensive. Data mining systems can be categorized according to
various criteria among other classification are the following: · Classification
according to the type of data source mined: this classification categorizes data
mining systems according to the type of data handled such as spatial data,
multimedia data, time-series data, text data, World Wide Web, etc.
Classification according to the data model drawn on: this classification categorizes
data mining systems based on the data model involved such as relational database,
object-oriented database, data warehouse, transactional, etc. ·
Classification according to the king of knowledge discovered:
this classification categorizes data mining systems based on the kind of knowledge
discovered or data mining functionalities, such as characterization, discrimination,
association, classification, clustering, etc. Some systems tend to be comprehensive
systems offering several data mining functionalities together. ·
Classification according to mining techniques used:
Data mining systems employ and provide different techniques. This classification
categorizes data mining systems according to the data analysis approach used such
as machine learning, neural networks, genetic algorithms, statistics, visualization,
database oriented or data warehouse-oriented, etc. The classification can also take
into account the degree of user interaction involved in the data mining process such
as query-driven systems, interactive exploratory systems, or autonomous systems.
A comprehensive system would provide a wide variety of data mining techniques
to fit different situations and options, and offer different degrees of user
interaction.
Five primitives for specifying a data mining task
• Task-relevant data: This primitive specifies the data upon which mining is to be
performed. It involves specifying the database and tables or data warehouse
containing the relevant data, conditions for selecting the relevant data, the relevant
attributes or dimensions for exploration, and instructions regarding the ordering or
grouping of the data retrieved.
• Knowledge type to be mined: This primitive specifies the specific data mining
function to be performed, such as characterization, discrimination, association,
classification, clustering, or evolution analysis. As well, the user can be more
specific and provide pattern templates that all discovered patterns must match.
These templates or meta patterns (also called meta rules or meta queries), can be
used to guide the discovery process.
• Background knowledge: This primitive allows users to specify knowledge they
have about the domain to be mined. Such knowledge can be used to guide the
knowledge discovery process and evaluate the patterns that are found. Of the
several kinds of background knowledge, this chapter focuses on concept
hierarchies.
• Pattern interestingness measure: This primitive allows users to specify functions
that 43 are used to separate uninteresting patterns from knowledge and may be
used to guide the mining process, as well as to evaluate the discovered patterns.
This allows the user to confine the number of uninteresting patterns returned by the
process, as a data mining process may generate a large number of patterns.
Interestingness measures can be specified for such pattern characteristics as
simplicity, certainty, utility and novelty.
• Visualization of discovered patterns: This primitive refers to the form in which
discovered patterns are to be displayed. In order for data mining to be effective in
conveying knowledge to users, data mining systems should be able to display the
discovered patterns in multiple forms such as rules, tables, cross tabs (cross-
tabulations), pie or bar charts, decision trees, cubes or other visual representations.