Data Mining MCA 3 Sem
Data Mining MCA 3 Sem
The process of extracting information to identify patterns, trends, and useful data that would allow
the business to take the data-driven decision
from huge sets of data is called Data Mining.
Data Mining is the process of
investigating hidden patterns of
information to various perspectives for
categorization into useful data, which is collected
and assembled in particular areas such as data
warehouses, efficient analysis, data mining
algorithm, helping decision making.
Data Mining is a process used by
organizations to extract specific data from huge databases to solve business problems. It primarily
turns raw data into useful information.
Relational Database:
A relational database is a collection of multiple data sets formally organized by tables, records, and columns from
which data can be accessed in various ways without having to recognize the database tables. Tables convey and share
information, which facilitates data searchability, reporting, and organization.
Relational Databases:
a relational database
consists of a set of
tables containing
either values of entity
attributes, or values
of attributes from
entity relationships.
Tables have columns
and rows, where
columns represent
attributes and rows
represent tuples. A
tuple in a relational
table corresponds to
either an object or a
relationship between objects and is identified by a set of attribute values representing a unique key. In following
figure it presents some relations Customer, Items, and Borrow representing business activity in a video store. These
relations are just a subset of what could be a database for the video store and is given as an example.
The most commonly used query language for relational database is SQL, which allows retrieval and manipulation of
the data stored in the tables, as well as the calculation of aggregate functions such as average, sum, min, max and
count. For instance, an SQL query to select the videos grouped by category would be: SELECT count(*) FROM Items
WHERE type=video GROUP BY category. Data mining algorithms using relational databases can be more versatile than data
mining algorithms specifically written for flat files, since they can take advantage of the structure inherent to
relational databases. While data mining can benefit from SQL for data selection, transformation and consolidation, it goes
beyond what SQL could provide, such as predicting, comparing, detecting deviations, etc.
Data warehouses:
A Data Warehouse is the technology that collects the data from various sources within the organization to
provide meaningful business insights. The huge amount of data comes from multiple places such as
Marketing and Finance. The extracted data is utilized for analytical purposes and helps in decision- making
for a business organization. The data warehouse is designed for the analysis of data rather than transaction
processing.
The data cube structure that stores the primitive or lowest level of information is called a base cuboid. Its
corresponding higher level multidimensional (cube) structures are called (non-base) cuboids. A base cuboid together with
all of its corresponding higher level cuboids form a data cube. By providing multidimensional data views and the
precipitation of summarized data, data warehouse systems are well suited for OnLine Analytical Processing, or OLAP. OLAP
operations make use of background knowledge regarding the domain of the data being studied in order to allow the
presentation of data at different levels of abstraction. Such operations accommodate different user viewpoints.
Examples of OLAP operations include drill-down and roll-up, which allow the user to view the data at differing degrees of
summarization, as illustrated in above figure.
Data Repositories:
The Data Repository generally refers to a destination for data storage. However, many IT professionals utilize the
term more clearly to refer to a specific kind of setup within an IT structure. For example, a group of databases, where an
organization has kept various kinds of information.
Object-Relational Database:
A combination of an object-oriented database model and relational database model is called an object-relational
model. It supports Classes, Objects, Inheritance, etc.
One of the primary objectives of the Object-relational data model is to close the gap between the Relational database and
the object-oriented model practices frequently utilized in many programming languages, for example, C++, Java, C#, and
so on.
Transactional Database:
A transactional database refers to a database management system (DBMS) that has the potential to undo a database
transaction if it is not performed appropriately. Even though this was a unique capability a very long while back, today,
most of the relational database systems support
transactional database activities.
• An objected-oriented database is designed based on the object-oriented programming paradigm where data are a large
number of objects organized into classes and class hierarchies. Each entity in the database is considered as an object.
The object contains a set of variables that describe the object, a set of messages that the object can use to
communicate with other objects or with the rest of the database system and a set of methods where each method
holds the code to implement a message.
• A spatial database contains spatial-related data, which may be represented in the form of raster or vector data.
Raster data consists of n-dimensional bit maps or pixel maps, and vector data are represented by lines, points,
polygons or other kinds of processed primitives, Some examples of spatial databases include geographical (map)
databases, VLSI chip designs, and medical and satellite images databases.
• Time-Series Databases: Time-series databases contain time related data such stock market data or logged activities.
These databases usually have a continuous flow of new data coming in, which sometimes causes the need for a
challenging real time analysis. Data mining in such databases commonly includes the study of trends and correlations
between evolutions of different variables, as well as the prediction of trends and movements of the variables in time.
• A text database is a database that contains text documents or other word descriptions in the form of long sentences or
paragraphs, such as product specifications, error or bug reports, warning messages, summary reports, notes, or other
documents.
• A multimedia database stores images, audio, and video data, and is used in applications such as picture content-
based retrieval, voice-mail systems, video-ondemand systems, the World Wide Web, and speech-based user
interfaces.
• The World-Wide Web provides rich, world-wide, on-line information services, where data objects are linked
together to facilitate interactive access. Some examples of distributed information services associated with the
World-Wide Web include America Online, Yahoo!, AltaVista, and Prodigy.
The major reason that data mining has attracted a great deal of attention in information industry in recent years is
due to the wide availability of huge amounts of data and the imminent need for turning such data into useful
information and knowledge. The information and knowledge gained can be used for applications ranging from
business management, production control, and market analysis, to engineering design and science exploration.
The evolution of database technology
Data mining functionalities/Data mining tasks: what kinds of patterns can be mined?
Data mining functionalities are used to specify the kind of patterns to be found in data mining tasks. In general,
data mining tasks can be classified into two categories:
• Descriptive
• Predictive
Descriptive mining tasks characterize the general properties of the data in the database. Predictive mining tasks
perform inference on the current data in order to make predictions.
Describe data mining functionalities, and the kinds of patterns they can discover (or) Define each of the following
data mining functionalities: characterization, discrimination, association and correlation analysis, classification,
prediction, clustering, and evolution analysis. Give examples of each data mining functionality, using a real-life
database that you are familiar with.
Data can be associated with classes or concepts. It describes a given set of data in a concise and summarative
manner, presenting interesting general properties of the data. These descriptions can be derived via
1. data characterization, by summarizing the data of the class under study (often called the target class)
2. data discrimination, by comparison of the target class with one or a set of comparative classes
3. both data characterization and discrimination
Data characterization
A data mining system should be able to produce a description summarizing the characteristics of a student who has
obtained more than 75% in every semester; the result could be a general profile of the student.
Data Discrimination is a comparison of the general features of target class data objects with the general features of
objects from one or a set of contrasting classes. Example
The general features of students with high GPA’s may be compared with the general features of students with low
GPA’s. The resulting description could be a general comparative profile of the students such as 75% of the students
with high GPA’s are fourth- year computing science students while 65% of the students with low GPA’s are not. The
output of data characterization can be presented in various forms. Examples include pie charts, bar charts, curves,
multidimensional data cubes, and multidimensional tables, including crosstabs. The resulting descriptions can also
be presented as generalized relations, or in rule form called characteristic rules. Discrimination descriptions
expressed in rule form are referred to as discriminant rules.
Classification:
It classifies data (constructs a model) based on the training set and the values (class labels) in a classifying
attribute and uses it in classifying new data
Typical Applications
credit approval
target marketing
medical diagnosis
treatment effectiveness analysis
Classification can be defined as the process of finding a model (or function) that describes and distinguishes data
classes or concepts, for the purpose of being able to use the model to predict the class of objects whose class label
is unknown. The derived model is based on the analysis of a set of training data (i.e., data objects whose class label
is known).
Example:
An airport security screening station is used to deter mine if passengers are potential terrorist or criminals.
To do this, the face of each passenger is scanned and its basic pattern (distance between eyes, size, and shape of
mouth, head etc) is identified. This pattern is compared to entries in a database to see if it matches any patterns
that are associated with known offenders
Prediction:
Find some missing or unavailable data values rather than class labels referred to as prediction. Although prediction
may refer to both data value prediction and class label prediction, it is usually confined to data value prediction and
thus is distinct from classification. Prediction also encompasses the identification of distribution trends based on the
available data.
Example:
Predicting flooding is difficult problem. One approach is uses monitors placed at various points in the river.
These monitors collect data relevant to flood prediction: water level, rain amount, time, humidity etc. These water
levels at a potential flooding point in the river can be predicted based on the data collected by the sensors upriver
from this point. The prediction
must be made with respect to the
time the data were collected.
Clustering analysis
Clustering analyzes data objects without consulting a known class label. The objects are clustered or grouped based
on the principle of maximizing the
intraclass similarity and minimizing the
interclass similarity. Each cluster that is
formed can be viewed as a class of
objects
In general, in classification you have a set of predefined classes and want to know which class a new object
belongs to.
Clustering tries to group a set of objects and find whether there is some relationship between the objects.
In the context of machine learning, classification is supervised learning and clustering is unsupervised
learning.
Outlier analysis: A database may contain data objects that do not comply with general model of data. These data
objects are outliers. In other words, the data objects which do not fall within the cluster will be called as outlier data
objects. Noisy data or exceptional data are also called as outlier data. The analysis of outlier data is referred to as
outlier mining.
Example Outlier analysis may uncover fraudulent usage of credit cards by detecting purchases of extremely large
amounts for a given account number in comparison to regular charges incurred by the same account. Outlier values
may also be detected with respect to the location and type of purchase, or the purchase frequency.
Data evolution analysis describes and models regularities or trends for objects whose behaviour changes over time.
Example: The data of result the last several years of a college would give an idea if quality of graduated produced by
it
There are many data mining systems available or being developed. Some are specialized systems dedicated to a
given data source or are confined to limited data mining functionalities, other are more versatile and
comprehensive. Data mining systems can be categorized according to various criteria among other classification are
the following:
Classification according to the type of data source mined: this classification categorizes data mining systems
according to the type of data handled such as spatial data, multimedia data, time-series data, text data, World Wide
Web, etc.
Classification according to the data model drawn on: this classification categorizes data mining systems based on
the data model involved such as relational database, object- oriented database, data warehouse, transactional, etc.
Classification according to the king of knowledge discovered: this classification categorizes data mining systems
based on the kind of knowledge discovered or data mining functionalities, such as characterization, discrimination,
association, classification, clustering, etc. Some systems tend to be comprehensive systems offering several data
mining functionalities together.
Classification according to mining techniques used: Data mining systems employ and provide different techniques.
This classification categorizes data mining systems according to the data analysis approach used such as machine
learning, neural networks, genetic algorithms, statistics, visualization, database oriented or data warehouse -
oriented, etc. The classification can also take into account the degree of user interaction involved in the data mining
process such as query-driven systems, interactive exploratory systems, or autonomous systems. A comprehensive
system would provide a wide variety of data mining techniques to fit different situations and options, and offer
different degrees of user interaction.
1. Mining methodology and user-interaction issues: Mining different kinds of knowledge in databases: Since
different users can be interested in different kinds of knowledge, data mining should cover a wide spectrum of data
analysis and knowledge discovery tasks, including data characterization, discrimination, association, classification,
clustering, trend and deviation analysis, and similarity analysis. These tasks may use the same database in different
ways and require the development of numerous data mining techniques.
a. Interactive mining of knowledge at multiple levels of abstraction: Since it is difficult to know exactly what can
be discovered within a database, the data mining process should be interactive.
b. Incorporation of background knowledge: Background knowledge, or information regarding the domain under
study, may be used to guide the discovery patterns. Domain knowledge related to databases, such as integrity
constraints and deduction rules, can help focus and speed up a data mining process, or judge the
interestingness of discovered patterns.
c. Data mining query languages and ad-hoc data mining: Knowledge in Relational query languages (such as SQL)
required since it allow users to pose ad-hoc queries for data retrieval
d. Presentation and visualization of data mining results: Discovered knowledge should be expressed in high-
level languages, visual representations, so that the knowledge can be easily understood and directly usable by
humans
e. Handling outlier or incomplete data: The data stored in a database may reflect outliers: noise, exceptional
cases, or incomplete data objects. These objects may confuse the analysis process, causing over fitting of the
data to the knowledge model constructed. As a result, the accuracy of the discovered patterns can be poor.
Data cleaning methods and data analysis methods which can handle outliers are required.
f. Pattern evaluation: refers to interestingness of pattern: A data mining system can uncover thousands of
patterns. Many of the patterns discovered may be uninteresting to the given user, representing common
knowledge or lacking novelty. Several challenges remain regarding the development of techniques to assess
the interestingness of discovered patterns
2. Performance issues. These include efficiency, scalability, and parallelization of data mining algorithms.
a. Efficiency and scalability of data mining algorithms: To effectively extract information from a huge amount
of data in databases, data mining algorithms must be efficient and scalable.
b. Parallel, distributed, and incremental updating algorithms: Such algorithms divide the data into partitions,
which are processed in parallel. The results from the partitions are then merged.
3. Issues relating to the diversity of database types Handling of relational and complex types of data: Since
relational databases and data warehouses are widely used, the development of efficient and effective data mining
systems for such data is important.
Mining information from heterogeneous databases and global information systems: Local and wide-area computer
networks (such as the Internet) connect many sources of data, forming huge, distributed, and heterogeneous
databases. The discovery of knowledge from different sources of structured, semi-structured, or unstructured data
with diverse data semantics poses great challenges to data mining
Unit III
Data preprocessing
Data preprocessing describes any type of processing performed on raw data to prepare it for another processing
procedure. Commonly used as a preliminary data mining practice, data preprocessing transforms the data into a
format that will be more easily and effectively processed for the purpose of the user.
Data preprocessing describes any type of processing performed on raw data to prepare it for another processing
procedure. Commonly used as a preliminary data mining practice, data preprocessing transforms the data into a
format that will be more easily and effectively processed for the purpose of the user
Why Data Preprocessing?
Data in the real world is dirty. It can be in incomplete, noisy and inconsistent from. These data needs to be
preprocessed in order to help improve the quality of the data, and quality of the mining results.
If no quality data , then no quality mining results. The quality decision is always based on the quality data.
If there is much irrelevant and redundant information present or noisy and unreliable data, then
knowledge discovery during the training phase is more difficult
Incomplete data: lacking attribute values, lacking certain attributes of interest, or containing only aggregate data.
e.g., occupation=“ ”.
Noisy data: containing errors or outliers data. e.g., Salary=“-10”
Inconsistent data: containing discrepancies in codes or names. e.g., Age=“42” Birthday=“03/07/1997”
Incomplete data may come from
“Not applicable” data value when collected
Different considerations between the time when the data was collected and when it is analyzed.
Human/hardware/softwareproblems
Data transformation
Normalization and aggregation
Data reduction
Obtains reduced representation in volume but produces the same or similar analytical results
Data discretization
Part of data reduction but with particular importance, especially for numerical data Forms of Data Preprocessing
Data Cleaning
Data cleaning routines attempt to fill in missing values, smooth out noise while identifying outliers, and correct
inconsistencies in the data.
Various methods for handling this problem
Missing Values
The various methods for handling the problem of missing values in data tuples include:
a. Ignoring the tuple: This is usually done when the class label is missing (assuming the mining task involves
classification or description). This method is not very effective unless the tuple contains several attributes
with missing values. It is especially poor when the percentage of missing values per attribute varies
considerably.
b. Manually filling in the missing value: In general, this approach is time-consuming and may not be a
reasonable task for large data sets with many missing values, especially when the value to be filled in is not
easily determined.
c. Using a global constant to fill in the missing value: Replace all missing attribute values by the same
constant, such as a label like “Unknown,” or −∞. If missing values are replaced by, say, “Unknown,” then
the mining program may mistakenly think that they form an interesting concept, since they all have a value
in common — that of “Unknown.” Hence, although this method is simple, it is not recommended.
d. Using the attribute mean for quantitative (numeric) values or attribute mode for categorical (nominal)
values, for all samples belonging to the same class as the given tuple: For example, if classifying
customers according to credit risk, replace the missing value with the average income value for customers
in the same credit risk category as that of the given tuple.
e. Using the most probable value to fill in the missing value: This may be determined with regression,
inference-based tools using Bayesian formalism, or decision tree induction. For example, using the other
customer attributes in your data set, you may construct a decision tree to predict the missing values for
income.
Noisy data
Noise is a random error or variance in a measured variable. Data smoothing tech is used for removing such
noisy data.
Several Data smoothing techniques:
1 Binning methods: Binning methods smooth a sorted data value by consulting the neighborhood", or values
around it. The sorted values are distributed into a number of 'buckets', or bins. Because binning methods
consult the neighborhood of values, they perform local smoothing.
In this technique,
1. The data for first sorted
2. Then the sorted list partitioned into equi-depth of bins.
3. Then one can smooth by bin means, smooth by bin median, smooth by bin boundaries, etc.
a. Smoothing by bin means:
b. Each value in the bin is replaced by the mean value of the bin.
c. Smoothing by bin medians: Each value in the bin is replaced by the bin median.
d. Smoothing by boundaries: The min and max values of a bin are identified as the bin boundaries.
e. Each bin value is replaced by the closest boundary value
Data Integration and Transformation
Data Integration: It combines data from multiple sources into a coherent store. There are number of issues to
consider during data integration.
Issues:
Entity identification problem: Identifying entity in one data source similar to entity in another table. For
example, customer_id in one db and customer_no in another db refer to the same entity
Detecting and resolving data value conflicts: Attribute values from different sources can be different due to
different representations, different scales. E.g. metric vs. British units
Redundancy: is another issue while performing data integration. Redundancy can occur due to the following
reasons
Object identification: The same attribute may have different names in different db
reduction includes,
1. Data cube aggregation, where aggregation operations are applied to the data in the construction of a data cube.
2. Attribute subset selection, where irrelevant, weakly relevant or redundant attributes or dimensions may be
detected and removed.
3. Dimensionality reduction, where encoding mechanisms are used to reduce the data set size. Examples: Wavelet
Transforms Principal Components Analysis
4. Numerosity reduction, where the data are replaced or estimated by alternative, smaller data representations
such as parametric models (which need store only the model parameters instead of the actual data) or
nonparametric methods such as clustering, sampling, and the use of histograms.
5. Discretization and concept hierarchy generation, where raw data values for attributes are replaced by ranges
or higher conceptual levels. Data Discretization is a form of numerosity reduction that is very useful for the
automatic generation of concept hierarchies. Data cube aggregation: Reduce the data to the concept level
needed in the analysis. Queries regarding aggregated information should be answered using data cube when
possible. Data cubes store multidimensional aggregated information. The following figure shows a data cube for
multidimensional analysis of sales data with respect to annual sales per item type for each branch.
Data Discretization
Top-down Discretization -
If the process starts by first finding one or a few points called split points or cut points to split the
entire attribute range and then repeat this recursively on the resulting intervals.
Bottom-up Discretization -
Concept Hierarchies
Typical Methods of Discretization and Concept Hierarchy Generation for Numerical Data
1] Binning
2] Histogram Analysis
It is an unsupervised discretization technique because histogram analysis does not use class
information.
Histograms partition the values for an attribute into disjoint ranges called buckets.
It is also further classified into
o Equal-width histogram
o Equal frequency histogram
The histogram analysis algorithm can be applied recursively to each partition to automatically
generate a multilevel concept hierarchy, with the procedure terminating once a pre-specified
number of concept levels has been reached.
3] Cluster Analysis
4] Entropy-Based Discretization
It is a bottom-up method.
Find the best neighboring intervals and merge them to form larger intervals recursively.
The method is supervised in that it uses class information.
ChiMerge treats intervals as discrete categories.
The basic notion is that for accurate discretization, the relative class frequencies should be fairly
consistent within an interval.
Therefore, if two adjacent intervals have a very similar distribution of classes, then the intervals
can be merged.
Otherwise, they should remain separate.
Architecture of a typical data mining system
Data Source:
Different processes:
Before passing the data to the database or data warehouse server, the data must be cleaned, integrated,
and selected. As the information comes from various sources and in different formats, it can't be used
directly for the data mining procedure because the data may not be complete and accurate. So, the first
data requires to be cleaned and unified. More information than needed will be collected from various data
sources, and only the data of interest will have to be selected and passed to the server. These procedures
are not as easy as we think. Several methods may be performed on the data as part of selection, integration,
and cleaning.
The database or data warehouse server consists of the original data that is ready to be processed. Hence,
the server is cause for retrieving the relevant data that is based on data mining as per user request.
The data mining engine is a major component of any data mining system. It contains several modules for
operating data mining tasks, including association, characterization, classification, clustering, prediction,
time-series analysis, etc.
In other words, we can say data mining is the root of our data mining architecture. It comprises instruments and
software used to obtain insights and knowledge from data collected from various data sources and stored
within the data warehouse.
The Pattern evaluation module is primarily responsible for the measure of investigation of the pattern by
using a threshold value. It collaborates with the data mining engine to focus the search on exciting patterns.
This segment commonly employs stake measures that cooperate with the data mining modules to focus
the search towards fascinating patterns. It might utilize a stake threshold to filter out discovered patterns.
On the other hand, the pattern evaluation module might be coordinated with the mining module,
depending on the implementation of the data mining techniques used. For efficient data mining, it is
abnormally suggested to push the evaluation of pattern stake as much as possible into the mining
procedure to confine the search to only fascinating patterns.
Graphical User Interface:
The graphical user interface (GUI) module communicates between the data mining system and the user.
This module helps the user to easily and efficiently use the system without knowing the complexity of the
process. This module cooperates with the data mining system when the user specifies a query or a task and
displays the results.
Knowledge Base:
The knowledge base is helpful in the entire process of data mining. It might be helpful to guide the search
or evaluate the stake of the result patterns. The knowledge base may even contain user views and data
from user experiences that might be helpful in the data mining process. The data mining engine may receive
inputs from the knowledge base to make the result more accurate and reliable. The pattern assessment
module regularly interacts with the knowledge base to get inputs, and also update it
Misconception: Data mining systems can autonomously dig out all of the valuable knowledge from a given large
database, without human intervention.
If there was no user intervention then the system would uncover a large set of patterns that may even surpass the
size of the database. Hence, user interference is required.
This user communication with the system is provided by using a set of data mining primitives.
Data Mining Primitives
Data mining primitives define a data mining task, which can be specified in the form of a data mining query.
It is important to specify the knowledge to be mined, as this determines the data mining function to be
performed.
Kinds of knowledge include concept description, association, classification, prediction and clustering.
User can also provide pattern templates. Also called metapatterns or metarules or metaqueries. Example : A
user studying the buying habits of allelectronics customers may choose to mine association rules of the form: P
(X:customer,W) ^ Q (X,Y) => buys (X,Z) Meta rules such as the following can be specified: age (X, “30…..39”) ^
income (X, “40k….49K”) => buys (X, “VCR”) [2.2%, 60%] occupation (X, “student ”) ^ age (X, “20…..29”)= > buys
(X, “computer”) [1.4%, 70%]
Background knowledge
Defines a sequence of mappings from a set of low-level concepts to higherlevel (more general) concepts.
Allows data to be mined at multiple levels of abstraction.
These allow users to view data from different perspectives, allowing further insight into the relationships.
Example (location)
Data mining language must be designed to facilitate flexible and effective knowledge discovery.
Having a query language for data mining may help standardize the development of platforms for data
mining systems.
But designed a language is challenging because data mining covers a wide spectrum of tasks and each task
has different requirement.
Hence, the design of a language requires deep understanding of the limitations and underlying mechanism
of the various kinds of tasks. So…how would you design an efficient query language???
Based on the primitives discussed earlier.
DMQL allows mining of different kinds of knowledge from relational databases and data warehouses at
multiple levels of abstraction.
Analytical
Analytical characterization
Analytical characterization is used to help and identifying the weakly relevant, or irrelevant
attributes. We can exclude these unwanted irrelevant attributes when we preparing our
data for the mining.
Why Analytical Characterization?
Analytical Characterization is a very important activity in data mining due to the following
reasons;
Due to the limitation of the OLAP tool about handling the complex objects.
Due to the lack of an automated generalization, we must explicitly tell the system which
attributes are irrelevant and must be removed, and similarly, we must explicitly tell the
system which attributes are relevant and must be included in the class characterization .
UNIT- III
What Is Association Mining?
Association rule mining – Finding frequent patterns, associations, correlations, or causal structures among
sets of items or objects in transaction databases, relational databases, and other information repositories.
Applications – Basket data analysis, cross‐marketing, catalog design, loss‐ leader analysis, clustering,
classification, etc.
Rule form
Prediction (Boolean variables) prediction (Boolean variables) variables) [support [support, confidence]
– C t ompu er => anti i ft v rus_software [ t support =2%, confidence = 60%]
– buys (x, “computer”) “computer”) buys (x, “antivirus software”) [0.5%, 60%]
Association Rule: Basic Concepts
Given a database of transactions each transaction is a list of items (purchased (purchased by a customer
customer in a visit)
Find all rules that correlate the presence of one set of items with that of another another set of items
Find frequent patterns
Examp q le for frequent itemset mining is market basket analysis.
Association rule performance measures
• Confidence • Support • Minimum support threshold • Minimum confidence threshold
Martket Basket Analysis
Shopping baskets
Each item has a Boolean variable representing the
presence or absence of that item
Each basket can be represented by a Boolean vector of
values assigned to these variables.
Identify patterns from Boolean vector
Patterns can be rep y resented by association rules.
Knowledge type constraints: These specify the type of knowledge to be mined, such as association
or correlation.
Data constraints: These specify the set of task-relevant data.
Dimension/level constraints: These specify the desired dimensions (or attributes) of the data, or
levels of the concept hierarchies, to be used in mining.
Interestingness constraints: These specify thresholds on statistical measures of rule
interestingness, such as support, confidence, and correlation.
Rule constraints: These specify the form of rules to be mined. Such constraints may be expressed as
metarules (rule templates), as the maximum or minimum number of predicates that can occur in the rule
antecedent or consequent, or as relationships among attributes, attribute values, and/or aggregates.
The above constraints can be specified using a high-level declarative data mining query language and
user interface.
The first four of the above types of constraints have already been addressed in earlier parts of this book
and chapter. In this section, we discuss the use of rule constraints to focus the mining task. This form of
constraint-based mining allows users to describe the rules that they would like to uncover, thereby
making the data mining process more effective. In addition, a sophisticated mining query optimizer can
be used to exploit the constraints specified by the user, thereby making the mining process more
efficient. Constraint-based mining encourages interactive exploratory mining and analysis.
“How are metarules useful?” Metarules allow users to specify the syntactic form of rules that they are
interested in mining. The rule forms can be used as constraints to help improve the efficiency of the
mining process. Metarules may be based on the analyst’s experience, expectations, or intuition regarding
the data or may be automatically generated based on the database schema.
Rule constraints specify expected set/subset relationships of the variables in the mined rules, constant
initiation of variables, and aggregate functions. Users typically employ their knowledge of the application
or data to specify rule constraints for the mining task. These rule constraints may be used together with,
or as an alternative to, meta rule-guided mining. In this section, we examine rule constraints as to how
they can be used to make the mining process more efficient. Let’s study an example where rule
constraints are used to mine hybrid-dimensional association rules.
UNIT II
Data Warehouse and OLAP Technology for Data Mining
Data Warehouse Introduction: A data warehouse is a collection of data marts representing historical data from
different operations in the company. This data is stored in a structure optimized for querying and data analysis as
a data warehouse. Table design, dimensions and organization should be consistent throughout a data warehouse
so that reports or queries across the data warehouse are consistent. A data warehouse can also be viewed as a
database for historical data from different functions within a company.
The term Data Warehouse was coined by Bill Inmon in 1990, which he defined in the following way: "A warehouse
is a subject-oriented, integrated, time-variant and nonvolatile collection of data in support of management's
decision making process". He defined the terms in the sentence as follows:
Subject Oriented: Data that gives information about a particular subject instead of about a company's ongoing
operations. Integrated: Data that is gathered into the data warehouse from a variety of sources and merged into a
coherent whole.
Time-variant: All data in the data warehouse is identified with a particular time period.
Non-volatile: Data is stable in a data warehouse. More data is added but data is never removed.
This enables management to gain a consistent picture of the business. It is a single, complete and consistent store
of data obtained from a variety of different sources made available to end users in what they can understand and
use in a business context. It can be
Data warehouses are designed to perform well with aggregate queries running on large amounts of data.
The structure of data warehouses is easier for end users to navigate, understand and query against unlike
the relational databases primarily designed to handle lots of transactions.
Data warehouses enable queries that cut across different segments of a company's operation. E.g.
production data could be compared against inventory data even if they were originally stored in different
databases with different structures.
Queries that would be complex in very normalized databases could be easier to build and maintain in data
warehouses, decreasing the workload on transaction systems.
Data warehousing is an efficient way to manage and report on data that is from a variety of sources, non
uniform and scattered throughout a company.
Data warehousing is an efficient way to manage demand for lots of information from lots of users.
Data warehousing provides the capability to analyze large amounts of historical data for nuggets of wisdom
that can provide an organization with competitive advantage.
Operational and informational Data • Operational Data: Focusing on transactional function such as bank
card withdrawals and deposits Detailed Updateable Reflects current data •
Informational Data: Focusing on providing answers to problems posed by decision makers Summarized
Non updateable
ODS is an architecture concept to support day-to-day operational decision support and contains current
value data propagated from operational applications
ODS is subject-oriented, similar to a classic definition of a Data warehouse
ODS is integrated
Differences between Operational Database Systems and Data Warehouses
Features of OLTP and OLAP
1. Users and system orientation: An OLTP system is customer-oriented and is used for transaction and query
processing by clerks, clients, and information technology professionals. An OLAP system is market-oriented and is
used for data analysis by knowledge workers, including managers, executives, and analysts.
2. Data contents: An OLTP system manages current data that, typically, are too detailed to be easily used for
decision making. An OLAP system manages large amounts of historical data, provides facilities for summarization
and aggregation, and stores and manages information at different levels of granularity. These features make the
data easier for use in informed decision making.
3. Database design: An OLTP system usually adopts an entity-relationship (ER) data model and an application
oriented database design. An OLAP system typically adopts either a star or snowflake model and a subject -
oriented database design.
4. View: An OLTP system focuses mainly on the current data within an enterprise or department, without referring
to historical data or data in different organizations. In contrast, an OLAP system often spans multiple versions of a
database schema. OLAP systems also deal with information that originates from different organizations,
integrating information from many data stores. Because of their huge volume, OLAP data are stored on multiple
storage media.
5. Access patterns: The access patterns of an OLTP system consist mainly of short, atomic transactions. Such a
system requires concurrency control and recovery mechanisms. However, accesses to OLAP systems are mostly
read-only operations although many could be complex queries.
From the architecture point of view, there are three data warehouse models: the enterprise warehouse,
the data mart, and the virtual warehouse.
3. Modeling: Modelling is a significant stage that involves designing the warehouse schema and views.
This may contain using a modeling tool if the data warehouses are sophisticated.
4. Physical modeling: For the data warehouses to perform efficiently, physical modeling is needed. This
contains designing the physical data warehouse organization, data placement, data partitioning, deciding
on access techniques, and indexing.
5. Sources: The information for the data warehouse is likely to come from several data sources. This step
contains identifying and connecting the sources using the gateway, ODBC drives, or another wrapper.
6. ETL: The data from the source system will require to go through an ETL phase. The process of designing
and implementing the ETL phase may contain defining a suitable ETL tool vendors and purchasing and
implementing the tools. This may contains customize the tool to suit the need of the enterprises.
7. Populate the data warehouses: Once the ETL tools have been agreed upon, testing the tools will be
needed, perhaps using a staging area. Once everything is working adequately, the ETL tools may be used
in populating the warehouses given the schema and view definition.
8. User applications: For the data warehouses to be helpful, there must be end-user applications. This step
contains designing and implementing applications required by the end-users.
9. Roll-out the warehouses and applications: Once the data warehouse has been populated and the end-
client applications tested, the warehouse system and the operations may be rolled out for the user's
community to use.
Implementation Guidelines
3. Senior management support: A data warehouses project must be fully supported by senior
management. Given the resource-intensive feature of such project and the time they can take to implement,
a warehouse project signal for a sustained commitment from senior management.
4. Ensure quality: The only record that has been cleaned and is of a quality that is implicit by the
organizations should be loaded in the data warehouses.
5. Corporate strategy: A data warehouse project must be suitable for corporate strategies and business
goals. The purpose of the project must be defined before the beginning of the projects.
6. Business plan: The financial costs (hardware, software, and peopleware), expected advantage, and a
project plan for a data warehouses project must be clearly outlined and understood by all stakeholders.
Without such understanding, rumors about expenditure and benefits can become the only sources of data,
subversion the projects.
7. Training: Data warehouses projects must not overlook data warehouses training requirements. For a
data warehouses project to be successful, the customers must be trained to use the warehouses and to
understand its capabilities.
8. Adaptability: The project should build in flexibility so that changes may be made to the data warehouses
if and when required. Like any system, a data warehouse will require to change, as the needs of an enterprise
change.
9. Joint management: The project must be handled by both IT and business professionals in the enterprise. To
ensure that proper communication with the stakeholder and which the project is the target for assisting the
enterprise's business, the business professional must be involved in the project along with technical
professionals.
When data is grouped or combined in multidimensional matrices called Data Cubes. The data cube method has
a few alternative names or a few variants, such as "Multidimensional databases," "materialized views," and
"OLAP (On-Line Analytical Processing)."
The general idea of this approach is to materialize certain expensive computations that are frequently
inquired.
A data cube is created from a subset of attributes in the database. Specific attributes are chosen to be
measure attributes, i.e., the attributes whose values are of interest. Another attributes are selected as
dimensions or functional attributes. The measure attributes are aggregated according to the dimensions.
For example, XYZ may create a sales data warehouse to keep records of the store's sales for the dimensions
time, item, branch, and location. These dimensions enable the store to keep track of things like monthly
sales of items, and the branches and locations at which the items were sold. Each dimension may have a
table identify with it, known as a dimensional table, which describes the dimensions. For example, a
dimension table for items may contain the attributes item_name, brand, and type.
Data cube method is an interesting technique with many applications. Data cubes could be sparse in many
cases because not every cell in each dimension may have corresponding data in the database.
If a query contains constants at even lower levels than those provided in a data cube, it is not clear how to
make the best use of the precomputed results stored in the data cube.
The model view data in the form of a data cube. OLAP tools are based on the multidimensional data model.
Data cubes usually model n-dimensional data.
3- Dimensional Cuboids
Let suppose we would like to view the sales data with a third
dimension. For example, suppose we would like to view the data
according to time, item as well as the location for the cities
Chicago, New York, Toronto, and Vancouver. The
measured display in dollars sold (in thousands).
These 3-D data are shown in the table. The 3-D
data of the table are represented as a series of 2-D
tables.
In data warehousing, the data cubes are n-dimensional. The cuboid which
holds the lowest level of summarization is called a base cuboid.
For example, the 4-D cuboid in the figure is the base cuboid for the given
time, item, location, and supplier dimensions.
Figure is shown a 4-D data cube representation of sales data, according to the dimensions time, item,
location, and supplier. The measure displayed is dollars sold (in thousands).
The topmost 0-D cuboid, which holds the highest level of summarization, is known as the apex cuboid. In
this example, this is the total sales, or dollars sold, summarized over all four dimensions.
The lattice of cuboid forms a data cube. The figure shows the lattice of cuboids creating 4-D data cubes for
the dimension time, item, location, and supplier. Each cuboid represents a different degree of
summarization.
UNIT V
There are two forms of data analysis that can be used to extract models describing important classes or
predict future data trends. These two forms are as follows:
1. Classification
2. Prediction
We use classification and prediction to extract a model, representing the data classes to predict future data
trends. Classification predicts the categorical labels of data with the prediction models. This analysis
provides us with the best understanding of the data at a large scale.
Classification models predict categorical class labels, and prediction models predict continuous-valued
functions. For example, we can build a classification model to categorize bank loan applications as either
safe or risky or a prediction model to predict the expenditures in dollars of potential customers on computer
equipment given their income and occupation.
What is Classification?
Classification is to identify the category or the class label of a new observation. First, a set of data is used
as training data. The set of input data and the corresponding outputs are given to the algorithm. So, the
training data set includes the input data and their
associated class labels. Using the training dataset, the
algorithm derives a model or the classifier. The derived
model can be a decision tree, mathematical formula, or
a neural network. In classification, when unlabeled data
is given to the model, it should find the class to which it
belongs. The new data provided to the model is the test
data set.
The bank needs to analyze whether giving a loan to a particular customer is risky or not. For example,
based on observable data for multiple loan borrowers, a classification model may be established that
forecasts credit risk. The data could track job records, homeownership or leasing, years of residency,
number, type of deposits, historical credit ranking, etc. The goal would be credit ranking, the predictors
would be the other characteristics, and the data would represent a case for each consumer. In this example,a
model is constructed to find the categorical label. The labels are risky or safe.
How does Classification Works?
Sentiment Analysis: Sentiment analysis is highly helpful in social media monitoring. We can use it to extract
social media insights. We can build sentiment analysis models to read and analyze misspelled words with
advanced machine learning algorithms. The accurate trained models provide consistently accurate
outcomes and result in a fraction of the time.
Document Classification: We can use document classification to organize the documents into sections
according to the content. Document classification refers to text classification; we can classify the words in
the entire document. And with the help of machine learning classification algorithms, we can execute it
automatically.
Image Classification: Image classification is used for the trained categories of an image. These could be
the caption of the image, a statistical value, a theme. You can tag images to train your model for relevant
categories by applying supervised learning algorithms.
Machine Learning Classification: It uses the statistically demonstrable algorithm rules to execute
analytical tasks that would take humans hundreds of more hours to perform.
3. Data Classification Process: The data classification process can be categorized into five steps:
Create the goals of data classification, strategy, workflows, and architecture of data classification.
Classify confidential details that we store.
Using marks by data labelling.
To improve protection and obedience, use effects.
Data is complex, and a continuous method is a classification.
What is Data Classification Lifecycle?
What is Prediction?
Another process of data analysis is prediction. It is used to find a numerical output. Same as in classification, the
training dataset contains the inputs and corresponding numerical output values. The algorithm derivesthe
model or a predictor according to the training dataset. The model should find a numerical output when the
new data is given. Unlike in classification, this method does not have a class label. The model predicts a
continuous-valued function or ordered value.
Regression is generally used for prediction. Predicting the value of a house depending on the facts such as
the number of rooms, the total area, etc., is an example for prediction.
For example, suppose the marketing manager needs to predict how much a particular customer will spend
at his company during a sale. We are bothered to forecast a numerical value in this case. Therefore, an
example of numeric prediction is the data processing activity. In this case, a model or a predictor will be
developed that forecasts a continuous or ordered value function.
o Robustness: Robustness is the ability to make correct predictions or classifications. In the context
of data mining, robustness is the ability of the classifier or predictor to make correct predictions from
incoming unknown data.
Classification Prediction
In classification, the accuracy depends on finding the In prediction, the accuracy depends on
class label correctly. how well a given predictor can guess
the value of a predicated attribute for
new data.
In classification, the model can be known as the In prediction, the model can be known
classifier. as the predictor.
A model or the classifier is constructed to find the A model or a predictor will be
categorical labels. constructed that predicts a continuous-
valued function or ordered value.
For example, the grouping of patients based on their For example, We can think of
medical records can be considered a classification. prediction as predicting the correct
treatment for a particular disease for a
person.
The decision tree, applied to existing data, is a classification model. We can get a class prediction by
applying it to new data for which the class is unknown. The assumption is that the new data comes from a
distribution similar to the data we used to construct our decision tree. In many instances, this is a correct
assumption, so we can use the decision tree to build a predictive model. Classification of prediction is the
process of finding a model that describes the classes or concepts of information. The purpose is to predict
the class of objects whose class label is unknown using this model. Below are some major differences
between classification and prediction.
o The intra-cluster similarities are high, It implies that the data present inside the cluster is similar to
one another.
o The inter-cluster similarity is low, and it means each cluster holds data that is not similar to other
data.
What is a Cluster?
o Clustering is the method of converting a group of abstract objects into classes of similar objects.
o Clustering is a method of partitioning a set of data or objects into a set of significant subclasses
called clusters.
o It helps users to understand the structure or natural grouping in a data set and used either as a
stand-alone instrument to get a better insight into data distribution or as a pre-processing step for
other algorithms
Important points:
o In many applications, clustering analysis is widely used, such as data analysis, market research,
pattern recognition, and image processing.
o It assists marketers to find different groups in their client base and based on the purchasing patterns.
They can characterize their customer groups.
o It helps in allocating documents on the internet for data discovery.
o Clustering is also used in tracking applications such as detection of credit card fraud.
o As a data mining function, cluster analysis serves as a tool to gain insight into the distribution of
data to analyze the characteristics of each cluster.
o In terms of biology, It can be used to determine plant and animal taxonomies, categorization of
genes with the same functionalities and gain insight into structure inherent to populations.
o It helps in the identification of areas of similar land that are used in an earth observation database
and the identification of house groups in a city according to house type, value, and geographical
location.
Clustering analysis has been an evolving problem in data mining due to its variety of applications. The
advent of various data clustering tools in the last few years and their comprehensive use in a broad range
of applications, including image processing, computational biology, mobile communication, medicine, and
economics, must contribute to the popularity of these algorithms. The main issue with the data clustering
algorithms is that it cant be standardized. The advanced algorithm may give the best results with one type
of data set, but it may fail or perform poorly with other kinds of data set. Although many efforts have been
made to standardize the algorithms that can perform well in all situations, no significant achievement has
been achieved so far. Many clustering tools have been proposed so far. However, each algorithm has its
advantages or disadvantages and cant work on all real situations.
1. Scalability:
Scalability in clustering implies that as we boost the amount of data objects, the time to perform clustering
should approximately scale to the complexity order of the algorithm. For example, if we perform K- means
clustering, we know it is O(n), where n is the number of objects in the data. If we raise the number of data
objects 10 folds, then the time taken to cluster them should also approximately increase 10 times. It means there
should be a linear relationship. If that is not the case, then there is some error with our implementation process.
2. Interpretability:
The clustering algorithm should be able to find arbitrary shape clusters. They should not be limited to only
distance measurements that tend to discover a spherical cluster of small sizes.
Databases contain data that is noisy, missing, or incorrect. Few algorithms are sensitive to such data and
may result in poor quality clusters.
6. High dimensionality:
The clustering tools should not only able to handle high dimensional data space but also the low-
dimensional space
Clustering Methods
Partitioning Clustering
Density-Based Clustering
In the distribution model-based clustering method, the data is divided based on the probability of how a
dataset belongs to a particular distribution. The grouping is done by assuming some distributions
commonly Gaussian Distribution.
Hierarchical Clustering
Fuzzy Clustering
Fuzzy clustering is a type of soft method in which a data object may belong to more than one group or
cluster. Each dataset has a set of membership coefficients, which depend on the degree of membership to
be in a cluster. Fuzzy C-means algorithm is the example of this type of clustering; it is sometimes also
known as the Fuzzy k-means algorithm.
Clustering Algorithms
The Clustering algorithms can be divided based on their models that are explained above. There are
different types of clustering algorithms published, but only a few are commonly used. The clustering
algorithm is based on the kind of data that we are using. Such as, some algorithms need to guess the
number of clusters in the given dataset, whereas some are required to find the minimum distance between the
observation of the dataset.
Here we are discussing mainly popular Clustering algorithms that are widely used in machine learning:
1. K-Means algorithm: The k-means algorithm is one of the most popular clustering algorithms. It
classifies the dataset by dividing the samples into different clusters of equal variances. The number
of clusters must be specified in this algorithm. It is fast with fewer computations required, with the
linear complexity of O(n).
2. Mean-shift algorithm: Mean-shift algorithm tries to find the dense areas in the smooth density of
data points. It is an example of a centroid-based model, that works on updating the candidates for
centroid to be the center of the points within a given region.
3. DBSCAN Algorithm: It stands for Density-Based Spatial Clustering of Applications with Noise.
It is an example of a density-based model similar to the mean-shift, but with some remarkable
advantages. In this algorithm, the areas of high density are separated by the areas of low density.
Because of this, the clusters can be found in any arbitrary shape.
4. Expectation-Maximization Clustering using GMM: This algorithm can be used as an alternative
for the k-means algorithm or for those cases where K-means can be failed. In GMM, it is assumed
that the data points are Gaussian distributed.
5. Agglomerative Hierarchical algorithm: The Agglomerative hierarchical algorithm performs the
bottom-up hierarchical clustering. In this, each data point is treated as a single cluster at the outset
and then successively merged. The cluster hierarchy can be represented as a tree-structure.
6. Affinity Propagation: It is different from other clustering algorithms as it does not require to specify
the number of clusters. In this, each data point sends a message between the pair of data points
until convergence. It has O(N2T) time complexity, which is the main drawback of this algorithm.
Applications of Clustering
Below are some commonly known applications of clustering technique in Machine Learning:
o In Identification of Cancer Cells: The clustering algorithms are widely used for the identification of
cancerous cells. It divides the cancerous and non-cancerous data sets into different groups.
o In Search Engines: Search engines also work on the clustering technique. The search result appears
based on the closest object to the search query. It does it by grouping similar data objects in one
group that is far from the other dissimilar objects. The accurate result of a query depends on the
quality of the clustering algorithm used.
o Customer Segmentation: It is used in market research to segment the customers based on their
choice and preferences.
o In Biology: It is used in the biology stream to classify different species of plants and animals using
the image recognition technique.
o In Land Use: The clustering technique is used in identifying the area of similar lands use in the GIS
database. This can be very useful to find that for what purpose the particular land should be used,
that means for which purpose it is more suitable.
Data Mining tools have the objective of discovering patterns/trends/groupings among large sets of data
and transforming data into more refined information.
It is a framework, such as Rstudio or Tableau that allows you to perform different types of data mining
analysis.
We can perform various algorithms such as clustering or classification on your data set and visualize the
results itself. It is a framework that provides us better insights for our data and the phenomenon that data
represent. Such a framework is called a data mining tool.
1. Orange Data Mining:
DMelt can be used for the analysis of the large volume of data, data mining, and statistical analysis. It is
extensively used in natural sciences, financial markets, and engineering.
4. Rattle:
5. Rapid Miner:
Rapid Miner provides the server on-site as well as in public or private cloud infrastructure. It has a
client/server model as its base. A rapid miner comes with template-based frameworks that enable fast
delivery with few errors(which are commonly expected in the manual coding writing process)