0% found this document useful (0 votes)
559 views51 pages

Data Mining MCA 3 Sem

data mining is data shorting system for use to clear unstruchure data to strucher data it is a data converting system

Uploaded by

Kuldeep Billore
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
559 views51 pages

Data Mining MCA 3 Sem

data mining is data shorting system for use to clear unstruchure data to strucher data it is a data converting system

Uploaded by

Kuldeep Billore
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 51

UNIT- I

What is Data Mining?

 The process of extracting information to identify patterns, trends, and useful data that would allow
the business to take the data-driven decision
from huge sets of data is called Data Mining.
 Data Mining is the process of
investigating hidden patterns of
information to various perspectives for
categorization into useful data, which is collected
and assembled in particular areas such as data
warehouses, efficient analysis, data mining
algorithm, helping decision making.
 Data Mining is a process used by
organizations to extract specific data from huge databases to solve business problems. It primarily
turns raw data into useful information.

Types of Data Mining

Relational Database:

A relational database is a collection of multiple data sets formally organized by tables, records, and columns from
which data can be accessed in various ways without having to recognize the database tables. Tables convey and share
information, which facilitates data searchability, reporting, and organization.

Relational Databases:
a relational database
consists of a set of
tables containing
either values of entity
attributes, or values
of attributes from
entity relationships.
Tables have columns
and rows, where
columns represent
attributes and rows
represent tuples. A
tuple in a relational
table corresponds to
either an object or a
relationship between objects and is identified by a set of attribute values representing a unique key. In following
figure it presents some relations Customer, Items, and Borrow representing business activity in a video store. These
relations are just a subset of what could be a database for the video store and is given as an example.

The most commonly used query language for relational database is SQL, which allows retrieval and manipulation of
the data stored in the tables, as well as the calculation of aggregate functions such as average, sum, min, max and
count. For instance, an SQL query to select the videos grouped by category would be: SELECT count(*) FROM Items
WHERE type=video GROUP BY category. Data mining algorithms using relational databases can be more versatile than data
mining algorithms specifically written for flat files, since they can take advantage of the structure inherent to
relational databases. While data mining can benefit from SQL for data selection, transformation and consolidation, it goes
beyond what SQL could provide, such as predicting, comparing, detecting deviations, etc.

Data warehouses:

A Data Warehouse is the technology that collects the data from various sources within the organization to
provide meaningful business insights. The huge amount of data comes from multiple places such as
Marketing and Finance. The extracted data is utilized for analytical purposes and helps in decision- making
for a business organization. The data warehouse is designed for the analysis of data rather than transaction
processing.

A data warehouse is a repository of information


collected from multiple sources, stored under a unified
schema, and which usually resides at a single site. Data
warehouses are constructed via a process of data
cleansing, data transformation, data integration, data
loading, and periodic data refreshing. The figure shows
the basic architecture of a data warehouse

In order to facilitate decision making, the data in a data


warehouse are organized around major subjects, such as
customer, item, supplier, and activity. The data are
stored to provide information from a historical
perspective and are typically summarized. A data
warehouse is usually modeled by a multidimensional
database structure, where each dimension corresponds to an attribute or a set of attributes in the schema, and each cell
stores the value of some aggregate measure, such as count or sales amount. The actual physical structure of a data
warehouse may be a relational data store or a multidimensional data cube. It provides a multidimensional view of data
and allows the precomputation and fast accessing of summarized data.

The data cube structure that stores the primitive or lowest level of information is called a base cuboid. Its
corresponding higher level multidimensional (cube) structures are called (non-base) cuboids. A base cuboid together with
all of its corresponding higher level cuboids form a data cube. By providing multidimensional data views and the
precipitation of summarized data, data warehouse systems are well suited for OnLine Analytical Processing, or OLAP. OLAP
operations make use of background knowledge regarding the domain of the data being studied in order to allow the
presentation of data at different levels of abstraction. Such operations accommodate different user viewpoints.
Examples of OLAP operations include drill-down and roll-up, which allow the user to view the data at differing degrees of
summarization, as illustrated in above figure.

Data Repositories:

The Data Repository generally refers to a destination for data storage. However, many IT professionals utilize the
term more clearly to refer to a specific kind of setup within an IT structure. For example, a group of databases, where an
organization has kept various kinds of information.

Object-Relational Database:

A combination of an object-oriented database model and relational database model is called an object-relational
model. It supports Classes, Objects, Inheritance, etc.
One of the primary objectives of the Object-relational data model is to close the gap between the Relational database and
the object-oriented model practices frequently utilized in many programming languages, for example, C++, Java, C#, and
so on.

Transactional Database:

A transactional database refers to a database management system (DBMS) that has the potential to undo a database
transaction if it is not performed appropriately. Even though this was a unique capability a very long while back, today,
most of the relational database systems support
transactional database activities.

In general, a transactional database consists of a


flat file where each record represents a transaction.
A transaction typically includes a unique
transaction identity number (trans ID), and a list of the items making up the transaction (such as items purchased in
a store) as shown below:

Advanced database systems and advanced database applications

• An objected-oriented database is designed based on the object-oriented programming paradigm where data are a large
number of objects organized into classes and class hierarchies. Each entity in the database is considered as an object.
The object contains a set of variables that describe the object, a set of messages that the object can use to
communicate with other objects or with the rest of the database system and a set of methods where each method
holds the code to implement a message.

• A spatial database contains spatial-related data, which may be represented in the form of raster or vector data.
Raster data consists of n-dimensional bit maps or pixel maps, and vector data are represented by lines, points,
polygons or other kinds of processed primitives, Some examples of spatial databases include geographical (map)
databases, VLSI chip designs, and medical and satellite images databases.

• Time-Series Databases: Time-series databases contain time related data such stock market data or logged activities.
These databases usually have a continuous flow of new data coming in, which sometimes causes the need for a
challenging real time analysis. Data mining in such databases commonly includes the study of trends and correlations
between evolutions of different variables, as well as the prediction of trends and movements of the variables in time.

• A text database is a database that contains text documents or other word descriptions in the form of long sentences or
paragraphs, such as product specifications, error or bug reports, warning messages, summary reports, notes, or other
documents.

• A multimedia database stores images, audio, and video data, and is used in applications such as picture content-
based retrieval, voice-mail systems, video-ondemand systems, the World Wide Web, and speech-based user
interfaces.

• The World-Wide Web provides rich, world-wide, on-line information services, where data objects are linked
together to facilitate interactive access. Some examples of distributed information services associated with the
World-Wide Web include America Online, Yahoo!, AltaVista, and Prodigy.

What motivated data mining? Why is it important?

The major reason that data mining has attracted a great deal of attention in information industry in recent years is
due to the wide availability of huge amounts of data and the imminent need for turning such data into useful
information and knowledge. The information and knowledge gained can be used for applications ranging from
business management, production control, and market analysis, to engineering design and science exploration.
The evolution of database technology

Data mining functionalities/Data mining tasks: what kinds of patterns can be mined?

Data mining functionalities are used to specify the kind of patterns to be found in data mining tasks. In general,
data mining tasks can be classified into two categories:

• Descriptive

• Predictive

Descriptive mining tasks characterize the general properties of the data in the database. Predictive mining tasks
perform inference on the current data in order to make predictions.

Describe data mining functionalities, and the kinds of patterns they can discover (or) Define each of the following
data mining functionalities: characterization, discrimination, association and correlation analysis, classification,
prediction, clustering, and evolution analysis. Give examples of each data mining functionality, using a real-life
database that you are familiar with.

Concept/class description: characterization and discrimination

Data can be associated with classes or concepts. It describes a given set of data in a concise and summarative
manner, presenting interesting general properties of the data. These descriptions can be derived via

1. data characterization, by summarizing the data of the class under study (often called the target class)
2. data discrimination, by comparison of the target class with one or a set of comparative classes
3. both data characterization and discrimination

Data characterization

It is a summarization of the general characteristics or features of a target class of data. Example:

A data mining system should be able to produce a description summarizing the characteristics of a student who has
obtained more than 75% in every semester; the result could be a general profile of the student.

Data Discrimination is a comparison of the general features of target class data objects with the general features of
objects from one or a set of contrasting classes. Example

The general features of students with high GPA’s may be compared with the general features of students with low
GPA’s. The resulting description could be a general comparative profile of the students such as 75% of the students
with high GPA’s are fourth- year computing science students while 65% of the students with low GPA’s are not. The
output of data characterization can be presented in various forms. Examples include pie charts, bar charts, curves,
multidimensional data cubes, and multidimensional tables, including crosstabs. The resulting descriptions can also
be presented as generalized relations, or in rule form called characteristic rules. Discrimination descriptions
expressed in rule form are referred to as discriminant rules.

Classification and prediction Classification:

Classification:

 It predicts categorical class labels

 It classifies data (constructs a model) based on the training set and the values (class labels) in a classifying
attribute and uses it in classifying new data

 Typical Applications

 credit approval
 target marketing
 medical diagnosis
 treatment effectiveness analysis

Classification can be defined as the process of finding a model (or function) that describes and distinguishes data
classes or concepts, for the purpose of being able to use the model to predict the class of objects whose class label
is unknown. The derived model is based on the analysis of a set of training data (i.e., data objects whose class label
is known).

Example:

An airport security screening station is used to deter mine if passengers are potential terrorist or criminals.
To do this, the face of each passenger is scanned and its basic pattern (distance between eyes, size, and shape of
mouth, head etc) is identified. This pattern is compared to entries in a database to see if it matches any patterns
that are associated with known offenders

A classification model can be represented in various forms, such as

Prediction:

Find some missing or unavailable data values rather than class labels referred to as prediction. Although prediction
may refer to both data value prediction and class label prediction, it is usually confined to data value prediction and
thus is distinct from classification. Prediction also encompasses the identification of distribution trends based on the
available data.
Example:
Predicting flooding is difficult problem. One approach is uses monitors placed at various points in the river.
These monitors collect data relevant to flood prediction: water level, rain amount, time, humidity etc. These water
levels at a potential flooding point in the river can be predicted based on the data collected by the sensors upriver
from this point. The prediction
must be made with respect to the
time the data were collected.

Classification vs. Prediction

Classification differs from


prediction in that the former is to
construct a set of models (or
functions) that describe and
distinguish data class or concepts,
whereas the latter is to predict
some missing or unavailable, and
often numerical, data values.
Their similarity is that they are both tools for prediction: Classification is used for predicting the class label of data
objects and prediction is typically used for predicting missing numerical data values.

Clustering analysis

Clustering analyzes data objects without consulting a known class label. The objects are clustered or grouped based
on the principle of maximizing the
intraclass similarity and minimizing the
interclass similarity. Each cluster that is
formed can be viewed as a class of
objects

Clustering can also facilitate taxonomy


formation, that is, the organization of
observations into a hierarchy of classes
that group similar events together as
shown below:

Example: A certain national department


store chain creates special catalogs
targeted to various demographic groups based on attributes such as income, location and physical characteristics of
potential customers (age, height, weight, etc). To determine the target mailings of the various catalogs and to assist
in the creation of new, more specific catalogs, the company performs a clustering of potential customers based on
the determined attribute values. The results of the clustering exercise are the used by management to create
special catalogs and distribute them to the correct target population based on the cluster for that catalog.

Classification vs. Clustering

 In general, in classification you have a set of predefined classes and want to know which class a new object
belongs to.
 Clustering tries to group a set of objects and find whether there is some relationship between the objects.
 In the context of machine learning, classification is supervised learning and clustering is unsupervised
learning.

Outlier analysis: A database may contain data objects that do not comply with general model of data. These data
objects are outliers. In other words, the data objects which do not fall within the cluster will be called as outlier data
objects. Noisy data or exceptional data are also called as outlier data. The analysis of outlier data is referred to as
outlier mining.

Example Outlier analysis may uncover fraudulent usage of credit cards by detecting purchases of extremely large
amounts for a given account number in comparison to regular charges incurred by the same account. Outlier values
may also be detected with respect to the location and type of purchase, or the purchase frequency.

Data evolution analysis describes and models regularities or trends for objects whose behaviour changes over time.

Example: The data of result the last several years of a college would give an idea if quality of graduated produced by
it

Classification of data mining systems

There are many data mining systems available or being developed. Some are specialized systems dedicated to a
given data source or are confined to limited data mining functionalities, other are more versatile and
comprehensive. Data mining systems can be categorized according to various criteria among other classification are
the following:

Classification according to the type of data source mined: this classification categorizes data mining systems
according to the type of data handled such as spatial data, multimedia data, time-series data, text data, World Wide
Web, etc.

Classification according to the data model drawn on: this classification categorizes data mining systems based on
the data model involved such as relational database, object- oriented database, data warehouse, transactional, etc.

Classification according to the king of knowledge discovered: this classification categorizes data mining systems
based on the kind of knowledge discovered or data mining functionalities, such as characterization, discrimination,
association, classification, clustering, etc. Some systems tend to be comprehensive systems offering several data
mining functionalities together.

Classification according to mining techniques used: Data mining systems employ and provide different techniques.
This classification categorizes data mining systems according to the data analysis approach used such as machine
learning, neural networks, genetic algorithms, statistics, visualization, database oriented or data warehouse -
oriented, etc. The classification can also take into account the degree of user interaction involved in the data mining
process such as query-driven systems, interactive exploratory systems, or autonomous systems. A comprehensive
system would provide a wide variety of data mining techniques to fit different situations and options, and offer
different degrees of user interaction.

Major issues in data mining

1. Mining methodology and user-interaction issues: Mining different kinds of knowledge in databases: Since
different users can be interested in different kinds of knowledge, data mining should cover a wide spectrum of data
analysis and knowledge discovery tasks, including data characterization, discrimination, association, classification,
clustering, trend and deviation analysis, and similarity analysis. These tasks may use the same database in different
ways and require the development of numerous data mining techniques.
a. Interactive mining of knowledge at multiple levels of abstraction: Since it is difficult to know exactly what can
be discovered within a database, the data mining process should be interactive.
b. Incorporation of background knowledge: Background knowledge, or information regarding the domain under
study, may be used to guide the discovery patterns. Domain knowledge related to databases, such as integrity
constraints and deduction rules, can help focus and speed up a data mining process, or judge the
interestingness of discovered patterns.
c. Data mining query languages and ad-hoc data mining: Knowledge in Relational query languages (such as SQL)
required since it allow users to pose ad-hoc queries for data retrieval
d. Presentation and visualization of data mining results: Discovered knowledge should be expressed in high-
level languages, visual representations, so that the knowledge can be easily understood and directly usable by
humans
e. Handling outlier or incomplete data: The data stored in a database may reflect outliers: noise, exceptional
cases, or incomplete data objects. These objects may confuse the analysis process, causing over fitting of the
data to the knowledge model constructed. As a result, the accuracy of the discovered patterns can be poor.
Data cleaning methods and data analysis methods which can handle outliers are required.
f. Pattern evaluation: refers to interestingness of pattern: A data mining system can uncover thousands of
patterns. Many of the patterns discovered may be uninteresting to the given user, representing common
knowledge or lacking novelty. Several challenges remain regarding the development of techniques to assess
the interestingness of discovered patterns

2. Performance issues. These include efficiency, scalability, and parallelization of data mining algorithms.

a. Efficiency and scalability of data mining algorithms: To effectively extract information from a huge amount
of data in databases, data mining algorithms must be efficient and scalable.
b. Parallel, distributed, and incremental updating algorithms: Such algorithms divide the data into partitions,
which are processed in parallel. The results from the partitions are then merged.

3. Issues relating to the diversity of database types Handling of relational and complex types of data: Since
relational databases and data warehouses are widely used, the development of efficient and effective data mining
systems for such data is important.

Mining information from heterogeneous databases and global information systems: Local and wide-area computer
networks (such as the Internet) connect many sources of data, forming huge, distributed, and heterogeneous
databases. The discovery of knowledge from different sources of structured, semi-structured, or unstructured data
with diverse data semantics poses great challenges to data mining
Unit III
Data preprocessing
Data preprocessing describes any type of processing performed on raw data to prepare it for another processing
procedure. Commonly used as a preliminary data mining practice, data preprocessing transforms the data into a
format that will be more easily and effectively processed for the purpose of the user.
Data preprocessing describes any type of processing performed on raw data to prepare it for another processing
procedure. Commonly used as a preliminary data mining practice, data preprocessing transforms the data into a
format that will be more easily and effectively processed for the purpose of the user
Why Data Preprocessing?
Data in the real world is dirty. It can be in incomplete, noisy and inconsistent from. These data needs to be
preprocessed in order to help improve the quality of the data, and quality of the mining results.
 If no quality data , then no quality mining results. The quality decision is always based on the quality data.
 If there is much irrelevant and redundant information present or noisy and unreliable data, then
knowledge discovery during the training phase is more difficult

Incomplete data: lacking attribute values, lacking certain attributes of interest, or containing only aggregate data.
e.g., occupation=“ ”.
Noisy data: containing errors or outliers data. e.g., Salary=“-10”
Inconsistent data: containing discrepancies in codes or names. e.g., Age=“42” Birthday=“03/07/1997”
Incomplete data may come from
“Not applicable” data value when collected
Different considerations between the time when the data was collected and when it is analyzed.
Human/hardware/softwareproblems

 Noisy data (incorrect values) may come from


a. Faulty data collection by instruments
b. Human or computer error at data entry
c. Errors in data transmission
 Inconsistent data may come from
1) Different data sources
2) Functional dependency violation (e.g., modify some linked data

 Major Tasks in Data Preprocessing


Data cleaning
Fill in missing values, smooth noisy data, identify or remove outliers, and resolve inconsistencies
Data integration
Integration of multiple databases, data cubes, or files

Data transformation 
Normalization and aggregation

Data reduction 
Obtains reduced representation in volume but produces the same or similar analytical results

Data discretization 
Part of data reduction but with particular importance, especially for numerical data Forms of Data Preprocessing

Data Cleaning
Data cleaning routines attempt to fill in missing values, smooth out noise while identifying outliers, and correct
inconsistencies in the data.
Various methods for handling this problem
Missing Values
The various methods for handling the problem of missing values in data tuples include:
a. Ignoring the tuple: This is usually done when the class label is missing (assuming the mining task involves
classification or description). This method is not very effective unless the tuple contains several attributes
with missing values. It is especially poor when the percentage of missing values per attribute varies
considerably.
b. Manually filling in the missing value: In general, this approach is time-consuming and may not be a
reasonable task for large data sets with many missing values, especially when the value to be filled in is not
easily determined.
c. Using a global constant to fill in the missing value: Replace all missing attribute values by the same
constant, such as a label like “Unknown,” or −∞. If missing values are replaced by, say, “Unknown,” then
the mining program may mistakenly think that they form an interesting concept, since they all have a value
in common — that of “Unknown.” Hence, although this method is simple, it is not recommended.
d. Using the attribute mean for quantitative (numeric) values or attribute mode for categorical (nominal)
values, for all samples belonging to the same class as the given tuple: For example, if classifying
customers according to credit risk, replace the missing value with the average income value for customers
in the same credit risk category as that of the given tuple.
e. Using the most probable value to fill in the missing value: This may be determined with regression,
inference-based tools using Bayesian formalism, or decision tree induction. For example, using the other
customer attributes in your data set, you may construct a decision tree to predict the missing values for
income.
Noisy data
Noise is a random error or variance in a measured variable. Data smoothing tech is used for removing such
noisy data.
Several Data smoothing techniques:
1 Binning methods: Binning methods smooth a sorted data value by consulting the neighborhood", or values
around it. The sorted values are distributed into a number of 'buckets', or bins. Because binning methods
consult the neighborhood of values, they perform local smoothing.
In this technique,
1. The data for first sorted
2. Then the sorted list partitioned into equi-depth of bins.
3. Then one can smooth by bin means, smooth by bin median, smooth by bin boundaries, etc.
a. Smoothing by bin means:
b. Each value in the bin is replaced by the mean value of the bin.
c. Smoothing by bin medians: Each value in the bin is replaced by the bin median.
d. Smoothing by boundaries: The min and max values of a bin are identified as the bin boundaries.
e. Each bin value is replaced by the closest boundary value
Data Integration and Transformation
Data Integration: It combines data from multiple sources into a coherent store. There are number of issues to
consider during data integration.
Issues:

 Schema integration: refers integration of metadata from different sources.

 Entity identification problem: Identifying entity in one data source similar to entity in another table. For
example, customer_id in one db and customer_no in another db refer to the same entity

 Detecting and resolving data value conflicts: Attribute values from different sources can be different due to
different representations, different scales. E.g. metric vs. British units

 Redundancy: is another issue while performing data integration. Redundancy can occur due to the following
reasons

 Object identification: The same attribute may have different names in different db

 Derived Data: one attribute may be derived from another attribute.


Data Transformation
 Smoothing: which works to remove noise from the data
 Aggregation: where summary or aggregation operations are applied to the data. For example, the daily sales
data may be aggregated so as to compute weekly and annual total scores.
 Generalization of the data: where low-level or “primitive” (raw) data are replaced by higher-level concepts
through the use of concept hierarchies. For example, categorical attributes, like street, can be generalized
to higher-level concepts, like city or country.
 Normalization: where the attribute data are scaled so as to fall within a small specified range, such as −1.0
to 1.0, or 0.0 to 1.0.
 Attribute construction (feature construction): this is where new attributes are constructed and added from the
given set of attributes to help the mining process
Normalization
In which data are scaled to fall within a small, specified range, useful for classification algorithms involving neural
networks, distance measurements such as nearest neighbor classification and clustering. There are 3 methods for
data normalization. They are:
 min-max normalization
 z-score normalization
 normalization by decimal scaling

reduction includes,
1. Data cube aggregation, where aggregation operations are applied to the data in the construction of a data cube.
2. Attribute subset selection, where irrelevant, weakly relevant or redundant attributes or dimensions may be
detected and removed.
3. Dimensionality reduction, where encoding mechanisms are used to reduce the data set size. Examples: Wavelet
Transforms Principal Components Analysis
4. Numerosity reduction, where the data are replaced or estimated by alternative, smaller data representations
such as parametric models (which need store only the model parameters instead of the actual data) or
nonparametric methods such as clustering, sampling, and the use of histograms.
5. Discretization and concept hierarchy generation, where raw data values for attributes are replaced by ranges
or higher conceptual levels. Data Discretization is a form of numerosity reduction that is very useful for the
automatic generation of concept hierarchies. Data cube aggregation: Reduce the data to the concept level
needed in the analysis. Queries regarding aggregated information should be answered using data cube when
possible. Data cubes store multidimensional aggregated information. The following figure shows a data cube for
multidimensional analysis of sales data with respect to annual sales per item type for each branch.

Data Discretization

 Dividing the range of a continuous attribute into intervals.


 Interval labels can then be used to replace actual data values.
 Reduce the number of values for a given continuous attribute.
 Some classification algorithms only accept categorically attributes.
 This leads to a concise, easy-to-use, knowledge-level representation of mining results.
 Discretization techniques can be categorized based on whether it uses class information or not
such as follows:
o Supervised Discretization - This discretization process uses class information.
o Unsupervised Discretization - This discretization process does not use class information.
 Discretization techniques can be categorized based on which direction it proceeds as follows:

Top-down Discretization -

 If the process starts by first finding one or a few points called split points or cut points to split the
entire attribute range and then repeat this recursively on the resulting intervals.

Bottom-up Discretization -

 Starts by considering all of the continuous values as potential split-points.


 Removes some by merging neighborhood values to form intervals, and then recursively applies
this process to the resulting intervals.

Concept Hierarchies

 Discretization can be performed rapidly on an attribute to provide a hierarchical partitioning of the


attribute values, known as a Concept Hierarchy.
 Concept hierarchies can be used to reduce the data by collecting and replacing low-level concepts
with higher-level concepts.
 In the multidimensional model, data are organized into multiple dimensions, and each dimension
contains multiple levels of abstraction defined by concept hierarchies.
 This organization provides users with the flexibility to view data from different perspectives.
 Data mining on a reduced data set means fewer input and output operations and is more efficient
than mining on a larger data set.
 Because of these benefits, discretization techniques and concept hierarchies are typically applied
before data mining, rather than during mining

Typical Methods of Discretization and Concept Hierarchy Generation for Numerical Data
1] Binning

 Binning is a top-down splitting technique based on a specified number of bins.


 Binning is an unsupervised discretization technique because it does not use class information.
 In this, The sorted values are distributed into several buckets or bins and then replaced with each
bin value by the bin mean or median.
 It is further classified into
o Equal-width (distance) partitioning
o Equal-depth (frequency) partitioning

2] Histogram Analysis

 It is an unsupervised discretization technique because histogram analysis does not use class
information.
 Histograms partition the values for an attribute into disjoint ranges called buckets.
 It is also further classified into
o Equal-width histogram
o Equal frequency histogram
 The histogram analysis algorithm can be applied recursively to each partition to automatically
generate a multilevel concept hierarchy, with the procedure terminating once a pre-specified
number of concept levels has been reached.

3] Cluster Analysis

 Cluster analysis is a popular data discretization method.


 A clustering algorithm can be applied to discretize a numerical attribute of A by partitioning the
values of A into clusters or groups.
 Clustering considers the distribution of A, as well as the closeness of data points, and therefore
can produce high-quality discretization results.
 Each initial cluster or partition may be further decomposed into several subcultures, forming a
lower level of the hierarchy.

4] Entropy-Based Discretization

 Entropy-based discretization is a supervised, top-down splitting technique.


 It explores class distribution information in its calculation and determination of split points.
 Let D consist of data instances defined by a set of attributes and a class-label attribute.
 The class-label attribute provides the class information per instance.
 In this, the interval boundaries or split-points defined may help to improve classification accuracy.
 The entropy and information gain measures are used for decision tree induction.

5] Interval Merge by χ2 Analysis

 It is a bottom-up method.
 Find the best neighboring intervals and merge them to form larger intervals recursively.
 The method is supervised in that it uses class information.
 ChiMerge treats intervals as discrete categories.
 The basic notion is that for accurate discretization, the relative class frequencies should be fairly
consistent within an interval.
 Therefore, if two adjacent intervals have a very similar distribution of classes, then the intervals
can be merged.
 Otherwise, they should remain separate.
Architecture of a typical data mining system
Data Source:

The actual source of data is the Database, data


warehouse, World Wide Web (WWW), text files, and
other documents. You need a huge amount of
historical data for data mining to be successful.
Organizations typically store data in databases or data
warehouses. Data warehouses may comprise one or
more databases, text files spreadsheets, or other
repositories of data. Sometimes, even plain text files
or spreadsheets may contain information. Another primary source of data is the World Wide Web or the
internet.

Different processes:

Before passing the data to the database or data warehouse server, the data must be cleaned, integrated,
and selected. As the information comes from various sources and in different formats, it can't be used
directly for the data mining procedure because the data may not be complete and accurate. So, the first
data requires to be cleaned and unified. More information than needed will be collected from various data
sources, and only the data of interest will have to be selected and passed to the server. These procedures
are not as easy as we think. Several methods may be performed on the data as part of selection, integration,
and cleaning.

Database or Data Warehouse Server:

The database or data warehouse server consists of the original data that is ready to be processed. Hence,
the server is cause for retrieving the relevant data that is based on data mining as per user request.

Data Mining Engine:

The data mining engine is a major component of any data mining system. It contains several modules for
operating data mining tasks, including association, characterization, classification, clustering, prediction,
time-series analysis, etc.

In other words, we can say data mining is the root of our data mining architecture. It comprises instruments and
software used to obtain insights and knowledge from data collected from various data sources and stored
within the data warehouse.

Pattern Evaluation Module:

The Pattern evaluation module is primarily responsible for the measure of investigation of the pattern by
using a threshold value. It collaborates with the data mining engine to focus the search on exciting patterns.

This segment commonly employs stake measures that cooperate with the data mining modules to focus
the search towards fascinating patterns. It might utilize a stake threshold to filter out discovered patterns.
On the other hand, the pattern evaluation module might be coordinated with the mining module,
depending on the implementation of the data mining techniques used. For efficient data mining, it is
abnormally suggested to push the evaluation of pattern stake as much as possible into the mining
procedure to confine the search to only fascinating patterns.
Graphical User Interface:

The graphical user interface (GUI) module communicates between the data mining system and the user.
This module helps the user to easily and efficiently use the system without knowing the complexity of the
process. This module cooperates with the data mining system when the user specifies a query or a task and
displays the results.

Knowledge Base:

The knowledge base is helpful in the entire process of data mining. It might be helpful to guide the search
or evaluate the stake of the result patterns. The knowledge base may even contain user views and data
from user experiences that might be helpful in the data mining process. The data mining engine may receive
inputs from the knowledge base to make the result more accurate and reliable. The pattern assessment
module regularly interacts with the knowledge base to get inputs, and also update it

Misconception: Data mining systems can autonomously dig out all of the valuable knowledge from a given large
database, without human intervention.
If there was no user intervention then the system would uncover a large set of patterns that may even surpass the
size of the database. Hence, user interference is required.
This user communication with the system is provided by using a set of data mining primitives.
Data Mining Primitives
Data mining primitives define a data mining task, which can be specified in the form of a data mining query.

 Task Relevant Data


 Kinds of knowledge to be mined
 Background knowledge
 Interestingness measure
 Presentation and visualization of discovered patterns
Task relevant data

 Data portion to be investigated.


 Attributes of interest (relevant attributes) can be specified.
 Initial data relation
 Minable view Example
 If a data mining task is to study associations between items frequently purchased at AllElectronics by
customers in Canada, the task relevant data can be specified by providing the following information: Name
of the database or data warehouse to be used (e.g., AllElectronics_db)
 Names of the tables or data cubes containing relevant data (e.g., item, customer, purchases and
items_sold)
 Conditions for selecting the relevant data (e.g., retrieve data pertaining to purchases made in Canada for
the current year)
 The relevant attributes or dimensions (e.g., name and price from the item table and income and age from
the customer table)
Kind of knowledge to be mined

 It is important to specify the knowledge to be mined, as this determines the data mining function to be
performed.
 Kinds of knowledge include concept description, association, classification, prediction and clustering.
 User can also provide pattern templates. Also called metapatterns or metarules or metaqueries. Example : A
user studying the buying habits of allelectronics customers may choose to mine association rules of the form: P
(X:customer,W) ^ Q (X,Y) => buys (X,Z) Meta rules such as the following can be specified: age (X, “30…..39”) ^
income (X, “40k….49K”) => buys (X, “VCR”) [2.2%, 60%] occupation (X, “student ”) ^ age (X, “20…..29”)= > buys
(X, “computer”) [1.4%, 70%]
Background knowledge

 It is the information about the domain to be mined


 Concept hierarchy: is a powerful form of background knowledge.
 Four major types of concept hierarchies: schema hierarchies set-grouping hierarchies operation-derived
hierarchies rule-based hierarchies
Concept hierarchies (1)

 Defines a sequence of mappings from a set of low-level concepts to higherlevel (more general) concepts.
 Allows data to be mined at multiple levels of abstraction.
 These allow users to view data from different perspectives, allowing further insight into the relationships.
Example (location)

 Rolling Up - Generalization of data Allows to


view data at more meaningful and explicit
abstractions. Makes it easier to understand
Compresses the data Would require fewer
input/output operations
 Drilling Down - Specialization of data
Concept values replaced by lower level
concepts
 There may be more than concept hierarchy
for a given attribute or dimension based on
different user viewpoints
Example: Regional sales manager may prefer the previous concept hierarchy but marketing manager might prefer
to see location with respect to linguistic lines in order to facilitate the distribution of commercial ads.
Data mining query languages

 Data mining language must be designed to facilitate flexible and effective knowledge discovery.
 Having a query language for data mining may help standardize the development of platforms for data
mining systems.
 But designed a language is challenging because data mining covers a wide spectrum of tasks and each task
has different requirement.
 Hence, the design of a language requires deep understanding of the limitations and underlying mechanism
of the various kinds of tasks. So…how would you design an efficient query language???
 Based on the primitives discussed earlier.
 DMQL allows mining of different kinds of knowledge from relational databases and data warehouses at
multiple levels of abstraction.
Analytical

Characterization in Data Mining: Analytical


Characterization is a very important topic in
data mining, and we will explain it with the
following situation;
We want to characterize the class or in other
words, we can say that suppose we want to

compare the classes. Now the


confusing question is that What if we are
not sure which attribute we should
include for the class characterization or
class comparison? If we specify too many attributes, then these attributes can be a solid
reason to slow down the overall process of data mining.
We can solve this problem with the help of analytical characterization.

Analytical characterization

Analytical characterization is used to help and identifying the weakly relevant, or irrelevant
attributes. We can exclude these unwanted irrelevant attributes when we preparing our
data for the mining.
Why Analytical Characterization?
Analytical Characterization is a very important activity in data mining due to the following
reasons;
Due to the limitation of the OLAP tool about handling the complex objects.
Due to the lack of an automated generalization, we must explicitly tell the system which
attributes are irrelevant and must be removed, and similarly, we must explicitly tell the
system which attributes are relevant and must be included in the class characterization .

UNIT- III
What Is Association Mining?

 Association rule mining – Finding frequent patterns, associations, correlations, or causal structures among
sets of items or objects in transaction databases, relational databases, and other information repositories.
 Applications – Basket data analysis, cross‐marketing, catalog design, loss‐ leader analysis, clustering,
classification, etc.
 Rule form
Prediction (Boolean variables)  prediction (Boolean variables) variables) [support [support, confidence]
– C t ompu er => anti i ft v rus_software [ t support =2%, confidence = 60%]
– buys (x, “computer”) “computer”)  buys (x, “antivirus software”) [0.5%, 60%]
Association Rule: Basic Concepts
 Given a database of transactions each transaction is a list of items (purchased (purchased by a customer
customer in a visit)
 Find all rules that correlate the presence of one set of items with that of another another set of items
 Find frequent patterns
Examp q le for frequent itemset mining is market basket analysis.
Association rule performance measures
• Confidence • Support • Minimum support threshold • Minimum confidence threshold
Martket Basket Analysis

 Shopping baskets
 Each item has a Boolean variable representing the
presence or absence of that item
 Each basket can be represented by a Boolean vector of
values assigned to these variables.
 Identify patterns from Boolean vector
 Patterns can be rep y resented by association rules.

Mining single‐dimensional Boolean association rules from transactional transactional databases:


Apriori Algorithm
• Single dimensional dimensional, single‐level, Boolean Boolean frequent item sets • Finding Finding frequent
frequent item sets using candidate candidate generation • Generating association rule from frequent item sets
The Apriori Algorithm

 Join Step – Ck is generated by joining Lk‐1with itself


 Prune Step – Any (k‐1)‐itemset itemset that is not frequent frequent cannot be a subset of a frequent k‐
item set

How to Count Supports of Candidates?


Why counting supports of candidates a problem?
-The total number of candidates candidates can be very huge
-One transaction may contain many candidates
• Method
– Candidate item sets are stored in a hash‐tree
– Leaf node of hash‐tree contains contains a list of item sets and
counts
– Interior node contains a hash table

– Subset function: finds all the candidates contained in a transaction


Example of Generating Generating Candidates Candidates
Mining multilevel association rules from transactional databases
Mining various kinds of association rules
 Mining Multilevel association association rules – Concepts at different levels
 Mining Multidimensional Multidimensional association association rules – More than one dimensional
 Mining Quantitative association rules – Numeric attributes

Mining multidimensional association rules


Constraint-Based Association Mining.
Introduction: A data mining process may uncover thousands of rules from a given set of data, most of
which end up being unrelated or uninteresting to the users. Often, users have a good sense of which
“direction” of mining may lead to interesting patterns and the “form” of the patterns or rules they would
like to find. Thus, a good heuristic is to have the users specify such intuition or expectations as
constraints to confine the search space. This strategy is known as constraint-based mining. The
constraints can include the following:

 Knowledge type constraints: These specify the type of knowledge to be mined, such as association
or correlation.
 Data constraints: These specify the set of task-relevant data.
 Dimension/level constraints: These specify the desired dimensions (or attributes) of the data, or
levels of the concept hierarchies, to be used in mining.
 Interestingness constraints: These specify thresholds on statistical measures of rule
interestingness, such as support, confidence, and correlation.
Rule constraints: These specify the form of rules to be mined. Such constraints may be expressed as
metarules (rule templates), as the maximum or minimum number of predicates that can occur in the rule
antecedent or consequent, or as relationships among attributes, attribute values, and/or aggregates.

The above constraints can be specified using a high-level declarative data mining query language and
user interface.

The first four of the above types of constraints have already been addressed in earlier parts of this book
and chapter. In this section, we discuss the use of rule constraints to focus the mining task. This form of
constraint-based mining allows users to describe the rules that they would like to uncover, thereby
making the data mining process more effective. In addition, a sophisticated mining query optimizer can
be used to exploit the constraints specified by the user, thereby making the mining process more
efficient. Constraint-based mining encourages interactive exploratory mining and analysis.

Meta rule-Guided Mining of Association Rules

“How are metarules useful?” Metarules allow users to specify the syntactic form of rules that they are
interested in mining. The rule forms can be used as constraints to help improve the efficiency of the
mining process. Metarules may be based on the analyst’s experience, expectations, or intuition regarding
the data or may be automatically generated based on the database schema.

Constraint Pushing: Mining Guided by Rule Constraints

Rule constraints specify expected set/subset relationships of the variables in the mined rules, constant
initiation of variables, and aggregate functions. Users typically employ their knowledge of the application
or data to specify rule constraints for the mining task. These rule constraints may be used together with,
or as an alternative to, meta rule-guided mining. In this section, we examine rule constraints as to how
they can be used to make the mining process more efficient. Let’s study an example where rule
constraints are used to mine hybrid-dimensional association rules.
UNIT II
Data Warehouse and OLAP Technology for Data Mining
Data Warehouse Introduction: A data warehouse is a collection of data marts representing historical data from
different operations in the company. This data is stored in a structure optimized for querying and data analysis as
a data warehouse. Table design, dimensions and organization should be consistent throughout a data warehouse
so that reports or queries across the data warehouse are consistent. A data warehouse can also be viewed as a
database for historical data from different functions within a company.
The term Data Warehouse was coined by Bill Inmon in 1990, which he defined in the following way: "A warehouse
is a subject-oriented, integrated, time-variant and nonvolatile collection of data in support of management's
decision making process". He defined the terms in the sentence as follows:
Subject Oriented: Data that gives information about a particular subject instead of about a company's ongoing
operations. Integrated: Data that is gathered into the data warehouse from a variety of sources and merged into a
coherent whole.
Time-variant: All data in the data warehouse is identified with a particular time period.
Non-volatile: Data is stable in a data warehouse. More data is added but data is never removed.
This enables management to gain a consistent picture of the business. It is a single, complete and consistent store
of data obtained from a variety of different sources made available to end users in what they can understand and
use in a business context. It can be

 Used for decision Support


 Used to manage and control business
 Used by managers and end-users to understand the business and make judgments Data Warehousing is an
architectural construct of information systems that provides users with current and historical decision
support information that is hard to access or present in traditional operational data stores.
Enterprise Data warehouse: It collects all information about subjects (customers, products, sales, assets,
personnel) that span the entire organization
Data Mart: Departmental subsets that focus on selected subjects. A data mart is a segment of a data warehouse
that can provide data for reporting and analysis on a section, unit, department or operation in the company, e.g.
sales, payroll, production. Data marts are sometimes complete individual data warehouses which are usually
smaller than the corporate data warehouse.
Decision Support System (DSS):Information technology to help the knowledge worker (executive, manager, and
analyst) makes faster & better decisions
Drill-down: Traversing the summarization levels from highly summarized data to the underlying current or old
detail
Metadata: Data about data. Containing location and description of warehouse system components: names,
definition, structure…

Benefits of data warehousing

 Data warehouses are designed to perform well with aggregate queries running on large amounts of data.
 The structure of data warehouses is easier for end users to navigate, understand and query against unlike
the relational databases primarily designed to handle lots of transactions.
 Data warehouses enable queries that cut across different segments of a company's operation. E.g.
production data could be compared against inventory data even if they were originally stored in different
databases with different structures.
 Queries that would be complex in very normalized databases could be easier to build and maintain in data
warehouses, decreasing the workload on transaction systems.
 Data warehousing is an efficient way to manage and report on data that is from a variety of sources, non
uniform and scattered throughout a company.
 Data warehousing is an efficient way to manage demand for lots of information from lots of users.
 Data warehousing provides the capability to analyze large amounts of historical data for nuggets of wisdom
that can provide an organization with competitive advantage.

Operational and informational Data • Operational Data:  Focusing on transactional function such as bank
card withdrawals and deposits  Detailed  Updateable  Reflects current data •

Informational Data:  Focusing on providing answers to problems posed by decision makers  Summarized 
Non updateable

Data Warehouse Characteristics


• A data warehouse can be viewed as an information system with the following attributes: –
-It is a database designed for analytical tasks
-Its content is periodically updated
-It contains current and historical data to provide a historical perspective of information
Operational data store (ODS)

 ODS is an architecture concept to support day-to-day operational decision support and contains current
value data propagated from operational applications
 ODS is subject-oriented, similar to a classic definition of a Data warehouse

 ODS is integrated
Differences between Operational Database Systems and Data Warehouses
Features of OLTP and OLAP
1. Users and system orientation: An OLTP system is customer-oriented and is used for transaction and query
processing by clerks, clients, and information technology professionals. An OLAP system is market-oriented and is
used for data analysis by knowledge workers, including managers, executives, and analysts.
2. Data contents: An OLTP system manages current data that, typically, are too detailed to be easily used for
decision making. An OLAP system manages large amounts of historical data, provides facilities for summarization
and aggregation, and stores and manages information at different levels of granularity. These features make the
data easier for use in informed decision making.
3. Database design: An OLTP system usually adopts an entity-relationship (ER) data model and an application
oriented database design. An OLAP system typically adopts either a star or snowflake model and a subject -
oriented database design.
4. View: An OLTP system focuses mainly on the current data within an enterprise or department, without referring
to historical data or data in different organizations. In contrast, an OLAP system often spans multiple versions of a
database schema. OLAP systems also deal with information that originates from different organizations,
integrating information from many data stores. Because of their huge volume, OLAP data are stored on multiple
storage media.
5. Access patterns: The access patterns of an OLTP system consist mainly of short, atomic transactions. Such a
system requires concurrency control and recovery mechanisms. However, accesses to OLAP systems are mostly
read-only operations although many could be complex queries.

Multidimensional Data Model.


The most popular data model for data warehouses is a multidimensional model. This model can exist in the form
of a star schema, a snowflake schema, or a fact constellation schema. Let's have a look at each of these schema
types-
From Tables and Spreadsheets to Data Cubes
Data warehouse architecture
Steps for the Design and Construction of Data Warehouse : This subsection presents a business analysis
framework for data warehouse design. The basic steps involved in the design process are also described.
The Design of a Data Warehouse: A Business Analysis Framework Four different views regarding the design of
a data warehouse must be considered: the top-down view, the data source view, the data warehouse view,
the business query view.
 The top-down view allows the selection of relevant information necessary for the data warehouse
 The data source view exposes the information being captured, stored and managed by operational
systems.
 The data warehouse view includes fact tables and dimension tables
 Finally the business query view is the Perspective of data in the data warehouse from the viewpoint of
the end user.

Introduction: Three tier client server


architecture is also known as multi-tier
architecture and signals the introduction of a
middle tier to mediate between clients and
servers. The middle tier exists between the
user interface on the client side and database
management system (DBMS) on the server
side. This third layer executes process
management, which includes implementation
of business logic and rules. The three tier
models can accommodate hundreds of users.
It hides the complexity of process distribution
from the user, while being able to complete
complex tasks through message queuing,
application implementation, and data staging
or the storage of data before being uploaded
to the data warehouse.
The bottom tier is a warehouse database
server that is almost always a relational
database system. Back-end tools and utilities
are used to feed data into the bottom tier
from operational databases or other external sources (such as customer profile information provided by
external consultants). These tools and utilities perform data extraction, cleaning, and transformation (e.g.,
to merge similar data from different sources into a unified format), as well as load and refresh functions to
update the data warehouse (Section 3.3.3). The data are extracted using application program interfaces
known as gateways. A gateway is supported by the underlying DBMS and allows client programs to
generate SQL code to be executed at a server. Examples of gateways include ODBC (Open Database
Connection) and OLEDB (Open Linking and Embedding for Databases) by Microsoft and JDBC (Java
Database Connection). This tier also contains a metadata repository, which stores information about the
data warehouse and its contents. The meta data repository is further described in Section 3.3.4.
The middle tier is an OLAP server that is typically implemented using either (1) a relational OLAP
(ROLAP) model, that is, an extended relational DBMS that maps operations on multidimensional data to
standard relational operations; or (2) a multidimensional OLAP (MOLAP) model, that is, a special-
purpose server that directly implements multidimensional data and operations. OLAP servers are
discussed in Section 3.3.5.
The top tier is a front-end client layer, which contains query and reporting tools, analysis tools, and/or
data mining tools (e.g., trend analysis, prediction, and so on).

From the architecture point of view, there are three data warehouse models: the enterprise warehouse,
the data mart, and the virtual warehouse.

Enterprise warehouse: An enterprise warehouse collects all


of the information about subjects spanning the entire
organization. It provides corporate-wide data integration,
usually from one or more operational systems or external
information providers, and is cross-functional in
scope. It typically contains detailed data as well as
summarized data, and can range in size from a
few gigabytes to hundreds of gigabytes, terabytes,
or beyond.
Data mart: A data mart contains a subset of
corporate-wide data that is of value to a specific
group of users. The scope is connected to specific,
selected subjects. For example, a marketing data
mart may connect its subjects to customer, item,
and sales. The data contained in data marts tend to be
summarized. Depending on the source of data, data
marts can be categorized into the following two classes:
(i).Independent data marts are sourced from data
captured from one or more operational systems or
external information providers, or from data generated locally within a particular department or geographic area.
(ii).Dependent data marts are sourced directly from enterprise data warehouses.
Virtual warehouse: A virtual warehouse is a set of views over operational databases. For efficient query
processing, only some of the possible summary views may be materialized. A virtual warehouse is easy to build but
requires excess capacity on operational database servers. Figure: A recommended approach for data warehouse
development.
Data Warehouse Implementation
1. Requirements analysis and capacity planning: The first process in data warehousing involves defining
enterprise needs, defining architectures, carrying out capacity planning, and selecting the hardware and software
tools. This step will contain be consulting senior management as well as the different stakeholder.
2. Hardware integration: Once the hardware and software has been selected, they require to be put by
integrating the servers, the storage methods, and the user software tools.

3. Modeling: Modelling is a significant stage that involves designing the warehouse schema and views.
This may contain using a modeling tool if the data warehouses are sophisticated.

4. Physical modeling: For the data warehouses to perform efficiently, physical modeling is needed. This
contains designing the physical data warehouse organization, data placement, data partitioning, deciding
on access techniques, and indexing.

5. Sources: The information for the data warehouse is likely to come from several data sources. This step
contains identifying and connecting the sources using the gateway, ODBC drives, or another wrapper.

6. ETL: The data from the source system will require to go through an ETL phase. The process of designing
and implementing the ETL phase may contain defining a suitable ETL tool vendors and purchasing and
implementing the tools. This may contains customize the tool to suit the need of the enterprises.

7. Populate the data warehouses: Once the ETL tools have been agreed upon, testing the tools will be
needed, perhaps using a staging area. Once everything is working adequately, the ETL tools may be used
in populating the warehouses given the schema and view definition.

8. User applications: For the data warehouses to be helpful, there must be end-user applications. This step
contains designing and implementing applications required by the end-users.

9. Roll-out the warehouses and applications: Once the data warehouse has been populated and the end-
client applications tested, the warehouse system and the operations may be rolled out for the user's
community to use.

Implementation Guidelines

1. Build incrementally: Data warehouses must be built


incrementally. Generally, it is recommended that a data marts
may be created with one particular project in mind, and once it
is implemented, several other sections of the enterprise may
also want to implement similar systems. An enterprise data
warehouses can then be implemented in an iterative manner
allowing all data marts to extract information from the data
warehouse.

2. Need a champion: A data warehouses project must have a


champion who is active to carry out considerable researches
into expected price and benefit of the project. Data
warehousing projects requires inputs from many units in an
enterprise and therefore needs to be driven by someone who
is needed for interacting with people in the enterprises and can actively persuade colleagues.

3. Senior management support: A data warehouses project must be fully supported by senior
management. Given the resource-intensive feature of such project and the time they can take to implement,
a warehouse project signal for a sustained commitment from senior management.
4. Ensure quality: The only record that has been cleaned and is of a quality that is implicit by the
organizations should be loaded in the data warehouses.

5. Corporate strategy: A data warehouse project must be suitable for corporate strategies and business
goals. The purpose of the project must be defined before the beginning of the projects.

6. Business plan: The financial costs (hardware, software, and peopleware), expected advantage, and a
project plan for a data warehouses project must be clearly outlined and understood by all stakeholders.
Without such understanding, rumors about expenditure and benefits can become the only sources of data,
subversion the projects.

7. Training: Data warehouses projects must not overlook data warehouses training requirements. For a
data warehouses project to be successful, the customers must be trained to use the warehouses and to
understand its capabilities.

8. Adaptability: The project should build in flexibility so that changes may be made to the data warehouses
if and when required. Like any system, a data warehouse will require to change, as the needs of an enterprise
change.

9. Joint management: The project must be handled by both IT and business professionals in the enterprise. To
ensure that proper communication with the stakeholder and which the project is the target for assisting the
enterprise's business, the business professional must be involved in the project along with technical
professionals.

What is Data Cube?

When data is grouped or combined in multidimensional matrices called Data Cubes. The data cube method has
a few alternative names or a few variants, such as "Multidimensional databases," "materialized views," and
"OLAP (On-Line Analytical Processing)."

The general idea of this approach is to materialize certain expensive computations that are frequently
inquired.

For example, a relation with the schema sales (part,


supplier, customer, and sale-price) can be
materialized into a set of eight views as shown in fig,
where psc indicates a view consisting of aggregate
function value (such as total-sales) computed by
grouping three attributes part, supplier, and
customer, p indicates a view composed of the
corresponding aggregate function values calculated by
grouping part alone, etc.

A data cube is created from a subset of attributes in the database. Specific attributes are chosen to be
measure attributes, i.e., the attributes whose values are of interest. Another attributes are selected as
dimensions or functional attributes. The measure attributes are aggregated according to the dimensions.
For example, XYZ may create a sales data warehouse to keep records of the store's sales for the dimensions
time, item, branch, and location. These dimensions enable the store to keep track of things like monthly
sales of items, and the branches and locations at which the items were sold. Each dimension may have a
table identify with it, known as a dimensional table, which describes the dimensions. For example, a
dimension table for items may contain the attributes item_name, brand, and type.

Data cube method is an interesting technique with many applications. Data cubes could be sparse in many
cases because not every cell in each dimension may have corresponding data in the database.

Techniques should be developed to handle sparse cubes efficiently.

If a query contains constants at even lower levels than those provided in a data cube, it is not clear how to
make the best use of the precomputed results stored in the data cube.

The model view data in the form of a data cube. OLAP tools are based on the multidimensional data model.
Data cubes usually model n-dimensional data.

A data cube enables data to be modeled and viewed in


multiple dimensions. A multidimensional data model is
organized around a central theme, like sales and
transactions. A fact table represents this theme. Facts are
numerical measures. Thus, the fact table contains measure
(such as Rs_sold) and keys to each of the related dimensional
tables.

Dimensions are a fact that defines a data cube. Facts are


generally quantities, which are used for analyzing the
relationship between dimensions.

Example: In the 2-D representation, we will look at the All


Electronics sales data for items sold per quarter in the city of
Vancouver. The measured display in dollars sold (in thousands).

3- Dimensional Cuboids

Let suppose we would like to view the sales data with a third
dimension. For example, suppose we would like to view the data
according to time, item as well as the location for the cities
Chicago, New York, Toronto, and Vancouver. The
measured display in dollars sold (in thousands).
These 3-D data are shown in the table. The 3-D
data of the table are represented as a series of 2-D
tables.

Conceptually, we may represent the same data in


the form of 3-D data cubes, as shown in fig:
Let us suppose that we would like to view our sales data with an
additional fourth dimension, such as a supplier.

In data warehousing, the data cubes are n-dimensional. The cuboid which
holds the lowest level of summarization is called a base cuboid.

For example, the 4-D cuboid in the figure is the base cuboid for the given
time, item, location, and supplier dimensions.

Figure is shown a 4-D data cube representation of sales data, according to the dimensions time, item,
location, and supplier. The measure displayed is dollars sold (in thousands).

The topmost 0-D cuboid, which holds the highest level of summarization, is known as the apex cuboid. In
this example, this is the total sales, or dollars sold, summarized over all four dimensions.

The lattice of cuboid forms a data cube. The figure shows the lattice of cuboids creating 4-D data cubes for
the dimension time, item, location, and supplier. Each cuboid represents a different degree of
summarization.
UNIT V

Classification and Predication in Data Mining

There are two forms of data analysis that can be used to extract models describing important classes or
predict future data trends. These two forms are as follows:

1. Classification
2. Prediction

We use classification and prediction to extract a model, representing the data classes to predict future data
trends. Classification predicts the categorical labels of data with the prediction models. This analysis
provides us with the best understanding of the data at a large scale.

Classification models predict categorical class labels, and prediction models predict continuous-valued
functions. For example, we can build a classification model to categorize bank loan applications as either
safe or risky or a prediction model to predict the expenditures in dollars of potential customers on computer
equipment given their income and occupation.

What is Classification?

Classification is to identify the category or the class label of a new observation. First, a set of data is used
as training data. The set of input data and the corresponding outputs are given to the algorithm. So, the
training data set includes the input data and their
associated class labels. Using the training dataset, the
algorithm derives a model or the classifier. The derived
model can be a decision tree, mathematical formula, or
a neural network. In classification, when unlabeled data
is given to the model, it should find the class to which it
belongs. The new data provided to the model is the test
data set.

Classification is the process of classifying a record. One


simple example of classification is to check whether it is
raining or not. The answer can either be yes or no. So,
there is a particular number of choices. Sometimes
there can be more than two classes to classify. That is
called multiclass classification.

The bank needs to analyze whether giving a loan to a particular customer is risky or not. For example,
based on observable data for multiple loan borrowers, a classification model may be established that
forecasts credit risk. The data could track job records, homeownership or leasing, years of residency,
number, type of deposits, historical credit ranking, etc. The goal would be credit ranking, the predictors
would be the other characteristics, and the data would represent a case for each consumer. In this example,a
model is constructed to find the categorical label. The labels are risky or safe.
How does Classification Works?

The functioning of classification with the assistance of the bank


loan application has been mentioned above. There are two
stages in the data classification system: classifier or model
creation and classification classifier.

1. Developing the Classifier or model


creation: This level is the learning stage or the learning
process. The classification algorithms construct the classifier in
this stage. A classifier is constructed from a training set
composed of the records of databases and their
corresponding class names. Each category that makes up
the training set is referred to as a category or class. We may
also refer to these records as samples, objects, or data points.
2. Applying classifier for classification: The classifier is used for classification at this level. The test data
are used here to estimate the accuracy of the classification algorithm. If the consistency is deemed sufficient,
the classification rules can be expanded to cover new data records. It includes:

Sentiment Analysis: Sentiment analysis is highly helpful in social media monitoring. We can use it to extract
social media insights. We can build sentiment analysis models to read and analyze misspelled words with
advanced machine learning algorithms. The accurate trained models provide consistently accurate
outcomes and result in a fraction of the time.

Document Classification: We can use document classification to organize the documents into sections
according to the content. Document classification refers to text classification; we can classify the words in
the entire document. And with the help of machine learning classification algorithms, we can execute it
automatically.

Image Classification: Image classification is used for the trained categories of an image. These could be
the caption of the image, a statistical value, a theme. You can tag images to train your model for relevant
categories by applying supervised learning algorithms.

Machine Learning Classification: It uses the statistically demonstrable algorithm rules to execute
analytical tasks that would take humans hundreds of more hours to perform.

3. Data Classification Process: The data classification process can be categorized into five steps:

 Create the goals of data classification, strategy, workflows, and architecture of data classification.
 Classify confidential details that we store.
 Using marks by data labelling.
 To improve protection and obedience, use effects.
 Data is complex, and a continuous method is a classification.
What is Data Classification Lifecycle?

The data classification life cycle produces an


excellent structure for controlling the flow of data to an
enterprise. Businesses need to account for data security
and compliance at each level. With the help of data
classification, we can perform it at every stage, from
origin to deletion. The data life-cycle has the
following stages, such as:

1. Origin: It produces sensitive data in various formats,


with emails, Excel, Word, Google
documents, social media, and websites.
2. Role-based practice: Role-based security
restrictions apply to all delicate data by tagging
based on in-house protection policies and
agreement rules.
3. Storage: Here, we have the obtained data, including access controls and encryption.
4. Sharing: Data is continually distributed among agents, consumers, and co-workers from various
devices and platforms.
5. Archive: Here, data is eventually archived within an industry's storage systems.
6. Publication: Through the publication of data, it can reach customers. They can then view and
download in the form of dashboards.

What is Prediction?

Another process of data analysis is prediction. It is used to find a numerical output. Same as in classification, the
training dataset contains the inputs and corresponding numerical output values. The algorithm derivesthe
model or a predictor according to the training dataset. The model should find a numerical output when the
new data is given. Unlike in classification, this method does not have a class label. The model predicts a
continuous-valued function or ordered value.

Regression is generally used for prediction. Predicting the value of a house depending on the facts such as
the number of rooms, the total area, etc., is an example for prediction.

For example, suppose the marketing manager needs to predict how much a particular customer will spend
at his company during a sale. We are bothered to forecast a numerical value in this case. Therefore, an
example of numeric prediction is the data processing activity. In this case, a model or a predictor will be
developed that forecasts a continuous or ordered value function.

Classification and Prediction Issues


1. Data Cleaning: Data cleaning involves removing the noise and treatment of missing values. The
noise is removed by applying smoothing techniques, and the problem of missing values is solved by
replacing a missing value with the most commonly occurring value for that attribute.
2. Relevance Analysis: The database may also have irrelevant attributes. Correlation analysis is used
to know whether any two given attributes are related.
3. Data Transformation and reduction: The data can
be transformed by any of the following methods.
4. Normalization: The data is transformed using
normalization. Normalization involves scaling all values
for a given attribute to make them fall within a small
specified range. Normalization is used when the neural
networks or the methods involving
measurements are used in the learning step.
5. Generalization: The data can also be
transformed by generalizing it to the higher
concept. For this purpose, we can use the
concept hierarchies.

Comparison of Classification and Prediction Methods

o Accuracy: The accuracy of the


classifier can be referred to as
the ability of the classifier to
predict the class label
correctly, and the accuracy of
the predictor can be referred
to as how well a given
predictor can estimate the
unknown value.
o Speed: The speed of the method depends on the computational cost of generating and using the
classifier or predictor.

o Robustness: Robustness is the ability to make correct predictions or classifications. In the context
of data mining, robustness is the ability of the classifier or predictor to make correct predictions from
incoming unknown data.

Classification Prediction

Classification is the process of identifying which Predication is the process of identifying


category a new observation belongs to based on a the missing or unavailable numerical
training data set containing observations whose data for a new observation.
category membership is known.

In classification, the accuracy depends on finding the In prediction, the accuracy depends on
class label correctly. how well a given predictor can guess
the value of a predicated attribute for
new data.

In classification, the model can be known as the In prediction, the model can be known
classifier. as the predictor.
A model or the classifier is constructed to find the A model or a predictor will be
categorical labels. constructed that predicts a continuous-
valued function or ordered value.

For example, the grouping of patients based on their For example, We can think of
medical records can be considered a classification. prediction as predicting the correct
treatment for a particular disease for a
person.

o Scalability: Scalability refers to an increase or decrease in the performance of the classifier or


predictor based on the given data.
o Interpretability: Interpretability is how readily we can understand the reasoning behind predictions
or classification made by the predictor or classifier.

Difference between Classification and Prediction

The decision tree, applied to existing data, is a classification model. We can get a class prediction by
applying it to new data for which the class is unknown. The assumption is that the new data comes from a
distribution similar to the data we used to construct our decision tree. In many instances, this is a correct
assumption, so we can use the decision tree to build a predictive model. Classification of prediction is the
process of finding a model that describes the classes or concepts of information. The purpose is to predict
the class of objects whose class label is unknown using this model. Below are some major differences
between classification and prediction.

Clustering in Data Mining

Clustering helps to splits data into several subsets.


Each of these subsets contains data similar to each
other, and these subsets are called clusters. Now
that the data from our customer base is divided
into clusters, we can make an informed decision
about who we think is best suited for this product.

Let's understand this with an example, suppose


we are a market manager, and we have a new
tempting product to sell. We are sure that the
product would bring enormous profit, as long as
it is sold to the right people. So, how can we tell
who is best suited for the product from our
company's huge customer base?
Clustering, falling under the category
of unsupervised machine learning, is one of the
problems that machine learning algorithms
solve.

Clustering only utilizes input data, to


determine patterns, anomalies, or similarities in its
input data.

A good clustering algorithm aims to obtain


clusters whose:

o The intra-cluster similarities are high, It implies that the data present inside the cluster is similar to
one another.
o The inter-cluster similarity is low, and it means each cluster holds data that is not similar to other
data.

What is a Cluster?

o A cluster is a subset of similar objects


o A subset of objects such that the distance between any of the two objects in the cluster is less than
the distance between any object in the cluster and any object that is not located inside it.
o A connected region of a multidimensional space with a comparatively high density of objects.

What is clustering in Data Mining?

o Clustering is the method of converting a group of abstract objects into classes of similar objects.
o Clustering is a method of partitioning a set of data or objects into a set of significant subclasses
called clusters.
o It helps users to understand the structure or natural grouping in a data set and used either as a
stand-alone instrument to get a better insight into data distribution or as a pre-processing step for
other algorithms

Important points:

o Data objects of a cluster can be considered as one group.


o We first partition the information set into groups while doing cluster analysis. It is based on data
similarities and then assigns the levels to the groups.
o The over-classification main advantage is that it is adaptable to modifications, and it helps single out
important characteristics that differentiate between distinct groups.

Applications of cluster analysis in data mining:

o In many applications, clustering analysis is widely used, such as data analysis, market research,
pattern recognition, and image processing.
o It assists marketers to find different groups in their client base and based on the purchasing patterns.
They can characterize their customer groups.
o It helps in allocating documents on the internet for data discovery.
o Clustering is also used in tracking applications such as detection of credit card fraud.
o As a data mining function, cluster analysis serves as a tool to gain insight into the distribution of
data to analyze the characteristics of each cluster.
o In terms of biology, It can be used to determine plant and animal taxonomies, categorization of
genes with the same functionalities and gain insight into structure inherent to populations.
o It helps in the identification of areas of similar land that are used in an earth observation database
and the identification of house groups in a city according to house type, value, and geographical
location.

Why is clustering used in data mining?

Clustering analysis has been an evolving problem in data mining due to its variety of applications. The
advent of various data clustering tools in the last few years and their comprehensive use in a broad range
of applications, including image processing, computational biology, mobile communication, medicine, and
economics, must contribute to the popularity of these algorithms. The main issue with the data clustering
algorithms is that it cant be standardized. The advanced algorithm may give the best results with one type
of data set, but it may fail or perform poorly with other kinds of data set. Although many efforts have been
made to standardize the algorithms that can perform well in all situations, no significant achievement has
been achieved so far. Many clustering tools have been proposed so far. However, each algorithm has its
advantages or disadvantages and cant work on all real situations.

1. Scalability:

Scalability in clustering implies that as we boost the amount of data objects, the time to perform clustering
should approximately scale to the complexity order of the algorithm. For example, if we perform K- means
clustering, we know it is O(n), where n is the number of objects in the data. If we raise the number of data
objects 10 folds, then the time taken to cluster them should also approximately increase 10 times. It means there
should be a linear relationship. If that is not the case, then there is some error with our implementation process.

Data should be scalable if it is not scalable, then


we can't get the appropriate result. The figure
illustrates the graphical example where it may
lead to the wrong result.

2. Interpretability:

The outcomes of clustering should be


interpretable, comprehensible, and usable.

3. Discovery of clusters with attribute


shape:

The clustering algorithm should be able to find arbitrary shape clusters. They should not be limited to only
distance measurements that tend to discover a spherical cluster of small sizes.

4. Ability to deal with different types of attributes:


Algorithms should be capable of being applied to any data such as data based on intervals (numeric), binary
data, and categorical data.

5. Ability to deal with noisy data:

Databases contain data that is noisy, missing, or incorrect. Few algorithms are sensitive to such data and
may result in poor quality clusters.

6. High dimensionality:

The clustering tools should not only able to handle high dimensional data space but also the low-
dimensional space

Clustering Methods

1. Partitioning Clustering 2.Density-Based Clustering 3. Distribution Model-Based Clustering 4.


Hierarchical Clustering 5.Fuzzy Clustering

Partitioning Clustering

It is a type of clustering that divides the data into non-


hierarchical groups. It is also known as the centroid- based
method. The most common example of partitioning
clustering is the K-Means Clustering algorithm.

In this type, the dataset is divided into a set of k groups, where


K is used to define the number of pre-defined groups. The
cluster center is created in such a way that the distance
between the data points of one cluster is minimum as
compared to another cluster centroid.

Density-Based Clustering

The density-based clustering method connects the


highly-dense areas into clusters, and the arbitrarily
shaped distributions are formed as long as the dense
region can be connected. This algorithm does it by
identifying different clusters in the dataset and connects
the areas of high densities into clusters. The dense areas
in data space are divided from each other by sparser
areas.

These algorithms can face difficulty in clustering the


data points if the dataset has varying densities and high
dimensions.
Distribution Model-Based Clustering

In the distribution model-based clustering method, the data is divided based on the probability of how a
dataset belongs to a particular distribution. The grouping is done by assuming some distributions
commonly Gaussian Distribution.

The example of this type is the Expectation-Maximization


Clustering algorithm that uses Gaussian Mixture Models
(GMM).

Hierarchical Clustering

Hierarchical clustering can be used as an alternative for the


partitioned clustering as there is no requirement of pre-
specifying the number of clusters to be
created. In this technique, the dataset is divided
into clusters to create a tree-like structure, which
is also called a dendrogram. The observations or any
number of clusters can be selected by cutting the
tree at the correct level. The most common
example of this method is the Agglomerative
Hierarchical algorithm.

Fuzzy Clustering

Fuzzy clustering is a type of soft method in which a data object may belong to more than one group or
cluster. Each dataset has a set of membership coefficients, which depend on the degree of membership to
be in a cluster. Fuzzy C-means algorithm is the example of this type of clustering; it is sometimes also
known as the Fuzzy k-means algorithm.

Clustering Algorithms

The Clustering algorithms can be divided based on their models that are explained above. There are
different types of clustering algorithms published, but only a few are commonly used. The clustering
algorithm is based on the kind of data that we are using. Such as, some algorithms need to guess the
number of clusters in the given dataset, whereas some are required to find the minimum distance between the
observation of the dataset.

Here we are discussing mainly popular Clustering algorithms that are widely used in machine learning:

1. K-Means algorithm: The k-means algorithm is one of the most popular clustering algorithms. It
classifies the dataset by dividing the samples into different clusters of equal variances. The number
of clusters must be specified in this algorithm. It is fast with fewer computations required, with the
linear complexity of O(n).
2. Mean-shift algorithm: Mean-shift algorithm tries to find the dense areas in the smooth density of
data points. It is an example of a centroid-based model, that works on updating the candidates for
centroid to be the center of the points within a given region.
3. DBSCAN Algorithm: It stands for Density-Based Spatial Clustering of Applications with Noise.
It is an example of a density-based model similar to the mean-shift, but with some remarkable
advantages. In this algorithm, the areas of high density are separated by the areas of low density.
Because of this, the clusters can be found in any arbitrary shape.
4. Expectation-Maximization Clustering using GMM: This algorithm can be used as an alternative
for the k-means algorithm or for those cases where K-means can be failed. In GMM, it is assumed
that the data points are Gaussian distributed.
5. Agglomerative Hierarchical algorithm: The Agglomerative hierarchical algorithm performs the
bottom-up hierarchical clustering. In this, each data point is treated as a single cluster at the outset
and then successively merged. The cluster hierarchy can be represented as a tree-structure.
6. Affinity Propagation: It is different from other clustering algorithms as it does not require to specify
the number of clusters. In this, each data point sends a message between the pair of data points
until convergence. It has O(N2T) time complexity, which is the main drawback of this algorithm.

Applications of Clustering

Below are some commonly known applications of clustering technique in Machine Learning:

o In Identification of Cancer Cells: The clustering algorithms are widely used for the identification of
cancerous cells. It divides the cancerous and non-cancerous data sets into different groups.
o In Search Engines: Search engines also work on the clustering technique. The search result appears
based on the closest object to the search query. It does it by grouping similar data objects in one
group that is far from the other dissimilar objects. The accurate result of a query depends on the
quality of the clustering algorithm used.
o Customer Segmentation: It is used in market research to segment the customers based on their
choice and preferences.
o In Biology: It is used in the biology stream to classify different species of plants and animals using
the image recognition technique.
o In Land Use: The clustering technique is used in identifying the area of similar lands use in the GIS
database. This can be very useful to find that for what purpose the particular land should be used,
that means for which purpose it is more suitable.

Data Mining tools

Data Mining tools have the objective of discovering patterns/trends/groupings among large sets of data
and transforming data into more refined information.

It is a framework, such as Rstudio or Tableau that allows you to perform different types of data mining
analysis.

We can perform various algorithms such as clustering or classification on your data set and visualize the
results itself. It is a framework that provides us better insights for our data and the phenomenon that data
represent. Such a framework is called a data mining tool.
1. Orange Data Mining:

Orange is a perfect machine learning and data mining software


suite. It supports the visualization and is a software-based on
components written in Python computing language and
developed at the bioinformatics laboratory at the faculty of
computer and information science, Ljubljana University,
Slovenia.

As it is a software-based on components, the components of


Orange are called "widgets." These widgets range from
preprocessing and data visualization to the assessment of algorithms and predictive modeling.

Widgets deliver significant functionalities such as:

o Displaying data table and allowing to select features


o Data reading
o Training predictors and comparison of learning algorithms
o Data element visualization, etc.

2. SAS Data Mining:

SAS stands for Statistical Analysis System. It is a


product of the SAS Institute created for analytics and data
management. SAS can mine data, change it, manage
information from various sources, and analyze
statistics. It offers a graphical UI for non-technical users.

SAS data miner allows users to analyze big data and


provide accurate insight for timely decision-making
purposes. SAS has distributed memory processing
architecture that is highly scalable. It is suitable for data
mining, optimization, and text mining purposes.

3. DataMelt Data Mining:

DataMelt is a computation and visualization


environment which offers an interactive structure for data
analysis and visualization. It is primarily designed for
students, engineers, and scientists. It is also known as
DMelt.

DMelt is a multi-platform utility written in JAVA. It can run


on any operating system which is compatible with JVM (Java
Virtual Machine). It consists of Science and
mathematics libraries.
o Scientific libraries:
Scientific libraries are used for drawing the 2D/3D plots.
o Mathematical libraries:
Mathematical libraries are used for random number generation, algorithms, curve fitting, etc.

DMelt can be used for the analysis of the large volume of data, data mining, and statistical analysis. It is
extensively used in natural sciences, financial markets, and engineering.

4. Rattle:

Ratte is a data mining tool based on GUI. It uses the R stats


programming language. Rattle exposes the statical power of R
by offering significant data mining features. While rattle has a
comprehensive and well-developed user interface, It has an
integrated log code tab that produces duplicate code for any
GUI operation.

The data set produced by Rattle can be viewed and edited.


Rattle gives the other facility to review the code, use it for many
purposes, and extend the code without any restriction.

5. Rapid Miner:

Rapid Miner is one of the most popular predictive analysis


systems created by the company with the same name as the
Rapid Miner. It is written in JAVA programming language. It
offers an integrated environment for text mining, deep learning,
machine learning, and predictive analysis.

The instrument can be used for a wide range of applications,


including company applications, commercial applications,
research, education, training, application development,
machine learning.

Rapid Miner provides the server on-site as well as in public or private cloud infrastructure. It has a
client/server model as its base. A rapid miner comes with template-based frameworks that enable fast
delivery with few errors(which are commonly expected in the manual coding writing process)

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy