0% found this document useful (0 votes)
214 views

Datawarehousing and Data Mining Full Notes PDF

This document provides an overview of key concepts in data warehousing. It defines a data warehouse as a collection of data marts representing historical data from different company operations, stored in a structure optimized for querying and analysis. The document outlines the components of a data warehouse architecture, including data sourcing, cleanup and transformation tools, a metadata repository, data warehouse database technology, data marts, reporting/analysis tools, administration, and information delivery systems. It also discusses the characteristics, benefits, and types of data in a data warehouse.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
214 views

Datawarehousing and Data Mining Full Notes PDF

This document provides an overview of key concepts in data warehousing. It defines a data warehouse as a collection of data marts representing historical data from different company operations, stored in a structure optimized for querying and analysis. The document outlines the components of a data warehouse architecture, including data sourcing, cleanup and transformation tools, a metadata repository, data warehouse database technology, data marts, reporting/analysis tools, administration, and information delivery systems. It also discusses the characteristics, benefits, and types of data in a data warehouse.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 162

TITLE OF THE COURSE: DATA WAREHOUSING AND DATA

MINING

SEMESTER: V

PROGRAMME: BCA

E-CONTENT PREPARED BY: Dr.BHARANIDHARAN.G, Ph.D.


ASST. PROFESSOR, BCA

COLLEGE: THE NEW COLLEGE, CHENNAI-14.


Data warehousing and Data mining Unit-I

UNIT I
DATA WAREHOUSING
Data warehousing Components –Building a Data warehouse – Mapping the Data Warehouse
to a Multiprocessor Architecture – DBMS Schemas for Decision Support – Data Extraction,
Cleanup, and Transformation Tools –Metadata.

Data Warehouse Introduction


A data warehouse is a collection of data marts representing historical data from different
operations in the company. This data is stored in a structure optimized for querying and data
analysis as a data warehouse. Table design, dimensions and organization should be consistent
throughout a data warehouse so that reports or queries across the data warehouse are consistent.
A data warehouse can also be viewed as a database for historical data from different
functions within a company. The term Data Warehouse was coined by Bill Inmon in 1990, which
he defined in the following way: "A warehouse is a subject-oriented, integrated, time-variant and
non-volatile collection of data in support of management's decision making process". He defined
the terms in the sentence as follows:
 Subject Oriented: Data that gives information about a particular subject instead of about a
company's ongoing operations.
 Integrated: Data that is gathered into the data warehouse from a variety of sources and
merged into a coherent whole.
 Time-variant: All data in the data warehouse is identified with a particular time period.
 Non-volatile: Data is stable in a data warehouse. More data is added but data is never
removed. This enables management to gain a consistent picture of the business. It is a
single, complete and consistent store of data obtained from a variety of different sources
made available to end users in what they can understand and use in a business context. It
can be Used for decision Support, Used to manage and control business, Used by
managers and end-users to understand the business and make judgments.

Page 1
Data warehousing and Data mining Unit-I

Data Warehousing is an architectural construct of information systems that provides users


with current and historical decision support information that is hard to access or present in
traditional operational data stores

Other important terminology


 Enterprise Data warehouse: It collects all information about subjects (customers,
products, sales, assets, personnel) that span the entire organization
 Data Mart: Departmental subsets that focus on selected subjects. A data mart is a segment
of a data warehouse that can provide data for reporting and analysis on a section, unit,
department or operation in the company, e.g. sales, payroll, production. Data marts are
sometimes complete individual data warehouses which are usually smaller than the
corporate data warehouse.
 Decision Support System (DSS): Information technology to help the knowledge worker
(executive, manager, and analyst) makes faster & better decisions
 Drill-down: Traversing the summarization levels from highly summarized data to the
underlying current or old detail
 Metadata: Data about data. Containing location and description of warehouse system
components: names, definition, structure…

Benefits of data warehousing


Data warehouses are designed to perform well with aggregate queries running on large
amounts of data.
The structure of data warehouses is easier for end users to navigate, understand and
query against unlike the relational databases primarily designed to handle lots of transactions.
Data warehouses enable queries that cut across different segments of a company's
operation. E.g. production data could be compared against inventory data even if they were
originally stored in different databases with different structures.

Page 2
Data warehousing and Data mining Unit-I

Queries that would be complex in very normalized databases could be easier to build
and maintain in data warehouses, decreasing the workload on transaction systems.
Data warehousing is an efficient way to manage and report on data that is from a variety
of sources, non uniform and scattered throughout a company.
Data warehousing is an efficient way to manage demand for lots of information from
lots of users.
Data warehousing provides the capability to analyze large amounts of historical data for
nuggets of wisdom that can provide an organization with competitive advantage.

Operational and informational Data


Operational Data:
Focusing on transactional function such as bank card withdrawals and deposits
Detailed
Updateable
Reflects current data

Informational Data:
Focusing on providing answers to problems posed by decision makers
Summarized
Non updateable

Page 3
Data warehousing and Data mining Unit-I

Data Warehouse Characteristics


• A data warehouse can be viewed as an information system with the following attributes:
– It is a database designed for analytical tasks
– It’s content is periodically updated
– It contains current and historical data to provide a historical perspective of information

Operational data store (ODS)


• ODS is an architecture concept to support day-to-day operational decision support and contains
current value data propagated from operational applications
• ODS is subject-oriented, similar to a classic definition of a Data warehouse
• ODS is integrated

Data warehouse Architecture and its seven components


1. Data sourcing, cleanup, transformation, and migration tools
2. Metadata repository
3. Warehouse/database technology
4. Data marts
5. Data query, reporting, analysis, and mining tools
6. Data warehouse administration and management
7. Information delivery system

Data warehouse is an environment, not a product which is based on relational database


management system that functions as the central repository for informational data. The central
repository information is surrounded by number of key components designed to make the
environment is functional, manageable and accessible.

Page 4
Data warehousing and Data mining Unit-I

The data source for data warehouse is coming from operational applications. The data
entered into the data warehouse transformed into an integrated structure and format. The
transformation process involves conversion, summarization, filtering and condensation. The data
warehouse must be capable of holding and managing large volumes of data as well as different
structure of data structures over the time.
1. Data warehouse database
This is the central part of the data warehousing environment. This is the item number 2 in
the above arch. diagram. This is implemented based on RDBMS technology.
2. Sourcing, Acquisition, Clean up, and Transformation Tools
This is item number 1 in the above arch diagram. They perform conversions,
summarization, key changes, structural changes and condensation. The data transformation is
required so that the information can by used by decision support tools. The transformation

Page 5
Data warehousing and Data mining Unit-I

produces programs, control statements, JCL code, COBOL code, UNIX scripts, and SQL DDL
code etc., to move the data into data warehouse from multiple operational systems.
The functionalities of these tools are listed below:
To remove unwanted data from operational db
Converting to common data names and attributes
Calculating summaries and derived data
Establishing defaults for missing data
Accommodating source data definition change.

Issues to be considered while data sourcing, cleanup, extract and transformation:


Data heterogeneity: It refers to DBMS different nature such as it may be in different data
modules, it may have different access languages, it may have data navigation methods,
operations, concurrency, integrity and recovery processes etc.,

3. Meta data
It is data about data. It is used for maintaining, managing and using the data warehouse. It
is classified into two:
1.Technical Meta data: It contains information about data warehouse data used by warehouse
designer, administrator to carry out development and management tasks. It includes,
Info about data stores
Transformation descriptions. That is mapping methods from operational db to warehouse db
Warehouse Object and data structure definitions for target data
The rules used to perform clean up, and data enhancement
Data mapping operations
Access authorization, backup history, archive history, info delivery history, data acquisition
history, data access etc.
2.Business Meta data: It contains info that gives info stored in data warehouse to users. It
includes,

Page 6
Data warehousing and Data mining Unit-I

Subject areas, and info object type including queries, reports, images, video, audio clips
etc.
Internet home pages
Info related to info delivery system
Data warehouse operational info such as ownerships, audit trails etc.,
Meta data helps the users to understand content and find the data. Meta data are stored in a
separate data stores which is known as informational directory or Meta data repository which
helps to integrate, maintain and view the contents of the data warehouse.
The following lists the characteristics of info directory/ Meta data:
It is the gateway to the data warehouse environment
It supports easy distribution and replication of content for high performance and
availability
It should be searchable by business oriented key words
It should act as a launch platform for end user to access data and analysis tools
It should support the sharing of info
It should support scheduling options for request
IT should support and provide interface to other applications
It should support end user monitoring of the status of the data warehouse environment
4. Access tools
Its purpose is to provide info to business users for decision making. There are five
categories:
Data query and reporting tools
Application development tools
Executive info system tools (EIS)
OLAP tools
Data mining tools
Query and reporting tools are used to generate query and report. There are two types of
reporting tools. They are:

Page 7
Data warehousing and Data mining Unit-I

Production reporting tool used to generate regular operational reports


Desktop report writer are inexpensive desktop tools designed for end users.
Managed Query tools: used to generate SQL query. It uses Meta layer software in between users
and databases which offers a point-and-click creation of SQL statement. This tool is a preferred
choice of users to perform segment identification, demographic analysis, territory management
and preparation of customer mailing lists etc.
Application development tools: This is a graphical data access environment which integrates
OLAP tools with data warehouse and can be used to access all db systems
OLAP Tools: are used to analyze the data in multi dimensional and complex views. To enable
multidimensional properties it uses MDDB and MRDB where MDDB refers multi dimensional
data base and MRDB refers multi relational data bases.
Data mining tools: are used to discover knowledge from the data warehouse data also can be
used for data visualization and data correction purposes.
5. Data marts
Departmental subsets that focus on selected subjects. They are independent used by
dedicated user group. They are used for rapid delivery of enhanced decision support functionality
to end users. Data mart is used in the following situation:
Extremely urgent user requirement
The absence of a budget for a full scale data warehouse strategy
The decentralization of business needs
The attraction of easy to use tools and mind sized project
Data mart presents two problems:
1. Scalability: A small data mart can grow quickly in multi dimensions. So that while
designing it, the organization has to pay more attention on system scalability, consistency
and manageability issues
2. Data integration
6. Data warehouse admin and management
The management of data warehouse includes,

Page 8
Data warehousing and Data mining Unit-I

Security and priority management


Monitoring updates from multiple sources
Data quality checks
Managing and updating meta data
Auditing and reporting data warehouse usage and status
Purging data
Replicating, sub setting and distributing data
Backup and recovery
Data warehouse storage management which includes capacity planning, hierarchical
storage management and purging of aged data etc.,
7. Information delivery system
• It is used to enable the process of subscribing for data warehouse info.
• Delivery to one or more destinations according to specified scheduling algorithm

Building a Data warehouse


There are two reasons why organizations consider data warehousing a critical need. In
other words, there are two factors that drive you to build and use data warehouse. They are:
Business factors:
Business users want to make decision quickly and correctly using all available data.
Technological factors:
To address the incompatibility of operational data stores
IT infrastructure is changing rapidly. Its capacity is increasing and cost is decreasing so
that building a data warehouse is easy
There are several things to be considered while building a successful data warehouse
Business considerations:
Organizations interested in development of a data warehouse can choose one of the
following two approaches:

Page 9
Data warehousing and Data mining Unit-I

Top - Down Approach (Suggested by Bill Inmon)


Bottom - Up Approach (Suggested by Ralph Kimball)
Top - Down Approach
In the top down approach suggested by Bill Inmon, we build a centralized repository to
house corporate wide business data. This repository is called Enterprise Data Warehouse (EDW).
The data in the EDW is stored in a normalized form in order to avoid redundancy. The central
repository for corporate wide data helps us maintain one version of truth of the data. The data in
the EDW is stored at the most detail level. The reason to build the EDW on the most detail level
is to leverage
1. Flexibility to be used by multiple departments.
2. Flexibility to cater for future requirements.
The disadvantages of storing data at the detail level are
1. The complexity of design increases with increasing level of detail.
2. It takes large amount of space to store data at detail level, hence increased cost.
Once the EDW is implemented we start building subject area specific data marts which
contain data in a de normalized form also called star schema. The data in the marts are usually
summarized based on the end users analytical requirements. The reason to de normalize the data
in the mart is to provide faster access to the data for the end users analytics. If we were to have
queried a normalized schema for the same analytics, we would end up in a complex multiple
level joins that would be much slower as compared to the one on the de normalized schema.
We should implement the top-down approach when
1. The business has complete clarity on all or multiple subject areas data warehouse
requirements.
2. The business is ready to invest considerable time and money.
The advantage of using the Top Down approach is that we build a centralized
repository to cater for one version of truth for business data. This is very important for the
data to be reliable, consistent across subject areas and for reconciliation in case of data
related contention between subject areas.

Page 10
Data warehousing and Data mining Unit-I

The disadvantage of using the Top Down approach is that it requires more time and initial
investment. The business has to wait for the EDW to be implemented followed by building the
data marts before which they can access their reports.
Bottom Up Approach
The bottom up approach suggested by Ralph Kimball is an incremental approach to build
a data warehouse. Here we build the data marts separately at different points of time as and when
the specific subject area requirements are clear. The data marts are integrated or combined
together to form a data warehouse. Separate data marts are combined through the use of
conformed dimensions and conformed facts. A conformed dimension and a conformed fact is one
that can be shared across data marts.
A Conformed dimension has consistent dimension keys, consistent attribute names and
consistent values across separate data marts. The conformed dimension means exact same thing
with every fact table it is joined.
A Conformed fact has the same definition of measures, same dimensions joined to it and
at the same granularity across data marts.
The bottom up approach helps us incrementally build the warehouse by developing and
integrating data marts as and when the requirements are clear. We don’t have to wait for knowing
the overall requirements of the warehouse.
We should implement the bottom up approach when
1. We have initial cost and time constraints.
2. The complete warehouse requirements are not clear. We have clarity to only one data mart.
The advantage of using the Bottom Up approach is that they do not require high initial costs and
have a faster implementation time; hence the business can start using the marts much earlier as
compared to the top-down approach.
The disadvantages of using the Bottom Up approach are that it stores data in the de
normalized format; hence there would be high space usage for detailed data. We have a tendency
of not keeping detailed data in this approach hence losing out on advantage of having detail data
i.e. flexibility to easily cater to future requirements. Bottom up approach is more realistic but the
complexity of the integration may become a serious obstacle.

Page 11
Data warehousing and Data mining Unit-I

Design considerations
To be a successful data warehouse designer must adopt a holistic approach that is
considering all data warehouse components as parts of a single complex system, and take into
account all possible data sources and all known usage requirements.
Most successful data warehouses that meet these requirements have these common
characteristics:
Are based on a dimensional model
Contain historical and current data
Include both detailed and summarized data
Consolidate disparate data from multiple sources while retaining consistency
Data warehouse is difficult to build due to the following reason:
Heterogeneity of data sources
Use of historical data
Growing nature of data base
Data warehouse design approach muse be business driven, continuous and iterative engineering
approach. In addition to the general considerations there are following specific points relevant to
the data warehouse design:
Data content
The content and structure of the data warehouse are reflected in its data model. The data
model is the template that describes how information will be organized within the integrated
warehouse framework. The data warehouse data must be a detailed data. It must be formatted,
cleaned up and transformed to fit the warehouse data model.
Meta data
It defines the location and contents of data in the warehouse. Meta data is searchable by
users to find definitions or subject areas. In other words, it must provide decision support
oriented pointers to warehouse data and thus provides a logical link between warehouse data and
decision support applications.

Page 12
Data warehousing and Data mining Unit-I

Data distribution
One of the biggest challenges when designing a data warehouse is the data placement and
distribution strategy. Data volumes continue to grow in nature. Therefore, it becomes necessary
to know how the data should be divided across multiple servers and which users should get
access to which types of data. The data can be distributed based on the subject area, location
(geographical region), or time (current, month, year).
Tools
A number of tools are available that are specifically designed to help in the
implementation of the data warehouse. All selected tools must be compatible with the given data
warehouse environment and with each other. All tools must be able to use a common Meta data
repository.

Design steps
The following nine-step method is followed in the design of a data warehouse:
1. Choosing the subject matter
2. Deciding what a fact table represents
3. Identifying and conforming the dimensions
4. Choosing the facts
5. Storing pre calculations in the fact table
6. Rounding out the dimension table
7. Choosing the duration of the db
8. The need to track slowly changing dimensions
9. Deciding the query priorities and query models
Technical considerations
A number of technical issues are to be considered when designing a data warehouse
environment. These issues include:
The hardware platform that would house the data warehouse
The DBMS that supports the warehouse data

Page 13
Data warehousing and Data mining Unit-I

The communication infrastructure that connects data marts, operational systems and end
users
The hardware and software to support meta data repository
The systems management framework that enables admin of the entire environment
Implementation considerations
The following logical steps needed to implement a data warehouse:
Collect and analyze business requirements
Create a data model and a physical design
Define data sources
Choose the DB tech and platform
Extract the data from operational DB, transform it, clean it up and load it into the
warehouse
Choose DB access and reporting tools
Choose DB connectivity software
Choose data analysis and presentation s/w
Update the data warehouse
Access tools
Data warehouse implementation relies on selecting suitable data access tools. The best
way to choose this is based on the type of data can be selected using this tool and the kind of
access it permits for a particular user. The following lists the various types of data that can be
accessed:
Simple tabular form data
Ranking data
Multivariable data
Time series data
Graphing, charting and pivoting data
Complex textual search data
Statistical analysis data

Page 14
Data warehousing and Data mining Unit-I

Data for testing of hypothesis, trends and patterns


Predefined repeatable queries
Ad hoc user specified queries
Reporting and analysis data
Complex queries with multiple joins, multi level sub queries and sophisticated search
criteria
Data extraction, clean up, transformation and migration
A proper attention must be paid to data extraction which represents a success factor for a
data warehouse architecture. When implementing data warehouse several the following selection
criteria that affect the ability to transform, consolidate, integrate and repair the data should be
considered:
Timeliness of data delivery to the warehouse
The tool must have the ability to identify the particular data and that can be read by
conversion tool
The tool must support flat files, indexed files since corporate data is still in this type
The tool must have the capability to merge data from multiple data stores
The tool should have specification interface to indicate the data to be extracted
The tool should have the ability to read data from data dictionary
The code generated by the tool should be completely maintainable
The tool should permit the user to extract the required data
The tool must have the facility to perform data type and character set translation
The tool must have the capability to create summarization, aggregation and derivation of
records
The data warehouse database system must be able to perform loading data directly from
these tools

Page 15
Data warehousing and Data mining Unit-I

Data placement strategies


– As a data warehouse grows, there are at least two options for data placement. One is to put
some of the data in the data warehouse into another storage media.
– The second option is to distribute the data in the data warehouse across multiple servers.

User levels
The users of data warehouse data can be classified on the basis of their skill level in
accessing the warehouse.
There are three classes of users:
Casual users: are most comfortable in retrieving info from warehouse in pre defined formats and
running pre existing queries and reports. These users do not need tools that allow for building
standard and ad hoc reports
Power Users: can use pre defined as well as user defined queries to create simple and ad hoc
reports. These users can engage in drill down operations. These users may have the experience of
using reporting and query tools.
Expert users: These users tend to create their own complex queries and perform standard
analysis on the info they retrieve. These users have the knowledge about the use of query and
report tools

Benefits of data warehousing


Data warehouse usage includes,
– Locating the right info
– Presentation of info
– Testing of hypothesis
– Discovery of info
– Sharing the analysis

Page 16
Data warehousing and Data mining Unit-I

The benefits can be classified into two:


Tangible benefits (quantified / measureable):It includes,
– Improvement in product inventory
– Decrement in production cost
– Improvement in selection of target markets
– Enhancement in asset and liability management
Intangible benefits (not easy to quantified): It includes,
– Improvement in productivity by keeping all data in single location and
eliminating rekeying of data
– Reduced redundant processing
– Enhanced customer relation

Mapping the data warehouse architecture to Multiprocessor architecture


The functions of data warehouse are based on the relational data base technology. The
relational data base technology is implemented in parallel manner. There are two advantages of
having parallel relational data base technology for data warehouse:
 Linear Speed up: refers the ability to increase the number of processor to reduce response
time
 Linear Scale up: refers the ability to provide same performance on the same requests as
the database size increases
Types of parallelism
There are two types of parallelism:
 Inter query Parallelism: In which different server threads or processes handle multiple
requests at the same time.
 Intra query Parallelism: This form of parallelism decomposes the serial SQL query into
lower level operations such as scan, join, sort etc. Then these lower level operations are
executed concurrently in parallel.

Page 17
Data warehousing and Data mining Unit-I

Intra query parallelism can be done in either of two ways:


 Horizontal parallelism: which means that the data base is partitioned across multiple disks
and parallel processing occurs within a specific task that is performed concurrently on
different processors against different set of data
 Vertical parallelism: This occurs among different tasks. All query components such as
scan, join, sort etc are executed in parallel in a pipelined fashion. In other words, an output
from one task becomes an input into another task.

Data partitioning:
Data partitioning is the key component for effective parallel execution of data base
operations. Partition can be done randomly or intelligently.
Random portioning includes random data striping across multiple disks on a single server.
Another option for random portioning is round robin fashion partitioning in which each record is
placed on the next disk assigned to the data base.
Intelligent partitioning assumes that DBMS knows where a specific record is located and does
not waste time searching for it across all disks.
The various intelligent partitioning include:
Hash partitioning: A hash algorithm is used to calculate the partition number based on the value
of the partitioning key for each row

Page 18
Data warehousing and Data mining Unit-I

Key range partitioning: Rows are placed and located in the partitions according to the value of
the partitioning key. That is all the rows with the key value from A to K are in partition 1, L to T
are in partition 2 and so on.
Schema portioning: an entire table is placed on one disk; another table is placed on different disk
etc. This is useful for small reference tables.
User defined portioning: It allows a table to be partitioned on the basis of a user defined
expression.

Data base architectures of parallel processing


There are three DBMS software architecture styles for parallel processing:
1. Shared memory or shared everything Architecture
2. Shared disk architecture
3. Shred nothing architecture
Shared Memory Architecture
Tightly coupled shared memory systems, illustrated in following figure have the following
characteristics:
Multiple PUs share memory.
Each PU has full access to all shared memory through a common bus.
Communication between nodes occurs via shared memory.
Performance is limited by the bandwidth of the memory bus.
Symmetric multiprocessor (SMP) machines are often nodes in a cluster. Multiple SMP
nodes can be used with Oracle Parallel Server in a tightly coupled system, where memory is
shared among the multiple PUs, and is accessible by all the PUs through a memory bus.
Examples of tightly coupled systems include the Pyramid, Sequent, and Sun SparcServer.
Performance is potentially limited in a tightly coupled system by a number of factors.
These include various system components such as the memory bandwidth, PU to PU
communication bandwidth, the memory available on the system, the I/O bandwidth, and the
bandwidth of the common bus.

Page 19
Data warehousing and Data mining Unit-I

Parallel processing advantages of shared memory systems are these:


Memory access is cheaper than inter-node communication. This means that internal
synchronization is faster than using the Lock Manager.
Shared memory systems are easier to administer than a cluster.
A disadvantage of shared memory systems for parallel processing is as follows:
Scalability is limited by bus bandwidth and latency, and by available memory.
Shared Disk Architecture
Shared disk systems are typically loosely coupled. Such systems, illustrated in following
figure, have the following characteristics:
Each node consists of one or more PUs and associated memory.
Memory is not shared between nodes.
Communication occurs over a common high-speed bus.
Each node has access to the same disks and other resources.
A node can be an SMP if the hardware supports it.
Bandwidth of the high-speed bus limits the number of nodes (scalability) of the system.

Page 20
Data warehousing and Data mining Unit-I

The cluster illustrated in figure is composed of multiple tightly coupled nodes. The
Distributed Lock Manager (DLM ) is required. Examples of loosely coupled systems are VAX
clusters or Sun clusters.
Since the memory is not shared among the nodes, each node has its own data cache. Cache
consistency must be maintained across the nodes and a lock manager is needed to maintain the
consistency. Additionally, instance locks using the DLM on the Oracle level must be maintained
to ensure that all nodes in the cluster see identical data.
There is additional overhead in maintaining the locks and ensuring that the data caches are
consistent. The performance impact is dependent on the hardware and software components, such
as the bandwidth of the high-speed bus through which the nodes communicate, and DLM
performance.
Parallel processing advantages of shared disk systems are as follows:
Shared disk systems permit high availability. All data is accessible even if one node
dies.
These systems have the concept of one database, which is an advantage over shared
nothing systems.

Page 21
Data warehousing and Data mining Unit-I

Shared disk systems provide for incremental growth.


Parallel processing disadvantages of shared disk systems are these:
Inter-node synchronization is required, involving DLM overhead and greater
dependency on high-speed interconnect.
If the workload is not partitioned well there may be high synchronization overhead.
There is operating system overhead of running shared disk software.
Shared Nothing Architecture
Shared nothing systems are typically loosely coupled. In shared nothing systems only one
CPU is connected to a given disk. If a table or database is located on that disk, access depends
entirely on the PU which owns it. Shared nothing systems can be represented as follows:

Shared nothing systems are concerned with access to disks, not access to memory.
Nonetheless, adding more PUs and disks can improve scale up. Oracle Parallel Server can access
the disks on a shared nothing system as long as the operating system provides transparent disk
access, but this access is expensive in terms of latency.

Page 22
Data warehousing and Data mining Unit-I

Shared nothing systems have advantages and disadvantages for parallel processing:
Advantages
Shared nothing systems provide for incremental growth.
System growth is practically unlimited.
MPPs are good for read-only databases and decision support applications.
Failure is local: if one node fails, the others stay up.
Disadvantages
More coordination is required.
More overhead is required for a process working on a disk belonging to another node.
If there is a heavy workload of updates or inserts, as in an online transaction processing system,
it may be worthwhile to consider data-dependent routing to alleviate contention.

Parallel DBMS features


Scope and techniques of parallel DBMS operations
Optimizer implementation
Application transparency
Parallel environment which allows the DBMS server to take full advantage of the existing
facilities on a very low level
DBMS management tools help to configure, tune, admin and monitor a parallel RDBMS as
effectively as if it were a serial RDBMS
Price / Performance: The parallel RDBMS can demonstrate a non linear speed up and scale up
at reasonable costs.

Page 23
Data warehousing and Data mining Unit-I

Parallel DBMS vendors


1. Oracle: Parallel Query Option (PQO)
Architecture: shared disk arch
Data partition: Key range, hash, round robin
Parallel operations: hash joins, scan and sort
2. Informix: eXtended Parallel Server (XPS)
Architecture: Shared memory, shared disk and shared nothing models
Data partition: round robin, hash, schema, key range and user defined
Parallel operations: INSERT, UPDATE, DELELTE
3. IBM: DB2 Parallel Edition (DB2 PE)
Architecture: Shared nothing models
Data partition: hash
Parallel operations: INSERT, UPDATE, DELELTE, load, recovery, index creation, backup,
table reorganization
4. SYBASE: SYBASE MPP
Architecture: Shared nothing models
Data partition: hash, key range, Schema
Parallel operations: Horizontal and vertical parallelism

DBMS schemas for decision support


The basic concepts of dimensional modeling are: facts, dimensions and measures. A fact
is a collection of related data items, consisting of measures and context data. It typically
represents business items or business transactions. A dimension is a collection of data that
describe one business dimension. Dimensions determine the contextual background for the facts;
they are the parameters over which we want to perform OLAP. A measure is a numeric attribute
of a fact, representing the performance or behavior of the business relative to the dimensions.
Considering Relational context, there are three basic schemas that are used in dimensional
modeling:

Page 24
Data warehousing and Data mining Unit-I

1. Star schema
2. Snowflake schema
3. Fact constellation schema
Star schema
The multidimensional view of data that is expressed using relational data base semantics
is provided by the data base schema design called star schema. The basic of stat schema is that
information can be classified into two groups:
Facts
Dimension
Star schema has one large central table (fact table) and a set of smaller tables (dimensions)
arranged in a radial pattern around the central table.
Facts are core data element being analyzed while dimensions are attributes about the facts.
The determination of which schema model should be used for a data warehouse should be based
upon the analysis of project requirements, accessible tools and project team preferences.

Page 25
Data warehousing and Data mining Unit-I

The star schema architecture is the simplest data warehouse schema. It is called a star
schema because the diagram resembles a star, with points radiating from a center. The center of
the star consists of fact table and the points of the star are the dimension tables. Usually the fact
tables in a star schema are in third normal form(3NF) whereas dimensional tables are de-
normalized. Despite the fact that the star schema is the simplest architecture, it is most commonly
used nowadays and is recommended by Oracle.

Fact Tables
A fact table is a table that contains summarized numerical and historical data (facts) and a
multipart index composed of foreign keys from the primary keys of related dimension tables. A
fact table typically has two types of columns: foreign keys to dimension tables and measures
those that contain numeric facts. A fact table can contain fact's data on detail or aggregated level.

Dimension Tables
Dimensions are categories by which summarized data can be viewed. E.g. a profit
summary in a fact table can be viewed by a Time dimension (profit by month, quarter, year),
Region dimension
(profit by country, state, city), Product dimension (profit for product1, product2). A dimension is
a structure usually composed of one or more hierarchies that categorizes data. If a dimension
hasn't got a hierarchies and levels it is called flat dimension or list. The primary keys of each of
the dimension tables are part of the composite primary key of the fact table.
Dimensional attributes help to describe the dimensional value. They are normally
descriptive, textual values. Dimension tables are generally small in size then fact table. Typical
fact tables store data about sales while dimension tables data about geographic region (markets,
cities), clients, products, times, channels.
Measures
Measures are numeric data based on columns in a fact table. They are the primary data
which end users are interested in. E.g. a sales fact table may contain a profit measure which
represents profit on each sale.

Page 26
Data warehousing and Data mining Unit-I

Aggregations are pre calculated numeric data. By calculating and storing the answers to a
query before users ask for it, the query processing time can be reduced. This is key in providing
fast query performance in OLAP.
Cubes are data processing units composed of fact tables and dimensions from the data
warehouse. They provide multidimensional views of data, querying and analytical capabilities to
clients.
The main characteristics of star schema:
Simple structure -> easy to understand schema
Great query effectives -> small number of tables to join
Relatively long time of loading data into dimension tables -> de-normalization,
redundancy data caused that size of the table could be large.
The most commonly used in the data warehouse implementations -> widely supported
by a large number of business intelligence tools
Snowflake schema: is the result of decomposing one or more of the dimensions. The many-to
one relationship among sets of attributes of a dimension can separate new dimension tables,
forming a hierarchy. The decomposed snowflake structure visualizes the hierarchical structure of
dimensions very well.
Fact constellation schema: For each star schema it is possible to construct fact constellation
schema (for example by splitting the original star schema into more star schemes each of them
describes facts on another level of dimension hierarchies). The fact constellation architecture
contains multiple fact tables that share many dimension tables.
The main shortcoming of the fact constellation schema is a more complicated design
because many variants for particular kinds of aggregation must be considered and selected.
Moreover, dimension tables are still large.
Multi relational Database:
The relational implementation of multidimensional data base systems is referred to as
multi relational database systems.

Page 27
Data warehousing and Data mining Unit-I

Data Extraction, Cleanup and Transformation Tools


• The task of capturing data from a source data system, cleaning and transforming it and
then loading the results into a target data system can be carried out either by separate
products, or by a single integrated solution. More contemporary integrated solutions can
fall into one of the categories described below:
– Code Generators
– Database data Replications
– Rule-driven Dynamic Transformation Engines (Data Mart Builders)
Code Generator:
– It creates 3GL/4GL transformation programs based on source and target data
definitions, and data transformation and enhancement rules defined by the
developer.
– This approach reduces the need for an organization to write its own data capture,
transformation, and load programs. These products employ DML Statements to
capture a set of the data from source system.
– These are used for data conversion projects, and for building an enterprise-wide
data warehouse, when there is a significant amount of data transformation to be
done involving a variety of different flat files, non-relational, and relational data
sources.

Database Data Replication Tools:


– These tools employ database triggers or a recovery log to capture changes to a
single data source on one system and apply the changes to a copy of the data source
data located on a different system.
– Most replication products do not support the capture of changes to non-relational
files and databases, and often do not provide facilities for significant data
transformation and enhancement.
– These point-to-point tools are used for disaster recovery and to build an operational
data store, a data warehouse, or a data mart when the number of data sources

Page 28
Data warehousing and Data mining Unit-I

involved are small and a limited amount of data transformation and enhancement is
required.
Rule-driven Dynamic Transformation Engines (Data Mart Builders):
– They are also known as Data Mart Builders and capture data from a source system
at User-defined intervals, transform data, and then send and load the results into a
target environment, typically a data mart.
– To date most of the products of this category support only relational data sources,
though now this trend have started changing.
– Data to be captured from source system is usually defined using query language
statements, and data transformation and enhancement is done on a script or a
function logic defined to the tool.
– With most tools in this category, data flows from source systems to target systems
through one or more servers, which perform the data transformation and
enhancement. These transformation servers can usually be controlled from a single
location, making the job of such environment much easier.
Meta Data:
Meta Data Definitions:
Metadata – additional data warehouse used to understand what information is in the
warehouse, and what it means
Metadata Repository – specialized database designed to maintain metadata, together with
the tools and interfaces that allow a company to collect and distribute its metadata.
Operational Data – elements from operation systems, external data (or other sources)
mapped to the warehouse structures.
Industry trend:
Why were early Data Warehouses that did not include significant amounts of metadata collection
able to succeed?
• Usually a subset of data was targeted, making it easier to understand content,
organization, ownership.
• Usually targeted a subset of (technically inclined) end users

Page 29
Data warehousing and Data mining Unit-I

Early choices were made to ensure the success of initial data warehouse efforts.
Meta Data Transitions:
 Usually, metadata repositories are already in existence. Traditionally, metadata was
aimed at overall systems management, such as aiding in the maintenance of legacy
systems through impact analysis, and determining the appropriate reuse of legacy data
structures.
 Repositories can now aide in tracking metadata to help all data warehouse users
understand what information is in the warehouse and what it means. Tools are now being
positioned to help manage and maintain metadata.
Meta Data Lifecycle:
1. Collection: Identify metadata and capture it in a central repository.
2. Maintenance: Put in place processes to synchronize metadata automatically with the
changing data architecture.
3. Deployment: Provide metadata to users in the right form and with the right tools.
The key to ensuring a high level of collection and maintenance accuracy is to incorporate as
much automation as possible. The key to a successful metadata deployment is to correctly
match the metadata offered to the specific needs of each audience.
Meta Data Collection:
• Collecting the right metadata at the right time is the basis for a success. If the user does
not already have an idea about what information would answer a question, the user will
not find anything helpful in the warehouse.
• Metadata spans many domains from physical structure data, to logical model data, to
business usage and rules.
• Typically the metadata that should be collected is already generated and processed by the
development team anyway. Metadata collection preserves the analysis performed by the
team.

Page 30
Data warehousing and Data mining Unit-I

Meta Data Categories:


Warehouse Data Sources:
• Information about the potential sources of data for a data warehouse (existing operational
systems, external data, manually maintained information). The intent is to understand
both the physical structure of the data and the meaning of the data. Typically the physical
structure is easier to collect as it may exist in a repository that can be parsed
automatically.
Data Models:
Correlate the enterprise model to the warehouse model.
• Map entities in the enterprise model to their representation in the warehouse model. This
will provide the basis for further change impact analysis and end user content analysis.
• Ensure the entity, element definition, business rules, valid values, and usage guidelines
are transposed properly from the enterprise model to the warehouse model.
Warehouse Mappings:
Map the operational data into the warehouse data structures
• Each time a data element is mapped to the warehouse, the logical connection between the
data elements, as well as any transformations should be recorded.
• Along with being able to determine that an element in the warehouse is populated from
specific sources of data, the metadata should also discern exactly what happens to those
elements as they are extracted from the data sources, moved, transformed, and loaded into
the warehouse.
Warehouse Usage Information:
Usage information can be used to:
• Understand what tables are being accessed, by whom, and how often. This can be used to
fine tune the physical structure of the data warehouse.
• Improve query reuse by identifying existing queries (catalog queries, identify query
authors, descriptions).
• Understand how data is being used to solve business problems.

Page 31
Data warehousing and Data mining Unit-I

This information is captured after the warehouse has been deployed. Typically, this information
is not easy to collect.
Maintaining Meta Data:
• As with any maintenance process, automation is key to maintaining current high-quality
information. The data warehouse tools can play an important role in how the metadata is
maintained.
• Most proposed database changes already go through appropriate verification and
authorization, so adding a metadata maintenance requirement should not be significant.
• Capturing incremental changes is encouraged since metadata (particularly structure
information) is usually very large.
Maintaining the Warehouse:
The warehouse team must have comprehensive impact analysis capabilities to respond to
change that may affect:
• Data extraction\movement\transformation routines
• Table structures
• Data marts and summary data structures
• Stored user queries
• Users who require new training (due to query or other changes)
What business problems are addressed in part using the element that is changing (help
understand the significance of the change, and how it may impact decision making).
Meta Data Deployment:
Supply the right metadata to the right audience
• Warehouse developers will primarily need the physical structure information for data
sources. Further analysis on that metadata leads to the development of more metadata
(mappings).
• Warehouse maintainers typically require direct access to the metadata as well.
• End Users require an easy-to-access format. They should not be burdened with technical
names or cryptic commands. Training, documentation and other forms of help, should be
readily available.

Page 32
Data warehousing and Data mining Unit-I

End Users:
Users of the warehouse are primarily concerned with two types of metadata.
1. A high-level topic inventory of the warehouse (what is in the warehouse and where it
came from).
2. Existing queries that are pertinent to their search (reuse).
The important goal is that the user is easily able to correctly find and interpret the data they need.
Integration with Data Access Tools:
1. Side by Side access to metadata and to real data. The user can browse metadata and write
queries against the real data.
2. Populate query tool help text with metadata exported from the repository. The tool can
now provide the user with context sensitive help at the expense of needing updating
whenever metadata changes and the user may be using outdated metadata.
3. Provide query tools that access the metadata directly to provide context sensitive help.
This eliminates the refresh issue, and ensures the user always sees current metadata.
4. Full interconnectivity between query tool and metadata tool (transparent transactions
between tools).

Page 33
Data warehousing and Data mining Unit-I

OLTP vs. OLAP

We can divide IT systems into transactional (OLTP) and analytical (OLAP). In general we can
assume that OLTP systems provide source data to data warehouses, whereas OLAP systems help
to analyze it.
- OLTP (On-line Transaction Processing) is characterized by a large number of short on-line
transactions (INSERT, UPDATE, DELETE). The main emphasis for OLTP systems is put on
very fast query processing, maintaining data integrity in multi-access environments and an
effectiveness measured by number of transactions per second. In OLTP database there is detailed
and current data, and schema used to store transactional databases is the entity model (usually
3NF).
- OLAP (On-line Analytical Processing) is characterized by relatively low volume of
transactions. Queries are often very complex and involve aggregations. For OLAP systems a
response time is an effectiveness measure. OLAP applications are widely used by Data Mining
techniques. In OLAP database there is aggregated, historical data, stored in multi-dimensional
schemas (usually star schema).

Page 34
Data warehousing and Data mining Unit-I

The following table summarizes the major differences between OLTP and OLAP system design.

OLTP System OLAP System


Online Transaction Processing Online Analytical Processing
(Operational System) (Data Warehouse)
Operational data; OLTPs are the Consolidation data; OLAP data comes
Source of data
original source of the data. from the various OLTP Databases
To control and run fundamental To help with planning, problem solving,
Purpose of data
business tasks and decision support
Reveals a snapshot of ongoing Multi-dimensional views of various kinds
What the data
business processes of business activities
Inserts and Short and fast inserts and updates Periodic long-running batch jobs refresh
Updates initiated by end users the data
Relatively standardized and simple
Often complex queries involving
Queries queries Returning relatively few
aggregations
records
Depends on the amount of data involved;
Processing batch data refreshes and complex queries
Typically very fast
Speed may take many hours; query speed can be
improved by creating indexes
Larger due to the existence of
Space Can be relatively small if historical
aggregation structures and history data;
Requirements data is archived
requires more indexes than OLTP
Typically de-normalized with fewer
DatabaseDesign Highly normalized with many tables tables; use of star and/or snowflake
schemas
Backup religiously; operational data Instead of regular backups, some
Backup and is critical to run the business, data environments may consider simply
Recovery loss is likely to entail significant reloading the OLTP data as a recovery
monetary loss and legal liability method

Page 35
DATAWAREHOUSING AND DATA MINING UNIT-II

UNIT II BUSINESS ANALYSIS

Reporting and Query tools and Applications – Tool Categories – The Need for
Applications – Cognos Impromptu – Online Analytical Processing (OLAP) – Need –
Multidimensional Data Model – OLAP Guidelines – Multidimensional versus
Multirelational OLAP – Categories of Tools – OLAP Tools and the Internet.

Reporting Query tools and Applications

The data warehouse is accessed using an end-user query and reporting tool from
Business Objects. Business Objects provides several tools to securely access the data
warehouse or personal data files with a point-and-click interface including the following:

 BusinessObjects (Reporter and Explorer) ? a Microsoft Windows based query and


reporting tool.
 InfoView - a web based tool, that allows reports to be refreshed on demand (but
cannot create new reports).
 InfoBurst - a web based server tool that allows reports to be refreshed, scheduled
and distributed. It can be used to distribute reports and data to users or servers in
various formats (e.g. Text, Excel, PDF, HTML, etc.). For more information, see the
documentation below:
o InfoBurst Usage Notes (PDF)
o InfoBurst User Guide (PDF)
 Data Warehouse List Upload - a web based tool that allows lists of data to be
uploaded into the data warehouse for use as input to queries. For more
information, see the documentation below:

o Data Warehouse List Upload Instructions (PDF)

WSU has negotiated a contract with Business Objects for purchasing these tools at a
discount. View BusObj Rates.

Selecting your Query Tool:

Page 1
DATAWAREHOUSING AND DATA MINING UNIT-II

a. The query tools discussed in the next several slides represent the most commonly
used query tools at Penn State.
b. A Data Warehouse user is free to select any query tool, and is not limited to the
ones mentioned.
c. What is a ―Query Tool‖?
d. A query tool is a software package that allows you to connect to the data
warehouse from your PC and develop queries and reports.

There are many query tools to choose from. Below is a listing of what is currently
being used on the PC:

1. Microsoft Access

2. Microsoft Excel

3. Cognos Impromptu

Data Warehousing Tools and Technologies

a) Building a data warehouse is a complex task because there is no vendor that provides
an end-to-end‗ set of tools.

b) Necessitates that a data warehouse is built using multiple products from different
vendors.

c) Ensuring that these products work well together and are fully integrated is a major
challenge.

Cognos impromptu Query Tabs: Data

 Identify what to query


 Click and drag

Sort

 Hierarchy presentation
 Ascending or descending order

Group

Page 2
DATAWAREHOUSING AND DATA MINING UNIT-II

 Summarized data by group order

Filter

 Defines criteria
 Specifies query range

Administrative

 Access
 Profile
 Client/Server

Generating reports:

Page 3
DATAWAREHOUSING AND DATA MINING UNIT-II

Edit features on the toolbar allowed changes to report data after the query has
been completed

 Modify existing data


 Format numerical and date fields
 Perform calculations
 Group data
 Sort columns.

General Ledger System Data:

 Data Elements

Table Format

 Balances – Summary information


 Lines – Journal entry detail

Numeric

 Detail and summary


 Include calculations

Descriptive

 Accounting string segment values

Cognous Impromptu What is impromptu? Impromptu is an interactive database


reporting tool. It allows Power Users to query data without programming knowledge.
When using the Impromptu tool, no data is written or changed in the database. It is only
capable of reading the data. Impromptu's main features includes,

 Interactive reporting capability


 Enterprise-wide scalability
 Superior user interface
 Fastest time to result
 Lowest cost of ownership

Page 4
DATAWAREHOUSING AND DATA MINING UNIT-II

Catalogs Impromptu stores metadata in subject related folders. This metadata is what
will be used to develop a query for a report. The metadata set is stored in a file called a
‗catalog‗. The catalog does not contain any data. It just contains information about
connecting to the database and the fields that will be accessible for reports. A catalog
contains:

 Folders—meaningful groups of information representing columns from one or


more tables
 Columns—individual data elements that can appear in one or more folders
 Calculations—expressions used to compute required values from existing data

 Conditions—used to filter information so that only a certain type of information is


displayed
 Prompts—pre-defined selection criteria prompts that users can include in reports they
create
 Other components, such as metadata, a logical database name, join information, and user
classes

You can use catalogs to


 view, run, and print reports

 export reports to other applications


 disconnect from and connect to the database
 create reports

 change the contents of the catalog


 add user classes

Page 5
DATAWAREHOUSING AND DATA MINING UNIT-II

Online Analytical Processing (OLAP), OLAP Need, Multidimensional


Data Model: The Multidimensional data Model

The multidimensional data model is an integral part of On-Line Analytical Processing, or OLAP.
Because OLAP is on-line, it must provide answers quickly; analysts pose iterative queries during
interactive sessions, not in batch jobs that run overnight. And because OLAP is also analytic, the
queries are complex. The multidimensional data model is designed to solve complex queries in
real time. Multidimensional data model is to view it as a cube. The cable at the left contains
detailed sales data by product, market and time. The cube on the right associates sales number
(unit sold) with dimensions-product type, market and time with the unit variables organized as
cell in an array. This cube can be expended to include another array-price-which can be
associates with all or only some dimensions. As number of dimensions increases number of
cubes cell increase exponentially. Dimensions are hierarchical in nature i.e. time dimension may
contain hierarchies for years, quarters, months, weak and day. GEOGRAPHY may contain
country, state, city etc.

In this cube we can observe, that each side of the cube represents one of the elements of
the question. The x-axis represents the time, the y-axis represents the products and the z-
axis represents different centers. The cells of in the cube represents the number of
product sold or can represent the price of the items.

Page 6
DATAWAREHOUSING AND DATA MINING UNIT-II

This Figure also gives a different understanding to the drilling down operations. The
relations defined must not be directly related, they related directly. The size of the
dimension increase, the size of the cube will also increase exponentially. The time
response of the cube depends on the size of the cube.

Operations in Multidimensional Data Model:

• Aggregation (roll-up)

– dimension reduction: e.g., total sales by city

– summarization over aggregate hierarchy:

e.g., total sales by city and year -> total sales by region and by year

• Selection (slice) defines a subcube – e.g., sales where city = Palo Alto and date =
1/15/96

• Navigation to detailed data (drill-down) – e.g., (sales - expense) by city, top 3% of cities
by average income

• Visualization Operations (e.g., Pivot or dice)

OLAP

OLAP stands for Online Analytical Processing. It uses database tables (fact and dimension
tables) to enable multidimensional viewing, analysis and querying of large amounts of
data. E.g. OLAP technology could provide management with fast answers to complex
queries on their operational data or enable them to analyze their company's historical
data for trends and patterns. Online Analytical Processing (OLAP) applications and tools
are those that are designed to ask ―complex queries of large multidimensional
collections of data‖. Due to that OLAP is accompanied with data warehousing.

Page 7
DATAWAREHOUSING AND DATA MINING UNIT-II

Need
The key driver of OLAP is the multidimensional nature of the business problem. These
problems are characterized by retrieving a very large number of records that can reach
gigabytes and terabytes and summarizing this data into a form information that can by
used by business analysts. One of the limitations that SQL has, it cannot represent these
complex problems. A query will be translated in to several SQL statements. These SQL
statements will involve multiple joins, intermediate tables, sorting, aggregations and a
huge temporary memory to store these tables. These procedures required a lot of
computation which will require a long time in computing. The second limitation of SQL is
its inability to use mathematical models in these SQL statements. If an analyst, could
create these complex statements using SQL statements, still there will be a large number
of computation and huge memory needed. Therefore the use of OLAP is preferable to
solve this kind of problem.

CATEGORIES OF OLAP TOOLS

MOLAP This is the more traditional way of OLAP analysis. In MOLAP, data is stored in a
multidimensional cube. The storage is not in the relational database, but in proprietary
formats. That is, data stored in array-based structures.
Advantages:
 Excellent performance: MOLAP cubes are built for fast data retrieval, and are
optimal for slicing and dicing operations.
 Can perform complex calculations: All calculations have been pre-generated when
the cube is created. Hence, complex calculations are not only doable, but they
return quickly.

Disadvantages:
 Limited in the amount of data it can handle: Because all calculations are performed
when the cube is built, it is not possible to include a large amount of data in the
cube itself. This is not to say that the data in the cube cannot be derived from a

Page 8
DATAWAREHOUSING AND DATA MINING UNIT-II

large amount of data. Indeed, this is possible. But in this case, only summary-level
information will be included in the cube itself.
 Requires additional investment: Cube technology are often proprietary and do not
already exist in the organization. Therefore, to adopt MOLAP technology, chances
are additional investments in human and capital resources are needed.

Examples: Hyperion Essbase, Fusion (Information Builders)

ROLP

This methodology relies on manipulating the data stored in the relational database to
give the appearance of traditional OLAP's slicing and dicing functionality. In essence, each
action of slicing and dicing is equivalent to adding a "WHERE" clause in the SQL
statement. Data stored in relational tables

Advantages:

 Can handle large amounts of data: The data size limitation of ROLAP technology is
the limitation on data size of the underlying relational database. In other words,
ROLAP itself places no limitation on data amount.
 Can leverage functionalities inherent in the relational database: Often, relational
database already comes with a host of functionalities. ROLAP technologies, since
they sit on top of the relational database, can therefore leverage these
functionalities.

Page 9
DATAWAREHOUSING AND DATA MINING UNIT-II

Disadvantages:

 Performance can be slow: Because each ROLAP report is essentially a SQL query
(or multiple SQL queries) in the relational database, the query time can be long if
the underlying data size is large.
 Limited by SQL functionalities: Because ROLAP technology mainly relies on
generating SQL statements to query the relational database, and SQL statements
do not fit all needs (for example, it is difficult to perform complex calculations
using SQL), ROLAP technologies are therefore traditionally limited by what SQL
can do. ROLAP vendors have mitigated this risk by building into the tool out-of-
the-box complex functions as well as the ability to allow users to define their own
functions.

Examples: Microstrategy Intelligence Server, MetaCube (Informix/IBM)

HOLAP

(MQE: Managed Query Environment)

HOLAP technologies attempt to combine the advantages of MOLAP and ROLAP. For
summary-type information, HOLAP leverages cube technology for faster performance. It
stores only the indexes and aggregations in the multidimensional form while the rest of
the data is stored in the relational database.

Page 10
DATAWAREHOUSING AND DATA MINING UNIT-II

Examples: PowerPlay (Cognos), Brio, Microsoft Analysis Services, Oracle Advanced


Analytic Services.

OLAP Guidelines:

Dr. E.F. Codd the ―father‖ of the relational model, created a list of rules to deal with the
OLAP systems. Users should priorities these rules according to their needs to match their
business requirements (reference 3).

These rules are:

1) Multidimensional conceptual view: The OLAP should provide an appropriate


multidimensional Business model that suits the Business problems and Requirements.

2) Transparency: The OLAP tool should provide transparency to the input data for the
users.

3) Accessibility: The OLAP tool should only access the data required only to the analysis
needed.

4) Consistent reporting performance: The Size of the database should not affect in any
way the performance.

5) Client/server architecture: The OLAP tool should use the client server architecture to
ensure better performance and flexibility.

6) Generic dimensionality: Data entered should be equivalent to the structure and


operation requirements.

7) Dynamic sparse matrix handling: The OLAP too should be able to manage the sparse
matrix and so maintain the level of performance.

8) Multi-user support: The OLAP should allow several users working concurrently to
work together.

9) Unrestricted cross-dimensional operations: The OLAP tool should be able to perform


operations across the dimensions of the cube.

Page 11
DATAWAREHOUSING AND DATA MINING UNIT-II

10) Intuitive data manipulation. ―Consolidation path re-orientation, drilling down across
columns or rows, zooming out, and other manipulation inherent in the consolidation path
outlines should be accomplished via direct action upon the cells of the analytical model,
and should neither require the use of a menu nor multiple trips across the user
interface.(Reference 4)

11) Flexible reporting: It is the ability of the tool to present the rows and column in a
manner suitable to be analyzed.

12) Unlimited dimensions and aggregation levels: This depends on the kind of Business,
where multiple dimensions and defining hierarchies can be made.

Features of OLTP and OLAP

The major distinguishing features between OLTP and OLAP are summarized as follows.

1. Users and system orientation: An OLTP system is customer-oriented and is used for
transaction and query processing by clerks, clients, and information technology
professionals. An OLAP system is market-oriented and is used for data analysis by
knowledge workers, including managers, executives, and analysts.

2. Data contents: An OLTP system manages current data that, typically, are too detailed
to be easily used for decision making. An OLAP system manages large amounts of
historical data, provides facilities for summarization and aggregation, and stores and
manages information at different levels of granularity. These features make the data
easier for use in informed decision making.

3. Database design: An OLTP system usually adopts an entity-relationship (ER) data


model and an application oriented database design. An OLAP system typically adopts
either a star or snowflake model and a subject-oriented database design.

4. View: An OLTP system focuses mainly on the current data within an enterprise or
department, without referring to historical data or data in different organizations. In
contrast, an OLAP system often spans multiple versions of a database schema. OLAP

Page 12
DATAWAREHOUSING AND DATA MINING UNIT-II

systems also deal with information that originates from different organizations,
integrating information from many data stores. Because of their huge volume, OLAP data
are stored on multiple storage media.

5. Access patterns: The access patterns of an OLTP system consist mainly of short,
atomic transactions. Such a system requires concurrency control and recovery
mechanisms. However, accesses to OLAP systems are mostly read-only operations
although many could be complex queries.

Comparison between OLTP and OLAP systems.

Page 13
DATAWAREHOUSING AND DATA MINING UNIT-II

Multidimensional versus Multi relational OLAP, Categories of Tools


Representation of Multi-Dimensional Data:

 OLAP database servers use multi-dimensional structures to store data and


relationships between data.
 Multi-dimensional structures are best-visualized as cubes of data, and cubes within
cubes of data. Each side of a cube is a dimension.

Multi-dimensional OLAP supports common analytical operations, such as:

 Consolidation: involves the aggregation of data such as ‗roll-ups‗or complex


expressions involving interrelated data. Foe example, branch offices can be rolled
up to cities and rolled up to countries.
 Drill-Down: is the reverse of consolidation and involves displaying the detailed
data that comprises the consolidated data.
 Slicing and dicing: refers to the ability to look at the data from different viewpoints.
Slicing and dicing is often performed along a time axis in order to analyze trends
and find patterns.

Page 14
DATAWAREHOUSING AND DATA MINING UNIT-II

Relational OLAP (ROLAP)

 ROLAP is the fastest-growing type of OLAP tools.


 ROLAP supports RDBMS products through the use of a metadata layer, thus
avoiding the requirement to create a static multi-dimensional data structure.
 This facilitates the creation of multiple multi-dimensional views of the two-
dimensional relation.
 To improve performance, some ROLAP products have enhanced SQL engines to
support the complexity of multi-dimensional analysis, while others recommend, or
require, the use of highly denormalized database designs such as the star schema.
 The development issues associated with ROLAP technology:
 Performance problems associated with the processing of complex queries that
require multiple passes through the relational data.
 Development of middleware to facilitate the development of multi-dimensional
applications. Development of an option to create persistent multi-dimensional
structures, together with facilities o assist in the administration of these
structures.

OLAP Tools and the Internet

Categorization of OLAP Tools OLAP tools are designed to manipulate and control multi-
dimensional databases and help the sophisticated user to analyze the data using clear
multidimensional complex views. Their typical applications include product performance
and profitability, effectiveness of a sales program or a marketing campaign, sales
forecasting, and capacity planning.

Page 15
DATAWAREHOUSING AND DATA MINING UNIT-II

ROLAP

The mainly comprehensive premises in computing have been the internet and data
warehousing thus the integration of these two giant technologies is a necessity. The
advantages of using the Web for access are inevitable.These advantages are:

1. The internet provides connectivity between countries acting as a free resource.

2. The web eases administrative tasks of managing scattered locations.

3. The Web allows users to store and manage data and applications on servers that can be
managed, maintained and updated centrally.

Page 16
DATAWAREHOUSING AND DATA MINING UNIT-II

These reasons indicate the importance of the Web in data storage and manipulation. The
Web-enabled data access has many significant features, such as:

 The first
 The second
 The emerging third
 HTML publishing
 Helper applications
 Plug-ins
 Server-centric components
 Java and active-x applications

Products for OLAP

Microsoft Analysis Services (previously called OLAP Services, part of SQL Server), IBM's
DB2 OLAP Server, SAP BW and products from Brio, Business Objects, Cognos, Micro
Strategy and others.

Companies using OLAP

MIS AG Overview

MIS AG is the leading European provider of business intelligence solutions and services,
providing development, implementation, and service of systems for budgeting, reporting,
consolidation, and analysis.

Poet Overview

With FastObjects™, German Poet Software GmbH (Poet) provides developers with a
flexible Object-oriented Database Management System (ODBMS) solution optimized for
managing complexity in high-performance applications using Java technology, C++ and
.NET.

Page 17
DATAWAREHOUSING AND DATA MINING UNIT-III

UNIT III

DATA MINING

Introduction – Data – Types of Data – Data Mining Functionalities – Interestingness of


Patterns – Classification of Data Mining Systems – Data Mining Task Primitives – Integration
of a Data Mining System with a Data Warehouse – Issues –Data Preprocessing.

Data

• Collection of data objects and their attributes


• An attribute is a property or characteristic of an object
– Examples: eye color of a person, temperature, etc.
– Attribute is also known as variable, field, characteristic, or feature
• A collection of attributes describe an object
– Object is also known as record, point, case, sample, entity, or instance
Attributes

Tid Refund Marital Taxable


Status Income Cheat

1 Yes Single 125K No


2 No Married 100K No
3 No Single 70K No
4 Yes Married 120K No
5 No Divorced 95K Yes
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No

Objects 10 No Single 90K Yes

Attribute Values
• Attribute values are numbers or symbols assigned to an attribute
• Distinction between attributes and attribute values
– Same attribute can be mapped to different attribute values
• Example: height can be measured in feet or meters
– Different attributes can be mapped to the same set of values

Page 1
DATAWAREHOUSING AND DATA MINING UNIT-III

• Example: Attribute values for ID and age are integers


• But properties of attribute values can be different
– ID has no limit but age has a maximum and minimum value
Types of Attributes
• There are different types of attributes
– Nominal
• Examples: ID numbers, eye color, zip codes
– Ordinal
• Examples: rankings (e.g., taste of potato chips on a scale from 1-10),
grades, height in {tall, medium, short}
– Interval
• Examples: calendar dates, temperatures in Celsius or Fahrenheit.
– Ratio
Examples: temperature in Kelvin, length, time, counts

Evolution of Database Technology


Data mining primitives.

A data mining query is defined in terms of the following primitives

1. Task-relevant data: This is the database portion to be investigated. For example, suppose
that you are a manager of All Electronics in charge of sales in the United States and Canada. In
particular, you would like to study the buying trends of customers in Canada. Rather than
mining on the entire database. These are referred to as relevant attributes

2. The kinds of knowledge to be mined: This specifies the data mining functions to be
performed, such as characterization, discrimination, association, classification, clustering, or
evolution analysis. For instance, if studying the buying habits of customers in Canada, you
may choose to mine associations between customer profiles and the items that these
customers like to buy

Page 2
DATAWAREHOUSING AND DATA MINING UNIT-III

3. Background knowledge: Users can specify background knowledge, or knowledge about


the domain to be mined. This knowledge is useful for guiding the knowledge discovery
process, and for evaluating the patterns found. There are several kinds of background
knowledge.

4. Interestingness measures: These functions are used to separate uninteresting patterns


from knowledge. They may be used to guide the mining process, or after discovery, to
evaluate the discovered patterns. Different kinds of knowledge may have different
interestingness measures.

5. Presentation and visualization of discovered patterns: This refers to the form in which
discovered patterns are to be displayed. Users can choose from different forms for knowledge
presentation, such as rules, tables, charts, graphs, decision trees, and cubes.

Page 3
DATAWAREHOUSING AND DATA MINING UNIT-III

Figure : Primitives for specifying a data mining task.

Ent(S)  E(T , S)  

Page 4
DATAWAREHOUSING AND DATA MINING UNIT-III

Knowledge Discovery in Databases or KDD

Knowledge discovery as a process is depicted and consists of an iterative sequence of the


following steps:

 Data cleaning (to remove noise or irrelevant data),



 Data integration (where multiple data sources may be combined)

 Data selection (where data relevant to the analysis task are retrieved from the
database)

 Data transformation (where data are transformed or consolidated into forms
appropriate for mining by performing summary or aggregation operations, for
instance),

 Data mining (an essential process where intelligent methods are applied in order to
extract data patterns),

 Pattern evaluation (to identify the truly interesting patterns representing knowledge
based on some interestingness measures;), and

 Knowledge presentation (where visualization and knowledge representation
techniques are used to present the mined knowledge to the user).

Figure: Data mining as a process of knowledge discovery.

Page 5
DATAWAREHOUSING AND DATA MINING UNIT-III

Architecture of a typical data mining system.

The architecture of a typical data mining system may have the following major components

1. Database, data warehouse, or other information repository. This is one or a set of


databases, data warehouses, spread sheets, or other kinds of information repositories. Data
cleaning and data integration techniques may be performed on the data.

2. Database or data warehouse server. The database or data warehouse server is


responsible for fetching the relevant data, based on the user's data mining request.

3. Knowledge base. This is the domain knowledge that is used to guide the search, or
evaluate the interestingness of resulting patterns. Such knowledge can include concept
hierarchies, used to organize attributes or attribute values into different levels of abstraction.
Knowledge such as user beliefs, which can be used to assess a pattern's interestingness based
on its unexpectedness, may also be included.

4. Data mining engine. This is essential to the data mining system and ideally consists of a
set of functional modules for tasks such as characterization, association analysis,
classification, evolution and deviation analysis.

5. Pattern evaluation module. This component typically employs interestingness measures


and interacts with the data mining modules so as to focus the search towards interesting
patterns. It may access interestingness thresholds stored in the knowledge base. Alternatively,
the pattern evaluation module may be integrated with the mining module, depending on the
implementation of the data mining method used.

6. Graphical user interface. This module communicates between users and the data mining
system, allowing the user to interact with the system by specifying a data mining query or
task, providing information to help focus the search, and performing exploratory data mining
based on the intermediate data mining results.

Page 6
DATAWAREHOUSING AND DATA MINING UNIT-III

Figure: Architecture of a typical data mining system

Data mining functionalities

Data mining functionalities are used to specify the kind of patterns to be found in data mining
tasks. In general, data mining tasks can be classified into two categories
Descriptive and Predictive.
Descriptive mining tasks characterize the general properties of the data in the database.
Predictive mining tasks perform inference on the current data in order to make predictions. In
some cases, users may have no idea of which kinds of patterns in their data may be
interesting, and hence may like to search for several different kinds of patterns in parallel.
Thus it is important to have a data mining system that can mine multiple kinds of patterns to
accommodate di_erent user expectations or applications. Furthermore, data mining systems
should be able to discover patterns at various granularities. To encourage interactive and
exploratory mining, users should be able to easily \play" with the output patterns, such as by
mouse clicking. Operations that can be speci_ed by simple mouse clicks include adding or
dropping a dimension (or an attribute), swapping rows and columns (pivoting, or axis
rotation), changing dimension representations (e.g., from a 3-D cube to a sequence of 2-D
cross tabulations, or crosstabs), or using OLAP roll-up or drill-down operations along

Page 7
DATAWAREHOUSING AND DATA MINING UNIT-III

dimensions. Such operations allow data patterns to be expressed from different angles of view
and at multiple levels of abstraction.
Data mining systems should also allow users to specify hints to guide or focus the search for
interesting patterns. Since some patterns may not hold for all of the data in the database, a
measure of certainty or \trustworthiness" is usually associated with each discovered pattern.
Data mining functionalities, and the kinds of patterns they can discover, are described below.

Concept/class description: characterization and discrimination


Data can be associated with classes or concepts. For example, in the AllElectronics store,
classes of items for sale include computers and printers, and concepts of customers include
bigSpenders and budgetSpenders. It can be useful to describe individual classes and concepts
in summarized, concise, and yet precise terms. Such descriptions of a class or a concept are
called class/concept descriptions. These descriptions can be derived via (1) data
characterization, by summarizing the data of the class under study (often called the target
class) in general terms, or (2) data discrimination, by comparison of the target class with one
or a set of comparative classes (often called the contrasting classes), or (3) both data
characterization and discrimination.
Data characterization is a summarization of the general characteristics or features of a target
class of data. The data corresponding to the user-specified class are typically collected by a
database query. For example, to study the characteristics of software products whose sales
increased by 10% in the last year, one can collect the data related to such products by
executing an SQL query. There are several methods for e_ective data summarization and
characterization. For instance, the data cube- based OLAP roll-up operation can be used to
perform user-controlled data summarization along a specified dimension. This process is
further detailed in Chapter 2 which discusses data warehousing. An attribute- oriented
induction technique can be used to perform data generalization and characterization without
step-by-step user interaction. The output of data characterization can be presented in various
forms. Examples include pie charts, bar charts, curves, multidimensional data cubes, and

Page 8
DATAWAREHOUSING AND DATA MINING UNIT-III

multidimensional tables, including crosstabs. The resulting descriptions can also be presented
as generalized relations, or in rule form (called characteristic rules).

Association analysis
Association analysis is the discovery of association rules showing attribute-value conditions
that occur frequently together in a given set of data. Association analysis is widely used for
market basket or transaction data analysis. More formally, association rules are of the form X
) Y , i.e., \A1 ^ _ _ _ ^Am !B1 ^ _ _ _^Bn", where Ai (for i 2 f1; : : :;mg) and Bj (for j 2 f1; : : :; ng)
are attribute-value pairs. The association rule X ) Y is interpreted as \database tuples that
satisfy the conditions in X are also likely to satisfy the conditions in Y ".

An association between more than one attribute, or predicate (i.e., age, income, and buys).
Adopting the terminology used in multidimensional databases, where each attribute is
referred to as a dimension,the above rule can be referred to as a multidimensional association
rule. Suppose, as a marketing manager of AllElectronics, you would like to determine which
items are frequently purchased together within the same transactions. An example of such a
rule is
Contains
(T; \computer") ) contains(T; \software") [support = 1%; confidence = 50%]
meaning that if a transaction T contains \computer", there is a 50% chance that it contains
\software" as well, and 1% of all of the transactions contain both. This association rule
involves a single attribute or predicate (i.e., contains) which repeats. Association rules that
contain a single predicate are referred to as single-dimensional association rules. Dropping
the predicate notation, the above rule can be written simply as \computer ) software [1%,
50%]".

Page 9
DATAWAREHOUSING AND DATA MINING UNIT-III

Classification and prediction


Classification is the processing of finding a set of models (or functions) which describe and
distinguish data classes or concepts, for the purposes of being able to use the model to predict
the class of objects whose class label is unknown. The derived model is based on the analysis
of a set of training data (i.e., data objects whose class label is known). The derived model may
be represented in various forms, such as classi_cation (IF-THEN) rules, decision trees,
mathematical formulae, or neural networks. A decision tree is a ow-chart-like tree structure,
where each node denotes a test on an attribute value, each branch represents an outcome of
the test, and tree leaves represent classes or class distributions. Decision trees can be easily
converted to classi_cation rules. A neural network is a collection of linear threshold units that
can be trained to distinguish objects of different classes.Classification can be used for
predicting the class label of data objects. However, in many applications, one may like to
predict some missing or unavailable data values rather than class labels. This is usually the
case when the predicted values are numerical data, and is often specifically referred to as
prediction. Although prediction may refer to both data value prediction and class label
prediction, it is usually con_ned to data value prediction and thus is distinct from
classification. Prediction also encompasses the identification of distribution trends based on
the available data. Classification and prediction may need to be preceded by relevance
analysis which attempts to identify at tributes that do not contribute to the classification or
prediction process.

Clustering analysis
Clustering analyzes data objects without consulting a known class label. In general, the class
labels are not present in the training data simply because they are not known to begin with.
Clustering can be used to generate such labels. The objects are clustered or grouped based on
the principle of maximizing the intraclass similarity and minimizing the interclass similarity.
That is, clusters of objects are formed so that objects within a cluster have high similarity in
comparison to one another, but are very dissimilar to objects in other clusters. Each cluster
that is formed can be viewed as a class of objects, from which rules can be derived.

Page 10
DATAWAREHOUSING AND DATA MINING UNIT-III

Evolution and deviation analysis


Data evolution analysis describes and models regularities or trends for objects whose
behavior changes over time. Although this may include characterization, discrimination,
association, classi_cation, or clustering of time-related data, distinct features of such an
analysis include time-series data analysis, sequence or periodicity pattern matching, and
similarity-based data analysis.

Interestingness Patterns

A data mining system has the potential to generate thousands or even millions of patterns, or
rules. This raises some serious questions for data mining:
A pattern is interesting if (1) it is easily understood by humans, (2) valid on new or test data
with some degree of certainty, (3) potentially useful, and (4) novel. A pattern is also
interesting if it validates a hypothesis that the user sought to con_rm. An interesting pattern
represents knowledge. Several objective measures of pattern interestingness exist. These are
based on the structure of discovered patterns and the statistics underlying them. An objective
measure for association rules of the form XU Y is rule support, representing the percentage of
data samples that the given rule satisfies. Another objective measure for association rules is
confidence, which assesses the degree of certainty of the detected association. It is defined as
the conditional probability that a pattern Y is true given that X is true. More formally, support
and confidence aredefined as

support(X ) Y) = Prob{XUY}g
confidence (X ) Y) = Prob{Y |X}g

A classification of data mining systems


Data mining is an interdisciplinary field, the conuence of a set of disciplines including
database systems, statistics, machine learning, visualization, and information science.
Moreover, depending on the data mining approach used, techniques from other disciplines

Page 11
DATAWAREHOUSING AND DATA MINING UNIT-III

may be applied, such as neural networks, fuzzy and/or rough set theory, knowledge
representation, inductive logic programming, or high performance computing. Depending on
the kinds of data to be mined or on the given data mining application, the data mining system
may also integrate techniques from spatial data analysis, information retrieval, pattern
recognition, image analysis, signal processing, computer graphics, Web technology,
economics, or psychology. Because of the diversity of disciplines contributing to data mining,
data mining research is expected to generate
a large variety of data mining systems. Therefore, it is necessary to provide a clear
classi_cation of data mining systems. Such a classi_cation may help potential users distinguish
data mining systems and identify those that best match their needs. Data mining systems can
be categorized according to various criteria, as follows._ Classi_cation according to the kinds
of databases mined. A data mining system can be classi_ed according to the kinds of databases
mined. Database systems themselves can be classi_ed according to di_erent criteria (such as
data models, or the types of data or applications involved), each of which may require its own
data mining technique. Data mining systems can therefore be classi_ed accordingly.
For instance, if classifying according to data models, we may have a relational, transactional,
object-oriented, object-relational, or data warehouse mining system. If classifying according
to the special types of data handled, we may have a spatial, time-series, text, or multimedia
data mining system, or a World-Wide Web mining system. Other system types include
heterogeneous data mining systems, and legacy data mining systems.
Classification according to the kinds of knowledge mined. Data mining systems can be
categorized according to the kinds of knowledge they mine, i.e., based on data mining
functionalities, such as characterization, discrimination, association, classi_cation, clustering,
trend and evolution analysis, deviation analysis, similarity analysis, etc. A comprehensive data
mining system usually provides multiple and/or integrated data mining functionalities.
Moreover, data mining systems can also be distinguished based on the granularity or levels of
abstraction of the knowledge mined, including generalized knowledge (at a high level of
abstraction), primitive-level knowledge (at a raw data level), or knowledge at multiple levels

Page 12
DATAWAREHOUSING AND DATA MINING UNIT-III

(considering several levels of abstraction). An advanced data mining system should facilitate
the discovery of knowledge at multiple levels of abstraction.

Classification according to the kinds of knowledge mined.


Data mining systems can be categorized according to the kinds of knowledge they mine, i.e.,
based on data mining functionalities, such as characterization, discrimination, association,
classi_cation, clustering, trend and evolution analysis, deviation analysis, similarity analysis,
etc. A comprehensive data mining system usually provides multiple and/or integrated data
mining functionalities.Moreover, data mining systems can also be distinguished based on the
granularity or levels of abstraction of the knowledge mined, including generalized knowledge
(at a high level of abstraction), primitive-level knowledge (at a raw data level), or knowledge
at multiple levels (considering several levels of abstraction). An advanced data mining system
should facilitate the discovery of knowledge at multiple levels of abstraction.

Classification according to the kinds of techniques utilized


Data mining systems can also be categorized according to the underlying data mining
techniques employed. These techniques can be described according to the degree of user
interaction involved (e.g., autonomous systems, interactive exploratory systems, query-driven
systems), or the methods of data analysis employed (e.g., database-oriented or data
warehouse-oriented techniques, machine learning, statistics, visualization, pattern
recognition, neural networks, and so on). A sophisticated data mining system will often adopt
multiple data mining techniques or work out an e_ective, integrated technique which
combines the merits of a few individual approaches.

Page 13
DATAWAREHOUSING AND DATA MINING UNIT-III

Major issues in data mining


The scope of this book addresses major issues in data mining regarding mining methodology,
user interaction, performance, and diverse data types. These issues are introduced below:
1. Mining methodology and user-interaction issues. These reect the kinds of knowledge
mined, the ability to mine knowledge at multiple granularities, the use of domain knowledge,
ad-hoc mining, and knowledge visualization.
Mining different kinds of knowledge in databases.
Since different users can be interested in different kinds of knowledge, data mining should
cover a wide spectrum of data analysis and knowledge discovery tasks, including data
characterization, discrimination, association, classification, clustering, trend and deviation
analysis, and similarity analysis. These tasks may use the same database in different ways and
require the development of numerous data mining techniques.
Interactive mining of knowledge at multiple levels of abstraction.
Since it is difficult to know exactly what can be discovered within a database, the data mining
process should be interactive. For databases containing a huge amount of data, appropriate
sampling technique can first be applied to facilitate interactive data exploration. Interactive
mining allows users to focus the search for patterns, providing and refining data mining
requests based on returned results. Specifically, knowledge should be mined by drilling-down,
rolling-up, and pivoting through the data space and knowledge space interactively, similar to
what OLAP can do on data cubes. In this way, the user can interact with the data mining
system to view data and discovered patterns at multiple granularities and from different
angles.

Incorporation of background knowledge.


Background knowledge, or information regarding the domain under study, may be used to
guide the discovery process and allow discovered patterns to be expressed in concise terms
and at different levels of abstraction. Domain knowledge related to databases, such as
integrity constraints and deduction rules, can help focus and speed up a data mining process,
or judge the interestingness of discovered patterns.

Page 14
DATAWAREHOUSING AND DATA MINING UNIT-III

Data mining query languages and ad-hoc data mining.


Relational query languages (such as SQL) allow users to pose ad-hoc queries for data retrieval.
In a similar vein, high-level data mining query languages need to be developed to allow users
to describe ad-hoc data mining tasks by facilitating the speci_cation of the relevant sets of data
for analysis, the domain knowledge, the kinds of knowledge to be mined, and the conditions
and interestingness constraints to be enforced on the discovered patterns. Such a language
should be integrated with a database or data warehouse query language, and optimized for
e_cient and exible data mining.

Presentation and visualization of data mining results.


Discovered knowledge should be expressed in high-level languages, visual representations, or
other expressive forms so that the knowledge can be easily understood and directly usable by
humans. This is especially crucial if the data mining system is to be interactive. This requires
the system to adopt expressive knowledge representation techniques, such as trees, tables,
rules, graphs, charts, crosstabs, matrices, or curves.

Handling outlier or incomplete data.


The data stored in a database may reect outliers | noise, exceptional cases, or incomplete data
objects. These objects may confuse the analysis process, causing over_tting of the data to the
knowledge modelconstructed. As a result, the accuracy of the discovered patterns can be
poor. Data cleaning methods and data analysis methods which can handle outliers are
required. While most methods discard outlier data, such data may be of interest in itself such
as in fraud detection for _nding unusual usage of tele-communication services or credit cards.
This form of data analysis is known as outlier mining.
Pattern evaluation: the interestingness problem.
A data mining system can uncover thousands of patterns. Many of the patterns discovered
may be uninteresting to the given user, representing common knowledge or lacking novelty.
Several challenges remain regarding the development of techniques to assess the

Page 15
DATAWAREHOUSING AND DATA MINING UNIT-III

interestingness of discovered patterns, particularly with regard to subjective measures which


estimate the value of patterns with respect to a given user class, based on user beliefs or
expectations. The use of interestingness measures to guide the discovery process and reduce
the search space is another active area of research.

2. Performance issues. These include efficiency, scalability, and parallelization of data


mining algorithms.
Efficiency and scalability of data mining algorithms.
To effectively extract information from a huge amount of data in databases, data mining
algorithms must be efficient and scalable. That is, the running time of a data mining algorithm
must be predictable and acceptable in large databases. Algorithms with exponential or even
medium-order polynomial complexity will not be of practical use. From a database
perspective on knowledge discovery, efficiency and scalability are key issues in the
implementation of data mining systems. Many of the issues discussed above under mining
methodology and user-interaction must also consider efficiency and scalability.

Parallel, distributed, and incremental updating algorithms.


The huge size of many databases, the wide distribution of data, and the computational
complexity of some data mining methods are factors motivating the development of parallel
and distributed data mining algorithms. Such algorithms divide the data into partitions, which
are processed in parallel. The results from the partitions are then merged. Moreover, the high
cost of some data mining processes promotes the need for incremental data mining
algorithms which incorporate database updates without having to mine the entire data again
\from scratch". Such algorithms perform knowledge modification incrementally to amend and
strengthen what was previously discovered.

Page 16
DATAWAREHOUSING AND DATA MINING UNIT-III

3. Issues relating to the diversity of database types.


Handling of relational and complex types of data.
There are many kinds of data stored in databases and data warehouses. Since relational
databases and data warehouses are widely used, the development of efficient and effective
data mining systems for such data is important. However, other databases may contain
complex data objects, hypertext and multimedia data, spatial data, temporal data, or
transaction data. It is unrealistic to expect one system to mine all kinds of data due to the
diversity of data types and different goals of data mining. Specific data mining systems should
be constructed for mining specific kinds of data. Therefore, one may expect to have different
data mining systems for different kinds of data.

Mining information from heterogeneous databases and global information systems.


Local and wide-area computer networks (such as the Internet) connect many sources of data,
forming huge, distributed, and heterogeneous databases. The discovery of knowledge from
di_erent sources of structured, semi-structured, or unstructured data with diverse data
semantics poses great challenges to data mining. Data mining may help disclose high-level
data regularities in multiple heterogeneous databases that are unlikely to be discovered by
simple query systems and may improve information exchange and interoperability in
heterogeneous databases.

Page 17
DATAWAREHOUSING AND DATA MINING UNIT-III

DataPreprocessing
Data cleaning.

Data cleaning routines attempt to fill in missing values, smooth out noise while
identifying outliers, and correct inconsistencies in the data.

(i). Missing values

1. Ignore the tuple: This is usually done when the class label is missing (assuming the mining
task involves classification or description). This method is not very effective, unless the tuple
contains several attributes with missing values. It is especially poor when the percentage of
missing values per attribute varies considerably.

2. Fill in the missing value manually: In general, this approach is time-consuming and may
not be feasible given a large data set with many missing values.

3. Use a global constant to fill in the missing value: Replace all missing attribute values by
the same constant, such as a label like “Unknown". If missing values are replaced by, say,
“Unknown", then the mining program may mistakenly think that they form an interesting
concept, since they all have a value in common - that of “Unknown". Hence, although this
method is simple, it is not recommended.

4. Use the attribute mean to fill in the missing value: For example, suppose that the
average income of All Electronics customers is $28,000. Use this value to replace the missing
value for income.

5. Use the attribute mean for all samples belonging to the same class as the given tuple:
For example, if classifying customers according to credit risk, replace the missing value with
the average income value for customers in the same credit risk category as that of the given
tuple.

6. Use the most probable value to fill in the missing value: This may be determined with
inference-based tools using a Bayesian formalism or decision tree induction. For example,

Page 18
DATAWAREHOUSING AND DATA MINING UNIT-III

using the other customer attributes in your data set, you may construct a decision tree to
predict the missing values for income.

(ii). Noisy data

Noise is a random error or variance in a measured variable.

1. Binning methods:

Binning methods smooth a sorted data value by consulting the ”neighborhood", or


values around it. The sorted values are distributed into a number of 'buckets', or bins.
Because binning methods consult the neighborhood of values, they perform local smoothing.
Figure illustrates some binning techniques.

In this example, the data for price are first sorted and partitioned into equi-depth bins
(of depth 3). In smoothing by bin means, each value in a bin is replaced by the mean value of
the bin. For example, the mean of the values 4, 8, and 15 in Bin 1 is 9. Therefore, each original
value in this bin is replaced by the value 9. Similarly, smoothing by bin medians can be
employed, in which each bin value is replaced by the bin median. In smoothing by bin
boundaries, the minimum and maximum values in a given bin are identified as the bin
boundaries. Each bin value is then replaced by the closest boundary value.

(i).Sorted data for price (in dollars): 4, 8, 15, 21, 21, 24, 25, 28, 34

(ii).Partition into (equi-width) bins:

• Bin 1: 4, 8, 15
• Bin 2: 21, 21, 24
• Bin 3: 25, 28, 34
(iii).Smoothing by bin means:

- Bin 1: 9, 9, 9,

- Bin 2: 22, 22, 22

Page 19
DATAWAREHOUSING AND DATA MINING UNIT-III

- Bin 3: 29, 29, 29

(iv).Smoothing by bin boundaries:

• Bin 1: 4, 4, 15
• Bin 2: 21, 21, 24
• Bin 3: 25, 25, 34
2. Clustering:

Outliers may be detected by clustering, where similar values are organized into groups
or “clusters”. Intuitively, values which fall outside of the set of clusters may be considered
outliers.

Figure: Outliers may be detected by clustering analysis.

3. Combined computer and human inspection: Outliers may be identified through a


combination of computer and human inspection. In one application, for example, an
information-theoretic measure was used to help identify outlier patterns in a handwritten
character database for classification. The measure's value reflected the “surprise" content of
the predicted character label with respect to the known label. Outlier patterns may be
informative or “garbage". Patterns whose surprise content is above a threshold are output to

Page 20
DATAWAREHOUSING AND DATA MINING UNIT-III

a list. A human can then sort through the patterns in the list to identify the actual garbage
ones

4. Regression: Data can be smoothed by fitting the data to a function, such as with regression.
Linear regression involves finding the “best" line to fit two variables, so that one variable can
be used to predict the other. Multiple linear regression is an extension of linear regression,
where more than two variables are involved and the data are fit to a multidimensional
surface.

(iii). Inconsistent data

There may be inconsistencies in the data recorded for some transactions. Some data
inconsistencies may be corrected manually using external references. For example, errors
made at data entry may be corrected by performing a paper trace. This may be coupled with
routines designed to help correct the inconsistent use of codes. Knowledge engineering tools
may also be used to detect the violation of known data constraints. For example, known
functional dependencies between attributes can be used to find values contradicting the
functional constraints.

Data transformation.

In data transformation, the data are transformed or consolidated into forms


appropriate for mining. Data transformation can involve the following:

1. Normalization, where the attribute data are scaled so as to fall within a small specified
range, such as -1.0 to 1.0, or 0 to 1.0.

There are three main methods for data normalization : min-max normalization, z-
score normalization, and normalization by decimal scaling.

(i).Min-max normalization performs a linear transformation on the original data. Suppose


that minA and maxA are the minimum and maximum values of an attribute A. Min-max
normalization maps a value v of A to v0 in the range [new minA; new maxA] by computing

Page 21
DATAWAREHOUSING AND DATA MINING UNIT-III

(ii).z-score normalization (or zero-mean normalization), the values for an attribute A are
normalized based on the mean and standard deviation of A. A value v of A is normalized to v0
by computing where mean A and stand dev A are the mean and standard deviation,
respectively, of attribute A. This method of normalization is useful when the actual minimum
and maximum of attribute A are unknown, or when there are outliers which dominate the
min-max normalization.

(iii). Normalization by decimal scaling normalizes by moving the decimal point of values of
attribute A. The number of decimal points moved depends on the maximum absolute value of
A. A value v of A is normalized to v0by computing where j is the smallest integer such that

2. Smoothing, which works to remove the noise from data? Such techniques include binning,
clustering, and regression.

(i). Binning methods:

Binning methods smooth a sorted data value by consulting the ”neighborhood", or


values around it. The sorted values are distributed into a number of 'buckets', or bins. Because
binning methods consult the neighborhood of values, they perform local smoothing. Figure
illustrates some binning techniques.

In this example, the data for price are first sorted and partitioned into equi-depth bins
(of depth 3). In smoothing by bin means, each value in a bin is replaced by the mean value of
the bin. For example, the mean of the values 4, 8, and 15 in Bin 1 is 9. Therefore, each original
value in this bin is replaced by the value 9. Similarly, smoothing by bin medians can be

Page 22
DATAWAREHOUSING AND DATA MINING UNIT-III

employed, in which each bin value is replaced by the bin median. In smoothing by bin
boundaries, the minimum and maximum values in a given bin are identified as the bin
boundaries. Each bin value is then replaced by the closest boundary value.

(i).Sorted data for price (in dollars): 4, 8, 15, 21, 21, 24, 25, 28, 34

(ii).Partition into (equi-width) bins:

• Bin 1: 4, 8, 15
• Bin 2: 21, 21, 24
• Bin 3: 25, 28, 34
(iii).Smoothing by bin means:

- Bin 1: 9, 9, 9,

- Bin 2: 22, 22, 22

- Bin 3: 29, 29, 29

(iv).Smoothing by bin boundaries:

• Bin 1: 4, 4, 15
• Bin 2: 21, 21, 24
• Bin 3: 25, 25, 34
(ii). Clustering:

Outliers may be detected by clustering, where similar values are organized into groups
or “clusters”. Intuitively, values which fall outside of the set of clusters may be considered
outliers.

Figure: Outliers may be detected by clustering analysis.

Page 23
DATAWAREHOUSING AND DATA MINING UNIT-III

3. Aggregation, where summary or aggregation operations are applied to the data. For
example, the daily sales data may be aggregated so as to compute monthly and annual total
amounts.

4. Generalization of the data, where low level or 'primitive' (raw) data are replaced by
higher level concepts through the use of concept hierarchies. For example, categorical
attributes, like street, can be generalized to higher level concepts, like city or county.

Data reduction.

Data reduction techniques can be applied to obtain a reduced representation of the


data set that is much smaller in volume, yet closely maintains the integrity of the original data.
That is, mining on the reduced data set should be more efficient yet produce the same (or
almost the same) analytical results.

Strategies for data reduction include the following.

1. Data cube aggregation, where aggregation operations are applied to the data in the
construction of a data cube.

2. Dimension reduction, where irrelevant, weakly relevant or redundant attributes or


dimensions may be detected and removed.

3. Data compression, where encoding mechanisms are used to reduce the data set size.

4. Numerosity reduction, where the data are replaced or estimated by alternative, smaller
data representations such as parametric models (which need store only the model

Page 24
DATAWAREHOUSING AND DATA MINING UNIT-III

parameters instead of the actual data), or nonparametric methods such as clustering,


sampling, and the use of histograms.

5. Discretization and concept hierarchy generation, where raw data values for attributes
are replaced by ranges or higher conceptual levels. Concept hierarchies allow the mining of
data at multiple levels of abstraction, and are a powerful tool for data mining.

Data Cube Aggregation

• The lowest level of a data cube


– the aggregated data for an individual entity of interest
– e.g., a customer in a phone calling data warehouse.
• Multiple levels of aggregation in data cubes
– Further reduce the size of data to deal with
• Reference appropriate levels
– Use the smallest representation which is enough to solve the task
• Queries regarding aggregated information should be answered using data cube, when
possible

Dimensionality Reduction

Feature selection (i.e., attribute subset selection):

– Select a minimum set of features such that the probability distribution of different
classes given the values for those features is as close as possible to the original
distribution given the values of all features
– reduce # of patterns in the patterns, easier to understand
Heuristic methods:

1. Step-wise forward selection: The procedure starts with an empty set of attributes. The
best of the original attributes is determined and added to the set. At each subsequent iteration
or step, the best of the remaining original attributes is added to the set.

Page 25
DATAWAREHOUSING AND DATA MINING UNIT-III

2. Step-wise backward elimination: The procedure starts with the full set of attributes. At
each step, it removes the worst attribute remaining in the set.

3. Combination forward selection and backward elimination: The step-wise forward


selection and backward elimination methods can be combined, where at each step one selects
the best attribute and removes the

4. Decision tree induction: Decision tree algorithms, such as ID3 and C4.5, were originally
intended for classifcation. Decision tree induction constructs a flow-chart-like structure
where each internal (non-leaf) node denotes a test on an attribute, each branch corresponds
to an outcome of the test, and each external (leaf) node denotes a class prediction. At each
node, the algorithm chooses the “best" attribute to partition the data into individual classes.

Page 26
DATAWAREHOUSING AND DATA MINING UNIT-III

Data compression

In data compression, data encoding or transformations are applied so as to obtain a


reduced or ”compressed" representation of the original data. If the original data can be
reconstructed from the compressed data without any loss of information, the data
compression technique used is called lossless. If, instead, we can reconstruct only an
approximation of the original data, then the data compression technique is called lossy. The
two popular and effective methods of lossy data compression: wavelet transforms, and
principal components analysis.

Wavelet transforms

The discrete wavelet transform (DWT) is a linear signal processing technique that,
when applied to a data vector D, transforms it to a numerically different vector, D0, of wavelet
coefficients. The two vectors are of the same length.

The DWT is closely related to the discrete Fourier transform (DFT), a signal processing
technique involving sines and cosines. In general, however, the DWT achieves better lossy
compression.

The general algorithm for a discrete wavelet transform is as follows.

1. The length, L, of the input data vector must be an integer power of two. This condition
can be met by padding the data vector with zeros, as necessary.
2. Each transform involves applying two functions. The first applies some data smoothing,

Page 27
DATAWAREHOUSING AND DATA MINING UNIT-III

such as a sum or weighted average. The second performs a weighted difference.


3. The two functions are applied to pairs of the input data, resulting in two sets of data of
length L=2. In general, these respectively represent a smoothed version of the input
data, and the high-frequency content of it.
4. The two functions are recursively applied to the sets of data obtained in the previous
loop, until the resulting data sets obtained are of desired length.
5. A selection of values from the data sets obtained in the above iterations are designated
the wavelet coefficients of the transformed data.

Principal components analysis

Principal components analysis (PCA) searches for c k-dimensional orthogonal vectors


that can best be used to represent the data, where c << N. The original data is thus projected
onto a much smaller space, resulting in data compression. PCA can be used as a form of
dimensionality reduction. The initial data can then be projected onto this smaller set.

The basic procedure is as follows.

1. The input data are normalized, so that each attribute falls within the same range. This step
helps ensure that attributes with large domains will not dominate attributes with smaller
domains.

2. PCA computes N orthonormal vectors which provide a basis for the normalized input data.
These are unit vectors that each point in a direction perpendicular to the others. These
vectors are referred to as the principal components. The input data are a linear combination
of the principal components.

3. The principal components are sorted in order of decreasing “significance" or strength. The
principal components essentially serve as a new set of axes for the data, providing important
information about variance.

Page 28
DATAWAREHOUSING AND DATA MINING UNIT-III

4. since the components are sorted according to decreasing order of “significance", the size of
the data can be reduced by eliminating the weaker components, i.e., those with low variance.
Using the strongest principal components, it should be possible to reconstruct a good
approximation of the original data.

Numerosity reduction

Regression and log-linear models

Regression and log-linear models can be used to approximate the given data. In linear
regression, the data are modeled to fit a straight line. For example, a random variable, Y
(called a response variable), can be modeled as a linear function of another random variable,
X (called a predictor variable), with the equation where the variance of Y is assumed to be
constant. These coefficients can be solved for by the method of least squares, which minimizes
the error between the actual line separating the data and the estimate of the line.

Multiple regression is an extension of linear regression allowing a response variable Y


to be modeled as a linear function of a multidimensional feature vector.

Log-linear models approximate discrete multidimensional probability distributions.


The method can be used to estimate the probability of each cell in a base cuboid for a set of
discretized attributes, based on the smaller cuboids making up the data cube lattice

Histograms

A histogram for an attribute A partitions the data distribution of A into disjoint subsets,
or buckets. The buckets are displayed on a horizontal axis, while the height (and area) of a
bucket typically reects the average frequency of the values represented by the bucket.

1. Equi-width: In an equi-width histogram, the width of each bucket range is constant (such as
the width of $10 for the buckets in Figure 3.8).

Page 29
DATAWAREHOUSING AND DATA MINING UNIT-III

2. Equi-depth (or equi-height): In an equi-depth histogram, the buckets are created so that,
roughly, the frequency of each bucket is constant (that is, each bucket contains roughly the
same number of contiguous data samples).

3. V-Optimal: If we consider all of the possible histograms for a given number of buckets, the
V-optimal histogram is the one with the least variance. Histogram variance is a weighted sum
of the original values that each bucket represents, where bucket weight is equal to the
number of values in the bucket.

4. MaxDiff: In a MaxDiff histogram, we consider the difference between each pair of adjacent
values. A bucket boundary is established between each pair for pairs having the largest
differences, where is user-specified.

Clustering

Clustering techniques consider data tuples as objects. They partition the objects into
groups or clusters, so that objects within a cluster are “similar" to one another and
“dissimilar" to objects in other clusters. Similarity is commonly defined in terms of how
“close" the objects are in space, based on a distance function. The “quality" of a cluster may be
represented by its diameter, the maximum distance between any two objects in the cluster.
Centroid distance is an alternative measure of cluster quality, and is defined as the average
distance of each cluster object from the cluster centroid.

Page 30
DATAWAREHOUSING AND DATA MINING UNIT-III

Sampling

Sampling can be used as a data reduction technique since it allows a large data set to be
represented by a much smaller random sample (or subset) of the data. Suppose that a large
data set, D, contains N tuples. Let's have a look at some possible samples for D.

1. Simple random sample without replacement (SRSWOR) of size n: This is created by


drawing n of the N tuples from D (n < N), where the probably of drawing any tuple in D is 1=N,
i.e., all tuples are equally likely.

2. Simple random sample with replacement (SRSWR) of size n: This is similar to SRSWOR,
except that each time a tuple is drawn from D, it is recorded and then replaced. That is, after a
tuple is drawn, it is placed back in D so that it may be drawn again.

3. Cluster sample: If the tuples in D are grouped into M mutually disjoint “clusters", then a
SRS of m clusters can be obtained, where m < M. A reduced data representation can be
obtained by applying, say, SRSWOR to the pages, resulting in a cluster sample of the tuples.

4. Stratified sample: If D is divided into mutually disjoint parts called “strata", a stratified
sample of D is generated by obtaining a SRS at each stratum. This helps to ensure a
representative sample, especially when the data are skewed. For example, a stratified sample
may be obtained from customer data, where stratum is created for each customer age group.

Figure : Sampling can be used for data reduction.

Page 31
DATAWAREHOUSING AND DATA MINING UNIT-III

Major Issues in Data Warehousing and Mining

• Mining methodology and user interaction


– Mining different kinds of knowledge in databases
– Interactive mining of knowledge at multiple levels of abstraction
– Incorporation of background knowledge
– Data mining query languages and ad-hoc data mining
– Expression and visualization of data mining results
– Handling noise and incomplete data
– Pattern evaluation: the interestingness problem
• Performance and scalability
– Efficiency and scalability of data mining algorithms
– Parallel, distributed and incremental mining methods

Page 32
DATAWAREHOUSING AND DATA MINING UNIT-III

• Issues relating to the diversity of data types


– Handling relational and complex types of data
– Mining information from heterogeneous databases and global information
systems (WWW)
• Issues related to applications and social impacts
– Application of discovered knowledge
• Domain-specific data mining tools

Page 33
DATAWAREHOUSING AND DATA MINING UNIT-IV

UNIT IV
ASSOCIATION RULE MINING AND CLASSIFICATION

Mining Frequent Patterns, Associations and Correlations – Mining Methods – Mining Various Kinds of
Association Rules – Correlation Analysis – Constraint Based Association Mining – Classification and
Prediction - Basic Concepts - Decision Tree Induction - Bayesian Classification – Rule Based
Classification – Classification by Back propagation – Support Vector Machines – Associative
Classification – Lazy Learners – Other Classification Methods – Prediction

Association Mining

• Association rule mining:


– Finding frequent patterns, associations, correlations, or causal structures among sets of
items or objects in transaction databases, relational databases, and other information
repositories.
• Applications:
– Basket data analysis, cross-marketing, catalog design, loss-leader analysis, clustering,
classification, etc.
• Examples.
– Rule form: “Body ® Head [support, confidence]”.
– buys(x, “diapers”) ® buys(x, “beers”) [0.5%, 60%]
– major(x, “CS”) ^ takes(x, “DB”) ® grade(x, “A”) [1%, 75%]

Association Rule: Basic Concepts

• Given: (1) database of transactions, (2) each transaction is a list of items (purchased by a
customer in a visit)
• Find: all rules that correlate the presence of one set of items with that of another set of items
– E.g., 98% of people who purchase tires and auto accessories also get automotive services
done
• Applications

Page 1
DATAWAREHOUSING AND DATA MINING UNIT-IV

– * Maintenance Agreement (What the store should do to boost Maintenance Agreement


sales)
– Home Electronics  * (What other products should the store stocks up?)
– Attached mailing in direct marketing
– Detecting “ping-pong”ing of patients, faulty “collisions”

Rule Measures: Support and Confidence

• Find all the rules X & Y  Z with minimum confidence and support
– support, s, probability that a transaction contains {X  Y  Z}
– confidence, c, conditional probability that a transaction having {X  Y} also contains Z

Let minimum support 50%, and minimum confidence 50%, we have

– A  C (50%, 66.6%)
– C  A (50%, 100%)

Transaction ID Items Bought


2000 A,B,C
1000 A,C
4000 A,D
5000 B,E,F

Association Rule Mining: A Road Map

• Boolean vs. quantitative associations (Based on the types of values handled)


– buys(x, “SQLServer”) ^ buys(x, “DMBook”) ® buys(x, “DBMiner”) [0.2%, 60%]
– age(x, “30..39”) ^ income(x, “42..48K”) ® buys(x, “PC”) [1%, 75%]
• Single dimension vs. multiple dimensional associations (see ex. Above)
• Single level vs. multiple-level analysis
– What brands of beers are associated with what brands of diapers?
• Various extensions
– Correlation, causality analysis
• Association does not necessarily imply correlation or causality

Page 2
DATAWAREHOUSING AND DATA MINING UNIT-IV

– Maxpatterns and closed itemsets


– Constraints enforced
• E.g., small sales (sum < 100) trigger big buys (sum > 1,000)?

Market – Basket analysis

A market basket is a collection of items purchased by a customer in a single transaction, which


is a well-defined business activity. For example, a customer's visits to a grocery store or an online
purchase from a virtual store on the Web are typical customer transactions. Retailers accumulate
huge collections of transactions by recording business activities over time. One common analysis run
against a transactions database is to find sets of items, or itemsets, that appear together in many
transactions. A business can use knowledge of these patterns to improve the Placement of these items
in the store or the layout of mail- order catalog page and Web pages. An itemset containing i items is
called an i-itemset. The percentage of transactions that contain an itemset is called the itemset's
support. For an itemset to be interesting, its support must be higher than a user-specified minimum.
Such itemsets are said to be frequent.

Figure : Market basket analysis.

Page 3
IT1101 - DATAWAREHOUSING AND DATA MINING UNIT-IV

Rule support and confidence are two measures of rule interestingness. They respectively
reflect the usefulness and certainty of discovered rules. A support of 2% for association Rule means
that 2% of all the transactions under analysis show that computer and financial management
software are purchased together. A confidence of 60% means that 60% of the customers who
purchased a computer also bought the software. Typically, association rules are considered
interesting if they satisfy both a minimum support threshold and a minimum confidence threshold.

Mining Frequent Patterns

The method that mines the complete set of frequent itemsets with candidate generation.
Apriori property & The Apriori Algorithm.

Apriori property

• All nonempty subsets of a frequent item set most also be frequent.


– An item set I does not satisfy the minimum support threshold, min-sup, then I is not
frequent, i.e., support(I) < min-sup
– If an item A is added to the item set I then the resulting item set (I U A) can not occur
more frequently than I.
• Monotonic functions are functions that move in only one direction.
• This property is called anti-monotonic.
• If a set can not pass a test, all its supersets will fail the same test as well.
• This property is monotonic in failing the test.

The Apriori Algorithm

• Join Step: Ck is generated by joining Lk-1with itself


• Prune Step: Any (k-1)-itemset that is not frequent cannot be a subset of a frequent k-itemset

Page 4
DATAWAREHOUSING AND DATA MINING UNIT-IV

Example

Page 5
DATAWAREHOUSING AND DATA MINING UNIT-IV

The method that mines the complete set of frequent itemsets without generation.

• Compress a large database into a compact, Frequent-Pattern tree (FP-tree) structure


– highly condensed, but complete for frequent pattern mining
– avoid costly database scans
• Develop an efficient, FP-tree-based frequent pattern mining method
– A divide-and-conquer methodology: decompose mining tasks into smaller ones
– Avoid candidate generation: sub-database test only!

Construct FP-tree from a Transaction DB

TID Items bought (ordered) frequent items


100 {f, a, c, d, g, i, m, p} {f, c, a, m, p}
200 {a, b, c, f, l, m, o} {f, c, a, b, m} min_support = 0.5
300 {b, f, h, j, o} {f, b}
400 {b, c, k, s, p} {c, b, p}
500 {a, f, c, e, l, p, m, n} {f, c, a, m, p}
Steps:

1. Scan DB once, find frequent 1-itemset (single item pattern)


2. Order frequent items in frequency descending order
3. Scan DB again, construct FP-tree

Page 6
DATAWAREHOUSING AND DATA MINING UNIT-IV

Header Table

Item frequency head


f 4
c 4
a 3
b 3
m 3
p3

Benefits of the FP-tree Structure

• Completeness:
– never breaks a long pattern of any transaction
– preserves complete information for frequent pattern mining
• Compactness
– reduce irrelevant information—infrequent items are gone
– frequency descending ordering: more frequent items are more likely to be shared
– never be larger than the original database (if not count node-links and counts)
– Example: For Connect-4 DB, compression ratio could be over 100

Mining Frequent Patterns Using FP-tree

• General idea (divide-and-conquer)


– Recursively grow frequent pattern path using the FP-tree
• Method
– For each item, construct its conditional pattern-base, and then its conditional FP-tree
– Repeat the process on each newly created conditional FP-tree
– Until the resulting FP-tree is empty, or it contains only one path (single path will generate
all the combinations of its sub-paths, each of which is a frequent pattern)

Page 7
IT1101 - DATAWAREHOUSING AND DATA MINING UNIT-IV

Major Steps to Mine FP-tree

1) Construct conditional pattern base for each node in the FP-tree


2) Construct conditional FP-tree from each conditional pattern-base
3) Recursively mine conditional FP-trees and grow frequent patterns obtained so far
 If the conditional FP-tree contains a single path, simply enumerate all the patterns

Principles of Frequent Pattern Growth

• Pattern growth property


– Let  be a frequent itemset in DB, B be 's conditional pattern base, and  be an itemset
in B. Then  is a frequent itemset in DB iff  is frequent in B.
• “abcdef ” is a frequent pattern, if and only if
– “abcde ” is a frequent pattern, and
– “f ” is frequent in the set of transactions containing “abcde ”

Why Is Frequent Pattern Growth Fast?

• Our performance study shows


– FP-growth is an order of magnitude faster than Apriori, and is also faster than tree-
projection
• Reasoning
– No candidate generation, no candidate test
– Use compact data structure
– Eliminate repeated database scan
Basic operation is counting and FP-tree building

Mining multilevel association rules from transactional databases.

• Items often form hierarchy.


• Items at the lower level are expected to have lower support.

Page 8
DATAWAREHOUSING AND DATA MINING UNIT-IV

• Rules regarding itemsets at


appropriate levels could be quite useful.

• Transaction database can be encoded based on dimensions and levels


• We can explore shared multi-level mining

Food

Milk Bread

Skim 2% Wheat White

Fraser Sunset

TID Items
T1 {111, 121, 211, 221}
T2 {111, 211, 222, 323}
T3 {112, 122, 221, 411}
T4 {111, 121}
T5 {111, 122, 211, 221, 413}

Mining Multi-Level Associations

• A top_down, progressive deepening approach:


– First find high-level strong rules:
milk ® bread [20%, 60%].

– Then find their lower-level “weaker” rules:


2% milk ® wheat bread [6%, 50%].

• Variations at mining multiple-level association rules.


– Level-crossed association rules:
2% milk ® Wonder wheat bread

Page 9
DATAWAREHOUSING AND DATA MINING UNIT-IV

– Association rules with multiple, alternative hierarchies:


2% milk ® Wonder bread

Multi-level Association: Uniform Support vs. Reduced Support

• Uniform Support: the same minimum support for all levels


– + One minimum support threshold. No need to examine itemsets containing any item
whose ancestors do not have minimum support.
– – Lower level items do not occur as frequently. If support threshold
• too high  miss low level associations
• too low  generate too many high level associations
• Reduced Support: reduced minimum support at lower levels
– There are 4 search strategies:
• Level-by-level independent
• Level-cross filtering by k-itemset
• Level-cross filtering by single item
• Controlled level-cross filtering by single item

Multi-level Association: Redundancy Filtering

• Some rules may be redundant due to “ancestor” relationships between items.


• Example
– milk  wheat bread [support = 8%, confidence = 70%]
– 2% milk  wheat bread [support = 2%, confidence = 72%]
• We say the first rule is an ancestor of the second rule.
• A rule is redundant if its support is close to the “expected” value, based on the rule’s ancestor

Multi-Level Mining: Progressive Deepening

• A top-down, progressive deepening approach:


– First mine high-level frequent items:
milk (15%), bread (10%)

Page 10
DATAWAREHOUSING AND DATA MINING UNIT-IV

– Then mine their lower-level “weaker” frequent itemsets:


2% milk (5%), wheat bread (4%)

• Different min_support threshold across multi-levels lead to different algorithms:

– If adopting the same min_support across multi-levels then toss t if any of t’s ancestors is
infrequent.
– If adopting reduced min_support at lower levels then examine only those descendents
whose ancestor’s support is frequent/non-negligible.

Correlation in detail.

• Interest (correlation, lift)


– taking both P(A) and P(B) in consideration
– P(A^B)=P(B)*P(A), if A and B are independent events
– A and B negatively correlated, if the value is less than 1; otherwise A and B positively
correlated

X111100 00 Itemset Support Interest


Y110000 00 X,Y 25% 2
Z011111 11 X,Z 37.50% 0.9
2 Correlation
Y,Z 12.50% 0.57
• 2 measures correlation between categorical attributes

2  
2
(observed _ exp ected )
exp ected

game not game sum(row)


video 4000(4500) 3500(3000) 7500
not video 2000(1500) 500 (1000) 2500
sum(col.) 6000 4000 10000

Page 11
DATAWAREHOUSING AND DATA MINING UNIT-IV

• expected(i,j) = count(row i) * count(column j) / N


• 2 = (4000 - 4500)2 / 4500 - (3500 - 3000)2 / 3000 - (2000 - 1500)2 / 1500 - (500 - 1000)2 /
1000 = 555.6
• 2 > 1 and observed value of (game, video) < expected value, there is a negative correlation

Numeric correlation

• Correlation concept in statistics


– Used to study the relationship existing between 2 or more numeric variables
– A correlation is a measure of the linear relationship between variables
Ex: number of hours spent studying in a class with grade received
– Outcomes:
• positively related
• Not related
• negatively related
– Statistical relationships
• Covariance
• Correlation coefficient

Constraint-Based Mining in detail.

• Interactive, exploratory mining giga-bytes of data?


– Could it be real? — Making good use of constraints!
• What kinds of constraints can be used in mining?
– Knowledge type constraint: classification, association, etc.
– Data constraint: SQL-like queries
• Find product pairs sold together in Vancouver in Dec.’98.
– Dimension/level constraints:
• in relevance to region, price, brand, customer category.
– Rule constraints
• small sales (price < $10) triggers big sales (sum > $200).

Page 12
DATAWAREHOUSING AND DATA MINING UNIT-IV

– Interestingness constraints:
• strong rules (min_support  3%, min_confidence  60%).

Rule Constraints in Association Mining

• Two kind of rule constraints:


– Rule form constraints: meta-rule guided mining.
• P(x, y) ^ Q(x, w) ® takes(x, “database systems”).
– Rule (content) constraint: constraint-based query optimization (Ng, et al., SIGMOD’98).
• sum(LHS) < 100 ^ min(LHS) > 20 ^ count(LHS) > 3 ^ sum(RHS) > 1000
• 1-variable vs. 2-variable constraints (Lakshmanan, et al. SIGMOD’99):
– 1-var: A constraint confining only one side (L/R) of the rule, e.g., as shown above.
– 2-var: A constraint confining both sides (L and R).
• sum(LHS) < min(RHS) ^ max(RHS) < 5* sum(LHS)

Constrain-Based Association Query

• Database: (1) trans (TID, Itemset ), (2) itemInfo (Item, Type, Price)
• A constrained asso. query (CAQ) is in the form of {(S1, S2 )|C },
– where C is a set of constraints on S1, S2 including frequency constraint
• A classification of (single-variable) constraints:
– Class constraint: S  A. e.g. S  Item
– Domain constraint:
• S v,   { , , , , ,  }. e.g. S.Price < 100
• v S,  is  or . e.g. snacks  S.Type
• V S, or S V,   { , , , ,  }
– e.g. {snacks, sodas }  S.Type
– Aggregation constraint: agg(S)  v, where agg is in {min, max, sum, count, avg}, and   {
, , , , ,  }.
• e.g. count(S1.Type)  1 , avg(S2.Price)  100

Page 13
DATAWAREHOUSING AND DATA MINING UNIT-IV

Constrained Association Query Optimization Problem

• Given a CAQ = { (S1, S2) | C }, the algorithm should be :


– sound: It only finds frequent sets that satisfy the given constraints C
– complete: All frequent sets satisfy the given constraints C are found
• A naïve solution:
– Apply Apriori for finding all frequent sets, and then to test them for constraint
satisfaction one by one.
• Our approach:
– Comprehensive analysis of the properties of constraints and try to push them as deeply
as possible inside the frequent set computation.
Categories of Constraints.

1. Anti-monotone and Monotone Constraints


• constraint Ca is anti-monotone iff. for any pattern S not satisfying Ca, none of the super-
patterns of S can satisfy Ca
• A constraint Cm is monotone iff. for any pattern S satisfying Cm, every super-pattern of S also
satisfies it

2. Succinct Constraint
• A subset of item Is is a succinct set, if it can be expressed as p(I) for some selection predicate
p, where  is a selection operator
• SP2I is a succinct power set, if there is a fixed number of succinct set I1, …, Ik I, s.t. SP can be
expressed in terms of the strict power sets of I1, …, Ik using union and minus
• A constraint Cs is succinct provided SATCs(I) is a succinct power set

3. Convertible Constraint
• Suppose all items in patterns are listed in a total order R
• A constraint C is convertible anti-monotone iff a pattern S satisfying the constraint implies that
each suffix of S w.r.t. R also satisfies C
• A constraint C is convertible monotone iff a pattern S satisfying the constraint implies that
each pattern of which S is a suffix w.r.t. R also satisfies C

Page 14
DATAWAREHOUSING AND DATA MINING UNIT-IV

Property of Constraints: Anti-Monotone

• Anti-monotonicity: If a set S violates the constraint, any superset of S violates the constraint.
• Examples:
– sum(S.Price)  v is anti-monotone
– sum(S.Price)  v is not anti-monotone
– sum(S.Price) = v is partly anti-monotone
• Application:
– Push “sum(S.price)  1000” deeply into iterative frequent set computation.

Example of Convertible Constraints: Avg(S)  V

• Let R be the value descending order over the set of items


– E.g. I={9, 8, 6, 4, 3, 1}
• Avg(S)  v is convertible monotone w.r.t. R
– If S is a suffix of S1, avg(S1)  avg(S)
• {8, 4, 3} is a suffix of {9, 8, 4, 3}
• avg({9, 8, 4, 3})=6  avg({8, 4, 3})=5
– If S satisfies avg(S) v, so does S1
• {8, 4, 3} satisfies constraint avg(S)  4, so does {9, 8, 4, 3}

Property of Constraints: Succinctness

• Succinctness:
– For any set S1 and S2 satisfying C, S1  S2 satisfies C
– Given A1 is the sets of size 1 satisfying C, then any set S satisfying C are based on A1 , i.e.,
it contains a subset belongs to A1 ,
• Example :
– sum(S.Price )  v is not succinct
– min(S.Price )  v is succinct

Page 15
DATAWAREHOUSING AND DATA MINING UNIT-IV

• Optimization:
– If C is succinct, then C is pre-counting prunable. The satisfaction of the constraint alone is
not affected by the iterative support counting.

Classification and Prediction

 Classification:

– predicts categorical class labels

– classifies data (constructs a model) based on the training set and the values (class labels)
in a classifying attribute and uses it in classifying new data
• Prediction

– models continuous-valued functions, i.e., predicts unknown or missing values

• Typical applications

– Credit approval
– Target marketing
– Medical diagnosis
– Fraud detection

Classification—A Two-Step Process

• Model construction: describing a set of predetermined classes


– Each tuple/sample is assumed to belong to a predefined class, as determined by the class
label attribute
– The set of tuples used for model construction: training set
– The model is represented as classification rules, decision trees, or mathematical formulae

• Model usage: for classifying future or unknown objects


– Estimate accuracy of the model

Page 16
DATAWAREHOUSING AND DATA MINING UNIT-IV

• The known label of test sample is compared with the classified result from the
model
• Accuracy rate is the percentage of test set samples that are correctly classified by
the model
• Test set is independent of training set, otherwise over-fitting will occur
Process (1): Model Construction

Process (2): Using the Model in Prediction

Supervised vs. Unsupervised Learning


 Supervised learning (classification)

Page 17
DATAWAREHOUSING AND DATA MINING UNIT-IV

 Supervision: The training data (observations, measurements, etc.) are accompanied by


labels indicating the class of the observations

 New data is classified based on the training set

 Unsupervised learning (clustering)

 The class labels of training data is unknown

 Given a set of measurements, observations, etc. with the aim of establishing the
existence of classes or clusters in the data

Classification by Decision Tree Induction


Decision tree

– A flow-chart-like tree structure


– Internal node denotes a test on an attribute
– Branch represents an outcome of the test
– Leaf nodes represent class labels or class distribution
• Decision tree generation consists of two phases
– Tree construction
• At start, all the training examples are at the root
• Partition examples recursively based on selected attributes
– Tree pruning
• Identify and remove branches that reflect noise or outliers
• Use of decision tree: Classifying an unknown sample
– Test the attribute values of the sample against the decision tree

Training Dataset

This follows an example from Quinlan’s ID3


age income student credit_rating
<=30 high no fair
<=30 high no excellent
31…40 high no fair
>40 medium no fair
>40 low yes fair
>40 low yes excellent
31…40 low yes excellent
<=30 medium no fair
<=30 low yes fair
>40 medium yes fair

<=30 medium yes excellent


31…40 medium no excellent Page 18
31…40 high yes fair
>40 medium no excellent
ATAWAREHOUSING AND DATA MINING UNIT-IV

Algorithm for Decision Tree Induction

• Basic algorithm (a greedy algorithm)


– Tree is constructed in a top-down recursive divide-and-conquer manner
– At start, all the training examples are at the root
– Attributes are categorical (if continuous-valued, they are discretized in advance)
– Examples are partitioned recursively based on selected attributes
– Test attributes are selected on the basis of a heuristic or statistical measure (e.g.,
information gain)
• Conditions for stopping partitioning
– All samples for a given node belong to the same class
– There are no remaining attributes for further partitioning – majority voting is employed
for classifying the leaf
– There are no samples left

Extracting Classification Rules from Trees

• Represent the knowledge in the form of IF-THEN rules


• One rule is created for each path from the root to a leaf
• Each attribute-value pair along a path forms a conjunction
• The leaf node holds the class prediction
• Rules are easier for humans to understand

• Example

IF age = “<=30” AND student = “no” THEN buys_computer = “no”

Page 19
DATAWAREHOUSING AND DATA MINING UNIT-IV

IF age = “<=30” AND student = “yes” THEN buys_computer = “yes”

IF age = “31…40” THEN buys_computer = “yes”

IF age = “>40” AND credit_rating = “excellent” THEN buys_computer = “yes”

IF age = “>40” AND credit_rating = “fair” THEN buys_computer = “no”

Avoid Overfitting in Classification

• The generated tree may overfit the training data


– Too many branches, some may reflect anomalies due to noise or outliers
– Result is in poor accuracy for unseen samples
• Two approaches to avoid overfitting
– Prepruning: Halt tree construction early—do not split a node if this would result in the
goodness measure falling below a threshold
• Difficult to choose an appropriate threshold
– Postpruning: Remove branches from a “fully grown” tree—get a sequence of
progressively pruned trees
• Use a set of data different from the training data to decide which is the “best
pruned tree”

Tree Mining in Weka and Tree Mining in Clementine.

Tree Mining in Weka

• Example:
– Weather problem: build a decision tree to guide the decision about whether or not to play
tennis.
– Dataset
(weather.nominal.arff)

• Validation:
– Using training set as a test set will provide optimal classification accuracy.

Page 20
DATAWAREHOUSING AND DATA MINING UNIT-IV

– Expected accuracy on a different test set will always be less.


– 10-fold cross validation is more robust than using the training set as a test set.
• Divide data into 10 sets with about same proportion of class label values as in
original set.
• Run classification 10 times independently with the remaining 9/10 of the set as
the training set.
• Average accuracy.
– Ratio validation: 67% training set / 33% test set.
– Best: having a separate training set and test set.

• Results:
– Classification accuracy (correctly classified instances).
– Errors (absolute mean, root squared mean, …)
– Kappa statistic (measures agreement between predicted and observed classification; -
100%-100% is the proportion of agreements after chance agreement has been
excluded; 0% means complete agreement by chance)
• Results:
– TP (True Positive) rate per class label
– FP (False Positive) rate
– Precision = TP rate = TP / (TP + FN)) * 100%
– Recall = TP / (TP + FP)) * 100%
– F-measure = 2* recall * precision / recall + precision
• ID3 characteristics:
– Requires nominal values
– Improved into C4.5
• Dealing with numeric attributes
• Dealing with missing values
• Dealing with noisy data
• Generating rules from trees

Page 21
DATAWAREHOUSING AND DATA MINING UNIT-IV

Tree Mining in Clementine

• Methods:
– C5.0: target field must be categorical, predictor fields may be numeric or categorical,
provides multiple splits on the field that provides the maximum information gain at
each level
– QUEST: target field must be categorical, predictor fields may be numeric ranges or
categorical, statistical binary split
– C&RT: target and predictor fields may be numeric ranges or categorical, statistical binary
split based on regression
– CHAID: target and predictor fields may be numeric ranges or categorical, statistical
binary split based on chi-square

Bayesian Classification:

• Probabilistic learning: Calculate explicit probabilities for hypothesis, among the most practical
approaches to certain types of learning problems
• Incremental: Each training example can incrementally increase/decrease the probability that a
hypothesis is correct. Prior knowledge can be combined with observed data.
• Probabilistic prediction: Predict multiple hypotheses, weighted by their probabilities
• Standard: Even when Bayesian methods are computationally intractable, they can provide a
standard of optimal decision making against which other methods can be measured

Bayesian Theorem

• Given training data D, posteriori probability of a hypothesis h, P(h|D) follows the Bayes theorem

P(h| D)  P(D|h)P(h)
P(D)
• MAP (maximum posteriori) hypothesis
h argmax P(h| D) argmax P(D|h)P(h).
MAP
hHhH
Page 22
DATAWAREHOUSING AND DATA MINING UNIT-IV

• Practical difficulty: require initial knowledge of many probabilities, significant computational


cost

Naïve Bayes Classifier (I)

• A simplified assumption: attributes are conditionally independent:


n
P( | V )  P( ) P( | )
Cj C j  vi C j
i1

• Greatly reduces the computation cost, only count the class distribution.

Naive Bayesian Classifier (II)

Given a training set, we can compute the probabilities

Outlook P N Humidity P N
sunny 2/9 3/5 high 3/9 4/5
overcast 4/9 0 normal 6/9 1/5
rain 3/9 2/5
Temperature W indy
Bayesian hot 2/9 2/5 true 3/9 3/5 classification
mild 4/9 2/5 false 6/9 2/5
• The cool 3/9 1/5 classification
problem may be formalized using a-posteriori probabilities:
• P(C|X) = prob. that the sample tuple
• X=<x1,…,xk> is of class C.
• E.g. P(class=N | outlook=sunny, windy=true,…)
• Idea: assign to sample X the class label C such that P(C|X) is maximal

Estimating a-posteriori probabilities

• Bayes theorem:

Page 23
DATAWAREHOUSING AND DATA MINING UNIT-IV

P(C|X) = P(X|C)·P(C) / P(X)

• P(X) is constant for all classes


• P(C) = relative freq of class C samples
• C such that P(C|X) is maximum =

C such that P(X|C)·P(C) is maximum


• Problem: computing P(X|C) is unfeasible!

Naïve Bayesian Classification

• Naïve assumption: attribute independence


P(x1,…,xk|C) = P(x1|C)·…·P(xk|C)

• If i-th attribute is categorical:


P(xi|C) is estimated as the relative freq of samples having value xi as i-th attribute in class C
• If i-th attribute is continuous:
P(xi|C) is estimated thru a Gaussian density function
• Computationally easy in both cases

Bayesian Belief Networks

 Bayesian belief network allows a subset of the variables conditionally independent



 A graphical model of causal relationships

 Represents dependency among the variables

 Gives a specification of joint probability distribution

Page 24
DATAWAREHOUSING AND DATA MINING UNIT-IV

Bayesian Belief Network: An Example

The conditional probability table (CPT) for variable LungCancer:

CPT shows the conditional probability for each possible combination of its parents

Derivation of the probability of a particular combination of values of X, from CPT:


n
P(x1,..., xn )   P(xi | Parents(Y i))
i 1

Association-Based Classification

• Several methods for association-based classification


– ARCS: Quantitative association mining and clustering of association rules (Lent et al’97)
• It beats C4.5 in (mainly) scalability and also accuracy
– Associative classification: (Liu et al’98)
• It mines high support and high confidence rules in the form of “cond_set => y”,
where y is a class label
– CAEP (Classification by aggregating emerging patterns) (Dong et al’99)
• Emerging patterns (EPs): the itemsets whose support increases significantly
from one class to another
• Mine Eps based on minimum support and growth rate

Page 25
DATAWAREHOUSING AND DATA MINING UNIT-IV

Pruning of decision trees

Discarding one or more subtrees and replacing them with leaves simplify a decision tree, and that
is the main task in decision-tree pruning. In replacing the subtree with a leaf, the algorithm expects to
lower the predicted error rate and increase the quality of a classification model. But computation of
error rate is not simple. An error rate based only on a training data set does not provide a suitable
estimate. One possibility to estimate the predicted error rate is to use a new, additional set of test
samples if they are available, or to use the cross-validation techniques. This technique divides initially
available samples into equal sized blocks and, for each block, the tree is constructed from all samples
except this block and tested with a given block of samples. With the available training and testing
samples, the basic idea of decision tree-pruning is to remove parts of the tree (subtrees) that do not
contribute to the classification accuracy of unseen testing samples, producing a less complex and thus
more comprehensible tree. There are two ways in which the recursive-partitioning method can be
modified:

1. Deciding not to divide a set of samples any further under some conditions. The stopping
criterion is usually based on some statistical tests, such as the χ2 test: If there are no significant
differences in classification accuracy before and after division, then represent a current node as a
leaf. The decision is made in advance, before splitting, and therefore this approach is called
prepruning.
2. Removing restrospectively some of the tree structure using selected accuracy criteria. The
decision in this process of postpruning is made after the tree has been built.

C4.5 follows the postpruning approach, but it uses a specific technique to estimate the predicted error rate. This
method is called pessimistic pruning. For every node in a tree, the estimation of the upper confidence limit u cf is
computed using the statistical tables for binomial distribution (given in most textbooks on statistics). Parameter U cf
is a function of ∣Ti∣ and E for a given node. C4.5 uses the default confidence level of 25%, and compares U 25% (∣Ti∣/E)
for a given node Ti with a weighted confidence of

Page 26
DATAWAREHOUSING AND DATA MINING UNIT-IV

its leaves. Weights are the total number of cases for every leaf. If the predicted error for a root node in
a subtree is less than weighted sum of U25% for the leaves (predicted error for the subtree), then a
subtree will be replaced with its root node, which becomes a new leaf in a pruned tree.

Let us illustrate this procedure with one simple example. A subtree of a decision tree is given
in Figure, where the root node is the test x1 on three possible values {1, 2, 3} of the attribute A. The
children of the root node are leaves denoted with corresponding classes and (∣Ti∣/E) parameters. The
question is to estimate the possibility of pruning the subtree and replacing it with its root node as a
new, generalized leaf node.

Figure : Pruning a subtree by replacing it with one leaf node

To analyze the possibility of replacing the subtree with a leaf node it is necessary to compute a
predicted error PE for the initial tree and for a replaced node. Using default confidence of 25%, the
upper confidence limits for all nodes are collected from statistical tables: U25% (6, 0) = 0.206, U25%(9,
0) = 0.143, U25%(1, 0) = 0.750, and U25%(16, 1) = 0.157. Using these values, the predicted errors for the
initial tree and the replaced node are

PEtree = 6.0.206 + 9.0.143 + 1.0.750 = 3.257

PEnode = 16.0.157 = 2.512

Since the existing subtree has a higher value of predicted error than the replaced node, it is
recommended that the decision tree be pruned and the subtree replaced with the new leaf node.

Rule Based Classification

Page 27
DATAWAREHOUSING AND DATA MINING UNIT-IV

Using IF-THEN Rules for Classification

 Represent the knowledge in the form of IF-THEN rules




R: IF age = youth AND student = yes THEN buys_computer = yes

 Rule antecedent/precondition vs. rule consequent



 Assessment of a rule: coverage and accuracy
 
 ncovers = # of tuples covered by R
 
ncorrect = # of tuples correctly classified by R

coverage(R) = ncovers /|D| /* D: training data set */

accuracy(R) = ncorrect / ncovers

 If more than one rule is triggered, need conflict resolution



 Size ordering: assign the highest priority to the triggering rules that has the “toughest”
requirement (i.e., with the most attribute test)

 Class-based ordering: decreasing order of prevalence or misclassification cost per class

 Rule-based ordering (decision list): rules are organized into one long priority list,
according to some measure of rule quality or by experts

Rule Extraction from a Decision Tree

 Rules are easier to understand than large trees



 One rule is created for each path from the root to a leaf

 Each attribute-value pair along a path forms a conjunction: the leaf holds the class prediction

 Rules are mutually exclusive and exhaustive

Page 28
DATAWAREHOUSING AND DATA MINING UNIT-IV

 Example: Rule extraction from our buys_computer decision-tree

IF age = young AND student = no THEN buys_computer = no


IF age = young AND student = yes THEN buys_computer = yes
IF age = mid-age THEN buys_computer = yes
IF age = old AND credit_rating = excellent THEN buys_computer = yes
IF age = young AND credit_rating = fair THEN buys_computer = no

Rule Extraction from the Training Data

 Sequential covering algorithm: Extracts rules directly from training data



 Typical sequential covering algorithms: FOIL, AQ, CN2, RIPPER

 Rules are learned sequentially, each for a given class Ci will cover many tuples of Ci but none (or
few) of the tuples of other classes

 Steps:

 Rules are learned one at a time

 Each time a rule is learned, the tuples covered by the rules are removed

 The process repeats on the remaining tuples unless termination condition, e.g., when no
more training examples or when the quality of a rule returned is below a user-specified
threshold

 Comp. w. decision-tree induction: learning a set of rules simultaneously
Classification by Backpropagation

 Backpropagation: A neural network learning algorithm



 Started by psychologists and neurobiologists to develop and test computational analogues of
neurons

 A neural network: A set of connected input/output units where each connection has a weight
associated with it

 During the learning phase, the network learns by adjusting the weights so as to be able to
predict the correct class label of the input tuples

 Also referred to as connectionist learning due to the connections between units
Neural Network as a Classifier

Page 29
DATAWAREHOUSING AND DATA MINING UNIT-IV

 Weakness

 Long training time

 Require a number of parameters typically best determined empirically, e.g., the
network topology or ``structure."

 Poor interpretability: Difficult to interpret the symbolic meaning behind the learned
weights and of ``hidden units" in the network

 Strength

 High tolerance to noisy data

 Ability to classify untrained patterns

 Well-suited for continuous-valued inputs and outputs

 Successful on a wide array of real-world data

 Algorithms are inherently parallel

 Techniques have recently been developed for the extraction of rules from trained
neural networks

A Neuron (= a perceptron)

 The n-dimensional input vector x is mapped into variable y by means of the scalar product and
a nonlinear function mapping

A Multi-Layer Feed-Forward Neural Network

Page 30
DATAWAREHOUSING AND DATA MINING UNIT-IV

 The inputs to the network correspond to the attributes measured for each training tuple

 Inputs are fed simultaneously into the units making up the input layer

 They are then weighted and fed simultaneously to a hidden layer

 The number of hidden layers is arbitrary, although usually only one

 The weighted outputs of the last hidden layer are input to units making up the output layer,
which emits the network's prediction

 The network is feed-forward in that none of the weights cycles back to an input unit or to an
output unit of a previous layer

 From a statistical point of view, networks perform nonlinear regression: Given enough hidden
units and enough training samples, they can closely approximate any function

Backpropagation

 Iteratively process a set of training tuples & compare the network's prediction with the actual
known target value

 For each training tuple, the weights are modified to minimize the mean squared error between
the network's prediction and the actual target value

 Modifications are made in the “backwards” direction: from the output layer, through each
hidden layer down to the first hidden layer, hence “backpropagation”

 Steps

 Initialize weights (to small random #s) and biases in the network

 Propagate the inputs forward (by applying activation function)

 Backpropagate the error (by updating weights and biases)

 Terminating condition (when error is very small, etc.)

Page 31
DATAWAREHOUSING AND DATA MINING UNIT-IV

 Efficiency of backpropagation: Each epoch (one interation through the training set) takes O(|D|

* w), with |D| tuples and w weights, but # of epochs can be exponential to n, the number of
inputs, in the worst case
 Rule extraction from networks: network pruning

 Simplify the network structure by removing weighted links that have the least effect on
the trained network

 Then perform link, unit, or activation value clustering

 The set of input and activation values are studied to derive rules describing the
relationship between the input and hidden unit layers

 Sensitivity analysis: assess the impact that a given input variable has on a network output. The
knowledge gained from this analysis can be represented in rules

SVM—Support Vector Machines


 A new classification method for both linear and nonlinear data

 It uses a nonlinear mapping to transform the original training data into a higher dimension

 With the new dimension, it searches for the linear optimal separating hyperplane (i.e.,

“decision boundary”)

 With an appropriate nonlinear mapping to a sufficiently high dimension, data from two classes
can always be separated by a hyperplane

 SVM finds this hyperplane using support vectors (“essential” training tuples) and margins

(defined by the support vectors)

 Features: training can be slow but accuracy is high owing to their ability to model
complex nonlinear decision boundaries (margin maximization)

 Used both for classification and prediction

 Applications:

 handwritten digit recognition, object recognition, speaker identification, benchmarking
time-series prediction tests

SVM—General Philosophy

Page 32
DATAWAREHOUSING AND DATA MINING UNIT-IV

SVM—Margins and Support Vectors

SVM—Linearly Separable
 A separating hyperplane can be written as
W ● X + b = 0

where W={w1, w2, …, wn} is a weight vector and b a scalar (bias)

 For 2-D it can be written as
w0 + w1 x1 + w2 x2 = 0
 The hyperplane defining the sides of the margin:
H1: w0 + w1 x1 + w2 x2 ≥ 1 for yi = +1, and
H2: w0 + w1 x1 + w2 x2 ≤ – 1 for yi = –1
Any training tuples that fall on hyperplanes H1 or H2 (i.e., the
sides defining the margin) are support vectors

Page 33
DATAWAREHOUSING AND DATA MINING UNIT-IV

 This becomes a constrained (convex) quadratic optimization problem: Quadratic objective


function and linear constraints  Quadratic Programming (QP)  Lagrangian multipliers
Why Is SVM Effective on High Dimensional Data?
 The complexity of trained classifier is characterized by the # of support vectors rather than the
dimensionality of the data

 The support vectors are the essential or critical training examples —they lie closest to the
decision boundary (MMH)
 If all other training examples are removed and the training is repeated, the same separating
hyperplane would be found

 The number of support vectors found can be used to compute an (upper) bound on the
expected error rate of the SVM classifier, which is independent of the data dimensionality

 Thus, an SVM with a small number of support vectors can have good generalization, even when
the dimensionality of the data is high

Associative Classification
 Associative classification

 Association rules are generated and analyzed for use in classification

 Search for strong associations between frequent patterns (conjunctions of attribute-
value pairs) and class labels

 Classification: Based on evaluating a set of rules in the form of

P1 ^ p 2 … ^ p l “Aclass = C” (conf, sup)
 Why effective?

 It explores highly confident associations among multiple attributes and may overcome
some constraints introduced by decision-tree induction, which considers only one
attribute at a time
In many studies, associative classification has been found to be more accurate than some traditional
classification methods, such as C4.

Typical Associative Classification Methods


 CBA (Classification By Association: Liu, Hsu & Ma, KDD’98)

 Mine association possible rules in the form of

Page 34
DATAWAREHOUSING AND DATA MINING UNIT-IV

 Cond-set (a set of attribute-value pairs)  class label



 Build classifier: Organize rules according to decreasing precedence based on confidence
and then support

 CMAR (Classification based on Multiple Association Rules: Li, Han, Pei, ICDM’01)

 Classification: Statistical analysis on multiple rules

 CPAR (Classification based on Predictive Association Rules: Yin & Han, SDM’03)

 Generation of predictive rules (FOIL-like analysis)

 High efficiency, accuracy similar to CMAR

 RCBT (Mining top-k covering rule groups for gene expression data, Cong et al. SIGMOD’05)

 Explore high-dimensional classification, using top-k rule groups

 Achieve high classification accuracy and high run-time efficiency

Associative Classification May Achieve High Accuracy and Efficiency (Cong et al. SIGMOD05)

Other Classification Methods

The k-Nearest Neighbor Algorithm

 All instances correspond to points in the n-D space



 The nearest neighbor are defined in terms of Euclidean distance, dist(X1, X2)

 Target function could be discrete- or real- valued

Page 35
DATAWAREHOUSING AND DATA MINING UNIT-IV

 For discrete-valued, k-NN returns the most common value among the k training examples
nearest to xq
 k-NN for real-valued prediction for a given unknown tuple

 Returns the mean values of the k nearest neighbors

 Distance-weighted nearest neighbor algorithm

 Weight the contribution of each of the k neighbors according to their distance to the
query xq
 Give greater weight to closer neighbors

 Robust to noisy data by averaging k-nearest neighbors

 Curse of dimensionality: distance between neighbors could be dominated by irrelevant
attributes

 To overcome it, axes stretch or elimination of the least relevant attributes

Genetic Algorithms

 Genetic Algorithm: based on an analogy to biological evolution



 An initial population is created consisting of randomly generated rules

 Each rule is represented by a string of bits
 
 E.g., if A1 and ¬A2 then C2 can be encoded as 100
 If an attribute has k > 2 values, k bits can be used

 Based on the notion of survival of the fittest, a new population is formed to consist of the
fittest rules and their offsprings

 The fitness of a rule is represented by its classification accuracy on a set of training examples

 Offsprings are generated by crossover and mutation

 The process continues until a population P evolves when each rule in P satisfies a prespecified
threshold

 Slow but easily parallelizable

Rough Set Approach:

 Rough sets are used to approximately or “roughly” define equivalent classes

Page 36
DATAWAREHOUSING AND DATA MINING UNIT-IV

 A rough set for a given class C is approximated by two sets: a lower approximation (certain to
be in C) and an upper approximation (cannot be described as not belonging to C)
Finding the minimal subsets (reducts) of attributes for feature reduction is NP-hard but a
discernibility matrix (which stores the differences between attribute values for each pair of data
tuples) is used to reduce the computation intensity

Figure: A rough set approximation of the set of tuples of the class C suing lower and upper
approximation sets of C. The rectangular regions represent equivalence classes

Fuzzy Set approaches


 Fuzzy logic uses truth values between 0.0 and 1.0 to represent the degree of
membership (such as using fuzzy membership graph)

 Attribute values are converted to fuzzy values

 e.g., income is mapped into the discrete categories {low, medium, high} with fuzzy
values calculated

 For a given new sample, more than one fuzzy value may apply

 Each applicable rule contributes a vote for membership in the categories

 Typically, the truth values for each predicted category are summed, and these sums are
combined

Page 37
DATAWAREHOUSING AND DATA MINING UNIT-IV

Prediction

 (Numerical) prediction is similar to classification



 construct a model

 use model to predict continuous or ordered value for a given input

 Prediction is different from classification

 Classification refers to predict categorical class label

 Prediction models continuous-valued functions

 Major method for prediction: regression

 model the relationship between one or more independent or predictor variables and a
dependent or response variable

 Regression analysis

 Linear and multiple regression

 Non-linear regression

 Other regression methods: generalized linear model, Poisson regression, log-linear
models, regression trees
Linear Regression
 Linear regression: involves a response variable y and a single predictor variable
x y = w0 + w1 x

where w0 (y-intercept) and w1 (slope) are regression coefficients

 Method of least squares: estimates the best-fitting straight line

 Multiple linear regression: involves more than one predictor variable
 
 Training data is of the form (X1, y1), (X2, y2),…, (X|D|, y|D|)
 
Ex. For 2-D data, we may have: y = w0 + w1 x1+ w2 x2

Page 38
DATAWAREHOUSING AND DATA MINING UNIT-IV

 Solvable by extension of least square method or using SAS, S-Plus



 Many nonlinear functions can be transformed into the above

Nonlinear Regression

 Some nonlinear models can be modeled by a polynomial function

 A polynomial regression model can be transformed into linear regression model. For example,

y 2 3
= w0 + w1 x + w2 x + w3 x
convertible to linear with new variables: x2 = x2, x3= x3
y = w0 + w1 x + w2 x2 + w3 x3

 Other functions, such as power function, can also be transformed to linear model

 Some models are intractable nonlinear (e.g., sum of exponential terms)

 possible to obtain least square estimates through extensive calculation on more
complex formulae

Page 39

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy