Datawarehousing and Data Mining Full Notes PDF
Datawarehousing and Data Mining Full Notes PDF
MINING
SEMESTER: V
PROGRAMME: BCA
UNIT I
DATA WAREHOUSING
Data warehousing Components –Building a Data warehouse – Mapping the Data Warehouse
to a Multiprocessor Architecture – DBMS Schemas for Decision Support – Data Extraction,
Cleanup, and Transformation Tools –Metadata.
Page 1
Data warehousing and Data mining Unit-I
Page 2
Data warehousing and Data mining Unit-I
Queries that would be complex in very normalized databases could be easier to build
and maintain in data warehouses, decreasing the workload on transaction systems.
Data warehousing is an efficient way to manage and report on data that is from a variety
of sources, non uniform and scattered throughout a company.
Data warehousing is an efficient way to manage demand for lots of information from
lots of users.
Data warehousing provides the capability to analyze large amounts of historical data for
nuggets of wisdom that can provide an organization with competitive advantage.
Informational Data:
Focusing on providing answers to problems posed by decision makers
Summarized
Non updateable
Page 3
Data warehousing and Data mining Unit-I
Page 4
Data warehousing and Data mining Unit-I
The data source for data warehouse is coming from operational applications. The data
entered into the data warehouse transformed into an integrated structure and format. The
transformation process involves conversion, summarization, filtering and condensation. The data
warehouse must be capable of holding and managing large volumes of data as well as different
structure of data structures over the time.
1. Data warehouse database
This is the central part of the data warehousing environment. This is the item number 2 in
the above arch. diagram. This is implemented based on RDBMS technology.
2. Sourcing, Acquisition, Clean up, and Transformation Tools
This is item number 1 in the above arch diagram. They perform conversions,
summarization, key changes, structural changes and condensation. The data transformation is
required so that the information can by used by decision support tools. The transformation
Page 5
Data warehousing and Data mining Unit-I
produces programs, control statements, JCL code, COBOL code, UNIX scripts, and SQL DDL
code etc., to move the data into data warehouse from multiple operational systems.
The functionalities of these tools are listed below:
To remove unwanted data from operational db
Converting to common data names and attributes
Calculating summaries and derived data
Establishing defaults for missing data
Accommodating source data definition change.
3. Meta data
It is data about data. It is used for maintaining, managing and using the data warehouse. It
is classified into two:
1.Technical Meta data: It contains information about data warehouse data used by warehouse
designer, administrator to carry out development and management tasks. It includes,
Info about data stores
Transformation descriptions. That is mapping methods from operational db to warehouse db
Warehouse Object and data structure definitions for target data
The rules used to perform clean up, and data enhancement
Data mapping operations
Access authorization, backup history, archive history, info delivery history, data acquisition
history, data access etc.
2.Business Meta data: It contains info that gives info stored in data warehouse to users. It
includes,
Page 6
Data warehousing and Data mining Unit-I
Subject areas, and info object type including queries, reports, images, video, audio clips
etc.
Internet home pages
Info related to info delivery system
Data warehouse operational info such as ownerships, audit trails etc.,
Meta data helps the users to understand content and find the data. Meta data are stored in a
separate data stores which is known as informational directory or Meta data repository which
helps to integrate, maintain and view the contents of the data warehouse.
The following lists the characteristics of info directory/ Meta data:
It is the gateway to the data warehouse environment
It supports easy distribution and replication of content for high performance and
availability
It should be searchable by business oriented key words
It should act as a launch platform for end user to access data and analysis tools
It should support the sharing of info
It should support scheduling options for request
IT should support and provide interface to other applications
It should support end user monitoring of the status of the data warehouse environment
4. Access tools
Its purpose is to provide info to business users for decision making. There are five
categories:
Data query and reporting tools
Application development tools
Executive info system tools (EIS)
OLAP tools
Data mining tools
Query and reporting tools are used to generate query and report. There are two types of
reporting tools. They are:
Page 7
Data warehousing and Data mining Unit-I
Page 8
Data warehousing and Data mining Unit-I
Page 9
Data warehousing and Data mining Unit-I
Page 10
Data warehousing and Data mining Unit-I
The disadvantage of using the Top Down approach is that it requires more time and initial
investment. The business has to wait for the EDW to be implemented followed by building the
data marts before which they can access their reports.
Bottom Up Approach
The bottom up approach suggested by Ralph Kimball is an incremental approach to build
a data warehouse. Here we build the data marts separately at different points of time as and when
the specific subject area requirements are clear. The data marts are integrated or combined
together to form a data warehouse. Separate data marts are combined through the use of
conformed dimensions and conformed facts. A conformed dimension and a conformed fact is one
that can be shared across data marts.
A Conformed dimension has consistent dimension keys, consistent attribute names and
consistent values across separate data marts. The conformed dimension means exact same thing
with every fact table it is joined.
A Conformed fact has the same definition of measures, same dimensions joined to it and
at the same granularity across data marts.
The bottom up approach helps us incrementally build the warehouse by developing and
integrating data marts as and when the requirements are clear. We don’t have to wait for knowing
the overall requirements of the warehouse.
We should implement the bottom up approach when
1. We have initial cost and time constraints.
2. The complete warehouse requirements are not clear. We have clarity to only one data mart.
The advantage of using the Bottom Up approach is that they do not require high initial costs and
have a faster implementation time; hence the business can start using the marts much earlier as
compared to the top-down approach.
The disadvantages of using the Bottom Up approach are that it stores data in the de
normalized format; hence there would be high space usage for detailed data. We have a tendency
of not keeping detailed data in this approach hence losing out on advantage of having detail data
i.e. flexibility to easily cater to future requirements. Bottom up approach is more realistic but the
complexity of the integration may become a serious obstacle.
Page 11
Data warehousing and Data mining Unit-I
Design considerations
To be a successful data warehouse designer must adopt a holistic approach that is
considering all data warehouse components as parts of a single complex system, and take into
account all possible data sources and all known usage requirements.
Most successful data warehouses that meet these requirements have these common
characteristics:
Are based on a dimensional model
Contain historical and current data
Include both detailed and summarized data
Consolidate disparate data from multiple sources while retaining consistency
Data warehouse is difficult to build due to the following reason:
Heterogeneity of data sources
Use of historical data
Growing nature of data base
Data warehouse design approach muse be business driven, continuous and iterative engineering
approach. In addition to the general considerations there are following specific points relevant to
the data warehouse design:
Data content
The content and structure of the data warehouse are reflected in its data model. The data
model is the template that describes how information will be organized within the integrated
warehouse framework. The data warehouse data must be a detailed data. It must be formatted,
cleaned up and transformed to fit the warehouse data model.
Meta data
It defines the location and contents of data in the warehouse. Meta data is searchable by
users to find definitions or subject areas. In other words, it must provide decision support
oriented pointers to warehouse data and thus provides a logical link between warehouse data and
decision support applications.
Page 12
Data warehousing and Data mining Unit-I
Data distribution
One of the biggest challenges when designing a data warehouse is the data placement and
distribution strategy. Data volumes continue to grow in nature. Therefore, it becomes necessary
to know how the data should be divided across multiple servers and which users should get
access to which types of data. The data can be distributed based on the subject area, location
(geographical region), or time (current, month, year).
Tools
A number of tools are available that are specifically designed to help in the
implementation of the data warehouse. All selected tools must be compatible with the given data
warehouse environment and with each other. All tools must be able to use a common Meta data
repository.
Design steps
The following nine-step method is followed in the design of a data warehouse:
1. Choosing the subject matter
2. Deciding what a fact table represents
3. Identifying and conforming the dimensions
4. Choosing the facts
5. Storing pre calculations in the fact table
6. Rounding out the dimension table
7. Choosing the duration of the db
8. The need to track slowly changing dimensions
9. Deciding the query priorities and query models
Technical considerations
A number of technical issues are to be considered when designing a data warehouse
environment. These issues include:
The hardware platform that would house the data warehouse
The DBMS that supports the warehouse data
Page 13
Data warehousing and Data mining Unit-I
The communication infrastructure that connects data marts, operational systems and end
users
The hardware and software to support meta data repository
The systems management framework that enables admin of the entire environment
Implementation considerations
The following logical steps needed to implement a data warehouse:
Collect and analyze business requirements
Create a data model and a physical design
Define data sources
Choose the DB tech and platform
Extract the data from operational DB, transform it, clean it up and load it into the
warehouse
Choose DB access and reporting tools
Choose DB connectivity software
Choose data analysis and presentation s/w
Update the data warehouse
Access tools
Data warehouse implementation relies on selecting suitable data access tools. The best
way to choose this is based on the type of data can be selected using this tool and the kind of
access it permits for a particular user. The following lists the various types of data that can be
accessed:
Simple tabular form data
Ranking data
Multivariable data
Time series data
Graphing, charting and pivoting data
Complex textual search data
Statistical analysis data
Page 14
Data warehousing and Data mining Unit-I
Page 15
Data warehousing and Data mining Unit-I
User levels
The users of data warehouse data can be classified on the basis of their skill level in
accessing the warehouse.
There are three classes of users:
Casual users: are most comfortable in retrieving info from warehouse in pre defined formats and
running pre existing queries and reports. These users do not need tools that allow for building
standard and ad hoc reports
Power Users: can use pre defined as well as user defined queries to create simple and ad hoc
reports. These users can engage in drill down operations. These users may have the experience of
using reporting and query tools.
Expert users: These users tend to create their own complex queries and perform standard
analysis on the info they retrieve. These users have the knowledge about the use of query and
report tools
Page 16
Data warehousing and Data mining Unit-I
Page 17
Data warehousing and Data mining Unit-I
Data partitioning:
Data partitioning is the key component for effective parallel execution of data base
operations. Partition can be done randomly or intelligently.
Random portioning includes random data striping across multiple disks on a single server.
Another option for random portioning is round robin fashion partitioning in which each record is
placed on the next disk assigned to the data base.
Intelligent partitioning assumes that DBMS knows where a specific record is located and does
not waste time searching for it across all disks.
The various intelligent partitioning include:
Hash partitioning: A hash algorithm is used to calculate the partition number based on the value
of the partitioning key for each row
Page 18
Data warehousing and Data mining Unit-I
Key range partitioning: Rows are placed and located in the partitions according to the value of
the partitioning key. That is all the rows with the key value from A to K are in partition 1, L to T
are in partition 2 and so on.
Schema portioning: an entire table is placed on one disk; another table is placed on different disk
etc. This is useful for small reference tables.
User defined portioning: It allows a table to be partitioned on the basis of a user defined
expression.
Page 19
Data warehousing and Data mining Unit-I
Page 20
Data warehousing and Data mining Unit-I
The cluster illustrated in figure is composed of multiple tightly coupled nodes. The
Distributed Lock Manager (DLM ) is required. Examples of loosely coupled systems are VAX
clusters or Sun clusters.
Since the memory is not shared among the nodes, each node has its own data cache. Cache
consistency must be maintained across the nodes and a lock manager is needed to maintain the
consistency. Additionally, instance locks using the DLM on the Oracle level must be maintained
to ensure that all nodes in the cluster see identical data.
There is additional overhead in maintaining the locks and ensuring that the data caches are
consistent. The performance impact is dependent on the hardware and software components, such
as the bandwidth of the high-speed bus through which the nodes communicate, and DLM
performance.
Parallel processing advantages of shared disk systems are as follows:
Shared disk systems permit high availability. All data is accessible even if one node
dies.
These systems have the concept of one database, which is an advantage over shared
nothing systems.
Page 21
Data warehousing and Data mining Unit-I
Shared nothing systems are concerned with access to disks, not access to memory.
Nonetheless, adding more PUs and disks can improve scale up. Oracle Parallel Server can access
the disks on a shared nothing system as long as the operating system provides transparent disk
access, but this access is expensive in terms of latency.
Page 22
Data warehousing and Data mining Unit-I
Shared nothing systems have advantages and disadvantages for parallel processing:
Advantages
Shared nothing systems provide for incremental growth.
System growth is practically unlimited.
MPPs are good for read-only databases and decision support applications.
Failure is local: if one node fails, the others stay up.
Disadvantages
More coordination is required.
More overhead is required for a process working on a disk belonging to another node.
If there is a heavy workload of updates or inserts, as in an online transaction processing system,
it may be worthwhile to consider data-dependent routing to alleviate contention.
Page 23
Data warehousing and Data mining Unit-I
Page 24
Data warehousing and Data mining Unit-I
1. Star schema
2. Snowflake schema
3. Fact constellation schema
Star schema
The multidimensional view of data that is expressed using relational data base semantics
is provided by the data base schema design called star schema. The basic of stat schema is that
information can be classified into two groups:
Facts
Dimension
Star schema has one large central table (fact table) and a set of smaller tables (dimensions)
arranged in a radial pattern around the central table.
Facts are core data element being analyzed while dimensions are attributes about the facts.
The determination of which schema model should be used for a data warehouse should be based
upon the analysis of project requirements, accessible tools and project team preferences.
Page 25
Data warehousing and Data mining Unit-I
The star schema architecture is the simplest data warehouse schema. It is called a star
schema because the diagram resembles a star, with points radiating from a center. The center of
the star consists of fact table and the points of the star are the dimension tables. Usually the fact
tables in a star schema are in third normal form(3NF) whereas dimensional tables are de-
normalized. Despite the fact that the star schema is the simplest architecture, it is most commonly
used nowadays and is recommended by Oracle.
Fact Tables
A fact table is a table that contains summarized numerical and historical data (facts) and a
multipart index composed of foreign keys from the primary keys of related dimension tables. A
fact table typically has two types of columns: foreign keys to dimension tables and measures
those that contain numeric facts. A fact table can contain fact's data on detail or aggregated level.
Dimension Tables
Dimensions are categories by which summarized data can be viewed. E.g. a profit
summary in a fact table can be viewed by a Time dimension (profit by month, quarter, year),
Region dimension
(profit by country, state, city), Product dimension (profit for product1, product2). A dimension is
a structure usually composed of one or more hierarchies that categorizes data. If a dimension
hasn't got a hierarchies and levels it is called flat dimension or list. The primary keys of each of
the dimension tables are part of the composite primary key of the fact table.
Dimensional attributes help to describe the dimensional value. They are normally
descriptive, textual values. Dimension tables are generally small in size then fact table. Typical
fact tables store data about sales while dimension tables data about geographic region (markets,
cities), clients, products, times, channels.
Measures
Measures are numeric data based on columns in a fact table. They are the primary data
which end users are interested in. E.g. a sales fact table may contain a profit measure which
represents profit on each sale.
Page 26
Data warehousing and Data mining Unit-I
Aggregations are pre calculated numeric data. By calculating and storing the answers to a
query before users ask for it, the query processing time can be reduced. This is key in providing
fast query performance in OLAP.
Cubes are data processing units composed of fact tables and dimensions from the data
warehouse. They provide multidimensional views of data, querying and analytical capabilities to
clients.
The main characteristics of star schema:
Simple structure -> easy to understand schema
Great query effectives -> small number of tables to join
Relatively long time of loading data into dimension tables -> de-normalization,
redundancy data caused that size of the table could be large.
The most commonly used in the data warehouse implementations -> widely supported
by a large number of business intelligence tools
Snowflake schema: is the result of decomposing one or more of the dimensions. The many-to
one relationship among sets of attributes of a dimension can separate new dimension tables,
forming a hierarchy. The decomposed snowflake structure visualizes the hierarchical structure of
dimensions very well.
Fact constellation schema: For each star schema it is possible to construct fact constellation
schema (for example by splitting the original star schema into more star schemes each of them
describes facts on another level of dimension hierarchies). The fact constellation architecture
contains multiple fact tables that share many dimension tables.
The main shortcoming of the fact constellation schema is a more complicated design
because many variants for particular kinds of aggregation must be considered and selected.
Moreover, dimension tables are still large.
Multi relational Database:
The relational implementation of multidimensional data base systems is referred to as
multi relational database systems.
Page 27
Data warehousing and Data mining Unit-I
Page 28
Data warehousing and Data mining Unit-I
involved are small and a limited amount of data transformation and enhancement is
required.
Rule-driven Dynamic Transformation Engines (Data Mart Builders):
– They are also known as Data Mart Builders and capture data from a source system
at User-defined intervals, transform data, and then send and load the results into a
target environment, typically a data mart.
– To date most of the products of this category support only relational data sources,
though now this trend have started changing.
– Data to be captured from source system is usually defined using query language
statements, and data transformation and enhancement is done on a script or a
function logic defined to the tool.
– With most tools in this category, data flows from source systems to target systems
through one or more servers, which perform the data transformation and
enhancement. These transformation servers can usually be controlled from a single
location, making the job of such environment much easier.
Meta Data:
Meta Data Definitions:
Metadata – additional data warehouse used to understand what information is in the
warehouse, and what it means
Metadata Repository – specialized database designed to maintain metadata, together with
the tools and interfaces that allow a company to collect and distribute its metadata.
Operational Data – elements from operation systems, external data (or other sources)
mapped to the warehouse structures.
Industry trend:
Why were early Data Warehouses that did not include significant amounts of metadata collection
able to succeed?
• Usually a subset of data was targeted, making it easier to understand content,
organization, ownership.
• Usually targeted a subset of (technically inclined) end users
Page 29
Data warehousing and Data mining Unit-I
Early choices were made to ensure the success of initial data warehouse efforts.
Meta Data Transitions:
Usually, metadata repositories are already in existence. Traditionally, metadata was
aimed at overall systems management, such as aiding in the maintenance of legacy
systems through impact analysis, and determining the appropriate reuse of legacy data
structures.
Repositories can now aide in tracking metadata to help all data warehouse users
understand what information is in the warehouse and what it means. Tools are now being
positioned to help manage and maintain metadata.
Meta Data Lifecycle:
1. Collection: Identify metadata and capture it in a central repository.
2. Maintenance: Put in place processes to synchronize metadata automatically with the
changing data architecture.
3. Deployment: Provide metadata to users in the right form and with the right tools.
The key to ensuring a high level of collection and maintenance accuracy is to incorporate as
much automation as possible. The key to a successful metadata deployment is to correctly
match the metadata offered to the specific needs of each audience.
Meta Data Collection:
• Collecting the right metadata at the right time is the basis for a success. If the user does
not already have an idea about what information would answer a question, the user will
not find anything helpful in the warehouse.
• Metadata spans many domains from physical structure data, to logical model data, to
business usage and rules.
• Typically the metadata that should be collected is already generated and processed by the
development team anyway. Metadata collection preserves the analysis performed by the
team.
Page 30
Data warehousing and Data mining Unit-I
Page 31
Data warehousing and Data mining Unit-I
This information is captured after the warehouse has been deployed. Typically, this information
is not easy to collect.
Maintaining Meta Data:
• As with any maintenance process, automation is key to maintaining current high-quality
information. The data warehouse tools can play an important role in how the metadata is
maintained.
• Most proposed database changes already go through appropriate verification and
authorization, so adding a metadata maintenance requirement should not be significant.
• Capturing incremental changes is encouraged since metadata (particularly structure
information) is usually very large.
Maintaining the Warehouse:
The warehouse team must have comprehensive impact analysis capabilities to respond to
change that may affect:
• Data extraction\movement\transformation routines
• Table structures
• Data marts and summary data structures
• Stored user queries
• Users who require new training (due to query or other changes)
What business problems are addressed in part using the element that is changing (help
understand the significance of the change, and how it may impact decision making).
Meta Data Deployment:
Supply the right metadata to the right audience
• Warehouse developers will primarily need the physical structure information for data
sources. Further analysis on that metadata leads to the development of more metadata
(mappings).
• Warehouse maintainers typically require direct access to the metadata as well.
• End Users require an easy-to-access format. They should not be burdened with technical
names or cryptic commands. Training, documentation and other forms of help, should be
readily available.
Page 32
Data warehousing and Data mining Unit-I
End Users:
Users of the warehouse are primarily concerned with two types of metadata.
1. A high-level topic inventory of the warehouse (what is in the warehouse and where it
came from).
2. Existing queries that are pertinent to their search (reuse).
The important goal is that the user is easily able to correctly find and interpret the data they need.
Integration with Data Access Tools:
1. Side by Side access to metadata and to real data. The user can browse metadata and write
queries against the real data.
2. Populate query tool help text with metadata exported from the repository. The tool can
now provide the user with context sensitive help at the expense of needing updating
whenever metadata changes and the user may be using outdated metadata.
3. Provide query tools that access the metadata directly to provide context sensitive help.
This eliminates the refresh issue, and ensures the user always sees current metadata.
4. Full interconnectivity between query tool and metadata tool (transparent transactions
between tools).
Page 33
Data warehousing and Data mining Unit-I
We can divide IT systems into transactional (OLTP) and analytical (OLAP). In general we can
assume that OLTP systems provide source data to data warehouses, whereas OLAP systems help
to analyze it.
- OLTP (On-line Transaction Processing) is characterized by a large number of short on-line
transactions (INSERT, UPDATE, DELETE). The main emphasis for OLTP systems is put on
very fast query processing, maintaining data integrity in multi-access environments and an
effectiveness measured by number of transactions per second. In OLTP database there is detailed
and current data, and schema used to store transactional databases is the entity model (usually
3NF).
- OLAP (On-line Analytical Processing) is characterized by relatively low volume of
transactions. Queries are often very complex and involve aggregations. For OLAP systems a
response time is an effectiveness measure. OLAP applications are widely used by Data Mining
techniques. In OLAP database there is aggregated, historical data, stored in multi-dimensional
schemas (usually star schema).
Page 34
Data warehousing and Data mining Unit-I
The following table summarizes the major differences between OLTP and OLAP system design.
Page 35
DATAWAREHOUSING AND DATA MINING UNIT-II
Reporting and Query tools and Applications – Tool Categories – The Need for
Applications – Cognos Impromptu – Online Analytical Processing (OLAP) – Need –
Multidimensional Data Model – OLAP Guidelines – Multidimensional versus
Multirelational OLAP – Categories of Tools – OLAP Tools and the Internet.
The data warehouse is accessed using an end-user query and reporting tool from
Business Objects. Business Objects provides several tools to securely access the data
warehouse or personal data files with a point-and-click interface including the following:
WSU has negotiated a contract with Business Objects for purchasing these tools at a
discount. View BusObj Rates.
Page 1
DATAWAREHOUSING AND DATA MINING UNIT-II
a. The query tools discussed in the next several slides represent the most commonly
used query tools at Penn State.
b. A Data Warehouse user is free to select any query tool, and is not limited to the
ones mentioned.
c. What is a ―Query Tool‖?
d. A query tool is a software package that allows you to connect to the data
warehouse from your PC and develop queries and reports.
There are many query tools to choose from. Below is a listing of what is currently
being used on the PC:
1. Microsoft Access
2. Microsoft Excel
3. Cognos Impromptu
a) Building a data warehouse is a complex task because there is no vendor that provides
an end-to-end‗ set of tools.
b) Necessitates that a data warehouse is built using multiple products from different
vendors.
c) Ensuring that these products work well together and are fully integrated is a major
challenge.
Sort
Hierarchy presentation
Ascending or descending order
Group
Page 2
DATAWAREHOUSING AND DATA MINING UNIT-II
Filter
Defines criteria
Specifies query range
Administrative
Access
Profile
Client/Server
Generating reports:
Page 3
DATAWAREHOUSING AND DATA MINING UNIT-II
Edit features on the toolbar allowed changes to report data after the query has
been completed
Data Elements
Table Format
Numeric
Descriptive
Page 4
DATAWAREHOUSING AND DATA MINING UNIT-II
Catalogs Impromptu stores metadata in subject related folders. This metadata is what
will be used to develop a query for a report. The metadata set is stored in a file called a
‗catalog‗. The catalog does not contain any data. It just contains information about
connecting to the database and the fields that will be accessible for reports. A catalog
contains:
Page 5
DATAWAREHOUSING AND DATA MINING UNIT-II
The multidimensional data model is an integral part of On-Line Analytical Processing, or OLAP.
Because OLAP is on-line, it must provide answers quickly; analysts pose iterative queries during
interactive sessions, not in batch jobs that run overnight. And because OLAP is also analytic, the
queries are complex. The multidimensional data model is designed to solve complex queries in
real time. Multidimensional data model is to view it as a cube. The cable at the left contains
detailed sales data by product, market and time. The cube on the right associates sales number
(unit sold) with dimensions-product type, market and time with the unit variables organized as
cell in an array. This cube can be expended to include another array-price-which can be
associates with all or only some dimensions. As number of dimensions increases number of
cubes cell increase exponentially. Dimensions are hierarchical in nature i.e. time dimension may
contain hierarchies for years, quarters, months, weak and day. GEOGRAPHY may contain
country, state, city etc.
In this cube we can observe, that each side of the cube represents one of the elements of
the question. The x-axis represents the time, the y-axis represents the products and the z-
axis represents different centers. The cells of in the cube represents the number of
product sold or can represent the price of the items.
Page 6
DATAWAREHOUSING AND DATA MINING UNIT-II
This Figure also gives a different understanding to the drilling down operations. The
relations defined must not be directly related, they related directly. The size of the
dimension increase, the size of the cube will also increase exponentially. The time
response of the cube depends on the size of the cube.
• Aggregation (roll-up)
e.g., total sales by city and year -> total sales by region and by year
• Selection (slice) defines a subcube – e.g., sales where city = Palo Alto and date =
1/15/96
• Navigation to detailed data (drill-down) – e.g., (sales - expense) by city, top 3% of cities
by average income
OLAP
OLAP stands for Online Analytical Processing. It uses database tables (fact and dimension
tables) to enable multidimensional viewing, analysis and querying of large amounts of
data. E.g. OLAP technology could provide management with fast answers to complex
queries on their operational data or enable them to analyze their company's historical
data for trends and patterns. Online Analytical Processing (OLAP) applications and tools
are those that are designed to ask ―complex queries of large multidimensional
collections of data‖. Due to that OLAP is accompanied with data warehousing.
Page 7
DATAWAREHOUSING AND DATA MINING UNIT-II
Need
The key driver of OLAP is the multidimensional nature of the business problem. These
problems are characterized by retrieving a very large number of records that can reach
gigabytes and terabytes and summarizing this data into a form information that can by
used by business analysts. One of the limitations that SQL has, it cannot represent these
complex problems. A query will be translated in to several SQL statements. These SQL
statements will involve multiple joins, intermediate tables, sorting, aggregations and a
huge temporary memory to store these tables. These procedures required a lot of
computation which will require a long time in computing. The second limitation of SQL is
its inability to use mathematical models in these SQL statements. If an analyst, could
create these complex statements using SQL statements, still there will be a large number
of computation and huge memory needed. Therefore the use of OLAP is preferable to
solve this kind of problem.
MOLAP This is the more traditional way of OLAP analysis. In MOLAP, data is stored in a
multidimensional cube. The storage is not in the relational database, but in proprietary
formats. That is, data stored in array-based structures.
Advantages:
Excellent performance: MOLAP cubes are built for fast data retrieval, and are
optimal for slicing and dicing operations.
Can perform complex calculations: All calculations have been pre-generated when
the cube is created. Hence, complex calculations are not only doable, but they
return quickly.
Disadvantages:
Limited in the amount of data it can handle: Because all calculations are performed
when the cube is built, it is not possible to include a large amount of data in the
cube itself. This is not to say that the data in the cube cannot be derived from a
Page 8
DATAWAREHOUSING AND DATA MINING UNIT-II
large amount of data. Indeed, this is possible. But in this case, only summary-level
information will be included in the cube itself.
Requires additional investment: Cube technology are often proprietary and do not
already exist in the organization. Therefore, to adopt MOLAP technology, chances
are additional investments in human and capital resources are needed.
ROLP
This methodology relies on manipulating the data stored in the relational database to
give the appearance of traditional OLAP's slicing and dicing functionality. In essence, each
action of slicing and dicing is equivalent to adding a "WHERE" clause in the SQL
statement. Data stored in relational tables
Advantages:
Can handle large amounts of data: The data size limitation of ROLAP technology is
the limitation on data size of the underlying relational database. In other words,
ROLAP itself places no limitation on data amount.
Can leverage functionalities inherent in the relational database: Often, relational
database already comes with a host of functionalities. ROLAP technologies, since
they sit on top of the relational database, can therefore leverage these
functionalities.
Page 9
DATAWAREHOUSING AND DATA MINING UNIT-II
Disadvantages:
Performance can be slow: Because each ROLAP report is essentially a SQL query
(or multiple SQL queries) in the relational database, the query time can be long if
the underlying data size is large.
Limited by SQL functionalities: Because ROLAP technology mainly relies on
generating SQL statements to query the relational database, and SQL statements
do not fit all needs (for example, it is difficult to perform complex calculations
using SQL), ROLAP technologies are therefore traditionally limited by what SQL
can do. ROLAP vendors have mitigated this risk by building into the tool out-of-
the-box complex functions as well as the ability to allow users to define their own
functions.
HOLAP
HOLAP technologies attempt to combine the advantages of MOLAP and ROLAP. For
summary-type information, HOLAP leverages cube technology for faster performance. It
stores only the indexes and aggregations in the multidimensional form while the rest of
the data is stored in the relational database.
Page 10
DATAWAREHOUSING AND DATA MINING UNIT-II
OLAP Guidelines:
Dr. E.F. Codd the ―father‖ of the relational model, created a list of rules to deal with the
OLAP systems. Users should priorities these rules according to their needs to match their
business requirements (reference 3).
2) Transparency: The OLAP tool should provide transparency to the input data for the
users.
3) Accessibility: The OLAP tool should only access the data required only to the analysis
needed.
4) Consistent reporting performance: The Size of the database should not affect in any
way the performance.
5) Client/server architecture: The OLAP tool should use the client server architecture to
ensure better performance and flexibility.
7) Dynamic sparse matrix handling: The OLAP too should be able to manage the sparse
matrix and so maintain the level of performance.
8) Multi-user support: The OLAP should allow several users working concurrently to
work together.
Page 11
DATAWAREHOUSING AND DATA MINING UNIT-II
10) Intuitive data manipulation. ―Consolidation path re-orientation, drilling down across
columns or rows, zooming out, and other manipulation inherent in the consolidation path
outlines should be accomplished via direct action upon the cells of the analytical model,
and should neither require the use of a menu nor multiple trips across the user
interface.(Reference 4)
11) Flexible reporting: It is the ability of the tool to present the rows and column in a
manner suitable to be analyzed.
12) Unlimited dimensions and aggregation levels: This depends on the kind of Business,
where multiple dimensions and defining hierarchies can be made.
The major distinguishing features between OLTP and OLAP are summarized as follows.
1. Users and system orientation: An OLTP system is customer-oriented and is used for
transaction and query processing by clerks, clients, and information technology
professionals. An OLAP system is market-oriented and is used for data analysis by
knowledge workers, including managers, executives, and analysts.
2. Data contents: An OLTP system manages current data that, typically, are too detailed
to be easily used for decision making. An OLAP system manages large amounts of
historical data, provides facilities for summarization and aggregation, and stores and
manages information at different levels of granularity. These features make the data
easier for use in informed decision making.
4. View: An OLTP system focuses mainly on the current data within an enterprise or
department, without referring to historical data or data in different organizations. In
contrast, an OLAP system often spans multiple versions of a database schema. OLAP
Page 12
DATAWAREHOUSING AND DATA MINING UNIT-II
systems also deal with information that originates from different organizations,
integrating information from many data stores. Because of their huge volume, OLAP data
are stored on multiple storage media.
5. Access patterns: The access patterns of an OLTP system consist mainly of short,
atomic transactions. Such a system requires concurrency control and recovery
mechanisms. However, accesses to OLAP systems are mostly read-only operations
although many could be complex queries.
Page 13
DATAWAREHOUSING AND DATA MINING UNIT-II
Page 14
DATAWAREHOUSING AND DATA MINING UNIT-II
Categorization of OLAP Tools OLAP tools are designed to manipulate and control multi-
dimensional databases and help the sophisticated user to analyze the data using clear
multidimensional complex views. Their typical applications include product performance
and profitability, effectiveness of a sales program or a marketing campaign, sales
forecasting, and capacity planning.
Page 15
DATAWAREHOUSING AND DATA MINING UNIT-II
ROLAP
The mainly comprehensive premises in computing have been the internet and data
warehousing thus the integration of these two giant technologies is a necessity. The
advantages of using the Web for access are inevitable.These advantages are:
3. The Web allows users to store and manage data and applications on servers that can be
managed, maintained and updated centrally.
Page 16
DATAWAREHOUSING AND DATA MINING UNIT-II
These reasons indicate the importance of the Web in data storage and manipulation. The
Web-enabled data access has many significant features, such as:
The first
The second
The emerging third
HTML publishing
Helper applications
Plug-ins
Server-centric components
Java and active-x applications
Microsoft Analysis Services (previously called OLAP Services, part of SQL Server), IBM's
DB2 OLAP Server, SAP BW and products from Brio, Business Objects, Cognos, Micro
Strategy and others.
MIS AG Overview
MIS AG is the leading European provider of business intelligence solutions and services,
providing development, implementation, and service of systems for budgeting, reporting,
consolidation, and analysis.
Poet Overview
With FastObjects™, German Poet Software GmbH (Poet) provides developers with a
flexible Object-oriented Database Management System (ODBMS) solution optimized for
managing complexity in high-performance applications using Java technology, C++ and
.NET.
Page 17
DATAWAREHOUSING AND DATA MINING UNIT-III
UNIT III
DATA MINING
Data
Attribute Values
• Attribute values are numbers or symbols assigned to an attribute
• Distinction between attributes and attribute values
– Same attribute can be mapped to different attribute values
• Example: height can be measured in feet or meters
– Different attributes can be mapped to the same set of values
Page 1
DATAWAREHOUSING AND DATA MINING UNIT-III
1. Task-relevant data: This is the database portion to be investigated. For example, suppose
that you are a manager of All Electronics in charge of sales in the United States and Canada. In
particular, you would like to study the buying trends of customers in Canada. Rather than
mining on the entire database. These are referred to as relevant attributes
2. The kinds of knowledge to be mined: This specifies the data mining functions to be
performed, such as characterization, discrimination, association, classification, clustering, or
evolution analysis. For instance, if studying the buying habits of customers in Canada, you
may choose to mine associations between customer profiles and the items that these
customers like to buy
Page 2
DATAWAREHOUSING AND DATA MINING UNIT-III
5. Presentation and visualization of discovered patterns: This refers to the form in which
discovered patterns are to be displayed. Users can choose from different forms for knowledge
presentation, such as rules, tables, charts, graphs, decision trees, and cubes.
Page 3
DATAWAREHOUSING AND DATA MINING UNIT-III
Ent(S) E(T , S)
Page 4
DATAWAREHOUSING AND DATA MINING UNIT-III
Page 5
DATAWAREHOUSING AND DATA MINING UNIT-III
The architecture of a typical data mining system may have the following major components
3. Knowledge base. This is the domain knowledge that is used to guide the search, or
evaluate the interestingness of resulting patterns. Such knowledge can include concept
hierarchies, used to organize attributes or attribute values into different levels of abstraction.
Knowledge such as user beliefs, which can be used to assess a pattern's interestingness based
on its unexpectedness, may also be included.
4. Data mining engine. This is essential to the data mining system and ideally consists of a
set of functional modules for tasks such as characterization, association analysis,
classification, evolution and deviation analysis.
6. Graphical user interface. This module communicates between users and the data mining
system, allowing the user to interact with the system by specifying a data mining query or
task, providing information to help focus the search, and performing exploratory data mining
based on the intermediate data mining results.
Page 6
DATAWAREHOUSING AND DATA MINING UNIT-III
Data mining functionalities are used to specify the kind of patterns to be found in data mining
tasks. In general, data mining tasks can be classified into two categories
Descriptive and Predictive.
Descriptive mining tasks characterize the general properties of the data in the database.
Predictive mining tasks perform inference on the current data in order to make predictions. In
some cases, users may have no idea of which kinds of patterns in their data may be
interesting, and hence may like to search for several different kinds of patterns in parallel.
Thus it is important to have a data mining system that can mine multiple kinds of patterns to
accommodate di_erent user expectations or applications. Furthermore, data mining systems
should be able to discover patterns at various granularities. To encourage interactive and
exploratory mining, users should be able to easily \play" with the output patterns, such as by
mouse clicking. Operations that can be speci_ed by simple mouse clicks include adding or
dropping a dimension (or an attribute), swapping rows and columns (pivoting, or axis
rotation), changing dimension representations (e.g., from a 3-D cube to a sequence of 2-D
cross tabulations, or crosstabs), or using OLAP roll-up or drill-down operations along
Page 7
DATAWAREHOUSING AND DATA MINING UNIT-III
dimensions. Such operations allow data patterns to be expressed from different angles of view
and at multiple levels of abstraction.
Data mining systems should also allow users to specify hints to guide or focus the search for
interesting patterns. Since some patterns may not hold for all of the data in the database, a
measure of certainty or \trustworthiness" is usually associated with each discovered pattern.
Data mining functionalities, and the kinds of patterns they can discover, are described below.
Page 8
DATAWAREHOUSING AND DATA MINING UNIT-III
multidimensional tables, including crosstabs. The resulting descriptions can also be presented
as generalized relations, or in rule form (called characteristic rules).
Association analysis
Association analysis is the discovery of association rules showing attribute-value conditions
that occur frequently together in a given set of data. Association analysis is widely used for
market basket or transaction data analysis. More formally, association rules are of the form X
) Y , i.e., \A1 ^ _ _ _ ^Am !B1 ^ _ _ _^Bn", where Ai (for i 2 f1; : : :;mg) and Bj (for j 2 f1; : : :; ng)
are attribute-value pairs. The association rule X ) Y is interpreted as \database tuples that
satisfy the conditions in X are also likely to satisfy the conditions in Y ".
An association between more than one attribute, or predicate (i.e., age, income, and buys).
Adopting the terminology used in multidimensional databases, where each attribute is
referred to as a dimension,the above rule can be referred to as a multidimensional association
rule. Suppose, as a marketing manager of AllElectronics, you would like to determine which
items are frequently purchased together within the same transactions. An example of such a
rule is
Contains
(T; \computer") ) contains(T; \software") [support = 1%; confidence = 50%]
meaning that if a transaction T contains \computer", there is a 50% chance that it contains
\software" as well, and 1% of all of the transactions contain both. This association rule
involves a single attribute or predicate (i.e., contains) which repeats. Association rules that
contain a single predicate are referred to as single-dimensional association rules. Dropping
the predicate notation, the above rule can be written simply as \computer ) software [1%,
50%]".
Page 9
DATAWAREHOUSING AND DATA MINING UNIT-III
Clustering analysis
Clustering analyzes data objects without consulting a known class label. In general, the class
labels are not present in the training data simply because they are not known to begin with.
Clustering can be used to generate such labels. The objects are clustered or grouped based on
the principle of maximizing the intraclass similarity and minimizing the interclass similarity.
That is, clusters of objects are formed so that objects within a cluster have high similarity in
comparison to one another, but are very dissimilar to objects in other clusters. Each cluster
that is formed can be viewed as a class of objects, from which rules can be derived.
Page 10
DATAWAREHOUSING AND DATA MINING UNIT-III
Interestingness Patterns
A data mining system has the potential to generate thousands or even millions of patterns, or
rules. This raises some serious questions for data mining:
A pattern is interesting if (1) it is easily understood by humans, (2) valid on new or test data
with some degree of certainty, (3) potentially useful, and (4) novel. A pattern is also
interesting if it validates a hypothesis that the user sought to con_rm. An interesting pattern
represents knowledge. Several objective measures of pattern interestingness exist. These are
based on the structure of discovered patterns and the statistics underlying them. An objective
measure for association rules of the form XU Y is rule support, representing the percentage of
data samples that the given rule satisfies. Another objective measure for association rules is
confidence, which assesses the degree of certainty of the detected association. It is defined as
the conditional probability that a pattern Y is true given that X is true. More formally, support
and confidence aredefined as
support(X ) Y) = Prob{XUY}g
confidence (X ) Y) = Prob{Y |X}g
Page 11
DATAWAREHOUSING AND DATA MINING UNIT-III
may be applied, such as neural networks, fuzzy and/or rough set theory, knowledge
representation, inductive logic programming, or high performance computing. Depending on
the kinds of data to be mined or on the given data mining application, the data mining system
may also integrate techniques from spatial data analysis, information retrieval, pattern
recognition, image analysis, signal processing, computer graphics, Web technology,
economics, or psychology. Because of the diversity of disciplines contributing to data mining,
data mining research is expected to generate
a large variety of data mining systems. Therefore, it is necessary to provide a clear
classi_cation of data mining systems. Such a classi_cation may help potential users distinguish
data mining systems and identify those that best match their needs. Data mining systems can
be categorized according to various criteria, as follows._ Classi_cation according to the kinds
of databases mined. A data mining system can be classi_ed according to the kinds of databases
mined. Database systems themselves can be classi_ed according to di_erent criteria (such as
data models, or the types of data or applications involved), each of which may require its own
data mining technique. Data mining systems can therefore be classi_ed accordingly.
For instance, if classifying according to data models, we may have a relational, transactional,
object-oriented, object-relational, or data warehouse mining system. If classifying according
to the special types of data handled, we may have a spatial, time-series, text, or multimedia
data mining system, or a World-Wide Web mining system. Other system types include
heterogeneous data mining systems, and legacy data mining systems.
Classification according to the kinds of knowledge mined. Data mining systems can be
categorized according to the kinds of knowledge they mine, i.e., based on data mining
functionalities, such as characterization, discrimination, association, classi_cation, clustering,
trend and evolution analysis, deviation analysis, similarity analysis, etc. A comprehensive data
mining system usually provides multiple and/or integrated data mining functionalities.
Moreover, data mining systems can also be distinguished based on the granularity or levels of
abstraction of the knowledge mined, including generalized knowledge (at a high level of
abstraction), primitive-level knowledge (at a raw data level), or knowledge at multiple levels
Page 12
DATAWAREHOUSING AND DATA MINING UNIT-III
(considering several levels of abstraction). An advanced data mining system should facilitate
the discovery of knowledge at multiple levels of abstraction.
Page 13
DATAWAREHOUSING AND DATA MINING UNIT-III
Page 14
DATAWAREHOUSING AND DATA MINING UNIT-III
Page 15
DATAWAREHOUSING AND DATA MINING UNIT-III
Page 16
DATAWAREHOUSING AND DATA MINING UNIT-III
Page 17
DATAWAREHOUSING AND DATA MINING UNIT-III
DataPreprocessing
Data cleaning.
Data cleaning routines attempt to fill in missing values, smooth out noise while
identifying outliers, and correct inconsistencies in the data.
1. Ignore the tuple: This is usually done when the class label is missing (assuming the mining
task involves classification or description). This method is not very effective, unless the tuple
contains several attributes with missing values. It is especially poor when the percentage of
missing values per attribute varies considerably.
2. Fill in the missing value manually: In general, this approach is time-consuming and may
not be feasible given a large data set with many missing values.
3. Use a global constant to fill in the missing value: Replace all missing attribute values by
the same constant, such as a label like “Unknown". If missing values are replaced by, say,
“Unknown", then the mining program may mistakenly think that they form an interesting
concept, since they all have a value in common - that of “Unknown". Hence, although this
method is simple, it is not recommended.
4. Use the attribute mean to fill in the missing value: For example, suppose that the
average income of All Electronics customers is $28,000. Use this value to replace the missing
value for income.
5. Use the attribute mean for all samples belonging to the same class as the given tuple:
For example, if classifying customers according to credit risk, replace the missing value with
the average income value for customers in the same credit risk category as that of the given
tuple.
6. Use the most probable value to fill in the missing value: This may be determined with
inference-based tools using a Bayesian formalism or decision tree induction. For example,
Page 18
DATAWAREHOUSING AND DATA MINING UNIT-III
using the other customer attributes in your data set, you may construct a decision tree to
predict the missing values for income.
1. Binning methods:
In this example, the data for price are first sorted and partitioned into equi-depth bins
(of depth 3). In smoothing by bin means, each value in a bin is replaced by the mean value of
the bin. For example, the mean of the values 4, 8, and 15 in Bin 1 is 9. Therefore, each original
value in this bin is replaced by the value 9. Similarly, smoothing by bin medians can be
employed, in which each bin value is replaced by the bin median. In smoothing by bin
boundaries, the minimum and maximum values in a given bin are identified as the bin
boundaries. Each bin value is then replaced by the closest boundary value.
(i).Sorted data for price (in dollars): 4, 8, 15, 21, 21, 24, 25, 28, 34
• Bin 1: 4, 8, 15
• Bin 2: 21, 21, 24
• Bin 3: 25, 28, 34
(iii).Smoothing by bin means:
- Bin 1: 9, 9, 9,
Page 19
DATAWAREHOUSING AND DATA MINING UNIT-III
• Bin 1: 4, 4, 15
• Bin 2: 21, 21, 24
• Bin 3: 25, 25, 34
2. Clustering:
Outliers may be detected by clustering, where similar values are organized into groups
or “clusters”. Intuitively, values which fall outside of the set of clusters may be considered
outliers.
Page 20
DATAWAREHOUSING AND DATA MINING UNIT-III
a list. A human can then sort through the patterns in the list to identify the actual garbage
ones
4. Regression: Data can be smoothed by fitting the data to a function, such as with regression.
Linear regression involves finding the “best" line to fit two variables, so that one variable can
be used to predict the other. Multiple linear regression is an extension of linear regression,
where more than two variables are involved and the data are fit to a multidimensional
surface.
There may be inconsistencies in the data recorded for some transactions. Some data
inconsistencies may be corrected manually using external references. For example, errors
made at data entry may be corrected by performing a paper trace. This may be coupled with
routines designed to help correct the inconsistent use of codes. Knowledge engineering tools
may also be used to detect the violation of known data constraints. For example, known
functional dependencies between attributes can be used to find values contradicting the
functional constraints.
Data transformation.
1. Normalization, where the attribute data are scaled so as to fall within a small specified
range, such as -1.0 to 1.0, or 0 to 1.0.
There are three main methods for data normalization : min-max normalization, z-
score normalization, and normalization by decimal scaling.
Page 21
DATAWAREHOUSING AND DATA MINING UNIT-III
(ii).z-score normalization (or zero-mean normalization), the values for an attribute A are
normalized based on the mean and standard deviation of A. A value v of A is normalized to v0
by computing where mean A and stand dev A are the mean and standard deviation,
respectively, of attribute A. This method of normalization is useful when the actual minimum
and maximum of attribute A are unknown, or when there are outliers which dominate the
min-max normalization.
(iii). Normalization by decimal scaling normalizes by moving the decimal point of values of
attribute A. The number of decimal points moved depends on the maximum absolute value of
A. A value v of A is normalized to v0by computing where j is the smallest integer such that
2. Smoothing, which works to remove the noise from data? Such techniques include binning,
clustering, and regression.
In this example, the data for price are first sorted and partitioned into equi-depth bins
(of depth 3). In smoothing by bin means, each value in a bin is replaced by the mean value of
the bin. For example, the mean of the values 4, 8, and 15 in Bin 1 is 9. Therefore, each original
value in this bin is replaced by the value 9. Similarly, smoothing by bin medians can be
Page 22
DATAWAREHOUSING AND DATA MINING UNIT-III
employed, in which each bin value is replaced by the bin median. In smoothing by bin
boundaries, the minimum and maximum values in a given bin are identified as the bin
boundaries. Each bin value is then replaced by the closest boundary value.
(i).Sorted data for price (in dollars): 4, 8, 15, 21, 21, 24, 25, 28, 34
• Bin 1: 4, 8, 15
• Bin 2: 21, 21, 24
• Bin 3: 25, 28, 34
(iii).Smoothing by bin means:
- Bin 1: 9, 9, 9,
• Bin 1: 4, 4, 15
• Bin 2: 21, 21, 24
• Bin 3: 25, 25, 34
(ii). Clustering:
Outliers may be detected by clustering, where similar values are organized into groups
or “clusters”. Intuitively, values which fall outside of the set of clusters may be considered
outliers.
Page 23
DATAWAREHOUSING AND DATA MINING UNIT-III
3. Aggregation, where summary or aggregation operations are applied to the data. For
example, the daily sales data may be aggregated so as to compute monthly and annual total
amounts.
4. Generalization of the data, where low level or 'primitive' (raw) data are replaced by
higher level concepts through the use of concept hierarchies. For example, categorical
attributes, like street, can be generalized to higher level concepts, like city or county.
Data reduction.
1. Data cube aggregation, where aggregation operations are applied to the data in the
construction of a data cube.
3. Data compression, where encoding mechanisms are used to reduce the data set size.
4. Numerosity reduction, where the data are replaced or estimated by alternative, smaller
data representations such as parametric models (which need store only the model
Page 24
DATAWAREHOUSING AND DATA MINING UNIT-III
5. Discretization and concept hierarchy generation, where raw data values for attributes
are replaced by ranges or higher conceptual levels. Concept hierarchies allow the mining of
data at multiple levels of abstraction, and are a powerful tool for data mining.
– Select a minimum set of features such that the probability distribution of different
classes given the values for those features is as close as possible to the original
distribution given the values of all features
– reduce # of patterns in the patterns, easier to understand
Heuristic methods:
1. Step-wise forward selection: The procedure starts with an empty set of attributes. The
best of the original attributes is determined and added to the set. At each subsequent iteration
or step, the best of the remaining original attributes is added to the set.
Page 25
DATAWAREHOUSING AND DATA MINING UNIT-III
2. Step-wise backward elimination: The procedure starts with the full set of attributes. At
each step, it removes the worst attribute remaining in the set.
4. Decision tree induction: Decision tree algorithms, such as ID3 and C4.5, were originally
intended for classifcation. Decision tree induction constructs a flow-chart-like structure
where each internal (non-leaf) node denotes a test on an attribute, each branch corresponds
to an outcome of the test, and each external (leaf) node denotes a class prediction. At each
node, the algorithm chooses the “best" attribute to partition the data into individual classes.
Page 26
DATAWAREHOUSING AND DATA MINING UNIT-III
Data compression
Wavelet transforms
The discrete wavelet transform (DWT) is a linear signal processing technique that,
when applied to a data vector D, transforms it to a numerically different vector, D0, of wavelet
coefficients. The two vectors are of the same length.
The DWT is closely related to the discrete Fourier transform (DFT), a signal processing
technique involving sines and cosines. In general, however, the DWT achieves better lossy
compression.
1. The length, L, of the input data vector must be an integer power of two. This condition
can be met by padding the data vector with zeros, as necessary.
2. Each transform involves applying two functions. The first applies some data smoothing,
Page 27
DATAWAREHOUSING AND DATA MINING UNIT-III
1. The input data are normalized, so that each attribute falls within the same range. This step
helps ensure that attributes with large domains will not dominate attributes with smaller
domains.
2. PCA computes N orthonormal vectors which provide a basis for the normalized input data.
These are unit vectors that each point in a direction perpendicular to the others. These
vectors are referred to as the principal components. The input data are a linear combination
of the principal components.
3. The principal components are sorted in order of decreasing “significance" or strength. The
principal components essentially serve as a new set of axes for the data, providing important
information about variance.
Page 28
DATAWAREHOUSING AND DATA MINING UNIT-III
4. since the components are sorted according to decreasing order of “significance", the size of
the data can be reduced by eliminating the weaker components, i.e., those with low variance.
Using the strongest principal components, it should be possible to reconstruct a good
approximation of the original data.
Numerosity reduction
Regression and log-linear models can be used to approximate the given data. In linear
regression, the data are modeled to fit a straight line. For example, a random variable, Y
(called a response variable), can be modeled as a linear function of another random variable,
X (called a predictor variable), with the equation where the variance of Y is assumed to be
constant. These coefficients can be solved for by the method of least squares, which minimizes
the error between the actual line separating the data and the estimate of the line.
Histograms
A histogram for an attribute A partitions the data distribution of A into disjoint subsets,
or buckets. The buckets are displayed on a horizontal axis, while the height (and area) of a
bucket typically reects the average frequency of the values represented by the bucket.
1. Equi-width: In an equi-width histogram, the width of each bucket range is constant (such as
the width of $10 for the buckets in Figure 3.8).
Page 29
DATAWAREHOUSING AND DATA MINING UNIT-III
2. Equi-depth (or equi-height): In an equi-depth histogram, the buckets are created so that,
roughly, the frequency of each bucket is constant (that is, each bucket contains roughly the
same number of contiguous data samples).
3. V-Optimal: If we consider all of the possible histograms for a given number of buckets, the
V-optimal histogram is the one with the least variance. Histogram variance is a weighted sum
of the original values that each bucket represents, where bucket weight is equal to the
number of values in the bucket.
4. MaxDiff: In a MaxDiff histogram, we consider the difference between each pair of adjacent
values. A bucket boundary is established between each pair for pairs having the largest
differences, where is user-specified.
Clustering
Clustering techniques consider data tuples as objects. They partition the objects into
groups or clusters, so that objects within a cluster are “similar" to one another and
“dissimilar" to objects in other clusters. Similarity is commonly defined in terms of how
“close" the objects are in space, based on a distance function. The “quality" of a cluster may be
represented by its diameter, the maximum distance between any two objects in the cluster.
Centroid distance is an alternative measure of cluster quality, and is defined as the average
distance of each cluster object from the cluster centroid.
Page 30
DATAWAREHOUSING AND DATA MINING UNIT-III
Sampling
Sampling can be used as a data reduction technique since it allows a large data set to be
represented by a much smaller random sample (or subset) of the data. Suppose that a large
data set, D, contains N tuples. Let's have a look at some possible samples for D.
2. Simple random sample with replacement (SRSWR) of size n: This is similar to SRSWOR,
except that each time a tuple is drawn from D, it is recorded and then replaced. That is, after a
tuple is drawn, it is placed back in D so that it may be drawn again.
3. Cluster sample: If the tuples in D are grouped into M mutually disjoint “clusters", then a
SRS of m clusters can be obtained, where m < M. A reduced data representation can be
obtained by applying, say, SRSWOR to the pages, resulting in a cluster sample of the tuples.
4. Stratified sample: If D is divided into mutually disjoint parts called “strata", a stratified
sample of D is generated by obtaining a SRS at each stratum. This helps to ensure a
representative sample, especially when the data are skewed. For example, a stratified sample
may be obtained from customer data, where stratum is created for each customer age group.
Page 31
DATAWAREHOUSING AND DATA MINING UNIT-III
Page 32
DATAWAREHOUSING AND DATA MINING UNIT-III
Page 33
DATAWAREHOUSING AND DATA MINING UNIT-IV
UNIT IV
ASSOCIATION RULE MINING AND CLASSIFICATION
Mining Frequent Patterns, Associations and Correlations – Mining Methods – Mining Various Kinds of
Association Rules – Correlation Analysis – Constraint Based Association Mining – Classification and
Prediction - Basic Concepts - Decision Tree Induction - Bayesian Classification – Rule Based
Classification – Classification by Back propagation – Support Vector Machines – Associative
Classification – Lazy Learners – Other Classification Methods – Prediction
Association Mining
• Given: (1) database of transactions, (2) each transaction is a list of items (purchased by a
customer in a visit)
• Find: all rules that correlate the presence of one set of items with that of another set of items
– E.g., 98% of people who purchase tires and auto accessories also get automotive services
done
• Applications
Page 1
DATAWAREHOUSING AND DATA MINING UNIT-IV
• Find all the rules X & Y Z with minimum confidence and support
– support, s, probability that a transaction contains {X Y Z}
– confidence, c, conditional probability that a transaction having {X Y} also contains Z
– A C (50%, 66.6%)
– C A (50%, 100%)
Page 2
DATAWAREHOUSING AND DATA MINING UNIT-IV
Page 3
IT1101 - DATAWAREHOUSING AND DATA MINING UNIT-IV
Rule support and confidence are two measures of rule interestingness. They respectively
reflect the usefulness and certainty of discovered rules. A support of 2% for association Rule means
that 2% of all the transactions under analysis show that computer and financial management
software are purchased together. A confidence of 60% means that 60% of the customers who
purchased a computer also bought the software. Typically, association rules are considered
interesting if they satisfy both a minimum support threshold and a minimum confidence threshold.
The method that mines the complete set of frequent itemsets with candidate generation.
Apriori property & The Apriori Algorithm.
Apriori property
Page 4
DATAWAREHOUSING AND DATA MINING UNIT-IV
Example
Page 5
DATAWAREHOUSING AND DATA MINING UNIT-IV
The method that mines the complete set of frequent itemsets without generation.
Page 6
DATAWAREHOUSING AND DATA MINING UNIT-IV
Header Table
• Completeness:
– never breaks a long pattern of any transaction
– preserves complete information for frequent pattern mining
• Compactness
– reduce irrelevant information—infrequent items are gone
– frequency descending ordering: more frequent items are more likely to be shared
– never be larger than the original database (if not count node-links and counts)
– Example: For Connect-4 DB, compression ratio could be over 100
Page 7
IT1101 - DATAWAREHOUSING AND DATA MINING UNIT-IV
Page 8
DATAWAREHOUSING AND DATA MINING UNIT-IV
Food
Milk Bread
Fraser Sunset
TID Items
T1 {111, 121, 211, 221}
T2 {111, 211, 222, 323}
T3 {112, 122, 221, 411}
T4 {111, 121}
T5 {111, 122, 211, 221, 413}
Page 9
DATAWAREHOUSING AND DATA MINING UNIT-IV
Page 10
DATAWAREHOUSING AND DATA MINING UNIT-IV
– If adopting the same min_support across multi-levels then toss t if any of t’s ancestors is
infrequent.
– If adopting reduced min_support at lower levels then examine only those descendents
whose ancestor’s support is frequent/non-negligible.
Correlation in detail.
2
2
(observed _ exp ected )
exp ected
Page 11
DATAWAREHOUSING AND DATA MINING UNIT-IV
Numeric correlation
Page 12
DATAWAREHOUSING AND DATA MINING UNIT-IV
– Interestingness constraints:
• strong rules (min_support 3%, min_confidence 60%).
• Database: (1) trans (TID, Itemset ), (2) itemInfo (Item, Type, Price)
• A constrained asso. query (CAQ) is in the form of {(S1, S2 )|C },
– where C is a set of constraints on S1, S2 including frequency constraint
• A classification of (single-variable) constraints:
– Class constraint: S A. e.g. S Item
– Domain constraint:
• S v, { , , , , , }. e.g. S.Price < 100
• v S, is or . e.g. snacks S.Type
• V S, or S V, { , , , , }
– e.g. {snacks, sodas } S.Type
– Aggregation constraint: agg(S) v, where agg is in {min, max, sum, count, avg}, and {
, , , , , }.
• e.g. count(S1.Type) 1 , avg(S2.Price) 100
Page 13
DATAWAREHOUSING AND DATA MINING UNIT-IV
2. Succinct Constraint
• A subset of item Is is a succinct set, if it can be expressed as p(I) for some selection predicate
p, where is a selection operator
• SP2I is a succinct power set, if there is a fixed number of succinct set I1, …, Ik I, s.t. SP can be
expressed in terms of the strict power sets of I1, …, Ik using union and minus
• A constraint Cs is succinct provided SATCs(I) is a succinct power set
3. Convertible Constraint
• Suppose all items in patterns are listed in a total order R
• A constraint C is convertible anti-monotone iff a pattern S satisfying the constraint implies that
each suffix of S w.r.t. R also satisfies C
• A constraint C is convertible monotone iff a pattern S satisfying the constraint implies that
each pattern of which S is a suffix w.r.t. R also satisfies C
Page 14
DATAWAREHOUSING AND DATA MINING UNIT-IV
• Anti-monotonicity: If a set S violates the constraint, any superset of S violates the constraint.
• Examples:
– sum(S.Price) v is anti-monotone
– sum(S.Price) v is not anti-monotone
– sum(S.Price) = v is partly anti-monotone
• Application:
– Push “sum(S.price) 1000” deeply into iterative frequent set computation.
• Succinctness:
– For any set S1 and S2 satisfying C, S1 S2 satisfies C
– Given A1 is the sets of size 1 satisfying C, then any set S satisfying C are based on A1 , i.e.,
it contains a subset belongs to A1 ,
• Example :
– sum(S.Price ) v is not succinct
– min(S.Price ) v is succinct
Page 15
DATAWAREHOUSING AND DATA MINING UNIT-IV
• Optimization:
– If C is succinct, then C is pre-counting prunable. The satisfaction of the constraint alone is
not affected by the iterative support counting.
Classification:
– predicts categorical class labels
– classifies data (constructs a model) based on the training set and the values (class labels)
in a classifying attribute and uses it in classifying new data
• Prediction
• Typical applications
– Credit approval
– Target marketing
– Medical diagnosis
– Fraud detection
–
Classification—A Two-Step Process
Page 16
DATAWAREHOUSING AND DATA MINING UNIT-IV
• The known label of test sample is compared with the classified result from the
model
• Accuracy rate is the percentage of test set samples that are correctly classified by
the model
• Test set is independent of training set, otherwise over-fitting will occur
Process (1): Model Construction
Page 17
DATAWAREHOUSING AND DATA MINING UNIT-IV
Training Dataset
Page 19
DATAWAREHOUSING AND DATA MINING UNIT-IV
• Example:
– Weather problem: build a decision tree to guide the decision about whether or not to play
tennis.
– Dataset
(weather.nominal.arff)
• Validation:
– Using training set as a test set will provide optimal classification accuracy.
Page 20
DATAWAREHOUSING AND DATA MINING UNIT-IV
• Results:
– Classification accuracy (correctly classified instances).
– Errors (absolute mean, root squared mean, …)
– Kappa statistic (measures agreement between predicted and observed classification; -
100%-100% is the proportion of agreements after chance agreement has been
excluded; 0% means complete agreement by chance)
• Results:
– TP (True Positive) rate per class label
– FP (False Positive) rate
– Precision = TP rate = TP / (TP + FN)) * 100%
– Recall = TP / (TP + FP)) * 100%
– F-measure = 2* recall * precision / recall + precision
• ID3 characteristics:
– Requires nominal values
– Improved into C4.5
• Dealing with numeric attributes
• Dealing with missing values
• Dealing with noisy data
• Generating rules from trees
Page 21
DATAWAREHOUSING AND DATA MINING UNIT-IV
• Methods:
– C5.0: target field must be categorical, predictor fields may be numeric or categorical,
provides multiple splits on the field that provides the maximum information gain at
each level
– QUEST: target field must be categorical, predictor fields may be numeric ranges or
categorical, statistical binary split
– C&RT: target and predictor fields may be numeric ranges or categorical, statistical binary
split based on regression
– CHAID: target and predictor fields may be numeric ranges or categorical, statistical
binary split based on chi-square
Bayesian Classification:
• Probabilistic learning: Calculate explicit probabilities for hypothesis, among the most practical
approaches to certain types of learning problems
• Incremental: Each training example can incrementally increase/decrease the probability that a
hypothesis is correct. Prior knowledge can be combined with observed data.
• Probabilistic prediction: Predict multiple hypotheses, weighted by their probabilities
• Standard: Even when Bayesian methods are computationally intractable, they can provide a
standard of optimal decision making against which other methods can be measured
Bayesian Theorem
• Given training data D, posteriori probability of a hypothesis h, P(h|D) follows the Bayes theorem
P(h| D) P(D|h)P(h)
P(D)
• MAP (maximum posteriori) hypothesis
h argmax P(h| D) argmax P(D|h)P(h).
MAP
hHhH
Page 22
DATAWAREHOUSING AND DATA MINING UNIT-IV
• Greatly reduces the computation cost, only count the class distribution.
Outlook P N Humidity P N
sunny 2/9 3/5 high 3/9 4/5
overcast 4/9 0 normal 6/9 1/5
rain 3/9 2/5
Temperature W indy
Bayesian hot 2/9 2/5 true 3/9 3/5 classification
mild 4/9 2/5 false 6/9 2/5
• The cool 3/9 1/5 classification
problem may be formalized using a-posteriori probabilities:
• P(C|X) = prob. that the sample tuple
• X=<x1,…,xk> is of class C.
• E.g. P(class=N | outlook=sunny, windy=true,…)
• Idea: assign to sample X the class label C such that P(C|X) is maximal
• Bayes theorem:
Page 23
DATAWAREHOUSING AND DATA MINING UNIT-IV
Page 24
DATAWAREHOUSING AND DATA MINING UNIT-IV
CPT shows the conditional probability for each possible combination of its parents
Association-Based Classification
Page 25
DATAWAREHOUSING AND DATA MINING UNIT-IV
Discarding one or more subtrees and replacing them with leaves simplify a decision tree, and that
is the main task in decision-tree pruning. In replacing the subtree with a leaf, the algorithm expects to
lower the predicted error rate and increase the quality of a classification model. But computation of
error rate is not simple. An error rate based only on a training data set does not provide a suitable
estimate. One possibility to estimate the predicted error rate is to use a new, additional set of test
samples if they are available, or to use the cross-validation techniques. This technique divides initially
available samples into equal sized blocks and, for each block, the tree is constructed from all samples
except this block and tested with a given block of samples. With the available training and testing
samples, the basic idea of decision tree-pruning is to remove parts of the tree (subtrees) that do not
contribute to the classification accuracy of unseen testing samples, producing a less complex and thus
more comprehensible tree. There are two ways in which the recursive-partitioning method can be
modified:
1. Deciding not to divide a set of samples any further under some conditions. The stopping
criterion is usually based on some statistical tests, such as the χ2 test: If there are no significant
differences in classification accuracy before and after division, then represent a current node as a
leaf. The decision is made in advance, before splitting, and therefore this approach is called
prepruning.
2. Removing restrospectively some of the tree structure using selected accuracy criteria. The
decision in this process of postpruning is made after the tree has been built.
C4.5 follows the postpruning approach, but it uses a specific technique to estimate the predicted error rate. This
method is called pessimistic pruning. For every node in a tree, the estimation of the upper confidence limit u cf is
computed using the statistical tables for binomial distribution (given in most textbooks on statistics). Parameter U cf
is a function of ∣Ti∣ and E for a given node. C4.5 uses the default confidence level of 25%, and compares U 25% (∣Ti∣/E)
for a given node Ti with a weighted confidence of
Page 26
DATAWAREHOUSING AND DATA MINING UNIT-IV
its leaves. Weights are the total number of cases for every leaf. If the predicted error for a root node in
a subtree is less than weighted sum of U25% for the leaves (predicted error for the subtree), then a
subtree will be replaced with its root node, which becomes a new leaf in a pruned tree.
Let us illustrate this procedure with one simple example. A subtree of a decision tree is given
in Figure, where the root node is the test x1 on three possible values {1, 2, 3} of the attribute A. The
children of the root node are leaves denoted with corresponding classes and (∣Ti∣/E) parameters. The
question is to estimate the possibility of pruning the subtree and replacing it with its root node as a
new, generalized leaf node.
To analyze the possibility of replacing the subtree with a leaf node it is necessary to compute a
predicted error PE for the initial tree and for a replaced node. Using default confidence of 25%, the
upper confidence limits for all nodes are collected from statistical tables: U25% (6, 0) = 0.206, U25%(9,
0) = 0.143, U25%(1, 0) = 0.750, and U25%(16, 1) = 0.157. Using these values, the predicted errors for the
initial tree and the replaced node are
Since the existing subtree has a higher value of predicted error than the replaced node, it is
recommended that the decision tree be pruned and the subtree replaced with the new leaf node.
Page 27
DATAWAREHOUSING AND DATA MINING UNIT-IV
Page 28
DATAWAREHOUSING AND DATA MINING UNIT-IV
Page 29
DATAWAREHOUSING AND DATA MINING UNIT-IV
Weakness
Long training time
Require a number of parameters typically best determined empirically, e.g., the
network topology or ``structure."
Poor interpretability: Difficult to interpret the symbolic meaning behind the learned
weights and of ``hidden units" in the network
Strength
High tolerance to noisy data
Ability to classify untrained patterns
Well-suited for continuous-valued inputs and outputs
Successful on a wide array of real-world data
Algorithms are inherently parallel
Techniques have recently been developed for the extraction of rules from trained
neural networks
A Neuron (= a perceptron)
The n-dimensional input vector x is mapped into variable y by means of the scalar product and
a nonlinear function mapping
Page 30
DATAWAREHOUSING AND DATA MINING UNIT-IV
The inputs to the network correspond to the attributes measured for each training tuple
Inputs are fed simultaneously into the units making up the input layer
They are then weighted and fed simultaneously to a hidden layer
The number of hidden layers is arbitrary, although usually only one
The weighted outputs of the last hidden layer are input to units making up the output layer,
which emits the network's prediction
The network is feed-forward in that none of the weights cycles back to an input unit or to an
output unit of a previous layer
From a statistical point of view, networks perform nonlinear regression: Given enough hidden
units and enough training samples, they can closely approximate any function
Backpropagation
Iteratively process a set of training tuples & compare the network's prediction with the actual
known target value
For each training tuple, the weights are modified to minimize the mean squared error between
the network's prediction and the actual target value
Modifications are made in the “backwards” direction: from the output layer, through each
hidden layer down to the first hidden layer, hence “backpropagation”
Steps
Initialize weights (to small random #s) and biases in the network
Propagate the inputs forward (by applying activation function)
Backpropagate the error (by updating weights and biases)
Terminating condition (when error is very small, etc.)
Page 31
DATAWAREHOUSING AND DATA MINING UNIT-IV
Efficiency of backpropagation: Each epoch (one interation through the training set) takes O(|D|
* w), with |D| tuples and w weights, but # of epochs can be exponential to n, the number of
inputs, in the worst case
Rule extraction from networks: network pruning
Simplify the network structure by removing weighted links that have the least effect on
the trained network
Then perform link, unit, or activation value clustering
The set of input and activation values are studied to derive rules describing the
relationship between the input and hidden unit layers
Sensitivity analysis: assess the impact that a given input variable has on a network output. The
knowledge gained from this analysis can be represented in rules
SVM—General Philosophy
Page 32
DATAWAREHOUSING AND DATA MINING UNIT-IV
SVM—Linearly Separable
A separating hyperplane can be written as
W ● X + b = 0
where W={w1, w2, …, wn} is a weight vector and b a scalar (bias)
For 2-D it can be written as
w0 + w1 x1 + w2 x2 = 0
The hyperplane defining the sides of the margin:
H1: w0 + w1 x1 + w2 x2 ≥ 1 for yi = +1, and
H2: w0 + w1 x1 + w2 x2 ≤ – 1 for yi = –1
Any training tuples that fall on hyperplanes H1 or H2 (i.e., the
sides defining the margin) are support vectors
Page 33
DATAWAREHOUSING AND DATA MINING UNIT-IV
Associative Classification
Associative classification
Association rules are generated and analyzed for use in classification
Search for strong associations between frequent patterns (conjunctions of attribute-
value pairs) and class labels
Classification: Based on evaluating a set of rules in the form of
P1 ^ p 2 … ^ p l “Aclass = C” (conf, sup)
Why effective?
It explores highly confident associations among multiple attributes and may overcome
some constraints introduced by decision-tree induction, which considers only one
attribute at a time
In many studies, associative classification has been found to be more accurate than some traditional
classification methods, such as C4.
Page 34
DATAWAREHOUSING AND DATA MINING UNIT-IV
Associative Classification May Achieve High Accuracy and Efficiency (Cong et al. SIGMOD05)
Page 35
DATAWAREHOUSING AND DATA MINING UNIT-IV
For discrete-valued, k-NN returns the most common value among the k training examples
nearest to xq
k-NN for real-valued prediction for a given unknown tuple
Returns the mean values of the k nearest neighbors
Distance-weighted nearest neighbor algorithm
Weight the contribution of each of the k neighbors according to their distance to the
query xq
Give greater weight to closer neighbors
Robust to noisy data by averaging k-nearest neighbors
Curse of dimensionality: distance between neighbors could be dominated by irrelevant
attributes
To overcome it, axes stretch or elimination of the least relevant attributes
Genetic Algorithms
Page 36
DATAWAREHOUSING AND DATA MINING UNIT-IV
A rough set for a given class C is approximated by two sets: a lower approximation (certain to
be in C) and an upper approximation (cannot be described as not belonging to C)
Finding the minimal subsets (reducts) of attributes for feature reduction is NP-hard but a
discernibility matrix (which stores the differences between attribute values for each pair of data
tuples) is used to reduce the computation intensity
Figure: A rough set approximation of the set of tuples of the class C suing lower and upper
approximation sets of C. The rectangular regions represent equivalence classes
Page 37
DATAWAREHOUSING AND DATA MINING UNIT-IV
Prediction
Page 38
DATAWAREHOUSING AND DATA MINING UNIT-IV
Page 39