0% found this document useful (0 votes)
7 views53 pages

DATA WAREHOUSE

A Data Warehouse (DW) is a relational database designed for query and analysis, integrating historical data from multiple sources to support decision-making. Data Marts are subsets of data warehouses focused on specific business functions, offering easier access and lower costs. Metadata is crucial for understanding the structure and content of data warehouses, while data cubes and schemas like star and snowflake schemas organize data for efficient analysis.

Uploaded by

Swastik 229
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views53 pages

DATA WAREHOUSE

A Data Warehouse (DW) is a relational database designed for query and analysis, integrating historical data from multiple sources to support decision-making. Data Marts are subsets of data warehouses focused on specific business functions, offering easier access and lower costs. Metadata is crucial for understanding the structure and content of data warehouses, while data cubes and schemas like star and snowflake schemas organize data for efficient analysis.

Uploaded by

Swastik 229
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 53

DATA WAREHOUSE & DATA MINING

SECTION – A
What is a Data Warehouse?
A Data Warehouse (DW) is a relational database that is designed for query and analysis rather than
transaction processing. It includes historical data derived from transaction data from single and multiple
sources. A Data Warehouse provides integrated, enterprise-wide, historical data and focuses on providing
support for decision-makers for data modeling and analysis. A Data Warehouse is a group of data
specific to the entire organization, not only to a particular group of users.
It is not used for daily operations and transaction processing but used for making
decisions.
A Data Warehouse can be viewed as a data system with the following attributes:
o It is a database designed for investigative tasks, using data from various
applications.
o It supports a relatively small number of clients with relatively long
interactions.
o It includes current and historical data to provide a historical perspective of
information.
o Its usage is read-intensive.
o It contains a few large tables.
"Data Warehouse is a subject-oriented, integrated, and time-variant store of
information in support of management's decisions."

Subject-Oriented
A data warehouse target on the modeling and analysis of data for decision-makers.
Therefore, data warehouses typically provide a concise and straightforward view
around a particular subject, such as customer, product, or sales, instead of the global
organization's ongoing operations. This is done by excluding data that are not useful
concerning the subject and including all data needed by the users to understand the
subject.
Integrated
A data warehouse integrates various heterogeneous data sources like RDBMS, flat
files, and online transaction records. It requires performing data cleaning and
integration during data warehousing to ensure consistency in naming conventions,
attributes types, etc., among different data sources.
Time-Variant
Historical information is kept in a data warehouse. For example, one can retrieve files
from 3 months, 6 months, 12 months, or even previous data from a data warehouse.
These variations with a transactions system, where often only the most current file is
kept.
Non-Volatile
The data warehouse is a physically separate data storage, which is transformed from
the source operational RDBMS. The operational updates of data do not occur in the
data warehouse, i.e., update, insert, and delete operations are not performed. It
usually requires only two procedures in data accessing: Initial loading of data and
access to data. Therefore, the DW does not require transaction processing, recovery,
and concurrency capabilities, which allows for substantial speedup of data retrieval.
Non-Volatile defines that once entered into the warehouse, and data should not
change.
Data Warehouse Usage:-
1. Data warehouses and data marts are used in a wide range of applications.
2. Business executives use the data in data warehouses and data marts to perform data
analysis and make strategic decisions.
3. In many areas, data warehouses are used as an integral part for enterprise management.
4. The data warehouse is mainly used for generating reports and answering predefined queries.
5. It is used to analyze summarized and detailed data, where the results are presented in the
form of reports and charts.
6. Later, the data warehouse is used for strategic purposes, performing multidimensional
analysis and sophisticated operations.
7. Finally, the data warehouse may be employed for knowledge discovery and strategic
decision making using data mining tools.
8. In this context, the tools for data warehousing can he categorized into access and retrieval
tools, database reporting tools, data analysis tools, and data mining tools.
What is Data Mart?
A Data Mart is a subset of a directorial information store, generally oriented to a
specific purpose or primary data subject which may be distributed to provide
business needs. Data Marts are analytical record stores designed to focus on
particular business functions for a specific community within an organization.
Data marts are derived from subsets of data in a data warehouse, though in the
bottom-up data warehouse design methodology, the data warehouse is created
from the union of organizational data marts.

The fundamental use of a data mart is Business Intelligence


(BI) applications. BI is used to gather, store, access, and analyze record. It can
be used by smaller businesses to utilize the data they have accumulated since it
is less expensive than implementing a data warehouse.
Reasons for creating a data mart

o Creates collective data by a group of users


o Easy access to frequently needed data
o Ease of creation
o Improves end-user response time
o Lower cost than implementing a complete data warehouses
o Potential clients are more clearly defined than in a comprehensive data
warehouse
o It contains only essential business data and is less cluttered.

Types of Data Marts

There are mainly two approaches to designing data marts. These approaches are

o Dependent Data Marts


o Independent Data Marts

Dependent Data Marts


A dependent data marts is a logical subset of a physical subset of a higher data
warehouse. According to this technique, the data marts are treated as the
subsets of a data warehouse. In this technique, firstly a data warehouse is
created from which further various data marts can be created. These data mart
are dependent on the data warehouse and extract the essential record from it. In
this technique, as the data warehouse creates the data mart; therefore, there is
no need for data mart integration. It is also known as a top-down approach.

Independent Data Marts


The second approach is Independent data marts (IDM) Here, firstly independent
data marts are created, and then a data warehouse is designed using these
independent multiple data marts. In this approach, as all the data marts are
designed independently; therefore, the integration of data marts is required. It is
also termed as a bottom-up approach as the data marts are integrated to
develop a data warehouse.

What is Meta Data?


Metadata is data about the data or documentation about the information which is
required by the users. In data warehousing, metadata is one of the essential
aspects.

Metadata includes the following:

1. The location and descriptions of warehouse systems and components.


2. Names, definitions, structures, and content of data-warehouse and end-
users views.
3. Identification of authoritative data sources.
4. Integration and transformation rules used to populate data.
5. Integration and transformation rules used to deliver information to end-
user analytical tools.
6. Subscription information for information delivery to analysis subscribers.
7. Metrics used to analyze warehouses usage and performance.
8. Security authorizations, access control list, etc.

Metadata is used for building, maintaining, managing, and using the data
warehouses. Metadata allow users access to help understand the content and
find data.

Several examples of metadata are:

1. A library catalog may be considered metadata. The directory metadata


consists of several predefined components representing specific attributes
of a resource, and each item can have one or more values. These
components could be the name of the author, the name of the document,
the publisher's name, the publication date, and the methods to which it
belongs.
2. The table of content and the index in a book may be treated metadata for
the book.
3. Suppose we say that a data item about a person is 80. This must be
defined by noting that it is the person's weight and the unit is kilograms.
Therefore, (weight, kilograms) is the metadata about the data is 80.
4. Another examples of metadata are data about the tables and figures in a
report like this book. A table (which is a record) has a name (e.g., table
titles), and there are column names of the tables that may be treated
metadata. The figures also have titles or names.

Why is metadata necessary in a data warehouses?

o First, it acts as the glue that links all parts of the data warehouses.
o Next, it provides information about the contents and structures to the
developers.
o Finally, it opens the doors to the end-users and makes the contents
recognizable in their terms.

Metadata is Like a Nerve Center. Various processes during the building and
administering of the data warehouse generate parts of the data warehouse
metadata. Another uses parts of metadata generated by one process. In the data
warehouse, metadata assumes a key position and enables communication
among various methods. It acts as a nerve centre in the data warehouse.

What is Data Cube?


When data is grouped or combined in multidimensional matrices called Data
Cubes. The data cube method has a few alternative names or a few variants,
such as "Multidimensional databases," "materialized views," and "OLAP (On-Line
Analytical Processing)."

The general idea of this approach is to materialize certain expensive


computations that are frequently inquired.

For example, a relation with the schema sales (part, supplier, customer, and
sale-price) can be materialized into a set of eight views as shown in fig,
where psc indicates a view consisting of aggregate function value (such as total-
sales) computed by grouping three attributes part, supplier, and
customer, p indicates a view composed of the corresponding aggregate function
values calculated by grouping part alone, etc.

A data cube is created from a subset of attributes in the database. Specific attributes
are chosen to be measure attributes, i.e., the attributes whose values are of interest.
Another attributes are selected as dimensions or functional attributes. The measure
attributes are aggregated according to the dimensions. For example, XYZ may create
a sales data warehouse to keep records of the store's sales for the dimensions time,
item, branch, and location. These dimensions enable the store to keep track of things
like monthly sales of items, and the branches and locations at which the items were
sold. Each dimension may have a table identify with it, known as a dimensional table,
which describes the dimensions. For example, a dimension table for items may contain
the attributes item_name, brand, and type.
Data cube method is an interesting technique with many applications. Data cubes could be sparse
in many cases because not every cell in each dimension may have corresponding data in the
database. Techniques should be developed to handle sparse cubes efficiently. If a query contains
constants at even lower levels than those provided in a data cube, it is not clear how to make the
best use of the precomputed results stored in the data cube. The model view data in the form of a
data cube. OLAP tools are based on the multidimensional data model. Data cubes usually model n-
dimensional data. A data cube enables data to be modeled and viewed in multiple dimensions. A
multidimensional data model is organized around a central theme, like sales and transactions. A
fact table represents this theme. Facts are numerical measures. Thus, the fact table contains
measure (such as Rs_sold) and keys to each of the related dimensional tables. Dimensions are a
fact that defines a data cube. Facts are generally quantities, which are used for analyzing the

relationship between dimensions.

What is Star Schema?


A star schema is the elementary form of a
dimensional model, in which data are organized
into facts and dimensions. A fact is an event that is
counted or measured, such as a sale or log in. A
dimension includes reference data about the fact,
such as date, item, or customer.

A star schema is a relational schema where a


relational schema whose design represents a
multidimensional data model. The star schema is the
explicit data warehouse schema. It is known as star
schema because the entity-relationship diagram of
this schemas simulates a star, with points, diverge from a central table. The center of the
schema consists of a large fact table, and the points of the star are the dimension tables.

Fact Tables

A table in a star schema which contains facts and connected to dimensions. A fact table
has two types of columns: those that include fact and those that are foreign keys to the
dimension table. The primary key of the fact tables is generally a composite key that is
made up of all of its foreign keys.
A fact table might involve either detail level fact or fact that have been aggregated (fact
tables that include aggregated fact are often instead called summary tables). A fact table
generally contains facts with the same level of aggregation.

Dimension Tables
A dimension is an architecture usually composed of one or more hierarchies that
categorize data. If a dimension has not got hierarchies and levels, it is called a flat
dimension or list. The primary keys of each of the dimensions table are part of the
composite primary keys of the fact table. Dimensional attributes help to define the
dimensional value. They are generally descriptive, textual values. Dimensional tables are
usually small in size than fact table.
Fact tables store data about sales while dimension tables data about the geographic
region (markets, cities), clients, products, times, channels.
Characteristics of Star Schema
The star schema is intensely suitable for data warehouse database design because of the
following features:
o It creates a DE-normalized database that can quickly provide query responses.
o It provides a flexible design that can be changed easily or added to throughout the
development cycle, and as the database grows.
o It provides a parallel in design to how end-users typically think of and use the data.
o It reduces the complexity of metadata for both developers and end-users.
Advantages of Star Schema

Disadvantage of Star Schema


There is some condition which cannot be meet by star schemas like the relationship
between the user, and bank account cannot describe as star schema as the relationship
between them is many to many.

What is Snowflake Schema?


A snowflake schema is equivalent to the star schema. "A schema is known as a
snowflake if one or more dimension tables do not connect directly to the fact table but
must join through other dimension tables." The snowflake schema is an expansion of the
star schema where each point of the star explodes into more points. It is called
snowflake schema because the diagram of snowflake schema resembles a
snowflake. Snowflaking is a method of normalizing the dimension tables in a STAR
schemas. When we normalize all the dimension tables entirely, the resultant structure
resembles a snowflake with the fact table in the middle.

Snowflaking is used to develop the


performance of specific queries. The
schema is diagramed with each fact
surrounded by its associated dimensions,
and those dimensions are related to other
dimensions, branching out into a
snowflake pattern. The snowflake schema
consists of one fact table which is linked
to many dimension tables, which can be
linked to other dimension tables through
a many-to-one relationship. Tables in a
snowflake schema are generally
normalized to the third normal form. Each
dimension table performs exactly one level in a hierarchy. The following diagram shows a
snowflake schema with two dimensions, each having three levels. A snowflake schemas
can have any number of dimension, and each dimension can have any number of levels.

Example: Figure shows a snowflake schema with a Sales fact table, with Store, Location,
Time, Product, Line, and Family dimension tables. The Market dimension has two
dimension tables with Store as the primary dimension table, and Location as the
outrigger dimension table. The product dimension has three dimension tables with
Product as the primary dimension table, and the Line and Family table are the outrigger
dimension tables.

A snowflake schema is designed for flexible querying across more complex dimensions
and relationship. It is suitable for many to many and one to many relationships between
dimension levels.

Advantage of Snowflake Schema

1. The primary advantage of the snowflake schema is the development in query


performance due to minimized disk storage requirements and joining smaller lookup
tables.
2. It provides greater scalability in the interrelationship between dimension levels and
components.
3. No redundancy, so it is easier to maintain.

Disadvantage of Snowflake Schema

1. The primary disadvantage of the snowflake schema is the additional maintenance efforts
required due to the increasing number of lookup tables. It is also known as a multi fact
star schema.
2. There are more complex queries and hence, difficult to understand.
3. More tables more join so more query execution time.

What is Fact Constellation Schema?


A Fact constellation means two or more fact tables sharing one or more dimensions. It is
also called Galaxy schema.

Fact Constellation Schema describes a logical structure of data warehouse or data mart.
Fact Constellation Schema can design with a collection of de-normalized FACT, Shared,
and Conformed Dimension tables.

Fact Constellation Schema is a sophisticated database design that is difficult to


summarize information. Fact Constellation Schema can implement between aggregate
Fact tables or decompose a complex Fact table into independent simplex Fact tables.

Example: A fact constellation schema is shown in the figure below.


This schema defines two fact tables, sales, and shipping. Sales are treated along four
dimensions, namely, time, item, branch, and location. The schema contains a fact table
for sales that includes keys to each of the four dimensions, along with two measures:
Rupee_sold and units_sold. The shipping table has five dimensions, or keys: item_key,
time_key, shipper_key, from_location, and to_location, and two measures: Rupee_cost
and units_shipped.

The primary disadvantage of the fact constellation schema is that it is a more


challenging design because many variants for specific kinds of aggregation must be
considered and selected.

Data Warehouse Process Architecture


The process architecture defines an architecture in which the data from the data
warehouse is processed for a particular computation.

Following are the two fundamental process architectures:

Centralized Process Architecture

In this architecture, the data is collected into single centralized storage and processed
upon completion by a single machine with a huge structure in terms of memory,
processor, and storage.

Centralized process architecture evolved with transaction processing and is well suited for
small organizations with one location of service.

It requires minimal resources both from people and system perspectives.


It is very successful when the collection and consumption of data occur at the same
location.
Distributed Process Architecture
In this architecture, information and its processing are allocated across data centers, and
its processing is distributed across data centers, and processing of data is localized with
the group of the results into centralized storage. Distributed architectures are used to
overcome the limitations of the centralized process architectures where all the information
needs to be collected to one central location, and results are available in one central
location.
There are several architectures of the distributed process:
Client-Server
In this architecture, the user does all the information collecting and presentation, while the
server does the processing and management of data.
Three-tier Architecture
With client-server architecture, the client machines need to be connected to a server
machine, thus mandating finite states and introducing latencies and overhead in terms of
record to be carried between clients and servers.
N-tier Architecture
The n-tier or multi-tier architecture is where clients, middleware, applications, and servers
are isolated into tiers.
Cluster Architecture
In this architecture, machines that are connected in network architecture (software or
hardware) to approximately work together to process information or compute
requirements in parallel. Each device in a cluster is associated with a function that is
processed locally, and the result sets are collected to a master server that returns it to the
user.
Peer-to-Peer Architecture
This is a type of architecture where there are no dedicated servers and clients. Instead, all
the processing responsibilities are allocated among all machines, called peers. Each
machine can perform the function of a client or server or just process data.
Difference between OLTP and OLAP
OLTP (On-Line Transaction Processing) is featured by a large number of short on-line
transactions (INSERT, UPDATE, and DELETE). The primary significance of OLTP operations
is put on very rapid query processing, maintaining record integrity in multi-access
environments, and effectiveness consistent by the number of transactions per second. In
the OLTP database, there is an accurate and current record, and schema used to save
transactional database is the entity model (usually 3NF).

OLAP (On-line Analytical Processing) is represented by a relatively low volume of


transactions. Queries are very difficult and involve aggregations. For OLAP operations,
response time is an effectiveness measure. OLAP applications are generally used by Data
Mining techniques. In OLAP database there is aggregated, historical information, stored in
multi-dimensional schemas (generally star schema).

Following are the difference between OLAP and OLTP system.


1) Users: OLTP systems are designed for office worker while the OLAP systems are
designed for decision-makers. Therefore while an OLTP method may be accessed by
hundreds or even thousands of clients in a huge enterprise, an OLAP system is suitable to
be accessed only by a select class of manager and may be used only by dozens of users.
2) Functions: OLTP systems are mission-critical. They provide day-to-day operations of
an enterprise and are largely performance and availability driven. These operations carry
out simple repetitive operations. OLAP systems are management-critical to support the
decision of enterprise support tasks using detailed investigation.
3) Nature: Although SQL queries return a set of data, OLTP methods are designed to step
one record at the time, for example, a data related to the user who may be on the phone
or in the store. OLAP system is not designed to deal with individual customer records.
Instead, they include queries that deal with many data at a time and provide summary or
aggregate information to a manager. OLAP applications include data stored in a data
warehouses that have been extracted from many tables and possibly from more than one
enterprise database.
4) Design: OLTP database operations are designed to be application-oriented
while OLAP operations are designed to be subject-oriented. OLTP systems view the
enterprise record as a collection of tables (possibly based on an entity-relationship
model). OLAP operations view enterprise information as multidimensional).
5) Data: OLTP systems usually deal only with the current status of data. For example, a
record about an employee who left three years ago may not be feasible on the Human
Resources System. The old data may have been achieved on some type of stable storage
media and may not be accessible online. On the other hand, OLAP systems needed
historical data over several years since trends are often essential in decision making.
6) Kind of use: OLTP methods are used for reading and writing operations while OLAP
methods usually do not update the data.
7) View: An OLTP system focuses primarily on the current data within an enterprise or
department, which does not refer to historical data or data in various organizations. In
contrast, an OLAP system spans multiple version of a database schema, due to the
evolutionary process of an organization. OLAP system also deals with information that
originates from different organizations, integrating information from many data stores.
Because of their huge volume, these are stored on multiple storage media.
8) Access Patterns: The access pattern of an OLTP system consist primarily of short,
atomic transactions. Such a system needed concurrency control and recovery techniques.
However, access to OLAP systems is mostly read-only operations because these data
warehouses store historical information.
The biggest difference between an OLTP and OLAP system is the amount of data
analyzed in a single transaction. Whereas an OLTP handles many concurrent
customers and queries touching only a single data or limited collection of
records at a time, an OLAP system must have the efficiency to operate on
millions of data to answer a single query.
Difference between ROLAP and MOLAP

ROLAP MOLAP

ROLAP stands for Relational Online MOLAP stands for Multidimensional Online
Analytical Processing. Analytical Processing.

It usually used when data warehouse It used when data warehouse contains relational as
contains relational data. well as non-relational data.

It contains Analytical server. It contains the MDDB server.

It creates a multidimensional view of It contains prefabricated data cubes.


data dynamically.

It is very easy to implement It is difficult to implement.

It has a high response time It has less response time due to prefabricated
cubes.

It requires less amount of memory. It requires a large amount of memory.


Types of OLAP
There are three main types of OLAP servers are as following:

ROLAP stands for Relational OLAP, an application based on relational DBMSs.

MOLAP stands for Multidimensional OLAP, an application based on multidimensional DBMSs.

HOLAP stands for Hybrid OLAP, an application using both relational and multidimensional
techniques.

Relational OLAP (ROLAP) Server

These are intermediate servers which stand in between a relational back-end server and
user frontend tools. They use a relational or extended-relational DBMS to save and handle
warehouse data, and OLAP middleware to provide missing pieces. ROLAP servers contain
optimization for each DBMS back end, implementation of aggregation navigation logic, and
additional tools and services. ROLAP technology tends to have higher scalability than
MOLAP technology. ROLAP systems work primarily from the data that resides in a relational
database, where the base data and dimension tables are stored as relational tables. This
model permits the multidimensional analysis of data. This technique relies on manipulating
the data stored in the relational database to give the presence of traditional OLAP's slicing
and dicing functionality. In essence, each method of slicing and dicing is equivalent to
adding a "WHERE" clause in the SQL statement.

Relational OLAP Architecture


ROLAP Architecture includes the following components

o Database server.
o ROLAP server.
o Front-end tool.

Relational OLAP (ROLAP) is the latest and


fastest-growing OLAP technology segment in
the market. This method allows multiple
multidimensional views of two-dimensional
relational tables to be created, avoiding
structuring record around the desired view.

Some products in this segment have


supported reliable SQL engines to help the
complexity of multidimensional analysis. This includes creating multiple SQL statements to
handle user requests, being 'RDBMS' aware and also being capable of generating the SQL
statements based on the optimizer of the DBMS engine.
Advantages
Can handle large amounts of information: The data size limitation of ROLAP
technology is depends on the data size of the underlying RDBMS. So, ROLAP itself does not
restrict the data amount.

<="" strong="">RDBMS already comes with a lot of features. So ROLAP technologies,


(works on top of the RDBMS) can control these functionalities.

Disadvantages
Performance can be slow: Each ROLAP report is a SQL query (or multiple SQL queries) in
the relational database, the query time can be prolonged if the underlying data size is
large.

Limited by SQL functionalities: ROLAP technology relies on upon developing SQL


statements to query the relational database, and SQL statements do not suit all needs.

Multidimensional OLAP (MOLAP) Server

A MOLAP system is based on a native logical model that directly supports multidimensional
data and operations. Data are stored physically into multidimensional arrays, and positional
techniques are used to access them.

One of the significant distinctions of MOLAP against a ROLAP is that data are summarized
and are stored in an optimized format in a multidimensional cube, instead of in a relational
database. In MOLAP model, data are structured into proprietary formats by client's
reporting requirements with the calculations pre-generated on the cubes.

MOLAP Architecture
MOLAP Architecture includes the following components

o Database server.
o MOLAP server.
o Front-end tool.

o MOLAP structure primarily reads the precompiled data. MOLAP structure has limited
capabilities to dynamically create aggregations or to evaluate results which have not been
pre-calculated and stored.

o Applications requiring iterative and comprehensive time-series analysis of trends are well
suited for MOLAP technology (e.g., financial analysis and budgeting).

o Examples include Arbor Software's Essbase. Oracle's Express Server, Pilot Software's
Lightship Server, Sniper's TM/1. Planning Science's Gentium and Kenan Technology's
Multiway.
o Some of the problems faced by clients are related to maintaining support to multiple
subject areas in an RDBMS. Some vendors can solve these problems by continuing access
from MOLAP tools to detailed data in and RDBMS.

o This can be very useful for organizations with performance-sensitive multidimensional


analysis requirements and that have built or are in the process of building a data
warehouse architecture that contains multiple subject areas.

o An example would be the creation of sales data measured by several dimensions (e.g.,
product and sales region) to be stored and maintained in a persistent structure. This
structure would be provided to reduce the application overhead of performing calculations
and building aggregation during initialization. These structures can be automatically
refreshed at predetermined intervals established by an administrator.

o Advantages
o Excellent Performance: A MOLAP cube is built for fast information retrieval, and is
optimal for slicing and dicing operations.

o Can perform complex calculations: All evaluation have been pre-generated when the
cube is created. Hence, complex calculations are not only possible, but they return quickly.

o Disadvantages
o Limited in the amount of information it can handle: Because all calculations are
performed when the cube is built, it is not possible to contain a large amount of data in the
cube itself.

Requires additional investment: Cube technology is generally proprietary and does not
already exist in the organization. Therefore, to adopt MOLAP technology, chances are other
investments in human and capital resources are needed.

Hybrid OLAP (HOLAP) Server

HOLAP incorporates the best features of MOLAP and ROLAP into a single architecture.
HOLAP systems save more substantial quantities of detailed data in the relational tables
while the aggregations are stored in the pre-calculated cubes. HOLAP also can drill through
from the cube down to the relational tables for delineated data. The Microsoft SQL
Server 2000 provides a hybrid OLAP server.

Advantages of HOLAP
1. HOLAP provide benefits of both MOLAP
and ROLAP.
2. It provides fast access at all levels of
aggregation.
3. HOLAP balances the disk space
requirement, as it only stores the
aggregate information on the OLAP
server and the detail record remains in
the relational database. So no duplicate
copy of the detail record is maintained.

Disadvantages of HOLAP
1. HOLAP architecture is very complicated because it supports both MOLAP and ROLAP
servers.

Other Types

There are also less popular types of OLAP styles upon which one could stumble upon every
so often. We have listed some of the less popular brands existing in the OLAP industry.

Web-Enabled OLAP (WOLAP) Server


WOLAP pertains to OLAP application which is accessible via the web browser. Unlike
traditional client/server OLAP applications, WOLAP is considered to have a three-tiered
architecture which consists of three components: a client, a middleware, and a database
server.

Desktop OLAP (DOLAP) Server


DOLAP permits a user to download a section of the data from the database or source, and
work with that dataset locally, or on their desktop.

Mobile OLAP (MOLAP) Server


Mobile OLAP enables users to access and work on OLAP data and applications remotely
through the use of their mobile devices.

Spatial OLAP (SOLAP) Server


SOLAP includes the capabilities of both Geographic Information Systems (GIS) and OLAP
into a single user interface. It facilitates the management of both spatial and non-spatial
data.

Three-Tier Data Warehouse Architecture


Data Warehouses usually have a three-level (tier) architecture that includes:

1. Bottom Tier (Data Warehouse Server)


2. Middle Tier (OLAP Server)
3. Top Tier (Front end Tools).

A bottom-tier that consists of the Data Warehouse server, which is almost always an
RDBMS. It may include several specialized data marts and a metadata repository.

Data from operational databases and external sources (such as user profile data provided
by external consultants) are extracted using application program interfaces called a
gateway. A gateway is provided by the underlying DBMS and allows customer programs to
generate SQL code to be executed at a server.

Examples of gateways contain ODBC (Open Database Connection) and OLE-DB (Open-
Linking and Embedding for Databases),
by Microsoft, and JDBC (Java Database
Connection).

A middle-tier which consists of an OLAP


server for fast querying of the data
warehouse.

The OLAP server is implemented using


either

(1) A Relational OLAP (ROLAP) model,


i.e., an extended relational DBMS that maps
functions on multidimensional data to
standard relational operations.

(2) A Multidimensional OLAP (MOLAP)


model, i.e., a particular purpose server
that directly implements multidimensional
information and operations.

A top-tier that contains front-end


tools for displaying results provided by
OLAP, as well as additional tools for data
mining of the OLAP-generated data.

The overall Data Warehouse Architecture is


shown in fig :
The metadata repository stores information that defines DW objects. It includes the
following parameters and information for the middle and the top-tier applications:

1. A description of the DW structure, including the warehouse schema, dimension, hierarchies,


data mart locations, and contents, etc.
2. Operational metadata, which usually describes the currency level of the stored data, i.e.,
active, archived or purged, and warehouse monitoring information, i.e., usage statistics,
error reports, audit, etc.
3. System performance data, which includes indices, used to improve data access and
retrieval performance.
4. Information about the mapping from operational databases, which provides
source RDBMSs and their contents, cleaning and transformation rules, etc.
5. Summarization algorithms, predefined queries, and reports business data, which include
business terms and definitions, ownership information, etc.

Distributed Data Warehouses


The concept of a distributed data warehouse suggests that there are two types of
distributed data warehouses and their modifications for the local enterprise warehouses
which are distributed throughout the enterprise and a global warehouses as shown in fig:

Characteristics of Local data warehouses


o Activity appears at the local level
o Bulk of the operational processing
o Local site is autonomous
o Each local data warehouse has its unique architecture and contents of data
o The data is unique and of prime essential to that locality only
o Majority of the record is local and not replicated
o Any intersection of data between local data warehouses is circumstantial
o Local warehouse serves different technical communities
o The scope of the local data warehouses is finite to the local site
o Local warehouses also include historical data and are integrated only within the local site.
Virtual Data Warehouses
Virtual Data Warehouses is created in the following
stages:
1. Installing a set of data approach, data dictionary,
and process management facilities.
2. Training end-clients.
3. Monitoring how DW facilities will be used
4. Based upon actual usage, physically Data Warehouse is created to provide the high-
frequency results
This strategy defines that end users are allowed to get at operational databases directly
using whatever tools are implemented to the data access network. This method provides
ultimate flexibility as well as the minimum amount of redundant information that must be
loaded and maintained. The data warehouse is a great idea, but it is difficult to build and
requires investment. Why not use a cheap and fast method by eliminating the
transformation phase of repositories for metadata and another database. This method is
termed the 'virtual data warehouse.'
To accomplish this, there is a need to define four kinds of data:
1. A data dictionary including the definitions of the various databases.
2. A description of the relationship between the data components.
3. The description of the method user will interface with the system.
4. The algorithms and business rules that describe what to do and how to do it.
Disadvantages
1. Since queries compete with production record transactions, performance can be
degraded.
2. There is no metadata, no summary record, or no individual DSS (Decision Support
System) integration or history. All queries must be copied, causing an additional burden
on the system.
3. There is no refreshing process, causing the queries to be very complex.
Warehouse Manager
The warehouse manager is responsible for the warehouse management process. It
consists of a third-party system software, C programs, and shell scripts. The size and
complexity of a warehouse manager varies between specific solutions.
Warehouse Manager Architecture
A warehouse manager includes the following −
 The controlling process
 Stored procedures or C with SQL
 Backup/Recovery tool
 SQL scripts

Functions of Warehouse Manager


A warehouse manager performs the following
functions −
 Analyzes the data to perform consistency and referential integrity checks.
 Creates indexes, business views, partition views against the base data.
 Generates new aggregations and updates the existing aggregations.
 Generates normalizations.
 Transforms and merges
warehouse.
 Backs up the data in the data warehouse.
 Archives the data that has reached the end of its captured life.
Note − A warehouse Manager analyzes query
profiles to determine whether the index and
aggregations are appropriate.
SECTION - B
Data Mining Query Language
Data Mining is a process is in which user data are extracted and processed from a heap of
unprocessed raw data. By aggregating these datasets into a summarized format, many
problems arising in finance, marketing, and many other fields can be solved. In the modern
world with enormous data, Data Mining is one of the growing fields of technology that acts
as an application in many industries we depend on in our life. Many developments and
researches have been held in this field and many systems are also been disclosed. Since
there are numerous processes and functions to be done in Data Mining, a very well
developed user interface is needed. Even though there are many well-developed user
interfaces for the relational systems, Han, Fu, Wang, et al. proposed the Data Mining Query
Language(DMQL) to further build more developmental systems and innovate many kinds of
research in this field. Though we can’t consider DMQL as a standard language. It is a derived
language that stands as a general query language to perform data mining techniques. DMQL
is executed in DB miner systems for collecting data from several layers of databases. Ideas
in designing DMQL:
DMQL is designed based on Structured Query Language(SQL) which in turn is a relational query
language.
 Data Mining request: For the given data mining task, the corresponding datasets must be defined
in the form of a data mining request. Let us see this with an example. As the user can request for
any specific part of a dataset in the database, the data miner can use the database query to retrieve
the suitable datasets before the process of data mining. If the aggregation of that specific data is
not possible for the data miner, he then collects the supersets from which one can derive the
required data. This proves the need for query language in data mining which acts as its subtask.
Since the extraction of relevant data from huge datasets cannot be performed by manual work,
many development methods are present in the data mining technique. But by doing this way,
sometimes the task of collecting relevant data requested by the user may be failed. By using
DMQL, a command to retrieve specific datasets or data from the database, which gives a desired
result to the user and it gives comprehending experience in fulfilling the expectations of users.
 Background Knowledge: Prior knowledge of datasets and their relationships in a database help in
mining the data. By knowing the relationships or any useful information can ease the process of
extraction and aggregation. For an instance, the conceptual hierarchy of the number of datasets
can increase the efficiency of the process and accuracy by collecting the desired data easily. By
knowing the hierarchy, the data can be generalized with ease.
 Generalization: When the data in datasets of a data warehouse is not generalized, often the data
would be in form of unprocessed primitive integrity constraints, roughly associated multi-valued
datasets and their dependencies. But by using the generalization concept using query language can
help in processing the raw data into a precise abstraction. It also works in the multi-level collection
of data with a quality aggregation. When the larger databases come into the scene, the
generalization would play a major role in giving desirable results in a conceptual level of data
collection.
 Flexibility and Interaction: To avoid the collection of less desirable or unwanted data from
databases, efficient exposure values or thresholds must be specified for the flexible data mining and
to provide compulsive interaction which makes the user experience interesting. Such threshold
values can be provided with queries of data mining.
The four parameters of data mining:
 The first parameter is to fetch the relevant dataset from the database in the form of a relational
query. By specifying this primitive, relevant data are retrieved.
 The second parameter is the type of resource/information extracted. This primitive includes
generalization, association, classification, characterization, and discrimination rules.
 The third parameter is the hierarchy of datasets or generalization relation or background knowledge
as said earlier in the designing of DMQL.
 The final parameter is the proficiency of the data collected which can be represented by a specific
threshold value which in turn depends on the type of rules used in data mining.
Kinds of thresholds in rule mining:
In the process of data mining, maintaining a set of threshold values is very important in extracting
useful and engaging datasets from a heap of data. This threshold value also helps in measuring the
relevance of the data and it helps in a driving search for interesting datasets.
The types of thresholds in rule mining can be categorized into three classes.
 Significance Threshold: To present a dataset in the data mining process, the dataset must be
verified for having at least some rationally significant proof of a pattern within itself. According to
mining association rules, they are called the minimum support threshold. The patterns found within
this minimum support threshold is called frequent data items. In accordance with characteristic rules,
they are called noise threshold. The patterns which cannot cross this threshold are denoted as
noise.
 Rule Redundancy Threshold: This threshold prevents the redundancy of the dataset that is going
to be presented. That is, the rules that are going to be provided should not be the same as that of
existing ones.
 Rule Confidence Threshold: The probability of X under the condition Y in rule (X->Y), probability
must pass through this rule confidence threshold to make sure of it.
Transaction ID Rice Pulse Oil Milk Apple

t1 1 1 1 0 0

t2 0 1 1 1 0

t3 0 0 0 1 1

t4 1 1 0 1 0

t5 1 1 1 0 1

t6 1 1 1 1 1
SECTION – C
Data Mining – Cluster Analysis

Cluster Analysis is the process to find similar groups of objects in order to form clusters. It is an
unsupervised machine learning-based algorithm that acts on unlabelled data. A group of data points
would comprise together to form a cluster in which all the objects would belong to the same group.
Cluster:
The given data is divided into different groups by combining similar objects into a group. This group
is nothing but a cluster. A cluster is nothing but a collection of similar data which is grouped
together.
For example, consider a dataset of vehicles given in which it contains information about different
vehicles like cars, buses, bicycles, etc. As it is unsupervised learning there are no class labels like
Cars, Bikes, etc for all the vehicles, all the data is combined and is not in a structured manner.
Now our task is to convert the unlabelled data to labelled data and it can be done using clusters.
The main idea of cluster analysis is that it would arrange all the data points by forming clusters like
cars cluster which contains all the cars, bikes clusters which contains all the bikes, etc.
Simply it is the partitioning of similar objects which are applied to unlabelled data.
Properties of Clustering :
1. Clustering Scalability: Nowadays there is a vast amount of data and should be dealing with
huge databases. In order to handle extensive databases, the clustering algorithm should be
scalable. Data should be scalable, if it is not scalable, then we can’t get the appropriate result which
would lead to wrong results.
2. High Dimensionality: The algorithm should be able to handle high dimensional space along with
the data of small size.
3. Algorithm Usability with multiple data kinds: Different kinds of data can be used with
algorithms of clustering. It should be capable of dealing with different types of data like discrete,
categorical and interval-based data, binary data etc.
4. Dealing with unstructured data: There would be some databases that contain missing values,
and noisy or erroneous data. If the algorithms are sensitive to such data then it may lead to poor
quality clusters. So it should be able to handle unstructured data and give some structure to the
data by organising it into groups of similar data objects. This makes the job of the data expert easier
in order to process the data and discover new patterns.
5. Interpretability: The clustering outcomes should be interpretable, comprehensible, and
usable. The interpretability reflects how easily the data is understood.
Clustering Methods:
The clustering methods can be classified into the following categories:
 Partitioning Method
 Hierarchical Method
 Density-based Method
 Grid-Based Method
 Model-Based Method
 Constraint-based Method
Partitioning Method: It is used to make partitions on the data in order to form clusters. If “n”
partitions are done on “p” objects of the database then each partition is represented by a cluster
and n < p. The two conditions which need to be satisfied with this Partitioning Clustering Method
are:
 One objective should only belong to only one group.
 There should be no group without even a single purpose.
In the partitioning method, there is one technique called iterative relocation, which means the object
will be moved from one group to another to improve the partitioning
Hierarchical Method: In this method, a hierarchical decomposition of the given set of data objects
is created. We can classify hierarchical methods and will be able to know the purpose of
classification on the basis of how the hierarchical decomposition is formed. There are two types of
approaches for the creation of hierarchical decomposition, they are:
 Agglomerative Approach: The agglomerative approach is also known as the bottom-up approach.
Initially, the given data is divided into which objects form separate groups. Thereafter it keeps on
merging the objects or the groups that are close to one another which means that they exhibit
similar properties. This merging process continues until the termination condition holds.
 Divisive Approach: The divisive approach is also known as the top-down approach. In this
approach, we would start with the data objects that are in the same cluster. The group of individual
clusters is divided into small clusters by continuous iteration. The iteration continues until the
condition of termination is met or until each cluster contains one object.
Once the group is split or merged then it can never be undone as it is a rigid method and is not so
flexible. The two approaches which can be used to improve the Hierarchical Clustering Quality in
Data Mining are: –
 One should carefully analyze the linkages of the object at every partitioning of hierarchical
clustering.
 One can use a hierarchical agglomerative algorithm for the integration of hierarchical
agglomeration. In this approach, first, the objects are grouped into micro-clusters. After grouping
data objects into microclusters, macro clustering is performed on the microcluster.
Density-Based Method: The density-based method mainly focuses on density. In this method, the
given cluster will keep on growing continuously as long as the density in the neighbourhood
exceeds some threshold, i.e, for each data point within a given cluster. The radius of a given
cluster has to contain at least a minimum number of points.
Grid-Based Method: In the Grid-Based method a grid is formed using the object together,i.e, the
object space is quantized into a finite number of cells that form a grid structure. One of the major
advantages of the grid-based method is fast processing time and it is dependent only on the
number of cells in each dimension in the quantized space. The processing time for this method is
much faster so it can save time.
Model-Based Method: In the model-based method, all the clusters are hypothesized in order to
find the data which is best suited for the model. The clustering of the density function is used to
locate the clusters for a given model. It reflects the spatial distribution of data points and also
provides a way to automatically determine the number of clusters based on standard statistics,
taking outlier or noise into account. Therefore it yields robust clustering methods.
Constraint-Based Method: The constraint-based clustering method is performed by the
incorporation of application or user-oriented constraints. A constraint refers to the user expectation
or the properties of the desired clustering results. Constraints provide us with an interactive way of
communication with the clustering process. The user or the application requirement can specify
constraints.
Applications Of Cluster Analysis:
 It is widely used in image processing, data analysis, and pattern recognition.
 It helps marketers to find the distinct groups in their customer base and they can characterize
their customer groups by using purchasing patterns.
 It can be used in the field of biology, by deriving animal and plant taxonomies and identifying
genes with the same capabilities.
 It also helps in information discovery by classifying documents on the web.
Partitioning Method (K-Mean) in Data Mining

Figure – K-mean Clustering` Flowchart


Hierarchical Clustering in Data Mining
Decision Tree
Decision Tree Mining is a type of data mining technique that is used to build Classification
Models. It builds classification models in the form of a tree-like structure, just like its
name. This type of mining belongs to supervised class learning.
In supervised learning, the target result is already known. Decision trees can be used for
both categorical and numerical data. The categorical data represent gender, marital
status, etc. while the numerical data represent age, temperature, etc.
What Is The Use Of A Decision Tree?
Decision Tree is used to build classification and regression models. It is used to create
data models that will predict class labels or values for the decision-making process. The
models are built from the training dataset fed to the system (supervised learning).
Using a decision tree, we can visualize the decisions that make it easy to understand and
thus it is a popular data mining technique.

What is a Genetic Algorithm?


Genetic algorithms (GA) are adaptive search algorithms- adaptive in terms of the
number of parameters you provide or the types of parameters you provide. The
algorithms classify the best optimal solution among the several solutions, and its
design is based on the natural genetic solution.
Genetic algorithm emulates the principles of natural evolution, i.e. survival of the
fittest. Natural evolution propagates the genetic material in the fittest individuals from
one generation to the next.
The genetic algorithm applies the same technique in data mining – it iteratively
performs the selection, crossover, mutation, and encoding process to evolve the
successive generation of models.
The components of genetic algorithms consist of:

 Population incorporating individuals.


 Encoding or decoding mechanism of individuals.
 The objective function and an associated fitness evaluation criterion.
 Selection procedure.
 Genetic operators like recombination or crossover, mutation.
 Probabilities to perform genetic operations.
 Replacement technique.
 Termination combination.
At every iteration, the algorithm delivers a model that
inherits its traits from the previous model and
competes with the other models until the most
predictive model survive.
Genetic Algorithms in Data Mining
So far, we have studied that the genetic algorithm is a
classification method that is adaptive, robust and used
globally in situations where the area of classification is
large. The algorithms optimize a fitness function based
on the criteria preferred by data mining so as to
obtain an optimal solution, for example:
 Knowledge discovery system
 MASSON system
However, the data mining application based on the genetic algorithm is not as rich as
the application based on fuzzy sets. In the section ahead, we have categorized some
systems based on the genetic algorithm used in data mining.
Regression
Data mining identifies human interpretable patterns; it includes a prediction that
determines a future value from the available variable or attributes in the database. The
basic assumption of the linear multi regression model is that there is no interaction
among the attributes.
GA handles the interaction among the attributes in a far better way. The non-linear
multi regression model uses GA to get single out from the training data set.
Association Rules
Multiple objective GA deals with the problems with multiple objective functions and
constraints, to determine the optimal set of solutions. None of the solutions from this
set must exist in the search space that can dominate any member of this set.
Such algorithms are used for rule mining with a large search space with many
attributes and records. To obtain the optimal solutions, multi-objective GA performs the
global search with multiple objectives. Such as a combination of factors like predictive
accuracy, comprehensibility and interestingness.

Advantages and Disadvantages


Advantages

 Easy to understand as it is based on the concept of natural evolution.


 Classifies an optimal solution from a set of solutions.
 GA uses the pay off information instead of the derivative to yield an optimal solution.
 GA backs multi-objective optimization.
 GA is an adaptive search algorithm.
 GA also operates in a noisy environment.
Disadvantages

 An improper implementation may lead to a solution that is not optimal.


 Implementing fitness function iteratively may lead to computational challenges.
 GA is time-consuming as it deals with a lot of computation.
Applications of GA
GA is used in implementing many applications let s discuss a few of them.

 Economics: In the field of economics GA is used to implement certain models that


conduct competitive analysis, decision making, and effective scheduling.
 Aircraft Design: GA is used to provide the parameters that must be modified and
upgraded in order to get a better design.
 DNA Analysis: GA is used to establish DNA structure using spectrometric information.
 Transport: GA is used to develop a transport plan that is time and cost-efficient.
 Data mining: GA classify a large set of data to determine the optimal solution to the
concerned problem.

Rough Set
The notion of Rough sets was introduced by Z Pawlak in his seminal paper of 1982 (Pawlak
1982). It is a formal theory derived from fundamental research on logical properties of information
systems. Rough set theory has been a methodology of database mining or knowledge discovery
in relational databases. In its abstract form, it is a new area of uncertainty mathematics closely
related to fuzzy theory. We can use rough set approach to discover structural relationship within
imprecise and noisy data. Rough sets and fuzzy sets are complementary generalizations of
classical sets. The approximation spaces of rough set theory are sets with multiple memberships,
while fuzzy sets are concerned with partial memberships. The rapid development of these two
approaches provides a basis for “soft computing, ” initiated by Lotfi A. Zadeh. Soft Computing
includes along with rough sets, at least fuzzy logic, neural networks, probabilistic reasoning, belief
networks, machine learning, evolutionary computing, and chaos theory.
Basic problems in data analysis solved by Rough Set:
 Characterization of a set of objects in terms of attribute values.
 Finding dependency between the attributes.
 Reduction of superfluous attributes.
 Finding the most significant attributes.
 Decision rule generation.
Goals of Rough Set Theory –
 The main goal of the rough set analysis is the induction of (learning) approximations of concepts.
Rough sets constitute a sound basis for KDD. It offers mathematical tools to discover patterns
hidden in data.
 It can be used for feature selection, feature extraction, data reduction, decision rule generation,
and pattern extraction (templates, association rules) etc.
 Identifies partial or total dependencies in data, eliminates redundant data, gives approach to null
values, missing data, dynamic data and others.

Support Vector Machine Algorithm


SECTION -D
Complex Data Types in Data Mining
The Complex data types require advanced data mining techniques. Some of the
Complex data types are sequence Data which includes the Time-Series, Symbolic
Sequences, and Biological Sequences. The additional preprocessing steps are needed
for data mining of these complex data types.

1. Time-Series Data Mining:

In time-series data, data is measured as the long series of the numerical or textual data
at equal time intervals per minute, per hour, or per day. Time-series data mining is
performed on the data obtained from the stock markets, scientific data, and medical
data. In time series mining it is not possible to find the data that exactly matches the
given query. We employ the similarity search method that finds the data sequences
that are similar to the given query string. In the similarity search method, subsequence
matching is performed to find the subsequences that are similar to a given query
string. In order to perform the similarity search, dimensionality reduction of complex
data to transform the time-series data into numerical data.

2. Sequential Pattern Mining in Symbolic Sequences:

Symbolic sequences are composed of long nominal data sequences, which dynamically
change their behavior over time intervals. Examples of the Symbolic Sequences
include online customer shopping sequences as well as sequences of events of
experiments. Mining of Symbolic Sequences is called Sequential Mining. A sequential
pattern is a subsequence that exists more frequently in a set of sequences. so it finds
the most frequent subsequence in a set of sequences to perform the mining. Many
scalable algorithms have been built to find out the frequent subsequence. There are
also algorithms to mine the multidimensional and multilevel sequential patterns.

3. Data mining of Biological Sequences:

Biological sequences are the long sequences of nucleotides and data mining of
biological sequences is required to find the features of the DNA of humans. Biological
sequence analysis is the first step of data mining to compare the alignment of the
biological sequences. Two species are similar to each other only if their nucleotide
(DNA, RNA) and protein sequences are close and similar. During the data mining of
Biological Sequences, the degree of similarity between nucleotide sequences is
measured. The degree of similarity obtained by sequence alignment of nucleotides is
essential in determining the homology between two sequences.
There can be the situation of alignment of two or more input biological sequences by
identifying similar sequences with long subsequences. The amino acids also called
proteins sequences are also compared and aligned.

4. Graph Pattern Mining:

Graph Pattern Mining can be done by using Apriori-based and pattern growth-based
approaches. We can mine the subgraphs of the graph and the set of closed graphs. A
closed graph g is the graph that doesn’t have a super graph that carries the same
support count as g. Graph Pattern Mining is applied to different types of graphs such as
frequent graphs, coherent graphs, and dense graphs. We can also improve the mining
efficiency by applying the user constraints on the graph patterns. Graph patterns are
two types. Homogeneous graphs where nodes or links of the graph are of the same
type by having similar features. In Heterogeneous graph patterns, the nodes and links
are of different types.
5. Statistical Modeling of Networks:

A network is a collection of nodes where each node represents the data and the nodes
are linked through edges, representing relationships between data objects. If all the
nodes and links connecting the nodes are of the same type, then the network is
homogeneous such as a friend network or a web page network. If the nodes and links
connecting the nodes are of different types, then the network is heterogeneous such as
health-care networks (linking the different parameters such as doctors, nurses,
patients, diseases together in the network). Graph Pattern Mining can be further
applied to the network to derive the knowledge and useful patterns from the network.

6. Mining Spatial Data:

Spatial data is the geo space-related data that is stored in large data repositories. The
spatial data is represented in “vector” format and geo-referenced multimedia format.
A spatial database is constructed from large geographic data warehouses by
integrating geographical data of multiple sources of areas. we can construct spatial
data cubes that contain information about the spatial dimensions and measures. It is
possible to perform the OLAP operations on the spatial data for spatial data analysis.
Spatial data mining is performed on spatial data warehouses, spatial databases, and
other geospatial data repositories. Spatial Data mining discovers knowledge about the
geographic areas. The preprocessing of spatial data involves several operations like
spatial clustering, spatial classification, spatial modeling, and outlier detection in spatial
data.

7. Mining Cyber-Physical System Data:

Cyber-Physical System Data can be mined by constructing a graph or network of data.


A cyber-physical system (CPS) is a heterogeneous network that consists of a large
number of interconnected nodes that store patients or medical information. The links in
the CPS network represent the relationship between the nodes . cyber-physical systems
store dynamic, inconsistent, and interdependent data that contains spatiotemporal
information. Mining cyber-physical data links the situation as a query to access the data
from a large information database and it involves real-time calculations and analysis to
prompt responses from the CPS system. CPS analysis requires rare-event detection and
anomaly analysis in cyber-physical data streams, in cyber-physical networks, and the
processing of Cyber-Physical Data involves the integration of stream data with real-
time automated control processes.

8. Mining Multimedia Data:

Multimedia data objects include image data, video data, audio data, website hyperlinks,
and linkages. Multimedia data mining tries to find out interesting patterns from
multimedia databases. This includes the processing of the digital data and performs
tasks like image processing, image classification, video, and audio data mining, and
pattern recognition. Multimedia Data mining is becoming the most interesting research
area because most of the social media platforms like Twitter, Facebook data can be
analyzed through this and derive interesting trends and patterns.

9. Mining Web Data:

Web mining is essential to discover crucial patterns and knowledge from the Web. Web
content mining analyzes data of several websites which includes the web pages and
the multimedia data such as images in the web pages. Web mining is done to
understand the content of web pages, unique users of the website, unique hypertext
links, web page relevance and ranking, web page content summaries, time that the
users spent on the particular website, and understand user search patterns. Web
mining also finds out the best search engine and determines the search algorithm used
by it. So it helps improve search efficiency and finds the best search engine for the
users.
10. Mining Text Data:

Text mining is the subfield of data mining, machine learning, Natural Language
processing, and statistics. Most of the information in our daily life is stored as text such
as news articles, technical papers, books, email messages, blogs. Text Mining helps us
to retrieve high-quality information from text such as sentiment analysis, document
summarization, text categorization, text clustering. We apply machine learning models
and NLP techniques to derive useful information from the text. This is done by finding
out the hidden patterns and trends by means such as statistical pattern learning and
statistical language modeling. In order to perform text mining, we need to preprocess
the text by applying the techniques of stemming and lemmatization in order to convert
the textual data into data vectors.

11. Mining Spatiotemporal Data:

The data that is related to both space and time is Spatiotemporal data. Spatiotemporal
data mining retrieves interesting patterns and knowledge from spatiotemporal data.
Spatiotemporal Data mining helps us to find the value of the lands, the age of the rocks
and precious stones, predict the weather patterns. Spatiotemporal data mining has
many practical applications like GPS in mobile phones, timers, Internet-based map
services, weather services, satellite, RFID, sensor.

12. Mining Data Streams:

Stream data is the data that can change dynamically and it is noisy, inconsistent which
contain multidimensional features of different data types. So this data is stored in
NoSql database systems. The volume of the stream data is very high and this is the
challenge for the effective mining of stream data. While mining the Data Streams we
need to perform the tasks such as clustering, outlier analysis, and the online detection
of rare events in data streams.

Spatial Databases
Spatial data is associated with geographic locations such as cities,towns etc. A spatial
database is optimized to store and query data representing objects. These are the
objects which are defined in a geometric space.
Characteristics of Spatial Database
A spatial database system has the following characteristics
 It is a database system
 It offers spatial data types (SDTs) in its data model and query language.
 It supports spatial data types in its implementation, providing at least spatial indexing
and efficient algorithms for spatial join.
Example
A road map is a visualization of geographic information. A road map is a 2-dimensional
object which contains points, lines, and polygons that can represent cities, roads, and
political boundaries such as states or provinces.
In general, spatial data can be of two types −
 Vector data: This data is represented as discrete points, lines and polygons
 Rastor data: This data is represented as a matrix of square cells.
The spatial data in the form of points, lines, polygons etc. is used by many different
databases as shown above.

Multimedia Database
 Difficulty Level : Medium
 Last Updated : 28 Aug, 2020
Multimedia database is the collection of interrelated multimedia data that includes
text, graphics (sketches, drawings), images, animations, video, audio etc and have vast
amounts of multisource multimedia data. The framework that manages different types
of multimedia data which can be stored, delivered and utilized in different ways is
known as multimedia database management system. There are three classes of the
multimedia database which includes static media, dynamic media and dimensional
media.
Content of Multimedia Database management system :
1. Media data – The actual data representing an object.
2. Media format data – Information such as sampling rate, resolution, encoding scheme
etc. about the format of the media data after it goes through the acquisition,
processing and encoding phase.
3. Media keyword data – Keywords description relating to the generation of data. It is
also known as content descriptive data. Example: date, time and place of recording.
4. Media feature data – Content dependent data such as the distribution of colors, kinds
of texture and different shapes present in data.
Types of multimedia applications based on data management characteristic are :
1. Repository applications – A Large amount of multimedia data as well as meta-
data(Media format date, Media keyword data, Media feature data) that is stored for
retrieval purpose, e.g., Repository of satellite images, engineering drawings, radiology
scanned pictures.
2. Presentation applications – They involve delivery of multimedia data subject to
temporal constraint. Optimal viewing or listening requires DBMS to deliver data at
certain rate offering the quality of service above a certain threshold. Here data is
processed as it is delivered. Example: Annotating of video and audio data, real-time
editing analysis.
3. Collaborative work using multimedia information – It involves executing a
complex task by merging drawings, changing notifications. Example: Intelligent
healthcare network.
There are still many challenges to multimedia databases, some of which are :
1. Modelling – Working in this area can improve database versus information retrieval
techniques thus, documents constitute a specialized area and deserve special
consideration.
2. Design – The conceptual, logical and physical design of multimedia databases has not
yet been addressed fully as performance and tuning issues at each level are far more
complex as they consist of a variety of formats like JPEG, GIF, PNG, MPEG which is not
easy to convert from one form to another.
3. Storage – Storage of multimedia database on any standard disk presents the problem
of representation, compression, mapping to device hierarchies, archiving and buffering
during input-output operation. In DBMS, a ”BLOB”(Binary Large Object) facility allows
untyped bitmaps to be stored and retrieved.
4. Performance – For an application involving video playback or audio-video
synchronization, physical limitations dominate. The use of parallel processing may
alleviate some problems but such techniques are not yet fully developed. Apart from
this multimedia database consume a lot of processing time as well as bandwidth.
5. Queries and retrieval –For multimedia data like images, video, audio accessing data
through query opens up many issues like efficient query formulation, query execution
and optimization which need to be worked upon.
Areas where multimedia database is applied are :
 Documents and record management : Industries and businesses that keep detailed
records and variety of documents. Example: Insurance claim record.
 Knowledge dissemination : Multimedia database is a very effective tool for
knowledge dissemination in terms of providing several resources. Example: Electronic
books.
 Education and training : Computer-aided learning materials can be designed using
multimedia sources which are nowadays very popular sources of learning. Example:
Digital libraries.
 Marketing, advertising, retailing, entertainment and travel. Example: a virtual tour of
cities.
 Real-time control and monitoring : Coupled with active database technology,
multimedia presentation of information can be very effective means for monitoring and
controlling complex tasks Example: Manufacturing operation control.
Data Mining – Time-Series, Symbolic and Biological Sequences Data
Data mining refers to extracting or mining knowledge from large amounts of data. In other
words, Data mining is the science, art, and technology of discovering large and complex bodies of
data in order to discover useful patterns. Theoreticians and practitioners are continually seeking
improved techniques to make the process more efficient, cost-effective, and accurate.
This article discusses Sequence data. Evaluation of data reached the maximum extent and may
still peruse in the future. To generalize the evaluation of data we classify them as Sequence Data,
Graphs, and Network Mining, another kind of data.

A sequence is an ordered list of events. Sequences data are classified based on


characteristics as:
 Time-Series data (data with respect to time)
 Symbolic data (data with laps in an interval of time)
 Biological data (data related to DNA and protein)

Time-Series Data:

In this type of sequence, the data are of numeric data type recorded at a regular level. They are
generated by an economic process like Stock Market analysis, Medical Observations. They are
useful for studying natural phenomena.
Nowadays these times series are used for piecewise data approximations for further analysis. In
this time-series data, we find a subsequence that matches the query we search.
 Time Series Forecasting: Forecasting is a method of making predictions based on past and
present data to know what happens in the future. Trend analysis is a method of forecasting Time
Series. It is a function that generates historic patterns in time series that are used in short and
long-term predictions. We can obtain various patterns in time series like cyclic movements, trend
movements, seasonal movements as we see they are with respect to time or season. ARIMA,
SARIMA, long memory time series modeling are some of the popular methods for such analysis

Symbolic Data:

This type of ordered set of elements or events is recorded with or without a concrete notion of
time. Some symbolic sequences such as customer shopping sequences, web clickstreams are
examples of symbolic data. Sequential pattern mining is mainly used for symbolic sequence
Constraint-based pattern matching is one of the best ways to interact with user-defined data.
Apriori is an Algorithm used for this type of analysis Below is an example of a symbolic date
where we see customers c1 and c2 are purchasing products at different time intervals

Biological Data:

They are made of DNA and protein sequences. They are very long and complicated but have
some hidden meaning. These types of data are used for the sequence of nucleotides or amino
acids. These analyses are used for aligning, indexes, analyze biological sequence and play a
crucial role in bioinformatics and modern biology. Substitution trees are used to find the
probabilities of amino acids and probabilities of intersections. BLAST-Basic Local Alignment
Search Tool is the most effective tool for biological sequence.

Data Mining - Mining Text Data


Text databases consist of huge collection of documents. They collect these information
from several sources such as news articles, books, digital libraries, e-mail messages,
web pages, etc. Due to increase in the amount of information, the text databases are
growing rapidly. In many of the text databases, the data is semi-structured.
For example, a document may contain a few structured fields, such as title, author,
publishing_date, etc. But along with the structure data, the document also contains
unstructured text components, such as abstract and contents. Without knowing what
could be in the documents, it is difficult to formulate effective queries for analyzing and
extracting useful information from the data. Users require tools to compare the
documents and rank their importance and relevance. Therefore, text mining has
become popular and an essential theme in data mining.
Information Retrieval
Information retrieval deals with the retrieval of information from a large number of text-
based documents. Some of the database systems are not usually present in information
retrieval systems because both handle different kinds of data. Examples of information
retrieval system include −
 Online Library catalogue system
 Online Document Management Systems
 Web Search Systems etc.
Note − The main problem in an information retrieval system is to locate relevant
documents in a document collection based on a user's query. This kind of user's query
consists of some keywords describing an information need.
In such search problems, the user takes an initiative to pull relevant information out
from a collection. This is appropriate when the user has ad-hoc information need, i.e., a
short-term need. But if the user has a long-term information need, then the retrieval
system can also take an initiative to push any newly arrived information item to the
user.
This kind of access to information is called Information Filtering. And the corresponding
systems are known as Filtering Systems or Recommender Systems.
Basic Measures for Text Retrieval
We need to check the accuracy of a system when it retrieves a number of documents
on the basis of user's input. Let the set of documents relevant to a query be denoted as
{Relevant} and the set of retrieved document as {Retrieved}. The set of documents
that are relevant and retrieved can be denoted as {Relevant} ∩ {Retrieved}. This can
be shown in the form of a Venn diagram as follows −

There are three fundamental measures for assessing the quality of text retrieval −
 Precision
 Recall
 F-score
Precision
Precision is the percentage of retrieved documents that are in fact relevant to the
query. Precision can be defined as −
Precision= |{Relevant} ∩ {Retrieved}| / |{Retrieved}|
Recall
Recall is the percentage of documents that are relevant to the query and were in fact
retrieved. Recall is defined as −
Recall = |{Relevant} ∩ {Retrieved}| / |{Relevant}|
F-score
F-score is the commonly used trade-off. The information retrieval system often needs to
trade-off for precision or vice versa. F-score is defined as harmonic mean of recall or
precision as follows −
F-score = recall x precision / (recall + precision) / 2

Data Mining - Mining World Wide Web

Web mining can widely be seen as the application of adapted data mining techniques
to the web, whereas data mining is defined as the application of the algorithm to
discover patterns on mostly structured data embedded into a knowledge discovery
process. Web mining has a distinctive property to provide a set of various data types.
The web has multiple aspects that yield different approaches for the mining process,
such as web pages consist of text, web pages are linked via hyperlinks, and user
activity can be monitored via web server logs. These three features lead to the
differentiation between the three areas are web content mining, web structure mining,
web usage mining.

There are three types of data mining:

1. Web Content Mining:

Web content mining can be used to extract useful data, information, knowledge from
the web page content. In web content mining, each web page is considered as an
individual document. The individual can take advantage of the semi-structured nature
of web pages, as HTML provides information that concerns not only the layout but also
logical structure. The primary task of content mining is data extraction, where
structured data is extracted from unstructured websites. The objective is to facilitate
data aggregation over various web sites by using the extracted structured data. Web
content mining can be utilized to distinguish topics on the web. For Example, if any
user searches for a specific task on the search engine, then the user will get a list of
suggestions.

2. Web Structured Mining:

The web structure mining can be used to find the link structure of hyperlink. It is used
to identify that data either link the web pages or direct link network. In Web Structure
Mining, an individual considers the web as a directed graph, with the web pages being
the vertices that are associated with hyperlinks. The most important application in this
regard is the Google search engine, which estimates the ranking of its outcomes
primarily with the PageRank algorithm. It characterizes a page to be exceptionally
relevant when frequently connected by other highly related pages. Structure and
content mining methodologies are usually combined. For example, web structured
mining can be beneficial to organizations to regulate the network between two
commercial sites.

3. Web Usage Mining:

Web usage mining is used to extract useful data, information, knowledge from the
weblog records, and assists in recognizing the user access patterns for web pages. In
Mining, the usage of web resources, the individual is thinking about records of requests
of visitors of a website, that are often collected as web server logs. While the content
and structure of the collection of web pages follow the intentions of the authors of the
pages, the individual requests demonstrate how the consumers see these pages. Web
usage mining may disclose relationships that were not proposed by the creator of the
pages.

Some of the methods to identify and analyze the web usage patterns are given below:

I. Session and visitor analysis:

The analysis of preprocessed data can be accomplished in session analysis, which


incorporates the guest records, days, time, sessions, etc. This data can be utilized to
analyze the visitor's behavior.

The document is created after this analysis, which contains the details of repeatedly
visited web pages, common entry, and exit.

II. OLAP (Online Analytical Processing):

OLAP accomplishes a multidimensional analysis of advanced data.

OLAP can be accomplished on various parts of log related data in a specific period.

OLAP tools can be used to infer important business intelligence metrics

Challenges in Web Mining:

The web pretends incredible challenges for resources, and knowledge discovery based
on the following observations:

o The complexity of web pages:

The site pages don't have a unifying structure. They are extremely complicated as compared to traditional text
documents. There are enormous amounts of documents in the digital library of the web. These libraries are not
organized according to a specific order.

o The web is a dynamic data source:


The data on the internet is quickly updated. For example, news, climate, shopping, financial news, sports, and
so on.

o Diversity of client networks:

The client network on the web is quickly expanding. These clients have different
interests, backgrounds, and usage purposes. There are over a hundred million
workstations that are associated with the internet and still increasing tremendously.

o Relevancy of data:

It is considered that a specific person is generally concerned about a small portion of


the web, while the rest of the segment of the web contains the data that is not familiar
to the user and may lead to unwanted results.

o The web is too broad:

The size of the web is tremendous and rapidly increasing. It appears that the web is too
huge for data warehousing and data mining.

Mining the Web's Link Structures to recognize Authoritative Web Pages:

The web comprises of pages as well as hyperlinks indicating from one to another page.
When a creator of a Web page creates a hyperlink showing another Web page, this can
be considered as the creator's authorization of the other page. The unified
authorization of a given page by various creators on the web may indicate the
significance of the page and may naturally prompt the discovery of authoritative web
pages. The web linkage data provide rich data about the relevance, the quality, and
structure of the web's content, and thus is a rich source of web mining.

Application of Web Mining:

Web mining has an extensive application because of various uses of the web. The list of
some applications of web mining is given below.

o Marketing and conversion tool


o Data analysis on website and application accomplishment.
o Audience behavior analysis
o Advertising and campaign accomplishment analysis.
o Testing and analysis of a site.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy