DATA WAREHOUSE
DATA WAREHOUSE
SECTION – A
What is a Data Warehouse?
A Data Warehouse (DW) is a relational database that is designed for query and analysis rather than
transaction processing. It includes historical data derived from transaction data from single and multiple
sources. A Data Warehouse provides integrated, enterprise-wide, historical data and focuses on providing
support for decision-makers for data modeling and analysis. A Data Warehouse is a group of data
specific to the entire organization, not only to a particular group of users.
It is not used for daily operations and transaction processing but used for making
decisions.
A Data Warehouse can be viewed as a data system with the following attributes:
o It is a database designed for investigative tasks, using data from various
applications.
o It supports a relatively small number of clients with relatively long
interactions.
o It includes current and historical data to provide a historical perspective of
information.
o Its usage is read-intensive.
o It contains a few large tables.
"Data Warehouse is a subject-oriented, integrated, and time-variant store of
information in support of management's decisions."
Subject-Oriented
A data warehouse target on the modeling and analysis of data for decision-makers.
Therefore, data warehouses typically provide a concise and straightforward view
around a particular subject, such as customer, product, or sales, instead of the global
organization's ongoing operations. This is done by excluding data that are not useful
concerning the subject and including all data needed by the users to understand the
subject.
Integrated
A data warehouse integrates various heterogeneous data sources like RDBMS, flat
files, and online transaction records. It requires performing data cleaning and
integration during data warehousing to ensure consistency in naming conventions,
attributes types, etc., among different data sources.
Time-Variant
Historical information is kept in a data warehouse. For example, one can retrieve files
from 3 months, 6 months, 12 months, or even previous data from a data warehouse.
These variations with a transactions system, where often only the most current file is
kept.
Non-Volatile
The data warehouse is a physically separate data storage, which is transformed from
the source operational RDBMS. The operational updates of data do not occur in the
data warehouse, i.e., update, insert, and delete operations are not performed. It
usually requires only two procedures in data accessing: Initial loading of data and
access to data. Therefore, the DW does not require transaction processing, recovery,
and concurrency capabilities, which allows for substantial speedup of data retrieval.
Non-Volatile defines that once entered into the warehouse, and data should not
change.
Data Warehouse Usage:-
1. Data warehouses and data marts are used in a wide range of applications.
2. Business executives use the data in data warehouses and data marts to perform data
analysis and make strategic decisions.
3. In many areas, data warehouses are used as an integral part for enterprise management.
4. The data warehouse is mainly used for generating reports and answering predefined queries.
5. It is used to analyze summarized and detailed data, where the results are presented in the
form of reports and charts.
6. Later, the data warehouse is used for strategic purposes, performing multidimensional
analysis and sophisticated operations.
7. Finally, the data warehouse may be employed for knowledge discovery and strategic
decision making using data mining tools.
8. In this context, the tools for data warehousing can he categorized into access and retrieval
tools, database reporting tools, data analysis tools, and data mining tools.
What is Data Mart?
A Data Mart is a subset of a directorial information store, generally oriented to a
specific purpose or primary data subject which may be distributed to provide
business needs. Data Marts are analytical record stores designed to focus on
particular business functions for a specific community within an organization.
Data marts are derived from subsets of data in a data warehouse, though in the
bottom-up data warehouse design methodology, the data warehouse is created
from the union of organizational data marts.
There are mainly two approaches to designing data marts. These approaches are
Metadata is used for building, maintaining, managing, and using the data
warehouses. Metadata allow users access to help understand the content and
find data.
o First, it acts as the glue that links all parts of the data warehouses.
o Next, it provides information about the contents and structures to the
developers.
o Finally, it opens the doors to the end-users and makes the contents
recognizable in their terms.
Metadata is Like a Nerve Center. Various processes during the building and
administering of the data warehouse generate parts of the data warehouse
metadata. Another uses parts of metadata generated by one process. In the data
warehouse, metadata assumes a key position and enables communication
among various methods. It acts as a nerve centre in the data warehouse.
For example, a relation with the schema sales (part, supplier, customer, and
sale-price) can be materialized into a set of eight views as shown in fig,
where psc indicates a view consisting of aggregate function value (such as total-
sales) computed by grouping three attributes part, supplier, and
customer, p indicates a view composed of the corresponding aggregate function
values calculated by grouping part alone, etc.
A data cube is created from a subset of attributes in the database. Specific attributes
are chosen to be measure attributes, i.e., the attributes whose values are of interest.
Another attributes are selected as dimensions or functional attributes. The measure
attributes are aggregated according to the dimensions. For example, XYZ may create
a sales data warehouse to keep records of the store's sales for the dimensions time,
item, branch, and location. These dimensions enable the store to keep track of things
like monthly sales of items, and the branches and locations at which the items were
sold. Each dimension may have a table identify with it, known as a dimensional table,
which describes the dimensions. For example, a dimension table for items may contain
the attributes item_name, brand, and type.
Data cube method is an interesting technique with many applications. Data cubes could be sparse
in many cases because not every cell in each dimension may have corresponding data in the
database. Techniques should be developed to handle sparse cubes efficiently. If a query contains
constants at even lower levels than those provided in a data cube, it is not clear how to make the
best use of the precomputed results stored in the data cube. The model view data in the form of a
data cube. OLAP tools are based on the multidimensional data model. Data cubes usually model n-
dimensional data. A data cube enables data to be modeled and viewed in multiple dimensions. A
multidimensional data model is organized around a central theme, like sales and transactions. A
fact table represents this theme. Facts are numerical measures. Thus, the fact table contains
measure (such as Rs_sold) and keys to each of the related dimensional tables. Dimensions are a
fact that defines a data cube. Facts are generally quantities, which are used for analyzing the
Fact Tables
A table in a star schema which contains facts and connected to dimensions. A fact table
has two types of columns: those that include fact and those that are foreign keys to the
dimension table. The primary key of the fact tables is generally a composite key that is
made up of all of its foreign keys.
A fact table might involve either detail level fact or fact that have been aggregated (fact
tables that include aggregated fact are often instead called summary tables). A fact table
generally contains facts with the same level of aggregation.
Dimension Tables
A dimension is an architecture usually composed of one or more hierarchies that
categorize data. If a dimension has not got hierarchies and levels, it is called a flat
dimension or list. The primary keys of each of the dimensions table are part of the
composite primary keys of the fact table. Dimensional attributes help to define the
dimensional value. They are generally descriptive, textual values. Dimensional tables are
usually small in size than fact table.
Fact tables store data about sales while dimension tables data about the geographic
region (markets, cities), clients, products, times, channels.
Characteristics of Star Schema
The star schema is intensely suitable for data warehouse database design because of the
following features:
o It creates a DE-normalized database that can quickly provide query responses.
o It provides a flexible design that can be changed easily or added to throughout the
development cycle, and as the database grows.
o It provides a parallel in design to how end-users typically think of and use the data.
o It reduces the complexity of metadata for both developers and end-users.
Advantages of Star Schema
Example: Figure shows a snowflake schema with a Sales fact table, with Store, Location,
Time, Product, Line, and Family dimension tables. The Market dimension has two
dimension tables with Store as the primary dimension table, and Location as the
outrigger dimension table. The product dimension has three dimension tables with
Product as the primary dimension table, and the Line and Family table are the outrigger
dimension tables.
A snowflake schema is designed for flexible querying across more complex dimensions
and relationship. It is suitable for many to many and one to many relationships between
dimension levels.
1. The primary disadvantage of the snowflake schema is the additional maintenance efforts
required due to the increasing number of lookup tables. It is also known as a multi fact
star schema.
2. There are more complex queries and hence, difficult to understand.
3. More tables more join so more query execution time.
Fact Constellation Schema describes a logical structure of data warehouse or data mart.
Fact Constellation Schema can design with a collection of de-normalized FACT, Shared,
and Conformed Dimension tables.
In this architecture, the data is collected into single centralized storage and processed
upon completion by a single machine with a huge structure in terms of memory,
processor, and storage.
Centralized process architecture evolved with transaction processing and is well suited for
small organizations with one location of service.
ROLAP MOLAP
ROLAP stands for Relational Online MOLAP stands for Multidimensional Online
Analytical Processing. Analytical Processing.
It usually used when data warehouse It used when data warehouse contains relational as
contains relational data. well as non-relational data.
It has a high response time It has less response time due to prefabricated
cubes.
HOLAP stands for Hybrid OLAP, an application using both relational and multidimensional
techniques.
These are intermediate servers which stand in between a relational back-end server and
user frontend tools. They use a relational or extended-relational DBMS to save and handle
warehouse data, and OLAP middleware to provide missing pieces. ROLAP servers contain
optimization for each DBMS back end, implementation of aggregation navigation logic, and
additional tools and services. ROLAP technology tends to have higher scalability than
MOLAP technology. ROLAP systems work primarily from the data that resides in a relational
database, where the base data and dimension tables are stored as relational tables. This
model permits the multidimensional analysis of data. This technique relies on manipulating
the data stored in the relational database to give the presence of traditional OLAP's slicing
and dicing functionality. In essence, each method of slicing and dicing is equivalent to
adding a "WHERE" clause in the SQL statement.
o Database server.
o ROLAP server.
o Front-end tool.
Disadvantages
Performance can be slow: Each ROLAP report is a SQL query (or multiple SQL queries) in
the relational database, the query time can be prolonged if the underlying data size is
large.
A MOLAP system is based on a native logical model that directly supports multidimensional
data and operations. Data are stored physically into multidimensional arrays, and positional
techniques are used to access them.
One of the significant distinctions of MOLAP against a ROLAP is that data are summarized
and are stored in an optimized format in a multidimensional cube, instead of in a relational
database. In MOLAP model, data are structured into proprietary formats by client's
reporting requirements with the calculations pre-generated on the cubes.
MOLAP Architecture
MOLAP Architecture includes the following components
o Database server.
o MOLAP server.
o Front-end tool.
o MOLAP structure primarily reads the precompiled data. MOLAP structure has limited
capabilities to dynamically create aggregations or to evaluate results which have not been
pre-calculated and stored.
o Applications requiring iterative and comprehensive time-series analysis of trends are well
suited for MOLAP technology (e.g., financial analysis and budgeting).
o Examples include Arbor Software's Essbase. Oracle's Express Server, Pilot Software's
Lightship Server, Sniper's TM/1. Planning Science's Gentium and Kenan Technology's
Multiway.
o Some of the problems faced by clients are related to maintaining support to multiple
subject areas in an RDBMS. Some vendors can solve these problems by continuing access
from MOLAP tools to detailed data in and RDBMS.
o An example would be the creation of sales data measured by several dimensions (e.g.,
product and sales region) to be stored and maintained in a persistent structure. This
structure would be provided to reduce the application overhead of performing calculations
and building aggregation during initialization. These structures can be automatically
refreshed at predetermined intervals established by an administrator.
o Advantages
o Excellent Performance: A MOLAP cube is built for fast information retrieval, and is
optimal for slicing and dicing operations.
o Can perform complex calculations: All evaluation have been pre-generated when the
cube is created. Hence, complex calculations are not only possible, but they return quickly.
o Disadvantages
o Limited in the amount of information it can handle: Because all calculations are
performed when the cube is built, it is not possible to contain a large amount of data in the
cube itself.
Requires additional investment: Cube technology is generally proprietary and does not
already exist in the organization. Therefore, to adopt MOLAP technology, chances are other
investments in human and capital resources are needed.
HOLAP incorporates the best features of MOLAP and ROLAP into a single architecture.
HOLAP systems save more substantial quantities of detailed data in the relational tables
while the aggregations are stored in the pre-calculated cubes. HOLAP also can drill through
from the cube down to the relational tables for delineated data. The Microsoft SQL
Server 2000 provides a hybrid OLAP server.
Advantages of HOLAP
1. HOLAP provide benefits of both MOLAP
and ROLAP.
2. It provides fast access at all levels of
aggregation.
3. HOLAP balances the disk space
requirement, as it only stores the
aggregate information on the OLAP
server and the detail record remains in
the relational database. So no duplicate
copy of the detail record is maintained.
Disadvantages of HOLAP
1. HOLAP architecture is very complicated because it supports both MOLAP and ROLAP
servers.
Other Types
There are also less popular types of OLAP styles upon which one could stumble upon every
so often. We have listed some of the less popular brands existing in the OLAP industry.
A bottom-tier that consists of the Data Warehouse server, which is almost always an
RDBMS. It may include several specialized data marts and a metadata repository.
Data from operational databases and external sources (such as user profile data provided
by external consultants) are extracted using application program interfaces called a
gateway. A gateway is provided by the underlying DBMS and allows customer programs to
generate SQL code to be executed at a server.
Examples of gateways contain ODBC (Open Database Connection) and OLE-DB (Open-
Linking and Embedding for Databases),
by Microsoft, and JDBC (Java Database
Connection).
t1 1 1 1 0 0
t2 0 1 1 1 0
t3 0 0 0 1 1
t4 1 1 0 1 0
t5 1 1 1 0 1
t6 1 1 1 1 1
SECTION – C
Data Mining – Cluster Analysis
Cluster Analysis is the process to find similar groups of objects in order to form clusters. It is an
unsupervised machine learning-based algorithm that acts on unlabelled data. A group of data points
would comprise together to form a cluster in which all the objects would belong to the same group.
Cluster:
The given data is divided into different groups by combining similar objects into a group. This group
is nothing but a cluster. A cluster is nothing but a collection of similar data which is grouped
together.
For example, consider a dataset of vehicles given in which it contains information about different
vehicles like cars, buses, bicycles, etc. As it is unsupervised learning there are no class labels like
Cars, Bikes, etc for all the vehicles, all the data is combined and is not in a structured manner.
Now our task is to convert the unlabelled data to labelled data and it can be done using clusters.
The main idea of cluster analysis is that it would arrange all the data points by forming clusters like
cars cluster which contains all the cars, bikes clusters which contains all the bikes, etc.
Simply it is the partitioning of similar objects which are applied to unlabelled data.
Properties of Clustering :
1. Clustering Scalability: Nowadays there is a vast amount of data and should be dealing with
huge databases. In order to handle extensive databases, the clustering algorithm should be
scalable. Data should be scalable, if it is not scalable, then we can’t get the appropriate result which
would lead to wrong results.
2. High Dimensionality: The algorithm should be able to handle high dimensional space along with
the data of small size.
3. Algorithm Usability with multiple data kinds: Different kinds of data can be used with
algorithms of clustering. It should be capable of dealing with different types of data like discrete,
categorical and interval-based data, binary data etc.
4. Dealing with unstructured data: There would be some databases that contain missing values,
and noisy or erroneous data. If the algorithms are sensitive to such data then it may lead to poor
quality clusters. So it should be able to handle unstructured data and give some structure to the
data by organising it into groups of similar data objects. This makes the job of the data expert easier
in order to process the data and discover new patterns.
5. Interpretability: The clustering outcomes should be interpretable, comprehensible, and
usable. The interpretability reflects how easily the data is understood.
Clustering Methods:
The clustering methods can be classified into the following categories:
Partitioning Method
Hierarchical Method
Density-based Method
Grid-Based Method
Model-Based Method
Constraint-based Method
Partitioning Method: It is used to make partitions on the data in order to form clusters. If “n”
partitions are done on “p” objects of the database then each partition is represented by a cluster
and n < p. The two conditions which need to be satisfied with this Partitioning Clustering Method
are:
One objective should only belong to only one group.
There should be no group without even a single purpose.
In the partitioning method, there is one technique called iterative relocation, which means the object
will be moved from one group to another to improve the partitioning
Hierarchical Method: In this method, a hierarchical decomposition of the given set of data objects
is created. We can classify hierarchical methods and will be able to know the purpose of
classification on the basis of how the hierarchical decomposition is formed. There are two types of
approaches for the creation of hierarchical decomposition, they are:
Agglomerative Approach: The agglomerative approach is also known as the bottom-up approach.
Initially, the given data is divided into which objects form separate groups. Thereafter it keeps on
merging the objects or the groups that are close to one another which means that they exhibit
similar properties. This merging process continues until the termination condition holds.
Divisive Approach: The divisive approach is also known as the top-down approach. In this
approach, we would start with the data objects that are in the same cluster. The group of individual
clusters is divided into small clusters by continuous iteration. The iteration continues until the
condition of termination is met or until each cluster contains one object.
Once the group is split or merged then it can never be undone as it is a rigid method and is not so
flexible. The two approaches which can be used to improve the Hierarchical Clustering Quality in
Data Mining are: –
One should carefully analyze the linkages of the object at every partitioning of hierarchical
clustering.
One can use a hierarchical agglomerative algorithm for the integration of hierarchical
agglomeration. In this approach, first, the objects are grouped into micro-clusters. After grouping
data objects into microclusters, macro clustering is performed on the microcluster.
Density-Based Method: The density-based method mainly focuses on density. In this method, the
given cluster will keep on growing continuously as long as the density in the neighbourhood
exceeds some threshold, i.e, for each data point within a given cluster. The radius of a given
cluster has to contain at least a minimum number of points.
Grid-Based Method: In the Grid-Based method a grid is formed using the object together,i.e, the
object space is quantized into a finite number of cells that form a grid structure. One of the major
advantages of the grid-based method is fast processing time and it is dependent only on the
number of cells in each dimension in the quantized space. The processing time for this method is
much faster so it can save time.
Model-Based Method: In the model-based method, all the clusters are hypothesized in order to
find the data which is best suited for the model. The clustering of the density function is used to
locate the clusters for a given model. It reflects the spatial distribution of data points and also
provides a way to automatically determine the number of clusters based on standard statistics,
taking outlier or noise into account. Therefore it yields robust clustering methods.
Constraint-Based Method: The constraint-based clustering method is performed by the
incorporation of application or user-oriented constraints. A constraint refers to the user expectation
or the properties of the desired clustering results. Constraints provide us with an interactive way of
communication with the clustering process. The user or the application requirement can specify
constraints.
Applications Of Cluster Analysis:
It is widely used in image processing, data analysis, and pattern recognition.
It helps marketers to find the distinct groups in their customer base and they can characterize
their customer groups by using purchasing patterns.
It can be used in the field of biology, by deriving animal and plant taxonomies and identifying
genes with the same capabilities.
It also helps in information discovery by classifying documents on the web.
Partitioning Method (K-Mean) in Data Mining
Rough Set
The notion of Rough sets was introduced by Z Pawlak in his seminal paper of 1982 (Pawlak
1982). It is a formal theory derived from fundamental research on logical properties of information
systems. Rough set theory has been a methodology of database mining or knowledge discovery
in relational databases. In its abstract form, it is a new area of uncertainty mathematics closely
related to fuzzy theory. We can use rough set approach to discover structural relationship within
imprecise and noisy data. Rough sets and fuzzy sets are complementary generalizations of
classical sets. The approximation spaces of rough set theory are sets with multiple memberships,
while fuzzy sets are concerned with partial memberships. The rapid development of these two
approaches provides a basis for “soft computing, ” initiated by Lotfi A. Zadeh. Soft Computing
includes along with rough sets, at least fuzzy logic, neural networks, probabilistic reasoning, belief
networks, machine learning, evolutionary computing, and chaos theory.
Basic problems in data analysis solved by Rough Set:
Characterization of a set of objects in terms of attribute values.
Finding dependency between the attributes.
Reduction of superfluous attributes.
Finding the most significant attributes.
Decision rule generation.
Goals of Rough Set Theory –
The main goal of the rough set analysis is the induction of (learning) approximations of concepts.
Rough sets constitute a sound basis for KDD. It offers mathematical tools to discover patterns
hidden in data.
It can be used for feature selection, feature extraction, data reduction, decision rule generation,
and pattern extraction (templates, association rules) etc.
Identifies partial or total dependencies in data, eliminates redundant data, gives approach to null
values, missing data, dynamic data and others.
In time-series data, data is measured as the long series of the numerical or textual data
at equal time intervals per minute, per hour, or per day. Time-series data mining is
performed on the data obtained from the stock markets, scientific data, and medical
data. In time series mining it is not possible to find the data that exactly matches the
given query. We employ the similarity search method that finds the data sequences
that are similar to the given query string. In the similarity search method, subsequence
matching is performed to find the subsequences that are similar to a given query
string. In order to perform the similarity search, dimensionality reduction of complex
data to transform the time-series data into numerical data.
Symbolic sequences are composed of long nominal data sequences, which dynamically
change their behavior over time intervals. Examples of the Symbolic Sequences
include online customer shopping sequences as well as sequences of events of
experiments. Mining of Symbolic Sequences is called Sequential Mining. A sequential
pattern is a subsequence that exists more frequently in a set of sequences. so it finds
the most frequent subsequence in a set of sequences to perform the mining. Many
scalable algorithms have been built to find out the frequent subsequence. There are
also algorithms to mine the multidimensional and multilevel sequential patterns.
Biological sequences are the long sequences of nucleotides and data mining of
biological sequences is required to find the features of the DNA of humans. Biological
sequence analysis is the first step of data mining to compare the alignment of the
biological sequences. Two species are similar to each other only if their nucleotide
(DNA, RNA) and protein sequences are close and similar. During the data mining of
Biological Sequences, the degree of similarity between nucleotide sequences is
measured. The degree of similarity obtained by sequence alignment of nucleotides is
essential in determining the homology between two sequences.
There can be the situation of alignment of two or more input biological sequences by
identifying similar sequences with long subsequences. The amino acids also called
proteins sequences are also compared and aligned.
Graph Pattern Mining can be done by using Apriori-based and pattern growth-based
approaches. We can mine the subgraphs of the graph and the set of closed graphs. A
closed graph g is the graph that doesn’t have a super graph that carries the same
support count as g. Graph Pattern Mining is applied to different types of graphs such as
frequent graphs, coherent graphs, and dense graphs. We can also improve the mining
efficiency by applying the user constraints on the graph patterns. Graph patterns are
two types. Homogeneous graphs where nodes or links of the graph are of the same
type by having similar features. In Heterogeneous graph patterns, the nodes and links
are of different types.
5. Statistical Modeling of Networks:
A network is a collection of nodes where each node represents the data and the nodes
are linked through edges, representing relationships between data objects. If all the
nodes and links connecting the nodes are of the same type, then the network is
homogeneous such as a friend network or a web page network. If the nodes and links
connecting the nodes are of different types, then the network is heterogeneous such as
health-care networks (linking the different parameters such as doctors, nurses,
patients, diseases together in the network). Graph Pattern Mining can be further
applied to the network to derive the knowledge and useful patterns from the network.
Spatial data is the geo space-related data that is stored in large data repositories. The
spatial data is represented in “vector” format and geo-referenced multimedia format.
A spatial database is constructed from large geographic data warehouses by
integrating geographical data of multiple sources of areas. we can construct spatial
data cubes that contain information about the spatial dimensions and measures. It is
possible to perform the OLAP operations on the spatial data for spatial data analysis.
Spatial data mining is performed on spatial data warehouses, spatial databases, and
other geospatial data repositories. Spatial Data mining discovers knowledge about the
geographic areas. The preprocessing of spatial data involves several operations like
spatial clustering, spatial classification, spatial modeling, and outlier detection in spatial
data.
Multimedia data objects include image data, video data, audio data, website hyperlinks,
and linkages. Multimedia data mining tries to find out interesting patterns from
multimedia databases. This includes the processing of the digital data and performs
tasks like image processing, image classification, video, and audio data mining, and
pattern recognition. Multimedia Data mining is becoming the most interesting research
area because most of the social media platforms like Twitter, Facebook data can be
analyzed through this and derive interesting trends and patterns.
Web mining is essential to discover crucial patterns and knowledge from the Web. Web
content mining analyzes data of several websites which includes the web pages and
the multimedia data such as images in the web pages. Web mining is done to
understand the content of web pages, unique users of the website, unique hypertext
links, web page relevance and ranking, web page content summaries, time that the
users spent on the particular website, and understand user search patterns. Web
mining also finds out the best search engine and determines the search algorithm used
by it. So it helps improve search efficiency and finds the best search engine for the
users.
10. Mining Text Data:
Text mining is the subfield of data mining, machine learning, Natural Language
processing, and statistics. Most of the information in our daily life is stored as text such
as news articles, technical papers, books, email messages, blogs. Text Mining helps us
to retrieve high-quality information from text such as sentiment analysis, document
summarization, text categorization, text clustering. We apply machine learning models
and NLP techniques to derive useful information from the text. This is done by finding
out the hidden patterns and trends by means such as statistical pattern learning and
statistical language modeling. In order to perform text mining, we need to preprocess
the text by applying the techniques of stemming and lemmatization in order to convert
the textual data into data vectors.
The data that is related to both space and time is Spatiotemporal data. Spatiotemporal
data mining retrieves interesting patterns and knowledge from spatiotemporal data.
Spatiotemporal Data mining helps us to find the value of the lands, the age of the rocks
and precious stones, predict the weather patterns. Spatiotemporal data mining has
many practical applications like GPS in mobile phones, timers, Internet-based map
services, weather services, satellite, RFID, sensor.
Stream data is the data that can change dynamically and it is noisy, inconsistent which
contain multidimensional features of different data types. So this data is stored in
NoSql database systems. The volume of the stream data is very high and this is the
challenge for the effective mining of stream data. While mining the Data Streams we
need to perform the tasks such as clustering, outlier analysis, and the online detection
of rare events in data streams.
Spatial Databases
Spatial data is associated with geographic locations such as cities,towns etc. A spatial
database is optimized to store and query data representing objects. These are the
objects which are defined in a geometric space.
Characteristics of Spatial Database
A spatial database system has the following characteristics
It is a database system
It offers spatial data types (SDTs) in its data model and query language.
It supports spatial data types in its implementation, providing at least spatial indexing
and efficient algorithms for spatial join.
Example
A road map is a visualization of geographic information. A road map is a 2-dimensional
object which contains points, lines, and polygons that can represent cities, roads, and
political boundaries such as states or provinces.
In general, spatial data can be of two types −
Vector data: This data is represented as discrete points, lines and polygons
Rastor data: This data is represented as a matrix of square cells.
The spatial data in the form of points, lines, polygons etc. is used by many different
databases as shown above.
Multimedia Database
Difficulty Level : Medium
Last Updated : 28 Aug, 2020
Multimedia database is the collection of interrelated multimedia data that includes
text, graphics (sketches, drawings), images, animations, video, audio etc and have vast
amounts of multisource multimedia data. The framework that manages different types
of multimedia data which can be stored, delivered and utilized in different ways is
known as multimedia database management system. There are three classes of the
multimedia database which includes static media, dynamic media and dimensional
media.
Content of Multimedia Database management system :
1. Media data – The actual data representing an object.
2. Media format data – Information such as sampling rate, resolution, encoding scheme
etc. about the format of the media data after it goes through the acquisition,
processing and encoding phase.
3. Media keyword data – Keywords description relating to the generation of data. It is
also known as content descriptive data. Example: date, time and place of recording.
4. Media feature data – Content dependent data such as the distribution of colors, kinds
of texture and different shapes present in data.
Types of multimedia applications based on data management characteristic are :
1. Repository applications – A Large amount of multimedia data as well as meta-
data(Media format date, Media keyword data, Media feature data) that is stored for
retrieval purpose, e.g., Repository of satellite images, engineering drawings, radiology
scanned pictures.
2. Presentation applications – They involve delivery of multimedia data subject to
temporal constraint. Optimal viewing or listening requires DBMS to deliver data at
certain rate offering the quality of service above a certain threshold. Here data is
processed as it is delivered. Example: Annotating of video and audio data, real-time
editing analysis.
3. Collaborative work using multimedia information – It involves executing a
complex task by merging drawings, changing notifications. Example: Intelligent
healthcare network.
There are still many challenges to multimedia databases, some of which are :
1. Modelling – Working in this area can improve database versus information retrieval
techniques thus, documents constitute a specialized area and deserve special
consideration.
2. Design – The conceptual, logical and physical design of multimedia databases has not
yet been addressed fully as performance and tuning issues at each level are far more
complex as they consist of a variety of formats like JPEG, GIF, PNG, MPEG which is not
easy to convert from one form to another.
3. Storage – Storage of multimedia database on any standard disk presents the problem
of representation, compression, mapping to device hierarchies, archiving and buffering
during input-output operation. In DBMS, a ”BLOB”(Binary Large Object) facility allows
untyped bitmaps to be stored and retrieved.
4. Performance – For an application involving video playback or audio-video
synchronization, physical limitations dominate. The use of parallel processing may
alleviate some problems but such techniques are not yet fully developed. Apart from
this multimedia database consume a lot of processing time as well as bandwidth.
5. Queries and retrieval –For multimedia data like images, video, audio accessing data
through query opens up many issues like efficient query formulation, query execution
and optimization which need to be worked upon.
Areas where multimedia database is applied are :
Documents and record management : Industries and businesses that keep detailed
records and variety of documents. Example: Insurance claim record.
Knowledge dissemination : Multimedia database is a very effective tool for
knowledge dissemination in terms of providing several resources. Example: Electronic
books.
Education and training : Computer-aided learning materials can be designed using
multimedia sources which are nowadays very popular sources of learning. Example:
Digital libraries.
Marketing, advertising, retailing, entertainment and travel. Example: a virtual tour of
cities.
Real-time control and monitoring : Coupled with active database technology,
multimedia presentation of information can be very effective means for monitoring and
controlling complex tasks Example: Manufacturing operation control.
Data Mining – Time-Series, Symbolic and Biological Sequences Data
Data mining refers to extracting or mining knowledge from large amounts of data. In other
words, Data mining is the science, art, and technology of discovering large and complex bodies of
data in order to discover useful patterns. Theoreticians and practitioners are continually seeking
improved techniques to make the process more efficient, cost-effective, and accurate.
This article discusses Sequence data. Evaluation of data reached the maximum extent and may
still peruse in the future. To generalize the evaluation of data we classify them as Sequence Data,
Graphs, and Network Mining, another kind of data.
Time-Series Data:
In this type of sequence, the data are of numeric data type recorded at a regular level. They are
generated by an economic process like Stock Market analysis, Medical Observations. They are
useful for studying natural phenomena.
Nowadays these times series are used for piecewise data approximations for further analysis. In
this time-series data, we find a subsequence that matches the query we search.
Time Series Forecasting: Forecasting is a method of making predictions based on past and
present data to know what happens in the future. Trend analysis is a method of forecasting Time
Series. It is a function that generates historic patterns in time series that are used in short and
long-term predictions. We can obtain various patterns in time series like cyclic movements, trend
movements, seasonal movements as we see they are with respect to time or season. ARIMA,
SARIMA, long memory time series modeling are some of the popular methods for such analysis
Symbolic Data:
This type of ordered set of elements or events is recorded with or without a concrete notion of
time. Some symbolic sequences such as customer shopping sequences, web clickstreams are
examples of symbolic data. Sequential pattern mining is mainly used for symbolic sequence
Constraint-based pattern matching is one of the best ways to interact with user-defined data.
Apriori is an Algorithm used for this type of analysis Below is an example of a symbolic date
where we see customers c1 and c2 are purchasing products at different time intervals
Biological Data:
They are made of DNA and protein sequences. They are very long and complicated but have
some hidden meaning. These types of data are used for the sequence of nucleotides or amino
acids. These analyses are used for aligning, indexes, analyze biological sequence and play a
crucial role in bioinformatics and modern biology. Substitution trees are used to find the
probabilities of amino acids and probabilities of intersections. BLAST-Basic Local Alignment
Search Tool is the most effective tool for biological sequence.
There are three fundamental measures for assessing the quality of text retrieval −
Precision
Recall
F-score
Precision
Precision is the percentage of retrieved documents that are in fact relevant to the
query. Precision can be defined as −
Precision= |{Relevant} ∩ {Retrieved}| / |{Retrieved}|
Recall
Recall is the percentage of documents that are relevant to the query and were in fact
retrieved. Recall is defined as −
Recall = |{Relevant} ∩ {Retrieved}| / |{Relevant}|
F-score
F-score is the commonly used trade-off. The information retrieval system often needs to
trade-off for precision or vice versa. F-score is defined as harmonic mean of recall or
precision as follows −
F-score = recall x precision / (recall + precision) / 2
Web mining can widely be seen as the application of adapted data mining techniques
to the web, whereas data mining is defined as the application of the algorithm to
discover patterns on mostly structured data embedded into a knowledge discovery
process. Web mining has a distinctive property to provide a set of various data types.
The web has multiple aspects that yield different approaches for the mining process,
such as web pages consist of text, web pages are linked via hyperlinks, and user
activity can be monitored via web server logs. These three features lead to the
differentiation between the three areas are web content mining, web structure mining,
web usage mining.
Web content mining can be used to extract useful data, information, knowledge from
the web page content. In web content mining, each web page is considered as an
individual document. The individual can take advantage of the semi-structured nature
of web pages, as HTML provides information that concerns not only the layout but also
logical structure. The primary task of content mining is data extraction, where
structured data is extracted from unstructured websites. The objective is to facilitate
data aggregation over various web sites by using the extracted structured data. Web
content mining can be utilized to distinguish topics on the web. For Example, if any
user searches for a specific task on the search engine, then the user will get a list of
suggestions.
The web structure mining can be used to find the link structure of hyperlink. It is used
to identify that data either link the web pages or direct link network. In Web Structure
Mining, an individual considers the web as a directed graph, with the web pages being
the vertices that are associated with hyperlinks. The most important application in this
regard is the Google search engine, which estimates the ranking of its outcomes
primarily with the PageRank algorithm. It characterizes a page to be exceptionally
relevant when frequently connected by other highly related pages. Structure and
content mining methodologies are usually combined. For example, web structured
mining can be beneficial to organizations to regulate the network between two
commercial sites.
Web usage mining is used to extract useful data, information, knowledge from the
weblog records, and assists in recognizing the user access patterns for web pages. In
Mining, the usage of web resources, the individual is thinking about records of requests
of visitors of a website, that are often collected as web server logs. While the content
and structure of the collection of web pages follow the intentions of the authors of the
pages, the individual requests demonstrate how the consumers see these pages. Web
usage mining may disclose relationships that were not proposed by the creator of the
pages.
Some of the methods to identify and analyze the web usage patterns are given below:
The document is created after this analysis, which contains the details of repeatedly
visited web pages, common entry, and exit.
OLAP can be accomplished on various parts of log related data in a specific period.
The web pretends incredible challenges for resources, and knowledge discovery based
on the following observations:
The site pages don't have a unifying structure. They are extremely complicated as compared to traditional text
documents. There are enormous amounts of documents in the digital library of the web. These libraries are not
organized according to a specific order.
The client network on the web is quickly expanding. These clients have different
interests, backgrounds, and usage purposes. There are over a hundred million
workstations that are associated with the internet and still increasing tremendously.
o Relevancy of data:
The size of the web is tremendous and rapidly increasing. It appears that the web is too
huge for data warehousing and data mining.
The web comprises of pages as well as hyperlinks indicating from one to another page.
When a creator of a Web page creates a hyperlink showing another Web page, this can
be considered as the creator's authorization of the other page. The unified
authorization of a given page by various creators on the web may indicate the
significance of the page and may naturally prompt the discovery of authoritative web
pages. The web linkage data provide rich data about the relevance, the quality, and
structure of the web's content, and thus is a rich source of web mining.
Web mining has an extensive application because of various uses of the web. The list of
some applications of web mining is given below.