Dmda M1
Dmda M1
UNIT 1
DATA WAREHOUSE
Top-Tier
Middle-Tier
Bottom-Tier
Each Tier can have different components based on the prerequisites presented
by the decision-makers of the project but are subject to the novelty of their
respective tier.
1. Bottom Tier
The Bottom Tier in the three-tier architecture of a data warehouse consists of
the Data Repository. Data Repository is the storage space for the data extracted
from various data sources, which undergoes a series of activities as a part of the
ETL process. ETL stands for Extract, Transform and Load. As a preliminary
process, before the data is loaded into the repository, all the data relevant and
required are identified from several sources of the system. These data are then
cleaned up, to avoid repeating or junk data from its current storage units. The
next step is to transform all these data into a single format of storage. The final
step of ETL is to Load the data on the repository. Few commonly used ETL
tools are:
Informatica
Microsoft SSIS
Snaplogic
Confluent
Apache Kafka
Alooma
Ab Initio
IBM Infosphere
The storage type of the repository can be a relational database management
system or a multidimensional database management system. A relational
database system can hold simple relational data, whereas a multidimensional
database system can hold data that more than one dimension. Whenever the
Repository includes both relational and multidimensional database management
systems, there exists a metadata unit. As the name suggests, the metadata unit
consists of all the metadata fetched from both the relational database and
multidimensional database systems. This Metadata unit provides incoming data
to the next tier, that is, the middle tier. From the user’s standpoint, the data from
the bottom tier can be accessed only with the use of SQL queries. The
complexity of the queries depends on the type of database. Data from the
relational database system can be retrieved using simple queries, whereas the
multidimensional database system demands complex queries with multiple joins
and conditional statements.
2. Middle Tier
The Middle tier here is the tier with the OLAP servers. The Data Warehouse can
have more than one OLAP server, and it can have more than one type of OLAP
server model as well, which depends on the volume of the data to be processed
and the type of data held in the bottom tier. There are three types of OLAP
server models, such as:
ROLAP
Relational online analytical processing is a model of online analytical
processing which carries out an active multidimensional breakdown of
data stored in a relational database, instead of redesigning a relational
database into a multidimensional database.
This is applied when the repository consists of only the relational
database system in it.
MOLAP
Multidimensional online analytical processing is another model of online
analytical processing that catalogs and comprises of directories directly
on its multidimensional database system.
This is applied when the repository consists of only the multidimensional
database system in it.
HOLAP
Hybrid online analytical processing is a hybrid of both relational and
multidimensional online analytical processing models.
When the repository contains both the relational database management
system and the multidimensional database management system, HOLAP
is the best solution for a smooth functional flow between the database
systems. HOLAP allows for storing data in both relational and
multidimensional formats.
The Middle Tier acts as an intermediary component between the top tier and the
data repository, that is, the top tier and the bottom tier respectively. From the
user’s standpoint, the middle tier gives an idea about the conceptual outlook of
the database.
3. Top Tier
The Top Tier is a front-end layer, that is, the user interface that allows the user
to connect with the database systems. This user interface is usually a tool or an
API call, which is used to fetch the required data for Reporting, Analysis, and
Data Mining purposes. The type of tool depends purely on the form of outcome
expected. It could be a Reporting tool, an Analysis tool, a Query tool or a Data
mining tool.
Data mining is a key part of data analytics overall and one of the core
disciplines in data science, which uses advanced analytics techniques to find
useful information in data sets. At a more granular level, data mining is a step in
the knowledge discovery in databases (KDD) process, a data science
methodology for gathering, processing and analyzing data. Data mining and
KDD are sometimes referred to interchangeably, but they're more commonly
seen as distinct things.
KDD Process
KDD (Knowledge Discovery in Databases) is a process that involves the
extraction of useful, previously unknown, and potentially valuable information
from large datasets. The KDD process is an iterative process and it requires
multiple iterations of the above steps to extract accurate knowledge from the
data.The following steps are included in KDD process:
Data Cleaning
Data cleaning is defined as removal of noisy and irrelevant data from
collection.
1. Cleaning in case of Missing values.
2. Cleaning noisy data, where noise is a random or variance error.
3. Cleaning with Data discrepancy detection and Data transformation tools.
Data Integration
Data integration is defined as heterogeneous data from multiple sources
combined in a common source(DataWarehouse). Data integration using Data
Migration tools, Data Synchronization tools and ETL(Extract-Load-
Transformation) process.
Data Selection
Data selection is defined as the process where data relevant to the analysis is
decided and retrieved from the data collection. For this we can use Neural
network, Decision Trees, Naive bayes, Clustering, and Regression methods.
Data Transformation
Data Transformation is defined as the process of transforming data into
appropriate form required by mining procedure. Data Transformation is a two
step process:
1. Data Mapping: Assigning elements from source base to destination to
capture transformations.
2. Code generation: Creation of the actual transformation program.
Data Mining
Data mining is defined as techniques that are applied to extract patterns
potentially useful. It transforms task relevant data into patterns, and decides
purpose of model using classification or characterization.
Pattern Evaluation
Pattern Evaluation is defined as identifying strictly increasing patterns
representing knowledge based on given measures. It find interestingness
score of each pattern, and uses summarization and Visualization to make data
understandable by user.
Knowledge Representation
This involves presenting the results in a way that is meaningful and can be used
to make decisions.
Note: KDD is an iterative process where evaluation measures can be enhanced,
mining can be refined, new data can be integrated and transformed in order to
get different and more appropriate results. Preprocessing of databases consists
of Data cleaning and Data Integration.
CHALLENGES
The process of extracting useful data from large volumes of data is data mining.
The data in the real-world is heterogeneous, incomplete, and noisy. Data in huge
quantities will usually be inaccurate or unreliable. These problems may occur
due to data measuring instrument or because of human errors. Suppose a retail
chain collects phone numbers of customers who spend more than $ 500, and the
accounting employees put the information into their system. The person may
make a digit mistake when entering the phone number, which results in
incorrect data. Even some customers may not be willing to disclose their phone
numbers, which results in incomplete data. The data could get changed due to
human or system error. All these consequences (noisy and incomplete
data)makes data mining challenging.
Data Distribution:
Real-worlds data is usually stored on various platforms in a distributed
computing environment. It might be in a database, individual systems, or even
on the internet. Practically, It is a quite tough task to make all the data to a
centralized data repository mainly due to organizational and technical concerns.
For example, various regional offices may have their servers to store their data.
It is not feasible to store, all the data from all the offices on a central server.
Therefore, data mining requires the development of tools and algorithms that
allow the mining of distributed data.
Complex Data:
Real-world data is heterogeneous, and it could be multimedia data, including
audio and video, images, complex data, spatial data, time series, and so on.
Managing these various types of data and extracting useful information is a
tough task. Most of the time, new technologies, new tools, and methodologies
would have to be refined to obtain specific information.
Performance:
The data mining system's performance relies primarily on the efficiency of
algorithms and techniques used. If the designed algorithm and techniques are
not up to the mark, then the efficiency of the data mining process will be
affected adversely.
Data Visualization:
In data mining, data visualization is a very important process because it is the
primary method that shows the output to the user in a presentable way. The
extracted data should convey the exact meaning of what it intends to express.
But many times, representing the information to the end-user in a precise and
easy way is difficult. The input data and the output information being
complicated, very efficient, and successful data visualization processes need to
be implemented to make it successful.
2. Data Transformation:
This step is taken in order to transform the data in appropriate forms suitable
for mining process. This involves following ways:
1. Normalization:
It is done in order to scale the data values in a specified range (-1.0 to 1.0
or 0.0 to 1.0)
2. Attribute Selection:
In this strategy, new attributes are constructed from the given set of
attributes to help the mining process.
3. Discretization:
This is done to replace the raw values of numeric attribute by interval
levels or conceptual levels.
4. Concept Hierarchy Generation:
Here attributes are converted from lower level to higher level in
hierarchy. For Example-The attribute “city” can be converted to
“country”.
3. Data Reduction:
Data reduction is a crucial step in the data mining process that involves
reducing the size of the dataset while preserving the important information.
This is done to improve the efficiency of data analysis and to avoid
overfitting of the model. Some common steps involved in data reduction are:
Feature Selection: This involves selecting a subset of relevant features from
the dataset. Feature selection is often performed to remove irrelevant or
redundant features from the dataset. It can be done using various techniques
such as correlation analysis, mutual information, and principal component
analysis (PCA).
Feature Extraction: This involves transforming the data into a lower-
dimensional space while preserving the important information. Feature
extraction is often used when the original features are high-dimensional and
complex. It can be done using techniques such as PCA, linear discriminant
analysis (LDA), and non-negative matrix factorization (NMF).
Sampling: This involves selecting a subset of data points from the dataset.
Sampling is often used to reduce the size of the dataset while preserving the
important information. It can be done using techniques such as random
sampling, stratified sampling, and systematic sampling.
Clustering: This involves grouping similar data points together into clusters.
Clustering is often used to reduce the size of the dataset by replacing similar
data points with a representative centroid. It can be done using techniques
such as k-means, hierarchical clustering, and density-based clustering.
Compression: This involves compressing the dataset while preserving the
important information. Compression is often used to reduce the size of the
dataset for storage and transmission purposes. It can be done using
techniques such as wavelet compression, JPEG compression, and gzip
compression.
Missing data is defined as the values or data that is not stored (or not
present) for some variable/s in the given dataset. Below is a sample of the
missing data from the Titanic dataset. You can see the columns ‘Age’ and
‘Cabin’ have some missing values.
In the dataset, the blank shows the missing values.
In Pandas, usually, missing values are represented by NaN. It stands for Not a
Number.
DIMENSIONALITY REDUCTION
Dimensionality reduction is a technique used to reduce the number of features
in a dataset while retaining as much of the important information as possible. In
other words, it is a process of transforming high-dimensional data into a lower-
dimensional space that still preserves the essence of the original data.
In machine learning, high-dimensional data refers to data with a large number of
features or variables. The curse of dimensionality is a common problem in
machine learning, where the performance of the model deteriorates as the
number of features increases. This is because the complexity of the model
increases with the number of features, and it becomes more difficult to find a
good solution. In addition, high-dimensional data can also lead to overfitting,
where the model fits the training data too closely and does not generalize well to
new data.
Dimensionality reduction can help to mitigate these problems by reducing the
complexity of the model and improving its generalization performance. There
are two main approaches to dimensionality reduction: feature selection and
feature extraction.
Selecting the best features helps the model to perform well. For example,
Suppose we want to create a model that automatically decides which car should
be crushed for a spare part, and to do this, we have a dataset. This dataset
contains a Model of the car, Year, Owner's name, Miles. So, in this dataset, the
name of the owner does not contribute to the model performance as it does not
decide if the car should be crushed or not, so we can remove this column and
select the rest of the features(column) for the model building.
Binning
Binning refers to a data smoothing technique that helps to group a huge number
of continuous values into smaller values. For data discretization and the
development of idea hierarchy, this technique can also be used.
Cluster Analysis
Cluster analysis is a form of data discretization. A clustering algorithm is
executed by dividing the values of x numbers into clusters to isolate a
computational feature of x.
What is Binarization?
Binarization is the process of transforming data features of any entity
into vectors of binary numbers to make classifier algorithms more efficient. In a
simple example, transforming an image’s grey scale from the 0-255 spectrum to
a 0-1 spectrum is binarization.
Binarization is process that is used to transform data features of any entity into
binary numbers. It is done to classify algorithms more efficiently. To convert
into binary, we can transform data using binary threshold. All value above
threshold is marked as 1 and all values that are equal to or below threshold are
marked as 0. This is called binarizing your data. It can be helpful when you
have value that you want to make Crip value.
Since the ultimate goal is to make this data easier for the classifier to read while
minimizing memory usage, it’s not always necessary to encode the whole
sentence or all the details of a complex concept. In this case, only the current
state of how the data is parsed is needed for the classifier. For example, when
the top word on the stack is used as the first word in the input queue. Since
order is quite important, a simpler binary vector is preferable.
What is data transformation?
Data transformation is the process of converting data from one format, such as a
database file, XML document or Excel spreadsheet, into another.
Transformations typically involve converting a raw data source into
a cleansed, validated and ready-to-use format. Data transformation is crucial to
data management processes that include data integration, data migration, data
warehousing and data preparation.
It is also a critical component for any organization seeking to leverage its data
to generate timely business insights. As the volume of data has proliferated,
organizations must have an efficient way to harness data to effectively put it to
business use. Data transformation is one element of harnessing this data,
because -- when done properly -- it ensures data is easy to access, consistent,
secure and ultimately trusted by the intended business users.
Data analysts, data engineers and data scientists are typically in charge of data
transformation within an organization. They identify the source data, determine
the required data formats and perform data mapping, as well as execute the
actual transformation process before moving the data into appropriate databases
for storage and use.
Their work involves five main steps:
1. data discovery, in which data professionals use data profiling tools or
profiling scripts to understand the structure and characteristics of the data
and also to determine how it should be transformed;
2. data mapping, during which data professionals connect, or match, data
fields from one source to data fields in another;
3. code generation, a part of the process where the software code required
to transform the data is created (either by data transformation tools or the
data professionals themselves writing script);
4. execution of the code, where the data undergoes the transformation; and
5. review, during which data professionals or the business/end users confirm
that the output data meets the established transformation requirements
and, if not, address and correct any anomalies and errors.
Similarity measure
is a numerical measure of how alike two data objects are.
higher when objects are more alike.
often falls in the range [0,1]
Similarity might be used to identify
duplicate data that may have differences due to typos.
equivalent instances from different data sets. E.g. names and/or addresses
that are the same but have misspellings.
groups of data that are very close (clusters)
Dissimilarity measure
is a numerical measure of how different two data objects are
lower when objects are more alike
minimum dissimilarity is often 0 while the upper limit varies depending
on how much variation can be
Dissimilarity might be used to identify
outliers
interesting exceptions, e.g. credit card fraud
boundaries to clusters
Proximity refers to either a similarity or dissimilarity