0% found this document useful (0 votes)
19 views30 pages

Dmda M1

A Data Warehouse (DW) is a relational database designed for query and analysis, integrating historical data from various sources to support decision-making. It features a three-tier architecture consisting of the bottom tier (data repository), middle tier (OLAP servers), and top tier (user interface), facilitating data extraction, transformation, and loading (ETL). Data mining, a key component of data analytics, involves identifying patterns in large datasets and is part of the Knowledge Discovery in Databases (KDD) process, which includes steps like data cleaning, integration, selection, transformation, and evaluation.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views30 pages

Dmda M1

A Data Warehouse (DW) is a relational database designed for query and analysis, integrating historical data from various sources to support decision-making. It features a three-tier architecture consisting of the bottom tier (data repository), middle tier (OLAP servers), and top tier (user interface), facilitating data extraction, transformation, and loading (ETL). Data mining, a key component of data analytics, involves identifying patterns in large datasets and is part of the Knowledge Discovery in Databases (KDD) process, which includes steps like data cleaning, integration, selection, transformation, and evaluation.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 30

DMDA MODULE 1

UNIT 1
DATA WAREHOUSE

Introduction to Data Warehouse:


A Data Warehouse (DW) is a relational database designed for query and
analysis rather than transaction processing. It includes historical data derived
from transaction data from single and multiple sources.
Data Warehouse is a subject-oriented, integrated, and time-variant store of
information in support of management's decisions.
A Data Warehouse provides integrated, enterprise-wide, historical data and
focuses on providing support for decision-makers for data modelling and
analysis.
A Data Warehouse is a group of data specific to the entire organisation, not only
to a particular group of users.
It is not used for daily operations and transaction processing but used for
making decisions.
A Data Warehouse can be viewed as a data system with the following attributes:
o It is a database designed for investigative tasks, using data from various
applications.
o It supports a relatively small number of clients with relatively long
interactions.
o It includes current and historical data to provide a historical information
perspective.
o Its usage is read-intensive.
o It contains a few large tables.
Difference Between Operational Data Base Systems and Data Warehouses:

Characteristics of Data Warehouse


Subject-Oriented
A data warehouse target on the modelling and analysis of data for decision-
makers. Therefore, data warehouses typically provide a concise and
straightforward view of a particular subject, such as customer, product, or sales,
instead of the global organization's ongoing operations. This is done by
excluding data that are not useful concerning the subject and including all data
needed by the users to understand the subject.
Integrated
A data warehouse integrates various heterogeneous data sources like RDBMS,
flat files, and online transaction records. It requires performing data cleaning
and integration during data warehousing to ensure consistency in naming
conventions, attribute types, etc., among different data sources.
Time-Variant
Historical information is kept in a data warehouse. For example, one can
retrieve files from 3 months, 6 months, 12 months, or even previous data from a
data warehouse. These variations with a transactions system, where often only
the most current file is kept.
Non-Volatile
The data warehouse is a physically separate data storage, which is transformed
from the source operational RDBMS. The operational updates of data do not
occur in the data warehouse, i.e., update, insert, and delete operations are not
performed. It usually requires only two procedures in data accessing: Initial
loading of data and access to data. Therefore, the DW does not require
transaction processing, recovery, and concurrency capabilities, which allows for
substantial speedup of data retrieval. Non-volatile means that once entered into
the warehouse, data should not change.
Data Warehouse Architecture
Three-Tier Data Warehouse Architecture
The Three-Tier Data Warehouse Architecture is the commonly used Data
Warehouse design in order to build a Data Warehouse by including the required
Data Warehouse Schema Model, the required OLAP server type, and the
required front-end tools for Reporting or Analysis purposes, which as the name
suggests contains three tiers such as Top tier, Bottom Tier and the Middle Tier
that are procedurally linked with one another from Bottom tier(data sources)
through Middle tier(OLAP servers) to the Top tier(Front-end tools).
Data Warehouse Architecture is the design based on which a Data Warehouse is
built, to accommodate the desired type of Data Warehouse Schema, user
interface application and database management system, for data organization
and repository structure. The type of Architecture is chosen based on the
requirement provided by the project team. Three-tier Data Warehouse
Architecture is the commonly used choice, due to its detailing in the structure.
The three different tiers here are termed as:

Top-Tier
Middle-Tier
Bottom-Tier
Each Tier can have different components based on the prerequisites presented

by the decision-makers of the project but are subject to the novelty of their
respective tier.
1. Bottom Tier
The Bottom Tier in the three-tier architecture of a data warehouse consists of
the Data Repository. Data Repository is the storage space for the data extracted
from various data sources, which undergoes a series of activities as a part of the
ETL process. ETL stands for Extract, Transform and Load. As a preliminary
process, before the data is loaded into the repository, all the data relevant and
required are identified from several sources of the system. These data are then
cleaned up, to avoid repeating or junk data from its current storage units. The
next step is to transform all these data into a single format of storage. The final
step of ETL is to Load the data on the repository. Few commonly used ETL
tools are:
 Informatica
 Microsoft SSIS
 Snaplogic
 Confluent
 Apache Kafka
 Alooma
 Ab Initio
 IBM Infosphere
The storage type of the repository can be a relational database management
system or a multidimensional database management system. A relational
database system can hold simple relational data, whereas a multidimensional
database system can hold data that more than one dimension. Whenever the
Repository includes both relational and multidimensional database management
systems, there exists a metadata unit. As the name suggests, the metadata unit
consists of all the metadata fetched from both the relational database and
multidimensional database systems. This Metadata unit provides incoming data
to the next tier, that is, the middle tier. From the user’s standpoint, the data from
the bottom tier can be accessed only with the use of SQL queries. The
complexity of the queries depends on the type of database. Data from the
relational database system can be retrieved using simple queries, whereas the
multidimensional database system demands complex queries with multiple joins
and conditional statements.
2. Middle Tier
The Middle tier here is the tier with the OLAP servers. The Data Warehouse can
have more than one OLAP server, and it can have more than one type of OLAP
server model as well, which depends on the volume of the data to be processed
and the type of data held in the bottom tier. There are three types of OLAP
server models, such as:
ROLAP
 Relational online analytical processing is a model of online analytical
processing which carries out an active multidimensional breakdown of
data stored in a relational database, instead of redesigning a relational
database into a multidimensional database.
 This is applied when the repository consists of only the relational
database system in it.

MOLAP
 Multidimensional online analytical processing is another model of online
analytical processing that catalogs and comprises of directories directly
on its multidimensional database system.
 This is applied when the repository consists of only the multidimensional
database system in it.

HOLAP
 Hybrid online analytical processing is a hybrid of both relational and
multidimensional online analytical processing models.
 When the repository contains both the relational database management
system and the multidimensional database management system, HOLAP
is the best solution for a smooth functional flow between the database
systems. HOLAP allows for storing data in both relational and
multidimensional formats.
The Middle Tier acts as an intermediary component between the top tier and the
data repository, that is, the top tier and the bottom tier respectively. From the
user’s standpoint, the middle tier gives an idea about the conceptual outlook of
the database.

3. Top Tier

The Top Tier is a front-end layer, that is, the user interface that allows the user
to connect with the database systems. This user interface is usually a tool or an
API call, which is used to fetch the required data for Reporting, Analysis, and
Data Mining purposes. The type of tool depends purely on the form of outcome
expected. It could be a Reporting tool, an Analysis tool, a Query tool or a Data
mining tool.

The Top Tier must be uncomplicated in terms of usability. Only user-friendly


tools can give effective outcomes. Even when the bottom tier and middle tier
are designed with at most cautiousness and clarity, if the Top tier is enabled with
a bungling front-end tool, then the whole Data Warehouse Architecture can
become an utter failure. This makes the selection of the user interface/ front-end
tool as the Top Tier, which will serve as the face of the Data Warehouse system,
a very significant part of the Three-Tier Data Warehouse Architecture designing
process.
Below are the few commonly used Top Tier tools.
 IBM Cognos
 Microsoft BI Platform
 SAP Business Objects Web
 Pentaho
 Crystal Reports
 SAP BW
 SAS Business Intelligence
UNIT 2
DATA MINING

What is data mining?


Data mining is the process of sorting through large data sets to identify patterns
and relationships that can help solve business problems through data analysis.
Data mining techniques and tools enable enterprises to predict future trends and
make more informed business decisions.

Data mining is a key part of data analytics overall and one of the core
disciplines in data science, which uses advanced analytics techniques to find
useful information in data sets. At a more granular level, data mining is a step in
the knowledge discovery in databases (KDD) process, a data science
methodology for gathering, processing and analyzing data. Data mining and
KDD are sometimes referred to interchangeably, but they're more commonly
seen as distinct things.

KDD Process
KDD (Knowledge Discovery in Databases) is a process that involves the
extraction of useful, previously unknown, and potentially valuable information
from large datasets. The KDD process is an iterative process and it requires
multiple iterations of the above steps to extract accurate knowledge from the
data.The following steps are included in KDD process:

Data Cleaning
Data cleaning is defined as removal of noisy and irrelevant data from
collection.
1. Cleaning in case of Missing values.
2. Cleaning noisy data, where noise is a random or variance error.
3. Cleaning with Data discrepancy detection and Data transformation tools.

Data Integration
Data integration is defined as heterogeneous data from multiple sources
combined in a common source(DataWarehouse). Data integration using Data
Migration tools, Data Synchronization tools and ETL(Extract-Load-
Transformation) process.

Data Selection
Data selection is defined as the process where data relevant to the analysis is
decided and retrieved from the data collection. For this we can use Neural
network, Decision Trees, Naive bayes, Clustering, and Regression methods.

Data Transformation
Data Transformation is defined as the process of transforming data into
appropriate form required by mining procedure. Data Transformation is a two
step process:
1. Data Mapping: Assigning elements from source base to destination to
capture transformations.
2. Code generation: Creation of the actual transformation program.

Data Mining
Data mining is defined as techniques that are applied to extract patterns
potentially useful. It transforms task relevant data into patterns, and decides
purpose of model using classification or characterization.

Pattern Evaluation
Pattern Evaluation is defined as identifying strictly increasing patterns
representing knowledge based on given measures. It find interestingness
score of each pattern, and uses summarization and Visualization to make data
understandable by user.

Knowledge Representation
This involves presenting the results in a way that is meaningful and can be used
to make decisions.
Note: KDD is an iterative process where evaluation measures can be enhanced,
mining can be refined, new data can be integrated and transformed in order to
get different and more appropriate results. Preprocessing of databases consists
of Data cleaning and Data Integration.

CHALLENGES

Incomplete and noisy data:

The process of extracting useful data from large volumes of data is data mining.
The data in the real-world is heterogeneous, incomplete, and noisy. Data in huge
quantities will usually be inaccurate or unreliable. These problems may occur
due to data measuring instrument or because of human errors. Suppose a retail
chain collects phone numbers of customers who spend more than $ 500, and the
accounting employees put the information into their system. The person may
make a digit mistake when entering the phone number, which results in
incorrect data. Even some customers may not be willing to disclose their phone
numbers, which results in incomplete data. The data could get changed due to
human or system error. All these consequences (noisy and incomplete
data)makes data mining challenging.

Data Distribution:
Real-worlds data is usually stored on various platforms in a distributed
computing environment. It might be in a database, individual systems, or even
on the internet. Practically, It is a quite tough task to make all the data to a
centralized data repository mainly due to organizational and technical concerns.
For example, various regional offices may have their servers to store their data.
It is not feasible to store, all the data from all the offices on a central server.
Therefore, data mining requires the development of tools and algorithms that
allow the mining of distributed data.

Complex Data:
Real-world data is heterogeneous, and it could be multimedia data, including
audio and video, images, complex data, spatial data, time series, and so on.
Managing these various types of data and extracting useful information is a
tough task. Most of the time, new technologies, new tools, and methodologies
would have to be refined to obtain specific information.

Performance:
The data mining system's performance relies primarily on the efficiency of
algorithms and techniques used. If the designed algorithm and techniques are
not up to the mark, then the efficiency of the data mining process will be
affected adversely.

Data Privacy and Security:


Data mining usually leads to serious issues in terms of data security,
governance, and privacy. For example, if a retailer analyzes the details of the
purchased items, then it reveals data about buying habits and preferences of the
customers without their permission.

Data Visualization:
In data mining, data visualization is a very important process because it is the
primary method that shows the output to the user in a presentable way. The
extracted data should convey the exact meaning of what it intends to express.
But many times, representing the information to the end-user in a precise and
easy way is difficult. The input data and the output information being
complicated, very efficient, and successful data visualization processes need to
be implemented to make it successful.

Tasks of Data Mining


Data mining involves six common classes of tasks:
Anomaly detection (Outlier/change/deviation detection) – The
identification of unusual data records, that might be interesting or
data errors that require further investigation.
Association rule learning (Dependency modelling) – Searches for
relationships between variables. For example a supermarket might
gather data on customer purchasing habits. Using association rule
learning, the supermarket can determine which products are
frequently bought together and use this information for marketing
purposes. This is sometimes referred to as market basket analysis.
Clustering – is the task of discovering groups and structures in the
data that are in some way or another "similar", without using known
structures in the data.
Classification – is the task of generalizing known structure to apply
to new data. For example, an e-mail program might attempt to
classify an e-mail as "legitimate" or as "spam".
Regression – attempts to find a function which models the data with
the least error.
Summarization – providing a more compact representation of
the data set, including visualization and report generation.

Data preprocessing is an important step in the data mining process. It refers to


the cleaning, transforming, and integrating of data in order to make it ready for
analysis. The goal of data preprocessing is to improve the quality of the data and
to make it more suitable for the specific data mining task.
Some common steps in data preprocessing include:
Data preprocessing is an important step in the data mining process that involves
cleaning and transforming raw data to make it suitable for analysis. Some
common steps in data preprocessing include:
Data Cleaning: This involves identifying and correcting errors or
inconsistencies in the data, such as missing values, outliers, and duplicates.
Various techniques can be used for data cleaning, such as imputation, removal,
and transformation.
Data Integration: This involves combining data from multiple sources to
create a unified dataset. Data integration can be challenging as it requires
handling data with different formats, structures, and semantics. Techniques such
as record linkage and data fusion can be used for data integration.
Data Transformation: This involves converting the data into a suitable format
for analysis. Common techniques used in data transformation include
normalization, standardization, and discretization. Normalization is used to
scale the data to a common range, while standardization is used to transform the
data to have zero mean and unit variance. Discretization is used to convert
continuous data into discrete categories.
Data Reduction: This involves reducing the size of the dataset while
preserving the important information. Data reduction can be achieved through
techniques such as feature selection and feature extraction. Feature selection
involves selecting a subset of relevant features from the dataset, while feature
extraction involves transforming the data into a lower-dimensional space while
preserving the important information.
Data Discretization: This involves dividing continuous data into discrete
categories or intervals. Discretization is often used in data mining and machine
learning algorithms that require categorical data. Discretization can be achieved
through techniques such as equal width binning, equal frequency binning, and
clustering.
Data Normalization: This involves scaling the data to a common range, such
as between 0 and 1 or -1 and 1. Normalization is often used to handle data with
different units and scales. Common normalization techniques include min-max
normalization, z-score normalization, and decimal scaling.
Data preprocessing plays a crucial role in ensuring the quality of data and the
accuracy of the analysis results. The specific steps involved in data
preprocessing may vary depending on the nature of the data and the analysis
goals.
By performing these steps, the data mining process becomes more efficient
and the results become more accurate.
Preprocessing in Data Mining:
Data preprocessing is a data mining technique which is used to transform the
raw data in a useful and efficient format.
1. Data Cleaning:
The data can have many irrelevant and missing parts. To handle this part,
data cleaning is done. It involves handling of missing data, noisy data etc.

 (a). Missing Data:


This situation arises when some data is missing in the data. It can be
handled in various ways.
Some of them are:
1. Ignore the tuples:
This approach is suitable only when the dataset we have is quite
large and multiple values are missing within a tuple.
2. Fill the Missing values:
There are various ways to do this task. You can choose to fill the
missing values manually, by attribute mean or the most probable
value.

 (b). Noisy Data:


Noisy data is a meaningless data that can’t be interpreted by machines.It
can be generated due to faulty data collection, data entry errors etc. It can
be handled in following ways :
1. Binning Method:
This method works on sorted data in order to smooth it. The whole
data is divided into segments of equal size and then various
methods are performed to complete the task. Each segmented is
handled separately. One can replace all data in a segment by its
mean or boundary values can be used to complete the task.
2. Regression:
Here data can be made smooth by fitting it to a regression
function.The regression used may be linear (having one
independent variable) or multiple (having multiple independent
variables).
3. Clustering:
This approach groups the similar data in a cluster. The outliers may
be undetected or it will fall outside the clusters.

2. Data Transformation:
This step is taken in order to transform the data in appropriate forms suitable
for mining process. This involves following ways:
1. Normalization:
It is done in order to scale the data values in a specified range (-1.0 to 1.0
or 0.0 to 1.0)
2. Attribute Selection:
In this strategy, new attributes are constructed from the given set of
attributes to help the mining process.
3. Discretization:
This is done to replace the raw values of numeric attribute by interval
levels or conceptual levels.
4. Concept Hierarchy Generation:
Here attributes are converted from lower level to higher level in
hierarchy. For Example-The attribute “city” can be converted to
“country”.
3. Data Reduction:
Data reduction is a crucial step in the data mining process that involves
reducing the size of the dataset while preserving the important information.
This is done to improve the efficiency of data analysis and to avoid
overfitting of the model. Some common steps involved in data reduction are:
Feature Selection: This involves selecting a subset of relevant features from
the dataset. Feature selection is often performed to remove irrelevant or
redundant features from the dataset. It can be done using various techniques
such as correlation analysis, mutual information, and principal component
analysis (PCA).
Feature Extraction: This involves transforming the data into a lower-
dimensional space while preserving the important information. Feature
extraction is often used when the original features are high-dimensional and
complex. It can be done using techniques such as PCA, linear discriminant
analysis (LDA), and non-negative matrix factorization (NMF).
Sampling: This involves selecting a subset of data points from the dataset.
Sampling is often used to reduce the size of the dataset while preserving the
important information. It can be done using techniques such as random
sampling, stratified sampling, and systematic sampling.
Clustering: This involves grouping similar data points together into clusters.
Clustering is often used to reduce the size of the dataset by replacing similar
data points with a representative centroid. It can be done using techniques
such as k-means, hierarchical clustering, and density-based clustering.
Compression: This involves compressing the dataset while preserving the
important information. Compression is often used to reduce the size of the
dataset for storage and transmission purposes. It can be done using
techniques such as wavelet compression, JPEG compression, and gzip
compression.

MISSING DATA/ MISSING IMPUTATIONS

Missing data is defined as the values or data that is not stored (or not
present) for some variable/s in the given dataset. Below is a sample of the
missing data from the Titanic dataset. You can see the columns ‘Age’ and
‘Cabin’ have some missing values.
In the dataset, the blank shows the missing values.
In Pandas, usually, missing values are represented by NaN. It stands for Not a
Number.

Why Is Data Missing From the Dataset?


There can be multiple reasons why certain values are missing from the data.
Reasons for the missing of data from the dataset affect the approach of handling
missing data. So it’s necessary to understand why the data could be missing.
Some of the reasons are listed below:
 Past data might get corrupted due to improper maintenance.
 Observations are not recorded for certain fields due to some reasons.
There might be a failure in recording the values due to human error.
 The user has not provided the values intentionally
 Item nonresponse: This means the participant refused to respond.

Types of Missing Values


Formally the missing values are categorized as follows:
Missing Completely At Random (MCAR)
 In MCAR, the probability of data being missing is the same for all the
observations. In this case, there is no relationship between the missing
data and any other values observed or unobserved (the data which is not
recorded) within the given dataset. That is, missing values are completely
independent of other data. There is no pattern.
 In the case of MCAR data, the value could be missing due to human
error, some system/equipment failure, loss of sample, or some
unsatisfactory technicalities while recording the values. For Example,
suppose in a library there are some overdue books. Some values of
overdue books in the computer system are missing. The reason might be a
human error, like the librarian forgetting to type in the values. So, the
missing values of overdue books are not related to any other variable/data
in the system. It should not be assumed as it’s a rare case. The advantage
of such data is that the statistical analysis remains unbiased.

Missing At Random (MAR)


 MAR data means that the reason for missing values can be explained by
variables on which you have complete information, as there is some
relationship between the missing data and other values/data. In this case,
the data is not missing for all the observations. It is missing only within
sub-samples of the data, and there is some pattern in the missing values.
 For example, if you check the survey data, you may find that all the
people have answered their ‘Gender,’ but ‘Age’ values
are mostly missing for people who have answered their ‘Gender’ as
‘female.’ (The reason being most of the females don’t want to reveal their
age.)
 So, the probability of data being missing depends only on the observed
value or data. In this case, the variables ‘Gender’ and ‘Age’ are related.
The reason for missing values of the ‘Age’ variable can be explained by
the ‘Gender’ variable, but you can not predict the missing value itself.
 Suppose a poll is taken for overdue books in a library. Gender and the
number of overdue books are asked in the poll. Assume that most of the
females answer the poll and men are less likely to answer. So why the
data is missing can be explained by another factor, that is gender. In this
case, the statistical analysis might result in bias. Getting an unbiased
estimate of the parameters can be done only by modeling the missing
data.

Missing Not At Random (MNAR)


 Missing values depend on the unobserved data. If there is some
structure/pattern in missing data and other observed data can not
explain it, then it is considered to be Missing Not At Random (MNAR).
 If the missing data does not fall under the MCAR or MAR, it can be
categorized as MNAR. It can happen due to the reluctance of people to
provide the required information. A specific group of respondents may
not answer some questions in a survey.
 For example, suppose the name and the number of overdue books are
asked in the poll for a library. So most of the people having no overdue
books are likely to answer the poll. People having more overdue books
are less likely to answer the poll. So, in this case, the missing value of the
number of overdue books depends on the people who have more books
overdue.
 Another example is that people having less income may refuse to share
some information in a survey or questionnaire.
 In the case of MNAR as well, the statistical analysis might result in bias.

7 ways to handle missing values in the dataset:


1. Deleting Rows with missing values
2. Impute missing values for continuous variable
3. Impute missing values for categorical variable
4. Other Imputation Methods
5. Using Algorithms that support missing values
6. Prediction of missing values
7. Imputation using Deep Learning Library — Datawig

Deleting Rows with Missing Values:


One straightforward approach is to remove entire rows that contain missing
values. While this method ensures that there are no missing values left in the
dataset, it may lead to a significant loss of data, especially if many rows have
missing values.

Impute Missing Values for Continuous Variables:


For numerical (continuous) variables, missing values can be filled in using
various methods such as mean, median, or mode imputation. Choosing the
appropriate method depends on the distribution of the data. Mean imputation
involves replacing missing values with the mean of the available data, while
median imputation uses the median, and mode imputation uses the mode (most
frequent value).

Impute Missing Values for Categorical Variables:


Categorical variables can be imputed using the mode, i.e., replacing missing
values with the most frequently occurring category in the variable. This method
works well when the categorical variables have a clear dominant category.

Other Imputation Methods:


There are more advanced imputation techniques like k-nearest neighbors (KNN)
imputation, regression imputation, and interpolation methods. KNN imputation
involves finding 'k' nearest neighbors for a missing value and imputing it based
on the values from these neighbors. Regression imputation involves predicting
missing values using a regression model built on the non-missing data.
Interpolation methods estimate missing values based on the patterns observed in
the existing data points.

Using Algorithms that Support Missing Values:


Some machine learning algorithms, like Random Forests and XGBoost,
inherently handle missing values. These algorithms can make splits and
decisions based on the available data without requiring imputation. It's
important to note that not all algorithms can handle missing values, so choosing
the right algorithm is crucial.

Prediction of Missing Values:


Missing values can be predicted using machine learning models. For instance, a
regression model can be trained to predict missing continuous values, and a
classification model can be used for missing categorical values. Once the model
is trained, it can predict missing values based on the patterns found in the rest of
the data.

Imputation Using Deep Learning Library — Datawig:


Datawig is a deep learning library specifically designed for imputing missing
values. It can handle both numerical and categorical variables. Datawig uses
neural networks to learn patterns from the available data and predict missing
values accordingly. Deep learning techniques can capture complex relationships
in the data, making them useful for imputation tasks.

DIMENSIONALITY REDUCTION
Dimensionality reduction is a technique used to reduce the number of features
in a dataset while retaining as much of the important information as possible. In
other words, it is a process of transforming high-dimensional data into a lower-
dimensional space that still preserves the essence of the original data.
In machine learning, high-dimensional data refers to data with a large number of
features or variables. The curse of dimensionality is a common problem in
machine learning, where the performance of the model deteriorates as the
number of features increases. This is because the complexity of the model
increases with the number of features, and it becomes more difficult to find a
good solution. In addition, high-dimensional data can also lead to overfitting,
where the model fits the training data too closely and does not generalize well to
new data.
Dimensionality reduction can help to mitigate these problems by reducing the
complexity of the model and improving its generalization performance. There
are two main approaches to dimensionality reduction: feature selection and
feature extraction.

There are two components of dimensionality reduction:


 Feature selection: In this, we try to find a subset of the original set of
variables, or features, to get a smaller subset which can be used to model
the problem. It usually involves three ways:
1. Filter
2. Wrapper
3. Embedded
 Feature extraction: This reduces the data in a high dimensional space to
a lower dimension space, i.e. a space with lesser no. of dimensions.
Methods of Dimensionality Reduction
The various methods used for dimensionality reduction include:
 Principal Component Analysis (PCA)
 Linear Discriminant Analysis (LDA)
 Generalized Discriminant Analysis (GDA)
Dimensionality reduction may be both linear and non-linear, depending upon
the method used. The prime linear method, called Principal Component
Analysis, or PCA, is discussed below.
.
Advantages of Dimensionality Reduction
 It helps in data compression, and hence reduced storage space.
 It reduces computation time.
 It also helps remove redundant features, if any.
 Improved Visualization: High dimensional data is difficult to visualize,
and dimensionality reduction techniques can help in visualizing the data
in 2D or 3D, which can help in better understanding and analysis.
 Overfitting Prevention: High dimensional data may lead to overfitting in
machine learning models, which can lead to poor generalization
performance. Dimensionality reduction can help in reducing the
complexity of the data, and hence prevent overfitting.
 Feature Extraction: Dimensionality reduction can help in extracting
important features from high dimensional data, which can be useful in
feature selection for machine learning models.
 Data Preprocessing: Dimensionality reduction can be used as a
preprocessing step before applying machine learning algorithms to reduce
the dimensionality of the data and hence improve the performance of the
model.
 Improved Performance: Dimensionality reduction can help in improving
the performance of machine learning models by reducing the complexity
of the data, and hence reducing the noise and irrelevant information in the
data.
Disadvantages of Dimensionality Reduction
 It may lead to some amount of data loss.
 PCA tends to find linear correlations between variables, which is
sometimes undesirable.
 PCA fails in cases where mean and covariance are not enough to define
datasets.
 We may not know how many principal components to keep- in practice,
some thumb rules are applied.
 Interpretability: The reduced dimensions may not be easily interpretable,
and it may be difficult to understand the relationship between the original
features and the reduced dimensions.
 Overfitting: In some cases, dimensionality reduction may lead to
overfitting, especially when the number of components is chosen based
on the training data.
 Sensitivity to outliers: Some dimensionality reduction techniques are
sensitive to outliers, which can result in a biased representation of the
data.
 Computational complexity: Some dimensionality reduction techniques,
such as manifold learning, can be computationally intensive, especially
when dealing with large datasets.

What is Feature Selection?


A feature is an attribute that has an impact on a problem or is useful for the
problem, and choosing the important features for the model is known as feature
selection. Each machine learning process depends on feature engineering, which
mainly contains two processes; which are Feature Selection and Feature
Extraction. Although feature selection and extraction processes may have the
same objective, both are completely different from each other. The main
difference between them is that feature selection is about selecting the subset of
the original feature set, whereas feature extraction creates new features. Feature
selection is a way of reducing the input variable for the model by using only
relevant data in order to reduce overfitting in the model.

So, we can define feature Selection as, "It is a process of automatically or


manually selecting the subset of most appropriate and relevant features to be
used in model building." Feature selection is performed by either including the
important features or excluding the irrelevant features in the dataset without
changing them.

Need for Feature Selection


Before implementing any technique, it is really important to understand, need
for the technique and so for the Feature Selection. As we know, in machine
learning, it is necessary to provide a pre-processed and good input dataset in
order to get better outcomes. We collect a huge amount of data to train our
model and help it to learn better. Generally, the dataset consists of noisy data,
irrelevant data, and some part of useful data. Moreover, the huge amount of data
also slows down the training process of the model, and with noise and irrelevant
data, the model may not predict and perform well. So, it is very necessary to
remove such noises and less-important data from the dataset and to do this, and
Feature selection techniques are used.

Selecting the best features helps the model to perform well. For example,
Suppose we want to create a model that automatically decides which car should
be crushed for a spare part, and to do this, we have a dataset. This dataset
contains a Model of the car, Year, Owner's name, Miles. So, in this dataset, the
name of the owner does not contribute to the model performance as it does not
decide if the car should be crushed or not, so we can remove this column and
select the rest of the features(column) for the model building.

Below are some benefits of using feature selection in machine learning:


o It helps in avoiding the curse of dimensionality.
o It helps in the simplification of the model so that it can be easily
interpreted by the researchers.
o It reduces the training time.
o It reduces overfitting hence enhance the generalization.

Feature Selection Techniques


There are mainly two types of Feature Selection techniques, which are:
o Supervised Feature Selection technique
Supervised Feature selection techniques consider the target variable and
can be used for the labelled dataset.
o Unsupervised Feature Selection technique
Unsupervised Feature selection techniques ignore the target variable and
can be used for the unlabelled dataset.
Feature selection is a very complicated and vast field of machine learning, and
lots of studies are already made to discover the best methods. There is no fixed
rule of the best feature selection method. However, choosing the method depend
on a machine learning engineer who can combine and innovate approaches to
find the best method for a specific problem. One should try a variety of model
fits on different subsets of features selected through different statistical
Measures.

Discretization in data mining

Data discretization refers to a method of converting a huge number of data


values into smaller ones so that the evaluation and management of data become
easy. In other words, data discretization is a method of converting attributes
values of continuous data into a finite set of intervals with minimum data loss.
There are two forms of data discretization first is supervised discretization, and
the second is unsupervised discretization. Supervised discretization refers to a
method in which the class data is used. Unsupervised discretization refers to a
method depending upon the way which operation proceeds. It means it works
on the top-down splitting strategy and bottom-up merging strategy.

Now, we can understand this concept with the help of an example


Suppose we have an attribute of Age with the given values
Ag 1,5,9,4,7,11,14,17,13,18,
e 19,31,33,36,42,44,46,70,74,78,77
Table before Discretization

Attribut Age Age Age Age


e

1,5,4 11,14,17,1 31,33,36,4 70,74,


,9,7 3,18,19 2,44,46 77,78

After Chil Young Mature Old


Discreti d
zation
Another example is analytics, where we gather the static data of website
visitors. For example, all visitors who visit the site with the IP address of India
are shown under country level.

Some Famous techniques of data discretization


Histogram analysis
Histogram refers to a plot used to represent the underlying frequency
distribution of a continuous data set. Histogram assists the data inspection for
data distribution. For example, Outliers, skewness representation, normal
distribution representation, etc.

Binning
Binning refers to a data smoothing technique that helps to group a huge number
of continuous values into smaller values. For data discretization and the
development of idea hierarchy, this technique can also be used.

Cluster Analysis
Cluster analysis is a form of data discretization. A clustering algorithm is
executed by dividing the values of x numbers into clusters to isolate a
computational feature of x.

Data discretization using decision tree analysis


Data discretization refers to a decision tree analysis in which a top-down slicing
technique is used. It is done through a supervised procedure. In a numeric
attribute discretization, first, you need to select the attribute that has the least
entropy, and then you need to run it with the help of a recursive process. The
recursive process divides it into various discretized disjoint intervals, from top
to bottom, using the same splitting criterion.

Data discretization using correlation analysis


Discretizing data by linear regression technique, you can get the best
neighbouring interval, and then the large intervals are combined to develop a
larger overlap to form the final 20 overlapping intervals. It is a supervised
procedure.

What is Binarization?
Binarization is the process of transforming data features of any entity
into vectors of binary numbers to make classifier algorithms more efficient. In a
simple example, transforming an image’s grey scale from the 0-255 spectrum to
a 0-1 spectrum is binarization.

Binarization is process that is used to transform data features of any entity into
binary numbers. It is done to classify algorithms more efficiently. To convert
into binary, we can transform data using binary threshold. All value above
threshold is marked as 1 and all values that are equal to or below threshold are
marked as 0. This is called binarizing your data. It can be helpful when you
have value that you want to make Crip value.

How is Binarization used?


In machine learning, even the most complex concepts can be transformed into
binary form. For example, to binarize the sentence “The dog ate the cat,” every
word is assigned an ID (for example dog-1, ate-2, the-3, cat-4). Then replace
each word with the tag to provide a binary vector. In this case the vector:
<3,1,2,3,4> can be refined by providing each word with four possible slots, then
setting the slot to correspond with a specific word:
<0,0,1,0,1,0,0,0,0,1,0,0,0,0,0,1>. This is commonly referred to as the bag-of-
words-method.

Since the ultimate goal is to make this data easier for the classifier to read while
minimizing memory usage, it’s not always necessary to encode the whole
sentence or all the details of a complex concept. In this case, only the current
state of how the data is parsed is needed for the classifier. For example, when
the top word on the stack is used as the first word in the input queue. Since
order is quite important, a simpler binary vector is preferable.
What is data transformation?
Data transformation is the process of converting data from one format, such as a
database file, XML document or Excel spreadsheet, into another.
Transformations typically involve converting a raw data source into
a cleansed, validated and ready-to-use format. Data transformation is crucial to
data management processes that include data integration, data migration, data
warehousing and data preparation.

The process of data transformation can also be referred to as


extract/transform/load (ETL). The extraction phase involves identifying and
pulling data from the various source systems that create data and then moving
the data to a single repository. Next, the raw data is cleansed, if needed. It's then
transformed into a target format that can be fed into operational systems or into
a data warehouse, a date lake or another repository for use in business
intelligence and analytics applications. The transformation may involve
converting data types, removing duplicate data and enriching the source data.

Data transformation is crucial to processes that include data integration, data


management, data migration, data warehousing and data wrangling.

It is also a critical component for any organization seeking to leverage its data
to generate timely business insights. As the volume of data has proliferated,
organizations must have an efficient way to harness data to effectively put it to
business use. Data transformation is one element of harnessing this data,
because -- when done properly -- it ensures data is easy to access, consistent,
secure and ultimately trusted by the intended business users.

Data analysts, data engineers and data scientists are typically in charge of data
transformation within an organization. They identify the source data, determine
the required data formats and perform data mapping, as well as execute the
actual transformation process before moving the data into appropriate databases
for storage and use.
Their work involves five main steps:
1. data discovery, in which data professionals use data profiling tools or
profiling scripts to understand the structure and characteristics of the data
and also to determine how it should be transformed;
2. data mapping, during which data professionals connect, or match, data
fields from one source to data fields in another;
3. code generation, a part of the process where the software code required
to transform the data is created (either by data transformation tools or the
data professionals themselves writing script);
4. execution of the code, where the data undergoes the transformation; and
5. review, during which data professionals or the business/end users confirm
that the output data meets the established transformation requirements
and, if not, address and correct any anomalies and errors.

Examples of data transformation


There are various data transformation methods, including the following:
 aggregation, in which data is collected from multiple sources and stored
in a single format;
 attribute construction, in which new attributes are added or created
from existing attributes;
 discretization, which involves converting continuous data values into
sets of data intervals with specific values to make the data more
manageable for analysis;
 generalization, where low-level data attributes are converted into high-
level data attributes (for example, converting data from multiple brackets
broken up by ages into the more general "young" and "old" attributes) to
gain a more comprehensive view of the data;
 integration, a step that involves combining data from different sources
into a single view;
 manipulation, where the data is changed or altered to make it more
readable and organized;
 normalization, a process that converts source data into another format to
limit the occurrence of duplicated data; and
 smoothing, which uses algorithms to reduce "noise" in data sets, thereby
helping to more efficiently and effectively identify trends in the data.

Measures of Similarity and Dissimilarity

Similarity measure
 is a numerical measure of how alike two data objects are.
 higher when objects are more alike.
 often falls in the range [0,1]
Similarity might be used to identify
 duplicate data that may have differences due to typos.
 equivalent instances from different data sets. E.g. names and/or addresses
that are the same but have misspellings.
 groups of data that are very close (clusters)
Dissimilarity measure
 is a numerical measure of how different two data objects are
 lower when objects are more alike
 minimum dissimilarity is often 0 while the upper limit varies depending
on how much variation can be
Dissimilarity might be used to identify
 outliers
 interesting exceptions, e.g. credit card fraud
 boundaries to clusters
Proximity refers to either a similarity or dissimilarity

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy