0% found this document useful (0 votes)
4 views11 pages

Unit-2 Data Warehouse Notes

Data transformation in data mining is crucial for converting raw data into a suitable format for analysis, involving steps like data cleaning, integration, normalization, and reduction. It enhances data quality, facilitates integration, and improves algorithm performance, but can be time-consuming and complex, potentially leading to data loss. Effective data cleaning methods are essential to ensure accuracy, involving steps such as removing duplicates, fixing structural errors, and handling missing data.

Uploaded by

Ishaan Dawar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views11 pages

Unit-2 Data Warehouse Notes

Data transformation in data mining is crucial for converting raw data into a suitable format for analysis, involving steps like data cleaning, integration, normalization, and reduction. It enhances data quality, facilitates integration, and improves algorithm performance, but can be time-consuming and complex, potentially leading to data loss. Effective data cleaning methods are essential to ensure accuracy, involving steps such as removing duplicates, fixing structural errors, and handling missing data.

Uploaded by

Ishaan Dawar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 11

UNIT2-NOTES

Data transformation in data mining refers to the process of converting raw data into a format
that is suitable for analysis and modelling. The goal of data transformation is to prepare the
data for data mining so that it can be used to extract useful insights and knowledge. Data
transformation typically involves several steps, including:
1. Data cleaning: Removing or correcting errors, inconsistencies, and missing values
in the data.
2. Data integration: Combining data from multiple sources, such as databases and
spreadsheets, into a single format.
3. Data normalization: Scaling the data to a common range of values, such as
between 0 and 1, to facilitate comparison and analysis.
4. Data reduction: Reducing the dimensionality of the data by selecting a subset of
relevant features or attributes.
5. Data discretization: Converting continuous data into discrete categories or bins.
6. Data aggregation: Combining data at different levels of granularity, such as by
summing or averaging, to create new features or attributes.
7. Data transformation is an important step in the data mining process as it helps to
ensure that the data is in a format that is suitable for analysis and modelling, and
that it is free of errors and inconsistencies. Data transformation can also help to
improve the performance of data mining algorithms, by reducing the dimensionality
of the data, and by scaling the data to a common range of values.
The data are transformed in ways that are ideal for mining the data. The data transformation
involves steps that are:
1. Smoothing: It is a process that is used to remove noise from the dataset using some
algorithms It allows for highlighting important features present in the dataset. It helps in
predicting the patterns. When collecting data, it can be manipulated to eliminate or reduce any
variance or any other noise form. The concept behind data smoothing is that it will be able to
identify simple changes to help predict different trends and patterns. This serves as a help to
analysts or traders who need to look at a lot of data which can often be difficult to digest for
finding patterns that they wouldn’t see otherwise.
2. Aggregation: Data collection or aggregation is the method of storing and presenting data in
a summary format. The data may be obtained from multiple data sources to integrate these data
sources into a data analysis description. This is a crucial step since the accuracy of data
analysis insights is highly dependent on the quantity and quality of the data used. Gathering
accurate data of high quality and a large enough quantity is necessary to produce relevant
results. The collection of data is useful for everything from decisions concerning financing or
business strategy of the product, pricing, operations, and marketing strategies. For example,
Sales, data may be aggregated to compute monthly& annual total amounts.
3. Discretization: It is a process of transforming continuous data into set of small intervals.
Most Data Mining activities in the real world require continuous attributes. Yet many of the
existing data mining frameworks are unable to handle these attributes. Also, even if a data
mining task can manage a continuous attribute, it can significantly improve its efficiency by
replacing a constant quality attribute with its discrete values. For example, (1-10, 11-20) (age:-
young, middle age, senior).
4. Attribute Construction: Where new attributes are created & applied to assist the mining
process from the given set of attributes. This simplifies the original data & makes the mining
more efficient.
5. Generalization: It converts low-level data attributes to high-level data attributes using
concept hierarchy. For Example, Age initially in Numerical form (22, 25) is converted into
categorical value (young, old). For example, Categorical attributes, such as house addresses,
may be generalized to higher-level definitions, such as town or country.
6. Normalization: Data normalization involves converting all data variables into a given
range. Techniques that are used for normalization are:
 Min-Max Normalization:
 This transforms the original data linearly.
 Suppose that: min_A is the minima and max_A is the maxima of an
attribute, P
 Where v is the value you want to plot in the new range.
 v’ is the new value you get after normalizing the old value.
 Z-Score Normalization:
 In z-score normalization (or zero-mean normalization) the values of an
attribute (A), are normalized based on the mean of A and its standard
deviation
 A value, v, of attribute A is normalized to v’ by computing
 Decimal Scaling:
 It normalizes the values of an attribute by changing the position of their
decimal points
 The number of points by which the decimal point is moved can be
determined by the absolute maximum value of attribute A.
 A value, v, of attribute A is normalized to v’ by computing
 where j is the smallest integer such that Max(|v’|) < 1.
 Suppose: Values of an attribute P varies from -99 to 99.
 The maximum absolute value of P is 99.
 For normalizing the values we divide the numbers by 100 (i.e., j = 2) or
(number of integers in the largest number) so that values come out to be
as 0.98, 0.97 and so on.

NUMERICALS DONE IN CLASS

ADVANTAGES OR DISADVANTAGES:

Advantages of Data Transformation in Data Mining:

1. Improves Data Quality: Data transformation helps to improve the quality of data by
removing errors, inconsistencies, and missing values.
2. Facilitates Data Integration: Data transformation enables the integration of data
from multiple sources, which can improve the accuracy and completeness of the
data.
3. Improves Data Analysis: Data transformation helps to prepare the data for analysis
and modeling by normalizing, reducing dimensionality, and discretizing the data.
4. Increases Data Security: Data transformation can be used to mask sensitive data, or
to remove sensitive information from the data, which can help to increase data
security.
5. Enhances Data Mining Algorithm Performance: Data transformation can improve
the performance of data mining algorithms by reducing the dimensionality of the
data and scaling the data to a common range of values.

Disadvantages of Data Transformation in Data Mining:

1. Time-consuming: Data transformation can be a time-consuming process, especially


when dealing with large datasets.
2. Complexity: Data transformation can be a complex process, requiring specialized
skills and knowledge to implement and interpret the results.
3. Data Loss: Data transformation can result in data loss, such as when discretizing
continuous data, or when removing attributes or features from the data.
4. Biased transformation: Data transformation can result in bias, if the data is not
properly understood or used.
5. High cost: Data transformation can be an expensive process, requiring significant
investments in hardware, software, and personnel.

Data cleaning is a crucial process in Data Mining. It carries an important part in the building of a
model. Data Cleaning can be regarded as the process needed, but everyone often neglects it. Data
quality is the main issue in quality information management. Data quality problems occur
anywhere in information systems. These problems are solved by data cleaning.

Data cleaning is fixing or removing incorrect, corrupted, incorrectly formatted, duplicate, or


incomplete data within a dataset. If data is incorrect, outcomes and algorithms are unreliable,
even though they may look correct. When combining multiple data sources, there are many
opportunities for data to be duplicated or mislabelled.

In most cases, data cleaning in data mining can be a laborious process and typically requires IT
resources to help in the initial step of evaluating your data because data cleaning before data
mining is so time-consuming. But without proper data quality, your final analysis will suffer
inaccuracy, or you could potentially arrive at the wrong conclusion.

Steps of Data Cleaning

While the techniques used for data cleaning may vary according to the types of data your
company stores, you can follow these basic steps to cleaning your data, such as:

1. Remove duplicate or irrelevant observations


Remove unwanted observations from your dataset, including duplicate observations or irrelevant
observations. Duplicate observations will happen most often during data collection. When you
combine data sets from multiple places, scrape data, or receive data from clients or multiple
departments, there are opportunities to create duplicate data. De-duplication is one of the largest
areas to be considered in this process. Irrelevant observations are when you notice observations
that do not fit into the specific problem you are trying to analyze.

For example, if you want to analyze data regarding millennial customers, but your dataset
includes older generations, you might remove those irrelevant observations. This can make
analysis more efficient, minimize distraction from your primary target, and create a more
manageable and performable dataset.

2. Fix structural errors

Structural errors are when you measure or transfer data and notice strange naming conventions,
typos, or incorrect capitalization. These inconsistencies can cause mislabeled categories or
classes. For example, you may find "N/A" and "Not Applicable" in any sheet, but they should be
analyzed in the same category.

3. Filter unwanted outliers

Often, there will be one-off observations where, at a glance, they do not appear to fit within the
data you are analysing. If you have a legitimate reason to remove an outlier, like improper data
entry, doing so will help the performance of the data you are working with.

However, sometimes, the appearance of an outlier will prove a theory you are working on. And
just because an outlier exists doesn't mean it is incorrect. This step is needed to determine the
validity of that number. If an outlier proves to be irrelevant for analysis or is a mistake, consider
removing it.

4. Handle missing data

You can't ignore missing data because many algorithms will not accept missing values. There are
a couple of ways to deal with missing data. Neither is optimal, but both can be considered, such
as:

o You can drop observations with missing values, but this will drop or lose information, so
be careful before removing it.
o You can input missing values based on other observations; again, there is an opportunity
to lose the integrity of the data because you may be operating from assumptions and not
actual observations.
o You might alter how the data is used to navigate null values effectively.

Methods of Data Cleaning


There are many data cleaning methods through which the data should be run. The methods are
described below:

1. Ignore the tuples: This method is not very feasible, as it only comes to use when the
tuple has several attributes is has missing values.
2. Fill the missing value: This approach is also not very effective or feasible. Moreover, it
can be a time-consuming method. In the approach, one has to fill in the missing value.
This is usually done manually, but it can also be done by attribute mean or using the most
probable value.
3. Binning method: This approach is very simple to understand. The smoothing of sorted
data is done using the values around it. The data is then divided into several segments of
equal size. After that, the different methods are executed to complete the task.
4. Regression: The data is made smooth with the help of using the regression function. The
regression can be linear or multiple. Linear regression has only one independent variable,
and multiple regressions have more than one independent variable.
5. Clustering: This method mainly operates on the group. Clustering groups the data in a
cluster. Then, the outliers are detected with the help of clustering. Next, the similar values
are then arranged into a "group" or a "cluster".

Data binning, bucketing is a data pre-processing method used to minimize the effects of small
observation errors. The original data values are divided into small intervals known as bins and
then they are replaced by a general value calculated for that bin. This has a smoothing effect on
the input data and may also reduce the chances of overfitting in the case of small datasets.

NUMERICAL DISCCUSED IN CLASS

Data integration is the process of combining data from multiple sources into a cohesive and
consistent view. This process involves identifying and accessing the different data sources,
mapping the data to a common format, and reconciling any inconsistencies or discrepancies
between the sources. The goal of data integration is to make it easier to access and analyze data
that is spread across multiple systems or platforms, to gain a more complete and accurate
understanding of the data.

NUMERICAL DISCCUSED IN CLASS

Data reduction techniques ensure the integrity of data while reducing the data. Data reduction is
a process that reduces the volume of original data and represents it in a much smaller volume.
Data reduction techniques are used to obtain a reduced representation of the dataset that is much
smaller in volume by maintaining the integrity of the original data. By reducing the data, the
efficiency of the data mining process is improved, which produces the same analytical results.
Data reduction aims to define it more compactly. When the data size is smaller, it is simpler to
apply sophisticated and computationally high-priced algorithms. The reduction of the data may
be in terms of the number of rows (records) or terms of the number of columns (dimensions).

Techniques of Data Reduction

Here are the following techniques or methods of data reduction in data mining, such as:

Dimensionality Reduction

Whenever we encounter weakly important data, we use the attribute required for our analysis.
Dimensionality reduction eliminates the attributes from the data set under consideration, thereby
reducing the volume of original data. It reduces data size as it eliminates outdated or redundant
features. Here are three methods of dimensionality reduction.

i. Wavelet Transform: In the wavelet transform, suppose a data vector A is transformed


into a numerically different data vector A' such that both A and A' vectors are of the same
length. Then how it is useful in reducing data because the data obtained from the wavelet
transform can be truncated. The compressed data is obtained by retaining the smallest
fragment of the strongest wavelet coefficients. Wavelet transform can be applied to data
cubes, sparse data, or skewed data.
ii. Principal Component Analysis: Suppose we have a data set to be analysed that has
tuples with n attributes. The principal component analysis identifies k independent tuples
with n attributes that can represent the data set.
In this way, the original data can be cast on a much smaller space, and dimensionality
reduction can be achieved. Principal component analysis can be applied to sparse and
skewed data.
iii. Attribute Subset Selection: The large data set has many attributes, some of which are
irrelevant to data mining or some are redundant. The core attribute subset selection
reduces the data volume and dimensionality. The attribute subset selection reduces the
volume of data by eliminating redundant and irrelevant attributes.
The attribute subset selection ensures that we get a good subset of original attributes even
after eliminating the unwanted attributes. The resulting probability of data distribution is
as close as possible to the original data distribution using all the attributes.

2. sNumerosity Reduction

The numerosity reduction reduces the original data volume and represents it in a much smaller
form. This technique includes two types parametric and non-parametric numerosity reduction.

i. Parametric: Parametric numerosity reduction incorporates storing only data parameters


instead of the original data. One method of parametric numerosity reduction is the
regression and log-linear method.
o Regression and Log-Linear: Linear regression models a relationship between
the two attributes by modelling a linear equation to the data set. Suppose we need
to model a linear function between two attributes.
y=wx+b

Here, y is the response attribute, and x is the predictor attribute. If we discuss in


terms of data mining, attribute x and attribute y are the numeric database
attributes, whereas w and b are regression coefficients.
Multiple linear regressions let the response variable y model linear function
between two or more predictor variables.
Log-linear model discovers the relation between two or more discrete attributes in
the database. Suppose we have a set of tuples presented in n-dimensional space.
Then the log-linear model is used to study the probability of each tuple in a
multidimensional space.
Regression and log-linear methods can be used for sparse data and skewed data.

ii. Non-Parametric: A non-parametric numerosity reduction technique does not assume


any model. The non-Parametric technique results in a more uniform reduction,
irrespective of data size, but it may not achieve a high volume of data reduction like the
parametric. There are at least four types of non-Parametric data reduction techniques,
Histogram, Clustering, Sampling, Data Cube Aggregation, and Data Compression.
o Histogram: A histogram is a graph that represents frequency distribution which
describes how often a value appears in the data. Histogram uses the binning
method to represent an attribute's data distribution. It uses a disjoint subset which
we call bin or buckets.
A histogram can represent a dense, sparse, uniform, or skewed data. Instead of
only one attribute, the histogram can be implemented for multiple attributes. It
can effectively represent up to five attributes.
o Clustering: Clustering techniques groups similar objects from the data so that the
objects in a cluster are similar to each other, but they are dissimilar to objects in
another cluster.
How much similar are the objects inside a cluster can be calculated using a
distance function. More is the similarity between the objects in a cluster closer
they appear in the cluster.
The quality of the cluster depends on the diameter of the cluster, i.e., the max
distance between any two objects in the cluster.
The cluster representation replaces the original data. This technique is more
effective if the present data can be classified into a distinct clustered.
o Sampling: One of the methods used for data reduction is sampling, as it can
reduce the large data set into a much smaller data sample. Below we will discuss
the different methods in which we can sample a large data set D containing N
tuples:

a. Simple random sample without replacement (SRSWOR) of size s: In


this s, some tuples are drawn from N tuples such that in the data set D
(s<N). The probability of drawing any tuple from the data set D is 1/N.
This means all tuples have an equal probability of getting sampled.
b. Simple random sample with replacement (SRSWR) of size s: It is like
the SRSWOR, but the tuple is drawn from data set D, is recorded, and
then replaced into the data set D so that it can be drawn again.

c. Cluster sample: The tuples in data set D are clustered into M mutually
disjoint subsets. The data reduction can be applied by implementing
SRSWOR on these clusters. A simple random sample of size s could be
generated from these clusters where s<M.

d. Stratified sample: The large data set D is partitioned into mutually


disjoint sets called 'strata'. A simple random sample is taken from each
stratum to get stratified data. This method is effective for skewed data.

3. Data Cube Aggregation

This technique is used to aggregate data in a simpler form. Data Cube Aggregation is a
multidimensional aggregation that uses aggregation at various levels of a data cube to represent
the original data set, thus achieving data reduction.

For example, suppose you have the data of All Electronics sales per quarter for the year 2018 to
the year 2022. If you want to get the annual sale per year, you just have to aggregate the sales per
quarter for each year. In this way, aggregation provides you with the required data, which is
much smaller in size, and thereby we achieve data reduction even without losing any data.
The data cube aggregation is a multidimensional aggregation that eases multidimensional
analysis. The data cube present precomputed and summarized data which eases the data mining
into fast access.

4. Data Compression

Data compression employs modification, encoding, or converting the structure of data in a way
that consumes less space. Data compression involves building a compact representation of
information by removing redundancy and representing data in binary form. Data that can be
restored successfully from its compressed form is called Lossless compression. In contrast, the
opposite where it is not possible to restore the original form from the compressed form is Lossy
compression. Dimensionality and numerosity reduction method are also used for data
compression.

This technique reduces the size of the files using different encoding mechanisms, such as
Huffman Encoding and run-length Encoding. We can divide it into two types based on their
compression techniques.
i. Lossless Compression: Encoding techniques (Run Length Encoding) allow a simple and
minimal data size reduction. Lossless data compression uses algorithms to restore the
precise original data from the compressed data.
ii. Lossy Compression: In lossy-data compression, the decompressed data may differ from
the original data but are useful enough to retrieve information from them. For example,
the JPEG image format is a lossy compression, but we can find the meaning equivalent to
the original image. Methods such as the Discrete Wavelet transform technique PCA
(principal component analysis) are examples of this compression.

6. Data Discretization

The data discretization technique is used to divide the attributes of the continuous nature into
data with intervals. We replace many constant values of the attributes with labels of small
intervals. This means that mining results are shown in a concise and easily understandable way.

i. Top-down discretization: If you first consider one or a couple of points (so-called


breakpoints or split points) to divide the whole set of attributes and repeat this method up
to the end, then the process is known as top-down discretization, also known as splitting.
ii. Bottom-up discretization: If you first consider all the constant values as split-points,
some are discarded through a combination of the neighbourhood values in the interval.
That process is called bottom-up discretization.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy