Unit-2 Data Warehouse Notes
Unit-2 Data Warehouse Notes
Data transformation in data mining refers to the process of converting raw data into a format
that is suitable for analysis and modelling. The goal of data transformation is to prepare the
data for data mining so that it can be used to extract useful insights and knowledge. Data
transformation typically involves several steps, including:
1. Data cleaning: Removing or correcting errors, inconsistencies, and missing values
in the data.
2. Data integration: Combining data from multiple sources, such as databases and
spreadsheets, into a single format.
3. Data normalization: Scaling the data to a common range of values, such as
between 0 and 1, to facilitate comparison and analysis.
4. Data reduction: Reducing the dimensionality of the data by selecting a subset of
relevant features or attributes.
5. Data discretization: Converting continuous data into discrete categories or bins.
6. Data aggregation: Combining data at different levels of granularity, such as by
summing or averaging, to create new features or attributes.
7. Data transformation is an important step in the data mining process as it helps to
ensure that the data is in a format that is suitable for analysis and modelling, and
that it is free of errors and inconsistencies. Data transformation can also help to
improve the performance of data mining algorithms, by reducing the dimensionality
of the data, and by scaling the data to a common range of values.
The data are transformed in ways that are ideal for mining the data. The data transformation
involves steps that are:
1. Smoothing: It is a process that is used to remove noise from the dataset using some
algorithms It allows for highlighting important features present in the dataset. It helps in
predicting the patterns. When collecting data, it can be manipulated to eliminate or reduce any
variance or any other noise form. The concept behind data smoothing is that it will be able to
identify simple changes to help predict different trends and patterns. This serves as a help to
analysts or traders who need to look at a lot of data which can often be difficult to digest for
finding patterns that they wouldn’t see otherwise.
2. Aggregation: Data collection or aggregation is the method of storing and presenting data in
a summary format. The data may be obtained from multiple data sources to integrate these data
sources into a data analysis description. This is a crucial step since the accuracy of data
analysis insights is highly dependent on the quantity and quality of the data used. Gathering
accurate data of high quality and a large enough quantity is necessary to produce relevant
results. The collection of data is useful for everything from decisions concerning financing or
business strategy of the product, pricing, operations, and marketing strategies. For example,
Sales, data may be aggregated to compute monthly& annual total amounts.
3. Discretization: It is a process of transforming continuous data into set of small intervals.
Most Data Mining activities in the real world require continuous attributes. Yet many of the
existing data mining frameworks are unable to handle these attributes. Also, even if a data
mining task can manage a continuous attribute, it can significantly improve its efficiency by
replacing a constant quality attribute with its discrete values. For example, (1-10, 11-20) (age:-
young, middle age, senior).
4. Attribute Construction: Where new attributes are created & applied to assist the mining
process from the given set of attributes. This simplifies the original data & makes the mining
more efficient.
5. Generalization: It converts low-level data attributes to high-level data attributes using
concept hierarchy. For Example, Age initially in Numerical form (22, 25) is converted into
categorical value (young, old). For example, Categorical attributes, such as house addresses,
may be generalized to higher-level definitions, such as town or country.
6. Normalization: Data normalization involves converting all data variables into a given
range. Techniques that are used for normalization are:
Min-Max Normalization:
This transforms the original data linearly.
Suppose that: min_A is the minima and max_A is the maxima of an
attribute, P
Where v is the value you want to plot in the new range.
v’ is the new value you get after normalizing the old value.
Z-Score Normalization:
In z-score normalization (or zero-mean normalization) the values of an
attribute (A), are normalized based on the mean of A and its standard
deviation
A value, v, of attribute A is normalized to v’ by computing
Decimal Scaling:
It normalizes the values of an attribute by changing the position of their
decimal points
The number of points by which the decimal point is moved can be
determined by the absolute maximum value of attribute A.
A value, v, of attribute A is normalized to v’ by computing
where j is the smallest integer such that Max(|v’|) < 1.
Suppose: Values of an attribute P varies from -99 to 99.
The maximum absolute value of P is 99.
For normalizing the values we divide the numbers by 100 (i.e., j = 2) or
(number of integers in the largest number) so that values come out to be
as 0.98, 0.97 and so on.
ADVANTAGES OR DISADVANTAGES:
1. Improves Data Quality: Data transformation helps to improve the quality of data by
removing errors, inconsistencies, and missing values.
2. Facilitates Data Integration: Data transformation enables the integration of data
from multiple sources, which can improve the accuracy and completeness of the
data.
3. Improves Data Analysis: Data transformation helps to prepare the data for analysis
and modeling by normalizing, reducing dimensionality, and discretizing the data.
4. Increases Data Security: Data transformation can be used to mask sensitive data, or
to remove sensitive information from the data, which can help to increase data
security.
5. Enhances Data Mining Algorithm Performance: Data transformation can improve
the performance of data mining algorithms by reducing the dimensionality of the
data and scaling the data to a common range of values.
Data cleaning is a crucial process in Data Mining. It carries an important part in the building of a
model. Data Cleaning can be regarded as the process needed, but everyone often neglects it. Data
quality is the main issue in quality information management. Data quality problems occur
anywhere in information systems. These problems are solved by data cleaning.
In most cases, data cleaning in data mining can be a laborious process and typically requires IT
resources to help in the initial step of evaluating your data because data cleaning before data
mining is so time-consuming. But without proper data quality, your final analysis will suffer
inaccuracy, or you could potentially arrive at the wrong conclusion.
While the techniques used for data cleaning may vary according to the types of data your
company stores, you can follow these basic steps to cleaning your data, such as:
For example, if you want to analyze data regarding millennial customers, but your dataset
includes older generations, you might remove those irrelevant observations. This can make
analysis more efficient, minimize distraction from your primary target, and create a more
manageable and performable dataset.
Structural errors are when you measure or transfer data and notice strange naming conventions,
typos, or incorrect capitalization. These inconsistencies can cause mislabeled categories or
classes. For example, you may find "N/A" and "Not Applicable" in any sheet, but they should be
analyzed in the same category.
Often, there will be one-off observations where, at a glance, they do not appear to fit within the
data you are analysing. If you have a legitimate reason to remove an outlier, like improper data
entry, doing so will help the performance of the data you are working with.
However, sometimes, the appearance of an outlier will prove a theory you are working on. And
just because an outlier exists doesn't mean it is incorrect. This step is needed to determine the
validity of that number. If an outlier proves to be irrelevant for analysis or is a mistake, consider
removing it.
You can't ignore missing data because many algorithms will not accept missing values. There are
a couple of ways to deal with missing data. Neither is optimal, but both can be considered, such
as:
o You can drop observations with missing values, but this will drop or lose information, so
be careful before removing it.
o You can input missing values based on other observations; again, there is an opportunity
to lose the integrity of the data because you may be operating from assumptions and not
actual observations.
o You might alter how the data is used to navigate null values effectively.
1. Ignore the tuples: This method is not very feasible, as it only comes to use when the
tuple has several attributes is has missing values.
2. Fill the missing value: This approach is also not very effective or feasible. Moreover, it
can be a time-consuming method. In the approach, one has to fill in the missing value.
This is usually done manually, but it can also be done by attribute mean or using the most
probable value.
3. Binning method: This approach is very simple to understand. The smoothing of sorted
data is done using the values around it. The data is then divided into several segments of
equal size. After that, the different methods are executed to complete the task.
4. Regression: The data is made smooth with the help of using the regression function. The
regression can be linear or multiple. Linear regression has only one independent variable,
and multiple regressions have more than one independent variable.
5. Clustering: This method mainly operates on the group. Clustering groups the data in a
cluster. Then, the outliers are detected with the help of clustering. Next, the similar values
are then arranged into a "group" or a "cluster".
Data binning, bucketing is a data pre-processing method used to minimize the effects of small
observation errors. The original data values are divided into small intervals known as bins and
then they are replaced by a general value calculated for that bin. This has a smoothing effect on
the input data and may also reduce the chances of overfitting in the case of small datasets.
Data integration is the process of combining data from multiple sources into a cohesive and
consistent view. This process involves identifying and accessing the different data sources,
mapping the data to a common format, and reconciling any inconsistencies or discrepancies
between the sources. The goal of data integration is to make it easier to access and analyze data
that is spread across multiple systems or platforms, to gain a more complete and accurate
understanding of the data.
Data reduction techniques ensure the integrity of data while reducing the data. Data reduction is
a process that reduces the volume of original data and represents it in a much smaller volume.
Data reduction techniques are used to obtain a reduced representation of the dataset that is much
smaller in volume by maintaining the integrity of the original data. By reducing the data, the
efficiency of the data mining process is improved, which produces the same analytical results.
Data reduction aims to define it more compactly. When the data size is smaller, it is simpler to
apply sophisticated and computationally high-priced algorithms. The reduction of the data may
be in terms of the number of rows (records) or terms of the number of columns (dimensions).
Here are the following techniques or methods of data reduction in data mining, such as:
Dimensionality Reduction
Whenever we encounter weakly important data, we use the attribute required for our analysis.
Dimensionality reduction eliminates the attributes from the data set under consideration, thereby
reducing the volume of original data. It reduces data size as it eliminates outdated or redundant
features. Here are three methods of dimensionality reduction.
2. sNumerosity Reduction
The numerosity reduction reduces the original data volume and represents it in a much smaller
form. This technique includes two types parametric and non-parametric numerosity reduction.
c. Cluster sample: The tuples in data set D are clustered into M mutually
disjoint subsets. The data reduction can be applied by implementing
SRSWOR on these clusters. A simple random sample of size s could be
generated from these clusters where s<M.
This technique is used to aggregate data in a simpler form. Data Cube Aggregation is a
multidimensional aggregation that uses aggregation at various levels of a data cube to represent
the original data set, thus achieving data reduction.
For example, suppose you have the data of All Electronics sales per quarter for the year 2018 to
the year 2022. If you want to get the annual sale per year, you just have to aggregate the sales per
quarter for each year. In this way, aggregation provides you with the required data, which is
much smaller in size, and thereby we achieve data reduction even without losing any data.
The data cube aggregation is a multidimensional aggregation that eases multidimensional
analysis. The data cube present precomputed and summarized data which eases the data mining
into fast access.
4. Data Compression
Data compression employs modification, encoding, or converting the structure of data in a way
that consumes less space. Data compression involves building a compact representation of
information by removing redundancy and representing data in binary form. Data that can be
restored successfully from its compressed form is called Lossless compression. In contrast, the
opposite where it is not possible to restore the original form from the compressed form is Lossy
compression. Dimensionality and numerosity reduction method are also used for data
compression.
This technique reduces the size of the files using different encoding mechanisms, such as
Huffman Encoding and run-length Encoding. We can divide it into two types based on their
compression techniques.
i. Lossless Compression: Encoding techniques (Run Length Encoding) allow a simple and
minimal data size reduction. Lossless data compression uses algorithms to restore the
precise original data from the compressed data.
ii. Lossy Compression: In lossy-data compression, the decompressed data may differ from
the original data but are useful enough to retrieve information from them. For example,
the JPEG image format is a lossy compression, but we can find the meaning equivalent to
the original image. Methods such as the Discrete Wavelet transform technique PCA
(principal component analysis) are examples of this compression.
6. Data Discretization
The data discretization technique is used to divide the attributes of the continuous nature into
data with intervals. We replace many constant values of the attributes with labels of small
intervals. This means that mining results are shown in a concise and easily understandable way.