0% found this document useful (0 votes)

4 views11 pages

Unit-2 Data Warehouse Notes

Data transformation in data mining is crucial for converting raw data into a suitable format for analysis, involving steps like data cleaning, integration, normalization, and reduction. It enhances data quality, facilitates integration, and improves algorithm performance, but can be time-consuming and complex, potentially leading to data loss. Effective data cleaning methods are essential to ensure accuracy, involving steps such as removing duplicates, fixing structural errors, and handling missing data.

Uploaded by

Ishaan Dawar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

4 views11 pages

Unit-2 Data Warehouse Notes

Uploaded by

Ishaan Dawar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 11

UNIT2-NOTES

Data transformation in data mining refers to the process of converting raw data into a format
that is suitable for analysis and modelling. The goal of data transformation is to prepare the
data for data mining so that it can be used to extract useful insights and knowledge. Data
transformation typically involves several steps, including:
1. Data cleaning: Removing or correcting errors, inconsistencies, and missing values
in the data.
2. Data integration: Combining data from multiple sources, such as databases and
spreadsheets, into a single format.
3. Data normalization: Scaling the data to a common range of values, such as
between 0 and 1, to facilitate comparison and analysis.
4. Data reduction: Reducing the dimensionality of the data by selecting a subset of
relevant features or attributes.
5. Data discretization: Converting continuous data into discrete categories or bins.
6. Data aggregation: Combining data at different levels of granularity, such as by
summing or averaging, to create new features or attributes.
7. Data transformation is an important step in the data mining process as it helps to
ensure that the data is in a format that is suitable for analysis and modelling, and
that it is free of errors and inconsistencies. Data transformation can also help to
improve the performance of data mining algorithms, by reducing the dimensionality
of the data, and by scaling the data to a common range of values.
The data are transformed in ways that are ideal for mining the data. The data transformation
involves steps that are:
1. Smoothing: It is a process that is used to remove noise from the dataset using some
algorithms It allows for highlighting important features present in the dataset. It helps in
predicting the patterns. When collecting data, it can be manipulated to eliminate or reduce any
variance or any other noise form. The concept behind data smoothing is that it will be able to
identify simple changes to help predict different trends and patterns. This serves as a help to
analysts or traders who need to look at a lot of data which can often be difficult to digest for
finding patterns that they wouldn’t see otherwise.
2. Aggregation: Data collection or aggregation is the method of storing and presenting data in
a summary format. The data may be obtained from multiple data sources to integrate these data
sources into a data analysis description. This is a crucial step since the accuracy of data
analysis insights is highly dependent on the quantity and quality of the data used. Gathering
accurate data of high quality and a large enough quantity is necessary to produce relevant
results. The collection of data is useful for everything from decisions concerning financing or
business strategy of the product, pricing, operations, and marketing strategies. For example,
Sales, data may be aggregated to compute monthly& annual total amounts.
3. Discretization: It is a process of transforming continuous data into set of small intervals.
Most Data Mining activities in the real world require continuous attributes. Yet many of the
existing data mining frameworks are unable to handle these attributes. Also, even if a data
mining task can manage a continuous attribute, it can significantly improve its efficiency by
replacing a constant quality attribute with its discrete values. For example, (1-10, 11-20) (age:-
young, middle age, senior).
4. Attribute Construction: Where new attributes are created & applied to assist the mining
process from the given set of attributes. This simplifies the original data & makes the mining
more efficient.
5. Generalization: It converts low-level data attributes to high-level data attributes using
concept hierarchy. For Example, Age initially in Numerical form (22, 25) is converted into
categorical value (young, old). For example, Categorical attributes, such as house addresses,
may be generalized to higher-level definitions, such as town or country.
6. Normalization: Data normalization involves converting all data variables into a given
range. Techniques that are used for normalization are:
 Min-Max Normalization:
 This transforms the original data linearly.
 Suppose that: min_A is the minima and max_A is the maxima of an
attribute, P
 Where v is the value you want to plot in the new range.
 v’ is the new value you get after normalizing the old value.
 Z-Score Normalization:
 In z-score normalization (or zero-mean normalization) the values of an
attribute (A), are normalized based on the mean of A and its standard
deviation
 A value, v, of attribute A is normalized to v’ by computing
 Decimal Scaling:
 It normalizes the values of an attribute by changing the position of their
decimal points
 The number of points by which the decimal point is moved can be
determined by the absolute maximum value of attribute A.
 A value, v, of attribute A is normalized to v’ by computing
 where j is the smallest integer such that Max(|v’|) < 1.
 Suppose: Values of an attribute P varies from -99 to 99.
 The maximum absolute value of P is 99.
 For normalizing the values we divide the numbers by 100 (i.e., j = 2) or
(number of integers in the largest number) so that values come out to be
as 0.98, 0.97 and so on.

NUMERICALS DONE IN CLASS

ADVANTAGES OR DISADVANTAGES:

Advantages of Data Transformation in Data Mining:

1. Improves Data Quality: Data transformation helps to improve the quality of data by
removing errors, inconsistencies, and missing values.
2. Facilitates Data Integration: Data transformation enables the integration of data
from multiple sources, which can improve the accuracy and completeness of the
data.
3. Improves Data Analysis: Data transformation helps to prepare the data for analysis
and modeling by normalizing, reducing dimensionality, and discretizing the data.
4. Increases Data Security: Data transformation can be used to mask sensitive data, or
to remove sensitive information from the data, which can help to increase data
security.
5. Enhances Data Mining Algorithm Performance: Data transformation can improve
the performance of data mining algorithms by reducing the dimensionality of the
data and scaling the data to a common range of values.

Disadvantages of Data Transformation in Data Mining:

1. Time-consuming: Data transformation can be a time-consuming process, especially

when dealing with large datasets.
2. Complexity: Data transformation can be a complex process, requiring specialized
skills and knowledge to implement and interpret the results.
3. Data Loss: Data transformation can result in data loss, such as when discretizing
continuous data, or when removing attributes or features from the data.
4. Biased transformation: Data transformation can result in bias, if the data is not
properly understood or used.
5. High cost: Data transformation can be an expensive process, requiring significant
investments in hardware, software, and personnel.

Data cleaning is a crucial process in Data Mining. It carries an important part in the building of a
model. Data Cleaning can be regarded as the process needed, but everyone often neglects it. Data
quality is the main issue in quality information management. Data quality problems occur
anywhere in information systems. These problems are solved by data cleaning.

Data cleaning is fixing or removing incorrect, corrupted, incorrectly formatted, duplicate, or

incomplete data within a dataset. If data is incorrect, outcomes and algorithms are unreliable,
even though they may look correct. When combining multiple data sources, there are many
opportunities for data to be duplicated or mislabelled.

In most cases, data cleaning in data mining can be a laborious process and typically requires IT
resources to help in the initial step of evaluating your data because data cleaning before data
mining is so time-consuming. But without proper data quality, your final analysis will suffer
inaccuracy, or you could potentially arrive at the wrong conclusion.

Steps of Data Cleaning

While the techniques used for data cleaning may vary according to the types of data your
company stores, you can follow these basic steps to cleaning your data, such as:

1. Remove duplicate or irrelevant observations

Remove unwanted observations from your dataset, including duplicate observations or irrelevant
observations. Duplicate observations will happen most often during data collection. When you
combine data sets from multiple places, scrape data, or receive data from clients or multiple
departments, there are opportunities to create duplicate data. De-duplication is one of the largest
areas to be considered in this process. Irrelevant observations are when you notice observations
that do not fit into the specific problem you are trying to analyze.

For example, if you want to analyze data regarding millennial customers, but your dataset
includes older generations, you might remove those irrelevant observations. This can make
analysis more efficient, minimize distraction from your primary target, and create a more
manageable and performable dataset.

2. Fix structural errors

Structural errors are when you measure or transfer data and notice strange naming conventions,
typos, or incorrect capitalization. These inconsistencies can cause mislabeled categories or
classes. For example, you may find "N/A" and "Not Applicable" in any sheet, but they should be
analyzed in the same category.

3. Filter unwanted outliers

Often, there will be one-off observations where, at a glance, they do not appear to fit within the
data you are analysing. If you have a legitimate reason to remove an outlier, like improper data
entry, doing so will help the performance of the data you are working with.

However, sometimes, the appearance of an outlier will prove a theory you are working on. And
just because an outlier exists doesn't mean it is incorrect. This step is needed to determine the
validity of that number. If an outlier proves to be irrelevant for analysis or is a mistake, consider
removing it.

4. Handle missing data

You can't ignore missing data because many algorithms will not accept missing values. There are
a couple of ways to deal with missing data. Neither is optimal, but both can be considered, such
as:

o You can drop observations with missing values, but this will drop or lose information, so
be careful before removing it.
o You can input missing values based on other observations; again, there is an opportunity
to lose the integrity of the data because you may be operating from assumptions and not
actual observations.
o You might alter how the data is used to navigate null values effectively.

Methods of Data Cleaning

There are many data cleaning methods through which the data should be run. The methods are
described below:

1. Ignore the tuples: This method is not very feasible, as it only comes to use when the
tuple has several attributes is has missing values.
2. Fill the missing value: This approach is also not very effective or feasible. Moreover, it
can be a time-consuming method. In the approach, one has to fill in the missing value.
This is usually done manually, but it can also be done by attribute mean or using the most
probable value.
3. Binning method: This approach is very simple to understand. The smoothing of sorted
data is done using the values around it. The data is then divided into several segments of
equal size. After that, the different methods are executed to complete the task.
4. Regression: The data is made smooth with the help of using the regression function. The
regression can be linear or multiple. Linear regression has only one independent variable,
and multiple regressions have more than one independent variable.
5. Clustering: This method mainly operates on the group. Clustering groups the data in a
cluster. Then, the outliers are detected with the help of clustering. Next, the similar values
are then arranged into a "group" or a "cluster".

Data binning, bucketing is a data pre-processing method used to minimize the effects of small
observation errors. The original data values are divided into small intervals known as bins and
then they are replaced by a general value calculated for that bin. This has a smoothing effect on
the input data and may also reduce the chances of overfitting in the case of small datasets.

NUMERICAL DISCCUSED IN CLASS

Data integration is the process of combining data from multiple sources into a cohesive and
consistent view. This process involves identifying and accessing the different data sources,
mapping the data to a common format, and reconciling any inconsistencies or discrepancies
between the sources. The goal of data integration is to make it easier to access and analyze data
that is spread across multiple systems or platforms, to gain a more complete and accurate
understanding of the data.

NUMERICAL DISCCUSED IN CLASS

Data reduction techniques ensure the integrity of data while reducing the data. Data reduction is
a process that reduces the volume of original data and represents it in a much smaller volume.
Data reduction techniques are used to obtain a reduced representation of the dataset that is much
smaller in volume by maintaining the integrity of the original data. By reducing the data, the
efficiency of the data mining process is improved, which produces the same analytical results.
Data reduction aims to define it more compactly. When the data size is smaller, it is simpler to
apply sophisticated and computationally high-priced algorithms. The reduction of the data may
be in terms of the number of rows (records) or terms of the number of columns (dimensions).

Techniques of Data Reduction

Here are the following techniques or methods of data reduction in data mining, such as:

Dimensionality Reduction

Whenever we encounter weakly important data, we use the attribute required for our analysis.
Dimensionality reduction eliminates the attributes from the data set under consideration, thereby
reducing the volume of original data. It reduces data size as it eliminates outdated or redundant
features. Here are three methods of dimensionality reduction.

i. Wavelet Transform: In the wavelet transform, suppose a data vector A is transformed

into a numerically different data vector A' such that both A and A' vectors are of the same
length. Then how it is useful in reducing data because the data obtained from the wavelet
transform can be truncated. The compressed data is obtained by retaining the smallest
fragment of the strongest wavelet coefficients. Wavelet transform can be applied to data
cubes, sparse data, or skewed data.
ii. Principal Component Analysis: Suppose we have a data set to be analysed that has
tuples with n attributes. The principal component analysis identifies k independent tuples
with n attributes that can represent the data set.
In this way, the original data can be cast on a much smaller space, and dimensionality
reduction can be achieved. Principal component analysis can be applied to sparse and
skewed data.
iii. Attribute Subset Selection: The large data set has many attributes, some of which are
irrelevant to data mining or some are redundant. The core attribute subset selection
reduces the data volume and dimensionality. The attribute subset selection reduces the
volume of data by eliminating redundant and irrelevant attributes.
The attribute subset selection ensures that we get a good subset of original attributes even
after eliminating the unwanted attributes. The resulting probability of data distribution is
as close as possible to the original data distribution using all the attributes.

2. sNumerosity Reduction

The numerosity reduction reduces the original data volume and represents it in a much smaller
form. This technique includes two types parametric and non-parametric numerosity reduction.

i. Parametric: Parametric numerosity reduction incorporates storing only data parameters

instead of the original data. One method of parametric numerosity reduction is the
regression and log-linear method.
o Regression and Log-Linear: Linear regression models a relationship between
the two attributes by modelling a linear equation to the data set. Suppose we need
to model a linear function between two attributes.
y=wx+b

Here, y is the response attribute, and x is the predictor attribute. If we discuss in

terms of data mining, attribute x and attribute y are the numeric database
attributes, whereas w and b are regression coefficients.
Multiple linear regressions let the response variable y model linear function
between two or more predictor variables.
Log-linear model discovers the relation between two or more discrete attributes in
the database. Suppose we have a set of tuples presented in n-dimensional space.
Then the log-linear model is used to study the probability of each tuple in a
multidimensional space.
Regression and log-linear methods can be used for sparse data and skewed data.

ii. Non-Parametric: A non-parametric numerosity reduction technique does not assume

any model. The non-Parametric technique results in a more uniform reduction,
irrespective of data size, but it may not achieve a high volume of data reduction like the
parametric. There are at least four types of non-Parametric data reduction techniques,
Histogram, Clustering, Sampling, Data Cube Aggregation, and Data Compression.
o Histogram: A histogram is a graph that represents frequency distribution which
describes how often a value appears in the data. Histogram uses the binning
method to represent an attribute's data distribution. It uses a disjoint subset which
we call bin or buckets.
A histogram can represent a dense, sparse, uniform, or skewed data. Instead of
only one attribute, the histogram can be implemented for multiple attributes. It
can effectively represent up to five attributes.
o Clustering: Clustering techniques groups similar objects from the data so that the
objects in a cluster are similar to each other, but they are dissimilar to objects in
another cluster.
How much similar are the objects inside a cluster can be calculated using a
distance function. More is the similarity between the objects in a cluster closer
they appear in the cluster.
The quality of the cluster depends on the diameter of the cluster, i.e., the max
distance between any two objects in the cluster.
The cluster representation replaces the original data. This technique is more
effective if the present data can be classified into a distinct clustered.
o Sampling: One of the methods used for data reduction is sampling, as it can
reduce the large data set into a much smaller data sample. Below we will discuss
the different methods in which we can sample a large data set D containing N
tuples:

a. Simple random sample without replacement (SRSWOR) of size s: In

this s, some tuples are drawn from N tuples such that in the data set D
(s<N). The probability of drawing any tuple from the data set D is 1/N.
This means all tuples have an equal probability of getting sampled.
b. Simple random sample with replacement (SRSWR) of size s: It is like
the SRSWOR, but the tuple is drawn from data set D, is recorded, and
then replaced into the data set D so that it can be drawn again.

c. Cluster sample: The tuples in data set D are clustered into M mutually
disjoint subsets. The data reduction can be applied by implementing
SRSWOR on these clusters. A simple random sample of size s could be
generated from these clusters where s<M.

d. Stratified sample: The large data set D is partitioned into mutually

disjoint sets called 'strata'. A simple random sample is taken from each
stratum to get stratified data. This method is effective for skewed data.

3. Data Cube Aggregation

This technique is used to aggregate data in a simpler form. Data Cube Aggregation is a
multidimensional aggregation that uses aggregation at various levels of a data cube to represent
the original data set, thus achieving data reduction.

For example, suppose you have the data of All Electronics sales per quarter for the year 2018 to
the year 2022. If you want to get the annual sale per year, you just have to aggregate the sales per
quarter for each year. In this way, aggregation provides you with the required data, which is
much smaller in size, and thereby we achieve data reduction even without losing any data.
The data cube aggregation is a multidimensional aggregation that eases multidimensional
analysis. The data cube present precomputed and summarized data which eases the data mining
into fast access.

4. Data Compression

Data compression employs modification, encoding, or converting the structure of data in a way
that consumes less space. Data compression involves building a compact representation of
information by removing redundancy and representing data in binary form. Data that can be
restored successfully from its compressed form is called Lossless compression. In contrast, the
opposite where it is not possible to restore the original form from the compressed form is Lossy
compression. Dimensionality and numerosity reduction method are also used for data
compression.

This technique reduces the size of the files using different encoding mechanisms, such as
Huffman Encoding and run-length Encoding. We can divide it into two types based on their
compression techniques.
i. Lossless Compression: Encoding techniques (Run Length Encoding) allow a simple and
minimal data size reduction. Lossless data compression uses algorithms to restore the
precise original data from the compressed data.
ii. Lossy Compression: In lossy-data compression, the decompressed data may differ from
the original data but are useful enough to retrieve information from them. For example,
the JPEG image format is a lossy compression, but we can find the meaning equivalent to
the original image. Methods such as the Discrete Wavelet transform technique PCA
(principal component analysis) are examples of this compression.

6. Data Discretization

The data discretization technique is used to divide the attributes of the continuous nature into
data with intervals. We replace many constant values of the attributes with labels of small
intervals. This means that mining results are shown in a concise and easily understandable way.

i. Top-down discretization: If you first consider one or a couple of points (so-called

breakpoints or split points) to divide the whole set of attributes and repeat this method up
to the end, then the process is known as top-down discretization, also known as splitting.
ii. Bottom-up discretization: If you first consider all the constant values as split-points,
some are discarded through a combination of the neighbourhood values in the interval.
That process is called bottom-up discretization.

Data Transformation and Standardization
No ratings yet
Data Transformation and Standardization
5 pages
Velammal Vidyalaya: Section A (Objective Type)
No ratings yet
Velammal Vidyalaya: Section A (Objective Type)
7 pages
Data Cleaning
No ratings yet
Data Cleaning
8 pages
STM Unit 1 Taxonomy of Bugs
No ratings yet
STM Unit 1 Taxonomy of Bugs
56 pages
G3167 Online LC Solution UseMa en D0006652
No ratings yet
G3167 Online LC Solution UseMa en D0006652
388 pages
Evaluation Scheme & Detailed Syllabus of Civil Engineering 3rd Year
No ratings yet
Evaluation Scheme & Detailed Syllabus of Civil Engineering 3rd Year
43 pages
Terraform Tutorial
No ratings yet
Terraform Tutorial
54 pages
Lecture Notes Data Mining Data Warehousing Unit-2: Data Preprocessing
No ratings yet
Lecture Notes Data Mining Data Warehousing Unit-2: Data Preprocessing
3 pages
Amc2017 MP
No ratings yet
Amc2017 MP
9 pages
Data Sheet
No ratings yet
Data Sheet
2 pages
SpecificationsMotor 3176c PDF
No ratings yet
SpecificationsMotor 3176c PDF
107 pages
Application of Anti-Corona Products: Slot Portion End Windings
No ratings yet
Application of Anti-Corona Products: Slot Portion End Windings
17 pages
Unit 4 Data Handling and Model Evaluation 4.1 Data Aggregation
100% (1)
Unit 4 Data Handling and Model Evaluation 4.1 Data Aggregation
31 pages
DATA MINING Notes
No ratings yet
DATA MINING Notes
37 pages
Chemistry Practical 2
No ratings yet
Chemistry Practical 2
14 pages
Unit 3
No ratings yet
Unit 3
22 pages
Data Cleaning and Data Transformation
No ratings yet
Data Cleaning and Data Transformation
13 pages
Chapter 3 Data Preparation
100% (1)
Chapter 3 Data Preparation
34 pages
Standard Operating Procedure To Learn How To Behave in Quality Control Laboratory in Pharmaceuticals
100% (1)
Standard Operating Procedure To Learn How To Behave in Quality Control Laboratory in Pharmaceuticals
38 pages
Economics - Revision Notes - Jocelyn Blink and Ian Dorton - Second Edition - Oxford 2012 (Dragged) 5
No ratings yet
Economics - Revision Notes - Jocelyn Blink and Ian Dorton - Second Edition - Oxford 2012 (Dragged) 5
1 page
Dta Mining
No ratings yet
Dta Mining
15 pages
Logcat 1678081425376
No ratings yet
Logcat 1678081425376
135 pages
Deep Learning Notes All Units
No ratings yet
Deep Learning Notes All Units
69 pages
Chapter 6
No ratings yet
Chapter 6
32 pages
Data Transformation in Data Mining
No ratings yet
Data Transformation in Data Mining
6 pages
About Version Control
No ratings yet
About Version Control
6 pages
Krushna Prasad Shadangi, Kaustubha Mohanty: Highlights
No ratings yet
Krushna Prasad Shadangi, Kaustubha Mohanty: Highlights
7 pages
Srm-3006: Selective Radiation Meter For Electromagnetic Fields Up To 6 GHZ
No ratings yet
Srm-3006: Selective Radiation Meter For Electromagnetic Fields Up To 6 GHZ
24 pages
Midterm Coc
No ratings yet
Midterm Coc
8 pages
RCC DESIGN Intro
No ratings yet
RCC DESIGN Intro
55 pages
Chap 3
No ratings yet
Chap 3
26 pages
6.3.1.10 Packet Tracer - Exploring Internetworking Devices Instructions
100% (1)
6.3.1.10 Packet Tracer - Exploring Internetworking Devices Instructions
4 pages
059-048 - Carving Incised Letters
No ratings yet
059-048 - Carving Incised Letters
4 pages
Unit 2 Preprocessing in Data Mining
No ratings yet
Unit 2 Preprocessing in Data Mining
6 pages
Unit 3 Data Warehousing and Data Mining
No ratings yet
Unit 3 Data Warehousing and Data Mining
7 pages
Data Mining & Data Warehousing
No ratings yet
Data Mining & Data Warehousing
62 pages
HobbyTronics - Texas Instruments H-Bridge Motor Driver 1A - SN754410 - COM-00315
100% (1)
HobbyTronics - Texas Instruments H-Bridge Motor Driver 1A - SN754410 - COM-00315
2 pages
Unit IV-ElementaryUDPSocket-Name and Adddress Conversion
No ratings yet
Unit IV-ElementaryUDPSocket-Name and Adddress Conversion
27 pages
Data Transformation
No ratings yet
Data Transformation
12 pages
Unit 2 Data Mining
No ratings yet
Unit 2 Data Mining
69 pages
Fundamentals of Data Science Notes (Module - 1)
No ratings yet
Fundamentals of Data Science Notes (Module - 1)
19 pages
Unit 2 DWDM
No ratings yet
Unit 2 DWDM
14 pages
Module 2
No ratings yet
Module 2
42 pages
Data Preprocessing Steps 2
No ratings yet
Data Preprocessing Steps 2
26 pages
Unit 2 Data Warehouse and Data Mining
No ratings yet
Unit 2 Data Warehouse and Data Mining
19 pages
Ian Goodfellow Yoshua Bengio and Aaron Courville D
No ratings yet
Ian Goodfellow Yoshua Bengio and Aaron Courville D
4 pages
Unit-1 Data Mining
No ratings yet
Unit-1 Data Mining
19 pages
ENDEQN30 - Module 2 (Exact Equations) MEC191
No ratings yet
ENDEQN30 - Module 2 (Exact Equations) MEC191
8 pages
Unit - 2
No ratings yet
Unit - 2
17 pages
Notes - Unit01 - Data Science and Big Data Analytics
No ratings yet
Notes - Unit01 - Data Science and Big Data Analytics
7 pages
DWDM Unit 3
No ratings yet
DWDM Unit 3
16 pages
Lecture 3 Unit 1
No ratings yet
Lecture 3 Unit 1
61 pages
LECTURE 3-BDM 411 Data Analytics and BIG Data
No ratings yet
LECTURE 3-BDM 411 Data Analytics and BIG Data
49 pages
Data Mining
100% (1)
Data Mining
18 pages
02 Data Warehouse
No ratings yet
02 Data Warehouse
18 pages
Data Preprocessing
No ratings yet
Data Preprocessing
8 pages
R Programming Unit-2
No ratings yet
R Programming Unit-2
29 pages
Down 2
No ratings yet
Down 2
61 pages
QB 10 Marker
No ratings yet
QB 10 Marker
19 pages
22UCS303 DS-Unit II-N
No ratings yet
22UCS303 DS-Unit II-N
71 pages
Work, Energy & Power: Formulas
No ratings yet
Work, Energy & Power: Formulas
2 pages
Study+Material+Unit 4+Data+Preprocessing+
No ratings yet
Study+Material+Unit 4+Data+Preprocessing+
8 pages
FDS CH 3
No ratings yet
FDS CH 3
2 pages
UDN6118
No ratings yet
UDN6118
8 pages
Data Mining UNIT II
No ratings yet
Data Mining UNIT II
19 pages
Data Mining
No ratings yet
Data Mining
15 pages
Data Mining Chapter3 0
No ratings yet
Data Mining Chapter3 0
32 pages
DMDW 5
No ratings yet
DMDW 5
25 pages
Unit 3 DW&DM Notes Mr. Rohit Pratap Singh
No ratings yet
Unit 3 DW&DM Notes Mr. Rohit Pratap Singh
22 pages
Data Preprocessing Techniques Cleaning Transformation and Integration
No ratings yet
Data Preprocessing Techniques Cleaning Transformation and Integration
6 pages
Unit 3
No ratings yet
Unit 3
18 pages
Data Preprocessing Unit 2
No ratings yet
Data Preprocessing Unit 2
3 pages
Week 3
No ratings yet
Week 3
23 pages
7.data Preprocessing
No ratings yet
7.data Preprocessing
12 pages
Aml Crashlog
No ratings yet
Aml Crashlog
18 pages
3a - External Flow Examples For Convection Heat Transfer
No ratings yet
3a - External Flow Examples For Convection Heat Transfer
7 pages
Bi Lesson 6
No ratings yet
Bi Lesson 6
36 pages
OJCST Vol13 N2-3 P 78-81
No ratings yet
OJCST Vol13 N2-3 P 78-81
4 pages
Data Preprocessing in Data Mining
No ratings yet
Data Preprocessing in Data Mining
3 pages
Unit 3 DW
No ratings yet
Unit 3 DW
19 pages
BDA Class1
No ratings yet
BDA Class1
33 pages
DWDM 3
No ratings yet
DWDM 3
12 pages
Data Cleaning: Missing Values: - For Example in Attribute Income If
No ratings yet
Data Cleaning: Missing Values: - For Example in Attribute Income If
30 pages
Data Mining
No ratings yet
Data Mining
5 pages
Major Issues in Data Mining
No ratings yet
Major Issues in Data Mining
5 pages
DATA MINING Notes (Upate)
No ratings yet
DATA MINING Notes (Upate)
25 pages
Approach To Troubleshooting: Quality Control Issues
No ratings yet
Approach To Troubleshooting: Quality Control Issues
7 pages
Business Uses of Data Mining and Data Warehousing MIS 304 Section 04 CRN-41595
No ratings yet
Business Uses of Data Mining and Data Warehousing MIS 304 Section 04 CRN-41595
23 pages
Data Preprocessing
No ratings yet
Data Preprocessing
3 pages
Data Analytics with Generative AI
From Everand
Data Analytics with Generative AI
Younish P
No ratings yet

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Unit-2 Data Warehouse Notes

Uploaded by

Unit-2 Data Warehouse Notes

Uploaded by

UNIT2-NOTES

NUMERICALS DONE IN CLASS

Advantages of Data Transformation in Data Mining:

Disadvantages of Data Transformation in Data Mining:

1. Time-consuming: Data transformation can be a time-consuming process, especially

Data cleaning is fixing or removing incorrect, corrupted, incorrectly formatted, duplicate, or

Steps of Data Cleaning

1. Remove duplicate or irrelevant observations

2. Fix structural errors

3. Filter unwanted outliers

4. Handle missing data

Methods of Data Cleaning

NUMERICAL DISCCUSED IN CLASS

NUMERICAL DISCCUSED IN CLASS

Techniques of Data Reduction

i. Wavelet Transform: In the wavelet transform, suppose a data vector A is transformed

i. Parametric: Parametric numerosity reduction incorporates storing only data parameters

Here, y is the response attribute, and x is the predictor attribute. If we discuss in

ii. Non-Parametric: A non-parametric numerosity reduction technique does not assume

a. Simple random sample without replacement (SRSWOR) of size s: In

d. Stratified sample: The large data set D is partitioned into mutually

3. Data Cube Aggregation

i. Top-down discretization: If you first consider one or a couple of points (so-called

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.