0% found this document useful (0 votes)
19 views64 pages

Data Mining Unit 3

Uploaded by

sengargungun858
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views64 pages

Data Mining Unit 3

Uploaded by

sengargungun858
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 64

Forms of Data Pre-processing

• Requirements
• Quality Data
• Major Task in data Pre-processing
• Data cleaning
• Data Integration
• Data Transformation
• Data Reduction
• Data discretization
Requirement of Data Pre-processing

• The pre-processing is required in advance


before data mining task because real world
data are generally incomplete due to lacking
attribute values.
• Lacking certain attributes of interest, or containing
only aggregate data
• Noisy data ( containing errors or outliers)
• Inconsistent data(containing discrepancies in codes or
names).
Quality Data
• Reason behind dirty data are noise, redundancy and
inconsistency.
• Quality data satisfy requirement of desired use.
• Quality data satisfy followings:
– Accuracy:- Noiseless or error free
– Completeness: No lacking of attribute values
– Consistency: No discrepancy
– Timeliness: all records updated within deadline
– Believability: Trusted Data
– Interpretability: Easy to understand
Major Task in data Pre-processing
• Data cleaning: fill in missing values, smooth noisy data,
identify or remove outliers, and resolve inconsistencies.
• Data integration: using multiple databases, data cubes, or
files.
• Data transformation: normalization and aggregation.
• Data reduction: reducing the volume but producing the same
or similar analytical results.
• Data discretization: part of data reduction, replacing
numerical attributes with nominal ones.
Data Cleaning

1. Handling Data with Missing Values


2. Handling Noisy Data(Data Smoothing)
3. Handling Inconsistent Data.
Methods used to handle data with the missing values

• Ignore the tuple: usually done when class label is missing.


• Use the attribute mean (or majority nominal value) to fill in
the missing value.
• Use the attribute mean (or majority nominal value) for all
samples belonging to the same class.
• Fill missing values manually.

• Use the global constant(UNKNOWN or ∞) for missing values.


• Predict the missing value by using a learning algorithm:
consider the attribute with the missing value as a dependent
(class) variable and run a learning algorithm (usually Bay’s or
decision tree) to predict the missing value.
Handling Noisy Data (Data Smoothing)

To Identify outliers and smooth out noisy


data by following methods:
1. Binning Methods
2. Regression
3. Clustering and outliers
4. Computer and Human Inspection
Binning Methods
This method smooth the sorted values by consulting its
neighbourhood i.e. closest value . This methods perform only
local smoothing because of neighbour consult.
Steps are as below:
1. Sort the attribute values.
2. Partition them into equal size bins(not necessary for last bin).
3. Now smooth the data by replacing each data of bins using any
one of the following:
I. Mean of bin
II. Median of bin
III. Closest boundary values of bin .
Binning Methods Example:
Smooth the following price list using binning methods.
4,8,15,21, 25,28,34,21,24 . let the bin size is 3(given)
Step 1:sorted data:4,8,15,21,21,24,25,28,34
Step2: Partition data into bins of given size-3:
Bin1: 4, 8, 15
Bin2: 21, 21,24
Bin3: 25, 28, 34
Step3: Data smoothing using mean value:
Bin1: 9, 9, 9 [mean of 4,8,15 is 9]
Bin2: 22, 22,22 [mean of 21,21,24 is 22]
Bin3: 29, 29, 29 [mean of 25,28,34 is 29]
Binning Methods Example:
Bin1: 4, 8, 15
Bin2: 21, 21,24
Bin3: 25, 28, 34
Step3: Data smoothing using median value:
Bin1: 8, 8, 8 [median of 4,8,15 is 8]
Bin2: 21, 21,21 [median of 21,21,24 is 21]
Bin3: 28, 28, 28 [median of 25,28,34 is 28]
Binning Methods Example:
Bin1: 4 , 8, 15
Bin2: 21, 21,24
Bin3: 25, 28, 34

Step3: Data smoothing using Closest boundary value:


Bin1: 4, 4, 15 [4 and 15 are boundaries so 8 will be
replaced by closest boundary 4]
Bin2: 21, 21,24 [21 and 24 are boundaries]

Bin3: 25, 25, 34 [25and 34 are boundaries so 28 will


be replaced by closest boundary 25]
Regression Methods
• Regression and log regression models can be used to
approximate given data.
• In simple linear regression the data are modeled to fit a best
straight line between two attributes so that one attribute can
be use to predict other attributes.
• Multiple regression can be used to smooth the noise.

• Y=a.X+b where X,Y are attributes and a, b are regression


coefficient
Regression Methods
Clustering and Outliers
• Clustering is process of grouping a set of data values
into multiple group such that objects within the
group have high similarity and objects within different
group have high dissimilarity.
• Data which are out side the clusters treated as
outliers and can be remove or ignored.
Clustering

NOISE or
OUTLIERS
Outlier analysis by box plotter

• A simple way of representing statistical data on a


plot in which a rectangle is drawn to represent the
second and third quartiles usually with a vertical
line inside to indicate the median value.
• The lower and upper quartiles are shown as
horizontal lines either side of the rectangle.
• For box plot arrange data in increasing order of the
values.
• Then calculate Q1,Q2, Q3 and IQR(Q3-Q1) Inter
Quartile range and draw box plot.
Outlier analysis by box plotter
Drawing a Box Plot.

Example : Draw a Box plot for the data below

Q1 Q2 Q3
10.2, 14.1, 14.4. 14.4, 14.4, 14.5, 14.5, 14.6, 14.7, 14.7, 14.7, 14.9, 15.1, 15.9, 16.4

Lower Upper
Quartile = Median = Quartile =
14.4 14.6 14.9

10.2 14.4 14.6 14.9 16.4


Q2=14.6 (median value of whole data set)
Q1 = 14.4 (median value left half of data set)
Q3 = 14.9 (median value right half of data set)
Then the IQR is given by:
IQR=Q3-Q1
IQR = 14.9 – 14.4 = 0.5
Outliers will be any points satisfy following:
outliers< Q1 – 1.5 ×IQR Outliers> Q3 + 1.5×IQR
Outliers> 14.9 + 0.75
outliers <14.4 – 0.75
Outliers> 15.65.
outliers <13.65
Here outliers are
Here outliers are :
15.9 and 16.4
10.2

The Outliers of data set are:10.2, 15.9,16.4


Computer and Human Inspection
• It detect the suspicious values by computer and checked by
human being manually.
• Outliers can be identified through a combination of computer
and human inspection. Outliers pattern may be informative or
garbage.
• The patterns whose surprise contents is above a threshold are
out to be list then human can sort through the pattern in the
list to identify actual garbage.
• This is much faster than manually search through entire
database.
• The garbage pattern can be used in subsequent data mining.
• Example: Indentifying outliers in hand written character
database for classification in which different verity of
characters are using.
Handling Inconsistent Data
• Inconsistent means discrepancies in code or name. It is
important issue which is considerable during data integration.
• Following are main reasons for data inconsistency:
✓ Faulty hardware and software for data collection.
✓ Human or computer error.
✓ Error due to data transmission.
✓ Technology limitations(data generated more faster than
receiver).
✓ Inconsistency in naming or convention(02/05/2018 may be
consider as 5 Feb, 2018 or 2 May, 2018)
✓ Due to duplicate tuple in database.
Data Integration
• It is a process of merging of data from multiple and
heterogeneous data sources.
• It is necessary to maintaining consistency to avoid
redundancy in resulted data set.
• This can be helpful to improve the accuracy and
speed of mining process over subsequent data set.
• Major challenges in integration process are:
1. Entity identification problem
2. Tuple Duplication
3. Data value conflict detection and resolutions
4. Redundancy and correlation analysis
Entity Identification Problem
• This refers to how real world entity from different databases
matched up.
– Example customer_id in one relation and customer_no in other
relation.

• During integration special attention need to be take care while


merging data from various sources.
– Discount attribute for individual in one data base is differ in other
database
• Metadata resolve these problem which defines name,
meaning, data type and range value for each attribute.
Tuple Duplications
• In addition to redundancy detection to maintain data
consistency, data tuple duplication should be deleted.
• Inconsistency often arises between various duplicates due to
updating partial data occurrences.

• Example :some customer name may be repeated with


different addresses against each name.
Data value conflict detection and resolutions

• Data Integration also involves the detection and resolution of


conflict data values for some real world entity.
• Attribute values may differs from different sources.
• This may be because of due to difference in scaling ,encoding
or representation.

• Examples: In University Grade System percentage of mark or


GRADE A/B/C or CREDIT SYSTEM
Redundancy and correlation analysis

• Inconsistency in value or attribute causes the redundancy in


dataset. An attribute may be redundant if it derived from
another attribute set.
• Example: Annual revenue of any organization may be derived
from different sources so ignoring any source during
integration may be create redundancy.
• Redundancy can be detected by correlation analysis. It detect
how strongly two or more attributes are dependent.
• Chi square ( χ²) test can be apply to detect how one attributes
strongly implies to other attributes
Example:
In a group of 1500 people is surveyed . In contingency table
observed frequency of preferred reading by male and female is
given. Perform chi square(χ²) to test hypothesis that “gender and
preferred reading are independent” at significant level 0.001.

Gender
Total
Male Female

Fiction 250 200 450


Preferred
Reading
Non Fiction 50 1000 1050

Total 300 1200 1500


Expected frequency (e)can be computed by formula:

e= (Count_gender X Count_prefered_reading)/N

Gender
Total
Male Female

Fiction 90 360 450


Preferred
Reading Non
Fiction
210 840 1050

Total 300 1200 1500

Degree of freedom=(Row-1)X(Column-1)
=(2-1)X(2-1)
=1
Chi Square Table For(independency)
Example 2:A researcher might want to know if there is a significant association between
the variables gender and soft drink choice (Coke and Pepsi were considered). The null
hypothesis would be,
Ho: There is no significant association between gender and soft drink choice.
(Gender and preferred soft drinking is independent )
Significant level 5%
Data Transformation
In data transformation, the data are transformed or
consolidated into appropriate forms for mining.
Strategies used for Data transformation are :
1. Smoothing: Remove the noise
2. Attribute Construction-New set of attributes generated.
3. Aggregation: Summarization of data value.
4. Normalization: Attributes values are scaled in new range(-1 to
+1, 0 to 1 etc.)
5. Discretization: Data divided into discrete intervals(Like 0-100,
101 to 200, 201 to 300 etc.)
6. Concepts of hierarchy generation from higher to lower .
For examples address hierarchy is :country>state>city>street
Data Transformation by Normalization
Normalization change the unit of measurement like
meter to kilometre.
For better performance data must scaled in smaller
format like interval [-1 to +1 or 0 to 1].
Following methods can be used for data normalization:
1. Min-Max Normalization
2. Z-Score(zero mean) Normalization
3. Normalization By decimal Scaling
Example:
Suppose that the minimum and maximum values for the
attribute income are $12,000 and $98,000, respectively. Map
income $73,600 to the range [0.0,1.0].
Here MINA=12000, MAXA=98000,
NEW_MINA=0, NEW_MAXA=1 and V=73600
V’=[(73,600−12,000 )/(98,000−12,000 )]*(1.0−0) +0
= 0.716.
So 73600 will represented 0.716 in new range[0-1]
Example:
Suppose that the mean and standard deviation of the values
for the attribute income are $54,000 and $16,000,
respectively. With z-score normalization, a value of $73,600
for income is transformed to :
V’=(73,600−54,000)/ 16,000
= 1.225

So new representation value of 73600 is 1.225


Example:
Suppose that the recorded values of A range from −986 to
917. The maximum absolute value of A is 986. To normalize
by decimal scaling, we therefore divide each value by 1,000
(i.e., j = 3 to make 986 les than 1 i.e. 0.986) so that −986
normalizes to:
−0.986 and 917 normalizes to 0.917.
Normalization Practice
Use the two methods below to normalize the
following group of data:
200, 300, 400, 600, 1000
(a) min-max normalization by setting
min = 0 and max = 1
(b) z-score normalization
Data reduction
• Data reduction techniques can be applied to obtain a reduced
representation of the data set that is much smaller in volume.
• It Keep closely maintains the integrity of the original data.
• Following Strategies used for data reduction :
1. Dimensionality reduction
2. Attribute subset selection
3. Numerosity reduction
4. Data cube aggregation
5. Discretization and concept hierarchy generation
6. Decision Tree
7. Data Compression
Dimensionality reduction
• In dimensionality reduction, data encoding or transformations
are applied to obtain a reduced representation of the original
data.
• If the original data can be reconstructed from the reduced data
without any loss of information, the data reduction is called
The reduced data retains all the information from the original data.
lossless.
• If we can reconstruct only an approximation of the original
Some information is lost, but the reduced
data, then the data reduction is called lossy. representation features.
captures the most important

• Two popular and effective methods of lossy dimensionality


reduction:
1. Wavelet transforms
2. Principal components analysis
Attribute Subset Selection

• Data sets for analysis may contain hundreds of attributes, many of


which may be irrelevant to the mining task or redundant.
• It reduces the data set size by removing irrelevant or redundant
attributes (or dimensions).
• The goal of attribute subset selection is to find a minimum set of
attributes.
• Attribute subset selection include the following techniques:
1. Stepwise forward selection:
The procedure starts with an empty set of attributes as the reduced
set. The best of the original attributes is determined and added to
the reduced set. At each subsequent iteration or step, the best of
the remaining original attributes is added to the set.
Attribute Subset Selection
2. Stepwise backward elimination:
The procedure starts with the full set of attributes. At each step, it
removes the worst attribute remaining in the set.

3. Combination of forward selection and backward


elimination:
The stepwise forward selection and backward elimination methods
can be combined so that, at each step, the procedure selects the
best attribute and removes the worst from among the remaining
attributes.
4. Decision tree induction:
When decision tree induction is used for attribute subset
selection, a tree is constructed from the given data. All attributes
that do not appear in the tree are assumed to be irrelevant. The
set of attributes appearing in the tree form the reduced subset of
attributes.
Decision Tree Induction-
1- Flow Chart like tree Structure
2- Supports taking decisions
3- It Defines rules visually in the form of tree.
Types of Nodes-
1- Root Node- Main Question
2- Branch Node- Intermediate Processes
3- Leaf Node- Answer
Decision Tree
Attribute Selection Measures-
1- Information Gain- How much information do we get in answer of an specific
question.
2- Entropy- Measures the amount of uncertainty in information.
Both are inversely proportional to each other.
reduce the data volume by finding smaller, more efficient representations of the data

Numerosity reduction
• Numerosity reduction reduce the data volume by choosing
alternative, ‘smaller’ forms of data representation.
• These techniques may be Parametric or Nonparametric.

• For parametric methods, a model is used to estimate the data,


so that typically only the data parameters need to be stored,
instead of the actual data. Regression and Log-linear models
are an example. fitting a line to represent the trend in the data.
• Nonparametric methods for storing reduced representations
of the data include :
1. Histograms The data is divided into intervals (bins), and the frequency of data points in each bin is stored.
2. Clustering Data points are grouped into clusters based on similarity, and representative values for each cluster
are used to summarize the data.

3. Sampling. A subset of the data is selected (randomly or systematically) to represent the entire dataset.
4. Data Cube aggregation Multidimensional data is summarized by computing aggregates (e.g.,
sum, average) for different combinations of dimensions in a data cube.
A histogram partitions the data distribution of an attribute
,A into disjoint subsets, also known as buckets.
Histograms
• A histogram for an attribute, A, partitions the data
distribution of A into disjoint subsets, or buckets.
• If each bucket represents only a single attribute-value/
frequency pair, the buckets are called singleton buckets.
• There are several partitioning rules, including the following:
• Equal-width: In an equal-width histogram, the width of each
bucket range is uniform (such as the width of $10 for the
The range of the attribute A is divided into buckets of uniform width. For example, if A ranges from 0 to 100, and
buckets . we use 10 buckets, each bucket will have a width of 10 10 (e.g., [0–10), [10–20), ..., [90–100]).
• Equal-frequency (or equidepth): In an equal-frequency
histogram, the buckets are created so that, roughly, the
frequency of each bucket is constant (that is, each bucket
contains roughly the same number of contiguous data
samples).
Buckets are created such that each bucket contains approximately the same number of data points.
For example, if there are 100 data points and 10 buckets, each bucket will contain roughly 10 data points.
Histograms
• Example :The following data are a list of prices of commonly
sold items at AllElectronics (rounded to the nearest dollar):
1, 1, 5, 5, 5, 5, 5, 8, 8, 10, 10, 10, 10, 12, 14, 14, 14, 15, 15, 15,
15, 15, 15, 18, 18, 18, 18, 18, 18, 18, 18, 20, 20, 20, 20, 20, 20,
20, 21, 21, 21, 21, 25, 25, 25, 25, 25, 28, 28, 30, 30, 30.
Histograms
• Example :The following data are a list of prices of commonly
sold items at AllElectronics (rounded to the nearest dollar):
1, 1, 5, 5, 5, 5, 5, 8, 8, 10, 10, 10, 10, 12, 14, 14, 14, 15, 15, 15,
15, 15, 15, 18, 18, 18, 18, 18, 18, 18, 18, 20, 20, 20, 20, 20, 20,
20, 21, 21, 21, 21, 25, 25, 25, 25, 25, 28, 28, 30, 30, 30.
Sampling
Sampling can be used as a data reduction technique because it
allows a large data set to be represented by a much smaller
random sample (or subset) of the data.
1. Simple random sample without replacement (SRSWOR) of size s:
This is created by drawing s of the N tuples from D (s < N), where
the probability of drawing any tuple in D is 1=N, that is, all tuples
are equally likely to be sampled.
2. Simple random sample with replacement (SRSWR) of size s:
This is similar to SRSWOR, except that each time a tuple is drawn
from D, it is recorded and then replaced. That is, after a tuple is
drawn, it is placed back in D so that it may be drawn Again.

Similar to SRSWOR, but with replacement. After a tuple is drawn, it is returned to the dataset before the next draw.
This allows the same tuple to be sampled multiple times.
Cluster sampling is a probabilistic sampling technique in which the population is
divided into groups or clusters, and a random selection of these clusters is made.
All members within the selected clusters are included in the sample. This method

• 3-Cluster sampling
is particularly effective when the population is large and widely dispersed
geographically.

Groups are heterogeneous, meaning the individual characteristics vary. The


population is divided into groups, often by geographical location, and then
all members of randomly selected groups are included in the sample. This
method is often used to reduce costs and increase efficiency, especially for
populations spread across different locations.

• 4-Stratified sampling
Groups are homogeneous, meaning the units share characteristics. The
population is divided into specific groups based on an interest, and some
members of all groups are included in the sample. This method is used to
improve precision and representation, and can help researchers make
better estimates.
Stratified sampling is a probabilistic sampling technique in which the population is divided into smaller groups, known as strata,
• based on shared characteristics. A sample is then drawn from each stratum. This method ensures that specific subgroups of the
population are adequately represented in the final sample.
Data Cube Aggregation
• It is use to summarize multidimensional data cube.
• Data aggregation is any process in which information is
gathered and expressed in a summary form, for purposes such
as statistical analysis.
• A common aggregation purpose is to get more information
about particular groups based on specific variables such as
age, profession, sales or income.
Data Discretization
In data discretization data divided into discrete intervals .Data
discretization can be categorized based on how it performed.
1. Supervised discretization use class information.
2. Unsupervised discretization not use prior class information.
3. Other techniques are: Data discretization is the process of converting
continuous data attributes into discrete intervals
• Binning methods or bins. This transformation is particularly useful
for simplifying data analysis, visualizing patterns,
and preparing data for algorithms that work
• Histogram better with discrete values (e.g., certain
classification algorithms).

• Cluster Analysis
• Decision tree analysis
• Correlation analysis
Discretization refers to transforming continuous numerical data into discrete categories or intervals.
Concept hierarchy generation involves replacing raw attribute values with higher-level categories or concepts (e.g., replacing age values like "
25" with "young")
Purpose-1)These techniques simplify the representation of data for analysis.2) They are essential in mining data at multiple abstraction levels,
enabling pattern detection and insights.
Discretization and Concept Hierarchy Generation for
Numerosity Reduction:
Numerical Data
Discretization reduces the volume of data by grouping continuous values into intervals, lowering complexity without significant information loss.
Example: Replace individual age values (1, 2, 3...) with intervals (e.g., "1–10", "11–20").

• Discretization and concept hierarchy generation, where raw data


values for attributes are replaced by ranges or higher conceptual
levels.
• Data discretization is a form of numerosity reduction that is very
useful for the automatic generation of concept hierarchies.
• Discretization and concept hierarchy generation are powerful tools
for data mining, in that they allow the mining of data at multiple
levels of abstraction. Strategies are: Discretization and concept hierarchies allow
algorithms to focus on relationships at
different levels of abstraction, improving the
1. Binning flexibility of analysis.

Example: A retail store could analyze data


2. Histogram by:
Specific prices (numerical level)
3. Entropy based discretization Price ranges (conceptual level, e.g., "low", "
medium", "high").

4. Cluster analysis
5. Discretization by intuitive partitioning
Discretization and Concept Hierarchy Generation for
Numerical Data

1. Binning: Binning is a top-down splitting technique


based on a specified number of bins.
These methods are also used as discretization
methods for numerosity reduction and concept
hierarchy generation.
2. Histogram analysis: Like binning, histogram analysis
is an unsupervised discretization technique because
it does not use class information. Histograms
partition the values for an attribute, A, into disjoint
ranges called buckets.
Discretization and Concept Hierarchy Generation for
A recursive process starting with the entire
Numerical Data range of the attribute, and progressively
splitting it into smaller intervals based on
entropy minimization.

3. Entropy-based discretization: Entropy-based


discretization is a supervised, top-down splitting technique.
• It explores class distribution information in its calculation and
determination of split-points (data values for partitioning an
attribute range). A split-point is a numerical value used to divide an attribute's range into two intervals.
For each possible split-point in the attribute range, calculate the entropy of the resulting intervals.Select the split-point that minimizes entropy

• To discretize a numerical attribute, A, the method selects the


value of A that has the minimum entropy as a split-point, and
Once the first split-point is chosen, repeat the process within each resulting interval.
recursively partitions the resulting intervals to arrive at a
hierarchical discretization. Such discretization forms a concept
hierarchy for A.
Minimum interval size.
Maximum number of intervals.
Desired entropy threshold.

Entropy quantifies the amount of uncertainty or impurity in a dataset.


Discretization and Concept Hierarchy Generation for
Numerical Data

5. Cluster analysis: A clustering algorithm can be applied to


discretize a numerical attribute, A, by partitioning the values
of A into clusters or groups.
Clustering takes the distribution of A into consideration, as
well as the closeness of data points, and therefore is able to
produce high-quality discretization results.
Discretization and Concept Hierarchy Generation for
Numerical Data

6. Discretization by intuitive partitioning:


A concept hierarchy for the attribute price, where an interval
($X : : :$Y] denotes the range from $X (exclusive) to $Y
Intuitive partitioning creates intervals for numerical attributes based on domain knowledge
(inclusive). or human understanding rather than purely statistical or algorithmic approaches.
X: Lower bound (exclusive). This approach is unsupervised and focuses on logical, easy-to-interpret divisions.
Y: Upper bound (inclusive).
100:::500\]]** includes values greater than \(100 and up to 500.
Concept Hierarchy Generation for Categorical Data

• Categorical data are discrete data. Categorical attributes have


a finite (but possibly large) number of distinct values, with no
ordering among the values. Examples include geographic
Geographic location (country state city), Job
location, job category, and item type. category (industry role title), and Item type
(category subcategory brand).

• There are several methods for the generation of concept


hierarchies for categorical data.
Users or domain experts define a hierarchy directly within the schema by specifying the relationships between attributes.
1. Specification of a partial ordering of attributes explicitly at
the schema level by users or Experts:
2. Specification of a portion of a hierarchy by explicit data
grouping: Users specify groups or categories directly based on the values of the attribute.
3. Specification of a set of attributes, but not of their partial
ordering:
Users specify which attributes belong to a concept hierarchy, but do not define their explicit
order.
This approach involves manual creation of a concept hierarchy by users or domain experts who specify the relationships and
order of attributes based on their knowledge of the domain.

1- Specification of a partial ordering of attributes explicitly at the schema


level by users or experts:
In this approach, domain experts or users de ne a partial order of attributes based on
their knowledge, creating a hierarchy at the schema level
Example: In a geographic location hierarchy, experts may specify that "City" is below
"State," which is below "Country." So, the hierarchy would look like
• Country → State → Cit
Rather than defining an entire hierarchy in advance, the focus is on creating clusters or groups of related data.

2-Speci ca on of a por on of a hierarchy by explicit data grouping:


This method involves grouping data based on the inherent relationships within the data.
Portions of the hierarchy are generated by clustering or organizing similar items
together
Example: For job categories, we could group roles such as "Software Engineer," "Data
Scientist," and "Product Manager" under the broader category "Technical Roles," while
"Accountant" and "Financial Analyst" fall under "Finance Roles." This might create a
hierarchy like
• Job Category → Technical Roles, Finance Roles
.

fi
ti
:

ti
y

fi
.

3- Speci ca on of a Set of A ributes Without Par al Ordering


Here, a set of attributes is de ned, but no explicit ordering is assigned. This
allows exibility in how data
is grouped at different levels without a xed sequence
Example: For an item hierarchy, we could specify attributes like "Brand,"
"Type," and "Model" for
products without specifying a strict order. This enables the hierarchy to adapt
based on the context,
such as grouping by "Brand" rst in one case or "Type" in another
Suppose we have three attributes for a set of products
Brand (e.g., "Apple," "Samsung," “Sony"
Type (e.g., "Smartphone," "Tablet," “Laptop"
Model (speci c model names, like "iPhone 14," "Galaxy S23”
Without a partial ordering, we can group or organize these attributes in various
ways based on different analytical needs or perspectives
fl
fi
ti
fi

fi

fi
tt
fi
)

ti
.

Attributes Defined:
Brand: Apple, Samsung, Sony.
Type: Smartphone, Tablet, Laptop.
Model: Specific models like
iPhone 14, Galaxy S21, etc.
Dynamic Hierarchy Generation:
Case 1: Group by Brand First

Brand Type Model


Apple Smartphone iPhone 14
Samsung Smartphone Galaxy
S21
Case 2: Group by Type First

Type Brand Model


Smartphone Apple iPhone 14
Smartphone Samsung Galaxy
S21
Case 3: Group by Model First
(Context-Specific)

Model Brand Type


iPhone 14 Apple Smartphone
Galaxy S21 Samsung
Smartphone

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy