Unit-3 DWDM
Unit-3 DWDM
UNIT-3
Why we need Data Mining?
Volume of information is increasing everyday that we can handle from business
transactions, scientific data, sensor data, Pictures, videos, etc. So, we need a system that will
be capable of extracting essence of information available and that can automatically generate
report, views or summary of data for better decision-making.
Knowledge discovery from Data (KDD) is essential for data mining. While others view data
mining as an essential step in the process of knowledge discovery. Here is the list of steps
involved in the knowledge discovery process −
Data Cleaning − In this step, the noise and inconsistent data is removed.
Data Integration − In this step, multiple data sources are combined.
Data Selection − In this step, data relevant to the analysis task are retrieved from the
database.
Data Transformation − In this step, data is transformed or consolidated into forms
appropriate for mining by performing summary or aggregation operations.
Data Mining − In this step, intelligent methods are applied in order to extract data
patterns.
Pattern Evaluation − In this step, data patterns are evaluated.
Knowledge Presentation − In this step, knowledge is represented.
Page 1
ASHISH DIXIT AJAY KUMAR GARG ENGINEERING COLLEGE
1. Flat Files
Flat files are defined as data files in text form or binary form with a structure that
can be easily extracted by data mining algorithms.
Data stored in flat files have no relationship or path among themselves, like if a
relational database is stored on flat file, and then there will be no relations between
the tables.
Flat files are represented by data dictionary. Eg: CSV file.
Application: Used in Data Warehousing to store data, Used in carrying data to and
from server, etc.
2. Relational Databases
A Relational database is defined as the collection of data organized in tables with
rows and columns.
Physical schema in Relational databases is a schema which defines the structure of
tables.
Logical schema in Relational databases is a schema which defines the relationship
among tables.
Standard API of relational database is SQL.
Application: Data Mining, ROLAP model, etc.
Page 2
ASHISH DIXIT AJAY KUMAR GARG ENGINEERING COLLEGE
3. DataWarehouse
A datawarehouse is defined as the collection of data integrated from multiple
sources that will query and decision making.
There are three types of datawarehouse: Enterprise datawarehouse, Data
Mart and Virtual Warehouse.
Two approaches can be used to update data in DataWarehouse: Query-
driven Approach and Update-driven Approach.
Application: Business decision making, Data mining, etc.
4. Transactional Databases
Transactional databases are a collection of data organized by time stamps, date,
etc to represent transaction in databases.
This type of database has the capability to roll back or undo its operation when a
transaction is not completed or committed.
Highly flexible system where users can modify information without changing any
sensitive information.
Follows ACID property of DBMS.
Application: Banking, Distributed systems, Object databases, etc.
5. Multimedia Databases
Multimedia databases consists audio, video, images and text media.
They can be stored on Object-Oriented Databases.
They are used to store complex information in pre-specified formats.
Application: Digital libraries, video-on demand, news-on demand, musical
database, etc.
6. Spatial Database
Store geographical information.
Stores data in the form of coordinates, topology, lines, polygons, etc.
Application: Maps, Global positioning, etc.
7. Time-series Databases
Time series databases contain stock exchange data and user logged activities.
Handles array of numbers indexed by time, date, etc.
It requires real-time analysis.
Application: eXtremeDB, Graphite, InfluxDB, etc.
8. WWW
WWW refers to World wide web is a collection of documents and resources like
audio, video, text, etc which are identified by Uniform Resource Locators (URLs)
through web browsers, linked by HTML pages, and accessible via the Internet
network.
It is the most heterogeneous repository as it collects data from multiple resources.
It is dynamic in nature as Volume of data is continuously increasing and changing.
Application: Online shopping, Job search, Research, studying, etc.
Page 3
ASHISH DIXIT AJAY KUMAR GARG ENGINEERING COLLEGE
a) DescriptiveFunction
The descriptive function deals with the general properties of data in the database.
Here is the list of descriptive functions −
1. Class/Concept Description
2. Mining of Frequent Patterns
3. Mining of Associations
4. Mining of Correlations
5. Mining of Clusters
1. Class/Concept Description
Class/Concept refers to the data to be associated with the classes or concepts. For
example, in a company, the classes of items for sales include computer and printers, and
concepts of customers include big spenders and budget spenders. Such descriptions of a class
or a concept are called class/concept descriptions. These descriptions can be derived by the
following two ways −
Data Characterization − This refers to summarizing data of class under study. This
class under study is called as Target Class.
Data Discrimination − It refers to the mapping or classification of a class with some
predefined group or class.
4. Mining of Correlations
It is a kind of additional analysis performed to uncover interesting statistical
correlations between associated-attribute-value pairs or between two item sets to analyze
that if they have positive, negative or no effect on each other.
5. Mining of Clusters
Cluster refers to a group of similar kind of objects. Cluster analysis refers to forming
group of objects that are very similar to each other but are highly different from the objects
in other clusters.
Page 4
ASHISH DIXIT AJAY KUMAR GARG ENGINEERING COLLEGE
b) Classification andPrediction
Classification is the process of finding a model that describes the data classes or
concepts. The purpose is to be able to use this model to predict the class of objects whose class
label is unknown. This derived model is based on the analysis of sets of training data. The
derived model can be presented in the following forms −
3. Decision Trees − A decision tree is a structure that includes a root node, branches, and
leaf nodes. Each internal node denotes a test on an attribute, each branch denotes the
outcome of a test, and each leaf node holds a class label.
6. Outlier Analysis − Outliers may be defined as the data objects that do not comply with
the general behavior or model of the data available.
Page 5
ASHISH DIXIT AJAY KUMAR GARG ENGINEERING COLLEGE
Note − These primitives allow us to communicate in an interactive manner with the data
mining system. Here is the list of Data Mining Task Primitives −
Set of task relevant data to be mined.
Kind of knowledge to be mined.
Background knowledge to be used in discovery process.
Interestingness measures and thresholds for pattern evaluation.
Representation for visualizing the discovered patterns.
1. Statistics:
It uses the mathematical analysis to express representations, model and summarize
empirical data or real world observations.
Statistical analysis involves the collection of methods, applicable to large amount of
data to conclude and report the trend.
2. Machine learning
Arthur Samuel defined machine learning as a field of study that gives computers the
ability to learn without being programmed.
When the new data is entered in the computer, algorithms help the data to grow or
change due to machine learning.
In machine learning, an algorithm is constructed to predict the data from the available
database (Predictive analysis).
It is related to computational statistics.
Page 6
ASHISH DIXIT AJAY KUMAR GARG ENGINEERING COLLEGE
Page 7
ASHISH DIXIT AJAY KUMAR GARG ENGINEERING COLLEGE
Page 8
ASHISH DIXIT AJAY KUMAR GARG ENGINEERING COLLEGE
Intrusion Detection
Intrusion refers to any kind of action that threatens integrity, confidentiality, or the
availability of network resources. In this world of connectivity, security has become the
major issue. With increased usage of internet and availability of the tools and tricks for
intruding and attacking network prompted intrusion detection to become a critical component
of network administration.
Major Issues in data mining:
Data mining is a dynamic and fast-expanding field with great strengths. The major issues
can divided into five groups:
a) Mining Methodology
b) User Interaction
c) Efficiency and scalability
d) Diverse Data Types Issues
e) Data mining society
a) Mining Methodology:
It refers to the following kinds of issues −
Mining different kinds of knowledge in databases − Different users may be
interested in different kinds of knowledge. Therefore it is necessary for data mining
to cover a broad range of knowledge discovery task.
Mining knowledge in multidimensional space – when searching for knowledge in
large datasets, we can explore the data in multidimensional space.
Handling noisy or incomplete data − the data cleaning methods are required to handle
the noise and incomplete objects while mining the data regularities. If the data cleaning
methods are not there then the accuracy of the discovered patterns will be poor.
Pattern evaluation − the patterns discovered should be interesting because either they
represent common knowledge or lack novelty.
b) User Interaction:
Interactive mining of knowledge at multiple levels of abstraction − The data mining
process needs to be interactive because it allows users to focus the search for patterns,
providing and refining data mining requests based on the returned results.
Incorporation of background knowledge − To guide discovery process and to
express the discovered patterns, the background knowledge can be used. Background
knowledge may be used to express the discovered patterns not only in concise terms
but at multiple levels of abstraction.
Data mining query languages and ad hoc data mining − Data Mining Query
language that allows the user to describe ad hoc mining tasks, should be integrated
with a data warehouse query language and optimized for efficient and flexible data
mining.
Presentation and visualization of data mining results − Once the patterns are
discovered it needs to be expressed in high level languages, and visual representations.
These representations should be easily understandable.
Page 9
ASHISH DIXIT AJAY KUMAR GARG ENGINEERING COLLEGE
Page 10
ASHISH DIXIT AJAY KUMAR GARG ENGINEERING COLLEGE
2. Binary Attributes: Binary data has only 2 values/states. For Example yes or no,
affected or unaffected, true or false.
i) Symmetric: Both values are equally important (Gender).
ii) Asymmetric: Both values are not equally important (Result).
3. Ordinal Attributes: The Ordinal Attributes contains values that have a meaningful
sequence or ranking (order) between them, but the magnitude between values is not
actually known, the order of values that shows what is important but don’t indicate how
important it is.
Attribute Values
Grade O, S, A, B, C, D, F
5. Discrete: Discrete data have finite values it can be numerical and can also be in
categorical form. These attributes has finite or countably infinite set of values.
Example
Attribute Values
Teacher, Business man,
Profession
Peon
ZIP Code 521157, 521301
6. Continuous: Continuous data have infinite no of states. Continuous data is of float type.
There can be many values between 2 and 3.
Example:
Attribute Values
Height 5.4, 5.7, 6.2, etc.,
Weight 50, 65, 70, 73, etc.,
Page 11
ASHISH DIXIT AJAY KUMAR GARG ENGINEERING COLLEGE
Mean = =
Sometimes, each values in a set maybe associated with weight for i=1, 2, …., N.
then the mean is as follows
Mean = =
This is called weighted arithmetic mean or weighted average.
Median: Median is middle value among all values.
For N number of odd list median is th value.
For N number of even list median is th value.
Mode:
The mode is another measure of central tendency.
Datasets with one, two, or three modes are respectively called unimodal, bimodal,
and trimodal.
A dataset with two or more modes is multimodal. If each data occurs only once,
then there is no mode.
Page 12
ASHISH DIXIT AJAY KUMAR GARG ENGINEERING COLLEGE
The data values can represent as Bar charts, pie charts, Line graphs, etc.
Page 13
ASHISH DIXIT AJAY KUMAR GARG ENGINEERING COLLEGE
Quantile plots:
A quantile plot is a simple and effective way to have a first look at a univariate data
distribution.
Plots quantile information
For a data xi data sorted in increasing order, fi indicates that approximately 100
fi% of the data are below or equal to the value xi
Note that
the 0.25 quantile corresponds to quartile Q1,
the 0.50 quantile is the median, and
the 0.75 quantile is Q3.
Page 14
ASHISH DIXIT AJAY KUMAR GARG ENGINEERING COLLEGE
Scatter Plot:
Scatter plot
Is one of the most effective graphical methods for determining if there appears to
be a relationship, clusters of points, or outliers between two numerical attributes.
Each pair of values is treated as a pair of coordinates and plotted as points in the plane
Data Visualization:
Visualization is the use of computer graphics to create visual images which aid in the
understanding of complex, often massive representations of data.
Categorization of visualization methods:
a) Pixel-oriented visualization techniques
b) Geometric projection visualization techniques
c) Icon-based visualization techniques
d) Hierarchical visualization techniques
e) Visualizing complex data and relations
a) Pixel-oriented visualization techniques
For a data set of m dimensions, create m windows on the screen, one for each
dimension
The m dimension values of a record are mapped to m pixels at the corresponding
positions in the windows
The colors of the pixels reflect the corresponding values
Page 15
ASHISH DIXIT AJAY KUMAR GARG ENGINEERING COLLEGE
To save space and show the connections among multiple dimensions, space filling is
often done in a circle segment
Page 16
ASHISH DIXIT AJAY KUMAR GARG ENGINEERING COLLEGE
Page 17
ASHISH DIXIT AJAY KUMAR GARG ENGINEERING COLLEGE
Page 18
ASHISH DIXIT AJAY KUMAR GARG ENGINEERING COLLEGE
a) Euclidean Distance
Assume that we have measurements xik, i = 1, … , N, on variables k = 1, … , p (also
called attributes).
The Euclidean distance between the ith and jth objects is
Note that λ and p are two different parameters. Dimension of the data matrix remains
finite.
Page 19
c) Mahalanobis Distance
Let X be a N × p matrix. Then the ith row of X is
Page 20
In general, the large the width, the greater the effect of the smoothing. Alternatively, bins may
be equal-width, where the interval range of values in each bin is constant Example 2: Remove
the noise in the following data using smoothing techniques:
8, 4,9,21,25,24,29,26,28,15
Sorted data for price (in dollars):4,8,9,15,21,21,24,25,26,28,29,34
Partition into equal-frequency (equi-depth) bins:
Bin 1: 4, 8,9,15
Bin 2: 21,21,24,25
Bin 3: 26,28,29,34
Smoothing by bin means:
Bin 1: 9,9,9,9
Bin 2: 23,23,23,23
Bin 3: 29,29,29,29
Smoothing by bin boundaries:
Bin 1: 4, 4,4,15
Bin 2: 21,21,25,25
Bin3: 26,26,26,34
Regression: Data can be smoothed by fitting the data to function, such as with regression.
Linear regression involves finding the ―best‖ line to fit two attributes (or variables), so that one
attribute can be used to predict the other. Multiple linear regressions is an extension of linear
regression, where more than two attributes are involved and the data are fit to a
multidimensional surface.
Clustering: Outliers may be detected by clustering, where similar values are organized into
groups, or ―clusters.‖ Intuitively, values that fall outside of the set of clusters may be
considered outliers.
Inconsistent Data
Inconsistencies exist in the data stored in the transaction. Inconsistencies occur due to occur
during data entry, functional dependencies between attributes and missing values. The
inconsistencies can be detected and corrected either by manually or by knowledge engineering
tools.
Data cleaning as a process
a) Discrepancy detection
b) Data transformations
a) Discrepancy detection
The first step in data cleaning is discrepancy detection. It considers the knowledge
ofmeta data and examines the following rules for detecting the discrepancy.
Unique rules- each value of the given attribute must be different from all other values for that
attribute.
Consecutive rules – Implies no missing values between the lowest and highest values for the
attribute and that all values must also be unique.
Null rules - specifies the use of blanks, question marks, special characters, or other strings
that may indicates the null condition
Discrepancy detection Tools:
Data scrubbing tools - use simple domain knowledge (e.g., knowledge of postal
addresses, and spell-checking) to detect errors and make corrections in the data
Data auditing tools – analyzes the data to discover rules and relationship, and
detecting data that violate such conditions.
b) Data transformations
This is the second step in data cleaning as a process. After detecting discrepancies, we
need to define and apply (a series of) transformations to correct them.
Data Transformations Tools:
Data migration tools – allows simple transformation to be specified, such to replaced
the string ―gender‖ by ―sex‖.
ETL (Extraction/Transformation/Loading) tools – allows users to specific transforms
through a graphical user interface(GUI)
3. Data Integration
Data mining often requires data integration - the merging of data from stores into a coherent
data store, as in data warehousing. These sources may include multiple data bases, data
cubes, or flat files.
Issues in Data Integration
a) Schema integration & object matching.
b) Redundancy.
c) Detection & Resolution of data value conflict
a) Schema Integration & Object Matching
Schema integration & object matching can be tricky because same entity can be
represented in different forms in different tables. This is referred to as the entity identification
problem. Metadata can be used to help avoid errors in schema integration. The meta data may
also be used to help transform the data.
b) Redundancy:
Redundancy is another important issue an attribute (such as annual revenue, for
instance) may be redundant if it can be ―derived‖ from another attribute are set of attributes.
Inconsistencies in attribute of dimension naming can also cause redundancies in the resulting
data set. Some redundancies can be detected by correlation analysis and covariance analysis.
For Nominal data, we use the 2 (Chi-Square) test.
For Numeric attributes we can use the correlation coefficient and covariance.
2Correlation analysis for numerical data:
For nominal data, a correlation relationship between two attributes, A and B, can be
discovered by a 2 (Chi-Square) test. Suppose A has c distinct values, namely a1, a2, a3,
……., ac. B has r distinct values, namely b1, b2, b3, …., br. The data tuples are described by
table.
The 2 value is computed as
𝟐
𝒓 𝒐𝒊𝒋−𝒆𝒊𝒋
2
= 𝒄𝒊=𝟏 𝒋=𝟏 𝒆𝒊𝒋
Where oij is the observed frequency of the joint event (Ai,Bj) and eij is the expected
frequency of (Ai,Bj), which can computed as
𝒄𝒐𝒖𝒏𝒕 𝑨=𝒂𝒊 𝑿𝒄𝒐𝒖𝒏𝒕(𝑩=𝒃𝒋)
eij = 𝒏
For Example,
4. Data Reduction:
Obtain a reduced representation of the data set that is much smaller in volume but yet
produces the same (or almost the same) analytical results.
Why data reduction? — A database/data warehouse may store terabytes of data.
Complex data analysis may take a very long time to run on the complete data set.
Page 6
Data Warehousing and Data Mining
Page 7
Data Warehousing and Data Mining
globally optimal solution. Many other attributes evaluation measure can be used, such as the
information gain measure used in building decision trees for classification.
1. Stepwise forward selection: The procedure starts with an empty set of attributes as the
reduced set. The best of original attributes is determined and added to the reduced set. At each
subsequent iteration or step, the best of the remaining original attributes is added to the set.
2. Stepwise backward elimination: The procedure starts with full set of attributes. At each step,
it removes the worst attribute remaining in the set.
3. Combination of forward selection and backward elimination: The stepwise forward
selection and backward elimination methods can be combined so that, at each step, the
procedure selects the best attribute and removes the worst from among the remaining attributes.
4. Decision tree induction: Decision tree induction constructs a flowchart like structure where
each internal node denotes a test on an attribute, each branch corresponds to an outcome of the
test, and each leaf node denotes a class prediction. At each node, the algorithm choices the
―best‖ attribute to partition the data into individual classes. A tree is constructed from the
given data. All attributes that do not appear in the tree are assumed to be irrelevant. The set of
attributes appearing in the tree from the reduced subset of attributes. Threshold measure is used
as stopping criteria.
Numerosity Reduction:
Numerosity reduction is used to reduce the data volume by choosing alternative, smaller
forms of the data representation
Techniques for Numerosity reduction:
Parametric - In this model only the data parameters need to be stored, instead of the
actual data. (e.g.,) Log-linear models, Regression
Page 8
Data Warehousing and Data Mining
Nonparametric – This method stores reduced representations of data include
histograms, clustering, and sampling
Parametric model
1. Regression
Linear regression
In linear regression, the data are model to fit a straight line. For example, a random
variable, Y called a response variable), can be modeled as a linear function of
another random variable, X called a predictor variable), with theequation Y=αX+β
Where the variance of Y is assumed to be constant. The coefficients, α and β (called
regression coefficients), specify the slope of the line and the Y- intercept,
respectively.
Multiple- linear regression
Multiple linear regression is an extension of (simple) linear regression, allowing a
response variable Y, to be modeled as a linear function of two or more predictor
variables.
2. Log-Linear Models
Log-Linear Models can be used to estimate the probability of each point in a
multidimensional space for a set of discretized attributes, based on a smaller subset
of dimensional combinations.
Nonparametric Model
1. Histograms
A histogram for an attribute A partitions the data distribution of A into disjoint subsets,
or buckets. If each bucket represents only a single attribute-value/frequency pair, the buckets
are called singleton buckets.
Ex: The following data are bast of prices of commonly sold items at All Electronics. The
numbers have been sorted:
1,1,5,5,5,5,5,8,8,10,10,10,10,12,14,14,14,15,15,15,15,15,18,18,18,18,18,18,18,18,20,20,20,2
0,20,20,21,21,21,21,21,25,25,25,25,25,28,28,30,30,30
Page 9
Data Warehousing and Data Mining
(Equal-frequency (or equi-depth): the frequency of each bucket is constant
2. Clustering
Clustering technique consider data tuples as objects. They partition the objects into
groups or clusters, so that objects within a cluster are similar to one another and dissimilar to
objects in other clusters. Similarity is defined in terms of how close the objects are in space,
based on a distance function. The quality of a cluster may be represented by its diameter, the
maximum distance between any two objects in the cluster. Centroid distance is an alternative
measure of cluster quality and is defined as the average distance of each cluster object from the
cluster centroid.
3. Sampling:
Sampling can be used as a data reduction technique because it allows a large data set
to be represented by a much smaller random sample (or subset) of the data. Suppose that a large
data set D, contains N tuples, then the possible samples are Simple Random sample without
Replacement (SRS WOR) of size n: This is created by drawing „n‟ of the „N‟ tuples from D
(n<N), where the probability of drawing any tuple in D is 1/N, i.e., all tuples are equally likely
to be sampled.
Page 10
Data Warehousing and Data Mining
Dimensionality Reduction:
In dimensionality reduction, data encoding or transformations are applied so as to obtained
reduced or ―compressed‖ representation of the oriental data.
Dimension Reduction Types
Lossless - If the original data can be reconstructed from the compressed data without any
loss of information
Lossy - If the original data can be reconstructed from the compressed data with loss of
information, then the data reduction is called lossy.
Effective methods in lossy dimensional reduction
a) Wavelet transforms
b) Principal components analysis.
a) Wavelet transforms:
The discrete wavelet transform (DWT) is a linear signal processing technique that, when
applied to a data vector, transforms it to a numerically different vector, of wavelet coefficients.
The two vectors are of the same length. When applying this technique to data reduction, we
consider each tuple as an n-dimensional data vector, that is, X=(x1,x2,…………,xn), depicting
n measurements made on the tuple from n database attributes.
For example, all wavelet coefficients larger than some user-specified threshold can be
retained. All other coefficients are set to 0. The resulting data representation is therefore very
sparse, so that can take advantage of data sparsity are computationally very fast if performed
in wavelet space.
The numbers next to a wave let name is the number of vanishing moment of the wavelet
this is a set of mathematical relationships that the coefficient must satisfy and is related to number
of coefficients.
1. The length, L, of the input data vector must be an integer power of 2. This condition
can be met by padding the data vector with zeros as necessary (L >=n).
2. Each transform involves applying two functions
The first applies some data smoothing, such as a sum or weighted average.
The second performs a weighted difference, which acts to bring out the detailed
features of data.
3. The two functions are applied to pairs of data points in X, that is, to all pairs of
measurements (X2i , X2i+1). This results in two sets of data of length L/2. In general,
Page 11
Data Warehousing and Data Mining
these represent a smoothed or low-frequency version of the input data and high
frequency content of it, respectively.
4. The two functions are recursively applied to the sets of data obtained in the previous
loop, until the resulting data sets obtained are of length 2.
In the above figure, Y1 and Y2, for the given set of data originally mapped to the axes X1 and
X2. This information helps identify groups or patterns within the data. The sorted axes are
such that the first axis shows the most variance among the data, the second axis shows the next
highest variance, and so on.
The size of the data can be reduced by eliminating the weaker components.
Advantage of PCA
PCA is computationally inexpensive
Multidimensional data of more than two dimensions can be handled by reducing the
problem to two dimensions.
Principal components may be used as inputs to multiple regression and cluster analysis.
Page 12
Data Warehousing and Data Mining
Page 13
Data Warehousing and Data Mining
a) Min-max normalization performs a linear transformation on the original data. Suppose that
minAand maxAare the minimum and maximum values of an attribute, A.Min- maxnormalization
maps a value, vi, of A to vi’in the range [new_minA,new_maxA]by computing
Min-max normalization preserves the relationships among the original data values. Itwill
encounter an ―out-of-bounds‖ error if a future input case for normalization fallsoutside of the
original data range for A.
Example:-Min-max normalization. Suppose that the minimum and maximum values
fortheattribute income are $12,000 and $98,000, respectively. We would like to map income
to the range [0.0, 1.0]. By min-max normalization, a value of $73,600 for income istransformed
to
b) Z-Score Normalization
The values for an attribute, A, are normalized based on the mean (i.e., average) and standard
deviation of A. A value, vi, of A is normalized to vi’ by computing
where𝐴 and A are the mean and standard deviation, respectively, of attribute A.
Example z-score normalization. Suppose that the mean and standard deviation of the values
for the attribute income are $54,000 and $16,000, respectively. With z-score normalization, a
value of $73,600 for income is transformed to
ExampleDecimal scaling. Suppose that the recorded values of A range from -986 to 917.
Themaximum absolute value of A is 986. To normalize by decimal scaling, wethereforedivide
each value by 1000 (i.e., j = 3) so that -986 normalizes to -0.986 and917normalizes to 0.917.
Page 14
Data Warehousing and Data Mining
Page 15
Data Warehousing and Data Mining
statetheir partial ordering. The systemcan then try to automatically generate the
attributeordering so as to construct a meaningful concept hierarchy.
4. Specification of only a partial set of attributes: Sometimes a user can be
carelesswhen defining a hierarchy, or have only a vague idea about what should be
includedin a hierarchy. Consequently, the user may have included only a small subset
of therelevant attributes in the hierarchy specification.
Data cleaning routines attempt to fill in missing values, smooth out noise
whileidentifying outliers, and correct inconsistencies in the data.
Data integration combines data from multiple sources to form a coherent datastore.
The resolution of semantic heterogeneity, metadata, correlation analysis,tuple
duplication detection, and data conflict detection contribute to smooth dataintegration.
Data reduction techniques obtain a reduced representation of the data while
minimizingthe loss of information content. These include methods of
dimensionalityreduction, numerosity reduction, and data compression.
Data transformation routines convert the data into appropriate forms for mining.For
example, in normalization, attribute data are scaled so as to fall within asmall range
such as 0.0 to 1.0. Other examples are data discretization and concepthierarchy
generation.
Data discretization transforms numeric data by mapping values to interval or
conceptlabels. Such methods can be used to automatically generate concept
hierarchiesfor the data, which allows for mining at multiple levels of granularity.
Page 16
Data Warehousing and Data Mining UNIT-
1.7 Major issues in data mining
Major issues in data mining are mining methodology, user interaction, performance, and diverse data types. These issues
are introduced below:
Mining methodology and user interaction issues:
Mining different kinds of knowledge in databases:
Because different users can be interested in different kinds of knowledge, data mining should cover a wide spectrum of
data analysis and knowledge discovery tasks which may use same database in different ways and require the
development of numerous data mining techniques.
Interactive mining of knowledge at multiple levels of abstraction:
Interactive mining allows users to focus the search for patterns, providing and refini8ng data mining requests based on
returned results to view data and discovered patterns at multiple granularities and from different angles.
Incorporation of background knowledge:
Page 17
Data Warehousing and Data Mining UNIT-
Background knowledge or domain knowledge guides the discovery process in concise terms at different levels of
abstraction and also speed up a data mining process, or judge the interesting of discovered patterns.
Data mining query languages and ad hoc data mining :
High-level data mining query languages need to be developed to allow users to describe ad hoc data mining tasks by
facilitating the specification of the relevant sets of data for analysis, the domain knowledge, the kinds of knowledge to be
mined, and the conditions and constraints to be enforced on the discovered patterns.
Presentation and visualization of data mining results:
Using visual representations, or other expressive forms, the knowledge can be easily understood and directly usable by
humans. This requires the system to adopt expressive knowledge representation techniques, such as tree, tables, rules,
graphs, charts, crosstabs, matrices, or curves.
Handling noisy or incomplete data:
Page 18
Data Warehousing and Data Mining UNIT-
Background knowledge or domain knowledge guides the discovery process in concise terms at different levels of
abstraction and also speed up a data mining process, or judge the interesting of discovered patterns.
Data mining query languages and ad hoc data mining :
High-level data mining query languages need to be developed to allow users to describe ad hoc data mining tasks by
facilitating the specification of the relevant sets of data for analysis, the domain knowledge, the kinds of knowledge to be
mined, and the conditions and constraints to be enforced on the discovered patterns.
Presentation and visualization of data mining results:
Using visual representations, or other expressive forms, the knowledge can be easily understood and directly usable by
humans. This requires the system to adopt expressive knowledge representation techniques, such as tree, tables, rules,
graphs, charts, crosstabs, matrices, or curves.
Handling noisy or incomplete data:
Noise or incomplete data may be confuse the process, causing the knowledge model constructed to over fit the data
which intern result the accuracy of the discovered patterns to be poor data clearing methods and data analysis methods
that can handle noise are required, as well as outlier mining methods for the discovery and analysis of exceptional cases.
Pattern evolution – the interestingness problem:
A data mining system can uncover thousands of patterns. Many of the patterns discovered may be uninteresting to the
given user, either because they represent common knowledge or lack novelty. Several challenges 0remain regarding the
development of techniques to asses the interestingness of discovered patterns, particularly with regard to subjective
measures that estimate the value of patterns with respect to a given user class, based on user beliefs or expectations. The
use of interestingness measure or user specified constraints to guide the discovery process and reduce the screech space
in another active area of research
Performance issues:
This include efficiency scalability, and parallelization of data mining algorithm
Efficiency and scalability of data mining algorithms:
in order to effectively extract for information from huge amount data in databases, data mining algorithms must be
efficient and scalable as well as running time of a data mining algorithms must predictable and acceptable.
Parallel, distributed, and incremental mining algorithms:
The huge size of databases, wide distribution of data, high cost and computational complexity of data mining methods
leads to the developments of parallel and distributed data mining algorithms moreover, the incremental data mining
algorithms updates database without having to mine this entire data again “from scratch”
It is unrealistic to expect one system to mine all kinds of data, given the diversity of data types and different goals of data
mining specific data mining systems should be constructed for mining specific kinds of data. Therefore, one may expect
to have different data mining systems for different kinds of data.
Mining information from heterogeneous databases and global information systems:
Data mining may help disclose high level data regularities in multiples heterogeneous databases that are unlikely to be
discovered by simple query systems and may improve information exchange and interoperability in heterogeneous
databases. Web mining, with uncovers interesting knowledge about web contents, web structures, web usage, and web
dynamics, becomes a very challenging and fast evolving field in data mining.
Data mining systems relay on databases to supply raw data for input and this raises problems in that databases tend be
dynamic, incomplete, noisy, and data large. Other problems arise as a result of the adequacy relevance of the information
stored. The above issues are considered major requirements and challenges for the further evolution of data mining
technology.
Page 19