DMW Notes UNIT-1 2023-24
DMW Notes UNIT-1 2023-24
Many definitions:
Extraction of interesting (non-trivial, implicit, previously unknown and potentially useful)
information or patterns from data in large databases
Look for hidden patterns & trends that are not immediately apparent from summarizing the data.
E.g. correlation between grades in two subjects.
i) Characterization: provides a concise and succinct summarization of the given collection of data
ii) Comparison/Discrimination: provides descriptions comparing two or more collections of data.
For example, customers who purchase computer products frequently 80% of such customers
have age between 20 & 40 and have a university degree
Whereas, customers who do not purchase computer products frequently 60% of such customers
are either senior citizens or youth without a university degree.
4) Clustering
Given points in some space, often a high-dimensional space. Group the points into a small number
of clusters
Each cluster consisting of points that are “near” in some sense
Points in the same cluster are “similar” and are “dissimilar” to points in other clusters
Examples
BANK AGENT:
◦ Must I grant a mortgage to this customer?
SUPERMARKET MANAGER:
◦ When customers buy eggs, do they also buy oil?
PERSONNEL MANAGER:
◦ What kind of employees do I have?
AGRICULTURAL SCIENTIST:
◦ What would be the wheat yield this year?
NETWORK ADMINISTRATOR:
◦ Which website visitor is a hacker?
◦ Which incoming mail is a spam?
TRADER in a RETAIL COMPANY:
◦ How many flat TVs do we expect to sell next month?
− Mining different kinds of knowledge in databases − Different users may be interested in different
kinds of knowledge. Therefore it is necessary for data mining to cover a broad range of knowledge
discovery task.
− Interactive mining of knowledge at multiple levels of abstraction − The data mining process
needs to be interactive because it allows users to focus the search for patterns, providing and refining
data mining requests based on the returned results.
− Incorporation of background knowledge − To guide discovery process and to express the
discovered patterns, the background knowledge can be used. Background knowledge may be used to
express the discovered patterns not only in concise terms but at multiple levels of abstraction.
− Data mining query languages and ad hoc data mining − Data Mining Query language that allows
the user to describe ad hoc mining tasks, should be integrated with a data warehouse query language
and optimized for efficient and flexible data mining.
− Presentation and visualization of data mining results − Once the patterns are discovered it needs
to be expressed in high level languages, and visual representations. These representations should be
easily understandable.
Performance Issues
− Efficiency and scalability of data mining algorithms − In order to effectively extract the
information from huge amount of data in databases, data mining algorithm must be efficient and
scalable.
− Parallel, distributed, and incremental mining algorithms − The factors such as huge size of
databases, wide distribution of data, and complexity of data mining methods motivate the
development of parallel and distributed data mining algorithms. These algorithms divide the data
into partitions which is further processed in a parallel fashion. Then the results from the partitions is
merged. The incremental algorithms, update databases without mining the data again from scratch.
Data cleaning
Fill in missing values, smooth noisy data, identify or remove outliers, and resolve
inconsistencies
Data integration
Integration of multiple databases, data cubes, or files
Data transformation
Normalization and aggregation
Data reduction (sampling)
Obtains reduced representation in volume but produces the same or similar analytical results
Data discretization
Part of data reduction but with particular importance, especially for numerical data
v minA
v' (new _ maxA new _ minA) new _ minA
maxA minA
Ex. Let income range $12,000 to $98,000 normalized to [0.0, 1.0]. Then $73,000 is mapped to
73,600 12,000
(1.0 0) 0 0.716
98,000 12,000
ii) z-score normalization (zero-mean normalization) – the values for an attribute, A, are normalized
based on the mean, μ and standard deviation σ of A as
v A
v'
A
This method of normalization is useful when the actual minimum and maximum of attribute A
are unknown, or when there are outliers that dominate the min-max normalization.
Ex. Let μ = 54,000, σ = 16,000. Then 73600 will be 73,600 54,000
1.225
16,000
iii) normalization by decimal scaling – it normalizes by moving the decimal point of values of
attribute A. The number of decimal points moved depends on the maximum absolute value of A.
𝒗
Here 𝒗| = Where j is the smallest integer such that Max(|ν’|) < 1
𝟏𝟎𝒋
Ex. Suppose that the recorded values of A range from -986 to 917.
The maximum absolute value of A is 986. To normalize by decimal scaling, we therefore divide
each value by 1,000 (i.e., j = 3) so that -986 normalizes to -0.986 and 917 normalizes to 0.917.
Attribute/feature construction
New attributes constructed from the given ones
Ex. we may wish to add the attribute “area” based on the attributes “height” and “width”.
Data discretization technique can be used to reduce the number of values for a given continuous
attribute by dividing the range of the attribute into intervals.
Interval labels used to replace actual data values which reduces and simplifies the original data.
This leads to a concise, easy-to-use, knowledge-level representation of mining results.
Categorization based on the use of class information:-
Supervised discretization- this type of discretization process uses class information.
Unsupervised discretization- it does not uses class information.
Categorization based on the direction it precedes:-
Top-down discretization or splitting - If the process starts by first finding one or a few
points (called split points or cut points) to split the entire attribute range, and then repeats
this recursively on the resulting intervals, it is called top-down discretization or splitting.
Bottom-up discretization or merging - it starts by considering all of the continuous values
as potential split-points, removes some by merging neighborhood values to form intervals,
and then recursively applies this process to the resulting intervals.
Concept hierarchies can be used to reduce the data by collecting and replacing low-level concepts
(such as numerical values for the attribute age) with higher-level concepts (such as youth, middle-
aged, or senior).
Concept hierarchies for numerical attributes can be constructed automatically based on data
discretization by: binning, histogram analysis, entropy-based discretization, 2 -merging, cluster
analysis, and discretization by intuitive partitioning.
======== *****=======