0% found this document useful (0 votes)
81 views15 pages

DMW Notes UNIT-1 2023-24

Dmw

Uploaded by

Rocky
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
81 views15 pages

DMW Notes UNIT-1 2023-24

Dmw

Uploaded by

Rocky
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

B.K.

BIRLA INSTITUTE OF ENGINEERING & TECHNOLOGY, Pilani


5CS5-16/5IT6-16 – DATA MINING & WAREHOUSING, Classroom Notes Unit – I
What is Data Mining?
 Discovery of useful summaries of data - Ullman
 Extracting or “Mining” knowledge form large amounts of data
 The efficient discovery of previously unknown patterns in large databases
 Technology which predict future trends based on historical data
 It helps businesses make proactive and knowledge-driven decisions

Many definitions:
 Extraction of interesting (non-trivial, implicit, previously unknown and potentially useful)
information or patterns from data in large databases
 Look for hidden patterns & trends that are not immediately apparent from summarizing the data.
 E.g. correlation between grades in two subjects.

1| Prepared by: Manoj Kumar Saini


B.K. BIRLA INSTITUTE OF ENGINEERING & TECHNOLOGY, Pilani
5CS5-16/5IT6-16 – DATA MINING & WAREHOUSING, Classroom Notes Unit – I
Phases / Steps in data mining
Data Mining: A KDD (Knowledge Discovery from Data) Process

Stages of Data Mining Process


1. Data gathering, e.g., operational sources, www.
2. Data cleansing: eliminate errors and/or bogus data, e.g., patient fever = 125.
3. Feature extraction/ Selection & Transformation: obtaining only the interesting attributes of the data,
e.g., “date acquired” is probably not useful for clustering celestial objects, as in Skycat.
4. Pattern extraction and discovery. This is the stage that is often thought of as “data mining” and is
where we shall concentrate our effort.
5. Visualization of the data.
6. Evaluation of results; not every discovered fact is useful, or even true! Judgment is necessary before
following your software's conclusions.

2| Prepared by: Manoj Kumar Saini


B.K. BIRLA INSTITUTE OF ENGINEERING & TECHNOLOGY, Pilani
5CS5-16/5IT6-16 – DATA MINING & WAREHOUSING, Classroom Notes Unit – I
Data Mining Functionalities — What Kinds of Patterns Can Be Mined?
Data mining functionalities are used to specify the kind of patterns to be found in data mining tasks. In
general, data mining tasks can be classified into two categories:
 Predictive Mining
 Use some variables to predict unknown or future values of other variables.
 Descriptive Mining
 Find human-interpretable patterns that describe the data.

3| Prepared by: Manoj Kumar Saini


B.K. BIRLA INSTITUTE OF ENGINEERING & TECHNOLOGY, Pilani
5CS5-16/5IT6-16 – DATA MINING & WAREHOUSING, Classroom Notes Unit – I
Classification of Data Mining Systems:

1) Concept/Class Description: Characterization and Discrimination


Concept description is a form of data generalization.

A concept typically refers to a collection of data such as frequent_buyers, graduate_students, and so


on.
A description of a concept in a summarized, concise and yet a precise term is known as concept
description.
These descriptions can be derived via-

i) Characterization: provides a concise and succinct summarization of the given collection of data
ii) Comparison/Discrimination: provides descriptions comparing two or more collections of data.

For example, customers who purchase computer products frequently  80% of such customers
have age between 20 & 40 and have a university degree
Whereas, customers who do not purchase computer products frequently  60% of such customers
are either senior citizens or youth without a university degree.

2) Mining Frequent Patterns, Associations, and Correlations


Frequent patterns are patterns that occur frequently in data. There are many kinds of frequent patterns,
including itemsets (set of items), subsequences, and substructures.
Associations and Item-sets
An association is a rule of the form: if X then Y Denoted by X Y
Ex: If India wins in cricket, sales of sweets go up.
If a customer buys a computer he also buys an antivirus.
For any rule if X Y => Y X
then X & Y are called “Interesting Item-sets”
Ex. People buying school uniform in June also buys school bags
(People buying school bags in June also buys school uniform)

3) Classification and Prediction


Classification and prediction are two forms of data analysis that can be used to extract models
describing important data classes and to predict future data trends. Whereas classification predicts
categorical (discrete, unordered) labels, prediction models continuous-valued functions.

4| Prepared by: Manoj Kumar Saini


B.K. BIRLA INSTITUTE OF ENGINEERING & TECHNOLOGY, Pilani
5CS5-16/5IT6-16 – DATA MINING & WAREHOUSING, Classroom Notes Unit – I

4) Clustering
 Given points in some space, often a high-dimensional space. Group the points into a small number
of clusters
 Each cluster consisting of points that are “near” in some sense
 Points in the same cluster are “similar” and are “dissimilar” to points in other clusters

5| Prepared by: Manoj Kumar Saini


B.K. BIRLA INSTITUTE OF ENGINEERING & TECHNOLOGY, Pilani
5CS5-16/5IT6-16 – DATA MINING & WAREHOUSING, Classroom Notes Unit – I
5) Anomaly Detection/Outliers
 Objects whose characteristics are significantly different from the rest of the data
 Such observations are known as ANOMALIES or OUTLIERS
 False alarms to be avoided
 Applications
 Fraud detection
 Network intrusions
 Unusual patterns of disease
 Ecosystem disturbances
Examples of Discovered Patterns
 Association rules
o 98% of people who purchase diapers also buy beer
 Classification
o People with age less than 25 and salary > 40k drive sports cars
 Similar time sequences
o Stocks of companies A and B perform similarly
 Outlier Detection
o Residential customers for telecom company with businesses at home

6| Prepared by: Manoj Kumar Saini


B.K. BIRLA INSTITUTE OF ENGINEERING & TECHNOLOGY, Pilani
5CS5-16/5IT6-16 – DATA MINING & WAREHOUSING, Classroom Notes Unit – I
Data Mining Applications
Some examples of “successes":
1. Decision trees constructed from bank-loan histories to produce algorithms to decide whether to grant a
loan.
2. Patterns of traveler behavior mined to manage the sale of discounted seats on planes, rooms in hotels,
etc.
3. “Diapers and beer" Observation that customers who buy diapers are more likely to buy beer than
average allowed supermarkets to place beer and diapers nearby, knowing many customers would walk
between them. Placing potato chips between increased sales of all three items.
4. Skycat and Sloan Sky Digital Sky Survey: clustering sky objects by their radiation levels in different
bands allowed astronomers to distinguish between galaxies, nearby stars, and many other kinds of
celestial objects.
(168 million records and some 500 attributes)
5. Comparison of the genotype of people with/without a condition allowed the discovery of a set of genes
that together account for many cases of diabetes. This sort of mining has become much more important
as the human genome has fully been decoded

Examples
 BANK AGENT:
◦ Must I grant a mortgage to this customer?
 SUPERMARKET MANAGER:
◦ When customers buy eggs, do they also buy oil?
 PERSONNEL MANAGER:
◦ What kind of employees do I have?
 AGRICULTURAL SCIENTIST:
◦ What would be the wheat yield this year?
 NETWORK ADMINISTRATOR:
◦ Which website visitor is a hacker?
◦ Which incoming mail is a spam?
 TRADER in a RETAIL COMPANY:
◦ How many flat TVs do we expect to sell next month?

7| Prepared by: Manoj Kumar Saini


B.K. BIRLA INSTITUTE OF ENGINEERING & TECHNOLOGY, Pilani
5CS5-16/5IT6-16 – DATA MINING & WAREHOUSING, Classroom Notes Unit – I
Data Mining Issues

 Mining Methodology and User Interaction Issues

− Mining different kinds of knowledge in databases − Different users may be interested in different
kinds of knowledge. Therefore it is necessary for data mining to cover a broad range of knowledge
discovery task.
− Interactive mining of knowledge at multiple levels of abstraction − The data mining process
needs to be interactive because it allows users to focus the search for patterns, providing and refining
data mining requests based on the returned results.
− Incorporation of background knowledge − To guide discovery process and to express the
discovered patterns, the background knowledge can be used. Background knowledge may be used to
express the discovered patterns not only in concise terms but at multiple levels of abstraction.
− Data mining query languages and ad hoc data mining − Data Mining Query language that allows
the user to describe ad hoc mining tasks, should be integrated with a data warehouse query language
and optimized for efficient and flexible data mining.
− Presentation and visualization of data mining results − Once the patterns are discovered it needs
to be expressed in high level languages, and visual representations. These representations should be
easily understandable.

8| Prepared by: Manoj Kumar Saini


B.K. BIRLA INSTITUTE OF ENGINEERING & TECHNOLOGY, Pilani
5CS5-16/5IT6-16 – DATA MINING & WAREHOUSING, Classroom Notes Unit – I
− Handling noisy or incomplete data − The data cleaning methods are required to handle the noise
and incomplete objects while mining the data regularities. If the data cleaning methods are not there
then the accuracy of the discovered patterns will be poor.
− Pattern evaluation − The patterns discovered should be interesting because either they represent
common knowledge or lack novelty.

 Performance Issues
− Efficiency and scalability of data mining algorithms − In order to effectively extract the
information from huge amount of data in databases, data mining algorithm must be efficient and
scalable.
− Parallel, distributed, and incremental mining algorithms − The factors such as huge size of
databases, wide distribution of data, and complexity of data mining methods motivate the
development of parallel and distributed data mining algorithms. These algorithms divide the data
into partitions which is further processed in a parallel fashion. Then the results from the partitions is
merged. The incremental algorithms, update databases without mining the data again from scratch.

 Diverse Data Types Issues


− Handling of relational and complex types of data − The database may contain complex data
objects, multimedia data objects, spatial data, temporal data etc. It is not possible for one system to
mine all these kind of data.
− Mining information from heterogeneous databases and global information systems − The data
is available at different data sources on LAN or WAN. These data source may be structured, semi
structured or unstructured. Therefore mining the knowledge from them adds challenges to data
mining.

9| Prepared by: Manoj Kumar Saini


B.K. BIRLA INSTITUTE OF ENGINEERING & TECHNOLOGY, Pilani
5CS5-16/5IT6-16 – DATA MINING & WAREHOUSING, Classroom Notes Unit – I
DATA PREPROCESSING
Why Preprocess Data?

 Data in the real world is dirty


 Incomplete: lacking attribute values, lacking certain attributes of interest, or containing only
aggregate data. e.g., occupation=“ ”
 Noisy: containing errors or outliers. e.g., Salary=“-10”
 Inconsistent: containing discrepancies in codes or names
 e.g., Age=“30” on 10/10/2013 and Birthday=“22/04/1984”
 e.g., Was rating “1,2,3”, now rating “A, B, C”
 e.g., discrepancy between duplicate records
 No quality data, no quality mining results!
 Quality decisions must be based on quality data. e.g., duplicate or missing data may cause
incorrect or even misleading statistics.

Sources of Dirty Data

 Incomplete data may come from


 “Not applicable” data value when collected
 Different considerations between the time when the data was collected and when it is
analyzed.
 Human/hardware/software problems
 Noisy data (incorrect values) may come from
 Faulty data collection instruments
 Human or computer error at data entry
 Errors in data transmission
 Inconsistent data may come from
 Different data sources
 Functional dependency violation (e.g., modify some linked data)
Duplicate records also need data cleaning

10 | Prepared by: Manoj Kumar Saini


B.K. BIRLA INSTITUTE OF ENGINEERING & TECHNOLOGY, Pilani
5CS5-16/5IT6-16 – DATA MINING & WAREHOUSING, Classroom Notes Unit – I
Forms of data preprocessing

Major Tasks in Data Preprocessing:

 Data cleaning
 Fill in missing values, smooth noisy data, identify or remove outliers, and resolve
inconsistencies
 Data integration
 Integration of multiple databases, data cubes, or files
 Data transformation
 Normalization and aggregation
 Data reduction (sampling)
 Obtains reduced representation in volume but produces the same or similar analytical results
 Data discretization
 Part of data reduction but with particular importance, especially for numerical data

11 | Prepared by: Manoj Kumar Saini


B.K. BIRLA INSTITUTE OF ENGINEERING & TECHNOLOGY, Pilani
5CS5-16/5IT6-16 – DATA MINING & WAREHOUSING, Classroom Notes Unit – I
Data Cleaning
Real-world data tend to be incomplete, noisy, and inconsistent. Data cleaning (or data cleansing) routines
attempt to fill in missing values, smooth out noise while identifying outliers, and correct inconsistencies in
the data.
1) Missing Data
Data is not always available - E.g., many tuples have no recorded value for several attributes, such as
customer income in sales data
Missing data may be due to -
a. equipment malfunction
b. inconsistent with other recorded data and thus deleted
c. data not entered due to misunderstanding
d. certain data may not be considered important at the time of entry
e. not register history or changes of the data
Missing data may need to be inferred.
Ways to Handle Missing Data-
 Ignore the tuple: usually done when class label is missing (assuming the tasks is
classification—not effective when the percentage of missing values per attribute varies
considerably)
 Fill in the missing value manually: this approach is time-consuming and may not be
feasible given a large data set with many missing values.
 Use a global constant to fill in the missing value: Replace all missing attribute values by
the same constant, such as a label like “Unknown” or . Here the problem is that the
program may mistakenly interpret “Unknown” as new interesting class.
 Use the attribute mean to fill in the missing value: For example, suppose that the average
income of the customers is Rs. 56,000. Use this value to replace the missing value for
income.
 Use the attribute mean for all samples belonging to the same class to fill in the missing
value: smarter choice.
 Use the most probable value to fill in the missing value: This may be determined with
regression, inference-based tools using a Bayesian formalism, or decision tree induction.
2) Noisy Data - Noise is a random error or variance in a measured variable
Incorrect attribute values may be due to
a. faulty data collection instruments
b. data entry problems
c. data transmission problems
d. technology limitation
e. inconsistency in naming convention
Other data problems which requires data cleaning
a) duplicate records
b) incomplete data
c) inconsistent data
Smooth out the data to remove noise
12 | Prepared by: Manoj Kumar Saini
B.K. BIRLA INSTITUTE OF ENGINEERING & TECHNOLOGY, Pilani
5CS5-16/5IT6-16 – DATA MINING & WAREHOUSING, Classroom Notes Unit – I
Smoothing Techniques
 Binning – Binning methods smooth a sorted data value by consulting its “neighborhood”, that is,
the values around it. The sorted values are distributed into a number of “buckets,” or bins.
Binning method –
 First sort the data and partition into (equi-depth or equi-width) bins.
 Smooth by bin means, bin median or by bin boundaries.
 Equal-width (distance) partitioning:
 It divides the range into N intervals of equal size: uniform grid
 if A and B are the lowest and highest values of the attribute, the width of intervals will
be: W = (B-A)/N.
 Most straightforward
 But outliers may dominate presentation
 Skewed data is not handled well.
 Equidepth (frequency) partitioning:
 It divides the range into N intervals, each containing approximately same number of
samples.
 Good data scaling
 Managing categorical attributes can be tricky.
Example: Sorted data for price: 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34
 Partition into (equi-depth) bins:
- Bin 1: 4, 8, 9, 15
- Bin 2: 21, 21, 24, 25
- Bin 3: 26, 28, 29, 34
 Smoothing by bin means:
- Bin 1: 9, 9, 9, 9
- Bin 2: 23, 23, 23, 23
- Bin 3: 29, 29, 29, 29
 Smoothing by bin boundaries:
- Bin 1: 4, 4, 4, 15
- Bin 2: 21, 21, 25, 25
- Bin 3: 26, 26, 26, 34
3) Clustering: Outliers may be detected by clustering, where similar values are organized into groups,
or “clusters.” Intuitively, values that fall outside of the set of clusters may be considered outliers
Combined computer and human inspection: detect suspicious values and check by human.
4) Regression: Data can be smoothed by fitting the data to a function, such as with regression. Linear
regression involves finding the “best” line to fit two attributes (or variables), so that one attribute
can be used to predict the other. Multiple linear regression is an extension of linear regression,
where more than two attributes are involved and the data are fit to a multidimensional surface.

13 | Prepared by: Manoj Kumar Saini


B.K. BIRLA INSTITUTE OF ENGINEERING & TECHNOLOGY, Pilani
5CS5-16/5IT6-16 – DATA MINING & WAREHOUSING, Classroom Notes Unit – I
Data transformation
In data transformation, the data are transformed or consolidated into forms appropriate for mining.
Data transformation can involve the following:
 Smoothing: remove noise from data. (done in data cleaning)
 Aggregation: summarization, data cube construction where aggregation operations are applied to the
data in the construction of a data cube.
 Generalization: concept hierarchy climbing
 Normalization: scaled to fall within a small, specified range
i) min-max normalization - performs a linear transformation on the original data.
Maps the value v of an attribute A from original range [minA, maxA] to v / in new range
[new_maxA, new_maxA] by computing

v  minA
v'  (new _ maxA  new _ minA)  new _ minA
maxA  minA
Ex. Let income range $12,000 to $98,000 normalized to [0.0, 1.0]. Then $73,000 is mapped to
73,600  12,000
(1.0  0)  0  0.716
98,000  12,000

ii) z-score normalization (zero-mean normalization) – the values for an attribute, A, are normalized
based on the mean, μ and standard deviation σ of A as

v  A
v' 
 A

This method of normalization is useful when the actual minimum and maximum of attribute A
are unknown, or when there are outliers that dominate the min-max normalization.
Ex. Let μ = 54,000, σ = 16,000. Then 73600 will be 73,600  54,000
 1.225
16,000

iii) normalization by decimal scaling – it normalizes by moving the decimal point of values of
attribute A. The number of decimal points moved depends on the maximum absolute value of A.
𝒗
Here 𝒗| = Where j is the smallest integer such that Max(|ν’|) < 1
𝟏𝟎𝒋
Ex. Suppose that the recorded values of A range from -986 to 917.
The maximum absolute value of A is 986. To normalize by decimal scaling, we therefore divide
each value by 1,000 (i.e., j = 3) so that -986 normalizes to -0.986 and 917 normalizes to 0.917.

 Attribute/feature construction
 New attributes constructed from the given ones
Ex. we may wish to add the attribute “area” based on the attributes “height” and “width”.

14 | Prepared by: Manoj Kumar Saini


B.K. BIRLA INSTITUTE OF ENGINEERING & TECHNOLOGY, Pilani
5CS5-16/5IT6-16 – DATA MINING & WAREHOUSING, Classroom Notes Unit – I
Data Discretization and Concept Hierarchy Generation

Data discretization technique can be used to reduce the number of values for a given continuous
attribute by dividing the range of the attribute into intervals.
 Interval labels used to replace actual data values which reduces and simplifies the original data.
 This leads to a concise, easy-to-use, knowledge-level representation of mining results.
Categorization based on the use of class information:-
 Supervised discretization- this type of discretization process uses class information.
 Unsupervised discretization- it does not uses class information.
Categorization based on the direction it precedes:-
 Top-down discretization or splitting - If the process starts by first finding one or a few
points (called split points or cut points) to split the entire attribute range, and then repeats
this recursively on the resulting intervals, it is called top-down discretization or splitting.
 Bottom-up discretization or merging - it starts by considering all of the continuous values
as potential split-points, removes some by merging neighborhood values to form intervals,
and then recursively applies this process to the resulting intervals.

Concept hierarchies can be used to reduce the data by collecting and replacing low-level concepts
(such as numerical values for the attribute age) with higher-level concepts (such as youth, middle-
aged, or senior).

Benefits of Data Discretization and Concept Hierarchy OR


Why discretization techniques and concept hierarchies are typically applied before data mining as
a preprocessing step, rather than during mining?
 The generalized data may be more meaningful and easier to interpret that contributes to a
consistent representation of data mining results among multiple mining tasks.
 In addition, mining on a reduced data set requires fewer input/output operations and is more
efficient than mining on a larger, ungeneralized data set.

Concept hierarchies for numerical attributes can be constructed automatically based on data
discretization by: binning, histogram analysis, entropy-based discretization, 2 -merging, cluster
analysis, and discretization by intuitive partitioning.

======== *****=======

15 | Prepared by: Manoj Kumar Saini

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy