0% found this document useful (0 votes)
35 views43 pages

Unit 1

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
35 views43 pages

Unit 1

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 43

Data Mining

By
Tanaya Priyadarshini Pradhan

1
Data Mining
• Data Mining is a process used by organizations to extract specific data
from huge databases to solve business problems.
• It primarily turns raw data into useful information.
• The process of extracting information to identify patterns, trends,
and useful data that would allow the business to take the data-
driven decision from huge sets of data is called Data Mining.

2
Data & Data Sets
• Data mining, also known as knowledge discovery in data (KDD),
is the process of uncovering patterns and other valuable information
from large data sets.
• Through data mining, data scientists assist in the analysis of gathered
data.
• A dataset is a set of numbers or values that pertain to a specific topic.
• For example
Each student's test scores in a certain class.

3
Raw Data Vs Process Data

4
Types of Data
Relational Database: A relational database is a collection of multiple data
sets formally organized by tables, records, and columns from which data can
be accessed in various ways without having to recognize the database tables.
Data Warehouse: A Data Warehouse is the technology that collects the data
from various sources within the organization to provide meaningful business
insights. The huge amount of data comes from multiple places such as
Marketing and Finance.
Data Repositories: The Data Repository generally refers to a destination for
data storage. However, many IT professionals utilize the term more clearly to
refer to a specific kind of setup within an IT structure. For example, a group
of databases, where an organization has kept various kinds of information.
5
Types of Data (Contd…)
Object-Relational Database: A combination of an object-oriented
database model and relational database model is called an object-
relational model. It supports Classes, Objects, Inheritance, etc. One of
the primary objectives of the Object-relational data model is to close the
gap between the Relational database and the object-oriented model
practices frequently utilized in many programming languages, for
example, C++, Java, C#, and so on.
Transactional Database: A transactional database refers to a database
management system (DBMS) that has the potential to undo a database
transaction if it is not performed appropriately.

6
KDD- Knowledge Discovery in Databases
• The term KDD stands for Knowledge Discovery in Databases.
• It is a field of interest to researchers in various fields, including
artificial intelligence, machine learning, pattern recognition, databases,
statistics, knowledge acquisition for expert systems, and data
visualization.
• The main objective of the KDD process is to extract information from
data in the context of large databases.
• It does this by using Data Mining algorithms to identify what is
deemed knowledge.

7
KDD Process
• The knowledge discovery process(illustrates in the given figure) is
iterative and interactive, comprises of nine steps.
• The process begins with determining the KDD objectives and ends
with the implementation of the discovered knowledge.
For example, offering various features to cell phone users in order to reduce
churn. This closes the loop, and the impacts are then measured on the new data
repositories, and the KDD process again.

8
KDD Process(Contd…)

9
KDD Process(Contd…)
1. Building up an understanding of the application domain
This is the initial preliminary step. It develops the scene for understanding what should be done
with the various decisions like transformation, algorithms, representation, etc. The individuals who are
in charge of a KDD venture need to understand and characterize the objectives of the end-user and the
environment in which the knowledge discovery process will occur ( involves relevant prior
knowledge).
2. Choosing and creating a data set on which discovery will be performed
Once defined the objectives, the data that will be utilized for the knowledge discovery process
should be determined. This incorporates discovering what data is accessible, obtaining important data,
and afterward integrating all the data for knowledge discovery onto one set involves the qualities that
will be considered for the process. This process is important because of Data Mining learns and
discovers from the accessible data.
3. Preprocessing and cleansing
In this step, data reliability is improved. It incorporates data clearing, for example, Handling the
missing quantities and removal of noise or outliers. It might include complex statistical techniques or
use a Data Mining algorithm in this context.
10
KDD Process(Contd…)
4. Data Transformation
In this stage, the creation of appropriate data for Data Mining is prepared and developed. Techniques here
incorporate dimension reduction( for example, feature selection and extraction and record sampling), also attribute
transformation(for example, discretization of numerical attributes and functional transformation). This step can be
essential for the success of the entire KDD project, and it is typically very project-specific.
5. Prediction and description
We are now prepared to decide on which kind of Data Mining to use, for example, classification, regression,
clustering, etc. This mainly relies on the KDD objectives, and also on the previous steps. There are two significant
objectives in Data Mining, the first one is a prediction, and the second one is the description. Prediction is usually
referred to as supervised Data Mining, while descriptive Data Mining incorporates the unsupervised and visualization
aspects of Data Mining. Most Data Mining techniques depend on inductive learning, where a model is built explicitly or
implicitly by generalizing from an adequate number of preparing models
6. Selecting the Data Mining algorithm
Having the technique, we now decide on the strategies. This stage incorporates choosing a particular technique to
be used for searching patterns that include multiple inducers.
For example
Considering precision versus understandability, the previous is better with neural networks, while the latter is
better with decision trees.
11
KDD Process(Contd…)
7. Utilizing the Data Mining algorithm
At last, the implementation of the Data Mining algorithm is reached. In this stage, we may need to
utilize the algorithm several times until a satisfying outcome is obtained.
For example
By turning the algorithms control parameters, such as the minimum number of instances in a single
leaf of a decision tree.
8. Evaluation
In this step, we assess and interpret the mined patterns, rules, and reliability to the objective
characterized in the first step. This step focuses on the comprehensibility and utility of the induced model. In
this step, the identified knowledge is also recorded for further use. The last step is the use, and overall feedback
and discovery results acquire by Data Mining.
9. Using the discovered knowledge
Now, we are prepared to include the knowledge into another system for further activity. The
knowledge becomes effective in the sense that we may make changes to the system and measure the impacts.
The accomplishment of this step decides the effectiveness of the whole KDD process.

12
Functionalities of Data Mining
• Data mining functionalities are used to represent the type of patterns
that have to be discovered in data mining tasks.
• The ultimate objective in Data Mining Functionalities is to observe
the various trends in data mining.
• Data mining is extensively used in many areas or sectors.
• There are several data mining functionalities that the organized and
scientific methods offer, such as:

13
Functionalities of Data Mining(contd…)

14
Class/Concept Descriptions
A class or concept implies there is a data set or set of features that define
the class or a concept. There are two concepts here, one that helps with
grouping and the other that helps in differentiating.
• Data Characterization: This refers to the summary of general characteristics
or features of the class, resulting in specific rules that define a target class. A
data analysis technique called Attribute-oriented Induction is employed on the
data set for achieving characterization.

• Data Discrimination: Discrimination is used to separate distinct data sets


based on the disparity in attribute values. It compares features of a class with
features of one or more contrasting classes. bar charts, curves and pie charts.

15
Mining Frequent Patterns
One of the functions of data mining is finding data patterns. Frequent
patterns are things that are discovered to be most common in data.
Various types of frequency can be found in the dataset.
• Frequent item set: This term refers to a group of items that are commonly
found together, such as milk and sugar.

• Frequent substructure: It refers to the various types of data structures that


can be combined with an item set or subsequences, such as trees and graphs.

• Frequent Subsequence: A regular pattern series, such as buying a phone


followed by a cover.

16
Association Analysis
It analyses the set of items that generally occur together in a
transactional dataset.
• It is also known as Market Basket Analysis for its wide use in retail
sales.
• Two parameters are used for determining the association rules:
• It provides which identifies the common item set in the database.

• Confidence is the conditional probability that an item occurs when another


item occurs in a transaction.

17
Classification
Classification is a data mining technique that categorizes items in a
collection based on some predefined properties.
• It uses methods like if-then, decision trees or neural networks to
predict a class or essentially classify a collection of items.
• A training set containing items whose properties are known is used to
train the system to predict the category of items from an unknown
collection of items.

18
19
An example of a decision tree with the dataset

20
Prediction
It defines predict some unavailable data values or spending trends.
An object can be anticipated based on the attribute values of the object
and attribute values of the classes. It can be a prediction of missing
numerical values or increase or decrease trends in time-related
information. There are primarily two types of predictions in data
mining: numeric and class predictions.
• Numeric predictions are made by creating a linear regression model that is
based on historical data. Prediction of numeric values helps businesses ramp
up for a future event that might impact the business positively or negatively.
• Class predictions are used to fill in missing class information for products
using a training data set where the class for products is known.
21
Cluster Analysis
• In image processing, pattern recognition and bioinformatics, clustering
is a popular data mining functionality.
• It is similar to classification, but the classes are not predefined.
• Data attributes represent the classes.
• Similar data are grouped together, with the difference being that a
class label is not known.
• Clustering algorithms group data based on similar features and
dissimilarities.

22
An example of a k mean with the dataset

23
Outlier Analysis
• Outlier analysis is important to understand the quality of data.
• If there are too many outliers, you cannot trust the data or draw
patterns.
• An outlier analysis determines if there is something out of turn in the
data and whether it indicates a situation that a business needs to
consider and take measures to mitigate.
• An outlier analysis of the data that cannot be grouped into any classes
by the algorithms is pulled up.

24
Evolution and Deviation Analysis

• Evolution Analysis pertains to the study of data sets that change over
time.

• Evolution analysis models are designed to capture evolutionary trends


in data helping to characterize, classify, cluster or discriminate time-
related data.

25
Correlation Analysis
• Correlation is a mathematical technique for determining whether and
how strongly two attributes is related to one another.
• It refers to the various types of data structures, such as trees and
graphs, that can be combined with an item set or subsequence.
• It determines how well two numerically measured continuous
variables are linked.
• Researchers can use this type of analysis to see if there are any
possible correlations between variables in their study.

26
Architecture Of Typical Data Mining System
• Data mining is a significant method where previously unknown and potentially
useful information is extracted from the vast amount of data.
• The data mining process involves several components, and these components
constitute a data mining system architecture.
• The significant components of data mining systems are a data source, data mining
engine, data warehouse server, the pattern evaluation module, graphical user
interface, and knowledge base.

27
Architecture Of Typical Data Mining System

28
Architecture Of Typical Data Mining System
• Data Source:
The actual source of data is the Database, data warehouse, World Wide Web (WWW), text files, and
other documents. You need a huge amount of historical data for data mining to be successful.
Organizations typically store data in databases or data warehouses. Data warehouses may comprise
one or more databases, text files spreadsheets, or other repositories of data. Sometimes, even plain
text files or spreadsheets may contain information. Another primary source of data is the World Wide
Web or the internet.
• Different processes:
Before passing the data to the database or data warehouse server, the data must be cleaned,
integrated, and selected. As the information comes from various sources and in different formats, it
can't be used directly for the data mining procedure because the data may not be complete and
accurate. So, the first data requires to be cleaned and unified. More information than needed will be
collected from various data sources, and only the data of interest will have to be selected and passed
to the server. These procedures are not as easy as we think. Several methods may be performed on
the data as part of selection, integration, and cleaning.
29
Architecture Of Typical Data Mining System
• Database or Data Warehouse Server:
The database or data warehouse server consists of the original data that is ready to
be processed. Hence, the server is cause for retrieving the relevant data that is based
on data mining as per user request.
• Data Mining Engine:
The data mining engine is a major component of any data mining system. It
contains several modules for operating data mining tasks, including association,
characterization, classification, clustering, prediction, time-series analysis, etc.
• In other words, we can say data mining is the root of our data mining architecture.
It comprises instruments and software used to obtain insights and knowledge from
data collected from various data sources and stored within the data warehouse.

30
Architecture Of Typical Data Mining System
• Graphical User Interface:
The graphical user interface (GUI) module communicates between the data mining system and
the user. This module helps the user to easily and efficiently use the system without knowing
the complexity of the process. This module cooperates with the data mining system when the
user specifies a query or a task and displays the results.
• Knowledge Base:
The knowledge base is helpful in the entire process of data mining. It might be helpful to guide
the search or evaluate the stake of the result patterns. The knowledge base may even contain
user views and data from user experiences that might be helpful in the data mining process. The
data mining engine may receive inputs from the knowledge base to make the result more
accurate and reliable. The pattern assessment module regularly interacts with the knowledge
base to get inputs, and also update it.
31
Architecture Of Typical Data Mining System
• Pattern Evaluation Module:
The Pattern evaluation module is primarily responsible for the measure of
investigation of the pattern by using a threshold value. It collaborates with the data
mining engine to focus the search on exciting patterns.
• This segment commonly employs stake measures that cooperate with the data
mining modules to focus the search towards fascinating patterns. It might utilize a
stake threshold to filter out discovered patterns. On the other hand, the pattern
evaluation module might be coordinated with the mining module, depending on
the implementation of the data mining techniques used. For efficient data mining,
it is abnormally suggested to push the evaluation of pattern stake as much as
possible into the mining procedure to confine the search to only fascinating
patterns.

32
Classification of Data Mining Systems
• Data mining refers to the process of extracting important data from raw data.
It analyses the data patterns in huge sets of data with the help of several
software. Ever since the development of data mining, it is being incorporated
by researchers in the research and development field.
• With Data mining, businesses are found to gain more profit. It has not only
helped in understanding customer demand but also in developing effective
strategies to enforce overall business turnover. It has helped in determining
business objectives for making clear decisions.
• Data collection and data warehousing, and computer processing are some of
the strongest pillars of data mining. Data mining utilizes the concept of
mathematical algorithms to segment the data and assess the possibility of
occurrence of future events.
33
Data Mining - Issues

• Data mining is not an easy task, as the algorithms used can get very
complex and data is not always available at one place.

• It needs to be integrated from various heterogeneous data sources.

34
Data Mining - Issues

35
Data Preprocessing in Data Mining
• Data preprocessing is an important step in the data mining process.
• It refers to the cleaning, transforming, and integrating of data in order
to make it ready for analysis.
• The goal of data preprocessing is to improve the quality of the data
and to make it more suitable for the specific data mining task.

36
Forms of data preprocessing
Data Preprocessing in Data Mining
Data preprocessing is an important step in the data mining process that involves
cleaning and transforming raw data to make it suitable for analysis. Some common
steps in data preprocessing include:

• Data Cleaning: This involves identifying and correcting errors or inconsistencies


in the data, such as missing values, outliers, and duplicates. Various techniques can
be used for data cleaning, such as imputation, removal, and transformation.

• Data Integration: This involves combining data from multiple sources to create a
unified dataset. Data integration can be challenging as it requires handling data
with different formats, structures, and semantics. Techniques such as record linkage
and data fusion can be used for data integration.
38
Data Preprocessing in Data Mining
• Data Transformation: This involves converting the data into a suitable format for analysis.
Common techniques used in data transformation include normalization, standardization, and
discretization. Normalization is used to scale the data to a common range, while
standardization is used to transform the data to have zero mean and unit variance.
Discretization is used to convert continuous data into discrete categories.
• Data Reduction: This involves reducing the size of the dataset while preserving the
important information. Data reduction can be achieved through techniques such as feature
selection and feature extraction. Feature selection involves selecting a subset of relevant
features from the dataset, while feature extraction involves transforming the data into a lower-
dimensional space while preserving the important information.
• Data Discretization: This involves dividing continuous data into discrete categories or
intervals. Discretization is often used in data mining and machine learning algorithms that
require categorical data. Discretization can be achieved through techniques such as equal
width binning, equal frequency binning, and clustering.
39
Data Preprocessing in Data Mining
• Data Normalization: This involves scaling the data to a common range,
such as between 0 and 1 or -1 and 1. Normalization is often used to
handle data with different units and scales. Common normalization
techniques include min-max normalization, z-score normalization, and
decimal scaling.
• Data preprocessing plays a crucial role in ensuring the quality of data
and the accuracy of the analysis results. The specific steps involved in
data preprocessing may vary depending on the nature of the data and the
analysis goals.
• By performing these steps, the data mining process becomes more
efficient and the results become more accurate.
40
Data Cleaning:
The data can have many irrelevant and missing parts. To handle this part, data cleaning is done. It involves handling of missing data, noisy data
etc.

(a). Missing Data: This situation arises when some data is missing in the data. It can be handled in various ways.

• Ignore the tuples:


This approach is suitable only when the dataset we have is quite large and multiple values are missing within a tuple.
• Fill the Missing values:
There are various ways to do this task. You can choose to fill the missing values manually, by attribute mean or the most probable value.

(b). Noisy Data: Noisy data is a meaningless data that can’t be interpreted by machines.It can be generated due to faulty data collection, data entry
errors etc. It can be handled in following ways :
• Binning Method:
This method works on sorted data in order to smooth it. The whole data is divided into segments of equal size and then various methods
are performed to complete the task. Each segmented is handled separately. One can replace all data in a segment by its mean or boundary
values can be used to complete the task.

• Regression:
Here data can be made smooth by fitting it to a regression function.The regression used may be linear (having one independent variable)
or multiple (having multiple independent variables).

• Clustering:
This approach groups the similar data in a cluster. The outliers may be undetected or it will fall outside the clusters.
41
Data Transformation:

This step is taken in order to transform the data in appropriate forms suitable for mining process. This
involves following ways:
Normalization:
It is done in order to scale the data values in a specified range (-1.0 to 1.0 or 0.0 to 1.0)

Attribute Selection:
In this strategy, new attributes are constructed from the given set of attributes to help the mining process.

Discretization:
This is done to replace the raw values of numeric attribute by interval levels or conceptual levels.

Concept Hierarchy Generation:


Here attributes are converted from lower level to higher level in hierarchy. For Example-The attribute
“city” can be converted to “country”.

42
Data Transformation:
• Data Reduction: Data reduction is a crucial step in the data mining process that involves reducing the size of the dataset
while preserving the important information. This is done to improve the efficiency of data analysis and to avoid
overfitting of the model. Some common steps involved in data reduction are:
• Feature Selection: This involves selecting a subset of relevant features from the dataset. Feature selection is often
performed to remove irrelevant or redundant features from the dataset. It can be done using various techniques such as
correlation analysis, mutual information, and principal component analysis (PCA).
• Feature Extraction: This involves transforming the data into a lower-dimensional space while preserving the important
information. Feature extraction is often used when the original features are high-dimensional and complex. It can be done
using techniques such as PCA, linear discriminant analysis (LDA), and non-negative matrix factorization (NMF).
• Sampling: This involves selecting a subset of data points from the dataset. Sampling is often used to reduce the size of
the dataset while preserving the important information. It can be done using techniques such as random sampling,
stratified sampling, and systematic sampling.
• Clustering: This involves grouping similar data points together into clusters. Clustering is often used to reduce the size
of the dataset by replacing similar data points with a representative centroid. It can be done using techniques such as k-
means, hierarchical clustering, and density-based clustering.
• Compression: This involves compressing the dataset while preserving the important information. Compression is often
used to reduce the size of the dataset for storage and transmission purposes. It can be done using techniques such as
wavelet compression, JPEG compression, and gzip compression.
43

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy