0% found this document useful (0 votes)
8 views18 pages

Chapter - 5 - Data Mining

Chapter 5 discusses Data Mining, defined as the process of discovering patterns and insights from large datasets using various techniques. It covers the steps in the data mining process, key techniques, and the differences between Knowledge Discovery in Databases (KDD) and data mining, as well as the advantages and challenges associated with data mining. Applications of data mining span across multiple industries, aiding in decision-making and predictive analytics.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views18 pages

Chapter - 5 - Data Mining

Chapter 5 discusses Data Mining, defined as the process of discovering patterns and insights from large datasets using various techniques. It covers the steps in the data mining process, key techniques, and the differences between Knowledge Discovery in Databases (KDD) and data mining, as well as the advantages and challenges associated with data mining. Applications of data mining span across multiple industries, aiding in decision-making and predictive analytics.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 18

Chapter 5

Data Mining

Amol D. Vibhute (PhD)


Assistant Professor

Email ID:- amol.vibhute@sicsr.ac.in


Roadmap of Chapter:
• Introduction
• What is data mining?
• KDD vs data mining,
• Information extraction,
• Data mining characteristics,
• Issues and challenges in DM,
• Application of DM

Tuesday, March 4, 2025 Dr. Amol 2


Introduction:
• Data Mining is the process of discovering patterns, trends, and insights from large datasets using techniques
from machine learning, statistics, and database systems. It helps in making data-driven decisions in
business, healthcare, finance, and many other fields.
• Key Features of Data Mining
– Extracts hidden patterns from raw data.
– Uses algorithms to find relationships in large datasets.
– Supports decision-making in various industries.
– Improves efficiency in predictive analytics & business intelligence (BI).

Tuesday, March 4, 2025 Dr. Amol 3


Cont.…
• Steps in Data Mining Process
• Data Collection & Preprocessing
– Gathering data from multiple sources (databases, IoT devices, logs).
– Cleaning and transforming data (handling missing values, normalization).
• Data Exploration & Transformation
– Identifying key attributes (feature selection).
– Removing noise & duplicates for accurate results.
• Applying Data Mining Techniques
– Classification – Predicting categories (Spam/Not Spam).
– Clustering – Grouping similar data (Customer Segmentation).
– Association Rule Mining – Finding patterns (Market Basket Analysis).
– Regression – Predicting continuous values (Stock Prices).
• Pattern Evaluation & Interpretation
– Extracting meaningful insights from discovered patterns.
• Deployment & Decision Making
– Using insights for fraud detection, customer analytics, healthcare, finance, etc.

Tuesday, March 4, 2025 Dr. Amol 4


Cont.…
• Key Data Mining Techniques
• Classification (Supervised Learning)
– Example: Email spam detection (Spam or Not Spam).
– Algorithms: Decision Trees, Naïve Bayes, Random Forest, SVM.
• Clustering (Unsupervised Learning)
– Example: Grouping customers by shopping behavior.
– Algorithms: K-Means, DBSCAN, Hierarchical Clustering.
• Association Rule Mining
– Example: "People who buy milk also buy bread" (Market Basket Analysis).
– Algorithm: Apriori, FP-Growth.
• Anomaly Detection
– Example: Fraud detection in credit card transactions.
– Techniques: Isolation Forest, One-Class SVM.
• Regression Analysis
– Example: Predicting house prices based on location, size, and amenities.
– Algorithms: Linear Regression, Decision Trees, Neural Networks.

Tuesday, March 4, 2025 Dr. Amol 5


KDD:
• KDD (Knowledge Discovery in Databases) is a process that involves the extraction of
useful, previously unknown, and potentially valuable information from large datasets.
• The KDD process is an iterative process and it requires multiple iterations of the
above steps to extract accurate knowledge from the data.
– Data Cleaning
• Data cleaning is defined as removal of noisy and irrelevant data from
collection.
• Cleaning in case of Missing values.
• Cleaning noisy data, where noise is a random or variance error.
• Cleaning with Data discrepancy detection and Data transformation tools.
– Data Integration
• Data integration is defined as heterogeneous data from multiple sources
combined in a common source(DataWarehouse). Data integration using Data
Migration tools, Data Synchronization tools and ETL(Extract-Load-
Transformation) process.

Tuesday, March 4, 2025 Dr. Amol 6


Cont.…
• Data Selection
– Data selection is defined as the process where data relevant to the analysis is decided and retrieved from the data collectio n. For this we can use
Neural network, Decision Trees, Naive bayes, Clustering, and Regression methods.

• Data Transformation
– Data Transformation is defined as the process of transforming data into appropriate form required by mining procedure. Data Transformation is a two
step process:

• Data Mapping: Assigning elements from source base to destination to capture transformations.
• Code generation: Creation of the actual transformation program.
• Data Mining
– Data mining is defined as techniques that are applied to extract patterns potentially useful. It transforms task relevant dat a into patterns, and decides
purpose of model using classification or characterization.

• Pattern Evaluation
– Pattern Evaluation is defined as identifying strictly increasing patterns representing knowledge based on given measures. It find interestingness score
of each pattern, and uses summarization and Visualization to make data understandable by user.

• Knowledge Representation
– This involves presenting the results in a way that is meaningful and can be used to make decisions.

Tuesday, March 4, 2025 Dr. Amol 7


Cont.…
• Advantages of KDD
– Improves decision-making: KDD provides valuable insights and knowledge that can help organizations make better
decisions.
– Increased efficiency: KDD automates repetitive and time-consuming tasks and makes the data ready for analysis,
which saves time and money.
– Better customer service: KDD helps organizations gain a better understanding of their customers’ needs and
preferences, which can help them provide better customer service.
– Fraud detection: KDD can be used to detect fraudulent activities by identifying patterns and anomalies in the data
that may indicate fraud.
– Predictive modeling: KDD can be used to build predictive models that can forecast future trends and patterns.

Tuesday, March 4, 2025 Dr. Amol 8


Cont.…
• Disadvantages of KDD
– Privacy concerns: KDD can raise privacy concerns as it involves collecting and analyzing large amounts of data,
which can include sensitive information about individuals.
– Complexity: KDD can be a complex process that requires specialized skills and knowledge to implement and
interpret the results.
– Unintended consequences: KDD can lead to unintended consequences, such as bias or discrimination, if the data or
models are not properly understood or used.
– Data Quality: KDD process heavily depends on the quality of data, if data is not accurate or consistent, the results
can be misleading
– High cost: KDD can be an expensive process, requiring significant investments in hardware, software, and
personnel.
– Overfitting: KDD process can lead to overfitting, which is a common problem in machine learning where a model
learns the detail and noise in the training data to the extent that it negatively impacts the performance of the model
on new unseen data.
Tuesday, March 4, 2025 Dr. Amol 9
Difference between KDD and Data Mining:
Parameter KDD Data Mining

KDD refers to a process of identifying valid, novel, potentially useful, and ultimately Data Mining refers to a process of extracting useful and
Definition
understandable patterns and relationships in data. valuable information or patterns from large data sets.

Objective To find useful knowledge from data. To extract useful information from data.

Association rules, classification, clustering, regression,


Data cleaning, data integration, data selection, data transformation, data mining, pattern
Techniques Used decision trees, neural networks, and dimensionality
evaluation, and knowledge representation and visualization.
reduction.

Structured information, such as rules and models, that can be used to make decisions or Patterns, associations, or insights that can be used to
Output
predictions. improve decision-making or understanding.

Data mining focus is on the discovery of patterns or


Focus Focus is on the discovery of useful knowledge, rather than simply finding patterns in data.
relationships in data.

Domain expertise is less critical in data mining, as the


Domain expertise is important in KDD, as it helps in defining the goals of the process,
Role of domain expertise algorithms are designed to identify patterns without relying
choosing appropriate data, and interpreting the results.
on prior knowledge.

Tuesday, March 4, 2025 Dr. Amol 10


Data mining characteristics:
• Data mining service is an easy form of information gathering methodology wherein which all the relevant information
goes through some sort of identification process.
• And eventually at the end of this process, one can determine all the characteristics of the data mining process.
• 1. Increased quantities of data:
– In earlier days, the data mining system can be determined with the help of their clients and customers, but in today’s date, one can acquire any
number of information without the help of those clients.
– Moreover, after this kind of revolution in the mining system, it also added one more problem and that is large quantities of work.
– With the help of this information technology, one can acquire a large number of information without any extra burden or troub le.

• 2. Provides incomplete data:


– Most of the people provide incomplete information about themselves in some of the survey conducted with the help of data mini ng systems.
– Therefore, people ignore the value of their information and that is why they provide incomplete information about themselves in those surveys
conducted for the benefit of the mining systems.
– Moreover, these mining systems changed the perspective of people and because of that, people fear the exchange of their perso nal information.

• 3. Complicated data structure:


– Data mining is a form wherein which all the information is gathered and incorporated with the help of information collection techniques. These information collecting techniques are more of
manual and rest are technological. Therefore, most of the understanding and determination of these mining can be a bit compli cated than other structures of information technology.

Tuesday, March 4, 2025 Dr. Amol 11


Issues and challenges in DM:
• Data Mining Issues:
– 1. Mining methodology and user interaction issues:
• i. Mining different kinds of knowledge in databases:
– Different user - different knowledge - different way. That means different client want a different kind of information so it becomes difficult to cover vast range of
data that can meet the client requirement.

• ii. Interactive mining of knowledge at multiple levels of abstraction:


– Interactive mining allows users to focus the search for patterns from different angles. The data mining process should be int eractive because it is difficult to
know what can be discovered within a database.

• iii. Incorporation of background knowledge:


– Background knowledge is used to guide discovery process and to express the discovered patterns.

• iv. Query languages and ad hoc mining:


– Relational query languages (such as SQL) allow users to pose ad-hoc queries for data retrieval. The language of data mining query language should be in
perfectly matched with the query language of data warehouse.

• v. Handling noisy or incomplete data:


– In a large database, many of the attribute values will be incorrect. This may be due to human error or because of any instrum ents fail. Data cleaning methods
and data analysis methods are used to handle noise data.

Tuesday, March 4, 2025 Dr. Amol 12


Cont.…
• 2. Performance issues:
– i. Efficiency and scalability of data mining algorithms:
• To effectively extract information from a huge amount of data in databases, data mining algorithms must be efficient and scal able.

– ii. Parallel, distributed, and incremental mining algorithms:


• The huge size of many databases, the wide distribution of data, and complexity of some data mining methods are factors motiva ting the
development of parallel and distributed data mining algorithms. Such algorithms divide the data into partitions, which are pr ocessed in parallel.

• 3. Issues relating to the diversity of database types:


– i. Handling of relational and complex types of data:
• There are many kinds of data stored in databases and data warehouses. It is not possible for one system to mine all these kin d of data. So
different data mining system should be construed for different kinds data.

– ii. Mining information from heterogeneous databases and global information systems:
• Since data is fetched from different data sources on Local Area Network (LAN) and Wide Area Network (WAN).The discovery of kn owledge from
different sources of structured is a great challenge to data mining.

Tuesday, March 4, 2025 Dr. Amol 13


Cont.…
• Major Challenges In Data Mining:
1. Security and Social Challenges:
• Dynamic techniques are done through data assortment sharing, so it requires impressive security. Private information about
people and touchy information is gathered for the client’s profiles, client standard of conduct understanding —illicit admittance
to information and the secret idea of information turning into a significant issue.

2. Noisy and Incomplete Data:


• Data Mining is the way toward obtaining information from huge volumes of data. This present reality information is noisy,
incomplete, and heterogeneous. Data in huge amounts regularly will be unreliable or inaccurate. These issues could be
because of human mistakes blunders or errors in the instruments that measure the data.

3. Distributed Data:
• True data is normally put away on various stages in distributed processing conditions. It very well may be on the internet,
individual systems, or even on the databases. It is essentially hard to carry all the data to a unified data archive principa lly
because of technical and organizational reasons.

Tuesday, March 4, 2025 Dr. Amol 14


Cont.…
• Major Challenges In Data Mining:
4. Complex Data:
• True data is truly heterogeneous, and it very well may be media data, including natural language text, time series, spatial d ata,
temporal data, complex data, audio or video, images, etc. It is truly hard to deal with these various types of data and
concentrate on the necessary information. More often than not, new apparatuses and systems would need to be created to
separate important information.

5. Performance:
• The presentation of the data mining framework basically relies upon the productivity of techniques and algorithms utilized. O n
the off chance that the techniques and algorithms planned are not sufficient; at that point, it will influence the presentati on of
the data mining measure unfavorably.

6. Scalability and Efficiency of the Algorithms:


• The Data Mining algorithm should be scalable and efficient to extricate information from tremendous measures of data in the
data set.

Tuesday, March 4, 2025 Dr. Amol 15


Cont.…
• Major Challenges In Data Mining:
7. Improvement of Mining Algorithms:
• Factors, for example, the difficulty of data mining approaches, the enormous size of the database, and the entire data flow
inspire the distribution and creation of parallel data mining algorithms.

8. Incorporation of Background Knowledge:


• In the event that background knowledge can be consolidated, more accurate and reliable data mining arrangements can be
found. Predictive tasks can make more accurate predictions, while descriptive tasks can come up with more useful findings.
Be that as it may, gathering and including foundation knowledge is an unpredictable cycle.

9. Ethics:
• Data mining raises ethical concerns related to the collection, use, and dissemination of data. The data may be used to
discriminate against certain groups, violate privacy rights, or perpetuate existing biases. Moreover, data mining algorithms
may not be transparent, making it challenging to detect biases or discrimination.

Tuesday, March 4, 2025 Dr. Amol 16


Application of DM:

Tuesday, March 4, 2025 Dr. Amol 17


Thank You !!!

Tuesday, March 4, 2025 Dr. Amol 18

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy