0% found this document useful (0 votes)
0 views19 pages

Fundamentals of Data Science Notes (Module - 1)

Data mining is the process of extracting knowledge from large datasets using techniques from artificial intelligence, machine learning, and statistics. It involves several essential steps including data selection, preprocessing, transformation, mining, evaluation, and presentation, and can be applied in various fields such as market analysis and fraud detection. The document also distinguishes between predictive and descriptive tasks in data mining, outlines the architecture of data mining systems, and discusses the differences between data mining and traditional database management systems.

Uploaded by

cssanjaycs438
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
0 views19 pages

Fundamentals of Data Science Notes (Module - 1)

Data mining is the process of extracting knowledge from large datasets using techniques from artificial intelligence, machine learning, and statistics. It involves several essential steps including data selection, preprocessing, transformation, mining, evaluation, and presentation, and can be applied in various fields such as market analysis and fraud detection. The document also distinguishes between predictive and descriptive tasks in data mining, outlines the architecture of data mining systems, and discusses the differences between data mining and traditional database management systems.

Uploaded by

cssanjaycs438
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 19

FUNDAMENTALS OF DATA SCIENCE

MODULE -1

 Fundamentals of Data Mining:

Data mining refers to extracting or mining knowledge from large amounts of data. The term is
actually a misnomer. Thus, data mining should have been more appropriately named as
knowledge mining which emphasis on mining from large amounts of data. It is the
computational process of discovering patterns in large data sets involving methods at the
intersection of artificial intelligence, machine learning, statistics, and database systems. The
overall goal of the data mining process is to extract information from a data set and transform
it into an understandable structure for further use. The key properties of data mining are
Automatic discovery of patterns Prediction of likely outcomes Creation of actionable
information Focus on large datasets and databases.

 Data mining is the process of automatically discovering useful information in large


data repositories.

Or

 Data mining is the non-trivial process of identifying valid, novel, potentially useful,
and ultimately understandable patterns in data.
 It also provide the capability to predict the outcome of a future observation.
 Data mining turns a large collection of data into knowledge.
 Attempts to extract hidden patterns and trends from large databases.
 Also support automatic exploration of data.
 It is also called as exploratory data analysis, data driven and deductive learning.

Eg:- The amount of customer will spend at an online, sales prediction, stock market etc.,

 Defination of Data Mining :-

“ Data mining is the process of discovering meaningful, new correlation patterns and trends by
shifting through large amounts of data stored in respositories, using pattern recognition
techniques as well as statistical and mathematical techniques. “
Or

Data Mining is defined as extracting information from huge sets of data. In other words, we
can say that data mining is the procedure of mining knowledge from data.

The information or knowledge extracted so can be used for any of the following applications:

o Market Analysis
o Fraud Detection
o Customer Retention
o Production Control
o Science Exploration

 Essential step in data Mining :-

 Data Selection ( Relevant data from various sources ) :

o The first step in the DM process is to select the relevant data for analysis.
o This involves identifying the data sources and selecting the data that is necessary for
the analysis.

 Data Preprocessing (Consistent state removal of unnecessary information) :

o The data obtained from different sources may be in different formats and may have
errors and inconsistencies.
o The data preprocessing step involves cleaning and transforming the data to make it
suitable for analysis.

 Data Transformation (Suitable format ) :

o Once the data has been cleaned, it may need to be transformed to make it more
meaningful for analysis.
o This involves converting the data into a form that is suitable for data mining
algorithms.
 Data Mining ( Algorithms) :

o The data mining step involves applying various data mining techniques to identify
patterns and relationships in the data.
o This involves selecting the appropriate algorithms and models that are suitable for
the data and the problem being addressed.

 Pattern Evaluation ( Patterns and knowledge ) :

o After the data mining step, the patterns and relationships identified in the data need
to be evaluated to determine their usefulness.
o This involves examining the patterns to determine whether they are meaningful and
can be used to make predictions or decisions.

 Knowledge Presentation ( Data Visualization ) :

o The patterns and relationships identified in the data need to be represented in a form
that is understandable and useful to the end-user.
o This involves presenting the results in a way that is meaningful and can be used to
make decisions.

 Architecture of Data Mining :-


A typical data mining system may have the following major components.

1. Knowledge Base: - This is the domain knowledge that is used to guide the search or
evaluate the interestingness of resulting patterns.
o Such knowledge can include concept hierarchies, used to organize attributes or
attribute values into different levels of abstraction.
o Knowledge such as user beliefs, which can be used to assess a pattern’s
interestingness based on its unexpectedness, may also be included.
o Other examples of domain knowledge are additional interestingness constraints or
thresholds, and metadata (e.g., describing data from multiple heterogeneous sources).

2. Data Mining Engine:- This is essential to the data mining system and ideally consists of
a set of functional modules for tasks such as characterization, association and correlation
analysis, classification,prediction, cluster analysis, outlier analysis, and evolution analysis.
3. Pattern Evaluation Module:- This component typically employs interestingness
measures interacts with the data mining modules so as to focus the search toward interesting
patterns.
o It may use interestingness thresholds to filter out discovered patterns.
o Alternatively, the pattern evaluation module may be integrated with the mining
module, depending on the implementation of the datamining method used.
o For efficient data mining, it is highly recommended to push the evaluation of pattern
interestingness as deep as possible into the mining process as to confine the search to
only the interesting patterns.

4. User interface: This module communicates between users and the data mining
system,allowing the user to interact with the system by specifying a data mining query or
task, providing information to help focus the search, and performing exploratory
datamining based on the intermediate data mining results.
o In addition, this component allows the user to browse database and data warehouse
schemas or data structures, evaluate mined patterns, and visualize the patterns in
different forms.

 Data Mining Tasks :-

There are two different types of Data Mining task :-

1) Predictive Task
2) Descriptive Task

 Predictive Task :-

It helps developers to provide unlabeled definitions of attributes. With


previously available or historical data, data mining can be used to make predictions
about critical business metrics based on data's linearity.

Techniques used in predictive Data Mining Task :-

 Classification : Predefined set of classes


Classification is a data mining technique that categorizes items in a collection based on
some predefined properties.

o It uses methods like if-then, decision trees or neural networks to predict a class or
essentially classify a collection of items.

( Eg :- Credit Card )

 Regression : Forecasting future data values

This technique is used to create a model that can predict the value of a target variable
based on the values of several input variables.

o Regression analysis is often used for prediction tasks.

( Eg :- Pension )

 Prediction : Predicting future state

It defines predict some unavailable data values or spending trends.

( Eg :- Online )

 Time Series Analysis : Predicting future values based on time

It helps to analyse and predict data based on time and historical data.

(Eg : - Stock Market )

 Descriptive Task :-

Enables you to determine patterns and relationship in a sample data.

o It includes certain knowledge to understand what is happening within the data


without a previous idea. The common data features are highlighted in the data set.
For example, count, average etc.

Techniques used in descriptive Data Mining Task :-

 Clustering : No classes known grouping

This technique is used to identify groups of data points that share similar
characteristics. Clustering can be used for segmentation, anomaly detection, and
summarization.
( Eg :- Marketing )

 Sequence discovery :

Similar to TSA but time dependent

( Eg:- Web page linking ).

 Summarization :

This technique is used to represent the data in a visual format that can help
users to identify patterns or trends that may not be apparent in the raw data.

 Association Rule :

This technique is used to identify relationships between variables in the data.


It can be used to discover co-occurring events or to identify patterns in
transaction data.

( Eg :- Retails )

 KDD Vs Data Mining :-

 Knowledge Discovery in Database ( KDD) was formalised in 1989.


 The term data mining was then coined , this high-level application technique is used to
present and analyse data for decision-makers.
 KDD consist of various steps like data Selection,Cleaning, Preprocessing, data
transformation ,reduction,DM,Post –processing and interpretation.
 DM is also one of the step in KDD
 The KDD Process tends to be highly iterative and interactive.
 DM is always used to find best decisions or to give conclusions.
 Fayyad et al gives definition.
 “ KDD is the process of identifying a valid, potentially used and ultimately
understandable structure in data.This process involves selecting or sampling data from
a data warehouse, cleaning or processing it, transforming or reducing it , applying a data
mining component to produce a structure, and then evaluating the deriven structure. ”
 Data Mining is a step in the KDD process concerned with the algorithmic means by
which patterns or structures are enumerated from the data under acceptable
computational efficiency limitations.
 The structures that are the outcome of data mining process must meet certain conditions
so that these can be considered as knowledge.
 The conditions are : validity, understandability, utility, novelty and interestingness.

Stages of KDD:-

 Selection :- This stage concerned with selection or segmenting the data that are relevant
to some criteria.
 Preprocessing :- Preprocessing is the data cleaning stage where unnecessary
information is removed.
 Transformation :- Transfering data in order to be suitable for the task of data mining.
 Data Mining :- This stage in concerned with the extraction of patterns from the data.
 Interpretation and Evaluation :- The patterns obtained in the data mining stage are
converted into knowledge, which in turn, is used to support decision making.
 Data Visualization :- Data visualization helps users to examine large volumes of data
and detect the patterns visually.

 DBMS Vs DM :-

 DBMS supports query languages which are useful for query-triggered data
Exploration.Where as data mining supports automatic data exploration.
 Data known – DBMS , if we know exactly what information we are seeking then
DBMS queries are useful.
 Data Unknown – DM, if we only vaguely know the possible correlations or patterns,
then data mining techniques are useful.
 One of the task of DM is hypothesis testing, this task can be handled by a DBMS query.
Thus, in these senses, DBMS supports some primitive data mining tasks.

There are three different ways in which DM system use a Relational DBMS

 May not use it at all :- A majority of data mining systems do not use any DBMS
and have their own memory and storage management.
o They treat the database simply as a data repository from which data is expected to
be downloaded into their own memory structures, before the data mining algorithm
starts.
 Loosely coupled :- In this case or approach, the DBMS is used only for storage and
retrieval of data.
o For instance, one can use a loosely-coupled SQL to fetch data records as required
by the mining algorithm.
o The front-end of the application is implemented in a host programming
language,with embedded SQL statements.
o The applications use an SQL select statement to retrieve the set of records of intreset
from the database.
 Tightly coupled :-

o In this approach, the portions of the application programs are selectively pushed to
the database system to perform the necessary computation.

o Data are stored in the database and all processing is done at the database end.

o The performance of this approach depends on the way to optimise the data mining
process while mapping it to a query.

o There are two suggested approaches.We can leave the optimization task to a built-
in query optimiser of the DBMS or We can use an external optimiser.
 DM Techniques :-

 Association Rules:

 This data mining technique helps to discover a link between two or more items. It
finds a hidden pattern in the data set.
 Association rules are if-then statements that support to show the probability of
interactions between data items within large data sets in different types of
databases.
 Association rule mining has several applications and is commonly used to help sales
correlations in data or medical data sets.
 For example, a list of grocery items that you have been buying for the last six
months. It calculates a percentage of items being purchased together.

These are three major measurements technique:

o Lift:
This measurement technique measures the accuracy of the confidence over how often
item B is purchased

(Confidence) / (item B)/ (Entire dataset)


o Support:
This measurement technique measures how often multiple items are purchased and
compared it to the overall dataset.

(Item A + Item B) / (Entire dataset)

o Confidence:
This measurement technique measures how often item B is purchased when item A is
purchased as well

(Item A + Item B)/ (Item A)

 Classification: -

 This technique is used to obtain important and relevant information about data.
 This data mining technique helps to classify data in different classes.

Data mining techniques can be classified by different criteria, as follows:

o Classification of Data mining frameworks as per the type of data sources mined:

• This classification is as per the type of data handled.

• For example, multimedia, spatial data, text data, time-series data, World Wide
Web, and so on..

o Classification of data mining frameworks as per the database involved:

• This classification based on the data model involved.

• For example. Object-oriented database, transactional database, relational


database, and so on..

o Classification of data mining frameworks as per the kind of knowledge


discovered:
This classification depends on the types of knowledge discovered or data mining
functionalities.

• For example, discrimination, classification, clustering, characterization, etc.

• some frameworks tend to be extensive frameworks offering a few data mining


functionalities together..
o Classification of data mining frameworks according to data mining techniques
used:

• This classification is as per the data analysis approach utilized, such as neural
networks, machine learning, genetic algorithms, visualization, statistics, data
warehouse-oriented or database-oriented, etc.

• The classification can also take into account, the level of user interaction involved
in the data mining procedure, such as query-driven systems, autonomous systems,
or interactive exploratory systems

 Prediction:

Prediction used a combination of other data mining techniques such as trends,


clustering, classification, etc. It analyzes past events or instances in the right sequence
to predict a future event.

 Outer detection:

This type of data mining technique relates to the observation of data items in the data
set, which do not match an expected pattern or expected behavior.

o This technique may be used in various domains like intrusion, detection, fraud
detection, etc.
o It is also known as Outlier Analysis or Outilier mining. The outlier is a data point that
diverges too much from the rest of the dataset.
o The majority of the real-world datasets have an outlier. Outlier detection plays a
significant role in the data mining field.
o Outlier detection is valuable in numerous fields like network interruption
identification, credit or debit card fraud detection, detecting outlying in wireless sensor
network data, etc.

 Regression :

o Regression can be defined as a statistical modeling method in which


previously obtained data is used to predicting a continuous quantity for new
observations.
o This classifier is also known as the Continuous Value Classifier.

 Clustering :

o Clustering analysis identifies data that are identical to each other.

o It clarifies the similarities and differences between the data.

o It is known as segmentation and provides an understanding of the events taking


place in the database.

 Artificial Neural network (ANN) Classifier Method :

o An artificial neural network (ANN) also referred to as simply a “Neural Network”


(NN), could be a process model supported by biological neural networks.
o It consists of an interconnected collection of artificial neurons.
o A neural network is a set of connected input/output units where each connection
has a weight associated with it.
o During the knowledge phase, the network acquires by adjusting the weights to be
able to predict the correct class label of the input samples.

 Genetic algorithms:-

o Genetic algorithms are adaptive heuristic search algorithms that belong to the
larger part of evolutionary algorithms.

o Genetic algorithms are based on the ideas of natural selection and genetics.

o These are intelligent exploitation of random search provided with historical data
to direct the search into the region of better performance in solution space.

o They are commonly used to generate high-quality solutions for optimization


problems and search problems.
 Applications of Data Mining :-

Data Mining is primarily used by organizations with intense consumer demands- Retail,
Communication, Financial, marketing company, determine price, consumer preferences,
product positioning, and impact on sales, customer satisfaction, and corporate profits.

 Data Mining in Healthcare:

o Data mining in healthcare has excellent potential to improve the health system.
o It uses data and analytics for better insights and to identify best practices that will
enhance health care services and reduce costs.
o Analysts use data mining approaches such as Machine learning, Multi-dimensional
database, Data visualization, Soft computing, and statistics.
o Data Mining can be used to forecast patients in each category.

 Data Mining in Market Basket Analysis:

o Market basket analysis is a modeling method based on a hypothesis.


o If you buy a specific group of products, then you are more likely to buy another
group of products.
o This technique may enable the retailer to understand the purchase behavior of a buyer
 Data mining in Education:

o Education data mining is a newly emerging field, concerned with developing


techniques that explore knowledge from the data generated from educational
Environments.
o EDM objectives are recognized as affirming student's future learning behavior,
studying the impact of educational support, and promoting learning science.

 Data Mining in Manufacturing Engineering:

o Knowledge is the best asset possessed by a manufacturing company.


o Data mining tools can be beneficial to find patterns in a complex manufacturing
process.
o Data mining can be used in system-level designing to obtain the relationships
between product architecture, product portfolio, and data needs of the customers.

 Data Mining in CRM (Customer Relationship Management):

Customer Relationship Management (CRM) is all about obtaining and holding


Customers, also enhancing customer loyalty and implementing customer-oriented
strategies.

 Data Mining in Fraud detection:

o Billions of dollars are lost to the action of frauds.


o Traditional methods of fraud detection are a little bit time consuming and
sophisticated.
o Data mining provides meaningful patterns and turning data into information.
o An ideal fraud detection system should protect the data of all the users

 Data Mining in Lie Detection:

o Apprehending a criminal is not a big deal, but bringing out the truth from him is a
very challenging task.
o Law enforcement may use data mining techniques to investigate offenses, monitor
suspected terrorist communications, etc.
o This technique includes text mining also, and it seeks meaningful patterns in data,
which is usually unstructured text.

 Data Mining Financial Banking:

o The Digitalization of the banking system is supposed to generate an enormous


amount of data with every new transaction.
o The data mining technique can help bankers by solving business-related problems
in banking and finance by identifying trends, casualties, and correlations in business
information and market costs that are not instantly evident to managers or
executives because the data volume is too large or are produced too rapidly on the
screen by experts.

 Challenges in Data Mining :-

Although data mining is very powerful, it faces many challenges during its execution.
Various challenges could be related to performance, data, methods, and techniques, etc.

 Incomplete and noisy data:

o The process of extracting useful data from large volumes of data is data mining.
o The data in the real-world is heterogeneous, incomplete, and noisy.
o Data in huge quantities will usually be inaccurate or unreliable.
o These problems may occur due to data measuring instrument or because of human
errors.

 Data Distribution:

o Real-worlds data is usually stored on various platforms in a distributed computing


environment.
o It might be in a database, individual systems, or even on the internet.
o Practically, It is a quite tough task to make all the data to a centralized data
repository mainly due to organizational and technical concerns.

 Complex Data:
o Real-world data is heterogeneous, and it could be multimedia data, including audio
and video, images, complex data, spatial data, time series, and so on.
o Managing these various types of data and extracting useful information is a tough
task.

 Performance:

o The data mining system's performance relies primarily on the efficiency of


algorithms and techniques used.
o If the designed algorithm and techniques are not up to the mark, then the efficiency
of the data mining process will be affected adversely.

 Data Privacy and Security:

Data mining usually leads to serious issues in terms of data security, governance, and
privacy.

 Data Visualization:

o In data mining, data visualization is a very important process because it is the


primary method that shows the output to the user in a presentable way.
o The extracted data should convey the exact meaning of what it intends to express.
o But many times, representing the information to the end-user in a precise and easy
way is difficult.

 Advantages of data mining :-

 Data mining helps in gathering information

o Businesses use data mining to analyze the massive data available to them.

o It lets them comprehend what is working for them and what needs to be improved.

o The real-time data analysis allows them to examine various trends including
purchasing patterns based on demographics.
 It predicts future trends

Businesses use data mining to analyze historical and current data. This allows them to
generate a model that helps in predicting future outcomes.

 It aids in making informed judgments

o With data mining, businesses find meaningful relationships among data. This helps them
immensely in making business decisions.

o Also, before mining the data, the purpose should be specified as it makes it easier to
discover hidden relationships.

 Data mining is cost-effective

o When compared to other data-oriented applications, data mining is cost-efficient. Also,


it assists users in predicting future demands.

o That way, the cost of the operation is reduced as businesses can develop accurate
inventory forecasts and accordingly purchase supplies.

 It detects fraud and risks

o In the banking sector, data mining makes predicting loan payments and credit reporting
more manageable.

o It helps in tracking spending habits for the detection of fraudulent transactions.

 Disadvantages of data mining :-

 Data mining is expensive

o Undoubtedly data mining helps businesses to understand their customer, predict their
behavior, and accordingly plan actions.
o However, the process of mining data is expensive. For efficient mining, the data should
be collected from multiple sources.

 It needs a large database

o For data mining to be effective, it requires a considerable database. The reason is the
small database limits the process of extracting valuable information.

o With a large database, more information is available to analyze trends and make data
mining successful.

 It raises privacy concerns

With data mining, businesses can track their customers purchasing habits and transactions.
If the data is solely used for improving the customer experience then they rarely have a
problem.

 Data mining can provide inaccurate information

Just like the saying, you reap what you sow, the information you receive from data mining
depends on the one you analyze. This means if the dataset is inaccurate or of poor quality,
then the information provided to you may be inaccurate too.

 It can cause security issues

Even though with legitimate data mining, the information is kept anonymous, there is still
potential for a data breach, which can leave the business as well as customers vulnerable.
The critical information of the employees and customers can be hacked.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy