Fundamentals of Data Science Notes (Module - 1)
Fundamentals of Data Science Notes (Module - 1)
MODULE -1
Data mining refers to extracting or mining knowledge from large amounts of data. The term is
actually a misnomer. Thus, data mining should have been more appropriately named as
knowledge mining which emphasis on mining from large amounts of data. It is the
computational process of discovering patterns in large data sets involving methods at the
intersection of artificial intelligence, machine learning, statistics, and database systems. The
overall goal of the data mining process is to extract information from a data set and transform
it into an understandable structure for further use. The key properties of data mining are
Automatic discovery of patterns Prediction of likely outcomes Creation of actionable
information Focus on large datasets and databases.
Or
Data mining is the non-trivial process of identifying valid, novel, potentially useful,
and ultimately understandable patterns in data.
It also provide the capability to predict the outcome of a future observation.
Data mining turns a large collection of data into knowledge.
Attempts to extract hidden patterns and trends from large databases.
Also support automatic exploration of data.
It is also called as exploratory data analysis, data driven and deductive learning.
Eg:- The amount of customer will spend at an online, sales prediction, stock market etc.,
“ Data mining is the process of discovering meaningful, new correlation patterns and trends by
shifting through large amounts of data stored in respositories, using pattern recognition
techniques as well as statistical and mathematical techniques. “
Or
Data Mining is defined as extracting information from huge sets of data. In other words, we
can say that data mining is the procedure of mining knowledge from data.
The information or knowledge extracted so can be used for any of the following applications:
o Market Analysis
o Fraud Detection
o Customer Retention
o Production Control
o Science Exploration
o The first step in the DM process is to select the relevant data for analysis.
o This involves identifying the data sources and selecting the data that is necessary for
the analysis.
o The data obtained from different sources may be in different formats and may have
errors and inconsistencies.
o The data preprocessing step involves cleaning and transforming the data to make it
suitable for analysis.
o Once the data has been cleaned, it may need to be transformed to make it more
meaningful for analysis.
o This involves converting the data into a form that is suitable for data mining
algorithms.
Data Mining ( Algorithms) :
o The data mining step involves applying various data mining techniques to identify
patterns and relationships in the data.
o This involves selecting the appropriate algorithms and models that are suitable for
the data and the problem being addressed.
o After the data mining step, the patterns and relationships identified in the data need
to be evaluated to determine their usefulness.
o This involves examining the patterns to determine whether they are meaningful and
can be used to make predictions or decisions.
o The patterns and relationships identified in the data need to be represented in a form
that is understandable and useful to the end-user.
o This involves presenting the results in a way that is meaningful and can be used to
make decisions.
1. Knowledge Base: - This is the domain knowledge that is used to guide the search or
evaluate the interestingness of resulting patterns.
o Such knowledge can include concept hierarchies, used to organize attributes or
attribute values into different levels of abstraction.
o Knowledge such as user beliefs, which can be used to assess a pattern’s
interestingness based on its unexpectedness, may also be included.
o Other examples of domain knowledge are additional interestingness constraints or
thresholds, and metadata (e.g., describing data from multiple heterogeneous sources).
2. Data Mining Engine:- This is essential to the data mining system and ideally consists of
a set of functional modules for tasks such as characterization, association and correlation
analysis, classification,prediction, cluster analysis, outlier analysis, and evolution analysis.
3. Pattern Evaluation Module:- This component typically employs interestingness
measures interacts with the data mining modules so as to focus the search toward interesting
patterns.
o It may use interestingness thresholds to filter out discovered patterns.
o Alternatively, the pattern evaluation module may be integrated with the mining
module, depending on the implementation of the datamining method used.
o For efficient data mining, it is highly recommended to push the evaluation of pattern
interestingness as deep as possible into the mining process as to confine the search to
only the interesting patterns.
4. User interface: This module communicates between users and the data mining
system,allowing the user to interact with the system by specifying a data mining query or
task, providing information to help focus the search, and performing exploratory
datamining based on the intermediate data mining results.
o In addition, this component allows the user to browse database and data warehouse
schemas or data structures, evaluate mined patterns, and visualize the patterns in
different forms.
1) Predictive Task
2) Descriptive Task
Predictive Task :-
o It uses methods like if-then, decision trees or neural networks to predict a class or
essentially classify a collection of items.
( Eg :- Credit Card )
This technique is used to create a model that can predict the value of a target variable
based on the values of several input variables.
( Eg :- Pension )
( Eg :- Online )
It helps to analyse and predict data based on time and historical data.
Descriptive Task :-
This technique is used to identify groups of data points that share similar
characteristics. Clustering can be used for segmentation, anomaly detection, and
summarization.
( Eg :- Marketing )
Sequence discovery :
Summarization :
This technique is used to represent the data in a visual format that can help
users to identify patterns or trends that may not be apparent in the raw data.
Association Rule :
( Eg :- Retails )
Stages of KDD:-
Selection :- This stage concerned with selection or segmenting the data that are relevant
to some criteria.
Preprocessing :- Preprocessing is the data cleaning stage where unnecessary
information is removed.
Transformation :- Transfering data in order to be suitable for the task of data mining.
Data Mining :- This stage in concerned with the extraction of patterns from the data.
Interpretation and Evaluation :- The patterns obtained in the data mining stage are
converted into knowledge, which in turn, is used to support decision making.
Data Visualization :- Data visualization helps users to examine large volumes of data
and detect the patterns visually.
DBMS Vs DM :-
DBMS supports query languages which are useful for query-triggered data
Exploration.Where as data mining supports automatic data exploration.
Data known – DBMS , if we know exactly what information we are seeking then
DBMS queries are useful.
Data Unknown – DM, if we only vaguely know the possible correlations or patterns,
then data mining techniques are useful.
One of the task of DM is hypothesis testing, this task can be handled by a DBMS query.
Thus, in these senses, DBMS supports some primitive data mining tasks.
There are three different ways in which DM system use a Relational DBMS
May not use it at all :- A majority of data mining systems do not use any DBMS
and have their own memory and storage management.
o They treat the database simply as a data repository from which data is expected to
be downloaded into their own memory structures, before the data mining algorithm
starts.
Loosely coupled :- In this case or approach, the DBMS is used only for storage and
retrieval of data.
o For instance, one can use a loosely-coupled SQL to fetch data records as required
by the mining algorithm.
o The front-end of the application is implemented in a host programming
language,with embedded SQL statements.
o The applications use an SQL select statement to retrieve the set of records of intreset
from the database.
Tightly coupled :-
o In this approach, the portions of the application programs are selectively pushed to
the database system to perform the necessary computation.
o Data are stored in the database and all processing is done at the database end.
o The performance of this approach depends on the way to optimise the data mining
process while mapping it to a query.
o There are two suggested approaches.We can leave the optimization task to a built-
in query optimiser of the DBMS or We can use an external optimiser.
DM Techniques :-
Association Rules:
This data mining technique helps to discover a link between two or more items. It
finds a hidden pattern in the data set.
Association rules are if-then statements that support to show the probability of
interactions between data items within large data sets in different types of
databases.
Association rule mining has several applications and is commonly used to help sales
correlations in data or medical data sets.
For example, a list of grocery items that you have been buying for the last six
months. It calculates a percentage of items being purchased together.
o Lift:
This measurement technique measures the accuracy of the confidence over how often
item B is purchased
o Confidence:
This measurement technique measures how often item B is purchased when item A is
purchased as well
Classification: -
This technique is used to obtain important and relevant information about data.
This data mining technique helps to classify data in different classes.
o Classification of Data mining frameworks as per the type of data sources mined:
• For example, multimedia, spatial data, text data, time-series data, World Wide
Web, and so on..
• This classification is as per the data analysis approach utilized, such as neural
networks, machine learning, genetic algorithms, visualization, statistics, data
warehouse-oriented or database-oriented, etc.
• The classification can also take into account, the level of user interaction involved
in the data mining procedure, such as query-driven systems, autonomous systems,
or interactive exploratory systems
Prediction:
Outer detection:
This type of data mining technique relates to the observation of data items in the data
set, which do not match an expected pattern or expected behavior.
o This technique may be used in various domains like intrusion, detection, fraud
detection, etc.
o It is also known as Outlier Analysis or Outilier mining. The outlier is a data point that
diverges too much from the rest of the dataset.
o The majority of the real-world datasets have an outlier. Outlier detection plays a
significant role in the data mining field.
o Outlier detection is valuable in numerous fields like network interruption
identification, credit or debit card fraud detection, detecting outlying in wireless sensor
network data, etc.
Regression :
Clustering :
Genetic algorithms:-
o Genetic algorithms are adaptive heuristic search algorithms that belong to the
larger part of evolutionary algorithms.
o Genetic algorithms are based on the ideas of natural selection and genetics.
o These are intelligent exploitation of random search provided with historical data
to direct the search into the region of better performance in solution space.
Data Mining is primarily used by organizations with intense consumer demands- Retail,
Communication, Financial, marketing company, determine price, consumer preferences,
product positioning, and impact on sales, customer satisfaction, and corporate profits.
o Data mining in healthcare has excellent potential to improve the health system.
o It uses data and analytics for better insights and to identify best practices that will
enhance health care services and reduce costs.
o Analysts use data mining approaches such as Machine learning, Multi-dimensional
database, Data visualization, Soft computing, and statistics.
o Data Mining can be used to forecast patients in each category.
o Apprehending a criminal is not a big deal, but bringing out the truth from him is a
very challenging task.
o Law enforcement may use data mining techniques to investigate offenses, monitor
suspected terrorist communications, etc.
o This technique includes text mining also, and it seeks meaningful patterns in data,
which is usually unstructured text.
Although data mining is very powerful, it faces many challenges during its execution.
Various challenges could be related to performance, data, methods, and techniques, etc.
o The process of extracting useful data from large volumes of data is data mining.
o The data in the real-world is heterogeneous, incomplete, and noisy.
o Data in huge quantities will usually be inaccurate or unreliable.
o These problems may occur due to data measuring instrument or because of human
errors.
Data Distribution:
Complex Data:
o Real-world data is heterogeneous, and it could be multimedia data, including audio
and video, images, complex data, spatial data, time series, and so on.
o Managing these various types of data and extracting useful information is a tough
task.
Performance:
Data mining usually leads to serious issues in terms of data security, governance, and
privacy.
Data Visualization:
o Businesses use data mining to analyze the massive data available to them.
o It lets them comprehend what is working for them and what needs to be improved.
o The real-time data analysis allows them to examine various trends including
purchasing patterns based on demographics.
It predicts future trends
Businesses use data mining to analyze historical and current data. This allows them to
generate a model that helps in predicting future outcomes.
o With data mining, businesses find meaningful relationships among data. This helps them
immensely in making business decisions.
o Also, before mining the data, the purpose should be specified as it makes it easier to
discover hidden relationships.
o That way, the cost of the operation is reduced as businesses can develop accurate
inventory forecasts and accordingly purchase supplies.
o In the banking sector, data mining makes predicting loan payments and credit reporting
more manageable.
o Undoubtedly data mining helps businesses to understand their customer, predict their
behavior, and accordingly plan actions.
o However, the process of mining data is expensive. For efficient mining, the data should
be collected from multiple sources.
o For data mining to be effective, it requires a considerable database. The reason is the
small database limits the process of extracting valuable information.
o With a large database, more information is available to analyze trends and make data
mining successful.
With data mining, businesses can track their customers purchasing habits and transactions.
If the data is solely used for improving the customer experience then they rarely have a
problem.
Just like the saying, you reap what you sow, the information you receive from data mining
depends on the one you analyze. This means if the dataset is inaccurate or of poor quality,
then the information provided to you may be inaccurate too.
Even though with legitimate data mining, the information is kept anonymous, there is still
potential for a data breach, which can leave the business as well as customers vulnerable.
The critical information of the employees and customers can be hacked.