0% found this document useful (0 votes)
22 views

Data Mining Notes

MCA Notes2024

Uploaded by

BEWAFA GAMING
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
22 views

Data Mining Notes

MCA Notes2024

Uploaded by

BEWAFA GAMING
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 46

Motivation or Importance of Data

Mining
Data mining is the area in which large quantities of knowledge are obtained and analyzed to retrieve any
valuable information, i.e. structured information. As time goes, its desires increased. Everyone needs the
succinct and accurate knowledge that is possible through it since it is not an easy job, but through a set of
processes and technology, it becomes possible.
Major Sources of Abundant data
 Business – Web, E-commerce, Transactions, Stocks
 Science – Remote Sensing, Bioinformatics, Scientific Simulation
 Society and Everyone – News, Digital Cameras, YouTube
 In Industries – To know the ratings of individuals and people's likes
Data Mining Motivation
The Following areas in which data mining uses extensively are demonstrating data mining motivation:
1. Market Analysis
The best way to get a more holistic view of your clients is data mining and market analysis. We can learn
more about customer tastes with data take a look at purchase histories, collect demographics, gender, place,
other profile information, and much more. We can then have more customized customer experiences with
this mining research, update your marketing strategy, retain a rigorous analysis process, and pitch goods to
which customers are more likely to react well.
For example, email marketers, use data mining to provide users with more personalized content. They will
learn things like gender, place, weather conditions, and more with the aid of a CRM or another big data
collection tool. Then the information can be used by email marketers to classify lists to include more
specific content.
By gathering gender knowledge about clients, Adidas does. Then, to give their new men's apparel collection
to men and their new women's apparel collection to women, they segment their email lists and data sets.
2. Fraud Detection
"Usage of one's career for personal reasons enrichment by the malicious misuse or execution of the wealth
or properties of the recruiting company" in technological systems have dishonest processes, This has
happened in many aspects of everyday life, such as Network Telecommunications, Mobile Communications,
E-commerce and internet banking. Detection of fraud includes detecting fraud as rapidly as Once it is
perpetrated, as possible.
Methods for identifying theft are increasingly being built to protect offenders by responding to their tactics.
New strategies for detecting fraud are being developed. More complicated owing to the extreme constraint
of the exchange of views in the identification of fraud now, fraud a variety of approaches have been
introduced to detect data processing, statistics, and artificial intelligence, for instance. Fraud is uncovered
from data and trend irregularities
Type of Fraud - The types of frauds maybe credit card frauds, telecommunication frauds, and computer
intrusion.
3. Customer Retention
The retention of customers applies to a business or product's ability to maintain its customers for a given
period. High retention of customers means that buyers of the product or company prefer to return, continue
to shop or otherwise not defect to another product or company or not to use it altogether.
4. Production Control
Power over output is a rich source of possible applications for data mining. The collecting and cleaning of
data are reasonably simple. Organizations have their input records, but there are virtually no regulatory and
privacy challenges. Since companies have a long history of setting up operating procedures to maximize
production processes, cost justification and return on investment forecasts are simple to do.
5. Scientific Exploration
Data discovery is a method close to initial data analysis, whereby a data scientist uses visual exploration
rather than conventional data processing systems to explain what is in a dataset and the functionality of the
data.
Such features can include data size or quantity, data completeness, data consistency, potential interactions
between data elements or data files/tables. Usually, data exploration is done using a mixture of automatic
and manual operations.
To give the analyst an initial view of the data and an interpretation of main aspects, automated tasks may
include data profiling, data visualization or tabular reports.

Data Mining: Introduction,


Advantages, Disadvantages, and
Applications
Introduction to Data Mining
In today's world, the amount of data is increasing exponentially whether it is
biomedical data, security data or online shopping data, many industries
preserve the data in order to analyse it, so that they can serve their customers
more effectively through the information which they take out from large
preserve data. This taking out or digging out information from huge data sets
obtained from different sources and industries is known as Data Mining.
What is Knowledge Discovery?
Knowledge discovery is the overall process of extracting knowledge from the
huge data sets. It involves the following steps:
 Data Cleaning – In Data Cleaning the noise and inconsistent data is removed.
 Data Integration − multiple data sources are combined.
 Data Selection − only the relevant data is selected from the database.
 Data Transformation − data is consolidated into appropriate forms for
mining by performing summary or aggregation operations.
 Data Mining − this is an intelligent step in which various methods are applied
to extract data patterns.
 Pattern Evaluation − data patterns, which can be in different forms like
trees, associations, clusters, etc. are evaluated.
 Knowledge Presentation − this step finally provides knowledge.
Advantages of Data Mining
The following are the advantages of data mining:
 Easy to analysis huge data at one go
 Profitable decision-making process
 Prediction of trends
 Knowledge-based information
 Profitable production
 Discovery of hidden patterns
 Cost effective, time efficient, and effective prediction
Disadvantages of Data Mining
The following are the disadvantages of data mining:
 The concise information obtained by the companies, they can sell it to other
companies for money like American Express has sold information about their
customers credit card purchases to other company.
 Data mining requires advance training and prior knowledge about the tools and
softwares to work on.
 Various data mining tools work in different manners due to different algorithms
employed in their design. Therefore, the selection of the correct data mining
tools is a very tough task.
 Some times prediction can go wrong and can play havoc for the companies on
taking any decision based on that prediction.
Applications of Data Mining
1. Communications
Data mining techniques are used in the communication sector to predict
customer behavior to offer highly targeted and relevant campaigns.
2. Insurance
Data mining helps insurance companies to price their products profitable and
promote new offers to their new or existing customers.
3. Education
Data mining benefits educators to access student data, predict achievement
levels and find students or groups of students who need extra attention. For
example, students who are weak in a science subject.
4. Manufacturing
By using the help of Data Mining Manufacturers can predict wear and tear of
production assets. They can anticipate maintenance which helps them reduce
them to minimize downtime.
5. Banking
Data mining helps the finance sector to get a view of market risks and manage
regulatory compliance. It helps banks to identify probable defaulters to decide
whether to issue credit cards, loans, etc.
6. Retail
Data Mining techniques help retail malls and grocery stores identify and
arrange most sellable items in the most attentive positions. It helps store
owners to come up with the offer which encourages customers to increase their
spending.
7. Service Providers
Service providers like mobile phone and utility industries use Data Mining to
predict the reasons when a customer leaves their company. They analyze
billing details, customer service interactions, complaints made to the company
to assign each customer a probability score and offer incentives.
8. E-Commerce
E-commerce websites use Data Mining to offer cross-sells and up-sells through
their websites. One of the most famous names is Amazon, which uses Data
mining techniques to get more customers into their eCommerce store.
Data Types in Data Mining
Overview
The method of extracting potentially valuable patterns from large data sets is
Data Mining. It is a multidisciplinary ability that uses machine learning,
analytics, and AI to extract knowledge to predict the possibility of future
events. Data mining insights are used for business purposes, identification of
fraud, scientific exploration, etc.
Data mining is the process of automatically scanning vast data stores to find
patterns and developments that go beyond basic research. Data mining uses
advanced statistical algorithms to slice data and calculate the possibility of
future events. Data mining is often referred to as Knowledge Discovery in
Databases (KDD).
In computer science, data mining, also known as information discovery from
databases. It is a method of finding interesting and useful patterns and
relationships in large data sets. To analyze massive data, known as data sets,
the field combines computational and artificial intelligence (such as neural
networks and machine learning) tools with database management. In business
(insurance, banking, retail), scientific research (astronomy, medicine), and
government security, data mining is commonly used (detection of criminals and
terrorists).
Data Mining Data Types (Types of Sources of Data)
The following are the data types (types of sources of data) in data mining:
1. Relational Databases
A relational database is a set of records which are linked between using some
set of pre-defined constraints. These records are arranged with columns and
rows in the form of tables. Tables are used to store data about the items that
are to be described in the database.
A relational database is characterized as the set of data arranged in rows and
columns in the database tables. In relational databases, the database structure
can be defined using physical and logical schema. The physical schema is a
schema which describes the database structure and the relationship between
tables while logical schema is a schema which describes how tables are linked
with one another. The relational database's standard API is SQL. Its
applications are data processing, model ROLAP, etc.
2. Data Warehouses
The method of building a data pool using some set of rules is a data
warehouse. Through combining data from several heterogeneous sources
which enable a user for analytical reporting, standardized and/or ad hoc
requests, and decision making. Data warehousing requires data cleaning,
integration of data and storage of information. To help historical research, a
data warehouse typically preserves several months or years of data. The data
in a data warehouse is usually loaded from multiple data sources by an
extraction, transformation, and loading process. Modern data warehouses shift
towards an architecture of extract, load, transformation in which all or much of
the transformation of data is carried out on the database that hosts the data
warehouse. It is important to remember that a very significant part of a data
warehouse's design initiative is to describe ETL (Extraction, Transformation,
and Loading.) method. ETL activities are the backbone of the data warehouse.
3. Transactional Databases
To explain what a transaction database is, let's first see what a transaction
entails. A transaction is, in technical words, a series of sequences of acts that
are both independent and dependent at the same time. A transaction is said to
be concluded only if all the activities that are part of the transaction are
completed successfully. The transaction will be considered an error even if it
fails, and all the actions need to be rolled back or undone.
There is a given starting point for any database transaction, followed by steps
to change the data inside the database. In the end, before the transaction can
be tried again, the database either commits the changes to make them
permanent or rolls back the changes to the starting point.
Example - The case of a bank transaction. A bank transaction is said to be
accurate only when the amount credited from one account is successfully
debited to another account. If the amount is withdrawn but not received by a
candidate then it is appropriate to roll back the whole transaction to the
original point.
4. Database Management System
DBMS is an application for database development and management. It offers a
structured way for users to create, retrieve, update, and manage the data. A
person who uses DBMS to communicate with the database need not concern
about how and where the data is processed. DBMS will take care of it.
DBMS is a collection of data in a structured manner. DBMS is a system for
database management that records information that has some significance. As
an example, if we have to create a student database, so we have to add certain
attributes such as student ID, student name, student address, student mobile
number, student email, etc., and all attributes have the same record type as a
student have. The DBMS provides the final user with a reliable firm.
5. Advanced Database System
A new range of databases such as NoSQL/new SQL was targeted by specialized
database management systems. New developments in data storage have risen
by application demands, such as support for predictive analytics, research, and
data processing, are also supported by advanced database management
systems. The center of an effective database and information systems has
always been advanced data management. It treats a wealth of different data
models and surveys the foundations of structuring, sorting, storing, and
querying data according to these models.
Data Mining Tasks – Overview
Overview
Data mining functionalities are to perceive the various forms of patterns to be
identified in data mining activities. To define the type of patterns to be
discovered in data mining activities, data mining features are used. Data
mining has a wide application for forecasting and characterizing data in big
data.
Data Mining Tasks Categories
Data mining tasks are majorly categorized into two categories: descriptive
and predictive.
1. Descriptive Data Mining
Descriptive data mining offers a detailed description of the data, for example- it
gives insight into what's going on inside the data without any prior idea. This
demonstrates the common characteristics in the results. It includes any
information to grasp what's going on in the data without a prior idea.
2. Predictive Data Mining
This allows users to consider features that are not specifically available. For
example, the projection of the market analysis in the next quarters with the
output of the previous quarters, In general, the predictive analysis forecasts or
infers the features of the data previously available. For an instance: judging by
the outcomes of medical records of a patient who suffers from some real
illness.
Key Data Mining Tasks
1. Characterization and Discrimination
 Data Characterization: The characterization of data is a description of the
general characteristics of objects in a target class which creates what are
called characteristic rules.
A database query usually computes the data applicable to a user-specified
class and runs through a description component to retrieve the meaning of the
data at various abstraction levels.
Eg;-Bar maps, curves, and pie charts.
 Data Discrimination: Data discrimination creates a series of rules called
discriminate rules that is simply a distinction between the two classes aligned
with the goal class and the opposite class of the general characteristics of
objects.
2. Prediction
To detect the inaccessible data, it uses regression analysis and detects the
missing numeric values in the data. If the classmark is absent, so classification
is used to render the prediction. Due to its relevance in business intelligence,
the prediction is common. If the classmark is absent, so the prediction is
performed using classification. There are two methods of predicting data. Due
to its relevance in business intelligence, a prediction is common. The prediction
of the classmark using the previously developed class model and the prediction
of incomplete or incomplete data using prediction analysis are two ways of
predicting data.
3. Classification
Classification is used to create data structures of predefined classes, as the
model is used to classify new instances whose classification is not understood.
The instances used to produce the model are known as data from preparation.
A decision tree or set of classification rules is based on such a form of
classification process that can be collected to identify future details, for
example by classifying the possible compensation of the employee based on
the classification of salaries of related employees in the company.
4. Association Analysis
The link between the data and the rules that bind them is discovered. And two
or more data attributes are associated. It associates qualities that are
transacted together regularly. They work out what are called the rules of
partnerships that are commonly used in the study of stock baskets. To link the
attributes, there are two elements. One is the trust that suggests the possibility
of both associated together, and another helps, which informs of associations'
past occurrence.
5. Outlier Analysis
Data components that cannot be clustered into a given class or cluster are
outliers. They are often referred to as anomalies or surprises and are also very
important to remember.
Although in some contexts, outliers can be called noise and discarded, they can
disclose useful information in other areas, and hence can be very important
and beneficial for their study.
6. Cluster Analysis
Clustering is the arrangement of data in groups. Unlike classification, however,
class labels are undefined in clustering and it is up to the clustering algorithm
to find suitable classes. Clustering is often called unsupervised classification
since provided class labels do not execute the classification. Many clustering
methods are all based on the concept of maximizing the similarity (intra-class
similarity) between objects of the same class and decreasing the similarity
between objects in different classes (inter-class similarity).
7. Evolution & Deviation Analysis
We may uncover patterns and shifts in actions over time, with such distinct
analysis, we can find features such as time-series results, periodicity, and
similarities in patterns. Many technologies from space science to retail
marketing can be found holistically in data processing and features.
Data Mining Functionalities
Overview
To perceive the various form of patterns to be identified in data mining
activities, the following functionalities should be considered. To define and find
the kind of patterns to be discovered in data mining activities, data mining
features are used. Data mining has a wide application for forecasting and
characterizing data to big data.
The data mining tasks can be categories into two main categories -
 Descriptive data mining:
Descriptive data mining demonstrates the common characteristics in the
results. It offers knowledge of the data and gives insight into what's going on
inside the data without any prior idea. It includes any information to grasp
what's going on in the data without a prior idea. Popular characteristics of the
data are demonstrated in the data collection.
 Predictive Data Mining:
Predictive data mining provides prediction features from data to its users. For
example -what will be the projection of the market analysis in the next quarter
with the output of the previous quarters? In general, the predictive analysis
forecasts or infers the features based on the data previously available.
For example: Judging by the outcomes of medical exams of a patient who
suffers from some diseases.
Data Mining Functionalities
1. Class/Concept Description: Characterization and
Discrimination
It is important to link data with groups or related items. For example,
computers and printers are types of goods for sale in the Hardware Shop.
 Data Characterization: The characterization of data is a description of the
key characteristics of objects in a target class which creates what is called a
characteristic rule. To do this, a user can run a database query to compute the
user-specified class through predefined modules to retrieve desired results
from data at various abstraction levels.
Eg;-Bar maps, curves, and pie charts,
 Data Discrimination: Data discrimination creates a series of the rule called
discriminate rules that are simply a distinction between the two classes aligned
with the goal class and the opposite class of the general characteristics of
objects.
2. Prediction
To distinguish the inaccessible data items, it uses regression analysis and
detects the missing numeric values in the data. If the class mark is absent,
classification is used to render the prediction. Due to its relevance in business
intelligence, a prediction is common. There are two main methods of data
prediction; the prediction of the classmark using the developed class model
and the prediction of incomplete data using prediction analysis.
3. Classification
Classification is used to create data sets using predefined classes, as the model
is used to classify new instances whose classification is not understood. The
instances used to produce the model are known as data from preparation. For
example - classify the employees on the basis of their salaries in a company.
4. Association Analysis
The link between the data and the rule that bind them is discovered. And two
or more data attributes are associated. It associates qualities that are
transacted together regularly. They work out what is called the rule of business
that is commonly used in the study of stock market analysis.
5. Outlier Analysis
Data components that cannot be clustered into a given class or cluster are
outliers. They are often referred to as data anomalies; in few instances, outliers
can be called unwanted noise which needs to be discarded from the given data
set, they may disclose some useful information in other areas, and hence can
be treated as an important aspect and beneficial for the data analysis.
6. Cluster Analysis
Clustering is a process of arranging data in the data sets based on similarity
features. Clustering is often called unsupervised classification since provided
class labels do not execute the classification. There are some most relevant
clustering methods which are based on the concept of maximizing the
similarity between the objects of the same class and decreasing the similarity
between objects in different classes.
7. Evolution & Deviation Analysis
We get clustering of data with evolution analysis. In this approach, a user can
uncover hidden patterns from data sets. This analysis helps users to find
features using similarities in patterns.

Data Exploration in Data Mining


1. What is Data Exploration?
Data exploration is the process of accumulating data relevant and concerned
with information about a target object or field. These characteristics will
embrace the size or quantity of information, completeness of the information,
correctness of the information, doable relationships amongst knowledge
components or files/tables within the knowledge.
Data exploration is usually conducted employing a combination of automatic
and manual activities. Automatic activities will embrace data profiling or data
visualization or tabular report to offer the analyst initial read into the
information and an understanding of key characteristics. Usually, it is followed
by manual drill-down or filtering of the information to spot anomalies or
patterns known through the automatic actions.
Data exploration can even need manual scripting and queries into the
information (e.g. exploitation languages like SQL or R) or exploitation
spreadsheets or similar tools to look at the data. All of those activities are
aimed toward making a mental model and understanding of the information
within the mind of the analyst, and shaping basic information (statistics,
structure, relationships) for the information set that may be employed in future
analysis. Once this initial understanding of the information is done, the
information is pruned or refined by removing unusable elements of the
information (data cleansing), correcting poorly formatted components and
shaping relevant relationships across datasets. This method is additionally
referred to as crucial knowledge quality.
2. Statistical Description of Data
Statistics play an important role in all fields. It helps in collecting data, be it in
any field. Along with that, it also helps in analyzing data using statistical
techniques. Statistics is all about the “collection” of data. Also, the goal is to
maintain the data for the welfare of everyone in the area. According to various
calculations, there are several predictions that led to one or the other answer.
Various methods of statistics include,
2.1. Measure of Central Tendency
In statistics, a central tendency. maybe referred to as a middle or location of
the distribution. Measures of central tendency are often called averages. The
most common measures of central tendency area unit,
a. The arithmetic mean: the sum of all numerical values divided by the total
number of numerical values.
b. Median: It refers to the midpoint of data after arranging the data in ascending
order.
c. Mode: It refers to the most frequently occurring number in the data.
2.2. Measure of Dispersion
In statistics, dispersion is related to variability, scattering and spread is the
extent to which a distribution is stretched or squeezed. It tells the variation of
the info from each other and provides a transparent plan concerning the
distribution of the info. The measure of dispersion shows the homogeneity or
the heterogeneity of the distribution of the observations Common examples of
measures of statistical dispersion are,
a. Range: It refers to the difference between the highest value to the lowest
value.
b. Variance: It refers to the sum of the square of deviations from the sample
mean which is divided by one less than the sample size.
c. Standard Deviation: It refers to the square root of the variance.
d. Interquartile Range: The IQR is a measure of variability, based on dividing
information set into quartiles. Quartiles divide a rank-ordered knowledge set
into four equal components. The values that separate components square
measure known as the primary, second, and third quartiles; and that they
square measure denoted by Q1, Q2, and Q3.
2.3. Measure of Skewness and Kurtosis
Skewness may be a live of symmetry, or more precisely, the lack of symmetry.
The data set is symmetric if it looks the same to the left and right of the center
point.
Kurtosis is a measure of whether the data are heavy-tailed or light-tailed
relative to a normal distribution. That is, information sets with high kurtosis
tend to possess serious tails or outliers. Data sets with low kurtosis tend to
possess lightweight tails or a lack of outliers. A uniform distribution would be an
extreme case.
3. Concept of Data Visualization
Data image is that the graphical illustration of knowledge and data. By
mistreatment visual parts like charts, graphs, and maps, data visualization
tools provide an accessible way to see and understand trends, outliers, and
patterns in data. Visualization is an increasingly key tool to make sense of the
trillions of rows of data generated every day.
Data image helps to inform stories by curating information into a type easier to
know, highlighting the trends and outliers. A good image tells a story, removing
the noise from data and highlighting the useful information. In the world of
huge information, information image tools and technologies area unit essential
to investigate huge amounts of data and create data-driven selections.
4. Various Technique of Data Visualization
4.1 Common general types of data visualization
 Charts
 Tables
 Graphs
 Maps
 Infographics
 Dashboards
4.2. More specific examples of methods to visualize data
 Area Chart
 Bar Chart
 Box-and-whisker Plots
 Bubble Cloud
 Bullet Graph
 Cartogram
 Circle View
 Dot Distribution Map
 Gantt Chart
 Heat Map
 Highlight Table
 Histogram
 Matrix
 Network
 Polar Area
 Radial Tree
 Scatter Plot (2D or 3D)
 Streamgraph
 Text Tables
 Timeline
 Treemap

OLAP: What It Is, Applications, Types,


Advantages, and Disadvantages
What is Online Analytical Processing or OLAP?
OLAP is an advance technology for data analysis; used in business intelligence.
It is efficient to do complex analytics and also allows users to extract the data
for analysis. OLAP includes software package that have functionalities to
examine the data from different information systems at the same time.
Analysts can aggregate the data to find desired results; therefore, OLAP data
can be pre-calculated and pre-aggregated which makes data discovery, report
viewing, complex analytics faster.
Online Analytical Processing (OLAP) Applications
The following are the applications of an OLAP:
 Typical applications of OLAP embody business news for sales, marketing,
management news, business method management (BPM), budgeting and
prognostication, money news and similar areas, with new applications arising,
like agriculture.
 The term OLAP was created as a small modification of the normal info term
OLTP (Online dealing Processing).
 Two leading OLAP merchandise square measure Titan Solution's Essbase and
Oracle's categorical Server.
 OLAP merchandise square measure usually designed for multiple-user
environments, with the price of the package, supported the number of users.
Online Analytical Processing (OLAP) Types
OLAP is mainly categories in three parts which are as follows -
1. Relational OLAP (ROLAP): Star Schema based
The ROLAP is a type of data processing model; it’s specifically designed to
work with large size data based on multidimensionality data model. It follows
relational database principles. Hence, data stores in relational tables and data
accessing from data-warehouse using complex SQL queries.
2. Multidimensional OLAP (MOLAP): Cube based
MOLAP is also a type of online analytical processing. MOLAP is most widely
used in businesses to analyze data quickly and efficiently. This analysis helps
business organisations to find business trends and insights; the analytical result
gives the direction to take healthy business decisions.
3. Hybrid OLAP (HOLAP)
HOLAP could be a combination of ROLAP and MOLAP. HOLAP server permits
to store the large volume of information system. Generally, HOLAP leverages
the cube technology to perform quickly and summary-type information.
OLAP Other Common Types
Some other common types of OLAP's are as follows –
1. Web OLAP (WOLAP)
As its name implies, WOLAP is an online analytical processing model based on
web browsers. Its architecture is based on client, middleware, and database
server.
A Web-based application requires no deployment on the client machine. All that
is needed is a Web browser and a network connection to the intranet or
Internet. A web-based program doesn't need to be installed at the client side;
hence, the technology only requires a web browser and strong network
connection to the internet or intranet.
2. Desktop OLAP (DOLAP)
DOLAP stands for desktop analytical processing. In DOLAP, a user can
download the data from the source and work with their desktop.
In DOLAP technology, cubes are used to store the data. Users feel like they
have their own spreadsheet and system is dedicated to them only. Hence, they
do not need to be concerned about performance impacts against the server
because data stores locally.
3. Mobile OLAP (MOLAP)
As its name implies, it’s a type of online analytical processing which works for
mobile devices. Now days, users are able to access, upload, and process their
data using mobile devices.
4. Spatial OLAP (SOLAP)
SOLAP stands for Spatial Online Analytical Processing. It is a method that gives
users the ability to analyse geographical data that stores in a multidimensional
data warehouse. SOLAP technology enables its users to examine spatial
patterns and trends from data more effectively.
Advantages of OLAP
 OLAP technology supports businesses by designing, reporting, and analysing
business data.
 Fast processing of large size data.
 Quickly analyze "What if" situations.
 Multidimensional data represents easily.
 OLAP provides the building blocks for business modelling.
 It allows users to do slice and dice cube information using different dimensions,
measures, and filters.
 It is convenient to end users.
 No specific skills are required for business people to use the application.
 It is user-friendly and scalable. It suits to all users like small, medium and large
businesses.
Disadvantages of OLAP
 It requires organized information into a star or snowflake schema.
 You cannot have a sizable amount of dimensions during a single OLAP cube.
 It is not best fitted for the transactional information.
 OLAP cube desires a full update of the cube which is difficult at the moment.
OLAP Cube and Operations
Introduction
1. OLAP Cube
An OLAP cube may be a multi-dimensional array of knowledge. The online
analytical process (OLAP) may be a computer-based technique of analyzing
knowledge to appear for insights. The term cube here refers to a multi-
dimensional dataset, that is additionally known as a Hypercube if the quantity
of dimensions is larger than three. A cube is also thought of as a multi-
dimensional generalization of a two- or three-dimensional program. For
instance, a corporation may like to summarize monetary knowledge by
product, by time-period, and by the town to check actual and budget expenses.
Product, time, cities and situation (actual and budget) are the data's
dimensions as shown in the figure. A cube isn't a "cube" within the strict
mathematical sense, as all the edges aren't essentially equal.

Fig 1: OLAP Cube


2. OLAP Cube Operations
Since OLAP servers work on data of multidimensional view, we are going to
discuss OLAP operations in multidimensional information.
Here is the list of OLAP operations,
 Roll-up
 Drill-down
 Slice
 Dice
 Pivot (rotate)
2.1. ROLL UP
Roll-up is performed by rising up a planning hierarchy for the dimension
location. Initially the conception hierarchy was "street < town < province <
country". On rolling up, aggregation of data is done by ascending the hierarchy
of location from the level of street to the level of the country. A roll-up involves
summarizing the information on a dimension. The summarization rule may be
Associate in Nursing mixture operate, like computing totals on a hierarchy or
applying a collection of formulas like "profit = sales - expenses". General
aggregation functions are also pricey to cipher once rolling up: if they can not
be determined from the cells of the cube, they need to be computed from the
bottom information, either computing them on-line (slow) or precomputing
them for attainable rollouts (large space). Aggregation functions that will be
determined from the cells square measure referred to as complex aggregation
functions, and permit economical computation.
For example, it is easy to calculate COUNT, MAX, MIN, and SUM in OLAP, as this
can be computed for each cell of the OLAP cube and then rolled up, since on
overall sum (or count, etc.) is the sum of sub-sums, but it is difficult to support
MEDIAN, as that must be computed for every view separately: the median of a
collection isn't the median of medians of subsets.
Let us understand it with a diagrammatic flow,
Fig 2.1 : Roll up Operation
2.2. DRILL DOWN
Drill-down is performed by stepping down the hierarchy for the dimension time.
Initially, the conception hierarchy was "day < month < quarter < year." On
drilling down, the dimension of time exists in descended from i.e. from the
extent of the quarter to the extent of the month. When drill-down is performed,
one or additional dimensions from the information cube are supplemental. It
navigates knowledge from less elaborate information to extremely elaborate
data.

Fig 2.2 : Drill Down Operation


2.3. OLAP SLICING
The slice operation selects one specific dimension from a given cube and
provides a replacement sub-cube.
Slice is that the act of selecting an oblong set of a cube by selecting one worth
for one in all its dimensions, making a replacement cube with one fewer
dimension.
The picture shows a slicing operation.
Slicing is performed for the dimension "time" using the criteria time = "Q1".
Subcube is formed by using or selecting one or two dimensions.
Fig 2.3 : OLAP Slicing
2.4. OLAP DICING
Dice selects 2 or additional dimensions from a given cube and provides a brand
new sub-cube. think about the subsequent diagram that shows the dice
operation. The dice operation produces a subcube by permitting the analyst to
choose specific values of multiple dimensions. the image shows a dicing
operation. The dicing operation on the cube which involves three dimensions is
based on the following selection criteria,
 (location = "Toronto" or "Vancouver")
 (time = "Q1" or "Q2")
 (item =" Mobile" or "Modem")

Fig 2.4 : Cube Dicing


2.5. OLAP PIVOT
The pivot operation is additionally referred to as rotation. It rotates the info
axes seeable to produce an alternate presentation of knowledge. take into
account the subsequent diagram that shows the pivot operation. Pivot permits
AN analyst to rotate the cube in the area to check its varied faces. As an
example, cities may well be organized vertically and product horizontally
whereas viewing information for a selected quarter. Pivoting may replace the
product with periods to check information across time for one product. The
picture shows a pivoting operation: the full cube is turned, giving another
perspective on the data.

Fig 2.5 : Olap Pivot

Data Preprocessing in Data Mining


1. Need of Data Preprocessing
Data preprocessing refers to the set of techniques implemented on the
databases to remove noisy, missing, and inconsistent data. Different Data
preprocessing techniques involved in data mining are data cleaning, data
integration, data reduction, and data transformation.
The need for data preprocessing arises from the fact that the real-time data
and many times the data of the database is often incomplete and inconsistent
which may result in improper and inaccurate data mining results. Thus to
improve the quality of data on which the observation and analysis are to be
done, it is treated with these four steps of data preprocessing. More the
improved data, More will be the accurate observation and prediction.

Fig 1: Steps of Data Preprocessing

2. Data Cleaning Process


Data in the real world is usually incomplete, incomplete and noisy. The data
cleaning process includes the procedure which aims at filling the missing
values, smoothing out the noise which determines the outliers and rectifies the
inconsistencies in data. Let us discuss the basic methods of data cleaning,
2.1. Missing Values
Assume that you are dealing with any data like sales and customer data and
you observe that there are several attributes from which the data is missing.
One cannot compute data with missing values. In this case, there are some
methods which sort out this problem. Let us go through them one by one,
2.1.1. Ignore the tuple:
If there is no class label specified then we could go for this method. It is not
effective in the case if the percentage of missing values per attribute changes
considerably.
2.1.2. Enter the missing value manually or fill it with global
constant:
When the database contains large missing values, then filling manually method
is not feasible. Meanwhile, this method is time-consuming. Another method is
to fill it with some global constant.
2.1.3. Filling the missing value with attribute mean or by using
the most probable value:
Filling the missing value with attribute value can be the other option. Filling
with the most probable value uses regression, Bayesian formulation or decision
tree.
2.2. Noisy Data
Noise refers to any error in a measured variable. If a numerical attribute is
given you need to smooth out the data by eliminating noise. Some data
smoothing techniques are as follows,
2.2.1. Binning:
1. Smoothing by bin means: In smoothing by bin means, each value in a bin is
replaced by the mean value of the bin.
2. Smoothing by bin median: In this method, each bin value is replaced by its
bin median value.
3. Smoothing by bin boundary: In smoothing by bin boundaries, the minimum
and maximum values in a given bin are identified as the bin boundaries. Every
value of bin is then replaced with the closest boundary value.
Let us understand with an example,
Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34
Smoothing by bin means:
- Bin 1: 9, 9, 9, 9
- Bin 2: 23, 23, 23, 23
- Bin 3: 29, 29, 29, 29

Smoothing by bin boundaries:


- Bin 1: 4, 4, 4, 15
- Bin 2: 21, 21, 25, 25
- Bin 3: 26, 26, 26, 34

Smoothing by bin median:


- Bin 1: 9 9, 9, 9
- Bin 2: 24, 24, 24, 24
- Bin 3: 29, 29, 29, 29
2.2.2. Regression:
Regression is used to predict the value. Linear regression uses the formula of a
straight line which predicts the value of y on the specified value of x whereas
multiple linear regression is used to predict the value of a variable is predicted
by using given values of two or more variables.
3. Data Integration Process
Data Integration is a data preprocessing technique that involves combining
data from multiple heterogeneous data sources into a coherent data store and
supply a unified view of the info. These sources may include multiple data
cubes, databases or flat files.
3.1. Approaches
There are mainly 2 major approaches for data integration – one is "tight
coupling approach" and another is the "loose coupling approach".
Tight Coupling:
Here, a knowledge warehouse is treated as an information retrieval component.
In this coupling, data is combined from different sources into one physical
location through the method of ETL – Extraction, Transformation, and
Loading.
Loose Coupling:
Here, an interface is as long as it takes the query from the user, transforms it
during away the source database can understand then sends the query on to
the source databases to get the result. And the data only remains within the
actual source databases.
3.2. Issues in Data Integration
There are not any issues to think about during data integration: Schema
Integration, Redundancy, Detection and determination of knowledge value
conflicts. These are explained in short as below,
3.1.1. Schema Integration:
Integrate metadata from different sources.
The real-world entities from multiple sources are matched mentioned because
of the entity identification problem.
For example, How can the info analyst and computer make certain that
customer id in one database and customer number in another regard to an
equivalent attribute.
3.2.2. Redundancy:
An attribute could also be redundant if it is often derived or obtaining from
another attribute or set of the attribute.
Inconsistencies in attribute also can cause redundancies within the resulting
data set.
Some redundancies are often detected by correlation analysis.
3.3.3. Detection and determination of data value conflicts:
This is the third important issues in data integration. Attribute values from
another different source may differ for an equivalent world entity. An attribute
in one system could also be recorded at a lower level abstraction than the
"same" attribute in another.
4. Data Reduction Process
Data warehouses usually store large amounts of data the data mining operation
takes a long time to process this data. The data reduction technique helps to
minimize the size of the dataset without affecting the result. The following are
the methods that are commonly used for data reduction,
a. Data cube aggregation
Refers to a method where aggregation operations are performed on data to
create a data cube, which helps to analyze business trends and performance.
b. Attribute subset selection
Refers to a method where redundant attributes or dimensions or irrelevant data
may be identified and removed.
c. Dimensionality reduction
Refers to a method where encoding techniques are used to minimize the size of
the data set.
d. Numerosity reduction
Refers to a method where smaller data representation replaces the data.
e. Discretization and concept hierarchy generation
Refers to methods where higher conceptual values replace raw data values for
attributes. Data discretization is a type of numerosity reduction for the
automatic generation of concept hierarchies.
5. Data Integration Process
In data transformation process data are transformed from one format to a
different format, that's more appropriate for data processing.
Some Data Transformation Strategies,
a. Smoothing
Smoothing may be a process of removing noise from the info.
b. Aggregation
Aggregation may be a process where summary or aggregation operations are
applied to the info.
c. Generalization
In generalization, low-level data are replaced with high-level data by using
concept hierarchies climbing.
d. Normalization
Normalization scaled attribute data so on fall within a little specified range,
such as 0.0 to 1.0.
e. Attribute Construction
In Attribute construction, new attributes are constructed from the given set of
attributes.

KDD Process in Data Mining


Data Mining Knowledge
Extracting data from a large database is data mining. Data Mining is defined as
the extraction of data from enormous data sets. In other terms, it can be said
that data mining is the process of mining knowledge. To recognize meaningful
patterns, the data mining process relies on data compiled in the data
warehousing stage.
For instance - "Gold Mining from rock or sand" is the same as "Data
Mining Knowledge"
Data mining may also refer to data analysis activity. It is the computer-
supported process of analyzing huge data sets that have either been compiled
or downloaded into the computer by large data sources. The computer
analyzes the data and extracts key information from it in the data mining
process. It looks for hidden patterns and attempts to predict future behaviour
within the data set.
Why it is important?
There are various things which show that Data mining is important. The most
common applications for the use of data mining areas -
 Market Analysis
 Detection of fraud
 Customer retention
 Control of Production
 Scientific exploration
In contrast to data analytics, where discovery goals are often not known or well
defined at the outset, data mining efforts are usually driven by a specific lack of
information that cannot be satisfied through standard data queries or reports.
Data mining produces data from which it is possible to derive and then test
predictive models, leading to a greater understanding of the marketplace.
Data mining's business application is broad. It can be used for everything from
pharmaceutical research to traffic pattern modelling. However, the classic use
case is to predict customer behaviour to optimize sales and marketing
activities. For example, retailers often use data mining to predict what their
customers might be buying next.
Other terms of reference for data mining:
 Mining of Knowledge
 Extraction of Knowledge
 Analysis of the pattern
 Archaeology of data
 Dredging of data
Effective data collection and warehousing as well as computer processing
involve data mining. Data mining uses sophisticated mathematical algorithms
for segmenting, the data and evaluating the probability of future events, also
known as Knowledge Discovery in Data Mining, data mining (KDD).
KDD and its Process
The term Knowledge Discovery in Databases, or KDD, in short, refers to the
broad process of discovering knowledge in data and emphasizes the "high-
level" application of specific data mining methods. Researchers in machine
learning, pattern recognition, databases, statistics, artificial intelligence, expert
systems knowledge acquisition, and data visualization are of interest.
In the context of big databases, the unifying objective of the KDD process is to
extract knowledge from data.
This is done by using data mining methods (algorithms) to extract (identity)
what is considered knowledge according to the specifications of the
measurements and thresholds, using the database along with any pre-
processing, sub-sampling and transformation requirements of the database.
The below-mentioned diagram is showing data mining and its process.

Figure: KDD process


Data Selection - Data relevant to the retrieved analysis
Data cleaning and pre-processing - Eliminate noisy and inconsistent
information
Data integration - Multiple data sources combined
Data Transformation - Transform into a form suitable for data mining
Data Mining - Extract data patterns using smart methods
Evaluation of Pattern - Identify interesting patterns
Knowledge representation - Representation of Knowledge, Presenting to the
user of mined knowledge
Example of Data mining
Well-known users of data mining techniques are grocery stores. Many
supermarkets offer customers free loyalty cards that give them access to
reduced prices that are not available to non-members. Cards make it easy for
stores to track who buys what, when they buy it, and at what price. After
analysis of the data, stores can then use this data to offer customers coupons
tailored to their purchase habits and decide when to put items on sale or when
to sell them at full price.

Data Cleaning in Data Mining


What is Data Cleaning?
Data cleaning is a method to remove all the possible noises from data and
clean it. Proper and cleaned data is used for data analysis and find key insights,
patterns, etc from it. Data cleaning increases data consistency and entails
normalizing of data.
The data derived from existing sources may be inaccurate, unreliable, complex,
and sometimes incomplete. So, before data mining, certain low-level data has
to be cleaned up. Data cleaning is not only about erasing data to make room
for new information, but rather finding a way to improve the accuracy of a data
set without actually deleting information.
Why Data Cleaning?
Data cleaning is important for both individuals and the organization. It
accumulates a lot of data as the company expands. Clean and structured data
allows the organization's executives and administrators to make decisions that
will improve the organization's efficiency.
An effective organizational strategy assists organization retention for a long
time. It makes the best choices, resulting in improved efficiency. To achieve
more and more efficient data cleaning is important.
Data Cleaning Process
The data cleaning process handles data cleaning; but before handling the
inconsistent data, it should be identified first. Following phases are used in the
data cleaning process.
1. Identify Inconsistent Details - Due to different factors, such as the data
type, the discrepancy in data can be built with many optional fields that allow
the candidates to fill in missing details. While entering the results, the
candidates could have made a mistake. Any of the details might be out of date,
such as updating address, phone number, etc. This may be the cause of the
contradictory details.
2. Identifying Missing Values - If there is a record that lacks several attributes
and its values so that it can ignore.
3. Remove Noisy Data and Missing Values - Noisy data incorporates
information without meaning. For the expression of corrupt records, the term
noisy information is also used. Noisy data cannot comply with valuable info by
the data mining process. To allow data mining, noisy data increases the volume
of data in the data warehouse that can be removed efficiently.

Fig: Data Cleaning Process


Overall following methods are used to eliminate noisy data -
 Cleaning - We may remove the noise by specifying boundary values to allow
the substitution, based on how they are created.
 Regression - Regression is used for noisy data. Regression matches the data
attributes as a feature that identifies the relationship between two variables,
such as linear regression so that one attribute helps identify the value of
another attribute.
 Clustering - The comparable data in a cluster is clustered by this method. The
outliers can be undetected or outside of the clusters may collapse.
Benefits of data cleaning
1. Removal of noises from various data sources.
2. Error detection improves working efficiency and gives a direction to the users
to identify mistakes come from various sources.
3. Using the data cleaning process, we can get an effective business process and
better decision-making.

Classification of Data Mining Systems


Overview
Data mining is an interdisciplinary field in which different fields are
interconnecting, including database systems, statistics, artificial learning,
visualization, and information science. The world of data mining is known as an
interdisciplinary one. It requires a set of different fields, such as analytics,
database systems, artificial learning, visualization, and information sciences.
Data mining system classification lets users understand the system and comply
with such systems with their specifications.
Also, methods from other fields, such as neural networks, fuzzy and/or rough
set theory, information representation, inductive logic programming, or high-
performance computation, may be implemented based on the data mining
approach used. The data mining framework can also implement techniques
from spatial data analysis, data extraction, pattern recognition, image analysis,
signal processing, computer graphics, depending on the types of data to be
mined or on the specified data-mining program.

Data Mining Systems Classification


A data mining system can be classified into the following −
 Database Technology
 Statistics
 Machine Learning
 Information Science
 Visualization
 Other Disciplines
Following points are describing data mining classifications -
1. Classification according to the form of mined databases
A data mining method can be graded according to the type of mined
databases. It is possible to distinguish database systems according to various
parameters (such as data models or data types or programs involved), each of
which may include its own technique of data mining. Accordingly, data mining
programs can be categorized accordingly.
2. Classification according to the forms of information
derived
Data mining systems can be classified by the kinds of information they gain,
that is, based on functionalities of data mining, such as characterization,
discrimination, analysis of interaction and similarity, grouping, estimation,
clustering, outlier analysis, and analysis of evolution. A robust data mining
framework generally offers several and/or combined functionalities for data
mining
We may define a method of data mining according to the form of information
that is extracted. This suggests that the data mining system is categorized on
the basis of features including
 Characterisation
 Discrimination Also
 Correlation Study and Interaction
 Classifying
 Forecasting
 Review of Outlier
 Study of Evolutions
3. Classification by the kind of techniques used
The level of user interaction or data processing method involved requires this
methodology. For example, artificial learning, simulation, pattern recognition,
neural networks, database-oriented or data-warehouse-oriented techniques.
4. Classification according to the adapted applications
Data mining systems can also be classified according to the adapted
applications. For instance, it is possible to adapt data mining systems
specifically for banking, telecommunications, DNA, capital markets, e-mail, and
so on. The application oriented approaches is also needed by various
applications. Therefore, domain-specific mining activities could not suit a
standardized, all-purpose data mining method.
These applications are as follows −
 Finance
 Telecommunications
 DNA
 Stock Markets
 E-mail
Data Mining Systems Integration Schemes
1. No Coupling
The data mining method does not use any of the database or data warehouse
features, it retrieves data from a given source and uses certain data mining
algorithms to process the data. The outcome of data mining is stored in
another file.
2. Loose Coupling
The data mining system may use some of the database and data warehouse
system features in this scheme. It collects information from the respiratory data
managed by these systems and performs data mining on that information. It
then stores the result of mining in a database or in a data warehouse, either in
a file or in a designated location.
3. Semi-tight coupling
The data mining method is connected to a database or data warehouse system
through semi-tight coupling and, in addition, powerful implementations of a few
primitives of data mining can be given in the database.
4. Strong coupling
The data mining method is easily merged into the database or data warehouse
system in this coupling system. The subsystem of data mining is treated as one
functional element of an information system.

Difference Between Classification and


Prediction in Data Mining
What is Classification?
The world of data mining is known as an interdisciplinary one. It requires a
range of disciplines such as analytics, database systems, machine learning,
simulation, and information sciences. The classification of the data mining
system allows users to understand the system and to align their criteria with
such systems. Classification is about the discovery of a model that
distinguishes groups and concepts of data. The definition is to forecast the
class of objects by using this model. The derived model relies on the study of
training data sets.
A classification task starts with a data set where the assignments of the class
are known. For example, based on observable data for multiple loan borrowers
over some time, a classification model may be established that forecasts credit
risk. The data could track job records, homeownership or leasing, years of
residency, number, and type of deposits, in addition to the historical credit
ranking, and so on. The goal would be credit ranking, the predictors would be
the other characteristics, and the data would represent a case for each
consumer.
How Does Classification Works?
The functioning of classification with the assistance of the bank loan application
has mentioned above. There are two stages in the data classification system
are classifier or model creation and using classification classifier.
 Classifier or model creation:
This level is the learning stage or the learning process. The classification
algorithms construct the classifier in this stage. A classifier is constructed from
a training set composed of the records of databases and their corresponding
class names. Each category that makes up the training set is referred to as a
category or class. We may also refer to these records as samples, objects, or
data points.
 Using classifier for classification:
The classifier is used for classification at this level. The test data are used here
to estimate the accuracy of the classification algorithm. If the consistency is
deemed sufficient, the classification rules can be expanded to cover new data
records.
 Data Classification Process:
The data classification process can be categorized into five steps:
1. Create the goals of data classification, strategy, workflows, and architecture of
data classification.
2. Classify confidential details that we store.
3. Using marks by data labelling.
4. To improve protection and docility, use effects.
5. Data is complex, and a continuous method is a classification.
What is a Prediction?
To detect the inaccessible data, it uses regression analysis and detects the
missing numeric values in the data. If the classmark is absent, so classification
is used to render the prediction. Due to its relevance in business intelligence,
the prediction is common. If the classmark is absent, so the prediction is
performed using classification.
There are two methods of predicting data. Due to its relevance in business
intelligence, the prediction is common. Examples of situations where the role of
data processing is prediction are below.
Suppose the marketing manager needs to predict how much a particular
customer will spend at his company during a sale. We are bothered to forecast
a numerical value in this case. Therefore, an example of numeric prediction is
the data processing activity. In this case, a model or a predictor will be
developed that forecasts a continuous or ordered value function.
Comparison of classification and prediction methods
Comparison of classification and prediction methods are described below -
1. Accuracy -
Classifier accuracy refers to the classifier's ability. It correctly predicts the class
label and the predictor's accuracy refers to how well a given predictor can
estimate the value of a new data attribute predicted.
2. Speed -
This refers to the expense of producing and using the classifier or predictor for
estimation.
3. Robustness -
It refers to the classifier or predictor's ability to make correct predictions from
the noisy data given.
4. Scalability -
It refers to the capacity to effectively build the classifier or predictor, given a
large amount of data.
5. Interpretability -
It refers to the extent to which the classifier or predictor knows.
Difference between classification and prediction
The decision tree, applied to existing data, is a classification model. We can get
a class prediction if we apply it to new data for which the class is unknown. The
assumption is that the new data comes from a distribution similar to the data
we used to construct our decision tree. This is a correct assumption in many
instances, which is why we can use the decision tree to build a predictive
model. Classification of prediction is the process of finding a model that
describes the classes or concepts of information. The purpose is to be able to
predict the class of objects whose class label is unknown using this model.
Classification and Prediction Issues
Followings are the key challenge in classification and data prediction -
 Data Cleaning -
The cleaning of data entails the elimination of noise and recovery of lost values.
Through applying key techniques, the noise is eliminated and the issue of
missing values is solved by substituting the missing value for that attribute for
the most frequently occurring value.
 Relevance Analysis -
The database can also have meaningless properties. Correlation analysis is
used to assess if two features are correlated with each other.
 Normalization & Generalization -
By generalizing, the data may also be translated. When the neural networks or
the techniques requiring tests are used in the learning process, normalization is
used. The data is converted using normalization. To make them fall into a
limited defined range, normalization requires scaling all values for given
attributes.

Cluster Analysis: What It Is, Methods,


Applications, and Needs in Data
Mining
Why Clustering Analysis Is Used?
Data mining clustering analysis is used to combine data points with
identical features in one group, i.e., data is partitioned into a group, collection
by identifying correlations in objects in useful classes using various usable
techniques (such as Density-based Method, Grid-based method, Model-based
method, Constraint-based method, Partition based method, and Hierarchical
method). Because of this function, it is commonly used in research to identify
patterns, process images, and analyze data.
What is Cluster Analysis in Data Mining?
Clustering is the arrangement of data into similar groups. Unlike classification,
class labels are undefined in clustering and it is up to the clustering algorithm
to find suitable classes. Clustering is often called unsupervised classification
since provided class labels do not execute the classification. Many clustering
methods are based on the concept of maximizing the similarity (intra-class
similarity) between objects of the same class and decreasing the similarity
between objects in different classes (inter-class similarity).

Fig 1: Example of clustering in data sets


In the above example, there are three different clusters. The data has arranged
and grouped into the cluster as per the similarity features in the datasets.
Methods of Data Mining Cluster Analysis
There are in data mining a lot of ways in which clustering is conducted.
1. Hierarchical Method
Hierarchical Clustering is an unsupervised clustering algorithm that involves
creating predominant clusters that have orders from top to bottom called
Hierarchical cluster analysis or HCA. Hierarchical clustering is a group of
clusters or groups in which each cluster is distinct from the other and each
cluster's artifacts are broadly identical to each other.

Fig 2: Example of hierarchical clustering in data sets


For example, files and directories on our hard drive are arranged in a hierarchy.
With the algorithm, related artifacts are clustered into classes called clusters.
2. Partitioning Method
A data collection into a group of disjoint clusters is decomposed by partition
clustering. A partitioning approach builds K (N ≥ K) partitions of the data, with
each partition representing a cluster, given a data group of N points. That is, by
meeting the following conditions, it classifies the data into K groups:
1. Each group includes at least one point and
2. Each point belongs to exactly one group. Here, a point will belong to more than
one group for fuzzy partitioning.
Fig 3: Example of partitioning clustering in data sets
Many algorithms for partition clustering aim to reduce an objective function. For
eg, the function (also referred to as the distortion function) in K-means and K-
medoids is ∑i=1K∑j=1|Ci|Dist(xj,center(i)).
3. Density-Based Clustering Method
Density-based clustering refers to unsupervised learning methods that define
distinctive groups/clusters in the data based on the assumption that a cluster in
data space is a contiguous region of high point density, isolated from other
such clusters by contiguous regions of low point density. In this clustering
approach in Data Mining, the primary emphasis is density.

Fig 4: An example of Density-Based Clustering Method


As the basis for this clustering process, the notion of mass is used. The cluster
will keep growing continuously in this clustering phase. There should be at least
one point in the radius of the category for each data point.
4. Constraint-based Method
To execute the clustering, program or user-oriented constraints are
implemented. The user's assumption is referred to as the constraint. In this
method of classification, association, which is given by the constraints, is very
interactive.
5. Model-based Method
Any cluster is hypothesized in this form of clustering system so that it can find
the data that is a better fit for the model. In this process, the density function is
clustered to locate the group.
6. Grid-based Method
In the Grid-Based Clustering Process, a grid is created using the combined
entity. A Grid Structure is generated by quantifying the object space into a
finite number of cells.
Application of Mining Cluster Analysis
Data clustering analysis has many uses, such as image processing, data
analysis, recognition of patterns, market research and many more. Using data
clustering, firms can discover new classes in the consumer database. Data
grouping can also be achieved based on purchase patterns.
It helps to comprehend each cluster and its features. How the data is
transmitted can be interpreted, and it serves as a method for data mining
functions. There are a wide range of implementations in today's world cluster
analysis,
1. Cluster analysis is commonly used in market analysis, whether it is for pattern
recognition, or image manipulation or exploratory data analysis.
2. It lets advertisers find the various categories of their consumer base and they
can use purchase habits to define their customer groups.
3. Clustering analysis is used widely by financial institutions to detect fraud using
clusters alongside outlier identification.
4. It may be used in the field of biology to classify genes with the same
capabilities by deriving animal and plant taxonomies.
5. In the market, a consumer segment cluster is used to target the selling of
various goods.
Needs of Clustering in Data Mining
1. Scalability -We need highly efficient clustering algorithms to manage massive
data groups.
2. Discovery of clusters with attribute shape -The algorithm should be able
to detect random clusters and should not be limited to distance measurements.
3. Interpretability - The outcomes of clustering should be interpretable,
descriptive, and accessible.
4. High dimensionality -Instead of only handling low dimensional data, the
algorithm should be able to handle high dimensional space.
5. Ability to deal with different kinds of attributes - Algorithms should be
able to be extended to any form of data, such as categorical, binary, and
interval-based (numerical) data.

Data Mining Outlier Analysis: What It


Is, Why It Is Used?
Outlier Detection
Outlier detection in data mining seeks to identify trends in data that do not
comply with expected behavior.

Fig: An example of an outlier

What are Outliers?


Outliers are a special concern in data analysis; it is most widely used in the
identification of fraud, where outliers may demonstrate illegal conduct. Outlier
Analysis is a technique that involves finding in the sample the anomalous
observation. Outlier discovery and interpretation is also an interesting activity
for data mining. An outlier is an aspect of a data set that stands out strongly
from the rest of the results.
Outlier Analysis is an activity for data processing known as outlier mining. It has
different application areas such as irregular use of credit cards or
telecommunication systems, healthcare research to discover unusual reactions
to medical procedures, and also to determine the advertisement expense
nature of consumers.
Why Outlier Analysis?
Most data mining techniques discard outlier's noise or anomalies, but the
unusual incidents may be more interesting than the more frequently occurring
in some applications such as fraud detection and hence the outlier analysis
becomes important in such cases.
How Outlier Detection Can Improve Business Analysis?
An organization should first think about whether they want to identify the
outliers and what they can do with the information before evaluating the use of
outlier analysis. To reveal the results they need to see and comprehend, this
emphasis will help the organization to choose the correct form of analysis using
diagrams or plotting. When an organization uses outlier analysis, it is necessary
to validate the findings with an overall dataset.
How to Detect an Outlier?
Clustering-based outlier identification using the nearest cluster distance. Each
cluster has a mean value within the K-Means clustering technique. Objects
belong to a cluster and are nearest to their mean value. First, we need to
initialize the threshold value to define the Outlier in such a way that any
distance of any data point greater than it from its nearest cluster marks it as an
outlier for our intent. Then we need to find the mean distance between the test
data and each cluster. Now, if the distance is greater than the threshold value
between the test data and the nearest cluster to it, then the test data would be
labelled as an outlier.
Common Steps of Algorithm
 Initialize the value of the Threshold.
 Calculate the distance between the test data from the average of each cluster.
 Find the cluster closest to the test results
 If, then, (Distance > Threshold) Outlier
 Calculate each cluster's average.
Outlier Analysis Techniques
The simplest method for outlier analysis is sorting. Load the dataset into a data
processing method, such as a spreadsheet, and then arrange the values. Then,
look at the spectrum of different data points. They can be viewed as outliers if
some data points are substantially higher or lower than those in the dataset.
Example
Let's take a look at an example of real sorting. Consider that a company's CEO
gets a salary that is two times that of the other staff. They should look to
ensure that no outliers are found in the dataset upon entering the data review
process. They would be able to spot exceptionally high findings when sorting
through the highest incomes. Knowing that the average pay is higher, a CEO
salary analysis will stand out as an outlier.

Association Analysis in Data Mining


What is Association Analysis?
Association analysis is most widely used to discover hidden patterns in large
data sets. These hidden and uncovered relationships can be represented in the
form of association rules or sets of frequent items. The role of identifying
interesting associations in large databases is correlation analysis. There can be
two types of these enthralling relationships: frequent itemsets or rules of the
association. Frequent object sets are a collection of objects that mostly take
place together. Association rules are the method of viewing fascinating
relationships. The rules of association show that a close bond occurs between
two or more objects.

Fig: Market Basket Analysis

Market Basket Analysis


In transactional data, each case is connected with a set of objects. In principle,
the list may contain all possible data items in the collection. For example - in a
single market-basket analysis, goods with related items may be bought.
However, only a small subset of all potential goods are present in a given set;
only a small fraction of the items available for sale in the shop reflect the items
in the market basket.
A common example of a regular pattern (item set) mining for association rules
is market basket analysis. Business basket, the research analyzes the
purchasing patterns of consumers by identifying correlations with the multiple
items carried in their shopping baskets by customers.
An example of association rule - milk, bread
In a shop, if a shopkeeper sales milk then it is a probability to sell bread
because a customer who is buying milk may also purchase bread. So it is
showing that milk and bread are correlated with one another.
Association Rule
An associative rule is an example of the implication of the form X→YX→Y, where
XX and YY are disjoint item sets (X∩Y=∅X∩Y=∅).
In terms of its support and trust, the strength of an alliance rule can be
calculated. Legislation that has very low support will happen purely by chance.
The reliability of the conclusion made by a rule is determined by trust.
Support of an association rule X→YX→Y
σ(X)σ(X) is the support count of XX
NN is the count of the transactions set TT.
s(X→Y)=σ(X∪Y)Ns(X→Y)=σ(X∪Y)N
Confidence of an association rule X→YX→Y
σ(X)σ(X) is the support count of XX
NN is the count of the transactions set TT.
conf(X→Y)=σ(X∪Y)σ(X)conf(X→Y)=σ(X∪Y)σ(X)
The interest of an association rule X→YX→Y
P(Y)=s(Y)P(Y)=s(Y) is the support of YY (fraction of baskets that contain YY)
If the interest of a rule is close to 1, then it is uninteresting.
I(X→Y)=1→XI(X→Y)=1→X and YY are independent
I(X→Y)>1→XI(X→Y)>1→X and YY are positively correlated
I(X→Y)<1→XI(X→Y)<1→X and YY are negative correlated
I(X→Y)=P(X,Y)P(X)×P(Y)I(X→Y)=P(X,Y)P(X)×P(Y)
For example, given a table of market basket transactions:
TID Items
1 {Bread, Milk}
2 {Bread, Diaper, Beer, Eggs}
3 {Milk, Diaper, Beer, Coke}
4 {Bread, Milk, Diaper, Beer}
5 {Bread, Milk, Diaper, Coke}
We can conclude that,
s({Milk,Diaper}→{Beer})=2/5=0.4s({Milk,Diaper}→{Beer})=2/5=0.4
conf({Milk,Diaper}→{Beer})=2/3=0.67conf({Milk,Diaper}→{Beer})=2/3=0.67
I({Milk,Diaper}→{Beer})=2/53/5×3/5=10/9=1.11

Data Integration in Data Mining


What is Data Integration?
Data integration is an integral part of the operations on data because data
may be obtained from various sources. Data integration is a technique that
integrates data from different sources to make them accessible in a single
unified view to users, which reports their status. Communication between
systems, there are sources that may contain several databases, data cubes, or
flat files.
To gain usable data, data fusion merges data from several heterogeneous
sources. Several databases, several files, or data cubes are involved in the
source. Consistencies, inconsistencies, redundancies, and inequalities must be
exempted from the consolidated results.
Systemic View of Data Integration Process
The below-mentioned diagram is a systemic view of the data integration
process –

Data Integration Process


Integration of data is important because it not only gives a coherent view of the
fragmented data; it also ensures the consistency of the data. This allows the
data-mining software to extract valuable information, which in turn encourages
managers and executives to take rational measures to develop the business.
 Extract, Convert and Load: Dataset copies are compiled, harmonized and
loaded into a data warehouse or archive from various sources.
 Extract, install and transform: data is loaded into a big data system and
converted for specific analytics use at a later time.
 Change Data Capture: identifies in real-time information changes in
databases and applies them to a data warehouse or other repository
 Data replication: data is replicated to other databases in one database to
keep the information synchronized for operational and backup uses.
Why is Data Integration Important?
Big data and all its advantages and problems are welcomed by companies who
want to stay competitive and relevant. In these massive datasets, data
integration facilitates requests, benefiting from everything from business
intelligence and consumer data analytics to data enrichment and information
delivery in real-time. Market and consumer data collection is one of the
foremost use cases for data integration services and technologies. To enable
enterprise reporting, business intelligence (BI data integration), and predictive
analytics, enterprise data integration feeds integrated data into data centers or
hybrid data integration architectures.
The incorporation of customer data offers a full understanding of main
performance metrics (KPIs), financial risks, clients, production and supply chain
activities, regulatory enforcement initiatives, and other facets of market
processes to business managers and data analysts.
Data Integration Problems
We have to deal with many issues that are discussed below while integrating
the results.
1. Detection and Settling Data Dispute
Data dispute means there is no match between the data combined from
multiple sources. As with attribute values, various data sets can vary. Perhaps
the difference is that they are depicted differently in the various data sets.
Assuming that the price of a hotel room in different currencies will be
expressed in different cities, during data integration, this sort of problem is
observed and resolved.
2. Redundancy and Study of Correlation
During data integration, redundancy is one of the major challenges. Redundant
information is irrelevant data or data that is no longer needed. It may also
occur because of attributes in the data set that may be extracted using another
attribute.
Example
One data set has the age of the client and another data set has the date of
birth of the client, so age will be a redundant attribute since it could be
calculated using the date of birth.
Data Integration Tools and Techniques
Techniques for data integration are available, from fully integrated to manual
approaches, across a wide variety of organizational levels. Typical data
integration tools and strategies include:
1. Manual Integration or Common User Interface
No single data view is usable. Users have access to all the originating systems
and all the related details.
2. Application Based Integration
It allows all the integration efforts to be applied by each application;
manageable with a limited number of applications.
3. Middleware Data Integration
Moves an application's integration logic to a new middleware layer. The
middleware program is used to gather data from multiple sources normalize
the data and archive it in the resulting data collection. If the company needs to
integrate data from existing systems to new systems, this strategy is adopted.
4. Integration of Uniform Access
This approach incorporates data from a source that is more discrepant. But, the
data location is not modified here, the data remains in its original location. This
approach only provides a single viewpoint that reflects the combined results.
To store the integrated data, no separate storage is needed, as only the
integrated view is generated for the end-user.

Major Issues in Data Mining-Purpose


and Challenges
Overview
Today, data mining is in great demand as it helps companies to provide
insights and study how their product sales can increase. Data mining has great
strengths and is a competitive and fast-expanding industry. For example, a
fashion shop that registers any of its customers who purchase a product from
their store. Based on customer data such as age, gender, income group,
occupation, etc., the store would be able to figure out what kind of consumers
buy different products. In this example, the customer's name is of no value so
we can not forecast the buying pattern by name as to whether or not that
consumer will buy a certain product. The age group, ethnicity, income group,
occupation, etc. will therefore be used to locate valuable details. "Data Mining"
is looking for facts or fascinating trends in data.
Data mining applications in today's world face a number of difficulties and
problems; many of these problems have been resolved to a certain degree in
recent research and development of data mining and are now considered
criteria for data mining; some are still at the research level.
Major Issues in Data Mining
The following are the some of the most common challenges in data mining -
1. Mining Methodology
New mining tasks continue to evolve as there are diverse applications. These
activities may use the same database in numerous ways and require new
techniques for data mining to be created. We need to traverse
multidimensional space when looking for information in big datasets. Various
variations of measurements need to be implemented to identify fascinating
patterns. Uncertain, chaotic and imperfect data may also lead to incorrect
derivation.
2. User Interaction Issue
The method of processing data can be extremely immersive. It is important to
be user-interactive in order to facilitate the mining process. In the course of
data mining, all domain information, context knowledge, limitations, etc. should
be combined. The knowledge uncovered by data mining should be accessible to
humans. An expressive representation of information, user-friendly simulation
techniques, etc., should be implemented by the framework.
3. Performance and Scalability
In order to efficiently retrieve interesting data from a large volume of data in
data warehouses, data mining algorithms should be robust and scalable.
The development of parallel and distributed data-intensive algorithms is
inspired by the large distribution of data and computational complexity.
The data mining algorithm must be efficient and scalable in order to efficiently
extract information from huge amounts of data in databases. The enormous
size of many databases, the wide distribution of data, and the complexity of
some data mining methods are factors that motivate the development of
algorithms for parallel and distributed data mining. These algorithms split the
data into partitions that are analyzed simultaneously.
4. Data type diversity
Handling with relational and dynamic data types: In libraries and data
warehouses, there are various kinds of data stored. Both these types of data
cannot be extracted by one machine. Along with different types of data, data
mining solutions should be built. Mining data from heterogeneous datasets and
global information systems: Because data is collected from numerous Local
Area Network (LAN) and Wide Area Network (WAN) data sources, the discovery
of information from various organized sources is a major challenge for data
mining.
5. Data mining and society
The areas of interest that need to be discussed are the disclosure of the use of
data and the potential infringement of human privacy and protection of rights.
Application of data mining found unique data mining solutions, tools for the
environment, intelligent addressing of questions, monitoring of processes, and
decision making.
Data Transformation in Data Mining
Data Transformation
Data transformation is a process used to turn raw data into an acceptable
format that allows data mining in order to effectively and quickly extract
strategic information. It is impossible to track or interpret raw data, which is
why it has to be pre-processed before any data is extracted from it. In order to
turn the data into the required shape, data processing involves data cleaning
techniques as well as a data reduction strategy. Data is generalized and
normalized. Normalization is a system that guarantees that no knowledge is
obsolete, that all is stored in a single location, and that all the dependencies
are logical.In order to include patterns that are easier to interpret, data
transformation is one of the important data pre-processing strategies that must
be done on data before data mining.
Data Transformation or Data Scaling Techniques
1. Data Cleaning
Cleaning the data implies eliminating noise from the collection of data
considered. There, using techniques like binning, regression, clustering.
2. Attribute Construction
For data elimination, methods for data transformation may also be used. If we
build a new function that blends the features to make the process of data
mining more effective, it is called a choice of attributes.
For example, male/female and student traits may be built into male/female
students. When we do studies into how many men and/or women are teachers,
this may be helpful, but we are not involved in their field of study.
3. Data Aggregation
Aggregation of data is any mechanism by which data is compiled and
expressed in a summarized form. Atomic data rows - usually obtained from
different sources are replaced by totals or summary figures as data are
aggregated. Based on such results, classes of observed aggregates are
replaced with summary statistics. Usually, aggregate data is found in a data
center, as it can offer responses to technical queries and thus significantly
minimize the time for massive data sets to be queried. In a reasonable time
frame, data aggregation will allow analysts to view and analyze vast volumes of
data. Hundreds, thousands, or even more atomic data records may be
represented by a row of aggregate data. Instead of having all of the
computation cycles to access each underlying atomic data row and aggregate
it in real-time as it is queried or downloaded, it can be easily queried when the
data is aggregated.
4. Data Normalization
Normalization allows us to scale the data within a range when practicing and/or
performing data analysis to prevent constructing inaccurate ML models. It
would be difficult to compare the numbers if the data set is very wide. We can
transform the original data linearly; perform decimal scaling or Z-score
normalization with different normalization techniques.
5. Discretization
The numeric attributes' raw values are replaced by discrete or logical intervals,
which can be further grouped into higher-level intervals in return.
6. Generation of concept hierarchy for nominal data
Values are extended to higher-order concepts for nominal data.
Advantages of Data Transformation
The following are the advantages of data transformation:
 To make things better-organised, data is transformed. It could be simpler for
both humans and computers to use transformed results.
 Properly formatted and validated data increases the consistency of data and
defends implementations from possible landmines, such as null values,
unintended duplicates, inaccurate indexing, and incompatible formats.
 Compatibility between programs, processes, and data forms is enabled by data
transformation. It may be important to convert data used for various uses in
different ways.

Data Reduction in Data Mining


What is Data Reduction?
Data reduction is a process that decreases the size of the original data and
reflects it in a much smaller quantity. Techniques that minimize data are
advantageous because they ensure the integrity of data while preserving
quality. It results in a large volume of data as we obtain data from multiple data
warehouses for review. This vast volume of data is daunting for a data scientist
to work with. It is also difficult to perform complicated queries on the massive
amount of data as it takes a long time and it is often impossible to track the
correct data. That is why it becomes important to minimize data.
The strategy of data reduction decreases the data volume but retains the data
integrity. The result obtained from data mining is not influenced by data
reduction, which means that the result obtained from data mining is the same
before and after data reduction (or almost the same). Because it is combined
with a certain reason, data reduction does not make sense on its own. In turn,
the aim determines the criteria for the corresponding techniques for data
reduction. Decreasing the disk space is a naive aim for data reduction. This
includes a process to compress the data into a more manageable format and
also, as the data has to be checked, to recover the original data.
To achieve a minimized representation of the data set that is much reduced in
volume but still contains important details, data reduction techniques may be
applied.
1. Aggregation of the Data Cube: In the construction of a data cube,
aggregation operations are applied to the data. To compile data in a simplified
way, this approach is used. For example, assume that the data we received for
the report for the years 2010 to 2015 contains the company's sales every three
months. Rather than the quarterly average, they include us in the yearly sales,
so we can summarize the statistics in such a manner that the resulting data
summarizes the cumulative sales each year instead of per fifth. It summarizes
the details.
2. Reduction in Dimension: Whenever we find some weakly meaningful results,
we use the attribute needed for our study. When it replaces old or obsolete
functionality, it reduces data size.
3. Compressing Data: The technique of data compression reduces the size of
files using various encoding mechanisms. Based on their compression
processes, we can break it into two types.
o Compression Lossless

o Compression Lossy
4. Reducing: It is necessary to store only the model parameter in this reduction
technique because the real data is replaced with mathematical models or a
smaller representation of the data instead of actual data. Or non-parametric
methods like clustering, histogram, screening, etc.
5.
a. Operation for Discretization & Definition Hierarchy: Data
discretization methods are used to separate the continuous nature's
attributes into interval data. We substitute several constant attribute
values for small interval marks. This suggests that mining effects are
demonstrated in a succinct and readily understood manner.
 Discretization of the Top-down
 Discretization from the Bottom-up
b. Hierarchies of Concept: By gathering and then replacing the low-
level concepts with high-level concepts, this decreases the data
scale (categorical variables such as middle age or Senior).
It is possible to adopt the following techniques for numeric data:
Binning - The method of changing numerical variables into categorical
equivalent is called binning; the number of categorical counterparts depends
on how many bins the user has defined.
Review of histograms - Like the binning process, the histogram is used to
divide the value into disjoint ranges called brackets for the X attribute.
Cluster Analysis - Cluster analysis is a common form of data discretization. A
clustering algorithm may be implemented by partitioning the values of A into
clusters or classes to isolate a computational feature of A.
It is possible to further decompose each initial cluster or partition into many
subcultures, creating a lower hierarchy level.

Data Cube Technology in Data Mining


What is Data Cube Technology?
A data cube is a three-dimensional (3D) (or higher) set of values that are
typically used to describe the time series of data from an image. It is an
abstraction of data to analyze aggregated information from a number of points
of view. As a spectrally-resolved picture is interpreted as a 3-D volume, it is
often useful for imaging spectroscopy.
The multidimensional extensions of two-dimensional tables may also represent
a data cube. It can be viewed as a group of 2-D tables stacked on each other
that are similar. Data cubes are used to represent data that is too abstract for a
table of columns and rows to explain. As data in multidimensional matrices
called Data Cubes is clustered or mixed. There are a few alternate names or
alternatives to the data cube system, such as "Multidimensional databases,"
"materialized views," and "OLAP (On-Line Analytical Processing)."The general
principle of this technique is to introduce those cost estimates that are
commonly demanded. A data cube is a multi-dimensional ("n-D") sequence of
values in computer programming contexts. The term data cube is usually used
in contexts where these arrays are massively bigger than the main memory of
the hosting computer; examples include multi-terabyte/petabyte data
warehouses and image data time series.
From a subset of attributes in the database, a data cube is generated. To
quantify attributes, unique attributes are selected, i.e. attributes whose
qualities are of importance. The other attributes are chosen as usable
attributes or measurements. The characteristics of the measurements are
aggregated according to the proportions. For instance, XYZ can create a sales
data warehouse to preserve records of the sales of the store for the time,
object, branch, and location dimensions. These dimensions allow to store and
keep track of details such as monthly transactions and the branches and areas
where the goods were sold. Each dimension may be defined by a table,
referred to as a dimensional table, which determines the dimensions. For eg,
the item name, brand, and type attributes can include a dimension table for
products.

The data cube technique, with many implementations, is a fascinating method.


In several examples, data cubes may be sparse and not every cell in each
dimension would have matching data in the database.
What is Data Cube Technology used for?
A multi-dimensional architecture is a data cube. The data cube is an abstract of
data for displaying aggregated data from a variety of viewpoints. As the
'measure' attribute, the dimensions are aggregated, as the remaining
dimensions are known as the 'function' attributes. In a multidimensional way,
data is viewed on a cube. It is possible to display the aggregated and
summarised facts with variables or attributes. This is the specification where
OLAP plays a role.
For simple data analysis, data cubes are widely used. It is used to represent
data as such quantities of company needs along with dimensions. Each cube
dimension reflects some of the database's characteristics, such as revenue
every day, month, or year.
Data Cube Classifications
Data cubes are specifically grouped into two classifications. These are
described below -
1. Multidimensional Data Cube
Centered on a structure where the cube is patterned as a multidimensional
array, most OLAP products are created. Compared to other methods, these
multidimensional OLAP (MOLAP) products typically provide a better
performance, primarily because they can be indexed directly into the data cube
structure to capture data subsets. The cube gets sparser as the number of
dimensions is larger. That ensures that no aggregated data would be stored in
multiple cells that represent unique attribute combinations. This in turn raises
the storage needs, which can at times exceed undesirable thresholds,
rendering the MOLAP solution untenable for massive, multi-dimensional data
sets. Compression strategies may help, but their use may damage MOLAP's
natural indexing.
2. Relational OLAP
The relational database architecture is used by Relational OLAP. Compared to a
multidimensional array, the ROLAP data cube is used as a series of relational
tables (approximately twice as many as the number of dimensions). Each of
these columns, referred to as a cuboid, denotes a particular perspective.

Data Discretization in Data Mining


What is Data discretization?
Data discretization is characterized as a method of translating attribute
values of continuous data into a finite set of intervals with minimal information
loss. Data discretization facilitates the transfer of data by substituting interval
marks for the values of numeric data. Similar to the values for the 'generation'
variable, interval labels such as (0-10, 11-20...) or (0-10, 11-20...) may be
substituted (kid, youth, adult, senior). Data discretization can be divided into
two forms of supervised discretization in which the class data is used and the
other is unsupervised discretization depending on which way the operation
proceeds, i.e. 'top-down splitting strategy' or 'bottom-up merging strategy.'
Continuous attributes are something that many real-world data mining tasks
need. However, many of the latest exploratory data mining techniques struggle
to appeal to such attributes. Besides, even though the machine learning task is
able to manage a continuous attribute, its output will benefit greatly if the
continuous attributes are replaced with their quantized values. Data
discretization is a process of translating continuous data into intervals and then
assigning the specific value within this interval. It can also be defined as
discretizing time calculated by time interval units, rather than a specific value.
Each discrete interval of the discretized attribute domain does not have to
contain the discrete values from the discrete attribute domain, but these
discrete values do have to induce an ordering on the domain of the discrete
attribute itself. As a result, it increases the consistency of discovered
information very significantly and also decreases the running time of various
data mining tasks such as association rule discovery, classification, and of
course, prediction. It offers gradual improvement for domains with a low
number of continuous attributes, the number of attributes increases, but is
typically almost always correct.
Discretization of the Top-down
If the procedure begins by first finding one or a few points to divide the whole
set of attributes (called split points or cut points) and then performs this
recursively at the resulting intervals, then it is called top-down discretization or
slicing.
Discretization from the Bottom-up
If the procedure begins by considering all the continuous values as possible
split-points, others are discarded by combining neighborhood values to form
intervals, so it is called bottom-up discretization or merging.
To have a hierarchical partitioning of the attribute values, known as a definition
hierarchy, discretization can be done quickly on an attribute.
Using the methods discussed below, data discretization can be extended to the
data to be converted.
Binning
For data discretization and further for the creation of idea hierarchy, this
approach can also be used. Values found for an attribute are grouped into a
number of equal-width or equal-frequency bins. Then the values are
smoothened using bin mean or bin median in each bean. Using this method
recursively you can generate concept hierarchy. Binning is unsupervised
discretization as it does not use any class information
Histogram Analysis
The histogram distributes an attribute's observed value into a disjoint subset,
often called buckets or bins.
Cluster Analysis
Cluster analysis is a common form of data discretization. A clustering algorithm
may be implemented by partitioning the values of A into clusters or classes to
isolate a computational feature of A.
It is possible to further decompose each initial cluster or partition into many
subcultures, creating a lower hierarchy level.
Data discretization using decision tree analysis
Data discretization in a decision tree analysis is performed in which a top-down
slicing approach is used; this is done using a supervised procedure. To
discretize a numeric attribute, first pick out the attribute that has the lowest
entropy, and then run it through a recursive process that will break it up into
several discretized disjoint intervals, one below the other, using the same
splitting criterion.
Data discretization using correlation analysis
By discretizing data by linear regression, the best neighbouring intervals are
found and then the large intervals are combined to create larger overlaps to
form the final 20 overlapping intervals. It is a supervised procedure.
Generation Concept Hierarchy for Nominal Data
The nominal data or nominal attribute is one that has a finite number of unique
values, and between the values there is no ordering. For example, job-
category, age-category, geographic-regions, item-category, etc. are nominal
attributes. By adding a group of attributes, the nominal attributes form the
definition hierarchy. It can create definition hierarchy, such as path, area, state,
nation all together.
Concept hierarchy transforms the data into many layers. By adding partial or
absolute ordering between the attributes, the definition hierarchy can be
generated and this can be achieved at the level of the schema.
Why Discretization is Important?
An infinite number of degrees of freedom (DoF) have mathematical problems
with continuous data. For a variety of purposes, data scientists need the
implementation of discretization.
 Features Interpretation - Due to infinite degrees of freedom, continuous
functions have a lower probability of correlating with the target variable and
can have a complex non-linear interaction. Therefore, the understanding of
such a function could be more complicated. Groups corresponding to the target
may be viewed after discretizing a variable.
 Ratio Signal-to-Noise - As we discretize a model, we fit it into bins and
reduce the influence of minor data fluctuations. Sometimes, minor variations
are known as noise. By discretization, we will reduce this noise. This is the
"smoothing" method in which fluctuations are smoothed from each bin, thereby
reducing noise in the results.

What is Multi-Dimensional Data Model?


A multidimensional model views data in the form of a data-cube. A data cube enables data to be
modeled and viewed in multiple dimensions. It is defined by dimensions and facts.
The dimensions are the perspectives or entities concerning which an organization keeps records. For
example, a shop may create a sales data warehouse to keep records of the store's sales for the
dimension time, item, and location. These dimensions allow the save to keep track of things, for
example, monthly sales of items and the locations at which the items were sold. Each dimension has a
table related to it, called a dimensional table, which describes the dimension further. For example, a
dimensional table for an item may contain the attributes item_name, brand, and type.
A multidimensional data model is organized around a central theme, for example, sales. This theme is
represented by a fact table. Facts are numerical measures. The fact table contains the names of the
facts or measures of the related dimensional tables.
Consider the data of a shop for items sold per quarter in the city of Delhi. The data is shown in the
table. In this 2D representation, the sales for Delhi are shown for the time dimension (organized in
quarters) and the item dimension (classified according to the types of an item sold). The fact or
measure displayed in rupee_sold (in thousands).

Now, if we want to view the sales data with a third dimension, For example, suppose the data
according to time and item, as well as the location is considered for the cities Chennai, Kolkata,
Mumbai, and Delhi. These 3D data are shown in the table. The 3D data of the table are represented as
a series of 2D tables.

Conceptually, it may also be represented by the same data in the form of a 3D data cube, as shown in
fig:

Single Dimensional Boolean


Association Rule Mining for
Transaction Databases
Single dimensional Boolean association rule mining is a technique used to discover interesting
relationships or patterns in transaction databases. In this approach, the focus is on analyzing the
presence or absence of items in transactions and identifying associations between them.
Transaction Databases
A transaction database is a collection of transactions, where each transaction represents a set of
items purchased or used together. For example, in a retail store, each transaction may represent a
customer's purchase, and the items bought by the customer form the transaction.
Association Rule Mining
Association rule mining aims to find associations or relationships between items in a transaction
database. An association rule consists of an antecedent (a set of items) and a consequent
(another set of items). The rule indicates that if the antecedent is present in a transaction, the
consequent is likely to be present as well.
Single Dimensional Boolean Association Rule Mining
In single dimensional Boolean association rule mining, the focus is on analyzing the presence or
absence of items in transactions. It involves identifying frequent itemsets and generating
association rules based on these itemsets.
1. Frequent Itemsets: A frequent itemset is a set of items that appears frequently in the transaction
database. To identify frequent itemsets, the algorithm scans the transaction database and counts
the occurrences of each item or itemset. The support of an itemset is the proportion of
transactions in which the itemset appears. The algorithm selects itemsets with support above a
predefined threshold as frequent itemsets.
2. Association Rule Generation: Once frequent itemsets are identified, association rules can be
generated. An association rule has the form "antecedent => consequent," where both the
antecedent and consequent are itemsets. The confidence of a rule is the proportion of transactions
containing the antecedent that also contain the consequent. The algorithm selects rules with
confidence above a predefined threshold as interesting association rules.
Benefits and Applications
Single dimensional Boolean association rule mining provides valuable insights into the
relationships between items in transaction databases. It has several benefits and applications,
including:
 Market Basket Analysis: By analyzing association rules, retailers can identify items that are
frequently purchased together. This information can be used for product placement, cross-selling,
and targeted marketing strategies.
 Web Usage Mining: Association rules can be used to analyze user behavior on websites. By
identifying patterns in users' navigation paths, website owners can optimize website design,
recommend relevant content, and personalize user experiences.
 Healthcare: Association rule mining can be applied to healthcare data to discover relationships
between symptoms, diseases, and treatments. This information can aid in diagnosis, treatment
planning, and disease prevention.
In conclusion, single dimensional Boolean association rule mining is a powerful technique for
discovering associations between items in transaction databases. It helps uncover valuable
insights that can be applied in various domains, such as retail, web analytics, and healthcare.

Data Mining - Applications & Trends


Data mining is widely used in diverse areas. There are a number of commercial data
mining system available today and yet there are many challenges in this field. In this
tutorial, we will discuss the applications and the trend of data mining.
Data Mining Applications
Here is the list of areas where data mining is widely used −
 Financial Data Analysis
 Retail Industry
 Telecommunication Industry
 Biological Data Analysis
 Other Scientific Applications
 Intrusion Detection
Financial Data Analysis
The financial data in banking and financial industry is generally reliable and of high
quality which facilitates systematic data analysis and data mining. Some of the typical
cases are as follows −
 Design and construction of data warehouses for multidimensional data analysis and data
mining.
 Loan payment prediction and customer credit policy analysis.
 Classification and clustering of customers for targeted marketing.
 Detection of money laundering and other financial crimes.
Retail Industry
Data Mining has its great application in Retail Industry because it collects large amount of
data from on sales, customer purchasing history, goods transportation, consumption and
services. It is natural that the quantity of data collected will continue to expand rapidly
because of the increasing ease, availability and popularity of the web.
Data mining in retail industry helps in identifying customer buying patterns and trends
that lead to improved quality of customer service and good customer retention and
satisfaction. Here is the list of examples of data mining in the retail industry −
 Design and Construction of data warehouses based on the benefits of data mining.
 Multidimensional analysis of sales, customers, products, time and region.
 Analysis of effectiveness of sales campaigns.
 Customer Retention.
 Product recommendation and cross-referencing of items.
Telecommunication Industry
Today the telecommunication industry is one of the most emerging industries providing
various services such as fax, pager, cellular phone, internet messenger, images, e-mail,
web data transmission, etc. Due to the development of new computer and communication
technologies, the telecommunication industry is rapidly expanding. This is the reason why
data mining is become very important to help and understand the business.
Data mining in telecommunication industry helps in identifying the telecommunication
patterns, catch fraudulent activities, make better use of resource, and improve quality of
service. Here is the list of examples for which data mining improves telecommunication
services −
 Multidimensional Analysis of Telecommunication data.
 Fraudulent pattern analysis.
 Identification of unusual patterns.
 Multidimensional association and sequential patterns analysis.
 Mobile Telecommunication services.
 Use of visualization tools in telecommunication data analysis.
Biological Data Analysis
In recent times, we have seen a tremendous growth in the field of biology such as
genomics, proteomics, functional Genomics and biomedical research. Biological data
mining is a very important part of Bioinformatics. Following are the aspects in which data
mining contributes for biological data analysis −
 Semantic integration of heterogeneous, distributed genomic and proteomic databases.
 Alignment, indexing, similarity search and comparative analysis multiple nucleotide
sequences.
 Discovery of structural patterns and analysis of genetic networks and protein pathways.
 Association and path analysis.
 Visualization tools in genetic data analysis.
Other Scientific Applications
The applications discussed above tend to handle relatively small and homogeneous data
sets for which the statistical techniques are appropriate. Huge amount of data have been
collected from scientific domains such as geosciences, astronomy, etc. A large amount of
data sets is being generated because of the fast numerical simulations in various fields
such as climate and ecosystem modeling, chemical engineering, fluid dynamics, etc.
Following are the applications of data mining in the field of Scientific Applications −
 Data Warehouses and data preprocessing.
 Graph-based mining.
 Visualization and domain specific knowledge.
Intrusion Detection
Intrusion refers to any kind of action that threatens integrity, confidentiality, or the
availability of network resources. In this world of connectivity, security has become the
major issue. With increased usage of internet and availability of the tools and tricks for
intruding and attacking network prompted intrusion detection to become a critical
component of network administration. Here is the list of areas in which data mining
technology may be applied for intrusion detection −
 Development of data mining algorithm for intrusion detection.
 Association and correlation analysis, aggregation to help select and build discriminating
attributes.
 Analysis of Stream data.
 Distributed data mining.
 Visualization and query tools.
Data Mining System Products
There are many data mining system products and domain specific data mining
applications. The new data mining systems and applications are being added to the
previous systems. Also, efforts are being made to standardize data mining languages.
Choosing a Data Mining System
The selection of a data mining system depends on the following features −
 Data Types − The data mining system may handle formatted text, record-based data,
and relational data. The data could also be in ASCII text, relational database data or data
warehouse data. Therefore, we should check what exact format the data mining system
can handle.
 System Issues − We must consider the compatibility of a data mining system with
different operating systems. One data mining system may run on only one operating
system or on several. There are also data mining systems that provide web-based user
interfaces and allow XML data as input.
 Data Sources − Data sources refer to the data formats in which data mining system will
operate. Some data mining system may work only on ASCII text files while others on
multiple relational sources. Data mining system should also support ODBC connections or
OLE DB for ODBC connections.
 Data Mining functions and methodologies − There are some data mining systems
that provide only one data mining function such as classification while some provides
multiple data mining functions such as concept description, discovery-driven OLAP
analysis, association mining, linkage analysis, statistical analysis, classification,
prediction, clustering, outlier analysis, similarity search, etc.
 Coupling data mining with databases or data warehouse systems − Data mining
systems need to be coupled with a database or a data warehouse system. The coupled
components are integrated into a uniform information processing environment. Here are
the types of coupling listed below −
o No coupling
o Loose Coupling
o Semi tight Coupling
o Tight Coupling
 Scalability − There are two scalability issues in data mining −
o Row (Database size) Scalability − A data mining system is considered as row scalable
when the number or rows are enlarged 10 times. It takes no more than 10 times to
execute a query.
o Column (Dimension) Salability − A data mining system is considered as column
scalable if the mining query execution time increases linearly with the number of
columns.
 Visualization Tools − Visualization in data mining can be categorized as follows −
o Data Visualization
o Mining Results Visualization
o Mining process visualization
o Visual data mining
 Data Mining query language and graphical user interface − An easy-to-use
graphical user interface is important to promote user-guided, interactive data mining.
Unlike relational database systems, data mining systems do not share underlying data
mining query language.

Trends in Data Mining


Data mining concepts are still evolving and here are the latest trends that we get to see
in this field −
 Application Exploration.
 Scalable and interactive data mining methods.
 Integration of data mining with database systems, data warehouse systems and web
database systems.
 SStandardization of data mining query language.
 Visual data mining.
 New methods for mining complex types of data.
 Biological data mining.
 Data mining and software engineering.
 Web mining.
 Distributed data mining.
 Real time data mining.
 Multi database data mining.
 Privacy protection and information security in data mining.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy