Data Mining in Business Intelligence Book
Data Mining in Business Intelligence Book
BUSINESS
INTELLIGENCE
__________
ISBN:XXXXXXXXXXXXX
Printed @KLEF
DEDICATION
3
Table of Contents
Foreword....................................................................................................7
Preface.......................................................................................................9
Introduction.............................................................................................11
The Software Process .......................................................................14
Software Engineering Practice................................................................15
Software Development Life Cycle.............................................. ...........20
Reverse Engineering ..............................................................................23
Software process model..........................................................................27
Types of software process models..........................................................28
Waterfall Model......................................................................................28
V Model..................................................................................................30
Incremental Model..................................................................................31
Iterative Model........................................................................................33
RAD Model.............................................................................................35
Spiral Model............................................................................................36
Agile model.............................................................................................37
Introduction - A Strategic Approach to Software Testing......................64
Strategic Issues........................................................................................66
Test Strategies for Conventional Software..............................................73
Validation Testing...................................................................................86
White-Box Testing.................................................................................105
Block Box Testing.................................................................................109
4
FOREWORD
5
The authors have meticulously crafted each chapter, striking a
balance between theoretical foundations and practical applications.
Drawing from their collective expertise, they offer real-world case
studies, illustrating how data mining has revolutionized diverse
industries, including finance, marketing, healthcare, and more. These
examples demonstrate how organizations have gained a competitive edge
by extracting meaningful knowledge from vast data repositories and
converting it into actionable insights.
As you immerse yourself in the world of data mining, you will come
to appreciate the ethical implications surrounding data privacy, security,
and bias. Understanding the responsibility of wielding data as a powerful
tool is crucial in preserving trust and integrity in the digital landscape.
Dr. KARTHIKEYAN J.
Professor of English & Dean, Career Development, Sri Venkateswara
College of Engineering & Technology, Chittoor – 517127. Andhra
Pradesh
6
PREFACE
7
practical knowledge, and real-world examples to guide you on this
transformative journey.
8
security concerns are addressed to ensure that data mining is carried out
in a manner that respects individuals' rights and complies with
regulations.
9
INTRODUCTION
11
M1.U1.1.Data
Any unprocessed fact, value, text, sound, or image that is not
being understood and analyzed falls under this category.
Data analytics, machine learning, and artificial intelligence
all depend on data, which is their most crucial component. We
cannot train any model without data, thus all current research
cannot be done.
M1.U1.1.1. Why analyze data:
Businesses can collect pertinent, accurate information
through data analysis that can be used to create future marketing
strategies, business plans, and realign the company's vision or
mission.
M1.U1.1.2. Raw data to valuable information:
12
require data.
People require knowledge.
A construction block is data.
Information provides context and meaning.
Turn Raw Data to Valuable Information:
Step One: Raw Data Extraction
This is the first step to enabling the next.
We must begin putting blocks together in a zero to
one approach.
Data extraction is the process of obtaining data from
numerous online resources.
Once the sources are prepared, you can begin the extraction
process. You can extract data in a variety of methods. Web
scraping is a useful technique today. The use of an automated
web scraping solution is beneficial because it eliminates the
requirement for independent script writing or hiring developers.
13
For companies with a tight budget but high data consumption, it
is the most practical and long-lasting solution. Step Two: Data
Analytics As the quality of the data may directly affect the
analysis result, it is important to verify the accuracy during the
analysis stage. Data will be delivered to the consumers during
this phase in a variety of reported formats, including dashboard
and visualization.
14
M1.U1.2. Lifecycle of Data:
M1.U1.2. 1.Generation
15
M1.U1.2. 1.5. Purging:
All copies of the data are erased during this stage.
This is typically done using already-archived material. Since
the growth of big data and the continued development of the
Internet of Things, data lifecycle management (DLM) has gained
importance (IoT).Globally, an ever-increasing number of gadgets
are producing enormous amounts of data. It is crucial to maintain
proper control over data throughout its life cycle to maximize its
value and reduce the chance of mistakes. A last step is to archive
or remove data when it has served its purpose.
16
M1.U2.1. What is Business Intelligence?
The procedural and technological framework that gathers,
saves, and analyses the data generated by a company's operations
is known as business intelligence (BI).BI is a broad phrase that
includes descriptive analytics, performance benchmarking,
process analysis, and data mining. Business intelligence (BI)
organizes all the data that a company generates into manageable
reports, performance metrics, and trends.
BI stands for the technological framework that gathers,
organizes, and evaluates corporate data.
Managers may make better decisions thanks to BI,
which analyses data and generates reports and insights.
BI solutions are created by software companies for
businesses that want to use their data more effectively.
Spreadsheets, reporting/query software, data
visualization software, data mining tools, and online
analytical processing are just a few examples of the
many different types of BI tools and software available
(OLAP).
Self-service BI is an analytical method that enables non-
technical people to access and explore data.
M1.U2.2. Benefits of BI
Numerous factors influence why businesses use BI.
It is frequently utilized to help activities as various as hiring,
compliance, production, and marketing.
17
It is challenging to identify a company segment that does not
benefit from having better information to work with because BI
is a basic business value.
Faster, more accurate reporting and analysis, better data
quality, improved employee satisfaction, decreased costs and
increased revenues, and the capacity to make better business
decisions are just a few of the many advantages businesses can
experience after incorporating BI into their business models.
For instance, if you are in charge of setting up the production
schedules for a number of beverage factories and sales are
increasing significantly month over month in a specific area, you
can approve more shifts almost immediately to make sure your
factories can meet demand.
M1.U2.3. BI and DW in today’s perspective:
According to Gartner, business intelligence is "an umbrella
phrase that encompasses the applications, infrastructure, tools,
and best practises that enable access to and analysis of
information to enhance and optimise choices and performance.
In order to help users derive business insights, BI systems
collect, organise, analyse, and show proprietary data. It may
combine data from many sources, find trends or patterns in the
data, and recommend best practises for visualisations and further
steps.Insights can contain measurements from the past,
projections for the future, analyses of competition performance,
and much more.
Among the advantages of business intelligence are:
Control over and access to private data
Better data literacy
Imaginative displays
Data analysis
Benchmarking
Performance supervision
18
Sales information
Streamlined processes
Eliminated speculation
19
It regulates data load
It allows users to manage schema such as tables,
indexes, etc. by retrieving massive volumes of data. It
allows users to produce reports It secures data.
M1.U2.5. BIDW:
The Kimball Group asserts that "data warehousing was
rebranded as 'business intelligence.'"Because it correctly
conveyed the transfer of the initiative and ownership of the data
assets to the business, this relabeling was much more than a
marketing strategy. Although the idea that corporate data users
should own the information suggests that accessing and storing
data (also known as data warehousing) is equivalent to
processing, analysing, and interpreting it (i.e., business
intelligence).
It is necessary to first distinguish between the idea of
business intelligence and the technologies that support it in order
to comprehend how BI and DW interact.
Business intelligence relies on gathering data from across the
organisation and using data analysis to provide reports and
global views.BI Tools are software programmes that enable
OLAP, create reports and visualisations, and enhance BI analysis
(online analytical processing).Another component of a BI toolkit
that focuses exclusively on gathering data is a data warehouse.
20
M1.U3.1. Data warehousing:
A data warehousing (DW) process is used to gather and
manage data from many sources in order to produce insightful
business information. Business data from many sources is often
connected and analysed using a data warehouse. The central
component of the BI system, which is designed for data analysis
and reporting, is the data warehouse.
The combination of several technologies and elements
facilitates the strategic use of data.Large amounts of data are
electronically stored by a company and are intended for analysis
and inquiry rather than transaction processing. It is a process of
converting data into information and promptly making it
accessible to users so that it might have an impact.
One or more data sources send information to a data
warehouse, which acts as a central store for that information. The
transactional system and other relational databases feed data into
a data warehouse.
Data could be:
Structured
Semi-structured
Unstructured data
21
This makes sure that all the information is taken into
account. Data mining is made possible by data
warehousing. Data mining searches for patterns in the
data that could result in increased revenue and
profitability.
M1.U3. 2. Need for data warehousing:
Airline:It is utilised for operational purposes in the
airline system, such as personnel ssignment, studies of
route profitability, frequent flyer programme
promotions, etc.
Banking:It is frequently used in the banking industry to
efficiently manage the resources on the desk.A few
banks are also utilised for operations, product
performance analysis, and market research.
Healthcare:Data warehouses were also utilised by the
healthcare sector to plan and forecast outcomes, provide
patient treatment reports, and communicate data with
affiliated insurance firms, medical aid services, etc.
Government sector:Data warehouses are utilised for
intelligence collection in the public sector.
It aids in the upkeep and analysis of each person's tax
records, health policy records, and other data by
government authorities.
Airline:Sectors of investment and insurance:
In this industry, warehouses are largely used to track
market trends, assess consumer trends, and analyse data
patterns.
Retain chain:Data warehouses are frequently utilised in
retail chains for distribution and marketing.Additionally,
it aids in keeping track of products, consumer
purchasing trends, promotions, and pricing policy.
22
Telecommunication:In this industry, distribution
decisions, sales decisions, and product marketing
decisions are all made using a data warehouse.
Hospitality sector:Based on customer feedback and
travel habits, this industry uses warehouse services to
plan and predict the locations for its advertising and
promotion efforts.
M1.U3. 2.1 Advantages:
Business users can easily access crucial data from a
variety of sources using data warehouses.
Consistent data on multiple cross-functional operations
is provided via data warehouse.
Ad hoc reporting and querying are also supported.
To lessen the strain on the production system, data
warehouses assist in integrating several data sources.
Using a data warehouse can speed up analysis and
reporting overall.
The user can utilise it more easily for reporting and
analysis thanks to restructuring and integration.
Users can obtain crucial data from numerous sources in
a single location thanks to data warehouses.
As a result, it saves users' time when obtaining data from
various sources.
A substantial amount of historical data is kept in data
warehouses.
This facilitates user analysis of various time periods and
patterns to make future predictions.
M1.U3. 2.2.Disadvantages:
It takes a lot of time to create and implement a data
warehouse.
Data Warehouse can get out of date rather soon.
Changes to data types, ranges, indexes, and searches are
challenging to implement.
23
The data warehouse may appear simple, but it is actually
too complicated for most consumers.
Data warehousing project scope will constantly expand
despite best efforts at project management.
Different business rules may occasionally be developed
by warehouse users.
Organizations must devote a significant amount of their
resources on implementation and training.
M1.U3. 3. Components of Data warehousing:
There are four components
M1.U3. 3. 1. Load manager:
The front component is another name for the load manager.
It completes all tasks necessary for the extraction and
loading of data into the warehouse.
To get the data ready for the data warehouse, these activities
also involve transformations.
Warehouse Manager: The warehouse manager carries out
tasks related to the administration of the data stored there.It
carries out tasks including data analysis to check for consistency,
index and view building, denormalization and aggregate
generation, transformation and merging of source data, and data
archiving and baking up.
M1.U3. 3.2. Query Manager:
The term "backend component" also applies to the query
manager.
It executes all actions necessary for the administration of
user inquiries.
Direct queries to the necessary tables are used by this data
warehouse component's activities to schedule the execution of
queries.
M1.U3. 3.3. Tools for end-user access:
24
This is divided into five categories, including 1. Data
Reporting 2. Query Tools 3. Application development tools 4.
EIS tools, 5. OLAP tools and data mining tools.
M1.U3. 4 Trends in data warehousing:
More capable data warehouses are needed as a result of
the enterprise's "datafication.":
A increasing stream of data is being produced by mobile
devices, social media usage, networked sensors (i.e., the Internet
of Things), and other sources. Some have referred to this stream
of data as a "fire hose of data."IT teams are responding by
enhancing data warehouse capabilities so they can manage more
data, more types of data, and do so more quickly than ever.
Physical and logical consolidation help reduce costs.
Spending more money on these technologies is not the
solution to the datafication problem.Or, to put it another
perspective, ten times as much data shouldn't result in ten times
as much money.In order to integrate these expanding data
warehouses, a mix of virtualization, compression, multi-tenant
databases, and servers designed to handle significantly higher
data volumes and workloads is required.
Optimized environment using Hadoop.
With its distributed file system (HDFS) and parallel
MapReduce paradigm, the open source Hadoop application
excels at processing very massive data volumes.Because of this,
Hadoop is a fantastic complement to "traditional" data
warehouses, which helps to explain why an increasing number of
data warehouse managers are increasingly turning to Hadoop to
handle some of the busiest workloads.
Real-time analytics are used in customer experience (CX)
initiatives to enhance marketing campaigns:
Data warehouses are essential to CX projects because they
have the information required to create a complete, 360-degree
25
perspective of your client base.For sentiment analysis,
personalization, marketing automation, sales, and customer
support, a data warehouse of customer information can be used.
Engineered systems are increasingly the preferred
method for managing massive amounts of information.
Data warehouses can easily become into a complicated
assembly of several parts, including servers, storage, database
software, and other elements, but that doesn't have to be the case.
Engineered solutions that are preconfigured and tuned for
certain workloads, like Oracle Big Data Appliance and Oracle
Exadata Database Machine, give the highest levels of
performance without the integration and configuration hassles.
M1.U3. 5. Data Marts:
A subset of a central information store called a "Data Mart"
is often focused on a single use or key data topic and can be
dispersed to meet business needs. Data Marts are analytical
record repositories created with a certain community within a
company in mind, focusing on particular business processes.
Despite the fact that organisational data marts are combined to
build the data warehouse in the bottom-up data warehouse
design process, data marts are derived from subsets of data in a
data warehouse.
Applications for business intelligence (BI) are the main
usage of a data mart. Records are gathered, stored, accessed, and
analysed using BI. Since it is less expensive than constructing a
data warehouse, smaller organisations can use it to make use of
the data they have amassed.
26
Purpose of creating data marts:
Creates collective data by a group of users
Easy access to frequently needed data
Ease of creation
Improves end-user response time
Lower cost than implementing a complete data
warehouses
Potential clients are more clearly defined than in a
comprehensive data warehouse
It contains only essential business data and is less
cluttered.
Types of datamarts:
To designing data marts, there are primarily two methods.
These methods are
(i)Dependent Data marts
(ii)Independent Datamarts
(i)Dependent Data marts
The logical subset of the physical subset of a higher data
warehouse is a dependent data mart. The data marts are regarded
as a data warehouse's subsets in accordance with this method.
This method starts by building a data warehouse from which
other data marts can be made. These data mart rely on the data
warehouse and pull the crucial information from it. This method
27
eliminates the need for data mart integration because the data
warehouse builds the data mart. It is also referred to as a top-
down strategy.
(ii)Independent DataMarts:
Independent data marts are the second strategy (IDM) In this
case, separate multiple data marts are first constructed, and then
a data warehouse is designed using them. This method requires
the integration of data marts because each data mart is
independently built. As the data marts are combined to create a
data warehouse, it is also known as a bottom-up strategy.
28
Steps involved in preparing DataMarts:
Designing the schema, building the physical storage,
populating the data mart with data from source systems,
accessing it to make educated decisions, and managing it over
time are the key implementation processes. These are the steps:
(i) Designing
The data mart process starts with the design step. Initiating
the request for a data mart, acquiring information about the
needs, and creating the data mart's logical and physical design
are all covered in this step.
The tasks involved are as follows:
Assembling the technological and business needs
Finding the sources of data
Designing the logical and physical architecture of the data
mart; choosing the suitable subset of data.
29
(ii) Constructing
(iii)Populating
This stage involves obtaining data from the source, cleaning
it up, transforming it into the appropriate format and level of
detail, and transferring it into the data mart.
The tasks involved are as follows:
Relating intended data sources to the data sources used
Getting data out
Converting and cleaning up the data.
Data loading for the data mart
Establishing and preserving metadata
(iv) Accessing
In this step, the data is put to use through querying, analysis,
report creation, chart and graph creation, and publication.
The tasks involved are as follows:
Create a meta layer (intermediate layer) for the front-end
tool to use. This layer converts database operations and
object names into business terms so that end users can
30
communicate with the data mart using language that is
related to business processes.
Create and maintain database structures like summarised
tables to aid in the quick and effective execution of
queries through front-end tools.
(v) Managing
In this step, the data mart's lifespan management is
included. Management duties are carried out at this step as
follows:
Granting safe access to the information
Controlling the expansion of the data.
Improving the performance of the system.
Ensuring data accessibility in the case of system
breakdowns.
31
A Data Warehouse is a A data mart is an only subtype of a Data W
vast repository of information architecture to meet the requirement of a specif
collected from various
organizations or departments
within a corporation.
It may hold multiple It holds only one subject area. For example
subject areas.
It is a Centralized System.
It is a Decentralized System.
32
In a data warehouse, the objects are defined through
metadata.
A directory is what metadata does. The decision support
system can locate a data warehouse's contents with the use of
this directory. In a data warehouse, metadata is created for the
names and meanings of the data. Additional metadata is created
in addition to this metadata to time-stamp any extracted data and
the extracted data's source.
M1.U3. 6. 1. Categories of Metadata:
Three major categories can be used to group metadata:
(i) Business Metadata:It contains details about who owns the
data, what a business is, and how its policies change.
(ii) Technical metadata:It consists of database system names,
the names and sizes of tables and columns, as well as the data
types and permitted values. Technical metadata also includes
structural data like indices and properties for primary and foreign
keys.
(iii) Operational metadata:It consists of data lineage and
currency. Whether data is current refers to whether it is live,
archived, or erased. The history of data migration and alteration
is referred to as the lineage of the data.
33
M1.U3. 6. 2. Role of Metadata:
In a data warehouse, metadata plays a crucial function.
Although it has a distinct function from the warehouse data,
metadata nonetheless has a significant impact. The following
describes the numerous functions of metadata.
34
M1.U3. 6. 3 Metadata Repository:
A data warehouse system's metadata store is a crucial
component. It has the metadata shown below:
It offers a description of the data warehouse's structure in its
definition. Hierarchies, derived data definitions, the schema, the
view, the locations and the contents of the data marts all define
the description.
(i) Business metadata: It includes details about who
owns the data, how the business is defined, and how
its policies change.
(ii) Operational Metadata: Data lineage and currency are
both included in operational metadata. Data's status
as live, archived, or purged is referred to as its
currency. Data's "lineage" refers to its migration and
modification history.
35
Information for data warehouse to operational
environment mapping The source databases and
their contents, data extraction, data partition
cleaning, transformation rules, data refresh, and
other components are all included.
(iii)Data for data warehouse mapping from the operating
environment :The data extraction, data partition cleaning,
transformation rules, data refresh rules, and purging rules are all
included, as well as the source databases and their contents.
36
Metadata cannot be passed in an approved or simple
manner.
37
Terminal questions:
1. Define Business intelligence.
2. Why data analyzing is required in business intelligence.
3. Describe briefly about data warehousing architecture.
4. Interpret the concepts of data marts and explain the
different types of data marts.
5. Extract the different components of data warehousing
and explain the needs for data warehousing.
6. Contrast the difference between data marts and Data
warehouse.
7. Explain briefly about lifecycle of data ?
8. Illustrate the characteristics of DW and BI and how they
interconnect to one another in today working method.
38
Module 2
Unit 1
M2.U1.1.Business intelligence:
Business intelligence is a term used to describe a set of ideas
and approaches used to enhance business decision making
through the use of data and fact-based systems. It is the talk of a
new changing and expanding world. Enhancing decision-making
in business concepts and analysis is the aim of business
intelligence. Business intelligence is more than simply one idea;
it's a collection of ideas and approaches. Business intelligence
relies on intuition and analytics to make judgments.
Process applied in Business Intelligence:
Business intelligence (BI) transforms raw data into relevant
information and then transforms that information into knowledge
using a variety of procedures, technologies, and tools (such as
Informatica/IBM). The decision-makers can then base their
decisions on the insights by first manually or using software to
extract some helpful insights.
39
To keep it short and simple, business intelligence about
gives correct information to the organization's decision-makers
in the proper and ethical manner. Important characteristics of
business intelligence include:
Decision-making based on facts.
A360-degree view of your company
On the same page as the virtual team.
Measurement for the purpose of developing KPIs (Key
Performance Indicators) using historical data that has been fed
into the system.
Establish the benchmarks for the various processes after
identifying the benchmark.
Systems for business intelligence can be used to spot
problems in the corporate world that need to be discovered and
fixed as well as market trends.
Data visualisation, which improves the quality of data and,
in turn, improves decision-making, is made possible by business
intelligence.
Because they are very inexpensive, business intelligence
systems can be employed by large corporations, organisations, as
well as Small and Medium Enterprises.
M2.U1.2. Business intelligence user types:
Analyst (Data Analyst or Business Analyst): The company's
statisticians, known as analysts (data analysts or business
analysts), applied BI based on historical data that had previously
been saved in the system.
Head of the company or manager: By making judgements
more effectively and using all the information they have learned,
the company's head employs business intelligence to boost
profitability.
IT specialist: for his business.
40
Small Business Owners: A small businessman may use it
because it is also reasonably priced.
Applications of Business Intelligence:
In Decision Making of the company by decision-makers of
the organizations.
In Data Mining while extracting knowledge.
In Operational Analytics and operational management.
In Predictive Analytics.
In Prescriptive Analytics.
Making Structured data from the unstructured data.
In Decision Support System.
In Executive Information System (EIS).
41
Business Intelligence Data Warehouse
It is a Decision Support
System (DSS). It is a data storage system.
42
Business Intelligence Data Warehouse
It deals with-
Acquiring/gathering of data
It deals with-
Metadata management
OLAP (Online Analytical Cleaning of data
Processing) Transforming data
Data Visualization Data dissemination
Data Mining Data recovery/backup
Query/Reporting Tools planning
Examples of Data
Examples of BI warehouse software: BigQuery,
software: SAP, Sisense, Snowflake, Amazon, Redshift,
Datapine, Looker, etc. Panoply, etc.
43
44
M2.U1..4. Architecture of BI and DI:
45
and more are used by businesses to collect data. Users can also
gather it from secondary sources like market research reports and
customer databases. To mix data from many sources,
contemporary BI applications make use of powerful data
connections. The formats of the data can be structured, semi-
structured, or unstructured.
(ii)Data integration:
Combine information from several sources to present a
coherent view when analysing data. Data is extracted from
various systems, then loaded into data warehouses. ETL is the
name of this procedure (extract, transform and load). Raw data is
extracted during data extraction from source sites such
databases, web pages, flat files, and SQL servers.
Filtering, cleaning, de-duplicating, and performing
calculations and summaries on the raw data are all part of the
transformation step. Additionally, converting the data into tables
to conform to the desired data warehouse schema may involve
changing the row and column headers, editing text strings, and
changing the data. The data is then imported into the data
warehouse as the final stage.
Data Storage
In order to facilitate further analysis, data warehouses store
structured data as a relational, columnar, or multi-dimensional
database. It supports maintaining a single version of the truth
throughout the organisation, data summarization, and cross-
functional analysis.
Data Analyzer:
Deriving insightful conclusions comes next after the data has
been analysed, cleaned, and transformed. Data analysis pulls
pertinent, usable information from the dataset to aid
organisations in decision-making. Graphs, charts, tables, maps,
46
and other visual representations of these statistics or insights are
frequently used.
With the use of drag-and-drop capabilities found in modern
BI applications, business users can easily construct intuitive
dashboards, reports, and visualisations without requiring a deep
understanding of technology.
Distribution of Data
To encourage teamwork, share reports and dashboards with
your teammates. Dashboards automatically change in real time
every day, every week, or every month. In a secure viewer
environment, users can also share dashboards.
Other users cannot alter the material, but they can
manipulate and interact with it using assigned filters. Another
choice is to share reports and dashboards with stakeholders
outside the organisation by using a public URL.
Data insights
Finding patterns in table representations or numbers rising in
a line chart is a key step in gaining insightful information. It can
also refer to using a pie chart to represent the distribution of
income or the number of hours dedicated to various chores on a
given day. Analyzing historical data can reveal how a company
responds to a variety of factors, such as market changes,
seasonality, patterns, economic cycles, and more. Analyze data
points and trends that may be consistent with the state of the
economy so that firms can make more informed decisions.
47
Other users cannot alter the material, but they can
manipulate and interact with it using assigned filters. Another
choice is to share reports and dashboards with stakeholders
outside the organisation by using a public URL.
48
Unit 2
M2.U2.1. Data Mining:
Data mining is the process of examining information's
hidden patterns from various angles for categorization into useful
data. This data is gathered and assembled in specific areas like
data warehouses, efficient analysis, and data mining algorithms,
which aid in decision-making and other data requirements and,
ultimately, reduce costs and generate income.
Organizations employ the data mining method to extract
certain data from sizable databases in order to address business
issues. It mostly transforms unprocessed data into insightful
knowledge.
Data mining is the act of examining large amounts of data to
look for patterns, identify trends, and develop understanding on
how to use the data. These results can then be used by data
miners to anticipate outcomes or make judgements.
M2.U2.2. Motivation for Data Mining:
Data mining is the process of sifting through a large amount
of data stored in repositories using pattern recognition
technologies, including statistical and mathematical
methodologies, in order to discover new connections, patterns,
and trends that are helpful. It is via the examination of factual
datasets that new relationships are found and records are
compiled in ways that are both logical and beneficial to the data
owner.
In order to identify regularities or relations that were initially
unknown, a large amount of data must be selected, explored, and
modelled. The goal is to produce results that are both clear and
helpful to the database owner.
It is not restricted to the application of statistical methods or
computer algorithms. It is a method of business intelligence that
49
may be used to support corporate decisions when combined with
information technology.
Data Science and data mining are related. It is performed
with a purpose by a person under certain circumstances, using a
particular data set. Text mining, online mining, audio and video
mining, picture data mining, and social media mining are just a
few of the services available in this phase. It is finished using
straightforward or highly specialised software.
Data mining can be seen as an outcome of the ongoing
development of data technology. The following features,
including data gathering and database building, data
management, and sophisticated data analysis, have all seen
evolution thanks to the database system market.
50
Domain-specific application is required here.For instance,
the data mining systems can be adjusted to suit the needs of the
telecommunications, financial, stock market, email, and other
industries.
Classification according to the type of techniques
utilized:
This strategy takes into account the level of user interaction
or the data analysis method used.
For instance, techniques focused on databases or data
warehouses, neural networks, visualisation, pattern recognition,
and machine learning.
Classification based on the kinds of knowledge that was
mined:
This is based on features including classification,
association, discrimination, correlation, and prediction, among
others.
Classification based on the kinds of databases that were
mined:
A database system can be categorised as a "data type," "data
use," or "data application" model.
M2.U2.4. DM tasks primitives
A data mining query is an input to the data mining system
that can be used to specify a data mining task.
Data mining task primitives are used to define a data mining
query.With the help of these primitives, the user can interact
with the data mining system to control it during discovery or to
explore the results at various depths and angles.
The following are detailed by the data mining primitives:
1. Set of task-relevant data to be mined.
2. Kind of knowledge to be mined.
51
3. Background knowledge to be used in the discovery
process.
4. Interestingness measures and thresholds for pattern
evaluation.
5. Representation for visualizing the discovered patterns.
52
M2.U2.4.1 The set of task-relevant data to be mined
This details the parts of the database or collection of data the
user is interested in.
This contains the desired dimensions of the interest-driven
data warehouse or database (the relevant attributes or
dimensions).
53
For directing the knowledge discovery process and
assessing the patterns discovered, this understanding of
the domain to be mined is helpful.
A common method of background knowledge that
enables data to be mined at various levels of abstraction
is concept hierarchies.
A series of mappings from basic concepts to higher-
level, more abstract concepts is known as a concept
hierarchy.
Rolling Up - Generalization of Data: Enables viewing of
data at more explicit and relevant abstractions and
facilitates understanding.
Less input/output procedures would be needed as a result
of the data's compression.
Drilling Down - Specialization of Data: Lower-level
concepts are used in place of concept values.
There may be more than one concept hierarchy for a
given attribute or dimension depending on the user's
point of view.
Below is an illustration of a concept hierarchy for the
attribute (or dimension) age.Another type of background
knowledge is user perceptions of the relationships in the
data.
M2.U2.4.4.The interestingness measures and thresholds
for pattern evaluation
There may be interesting measurements for different types of
knowledge. could be employed to direct the mining operation
or, after discovery, to assess the patterns found. Support and
confidence are two interesting metrics for association rules.
Rules that fall below the user-specified thresholds for support
and confidence are deemed uninteresting.
54
The format in which patterns are to be displayed is referred
to by this, and it could be rules, tables, cross tabs, charts, graphs,
decision trees, cubes, or other visual representations. The types
of presentation that will be utilised to display the found patterns
must be specified by the users .Certain types of knowledge may
be better represented by some representation forms than others.
55
M2.U3.1. Integration of a Data Mining system
with a Database or a Data
M2.U3.1.1. Warehouse :
To perform its functions effectively, the data mining
system is connected with a database or data warehouse
system. In order to function, a data mining system must
be able to interface with other data systems, such as a
database system. These systems can be integrated using
the following potential integration schemes.
M2.U3.1.2. No coupling:
A data mining system will not use any database or
data warehouse system functionality if there is no
coupling.
56
data mining system may spend a lot of time looking for,
gathering, cleaning, and altering data.
M2.U3.1.5.Tight coupling:
57
Tight coupling refers to the seamless integration of a
data mining technology with a database/data
warehousing system. One functional component of an
information system is the data mining subsystem. Data
structures, indexing schemes, and query processing
techniques used in database and data warehousing
systems are used to build and establish data mining
queries and functions. Due to its support for efficient
data mining functions implementation, high system
performance, and an integrated data processing
environment, it is highly desired.
58
M2.U3.3.Mining different kinds of knowledge in
databases:
Users may have varying levels of interest in various
types of knowledge. Data mining must therefore be able
to handle a variety of knowledge finding tasks.
59
depending on the returned findings, the data mining
process must be interactive.
60
Once the patterns are identified, they must be
communicated using high level languages and visual
aids. These visual representations ought to be simple to
understand.
61
The results from the partitions are then combined.
Databases are updated using incremental algorithms
rather than by mining the data anew from the beginning.
Terminal Questions:
1)Extract the importance of Business intelligence
and performs its working strategies in Data mining
techniques.
2)Demonstrate the working methodologies in Data
mining techniques with an architecture diagram.
62
3) Compare and contrast the working characteristics
of Business intelligence and data mining with an
diagram.
4)Describe briefly about the different functionalities
of Data mining.
5)Illustrate the different issues in data mining
techniques while applying in applications.
6)Interpret the concept of business intelligence in
Data mining techniques.
7)Define the motivation of Data mining .
8) Illustrate the concept of Integration of a Data
Mining system with a Database or a Data Warehouse.
9)Explain the importance of Data warehouse.
10)Differentiate the characteristics of Database and
Data warehouse.
63
M3.U1.1.KDD Process:
KDD is an iterative process that allows for the refinement of
mining techniques, the integration of fresh data, and the
transformation of existing data to provide new and better
findings.
Data Cleaning:
Data cleaning is the process of removing noisy and
pointless information from a collection.
It cleaning in case values are missing.
It removing random or variance errors from noisy data.
It using technologies for data transformation and
discrepancy detection.
Data Integration:
Data integration is the process of combining various
sources heterogeneous data into a single source
(DataWarehouse).
64
It integrating data with the aid of data migration
technologies.
It integrating data with the aid of data synchronisation
tools.
ETL (Extract-Load-Transformation)-based data
integration
Data Selection:
Data selection is the process of deciding which data from the
data collection are pertinent to the analysis and retrieving them.
Choosing data with a neural network.
Choosing data using decision trees.
selecting data using Naive Bayes.
Choosing data by employing clustering, regression.
Data transmission:
The process of changing data into the right form needed by
mining procedures is known as data transformation.The two
steps of data transformation are as follows:
Data Mapping:
To record transformations, elements from the source base are
assigned to the destination.
Code Generation:
Development of the transformation programme by itself
Data Mining:
Data mining is characterised as the application of intelligent
methods to identify potentially relevant patterns.It creates
patterns from task-relevant data.It uses characterisation or
classification to determine the model's purpose.
Pattern Evaluation:
Identification of strictly increasing patterns that indicate
knowledge based on predetermined metrics is the
definition of pattern evaluation.
Find each pattern's interestingness rating.
65
It summarising and visualising data to make it user-
understandable.
Knowledge Representation:
Knowledge representation is a method for displaying the
findings of data mining that makes use of visualisation
tools.
Produce reports.
Create some tables.
Create classification rules, characterization rules, and
other discriminatory rules.
2)Data Preprocessing:
Before using data, preparation is necessary. The idea of
transforming unclean data into clean data is known as data
preparation. Before using the algorithm, the dataset is
preprocessed to look for missing values, noisy data, and other
irregularities.
3) Data cleaning:
The practise of correcting or deleting inaccurate, damaged,
improperly formatted, duplicate, or incomplete data from a
dataset is known as data cleaning. There are numerous ways for
data to be duplicated or incorrectly categorised when merging
multiple data sources.
Missing Value:
66
"Low," "Medium," or "High."Consider that our aim is to develop
a model that can forecast a student's success in college.Before
executing the algorithm, data rows without the success column
could very well be disregarded and eliminated because they are
not helpful in predicting success.
Noisy Value:
67
necessary, but it can also have a negative impact on the
outcomes of any data mining analysis.Data mining can be made
easier through statistical analysis, which can leverage
information obtained from historical data to sort out noisy data.
68
The "tight coupling technique" and the "loose coupling
approach" are the two main strategies for integrating data.
Tight Coupling:
Loose Coupling:
In this case, a user-friendly interface is offered to accept the
user's query, change it so the source database can understand it,
and then send the transformed query directly to the source
databases to get the desired result.Additionally, the data is only
kept in the original source databases.
Issues in Data Integration:
Schema Integration, Redundancy Detection, and Resolving
Data Value Conflicts are the three factors to be taken into
account during data integration.
Below is a quick explanation of each.
1. Integration of Schemas:
It combining metadata from several sources.The entity
identification challenge refers to the real-world entities from
various sources.
69
2. Detecting redundancy
If an attribute can be retrieved or deduced from another
attribute or collection of attributes, it may be
redundant.Redundancies in the final data set can also be brought
on by inconsistent attribute values.Correlation analysis can
identify some redundancies.
3. Conflicts between data values are resolved:
The third major problem with data integration is this one.
The values of an attribute for the same real-world thing may
vary depending on the source.The "same" attribute may be
recorded at a lower level of abstraction in one system than in
another.
70
M3.U2.1. Data Reduction:
The data reduction technique may result in a condensed
description of the original data that is substantially lower in
number but maintains the original data's quality.
M3.U2.2 Data Cube Aggregation:
Utilizing this method, data is compiled in a more
straightforward format.Think about the data you acquired for
your research for the years 2012 to 2014, which includes your
company's income every three months.They involve you in the
annual sales, as opposed to the quarterly average. As a result, we
can summarise the data such that the final data summarises the
entire sales annually rather than weekly.It provides a data
summary.
M3.U2.2.1. Dimension reduction:
When we encounter information that is only marginally
relevant to our study, we employ the relevant attribute. It
decreases data size by getting rid of unnecessary or outmoded
elements.
Step-wise Forward Selection:
Step-wise Forward Selection - We start with a blank set of
traits and decide which of those is best later on depending on
how relevant they are to other attributes.
In statistics, we refer to it as a p-value. Assume that the data
collection has the following attributes, just a few of which are
redundant.
Initial attribute Set: {X1, X2, X3, X4, X5, X6}
Initial reduced attribute set: { }
Step-1: {X1}
Step-2: {X1, X2}
Step-3: {X1, X2, X5}
71
Step-wise Backward Selection:
This selection begins with a set of complete characteristics in
the initial data and eliminates the worst attribute that is still
present in the set at each stage.Assume that the data collection
has the following attributes, just a few of which are redundant.
Initial attribute Set: {X1, X2, X3, X4, X5, X6}
Initial reduced attribute set: {X1, X2, X3, X4, X5, X6 }
72
compression may seem different from the original data, but they
are still usable for extracting information.
M3.U2.3.1. Data Mining Primitives and Languages:
These primitives serve as the definition of a data mining
query.
Task-relevant data: This is the database section that needs to
be looked into.Let's say you oversee All Electronics and are in
charge of the company's sales throughout the US and
Canada.You want to focus in particular on Canadian consumers'
purchasing patterns.Instead of using the complete database for
mining.Relevant qualities are what we call these.
Information that can be mined includes:This details the
required data mining operations, such as characterisation,
discrimination, association, classification, clustering, or
evolution analysis.For instance, if you're researching Canadian
consumers'
73
purchasing patterns, you can decide to look for links
between their consumer profiles and the products they prefer.
74
UNIT 1
M4.U1.1.Association Rule Mining:
Finding the rules that may control relationships and causal objects
between sets of things is the goal of association rule mining, a data
mining technique. Therefore, in a particular transaction involving
numerous things, it looks for the principles governing how or why such
items are frequently purchased together.
Large volumes of data are analysed using association rule mining to
uncover intriguing linkages and relationships. This rule displays the
number of times an itemset appears in a transaction. A market-based
analysis serves as a common illustration.
M4.U1.2 Market-based analysis:
One of the most important methods used by large organisations to
demonstrate correlations between goods is market-based analysis.It
enables retailers to discover connections between the products that
customers usually purchase together.We can discover rules that
anticipate the occurrence of an item based on the occurrences of other
things in the transaction given a set of transactions.
Before we start defining the rule, let us first see the basic
definitions.
75
Frequent Itemset – An itemset whose support is greater than or
equal to minsup threshold.
Association Rule – An implication expression of the form X ->
Y, where X and Y are any 2 itemsets.
Example: {Milk, Diaper}->{Beer}
Rule Evaluation Metrics –
Support(s) –
The number of transactions that include items in the {X} and {Y}
parts of the rule as a percentage of the total number of
transaction.It is a measure of how frequently the collection of
items occur together as a percentage of all transactions.
Support = (X+Y)
total –
It is interpreted as fraction of transactions that contain both X and
Y.
Confidence(c) –
It is the ratio of the no of transactions that includes all items in
{B} as well as the no of transactions that includes all items in {A}
to the no of transactions that includes all items in {A}.
Conf(X=>Y) = Supp(X Y) Supp(X) –
It measures how often each item in Y appears in transactions that
contains items in X also.
Lift(l) –
The lift of the rule X=>Y is the confidence of the rule divided by
the expected confidence, assuming that the itemsets X and Y are
independent of each other.The expected confidence is the
confidence divided by the frequency of {Y}.
Lift(X=>Y) = Conf(X=>Y) Supp(Y) –
Lift value near 1 indicates X and Y almost often appear together as
expected, greater than 1 means they appear together more than expected
and less than 1 means they appear less than expected.Greater lift values
indicate stronger association.
76
c= (Milk, Diaper, Beer) (Milk, Diaper)
= 2/3
= 0.67
77
information about the item, like its item ID, name, brand, category,
supplier, and place of manufacture.Being able to condense a big
amount of data and deliver it at a high conceptual level is helpful.For
instance, providing a basic summary of such data by summarising a
huge number of elements related to Christmas season sales can be
quite beneficial for sales and marketing managers.This calls for a
crucial feature known as data generalisation.
A method for abstracting a sizable collection of information from
a database that is task-relevant from a low conceptual level to a higher
one.Data generalisation generates what are referred to as characteristic
rules by summarising the general characteristics of objects in a target
class.A database query is typically used to gather data related to a
user-specified class, and the data are then run through a
summarization module to extract the key information at various levels
of abstraction.One can choose to describe, for instance,
"OurVideoStore" consumers who routinely rent more than 30 movies
each year. The attribute-oriented induction approach, for instance, can
be used to do data summarization when concept hierarchies on the
attributes describing the target class are present. It should be noted
that straightforward OLAP procedures are appropriate for
characterisation of data when a data cube contains a summary of the
data.
Presentation Of Generalized Results
Generalized Relation:
Relations where some or all attributes are generalized, with counts or
other aggregation values accumulated.
Cross-Tabulation:
Mapping results into cross-tabulation form (similar to contingency
tables).
Visualization Techniques:
Pie charts, bar charts, curves, cubes, and other visual forms.
Quantitative characteristic rules:
Mapping generalized results in characteristic rules with quantitative
information associated with it.
Approaches:
78
Data cube approach(OLAP approach).
Attribute-oriented induction approach.
Summary:
Data generalization is the process that abstracts a large set
of task-relevant data in a database from a low conceptual level
to higher ones.
3)Attribute Revelance:
Analyzing a measure that can compute an attribute's
relevance to a particular class or notion is the fundamental idea
behind attribute relevance analysis. These metrics include
correlation coefficient, ambiguity, and information gain.
It is a statistical method for preparing data to rank or filter
out the pertinent attribute. Inappropriate features that could be
excluded from the concept description process can be identified
using measures of attribute relevance analysis. Analytical
characterisation is defined as the inclusion of this preprocessing
step into class characterization or comparison.By comparing
the general characteristics of items between two classes—the
target class and the opposing class—data discrimination creates
discriminating rules.
79
It can result in a significant amount of generalisation.
It may decrease the quantity of characteristics that help us
recognise patterns quickly.
Analyzing a measure that can compute an attribute's
relevance to a particular class or notion is the fundamental idea
behind attribute relevance analysis. These metrics include
correlation coefficient, ambiguity, and information gain.
The following is how attribute relevance analysis for
concept description is carried out:
80
81
M4.U2.1) Finding frequent item sets:
In association mining, frequent elements are found in the
data collection.
Frequent mining often identifies the intriguing connections
and links between item sets in relational and transactional
databases.
In a nutshell, frequent mining identifies the elements that
frequently coexist in a transaction or relation.
Mining Association Need:
The creation of association rules from a transactional dataset
is known as frequent mining.
It is wise to group items X and Y together in stores or offer a
discount on one item when you buy the other if you usually buy
both.
This has a significant impact on sales.
For instance, it is likely that a consumer who purchases milk
and bread will also purchase butter.
As a result, ['milk']['bread']=>['butter'] is the association rule.
Important Definitions :
82
Confidence: A confidence of 60% means that 60% of
the customers who purchased a milk and bread also
bought butter.
83
Example On finding Frequent Itemsets – Consider the
given dataset with given transactions.
84
3-frequent: {A, B, C} = 2 // ignore not frequent because
support count < minimum support count {A, B, D} = 2 // ignore
not frequent because support count < minimum support count
{A, C, D} = 3 // maximal frequent {B, C, D} = 3 // maximal
frequent
2) Apriori algorithm:
Apriori Property –
All non-empty subset of frequent itemset must be frequent. The
key concept of Apriori algorithm is its anti-monotonicity of
support measure. Apriori assumes that
85
minimum support count is 2
minimum confidence is 60%
Step-1: K=1
(I) Create a table containing support count of each item present
in dataset – Called C1(candidate set)
86
Step-2: K=2
87
(II) compare candidate (C2) support count with
minimum support count(here min_support=2 if
support_count of candidate set item is less than
min_support then remove those items) this gives us
itemset L2.
Step-3:
88
(II) Compare candidate (C3) support count with
minimum support count(here min_support=2 if
support_count of candidate set item is less than
min_support then remove those items) this gives us
itemset L3.
Step-4:
89
Confidence –
A confidence of 60% means that 60% of the customers,
who purchased milk and bread also bought butter.
Confidence(A->B)=Support_count(A∪B)/
Support_count(A)
90
sets from candidate itemsets, also it will scan database
many times repeatedly for finding candidate itemsets.
Apriori will be very low and inefficiency when memory
capacity is limited with large number of transactions.
IF-THEN Rules
Points to remember −
Rule Extraction
91
Here we will learn how to build a rule-based classifier by
extracting IF-THEN rules from a decision tree.
Points to remember −
One rule is created for each path from the root to the leaf
node.
To form a rule antecedent, each splitting criterion is
logically ANDed.
The leaf node holds the class prediction, forming the rule
consequent.
92
Input:
D, a data set class-labeled tuples,
Att_vals, the set of all attributes and their possible values.
repeat
Rule = Learn_One_Rule(D, Att_valls, c);
remove tuples covered by Rule form D;
until termination condition;
Rule Pruning
93
Note − This value will increase with the accuracy of R on
the pruning set. Hence, if the FOIL_Prune value is higher for the
pruned version of R, then we prune R.
3)Associative Classification:
To assist decision-makers, data mining is the process of
identifying and extracting hidden patterns from various forms of
data. Associative classification, a popular classification learning
technique in data mining, combines classification and association
rule detection techniques to produce classification models.
94
To determine the specific analysis and outcome, several data
mining techniques can be utilised, including classification
analysis, clustering analysis, and multivariate analysis. The main
purposes of association rules are to study and forecast consumer
behavior.It is mostly used in classification analysis to make
judgements, ask questions, and forecast behaviour.It is mostly
employed in clustering analysis when no assumptions are made
regarding potential correlations in the data.When attempting to
forecast the value of an endlessly dependent variable from a
series of independent factors, regression analysis is used.
95
Types of Associative Classification:
96
Keep in mind: These business intelligence tools all vary in
robustness, integration capabilities, ease-of-use (from a technical
perspective) and of course, pricing.
97
Website: www.sap.com
2. Datapine
Website: www.datapine.com
98
3. MicroStrategy
99
Website: www.microstrategy.com
100
4. SAS Business Intelligence
Website: www.sas.com
101
5. Yellowfin BI
Website: www.yellowfinbi.com
102
6. QlikSense
Website: www.qlik.com
103
7. Zoho Analytics
Website: www.zoho.com/analytics
104
8. Sisense
Website: www.sisense.com
105
9. Microsoft Power BI
Website: www.powerbi.microsoft.com
106
10. Looker
Website: www.looker.com
107
11. Clear Analytics
Website: www.clearanalyticsbi.com
108
12. Tableau
Website: www.tableau.com
109
13. Oracle BI
Website: www.oracle.com
110
14. Domo
Website: www.domo.com
111
15. IBM Cognos Analytics
112
Website: www.ibm.com
113