0% found this document useful (0 votes)
171 views112 pages

Data Mining in Business Intelligence Book

Uploaded by

nebad87145
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
171 views112 pages

Data Mining in Business Intelligence Book

Uploaded by

nebad87145
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 112

DATA MINING IN

BUSINESS
INTELLIGENCE

__________

ASOK KUMAR RATHINAM


Data Mining in Business Intelligence

Copyright © 2022 by ASOKKUMAR RATHINAM

All rights reserved. No part of this book may be reproduced or


transmitted in any form or by any means without written permission
from the author.

ISBN:XXXXXXXXXXXXX

Printed @KLEF
DEDICATION

To my beloved mother, Subbulakshmi Rathinam, whose unwavering


love and support have been the guiding light of my life. Your boundless
encouragement and sacrifices have shaped me into the person I am today.
With heartfelt gratitude, I dedicate my journey in the realm of Data
Mining and Business Intelligence to you. Your wisdom and strength
have instilled in me the determination to explore the depths of
knowledge, uncovering hidden insights that drive success in the business
world. As I embark on this path, I carry your values and teachings as a
beacon, reminding me to embrace challenges with resilience and to
approach every endeavor with integrity. This dedication is a small token
of my admiration and appreciation for the exceptional mother you are.
Your presence in my life continues to inspire and motivate me, and I will
forever be grateful for your love and guidance.

3
Table of Contents
Foreword....................................................................................................7
Preface.......................................................................................................9
Introduction.............................................................................................11
The Software Process .......................................................................14
Software Engineering Practice................................................................15
Software Development Life Cycle.............................................. ...........20
Reverse Engineering ..............................................................................23
Software process model..........................................................................27
Types of software process models..........................................................28
Waterfall Model......................................................................................28
V Model..................................................................................................30
Incremental Model..................................................................................31
Iterative Model........................................................................................33
RAD Model.............................................................................................35
Spiral Model............................................................................................36
Agile model.............................................................................................37
Introduction - A Strategic Approach to Software Testing......................64
Strategic Issues........................................................................................66
Test Strategies for Conventional Software..............................................73
Validation Testing...................................................................................86
White-Box Testing.................................................................................105
Block Box Testing.................................................................................109

4
FOREWORD

In this rapidly evolving digital era, businesses face an overwhelming


challenge: to navigate through the vast sea of data that surrounds them
and extract valuable insights that can lead to smarter decisions and
strategic advantages. The art and science of accomplishing this
monumental task lie at the heart of data mining in business intelligence.

As technology has advanced, data has emerged as the lifeblood of


organizations, flowing in from countless sources, spanning various
formats and types. The data deluge is both an opportunity and a
hindrance, presenting unparalleled prospects for innovation while also
posing significant obstacles to harnessing its true potential. Herein lies
the significance of data mining – the systematic process of discovering
hidden patterns, trends, and relationships within this ocean of
information.

This book, "Data Mining in Business Intelligence," delves deep into


the realms of data-driven decision-making, empowering readers to
understand, utilize, and maximize the transformative power of data
mining techniques. Whether you are a seasoned business executive, an
aspiring data scientist, or a curious learner, this comprehensive guide
offers a roadmap to navigate the intricacies of modern data analysis and
its role in driving business success.

Leading you through an enlightening journey, the book covers


fundamental concepts, methodologies, and cutting-edge tools that
facilitate data mining endeavors. By demystifying complex algorithms
and statistical techniques, it enables you to grasp the essence of
predictive modeling, clustering, association rule mining, and anomaly
detection – all critical components of business intelligence.

5
The authors have meticulously crafted each chapter, striking a
balance between theoretical foundations and practical applications.
Drawing from their collective expertise, they offer real-world case
studies, illustrating how data mining has revolutionized diverse
industries, including finance, marketing, healthcare, and more. These
examples demonstrate how organizations have gained a competitive edge
by extracting meaningful knowledge from vast data repositories and
converting it into actionable insights.

As you immerse yourself in the world of data mining, you will come
to appreciate the ethical implications surrounding data privacy, security,
and bias. Understanding the responsibility of wielding data as a powerful
tool is crucial in preserving trust and integrity in the digital landscape.

In closing, "Data Mining in Business Intelligence" is a powerful


resource that equips you to unlock the potential of data-driven decision-
making. The knowledge gained from this book empowers you to make
informed choices, optimize operations, and innovate with confidence.
Embrace this transformative journey, and let the revelations within these
pages serve as your guiding light in a data-powered future.

Dr. KARTHIKEYAN J.
Professor of English & Dean, Career Development, Sri Venkateswara
College of Engineering & Technology, Chittoor – 517127. Andhra
Pradesh

6
PREFACE

Welcome to "Data Mining in Business Intelligence," a


comprehensive guide that explores the powerful intersection of data
mining and business intelligence to unlock valuable insights and drive
informed decision-making in today's competitive landscape. In this book,
we delve into the world of data mining and its vital role in transforming
raw data into actionable knowledge, enabling businesses to thrive in an
increasingly data-driven era.

Data has become an abundant resource in our digital age, with


organizations generating vast amounts of data every day. However, the
real challenge lies in extracting meaningful information from this data
deluge and transforming it into valuable insights. This is where data
mining steps in, providing the essential tools and techniques to discover
patterns, trends, and relationships hidden within the data, which can lead
to valuable discoveries and opportunities.

Business Intelligence (BI) complements data mining by providing


the framework to gather, analyze, and present the data-driven insights to
business stakeholders, enabling them to make informed decisions and
drive positive outcomes. When data mining and business intelligence
converge, they form a powerful synergy, capable of revolutionizing the
way organizations operate, innovate, and gain a competitive edge.

In this book, we aim to provide a comprehensive and accessible


understanding of data mining and its integration into the broader context
of business intelligence. Whether you are a seasoned data professional
looking to enhance your skills or a business leader seeking to leverage
data for strategic decision-making, this book offers valuable insights,

7
practical knowledge, and real-world examples to guide you on this
transformative journey.

Key features of this book:

Fundamentals of Data Mining: We begin by laying the groundwork,


exploring the core principles and techniques of data mining, including
classification, clustering, association rule mining, and more. We
emphasize the importance of data preparation and cleaning to ensure
high-quality results.

Data Mining Algorithms: Delve into various data mining algorithms,


from traditional statistical methods to cutting-edge machine learning
techniques. We provide clear explanations and code examples to help
you implement these algorithms in your own projects.

Data Visualization: Learn the art of presenting data mining results


effectively through data visualization. Discover how to create compelling
visual representations that communicate complex insights in a simple
and understandable manner.

Integration with Business Intelligence: Understand how data mining


fits into the broader landscape of business intelligence. Explore the
synergy between data mining and BI tools, and how they work together
to drive data-driven decision-making.

Real-world Case Studies: Throughout the book, we showcase real-


world case studies from various industries, illustrating how data mining
has been successfully applied to solve practical business challenges,
drive revenue, reduce costs, and improve overall efficiency.

Ethical and Privacy Considerations: We emphasize the importance of


ethical data mining practices and the responsible use of data. Privacy and

8
security concerns are addressed to ensure that data mining is carried out
in a manner that respects individuals' rights and complies with
regulations.

We hope that "Data Mining in Business Intelligence" serves as a


valuable resource, empowering you to harness the full potential of data
mining to drive innovation and growth within your organization. As the
business landscape continues to evolve, the ability to extract knowledge
from data and convert it into actionable insights will remain a critical
competitive advantage. Let this book be your guide as you embark on
this exciting journey of data discovery and business intelligence.
Asokkumar Rathinam

9
INTRODUCTION

Welcome to the fascinating world of "Data Mining in Business


Intelligence." In this illuminating book, we embark on a transformative
journey into the realm of data-driven decision-making and uncover the
powerful insights that lie hidden within vast and complex datasets. As
businesses increasingly harness the potential of data to gain a
competitive edge, the art and science of data mining have emerged as
pivotal tools in extracting valuable patterns, trends, and knowledge from
information overload. With a focus on practical applications and cutting-
edge techniques, this book serves as an indispensable guide for
professionals, students, and enthusiasts eager to unravel the secrets of
data mining and leverage its potential to drive innovation, optimize
processes, and make informed, strategic choices that steer businesses
toward success in the dynamic landscape of the 21st century.

11
M1.U1.1.Data
Any unprocessed fact, value, text, sound, or image that is not
being understood and analyzed falls under this category.
Data analytics, machine learning, and artificial intelligence
all depend on data, which is their most crucial component. We
cannot train any model without data, thus all current research
cannot be done.
M1.U1.1.1. Why analyze data:
Businesses can collect pertinent, accurate information
through data analysis that can be used to create future marketing
strategies, business plans, and realign the company's vision or
mission.
M1.U1.1.2. Raw data to valuable information:

 Primary data is another name for raw data.


 It is gathered from a single source and must be
processed in order to be informative.
 Data in business can be found everywhere.
 Data that circulates on the internet, such as pictures,
Instagram posts, Facebook followers, comments, and
competitors' followers, could be external.
 Internal data from a business operating system, such
as a content management system, could also be used
(CMS).
 For better decision-making, data is sorted and
processed before being presented as patterns and
trends.
 Martin Doyle said, and I'll quote him, "Computers

12
require data.
 People require knowledge.
 A construction block is data.
 Information provides context and meaning.
 Turn Raw Data to Valuable Information:
 Step One: Raw Data Extraction
 This is the first step to enabling the next.
 We must begin putting blocks together in a zero to
one approach.
 Data extraction is the process of obtaining data from
numerous online resources.

M1.U1.1.3. Identify the source:

 Before we begin the extraction, it is important to confirm


the legality of the source and the accuracy of the data.
 To acquire more information, you can check the Terms
of Service (ToS).
 For instance, everyone is aware of the enormous benefits
LinkedIn offers in terms of sales opportunities.
 LinkedIn agreed and has strong policies about any kind
of website accessing method called "scraping."

M1.U1.1.4. Start extraction:

Once the sources are prepared, you can begin the extraction
process. You can extract data in a variety of methods. Web
scraping is a useful technique today. The use of an automated
web scraping solution is beneficial because it eliminates the
requirement for independent script writing or hiring developers.

13
For companies with a tight budget but high data consumption, it
is the most practical and long-lasting solution. Step Two: Data
Analytics As the quality of the data may directly affect the
analysis result, it is important to verify the accuracy during the
analysis stage. Data will be delivered to the consumers during
this phase in a variety of reported formats, including dashboard
and visualization.

Let's monitor and analyse social media platforms using Power BI


in order to test my marketing plan, product quality, and crisis
management. Information in business must be interpreted in light
of the organization's context. The main factor in producing
actionable insights is context.

Take Cheetos as an illustration. The trend line of mentions in


April is 50k, as you can see. Twitter mentions of Cheetos
increased by 50,000 in April. Despite being a large number, it
just provides information about the number of mentions at that
particular instant.

50k mentions in April

40k mentions in March

1673 references of February

From March to April, Cheetos receives 10,000 more


mentions.

From February to March, Cheetos receives 38k more


mentions.

The most important part is data storage because businesses


depend on it to keep their data safe.

The speed and capacity of data storage vary.

14
M1.U1.2. Lifecycle of Data:

The process by which a specific piece of data moves from


being created or captured to being archived and/or deleted at the
end of its useful life is known as the data life cycle. Although
details can vary, specialists in data management frequently count
six or more stages as part of the data life cycle. Here's an
illustration:

M1.U1.2. 1.Generation

Data enters an organization during this phase either through


data entry, acquisition from an external source, or signal
reception, such as the receiving of transmitted sensor data.

M1.U1.2. 1.2. Maintenance

Data is processed here before being used. Processes like


integration, cleaning, and extract-transform-load may be applied
to the data (ETL).

M1.U1.2. 1.3. Active use

Data are put to use in this stage to help the organization's


goals and activities.

M1.U1.2. 1.4. Publication:

Data is just provided outside the corporation during this


phase, not necessarily made available to the general public. A
particular unit of data's life cycle may or may not include
publication. Data is removed from all active production
environments throughout the archiving phase. Although
information is no longer processed, used, or published, it is
nonetheless kept on file just in case.

15
M1.U1.2. 1.5. Purging:
All copies of the data are erased during this stage.
This is typically done using already-archived material. Since
the growth of big data and the continued development of the
Internet of Things, data lifecycle management (DLM) has gained
importance (IoT).Globally, an ever-increasing number of gadgets
are producing enormous amounts of data. It is crucial to maintain
proper control over data throughout its life cycle to maximize its
value and reduce the chance of mistakes. A last step is to archive
or remove data when it has served its purpose.

16
M1.U2.1. What is Business Intelligence?
The procedural and technological framework that gathers,
saves, and analyses the data generated by a company's operations
is known as business intelligence (BI).BI is a broad phrase that
includes descriptive analytics, performance benchmarking,
process analysis, and data mining. Business intelligence (BI)
organizes all the data that a company generates into manageable
reports, performance metrics, and trends.
 BI stands for the technological framework that gathers,
organizes, and evaluates corporate data.
 Managers may make better decisions thanks to BI,
which analyses data and generates reports and insights.
 BI solutions are created by software companies for
businesses that want to use their data more effectively.
 Spreadsheets, reporting/query software, data
visualization software, data mining tools, and online
analytical processing are just a few examples of the
many different types of BI tools and software available
(OLAP).
 Self-service BI is an analytical method that enables non-
technical people to access and explore data.

M1.U2.2. Benefits of BI
Numerous factors influence why businesses use BI.
It is frequently utilized to help activities as various as hiring,
compliance, production, and marketing.

17
It is challenging to identify a company segment that does not
benefit from having better information to work with because BI
is a basic business value.
Faster, more accurate reporting and analysis, better data
quality, improved employee satisfaction, decreased costs and
increased revenues, and the capacity to make better business
decisions are just a few of the many advantages businesses can
experience after incorporating BI into their business models.
For instance, if you are in charge of setting up the production
schedules for a number of beverage factories and sales are
increasing significantly month over month in a specific area, you
can approve more shifts almost immediately to make sure your
factories can meet demand.
M1.U2.3. BI and DW in today’s perspective:
According to Gartner, business intelligence is "an umbrella
phrase that encompasses the applications, infrastructure, tools,
and best practises that enable access to and analysis of
information to enhance and optimise choices and performance.
In order to help users derive business insights, BI systems
collect, organise, analyse, and show proprietary data. It may
combine data from many sources, find trends or patterns in the
data, and recommend best practises for visualisations and further
steps.Insights can contain measurements from the past,
projections for the future, analyses of competition performance,
and much more.
Among the advantages of business intelligence are:
 Control over and access to private data
 Better data literacy
 Imaginative displays
 Data analysis
 Benchmarking
 Performance supervision

18
 Sales information
 Streamlined processes
 Eliminated speculation

M1.U2.4. Data warehousing:


 A data warehouse is described by Gartner as "a storage
architecture designed to retain data derived from
operational data stores, transaction systems, and external
sources."
 The data is subsequently combined in the warehouse into
an aggregate, summary form suitable for reporting and
enterprise-wide data analysis for predetermined business
purposes.
 Transactional databases are unable to handle
complicated queries, whereas data warehouses can.
Additionally, it has the capacity to initiate the cleaning
process by negotiating various data storage schemas
according to the type of data.
 Data cannot be changed once it is stored in a warehouse.
 Data warehouses can only analyse past data; they cannot
provide real-time data or forecast the future.

M1.U2.4.1. A data warehouse's fundamental


characteristics are:
 It uses a lot of historical data
 It allows both ad hoc and planned inquiries

19
 It regulates data load
 It allows users to manage schema such as tables,
indexes, etc. by retrieving massive volumes of data. It
allows users to produce reports It secures data.

M1.U2.5. BIDW:
The Kimball Group asserts that "data warehousing was
rebranded as 'business intelligence.'"Because it correctly
conveyed the transfer of the initiative and ownership of the data
assets to the business, this relabeling was much more than a
marketing strategy. Although the idea that corporate data users
should own the information suggests that accessing and storing
data (also known as data warehousing) is equivalent to
processing, analysing, and interpreting it (i.e., business
intelligence).
It is necessary to first distinguish between the idea of
business intelligence and the technologies that support it in order
to comprehend how BI and DW interact.
Business intelligence relies on gathering data from across the
organisation and using data analysis to provide reports and
global views.BI Tools are software programmes that enable
OLAP, create reports and visualisations, and enhance BI analysis
(online analytical processing).Another component of a BI toolkit
that focuses exclusively on gathering data is a data warehouse.

20
M1.U3.1. Data warehousing:
A data warehousing (DW) process is used to gather and
manage data from many sources in order to produce insightful
business information. Business data from many sources is often
connected and analysed using a data warehouse. The central
component of the BI system, which is designed for data analysis
and reporting, is the data warehouse.
The combination of several technologies and elements
facilitates the strategic use of data.Large amounts of data are
electronically stored by a company and are intended for analysis
and inquiry rather than transaction processing. It is a process of
converting data into information and promptly making it
accessible to users so that it might have an impact.
One or more data sources send information to a data
warehouse, which acts as a central store for that information. The
transactional system and other relational databases feed data into
a data warehouse.
Data could be:
 Structured
 Semi-structured
 Unstructured data

 Users can access the transformed data in the Data


Warehouse through Business Intelligence tools, SQL
clients, and spreadsheets after the data has been changed,
transformed, and ingested.
 A data warehouse compiles information from various
sources into a single thorough database.
 An organisation can examine its clients more thoroughly
by combining all of this data in one location.

21
 This makes sure that all the information is taken into
account. Data mining is made possible by data
warehousing. Data mining searches for patterns in the
data that could result in increased revenue and
profitability.
M1.U3. 2. Need for data warehousing:
 Airline:It is utilised for operational purposes in the
airline system, such as personnel ssignment, studies of
route profitability, frequent flyer programme
promotions, etc.
 Banking:It is frequently used in the banking industry to
efficiently manage the resources on the desk.A few
banks are also utilised for operations, product
performance analysis, and market research.
 Healthcare:Data warehouses were also utilised by the
healthcare sector to plan and forecast outcomes, provide
patient treatment reports, and communicate data with
affiliated insurance firms, medical aid services, etc.
 Government sector:Data warehouses are utilised for
intelligence collection in the public sector.
 It aids in the upkeep and analysis of each person's tax
records, health policy records, and other data by
government authorities.
 Airline:Sectors of investment and insurance:
 In this industry, warehouses are largely used to track
market trends, assess consumer trends, and analyse data
patterns.
 Retain chain:Data warehouses are frequently utilised in
retail chains for distribution and marketing.Additionally,
it aids in keeping track of products, consumer
purchasing trends, promotions, and pricing policy.

22
 Telecommunication:In this industry, distribution
decisions, sales decisions, and product marketing
decisions are all made using a data warehouse.
 Hospitality sector:Based on customer feedback and
travel habits, this industry uses warehouse services to
plan and predict the locations for its advertising and
promotion efforts.
M1.U3. 2.1 Advantages:
 Business users can easily access crucial data from a
variety of sources using data warehouses.
 Consistent data on multiple cross-functional operations
is provided via data warehouse.
 Ad hoc reporting and querying are also supported.
 To lessen the strain on the production system, data
warehouses assist in integrating several data sources.
 Using a data warehouse can speed up analysis and
reporting overall.
 The user can utilise it more easily for reporting and
analysis thanks to restructuring and integration.
 Users can obtain crucial data from numerous sources in
a single location thanks to data warehouses.
 As a result, it saves users' time when obtaining data from
various sources.
 A substantial amount of historical data is kept in data
warehouses.
 This facilitates user analysis of various time periods and
patterns to make future predictions.
M1.U3. 2.2.Disadvantages:
 It takes a lot of time to create and implement a data
warehouse.
 Data Warehouse can get out of date rather soon.
 Changes to data types, ranges, indexes, and searches are
challenging to implement.

23
 The data warehouse may appear simple, but it is actually
too complicated for most consumers.
 Data warehousing project scope will constantly expand
despite best efforts at project management.
 Different business rules may occasionally be developed
by warehouse users.
 Organizations must devote a significant amount of their
resources on implementation and training.
M1.U3. 3. Components of Data warehousing:
There are four components
M1.U3. 3. 1. Load manager:
The front component is another name for the load manager.
It completes all tasks necessary for the extraction and
loading of data into the warehouse.
To get the data ready for the data warehouse, these activities
also involve transformations.
Warehouse Manager: The warehouse manager carries out
tasks related to the administration of the data stored there.It
carries out tasks including data analysis to check for consistency,
index and view building, denormalization and aggregate
generation, transformation and merging of source data, and data
archiving and baking up.
M1.U3. 3.2. Query Manager:
The term "backend component" also applies to the query
manager.
It executes all actions necessary for the administration of
user inquiries.
Direct queries to the necessary tables are used by this data
warehouse component's activities to schedule the execution of
queries.
M1.U3. 3.3. Tools for end-user access:

24
This is divided into five categories, including 1. Data
Reporting 2. Query Tools 3. Application development tools 4.
EIS tools, 5. OLAP tools and data mining tools.
M1.U3. 4 Trends in data warehousing:
More capable data warehouses are needed as a result of
the enterprise's "datafication.":
A increasing stream of data is being produced by mobile
devices, social media usage, networked sensors (i.e., the Internet
of Things), and other sources. Some have referred to this stream
of data as a "fire hose of data."IT teams are responding by
enhancing data warehouse capabilities so they can manage more
data, more types of data, and do so more quickly than ever.
Physical and logical consolidation help reduce costs.
Spending more money on these technologies is not the
solution to the datafication problem.Or, to put it another
perspective, ten times as much data shouldn't result in ten times
as much money.In order to integrate these expanding data
warehouses, a mix of virtualization, compression, multi-tenant
databases, and servers designed to handle significantly higher
data volumes and workloads is required.
Optimized environment using Hadoop.
With its distributed file system (HDFS) and parallel
MapReduce paradigm, the open source Hadoop application
excels at processing very massive data volumes.Because of this,
Hadoop is a fantastic complement to "traditional" data
warehouses, which helps to explain why an increasing number of
data warehouse managers are increasingly turning to Hadoop to
handle some of the busiest workloads.
Real-time analytics are used in customer experience (CX)
initiatives to enhance marketing campaigns:
Data warehouses are essential to CX projects because they
have the information required to create a complete, 360-degree

25
perspective of your client base.For sentiment analysis,
personalization, marketing automation, sales, and customer
support, a data warehouse of customer information can be used.
Engineered systems are increasingly the preferred
method for managing massive amounts of information.
Data warehouses can easily become into a complicated
assembly of several parts, including servers, storage, database
software, and other elements, but that doesn't have to be the case.
Engineered solutions that are preconfigured and tuned for
certain workloads, like Oracle Big Data Appliance and Oracle
Exadata Database Machine, give the highest levels of
performance without the integration and configuration hassles.
M1.U3. 5. Data Marts:
A subset of a central information store called a "Data Mart"
is often focused on a single use or key data topic and can be
dispersed to meet business needs. Data Marts are analytical
record repositories created with a certain community within a
company in mind, focusing on particular business processes.
Despite the fact that organisational data marts are combined to
build the data warehouse in the bottom-up data warehouse
design process, data marts are derived from subsets of data in a
data warehouse.
Applications for business intelligence (BI) are the main
usage of a data mart. Records are gathered, stored, accessed, and
analysed using BI. Since it is less expensive than constructing a
data warehouse, smaller organisations can use it to make use of
the data they have amassed.

26
Purpose of creating data marts:
 Creates collective data by a group of users
 Easy access to frequently needed data
 Ease of creation
 Improves end-user response time
 Lower cost than implementing a complete data
warehouses
 Potential clients are more clearly defined than in a
comprehensive data warehouse
 It contains only essential business data and is less
cluttered.
Types of datamarts:
To designing data marts, there are primarily two methods.
These methods are
(i)Dependent Data marts
(ii)Independent Datamarts
(i)Dependent Data marts
The logical subset of the physical subset of a higher data
warehouse is a dependent data mart. The data marts are regarded
as a data warehouse's subsets in accordance with this method.
This method starts by building a data warehouse from which
other data marts can be made. These data mart rely on the data
warehouse and pull the crucial information from it. This method

27
eliminates the need for data mart integration because the data
warehouse builds the data mart. It is also referred to as a top-
down strategy.

(ii)Independent DataMarts:
Independent data marts are the second strategy (IDM) In this
case, separate multiple data marts are first constructed, and then
a data warehouse is designed using them. This method requires
the integration of data marts because each data mart is
independently built. As the data marts are combined to create a
data warehouse, it is also known as a bottom-up strategy.

28
Steps involved in preparing DataMarts:
Designing the schema, building the physical storage,
populating the data mart with data from source systems,
accessing it to make educated decisions, and managing it over
time are the key implementation processes. These are the steps:

(i) Designing

The data mart process starts with the design step. Initiating
the request for a data mart, acquiring information about the
needs, and creating the data mart's logical and physical design
are all covered in this step.
The tasks involved are as follows:
Assembling the technological and business needs
Finding the sources of data
Designing the logical and physical architecture of the data
mart; choosing the suitable subset of data.

29
(ii) Constructing

In order to enable quick and effective access to the data, this


step involves building the physical database and the logical
structures related to the data mart.

The tasks involved are as follows:

 Establishing the data mart's logical structures, such as


tablespaces, and physical database.
 Producing the schema objects described in the design
process, such as the tables and indexes.
 Deciding on the best way to put up the access structures
and tables.

(iii)Populating
This stage involves obtaining data from the source, cleaning
it up, transforming it into the appropriate format and level of
detail, and transferring it into the data mart.
The tasks involved are as follows:
 Relating intended data sources to the data sources used
 Getting data out
 Converting and cleaning up the data.
 Data loading for the data mart
 Establishing and preserving metadata
(iv) Accessing
In this step, the data is put to use through querying, analysis,
report creation, chart and graph creation, and publication.
The tasks involved are as follows:
 Create a meta layer (intermediate layer) for the front-end
tool to use. This layer converts database operations and
object names into business terms so that end users can

30
communicate with the data mart using language that is
related to business processes.
 Create and maintain database structures like summarised
tables to aid in the quick and effective execution of
queries through front-end tools.
(v) Managing
In this step, the data mart's lifespan management is
included. Management duties are carried out at this step as
follows:
 Granting safe access to the information
 Controlling the expansion of the data.
 Improving the performance of the system.
 Ensuring data accessibility in the case of system
breakdowns.

Difference between Data Warehouse and Data Mart

Data Warehouse Data Mart

31
A Data Warehouse is a A data mart is an only subtype of a Data W
vast repository of information architecture to meet the requirement of a specif
collected from various
organizations or departments
within a corporation.

It may hold multiple It holds only one subject area. For example
subject areas.

It holds very detailed It may hold more summarized data.


information.

Works to integrate all data It concentrates on integrating data from a g


sources of source systems.

In data warehousing, Fact In Data Mart, Star Schema and Snowflake


constellation is used.

It is a Centralized System.
It is a Decentralized System.

Data Warehousing is the Data Marts is a project-oriented.


data-oriented.

M1.U3. 6. Meta data:


Data about data is the simplest definition of metadata.
Metadata refers to information that is used to describe other
information. A book's index, for instance, functions as metadata
for the book's contents. In other words, metadata is the distilled
data that directs us to the detailed data. The following definition
of metadata applies to data warehouses.
A data warehouse's route map is its metadata.

32
In a data warehouse, the objects are defined through
metadata.
A directory is what metadata does. The decision support
system can locate a data warehouse's contents with the use of
this directory. In a data warehouse, metadata is created for the
names and meanings of the data. Additional metadata is created
in addition to this metadata to time-stamp any extracted data and
the extracted data's source.
M1.U3. 6. 1. Categories of Metadata:
Three major categories can be used to group metadata:
(i) Business Metadata:It contains details about who owns the
data, what a business is, and how its policies change.
(ii) Technical metadata:It consists of database system names,
the names and sizes of tables and columns, as well as the data
types and permitted values. Technical metadata also includes
structural data like indices and properties for primary and foreign
keys.
(iii) Operational metadata:It consists of data lineage and
currency. Whether data is current refers to whether it is live,
archived, or erased. The history of data migration and alteration
is referred to as the lineage of the data.

33
M1.U3. 6. 2. Role of Metadata:
In a data warehouse, metadata plays a crucial function.
Although it has a distinct function from the warehouse data,
metadata nonetheless has a significant impact. The following
describes the numerous functions of metadata.

Acting as a directory is metadata.


 This directory aids the decision support system in
finding the data warehouse's contents.
 When data is transformed from an operational context to
a data warehouse environment, metadata aids in decision
support systems for data mapping.
 The summarising of current detailed data and highly
summarised data is aided by metadata.
 Additionally, metadata aids in the differentiation of
highly summarised from sparsely detailed data.
 Tools for querying data are utilised.
 Tools for extraction and cleanup employ metadata.
 Tools used for reporting use metadata.
 Tools for transformation employ metadata.
 Metadata is crucial to how loading operations work

The following diagram shows the roles of metadata.

34
M1.U3. 6. 3 Metadata Repository:
A data warehouse system's metadata store is a crucial
component. It has the metadata shown below:
It offers a description of the data warehouse's structure in its
definition. Hierarchies, derived data definitions, the schema, the
view, the locations and the contents of the data marts all define
the description.
(i) Business metadata: It includes details about who
owns the data, how the business is defined, and how
its policies change.
(ii) Operational Metadata: Data lineage and currency are
both included in operational metadata. Data's status
as live, archived, or purged is referred to as its
currency. Data's "lineage" refers to its migration and
modification history.

35
Information for data warehouse to operational
environment mapping The source databases and
their contents, data extraction, data partition
cleaning, transformation rules, data refresh, and
other components are all included.
(iii)Data for data warehouse mapping from the operating
environment :The data extraction, data partition cleaning,
transformation rules, data refresh rules, and purging rules are all
included, as well as the source databases and their contents.

(iv)Algorithms for summarization :This includes data on


granularity, aggregating, summarising, etc., as well as dimension
algorithms.

M1.U3. 6. 4 . Challenges for Metadata Management:

It is impossible to exaggerate the value of metadata. The use


of metadata ensures calculation accuracy, confirms data
transformation, and drives report accuracy. To business end
users, metadata also enforces the definition of business words.
There are difficulties with using metadata for all these purposes.
Here are some of the difficulties discussed.
 A large organization's metadata is dispersed all over the
place. Applications, databases, and spreadsheets all
contain this metadata.
 Text files or multimedia files may contain metadata. The
definition of this data must be accurate in order to be
used for information management solutions.
 There are no established standards throughout the
business. Vendors of data management solutions have a
limited emphasis.

36
 Metadata cannot be passed in an approved or simple
manner.

37
Terminal questions:
1. Define Business intelligence.
2. Why data analyzing is required in business intelligence.
3. Describe briefly about data warehousing architecture.
4. Interpret the concepts of data marts and explain the
different types of data marts.
5. Extract the different components of data warehousing
and explain the needs for data warehousing.
6. Contrast the difference between data marts and Data
warehouse.
7. Explain briefly about lifecycle of data ?
8. Illustrate the characteristics of DW and BI and how they
interconnect to one another in today working method.

38
Module 2
Unit 1
M2.U1.1.Business intelligence:
Business intelligence is a term used to describe a set of ideas
and approaches used to enhance business decision making
through the use of data and fact-based systems. It is the talk of a
new changing and expanding world. Enhancing decision-making
in business concepts and analysis is the aim of business
intelligence. Business intelligence is more than simply one idea;
it's a collection of ideas and approaches. Business intelligence
relies on intuition and analytics to make judgments.
Process applied in Business Intelligence:
Business intelligence (BI) transforms raw data into relevant
information and then transforms that information into knowledge
using a variety of procedures, technologies, and tools (such as
Informatica/IBM). The decision-makers can then base their
decisions on the insights by first manually or using software to
extract some helpful insights.

39
To keep it short and simple, business intelligence about
gives correct information to the organization's decision-makers
in the proper and ethical manner. Important characteristics of
business intelligence include:
Decision-making based on facts.
A360-degree view of your company
On the same page as the virtual team.
Measurement for the purpose of developing KPIs (Key
Performance Indicators) using historical data that has been fed
into the system.
Establish the benchmarks for the various processes after
identifying the benchmark.
Systems for business intelligence can be used to spot
problems in the corporate world that need to be discovered and
fixed as well as market trends.
Data visualisation, which improves the quality of data and,
in turn, improves decision-making, is made possible by business
intelligence.
Because they are very inexpensive, business intelligence
systems can be employed by large corporations, organisations, as
well as Small and Medium Enterprises.
M2.U1.2. Business intelligence user types:
Analyst (Data Analyst or Business Analyst): The company's
statisticians, known as analysts (data analysts or business
analysts), applied BI based on historical data that had previously
been saved in the system.
Head of the company or manager: By making judgements
more effectively and using all the information they have learned,
the company's head employs business intelligence to boost
profitability.
IT specialist: for his business.

40
Small Business Owners: A small businessman may use it
because it is also reasonably priced.
Applications of Business Intelligence:
 In Decision Making of the company by decision-makers of
the organizations.
 In Data Mining while extracting knowledge.
 In Operational Analytics and operational management.
 In Predictive Analytics.
 In Prescriptive Analytics.
 Making Structured data from the unstructured data.
 In Decision Support System.
 In Executive Information System (EIS).

M2.U1.3. Relation between Business intelligence and Data


Warehousing:
Business intelligence: Major corporations frequently obtain a
lot of data from many sources. This data can always be used to
obtain a variety of information sets that support wiser business
decisions. These useful information could be either descriptive,
predictive, or prescriptive. BI stands for the different approaches
and instruments used for gathering, integrating, analysing, and
visualising corporate data. It might be regarded as being
particularly synonymous with data analytics in the corporate
world.
Data Warehouse: A system and group of back-end
technologies known as a data warehouse aid in gathering vast
quantities of disparate data from multiple sources and storing it
for later use. A good data warehouse will have business logic
built into it to make future extraction and analysis easier. One
application that utilises data warehouses is business intelligence.
In data warehouses, data is typically stored in fact tables (tables
containing quantities like revenue or costs) and dimensions,
which is similar to the OLAP paradigm (things we want to view
the facts by, such as region, office, or week).

41
Business Intelligence Data Warehouse

It is a set of tools and


methods to analyze data and It is a system for storage of
discover, extract and data from various sources in an
formulate actionable orderly manner as to facilitate
information that would be business-minded reads and
useful for business decisions. writes.

It is a Decision Support
System (DSS). It is a data storage system.

Serves at the front end. Serves at the back end.

A data warehouse’s main


aim is to provide the users of
The aim of business business intelligence; a
intelligence is to enable users structured and comprehensive
to make informed, data- view of available data of an
driven decisions. organization.

Collects data from various


Collects data from the disparate sources and organizes
data warehouse for analysis. it for efficient BI analysis.

Comprises of data held in


“fact tables” and “dimensions”
Comprises business with business meaning
reports, charts, graphs, etc. incorporated into them.

BI as such doesn’t have BI is one of many use-cases


much use without a data for data warehouses, there are

42
Business Intelligence Data Warehouse

warehouse as large amounts


of various and useful data is more applications for this
required for analysis. system.

Handled and maintained by


data engineers and system
Handled by executives administrators who report
and analysts relatively higher to/work for the executives and
up in the hierarchy. analysts.

The role of Business


Intelligence lies in improving The reflection of actual
the performance of business database development and
by utilizing tools and integration process is given by
approaches that focus on Data Warehouse and in addition,
counts, statistics, and Data Profiling and Company
visualization. validation standards.

It deals with-
 Acquiring/gathering of data
It deals with-
 Metadata management
 OLAP (Online Analytical  Cleaning of data
Processing)  Transforming data
 Data Visualization  Data dissemination
 Data Mining  Data recovery/backup
 Query/Reporting Tools planning

Examples of Data
Examples of BI warehouse software: BigQuery,
software: SAP, Sisense, Snowflake, Amazon, Redshift,
Datapine, Looker, etc. Panoply, etc.

43
44
M2.U1..4. Architecture of BI and DI:

The layers and components of a strong BI architecture are


described as having various capabilities and producing
dashboards and reports. A crucial component of the BI
architecture are data warehouses.
A strong BI architecture makes use of:
(i)Data Collection:Business operations systems including
CRM, ERP, finance, manufacturing, supply chain management,

45
and more are used by businesses to collect data. Users can also
gather it from secondary sources like market research reports and
customer databases. To mix data from many sources,
contemporary BI applications make use of powerful data
connections. The formats of the data can be structured, semi-
structured, or unstructured.
(ii)Data integration:
Combine information from several sources to present a
coherent view when analysing data. Data is extracted from
various systems, then loaded into data warehouses. ETL is the
name of this procedure (extract, transform and load). Raw data is
extracted during data extraction from source sites such
databases, web pages, flat files, and SQL servers.
Filtering, cleaning, de-duplicating, and performing
calculations and summaries on the raw data are all part of the
transformation step. Additionally, converting the data into tables
to conform to the desired data warehouse schema may involve
changing the row and column headers, editing text strings, and
changing the data. The data is then imported into the data
warehouse as the final stage.
Data Storage
In order to facilitate further analysis, data warehouses store
structured data as a relational, columnar, or multi-dimensional
database. It supports maintaining a single version of the truth
throughout the organisation, data summarization, and cross-
functional analysis.
Data Analyzer:
Deriving insightful conclusions comes next after the data has
been analysed, cleaned, and transformed. Data analysis pulls
pertinent, usable information from the dataset to aid
organisations in decision-making. Graphs, charts, tables, maps,

46
and other visual representations of these statistics or insights are
frequently used.
With the use of drag-and-drop capabilities found in modern
BI applications, business users can easily construct intuitive
dashboards, reports, and visualisations without requiring a deep
understanding of technology.
Distribution of Data
To encourage teamwork, share reports and dashboards with
your teammates. Dashboards automatically change in real time
every day, every week, or every month. In a secure viewer
environment, users can also share dashboards.
Other users cannot alter the material, but they can
manipulate and interact with it using assigned filters. Another
choice is to share reports and dashboards with stakeholders
outside the organisation by using a public URL.
Data insights
Finding patterns in table representations or numbers rising in
a line chart is a key step in gaining insightful information. It can
also refer to using a pie chart to represent the distribution of
income or the number of hours dedicated to various chores on a
given day. Analyzing historical data can reveal how a company
responds to a variety of factors, such as market changes,
seasonality, patterns, economic cycles, and more. Analyze data
points and trends that may be consistent with the state of the
economy so that firms can make more informed decisions.

47
Other users cannot alter the material, but they can
manipulate and interact with it using assigned filters. Another
choice is to share reports and dashboards with stakeholders
outside the organisation by using a public URL.

48
Unit 2
M2.U2.1. Data Mining:
Data mining is the process of examining information's
hidden patterns from various angles for categorization into useful
data. This data is gathered and assembled in specific areas like
data warehouses, efficient analysis, and data mining algorithms,
which aid in decision-making and other data requirements and,
ultimately, reduce costs and generate income.
Organizations employ the data mining method to extract
certain data from sizable databases in order to address business
issues. It mostly transforms unprocessed data into insightful
knowledge.
Data mining is the act of examining large amounts of data to
look for patterns, identify trends, and develop understanding on
how to use the data. These results can then be used by data
miners to anticipate outcomes or make judgements.
M2.U2.2. Motivation for Data Mining:
Data mining is the process of sifting through a large amount
of data stored in repositories using pattern recognition
technologies, including statistical and mathematical
methodologies, in order to discover new connections, patterns,
and trends that are helpful. It is via the examination of factual
datasets that new relationships are found and records are
compiled in ways that are both logical and beneficial to the data
owner.
In order to identify regularities or relations that were initially
unknown, a large amount of data must be selected, explored, and
modelled. The goal is to produce results that are both clear and
helpful to the database owner.
It is not restricted to the application of statistical methods or
computer algorithms. It is a method of business intelligence that

49
may be used to support corporate decisions when combined with
information technology.
Data Science and data mining are related. It is performed
with a purpose by a person under certain circumstances, using a
particular data set. Text mining, online mining, audio and video
mining, picture data mining, and social media mining are just a
few of the services available in this phase. It is finished using
straightforward or highly specialised software.
Data mining can be seen as an outcome of the ongoing
development of data technology. The following features,
including data gathering and database building, data
management, and sophisticated data analysis, have all seen
evolution thanks to the database system market.

An efficient structure for data storage and retrieval, query


and transaction processing, for instance, was later developed as a
result of the recent development of data gathering and database
building structures. Since query and transaction processing are
now features offered by many database systems, advanced data
analysis has become the next target.
A variety of databases and data repositories can store data. A
repository of several heterogeneous data sources grouped under a
single, unified schema at a specific site to facilitate management
decision-making is one data repository structure that has
emerged in the data warehouse.
Online analytical processing (OLAP), in particular, analysis
techniques with functions such as summarization, consolidation,
and aggregation, as well as the capacity to see data from a
variety of perspectives, are all part of data warehouse
technology.
M2.U2.3Classification of Data mining :
Adapted classification based on the application:

50
Domain-specific application is required here.For instance,
the data mining systems can be adjusted to suit the needs of the
telecommunications, financial, stock market, email, and other
industries.
Classification according to the type of techniques
utilized:
This strategy takes into account the level of user interaction
or the data analysis method used.
For instance, techniques focused on databases or data
warehouses, neural networks, visualisation, pattern recognition,
and machine learning.
Classification based on the kinds of knowledge that was
mined:
This is based on features including classification,
association, discrimination, correlation, and prediction, among
others.
Classification based on the kinds of databases that were
mined:
A database system can be categorised as a "data type," "data
use," or "data application" model.
M2.U2.4. DM tasks primitives
A data mining query is an input to the data mining system
that can be used to specify a data mining task.
Data mining task primitives are used to define a data mining
query.With the help of these primitives, the user can interact
with the data mining system to control it during discovery or to
explore the results at various depths and angles.
The following are detailed by the data mining primitives:
1. Set of task-relevant data to be mined.
2. Kind of knowledge to be mined.

51
3. Background knowledge to be used in the discovery
process.
4. Interestingness measures and thresholds for pattern
evaluation.
5. Representation for visualizing the discovered patterns.

These primitives can be incorporated into a data mining


query language, enabling flexible user interaction with data
mining systems.
A data mining query language offers a base on which
graphical user interfaces can be created.
These primitives can be incorporated into a data mining
query language, enabling flexible user interaction with data
mining systems.
A data mining query language offers a base on which
graphical user interfaces can be created.
It is difficult to design a complete data mining language
since data mining encompasses a broad range of functions, from
data classification to evolution analysis.
There are various criteria for every task.A thorough grasp of
the capabilities, constraints, and underlying workings of the
many types of data mining jobs is necessary for the construction
of an efficient data mining query language.In addition to
integrating into the entire information processing environment,
this makes it easier for a data mining system to communicate
with other information systems.

52
M2.U2.4.1 The set of task-relevant data to be mined
This details the parts of the database or collection of data the
user is interested in.
This contains the desired dimensions of the interest-driven
data warehouse or database (the relevant attributes or
dimensions).

The set of task-relevant data can be gathered from a


relational database using a relational query that includes
operations like selection, projection, join, and aggregate.The
initial data relation is a new data relation created as a result of
the data collection procedure.The query's criteria can be used to
group or order the original data relation.You can think of this
data retrieval as a part of the data mining task.This early
relationship might or might not be physical.
M2.U2.4.2. The kind of knowledge to be mined
The data mining operations that will be carried out are
specified below, including characterisation, discrimination,
association or correlation analysis, classification, prediction,
clustering, outlier analysis, and evolution analysis.
M2.U2.4.3 The background knowledge to be used in the
discovery process:

53
 For directing the knowledge discovery process and
assessing the patterns discovered, this understanding of
the domain to be mined is helpful.
 A common method of background knowledge that
enables data to be mined at various levels of abstraction
is concept hierarchies.
 A series of mappings from basic concepts to higher-
level, more abstract concepts is known as a concept
hierarchy.
 Rolling Up - Generalization of Data: Enables viewing of
data at more explicit and relevant abstractions and
facilitates understanding.
 Less input/output procedures would be needed as a result
of the data's compression.
 Drilling Down - Specialization of Data: Lower-level
concepts are used in place of concept values.
 There may be more than one concept hierarchy for a
given attribute or dimension depending on the user's
point of view.
 Below is an illustration of a concept hierarchy for the
attribute (or dimension) age.Another type of background
knowledge is user perceptions of the relationships in the
data.
M2.U2.4.4.The interestingness measures and thresholds
for pattern evaluation
There may be interesting measurements for different types of
knowledge. could be employed to direct the mining operation
or, after discovery, to assess the patterns found. Support and
confidence are two interesting metrics for association rules.
Rules that fall below the user-specified thresholds for support
and confidence are deemed uninteresting.

M2.U2.4.5 The expected representation for visualizing


the discovered patterns:

54
The format in which patterns are to be displayed is referred
to by this, and it could be rules, tables, cross tabs, charts, graphs,
decision trees, cubes, or other visual representations. The types
of presentation that will be utilised to display the found patterns
must be specified by the users .Certain types of knowledge may
be better represented by some representation forms than others.

For example, decision trees are frequently used for


categorization, whereas generalised relations and the related
cross tabs or pie/bar charts are ideal for giving distinctive
descriptions.

55
M2.U3.1. Integration of a Data Mining system
with a Database or a Data

M2.U3.1.1. Warehouse :
To perform its functions effectively, the data mining
system is connected with a database or data warehouse
system. In order to function, a data mining system must
be able to interface with other data systems, such as a
database system. These systems can be integrated using
the following potential integration schemes.

M2.U3.1.2. No coupling:
A data mining system will not use any database or
data warehouse system functionality if there is no
coupling.

In order to save the mining results in a different file,


it can retrieve data from a particular source (including a
file system), process the data using some data mining
techniques, and then process the remaining data. Even
though the method is straightforward, it suffers from a
number of short comings. A database system, in the first
place, gives a great deal of flexibility and adaptability
when it comes to data storage, organisation, access, and
processing. With out a database or data warehouse, a

56
data mining system may spend a lot of time looking for,
gathering, cleaning, and altering data.

M2.U3.1.3. Loose Coupling :


Some database or data warehouse services are used
in this data mining system.
These systems manage a data repository from which
the data is retrieved.The data is analysed using data
mining techniques, and the results are then saved either
in a file or in a specific location in a database or data
warehouse. Because it can retrieve some of the data from
databases by using query processing or other system
features, loose coupling is preferable to no coupling.

M2.U3.1.4. Semi tight Coupling:


In this case, the database/datawarehouse system can
support the sufficient execution of a few key data mining
primitives.
The pre-computation of several significant statistical
measures, such as total, count, max, min, standard
deviation, etc., as well as sorting, indexing, aggregation,
histogram analysis, multi-way join, and other operations
may be found in these primitives.

M2.U3.1.5.Tight coupling:

57
Tight coupling refers to the seamless integration of a
data mining technology with a database/data
warehousing system. One functional component of an
information system is the data mining subsystem. Data
structures, indexing schemes, and query processing
techniques used in database and data warehousing
systems are used to build and establish data mining
queries and functions. Due to its support for efficient
data mining functions implementation, high system
performance, and an integrated data processing
environment, it is highly desired.

M2.U3.2. Issues in DM:

Due to the complexity of the algorithms utilised and


the fact that data is not always available in one location,
data mining is not a simple operation. There are
numerous heterogeneous data sources that must be
combined. Some problems are also caused by these
variables. In this tutorial, we'll go through the key
concerns relating to

The following diagram describes the major issues.

58
M2.U3.3.Mining different kinds of knowledge in
databases:
Users may have varying levels of interest in various
types of knowledge. Data mining must therefore be able
to handle a variety of knowledge finding tasks.

M2.U3.3.1. Interactive mining of knowledge at


multiple levels of abstraction:
Because it enables users to narrow their search for
patterns and provide and modify data mining requests

59
depending on the returned findings, the data mining
process must be interactive.

M2.U3.3.2 Incorporation of background


knowledge :
Background knowledge can be utilised to direct the
discovery process and to communicate the patterns that
are found. Background information can be used to
represent the patterns that have been found not only
succinctly but also at different levels of abstraction.

M2.U3.3.3. Data mining query languages and ad


hoc data mining:
Ad hoc mining tasks should be defined in a query
language designed for efficient and flexible data mining
that is integrated with a data warehouse query language.

M2.U3.3.4. Presentation and visualization of data


mining results:
The patterns must be expressed in high level
languages and visual representations once they have
been found. These illustrations ought to be simple to
comprehend.

M2.U3.3.5. Handling noisy or incomplete data:

60
Once the patterns are identified, they must be
communicated using high level languages and visual
aids. These visual representations ought to be simple to
understand.

M2.U3.3.6 Pattern evaluation: The patterns


discovered should be interesting because either they
represent common knowledge or lack novelty.

M2.U3.3.7 Performance Challenges


There may be problems with performance like the
ones listed below:

M2.U3.3.8. Efficiency and scalability of data


mining algorithms :
Data mining algorithms must be efficient and
scalable in order to efficiently extract information from
enormous amounts of data in databases.

M2.U3.3.9 Algorithms for parallel, distributed,


and incremental mining: The development of parallel
and distributed data mining algorithms is driven by
issues like the enormous size of databases, the wide
dissemination of data, and the complexity of data mining
techniques. These algorithms separate the data into
manageable chunks that are then processed concurrently.

61
The results from the partitions are then combined.
Databases are updated using incremental algorithms
rather than by mining the data anew from the beginning.

M2.U3.4. Diverse Data Types Issues

Multiple Data Types Problems


handling of complicated and relational data types
Complex data objects, multimedia data objects, location
data, temporal data, etc. may all be present in the
database. One machine simply cannot mine all of these
types of data.

Global information systems and heterogeneous


database information mining: The data is accessible
from many LAN or WAN data sources. They might be
unstructured, semi-structured, or structured data sources.
As a result, data mining is made more difficult when
using the knowledge from them.

Terminal Questions:
1)Extract the importance of Business intelligence
and performs its working strategies in Data mining
techniques.
2)Demonstrate the working methodologies in Data
mining techniques with an architecture diagram.

62
3) Compare and contrast the working characteristics
of Business intelligence and data mining with an
diagram.
4)Describe briefly about the different functionalities
of Data mining.
5)Illustrate the different issues in data mining
techniques while applying in applications.
6)Interpret the concept of business intelligence in
Data mining techniques.
7)Define the motivation of Data mining .
8) Illustrate the concept of Integration of a Data
Mining system with a Database or a Data Warehouse.
9)Explain the importance of Data warehouse.
10)Differentiate the characteristics of Database and
Data warehouse.

63
M3.U1.1.KDD Process:
KDD is an iterative process that allows for the refinement of
mining techniques, the integration of fresh data, and the
transformation of existing data to provide new and better
findings.

Data Cleaning:
 Data cleaning is the process of removing noisy and
pointless information from a collection.
 It cleaning in case values are missing.
 It removing random or variance errors from noisy data.
 It using technologies for data transformation and
discrepancy detection.
Data Integration:
 Data integration is the process of combining various
sources heterogeneous data into a single source
(DataWarehouse).

64
 It integrating data with the aid of data migration
technologies.
 It integrating data with the aid of data synchronisation
tools.
 ETL (Extract-Load-Transformation)-based data
integration
Data Selection:
Data selection is the process of deciding which data from the
data collection are pertinent to the analysis and retrieving them.
 Choosing data with a neural network.
 Choosing data using decision trees.
 selecting data using Naive Bayes.
 Choosing data by employing clustering, regression.
Data transmission:
The process of changing data into the right form needed by
mining procedures is known as data transformation.The two
steps of data transformation are as follows:
Data Mapping:
To record transformations, elements from the source base are
assigned to the destination.
Code Generation:
Development of the transformation programme by itself
Data Mining:
Data mining is characterised as the application of intelligent
methods to identify potentially relevant patterns.It creates
patterns from task-relevant data.It uses characterisation or
classification to determine the model's purpose.
Pattern Evaluation:
 Identification of strictly increasing patterns that indicate
knowledge based on predetermined metrics is the
definition of pattern evaluation.
 Find each pattern's interestingness rating.

65
 It summarising and visualising data to make it user-
understandable.
Knowledge Representation:
 Knowledge representation is a method for displaying the
findings of data mining that makes use of visualisation
tools.
 Produce reports.
 Create some tables.
 Create classification rules, characterization rules, and
other discriminatory rules.
2)Data Preprocessing:
Before using data, preparation is necessary. The idea of
transforming unclean data into clean data is known as data
preparation. Before using the algorithm, the dataset is
preprocessed to look for missing values, noisy data, and other
irregularities.
3) Data cleaning:
The practise of correcting or deleting inaccurate, damaged,
improperly formatted, duplicate, or incomplete data from a
dataset is known as data cleaning. There are numerous ways for
data to be duplicated or incorrectly categorised when merging
multiple data sources.
Missing Value:

1)Ignore the data row:

When the class label is absent (assuming your data mining


goal is classification) or a significant number of the row's
attributes are missing, this is typically done (not just
one).However, if the proportion of these rows is large, you will
undoubtedly experience poor performance.Consider the case
where we have a database of student enrollment information
(age, SAT score, state of residence, etc.) and a column that
categorises students' likelihood of succeeding in college as

66
"Low," "Medium," or "High."Consider that our aim is to develop
a model that can forecast a student's success in college.Before
executing the algorithm, data rows without the success column
could very well be disregarded and eliminated because they are
not helpful in predicting success.

2)By global constant to fill missing value:

Choose a new global constant value that will be used to fill


in all the missing values, such as "unknown," "N/A," or minus
infinity.Because it occasionally makes no sense to try and predict
the missing value, this strategy is employed.Take another look at
the student enrolment database, for instance.Assuming certain
students' state of residence attribute data is lacking Filling it with
a state rather than anything like "N/A" doesn't really make sense.

3)By Mean to fill missing value:

The mean (or median, if the property is discrete) value for


that attribute in the database should be used to replace any
missing values.If the average income of a US family is X, for
instance, you may use that number to fill in the gaps in a
database of US family incomes.

Noisy Value:

Data that contain a significant quantity of noise, or


extraneous, meaningless informationThe phrase is frequently
used as a synonym for corrupt data, therefore this also includes
data corruption.It also includes any information that a user
system is unable to effectively understand and interpret.

Data that is noisy is useless data.The phrase has frequently


been used interchangeably with faulty data.However, it now
refers to any data, including unstructured data, that cannot be
effectively processed and analysed by machines.Noisy data is
any data that has been received, stored, or altered in a way that
prevents the programme that originally created it from reading or
using it.Noisy data not only requires more storage space than is

67
necessary, but it can also have a negative impact on the
outcomes of any data mining analysis.Data mining can be made
easier through statistical analysis, which can leverage
information obtained from historical data to sort out noisy data.

Hardware malfunctions, coding mistakes, and illegible voice


or optical character recognition (OCR) programme input can all
result in noisy data.Spelling mistakes, trade abbreviations, and
slang can also make text difficult for machines to read.

Data Integration and Transformation:

The process of changing data from one format to another,


usually from that of a source system into that needed by a
destination system, is known as data transformation.Most data
integration and management tasks, including data wrangling and
data warehousing, have data transformation as a component.

Data integration is a data preparation approach that creates a


coherent data store by combining data from various
heterogeneous data sources, giving users a cohesive picture of
the data.These sources could be various databases, flat files, or
data cubes.

The formal definition of data integration methodologies is a


triad of G, S, and M.

where G stands for the global schema,

S for the heterogeneous source of schema,

M for the mapping between source and global schema


queries.

68
The "tight coupling technique" and the "loose coupling
approach" are the two main strategies for integrating data.

Tight Coupling:

A data warehouse is viewed in this context as a component


for information retrieval.Using the ETL (Extraction,
Transformation, and Loading) process, data is combined from
various sources into a single physical location in this coupling.

Loose Coupling:
In this case, a user-friendly interface is offered to accept the
user's query, change it so the source database can understand it,
and then send the transformed query directly to the source
databases to get the desired result.Additionally, the data is only
kept in the original source databases.
Issues in Data Integration:
Schema Integration, Redundancy Detection, and Resolving
Data Value Conflicts are the three factors to be taken into
account during data integration.
Below is a quick explanation of each.
1. Integration of Schemas:
It combining metadata from several sources.The entity
identification challenge refers to the real-world entities from
various sources.

69
2. Detecting redundancy
If an attribute can be retrieved or deduced from another
attribute or collection of attributes, it may be
redundant.Redundancies in the final data set can also be brought
on by inconsistent attribute values.Correlation analysis can
identify some redundancies.
3. Conflicts between data values are resolved:
The third major problem with data integration is this one.
The values of an attribute for the same real-world thing may
vary depending on the source.The "same" attribute may be
recorded at a lower level of abstraction in one system than in
another.

70
M3.U2.1. Data Reduction:
The data reduction technique may result in a condensed
description of the original data that is substantially lower in
number but maintains the original data's quality.
M3.U2.2 Data Cube Aggregation:
Utilizing this method, data is compiled in a more
straightforward format.Think about the data you acquired for
your research for the years 2012 to 2014, which includes your
company's income every three months.They involve you in the
annual sales, as opposed to the quarterly average. As a result, we
can summarise the data such that the final data summarises the
entire sales annually rather than weekly.It provides a data
summary.
M3.U2.2.1. Dimension reduction:
When we encounter information that is only marginally
relevant to our study, we employ the relevant attribute. It
decreases data size by getting rid of unnecessary or outmoded
elements.
Step-wise Forward Selection:
Step-wise Forward Selection - We start with a blank set of
traits and decide which of those is best later on depending on
how relevant they are to other attributes.
In statistics, we refer to it as a p-value. Assume that the data
collection has the following attributes, just a few of which are
redundant.
Initial attribute Set: {X1, X2, X3, X4, X5, X6}
Initial reduced attribute set: { }

Step-1: {X1}
Step-2: {X1, X2}
Step-3: {X1, X2, X5}

Final reduced attribute set: {X1, X2, X5}

71
Step-wise Backward Selection:
This selection begins with a set of complete characteristics in
the initial data and eliminates the worst attribute that is still
present in the set at each stage.Assume that the data collection
has the following attributes, just a few of which are redundant.
Initial attribute Set: {X1, X2, X3, X4, X5, X6}
Initial reduced attribute set: {X1, X2, X3, X4, X5, X6 }

Step-1: {X1, X2, X3, X4, X5}


Step-2: {X1, X2, X3, X5}
Step-3: {X1, X2, X5}

Final reduced attribute set: {X1, X2, X5}


Combination of forwarding and Backward Selection:
It can eliminate the worst qualities and choose the finest
ones, which expedites the procedure and saves time.
M3.U2.3. Data Compression:
Using various encoding mechanisms, the data compression
approach minimises the size of the files (Huffman Encoding &
run-length Encoding).
Depending on their methods of compression, we can classify
it into two categories.
Loseless Compression:Run Length Encoding, a type of
lossless compression-encoding technology, enables a
straightforward and minimal data size reduction.Algorithms are
used in lossless data compression to extract the exact original
data from the compressed data.
Lossy Compression: Examples of this compression are the
Discrete Wavelet Transform Technique and PCA (Principal
Component Analysis).For instance, although the JPEG image
format uses lossy compression, the original image's content can
still be discerned.The decompressed data via lossy-data

72
compression may seem different from the original data, but they
are still usable for extracting information.
M3.U2.3.1. Data Mining Primitives and Languages:
These primitives serve as the definition of a data mining
query.
Task-relevant data: This is the database section that needs to
be looked into.Let's say you oversee All Electronics and are in
charge of the company's sales throughout the US and
Canada.You want to focus in particular on Canadian consumers'
purchasing patterns.Instead of using the complete database for
mining.Relevant qualities are what we call these.
Information that can be mined includes:This details the
required data mining operations, such as characterisation,
discrimination, association, classification, clustering, or
evolution analysis.For instance, if you're researching Canadian
consumers'

73
purchasing patterns, you can decide to look for links
between their consumer profiles and the products they prefer.

74
UNIT 1
M4.U1.1.Association Rule Mining:
Finding the rules that may control relationships and causal objects
between sets of things is the goal of association rule mining, a data
mining technique. Therefore, in a particular transaction involving
numerous things, it looks for the principles governing how or why such
items are frequently purchased together.
Large volumes of data are analysed using association rule mining to
uncover intriguing linkages and relationships. This rule displays the
number of times an itemset appears in a transaction. A market-based
analysis serves as a common illustration.
M4.U1.2 Market-based analysis:
One of the most important methods used by large organisations to
demonstrate correlations between goods is market-based analysis.It
enables retailers to discover connections between the products that
customers usually purchase together.We can discover rules that
anticipate the occurrence of an item based on the occurrences of other
things in the transaction given a set of transactions.

Before we start defining the rule, let us first see the basic
definitions.

Support Count( ) – Frequency of occurrence of a


itemset.
Here ({Milk, Bread, Diaper})=2

75
Frequent Itemset – An itemset whose support is greater than or
equal to minsup threshold.
Association Rule – An implication expression of the form X ->
Y, where X and Y are any 2 itemsets.
Example: {Milk, Diaper}->{Beer}
Rule Evaluation Metrics –
 Support(s) –
The number of transactions that include items in the {X} and {Y}
parts of the rule as a percentage of the total number of
transaction.It is a measure of how frequently the collection of
items occur together as a percentage of all transactions.
 Support = (X+Y)
 total –
It is interpreted as fraction of transactions that contain both X and
Y.
 Confidence(c) –
It is the ratio of the no of transactions that includes all items in
{B} as well as the no of transactions that includes all items in {A}
to the no of transactions that includes all items in {A}.
 Conf(X=>Y) = Supp(X Y) Supp(X) –
It measures how often each item in Y appears in transactions that
contains items in X also.
 Lift(l) –
The lift of the rule X=>Y is the confidence of the rule divided by
the expected confidence, assuming that the itemsets X and Y are
independent of each other.The expected confidence is the
confidence divided by the frequency of {Y}.
 Lift(X=>Y) = Conf(X=>Y) Supp(Y) –
Lift value near 1 indicates X and Y almost often appear together as
expected, greater than 1 means they appear together more than expected
and less than 1 means they appear less than expected.Greater lift values
indicate stronger association.

Example – From the above table, {Milk, Diaper}=>{Beer}


s= ({Milk, Diaper, Beer}) |T|
= 2/5
= 0.4

76
c= (Milk, Diaper, Beer) (Milk, Diaper)
= 2/3
= 0.67

l= Supp({Milk, Diaper, Beer}) Supp({Milk,


Diaper})*Supp({Beer})
= 0.4/(0.6*0.6)
= 1.11
The Association rule is very useful in analyzing datasets. The data
is collected using bar-code scanners in supermarkets. Such databases
consists of a large number of transaction records which list all items
bought by a customer on a single purchase. So the manager could
know if certain groups of items are consistently purchased together
and use this data for adjusting store layouts, cross-selling, promotions
based on statistics
1a)Concept description:
Concept description refers to the most fundamental type of
descriptive data mining. Usually, the term "concept" refers to a
grouping of data, such as "frequent buyers," "graduate students," and
so forth.Since describing the concept of a data mining operation
involves more than just listing the data, Concept description, on the
other hand, creates descriptions for characterising and contrasting the
facts.
When the notion to be expressed pertains to a class of objects, it is
also referred to as a class description.
Characteristics: It offers a clear and short summary of the
provided data set.
Comparison: It offers explanations contrasting two or more data
collections.
2)Data Generalization and Summarization:
Databases contain detailed information at the level of simple
concepts in the form of data and objects.For instance, the item relation
in a sales database might have characteristics that describe basic

77
information about the item, like its item ID, name, brand, category,
supplier, and place of manufacture.Being able to condense a big
amount of data and deliver it at a high conceptual level is helpful.For
instance, providing a basic summary of such data by summarising a
huge number of elements related to Christmas season sales can be
quite beneficial for sales and marketing managers.This calls for a
crucial feature known as data generalisation.
A method for abstracting a sizable collection of information from
a database that is task-relevant from a low conceptual level to a higher
one.Data generalisation generates what are referred to as characteristic
rules by summarising the general characteristics of objects in a target
class.A database query is typically used to gather data related to a
user-specified class, and the data are then run through a
summarization module to extract the key information at various levels
of abstraction.One can choose to describe, for instance,
"OurVideoStore" consumers who routinely rent more than 30 movies
each year. The attribute-oriented induction approach, for instance, can
be used to do data summarization when concept hierarchies on the
attributes describing the target class are present. It should be noted
that straightforward OLAP procedures are appropriate for
characterisation of data when a data cube contains a summary of the
data.
Presentation Of Generalized Results

Generalized Relation:
 Relations where some or all attributes are generalized, with counts or
other aggregation values accumulated.
Cross-Tabulation:
 Mapping results into cross-tabulation form (similar to contingency
tables).
Visualization Techniques:
 Pie charts, bar charts, curves, cubes, and other visual forms.
Quantitative characteristic rules:
 Mapping generalized results in characteristic rules with quantitative
information associated with it.
Approaches:

78
 Data cube approach(OLAP approach).
 Attribute-oriented induction approach.

Summary:
Data generalization is the process that abstracts a large set
of task-relevant data in a database from a low conceptual level
to higher ones.

3)Attribute Revelance:
Analyzing a measure that can compute an attribute's
relevance to a particular class or notion is the fundamental idea
behind attribute relevance analysis. These metrics include
correlation coefficient, ambiguity, and information gain.
It is a statistical method for preparing data to rank or filter
out the pertinent attribute. Inappropriate features that could be
excluded from the concept description process can be identified
using measures of attribute relevance analysis. Analytical
characterisation is defined as the inclusion of this preprocessing
step into class characterization or comparison.By comparing
the general characteristics of items between two classes—the
target class and the opposing class—data discrimination creates
discriminating rules.

It is a comparison of the general properties of data objects


from the targeting class with the general properties of objects
from one or more opposing classes. The target and contrasting
classes can be specified by the user. With the exception of the
fact that the findings of data discrimination include
comparative measures, the techniques employed for data
discrimination are essentially similar to those used for data
characterisation.
Why attribute relevance analysis is done
The following are some of the justifications for attribute
relevance analysis. − It can choose which proportions are
required.

79
It can result in a significant amount of generalisation.
It may decrease the quantity of characteristics that help us
recognise patterns quickly.
Analyzing a measure that can compute an attribute's
relevance to a particular class or notion is the fundamental idea
behind attribute relevance analysis. These metrics include
correlation coefficient, ambiguity, and information gain.
The following is how attribute relevance analysis for
concept description is carried out:

Data collection:By query processing, it can gather data for


both the target class and the opposing class.
Preliminary relevance analysis using conservative AOI:
Using a conservative AOI, a preliminary relevance analysis
identifies a collection of dimensions and attributes on which the
chosen relevance measure should be applied.The data can be
subjected to a preliminary analysis using AOI by removing
attributes with a large number of different values. To enable
additional attributes to be evaluated in future relevance analysis
by the chosen measure, the AOI applied should use attribute
generalisation thresholds that are set sufficiently large.
Remove: Using the chosen relevance analysis metric, this
technique eliminates weakly and irrelevantly related attributes.
Generate the concept description using AOI:

It can implement AOI using a less stringent set of attribute


generalisation thresholds to generate the concept description. If
class characterization is the descriptive mining function, only
the initial target class working relation is now included.Only
the initial target class working relation is supplied when class
characterization is the descriptive mining function. Only the
initial target class working relation is supplied when class
characterization is the descriptive mining function. Both the
original target class working relation and the original
contrasting class working relation are included if the
descriptive mining function is class comparison.

80
81
M4.U2.1) Finding frequent item sets:
In association mining, frequent elements are found in the
data collection.
Frequent mining often identifies the intriguing connections
and links between item sets in relational and transactional
databases.
In a nutshell, frequent mining identifies the elements that
frequently coexist in a transaction or relation.
Mining Association Need:
The creation of association rules from a transactional dataset
is known as frequent mining.
It is wise to group items X and Y together in stores or offer a
discount on one item when you buy the other if you usually buy
both.
This has a significant impact on sales.
For instance, it is likely that a consumer who purchases milk
and bread will also purchase butter.
As a result, ['milk']['bread']=>['butter'] is the association rule.

So seller can suggest the customer to buy butter if he/she


buys Milk and Bread.

Important Definitions :

 Support : It is one of the measure of interestingness.


This tells about usefulness and certainty of rules. 5%
Support means total 5% of transactions in database
follow the rule.

Support(A -> B) = Support_count(A ∪ B)

82
 Confidence: A confidence of 60% means that 60% of
the customers who purchased a milk and bread also
bought butter.

Confidence(A -> B) = Support_count(A ∪ B) /


Support_count(A)

If a rule satisfies both minimum support and minimum


confidence, it is a strong rule.

 Support_count(X) : Number of transactions in which X


appears. If X is A union B then it is the number of
transactions in which A and B both are present.
 Maximal Itemset: An itemset is maximal frequent if
none of its supersets are frequent.
 Closed Itemset:An itemset is closed if none of its
immediate supersets have same support count same as
Itemset.
 K- Itemset:Itemset which contains K items is a K-
itemset. So it can be said that an itemset is

 frequent if the corresponding support count is greater


than minimum support count.

83
Example On finding Frequent Itemsets – Consider the
given dataset with given transactions.

 Lets say minimum support count is 3


 Relation hold is maximal frequent => closed => frequent

1-frequent: {A} = 3; // not closed due to {A, C} and not


maximal {B} = 4; // not closed due to {B, D} and no maximal
{C} = 4; // not closed due to {C, D} not maximal {D} = 5; //
closed item-set since not immediate super-set has same count.
Not maximal

2-frequent: {A, B} = 2 // not frequent because support count


< minimum support count so ignore {A, C} = 3 // not closed due
to {A, C, D} {A, D} = 3 // not closed due to {A, C, D} {B, C} =
3 // not closed due to {B, C, D} {B, D} = 4 // closed but not
maximal due to {B, C, D} {C, D} = 4 // closed but not maximal
due to {B, C, D}

84
3-frequent: {A, B, C} = 2 // ignore not frequent because
support count < minimum support count {A, B, D} = 2 // ignore
not frequent because support count < minimum support count
{A, C, D} = 3 // maximal frequent {B, C, D} = 3 // maximal
frequent

4-frequent: {A, B, C, D} = 2 //ignore not frequent </

2) Apriori algorithm:

Apriori algorithm is given by R. Agrawal and R. Srikant in


1994 for finding frequent itemsets in a dataset for boolean
association rule. Name of the algorithm is Apriori because it uses
prior knowledge of frequent itemset properties. We apply an
iterative approach or level-wise search where k-frequent itemsets
are used to find k+1 itemsets.

To improve the efficiency of level-wise generation of


frequent itemsets, an important property is used called Apriori
property which helps by reducing the search space.

Apriori Property –
All non-empty subset of frequent itemset must be frequent. The
key concept of Apriori algorithm is its anti-monotonicity of
support measure. Apriori assumes that

All subsets of a frequent itemset must be frequent(Apriori


property).
If an itemset is infrequent, all its supersets will be infrequent.

Before we start understanding the algorithm, go through


some definitions which are explained in my previous post.
Consider the following dataset and we will find frequent itemsets
and generate association rules for them.

85
minimum support count is 2
minimum confidence is 60%

Step-1: K=1
(I) Create a table containing support count of each item present
in dataset – Called C1(candidate set)

(II) compare candidate set item’s support count with


minimum support count(here min_support=2 if support_count of
candidate set items is less than min_support then remove those
items). This gives us itemset L1.

86
Step-2: K=2

 Generate candidate set C2 using L1 (this is called join


step). Condition of joining Lk-1 and Lk-1 is that it should
have (K-2) elements in common.
 Check all subsets of an itemset are frequent or not and if
not frequent remove that itemset.(Example subset of{I1,
I2} are {I1}, {I2} they are frequent.Check for each
itemset)
 Now find support count of these itemsets by searching in
dataset.

87
(II) compare candidate (C2) support count with
minimum support count(here min_support=2 if
support_count of candidate set item is less than
min_support then remove those items) this gives us
itemset L2.

Step-3:

o Generate candidate set C3 using L2 (join step).


Condition of joining Lk-1 and Lk-1 is that it should
have (K-2) elements in common. So here, for
L2, first element should match.
So itemset generated by joining L2 is {I1, I2,
I3}{I1, I2, I5}{I1, I3, i5}{I2, I3, I4}{I2, I4, I5}
{I2, I3, I5}
o Check if all subsets of these itemsets are
frequent or not and if not, then remove that
itemset.(Here subset of {I1, I2, I3} are {I1, I2},
{I2, I3},{I1, I3} which are frequent. For {I2, I3,
I4}, subset {I3, I4} is not frequent so remove it.
Similarly check for every itemset)
o find support count of these remaining itemset by
searching in dataset.

88
(II) Compare candidate (C3) support count with
minimum support count(here min_support=2 if
support_count of candidate set item is less than
min_support then remove those items) this gives us
itemset L3.

Step-4:

o Generate candidate set C4 using L3 (join step).


Condition of joining Lk-1 and Lk-1 (K=4) is that,
they should have (K-2) elements in common. So
here, for L3, first 2 elements (items) should
match.
o Check all subsets of these itemsets are frequent
or not (Here itemset formed by joining L3 is {I1,
I2, I3, I5} so its subset contains {I1, I3, I5},
which is not frequent). So no itemset in C4
o We stop here because no frequent itemsets are
found further

Thus, we have discovered all the frequent item-sets.


Now generation of strong association rule comes into
picture. For that we need to calculate confidence of each
rule.

89
Confidence –
A confidence of 60% means that 60% of the customers,
who purchased milk and bread also bought butter.

Confidence(A->B)=Support_count(A∪B)/
Support_count(A)

So here, by taking an example of any frequent


itemset, we will show the rule generation.
Itemset {I1, I2, I3} //from L3
SO rules can be
[I1^I2]=>[I3] //confidence = sup(I1^I2^I3)/sup(I1^I2) =
2/4*100=50%
[I1^I3]=>[I2] //confidence = sup(I1^I2^I3)/sup(I1^I3) =
2/4*100=50%
[I2^I3]=>[I1] //confidence = sup(I1^I2^I3)/sup(I2^I3) =
2/4*100=50%
[I1]=>[I2^I3] //confidence = sup(I1^I2^I3)/sup(I1) =
2/6*100=33%
[I2]=>[I1^I3] //confidence = sup(I1^I2^I3)/sup(I2) =
2/7*100=28%
[I3]=>[I1^I2] //confidence = sup(I1^I2^I3)/sup(I3) =
2/6*100=33%

So if minimum confidence is 50%, then first 3 rules


can be considered as strong association rules.

Limitations of Apriori Algorithm


Apriori Algorithm can be slow. The main limitation is
time required to hold a vast number of candidate sets
with much frequent itemsets, low minimum support or
large itemsets i.e. it is not an efficient approach for large
number of datasets. For example, if there are 10^4 from
frequent 1- itemsets, it need to generate more than 10^7
candidates into 2-length which in turn they will be tested
and accumulate. Furthermore, to detect frequent pattern
in size 100 i.e. v1, v2… v100, it have to generate 2^100
candidate itemsets that yield on costly and wasting of
time of candidate generation. So, it will check for many

90
sets from candidate itemsets, also it will scan database
many times repeatedly for finding candidate itemsets.
Apriori will be very low and inefficiency when memory
capacity is limited with large number of transactions.

4)Rule based classification:

IF-THEN Rules

Rule-based classifier makes use of a set of IF-THEN rules


for classification. We can express a rule in the following from −

IF condition THEN conclusion

Let us consider a rule R1,

R1: IF age = youth AND student = yes


THEN buy_computer = yes

Points to remember −

 The IF part of the rule is called rule antecedent or


precondition.
 The THEN part of the rule is called rule consequent.
 The antecedent part the condition consist of one or more
attribute tests and these tests are logically ANDed.
 The consequent part consists of class prediction.

Note − We can also write rule R1 as follows −

R1: (age = youth) ^ (student = yes))(buys computer = yes)

If the condition holds true for a given tuple, then the


antecedent is satisfied.

Rule Extraction

91
Here we will learn how to build a rule-based classifier by
extracting IF-THEN rules from a decision tree.

Points to remember −

To extract a rule from a decision tree −

 One rule is created for each path from the root to the leaf
node.
 To form a rule antecedent, each splitting criterion is
logically ANDed.
 The leaf node holds the class prediction, forming the rule
consequent.

Rule Induction Using Sequential Covering Algorithm

Sequential Covering Algorithm can be used to extract IF-


THEN rules form the training data. We do not require to
generate a decision tree first. In this algorithm, each rule for a
given class covers many of the tuples of that class.

Some of the sequential Covering Algorithms are AQ, CN2,


and RIPPER. As per the general strategy the rules are learned
one at a time. For each time rules are learned, a tuple covered by
the rule is removed and the process continues for the rest of the
tuples. This is because the path to each leaf in a decision tree
corresponds to a rule.

Note − The Decision tree induction can be considered as


learning a set of rules simultaneously.

The Following is the sequential learning Algorithm where


rules are learned for one class at a time. When learning a rule
from a class Ci, we want the rule to cover all the tuples from
class C only and no tuple form any other class.

Algorithm: Sequential Covering

92
Input:
D, a data set class-labeled tuples,
Att_vals, the set of all attributes and their possible values.

Output: A Set of IF-THEN rules.


Method:
Rule_set={ }; // initial set of rules learned is empty

for each class c do

repeat
Rule = Learn_One_Rule(D, Att_valls, c);
remove tuples covered by Rule form D;
until termination condition;

Rule_set=Rule_set+Rule; // add a new rule to rule-set


end for
return Rule_Set;

Rule Pruning

The rule is pruned is due to the following reason −

 The Assessment of quality is made on the original set of


training data. The rule may perform well on training data
but less well on subsequent data. That's why the rule
pruning is required.
 The rule is pruned by removing conjunct. The rule R is
pruned, if pruned version of R has greater quality than
what was assessed on an independent set of tuples.

FOIL is one of the simple and effective method for rule


pruning. For a given rule R,

FOIL_Prune = pos - neg / pos + neg

where pos and neg is the number of positive tuples covered


by R, respectively.

93
Note − This value will increase with the accuracy of R on
the pruning set. Hence, if the FOIL_Prune value is higher for the
pruned version of R, then we prune R.

3)Associative Classification:
To assist decision-makers, data mining is the process of
identifying and extracting hidden patterns from various forms of
data. Associative classification, a popular classification learning
technique in data mining, combines classification and association
rule detection techniques to produce classification models.

Association Rule learning in Data Mining:

A machine learning technique called association rule


learning can be used to find intriguing connections between
variables in sizable databases. On the basis of various intriguing
measurements, it is intended to find powerful rules in the
database. Association rules seek to get rules that specify how or
why specific items are related for each given multi-item
transaction.
Finding information on typical if-then patterns and utilising
precise criteria with help and trust to identify the crucial linkages
are how association rules are developed. Since confidence is
determined by the number of times an if-then statement is found
to be true, they aid in illustrating how frequently an item appears
in a set of data. To contrast expected and actual confidence, a
third measure known as lift is frequently used. Lift displays the
frequency with which the if-then sentence was expected to be
true. To construct itemsets based on the data produced by two or
more items, create association rules. Typically, association rules
are made up of laws that the data does a good job of illustrating.

94
To determine the specific analysis and outcome, several data
mining techniques can be utilised, including classification
analysis, clustering analysis, and multivariate analysis. The main
purposes of association rules are to study and forecast consumer
behavior.It is mostly used in classification analysis to make
judgements, ask questions, and forecast behaviour.It is mostly
employed in clustering analysis when no assumptions are made
regarding potential correlations in the data.When attempting to
forecast the value of an endlessly dependent variable from a
series of independent factors, regression analysis is used.

Associative Classification in Data Mining:

The idea of associative classification was first put forth by


Bing Liu et al., who described a model whose principle is that
"the right-hand side is restricted to be the attribute of the
classification class." A supervised learning model called an
associative classifier assigns a target value using association
rules.

Association rules that yield class labels make up the model


that the association classifier creates and uses to label new
records. As a result, they can also be viewed as a list of "if-then"
statements: if a record satisfies the requirements (described on
the left side of the rule, also known as the antecedents), it is
marked (or scored) in accordance with the rule's category on the
right. The majority of associative classifiers mark new data by
applying the first matching rule after reading the list of rules in
order. Some metrics, such as Support or Confidence, which can
be used to sort or rank the rules in the model and assess their
quality, are inherited by association classifier rules from
association rules.

95
Types of Associative Classification:

There are different types of Associative Classification


Methods, Some of them are given below.
1. CBA (Classification Based on Associations):
It classifies data using association rule techniques, which are
more precise than conventional classification methods. It must
deal with the minimum support threshold's sensitivity. A lot
more rules are generated when a lower minimum support
criterion is selected.
2. CMAR (Classification based on Multiple Association
Rules): Compared to Classification Based on Associations, it
employs an effective FP-tree that uses less memory and
storage. When there are many attributes, the FP-tree may not
always fit in the main memory.
3. CPAR (Classification based on Predictive Association
Rules):
Predictive association rules-based classification combines
the benefits of association and conventional rule-based
categorization. A greedy algorithm is used in classification based
on predicted association rules to produce rules directly from
training data. In order to prevent overlooking crucial rules,
classification based on predictive association rules creates and
tests more rules than typical rule-based classifiers.

For starters, data discovery, which used to be limited to the


expertise of advanced analytics specialists, is now something
everyone can do using these tools. And not only that, these tools
give you the insights you need to achieve things like growth,
resolve issues that are urgent, collect all your data in one place,
forecast future outcomes and so much more.In this book
explaining the top 15 Business Intelligence tools in 2021 and
hopefully put you on the right path towards selecting your a tool
fit for your business.

96
Keep in mind: These business intelligence tools all vary in
robustness, integration capabilities, ease-of-use (from a technical
perspective) and of course, pricing.

1. SAP Business Objects

SAP Business Objects is a business intelligence software which


offers comprehensive reporting, analysis and interactive data
visualisation. The platform focuses heavily on categories such as
Customer Experience (CX) and CRM, digital supply chain, ERP
and more. What’s really nice about this platform is the self-
service, role-based dashboards its offers enabling users to build
their own dashboards and applications. SAP is a robust software
intended for all roles (IT, end uses and management) and offers
tons of functionalities in one platform. The complexity of the
product, however, does drive up the price so be prepared for that.

97
Website: www.sap.com

2. Datapine

Datapine is an all-in-one business intelligence platform that


facilitates the complex process of data analytics even for non-
technical users. Thanks to a comprehensive self-service analytics
approach, datapine’s solution enables data analysts and business
users alike to easily integrate different data sources, perform
advanced data analysis, build interactive business dashboards
and generate actionable business insights.

Website: www.datapine.com

98
3. MicroStrategy

MicroStrategy is an enterprise business intelligence tool that


offers powerful (and high speed) dashboarding and data
analytics, cloud solutions and hyperintelligence. With this
solution, users can identify trends, recognise new opportunities,
improve productivity and more. Users can also connect to one or
various sources, whether the incoming data is from a
spreadsheet, cloud-based or enterprise data software. It can be
accessed from your desktop or via mobile. Setup, however can
involve multiple parties and some rather extensive knowledge of
the application in order to get started.

99
Website: www.microstrategy.com

Shape up your digital strategy with the latest predictions

Learn more about the top Digital Customer Experience


Trends in 2021

100
4. SAS Business Intelligence

While SAS’ most popular offering is its advanced predictive


analytics, it also provides a great business intelligence platform.
This well-seasoned self-service tool, which was founded back in
the 1970s, allows users to leverage data and metrics to make
informed decisions about their business. Using their set of APIs,
users are provided with lots of customisation options, and SAS
ensures high-level data integration and advanced analytics &
reporting. They also have a great text analytics feature to give
you more contextual insights into your data.

Website: www.sas.com

101
5. Yellowfin BI

Yellowfin BI is a business intelligence tool and ‘end-to-end’


analytics platform that combines visualisation, machine learning,
and collaboration. You can also easily filter through tons of data
with intuitive filtering (e.g. checkboxes and radio buttons) as
well open up dashboards just about anywhere (thanks to this
tool’s flexibility in accessibility (mobile, webpage, etc.). The
nice thing about this BI tool is that you can easily take
dashboards and visualisations to the next level using a no
code/low code development environment.

Website: www.yellowfinbi.com

102
6. QlikSense

A product of Qlik, QlikSense is a complete data analytics


platform and business intelligence tool. You can use QlikSense
from any device at any time. The user interface of QlikSense is
optimised for touchscreen, which makes it a very popular BI
tool. It offers a one-of-a-kind associative analytics engine,
sophisticated AI and high performance cloud platform, making it
all the more attractive. An interesting feature within this platform
is its Search & Conversational Analytics which enables a faster
and easier way to ask questions and discover new insights by
way of natural language.

Website: www.qlik.com

103
7. Zoho Analytics

Zoho Analytics is great BI tool for in-depth reporting and data


analysis. This business intelligence tool has automatic data
syncing and can be scheduled periodically. You can easily build
a connector by using the integration APIs. Blend and merge data
from different sources and create meaningful reports. With an
easy editor you create personalised reports and dashboards
enabling you to zoom into the important details. It also offers a
unique commenting section in the sharing options which is great
for collaboration purposes.

Website: www.zoho.com/analytics

104
8. Sisense

Sisense is a user-friendly data analytics and business intelligence


tool that allows anyone within your organisation to manage large
and complex datasets as well as analyse and visualise this data
without your IT department getting involved. It lets you bring
together data from a wide variety of sources as well including
Adwords, Google Analytics and Salesforce. Not to mention,
because it uses in-chip technology, data is processed quite
quickly compared to other tools. Thi splatform is even
recognised as a leading cloud analytics platform by various
industry experts such as Gartner, G2 and Dresner.

Website: www.sisense.com

105
9. Microsoft Power BI

Microsoft Power BI is a web-based business analytics tool suite


which excels in data visualisation. It allows users to identify
trends in real-time and has brand new connectors that allow you
to up your game in campaigns. Because it’s web-based,
Microsoft Power BI can be accessed from pretty much
anywhere. This software also allows users to integrate their apps
and deliver reports and real-time dashboards.

Website: www.powerbi.microsoft.com

106
10. Looker

Data discovery app, Looker is another business intelligence tool


to look out for! Now part of Google Cloud, this unique platform
integrates with any SQL database or warehouse and is great for
startups, midsize-businesses or enterprise-grade businesses.
Some benefits of this particular tool include ease-of-use, handy
visualisations, powerful collaboration features (data and reports
can be shared via email or USL as well as integrated with other
applications), and reliable support (tech team).

Website: www.looker.com

107
11. Clear Analytics

This is for all of the Excel-lovers out there…This BI tool is an


intuitive Excel-based software that can be used by employees
with even the most basic knowledge of Excel. What you get is a
self-service Business Intelligence system that offers several BI
features such as creating, automating, analysing and visualisation
your company’s data. This solution also works with the
aforementioned Microsoft Power BI, using Power Query and
Power Pivot to clean and model different datasets.

Website: www.clearanalyticsbi.com

108
12. Tableau

Tableau is a Business Intelligence tool specialised in data


discovery and data visualisation. With the software you can
easily analyse, visualise and share data, without IT having to
intervene. Tableau supports multiple data sources such as MS
Excel, Oracle, MS SQL, Google Analytics and SalesForce. Users
will gain access to well-designed dashboards that are very easy
to use. Additionally Tableau also offers several standalone
products including Tableau Desktop (for anyone) and Tableau
Server (analytics for organisations), which can be run locally,
Tableau Online (hosted analytics for organisations) and many
more.

Website: www.tableau.com

109
13. Oracle BI

Oracle BI is an enterprise portfolio of technology and


applications for business intelligence. This technology gives
users pretty much all business intelligence capabilities, such as
dashboards, proactive intelligence, ad hoc, and more. Oracle is
also great for companies who need to analyse large data volumes
(from Oracle and non-Oracle sources) as it is a very robust
solution. Additional key features include data archiving,
versioning, a self-service portal and alerts/notifications.

Website: www.oracle.com

110
14. Domo

Domo is a completely cloud-based business intelligence platform


that integrates multiple data sources, including spreadsheets,
databases and social media. Domo is used by both small
companies and large multinationals. The platform offers micro
and macro level visibility and analyses (including predictive
analysis powerd with Mr. Roboto, their AI engine). From cash
balances and listings of your best selling products by region to
calculations of the marketing return on investment (ROI) for
each channel. The only setbacks with Domo are the difficulty in
downloading analyses from the cloud for personal use and the
steep learning curve.

Website: www.domo.com

111
15. IBM Cognos Analytics

Cognos Analytics is an AI-fueled business intelligence platform


that supports the entire analytics cycle. From discovery to
operationalisation, you can visualise, analyse and share
actionable insights about your data with your colleagues. A great
benefit of AI is that you are able to discover hidden patterns,
because the data is being interpreted and presented to you in a
visualized report. Keep in mind, however, that it can take a while
to get familiar with all of the features within this solution.

112
Website: www.ibm.com

Business Intelligence Tools are very versatile and provide


you with a lot of useful information regarding your business’
performance and where it’s headed. However, while great for
collating / gathering data from various sources and helping you
make sense of it, it does little in terms of collecting data directly
from your customers. So why not take it a step further…

The Voice of Customer (VoC) is a critical factor in not only


boosting your profits (as we mentioned before), but also creating
a sense of loyalty among your customers and appreciation for
your efforts to provide them with a meaningful online
experience. That, my friends is the true source of success.So how
can you supplement your existing BI tool and gather this Voice
of Customer data. Try collecting and analysing customer
feedback.

113

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy