0% found this document useful (0 votes)
42 views202 pages

7708 - Predictive Analytics and Big Data 2024

Uploaded by

sahchandan8287
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
42 views202 pages

7708 - Predictive Analytics and Big Data 2024

Uploaded by

sahchandan8287
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 202

Editorial Board

Dr. Charu Gupta


Assistant Professor, Department of Computer Science, School of Open Learning, DDCE,
Campus of Open Learning, University of Delhi
Ms. Aishwarya Anand Arora
Assistant Professor, Department of Computer Science, School of Open Learning, DDCE,
Campus of Open Learning, University of Delhi
Ms. Asha Yadav
Assistant Professor, Department of Computer Science, School of Open Learning, DDCE,
Campus of Open Learning, University of Delhi

Content Writers
Aditi Priya, Hemraj Kumawat, Dr. Charu Gupta
Academic Coordinator
Mr. Deekshant Awasthi

© Department of Distance and Continuing Education


ISBN: 978-81-19417-96-4
1st Edition: 2023
E-mail: ddceprinting@col.du.ac.in
management@col.du.ac.in

Published by:
Department of Distance and Continuing Education
Campus of Open Learning/School of Open Learning,
University of Delhi, Delhi-110007

Printed by:
School of Open Learning, University of Delhi
DISCLAIMER

Corrections/Modifications/Suggestions proposed by Statutory Body, DU/


Stakeholder/s in the Self Learning Material (SLM) will be incorporated in
WKH QH[W HGLWLRQ +RZHYHU WKHVH FRUUHFWLRQVPRGL¿FDWLRQVVXJJHVWLRQV ZLOO EH
uploaded on the website https://sol.du.ac.in.
Any feedback or suggestions can be sent to the email-feedbackslm@col.du.ac.in.

Printed at: Taxmann Publications Pvt. Ltd., 21/35, West Punjabi Bagh,
New Delhi - 110026 (200 Copies, 2023)

© Department of Distance & Continuing Education, Campus of Open Learning,


School of Open Learning, University of Delhi
Contents

PAGE

Lesson 1 : Data Unveiled: An In-Depth Look at Types, Warehouses, and Marts


1.1 Learning Objectives 1
1.2 Introduction 2
1.3 Exploring Data Types 3
1.4 Data Warehousing: Foundations and Concepts 9
1.5 Exploring Data Marts 16
1.6 Summary 21
1.7 Answers to In-Text Questions 22
1.8 Self-Assessment Questions 22
1.9 References 23
1.10 Suggested Reading 23

Lesson 2 : Data Quality, Data Cleaning, Handling Missing Data, Outliers,


and Overview of Big Data
2.1 Learning Objectives 24
2.2 Introduction 25
2.3 Data Quality 25
2.4 Data Cleaning 32
2.5 Handling Missing Data and Outliers 39
2.6 Overview of Big Data 46
2.7 Summary 46
2.8 Answers to In-Text Questions 47
2.9 Self-Assessment Questions 48
2.10 References 49
2.11 Suggested Readings 49

PAGE i
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
MBAFT 7708 PREDICTIVE ANALYTICS AND BIG DATA

PAGE

Lesson 3 : Navigating the Data Analytics Journey: Lifecycle, Exploration,


and Visualization
3.1 Learning Objectives 50
3.2 Introduction 51
3.3 Understanding Data Analytics 54
3.4 Data Exploration: Uncovering Insights 58
3.5 Data Visualisation: Communicating Insights 67
3.6 Summary 70
3.7 Answers to In-Text Questions 71
3.8 Self-Assessment Questions 71
3.9 References 72
3.10 Suggested Reading 72

Lesson 4 : Foundations of Predictive Modelling: Linear and Logistic


Regression, Model Comparison, and Decision Trees
4.1 Learning Objectives 74
4.2 Introduction 74
4.3 Linear Regression 75
4.4 Logistic Regression 80
4.5 Model Comparison 82
4.6 Decision Trees 86
4.7 Summary 91
4.8 Answers to In-Text Questions 92
4.9 Self-Assessment Questions 92
4.10 References 93
4.11 Suggested Reading 93

Lesson 5 : Unveiling Data Patterns: Clustering and Association Rules


5.1 Learning Objectives 94
5.2 Introduction 95
5.3 Understanding Clustering 96

ii PAGE
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
CONTENTS

PAGE

5.4 The Mechanics of Clustering Algorithms 100


5.5 Real-World Applications of Clustering 107
5.6 Evaluating Cluster Quality 108
5.7 Unravelling Connections: The World of Association Rules 110
5.8 Real-World Applications of Association Rules 115
5.9 Summary 117
5.10 Answers to In-Text Questions 118
5.11 Self-Assessment Questions 118
5.12 References 118
5.13 Suggested Readings 119

/HVVRQ   )URP 'DWD WR 6WUDWHJ\ &ODVVL¿FDWLRQ DQG 0DUNHW %DVNHW


Analysis Driving Action
6.1 Learning Objectives 121
6.2 Introduction 121
6.3 &ODVVL¿FDWLRQ$ 3UHGLFWLYH 0RGHOOLQJ 
6.4 8QGHUVWDQGLQJ 3RSXODU &ODVVL¿FDWLRQ$OJRULWKPV 
6.5 1DLYH %D\HV &ODVVL¿FDWLRQ 3UREDELOLW\ DQG ,QGHSHQGHQFH 
6.6 K-NN (K-Nearest Neighbour) Algorithm: The Power of Proximity 133
6.7 (YDOXDWLRQ 0HDVXUHV IRU &ODVVL¿FDWLRQ 
6.8 5HDO:RUOG$SSOLFDWLRQV RI &ODVVL¿FDWLRQ 
6.9 Strategic Insights with Market Basket Analysis 141
6.10 Summary 144
6.11 Answers to In-Text Questions 145
6.12 Self-Assessment Questions 145
6.13 References 146
6.14 Suggested Readings 147

PAGE iii
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
MBAFT 7708 PREDICTIVE ANALYTICS AND BIG DATA

PAGE

Lesson 7 : Predictive Analytics and its Use


7.1 Learning Objectives 148
7.2 Introduction 149
7.3 Predictive Analytics 149
7.4 3UHGLFWLYH$QDO\WLFV$SSOLFDWLRQV DQG %HQH¿WV  0DUNHWLQJ +HDOWKFDUH
Operations and Finance 159
7.5 Text Analysis 171
7.6 Analysis of Unstructured Data 172
7.7 In-Database Analytics 172
7.8 Summary 174
7.9 Answers to In-Text Questions 175
7.10 Self-Assessment Questions 175
7.11 References 175
7.12 Suggested Readings 176

Lesson 8 : Technology (Analytics) Solutions and Management of their


Implementation in Organizations
8.1 Learning Objectives 177
8.2 Introduction 178
8.3 Management of Analytics Technology Solution in Predictive Modelling 180
8.4 Predictive Modelling Technology Solutions 183
8.5 Summary 185
8.6 Answers to In-Text Questions 185
8.7 Self-Assessment Questions 186
8.8 References 186
8.9 Suggested Readings 186
Glossary 187

iv PAGE
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
L E S S O N

1
Data Unveiled: An
In-Depth Look at Types,
Warehouses, and Marts
Aditi Priya
Research Scholar
School of Computer & Systems Sciences
JNU, New Delhi - 110067
Email-Id: kajal.aditi@gmail.com

STRUCTURE
1.1 Learning Objectives
1.2 Introduction
1.3 Exploring Data Types
1.4 Data Warehousing: Foundations and Concepts
1.5 Exploring Data Marts
1.6 Summary
1.7 Answers to In-Text Questions
1.8 Self-Assessment Questions
1.9 References
1.10 Suggested Reading

1.1 Learning Objectives


‹ To understand the different types of data and their importance.
‹ To understand the concept of data warehouses and its benefits in data management.
‹ To understand the outline of the architecture of a data mart, including data sources,
data storage, and user access methods.

PAGE 1
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
MBAFT 7708 PREDICTIVE ANALYTICS AND BIG DATA

Notes
1.2 Introduction
In the fast-paced world of contemporary enterprise, data has evolved
into a game-changer that influences how companies conduct operations,
cultivate novel ideas, and arrive at significant decisions. The evolution
of data from a mere byproduct to a strategic asset highlights its ever-
increasing importance in our digitalized world. Amidst the information
overload of today’s highly connected world, businesses must navigate
a torrent of data from many sources – customer touch points, digital
interactions, social media platforms, IoT sensors, and additional avenues.
Amidst these data surges, challenges abound, yet so do opportunities.
Converting data into actionable insights maximizes efficiency and growth
potential in a competitive market. Data is more than just numbers and
figures; it provides a comprehensive picture of consumer behaviour, market
trends, operational efficiency, and emerging possibilities. It revolutionizes
conventional decision-making by providing an in-depth comprehension of
consumer desires, enabling enterprises to tailor their approaches to align
with constantly evolving expectations. Decision-making has witnessed
data’s seamless transition from bit player to star performer. In the past,
trusting intuition and emotions might have steered crucial decisions. In
today’s business landscape, data-driven decision-making is becoming
more critical. Sophisticated algorithms and cutting-edge tools enable
organizations to gain valuable insights, forecast future developments, and
make strategic decisions that give them an edge over their competitors.
Data’s significance transcends immediate operations and spurs creativity
and ongoing enhancement. Mining historical data and process optimization
leads to a deeper understanding, empowering businesses to pursue new
avenues of growth and enhance existing practices.
Moreover, the integration of data is not confined to internal operations
alone. It facilitates a deep understanding of customer behaviour, preferences,
and pain points. This customer-centric approach enables businesses
to precisely tailor their offerings, marketing campaigns, and customer
experiences. However, it is essential to acknowledge that data, in its raw
form, is not a magic elixir. It requires the right tools, methodologies, and
strategies to extract its full potential. Data quality, accuracy, and relevance
are paramount, as erroneous data can lead to misguided decisions and
faulty strategies.
2 PAGE
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
DATA UNVEILED: AN IN-DEPTH LOOK AT TYPES, WAREHOUSES

In essence, data has become the cornerstone of modern business and Notes
decision-making. Its transformative influence has ushered in an era where
businesses that embrace data-driven strategies stand poised to thrive,
adapt, and innovate. The ability to harness data effectively is not just
a competitive advantage – it is a prerequisite for relevance and success
in a rapidly evolving landscape. As we journey deeper into the digital
age, the role of data in shaping the business landscape will continue to
evolve, redefining what is possible and propelling organizations toward
new horizons of achievement.
This chapter mainly explores different data types, the concept of data
warehouses, and the significance of data marts.

1.3 Exploring Data Types


Data and information exhibit different characteristics in knowledge and
decision-making but are closely interrelated concepts. Data often encompasses
facts, figures, symbols, or observations that lack context, meaning, and
interpretation. One can think of it as the building blocks for information
and knowledge. Data can take forms like numbers, text, images, audio
or video. On its own, data holds significance and needs processing and
interpretation to unlock its meaning and usefulness. Some examples of
data include temperature readings, survey responses, transaction amounts,
and sensor measurements. The information represents processed data
organized and interpreted to provide context, meaning, and significance. It
emerges from analyzing and structuring data to extract insights that make
it helpful in understanding or making decisions. Information is presented
in a format that promotes comprehension and communicates knowledge
effectively. It addresses questions raised or contributes to understanding
a subject or context. Examples of information include weather forecasts,
sales reports, research findings, and summarized analyses.
In conclusion, Data acts as the material from which information is
derived. It gains value when it undergoes analysis, interpretation, and
organization to transform into information that guides decision-making,
supports research endeavours, and fosters comprehensive understanding.
Information takes many forms, and data refers specifically to unaltered,
unsorted knowledge. Information and knowledge’s foundation consists of
data. Data can be broadly categorized into two main types:
PAGE 3
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
MBAFT 7708 PREDICTIVE ANALYTICS AND BIG DATA

Notes ‹ Structured Data: This data type exhibits a strong structure and
adheres to a predetermined format. They are typically housed in
databases and tables with predetermined columns and rows. Inclusion
here spans numerical data, chronological references, personality,
and spatial markers.
‹ Semi-structured Data: This data type does not conform to the structure
of traditional structured data, like data stored in relational databases,
but it is not entirely unstructured. It lies in between structured and
unstructured data regarding organization and flexibility.
‹ Unstructured Data: Formless, chaotic, unstructured data defies
organization and structure. It encompasses multiple formats like
images, audio, video, and text. Within this category of unstructured
data, social media posts, emails, and multimedia content are typical
occurrences.
Types of data abound based on their nature, format, and attributes are
as follows:
1. Numerical Data: As quantitative data, these values are measurable
and quantifiable. The dataset comprises diverse measurement fields,
from the uncountable (say, happiness) to the quantifiable (size or
weight).
2. Categorical Data: Also classified as qualitative data, categories
or groups are identified using categorical data. Organizing data
according to nominal or ordinal principles is possible.
3. Text Data: Including words, sentences, paragraphs, or any written
content, textual data spans a wide range. This scope includes
documents, social media posts, emails, and others.
4. Time Series Data: Chronologically arranged, the data captures
changes over time. It is commonly utilized in the fields expanding
across finance, economics, and environmental monitoring
5. Spatial Data: Spatial data consists of critical geographic details
like coordinates and maps. In GIS, it is utilized for mapping and
analytical purposes.
6. Binary Data: Two discrete values are apparent through binary
representation: 0 and 1. This construct finds frequent use in computer

4 PAGE
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
DATA UNVEILED: AN IN-DEPTH LOOK AT TYPES, WAREHOUSES

systems and programming as a versatile means of denoting On/Off, Notes


True/False, or Presence/Absence.
7. Image Data: Image data carries visual information and is represented
in pixels. In fields like computer vision and image processing, it
finds extensive application.
8. Audio Data: Audio data represents sound waves and enables speech
recognition and audio processing capabilities.
9. Video Data: A series of frames containing visual data compose video
data. Such applications include video analysis and surveillance.
10. Sensor Data: Information from sensors and gadgets, such as temperature
sensors, GPS devices, and accelerometers, is gathered.
11. Big Data: Big data refers to massive datasets that are too complex
to be processed using traditional data management tools. It is
characterized by the 3Vs: volume, velocity, and variety. These data
types often overlap or combine, especially in complex analyses. The
choice of data type depends on the context of the analysis and the
specific insights or information being sought.
Data of different kinds are crucial in a business setting to facilitate
knowledgeable decision-making, optimize processes, analyze consumer
conduct, and guide strategic planning. Here are some key types of data
commonly used in business:
1. Transactional Data: Comprehensive documentation of commercial
transactions, including sales receipts, purchase orders, invoices, and
payments. This tool provides a comprehensive view of financial
performance by tracking revenue and expenses.
2. Customer Data: Included in this compilation are demographical
details, contact data, buying history, personal tastes, and client
feedback. Data science applications equip businesses with the
ability to tailor their marketing tactics to diverse customer groups,
augmenting effectiveness.
3. Sales Data: Details regarding sales activities, covering product sales,
sales distribution methods, pricing, and promotion strategies, are
fundamental to companies. This tool evaluates top sales data, guides
pricing strategies, and helps identify hit products.

PAGE 5
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
MBAFT 7708 PREDICTIVE ANALYTICS AND BIG DATA

Notes 4. Marketing Data: Across various channels, marketing metrics comprise


website traffic, social media involvement, click-through rates, and
conversion rates. Analytical data help determine marketing success
and fine-tune subsequent approaches.
5. Financial Data: These important financial reports cover balance sheets,
income statements, cash flow statements, and more. In financial
analysis, budgeting, and forecasting, spreadsheets are indispensable.
6. Supply Chain Data: In detail regarding transportation, stockpiling,
provider networks, and logistics procedures. Supply chain optimization
and inventory control are its prime functions.
7. Employee Data: Employee-related information covers payroll,
attendance, performance measurements, and training documentation.
Essential for HR tasks, they are applied liberally.
8. Operational Data: Day-to-day operations data includes production
steps, equipment maintenance, and resource usage information. This
technology primarily aims to improve processes through optimization
and efficiency enhancements.
9. Market Research Data: Insights gained from survey responses,
focus group discussions, and market analyses yield a comprehensive
understanding of market dynamics and customer inclinations.
Informing product development and marketing strategies is the
primary function of customer feedback.
10. Competitor Data: Data regarding competitors’ offerings, cost structures,
advertising campaigns, and market position. We are facilitating
competitive assessments and distinctiveness probing these products.
11. Social Media Data: Data from social media platforms, including
user interactions, sentiment analysis, and brand mentions. They
are used for social media marketing, reputation management, and
understanding customer sentiment.
12. Web Analytics Data: Data about website traffic, user behaviour,
and online interactions. They are used for improving website design,
user experience, and conversion rates. These data types are often
integrated, analyzed, and interpreted to drive data-driven decision-
making, innovation, and business growth.

6 PAGE
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
DATA UNVEILED: AN IN-DEPTH LOOK AT TYPES, WAREHOUSES

Data takes centre stage in the corporate world, offering knowledge, Notes
guiding strategic choices, streamlining workflows, and pioneering novel
approaches. Here is how data is essential in a business model:
1. Informed Decision-Making: Objective, accurate, and data-driven
decisions contrast with those fuelled by intuition. Businesses can
make strategic decisions supporting their goals by analyzing trends,
customer behaviour, and market dynamics.
2. Customer Insights: Data provides a clear picture of customer wants,
needs, and buying tendencies, allowing businesses to make informed
decisions. Businesses can skill fully craft products and marketing
campaigns that resonate with their intended market.
3. Personalization: Organizations can craft tailor-made experiences
for clients, leading to increased customer contentment and brand
loyalty. Marketing strategies that cater to individual preferences are
classified under personalization.
4. Operational Efficiency: Data-driven insights facilitate entrepreneurial
optimization. Through data analysis, we discover areas for improvement,
optimize workflows, and reduce costs to boost productivity.
5. Forecasting and Planning: Combining past and present data allows
for reliable predictions, benefiting businesses by helping them prepare
for future requirements, allocate resources effectively, maintain
inventory levels, and adjust production accordingly.
6. Market Insights: Market trends, rival movements, and new potential
emerge from data insights. With this data, enterprises can effectively
respond to market shifts and maintain their position.
7. Innovation and Product Development: Data visualization uncovers
market gaps and innovative areas to explore. Analyzing customer
opinions allows businesses to generate products and services that
resonate with their needs.
8. Risk Management: Businesses conduct detailed data assessments to
evaluate financial, operational, and compliance risks. Taking such
actions permits them to minimize potential dangers.
9. Marketing and Advertising: Data serves as a marketing roadmap,
pinpointing optimal channels, messaging, and timelines. For seamless,
targeted campaigns, it excels.

PAGE 7
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
MBAFT 7708 PREDICTIVE ANALYTICS AND BIG DATA

Notes 10. Customer Retention: Data helps identify at-risk customers and
allows businesses to implement retention strategies. By analyzing
customer behaviour, businesses can offer targeted incentives and
personalized communication.
11. Measuring Performance: Data provides key performance indicators
(KPIs) that allow businesses to measure their success and progress
toward goals. Performance measuring helps evaluate the effectiveness
of strategies and make adjustments as needed.
12. Business Intelligence and Reporting: The cornerstone of BI and
reporting tools, data warehouses lay the groundwork. Users can
design dashboards and visualizations with these tools.
13. Separation from Transactional Systems: Data warehouses are
distinct from transactional databases used in day-to-day operations.
Transactional databases focus on capturing real-time transactions,
while data warehouses provide a platform for analysis and reporting.
14. Continuous Improvement: Data-driven insights lead to continuous
improvement across various business aspects, including operations,
customer experience, and employee performance.
Incorporating data into the business model empowers organizations to
make proactive, efficient, and customer-centric decisions. It fosters a
culture of innovation and adaptability, enabling businesses to thrive in a
rapidly changing marketplace.

IN-TEXT QUESTIONS
1. What is a characteristic of structured data?
(a) It lacks a predefined schema
(b) It is typically stored in relational databases
(c) It includes textual documents
(d) It is challenging to query and analyze
2. Unstructured data is commonly found in which of the following
forms?
(a) Database tables
(b) XML files

8 PAGE
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
DATA UNVEILED: AN IN-DEPTH LOOK AT TYPES, WAREHOUSES

(c) Textual documents Notes

(d) Spreadsheet data

1.4 Data Warehousing: Foundations and Concepts


A data warehouse is a central storage facility crafted to store, collate,
and organize data from different parts of an organization for efficient
data utilization and analysis. This singular data source enables seamless
accessibility for data-driven decision-making, analysis, and reporting.
Optimized for querying and reporting, data warehouses grant users access
to valuable historical and contemporary data.
Key characteristics of a data warehouse include:
‹ Integration of Data: Integrating information from diverse sources,
including transactional databases, operational systems, spreadsheets,
and outside sources, data warehouses centralize data. The integration
of data reveals a complete perspective on an organization’s activities.
‹ Structured Data: Standardization is crucial to a data warehouse’s
structure, involving predefined tables with consistent data organization.
‹ Historical Data: Data collection and storage in data warehouses
enables long-term trend analysis and cross-sectional comparisons.
‹ Optimized for Analysis: Unlike transactional databases, which optimize
data capture and updates, data warehouses prioritize query speed
and reporting capabilities. Techniques like indexing materialized
views and columnar storage enable faster analysis queries.
‹ Support for Complex Queries: Designed to accommodate intricate
queries involving combinations, summarization, and alterations, data
warehouses feature centralized storage systems.
‹ Data Transformation: The ETL process often transforms and cleans
data to ensure data quality and consistency.
‹ Data Security and Governance: Data warehouses commonly follow
stringent security protocols to restrict access to confidential data
and conform to privacy laws.
‹ Scalability: Data can be scaled horizontally or vertically to meet
development needs.

PAGE 9
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
MBAFT 7708 PREDICTIVE ANALYTICS AND BIG DATA

Notes Data warehouses have evolved, and different architectural approaches


have emerged, including:
‹ Traditional On-Premises Data Warehouses: Physical storage requires
equipment upkeep within an organization’s infrastructure.
‹ Cloud-Based Data Warehouses: Cloud platforms are where these
reside, providing features like scalability, flexibility, and optimized
infrastructure management. Examples of these are Amazon Redshift,
Google Big Query, and Snowflake.
In the organizational context, data warehouses translate raw data into
actionable insights, empowering decision-makers to make better choices,
optimize strategies, and enhance competitiveness. We will discuss more
about the data warehousing process, types of data warehouses, data
warehousing models, and benefits of data warehouses in the following
section:
1. Data Warehousing Process:
‹ Extraction: From internal databases to external systems, spreadsheets,
and beyond, a wide range of data sources are tapped.
‹ Transformation: The scheme of a data warehouse is tailored to
data through streamlining and fine-tuning, resulting in uniformity,
correctness, and seamless integration. Data validation, formatting,
and structuring all require careful analysis.
‹ Loading: Two standard techniques for loading transformed data are
batch processing and real-time streaming.
‹ Query and Analysis: Users attain access to diverse analysis tools
once data integration takes place.
2. Types of Data Warehouses:
‹ Enterprise Data Warehouse (EDW): Repository comprising
information from various segments and departments.
‹ Departmental Data Warehouse: This warehouse provides tailored
data storage and reporting features to cater to the distinct data needs
of a specific department.
‹ Operational Data Store (ODS): With this technique, analysis can
be performed at nearly light speed due to quick data updates.

10 PAGE
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
DATA UNVEILED: AN IN-DEPTH LOOK AT TYPES, WAREHOUSES

3. Data Warehouse Architecture: Notes


The choice of data warehouse architecture depends on the organization’s
size, data volume, performance requirements, budget, and existing IT
infrastructure. Many organizations are adopting hybrid or cloud-based data
warehousing solutions to gain flexibility and scalability while minimizing
infrastructure management overhead:
‹ Single-Tier Architecture: All components operating on the same
server encompass extraction, transformation, loading, storage, and
querying tasks. This type of architecture is best suited for small-
scale data warehouses because they are simple to implement. They
cannot be extended to large businesses because of poor scalability
and performance.
‹ Two-Tier Architecture: Distinct servers segregate inquiry and data
management tasks. This architecture separates data storage from data
querying. Data is loaded into a central repository, and users query
the data directly from this repository. It offers better scalability and
performance than a single-tier architecture but may have limitations
for complex data integration.
‹ Three-Tier Architecture: Operations are compartmentalized by
dividing ETL, storage, and querying into separate layers. Also known
as a data warehouse client-server architecture, this model separates
data storage, management, and presentation into three tiers.
The first tier is the data source layer, the second is the data warehouse
layer (ETL and data storage), and the third is the client layer, where
users access and visualize data.
It provides improved scalability, flexibility, and ease of maintenance.
4. Data Warehouse Models
Data warehouse models define the structure and organization of data
within a data warehouse. These models determine how data is stored,
accessed, and used for analytical and reporting purposes. There are several
standard data warehouse models:
‹ Relational Data Warehouse Model: The relational model organizes
data into tables with rows and columns, much like traditional
relational databases. It uses SQL for querying and is well-suited for
structured data. Data is typically normalized to reduce redundancy,

PAGE 11
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
MBAFT 7708 PREDICTIVE ANALYTICS AND BIG DATA

Notes but denormalization may be used for performance optimization. Star


and snowflake schemas are common variations of the relational
model used in data warehousing.
‹ Multidimensional Data Warehouse Model: The multidimensional
model represents data in a matrix-like structure, with dimensions
(attributes) and facts (measures).It is highly intuitive for business
users and simplifies complex data relationships. OLAP (Online
Analytical Processing) tools are often used to navigate and analyze
multidimensional data.
‹ Dimensional Data Warehouse Model: Dimensional modelling is
a specific approach within the multidimensional model. It uses
dimensions (e.g., time, product, location) and facts (e.g., sales
revenue, quantity sold) to create star or snowflake schemas. Star
schemas are simple and intuitive, making them popular in data
warehousing.
‹ Data Vault Data Warehouse Model: The Data Vault model is designed
for scalability and flexibility. It uses hubs (containing business keys),
links (defining relationships), and satellites (holding attributes) to
model data. Data Vault is known for its ability to handle changing
data sources and evolving business requirements.
‹ Columnar Data Warehouse Model: The columnar model stores
data in column-wise format rather than row-wise. It is optimized
for analytical queries, offering faster query performance for read-
heavy workloads. Columnar databases are used in data warehousing
systems to improve analytics.
‹ In-Memory Data Warehouse Model: In-memory data warehouse
model data is stored entirely in RAM for rapid access and query
processing. It is designed for ultra-fast query performance, making
it suitable for real-time analytics.
‹ No SQL Data Warehouse Model: Some organizations use No SQL
databases like document-oriented or column-family databases for
storing and processing data in data warehousing scenarios. No SQL
databases can handle semi-structured and unstructured data, making
them useful for modern data warehousing needs.
‹ Data Lake Data Warehouse Model: Data lakes are not traditional
data warehouses but can complement them. Data lakes store structured

12 PAGE
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
DATA UNVEILED: AN IN-DEPTH LOOK AT TYPES, WAREHOUSES

and unstructured data in their raw format, providing flexibility Notes


for diverse data sources. Data lakes can be integrated with data
warehouses to support more extensive analytics.
‹ Hybrid Data Warehouse Model: Hybrid data warehouse models
combine different architectural approaches to meet specific business
needs. For example, organizations may combine on-premises relational
databases with cloud-based data lakes and No SQL databases.
The choice of a data warehouse model depends on factors like the
nature of the data, business requirements, scalability needs, and existing
technology infrastructure. Many organizations combine these models to
accommodate diverse data sources and analytical demands. Suppose the
essential components of all data warehouse schemas are fact and dimension
tables, which are shown below:

Figure 1.1: Components of a data warehouse


There are three different types of data warehouse schemas:
‹ Star Schema: This model functions with dimension tables around
the central fact table in a star formation.

PAGE 13
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
MBAFT 7708 PREDICTIVE ANALYTICS AND BIG DATA

Notes

Figure 1.2: Example of A Star Schema


‹ Snowflake Schema: Normalized dimension tables facilitate the
development of a more elaborate snowflake structure, which evolves
from the original star schema.

Figure 1.3: Example of A Snowflake Schema

14 PAGE
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
DATA UNVEILED: AN IN-DEPTH LOOK AT TYPES, WAREHOUSES

‹ Galaxy Schema: A connected structure is created by combining Notes


various star charts.

Figure 1.4: Example of a Galaxy Schema


Data Warehouse Benefits
‹ Improved Decision-Making: Presents integrated data insights for
insightful analysis, facilitating informed decision-making.
‹ Enhanced Data Quality: Data receipts are tidied, and data transformation
takes place, thus assuring exactness and coherence.
‹ Historical Analysis: By analyzing historical data, trends and patterns
can be identified, helping with accurate forecasting.
‹ Reduced Data Redundancy: Information is unified, eliminating
unnecessary variability and bringing consistency.
‹ Centralized Data: Functions as a unified data hub for the organization.
Challenges and Considerations:
‹ Data Integration: Synthesizing information from diverse origins
necessitates diligent preparation.
‹ Performance: Timely insights depend on swift query performance.
‹ Scalability: With rising data volumes, the data warehouse must be
capable of scaling to meet needs.
‹ Data Governance and Security: This guarantees data security and
adherence to rules.

PAGE 15
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
MBAFT 7708 PREDICTIVE ANALYTICS AND BIG DATA

Notes Data warehouses are a foundational element of data-driven decision-


making, allowing organizations to analyze historical and current data to
extract insights and drive business growth.

IN-TEXT QUESTIONS
3. What is the primary purpose of a data warehouse?
(a) Real-time data processing
(b) Long-term data storage
(c) Data analysis and reporting
(d) Data transmission between servers
4. What is the primary goal of data transformation in a data
warehouse during the ETL process?
(a) Storing raw data as-is
(b) Aggregating data for reporting
(c) Preparing data for analysis and reporting
(d) Extracting data from source systems

1.5 Exploring Data Marts


Data marts are a subset of data warehouses tailored to provide information
essential to a particular organizational department or business line. A
smaller, more focused data repository was developed uniquely for a
select group’s statistical requirements. Built to cater to specific business
needs, data marts present a condensed perspective on relevant data in a
particular domain.
Key characteristics of data marts include:
‹ Focused Data: The following industries are targeted by data marts:
sales, marketing, finance, and human resources.
‹ User-Centric: Data marts offer user groups specialized data, increasing
organizational performance.
‹ Simplified Structure: The specific structure of data marts accommodates
user requirements. Boosting querying speeds requires denormalization
and pre-aggregation.

16 PAGE
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
DATA UNVEILED: AN IN-DEPTH LOOK AT TYPES, WAREHOUSES

‹ Reduced Complexity: Data marts are split into various business Notes
segments, each containing a smaller portion of the data warehouse’s
information. With this advancement, the data retrieval and evaluation
process has become more straightforward.
‹ Improved Performance: Data marts improve query response times
and reporting efficiency by focusing on a specific business sector
and pre-processing data.
‹ Independent Development: Each department can independently
build and maintain its data marts without compromising the data
warehouse framework.
‹ Scalability: Organizations can expand their data infrastructure from
a single data mart over time.
Two approaches to organizing and structuring data exist within an
organization’s data architecture: Dependent Data Marts and Independent
Data Marts. Each approach has its benefits and considerations:
‹ Dependent Data Marts:
(i) Definition: Dependent Data Marts are similar to the data
subsets of an enterprise data warehouse. They cater to various
departmental or business unit requirements by extracting and
adapting data from the EDW. These data marts offer valuable
insights from a consolidated data source.
(ii) Benefits:
(a) Consistency: By deriving from a centralized EDW, Dependent
Data Marts maintain consistency in data organization-
wide.
(b) Data Governance: Implementing data governance and
control becomes simpler when data is managed in EDW,
decreasing the likelihood of accuracy issues.
(c) Cost Efficiency: Leveraging existing EDW infrastructure
and data, Dependent Data Marts tend to be less expensive
to construct and maintain.
(d) Scalability: As the organization expands, it will adapt
better because it is part of a broader EDW community.

PAGE 17
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
MBAFT 7708 PREDICTIVE ANALYTICS AND BIG DATA

Notes (iii) Considerations:


(a) Complexity: Dependent Data Marts necessitate coordinated
planning and execution to satisfy each business unit’s
requirements.
(b) Latency: Some business units need more recent data, which
the EDW’s refresh schedule may not provide.
(c) Customization: With data customization for each business
unit, complexity is possible, and data fragmentation
could result.
‹ Independent Data Marts:
(i) Definition: These data Marts are reserved for specific business
areas or departments and exist independently. They are
independent of any central data warehouse. In this, autonomous
data management is done.
(ii) Benefits:
(a) Speed: As independent marts do not coordinate with
centralized EDW teams, they can be developed more
quickly.
(b) Customization: Customized solutions tailored to the specific
needs of a business unit offer flexibility.
(c) Autonomy: Business units can alter data as per their
requirements.
(iii) Considerations:
(a) Data Consistency: Variations in data definitions and standards
between units can lead Independent Data Marts astray.
(b) Data Governance: With decentralized data, implemen-ting
governance can be challenging.
(c) Data Integration: Holistic view integration can be complicated
when combining Independent Data Marts information.
(d) Scalability: More challenging and expensive, managing
numerous independent data repositories increases as the
organization expands.

18 PAGE
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
DATA UNVEILED: AN IN-DEPTH LOOK AT TYPES, WAREHOUSES

The choice to determine whether to opt for a Dependent or Independent Notes


Data Mart depends on how an organization’s requirements and priorities
align with them. Dependent Data Marts offers more centralized management
and best suits companies with strict data governance requirements due
to its enhanced control and consistency. Independent Data Marts offer
expanded freedom and independence but require more work. Many
organizations choose a hybrid approach integrating centralized control
and autonomous business units to strike a balance.
Following a series of steps, data marts are developed, designed, and
maintained to accommodate the distinct analytical requirements of different
business units or departments within an organization. Here is a step-by-
step guide to creating data marts:
‹ Identify Business Requirements: It starts with understanding how
various departments or units require distinct data analytics and
reporting. To determine the needs and aspirations of stakeholders,
conduct interviews and workshops.
‹ Data Mart Planning: Determining the scope and objectives of each
data mart involves specifying the data it will house and how it will
be employed. It is essential to recognize the data sources needed
for each data mart.
‹ Data Source Integration: Locate and utilize the applicable data
sources for the data mart. Data is obtained through source systems
such as databases, warehouses, external providers, or APIs.
‹ Data Transformation: Data processing is required to guarantee
quality and consistency. Data Transformation is done to boost the
analytical worth of the data, enrich the data, or perform feature
engineering as per the requirements.
‹ Data Modelling: To create a data mart, one must design the schema
by defining tables, relationships, and metadata. The specific
requirements determine the chosen data modelling approach: star
schema or snowflake schema.
‹ ETL (Extract, Transform, Load): Transformed data should be
loaded into the data mart through ETL processes. Regular updates
or refreshes are crucial to keeping the data mart current.

PAGE 19
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
MBAFT 7708 PREDICTIVE ANALYTICS AND BIG DATA

Notes ‹ Data Storage and Architecture: When deciding on the best storage
platform for the data mart, one could consider relational databases,
columnar databases, data lakes, or cloud-based options. Sensitive
information remains secure thanks to robust access controls and
data security protocols.
‹ Data Documentation: By recording data definitions, lineage, and
metadata, we ensure data control is maintained.
‹ Data Quality Assurance: Regular checks on data quality ensure
continued monitoring and maintenance of data integrity. Establishing
data quality metrics and KPIs is crucial for tracking improvement
over time.
‹ User Access and Visualization: Provide business users with tools,
query access, and visualize data from the data mart. Consider using
data visualization platforms or BI (Business Intelligence) tools for
reporting and analytics.
‹ User Training and Support: Offer training to users on how to access
and utilize the data mart effectively. Provide ongoing support and
documentation.
‹ Performance Optimization: Continuously monitor and optimize the
performance of the data mart to ensure timely access to data. Indexing,
partitioning, and caching are standard optimization techniques.
‹ Data Governance: Establish data governance policies, including
data ownership, data stewardship, and data access controls. Ensure
compliance with data privacy regulations.
‹ Scalability and Future Planning: Consider future scalability needs
as the organization grows. Be prepared to expand or modify data
marts as business requirements evolve.
‹ Monitoring and Maintenance: Implement monitoring and alerting
systems to proactively identify issues and maintain data mart health.
Perform regular data quality checks and audits.
‹ Documentation and Knowledge Sharing: Document the data mart’s
architecture, data models, and ETL processes. Share knowledge and
best practices within the data mart development and management
team.

20 PAGE
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
DATA UNVEILED: AN IN-DEPTH LOOK AT TYPES, WAREHOUSES

Creating data marts is a complex process that requires collaboration between Notes
data engineers, data analysts, business stakeholders, and IT teams. It is
essential to align data mart development with the organization’s overall
data strategy and ensure that the resulting data marts meet the analytical
needs of the business units they serve.

IN-TEXT QUESTIONS
5. What is the primary advantage of using data marts in a data
warehousing strategy?
(a) Centralized data storage
(b) Simplified data transformation
(c) Tailored data access for specific user needs
(d) Real-time data processing
6. In a data mart architecture, what typically serves as the data
source for the data mart?
(a) Enterprise Data Warehouse (EDW)
(b) Operational Data Store (ODS)
(c) External data sources only
(d) All of the above

1.6 Summary
This unit comprehensively explores different data types, including data
warehouses and marts. Here are some of the critical points that are
covered in this unit:
‹ The first section distinguishes between structured, semi-structured,
and unstructured data, describing their characteristics and showcasing
real-world applications. Types of data based on their nature, format,
and attributes are also discussed in this section. These sections also
describe data utilized in a business to facilitate informed decision-
making, optimize processes, analyze consumer conduct, and guide
strategic planning.
‹ The concept of data warehousing is discussed in the section. It is a
centralized repository that stores, manages, and analyzes data from

PAGE 21
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
MBAFT 7708 PREDICTIVE ANALYTICS AND BIG DATA

Notes multiple sources. This section also covers the data warehousing
process, types of data warehouses, data warehousing models, and
benefits of data warehouses.
‹ The concept of data marts is discussed in the last section. Different
types of data marts and their definition, benefits, and considerations
related to them are explored in this section. The process of creating
a data mart is also covered in this.

1.7 Answers to In-Text Questions

1. (b) It is typically stored in relational databases


2. (c) Textual documents
3. (c) Data analysis and reporting
4. (c) Preparing data for analysis and reporting
5. (c) Tailored data access for specific user needs
6. (a) Enterprise Data Warehouse (EDW)

1.8 Self-Assessment Questions


1. What key characteristics differentiate structured, semi-structured,
and unstructured data? Provide examples of each.
2. Why is understanding the types of data essential for businesses
today? How does each data type impact decision-making?
3. Define a data warehouse and explain its primary role in data
management.
4. List and describe the core components of a data warehouse architecture.
5. Explain the concept of a data mart. How does it differ from a
traditional data warehouse regarding scope and purpose?
6. When should an organization consider using a data warehouse, a
data mart, or both? Provide examples of scenarios for each.

22 PAGE
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
DATA UNVEILED: AN IN-DEPTH LOOK AT TYPES, WAREHOUSES

1.9 References Notes

‹ Bruce, P., Bruce, A., & Gedeck, P. (2020). Practical statistics for
data scientists: 50+ essential concepts using R and Python. O’Reilly
Media.
‹ Yau, N. (2013). Data points: Visualization that means something.
John Wiley & Sons.
‹ Maheshwari, A. (2014). Data analytics made accessible. Seattle:
Amazon Digital Services.

1.10 Suggested Reading


‹ Provost, F., & Fawcett, T. (2013). Data Science for Business: What
you need to know about data mining and data-analytic thinking.
“O’Reilly Media, Inc.”.

PAGE 23
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
L E S S O N

2
Data Quality, Data
Cleaning, Handling
Missing Data, Outliers,
and Overview of Big Data
Aditi Priya
Research Scholar
School of Computer & Systems Sciences
JNU, New Delhi - 110067
Email-Id: kajal.aditi@gmail.com

STRUCTURE
2.1 Learning Objectives
2.2 Introduction
2.3 Data Quality
2.4 Data Cleaning
2.5 Handling Missing Data and Outliers
2.6 Overview of Big Data
2.7 Summary
2.8 Answers to In-Text Questions
2.9 Self-Assessment Questions
2.10 References
2.11 Suggested Readings

2.1 Learning Objectives


‹ To understand the concept of data quality and its importance.

24 PAGE
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
DATA QUALITY, DATA CLEANING, HANDLING MISSING DATA

‹ To learn various methods and tools for data cleaning. Notes


‹ To explore strategies for dealing with missing data.
‹ To provide an overview of big data concepts.

2.2 Introduction
The significance of data quality in the realms of decision-making
and analytics is of utmost importance. Data forms the foundation for
informative decision-making and insight extraction. Good data quality,
efficient decision-making, and precise analytics are possible with good
data quality. It spans multiple domains and industries and is critical to
decision-making and analytics. Data success relies on the accuracy and
dependability of statistics in a data-driven era. Decision-making and
analytics rely heavily on data quality, with precision, thoroughness, and
trustworthiness paramount. Data quality problems can cause inconclusive
results, flawed forecasts, and pricey errors. High-quality data allows
organizations to make insightful decisions, identify trends, and achieve
a leg up on competitors. With each new level of data exploration, the
importance of accuracy, completeness, and consistency inevitably rises,
transforming them from trivial issues to critical foundations for success
in a data-dependent landscape. Maintaining good data quality relies upon
processes that include data cleaning, handling missing data, and outliers.
Interconnected roles enhance data accuracy and decision-making. The
emergence of big data has transformed data management practices and
quality.

2.3 Data Quality


Data quality pertains to the degree of accuracy, reliability, consistency,
timeliness, and relevance of data for its intended use. It comprises several
dimensions, including:
‹ Accuracy: Accuracy in data representation depends on being error-
free.
‹ Completeness: All data must be present for accurate analysis and
results, with no missing values.

PAGE 25
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
MBAFT 7708 PREDICTIVE ANALYTICS AND BIG DATA

Notes ‹ Consistency: The importance of standardization is evident in data


management, where uniformity is vital.
‹ Timeliness: Data should be up-to-date and relevant for the time
frame in which it is used.
‹ Relevance: Data should be pertinent to the context and purpose for
collecting and using it.
The importance of data quality is paramount in numerous industries, and
this is due to several compelling reasons:
‹ Healthcare: Healthcare requires accurate patient records and medical
histories for proper diagnosis, treatment, and protection. With poor
data quality, medical errors and compromised patient care are
plausible.
‹ Finance: High-quality data is imperative for financial institutions to
accurately evaluate creditworthiness, catch fraud, and make proficient
investment decisions. Data inaccuracies can have significant financial
implications.
‹ Retail: Data-driven insights allow retailers to improve their inventory
management, pricing tactics, and customer engagement. Vital for
successful enterprise operations, accurate sales, and inventory data.
‹ Manufacturing: With high-quality data, quality control, optimization,
and supply chain management can improve. Manufacturing data
errors can cause defects and hold up production.
‹ Government and Public Sector: Agencies’ work depends on data
for policymaking, public service delivery, and resource allocation
management. Data quality plays a central role in ensuring accurate
reporting and informed decision-making.
‹ Marketing and Advertising: Customer data is essential to marketers
to create personalized ads and tailored customer interactions. Data
quality often helps successful marketing campaigns.
‹ Agriculture: Precision agriculture depends on high-quality data to
maximize crop yields, manage resources effectively, and lower the
environmental footprint.
‹ Energy: Data-driven insights enable energy sector organizations to
optimize their production, distribution, and consumption processes.
Data accuracy plays a vital role in optimizing energy efficiency.

26 PAGE
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
DATA QUALITY, DATA CLEANING, HANDLING MISSING DATA

‹ Transportation and Logistics: On-time delivery, route optimization, Notes


and quality data are intricately connected. Inaccurate data can give
rise to logistical complications.
‹ Education: Data plays a crucial role in evaluating academic programs
and monitoring student performance at educational institutions.
Planning for education relies on accurate student information.
‹ Research and Development: Excellent data quality is necessary for
replicable results and reliable conclusions in scientific research and
pharmaceuticals.
‹ Environmental Monitoring: Environmental agencies analyse climate
change, biodiversity, and air/water quality data. Data accuracy
is essential for adequate environmental protection and informed
policymaking.
Data quality directly impacts decision-making, efficiency, customer
satisfaction, overall success in all these industries and many more. Poor
data quality can lead to errors, inefficiencies, financial losses, and public
safety concerns. Therefore, ensuring data quality is a top priority for
organizations across various sectors.
Poor data quality can have significant and far-reaching consequences for
organizations. Here are some of the critical implications of poor data
quality, including errors, bias, and business inefficiencies:
‹ Errors in Decision-Making: Poor decisions arise from misleading
data. Information errors cause organizations to make poor financial,
operations, or strategy decisions. Due to this, significant consequences
can occur, such as financial losses, missed opportunities, reputational
damage, and non-compliance with regulations.
‹ Biased Insights: Bias in analysis and reporting can stem from poor
data quality. The origins of bias can lie in sampling errors, data
collection methods, or incomplete datasets. Insights skewed by
prejudice can create unfair impressions, biased market research,
and discriminatory behaviour. Damage may be done to customer
relationships and brand reputation.
‹ Operational Inefficiencies: Incorrect or inconsistent data may
jeopardize operational workflows. Data issues can consume too

PAGE 27
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
MBAFT 7708 PREDICTIVE ANALYTICS AND BIG DATA

Notes much time for employees, requiring them to reconcile or solve.


Operations inefficiency can result in increased expenses, decreased
output, and missed deadlines.
‹ Customer Dissatisfaction: Due to poor data quality, mistakes
in billing and delivery can occur, leading to negative customer
experiences and subpar service quality. A decline in revenue and
permanent damage to the organization’s image may ensue when
satisfied customers shift to rival companies.
‹ Regulatory Non-Compliance: Tightly controlled sectors require
high-quality data for regulatory compliance. Legal consequences
and non-compliance can result from incorrect data. Legal action
and regulatory fines can cause financial harm and damage an
organization’s reputation.
‹ Supply Chain Disruptions: Poor quality supply chain data can
impact procurement, production, and distribution. Potential delays,
stock outs, surplus inventory, and higher operational costs are all
possible consequences of disturbances.
‹ Financial Losses: Poor data quality can lead to significant financial
losses from billing mistakes and erroneous pricing, followed by
fraud and inaccurate financial accounts. Financial losses can have
a lasting impact on an organization’s profitability and viability.
‹ Missed Opportunities: Inaccurate or incomplete customer data can
lead to missed sales and marketing opportunities. Organizations may
fail to identify potential customers or upsell opportunities. Missed
opportunities can result in lost revenue and hinder business growth.
‹ Reputational Damage: Data quality issues can damage an organization’s
reputation, especially when they lead to customer dissatisfaction
or public data breaches. Reputational damage can be long-lasting,
impacting customer trust and stakeholder confidence.
‹ Wasted Resources: Organizations may allocate significant resources
to correct data quality issues after the fact, diverting resources
from more strategic initiatives. Wasted resources can reduce an
organization’s competitiveness and hinder innovation.
In summary, poor data quality immensely affects various aspects of an
organization, like operations, finances, reputation, and the capability to

28 PAGE
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
DATA QUALITY, DATA CLEANING, HANDLING MISSING DATA

make accurate decisions. Addressing data quality issues proactively is Notes


essential for minimizing these negative consequences and ensuring that
data serves as a reliable and valuable asset.
Data quality frameworks and methodologies, including Six Sigma and Total
Quality Management (TQM), are two methods that provide a systematic
strategy for enhancing data quality within organizations. Frameworks
focus on developing processes, benchmarks, and best practices to ensure
data accuracy, completeness, and reliability. Here is an overview of these
two widely recognized data quality frameworks:
1. Six Sigma: From its origins in manufacturing, Six Sigma has become
a comprehensive data-driven quality management methodology that
addresses data quality issues alongside others. The critical principles
related to this framework are:
(i) Define: Defining the goals and objectives of data quality
improvement is a critical step towards success. The importance
of data quality metrics and performance indicators should be
identified.
(ii) Measure: Assess the current data quality by leveraging
statistical techniques and data analysis. When evaluating data,
dimensions like completeness, accuracy, and consistency must
be considered.
(iii) Analyze: The root causes of data quality problems can be
determined by examining data procedures and workflows.
This step involves root cause analysis and process mapping.
(iv) Improve: To address the underlying issues implement effective
data quality improvement strategies and tactics. Option: Process
redesign, automation, or data cleansing are viable options.
Monitoring and control measures are necessary to ensure continued
improvement in data quality. Establishing and monitoring standards
for data quality is crucial. By employing Six Sigma methodology,
businesses may eliminate data errors, standardize process execution,
and streamline data handling practices. This advanced technology
makes data more dependable, making decisions more accurate.
2. Total Quality Management (TQM): All aspects of organizational
processes, including data quality, receive holistic attention via TQM’s

PAGE 29
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
MBAFT 7708 PREDICTIVE ANALYTICS AND BIG DATA

Notes ongoing improvement strategy. The key principles related to this


framework are:
(i) Customer Focus: Fully comprehending data users’ and stakeholders’
demands and expectations is crucial. Data quality standards
must match those of our clients.
(ii) Continuous Improvement: A data quality culture that encourages
continuous improvement can be promoted. Addressing data
quality concerns should be a collaborative effort involving
all levels of employees.
(iii) Employee Involvement: Promoting teamwork and involvement of
all employees in data quality efforts to achieve a collaborative
environment.
(iv) Data-Driven Decision-Making: Utilize data and metrics to inform
decisions concerning data quality, consistently measuring and
evaluating data quality.
(v) Process Orientation: Consider data quality an integral component
of organizational processes and ensure that data quality
standards are seamlessly integrated into workflows.
Benefits: Total Quality Management (TQM) nurtures a culture of
excellence within organizations and extends this ethos to data
quality, fostering precision, consistency, and enhanced customer
satisfaction. This, in turn, leads to improved decision-making and
a competitive edge.
Both Six Sigma and TQM emphasize the following elements in the
context of data quality:
‹ Data Metrics: Data quality is monitored and evaluated using metrics
and KPIs shared by both frameworks—these metrics aid organizations
by tracking progress and identifying areas for improvement.
‹ Continuous Improvement: Continuous improvement holds a key
place in both frameworks. They encourage efforts towards ongoing
data quality improvement and process optimization.
‹ Customer Focus: Customer requirements dictate the importance of
quality data, leading to extensive data quality efforts. Ensuring data
quality aligns with customer-valued data usage ensures business
goal alignment.

30 PAGE
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
DATA QUALITY, DATA CLEANING, HANDLING MISSING DATA

‹ Process Orientation: Frameworks alike consider organizational data Notes


quality an integral part of their broader processes. Data quality is
a standalone issue deeply ingrained in workflows and practices.
‹ Employee Involvement: Engagement and involvement of employees
are vital to the success of data quality endeavours. Addressing data
quality concerns is an effort shared by employees across all levels.
Six Sigma and Total Quality Management present comprehensive methods
for data quality management and organizational improvement. For an
organized effort, frameworks provide a means to address data quality
concerns while cultivating a culture of reliance on high-quality data.

IN-TEXT QUESTIONS
1. What is a key dimension of data quality?
(a) Quantity
(b) Volume
(c) Accuracy
(d) Variety
2. Which phase of Six Sigma focuses on identifying and addressing
the root causes of data quality issues?
(a) Define
(b) Measure
(c) Analyze
(d) Control
3. In the context of data quality, what is a key principle of Total
Quality Management (TQM)?
(a) Customer focus
(b) Employee isolation
(c) Data secrecy
(d) Irregular improvements

PAGE 31
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
MBAFT 7708 PREDICTIVE ANALYTICS AND BIG DATA

Notes 2.4 Data Cleaning


Correcting data mistakes, inconsistencies, and inaccuracies is a necessary
part of the data preparation process, known as data cleaning, cleansing,
or scrubbing. Data validation is critical for solid decision-making and
analysis, encompassing accuracy, completeness, and reliability. Here is
an exploration of the data-cleaning process:
1. Data Assessment:
(i) Objective: First, data quality must be evaluated. Analyzing the
dataset involves detecting questionable data points such as missing
values, duplicate entries, irregularities, and inconsistencies.
(ii) Methods: Data condition assessment employs procedures like
visual inspection, statistical summaries, and data profiling
tools.
2. Handling Missing Data:
(i) Objective: A significant aspect of data cleaning involves
addressing missing data. Analysis may suffer from biased or
incomplete data.
(ii) Methods:
(a) Imputation: Select the appropriate technique, such as mean,
median, mode, or regression, to fill in missing values.
(b) Deletion: It is essential to eliminate records or variables
with excessive missing values.
(c) Advanced Imputation: Advanced techniques like multiple
imputations should be applied in challenging missing
data cases.
3. Removing Duplicates:
(i) Objective: Skewed results and errors could arise from duplicate
records or entries.
(ii) Methods: Retain one copy of each distinct record while removing
the rest, marking them accordingly. By employing algorithms,
the process of deduplication can be automated.

32 PAGE
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
DATA QUALITY, DATA CLEANING, HANDLING MISSING DATA

4. Standardization and Formatting: Notes


(i) Objective: Data consistency demands adherence to formats and
standards.
(ii) Methods:
(a) Uniformity is required in date formatting, unit measurement,
and categorization.
(b) Correct typographical errors and inconsistencies.
5. Handling Outliers:
(i) Objective: The influence of outliers can lead to an inaccurate
representation of modelling results and statistical analyses.
(ii) Methods:
(a) Methods for discovering outliers include statistical analysis,
graphical representations, or visual inspection.
(b) By reference to domain knowledge, decide which of
removing, transforming, or treating outliers is best.
6. Data Validation and Cross-Validation:
(i) Objective: To confirm the accuracy of data, it must be compared
to established business standards or limitations.
(ii) Methods: Validation checks help identify data that does not
match expected patterns or ranges.
7. Data Integration:
(i) Objective: The key is merging multiple data sources into a
consolidated single dataset.
(ii) Methods: Properly aligning data elements is crucial to resolving
inconsistencies and mapping issues in datasets.
8. Documentation:
(i) Objective: Document the entire data cleaning process to ensure
accuracy and traceability.
(ii) Methods: By documenting the actions, explanations of decisions,
and changes implemented, ensure a complete record.

PAGE 33
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
MBAFT 7708 PREDICTIVE ANALYTICS AND BIG DATA

Notes 9. Iteration:
(i) Objective: Data cleaning generally involves an iterative process.
During the cleaning process, it is essential to evaluate data
quality and fix any recently appeared concerns.
10. Validation and Quality Assurance:
(i) Objective: Standardization is key during data validation, ensuring
it is in top shape for its intended application.
(ii) Methods: Comparing cleaned data to original data and con-
ducting validation tests is part of quality assurance checks.
11. Final Data Export:
(i) Objective: Once data quality has been ensured, export the
cleaned dataset for analysis or further processing
12. Monitoring and Maintenance:
(i) Objective: Continuous data monitoring and maintenance
procedures are necessary to prevent errors.
With a combination of domain knowledge, statistical methods, and data
manipulation abilities, data cleaning entails labour-intensive and iterative
work. Analytical insights rely on dependable and correct data, making
data preparation a critical step. Cleaning data is paramount when it comes
to reliability and accuracy. Standard data cleaning techniques include:
‹ Deduplication: Remove redundant data or entries within the dataset.
Keep only one instance of each unique record, discarding any
duplicate versions. Avoiding double-counting is key to ensuring
data accuracy.
‹ Standardization: Consistency in data presentation and organization
is essential. By converting “MM/DD/YYYY” to “YYYY-MM-
DD,” you can standardize date formats. By standardizing units of
measurement, such as converting weight units to kilograms, we can
streamline processes. Standardizing categorical values, like unifying
“New York” into one structure, improves data accessibility.
‹ Validation: Validating data against set business standards, limitations,
or logical instructions is essential. Validation checks are executed to
identify data that does not conform to expected norms. Validation

34 PAGE
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
DATA QUALITY, DATA CLEANING, HANDLING MISSING DATA

can involve confirming whether birthdates are within a reasonable Notes


range and product IDs correspond with existing records.
‹ Data Transformation: Data transformation into a suitable format is
crucial for analysis and modelling. Data can be modified through
mathematical rules like logarithms or square roots. From complex
data, derived variables or aggregations can aid analysis.
‹ Parsing and Extraction: Specific details can be obtained from
unformatted or loosely formatted data through extraction. Regular
expressions and parsing techniques aid in extracting relevant data
elements from text fields, emails, or other sources. An example of
parsing is to extract email addresses from a larger text document.
Techniques for data cleaning are crucial for preparation purposes. Data
analysts or scientists tailor their approach based on how the data is
formatted, what questions they wish to answer, and their areas of expertise.
Data profiling tools are essential in discovering data quality problems by
thoroughly analyzing a dataset’s features and tendencies. Data profiling
tools are used to highlight data quality issues using the following:—
‹ Identifying Missing Values: Data profiling tools show analysts which
columns have missing values, providing insight into a dataset’s
overall range of missing data. Addressing data completeness issues
demands this.
‹ Assessing Data Types: These tools help us to automatically recognize
the data types of columns, including numeric, text, date, or categorical.
Any mismatch between expected and actual data types should be
flagged as a potential data quality issue.
‹ Detecting Duplicate Records: Data profiling tools can identify
duplicate records or entries by evaluating values across rows.
Duplicates can be removed according to the problem defined.
‹ Analyzing Data Patterns: Data analysis involves examining patterns,
which include minimum and maximum values, distributions, and
unique value counts. Such deviations can signal both data quality
problems and anomalies.
‹ Detecting Outliers: Statistical methods pinpoint outliers in numerical
data column analysis. Identifying outliers can be accomplished with
box plots or scatter plots, which make them visible.

PAGE 35
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
MBAFT 7708 PREDICTIVE ANALYTICS AND BIG DATA

Notes ‹ Assessing Data Consistency: Data consistency can be determined


using tools by checking for formatting conformity (e.g., date) and
following rules. Inconsistent data can be highlighted.
‹ Identifying Data Quality Rules Violations: Data profiling tools help
define and apply unique data quality criteria. By flagging violations,
these rules evaluate data against defined standards.
‹ Summarizing Data Quality Statistics: With data profiling tools,
metrics and summary statistics like mean, median, standard deviation,
and quartiles are available for numerical columns. Analysts better
understand data distribution and quality due to these statistics.
‹ Profiling Text and Categorical Data: Through these tools, text
and categorical data can be analyzed to determine unique values,
frequencies, and patterns. Valuable for identifying discrepancies or
unforeseen values, this is.
‹ Visualizing Data Quality Issues: Analysts benefit from the visual
displays offered by data profiling tools, which include histograms,
bar charts, and scatter plots for highlighting data quality problems.
‹ Data Quality Reports: With the help of these tools, reports detailing
data quality issues are generated. Pieces might feature data quality
scores, heatmaps, and dashboards for easy tracking.
‹ Data Quality Assessment and Prioritization: Data profiling tools
help evaluate and rank data quality issues according to their severity
and impact on analytics. It helps in prioritizing issues.
Organizations can gain valuable knowledge about data quality issues
promptly and efficiently with data profiling tools. Automating data
exploration and quality assessment frees up data professionals to focus
on insight enhancement.
Data cleaning in the real world can present complex challenges that
vary across industries and data types. Some examples of real-world data
cleaning challenges and their solutions are given below:
‹ Inconsistent Data Entry:
(i) Challenge: Inconsistent data entry methods, such as those using
different date formats (MM/DD/YYYY vs DD-MM-YYYY)
or measurement units, can cause problems.

36 PAGE
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
DATA QUALITY, DATA CLEANING, HANDLING MISSING DATA

(ii) Solution: Standardize data formats and units. Employ data Notes
transformation techniques to standardize them.
‹ Missing Data in Healthcare Records:
(i) Challenge: The absence of data can negatively impact patient
care and analysis through electronic health records.
(ii) Solution: Imputation techniques can fill in missing values by
leveraging the patient’s historical data.
‹ Duplicate Customer Records in CRM Systems:
(i) Challenge: Duplicate customer records may arise from mistakes
during manual data entry in CRM systems.
(ii) Solution: By pairing similarity thresholds with data profiling,
databases are deduplicated by removing or combining redundant
records.
‹ Inaccurate Geospatial Data:
(i) Challenge: Geospatial datasets may hold the wrong coordinates
for locations.
(ii) Solution: Correcting data and verifying it takes place through
external sources or geocoding services. Detecting outliers can
reveal suspicious data values.
‹ Incomplete Financial Data:
(i) Challenge: Only complete or present data points in financial
datasets can help financial analysis.
(ii) Solution: When data is incomplete, imputation methods like
forward-fill, backward-fill, or interpolation may be applied to
estimate missing financial values.
‹ Sensor Data with Outliers:
(i) Challenge: Possibly caused by sensor issues or noise, outliers
appear in the data stream.
(ii) Solution: Statistical methods such as smoothing or imputation
can help discover and take care of outliers.
‹ Inconsistent Product Names in E-commerce:
(i) Challenge: Any nuance in e-commerce database product data,
mainly containing names and descriptions, affects search and
categorization procedures.
PAGE 37
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
MBAFT 7708 PREDICTIVE ANALYTICS AND BIG DATA

Notes (ii) Solution: By incorporating NLP, group similar products and


standardize their names and descriptions.
‹ Incorrect Billing Data in Utilities:
(i) Challenge: Incorrect charges and disputes stem from utility
billing data containing flawed information.
(ii) Solution: Identifying billing irregularities and setting up a
correction process is accomplished through validation checks
and a bill review.
‹ Data Integration Across Multiple Systems:
(i) Challenge: Different systems’ data structures and standards
must be aligned to integrate data.
(ii) Solution: Consistency and accuracy when merging data require
integrated data processes that encompass mapping, transformation,
and validation.
‹ Social Media Data with Spelling Errors:
(i) Challenge: From social media data, misspellings, slang, and
abbreviations are frequently discovered.
(ii) Solution: Enhancing text quality with text mining and spell-
checking algorithms can simplify sentiment analysis or topic
modelling tasks.
No matter their size or scope, data-cleaning challenges are a reality,
and organizations can address them by embracing diversity. Solutions
integrating data profiling, transformation, validation checks, and domain
expertise can guarantee accurate, consistent, and reliable data for analysis
and decision-making.

IN-TEXT QUESTIONS
4. Which of the following is a common data-cleaning task?
(a) Increasing data complexity
(b) Adding noise to data
(c) Transforming data
(d) Ignoring missing values

38 PAGE
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
DATA QUALITY, DATA CLEANING, HANDLING MISSING DATA

5. What is the primary purpose of detecting and removing duplicates Notes


in a dataset during data cleaning?
(a) To increase the size of the dataset
(b) To prevent data scaling
(c) To avoid double-counting and skewed results
(d) To introduce outliers for analysis

2.5 Handling Missing Data and Outliers


Identifying and categorizing missing data is vital for practical data analysis.
Handling missing data is essential to prevent biased results and maintain
the integrity of data analysis. The choice of method depends on the type
of missing data and the specific research question. Depending on the
reason why data is missing, it can negatively impact the reliability of
data analysis. There are three primary types of missing data:
1. Missing Completely at Random (MCAR):
(i) Cause: No rational cause exists; missing data occurs in MCAR
by chance. The missing data is not related to any other
variables, both apparent and hidden.
(ii) Example: Randomly chosen people answered a survey, with a
few mistakenly omitting a query.
(iii) Handling: MCAR’s absence of bias makes it the most favourable
choice when encountering missing data. To manage MCAR,
users may choose from listwise deletion, imputation, or
statistical techniques customized for missing values.
2. Missing at Random (MAR):
(i) Cause: By examining the values of other observed variables,
the pattern of missing data in MAR can be explained. By
utilizing existing data, the shortage of information can be
overcome.
(ii) Example: Men are likelier to conceal their body weight in
health surveys when their age and income are factored in.
(iii) Handling: Utilizing imputation approaches that incorporate
observed relationships for tackling MAR, which can effectively

PAGE 39
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
MBAFT 7708 PREDICTIVE ANALYTICS AND BIG DATA

Notes retrieve missing data. Multiple imputation is an effective way


to manage MAR data.
3. Missing Not at Random (MNAR):
(i) Cause: In MNAR, unaccounted observed variables lead to
non-random missing data. The absence of data could suggest
the presence of an undetected factor.
(ii) Example: Patients with severe side effects in drug trials tend
to underreport symptoms.
(iii) Handling: MNAR is challenging because the missing data is
due to unobservable factors. They may possess potential biases,
so researchers must be cautious when dealing with MNAR
data.
Maintaining an analysis’s reliability and validity requires handling any
missing data. Various methods are possible, given the specific nature of
the data and the research inquiry. Here are various methods for handling
missing data:
‹ Imputation Techniques: Observations serve as the basis for imputation,
which then estimates missing values. Standard imputation techniques
include:
(i) Mean, Median, or Mode Imputation: To replace missing
values, consider the mean, median, or mode of the variable.
This method is only suitable when the data is not missing at
random.
(ii) Regression Imputation: Utilizing other variables, regression
models can help forecast missing values. While expecting
a linear association and missing data to be haphazard, the
strategy still manages to identify links between variables.
‹ Deletion Techniques: With missing data, deletion entails removing
cases or variables. Even the simplest things can result in losses of
valuable information.
(i) List wise Deletion (Complete Case Analysis): Cases with
missing values must be removed. Bias and reduced sample
size may occur if data is not missing completely at random
during analysis.

40 PAGE
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
DATA QUALITY, DATA CLEANING, HANDLING MISSING DATA

(ii) Pairwise Deletion: Specific analysis calls for cases that include Notes
concrete details. Analysis can yield various sample sizes
depending on using all available data.
‹ Advanced Imputation Methods: Advanced imputation methods are
more sophisticated and can handle complex missing data patterns:
(i) K-Nearest Neighbours (K-NN): K Nearest Neighbours’ feature
space averages can be used to fill in missing values, using
measures like Euclidean distance to determine similarity.
(ii) Multiple Imputation (MI): Uncertainty regarding missed
information can result in the generation of multiple imputed
data sets. Analyzing each dataset separately and combining
the results later. By addressing imputation uncertainty, MI
enables sound statistical conclusions.
(iii) Expectation-Maximization (EM): Maximizing the likelihood
function of the data leads to the estimation of missing values
via an iterative algorithm. EM is most effective when handling
datasets with MNAR-missing data.
(iv) Interpolation and Extrapolation: Using mathematical techniques,
data can be analyzed statistically to pinpoint absent totals.
Used to describe various forms of data, common has gained
a notable connotation.
(v) Domain-Specific Imputation: When domain-specific expertise
is involved, imputation can take cues from that knowledge.
Imputing missing values is one area where medical expertise
can be applied.
(vi) Imputation Software: Imputation functions and packages can
be accessed through various software tools such as R, Python
(with pandas and sci-kit-learn), and other specialized statistical
software.
Outliers are those data points that significantly diverge from the remainder
of the data. As they are, most distant values deviate significantly in size.
Outliers are significant in data analysis for several reasons:
‹ Detection of Errors: Outliers can point to problems with data
collection, including data entry mistakes and measurement errors.

PAGE 41
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
MBAFT 7708 PREDICTIVE ANALYTICS AND BIG DATA

Notes Improved data quality can be achieved by identifying and tackling


outliers.
‹ Impact on Summary Statistics: The influence of outliers on summary
statistics, including the mean and standard deviation, is quite notable.
By neglecting extreme data points, the accuracy of central tendency
and variability can be compromised, making it crucial to consider
all values when analyzing a dataset.
‹ Influence on Machine Learning Models: Outliers significantly affect
sensitive machine learning models, including those utilized in linear
regression. The presence of these factors can lead to problems in
model stability and adaptation to new information.
‹ Identification of Anomalies: In some instances, outliers indicate
unusual occurrences or notable exceptions. Detecting these outliers
is essential for successful anomaly detection in fraud detection or
quality control applications.
‹ Understanding Data Distribution: Examining outliers gives a clearer
picture of the data distribution. Their appearance may signal an
irregular data distribution or multiple populations within the data.
‹ Robust Statistics: Outliers present no problem when working with
robust statistical methods. Methods that offer more precise calculations
of central tendency and variance are available.
‹ Visualization: Identifying outliers can be achieved by scrutinizing data
through histograms, scatter plots, and box plots. Data visualization
can help analysts identify and comprehend outliers.
Correct identification is vital in data analysis, where outliers play a role.
Finding outliers is possible by utilizing statistical tests, data visualization
methods, or machine learning algorithms. Here are some commonly used
methods for identifying outliers:
‹ Descriptive Statistics:
(i) Z-Score: Determine how far beyond the mean each data point
falls and calculate the z-score accordingly. Data points with
high absolute z-scores (often exceeding 2 or 3) often signal
potential outliers.

42 PAGE
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
DATA QUALITY, DATA CLEANING, HANDLING MISSING DATA

(ii) IQR (Interquartile Range): Finding the IQR requires a calculation Notes
of Q1 and Q3 and their difference. A comparison between
Q1 and Q3 reveals that data points outside these limits are
outliers.
‹ Visualization Techniques:
(i) Box Plots: Box plots graphically depict the data distribution,
pointing to possible outliers outside the “whiskers.”
(ii) Scatter Plots: Outliers can be located through scatter plots that
display data of the dominant pattern.
‹ Machine Learning Algorithms:
(i) Clustering Methods: Outliers can be identified through the use
of k-means clustering. Points that defy categorization as part
of any cluster could be considered outliers.
(ii) Isolation Forests: Methodically designed to detect outliers,
Isolation Forests are part of an ensemble learning approach.
Isolating outliers requires dividing the data into random
sections.
(iii) One-Class SVM (Support Vector Machine): The algorithm
segregates data into two categories, with anything outside the
existing boundary labelled as an outlier.
‹ Mahalanobis Distance:
(i) Measured against the dataset’s centroid, the Mahalanobis
distance displays the distance between a data point and the
cluster. High Mahalanobis distances signal potential outliers
among data points.
In data analysis, there are multiple reasons why outliers may occur, and
these occurrences can significantly influence statistical analysis outcomes.
Often, data entry mistakes arise from human inaccuracies or equipment
failures, deviating significantly from the genuine data distribution. Skewed
summary statistics can be the result of misleading outliers in statistical
analyses. The occurrence of measurement errors or inaccurate instruments
can result in outliers. Statistical dependability hinges upon the elimination
of outliers that disrupt intervariable connections. When examining a
system, it is possible that natural variability could result in outliers.

PAGE 43
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
MBAFT 7708 PREDICTIVE ANALYTICS AND BIG DATA

Notes Outliers can affect statistical parameters and must be taken into account.
Small or non-representative sampling increases the chance of including
error values, leading to anomalous readings. Outliers, unless handled,
can threaten the reliability of statistical models. Correctly identifying,
understanding, and managing outliers is essential for maintaining the
integrity of statistical analysis.
Careful consideration is necessary when dealing with outliers in data
analysis due to the dependence of the appropriate approach on the nature
and context of the data. The following considerations should be taken
into account:
‹ Understand the Context: Context and outlier cause examination
before taking action is essential. Behaviour, what accounts for?
Context clues help identify the ideal approach.
‹ Visualize the Data: By employing box plots, scatter plots, and
histograms, data analysis can detect outliers and comprehend the
data distribution. Insights into their nature and the impact they have
are provided through visual examination.
‹ Assess Their Impact: Summary statistics and data distributions
are impacted by how outliers affect them. Results may vary by
temporarily removing them. Assessing their effect on the analysis
will help.
‹ Consider Domain Knowledge: Crucial to any successful endeavour,
domain knowledge is best consulted by subject-matter experts and
stakeholders. Genuine data points, errors, and outliers can furnish
valuable knowledge.
‹ Outlier Treatment Options: Depending on the context, the following
approaches can be considered:
z Removal: The effects of data entry errors or measurement
issues on the analysis are undesirable. How can outliers be
eliminated? Be cautious about removing data points excessively.
z Transformation: With logarithmic or square root transformations,
data usually becomes distributed, reducing outliers’ influence.

44 PAGE
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
DATA QUALITY, DATA CLEANING, HANDLING MISSING DATA

z Winsorization: Winsorization is carried out by capping extreme Notes


values at a percentile (e.g., 1st and 99th). The data structure
is maintained while reducing outlier influence.
z Imputation: Providing estimates based on observed data to
replace missing or far-fetched values can be warranted when
the latter are not attributable to blunders.

IN-TEXT QUESTIONS
6. Why is handling missing data important in data analysis?
(a) It makes the data look complete
(b) It can lead to biased results
(c) It simplifies data processing
(d) It reduces the need for data visualization
7. What is a potential consequence of missing data not at random
(MNAR) in data analysis?
(a) Increased statistical power
(b) Biased and inaccurate results
(c) Smaller effective sample size
(d) Enhanced data completeness
8. Outliers can be detected using:
(a) Mean imputation
(b) Median absolute deviation
(c) Deleting all data points beyond a specific value
(d) Ignoring them during analysis
9. Which statistical method measures how many standard deviations
a data point is away from the mean for outlier detection?
(a) Interquartile Range (IQR)
(b) Modified Z-Score
(c) Box Plot
(d) Z-Score

PAGE 45
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
MBAFT 7708 PREDICTIVE ANALYTICS AND BIG DATA

Notes 2.6 Overview of Big Data


Traditional methods of data management and analysis are not suited
for dealing with the massive amounts of data generated today, which is
referred to as big data. It exhibits three primary characteristics: volume,
velocity, and variety.
Several challenges arise when handling big data, such as storage required
for processing large volumes of data, the requirement of distributed
processing, real-time analytics data integration complexity, and scalable
systems being essential. A data lake is a central storage facility required
for storing raw data in its original form. Accommodating structured, semi-
structured, and unstructured data makes it suitable for big data storage
and enables data analysis, exploration, and transformation.
Addressing big data presents new technologies, such as Hadoop for storage
and Spark for efficient processing, No SQL databases for diverse data
management, real-time analytics platforms like Kafka and Flink, machine
learning and AI systems, and cloud services with scalable resources.
Thanks to these technologies, organizations can gain valuable insights
while dealing with big data.

IN-TEXT QUESTION
10. Which of the following is NOT one of the four V’s of big data?
(a) Volume
(b) Velocity
(c) Validity
(d) Variety

2.7 Summary
This unit comprehensively explores key concepts and practices in data
management and analysis. Here are some of the key points that are
covered in this unit:
‹ In the first section, the concept of data quality and its importance in
business is examined. The dimensions of the data quality include
accuracy, completeness, consistency, reliability, and relevance. The

46 PAGE
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
DATA QUALITY, DATA CLEANING, HANDLING MISSING DATA

consequences and impact of poor data quality have been covered. Notes
Some data quality frameworks and methodologies, such as Six Sigma
TQM, to reduce the impact of poor data quality on organizations
are also covered in this section.
‹ The second section includes data cleaning, which is an iterative
process and a crucial step in the process of data preparation. It
is done to ensure the accuracy and reliability of the data used
for analysis. It involves several critical tasks, including handling
missing data through removal or imputation, identifying and treating
outliers, converting data types to appropriate formats, standardizing
inconsistent data, and removing duplicates. Categorical variables
are encoded for analysis, and text data undergoes pre-processing.
‹ Big data offers an overview, which is a significant shift in data
management techniques.
‹ The third section consists of two important elements of data
analysis, dealing with missing data and outliers. Data absence or
missing values can lead to bias and inaccurate analysis. Mean or
median imputation and other imputation techniques like regression
imputation and multiple imputation can replace missing values with
estimates. Deleting rows or columns may be necessary for small
amounts of missing data. Data points deviating mainly from the rest
must be attended to as they can impact analyses. Both z-scores and
transformation techniques, when combined with domain knowledge,
can help reduce the impact of outliers. The approach to handling
missing data and outliers should mesh with data qualities, research
objectives, and the more significant analytical setting to get accurate
and reliable results.
‹ The final section offers an overview of big data, a paradigm shift
in data management and analysis. It explores the four V’s of big
data: Volume, Velocity, Variety, and Veracity.

2.8 Answers to In-Text Questions

1. (c) Accuracy
2. (c) Analyze

PAGE 47
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
MBAFT 7708 PREDICTIVE ANALYTICS AND BIG DATA

Notes 3. (a) Customer focus


4. (c) Transforming data
5. (c) To avoid double-counting and skewed results
6. (b) It can lead to biased results
7. (b) Biased and inaccurate results
8. (b) Median absolute deviation
9. (d) Z-Score
10. (c) Validity

2.9 Self-Assessment Questions


1. What are the critical dimensions of data quality, and why are they
essential in decision-making?
2. Give an example of how poor data quality can impact an organization’s
operations or decision-making processes.
3. How can data quality be improved in an organization, and what role
does data governance play in this process?
4. Explain the concept of data cleaning and why it is a necessary step
in data analysis.
5. What are some everyday data-cleaning tasks, and can you provide
an example?
6. Why is it important to distinguish between missing data mechanisms
like MCAR, MAR, and MNAR in data analysis?
7. Provide examples of imputation techniques for handling missing data,
and explain when each method is most appropriate.
8. What are outliers, and how can they affect statistical analysis?
Describe a method for detecting and addressing outliers in a dataset.
9. Define what is meant by the four V’s of big data: Volume, Velocity,
Variety, and Veracity.

48 PAGE
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
DATA QUALITY, DATA CLEANING, HANDLING MISSING DATA

2.10 References Notes

‹ Olson, J. E. (2003). Data quality: the accuracy dimension. Elsevier.


‹ Rahm, E., & Do, H. H. (2000). Data cleaning: Problems and current
approaches. IEEE Data Eng. Bull., 23(4), 3–13.
‹ Graham, J. W. (2012). Missing data: Analysis and design. Springer
Science & Business Media. Mayer-Schönberger, V., & Cukier, K.
(2013). Big data: A revolution that will transform how we live,
work, and think. Houghton Mifflin Harcourt.

2.11 Suggested Readings


‹ Provost, F., & Fawcett, T. (2013). Data Science for Business: What
you need to know about data mining and data-analytic thinking. “
O’Reilly Media, Inc.”.
‹ Warren, J., & Marz, N. (2015). Big Data: Principles and best
practices of scalable real-time data systems. Simon and Schuster.

PAGE 49
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
L E S S O N

3
Navigating the Data
Analytics Journey:
Lifecycle, Exploration, and
Visualization
Aditi Priya
Research Scholar
School of Computer & Systems Sciences
JNU, New Delhi - 110067
Email-Id: kajal.aditi@gmail.com

STRUCTURE
3.1 Learning Objectives
3.2 Introduction
3.3 Understanding Data Analytics
3.4 Data Exploration: Uncovering Insights
3.5 Data Visualisation: Communicating Insights
3.6 Summary
3.7 Answers to In-Text Questions
3.8 Self-Assessment Questions
3.9 References
3.10 Suggested Reading

3.1 Learning Objectives


‹ To understand the significance of data analytics in decision-making and problem-
solving.
‹ To understand the data analytics lifecycle and its components.

50 PAGE
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
NAVIGATING THE DATA ANALYTICS JOURNEY

‹ To understand the importance of data exploration in the data analysis Notes


process.
‹ To study the role of data visualisation in conveying complex
information.

3.2 Introduction
Recently, the amount of data being created and generated has immensely
increased. As of the current year, approximately 328.77 million terabytes
of data are created each day, and the rate of data creation is consistently
on the rise. Since 2010, the amount of data generated annually has grown
yearly. It is estimated that 90% of the world’s data was developed in the
last two years alone. In 13 years, this figure has increased by an estimated
60x from just two zettabytes in 2010. The 120 zettabytes generated in 2023
are expected to increase by over 150% in 2025, hitting 181 zettabytes.
The data we encounter is highly varied. People develop a wide array of
content, including blog posts, tweets, interactions on social networks,
and images. The internet, the ultimate data source, is unfathomable and
extensive. This remarkable surge in the volume of data being created and
generated profoundly impacts businesses. Conventional database systems
cannot handle such a large volume of data. Moreover, these systems
struggle to cope with the demands of ‘Big Data.’
Here comes the question: What exactly is Big Data? Big data refers to
massive, complex, and varied sets of data. It consists of four V’s, i.e.,
Volume, Velocity, Variety, and veracity. Volume involves significant
amounts of data ranging in terabytes and more, generated from several
sources, such as sensors, social media, online transactions and many more.
Velocity is the unprecedented speed at which data is generated in today’s
digital world. Social media interactions such as emails, messages, chats,
IoT devices, and financial transactions come under real-time data, which
contributes to the velocity of big data. Variety refers to various formats of
data, such as databases, spreadsheets, text, images, etc. This diverse form
of data requires specific tools for processing and analysis. Veracity refers
to the quality and reliability of data. Big data can sometimes be noisy
or incomplete, and data scientists need to ensure that the insights drawn
from the data are accurate and trustworthy. Big data is generated across

PAGE 51
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
MBAFT 7708 PREDICTIVE ANALYTICS AND BIG DATA

Notes industries, from business and healthcare to academia and government.


Analysis of big data involves both challenges and opportunities. There
are several challenges related to big data. One of them is storage and
processing, in which the primary concern is the requirement of specialised
infrastructure and technologies to store data. Second is data integration,
which involves complexities in combining data from different sources
because they have varied file formats. Third is maintaining data quality
and reliability so accurate decisions can be made. Fourth is privacy and
security, which involves handling sensitive data that requires robust methods
and procedures. There are several opportunities associated with big data.
One of them is predictive analysis, which consists of big data analysis
that can help to develop prediction models to forecast future trends and
behaviour. The second one is Personalization, which helps businesses
gain insights and findings from big data to personalise products and
services for individual users to increase customer satisfaction and promote
business. The third is data-driven decision-making, which enables the
development of processes using informed decisions and better strategies.
The fourth is innovation, which can be achieved by discovering patterns
and trends leading to innovative solutions. The last one is that big data
increases the optimisations of operations and better resource allocations,
increasing efficiency.
Traditional data management tools and techniques cannot process Big
data and require advanced technologies. The conventional systems and
data management methodologies linked to them have proven inadequate
for scaling up to handle Big Data. This astonishing growth in data has
profoundly affected businesses. Traditional database systems, such as
relational databases, have been pushed to the limit. In increasing cases,
these systems are breaking under the pressures of “Big Data.” Traditional
methods and associated data management techniques have failed to scale to
Big Data. To handle the challenges of Big Data, newer technologies have
been introduced. These technologies effectively require a fundamentally
new set of techniques. With the vast amount of data generated from
various sources, often in terabytes or pet bytes, it is almost impossible
to analyse and process data manually. The data generated can have
different formats, such as structured, unstructured, and semi-structured.
The data includes text, images, videos, audio, documents, etc. In the

52 PAGE
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
NAVIGATING THE DATA ANALYTICS JOURNEY

real world, the data is generated rapidly and is complex. The complexity Notes
and interconnectedness involved in real-world data require using newer
and advanced technologies to discover underlying relationships and
patterns present in the data. Understanding and forecasting future trends
to minimise risk and strategise actions to make effective and accurate
decisions is often helpful.
So, to cope with all the challenges related to data generation in the real
world, the concept of data science becomes evident. Data science is a
multidisciplinary field that accumulates various methods and algorithms to
extract valuable information from different file formats, such as structured,
unstructured, and semi-structured. It consists of various processes of
collecting, transforming, analysing, and interpreting data to make more
informed and accurate decisions. The key components related to data
science are as follows:
(i) Collection of relevant data from various sources such as user-generated
data, sensor data, social media interaction data, transaction data,
healthcare data, financial data, etc.
(ii) Pre-processing of the data to ensure quality and maintain accuracy
by eliminating any discrepancies present in the data.
(iii) Selection of relevant and important features and characteristics to
increase the performance of any system.
(iv) Interpreting and discovering hidden patterns, distributions, and relations
through data visualisation tools and applications.
(v) Applications of statistical methods and techniques to discover trends
in data.
(vi) Using Natural Language Processing to analyse and process human
language to extract meaningful information.
(vii) Applications of machine-learning and deep-learning models to
develop our models for performing any task.
(viii) Processing bulk volume of data to reduce complexity and increase
performance.
(ix) Considering the ethical and legal aspects of data collection, storage,
and usage.

PAGE 53
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
MBAFT 7708 PREDICTIVE ANALYTICS AND BIG DATA

Notes (x) Planning and executing controlled experiments to test hypotheses


and gather meaningful information.
(xi) Discovering patterns and relationships in data using clustering,
association rule mining, and anomaly detection techniques.
The techniques utilised in data science include automated processing to
handle large volumes of data effectively and also reduce the required
time. It also helps in extracting insights from diverse file formats quickly
and reliably. The interconnectedness and complexity of data sets often
need advanced techniques, such as deep learning and network analysis,
to understand underlying patterns and relationships. The vast amount of
data requires data science because traditional data analysis and manual
processing methods are no longer sufficient to deal with the complexity,
volume, and variety of data available today. Data science provides the
tools, techniques, and methodologies to turn this data into valuable insights
that drive informed decision-making and innovation. The primary goal
of data science is to extract value from data and gain an accurate and
deep understanding of it. One of the key aspects of data science is the
availability of data and the capability to extract valuable insights from
the data. The fundamental concepts of data science are obtained from
various fields related to data analytics. This chapter focuses on the data
analytics lifecycle, the process of data exploration, and the role of data
visualisation. These topics are important in extracting meaningful insights
from the data and making informed and data-driven decisions.

3.3 Understanding Data Analytics


Data analytics and data science are interrelated disciplines that extract
meaningful observations and insights from data. However, they possess
unique emphases and functions within the broader context of data use.
Data science is a much broader discipline that entails data analytics
but extends its scope. It combines data analysis, machine learning, and
domain knowledge to comprehend past data, forecast and predict future
trends, and provide better actions. Data analytics mainly involves studying
and analysing past data to uncover hidden patterns and discover trends
to make informed and data-driven decisions. It also helps organisations

54 PAGE
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
NAVIGATING THE DATA ANALYTICS JOURNEY

to optimise their methods and techniques to mitigate risk and increase Notes
their output. The work of a data analyst usually involves statistical tools
like Excel, SQL, Tableau, etc. These tools help the data analyst perform
simple calculations for complex data modelling and visualisation. They
aim to answer specific business questions, identify opportunities, and guide
operational improvements. Data analytics involves examining, cleaning,
transforming, and interpreting data to extract valuable insights and make
informed decisions. It involves using various techniques, tools, and
methodologies to analyse data and uncover patterns, trends, correlations,
and other meaningful information that can aid in understanding business,
scientific, or other phenomena. Data analytics encompasses a wide range
of activities, from simple descriptive analysis summarising historical
data to advanced predictive and prescriptive analysis using statistical
and machine learning models to make predictions and recommendations.
It involves working with structured and unstructured data from various
sources, including databases, spreadsheets, text documents, images, videos,
sensor data, social media, and more. Data analytics is used across multiple
sectors like marketing, finance, healthcare, manufacturing, businesses,
etc. It aids organisations in achieving process optimisation by making
data-driven decisions and identifying newer prospects and opportunities
to grow exponentially and quickly. There has been an increase in the
development of tools and techniques used in Data analytics, enabling
businesses to keep up with the market competition. Data analytics is used
across industries, including marketing, finance, healthcare, manufacturing,
and more. It enables organisations to make data-driven decisions, optimise
processes, identify new opportunities, and gain a competitive edge in the
market. The tools and techniques used in data analytics continue to evolve,
allowing businesses to extract more valuable insights from their data.
The data analytics lifecycle consists of a sequence of stages and steps
commonly adhered to while conducting tasks related to data analytics. It
gives a systematic approach to deriving meaningful and valuable insights
from data to make data-driven decisions. While the steps and procedures
depend on the type of organisation, a generalised framework is followed
in the data analytics lifecycle. The most common steps utilised in the
data analytics lifecycle are as follows—

PAGE 55
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
MBAFT 7708 PREDICTIVE ANALYTICS AND BIG DATA

Notes ‹ The first step of the data analytics lifecycle begins with defining the
problem with utmost clarity so that the analysis’s purposes, goals,
and objectives are aligned with the organisation’s requirements.
‹ The second step of the data analytics lifecycle involves the collection
of data from different sources of data such as sensors, social media,
online transactions, healthcare data, etc. It also ensures that the
collected data is accurate, reliable, and meets the organisation’s needs.
‹ The third step of the data analytics lifecycle consists of removing
the discrepancies present in the data and transforming the data using
feature engineering.
‹ The fourth step of the data analytics lifecycle involves exploring
data using visual and statistical methods to understand the features
and identify patterns and relations in the data.
‹ The fifth step involves the selection of an appropriate model
according to the problem and characteristics of the data. This helps
to develop a predictive or descriptive analysis of data using various
models related to machine learning, deep learning, etc.
‹ The sixth step consists of evaluating the performance of various
models using several measures. This is done to validate the model
and check its efficacy, whether it is a generalised model, i.e., it
gives accurate results when newer datasets are considered.
‹ The seventh step involves the interpretation of the results obtained
after the analysis of data to extract useful information to make
well-informed decisions.
‹ The eighth step involves making informed decisions to take the
required actions.
‹ The last step involves continuously performing surveillance and
monitoring the performance of the implemented actions to update
and refine the procedures to yield better results. This also involves
considering the feedback and responses to refine the model to adapt
to the needs of the business processes.

56 PAGE
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
NAVIGATING THE DATA ANALYTICS JOURNEY

Notes

Figure 3.1: Data Analytics Lifecycle


Collaboration between data analysts, domain experts, and stakeholders
throughout the lifecycle is crucial for successful data-driven decision-making.
Each phase builds upon the previous one, creating a cyclical process that
continuously refines and enhances the insights extracted from data.

IN-TEXT QUESTIONS
1. What is the primary role of data analytics in decision-making
and problem-solving?
(a) To make data look presentable
(b) To make data collection efficient
(c) To extract insights and inform decisions
(d) To replace traditional decision-making processes
2. Which of the following statements is true regarding the Data
Analytics Lifecycle?
(a) It is a rigid and unmodifiable process
(b) It is only applicable to large organisations
(c) It guides structured data analysis but lacks flexibility
(d) It provides a framework that can adapt to diverse data
analysis needs

PAGE 57
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
MBAFT 7708 PREDICTIVE ANALYTICS AND BIG DATA

Notes 3.4 Data Exploration: Uncovering Insights


This section will discuss the integral components of the data analytics
lifecycle, i.e., data exploration and visualisation. Data exploration is a
vital aspect of the process of data analytics. It consists of examining
and exploring the dataset to understand and learn the basic structure,
framework, and content and discover patterns to apply advanced data
analytics tools and techniques. Data exploration is the elementary step
in the lifecycle of data analytics. Data analysts use various tools and
techniques to derive preliminary information to visualise and summarise
the data. This is also sometimes referred to as exploratory data analysis,
a statistical-based method used to analyse data available in raw form to
study their features and characteristics from a broader perspective. It also
aids the data analyst in identifying trends, patterns, missing values, and
potential areas for further analysis. With the help of data exploration,
analysts can make well-informed and data-driven decisions to determine
the best-fitted approach for other stages of the analysis process. Data
exploration lays the building stones for data analysis, unveiling initial
insights and aiding analysts in comprehending the strengths and weaknesses
of the data. It guides the upcoming stages, including data transformation,
modelling, and hypothesis testing, guaranteeing that the analysis is firmly
rooted in a thorough grasp of the dataset. It also helps businesses quickly
analyse large chunks of data to understand the future stages for further
analysis. This gives the business a great kick-start and a method to focus
on its area of interest. Usually, data exploration involves using several
data visualisation tools to investigate and assess the data at a high level.
Businesses can utilise this high-level approach to determine which chunks
of data are essential and which fragments should be delineated because
they can decrease performance. Data exploration can reduce the time
invested in less valuable analysis by choosing the optimal direction from
the outset. The process of data exploration can be conducted using both
automated as well as manual techniques. Automated techniques used in
data exploration include data profiling, which gives a data analyst an
initial insight into understanding the primary characteristics of data. Data
profiling involves scrutinising the data that is already available from a
database. It collects statistics and summaries of the data due to various
reasons, like—

58 PAGE
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
NAVIGATING THE DATA ANALYTICS JOURNEY

‹ To check whether the collected data suits the process and can be Notes
used.
‹ Enhance data search capabilities by attaching keyword descriptions
or categorising them.
‹ Check the quality of data to see whether it aligns with particular
standards.
‹ Check the risk involved when the data is used in different applications.
‹ Uncover metadata from the source database, encompassing value
patterns, distributions, potential key candidates, foreign key candidates,
and functional dependencies.
‹ Check whether the existing metadata precisely represents the real
values present within the source database.
After performing automatic actions, drilling down or filtering manually
occurs to locate irregularities or designations found during automation. In
addition, data exploration commonly entails manually inputting commands
or formulating queries within the data itself, either through Structured Query
Language (SQL) or other related programming tongues, and leveraging
specialised applications that enable visual analysis of unstructured data.
Adopting this conducive approach enables analysts to achieve clarity via
associative thinking about how they perceive various sizes of relational
models through applying unique personas within contextual boundaries
defined around dynamics dictated intrinsically by social determinism
rather than simplistically maintaining fixed facets through preconceived
rules derived purely based upon formal system manipulation alone (i.e.,
reductionistic heuristic).
Upon gaining knowledge of the information initially, refining or rejuvenating
the data by tidying up unnecessary bits of data through excisions (data
cleaning) is possible. Furthermore, realigning erroneously constructed
articles and uncovering latent interconnections between data collections
will mark this milestone within the procedure. Moreover, identifying
areas of insufficiency regarding the reliability and uniformness among
data sets shall culminate into this turning point during processing, which
is underpinned by forming concise statements of quality concerning these
materials. Executing these moves meticulously yet resourcefully initiates
paving means to help protect against repercussions arising from diminutive

PAGE 59
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
MBAFT 7708 PREDICTIVE ANALYTICS AND BIG DATA

Notes errors committed along this journey towards exchanging the existing
conceptualisations/explanations either within specific frames like direct
statistical interpretations (evaluative figures) relating ideas amassed so far
manifest fundamentally hinges upon how adept we endow ourselves with
mastery over selection rules following which steps prior application gives
rise, perceptions within related context prove especially germane right until
action stages invite personal insight unfoldment allowing our judgments
extend due proper care hence promoting data facility evolution containing
each necessary shaping aid well-nigh whenever required accordingly
setting basic foundational notions supported on theory enunciated vis-a-
vis pertinent standards become productive acts accomplishing harmonious
formulation procedures amidst.
Apart from organised inquiry and prediction methods, another way to
examine data is by trying random questions to search for undisclosed
patterns. This practice entails generating discoveries with minor initial
hypothesis building. Data analysis professionals and scientists centre
their work around exploring data rather than just focusing on traditional
statistics.
Experts use Exploratory Data Analysis (EDA) to examine datasets employing
assorted strategies to unveil underlying patterns, establish early indications,
and isolate anomalous features. Understanding the character of the data
through these techniques facilitates ensuing analysis and decision-making.
Here are some critical exploratory data analysis techniques:
1. Summary Statistics:
(i) Mean, Median, Mode: Central tendency measures reveal a
variable’s typical or common value. The mean is calculated
by adding up every value and dividing the final result by how
many values are in the dataset. Indicating the location of the
central point serves as a tool for assessing central tendency.
Notably, mean values can be influenced by exceptional data
points (outliers), tending them towards the mean. The middle
value is labelled as the median when organising data in
ascending or descending order. The alternative measure provides
resistance to the outlier impact on the mean. Particularly
useful when dealing with irregular distributions, the median
assembles the data points around a central value. Mean and

60 PAGE
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
NAVIGATING THE DATA ANALYTICS JOURNEY

median figures yield information on clustering trends in the Notes


data. A symmetrical distribution of data is achieved when
the values of the mean and the median are close. Data that
varies substantially may indicate the presence of outliers or
skewness. The mode is the value that tends to happen the most
among the dataset’s elements. It facilitates the identification
of the central tendency of the data. A dataset’s multimodality
frequently signals diverse clusters hidden within it.
(ii) Standard Deviation, Variance: Greater dissemination of
values from the average can be observed through measures
of dispersion. The average deviation of data points from the
mean is known as the standard deviation. It calculates the
extent to which the data points deviate from the mean. A
larger measurement implies greater dispersion or variability in
the data set. Segmenting a dataset with percentiles delineates
it into distinct areas, each comprising a predetermined data
share. As a midpoint measuring 50%, the median separates
the data into two relative halves. In addition to the median,
the 25th and 75th percentiles (or quartiles) help analysts
comprehend the variability of the data across various categories.
The mean deviation illustrates how distant each data point
is from the typical value. Higher standard deviations point
to greater fluctuations. In the case of exam scores, a low
standard deviation implies stability, whereas a high standard
deviation signals diversity. In statistics, variance represents
the average of the squares of the differences between each
data point and the mean. The square of the deviations allows
for a more direct assessment of dispersion. A broader range
suggests more diverse data. Standard deviation and variance
demonstrate how information is dispersed. Expansive ideals
are often correlated with extensive volatility or dissemination.
(iii) Percentiles and Quartiles: Segmenting data into four groupings
facilitates identifying values across various percentile points.
Percentiles provide insight into the distribution of data points
in a dataset. The 25th percentile marks the point where 25% of
the data drops below. Comparison between the two percentiles

PAGE 61
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
MBAFT 7708 PREDICTIVE ANALYTICS AND BIG DATA

Notes reveals an imbalance in favour of the 75th percentile, which


is greater than the 25th. Data is divided into quartiles and
segmented into groups containing 25% of the total. Identified
by its placement beneath 25%, the Q1 marks the starting point
for the lower half of the data set. As identified by the point
below, in which 75% of data points converge, Q3 represents
a key statistical benchmark.
These measures help identify central tendencies (clustering location)
and variations (spread) within the dataset. Investigating these
measurements enables professionals to understand the data intimately,
enabling them to make educated choices, discover patterns, and
extract valuable insights.
2. Data Visualisation:
(i) Scatter Plots: Designed for examining correlations among
continuous measures. Evident are patterns, connections, and
exceptions.
(ii) Histograms: Illustrate the distribution of a single variable.
Intervals are where they display data frequency or count.
(iii) Box Plots (Box-and-Whisker Plots): Highlight the distribution’s
mean, dispersion, and extraordinary observations.
(iv) Bar Charts and Pie Charts: Depicting frequency patterns or
proportion, bar graphs are employed for categorical data.
(v) Line Charts: Time series data exhibits repeatable patterns and
trends; analyse them accordingly.
3. Correlation Analysis:
(i) Pearson Correlation Coefficient: Quantifies the intensity and
orientation of the link between two continuous measures.
(ii) Spearman Rank Correlation: Designed for data where parametric
measures fail, this correlation metric is non-parametric and
reliable.
4. Categorical Data Analysis:
(i) Frequency Tables and Cross-Tabulations: Illuminate how often
various categories surface in a dataset and their interconnections.

62 PAGE
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
NAVIGATING THE DATA ANALYTICS JOURNEY

(ii) Bar Plots and Stacked Bar Plots: Represent the distribution Notes
of categorical variables visually.
5. Multivariate Analysis:
(i) Scatter plot Matrix: These plots effectively illustrate the
relationships between pairs of variables, significantly contributing
to data analysis.
(ii) Parallel Coordinates Plot: Illustrates how multiple factors
interrelate by depicting each as an upright line.
(iii) Principal Component Analysis (PCA): These retain vital details
and generate a more digestible representation.
6. Outlier Detection:
(i) Z-Scores: Determine how far away a data point is from the
mean by measuring in standard deviations.
(ii) Interquartile Range (IQR): Determining the gap between the
first and third quartiles highlights potential outliers.
7. Distribution Fitting:
(i) Normality Tests: This refers to whether you can confirm whether
a variable adheres to a normal distribution through statistical
analysis.
(ii) Kernel Density Estimation: Determine the probability of a
variable distributed in a particular manner.
8. Geospatial Analysis:
(i) Heat Maps: Depict data density using colour gradients on a
geographical map.
(ii) Choropleth Maps: Use colours or patterns to represent data
values in different geographical areas.
9. Time Series Analysis:
(i) Time Plots: Display data over time to identify trends, seasonality,
and patterns.
(ii) Autocorrelation Function (ACF) and Partial Autocorrelation
Function (PACF): Identify time-dependent relationships.

PAGE 63
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
MBAFT 7708 PREDICTIVE ANALYTICS AND BIG DATA

Notes 10. Dimensionality Reduction:


(i) PCA and t-SNE: Techniques to reduce high-dimensional data
to lower dimensions for visualisation.
Data exploration is a vital step that lies at the core of the data analysis
process. They hold significant importance and provide insights that
offer an in-depth comprehension of the dataset, substantially influencing
decision-making and the triumph of data-driven initiatives. Here are some
key aspects highlighting the importance of data exploration:
1. Understanding Data Characteristics:
(i) Data exploration reveals the fundamental aspects of the dataset,
such as size, structure, and complexity.
(ii) Identifying data types and their respective value ranges within
variables is made easier with its aid.
2. Data Quality Assessment:
(i) Data exploration enables analysts to determine data quality
by discovering absent entries, multiple entries, and irregular
patterns.
(ii) Timely detection of data quality concerns allows for efficient
data cleaning and pre-processing.
3. Pattern Detection:
(i) Examining data uncovers patterns, trends, and connections
between factors.
(ii) Analysts uncover obscure connections and interdependencies
during their investigations.
4. Anomaly Detection:
(i) Identifying anomalies, outliers, or unusual data points, exploratory
data analysis assists.
(ii) Essential for both quality assurance and fraud protection,
detecting anomalies is critical.
5. Hypothesis Generation:
(i) Data investigation helps discover valuable insights and formulate
queries.

64 PAGE
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
NAVIGATING THE DATA ANALYTICS JOURNEY

(ii) Within the data, it sheds light on potential cause-and-effect Notes


connections.
6. Feature Selection:
(i) Uncovering essential variables (features) is critical in machine
learning and predictive modelling.
(ii) Efficient model-building can be achieved through this method.
7. Data Visualisation:
(i) To simplify the communication of findings to stakeholders, data
visualisation techniques play a crucial role in data exploration.
(ii) Visual representations allow us to comprehend complex ideas
in a more manageable way.
8. Data Pre-processing Guidance:
(i) The progression of pre-processing hinges on the exploration
of data to address missing values, outliers, and skewed
distributions successfully.
(ii) Providing critical context for scaling and transformation, this
information makes key decisions easier.
9. Model Assumptions and Selection:
(i) To select statistical models correctly, comprehending data
distribution and qualities is indispensable.
10. Improved Decision-Making:
(i) By analysing data, new insights emerge, guiding decision-
making.
(ii) Data-driven decision-making enables businesses to profit from
strategic choices.
11. Data-Driven Storytelling:
(i) Exploring data enables crafting captivating stories around it.
(ii) Enhancing data visualisation brings discoveries closer to home,
making them more tangible and feasible.
12. Risk Mitigation:
(i) Spotting data quality concerns early on ensures that decisions
are based on complete and accurate information.

PAGE 65
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
MBAFT 7708 PREDICTIVE ANALYTICS AND BIG DATA

Notes 13. Time and Resource Efficiency:


(i) Expert data investigation can make future analysis faster by
avoiding repetition or unnecessary work.
(ii) This enables efficient allocation of resources towards data
segments offering the most potential.
Exploratory Data Analysis is not a linear process; instead, it involves
iterative exploration, visualisation, and refinement to unveil meaningful
insights and inform subsequent steps in the data analysis lifecycle. It
helps analysts uncover hidden patterns, anomalies, and relationships that
may not be immediately apparent, laying the groundwork for more in-
depth analysis and decision-making. In essence, data exploration forms
the bedrock of effective data analysis. It unearths concealed insights,
prioritises data accuracy, and directs forthcoming analysis and judgment-
making actions. For individuals working with data, this elemental step is
central to all data-oriented disciplines.

IN-TEXT QUESTIONS
3. What is the primary goal of data exploration in the data analysis
process?
(a) To make data look presentable
(b) To understand data characteristics, patterns, and relationships
(c) To clean and pre-process data
(d) To build predictive models
4. Which of the following statements is true regarding data
exploration?
(a) Data exploration is primarily focused on building predictive
models
(b) Data exploration aids in uncovering hidden insights and
patterns in data
(c) Data exploration is only concerned with data cleaning
(d) Data exploration is the final stage in the data analysis
process

66 PAGE
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
NAVIGATING THE DATA ANALYTICS JOURNEY

3.5 Data Visualisation: Communicating Insights Notes

In the process of turning raw data into compelling and understandable


visual representations, data visualisation is both an art and a science. The
focus shifts beyond mere numbers to uncover the hidden narrative, trends,
and insights lurking within the data. Individuals can efficiently understand
complex topics and reach sound conclusions by visually representing data.
Key aspects of data visualisation include:
‹ Choosing the Right Visualisations: Select the correct chart, graph,
or visualisation technique for the data and message you wish to
communicate. Common graphic representations include bar charts,
line graphs, scatter plots, and pie charts.
‹ Enhancing Clarity: Ensuring the visuals are neat, unhindered, and
quickly graspable. Labels, legends, and colours must be used
appropriately to evade confusion.
‹ Revealing Patterns: Raw data takes shape into meaningful insights
through the power of data visualisation. It may yield valuable
insights for decision-making.
‹ Highlighting Outliers: Identifying data points like outliers, anomalies,
or unusual patterns, these visualisations come in handy.
‹ Storytelling: Through thoughtfully crafted visuals, data takes on
depth and significance, illuminating understandings and motivating
viewers to act on them.
‹ Interactivity: Users can immerse themselves in data through interactive
visualisations, delving deeper and discovering hidden insights.
‹ Decision Support: Insightful visualisations foster sound decision-
making by leveraging data-driven perspectives. Acting as a bridge
between statistics and choices, they contribute significantly.
‹ Audience Consideration: Designing visualisations tailored to viewers’
respective levels of understanding and encounter with the specific
topic.
Data visualisation techniques allow for diverse approaches to convey
meaning and foster collaboration. Here are some common types of data
visualisation:

PAGE 67
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
MBAFT 7708 PREDICTIVE ANALYTICS AND BIG DATA

Notes 1. Bar Charts: Comparing categorical data or analysing the distribution


of a single variable calls for the use of statistics. Bar charts come
in various forms: grouped, stacked, and horizontal.
2. Histograms: A single variable’s distribution may be visualised by
separating values into bins and displaying how many instances fall
under each container.
3. Line Charts: These showcase trends and changes observed in data
over time. These charts are most commonly used for time series
data.
4. Scatter Plots: Individual data points are plotted as discrete dots On a
two-dimensional plane. Visualising relationships is where continuous
variables come in handy.
5. Pie Charts: A circle represents the whole, with various segments
comprising its composition - these are the parts. Limited category
types typically require the use of categorical or nominal approaches.
6. Area Charts: Line charts mirror histograms in their overall appearance.
However, histograms are distinct with the filled-in area beneath the
line. Data visualisation involves depicting cumulative information,
like using stacked area charts to track numerous variables during
specified periods.
7. Heatmaps: The intensity of colours can depict data values in a matrix
or grid. Data visualisation instruments intended to depict nuanced
connections and correlations.
8. Bubble Charts: Like scatter plots, but with the added feature of
bubble sizes representing an extra dimension. Tools like these are
handy because they can visualise multiple variables simultaneously.
9. Box Plots (Box-and-Whisker Plots): Displays of datasets incorporate
the median, quartiles, and detectable irregularities.
10. Violin Plots: Box plots and kernel density plots depict the distribution
of information. Most effective when comparing several datasets.
11. Treemaps: Depicting hierarchical data, use nested rectangles with
each level exemplifying a different visual hierarchy.

68 PAGE
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
NAVIGATING THE DATA ANALYTICS JOURNEY

12. Word Clouds: Determine the size of words based on their frequency Notes
and visualise the text data accordingly. Textual data can be analysed
for common terms.
13. Network Diagrams: Graphically representing networks, displaying
nodes as entities and connections as edges. Through social network
analysis and systems modelling, this is often used.
14. Choropleth Maps: By region, represent data using colour shading
or pattern. Displaying regional data, such as population density or
election results, is ideal for us.
15. Sankey Diagrams: Visualising the movement of data or resources
across categories or stages is crucial.
16. Radar Charts (Spider Charts): Showcasing multivariate data on a
radial grid, each axis represents a different variable. Across various
dimensions, entities can be compared using this technique.
17. 3D Visualisations: A third dimension elevates standard chart styles,
such as 3D scatter plots and surface plots. Effective for visualising
volumetric data.
18. Interactive Dashboards: Visualisations and controls can be incorporated
into an interactive interface. The application allows for timely
insights and data analysis.

Figure 3.2: Different Types of Graph


These examples provide a glimpse into the rich diversity of data visualisation
approaches. Choosing a visualisation type hinges on the data, audience,

PAGE 69
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
MBAFT 7708 PREDICTIVE ANALYTICS AND BIG DATA

Notes and desired analysis revelations. Through more effective communication,


data visualisation can deepen comprehension, facilitate decision-making,
and clarify complex information.

IN-TEXT QUESTIONS
5. What is the primary purpose of data visualisation in data
analysis?
(a) To replace data collection processes
(b) To make data look presentable
(c) To summarise data characteristics in tables
(d) To communicate data insights effectively
6. In data visualisation, what is the purpose of creating a line chart?
(a) To display the distribution of a single variable
(b) To show relationships between two numerical variables
over time
(c) To display categorical data using bars
(d) To create 3D visualisations

3.6 Summary
This unit comprehensively explores key concepts of the data analytics
process. Here are some of the key points that are covered in this unit:
‹ In the first section, we explored the fundamentals of data analytics.
Data analytics is a systematic process that involves collecting,
preparing, analysing, visualising, and interpreting data to extract
valuable insights and inform decision-making. We introduced the
Data Analytics Lifecycle, a structured framework that outlines the
stages of data analysis, including data collection, preparation, analysis,
visualisation, and interpretation. This section laid the foundation for
understanding data analytics’s key components and processes.
‹ In the second section, we covered data exploration, the initial phase
of data analysis, focusing on understanding data characteristics,
identifying patterns, and assessing data quality. We discussed
techniques such as summary statistics and data profiling, which

70 PAGE
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
NAVIGATING THE DATA ANALYTICS JOURNEY

help summarise and examine each variable’s characteristics in a Notes


dataset.
‹ The last section discusses data visualisation, which is a crucial
component of data analysis, enabling data representation through
charts, graphs, and visuals to facilitate data understanding and
communication. We explored various data visualisation types, such
as histograms and scatter plots.

3.7 Answers to In-Text Questions

1. (c) To extract insights and inform decisions


2. (d) It provides a framework that can adapt to diverse data analysis
needs
3. (b) To understand data characteristics, patterns, and relationships
4. (b) Data exploration aids in uncovering hidden insights and patterns
in data
5. (d) To communicate data insights effectively
6. (b) To show relationships between two numerical variables over
time

3.8 Self-Assessment Questions


1. Why is the Data Analytics Lifecycle considered an iterative process?
Provide an example of how iteration benefits data analysis.
2. Explain the role of data interpretation in the Data Analytics Lifecycle
and its importance in drawing meaningful conclusions.
3. Define data exploration and its importance in the data analysis
process.
4. What are summary statistics, and how do they assist in understanding
data characteristics?
5. Discuss the significance of data profiling in data exploration and list
the types of information it reveals.
6. What is the primary purpose of data visualisation, and how does it
aid in data analysis and communication?

PAGE 71
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
MBAFT 7708 PREDICTIVE ANALYTICS AND BIG DATA

Notes 7. List some best practices in data visualisation and explain why they
are important for creating compelling visualisations.

3.9 References
‹ Bruce, P., Bruce, A., & Gedeck, P. (2020). Practical statistics for
data scientists: 50+ essential concepts using R and Python. O’Reilly
Media.
‹ Yau, N. (2013). Data points: Visualisation that means something.
John Wiley & Sons.
‹ Maheshwari, A. (2014). Data analytics made accessible. Seattle:
Amazon Digital Services.

3.10 Suggested Reading


‹ Provost, F., & Fawcett, T. (2013). Data Science for Business: What
you need to know about data mining and data-analytic thinking. “
O’Reilly Media, Inc.”.

72 PAGE
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
L E S S O N

4
Foundations of Predictive
Modelling:
Linear and Logistic
Regression, Model
Comparison, and
Decision Trees
Aditi Priya
Research Scholar
School of Computer & Systems Sciences
JNU, New Delhi - 110067
Email-Id: kajal.aditi@gmail.com

STRUCTURE
4.1 Learning Objectives
4.2 Introduction
4.3 Linear Regression
4.4 Logistic Regression
4.5 Model Comparison
4.6 Decision Trees
4.7 Summary
4.8 Answers to In-Text Questions
4.9 Self-Assessment Questions
4.10 References
4.11 Suggested Reading

PAGE 73
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
MBAFT 7708 PREDICTIVE ANALYTICS AND BIG DATA

Notes 4.1 Learning Objectives


‹ To understand the concept of linear regression, types of linear
regression and its application areas.
‹ To understand the concept of logistic regression and its types.
‹ To study the different techniques available for model comparison.
‹ To understand the concept of the decision tree, the process of
formation of a decision tree and its advantages.

4.2 Introduction
Predictive modelling is integral to data science, offering a vital role in
unearthing essential insights and directing intelligent choices. Historical
data-based predictions or trend forecasts are created through mathematical
and statistical modelling techniques. The importance of predictive modelling
in data science cannot be overstated, as it enables organizations to:
‹ Anticipate Trends: Forecasting future trends is where predictive
models come into play, keeping businesses competitive and informed.
Anticipating customer preferences and market shifts is vital for
maintaining a competitive edge.
‹ Optimize Operations: Insights into operational efficiencies are
provided via data-driven approaches, optimizing processes and resource
allocation. Resource Optimization: Businesses can optimize their
operations and resources through predictive models. These models
help in efficient inventory management, supply chain optimization,
and workforce scheduling, reducing costs and improving productivity.
‹ Improve Decision-Making: Models predict outcomes by offering
probabilistic and identifying factors that contribute to them. Data
scientists help make informed and critical decisions about different
business strategies and resource allocation by analyzing past data
and trends and identifying valuable insights.
‹ Enhance Customer Experience: Marketing and e-commerce benefit
from predictive modelling through enhanced customer experiences
and targeted audiences.

74 PAGE
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
FOUNDATIONS OF PREDICTIVE MODELLING

‹ Mitigate Risks: Organizations can take proactive measures by Notes


assessing risks and uncertainties through predictive modelling.
This chapter explains the fundamental predictive modelling methods and
their practical application in data science initiatives. This chapter delves
into the fundamentals of predictive modelling by focusing on four essential
techniques. These methods are decision trees, linear regression, logistic
regression, and model comparison. These methods are part of every data
scientist’s baseline and are extensively utilized for diverse predictive work.
The key is understanding how they operate and assessing their uses and
contrasts. An outline of these methods is discussed below:
‹ Linear Regression: Linear regression is an approach to model the
connection between a dependent variable and one or more independent
variables. Applications of linear regression are utilized in tasks
related to regression and classification.
‹ Logistic Regression: Logistic regression empowers us to tackle
binary classification issues with might. Applications, interpretation,
and evaluation will be discussed.
‹ Decision Trees: Intuitive approaches and decision trees are offered
for predictive modelling. Decision trees are assembled, and their
benefits and capacity for classification and regression duties are
discussed.
‹ Model Comparison: Critical in predictive modelling is evaluating
and comparing models. Common evaluation metrics, cross-validation
techniques, and model selection strategies will be explored.
We will discuss these in detail in the following sections.

4.3 Linear Regression


By fitting a linear equation to observed data, linear regression is a statistical
technique used in predictive modelling to model the relationship between
a dependent variable and independent variables. Linear regression aims
to find the best-suited linear relationship that matches the dependency
of the dependent variable on the values of the independent variables. A
linear equation is formed through linear regression, which helps predict
the value of the dependent variable for any new data points.

PAGE 75
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
MBAFT 7708 PREDICTIVE ANALYTICS AND BIG DATA

Notes Purpose in Predictive Modelling:


Predictions and understanding relationships between variables are the
fundamental objectives of linear regression in predictive modelling. A
detailed explanation is presented below to explain the purpose of linear
regression in predictive modelling.
‹ Predictive Power: Linear regression is a straightforward yet potent
means of making predictions. Training on historical data allows the
model to predict future values based on the independent variables.
‹ Interpretability: Models that demonstrate linear regression are often
high in interpretability. The relationships between predictors and
targets become clear using the model equation coefficients. Analysts
and decision-makers can decipher factors contributing to predictions
due to transparency.
‹ Variable Importance: Identifying the independent variables affecting
the dependent variable entails examining the coefficients. Decision-
making and prioritizing resources demand access to this data.
‹ Assumptions Testing: Linear regression models require these
assumptions to be met, including linearity, independence of errors,
and homoscedasticity. Ensuring reliability and suitability through
testing, the model’s assumptions are checked.
‹ Feature Selection: Feature selection is aided by linear regression,
highlighting which predictors are most pertinent to the target variable.
Efficient models and deeper understanding come from analyzing
data this way.
‹ Baseline Model: Linear regression is commonly used as a starting
point in modelling. It serves as a reference point to compare against
intricate models. If a linear model performs well, it suggests that
the data may not necessitate an approach.
‹ Regression analysis: In addition to prediction, linear regression
is employed for regression analysis. This involves examining the
connections between variables, comprehending how changes in
predictors influence the target variable, and conducting hypothesis
testing.
Linear regression relates variables through the best-fitting linear equation.
This linear equation adopts the form of a straight line in simple linear

76 PAGE
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
FOUNDATIONS OF PREDICTIVE MODELLING

regression or a hyperplane in multiple linear regression. An independent Notes


variable influences the dependent variable; the main objective is to
comprehend.
There are several steps involved in the modelling of the relationship
between dependent and independent variables using linear regression—
‹ Defining the linear equation: The equation for a linear regression
model is represented by:
Y = b 0 + b 1X + H
where represents the predicted variable, represents the dependent
variable, b0 represents the intercept, H represents the slope of the
line, and represents the error term.
‹ Fitting the line or hyperplane: The objective is to find the values
of b0 and b1 that minimize the sum of the difference between the
squares of the observed and actual values. The resultant is the line
or hyperplane which fits the data points in the best possible way.
‹ Interpretation of the coefficients: After the model is fitted, the values
of coefficients are obtained. These coefficients provide information
about the relationship between the variables.
‹ Making Predictions: Once the values of the coefficients are obtained,
the model is not fit for making predictions for new data points.
‹ Evaluating Model Fit: The fitness of a model is estimated using
various measures. The most commonly used evaluation measures
are mean squared error, R-squared, etc. These measures tell how
well the predicted values match the actual values.
Simple linear regression is a statistical technique to model the relationship
between two variables. In statistical analyses, response variables are
paired with independent variables. By doing so, an equation that most
closely aligns with the relationship between the predictor variable and the
response variable is aimed to be established. The primary objective is to
measure the relationship. Simple linear regression is employed when a
single predictor variable influences a response variable. It is used when
you want to model and understand how changes in a single predictor
variable (X) relate to changes in a response variable (Y).

PAGE 77
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
MBAFT 7708 PREDICTIVE ANALYTICS AND BIG DATA

Notes The applications of simple linear regression in various scenarios are as


follows:
‹ Preliminary Analysis: Simple linear regression plays a vital role
in the initial exploration of the relationship between the response
variable and a single predictor. It provides a reliable way to check
whether a linear connection exists before moving on to complex
models.
‹ Single-Predictor Relationships: Simple linear regression can help
investigate how predictors relate to each other and the response
variable when multiple predictors are involved.
‹ Teaching and Illustration: Simple linear regression models are
employed in educational institutions to understand basic regression
concepts. These models offer modelling connections between variables
directly and simply.
‹ Simplified Models: Single predictor variables can sometimes drive
the response variable, making it possible to model them effectively
through simple linear regression. Complex calculations are reduced
when only one predictor is involved.
Multiple linear regression is a statistical technique examining the
connection between a response variable and one or more predictor
variables. Compared to simple linear regression, which concentrates on
one predictor, multiple linear regression accounts for multiple factors. By
creating a linear equation, the main goal centres around explaining how
changes in predictor variables impact the response variable.
Multiple linear regression is employed in different scenarios, such as:
‹ Economics and Finance: Financial models involve various factors
such as interest rates, inflation, and economic indicators. The
relationships between these factors can be modelled through multiple
linear regression.
‹ Marketing and Sales: Multiple linear regression is essential for
businesses. It helps to unravel how marketing strategies, including
advertising spend and pricing, affect sales and customer conduct.
‹ Medical Research: In medical studies, age, gender, lifestyle, and
genetic markers are often involved in predicting health outcomes
and disease risk.

78 PAGE
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
FOUNDATIONS OF PREDICTIVE MODELLING

‹ Environmental Science: Multiple linear regression is a tool Notes


environmental researchers employ when examining how pollution
levels, climate data, and land use affect environmental indicators
like air quality and biodiversity.
‹ Quality Control: To understand how temperature, humidity, and
production parameters influence product quality, multiple linear
regression is used in manufacturing.
‹ Real Estate: Multiple regression models are used to examine property
values.
‹ Education: Multiple linear regression helps educational researchers
understand the effect of student performance on class size, teacher
experience, and funding.
Assumptions of Linear Regression Models:
‹ Linearity: A linear association exists between predictors and responses
in a linear regression model. When all other factors remain constant,
changes in predictors result in proportional changes in responses.
Scatter plots can be used to check the dependent variable’s linearity
against each independent variable.
‹ Independence of Errors: In this, each data point has its error, unrelated
to the errors of others. This assumption is essential to ensure the
validity of statistical tests and model estimates. Violations of this
can be identified through residual plots or time series analysis,
depending on the nature of the data.
‹ Homoscedasticity: There should be a consistency in the variance of
errors across all levels of the predictor. The same spread of residuals
is observed across varying values of the predictors. Widening or
narrowing the spread of residuals as the predictors shift can violate
homoscedasticity. Robust standard errors or data transformation
techniques like weighted least squares can address this issue.
‹ Normality of Residuals: The residuals should be normally distributed.
An optimal distribution of residuals would have a zero mean. If the
errors are generally not distributed, the accuracy of hypothesis tests
and the confidence interval are affected. Histograms or quantile-
quantile (Q-Q) plots of the residuals can be used to check Normality.

PAGE 79
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
MBAFT 7708 PREDICTIVE ANALYTICS AND BIG DATA

Notes IN-TEXT QUESTIONS


1. What is the primary purpose of linear regression in predictive
modelling?
(a) To visualize data patterns
(b) To model the relationship between a dependent variable
and one or more independent variables
(c) To discover patterns in unstructured data
(d) To perform clustering analysis
2. Which type of regression is used when there is a single predictor
variable and a single response variable?
(a) Logistic regression
(b) Multiple regression
(c) Simple linear regression
(d) Polynomial regression

4.4 Logistic Regression


Logistic regression is a method for handling binary and multi-class
classification problems. It is a regression analysis that models the
relationship between a binary dependent variable and one or more
independent variables. Logistic regression measures the probability that an
input belongs to a specific category. The logistic function, also known as
the sigmoid function, is a key element of logistic regression. It transforms
linear combinations of the predictor variables into values between 0 and
1. The logistic function has the following form:
1
P(Y = 1) = (b0 + b1X 1 + b 2 X 2 +...b p X p )
1+ e
Where P(Y = 1) represents the probability of the dependent variable Y
belonging to class 1, b0 is the intercept term, and b1, b2...bp represents
the coefficients of the independent variable.
The application areas of Logistic regression include:
‹ Medical Diagnosis: Logistic regression allows medical professionals
to forecast if a patient has a particular medical condition based on
test results and patient information.

80 PAGE
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
FOUNDATIONS OF PREDICTIVE MODELLING

‹ Marketing: Based on demographics and past behaviour, logistic Notes


regression can help predict how customers respond to marketing
campaigns.
‹ Employee Attrition: Logistic regression can predict whether an
employee will leave a job based on job satisfaction, salary, and
work environment.
The difference between Linear and Logistic Regression is as follows:
1. Dependant variables:
(a) Linear Regression: In this, the dependent variables are continuous.
An example of this is the prediction of house prices.
(b) Logistic Regression: The dependent variable is a discrete value,
such as a binary variable. An example of this is determining
whether an email is spam or not.
2. Output:
(a) Linear Regression: No limits on output. The prediction includes
a continuous range of values.
(b) Logistic Regression: Observations get classified by predicting
probabilities between 0 and 1.
3. Equation:
(a) Linear Regression: Continuous variables are best represented
by the linear equation, which forms a straight line.
(b) Logistic Regression: Based on the logistic function (S-shaped
curve), the equation calculates probabilities from inputs.
4. Assumptions:
(a) Linear Regression: Assumes a linear relationship between
predictors and the dependent variable.
(b) Logistic Regression: Assumes a linear relationship between
predictors and the log odds of the dependent variable, which
allows the modelling of probabilities in a binary classification
context.

PAGE 81
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
MBAFT 7708 PREDICTIVE ANALYTICS AND BIG DATA

Notes IN-TEXT QUESTIONS


3. What is the primary purpose of logistic regression in predictive
modelling?
(a) To model the relationship between two continuous variables
(b) To predict a categorical outcome or probability of an
event occurring
(c) To handle missing data
(d) To perform clustering analysis
4. What is the logistic function used for in logistic regression?
(a) To calculate the mean squared error of the model
(b) To transform the predictor variables
(c) To estimate the probability of a binary outcome
(d) To reduce multicollinearity in the dataset

4.5 Model Comparison


Comparison of different predictive models plays an essential role in creating
practical and precise predictive systems. The importance of comparing
predictive models involves several critical aspects:
‹ Performance Evaluation: By comparing models, assessing their
predictive performance is possible. Answering which model performs
better helps, and this is just one of the ways that it helps. Quantifying
and comparing model performance, performance metrics like accuracy,
precision, recall, F1-score, and AUC-ROC are crucial.
‹ Bias and Variance Trade-off: Balancing model bias and variance
requires comparing different models. Lower bias models are often
paired with higher variance, whereas others show the opposite
characteristics. Models that generalize well require understanding
this trade-off between generalization and performance.
‹ Resource Efficiency: Model comparison examines computational
as well as resource needs. The computational expense can be a
hurdle for models intended for real-time use or large datasets. By

82 PAGE
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
FOUNDATIONS OF PREDICTIVE MODELLING

comparing models, we can choose resource-efficient models that Notes


can meet deployment constraints.
‹ Interpretability and Explainability: Different models might possess
different interpretability and explainability levels. Between predictive
performance and interpretability, model comparison allows you to
trade-off.
‹ Robustness and Generalization: Different models perform differently
when noise is present or the data is missing. Through comparison,
evaluating model robustness ensures that the chosen model is less
affected by data imperfections and can generalize well to new data.
Comparing predictive models involves a comprehensive evaluation process
to assess their performance, generalization capability, and suitability
for a specific task. Several techniques and concepts are crucial in this
comparison, including cross-validation, evaluation metrics, and the bias-
variance trade-off. Here is an overview of each:
1. Cross-Validation: Cross-validation is one of the most commonly
used techniques for comparison of different models. It helps to
predict the performance of the model on unknown data. The dataset
is segregated into multiple sets or folds for training and testing.
There are different cross-validation techniques, such as K-fold cross-
validation, Hold-out cross-validation, Leave-one-out cross-validation,
Leave-p-out cross-validation, Stratified K-fold cross-validation, etc.
An outline of these techniques is as follows:
(a) K-Fold cross-validation: The data is segregated into k subsets
in this technique. The model is trained on k-1 subsets, and
testing is done on the remaining one. This process is repeated
iteratively for k number of times, with each subset taken as
the test set once. At last, the average of results obtained for
each iteration is taken to determine the model’s performance.
(b) Hold-out cross-validation: The dataset is randomly segregated
into training and test sets. Mostly, 70% of the data is taken as
a training set, and the remaining 30% is taken as a test set.
The model is divided into only two sets, so the execution of
the model is fast.

PAGE 83
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
MBAFT 7708 PREDICTIVE ANALYTICS AND BIG DATA

Notes (c) Leave-one-out cross-validation: In this, k-1 samples of the


datasets are taken as training data, and only one is taken as
test data. This method is computationally extensive for large
datasets.
(d) Leave-p-out cross-validation: This is a generalized version
of the Leave-one-out cross-validation. From the dataset, k-p
samples are taken as training data and p samples are taken
as test data.
(e) Stratified K-fold cross-validation: This is an improved version
of the K-fold cross-validation. The dataset is divided into
k-folds, where each fold has the same ratio of instances of
target variables in the complete dataset. This performs well
for imbalanced datasets.
The advantages of Cross-validation are:
‹ Reducing Overfitting: The techniques used for cross-validation
reduce the chances of overfitting by incorporating robustness.
‹ Generalization: Cross-validation techniques provide a better
assessment of the performance of the model to unknown data.
2. Evaluation Metrics: Different evaluation measures are used to
compare the performance of different models. The choice of metrics
is based on the task type, whether it is classification or regression.
A few of them are described below:
(a) Classification Metrics:
(i) Accuracy: It is defined as the ratio of correctly classified
pixels to the total number of pixels.
(ii) Precision and Recall: These measures are used for the
evaluation of trade-offs between true positives and
false positives and true positives and false negatives,
respectively.
(iii) F1-Score: This measure combines precision and recall.
(iv) ROC-AUC: This measure is used for finding out the area
under the Receiver Operating Characteristic curve.

84 PAGE
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
FOUNDATIONS OF PREDICTIVE MODELLING

(b) Regression Metrics: Notes


(i) Mean Absolute Error (MAE): This is used for calculating
the mean absolute difference between the observed and
target values.
(ii) Mean Squared Error (MSE): This is used for calculating
the mean squared difference between the observed and
target values.
(iii) Root Mean Squared Error (RMSE): This means the square
root of MSE.
(c) Clustering Metrics:
(i) Silhouette Score: This finds out the quality of clustering
by measuring the intra-class similarity and inter-class
dissimilarity.
(ii) Davies-Bouldin Index: This metric is used to find the
average similarity between each cluster and its most
related cluster.
3. Bias-Variance Trade-Off: This is one of the vital concepts used
to compare models. It represents the trade-off between bias and
variance described below:
‹ Bias: When the model is straightforward and cannot capture
complexities present in the model, high bias occurs, resulting
in underfitting.
‹ Variance: When the model is too complex, high variance occurs,
which results in overfitting.

IN-TEXT QUESTIONS
5. Which one of them is not a common technique used for
comparing models?
(a) Hypothesis testing
(b) Cross-validation
(c) Fitting the model to different subsets
(d) ROC-AUC

PAGE 85
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
MBAFT 7708 PREDICTIVE ANALYTICS AND BIG DATA

Notes 6. What is model comparison primarily used for in data science?


(a) To select the most complex model
(b) To determine the best-fitting model for a specific dataset
(c) To eliminate all but one predictor variable
(d) To visualize data patterns

4.6 Decision Trees


Decision tree learning is one of the most extensively used and practical
methods for inductive inference. It is a method for approximating discrete-
valued functions that is robust to noisy data and capable of learning
disjunctive expressions. Decision trees are used in machine learning
and statistics as a fundamental tool for classification and regression
tasks. These help depict decision-making procedures, mirroring human
rationality in a systematic and decipherable fashion. The intuitive nature
and flexibility to handle various data types make decision trees popular.
Decision trees classify data points into different categories or classes
during classification tasks. Each decision node shows an attribute test
outcome, with branches pointing to other decision nodes or leaf nodes
containing class labels. Traversing through the tree, a data point can have
its class label assigned from the root to the leaf node.
For regression tasks, decision trees predict numerical values instead of
class labels. Every decision node, much like classification trees, tests an
attribute, and each branch signifies an outcome, differing from them in
that leaf nodes hold numerical values. Predictions are made along the
path from the root to a leaf node. Humans make decisions, and therefore,
decision trees are highly intuitive. Decision trees evaluate data and make
predictions/classifications based on a series of questions and criteria. Visual
representation in decision trees makes interpretation simple. The path
from the top to the bottom helps us understand how the model came to a
decision or forecast. Valuable for predictive modelling and understanding
data patterns, this transparency is a key aspect of decision trees.
Figure 4.1 illustrates a decision tree. This decision tree classifies Saturday
mornings according to their suitability for playing tennis. For example,
the instance (Outlook = Sunny, Temperature = Hot, Humidity = High,

86 PAGE
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
FOUNDATIONS OF PREDICTIVE MODELLING

Wind = Strong) would be sorted down the leftmost branch of this decision Notes
tree and would, therefore, be classified as a negative instance (i.e., the
tree predicts that Play Tennis = no).

Figure 4.1: Decision Tree for Play Tennis: An example is classified by


sorting it through the tree to the appropriate leaf node, then returning
the classification associated with this leaf (In this case, Yes or No). This
tree classifies Saturday mornings according to whether or not they are
suitable for playing tennis.
Building decision trees involves various stages, such as selecting attributes,
determining splitting criteria, and potentially pruning the tree to improve its
generalization performance. The description of each stage is given below:
‹ Selection of Attributes:
(a) Root Node Selection: In the initial stage of building the decision
tree, all the attributes are considered for selection as a root
node. The selected is one that has the maximum separation of
data points into different classes for the classification task or
has the maximum predictive power for the regression tasks.
(b) Attribute Subset Selection: Given the root node, a subset of
attributes is evaluated at each decision node. The ultimate
goal is to find an attribute that divides the data finely and
leads to better predictions or homogenous classes.
‹ Splitting Criteria:
(a) Gini Impurity: The Gini impurity metric assesses the likelihood
of misclassifying a randomly selected data point based on
the class labels assigned to the node distribution. Lower Gini
impurity equals a more evenly distributed node.

PAGE 87
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
MBAFT 7708 PREDICTIVE ANALYTICS AND BIG DATA

Notes (b) Information Gain: In classification tasks, information gain


reduces the randomness by splitting the data based on an
attribute. A high information gain value indicates that the
attribute is better for splitting.
IG(D, A) = Entropy(D) – Entropy(D|A)
Where IG(D, A) is the information gain when splitting data
D based on attribute A, Entropy(D) is the entropy of data D
and the Entropy(D|A) is the conditional entropy of data D
given the attribute A.
‹ Recursive Splitting: The recursive application of attribute selection
and splitting criteria allows for creating decision nodes and branches.
This process continues until a stopping criterion is achieved. This
stopping criterion could be the maximum depth, minimum number
of nodes, or when the performance of the model cannot be improved
any further.
‹ Pruning: Pruning is employed to quell overfitting. It involves
removing branches that do not substantially affect the performance
of the model on the validation set. Reduced-error pruning and cost-
complexity pruning are two commonly used approaches for pruning.
‹ Leaf Node Assignments: Leaf nodes represent the final predictions
or classifications. In classification, the class label with the majority
of data points in a leaf node is assigned as the predicted class. In
regression, the average of the target values in a leaf node is set as
the predicted value.
‹ Tree Visualization: Trees can be visualized, which makes them
highly interpretable. You can see the decision path from the root
to the leaves, showing the attributes and thresholds used at each
decision node.
‹ Model Evaluation: After building the tree, it is essential to evaluate
its performance on a separate validation dataset to ensure it
generalizes well to new, unseen data. Common evaluation metrics
include accuracy, precision, recall, F1-score (for classification), and
mean squared error (for regression).

88 PAGE
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
FOUNDATIONS OF PREDICTIVE MODELLING

In summary, building decision trees involves selecting attributes, Notes


using splitting criteria like Gini impurity or information gain,
recursively creating decision nodes, optionally pruning the tree to
prevent overfitting, and evaluating its performance on validation data.
Decision trees are powerful tools for classification and regression
tasks, known for their simplicity and interpretability.
Several advantages make decision trees popular for machine learning and
data analysis. Some of the key benefits of decision trees include:
‹ Interpretability: Decision trees are highly interpretable models. A
tree structure represents the decision-making process, where each
node represents a choice or feature, and each branch symbolizes a
likely consequence. For humans, this visual representation is easy
to understand and interpret. Analyzing the decision-making process,
experts and analysts can trace the tree that shows where decisions
originated.
‹ Ease of Visualization: Users can see the entire decision-making process
thanks to the graphical visualization of decision trees. Decision tree
visualizations can be easily created with Graphviz or Python libraries
(e.g., Matplotlib, Scikit-learn). Models can be more efficiently
evaluated and validated via visualization, helping identify overfitting
issues.
‹ No Assumptions about Data Distribution: Decision trees function
without assuming anything about the data distribution. Unlike
some models, they perform well with non-normal or non-specific
distribution data. Decision trees’ flexibility makes them appropriate
for a myriad of datasets.
‹ Handling Non-Linear Relationships: Features and the target variable
can be connected by non-linear relationships, which decision trees
can capture. Modelling complex decision boundaries in the data is
effective when linear models may not perform well.
‹ Feature Selection: The most informative attributes are located at the
top of the tree, a process that naturally occurs through decision tree
construction. Features without significant contributions to classification
or regression are pushed toward the leaf nodes. Simplifying the
model reduces the overfitting risk.

PAGE 89
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
MBAFT 7708 PREDICTIVE ANALYTICS AND BIG DATA

Notes ‹ Handling Mixed Data Types: Categorical and numerical data can
be processed by decision trees and missing values without extra
pre-processing. Data type varieties are what some tree algorithms,
including Random Forest, can accommodate.
‹ Scalability: Small to moderate-sized datasets are ideal for decision
trees. Combining multiple trees, ensemble methods like random
forests, and gradient-boosting trees helps address limitations.
‹ Versatility: Used for different purposes, decision trees are used
for classification, regression, and ranking. By combining multiple
trees through ensemble methods like Random Forests or AdaBoost,
complex problems can also be solved.
While decision trees have several advantages, they also come with
limitations that need to be considered when using them in machine
learning and data analysis:
‹ Overfitting: Noise in the training data becomes a factor when
deep decision trees lead to overfitting. Perfect fit to the training
data has an overfitted tree but may generalize poorly to unseen
data. Pruning and setting a maximum tree depth are tools used to
mitigate overfitting.
‹ Instability: Minor alterations in the data can significantly impact
the tree’s structure. Due to this instability, Decision tree creation
is affected by changes in the training data, leading to various trees
for concordant datasets. By averaging the predictions of multiple
trees, ensemble methods like Random Forests help reduce prediction
instability.
‹ Bias Toward Features with More Levels or Values: More levels or
values make selecting features desirable for decision trees. Many
levels of categorical attributes can lead to biased decision tree
performance.
‹ Inadequate for Capturing Complex Relationships: Non-linear
relationships are not an issue for decision trees but can fail with
more intricate ones. Even deep tree models may struggle to capture
complicated patterns in high-dimensional and noisy datasets.
‹ High Variance: Decision trees are known for High variance.

90 PAGE
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
FOUNDATIONS OF PREDICTIVE MODELLING

‹ Difficulty with Imbalanced Data: Datasets with significant class Notes


disparities challenge decision trees. Results might be tainted if you
do not implement specific measures to cater to the minority group.
‹ Limited Extrapolation: Considering the observed range of values
outside is crucial when dealing with decision trees. Changes to the
test data can negatively impact predictions delivered by the tree,
making them less reliable.

IN-TEXT QUESTIONS
7. What is the primary role of decision trees in machine learning
and data science?
(a) To perform clustering analysis
(b) To model the relationship between two continuous variables
(c) To create a visual representation of data
(d) To make decisions by mapping data features to outcomes
8. Which criteria are used for splitting in decision trees?
(a) Mean
(b) Variance
(c) Gini impurity
(d) Chi-square test

4.7 Summary
This unit comprehensively explores linear regression, logistic regression,
model comparison, and decision tree. Here are some of the key points
that are covered in this unit:
‹ The first section introduces Linear Regression, a powerful tool
for modelling the relationship between dependent and independent
variables. It is commonly used for predicting continuous numerical
outcomes. It also covers simple and multiple linear regression,
assumptions, and interpretation of coefficients.
‹ The second section covers Logistic Regression, which is used for
classification tasks. This technique is ideal for binary or multi-class

PAGE 91
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
MBAFT 7708 PREDICTIVE ANALYTICS AND BIG DATA

Notes classification problems, where we model the probability of an event


occurring.
‹ In the third section, different techniques have been discussed to assess
and compare the performance of other models. These techniques
include cross-validation evaluation metrics for comparison of the
performance of models.
‹ In the last section, the concept of decision trees is discussed. They
can be used both for regression and classification tasks. The process
of creating a decision tree is also discussed in this section.

4.8 Answers to In-Text Questions

1. (b) To model the relationship between a dependent variable and


one or more independent variables
2. (c) Simple linear regression
3. (b) To predict a categorical outcome or probability of an event
occurring
4. (c) To estimate the probability of a binary outcome
5. (c) Fitting the model to different subsets
6. (b) To determine the best-fitting model for a specific dataset
7. (d) To make decisions by mapping data features to outcomes
8. (c) Gini impurity

4.9 Self-Assessment Questions


1. Explain the difference between simple linear regression and multiple
linear regression.
2. Describe the logistic function and its role in logistic regression. Define
a data warehouse and explain its primary role in data management.
3. Why is model comparison critical in predictive modelling?
4. Name and briefly explain two standard techniques for model comparison.
5. What is the primary advantage of decision trees in predictive
modelling?

92 PAGE
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
FOUNDATIONS OF PREDICTIVE MODELLING

4.10 References Notes

‹ Bruce, P., Bruce, A., & Gedeck, P. (2020). Practical statistics for
data scientists: 50+ essential concepts using R and Python. O’Reilly
Media.
‹ Learning, M. (1997). Tom Mitchell. Publisher: McGraw Hill.

4.11 Suggested Reading


‹ Provost, F., & Fawcett, T. (2013). Data Science for Business: What
you need to know about data mining and data-analytic thinking.
“O’Reilly Media, Inc.”.

PAGE 93
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
L E S S O N

5
Unveiling Data Patterns:
Clustering and Association
Rules
Hemraj Kumawat
Research Scholar
School of Computer & Systems Sciences
JNU, New Delhi - 110067
Email-Id: hkumawat077@gmail.com

STRUCTURE
5.1 Learning Objectives
5.2 Introduction
5.3 Understanding Clustering
5.4 The Mechanics of Clustering Algorithms
5.5 Real-World Applications of Clustering
5.6 Evaluating Cluster Quality
5.7 Unravelling Connections: The World of Association Rules
5.8 Real-World Applications of Association Rules
5.9 Summary
5.10 Answers to In-Text Questions
5.11 Self-Assessment Questions
5.12 References
5.13 Suggested Readings

5.1 Learning Objectives


‹ To explore the role of clustering as a data analysis technique for uncovering natural
groupings and patterns within data.

94 PAGE
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
UNVEILING DATA PATTERNS: CLUSTERING AND ASSOCIATION RULES

Notes
‹ To gain insight into association rules and their application in
identifying connections and relationships within datasets.
‹ Learn about the practical applications of clustering and association
rules across various domains, from retail to healthcare.

5.2 Introduction
It is exciting, even amazing that we live in an era when we have more
data than ever. In this digital era, data is like oxygen for business. Every
click, transaction, and interaction on social media or e-commerce sites
generates an overwhelming amount of information that helps understand
customer behaviour, optimise business operations, and drive strategic
decisions. Every single organisation, be it a small start-up or a big tech
giant, is in a race to leverage the benefits of the vast amount of available
data. However, amid this vast sea of data lies a challenge: How do we
make sense of such vast data? Is there any way to gain meaningful insight
and extract hidden patterns from the data, which can guide us to make
better marketing decisions for our business?
Well, that is where data analysis comes into the picture. It is the key
for these organisations and businesses to survive and thrive in this data-
rich landscape. In this chapter, we plan to learn about two fundamental
pillars of data analysis: Clustering and Association rules. These popular
data analysis techniques help us to make sense of the enormous expanse
of data and uncover hidden but crucial information from the data.
The power of clustering: Amidst this chaos of data, imagine being able
to group similar kinds of data. Wouldn’t it make our life easy? Well,
certainly it would, and clustering helps us in doing so. In the business
sense, one can use clustering to categorise customers automatically based
on their preferences and cluster products based on their features. Clustering
empowers us to make data-driven decisions, whether identifying customer
groups for targeted marketing or finding anomalies in network traffic.
On the other side of the coin, association rules mining or association
analysis helps us uncover hidden relationships and connections within
our data. You might have noticed how e-commerce sites recommend the
products as if they know what you are planning to buy. Well, association
rules are the secrets to this magic. From a business perspective, by

PAGE 95
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
MBAFT 7708 PREDICTIVE ANALYTICS AND BIG DATA

Notes uncovering the subtle associations between different items, products or


events, we are able to enhance recommendations, optimise supply chains,
and, more importantly, unlock valuable market insights.
In this era of data proliferation, people have learned that data analysis is
not just a tool but a necessity. It is the engine, with data being its fuel,
that drives profitability and, more importantly, ensures that the product
and services are not just deliverable but also beneficial for customers.
During our journey to understand the clustering and association rules in
this chapter, we will delve deep into the intricacies of these concepts and
their applications. We will explore different clustering algorithms in detail
and try to understand the mathematics behind association rules; alongside,
we will relate these to real-world scenarios to better understand them.
From marketing strategies that rely on boosting product sales through
intelligent product recommendations, we will discover the practical power
of these data analysis tools.
So, whether you are a student looking to enhance your analytics skills, a
business professional seeking to harness the potential of data, or simply
a curious reader amazed by the magic of data analysis, let us embark on
this exciting journey to understand the world of clustering and association
rules. Let us discover some data patterns.

5.3 Understanding Clustering


Clustering, sometimes called unsupervised segmentation, is the process
of finding some natural grouping in the data. This grouping is based on
specific characteristics, creating natural division within our data. For
example, you have scattered pieces of puzzles, and you want to arrange
them into some coherent patterns.
Clustering is an unsupervised data analysis technique that does not require
labelled data. It is about finding structure or patterns within data when
we do not have pre-defined categories.
However, wait, why is clustering so significant in data analysis?
Clustering has a multitude of uses in a variety of industries. A few of
the applications include the following—

96 PAGE
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
UNVEILING DATA PATTERNS: CLUSTERING AND ASSOCIATION RULES

1. Pattern Discovery: You might love how the detectives find hidden Notes
clues and information while solving the case. Well, clustering is
like a detective of the world of data, which searches for hidden
patterns in the data. It groups data based on associations and
structures that might not be apparent initially. It is beneficial for
market researchers to segment customers with similar purchasing
behaviour. To biologists, it may aid in grouping similar species in
diverse ecosystems.
2. Decision Support: It is often daunting to make sense of complex data.
When you have a myriad of data points, it can be overwhelming.
Clustering simplifies strategic decision-making by creating categories.
For instance, retailers can use clustering to optimise inventory
management and tailor marketing campaigns.
3. Data reduction: It is the era of big data, and managing such a vast
amount of data has become increasingly challenging. Clustering helps
organise vast amounts of data in manageable clusters, simplifying
decision-making. This is a boon in fields such as network security,
where anomaly detection for large datasets is a critical task.
4. Anomaly detection: We have discussed how clustering helps group
similar objects. Apart from that, clustering also plays a crucial role
in detecting outliers or identifying anomalies. A typical example is
fraud detection, where clustering can point out unusual patterns of
transactions that might be indicators of fraudulent activity.
Different flavours of clustering
There are various clustering approaches. You can refer to [1] for a
complete list of approaches. Each approach is suitable for a particular
type of data distribution:—
1. Centroid-based clustering: This type of clustering partitions data
based on proximity to centroids. It often organises data in non-
hierarchical clusters instead of hierarchical clustering, which we
will know shortly. One of the popular algorithms in this category
is k-means. Centroid-based algorithms are efficient; however, they
are susceptible to initial conditions and outliers. Figure 5.1 shows
an example of centroid-based clustering.

PAGE 97
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
MBAFT 7708 PREDICTIVE ANALYTICS AND BIG DATA

Notes

Figure 5.1: Centroid-based clustering


2. Density-based clustering: The core idea behind this approach is that
it tries to group data types that are close to each other in terms of
density. It proves helpful in cases where clusters exhibit diverse
shapes and sizes since it allows for arbitrary-shaped distributions,
provided that you can link the dense regions. However, these
algorithms need help with high-dimensional data and variable
density. DBSCAN algorithm is a widely used clustering algorithm
in this category. We will discuss it in detail in upcoming sections.

Figure 5.2: Density-based clustering example


3. Distribution-based clustering: This approach operates on the premise
that data follow an underlying distribution, such as the Gaussian
distribution. Data points are clustered based on their proximity to
the centres of these distributions. As the distance from the centre
of the distribution increases, the likelihood of a point belonging
to that distribution decreases. In Figure 5.3, you can observe three
clusters based on Gaussian distributions. As the shades lighten, it
signifies a decrease in the probability of a point belonging to that
particular cluster.

98 PAGE
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
UNVEILING DATA PATTERNS: CLUSTERING AND ASSOCIATION RULES

Notes

Figure 5.3: distribution-based clustering example


4. Hierarchical clustering: This clustering approach is well suited for data
with some taxonomical hierarchy. It is a bottom-up approach which
starts with individual data points as their cluster and subsequently
combines clusters until the single cluster is remained. This approach
creates a hierarchical structure of clusters that can be represented
by a dendrogram, a tree-like diagram. These are generally used for
exploring visualisation purposes.

Figure 5.4: Hierarchical clustering example


Moving further in this chapter, we will look at these clustering approaches
in detail. We will know how they work and when you should use them,
and we will also look at various applications in the business world.
So, as we delve deeper, remember not just to look at the numbers and

PAGE 99
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
MBAFT 7708 PREDICTIVE ANALYTICS AND BIG DATA

Notes algorithms; instead, try to understand that clustering is more than that.
It is about revealing order in the chaos, identifying some structure that
seems absent at first glance, and, more importantly, gaining insights that
eventually result in better decision-making in business and beyond.

5.4 The Mechanics of Clustering Algorithms


The fundamental idea: At its core, clustering tries to group data points
into clusters such that data points within the same cluster have a higher
level of similarity than those in other clusters. However, how does
clustering achieve this? Let us discover it. All the clustering algorithms
follow a basic framework, which is shown in the following steps-
1. Initialisation: It involves randomly selecting data points as initial
cluster centroids, or you may begin with each data point as its
cluster.
2. Assignment: In this step, each data point is allocated to one of the
clusters that are closest to it based on some distance measure.
3. Update: Following the allocation of data points, the centroids are
recalculated. Typically, this is achieved by determining the mean or
average of all the data points within the cluster. These new centroids
provide a more accurate representation of the cluster’s members.
4. Iteration: Steps 2 and 3 are repeated iteratively until stopping criteria
are met, and final clusters are obtained. Standard stopping criteria
are the maximum number of iterations, no further change in cluster
members, or reaching a pre-defined quality of clusters.
K-means algorithm [2][3]: The algorithm initiates by choosing the
desired number of clusters and then partitioning the provided data points
or objects accordingly. It proceeds to randomly designate ‘k’ centroids
as the initial representatives for these clusters. Once the centroids are
defined, each data point is assigned to the cluster with the closest centroid,
as determined by a distance metric.
Once all data points have been assigned to clusters, the algorithm
computes the mean value for each cluster, and these mean values are
then established as the new centroids. Using these new centroids, all
members are reassigned to clusters. Some members may switch clusters

100 PAGE
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
UNVEILING DATA PATTERNS: CLUSTERING AND ASSOCIATION RULES

during this process unless the initial centroid estimates were exceptionally Notes
accurate. This process continues until there are no further changes in
cluster membership. It is important to note that different initial centroid
selections can result in different cluster outcomes.
The working of k-means algorithms can be summarised in the following
steps:
1. Decide the value of k, that is, the number of clusters.
2. Randomly initialise k centroids for clusters (It can be selected from
data points randomly).
3. Now, allocate a cluster to each data point to its closest cluster based
on Euclidean distance.
4. Compute the mean value of each cluster, which becomes new centroids
for the clusters.
5. Repeat step 3, i.e. reallocate data points to the new closest centroid.
6. If any change in cluster membership occurs, go to step 4 or terminate
the process.
Note that the k-means algorithm uses Euclidean distance to compute the
distance of data points from cluster centroids. Euclidean distance between
two point P(x1, y1) and Q(x2, y2) is computed using the equation 1-

d (P,Q) = ( x 2 - x1) 2 + ( y 2 - y1) 2 (1)


In essence, this algorithm aims to create clusters with a high level of
similarity among their members while maintaining a low degree of
similarity between different clusters. The K-means algorithm is a scalable
and efficient clustering method. However, it should be noted that it may
converge to a local minimum rather than identifying the global minimum.
Now, let us understand the algorithm clearly with the help of an example-
Player Age Score 1 Score 2 Score 3
P1 18 73 75 57
P2 18 79 85 75
P3 23 70 70 52
P4 20 55 55 55
P5 22 85 86 87
P6 19 91 90 89

PAGE 101
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
MBAFT 7708 PREDICTIVE ANALYTICS AND BIG DATA

Notes Player Age Score 1 Score 2 Score 3


P7 20 70 65 65
P8 21 53 56 59
P9 19 82 82 60
P10 40 76 60 78
Let us choose the value of k to be 3 and the first three records to be the
centroid for the clusters.
Player Age Score 1 Score 2 Score 3
P1 18 73 75 57
P2 18 79 85 75
P3 23 70 70 52
The distance between the data points or objects on the four attributes is
computed using the Euclidean distance formula.
Player Age Score 1 Score 2 Score 3 Euclidean Cluster
Distance assigned
P1 18 73 75 57 Centroid 1 C1
P2 18 79 85 75 Centroid 2 C2
P3 23 70 70 52 Centroid 3 C3
P4 20 55 55 55 27/43/22 C3
P5 22 85 86 87 34/14/41 C2
P6 19 91 90 89 40/19/47 C2
P7 20 70 65 65 13/24/14 C1
P8 21 53 56 59 28/42/23 C3
P9 19 82 82 60 12/16/19 C1
P10 40 76 60 78 33/34/32 C3
Now, clusters contains following elements: C1= {P1, P7, P9}; C2 = {P2,
P5, P6};
C3 = {P3, P4, P8, P10}.
The next step is to compute the means for the new clusters-
Cluster Age Score 1 Score 2 Score 3
Centroid 1 18 73 75 57
Centroid 2 18 79 85 75
Centroid 3 23 70 70 52
C1 19.0 75.0 74.0 60.7
C2 19.7 85.0 87.0 83.7
C3 26.0 63.5 60.3 60.8

102 PAGE
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
UNVEILING DATA PATTERNS: CLUSTERING AND ASSOCIATION RULES

We then initiate the process with new centroids, that is, the means, Notes
and calculate the Euclidean distance from each data point to these new
centroids-
Student Age Score 1 Score 2 Score 3 Euclidean Cluster
Distance assigned
C1 19.0 75.0 74.0 60.7
C2 19.7 85.0 87.0 83.7
C3 26.0 63.5 60.3 60.8
P1 18 73 75 57 4.44/31.68/19.62 C1
P2 18 79 85 75 12.34/10.89/33.40 C2
P3 23 70 70 52 11.52/39.11/14.93 C1
P4 20 55 55 55 28.19/52.42/13.04 C3
P5 22 85 86 87 30.73/4.14/42.72 C2
P6 19 91 90 89 36.23/8.58/49.83 C2
P7 20 70 65 65 11.20/32.53/10.86 C3
P8 21 53 56 59 28.55/50.96/12.53 C3
P9 19 82 82 60 10.65/24.42/29.37 C1
P10 40 76 60 78 30.62/35.42/25.45 C3
Now, the current state of 3 clusters is as follows- C1 = {P1, P3, P9};
C2 = {P2, P5, P6};
C3 = {P4, P7, P8, P10}. You may notice the change in the members of
the three clusters. We keep following the same procedure until no further
changes occur in cluster membership. The rest of the steps have been
left for the students to try by themselves.
The k-means method may use other distance measures, such as Manhattan
distance; however, employing various distance measures on the same
dataset can yield different outcomes, making it challenging to determine
the most optimal result.
A significant challenge in k-means clustering is deciding the optimal
value of k because the algorithm’s performance is highly dependent on an
optimal number of clusters. However, how do we decide on an optimal
number of clusters? There are many ways to decide the value of k, but
here we will discuss the most suitable method, the Elbow method.

PAGE 103
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
MBAFT 7708 PREDICTIVE ANALYTICS AND BIG DATA

Notes Elbow Method:


This method is widely adopted for calculating the optimal value of ‘k.’
It leverages the concept of WCSS, which stands for Within Clusters Sum
of Squares, representing the dispersion within a cluster. The formula to
calculate WCSS is given by equation 2-
WCSS = 6Pi in cluster 1 distance (PiC1)2 + 6Pi in cluster 2 distance (PiC2)2 + ... (2)
Where, 6Pi in cluster 1 distance (PiC1)2 is the sum of squared distances between
each data point and the centroid of cluster 1. The same procedure is
applied to each cluster. Once again, the distance between the centroid and
data points is computed using either Euclidean or Manhattan distance.
Now, to find the optimal value of k, we follow the given steps-
1. Apply the k-means algorithm for a given dataset with different
values of k (say, in the range of 1-10).
2. For each k, compute WCSS using equation 2.
3. Plot a curve of WCSS against different values of k.
4. The obtained plot looks like an arm. Then, the value of k at the
Sharp point of the plot or elbow of the arm (hence the name elbow
method) is the optimal value of k for the k-means algorithm.
For example, in Figure 5.5, the optimal value of k is 3.

Figure 5.5: Elbow curve to find the optimal


value of k, i.e. number of clusters.

104 PAGE
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
UNVEILING DATA PATTERNS: CLUSTERING AND ASSOCIATION RULES

DBSCAN Notes
The Density-Based Spatial Clustering of Applications with Noise (DBSCAN)
algorithm forms clusters based on the density of data points in the space.
Here, clusters are defined by areas of high-density data points separated
by lower-density areas. It is advantageous when dealing with data of
varying densities and variable shapes. This algorithm can automatically
find several clusters and can identify outliers as noise.
The fundamental concept behind the DBSCAN algorithm is that for every
data point within a cluster, its neighbourhood, defined by a given radius,
should encompass a minimum number of data points.
Hierarchical Clustering Algorithm
Hierarchical clustering can have two approaches. An approach involves
gradually merging various data points into one cluster, a technique known
as the agglomerative approach. Another approach, which is called the
divisive approach, involves dividing large clusters into small clusters.
Here, we will consider the Agglomerative approach. This method begins
with each cluster initially containing a single data point. Then, using a
distance measure, the algorithm merges the two nearest clusters into a
single cluster. As the algorithm proceeds, the number of clusters gets
reduced. The process terminates once all the clusters are merged into a
single cluster.
Various methods can be used to compute the distance between clusters.
A few methods are discussed here-
Single link method (nearest neighbour): It determines the distance
between two clusters by considering the distance between their closest
points, where one point is chosen from each cluster.
Complete linkage method (furthest neighbour): The distance between
two clusters is determined by considering the distance between their most
distant points, with one point chosen from each cluster.
Centroid method: The distance is calculated by finding the difference
between the centroids of two clusters.
Unweighted pair-group average method: It calculates the distance between
two clusters by finding the average distance among all pairs of objects
taken from these clusters. This process entails computing a total of p * n
distances, where ‘p’ and ‘n’ denote the number of objects in each cluster.

PAGE 105
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
MBAFT 7708 PREDICTIVE ANALYTICS AND BIG DATA

Notes The algorithm works as follows–


1. Each data point is allocated to its exclusive cluster.
2. Generate a distance matrix by calculating the distances between each
pair of data points.
3. Identify the two closest clusters with minimum distance.
4. Combine the closest clusters and eliminate them from the list.
5. Update the distance matrix after recalculating the distance of each
data point from the newly formed cluster.
6. Repeat till all data points form a single cluster.
Students are encouraged to apply this algorithm to the example used in the
k-means algorithm and draw a dendrogram to see how this algorithm works.
Choosing the appropriate algorithm
Now, you have multiple clustering algorithms to choose from. But how do
you know which algorithm is best suited to your data? Well, it depends
upon the nature of your data and your data analysis goals. K-means works
well when you have well-separated and spherical-shaped clusters. At the
same time, DBSCAN is more robust to irregularly shaped clusters and
varying densities in data and can handle noise in data.
In the next section, we will explore how to evaluate the quality of clusters
and make informed decisions about which algorithm to employ.

IN-TEXT QUESTIONS
1. What is the primary objective of clustering in data analysis?
(a) To perform binary classification
(b) To group similar data points together
(c) To predict future data patterns
(d) To perform statistical hypothesis testing
2. _______ is employed to determine the optimal value of ‘k’ in
the k-means clustering algorithm.
(a) Arm method
(b) Threshold method
(c) Elbow method
(d) None of the above

106 PAGE
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
UNVEILING DATA PATTERNS: CLUSTERING AND ASSOCIATION RULES

5.5 Real-World Applications of Clustering Notes

In the previous section, we discussed clustering, its definition, and its


importance in data analysis. Now, it is time to see its significance in
the real world and its impact on industries. Following are some of the
applications of clustering in real-world scenarios:—
1. Customer Segmentation: Imagine a big retail company with millions of
customers worldwide. Each customer has unique preferences, tastes,
and buying habits. How does this retail company tailor marketing
efforts effectively? To do so, it has to categorise customers to
understand their behaviour and purchasing habits.
Clustering plays an important role in businesses discovering distinct
customer segments by applying clustering algorithms to huge
amounts of customer data. Customer segments contain individuals
with similar buying behaviour, preferences, and demographics. For
retail companies, this could mean identifying a group of budget-
conscious shoppers, luxury pundits, or tech enthusiasts. Each customer
segment presents an opportunity to develop personalised marketing
strategies which enhance customer engagement and loyalty.
2. Healthcare and patient stratification: Every patient is unique in
medicine. They have different health conditions, different responses
to treatment, and distinct risk factors. Patients with similar health
profiles must be grouped so doctors can prescribe personalised
treatment plans. For example, a specific medication may be more
effective for one group of patients than another group. Clustering
helps healthcare professionals cluster patients with similar profiles
together for the stated purpose. Clustering is also helpful in early
detection of disease, i.e. identifying patients with higher risk of
developing certain conditions.
3. Image and document classification: Clustering is vital in image and
document analysis. Finding a specific photo can be daunting when
we have a vast collection of unorganised digital photos without
proper organisation. Clustering can help organise photos by grouping
similar images based on visual features, making retrieval easy.

PAGE 107
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
MBAFT 7708 PREDICTIVE ANALYTICS AND BIG DATA

Notes In the case of textual data, clustering can be employed for topic
modelling and document classification, which further aid in efficient
content management and retrieval.
4. Anomaly detection: Anomaly detection plays an integral part in
various domains. For instance, in cybersecurity, unusual patterns
in network traffic may indicate a cyberattack. Another example
could be manufacturing, where product quality anomalies may
indicate faults in the production line. We can use clustering to
create clusters of normal behaviour. Anything that falls outside of
clusters is considered an anomaly and flagged immediately, leading
to further investigation.
In this section, we saw that clustering’s applications range from
customer-centric marketing strategies to advancement in healthcare
and beyond, and it proves itself to be a versatile tool that transforms
large amounts of data into actionable knowledge.
In the following section, we will learn the inner workings of the
clustering algorithms and showcase their uses in real-world scenarios.

5.6 Evaluating Cluster Quality


Clustering, a powerful data analysis technique, is not free from challenges.
One of the crucial challenges in clustering lies in evaluating the quality
of clusters that clustering algorithms produce. A cluster might appear
coherent visually, but will it be genuinely helpful in decision-making?
Evaluation assists in finding the answer to this question.
Clustering evaluation metrics can be broadly categorised into two groups:
Internal evaluation metrics and External evaluation metrics [5]. Let us
understand both of them one by one—
Internal evaluation metrics: These evaluation metrics do not use external
information to assess the quality of clusters. These metrics provide
insight into how well the data points within the clusters are related and
how dissimilar the clusters are from each other. Commonly used internal
evaluation metrics include:—
1. Inertia (also known as Within-Cluster Sum of Squares): This
measures how distant the data points within a cluster are from the

108 PAGE
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
UNVEILING DATA PATTERNS: CLUSTERING AND ASSOCIATION RULES

cluster’s centroid. The lower the inertia, the tighter or more compact Notes
the clusters are.
2. Silhouette Score: It quantifies cohesiveness in the cluster, i.e. how
similar an object is to its cluster compared to other clusters, which
indicates separation. A higher value of silhouette score indicates
better-defined clusters.
External evaluation metrics: These evaluation metrics rely on external
information, such as ground truth labels of expert knowledge, in order to
assess the quality of clusters. This is particularly useful when you have
access to labelled data. Common metrics in this category include:—
1. Adjusted Rand Index (ARI): This metric assesses the similarity
between the true labels and cluster assignments, taking a chance
into account. It produces a score ranging from -1 to 1, with -1
indicating no similarity and 1 denoting perfect similarity.
2. Normalised Mutual Information (NMI): This metric quantifies the
mutual information between cluster assignments and actual labels
while normalising the result to a value from 0 to 1. A score of 0
indicates no similarity, while a score of 1 signifies perfect similarity.
The visual aspect of evaluation: Evaluation metrics discussed so far
provide a quantitative aspect of cluster quality; however, visual inspection
has a crucial role in cluster quality assessment. Visualisation techniques
such as scatter plots, dendrogram, and t-SNE (t-Distributed Stochastic
Neighbour Embedding) help understand aspects metrics might miss. It
may uncover complex cluster structures and anomalies that cannot be
captured only through metrics.
Note that choosing appropriate evaluation metrics depends on the problem
you are tackling. Some metrics are more suitable for specific data types
and cluster shapes. Therefore, it is essential to consider the nature of
the data and the task at hand in choosing evaluation metrics. Clustering
quality evaluation is an important step in the data analysis journey. It
ensures that clusters created are not just mathematical constructs but
provide meaningful insights that help the decision-making process.

PAGE 109
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
MBAFT 7708 PREDICTIVE ANALYTICS AND BIG DATA

Notes 5.7 Unravelling Connections: The World of Association


Rules
We have explored the world of clustering in previous sections, and now let
us shift our focus to another data analysis technique: Association Rules.
These rules are full of magic, which can uncover hidden connections and
relationships within the data.
As we know, a substantial volume of data is currently electronically
stored in retail establishments, primarily because of the barcoding system
used to track sold goods. If you have such a mountain of data, why not
leverage that to find helpful information? We can extract compelling
association rules from these extensive databases to achieve this. The
field of association rule mining was pioneered by Rakesh Agarwal at
IBM in 1993.
Association rules reveal which items frequently appear together in
transactions or events. This data analysis technique can be applied to
various applications such as market basket analysis, Recommendation
systems, healthcare, finance and many more.
Consider an example of ten transactions in a retail shop selling only nine
electronic items.
Transaction ID Items
1 Laptop, Smart Watch, Smartphone
2 Laptop, Smart Watch, Keyboard
3 Laptop, Mouse
4 Smart Watch, Keyboard, Mouse, Hard Disk
5 RAM, Ethernet Cable, Hard Disk, Biscuits, Smartphone
6 RAM, Ethernet Cable, Hard Disk, Biscuits, Mouse, Key-
board, Smartphone
7 Laptop, Smart Watch
8 Laptop, Smart Watch, Keyboard, Hard Disk
9 Laptop, Mouse
10 RAM, Ethernet Cable, Hard Disk, Laptop, Mouse, Key-
board, Smartphone
Our objective is to identify sets of items that exhibit a frequent co-
occurrence pattern. These association rules are typically expressed as
$ĺ% VLJQLI\LQJ WKDW LWHP % WHQGV WR EH SUHVHQW ZKHQHYHU LWHP $ LV

110 PAGE
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
UNVEILING DATA PATTERNS: CLUSTERING AND ASSOCIATION RULES

present. It is important to clarify that A and B can represent individual Notes


or sets of items. However, the same item does not appear in both sets.
Let us consider a scenario where items A and B appear together in just
2% of the transactions. However, when item A is present, there is a 70%
likelihood that item B accompanies it. In this context, the 2% occurrence
of A and B together is referred to as the support of the rule, and the 70%
represents the confidence or predictability of the rule. These support and
confidence metrics help assess the rule’s significance.
Confidence reflects the strength of the association between items A and
B, while support quantifies how often this pattern occurs. An association’s
support must exceed the minimum support threshold to be of business
value. Suppose we know P(A ‫ ׫‬B). In that case, we can calculate the
probability of B appearing when we already know that A exists, denoted
DV 3 %_$ 7KH VXSSRUW IRU WKH UXOH$ĺ% UHSUHVHQWV WKH SUREDELOLW\ RI$
and B appearing together, i.e., P(A ‫ ׫‬B). Confidence, however, represents
the conditional probability of B occurring when A is already present and
is written as P(B|A).
Our goal is to identify all associations that exhibit a support of at least
p% and a confidence of at least q%, with the condition that:—
‹ Rules that meet user-defined constraints are identified.
‹ Rules are efficiently extracted from extensive databases.
‹ Ensuring discovered rules are actionable.
The example below contains 9x8 = 72 pairs, and half of them are
duplicates, so there are 36 unique pairs. However, 36 are too many to
analyse. Let us consider an even simpler example with four items (n=4)
and four transactions (N=4).
Transaction ID Items
10 Laptop, Smart Watch
20 Laptop, Smart Watch, Keyboard
30 Laptop, Mouse
40 Smart Watch, Keyboard, Mouse
We aim to identify rules with a support threshold of at least 50% and a
confidence level of at least 75%.

PAGE 111
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
MBAFT 7708 PREDICTIVE ANALYTICS AND BIG DATA

Notes When N = 4, the outcomes are as follows: All items meet the support
criterion of 50%,only two pairs satisfy minimum support criteria, and no
3-item set meets the minimum support threshold.
Itemsets Frequency
Laptop 3
Smart Watch 3
Keyboard 2
Mouse 2
(Laptop, Smart Watch) 2
(Laptop, Keyboard) 1
(Laptop, Mouse) 1
(Smart Watch, Keyboard) 2
(Smart Watch, Mouse) 1
(Keyboard, Mouse) 1
(Laptop, Smart Watch, Keyboard) 1
(Laptop, Smart Watch, Mouse) 0
(Laptop, Keyboard, Mouse) 0
(Smart Watch, Keyboard, Mouse) 1
(Laptop, Smart Watch, Keyboard, Mouse) 0
We will examine the pairs that meet the minimum support value to assess
whether they also meet the minimum confidence level. Items or item sets
meeting the minimum support threshold are referred to as ‘frequent.’ All
individual items and two pairs are deemed frequent in the above example.
Next, we evaluate whether the pairs {Laptop, Smart Watch} and {Smart
Watch, Keyboard} satisfy the 75% confidence threshold for association
UXOHV )RU HDFK SDLU ^;<` ZH FDQ GHULYH WZR UXOHV QDPHO\ ;ĺ< DQG
<ĺ; SURYLGHG ERWK UXOHV PHHW WKH PLQLPXP FRQILGHQFH UHTXLUHPHQW
7KH FRQILGHQFH RI ;ĺ< LV GHWHUPLQHG E\ GLYLGLQJ WKH VXSSRUW RI ; DQG
Y occurring together by the support of X.
We have four potential rules, and their respective confidence levels are
as follows:
‹ /DSWRS ĺ 6PDUW :DWFK KDYLQJ FRQILGHQFH RI   
‹ 6PDUW :DWFK ĺ /DSWRS KDYLQJ FRQILGHQFH RI   
‹ 6PDUW :DWFK ĺ .H\ERDUG KDYLQJ FRQILGHQFH RI   
‹ .H\ERDUG ĺ 6PDUW :DWFK KDYLQJ WKH FRQILGHQFH RI 

112 PAGE
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
UNVEILING DATA PATTERNS: CLUSTERING AND ASSOCIATION RULES

+HQFH RQO\ WKH ILQDO UXOH µ.H\ERDUG ĺ 6PDUW :DWFK¶ SRVVHVVHV D Notes
confidence level exceeding the minimum 75% threshold and is eligible.
Rules surpassing the user-defined minimum confidence are called ‘confident’.
This straightforward algorithm employs a brute-force approach and performs
effectively when dealing with four items, as it requires the examination
of only 16 combinations. However, when confronted with many items,
such as 100, overall combinations become significantly larger, reaching
billions. For instance, with 20 items, the number of combinations amounts
to approximately one million, as the total combinations are given by 2n
for ‘n’ items. Although efforts have been made to enhance the naive
algorithm’s efficiency when handling larger datasets, it still struggles to
cope with many items and transactions. To address this limitation, we
introduce a more advanced algorithm known as the Apriori algorithm.
The Apriori algorithm
The popular algorithm for association rule mining is the Apriori algorithm.
It works on the principle of “Apriori” or “prior knowledge.” It operates
under the assumption that if an item set is frequent, then all of its subsets
are also frequent. The algorithm reduces the search space and speeds up
the rule discovery process.
This algorithm works as follows:—
1. Find Frequent Itemsets: The algorithm begins by identifying frequent
itemsets—Sets of items that regularly appear together in the dataset.
2. Create Association Rules: Using these frequent itemsets, association
rules are generated. These rules have the form “If {A} then {B},”
indicating that if itemset A is present, itemset B is also likely to
be present.
3. Calculate Support and Confidence: Support and Confidence are two
important metrics used with association rules. Support measures
how frequently a rule is applicable, while confidence quantifies the
strength of the rule.
4. Pruning and Optimisation: The Apriori algorithm employs pruning
techniques to eliminate uninteresting or redundant rules, focusing
on those with the highest support and confidence. Pruning and
Optimisation techniques adhere to the anti-monotone property,
meaning any subset of a frequent itemset must also be frequent. It

PAGE 113
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
MBAFT 7708 PREDICTIVE ANALYTICS AND BIG DATA

Notes is essential to remember that no superset of an infrequent itemset


should be created or evaluated.
For example, consider the Figure 5.6.

Figure 5.6: Pruning example in Apriori algorithm.


The step-by-step algorithm may be described as follows:—
1. L1 Computation
This is Step 1. Go through all transactions and identify items with
support exceeding the minimum threshold ‘p%.’ Label these frequent
items as L1.
2. Apriori-gen Function
This is Step 2. Utilise the items from L1 to generate all possible
item pairs within L1, forming the candidate set C2.
3. Pruning
This is Step 3. Go through all transactions and detect frequent item
pairs within the candidate set C2. Label these frequent pairs as L2.
4. Generalised rule
This is Step 4, in this step 2 is generalised. Build a candidate set
Ck of k items by merging all frequent itemsets from the Lk-1 set.
5. Pruning (step 3 Generalisation)

114 PAGE
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
UNVEILING DATA PATTERNS: CLUSTERING AND ASSOCIATION RULES

6. Step 5: Scan all transactions and find frequent items within the Notes
candidate set Ck. Label these frequent itemsets as Lk.
7. Go to Step 4 while Lk is not empty.
8. Terminate when Lk is empty.

IN-TEXT QUESTIONS
3. In the context of association rules, what is “market basket
analysis” commonly used for?
(a) Identifying customer demographics
(b) Predicting stock market trends
(c) Discovering patterns in customer purchasing behaviour
(d) Analysing website traffic data
4. In the context of association rule mining, what does the term
“support” refer to?
(a) The probability of an association rule being true
(b) The number of transactions containing all items in the
antecedent and consequent of a rule
(c) The lift of the association rule
(d) The confidence of the association rule

5.8 Real-World Applications of Association Rules


In our exploration of association rules and the Apriori algorithm, we
significantly understand how they work. It is time to witness these
concepts in action as we explore real-world applications across diverse
domains. Association rules are invaluable tools in solving practical
challenges with their ability to unveil hidden connections. Some of the
applications include—
Retail: Market Basket Analysis: Imagine you are pushing a cart down
the aisles at a supermarket. You pick up a bag of chips and some salsa. As
you proceed, you spot guacamole dip nearby. Without a second thought,
you add it to your cart. This sequence of choices might seem ordinary, but
in retail, it is gold. Market basket analysis examines purchasing patterns
to discover items that tend to be purchased together. Retailers use this

PAGE 115
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
MBAFT 7708 PREDICTIVE ANALYTICS AND BIG DATA

Notes insight to optimise store layouts, suggest complementary products, and


plan promotions. For instance, if chips and salsa are frequently bought
together, the store might place them closer on the shelves or offer discounts
when purchased as a bundle.
Healthcare: Disease Symptom Analysis: In healthcare, association rules
play a pivotal role in understanding disease symptoms and their co-
occurrence. By analysing patient data, these rules can reveal patterns in
symptom presentation. Consider a scenario where patients with a specific
respiratory condition frequently report symptoms like cough, shortness of
breath, and wheezing. Association rules can uncover these patterns, aiding
in the early diagnosis of the condition. Additionally, they can assist in
identifying common co-occurring symptoms that may require attention.
E-commerce: Personalised Recommendations: The recommendation
engines that power e-commerce platforms owe much of their effectiveness
to association rules. These engines analyse user behaviour, identifying
products or content that tend to be consumed together. When you receive
personalised product recommendations on an online store, association
rules are at work. They analyse your browsing and purchase history,
suggesting items that align with your preferences. Whether it is books,
electronics, or movies, these recommendations are tailored to enhance
your shopping experience.
Manufacturing: Quality Control: In manufacturing, association rules
can be a game-changer for quality control. They can identify factors
contributing to defects or inefficiencies in the production process. Suppose
a manufacturer of electronic devices notices that specific components are
often found together in devices that later experience issues. Association
rules can highlight these component combinations, prompting closer
inspection. This proactive approach can lead to improved product quality
and reduced warranty claims.
Web Content Personalisation: Content providers on the web, from news
websites to streaming platforms, leverage association rules to personalise
user experiences. By analysing what content users engage with, these
platforms can recommend articles, videos, or music that align with
individual interests. For instance, a news website might use association
rules to suggest related articles based on your reading. Similarly, a music
streaming service can recommend songs similar to those you have listened
to, creating a more engaging experience.

116 PAGE
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
UNVEILING DATA PATTERNS: CLUSTERING AND ASSOCIATION RULES

Beyond Business: Scientific Discovery: Association rules find applications Notes


beyond commerce. In scientific research, they are used to discover
relationships in data. For example, genomics researchers employ association
rules to identify connections between genes and diseases, contributing to
advancements in medical science.
In summary, association rules are versatile tools that transcend industry
boundaries. They empower organisations and researchers to uncover hidden
connections, enhance decision-making, and transform data into actionable
insights. Whether in retail, healthcare, e-commerce, manufacturing, or
scientific research, association rules are your allies in solving complex
challenges.

5.9 Summary
This lesson serves as an introductory gateway into the dynamic world
of data analysis, where we embark on a journey to unlock the hidden
patterns, relationships, and insights concealed within data. We delve
into the fundamental data analysis techniques, focusing on two pivotal
ones: clustering and association. Following are the key highlights of the
chapter:—
‹ Clustering: It is the art of grouping similar data points, revealing
natural patterns and structures. It finds applications in customer
segmentation, biological data organisation, and scientific classifications.
‹ Association Rules: Association rules mining uncovers connections and
relationships within data. It is vital for optimising retail strategies,
enhancing recommendation systems, and gaining insights into
healthcare and other domains.
‹ Clustering and association rules are foundational techniques for data
pattern discovery, providing the tools to extract valuable information
from complex datasets.
The lesson concludes by setting the stage for further exploration, promising
advanced algorithms, real-world case studies, and practical applications
in subsequent lessons.

PAGE 117
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
MBAFT 7708 PREDICTIVE ANALYTICS AND BIG DATA

Notes 5.10 Answers to In-Text Questions

1. (b) To group similar data points together


2. (c) Elbow method
3. (c) Discovering patterns in customer purchasing behaviour
4. (b) The number of transactions containing all items in the antecedent
and consequent of a rule

5.11 Self-Assessment Questions


1. Describe the main objective of clustering in data analysis and how
it differs from classification.
2. Discuss two clustering algorithms and briefly describe a scenario
where one algorithm is preferred.
3. How do you evaluate clusters created by the clustering algorithm?
Highlight the importance of visual inspection in cluster quality
assessment.
4. Imagine you have a dataset containing information about customers’
purchasing habits. How might clustering be applied to this dataset,
and what insights could it reveal for a business?
5. Define what support and confidence are in the context of association
rules. How do these metrics help assess the significance of an
association rule?
6. Suppose Li represents frequent items of length i, and Ci represents
the candidate set, a set of frequent items of length i. How do we
compute C3 from L2 or, more generally, Ci from Li-1?
7. How can you determine if an association rule is actionable or valuable
practically? What factors might influence this determination?

5.12 References
‹ Xu, Dongkuan, and Yingjie Tian. “A comprehensive survey of
clustering algorithms.” Annals of Data Science 2 (2015): 165-193.

118 PAGE
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
UNVEILING DATA PATTERNS: CLUSTERING AND ASSOCIATION RULES

‹ Lloyd, S. P. “Least Squares Quantisation in PCM. Technical Report Notes


RR-5497, Bell Lab, September 1957.” (1957).
‹ MacQueen, James. “Some methods for classification and analysis
of multivariate observations.” Proceedings of the fifth Berkeley
Symposium on Mathematical Statistics and Probability. Vol. 1. No.
14. 1967.
‹ Müllner, Daniel. “Modern hierarchical, agglomerative clustering
algorithms.” arXiv preprint arXiv:1109.2378 (2011).
‹ Amigó, Enrique, et al. “A comparison of extrinsic clustering
evaluation metrics based on formal constraints.” Information retrieval
12 (2009): 461–486.
‹ $JUDZDO 5DNHVK 7RPDV] ,PLHOLĔVNL DQG $UXQ 6ZDPL ³0LQLQJ
association rules between sets of items in large databases.” Proceedings
of the 1993 ACM SIGMOD international conference on Management
of data. 1993.

5.13 Suggested Readings


‹ “Data Science for Business” by Foster Provost and Tom Fawcett.
‹ “Introduction to Data Mining” by Pang-Ning Tan, Michael Steinbach,
and Vipin Kumar.
‹ “Business Analytics: Methods, Models, and Decisions” by James R.
Evans.
‹ “Predictive Analytics: The Power to Predict Who Will Click, Buy,
Lie, or Die” by Eric Siegel.

PAGE 119
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
L E S S O N

6
From Data to Strategy:
Classification and Market
Basket Analysis Driving
Action
Hemraj Kumawat
Research Scholar
School of Computer & Systems Sciences
JNU, New Delhi - 110067
Email-Id: hkumawat077@gmail.com

STRUCTURE
6.1 Learning Objectives
6.2 Introduction
6.3 &ODVVL¿FDWLRQ $ 3UHGLFWLYH 0RGHOOLQJ
6.4 8QGHUVWDQGLQJ 3RSXODU &ODVVL¿FDWLRQ $OJRULWKPV
6.5 1DLYH %D\HV &ODVVL¿FDWLRQ 3UREDELOLW\ DQG ,QGHSHQGHQFH
6.6 K-NN (K-Nearest Neighbour) Algorithm: The Power of Proximity
6.7 (YDOXDWLRQ 0HDVXUHV IRU &ODVVL¿FDWLRQ
6.8 5HDO:RUOG $SSOLFDWLRQV RI &ODVVL¿FDWLRQ
6.9 Strategic Insights with Market Basket Analysis
6.10 Summary
6.11 Answers to In-Text Questions
6.12 Self-Assessment Questions
6.13 References
6.14 Suggested Readings

120 PAGE
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
CLASSIFICATION AND MARKET BASKET ANALYSIS

6.1 Learning Objectives Notes

‹ To study predictive modelling and its significance in data-driven


decision-making.
‹ To understand the role of classification in supervised learning and
discuss its applications.
‹ To learn different classification algorithms and their use cases.
‹ To understand the concept of market basket analysis and its role
in understanding customer behaviour.

6.2 Introduction
“It is a capital mistake to theorise before one has data. Insensibly, one
begins to twist facts to suit theories, instead of theories to suit facts”,
the Famous fictional character Sherlock Holmes proclaims in the novel
‘A Scandal in Bohemia’ by Sir Arthur Conan Doyle.
This basic idea lies at the core of the data analysis. Data serves as
the bedrock of knowledge and fuels well-informed decision-making. In
the contemporary data-driven landscape, organisations grapple with an
abundance of information. The power to transform this data into actionable
insights is not just a competitive advantage but a strategic necessity. In
this lesson, we will journey through predictive modelling, classification,
and market basket analysis, tracing the path from raw data to informed
strategies that drive business success.
There is more to data than just numbers and records; it is a treasure
trove of opportunities and hidden patterns waiting to be uncovered. In
the quest for excellence and competitiveness, organisations have realised
that harnessing data’s power is paramount. However, it is not the data
itself but the intelligent application of it that makes the difference. This
lesson focuses on how data is transformed into actionable insights and
strategies that shape the future.
At the heart of this lesson is classification—a supervised learning technique
that categorises data into pre-defined groups or classes. Classification
serves as the bridge between raw data and informed decision-making.
It empowers organisations to not only understand their data but also to
predict future outcomes. By exploring various classification algorithms

PAGE 121
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
MBAFT 7708 PREDICTIVE ANALYTICS AND BIG DATA

Notes and real-world applications, we will unveil the transformative potential


of this technique.
Additionally, we delve into market basket analysis, a powerful tool for
understanding consumer behaviour and optimising business operations.
Market basket analysis unveils hidden insights that can guide pricing
strategies, cross-selling, and up selling efforts by examining associations
among products or services. Its applications extend beyond retail, offering
strategic advantages across diverse industries.
This lesson will explore practical examples and case studies that showcase
the real-world impact of predictive modelling, classification, and market
basket analysis. We will also address challenges and considerations,
equipping you with the knowledge to navigate this data-driven landscape
effectively.
Let us embark on this journey, “From Data to Strategy”, where we
unleash the power of classification and market basket analysis to help
in an informed decision-making process that can shape the future of the
business world.

6.3 Classification: A Predictive Modelling


Descriptive models serve the purpose of summarising data conveniently
or informally, enhancing our comprehension of underlying patterns and
mechanisms. In contrast, predictive models are explicitly designed to enable
us to forecast the unknown value of a variable of interest when provided
with known values of other variables. For instance, this could involve
diagnosing a medical patient based on a set of test results, estimating
the likelihood that customers will purchase product A considering their
previous purchases of other products, or predicting the future value of
the Dow Jones index six months ahead, using current and historical index
values. The classification technique produces the predictive models.
In data-driven decision-making, the journey from raw data to actionable
insights begins with classification. Classification is the art of categorising
data into distinct groups or classes, making it a fundamental technique in
supervised learning. It presupposes the presence of pre-defined categories
represented by class labels.

122 PAGE
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
CLASSIFICATION AND MARKET BASKET ANALYSIS

Classification involves instructing machines to identify patterns and make Notes


informed predictions by leveraging past observations. Classification is
central to our data-driven society, whether it involves organising emails into
spam, diagnosing medical conditions like distinguishing between malignant
and benign cells through MRI scan results or predicting customer churn.

Figure 6.1: Email spam detection


The classification algorithm acquires the ability to associate input data
with a particular output category, relying on labelled instances of input-
output pairs. Classification involves discovering a model or function that
characterises and discriminates between data categories or concepts, enabling
this model to predict the category of objects with unknown class labels.
Classification is a supervised learning technique, i.e. it requires labelled
data. It is trained on labelled datasets containing input and label pairs.
Each instance is denoted by a tuple (X, Y), where ‘X’ represents the
attribute set and ‘Y’ signifies the corresponding label set, also called
target labels or categories. Based on the number of labels, classification
can be divided into two types:—
1. Single label or Multinomial Classification: This classification
method involves a single label as the target variable and can be
further categorised into two types:
(a) Binary Classification: If the label has only two possible
categories or classes, it is called binary classification. For
instance, labels like ‘safe’ or ‘risky’ can be used in the context
of loan application data.
(b) Multi-class Classification: If the label has more than two
categories, it is known as multi-class classification.

PAGE 123
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
MBAFT 7708 PREDICTIVE ANALYTICS AND BIG DATA

Notes 2. Multi-label Classification: When an instance is classified into multiple


categories simultaneously, it is called multi-label classification. For
example, movie genre prediction is where a movie can simultaneously
have multiple genres, such as action, horror, and romance.
In this lesson, we will mainly be covering the single-label classification
approach. We will focus on binary classification primarily, but it can be
scaled to multi-class classification easily.
Definition: Classification involves training a target function ‘f’ that
associates each attribute set ‘x’ with one of the pre-established class
labels ‘y’ (see Figure 6.2).
This target function is known informally as a classification model or
classifier and serves various practical purposes.

Figure 6.2: A Classification model, mapping input x to output label y.


So far, we have discussed the importance of classification in data analysis
and the basics of classification. Now, let us dive into the process of
building a classification model and testing the model with unseen test data.
Data classification involves a two-step procedure consisting of a learning
phase, where a classification model is developed, and a classification
phase, where the model is applied to predict class labels for given data.
In the initial stage, a classifier is formulated to represent pre-defined data
classes or concepts. This phase is the learning or training step, during
which a classification algorithm constructs the classifier by analysing and
learning from a training dataset that includes database tuples and their
corresponding class labels.
A tuple, represented as X, is characterised by an n-dimensional attribute
vector denoted as X = (x1, x2, ..., xn). In this representation, each component
(xi) corresponds to a measurement extracted from the n database attributes,
labelled as A1, A2, ..., An. Every tuple, X, is affiliated with a pre-defined
class, as determined by another database attribute known as the class label
attribute. This class label attribute is discrete and unordered, consisting

124 PAGE
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
CLASSIFICATION AND MARKET BASKET ANALYSIS

of categorical or nominal values, where each value represents a distinct Notes


category or class. The individual tuples comprising the training set are
termed training examples and are randomly selected from the analysed
database. To understand classification fully, it is essential to grasp the
training and testing data concept. When building a classification model,
we do not just feed it data and expect it to work miracles. Instead, we
work with two parts of our dataset: the training data and the testing data.
The training data serves as the foundation upon which our model learns.
It is akin to a teacher instructing students—showing them examples of
correct answers. The model learns from this data, discerning patterns and
relationships. However, the actual test of its abilities lies in the testing
data, representing unseen data that the model has never encountered. The
model’s success is measured by how accurately it can classify this new
data, making it akin to a student taking an exam.
In the realm of classification, we have a variety of algorithms at our
disposal, each with its strengths and weaknesses. From the simplicity of
logistic regression to the complexity of decision trees and the elegance
of support vector machines, these algorithms offer different tools for
the data scientist’s toolkit. The following sections will delve into these
algorithms, exploring their inner workings and real-world applications.
We will showcase how classification is not a one-size-fits-all approach
but a versatile set of techniques tailored to different data and business
needs. We will also learn standard evaluation measures to see how well
our classification model performs.

6.4 Understanding Popular Classification Algorithms


During the previous section, we established the fundamentals of classification
and its role in transforming data into actionable insights. Now, let us
dive deeper into the classification world by exploring some popular
classification algorithms that data scientists rely on to make predictions
and guide strategic decision-making.
Logistic Regression: A Simple Yet Effective Tool: Our journey begins
with logistic regression, a tried-and-true algorithm known for its simplicity
and effectiveness. Despite its name, logistic regression is a classification
algorithm commonly employed for binary classification tasks. It models

PAGE 125
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
MBAFT 7708 PREDICTIVE ANALYTICS AND BIG DATA

Notes the probability of an observation corresponding to a particular class and


is especially useful when the relationship between features and classes is
linear. Figure 6.3 shows how logistic regression observations are separated
into two categories.
Decision Trees: Unveiling the Decision-Making Process: Next, we
encounter decision trees, which offer a more intuitive way to make
decisions. Decision trees break down a decision-making process into a
series of choices, forming a tree-like structure. Nodes in the tree represent
decision points, and each branch represents a possible outcome. Decision
trees are versatile and can handle both binary and multi-class classification
tasks. Figure 6.4 shows a basic decision tree designed to predict if a
loan applicant will fail to pay. For more details, please refer to lesson 4.

Figure 6.3: Logistic Regression.

126 PAGE
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
CLASSIFICATION AND MARKET BASKET ANALYSIS

Notes

Figure 6.4: Decision Tree example.


Random Forest: Harnessing the Power of Ensembles: Random Forest,
an ensemble learning method, takes the idea of decision trees to the
next level. It combines several decision trees to establish a robust and
precise classification model. Each tree in the forest undergoes training
on a distinct subset of data and features, reducing the risk of overfitting
and enhancing model performance.
Support Vector Machines (SVMs): Maximising Margin: Support Vector
Machines are known for their ability to discover the ideal hyperplane,
which maximises the margin between class boundaries. SVMs excel in
scenarios where the data does not exhibit linear separability and can be
effectively handled by transforming it into a higher-dimensional space. This
technique allows SVMs to handle complex classification tasks effectively.
K-Nearest Neighbours (K-NN): The Power of Proximity: K-Nearest
Neighbours is a basic yet powerful algorithm that categorises data points
based on their proximity to neighbouring points. It assumes that similar
data points belong to the same class. K-NN’s strength lies in its ability
to adapt to the data’s local structure, making it particularly useful for
non-linear classification tasks.
Naive Bayes: Leveraging Probabilistic Reasoning: It is a probabilistic
classification algorithm which utilises Bayes’ theorem. Its effectiveness
is notably pronounced in tasks like text classification and spam filtering,
although its applicability extends across various domains. Despite its
“naive” assumption of feature independence, Naive Bayes often performs
surprisingly well in practice.
PAGE 127
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
MBAFT 7708 PREDICTIVE ANALYTICS AND BIG DATA

Notes As we journey through the world of classification, we will explore


some of these algorithms in greater detail, delving into their strengths,
weaknesses, and real-world applications. These algorithms, each with
its unique characteristics, provide the tools we need to extract valuable
insights from data and pave the way for data-driven strategies.

IN-TEXT QUESTIONS
1. Which classification algorithm is known for modelling the
probability of an observation belonging to a particular class,
making it suitable for binary classification tasks?
(a) Random Forest
(b) Decision Trees
(c) Logistic regression
(d) Support Vector Machines
2. What is the primary advantage of using Random Forest, an
ensemble learning method, for classification?
(a) It is simple and easy to interpret
(b) It can handle complex, non-linear classification tasks
effectively
(c) It relies on the concept of maximising margin
(d) It works well with small datasets

6.5 Naive Bayes Classification: Probability and Independence


Naive Bayes is a family of probabilistic classification algorithms that
utilises Bayes’ theorem with a strong assumption of independence between
the features. Despite its simplicity and the “naive” assumption, Naive
Bayes is surprisingly effective in many real-world classification tasks,
particularly in text and spam detection.
At the core of Naive Bayes lies Bayes’ theorem, a formula that links the
likelihood of a hypothesis (class) given specific evidence (features) with
the likelihood of the evidence given the hypothesis.
Bayes’ theorem can be represented as:

128 PAGE
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
CLASSIFICATION AND MARKET BASKET ANALYSIS

P(E|H).P(H) Notes
P(H|E) = (1)
P(E)
where,
‹ P(H): The prior probability of hypothesis H (prior).
‹ P(E): The probability of evidence E (marginal likelihood).
‹ P(H|E): The probability of hypothesis H given evidence E (posterior
probability).
‹ P(E|H): The probability of evidence E given hypothesis H (likelihood).
The term “naive” in Naive Bayes stems from the assumption that all
features are independent when considering the class. In simpler terms, the
presence or absence of one feature is assumed not to influence the presence
or absence of any other feature. While this independence assumption is
rarely valid in real-world data, Naive Bayes can perform well, especially
when features are approximately conditionally independent.
Types of Naive Bayes:
There are several variants of the Naive Bayes algorithm, each suited to
different types of data:
‹ Gaussian Naive Bayes: This variant is used when the features follow
a Gaussian (normal) distribution. It is appropriate for continuous
data.
‹ Multinomial Naive Bayes: This variant is commonly used for text
classification tasks, where features represent word frequencies (e.g.,
in document classification). It assumes a multinomial distribution
for the features.
‹ Bernoulli Naive Bayes: Suited for binary or Boolean features, such
as document presence/absence in text classification.
The Naïve Bayes algorithm works in two phases:—
1. Training Phase: During the training phase, Naive Bayes calculates
the prior probabilities of each class (P(H)) and the conditional
probabilities of each feature given each class (P(E|H)). These
probabilities are estimated from the training data.
2. Prediction Phase: When making predictions for new data, Naive
Bayes calculates the posterior probabilities of each class given

PAGE 129
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
MBAFT 7708 PREDICTIVE ANALYTICS AND BIG DATA

Notes the observed features using Bayes’ theorem. The predicted class is
determined as the one with the highest posterior probability.
An Example: Spam Email Detection:
Let us suppose you work for an email service provider, and you want to
classify incoming emails as either “Spam” or “Not Spam” (ham). You
have a dataset of emails and their corresponding labels, indicating whether
each email is spam (1) or not (0). Your goal is to build a Naive Bayes
classifier to classify new incoming emails automatically.
Here is the small portion of the dataset you have:—
Email Text Label (1=Spam, 0=Not
Spam)
“Congratulations! You’ve won a million dollars!” 1
“Hello, please find attached the report.” 0
“Get a free iPhone now!” 1
“Meeting agenda for tomorrow’s conference.” 0
“Claim your prize today!” 1
“Reminder: Your appointment is tomorrow.” 0
Step 1: Data Pre-processing: For this example, let us assume that the
text has been pre-processed and converted into a numerical format. We
have already calculated the prior and conditional probabilities from the
training data.
Prior Probabilities (from training data): Calculate the prior probabilities:
P(Spam) and P(Not spam) based on the training data:
‹ P(Spam) = Count of Spam Emails / Total Count of Emails in the
Training Set.
‹ P(Not spam) = Count of Not Spam Emails / Total Count of Emails
in the Training Set.
In our example, P(Spam) = 3/6 = 0.5, and P(Not Spam) = 3/6 = 0.5.
Conditional Probabilities (from training data): Calculate the conditional
probabilities of each word in the email text given the class (Spam or
Not Spam). For example, calculate P(Claim|Spam), P(Claim|Not Spam),
P(free|spam), P(free|Not spam), etc., using the training data. Let us assume
the following conditional probabilities for some example words:
‹ 3 &ODLP_6SDP    § 

130 PAGE
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
CLASSIFICATION AND MARKET BASKET ANALYSIS

‹ 3 &ODLP_1RW 6SDP    §  Notes


‹ 3 IUHH_6SDP    § 
‹ 3 IUHH_1RW 6SDP    § 
‹ 3 SUL]H_6SDP    § 
‹ 3 SUL]H_1RW 6SDP    § 
‹ 3 QRZ_6SDP    § 
‹ 3 QRZ_1RW 6SDP    § 
Step 2: Making Predictions
Now, let us calculate the posterior probabilities for both classes (Spam
and Not Spam) based on the provided email text: “Claim your free prize
now!”.
1. Calculate the likelihood of the email text for each class:
‹ Likelihood of the email text given Spam: P(“Claim your free
prize now!”|Spam)
§3 &ODLP_6SDP  3 \RXU_6SDP  3 IUHH_6SDP  3 SUL]H_6SDP
* P(now|Spam)
§   DVVXPLQJ DOO RWKHU ZRUGV DUH SUHVHQW LQ WKH
vocabulary)
§  3 QRZ_VSDP LVORZHUGXHWRWKHDEVHQFHRI³QRZ´
in the training data)
‹ Likelihood of the email text given Not Spam: P(“Claim your
free prize now!”|Not Spam)
§3 &ODLP_1RW6SDP  3 \RXU_1RW6SDP  3 IUHH_1RW6SDP 
* P(prize|Not Spam) * P(now|Not Spam)
§   DVVXPLQJ DOO RWKHU ZRUGV DUH SUHVHQW LQ WKH
vocabulary)
§   3 QRZ_1RW VSDP  LV KLJKHU EHFDXVH WUDLQLQJ GDWD
has the word “now” )
2. Calculate the posterior probabilities using Bayes’ theorem:
‹ P(Spam | “Claim your free prize now!”) ‫ ן‬P(“Claim your free
prize now!”|Spam) * P(Spam)

PAGE 131
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
MBAFT 7708 PREDICTIVE ANALYTICS AND BIG DATA

Notes ‹ P(Not Spam | “Claim your free prize now!”) ‫ ן‬P(“Claim your
free prize now!”|Not Spam) * P(Not Spam)
3. Normalise the probabilities:
‹ P(Spam | “Claim your free prize now!”) = P(Spam | “Claim
your free prize now!”) / (P(Spam | “Claim your free prize
now!”) + P(Not Spam | “Claim your free prize now!”))
‹ P(Not Spam | “Claim your free prize now!”) = P(Not Spam |
“Claim your free prize now!”) / (P(Spam | “Claim your free
prize now!”) + P(Not Spam | “Claim your free prize now!”))
4. Choose the class with the maximum posterior probability as the
predicted class.
‹ P(Spam|”Claim your free prize now!”) ‫§ ן‬
‹ P(Not Spam|”Claim your free prize now!”) ‫§    ן‬
0.1667
Since P(Spam|”Claim your free prize now!”) > P(Not Spam|”Claim your
free prize now!”), the email is predicted to be “Spam.”
Advantages of Naive Bayes:
‹ Simplicity: Naive Bayes is straightforward to comprehend and apply,
making it appropriate for novices and experienced practitioners.
‹ Efficiency: It demonstrates computational efficiency and can manage
datasets with high dimensionality, featuring numerous attributes.
‹ Interpretability: The probabilities assigned to each class can provide
insights into the classification process.
‹ Effective for Text Classification: Naive Bayes is particularly effective
in text classification tasks, such as sentiment analysis and spam
detection.
Challenges and Limitations:
‹ Independence Assumption: The strong independence assumption
may only hold in some real-world scenarios.
‹ Data Scarcity: It may perform poorly when limited training data or
certain feature combinations are unseen in the training set.
‹ Continuous Data: Gaussian Naive Bayes assumes that features follow
a Gaussian distribution, which may not be accurate for all datasets.

132 PAGE
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
CLASSIFICATION AND MARKET BASKET ANALYSIS

Applications: Notes
Naive Bayes has numerous applications, including:
‹ Spam email detection.
‹ Document classification.
‹ Disease diagnosis based on medical test results.
‹ Credit risk assessment in finance.
‹ Sentiment analysis in NLP.
Despite its simplicity, Naive Bayes is a robust classification algorithm
that can yield excellent results in various real-world scenarios, mainly
when the independence assumption is approximately satisfied or when
dealing with text data.

6.6 K-NN (K-Nearest Neighbour) Algorithm: The Power


of Proximity
K-Nearest Neighbours, often abbreviated as K-NN, is a versatile and
intuitive classification algorithm under the umbrella of instance-based
or lazy learning methods. K-NN is used for classification and regression
tasks, but we will focus on its classification application in this discussion.
Working of the K-NN algorithm
The essence of K-NN lies in its simplicity. At its core, K-NN makes
predictions based on the majority class among its k-nearest neighbours.
Here is how it works:
‹ Training Phase: K-NN stores the entire training dataset in the
training phase. It does not build an explicit model but retains the
data points and their associated class labels.
‹ Prediction Phase: When a new data point needs to be classified,
K-NN identifies the k-nearest data points from the training dataset
based on a distance metric. Euclidean distance is a common choice,
but other metrics like Manhattan distance can be utilised. Euclidean
distance and Manhattan distance between two ‘n’ dimensional data
points can be given by equations (2) and (3), respectively-

d (P, Q) = ( x 2 − x1) 2 + ( y 2 − y1) 2 (2)

PAGE 133
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
MBAFT 7708 PREDICTIVE ANALYTICS AND BIG DATA

Notes D ( x, y ) = ∑in=1 xi − yi (3)


‹ Majority Voting: Once the k-nearest neighbours are determined,
K-NN counts the number of data points in each class among these
neighbours. The class with the highest count is selected as the
predicted class label for the new data point.
An Example: Customer Churn Prediction: Suppose you work for a
telecommunications company and want to predict whether a customer is
likely to churn (leave the company) based on two features: the number of
customer service calls and the total monthly charges. You have historical
data with churn labels (1 for churned, 0 for not churned). Here is the
simplified hypothetical dataset-
Customer Customer Total Monthly Churn
ID Service Calls Charges ($) (0=No, 1=Yes)
1 2 60 0
2 4 85 1
3 1 70 0
4 3 75 1
5 3 80 1
6 0 50 0
7 6 95 1
8 2 65 0
9 1 60 0
10 5 90 1
Step 1: Data Pre-processing
We start by scaling or normalising the features to ensure they have the
same impact on the model. For this example, let us assume the data is
already pre-processed and scaled.
Step 2: Choosing a Value for k
We must decide on an appropriate value for “k” in K-NN. Let us choose
k = 3 for this example.
Step 3: Making Predictions (Churn Prediction)
Now, let us predict whether a new customer with the following characteristics
is likely to churn:

134 PAGE
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
CLASSIFICATION AND MARKET BASKET ANALYSIS

Customer Service Calls: 4 Notes


Total Monthly Charges: $88
Here is how K-NN can help with churn prediction:
1. Calculate the Euclidean distance (using equation 2) between the new
customer and all existing customers in the dataset based on the two
features (Customer Service Calls and Total Monthly Charges).
‹ 'LVWDQFH WR &XVWRPHU  ¥  2
+ (88-60)2  § 
‹ 'LVWDQFH WR &XVWRPHU  ¥  2
+ (88-85)2  § 
‹ 'LVWDQFH WR &XVWRPHU  ¥  2
+ (88-70)2  § 
‹ 'LVWDQFH WR &XVWRPHU  ¥  2
+ (88-75)2  § 
‹ 'LVWDQFH WR &XVWRPHU  ¥  2 + (88-80)2  § 
‹ 'LVWDQFH WR &XVWRPHU  ¥  2
+ (88-50)2  § 
‹ 'LVWDQFH WR &XVWRPHU  ¥  2
+ (88-95)2  § 
‹ 'LVWDQFH WR &XVWRPHU  ¥  2
+ (88-65)2  § 
‹ 'LVWDQFH WR &XVWRPHU  ¥  2
+ (88-60)2  § 
‹ 'LVWDQFH WR &XVWRPHU  ¥  2
+ (88-90)2  § 
Select the three nearest neighbours with the smallest distances. In this
case, the three nearest neighbours are Customers 2, 7, and 10.
Count the churn status of these 3 nearest neighbours:
Customer 2: Churned (Churn = 1)
Customer 7: Churned (Churn = 1)
Customer 10: Churned (Churn = 1)
Apply majority voting. Since all three nearest neighbours churned, the
new customer may be classified as likely to churn (Churn = 1).
Advantages of K-NN:
‹ Simplicity: K-NN is straightforward to comprehend and implement.
It does not require complex model training or optimisation.
‹ Non-Linearity: It can accommodate non-linear decision boundaries,
rendering it applicable to diverse classification tasks.

PAGE 135
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
MBAFT 7708 PREDICTIVE ANALYTICS AND BIG DATA

Notes ‹ Instance-Based Learning: It avoids making stringent assumptions


about the underlying data distribution, which enhances its robustness
and adaptability to various datasets.
Challenges in K-NN algorithm
‹ Choosing the Right Value for k: One of the most critical challenges
in using K-NN is selecting an appropriate value for the parameter
“k”, which represents the number of nearest neighbours to consider
when making predictions. Choosing an incorrect value for k can
significantly impact the algorithm’s performance:
„ A smaller value of k (e.g., 1 or 3) can lead to noisy predictions,
especially when the dataset contains outliers or noise. The
model might be overly sensitive to individual data points.
„ A larger value of k (e.g., 10 or 20) can result in smoother
decision boundaries but may lead to less accurate predictions,
mainly if the data is complex and non-linear.
„ The choice of k often requires experimentation and validation
on the specific dataset to find the optimal value.
‹ Curse of Dimensionality: K-NN may face difficulties from the “curse
of dimensionality,” especially when dealing with high-dimensional
data. This challenge arises because the distance between data points
becomes less meaningful with an increase in the number of features
or dimensions in the dataset:
„ In higher dimensions, data points tend to be far apart from
each other, and the concept of proximity breaks down. The
far-apart data points can lead to poor performance as K-NN
relies on measuring distances between data points to determine
neighbours.
„ Dimensionality reduction approaches like Principal Component
Analysis (PCA), or feature selection may become essential
to address this concern. These techniques help decrease the
number of dimensions while retaining the most pertinent
information.
‹ Scalability: K-NN can pose computational challenges, especially when
dealing with large datasets. With an increase in dataset size, the
algorithm needs to compute distances between the new data point

136 PAGE
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
CLASSIFICATION AND MARKET BASKET ANALYSIS

and all the data points within the training set, potentially leading Notes
to impractical computational demands.
‹ Storage and Memory Requirements: The K-NN algorithm requires
storing the entire training dataset in memory for prediction, which
can be memory-intensive for large datasets.
Despite these challenges, K-NN remains a valuable algorithm in various
domains, primarily when used appropriately and with a good understanding
of its strengths and limitations. Addressing these challenges often involves
careful pre-processing, parameter tuning, and keeping in mind the specific
characteristics of the dataset and task at hand.

IN-TEXT QUESTIONS
3. What is a key assumption of the Naive Bayes algorithm?
(a) All features are dependent on each other
(b) All features exhibit conditional independence with respect
to the class
(c) Features have a linear relationship
(d) Features are normally distributed
4. What happens if you choose a small value of ‘K,’ such as 1
in K-NN?
(a) The algorithm becomes computationally slower
(b) The decision boundary becomes more complex
(c) The model becomes less prone to overfitting
(d) The predictions may be sensitive to noise or outliers

6.7 Evaluation Measures for Classification


Evaluation measures for classification algorithms are essential for assessing
the performance and effectiveness of these algorithms in various applications.
These measures help quantify how well a classifier is performing and
provide insights into its strengths and weaknesses. In this section, we will
delve into the details of crucial evaluation measures for classification,
including accuracy, precision, recall, F1-score, and the ROC-AUC curve.

PAGE 137
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
MBAFT 7708 PREDICTIVE ANALYTICS AND BIG DATA

Notes 1. Confusion Matrix:


‹ Definition: It is a table that summarises the classifier’s predictions
and the actual class labels. It provides a breakdown encompassing
true positives, true negatives, false positives, and false negatives.
‹ TP: True Positives (correctly predicted positive instances).
‹ TN: True Negatives (correctly predicted negative instances).
‹ FP: False Positives (incorrectly predicted positive instances).
‹ FN: False Negatives (incorrectly predicted negative instances).
‹ Use Case: The confusion matrix helps understand the types of errors
made by a classifier and can be used to calculate various evaluation
measures.
2. Accuracy:
‹ Definition: Accuracy measures the ratio of correctly predicted
instances to the total number of instances in the dataset. It provides
an overall assessment of a classifier’s correctness. It is given by
equation (4).
‹

(TP+TN)
Accuracy = (4)
(TP+TN+FP+FN
‹ Use Case: Accuracy is suitable when the class distribution in the
dataset is roughly balanced.
3. Precision:
‹ Definition: It measures the fraction of true positive predictions among
all positive predictions by the classifier. It indicates the classifier’s
ability to avoid false positive errors. It is given by equation (5).
TP
Precision = (5)
TP+FP
‹ Use Case: Precision is crucial when false positives are costly or
when the focus is on minimising type I errors (e.g., in medical
diagnoses).
4. Recall (True Positive Rate or Sensitivity):
‹ Definition: It computes the fraction of true positive predictions among
all truly positive instances in the dataset. It is given by equation
(6).

138 PAGE
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
CLASSIFICATION AND MARKET BASKET ANALYSIS

TP Notes
Precision = (6)
TP+FP
‹ Use Case: Recall is important when missing positive instances (false
negatives) is costly or when the goal is to maximise true positive
identification (e.g., in disease detection).
5. F1-Score:
‹ Definition: It is computed using the harmonic mean of precision
and recall. It provides a balance between these two metrics and
is particularly useful when dealing with imbalanced datasets. It is
given by equation (7).
2*(Precision*Recall)
F1 – Score = (7)
(Precision + Recall)
‹ Use Case: The F1-score is valuable when both precision and recall
need to be considered, especially in scenarios with imbalanced
classes.
6. ROC-AUC (Receiver Operating Characteristic Curve and Area Under
the Curve:
‹ Definition: The ROC curve is a graphical representation of a
classifier’s performance across different threshold values. It generates
a plot depicting the True Positive Rate (Recall) in relation to the
False Positive Rate (1 - Specificity) at different thresholds. The
AUC quantifies the area under the ROC curve, providing a single
numeric value for classifier performance.

Figure 6.5: AUC-ROC evaluation.

PAGE 139
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
MBAFT 7708 PREDICTIVE ANALYTICS AND BIG DATA

Notes ‹ Use Case: ROC-AUC helps assess binary classifiers and compare
models’ performance. A higher AUC indicates a better classifier.
7. Specificity (True Negative Rate):
‹ Definition: Specificity quantifies the ratio of correct negative predictions
to all actual negative instances, indicating the classifier’s capacity
to detect all negative instances correctly. It can be calculated using
the equation (8).
TN
Specificity = (8)
(TF+FN)
‹ Use Case: Specificity is valuable when avoiding false alarms (false
positives) is crucial, such as in security or fraud detection.
8. F-beta Score:
‹ Definition: The F-beta score is a generalisation of the F1-score that
allows you to adjust the balance between precision and recall. It
incorporates a parameter (beta) that controls the relative significance
of precision and recall. It is given by equation (9).
(1+ȕ2)*(Precision*Recall)
F – beta Score = (9)
Ǻ2*(Precision + Recall)
‹ Use Case: F-beta score is helpful when a specific trade-off between
precision and recall needs to be considered.

6.8 Real-World Applications of Classification


In the previous sections, we laid the foundation for understanding
classification—a powerful technique in the realm of predictive modelling
and also discussed some clustering algorithms in detail. Now, we embark on
a journey through the real-world applications of classification, showcasing
its significance in various industries and how it aids in solving critical
challenges.
Classification in Healthcare : Imagine a scenario where medical
professionals need to predict whether a patient is at risk of a life-threatening
disease based on their medical history and diagnostic tests. This is where
classification comes into play. By analysing historical patient data and
training a classification model, healthcare providers can identify patterns

140 PAGE
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
CLASSIFICATION AND MARKET BASKET ANALYSIS

and make early predictions. Classification enables timely interventions, Notes


potentially saving lives and reducing healthcare costs.
Classification in Finance: Financial institutions deal with vast amounts
of data daily, and making informed decisions is paramount. Classification
is instrumental in credit risk assessment. Banks can determine the
creditworthiness of loan applicants by categorising them into different risk
groups based on their financial history and other relevant factors. This
helps mitigate financial risks and ensures responsible lending practices.
Classification in Manufacturing: Quality control is another area where
classification shines. In manufacturing, defects or deviations from product
specifications can have severe consequences. By employing classification
models on production line data, manufacturers can detect anomalies and
identify products that do not meet quality standards in real time. This
proactive approach ensures product quality and reduces waste.
Challenges and Considerations: While classification offers immense
benefits, it has challenges. Real-world data is often messy, and class
imbalances can skew model performance. Additionally, ensuring model
interpretability is crucial, especially in healthcare, where decisions directly
impact lives. As we navigate the world of classification, we will address
these challenges and explore strategies to overcome them.
Classification beyond Borders: Classification is not confined to a
single industry. Its versatility extends across sectors, from e-commerce
and marketing to cyber security and environmental science. The ability
to categorise data, make predictions, and inform strategic decisions is a
skill that transcends boundaries.
As we journey through the diverse landscapes where classification is
indispensable, remember that it is not just about the algorithms—it is about
harnessing the power of data to make decisions that matter. Classification
is the compass that guides us through this data-driven world, helping us
navigate with confidence and precision.

6.9 Strategic Insights with Market Basket Analysis


Market Basket Analysis is a data mining technique that focuses on
understanding the purchasing behaviour of customers. This method assists
businesses in discovering connections and correlations between products

PAGE 141
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
MBAFT 7708 PREDICTIVE ANALYTICS AND BIG DATA

Notes or services commonly bought together by customers. It derives its name


from the traditional shopping “basket,” where items are placed during a
shopping trip.
The Essence of Association Rules: At the core of Market Basket Analysis
are association rules. These rules describe the relationships between items
in a dataset and can be expressed as “If-Then” statements. For example,
an association rule might state, “If a customer buys item A, then they are
likely to buy item B as well”. For more details about association rules,
please refer to lesson 3.
Practical Applications
Market Basket Analysis is not limited to just understanding customer
buying patterns; its applications extend across various domains:
‹ Retail: Retailers use Market Basket Analysis to optimise product
placement on shelves. They can encourage cross-selling and increase
sales revenue by placing complementary items together.
‹ E-commerce: Online retailers leverage Market Basket Analysis to
provide product recommendations to customers, enhancing the
shopping experience and boosting sales.
‹ Pricing Strategies: Market Basket Analysis has a direct influence on
pricing strategies. Businesses can implement effective bundling or
discount strategies by identifying items that are frequently purchased
together. For example, a grocery store might offer a discount on
a combination of bread and butter, enticing customers to buy both
items. This not only increases sales but also enhances the overall
customer experience.
‹ Cross-Selling and Upselling: One of the most potent applications of
Market Basket Analysis lies in cross-selling and upselling. Cross-
selling involves offering customers complementary products based
on their current purchase. For instance, an online bookstore may
suggest related books when a customer adds an item to their cart.
Upselling, on the other hand, involves persuading customers to
upgrade to a more expensive or premium version of a product or
service. Market Basket Analysis identifies opportunities for cross-
selling and upselling, enabling businesses to increase revenue while
providing valuable solutions to customers.

142 PAGE
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
CLASSIFICATION AND MARKET BASKET ANALYSIS

‹ Beyond Retail: While Market Basket Analysis has traditionally been Notes
associated with the retail sector, its principles extend far beyond.
Its versatility and the insights it provides can be applied across
various industries. Here are some examples:
„ Media and Entertainment: Streaming platforms utilise Market
Basket Analysis to suggest additional content to users,
considering their viewing history. This keeps users engaged
and increases content consumption.
„ Hospitality: In the hospitality industry, Market Basket Analysis
helps hotels and resorts offer personalised experiences by
suggesting additional services or amenities based on guest
preferences. For example, a guest booking a room may receive
offers for spa services or restaurant reservations.
„ E-commerce: Besides retail, e-commerce platforms also use Market
Basket Analysis extensively. They recommend complementary
products, enhance user experiences, and maximise sales revenue.
Market Basket Analysis provides a data-driven approach to understanding
customer behaviour and preferences in all these cases, leading to more
targeted and effective marketing strategies.
As we conclude our exploration of Market Basket Analysis, it becomes
clear that this technique is not just about product recommendations. It is
a strategic tool that drives revenue, enhances customer satisfaction, and
extends its reach beyond retail. The insights gained from Market Basket
Analysis can reshape business strategies, making them more customer-
centric, profitable, and sustainable.

IN-TEXT QUESTIONS
5. What is the primary goal of Market Basket Analysis (MBA)?
(a) To identify the most frequently purchased product.
(b) To discover associations between products frequently
bought together.
(c) To analyse customer demographics for targeted marketing.
(d) To optimise product pricing strategies.

PAGE 143
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
MBAFT 7708 PREDICTIVE ANALYTICS AND BIG DATA

Notes 6. In MBA, what term is used to describe sets of items often


purchased together by customers?
(a) Product Clusters
(b) Basket Groups
(c) Association Rules
(d) Itemsets

6.10 Summary
In this lesson, we embarked on a journey through the world of predictive
modelling, classification, and Market Basket Analysis, discovering how
data-driven insights shape the strategies of modern businesses. We have
explored the significance and real-world applications of these techniques,
shedding light on their transformative power. The following key points
were discussed in this lesson:—
1. Predictive Modelling with Classification: We began by understanding
the fundamentals of classification—an essential technique in
supervised learning. Classification bridges raw data and informed
decision-making, enabling organisations to predict future outcomes,
from diagnosing diseases to identifying fraudulent activities. We
explored various classification algorithms and their critical role in
data-driven strategies.
2. Strategic Insights with Market Basket Analysis: Moving on to
Market Basket Analysis, we unveiled its core concepts, including
association rules, support, confidence, and lift. We learned how
this technique deciphers consumer behaviour, optimises product
placement, and drives cross-selling and upselling strategies. Beyond
the retail sector, we discovered how Market Basket Analysis extends
its reach to various industries, providing insights that enhance the
customer experience and drive business growth.
3. The Profound Impact on Business Strategies: Finally, we examined
the tangible impact of Market Basket Analysis on business strategies.
We saw how it influences pricing strategies, enabling businesses to
bundle products, offer discounts, and increase the average transaction
value. We explored the art of cross-selling and upselling, where

144 PAGE
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
CLASSIFICATION AND MARKET BASKET ANALYSIS

Market Basket Analysis identifies opportunities to provide customers Notes


with additional value. Moreover, we highlighted how this technique’s
applicability transcends traditional retail, benefiting sectors like
media, hospitality, and e-commerce.
4. A Data-Driven Future: In a world awash with data, the ability to
transform raw information into actionable insights is more critical
than ever. Predictive modelling, classification, and Market Basket
Analysis offer the tools and methodologies needed to navigate this
data-driven landscape effectively.
Remember that the journey “From Data to Strategy” is an ongoing one.
With the knowledge gained from predictive modelling and Market Basket
Analysis, you can make informed decisions, enhance customer experiences,
and drive business success.

6.11 Answers to In-Text Questions

1. (c) Logistic regression


2. (b) It can handle complex, non-linear classification tasks effectively
3. (b) All features exhibit conditional independence with respect to
the class
4. (d) The predictions may be sensitive to noise or outliers
5. (b) To discover associations between products frequently bought
together
6. (d) Itemsets

6.12 Self-Assessment Questions


1. Define predictive modelling, and how does it differ from traditional
data analysis techniques?
2. Reflect upon the concept of classification in supervised learning and
its role in decision-making.
3. Highlight the importance of prior probabilities in Naive Bayes. How
does it affect the classification process?

PAGE 145
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
MBAFT 7708 PREDICTIVE ANALYTICS AND BIG DATA

Notes 4. Briefly explain the key characteristics of the classification algorithms:


Naive Bayes, Logistic regression, and K-Nearest Neighbours (K-
NN).
5. What are the potential challenges when working with imbalanced
data, and how can they be mitigated using classification algorithms?
6. Define and differentiate between accuracy, precision, recall, and F1-
score. When would you prioritise one metric over the others?
7. Describe a scenario where specificity (True Negative Rate) is a
critical evaluation metric and why it matters.
8. Reflect on a situation where you had to choose between different
classification algorithms for a specific problem. What factors
influenced your decision?

6.13 References
‹ Holmes, Arthur Conan, “A Scandal in Bohemia,” in “The Adventures
of Sherlock Holmes,” Strand Magazine, 1891.
‹ Hastie, Trevor, et al., “The Elements of Statistical Learning: Data
Mining, Inference, and Prediction,” 2nd ed. Springer, 2009.
‹ Mitchell, Tom M., “Machine Learning,” McGraw-Hill, 1997.
‹ Quinlan, J. Ross, “C4.5: Programs for Machine Learning,” Morgan
Kaufmann, 1993.
‹ Powers, David M. W., “Evaluation: From Precision, Recall and
F-Measure to ROC, Informedness, Markedness & Correlation,” in
“Journal of Machine Learning Technologies,” 2011.
‹ Fawcett, Tom, “An Introduction to ROC Analysis,” in “Pattern
Recognition Letters,” 2006.
‹ Agrawal, Rakesh, and Ramakrishnan Srikant, “Fast Algorithms for
Mining Association Rules,” in “Proceedings of the 20th International
Conference on Very Large Data Bases (VLDB),” 1994.
‹ Chen, Ming-Syan, et al., “A Fast Parallel Algorithm for Thinning
Digital Patterns,” in “Communications of the ACM,” 1993.

146 PAGE
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
CLASSIFICATION AND MARKET BASKET ANALYSIS

6.14 Suggested Readings Notes

‹ “Data Science for Business” by Foster Provost & Tom Fawcett (First
edition, 2013).
‹ “Mining of Massive Datasets” by Jure Leskovec, Anand Rajaraman,
& Jeffrey D.
‹ “Data Science for Business” by Foster Provost & Tom Fawcett (First
edition, 2013).
‹ “Mining of Massive Datasets” by Jure Leskovec, Anand Rajaraman,
& Jeffrey D. Ullman (Third edition, 2014).
‹ “Data Mining: Concepts and Techniques” by Jiawei Han, Micheline
Kamber, & Jian Pei (Third Edition, 2011).
‹ “Data Science for Executives” by Nir Kaldero (First Edition, 2018).
‹ “Data Mining: Practical Machine Learning Tools and Techniques”
by Ian H. Witten & Eibe Frank (Second Edition, 2005).

PAGE 147
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
L E S S O N

7
Predictive Analytics
and its Use
Dr. Charu Gupta
Assistant Professor
Department of Computer Science
School of Open Learning
University of Delhi
Email-Id: charugupta.sol.2023@gmail.com; charu.gupta@sol-du.ac.in

STRUCTURE
7.1 Learning Objectives
7.2 Introduction
7.3 Predictive Analytics
7.4 3UHGLFWLYH$QDO\WLFV$SSOLFDWLRQV DQG %HQH¿WV  0DUNHWLQJ +HDOWKFDUH 2SHUD-
tions and Finance
7.5 Text Analysis
7.6 Analysis of Unstructured Data
7.7 In-Database Analytics
7.8 Summary
7.9 Answers to In-Text Questions
7.10 Self-Assessment Questions
7.11 References
7.12 Suggested Readings

7.1 Learning Objectives


‹ To gain insights into Predictive Analysis, processes, and models.
‹ To learn the applications of Predictive Analysis.
‹ To explore the usage of Predictive models in marketing, Healthcare, Operations and
Finance.
‹ To understand Text Analysis, including Unstructured Data.
148 PAGE
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
PREDICTIVE ANALYTICS AND ITS USE

7.2 Introduction Notes

It has been a human tendency to know about the future. As such, various
techniques have been developed over the ages to predict the events that
may occur in the future based on historical as well as current data.
With the advent of science and technology, sophistication in the field
of prediction using historical and current data has increased many folds.
As such, the arena of Prediction Analysis and its applications in varied
fields has increased manifolds. Forecasting weather reports, predicting
future market trends, and predicting sales during a predefined period
are a few examples of using data and prediction methods. Nowadays,
there is a plethora of vast amounts of data both in volume, veracity and
variety that can be mined to unravel hidden patterns and trends to gain
insights into future outcomes and performance for developing video
games, translating voice-to-text messages, making decisions regarding
customer-oriented services, developing investment portfolios, improving
operational efficiencies, reducing risks, detecting deviations, detecting
cyber frauds, looking potential losses, analysing sentiments and social
behaviour and many more.

7.3 Predictive Analytics


Today, digitisation in every field has led to the generation of enormous
volumes of data ranging from log files to text to audio to images
and video, and such data resides in varied data repositories across an
organisation. Sophisticated algorithms of machine learning and deep
learning are extensively used to find trends and patterns to predict future
events. The statistical models of logistic and linear regression models,
neural networks and decision trees play an essential role in discovering
patterns from historical and current data.
Predictive Analytics is a branch of advanced analytics that makes
predictions about future outcomes by unravelling hidden patterns from
historical data combined with statistical modelling, data mining techniques
and machine learning.
The term predictive analytics refers to the use of statistics and modelling
techniques to make predictions about future outcomes and performance.

PAGE 149
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
MBAFT 7708 PREDICTIVE ANALYTICS AND BIG DATA

Notes Predictive Analytics - The Process (7Ds-Define Problem, Data Collection,


Data Analysis, Develop Model, Decide Model, Deploy Model, Depurate
Model)
Predictive Analysis is a process of collecting, transforming, cleaning,
and modelling data to arrive at a future forecast and insights for making
informed decisions. This process consists of the following iterative phases.
Define the
Problem

Depurate (Monitor
and Refine) Model Data Collection

Data
Deploy Model
Analysis

Decide
Model Develop Model

Figure 7.1 : Predictive Analytics Process Cycle


(1) DEFINE - PROBLEM UNDERSTANDING AND DEFINING
REQUIREMENTS
The first and vital stage in Predictive Analysis is to understand the problem
and frame the possible solution. During this process, the requirements of
all the stakeholders, the utilities available, the deliverables and the business
perspective of the solution are documented. This step clearly defines the
requirements, including the answer to “What is to be predicted?” and
“Will the outcome solve the problem defined?”
In this step, an analyst should define the project objectives and the scope
of the effort and identify the data sets and the outcome.
(2) DATA COLLECTION
In the second step, data in enormous amounts, volumes, and varieties
are collected from multiple reliable sources, including stakeholders,
organisations, and open sources. The data collected is raw and may be
structured or unstructured; different types of past and present data range

150 PAGE
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
PREDICTIVE ANALYTICS AND ITS USE

from transactional systems, sensors, and logs to ensure in-depth results. Notes
However, it is to be ensured that the data is collected from reliable sources
and complies with the data privacy and governance policies. The results
of prediction derived from the predictive model depend entirely on the
data being utilised; it is essential to collect the most relevant data aligned
with the problem and requirements. While collecting data, format, period
and attributes are also needed in the form of metadata.
(3) DATA ANALYSIS - EXPLORATORY DATA ANALYSIS
In the next step, after collection of the requisite data in sufficient volume,
the Analysis is performed. The data relevancy, suitability, quality, and
cleanliness are analysed. The Data collected is then cleaned, structured,
and formatted as desired using various data cleaning, transforming, and
wrangling processes.
Once the data in the desired form is transformed, it is analysed using
various statistical tools and methods. It is crucial to know the properties of
the data. This process of analysing and exploring the data before applying
the predictive model is called Exploratory Data Analysis. In this step,
the dependent and independent attributes and correlation among features/
attributes of a dataset are determined. The data types, dimensions, and
data distribution among the datasets collected are identified. It is also
analysed for any missing data, duplicate values, redundant data, outliers,
and any prominent pattern in data distribution. The correlation among the
features and attributes of data sets is calculated, and their impact on the
outcome is identified. Raw data sets may also be transformed into new
features for calculating a prediction. For example, different age groups
are created to predict the sales of a clothing apparel brand, like kids,
teenagers, youngsters, middle-aged people, and senior citizens, instead
of accurate age values.
Exploratory Data Analysis (EDA) is a pivotal phase in the data analysis
process that involves scrutinizing and understanding raw data before formal
statistical modelling. It serves as the lens through which data analysts
and scientists unravel patterns, relationships, and outliers, facilitating the
formulation of hypotheses and guiding subsequent analytical decisions. EDA
encompasses a myriad of techniques, from descriptive statistics and data
visualization to summary statistics and hypothesis testing. Visualization
tools such as histograms, scatter plots, and box plots provide a visual

PAGE 151
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
MBAFT 7708 PREDICTIVE ANALYTICS AND BIG DATA

Notes narrative, aiding in the identification of trends and irregularities. EDA


not only unveils the story within the data but also helps in determining
the appropriate data preprocessing steps and informs the choice of
modelling techniques. A meticulous EDA lays the foundation for robust
and meaningful analyses, fostering a comprehensive understanding of the
underlying patterns and nuances within the dataset.
Exploring and analysing the data is essential to identify and treat all the
outliers, null values and other unnecessary elements. This process helps
in improving the accuracy of the model.
In this step, numerical calculations and data visualisations are performed.
Numerical calculations include Calculating Standard Deviation, Z-score,
Inter-Quartile Range, Mean, Median, and Mode, and identifying the
skewness in the data to understand the dispersion of data across the
dataset. Graphical representations such as heat maps, scatter plots, bar
graphs, and box plots provide a more comprehensive dataset view.
(4) DEVELOP MODEL
In this phase of predictive Analysis, sophisticated algorithms from machine
learning, deep learning, regression algorithms, and neural networks are
used to build predictive models. Various tools, programming languages
and packages such as Microsoft Excel, MATLAB, Python, R, IBM-SPSS
Modeler, Rapid Miner, Tableau, SAP analytics, Scikit. Learn, and MiniTab
are available for predictive modelling.
While developing a predictive model, experiments with different features,
algorithms and processes are conducted using predictive modelling tools
to maximise performance, accuracy, and other performance metrics.
Predictors: The independent attributes in the dataset that are used to
predict the value of the target/ dependent variable.
Target: The dependent variable whose values are to be predicted using
a prediction model.
Types of Predictive Models
Predictive analytics models that are widely used are:
(a) Classification models - Deciphering Patterns for Decision-Making
Classification models are supervised machine learning models. These
models are used to classify the data into groups. It can also be used

152 PAGE
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
PREDICTIVE ANALYTICS AND ITS USE

to answer questions with binary outputs, for example, yes/ no or Notes


true /false in fraud detection cases. Types of classification models
include decision trees, random forest, and Naïve Bayes.
Classification models are a cornerstone in machine learning, serving
as a powerful tool for organizing data into distinct categories or
classes. The primary goal is to develop a predictive algorithm that
can learn patterns from labelled training data and subsequently
classify new, unseen instances into predefined categories. These
models are widely employed in diverse domains, from spam filtering
in emails to medical diagnosis and sentiment analysis in natural
language processing. Common classification algorithms include
logistic regression, decision trees, support vector machines, and
ensemble methods like random forests. Evaluation metrics such
as accuracy, precision, recall, and F1 score assess the model’s
performance. The versatility of classification models makes them
invaluable for tackling a broad spectrum of real-world problems,
empowering decision-makers with insights derived from data-driven
predictions and facilitating informed decision-making.
(b) Regression Models - Unraveling Relationships in Data
Regression analysis is a statistical technique for determining the
relationship between a single dependent (criterion) variable and
one or more independent (predictor) variables. The Analysis gives a
predicted value for the criterion resulting from a linear combination
of the predictors.
Regression models form a fundamental category within the realm of
statistical, machine learning methodologies and predictive modelling,
aiming to comprehend and quantify the relationships between
variables. These models are employed when the objective is to
predict a continuous outcome based on one or more input features.
Through the analysis of historical data, regression algorithms discern
patterns and trends, enabling the formulation of predictive equations.
Linear regression, for instance, models relationships as straight
lines, while more complex algorithms like polynomial regression
and support vector regression accommodate non-linear patterns. The
application of regression models spans various domains, from finance
predicting stock prices to healthcare forecasting patient outcomes.

PAGE 153
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
MBAFT 7708 PREDICTIVE ANALYTICS AND BIG DATA

Notes These models provide a quantitative understanding of how changes


in one variable can influence others, offering valuable insights for
decision-making and trend extrapolation. As indispensable tools in
the data analyst’s arsenal, regression models contribute significantly
to uncovering and leveraging patterns within datasets.
(c) Clustering models - Unveiling Inherent Patterns in Data
Clustering models are unsupervised learning models that group data
based on similar attributes. For example, an e-commerce platform
groups customers according to age and location and develops a
marketing strategy to boost sales. Different clustering algorithms
include k-means clustering, mean-shift clustering, Density-Based
Spatial Clustering of Applications with Noise (DBSCAN), Expectation-
maximisation (EM), clustering using Gaussian Mixture Models
(GMM), and hierarchical clustering.
Clustering models are pivotal in the realm of unsupervised machine
learning, tasked with uncovering inherent structures and groupings
within datasets. The primary objective is to categorize data points
into clusters based on similarities, allowing for the identification
of patterns or natural groupings without predefined labels. Popular
clustering algorithms include K-means, hierarchical clustering, and
DBSCAN, each with its own strengths and applications. These
models find utility in diverse fields, such as customer segmentation
in marketing, anomaly detection in cyber security, and image
segmentation in computer vision. By grouping similar data points
together, clustering models provide valuable insights into the
underlying structure of datasets, aiding in data exploration, pattern
recognition, and decision-making processes. Their versatility makes
them indispensable tools for uncovering hidden patterns in complex
and unstructured data.
(d) Time series models - Unraveling Temporal Trends for Predictive
Insights
Time series models use various data inputs at a specific frequency,
such as daily, weekly, or monthly. This model plots the dependent
variable over time to assess the data for seasonality, trends, and
cyclical behaviour. Popular time series models are Autoregressive
(AR), Moving Average (MA), ARMA, and ARIMA.

154 PAGE
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
PREDICTIVE ANALYTICS AND ITS USE

Time series models are a specialized class of predictive modelling Notes


techniques designed to analyze and forecast data points collected
over sequential time intervals. These models are particularly adept
at capturing temporal dependencies, trends, and seasonality within
the data, making them invaluable for predicting future values based
on historical patterns. Time series analysis involves exploring
the temporal dynamics, identifying trends, and accounting for
seasonality or cyclic patterns. Common time series models include
Autoregressive Integrated Moving Average (ARIMA), Exponential
Smoothing State Space Models (ETS), and Long Short-Term Memory
(LSTM) networks in deep learning. Applications of time series
models span various domains, from financial forecasting and stock
price prediction to weather forecasting and demand planning in
supply chain management. The ability to extrapolate future trends
from historical data makes time series models crucial for making
informed decisions in dynamic and time-dependent scenarios.
(e) Neural Networks - Mimicking the Intricacies of the Human Brain
for Complex Learning
Neural networks are biologically inspired data processing systems
that use historical and present data to forecast future values. This
model identifies complicated connections lying deep in data. The
neural network model consists of layers that accept data (input
layer), compute predictions (hidden layer), and provide output
(output layer) in the form of a single prediction.
Neural networks represent a class of machine learning models inspired
by the architecture of the human brain. Comprising interconnected
nodes, or neurons, organized into layers, neural networks are adept
at learning intricate patterns and relationships within data. They
consist of an input layer, hidden layers that process information, and
an output layer that generates predictions. The strength of neural
networks lies in their capacity to automatically extract features
from raw data, making them particularly powerful in tasks such
as image recognition, natural language processing, and complex
pattern recognition. Deep neural networks, often referred to as
deep learning, involve numerous hidden layers, enhancing their
ability to capture intricate and hierarchical representations. While

PAGE 155
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
MBAFT 7708 PREDICTIVE ANALYTICS AND BIG DATA

Notes their computational complexity and demand for substantial data can
pose challenges, neural networks stand at the forefront of cutting-
edge artificial intelligence applications, showcasing their prowess
in addressing complex problems across diverse domains.
(5) DECIDE - MODEL EVALUATION- ASSESSING PERFORMANCE
FOR INFORMED DECISION-MAKING
In this step, the efficiency of the predictive model applied in the previous
step is analysed by performing various tests. Different measures are used
for evaluating the performance of predictive models.
For regression models: Mean Squared Error (MSE), Root Mean Squared
Error (RMSE), R Squared (R2 Score).
For classification models: F1 Score, Confusion Matrix, Precision, Recall,
AUC-ROC, Percent Correction Classification.

Figure 7.2 : Confusion Matrix


Other performance metrics, such as the Concordant- Discordant Ratio,
Gain and Lift Chart, and Kolmogorov Smirnov chart, are also used for
evaluating a prediction model.
Model evaluation is a critical phase in predictive learning, essential for
assessing the performance and reliability of machine learning models.
The effectiveness of a predictive model is determined by its ability to

156 PAGE
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
PREDICTIVE ANALYTICS AND ITS USE

generalize well to new, unseen data. Common evaluation metrics include Notes
accuracy, precision, recall, F1 score, and area under the Receiver Operating
Characteristic (ROC) curve, among others. These metrics offer insights
into the model’s performance in terms of correctly classified instances,
false positives, false negatives, and the trade-offs between precision and
recall. Cross-validation techniques, such as k-fold cross-validation, aid
in robustly assessing a model’s performance by mitigating the impact
of data variability. The ultimate goal of model evaluation is to provide
stakeholders with a clear understanding of the model’s strengths and
limitations, enabling informed decision-making regarding its deployment,
optimization, or potential adjustments to meet specific business or
application requirements.
(6) DEPLOY - MODEL DEPLOYMENT - BRIDGING INSIGHT TO
ACTION
During the sixth step of predictive Analysis, the evaluated model is
deployed in a real-world environment for day-to-day decision-making
processes. For example, a new sales-boosting engine will be integrated
into the e-commerce platform to recommend high-rated products to the
customer.
Model deployment marks the culmination of the predictive learning
journey, transitioning from insightful analyses to practical applications.
Once a machine learning model has been trained, validated, and fine-
tuned, deployment involves integrating it into real-world systems to make
predictions on new, unseen data. The deployment process encompasses
considerations of scalability, efficiency, and integration with existing
infrastructure. Cloud-based solutions, containerization, and Application
Programming Interfaces (APIs) play pivotal roles in streamlining
this transition. Continuous monitoring and feedback mechanisms are
implemented to ensure the model’s ongoing relevance and accuracy in
dynamic environments. Model deployment is a crucial bridge connecting
predictive insights to actionable outcomes, enabling organizations to
leverage the power of machine learning for informed decision-making and
improved business processes. Successful deployment hinges not only on
the technical prowess of the model but also on the seamless integration
of predictive analytics into the operational fabric of the organization.

PAGE 157
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
MBAFT 7708 PREDICTIVE ANALYTICS AND BIG DATA

Notes (7) DEPURATE - MONITOR AND REFINE THE MODEL - NURTURING


CONTINUOUS IMPROVEMENT
The predictive analysis process is a continuous, iterative, and cyclic
process that regularly monitors and refines the model as per the actual
data sets collected over time.
The journey of a predictive model does not conclude with deployment;
instead, it enters a phase of continuous monitoring and refinement.
Model performance can be influenced by shifts in the underlying data
distribution, changes in user behaviour, or evolving business dynamics.
Regular monitoring involves tracking key performance metrics, identifying
deviations, and ensuring the model’s adaptability to unforeseen circumstances.
Anomalies or deteriorating accuracy may prompt the initiation of refinement
processes. This refinement can involve retraining the model with updated
data, adjusting hyper parameters, or even considering the incorporation
of more advanced techniques. The iterative nature of model monitoring
and refinement ensures that predictive models remain agile, resilient, and
aligned with the ever-changing landscape they seek to interpret. It is a
crucial practice for organizations committed to extracting enduring value
from their predictive learning initiatives, fostering a culture of continuous
improvement in the realm of data-driven decision-making.

IN-TEXT QUESTIONS
1. How many steps are there in predictive modelling?
(a) Five
(b) Six
(c) Seven
(d) Eight
2. The process termed ‘Depurate’ means:
(a) Declare
(b) Deploy
(c) Derive
(d) Refine

158 PAGE
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
PREDICTIVE ANALYTICS AND ITS USE

3. R-squared is a performance metric for: Notes

(a) Classification model


(b) Regression model
(c) Fashion business
(d) Healthcare

7.4 Predictive Analytics Applications and Benefits -


Marketing, Healthcare, Operations and Finance
Predictive analytics is being extensively used in various industries for
different business problems for the decision-making process. Predictive
models have helped businesses improve their operations, customer service,
behaviour, and outreach. Many business owners could identify why regular
customers diverted to a competitor. Predictive analytics plays a crucial
role in advertising and marketing. A few of the arenas where predictive
analytics is being used widely are briefed below:
1. MARKETING:
‹ Applications: Predictive analytics in marketing helps businesses
anticipate customer behaviours, tailor marketing strategies, and
optimize campaigns. It enables customer segmentation, personalized
recommendations, and churn prediction.
‹ Benefits: Improved targeting leads to higher conversion rates, increased
customer satisfaction, and more effective resource allocation.
Marketers can anticipate trends, optimize ad spend, and enhance
overall campaign ROI.
(A) Customer Segmentation:
Applications: Predictive analytics assists in dividing
customers into segments based on their behaviours,
preferences, and demographics. This segmentation allows
marketers to tailor their messaging and campaigns to
specific audience groups.
Benefits: Improved targeting ensures that marketing efforts
resonate with the right audience, leading to higher

PAGE 159
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
MBAFT 7708 PREDICTIVE ANALYTICS AND BIG DATA

Notes engagement, increased conversion rates, and more


effective use of resources.
(B) Churn Prediction:
Applications: Predictive analytics helps identify customers
who are likely to churn or stop using a product or service.
By analyzing historical data, the model predicts potential
churners based on patterns and behaviours.
Benefits: Early identification of potential churners allows
marketers to implement targeted retention strategies, such as
personalized offers or loyalty programs, to retain valuable
customers.
(C) Personalized Recommendations:
Applications: E-commerce and content platforms leverage
predictive analytics to offer personalized product or content
recommendations. By analyzing user behaviours and preferences,
the system predicts what items or content a user is likely to
be interested in.
Benefits: Enhanced user experience, increased customer satisfaction,
and higher conversion rates as customers are presented with
products or content that align with their interests.
(D) Campaign Optimization:
Applications: Predictive analytics helps optimize marketing
campaigns by forecasting the performance of different channels,
messages, or creatives. Marketers can allocate budgets based
on predicted outcomes to maximize impact.
Benefits: Improved return on investment (ROI) as marketers can
focus resources on the most effective channels and messages,
leading to better campaign results and cost efficiency.
(E) Lead Scoring:
Applications: Predictive analytics aids in lead scoring by
evaluating the likelihood that a prospect will convert into a
customer. By analyzing various factors such as engagement,
demographics, and behaviour, the model assigns scores to
leads.

160 PAGE
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
PREDICTIVE ANALYTICS AND ITS USE

Benefits: Sales teams can prioritize leads with higher scores, Notes
focusing efforts on prospects more likely to convert, resulting
in improved conversion rates and a more efficient sales process.
(F) Cross-Sell and Upsell:
Applications: Predictive analytics identifies opportunities for
cross-selling or upselling by analyzing customer purchase
history and behaviour. The model predicts which additional
products or services a customer is likely to be interested in.
Benefits: Increased revenue as marketers can strategically
promote complementary products or premium upgrades to
existing customers, maximizing the lifetime value of each
customer.
Common Benefits in Marketing:
Increased ROI: Predictive analytics helps marketers allocate
resources more efficiently, leading to improved campaign
performance and a higher return on investment.
Enhanced Customer Experience: Personalized marketing efforts
based on predictive insights result in a more tailored and
relevant experience for customers, fostering brand loyalty.
Strategic Decision-Making: Marketers can make data-driven
decisions, optimizing strategies and tactics based on insights
derived from predictive analytics.
Competitive Edge: Organizations leveraging predictive analytics
gain a competitive advantage by staying ahead in the rapidly
evolving landscape of marketing strategies.
In the marketing arena, predictive analytics serves as a powerful tool for
maximizing the impact of campaigns, improving customer engagement,
and driving business growth through strategic decision-making.
2. HEALTHCARE:
‹ Applications: Predictive analytics plays a crucial role in healthcare for
patient risk stratification, disease prediction, and resource allocation.
It aids in identifying high-risk patients, optimizing treatment plans,
and reducing readmission rates.

PAGE 161
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
MBAFT 7708 PREDICTIVE ANALYTICS AND BIG DATA

Notes ‹ Benefits: Early detection of diseases, personalized treatment plans,


and efficient resource utilization contribute to improved patient
outcomes, cost reduction, and enhanced overall healthcare delivery.
(A) Disease Prediction and Prevention:
‹ Applications: Predictive analytics models analyze patient
data to predict the likelihood of diseases such as
diabetes, heart disease, or sepsis. By identifying high-
risk individuals, healthcare providers can implement
preventive measures and interventions.
‹ Benefits: Early detection and prevention lead to improved
patient outcomes, reduced healthcare costs, and a shift
towards proactive and personalized healthcare.
(B) Patient Readmission Prediction:
‹ Applications: Predictive analytics helps forecast the likelihood
of patient readmission based on factors such as medical
history, treatment adherence, and post-discharge care.
This allows healthcare providers to allocate resources
effectively.
‹ Benefits: Reduced hospital readmissions, improved patient
care coordination, and optimized resource utilization
contribute to enhanced patient care quality.
(C) Resource Optimization and Capacity Planning:
‹ Applications: Predictive analytics assists hospitals in
forecasting patient admission rates, optimizing bed
utilization, and planning staffing levels. This is particularly
crucial during peak times or public health crises.
‹ Benefits: Efficient resource allocation, reduced wait times,
and improved patient flow within healthcare facilities
result in enhanced operational efficiency and patient
satisfaction.
(D) Medication Adherence:
‹ Applications: Predictive analytics models predict patient
adherence to medication regimens based on historical

162 PAGE
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
PREDICTIVE ANALYTICS AND ITS USE

data and patient behavior. This helps healthcare providers Notes


identify individuals at risk of non-adherence.
‹ Benefits: Improved medication adherence, better disease
management, and reduced healthcare costs through
prevention of complications associated with non-compliance.
(E) Fraud Detection and Healthcare Billing:
‹ Applications: Predictive analytics is employed to detect
fraudulent activities in healthcare billing and insurance
claims. By analyzing patterns and anomalies, the model
identifies potentially fraudulent claims.
‹ Benefits: Cost savings for healthcare payers, reduced fraud-
related losses, and a more streamlined and transparent
healthcare billing system.
(F) Population Health Management:
‹ Applications: Predictive analytics supports population
health management by identifying at-risk populations
and predicting healthcare trends within communities.
This facilitates targeted interventions and public health
strategies.
‹ Benefits: Improved public health outcomes, optimized
resource allocation, and a proactive approach to managing
the health of communities.
(G) Telehealth Triage and Remote Patient Monitoring:
‹ Applications: Predictive analytics aids in triaging patients
for telehealth consultations based on symptom severity
or monitoring patients remotely. Algorithms can predict
deteriorating health conditions.
‹ Benefits: Enhanced accessibility to healthcare, early
intervention for high-risk patients, and reduced healthcare
system strain during emergencies.
Common Benefits in Healthcare:
‹ Early Intervention: Predictive analytics enables early identification
of health risks, allowing for timely interventions and preventive
measures.

PAGE 163
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
MBAFT 7708 PREDICTIVE ANALYTICS AND BIG DATA

Notes ‹ Cost Savings: By predicting and preventing health issues, healthcare


costs associated with emergency care and complications are reduced.
‹ Enhanced Patient Outcomes: Improved patient care coordination,
personalized treatment plans, and proactive management contribute
to better patient outcomes.
‹ Efficient Resource Utilization: Predictive analytics optimizes the
allocation of healthcare resources, reducing waste and ensuring
resources are directed where they are needed most.
In the healthcare sector, predictive analytics serves as a transformative
force, offering a data-driven approach to patient care, resource management,
and public health strategies. It contributes to a shift from reactive to
proactive healthcare, ultimately improving patient outcomes and the
overall efficiency of healthcare systems.
3. OPERATIONS:
‹ Applications: In operations, predictive analytics is used for demand
forecasting, supply chain optimization, maintenance prediction, and
quality control. It helps businesses streamline processes, reduce
downtime, and enhance overall efficiency.
‹ Benefits: Efficient inventory management, timely maintenance, and
optimized production schedules result in cost savings, minimized
disruptions, and improved customer satisfaction through timely
deliveries.
(A) Demand Forecasting:
‹ Applications: Predictive analytics aids in forecasting product
demand by analyzing historical sales data, market trends,
and external factors. This helps organizations optimize
inventory levels and production schedules.
‹ Benefits: Reduced stockouts, minimized excess inventory,
and improved overall supply chain efficiency, leading
to cost savings and enhanced customer satisfaction.
(B) Supply Chain Optimization:
Applications: Predictive analytics models analyze various
elements of the supply chain, including supplier performance,

164 PAGE
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
PREDICTIVE ANALYTICS AND ITS USE

logistics, and production processes. This optimization ensures Notes


timely and cost-effective delivery of goods.
‹ Benefits: Improved supply chain visibility, reduced lead
times, and enhanced overall efficiency in logistics and
distribution.
(C) Maintenance Prediction:
‹ Applications: Predictive analytics is utilized for predicting
equipment and machinery failures by analyzing historical
maintenance data and equipment performance metrics.
This enables proactive maintenance planning.
‹ Benefits: Minimized downtime, extended equipment lifespan,
and optimized maintenance schedules, resulting in cost
savings and improved operational reliability.
(D) Quality Control:
‹ Applications: Predictive analytics is applied in quality control
processes to predict potential defects or deviations from
quality standards. This allows for timely interventions
and process adjustments.
‹ Benefits: Reduced waste, improved product quality, and
streamlined manufacturing processes contribute to cost
savings and increased customer satisfaction.
(E) Workforce Planning:
‹ Applications: Predictive analytics assists in workforce
planning by analyzing historical data on employee
performance, absenteeism, and turnover. This helps
organizations optimize staffing levels and skills.
‹ Benefits: Efficient workforce management, improved
productivity, and cost-effective labor allocation contribute
to overall operational effectiveness.
(F) Equipment Optimization:
‹ Applications: Predictive analytics models analyze equipment
data to predict when maintenance or upgrades are needed.
This ensures that equipment operates at peak efficiency
levels.

PAGE 165
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
MBAFT 7708 PREDICTIVE ANALYTICS AND BIG DATA

Notes ‹ Benefits: Improved equipment reliability, optimized energy


usage, and reduced operational costs through efficient
use of machinery and technology.
(G) Order Fulfillment Optimization:
‹ Applications: Predictive analytics is employed to optimize
order fulfillment processes, predicting order lead times,
and identifying potential delays. This allows for better
management of customer expectations.
‹ Benefits: Improved on-time delivery, enhanced customer
satisfaction, and streamlined order fulfillment processes
contribute to overall operational efficiency.
Common Benefits in Operations:
‹ Cost Reduction: Predictive analytics helps organizations identify
inefficiencies, streamline processes, and optimize resource allocation,
leading to overall cost reduction.
‹ Increased Efficiency: By predicting and mitigating potential issues,
operations become more streamlined and efficient, resulting in
improved productivity.
‹ Enhanced Decision-Making: Predictive analytics provides actionable
insights for decision-makers, enabling them to make informed and
strategic decisions to optimize operations.
‹ Improved Customer Satisfaction: Optimized operations, efficient
supply chains, and timely deliveries contribute to improved customer
satisfaction and loyalty.
In operations, predictive analytics transforms traditional approaches by
providing organizations with the tools to anticipate, plan, and respond to
dynamic challenges, ultimately enhancing overall operational efficiency
and effectiveness.
4. FINANCE:
‹ Applications: In finance, predictive analytics is employed for credit
scoring, fraud detection, investment risk assessment, and customer
churn prediction. It aids in making informed lending decisions,
identifying fraudulent activities, and optimizing investment portfolios.

166 PAGE
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
PREDICTIVE ANALYTICS AND ITS USE

‹ Benefits: Enhanced risk management, improved fraud prevention, and Notes


data-driven investment decisions contribute to financial stability,
regulatory compliance, and increased profitability for financial
institutions.
(A) Credit Scoring:
‹ Applications: Predictive analytics models analyze various
financial and non-financial factors to assess the
creditworthiness of individuals or businesses. This aids
in making informed lending decisions.
‹ Benefits: Improved risk management, reduced default rates,
and optimized lending practices contribute to a healthier
loan portfolio and financial stability.
(B) Fraud Detection:
‹ Applications: Predictive analytics is used for detecting
fraudulent activities in financial transactions. Models
analyze patterns and anomalies to identify potentially
fraudulent transactions.
‹ Benefits: Enhanced security, minimized financial losses
due to fraud, and improved trust among customers and
stakeholders.
(C) Investment Risk Assessment:
‹ Applications: Predictive analytics aids in assessing
investment risks by analyzing market trends, historical
data, and macroeconomic factors. This helps investors
make informed decisions.
‹ Benefits: Improved portfolio management, reduced exposure
to high-risk assets, and increased returns on investments
through data-driven decision-making.
(D) Customer Lifetime Value Prediction:
‹ Applications: Predictive analytics models analyze customer
behavior and historical data to predict the potential
lifetime value of a customer. This assists in customer
acquisition and retention strategies.

PAGE 167
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
MBAFT 7708 PREDICTIVE ANALYTICS AND BIG DATA

Notes ‹ Benefits: Targeted marketing efforts, personalized customer


engagement, and optimized resource allocation for customer
management contribute to increased profitability.
(E) Market Trend Forecasting:
‹ Applications: Predictive analytics is employed to forecast
market trends and fluctuations by analyzing historical
market data, news sentiment, and economic indicators.
‹ Benefits: Informed investment decisions, proactive risk
management, and strategic planning based on predicted
market conditions contribute to financial success.
(F) Personalized Financial Planning:
‹ Applications: Predictive analytics assists in providing
personalized financial advice and planning by analyzing
individual financial behaviors, goals, and market conditions.
‹ Benefits: Improved financial literacy, customized investment
strategies, and enhanced customer satisfaction through
tailored financial services.
(G) Loan Default Prediction:
‹ Applications: Predictive analytics models predict the likelihood
of loan default by analyzing borrower characteristics,
financial history, and economic indicators.
‹ Benefits: Enhanced risk management, reduced non-performing
loans, and improved overall financial stability for lending
institutions.
Common Benefits in Finance:
‹ Risk Management: Predictive analytics enhances risk assessment,
allowing financial institutions to proactively manage and mitigate
risks associated with lending, investments, and fraud.
‹ Operational Efficiency: Efficient resource allocation, streamlined
processes, and improved decision-making contribute to overall
operational efficiency in financial institutions.
‹ Customer Satisfaction: Personalized financial services, targeted
marketing, and efficient customer management lead to increased
customer satisfaction and loyalty.

168 PAGE
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
PREDICTIVE ANALYTICS AND ITS USE

‹ Financial Performance: Informed decision-making based on predictive Notes


analytics contributes to improved financial performance, profitability,
and long-term sustainability.
In the financial sector, predictive analytics serves as a strategic tool for
making data-driven decisions, managing risks, and optimizing various
financial processes. It enables financial institutions to stay competitive,
adapt to market changes, and provide personalized services to their clients.
Predictive Analysis – Wide Usage and common Benefits Across
Industries:
An organisation uses predictions based on past patterns to manage
inventories, workforce, marketing campaigns, and most other facets of
operation.
In the banking sector, predictive analysis has emerged as a powerful
tool with multifaceted applications. One crucial area is credit risk
assessment, where predictive models analyze vast datasets to evaluate the
creditworthiness of borrowers, aiding in prudent lending decisions and
risk management. Fraud detection is another pivotal application, where
predictive analytics scrutinizes transaction patterns, user behaviour, and
anomalies to identify potentially fraudulent activities swiftly, safeguarding
both financial institutions and their clients. Moreover, predictive analytics
plays a vital role in customer relationship management by predicting
customer preferences, behaviour, and potential churn, enabling banks to
tailor services, marketing campaigns, and retention strategies.
In the realm of cybercrime detection, predictive analysis acts as a
frontline defense against evolving threats. By leveraging machine learning
algorithms, predictive analytics can identify patterns associated with
malicious activities, detect anomalies in network behaviour, and predict
potential security breaches. This proactive approach enables cybersecurity
teams to stay ahead of cyber threats, implement timely countermeasures,
and fortify the resilience of digital infrastructures. From predicting
phishing attempts to identifying unusual patterns in user access, predictive
analytics provides a dynamic and adaptive defense mechanism, crucial in
the ever-evolving landscape of cyber threats. The combined application
of predictive analysis in banking and cybercrime detection underscores
its transformative impact on safeguarding financial systems and digital
assets in an increasingly interconnected world.

PAGE 169
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
MBAFT 7708 PREDICTIVE ANALYTICS AND BIG DATA

Notes HR Management: Predictive analysis plays a pivotal role in revolutionizing


HR management by offering valuable insights across various domains. In
talent acquisition, predictive models analyze historical hiring data, candidate
attributes, and performance metrics to predict the success of candidates in
specific roles. This facilitates data-driven recruitment strategies, reducing
time-to-fill positions and ensuring optimal candidate fit. Employee retention
benefits from predictive analytics as models predict potential churn by
analyzing factors such as job satisfaction, performance, and historical
trends, allowing organizations to implement proactive retention strategies.
Workforce planning is optimized as predictive analysis forecasts future
staffing needs, aligning HR strategies with organizational growth and
changes. Overall, predictive analysis empowers HR professionals to make
strategic decisions, enhance talent acquisition, and foster a more engaged
and productive workforce.
Boosting Sales: In the realm of sales, predictive analysis is a game-
changer, offering insights that empower sales teams to boost performance
and revenue. Lead scoring and prioritization benefit from predictive models
that assess the likelihood of a lead converting into a customer, enabling
sales teams to focus on high-potential opportunities. Sales forecasting is
enhanced as predictive analytics analyzes historical sales data, market
trends, and external factors, providing accurate insights into future sales
performance. Customer segmentation becomes more effective through
predictive analysis, allowing sales teams to tailor strategies based on
buying behaviour, preferences, and demographics. Churn prediction helps
in retaining customers by identifying those at risk of leaving, leading to
targeted retention efforts. Overall, predictive analysis in sales contributes
to more efficient resource allocation, improved customer satisfaction, and
increased revenue through strategic decision-making and personalized
selling strategies:
‹ Data-Driven Decision-Making: Predictive analytics empowers
organizations to base decisions on data and insights rather than
intuition, fostering a culture of informed decision-making.
‹ Cost Savings: By anticipating future trends, businesses can optimize
resources, reduce unnecessary expenses, and minimize operational
inefficiencies, leading to cost savings.

170 PAGE
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
PREDICTIVE ANALYTICS AND ITS USE

‹ Customer Satisfaction: Tailoring products and services based on Notes


predictive insights enhances customer satisfaction by meeting their
specific needs and expectations.
‹ Competitive Advantage: Organizations leveraging predictive analytics
gain a competitive edge by staying ahead of market trends, adapting
to changing conditions, and outperforming competitors.
‹ Security: A hybrid automation and predictive analytics model helps
organisations keep their data secure. Patterns of suspicious and
fraudulent activities over internet usage, weak passwords, unusual
behaviour, and fake reviews are detected using predictive analytics.
‹ Risk reduction: Finance and banking exploit predictive models to
identify credit frauds, provide adequate insurance coverage, and
reduce risk profiles.
‹ Operational efficiency: Operational costs are optimised using
predictive models in delivery partners, travel management, room
occupancy, and employee retention policy.
‹ Improved decision-making: Predictive analytics can provide insight
to inform the decision-making process in business and help expand
or add to a product line.

7.5 Text Analysis


A type of data that is ubiquitous and highly unstructured is text. All
applications, organisations, and industries communicate using text data.
This text data exists in different languages across the world. Log Analysis,
medical records, educational records, customer profiles, customer feedback,
comments, tweets, complaints, inquiries, emails, status updates, search
queries and any communication between people is “coded” as text. The
existence of text data in a large variety of form, language and structure
makes it highly unstructured. Exploiting this vast amount of data requires
converting it to a meaningful form.
Text is collected in a document. A document is one piece of text, whether
large or small. A document could be a single sentence, a 100-page report,
a Twitter comment or a Facebook post. A document is composed of
individual tokens or terms or words. A collection of documents is called
a corpus.

PAGE 171
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
MBAFT 7708 PREDICTIVE ANALYTICS AND BIG DATA

Notes Bag of Word: It is a collection of individual words, ignoring grammar,


word order, sentence structure, and punctuation but treating every word
in a document as an essential keyword.

7.6 Analysis of Unstructured Data


Text is often referred to as “unstructured” data. It implies that text does
not have the structure that is usually expected, for instance, tables of
records with fields having fixed meanings (essentially, collections of
feature vectors) and links between the tables. Text is linguistic and, as
such, follows different language structures intended for human usage,
not for computers.
The text comprises words with varying lengths and text fields with
varying numbers of words, grammatically poor sentences, misspelt words,
running words, abbreviations, and unexpected punctuations. Consider the
following review of a movie.
“the film is “uneven, disjointed, hmm…..plot makes no sense.. real sense”.
Ouch. “It is a movie enormously pleased with itself. It never lets us forget
how clever it is, every exhausting minute.”
The above movie review is mixed, and it is difficult to say whether the
review is for or against the movie. Therefore, text must undergo much
pre-processing before it can be used as input to an analytical algorithm.

7.7 In-Database Analytics


Consider the following reviews of a product:
Review 1: The night gel is sticky and oily.
Review 2: The gel cream is very, very light weight.
Review 3: The gel cream makes the skin very oily.
Bag of words representation
The bag-of-words representation treats documents as bags—multisets—of
words, ignoring word order and other grammatical structures. It represents
a sentence as a bag of word vectors comprising the word count of each
vector.

172 PAGE
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
PREDICTIVE ANALYTICS AND ITS USE

For the above three reviews, the Bag-of-Words vector will be formed Notes
as below:
Review the night Gel is Sticky oily cream very light weight skin makes Total
1 1 1 1 1 1 1 0 0 0 0 0 0 6
2 1 0 1 1 0 0 1 2 1 1 0 0 8
3 1 0 1 0 0 1 1 1 0 0 1 1 7
Drawbacks of Bag-of-Words
‹ New words will increase the size and length of the vector.
‹ Vectors will contain a large number of zeroes for other vectors.
‹ No information on the word order is maintained.
Term Frequency (TF)
Term frequency is a numerical statistic that reflects the importance of a
word in a document collection or corpus.
The frequency of the world in the document
TF =
,number of terms in the document
TFIDF of each word
R the night Gel is Sticky oily cream very light weight skin makes Total
1 1/6 1/6 1/6 1/6 1/6 1/6 0 0 0 0 0 0 6
2 1/8 0 1/8 1/8 0 0 1/8 2/8 1/8 1/8 0 0 8
3 1/7 0 1/7 0 0 1/7 1/7 1/7 0 0 1/7 1/7 7
Inverse Document Frequency (IDF)
The sparseness-low occurrence of a term t is commonly measured by an
equation called inverse document frequency.
Total number documents
IDF for a term t = 1 + log
The number of documents containing termt
Term Frequency-Inverse Document Frequency (TFIDF)
Term Frequency (TF) and Inverse Document Frequency (IDF), commonly
referred to as TFIDF. The TFIDF value of a term t in a given document
d is thus:
TFIDF(t, d) = TF(t, d) × IDF(t)
N-grams sequence
The sequence of n-number of words in a sequence is called n-grams. The
sequence of two words in a document is called bi-gram, and the sequence

PAGE 173
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
MBAFT 7708 PREDICTIVE ANALYTICS AND BIG DATA

Notes of three words together in a document is called tri-gram. For example,


in the above cases, the sequence of Gel Cream is a bigram.
Named Entity Extraction
Named Entity Extraction (NER) is data pre-processing to identify critical
information in the text, such as name, organisation, and location. An
entity is a thing that is being referred to in a document. For example,
Gel cream is a Named Entity in the above cases.
Topic Models
The topic models categorise the set of topics in a corpus separately. Each
document constitutes a sequence of words, and the words are mapped
to one or more topics. The topics are learned from the data (often via
unsupervised data mining). The final classifier is defined in terms of the
intermediate topics rather than words. A few methods for creating topic
models include matrix factorisation, Latent Semantic Indexing, Probabilistic
Topic Models, and Latent Dirichlet Allocation.

7.8 Summary
In this lesson, basic concepts of predictive analytics in which a model is
built that can provide an estimated value of a target variable for a new
unseen example. In the process, 7Ds of predictive modelling are introduced,
comprising Defining requirements, data collection, data analysis, data
modelling, deciding on the model based on various performance metrics,
and deploying the model for use in business processes. The deployed
predictive model is then further monitored and refined to estimate better
the target variable that provides solutions to real-world business problems.
For example, historical data about employee retention and moving to other
competitive companies will help an organisation frame “employee retention
policies” and detect frauds that will, in turn, optimise the recruitment
process in the long run. The lesson briefs the benefits and applications
of predictive modelling in different organisations, marketing, sales,
healthcare, e-commerce platforms, operations, finance, fraud detection,
reducing risk, education, and data security. The lesson further concluded
with text analytics and various parameters for its evaluation. The lesson
introduces a common way to turn text into a feature vector: to break
each document into individual words (its “bag of words” representation)

174 PAGE
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
PREDICTIVE ANALYTICS AND ITS USE

and assign values to each term using the TFIDF formula. The approach Notes
is relatively simple, inexpensive, versatile, and requires little domain
knowledge.

7.9 Answers to In-Text Questions

1. (c) seven
2. (d) refine
3. (b) regression model

7.10 Self-Assessment Questions


1. How is predictive analytics revolutionising business and industry
growth?
2. What is meant by predictive analytics? What are the basic steps in
the process of predictive analytics?
3. Explain the role of data in predictive analytics.
4. What are the different models and performance metrics used in
predictive analytics?
5. What is Text analysis, and what are its applications?
6. How is predictive analytics being used in marketing, for example,
forecasting sales in the fashion business?
7. Enumerate challenges of text analytics in industry applications.

7.11 References
‹ What is predictive analytics? | IBM. (Oct, 2023.). https://www.ibm.
com/topics/predictive-analytics.
‹ Provost, F., & Fawcett, T. (2013). Data Science for Business: What
You Need to Know about Data Mining and Data-Analytic Thinking.
O’Reilly Media, Inc.
‹ Miller, T.W. (2014). Modeling Techniques in Predictive Analytics:
Business Problems and Solutions with R. Pearson FT Press.

PAGE 175
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
MBAFT 7708 PREDICTIVE ANALYTICS AND BIG DATA

Notes ‹ Ittoo, A., & van den Bosch, A. (2016). Text analytics in industry:
Challenges, desiderata and trends. Computers in Industry, 78, 96-
107.
‹ Sarkar, D. (2016). Text analytics with Python (Vol. 2). New York,
NY, USA: Apress.

7.12 Suggested Readings


‹ Waller, M. A., & Fawcett, S. E. (2013). Data science, predictive
analytics, and big data: a revolution that will transform supply
chain design and management. Journal of Business Logistics, 34(2),
77–84.
‹ Erhard, J., & Bug, P. (2016). Application of predictive analytics to
sales forecasting in the fashion business.
‹ Malik, M. M., Abdallah, S., & Ala’raj, M. (2018). Data mining and
predictive analytics applications for the delivery of healthcare services:
a systematic literature review. Annals of Operations Research, 270,
287-312.
‹ Attaran, M., & Attaran, S. (2019). Opportunities and challenges of
implementing predictive analytics for competitive advantage. Applying
business intelligence initiatives in healthcare and organisational
settings, 64-90.
‹ Kakulapati, V., Chaitanya, K. K., Chaitanya, K. V. G., & Akshay, P.
(2020). Predictive analytics of HR-A machine learning approach.
Journal of Statistics and Management Systems, 23(6), 959-969.

176 PAGE
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
L E S S O N

8
Technology (Analytics)
Solutions and Management
of their Implementation in
Organizations
Dr. Charu Gupta
Assistant Professor
Department of Computer Science
School of Open Learning
University of Delhi
Email-Id: charugupta.sol.2023@gmail.com; charu.gupta@sol-du.ac.in

STRUCTURE
8.1 Learning Objectives
8.2 Introduction
8.3 Management of Analytics Technology Solution in Predictive Modelling
8.4 Predictive Modelling Technology Solutions
8.5 Summary
8.6 Answers to In-Text Questions
8.7 Self-Assessment Questions
8.8 References
8.9 Suggested Readings

8.1 Learning Objectives


‹ To gain insights of predictive analysis tools and techniques.
‹ To understand management of Technology solutions.

PAGE 177
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
MBAFT 7708 PREDICTIVE ANALYTICS AND BIG DATA

Notes 8.2 Introduction


In the ever-evolving landscape of data analytics, effective technology
management is paramount for organizations seeking to harness the full
potential of their data. Technology serves as the backbone of analytics,
enabling the collection, processing, and interpretation of vast amounts of
information. This chapter explores the crucial role of technology management
in analytics, examining key considerations, challenges, and emerging
trends. Modern business utilises different sophisticated technologies for
their efficient operation, respond immediately as market demand changes
and grow in the ever increasing competitive environment. The technology
has evolved to automate the day-to-day business operations, streamline
process and provide quality services.
Technology managers must keep themselves updated with emerging
technologies and features that contribute to optimise the costs, enhance
innovation, reduce security risks and accomplish the business goals.
Technology Management: Navigating the Intersection of Business and
Technology
Technology management refers to the systematic planning, implementation,
monitoring, and control of technology resources within an organization. It
involves the strategic alignment of technology with the overall business
goals and objectives. The primary aim of technology management is to
ensure that technological resources are effectively leveraged to enhance
organizational performance, innovation, and competitiveness.
Key Components of Technology Management are briefed below:
1. Strategic Planning:
‹ Aligning technology initiatives with the broader business
strategy is a foundational aspect of technology management.
This involves identifying how technology can support and
drive the achievement of organizational goals.
2. Technology Acquisition:
‹ Selecting and acquiring appropriate technologies to meet
organizational needs is a critical function. This includes
evaluating vendor solutions, negotiating contracts, and ensuring
compatibility with existing systems.

178 PAGE
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
TECHNOLOGY (ANALYTICS) SOLUTIONS AND MANAGEMENT

3. Implementation and Integration: Notes


‹ Effectively introducing new technologies into the organization
and integrating them with existing systems is crucial. This
involves careful planning, testing, and deployment to minimize
disruptions and maximize benefits.
4. Risk Management:
‹ Assessing and mitigating risks associated with technology,
including cybersecurity threats, data breaches, and system
failures, is an integral part of technology management. This
ensures the security and reliability of technological assets.
5. Resource Allocation:
‹ Efficiently allocating resources, both human and financial, is
essential for the successful implementation and maintenance of
technology initiatives. This includes budgeting for technology
upgrades, training, and ongoing support.
6. Performance Monitoring and Optimization:
‹ Continuously monitoring the performance of technology systems
and processes allows for timely identification of issues and
optimization opportunities. This includes evaluating the
efficiency, reliability, and effectiveness of technology solutions.
7. Innovation and Research:
‹ Fostering a culture of innovation and staying abreast of
technological advancements is vital for organizations to remain
competitive. Technology managers often engage in ongoing
research to identify emerging trends and potential opportunities.
8. Change Management:
‹ Managing the human aspect of technological changes is critical.
This involves addressing employee concerns, providing
training programs, and facilitating a smooth transition to new
technologies.
Technology Management play key role in different fields as briefed below:
1. Business and Industry:
‹ In business and industry, technology management ensures
that organizations stay technologically competitive, adapt to

PAGE 179
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
MBAFT 7708 PREDICTIVE ANALYTICS AND BIG DATA

Notes market changes, and leverage innovations to improve products,


services, and processes.
2. Information Technology (IT) Management:
‹ In the realm of IT, technology management involves overseeing
the development, deployment, and maintenance of IT systems.
This includes managing infrastructure, networks, software
development, and IT support.
3. Innovation Management:
‹ For organizations focused on innovation, technology management
plays a key role in identifying, nurturing, and implementing
innovative technologies that can give them a competitive edge.
4. Project Management:
‹ Technology management is closely linked to project management,
ensuring that technology projects are executed effectively,
within budget, and in alignment with organizational objectives.

8.3 Management of Analytics Technology Solution in


Predictive Modelling
In essence, technology management is about strategically navigating the
complex and ever-changing landscape of technology to drive positive
outcomes for organizations. It requires a blend of technical expertise,
business acumen, and leadership skills to make informed decisions that
contribute to the overall success and sustainability of the organization.
In the era of big data, organizations are increasingly turning to analytics
technology solutions to derive meaningful insights from vast and
complex datasets. Predictive modelling, a subset of analytics, empowers
businesses to anticipate future trends, make informed decisions, and gain a
competitive edge. Effective management of analytics technology solutions
is paramount to harnessing the full potential of predictive modelling.
This essay explores the key aspects of managing analytics technology
solutions for predictive modelling, addressing challenges, best practices,
and the evolving landscape.

180 PAGE
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
TECHNOLOGY (ANALYTICS) SOLUTIONS AND MANAGEMENT

8.3.1 Strategic Planning and Alignment Notes

The foundation of successful technology management for predictive


modelling lies in strategic planning and alignment with organizational
goals. It is crucial to identify how predictive modelling can contribute
to the overall business strategy. This involves collaboration between data
scientists, analysts, and business leaders to ensure that the technology
solution aligns with the organization’s objectives, whether they be
improving customer satisfaction, optimizing operations, or enhancing
decision-making processes.

8.3.2 Data Acquisition and Preparation


A critical phase in predictive modelling is the acquisition and preparation
of data. Technology managers must oversee the implementation of data
pipelines, ensuring data quality, cleanliness, and relevance. Robust data
governance practices, metadata management, and the use of data cataloguing
tools are essential for creating a solid foundation for predictive modelling.
Effective management in this phase involves addressing data integration
challenges, handling missing values, and ensuring that the data used is
representative and unbiased.

8.3.3 Model Development and Implementation


The heart of predictive modelling lies in the development and implementation
of models. Technology managers play a crucial role in selecting the right
modelling techniques, overseeing the work of data scientists, and ensuring
that the models align with business requirements. This phase requires a
balance between accuracy and interpretability, and technology managers
must be vigilant in evaluating the ethical implications of the models,
particularly when they impact decision-making processes.

8.3.4 Scalability and Performance


As the volume and complexity of data grow, scalability becomes a key
concern. Effective management involves choosing technologies that can
scale with the organization’s evolving needs. Cloud-based solutions,

PAGE 181
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
MBAFT 7708 PREDICTIVE ANALYTICS AND BIG DATA

Notes distributed computing frameworks, and parallel processing technologies are


often essential components for ensuring the scalability and performance
of predictive modelling solutions. Regular performance monitoring and
optimization efforts are also critical to maintaining efficiency.

8.3.5 Interpretability and Explainability


The interpretability of predictive models is gaining increased attention,
especially in industries with regulatory requirements or ethical considerations.
Technology managers must ensure that the chosen technology solutions
provide transparency into model outputs. This involves utilizing
interpretable algorithms, implementing model explainability techniques,
and communicating effectively with stakeholders to build trust in the
predictive modelling process.

8.3.6 Continuous Monitoring and Model Maintenance


Predictive models are not static; they require continuous monitoring and
maintenance. Technology managers must implement monitoring systems to
detect model drift, changes in data patterns, or shifts in model performance.
This proactive approach ensures that predictive models remain accurate
and relevant over time. Additionally, managers must establish protocols
for regular model updates, retraining, and version control to keep the
predictive modelling system agile and responsive to changing conditions.

8.3.7 Challenges and Mitigation Strategies


Managing analytics technology solutions for predictive modelling comes
with its set of challenges. These may include the need for skilled personnel,
data security concerns, and the dynamic nature of data. Technology
managers should address these challenges through comprehensive training
programs, robust cybersecurity measures, and the adoption of flexible
and adaptable technologies.
In conclusion, the effective management of analytics technology solutions
for predictive modelling is a multifaceted endeavour that requires strategic
planning, technological acumen, and a keen understanding of business
objectives. By navigating the complexities of data acquisition, model

182 PAGE
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
TECHNOLOGY (ANALYTICS) SOLUTIONS AND MANAGEMENT

development, scalability, interpretability, and ongoing maintenance, technology Notes


managers can ensure that predictive modelling becomes a powerful tool
for organizations seeking actionable insights and a competitive advantage
in an increasingly data-driven world. As the landscape of analytics
continues to evolve, adept management will be the key to unlocking the
full potential of predictive modelling technology.

8.4 Predictive Modelling Technology Solutions


There are numerous predictive modelling technology solutions available,
catering to various industries and business needs. Here are some notable
predictive modelling tools and platforms:
1. Python with Scikit-Learn and TensorFlow:
Python is a widely-used programming language for data science and
machine learning. Scikit-Learn provides a simple and efficient tool
for predictive data analysis, while TensorFlow offers a comprehensive
framework for building and deploying machine learning models.
2. R:
R is another popular programming language for statistical computing
and predictive modelling. It offers a wide range of packages, such
as caret and random Forest, for building predictive models.
3. SAS Enterprise Miner:
SAS Enterprise Miner is a comprehensive data mining and predictive
analytics solution. It provides a visual interface for building, assessing,
and deploying predictive models, making it accessible to both data
scientists and business analysts.
4. IBM SPSS Modeler:
IBM SPSS Modeler is a powerful, versatile predictive analytics
platform. It supports the entire data mining process, from data
preparation to model deployment, and includes a variety of algorithms
for predictive modelling.
5. Microsoft Azure Machine Learning:
Azure Machine Learning is a cloud-based service by Microsoft that
enables the development, deployment, and management of machine

PAGE 183
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
MBAFT 7708 PREDICTIVE ANALYTICS AND BIG DATA

Notes learning models. It supports various languages and frameworks,


including Python and R.
6. KNIME:
KNIME (Konstanz Information Miner) is an open-source platform
for data analytics, reporting, and integration. It provides a visual
workflow environment that allows users to design, execute, and
evaluate predictive models.
7. Alteryx:
Alteryx is a data blending and advanced analytics platform that
enables users to prepare, blend, and analyze data. It supports
predictive modelling through a user-friendly interface.
8. H2O.ai:
H2O.ai offers an open-source platform called H2O.ai, as well as
a commercial version called Driverless AI. It provides Automatic
Machine Learning (AutoML) capabilities, making it easier to build
and deploy predictive models.
9. Databricks:
Databricks provides a unified analytics platform built on top of
Apache Spark. It facilitates collaborative and scalable predictive
modelling with support for languages like Python, R, and Scala.
10. Google Cloud AI Platform:
Google Cloud AI Platform offers a suite of tools for building,
deploying, and managing machine learning models on Google Cloud.
It supports popular machine learning frameworks like TensorFlow
and scikit-learn.
These tools vary in terms of complexity, features, and integration capabilities,
allowing organizations to choose the one that best fits their requirements
and technical preferences. Additionally, many cloud platforms, such as
AWS (Amazon SageMaker), provide services and tools for predictive
modelling, contributing to the flexibility and scalability of predictive
analytics solutions.

184 PAGE
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
TECHNOLOGY (ANALYTICS) SOLUTIONS AND MANAGEMENT

IN-TEXT QUESTIONS Notes

1. Full form of KNIME is:


(a) Konstanz Information Miner
(b) Konstanz Information Modelling
(c) Knowledge Information Miner
(d) Knowledge Information Modelling
2. Which of the following is not open-source:
(a) H2O.ai
(b) IBM SPSS
(c) KNIME
(d) R

8.5 Summary
Effective predictive modelling technology management is crucial for
organizations aiming to extract meaningful insights from vast datasets. This
involves strategic planning, aligning technology initiatives with overarching
business goals, and ensuring that the chosen technology solutions integrate
seamlessly into existing infrastructures. Key aspects of management include
overseeing data acquisition and preparation processes, selecting appropriate
modelling techniques, and addressing scalability concerns. Interpretability
and explain ability of models, along with continuous monitoring and
maintenance, are also paramount. Challenges such as talent acquisition,
data security, and dynamic data nature must be met with comprehensive
strategies. The successful management of predictive modelling technology
empowers organizations to make informed decisions, stay competitive,
and navigate the evolving landscape of data analytics.

8.6 Answers to In-Text Questions

1. (a) Konstanz Information Miner


2. (b) IBM SPSS

PAGE 185
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
MBAFT 7708 PREDICTIVE ANALYTICS AND BIG DATA

Notes 8.7 Self-Assessment Questions


1. What are the key components of Technology Management?
2. What are the challenges and mitigation strategies for managing
technology in predictive modelling?
3. Briefly elaborate features of Predictive modelling tools which are
open-source in nature.

8.8 References
‹ Gordan, J & Spillecke, D et al., Marketing & Sales Big Data, Analytics,
and the Future of Marketing & Sales (2015). Mc Kinskey.
‹ Miller, T.W. (2014). Modelling Techniques in Predictive Analytics:
Business Problems and Solutions with R. Pearson FT Press.
‹ Attaran, M., & Attaran, S. (2019). Opportunities and challenges of
implementing predictive analytics for competitive advantage. Applying
business intelligence initiatives in healthcare and organizational
settings, 64-90.

8.9 Suggested Readings


‹ Sharon, C. I., & Suma, V. (2022). Predictive Analytics in IT Service
Management (ITSM). Data Mining and Machine Learning Applications,
175-193.
‹ Predictive Analytics: Opportunities, Challenges and Use Cases available
at https://lityx.com/wp-content/uploads/2016/05/Predictive-Analytics-
eBook.pdf

186 PAGE
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
Glossary
Area Under the Curve (AUC): A numerical measure quantifying the area under the ROC
curve, indicating the classifier’s ability to distinguish between classes.
Association Rules: Data mining technique used to identify interesting patterns and
relationships between items or events in a dataset, often employed in market basket analysis
and recommendation systems.
AutoML (Automated Machine Learning): The use of automated tools and processes to
streamline and accelerate the end-to-end process of developing machine learning models,
from data preparation to model deployment.
Bernoulli Naive Bayes: A type of the Naive Bayes algorithm for binary or Boolean
features, typically used in text classification tasks like spam detection.
Bias: The error introduced in a predictive model due to oversimplified data or model
structure assumptions.
Big Data: Large amounts of data consisting of the four V’s of Volume, Velocity, Variety,
and Veracity.
Classification: A supervised machine learning technique that assigns pre-defined categories
or labels to data points based on their features.
Cloud-Based Solutions: Technology solutions hosted on cloud platforms, offering scalability,
flexibility, and accessibility for predictive modelling tasks without the need for extensive
on-premises infrastructure.
Clustering: A data analysis technique that involves grouping similar data points together
based on specific criteria or features, revealing inherent patterns or structures.
Confidence: It measures the robustness of the relationship between items A and B. The
FRQILGHQFH IRU$ĺ% UHSUHVHQWV WKH SUREDELOLW\ RI % RFFXUULQJ ZKHQ$ LV DOUHDG\ SUHVHQW
expressed as P(B|A).
Confusion Matrix: A table summarising a classifier’s predictions and actual class labels,
providing detailed information about true positives, true negatives, false positives, and
false negatives.
Continuous Monitoring: The ongoing surveillance of predictive models to detect changes
in data patterns, model performance, and potential deviations, requiring timely intervention
for recalibration.

PAGE 187
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
MBAFT 7708 PREDICTIVE ANALYTICS AND BIG DATA

Notes Customer Segmentation: The practice of categorising customers into


distinct groups based on shared characteristics or behaviours, aiding in
targeted marketing and personalised experiences.
Data Accuracy: It refers to how well the data accurately represents real-
world objects or events.
Data Analysis: The systematic process of inspecting, cleaning, transforming,
and interpreting data to discover valuable insights and patterns.
Data Analytics: The process of examining, cleaning, transforming, and
interpreting data to extract valuable insights and support decision-making.
Data Analytics Lifecycle: A structured framework that outlines the
stages and processes involved in data analysis, including data collection,
preparation, analysis, visualisation, and interpretation.
Data Completeness: It refers to the amount of expected data present in
the dataset without missing values.
Data Consistency: This ensures that data is uniform and adheres to
predefined standards.
Data Drift: The phenomenon where the statistical properties of input
data change over time, necessitating adjustments to predictive models to
maintain accuracy and relevance.
Data Exploration: The initial phase of data analysis that focuses on
understanding data characteristics, identifying patterns, and assessing
data quality.
Data Governance: The framework of policies, processes, and controls
that ensure high data quality, integrity, and security throughout the entire
data lifecycle.
Data Interpretation: The process of making sense of analysed data and
drawing meaningful conclusions to inform decision-making.
Data Lake: A storage repository that holds raw data in its native format,
allowing for flexible data processing and analysis.
Data Mart: A smaller, departmental, or subject-specific data warehouse
focusing on specific data needs and user groups within an organization.

188 PAGE
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
GLOSSARY

Data Modelling: It is the process of designing data structure and Notes


relationships in a data warehouse, including defining schemas and data
hierarchies.
Data Privacy and Security: The implementation of measures to protect
sensitive data, ensuring compliance with privacy regulations and safeguarding
against unauthorized access or data breaches.
Data Profiling: The process of summarising and examining the characteri-
stics of each variable in a dataset, including data types and unique values.
Data Reliability: This ensures that data is trustworthy and consistent
over some time.
Data Timeliness: Depending on the purpose, data quality depends on
relevance and timeliness. Data Cleaning: This involves the Correction of
errors, inconsistencies, and inaccuracies in datasets. Data Transformation:
This refers to data manipulation to make it uniform and consistent.
Data Visualisation: The representation of data through charts, graphs,
and visuals to facilitate data understanding and communication.
Data Warehouse: A centralized repository for storing, managing, and
analyzing data from various sources, designed to support business
intelligence and reporting.
Decision Tree: A classification algorithm that uses a tree-like structure
to model decisions and outcomes, often visualised as a flowchart.
Duplicate Detection: This refers to identifying and removing duplicate
records or entries.
Encoding: The process of mapping data values to visual properties like
position, colour, size, and shape in a visualisation.
Enterprise Data Warehouse (EDW): A comprehensive data warehouse
that serves the data needs of an entire organization, often the central
repository for enterprise data.
ETL (Extract, Transform, Load): The process of extracting data from
source systems, transforming it into a suitable format for analysis, and
loading it into a data warehouse.
F1-Score: It is computed by taking the harmonic mean of precision and
recall. It provides a balanced measure of a classifier’s performance, which
is especially useful for imbalanced datasets.

PAGE 189
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
MBAFT 7708 PREDICTIVE ANALYTICS AND BIG DATA

Notes F-beta Score: A generalisation of the F1-score that allows adjusting the
balance between precision and recall using a parameter (beta).
Feature Engineering: The process of selecting, transforming, or creating
new features (variables) from raw data to enhance the performance of
predictive models.
Gaussian Naive Bayes: A type of the Naive Bayes algorithm suitable for
continuous data, assuming a Gaussian (normal) distribution of features.
Google Cloud AI Platform: Google Cloud AI Platform offers a suite
of tools for building, deploying, and managing machine learning models
on Google Cloud. It supports popular machine learning frameworks like
TensorFlow and scikit-learn.
Histogram: A graphical representation of the distribution of a single
variable, showing the frequency of data values within specific intervals.
Imbalanced Data: A situation in which one class of a classification
problem is significantly more prevalent than the other class, potentially
leading to biased models.
Imputation: This involves estimating values based on techniques such
as mean imputation or regression imputation to replace missing values.
Interpretability: The degree to which a predictive model’s results and
decisions can be understood and explained.
Label: Text added to a visualisation to provide context and identify data
points or categories.
Legend: A key that explains the meaning of colours, symbols, or other
visual elements used in a visualisation. Top of Form
Line Chart: A type of data visualisation that displays trends and
relationships between two numerical variables over time.
Linear Regression: A statistical method used to model the relationship
between a dependent variable and one or more independent variables.
Logistic Regression: A statistical method used for binary classification,
modelling the relationship between predictor variables and the probability
of an event occurring.

190 PAGE
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
GLOSSARY

MAR (Missing at Random): Data that depends on other observed Notes


variables but not on the missing data.
Market Basket Analysis: A data mining technique that analyses customer
buying patterns to identify associations between products frequently bought
together and often used in retail for product placement and recommendation.
MCAR (Missing Completely at Random): Randomly occurring missing
data with no specific pattern.
Missing Data: Data values absent or unrecorded in the dataset are included.
MNAR (Missing Not at Random): Missing data that is related to itself
and does not have any relation with any other observed variables.
Model Deployment: The process of integrating a trained predictive
model into the operational environment, allowing it to make real-time
predictions on new, unseen data.
Model Development: The phase in predictive modelling where mathematical
algorithms are selected, trained, and tested using historical data to create
a predictive model.
Model Interpretability: The ability to explain and understand how a
machine learning model makes predictions or classifications important
for making informed decisions.
Model Optimization: The iterative process of refining and improving
predictive models to enhance performance, often involving adjustments
to hyperparameters or feature selection.
Multicollinearity: A situation in regression analysis where two or more
independent variables are highly correlated, which can lead to unstable
coefficient estimates.
Multinomial Naive Bayes: A type of the Naive Bayes algorithm commonly
used for text classification, especially in cases where features represent
word frequencies.
Outlier Handling: Outliers must be found and dealt with during data
examination to ensure accuracy.
Overfitting: A modelling problem where a predictive model is too complex
and fits the training data too closely, leading to poor generalization of
new data.

PAGE 191
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
MBAFT 7708 PREDICTIVE ANALYTICS AND BIG DATA

Notes Precision: A classification metric that quantifies the ratio of true positive
predictions to all positive predictions made by the model, with a focus
on minimising false positives.
Predictive Analytics: It is a branch of advanced analytics that makes
predictions about future outcomes by unravelling hidden patterns from
historical data combined with statistical modelling, data mining techniques
and machine learning.
Predictive Modelling: the process of employing data and statistical
algorithms to predict or categorise future events or results by drawing
insights from past data.
Predictors: The independent attributes in the dataset that are used to
predict the value of the target/ dependent variable.
Pruning: Removing branches or nodes from a decision tree to prevent
overfitting and improve model performance.
Recall (Sensitivity): A classification metric that calculates the ratio of
true positive predictions to all actual positive instances, emphasising
detecting all positives.
Receiver Operating Characteristic (ROC) Curve: A visual representation
illustrating a classifier’s performance, depicting the True Positive Rate
(Recall) against the False Positive Rate (1 - Specificity) at multiple
thresholds.
Regression: A predictive modelling technique used to model the relationship
between one or more independent variables and a continuous dependent
variable.
ROC Curve: Receiver Operating Characteristic Curve, a graphical
representation of a model’s ability to distinguish between true and false
positives.
Scalability: The ability of technology solutions to handle growing
volumes of data and increased computational demands without sacrificing
performance.
Scatter Plot: A visualisation that displays the relationship between two
numerical variables through points on a graph.

192 PAGE
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
GLOSSARY

Semi-Structured Data: Data that does not conform to the structure of Notes
traditional structured data but has some level of organization, often in a
hierarchical or nested format, e.g., XML or JSON.
Specificity (True Negative Rate): A classification metric that evaluates
the ratio of true negative predictions to all actual negative instances,
particularly crucial in mitigating false alarms.
Splitting Criteria: Criteria used to decide how to split nodes in a decision
tree, including Gini impurity and information gain.
Strategic Planning: The process of defining the long-term objectives
of predictive modelling initiatives and aligning them with the overall
business strategy.
Structured Data: Data organized into a fixed format with defined fields
and an explicit schema, typically stored in relational databases.
Support: It quantifies the frequency of the pattern. The support for
DVVRFLDWLRQ UXOH$ĺ% FRUUHVSRQGV WR WKH OLNHOLKRRG RI ERWK LWHPV$ DQG
B co-occurring, which can be denoted by P(A‫ ׫‬B).
Talent Development: The cultivation of skilled professionals capable of
effectively managing and implementing predictive modelling technology,
often through training programs and skill enhancement initiatives.
Target: The dependent variable whose values are to be predicted using
a prediction model.
Text Analytics: It is the process of transforming unstructured text
documents into usable, structured data.
Tokenisation: It is the process of breaking apart a sentence or phrase
into its component pieces. Tokens are usually words or numbers.
Underfitting: A modelling problem where a predictive model is too
simple to capture the underlying patterns in the data.
Unstructured Data: Data that lacks a predefined structure and is typically
in the form of text, images, audio, or video.
Variety: The diversity of data types and sources, including structured,
semi-structured, and unstructured data.
Velocity: The speed at which data is generated, collected, and processed
in real-time.

PAGE 193
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
MBAFT 7708 PREDICTIVE ANALYTICS AND BIG DATA

Notes Veracity: The trustworthiness and accuracy of data in a significant data


context.
Volume: The volume of data is considerable and exceeds current data
processing capabilities.

194 PAGE
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy