7708 - Predictive Analytics and Big Data 2024
7708 - Predictive Analytics and Big Data 2024
Content Writers
Aditi Priya, Hemraj Kumawat, Dr. Charu Gupta
Academic Coordinator
Mr. Deekshant Awasthi
Published by:
Department of Distance and Continuing Education
Campus of Open Learning/School of Open Learning,
University of Delhi, Delhi-110007
Printed by:
School of Open Learning, University of Delhi
DISCLAIMER
Printed at: Taxmann Publications Pvt. Ltd., 21/35, West Punjabi Bagh,
New Delhi - 110026 (200 Copies, 2023)
PAGE
PAGE i
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
MBAFT 7708 PREDICTIVE ANALYTICS AND BIG DATA
PAGE
ii PAGE
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
CONTENTS
PAGE
PAGE iii
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
MBAFT 7708 PREDICTIVE ANALYTICS AND BIG DATA
PAGE
iv PAGE
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
L E S S O N
1
Data Unveiled: An
In-Depth Look at Types,
Warehouses, and Marts
Aditi Priya
Research Scholar
School of Computer & Systems Sciences
JNU, New Delhi - 110067
Email-Id: kajal.aditi@gmail.com
STRUCTURE
1.1 Learning Objectives
1.2 Introduction
1.3 Exploring Data Types
1.4 Data Warehousing: Foundations and Concepts
1.5 Exploring Data Marts
1.6 Summary
1.7 Answers to In-Text Questions
1.8 Self-Assessment Questions
1.9 References
1.10 Suggested Reading
PAGE 1
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
MBAFT 7708 PREDICTIVE ANALYTICS AND BIG DATA
Notes
1.2 Introduction
In the fast-paced world of contemporary enterprise, data has evolved
into a game-changer that influences how companies conduct operations,
cultivate novel ideas, and arrive at significant decisions. The evolution
of data from a mere byproduct to a strategic asset highlights its ever-
increasing importance in our digitalized world. Amidst the information
overload of today’s highly connected world, businesses must navigate
a torrent of data from many sources – customer touch points, digital
interactions, social media platforms, IoT sensors, and additional avenues.
Amidst these data surges, challenges abound, yet so do opportunities.
Converting data into actionable insights maximizes efficiency and growth
potential in a competitive market. Data is more than just numbers and
figures; it provides a comprehensive picture of consumer behaviour, market
trends, operational efficiency, and emerging possibilities. It revolutionizes
conventional decision-making by providing an in-depth comprehension of
consumer desires, enabling enterprises to tailor their approaches to align
with constantly evolving expectations. Decision-making has witnessed
data’s seamless transition from bit player to star performer. In the past,
trusting intuition and emotions might have steered crucial decisions. In
today’s business landscape, data-driven decision-making is becoming
more critical. Sophisticated algorithms and cutting-edge tools enable
organizations to gain valuable insights, forecast future developments, and
make strategic decisions that give them an edge over their competitors.
Data’s significance transcends immediate operations and spurs creativity
and ongoing enhancement. Mining historical data and process optimization
leads to a deeper understanding, empowering businesses to pursue new
avenues of growth and enhance existing practices.
Moreover, the integration of data is not confined to internal operations
alone. It facilitates a deep understanding of customer behaviour, preferences,
and pain points. This customer-centric approach enables businesses
to precisely tailor their offerings, marketing campaigns, and customer
experiences. However, it is essential to acknowledge that data, in its raw
form, is not a magic elixir. It requires the right tools, methodologies, and
strategies to extract its full potential. Data quality, accuracy, and relevance
are paramount, as erroneous data can lead to misguided decisions and
faulty strategies.
2 PAGE
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
DATA UNVEILED: AN IN-DEPTH LOOK AT TYPES, WAREHOUSES
In essence, data has become the cornerstone of modern business and Notes
decision-making. Its transformative influence has ushered in an era where
businesses that embrace data-driven strategies stand poised to thrive,
adapt, and innovate. The ability to harness data effectively is not just
a competitive advantage – it is a prerequisite for relevance and success
in a rapidly evolving landscape. As we journey deeper into the digital
age, the role of data in shaping the business landscape will continue to
evolve, redefining what is possible and propelling organizations toward
new horizons of achievement.
This chapter mainly explores different data types, the concept of data
warehouses, and the significance of data marts.
Notes Structured Data: This data type exhibits a strong structure and
adheres to a predetermined format. They are typically housed in
databases and tables with predetermined columns and rows. Inclusion
here spans numerical data, chronological references, personality,
and spatial markers.
Semi-structured Data: This data type does not conform to the structure
of traditional structured data, like data stored in relational databases,
but it is not entirely unstructured. It lies in between structured and
unstructured data regarding organization and flexibility.
Unstructured Data: Formless, chaotic, unstructured data defies
organization and structure. It encompasses multiple formats like
images, audio, video, and text. Within this category of unstructured
data, social media posts, emails, and multimedia content are typical
occurrences.
Types of data abound based on their nature, format, and attributes are
as follows:
1. Numerical Data: As quantitative data, these values are measurable
and quantifiable. The dataset comprises diverse measurement fields,
from the uncountable (say, happiness) to the quantifiable (size or
weight).
2. Categorical Data: Also classified as qualitative data, categories
or groups are identified using categorical data. Organizing data
according to nominal or ordinal principles is possible.
3. Text Data: Including words, sentences, paragraphs, or any written
content, textual data spans a wide range. This scope includes
documents, social media posts, emails, and others.
4. Time Series Data: Chronologically arranged, the data captures
changes over time. It is commonly utilized in the fields expanding
across finance, economics, and environmental monitoring
5. Spatial Data: Spatial data consists of critical geographic details
like coordinates and maps. In GIS, it is utilized for mapping and
analytical purposes.
6. Binary Data: Two discrete values are apparent through binary
representation: 0 and 1. This construct finds frequent use in computer
4 PAGE
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
DATA UNVEILED: AN IN-DEPTH LOOK AT TYPES, WAREHOUSES
PAGE 5
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
MBAFT 7708 PREDICTIVE ANALYTICS AND BIG DATA
6 PAGE
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
DATA UNVEILED: AN IN-DEPTH LOOK AT TYPES, WAREHOUSES
Data takes centre stage in the corporate world, offering knowledge, Notes
guiding strategic choices, streamlining workflows, and pioneering novel
approaches. Here is how data is essential in a business model:
1. Informed Decision-Making: Objective, accurate, and data-driven
decisions contrast with those fuelled by intuition. Businesses can
make strategic decisions supporting their goals by analyzing trends,
customer behaviour, and market dynamics.
2. Customer Insights: Data provides a clear picture of customer wants,
needs, and buying tendencies, allowing businesses to make informed
decisions. Businesses can skill fully craft products and marketing
campaigns that resonate with their intended market.
3. Personalization: Organizations can craft tailor-made experiences
for clients, leading to increased customer contentment and brand
loyalty. Marketing strategies that cater to individual preferences are
classified under personalization.
4. Operational Efficiency: Data-driven insights facilitate entrepreneurial
optimization. Through data analysis, we discover areas for improvement,
optimize workflows, and reduce costs to boost productivity.
5. Forecasting and Planning: Combining past and present data allows
for reliable predictions, benefiting businesses by helping them prepare
for future requirements, allocate resources effectively, maintain
inventory levels, and adjust production accordingly.
6. Market Insights: Market trends, rival movements, and new potential
emerge from data insights. With this data, enterprises can effectively
respond to market shifts and maintain their position.
7. Innovation and Product Development: Data visualization uncovers
market gaps and innovative areas to explore. Analyzing customer
opinions allows businesses to generate products and services that
resonate with their needs.
8. Risk Management: Businesses conduct detailed data assessments to
evaluate financial, operational, and compliance risks. Taking such
actions permits them to minimize potential dangers.
9. Marketing and Advertising: Data serves as a marketing roadmap,
pinpointing optimal channels, messaging, and timelines. For seamless,
targeted campaigns, it excels.
PAGE 7
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
MBAFT 7708 PREDICTIVE ANALYTICS AND BIG DATA
Notes 10. Customer Retention: Data helps identify at-risk customers and
allows businesses to implement retention strategies. By analyzing
customer behaviour, businesses can offer targeted incentives and
personalized communication.
11. Measuring Performance: Data provides key performance indicators
(KPIs) that allow businesses to measure their success and progress
toward goals. Performance measuring helps evaluate the effectiveness
of strategies and make adjustments as needed.
12. Business Intelligence and Reporting: The cornerstone of BI and
reporting tools, data warehouses lay the groundwork. Users can
design dashboards and visualizations with these tools.
13. Separation from Transactional Systems: Data warehouses are
distinct from transactional databases used in day-to-day operations.
Transactional databases focus on capturing real-time transactions,
while data warehouses provide a platform for analysis and reporting.
14. Continuous Improvement: Data-driven insights lead to continuous
improvement across various business aspects, including operations,
customer experience, and employee performance.
Incorporating data into the business model empowers organizations to
make proactive, efficient, and customer-centric decisions. It fosters a
culture of innovation and adaptability, enabling businesses to thrive in a
rapidly changing marketplace.
IN-TEXT QUESTIONS
1. What is a characteristic of structured data?
(a) It lacks a predefined schema
(b) It is typically stored in relational databases
(c) It includes textual documents
(d) It is challenging to query and analyze
2. Unstructured data is commonly found in which of the following
forms?
(a) Database tables
(b) XML files
8 PAGE
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
DATA UNVEILED: AN IN-DEPTH LOOK AT TYPES, WAREHOUSES
PAGE 9
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
MBAFT 7708 PREDICTIVE ANALYTICS AND BIG DATA
10 PAGE
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
DATA UNVEILED: AN IN-DEPTH LOOK AT TYPES, WAREHOUSES
PAGE 11
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
MBAFT 7708 PREDICTIVE ANALYTICS AND BIG DATA
12 PAGE
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
DATA UNVEILED: AN IN-DEPTH LOOK AT TYPES, WAREHOUSES
PAGE 13
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
MBAFT 7708 PREDICTIVE ANALYTICS AND BIG DATA
Notes
14 PAGE
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
DATA UNVEILED: AN IN-DEPTH LOOK AT TYPES, WAREHOUSES
PAGE 15
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
MBAFT 7708 PREDICTIVE ANALYTICS AND BIG DATA
IN-TEXT QUESTIONS
3. What is the primary purpose of a data warehouse?
(a) Real-time data processing
(b) Long-term data storage
(c) Data analysis and reporting
(d) Data transmission between servers
4. What is the primary goal of data transformation in a data
warehouse during the ETL process?
(a) Storing raw data as-is
(b) Aggregating data for reporting
(c) Preparing data for analysis and reporting
(d) Extracting data from source systems
16 PAGE
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
DATA UNVEILED: AN IN-DEPTH LOOK AT TYPES, WAREHOUSES
Reduced Complexity: Data marts are split into various business Notes
segments, each containing a smaller portion of the data warehouse’s
information. With this advancement, the data retrieval and evaluation
process has become more straightforward.
Improved Performance: Data marts improve query response times
and reporting efficiency by focusing on a specific business sector
and pre-processing data.
Independent Development: Each department can independently
build and maintain its data marts without compromising the data
warehouse framework.
Scalability: Organizations can expand their data infrastructure from
a single data mart over time.
Two approaches to organizing and structuring data exist within an
organization’s data architecture: Dependent Data Marts and Independent
Data Marts. Each approach has its benefits and considerations:
Dependent Data Marts:
(i) Definition: Dependent Data Marts are similar to the data
subsets of an enterprise data warehouse. They cater to various
departmental or business unit requirements by extracting and
adapting data from the EDW. These data marts offer valuable
insights from a consolidated data source.
(ii) Benefits:
(a) Consistency: By deriving from a centralized EDW, Dependent
Data Marts maintain consistency in data organization-
wide.
(b) Data Governance: Implementing data governance and
control becomes simpler when data is managed in EDW,
decreasing the likelihood of accuracy issues.
(c) Cost Efficiency: Leveraging existing EDW infrastructure
and data, Dependent Data Marts tend to be less expensive
to construct and maintain.
(d) Scalability: As the organization expands, it will adapt
better because it is part of a broader EDW community.
PAGE 17
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
MBAFT 7708 PREDICTIVE ANALYTICS AND BIG DATA
18 PAGE
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
DATA UNVEILED: AN IN-DEPTH LOOK AT TYPES, WAREHOUSES
PAGE 19
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
MBAFT 7708 PREDICTIVE ANALYTICS AND BIG DATA
Notes Data Storage and Architecture: When deciding on the best storage
platform for the data mart, one could consider relational databases,
columnar databases, data lakes, or cloud-based options. Sensitive
information remains secure thanks to robust access controls and
data security protocols.
Data Documentation: By recording data definitions, lineage, and
metadata, we ensure data control is maintained.
Data Quality Assurance: Regular checks on data quality ensure
continued monitoring and maintenance of data integrity. Establishing
data quality metrics and KPIs is crucial for tracking improvement
over time.
User Access and Visualization: Provide business users with tools,
query access, and visualize data from the data mart. Consider using
data visualization platforms or BI (Business Intelligence) tools for
reporting and analytics.
User Training and Support: Offer training to users on how to access
and utilize the data mart effectively. Provide ongoing support and
documentation.
Performance Optimization: Continuously monitor and optimize the
performance of the data mart to ensure timely access to data. Indexing,
partitioning, and caching are standard optimization techniques.
Data Governance: Establish data governance policies, including
data ownership, data stewardship, and data access controls. Ensure
compliance with data privacy regulations.
Scalability and Future Planning: Consider future scalability needs
as the organization grows. Be prepared to expand or modify data
marts as business requirements evolve.
Monitoring and Maintenance: Implement monitoring and alerting
systems to proactively identify issues and maintain data mart health.
Perform regular data quality checks and audits.
Documentation and Knowledge Sharing: Document the data mart’s
architecture, data models, and ETL processes. Share knowledge and
best practices within the data mart development and management
team.
20 PAGE
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
DATA UNVEILED: AN IN-DEPTH LOOK AT TYPES, WAREHOUSES
Creating data marts is a complex process that requires collaboration between Notes
data engineers, data analysts, business stakeholders, and IT teams. It is
essential to align data mart development with the organization’s overall
data strategy and ensure that the resulting data marts meet the analytical
needs of the business units they serve.
IN-TEXT QUESTIONS
5. What is the primary advantage of using data marts in a data
warehousing strategy?
(a) Centralized data storage
(b) Simplified data transformation
(c) Tailored data access for specific user needs
(d) Real-time data processing
6. In a data mart architecture, what typically serves as the data
source for the data mart?
(a) Enterprise Data Warehouse (EDW)
(b) Operational Data Store (ODS)
(c) External data sources only
(d) All of the above
1.6 Summary
This unit comprehensively explores different data types, including data
warehouses and marts. Here are some of the critical points that are
covered in this unit:
The first section distinguishes between structured, semi-structured,
and unstructured data, describing their characteristics and showcasing
real-world applications. Types of data based on their nature, format,
and attributes are also discussed in this section. These sections also
describe data utilized in a business to facilitate informed decision-
making, optimize processes, analyze consumer conduct, and guide
strategic planning.
The concept of data warehousing is discussed in the section. It is a
centralized repository that stores, manages, and analyzes data from
PAGE 21
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
MBAFT 7708 PREDICTIVE ANALYTICS AND BIG DATA
Notes multiple sources. This section also covers the data warehousing
process, types of data warehouses, data warehousing models, and
benefits of data warehouses.
The concept of data marts is discussed in the last section. Different
types of data marts and their definition, benefits, and considerations
related to them are explored in this section. The process of creating
a data mart is also covered in this.
22 PAGE
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
DATA UNVEILED: AN IN-DEPTH LOOK AT TYPES, WAREHOUSES
Bruce, P., Bruce, A., & Gedeck, P. (2020). Practical statistics for
data scientists: 50+ essential concepts using R and Python. O’Reilly
Media.
Yau, N. (2013). Data points: Visualization that means something.
John Wiley & Sons.
Maheshwari, A. (2014). Data analytics made accessible. Seattle:
Amazon Digital Services.
PAGE 23
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
L E S S O N
2
Data Quality, Data
Cleaning, Handling
Missing Data, Outliers,
and Overview of Big Data
Aditi Priya
Research Scholar
School of Computer & Systems Sciences
JNU, New Delhi - 110067
Email-Id: kajal.aditi@gmail.com
STRUCTURE
2.1 Learning Objectives
2.2 Introduction
2.3 Data Quality
2.4 Data Cleaning
2.5 Handling Missing Data and Outliers
2.6 Overview of Big Data
2.7 Summary
2.8 Answers to In-Text Questions
2.9 Self-Assessment Questions
2.10 References
2.11 Suggested Readings
24 PAGE
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
DATA QUALITY, DATA CLEANING, HANDLING MISSING DATA
2.2 Introduction
The significance of data quality in the realms of decision-making
and analytics is of utmost importance. Data forms the foundation for
informative decision-making and insight extraction. Good data quality,
efficient decision-making, and precise analytics are possible with good
data quality. It spans multiple domains and industries and is critical to
decision-making and analytics. Data success relies on the accuracy and
dependability of statistics in a data-driven era. Decision-making and
analytics rely heavily on data quality, with precision, thoroughness, and
trustworthiness paramount. Data quality problems can cause inconclusive
results, flawed forecasts, and pricey errors. High-quality data allows
organizations to make insightful decisions, identify trends, and achieve
a leg up on competitors. With each new level of data exploration, the
importance of accuracy, completeness, and consistency inevitably rises,
transforming them from trivial issues to critical foundations for success
in a data-dependent landscape. Maintaining good data quality relies upon
processes that include data cleaning, handling missing data, and outliers.
Interconnected roles enhance data accuracy and decision-making. The
emergence of big data has transformed data management practices and
quality.
PAGE 25
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
MBAFT 7708 PREDICTIVE ANALYTICS AND BIG DATA
26 PAGE
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
DATA QUALITY, DATA CLEANING, HANDLING MISSING DATA
PAGE 27
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
MBAFT 7708 PREDICTIVE ANALYTICS AND BIG DATA
28 PAGE
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
DATA QUALITY, DATA CLEANING, HANDLING MISSING DATA
PAGE 29
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
MBAFT 7708 PREDICTIVE ANALYTICS AND BIG DATA
30 PAGE
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
DATA QUALITY, DATA CLEANING, HANDLING MISSING DATA
IN-TEXT QUESTIONS
1. What is a key dimension of data quality?
(a) Quantity
(b) Volume
(c) Accuracy
(d) Variety
2. Which phase of Six Sigma focuses on identifying and addressing
the root causes of data quality issues?
(a) Define
(b) Measure
(c) Analyze
(d) Control
3. In the context of data quality, what is a key principle of Total
Quality Management (TQM)?
(a) Customer focus
(b) Employee isolation
(c) Data secrecy
(d) Irregular improvements
PAGE 31
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
MBAFT 7708 PREDICTIVE ANALYTICS AND BIG DATA
32 PAGE
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
DATA QUALITY, DATA CLEANING, HANDLING MISSING DATA
PAGE 33
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
MBAFT 7708 PREDICTIVE ANALYTICS AND BIG DATA
Notes 9. Iteration:
(i) Objective: Data cleaning generally involves an iterative process.
During the cleaning process, it is essential to evaluate data
quality and fix any recently appeared concerns.
10. Validation and Quality Assurance:
(i) Objective: Standardization is key during data validation, ensuring
it is in top shape for its intended application.
(ii) Methods: Comparing cleaned data to original data and con-
ducting validation tests is part of quality assurance checks.
11. Final Data Export:
(i) Objective: Once data quality has been ensured, export the
cleaned dataset for analysis or further processing
12. Monitoring and Maintenance:
(i) Objective: Continuous data monitoring and maintenance
procedures are necessary to prevent errors.
With a combination of domain knowledge, statistical methods, and data
manipulation abilities, data cleaning entails labour-intensive and iterative
work. Analytical insights rely on dependable and correct data, making
data preparation a critical step. Cleaning data is paramount when it comes
to reliability and accuracy. Standard data cleaning techniques include:
Deduplication: Remove redundant data or entries within the dataset.
Keep only one instance of each unique record, discarding any
duplicate versions. Avoiding double-counting is key to ensuring
data accuracy.
Standardization: Consistency in data presentation and organization
is essential. By converting “MM/DD/YYYY” to “YYYY-MM-
DD,” you can standardize date formats. By standardizing units of
measurement, such as converting weight units to kilograms, we can
streamline processes. Standardizing categorical values, like unifying
“New York” into one structure, improves data accessibility.
Validation: Validating data against set business standards, limitations,
or logical instructions is essential. Validation checks are executed to
identify data that does not conform to expected norms. Validation
34 PAGE
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
DATA QUALITY, DATA CLEANING, HANDLING MISSING DATA
PAGE 35
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
MBAFT 7708 PREDICTIVE ANALYTICS AND BIG DATA
36 PAGE
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
DATA QUALITY, DATA CLEANING, HANDLING MISSING DATA
(ii) Solution: Standardize data formats and units. Employ data Notes
transformation techniques to standardize them.
Missing Data in Healthcare Records:
(i) Challenge: The absence of data can negatively impact patient
care and analysis through electronic health records.
(ii) Solution: Imputation techniques can fill in missing values by
leveraging the patient’s historical data.
Duplicate Customer Records in CRM Systems:
(i) Challenge: Duplicate customer records may arise from mistakes
during manual data entry in CRM systems.
(ii) Solution: By pairing similarity thresholds with data profiling,
databases are deduplicated by removing or combining redundant
records.
Inaccurate Geospatial Data:
(i) Challenge: Geospatial datasets may hold the wrong coordinates
for locations.
(ii) Solution: Correcting data and verifying it takes place through
external sources or geocoding services. Detecting outliers can
reveal suspicious data values.
Incomplete Financial Data:
(i) Challenge: Only complete or present data points in financial
datasets can help financial analysis.
(ii) Solution: When data is incomplete, imputation methods like
forward-fill, backward-fill, or interpolation may be applied to
estimate missing financial values.
Sensor Data with Outliers:
(i) Challenge: Possibly caused by sensor issues or noise, outliers
appear in the data stream.
(ii) Solution: Statistical methods such as smoothing or imputation
can help discover and take care of outliers.
Inconsistent Product Names in E-commerce:
(i) Challenge: Any nuance in e-commerce database product data,
mainly containing names and descriptions, affects search and
categorization procedures.
PAGE 37
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
MBAFT 7708 PREDICTIVE ANALYTICS AND BIG DATA
IN-TEXT QUESTIONS
4. Which of the following is a common data-cleaning task?
(a) Increasing data complexity
(b) Adding noise to data
(c) Transforming data
(d) Ignoring missing values
38 PAGE
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
DATA QUALITY, DATA CLEANING, HANDLING MISSING DATA
PAGE 39
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
MBAFT 7708 PREDICTIVE ANALYTICS AND BIG DATA
40 PAGE
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
DATA QUALITY, DATA CLEANING, HANDLING MISSING DATA
(ii) Pairwise Deletion: Specific analysis calls for cases that include Notes
concrete details. Analysis can yield various sample sizes
depending on using all available data.
Advanced Imputation Methods: Advanced imputation methods are
more sophisticated and can handle complex missing data patterns:
(i) K-Nearest Neighbours (K-NN): K Nearest Neighbours’ feature
space averages can be used to fill in missing values, using
measures like Euclidean distance to determine similarity.
(ii) Multiple Imputation (MI): Uncertainty regarding missed
information can result in the generation of multiple imputed
data sets. Analyzing each dataset separately and combining
the results later. By addressing imputation uncertainty, MI
enables sound statistical conclusions.
(iii) Expectation-Maximization (EM): Maximizing the likelihood
function of the data leads to the estimation of missing values
via an iterative algorithm. EM is most effective when handling
datasets with MNAR-missing data.
(iv) Interpolation and Extrapolation: Using mathematical techniques,
data can be analyzed statistically to pinpoint absent totals.
Used to describe various forms of data, common has gained
a notable connotation.
(v) Domain-Specific Imputation: When domain-specific expertise
is involved, imputation can take cues from that knowledge.
Imputing missing values is one area where medical expertise
can be applied.
(vi) Imputation Software: Imputation functions and packages can
be accessed through various software tools such as R, Python
(with pandas and sci-kit-learn), and other specialized statistical
software.
Outliers are those data points that significantly diverge from the remainder
of the data. As they are, most distant values deviate significantly in size.
Outliers are significant in data analysis for several reasons:
Detection of Errors: Outliers can point to problems with data
collection, including data entry mistakes and measurement errors.
PAGE 41
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
MBAFT 7708 PREDICTIVE ANALYTICS AND BIG DATA
42 PAGE
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
DATA QUALITY, DATA CLEANING, HANDLING MISSING DATA
(ii) IQR (Interquartile Range): Finding the IQR requires a calculation Notes
of Q1 and Q3 and their difference. A comparison between
Q1 and Q3 reveals that data points outside these limits are
outliers.
Visualization Techniques:
(i) Box Plots: Box plots graphically depict the data distribution,
pointing to possible outliers outside the “whiskers.”
(ii) Scatter Plots: Outliers can be located through scatter plots that
display data of the dominant pattern.
Machine Learning Algorithms:
(i) Clustering Methods: Outliers can be identified through the use
of k-means clustering. Points that defy categorization as part
of any cluster could be considered outliers.
(ii) Isolation Forests: Methodically designed to detect outliers,
Isolation Forests are part of an ensemble learning approach.
Isolating outliers requires dividing the data into random
sections.
(iii) One-Class SVM (Support Vector Machine): The algorithm
segregates data into two categories, with anything outside the
existing boundary labelled as an outlier.
Mahalanobis Distance:
(i) Measured against the dataset’s centroid, the Mahalanobis
distance displays the distance between a data point and the
cluster. High Mahalanobis distances signal potential outliers
among data points.
In data analysis, there are multiple reasons why outliers may occur, and
these occurrences can significantly influence statistical analysis outcomes.
Often, data entry mistakes arise from human inaccuracies or equipment
failures, deviating significantly from the genuine data distribution. Skewed
summary statistics can be the result of misleading outliers in statistical
analyses. The occurrence of measurement errors or inaccurate instruments
can result in outliers. Statistical dependability hinges upon the elimination
of outliers that disrupt intervariable connections. When examining a
system, it is possible that natural variability could result in outliers.
PAGE 43
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
MBAFT 7708 PREDICTIVE ANALYTICS AND BIG DATA
Notes Outliers can affect statistical parameters and must be taken into account.
Small or non-representative sampling increases the chance of including
error values, leading to anomalous readings. Outliers, unless handled,
can threaten the reliability of statistical models. Correctly identifying,
understanding, and managing outliers is essential for maintaining the
integrity of statistical analysis.
Careful consideration is necessary when dealing with outliers in data
analysis due to the dependence of the appropriate approach on the nature
and context of the data. The following considerations should be taken
into account:
Understand the Context: Context and outlier cause examination
before taking action is essential. Behaviour, what accounts for?
Context clues help identify the ideal approach.
Visualize the Data: By employing box plots, scatter plots, and
histograms, data analysis can detect outliers and comprehend the
data distribution. Insights into their nature and the impact they have
are provided through visual examination.
Assess Their Impact: Summary statistics and data distributions
are impacted by how outliers affect them. Results may vary by
temporarily removing them. Assessing their effect on the analysis
will help.
Consider Domain Knowledge: Crucial to any successful endeavour,
domain knowledge is best consulted by subject-matter experts and
stakeholders. Genuine data points, errors, and outliers can furnish
valuable knowledge.
Outlier Treatment Options: Depending on the context, the following
approaches can be considered:
z Removal: The effects of data entry errors or measurement
issues on the analysis are undesirable. How can outliers be
eliminated? Be cautious about removing data points excessively.
z Transformation: With logarithmic or square root transformations,
data usually becomes distributed, reducing outliers’ influence.
44 PAGE
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
DATA QUALITY, DATA CLEANING, HANDLING MISSING DATA
IN-TEXT QUESTIONS
6. Why is handling missing data important in data analysis?
(a) It makes the data look complete
(b) It can lead to biased results
(c) It simplifies data processing
(d) It reduces the need for data visualization
7. What is a potential consequence of missing data not at random
(MNAR) in data analysis?
(a) Increased statistical power
(b) Biased and inaccurate results
(c) Smaller effective sample size
(d) Enhanced data completeness
8. Outliers can be detected using:
(a) Mean imputation
(b) Median absolute deviation
(c) Deleting all data points beyond a specific value
(d) Ignoring them during analysis
9. Which statistical method measures how many standard deviations
a data point is away from the mean for outlier detection?
(a) Interquartile Range (IQR)
(b) Modified Z-Score
(c) Box Plot
(d) Z-Score
PAGE 45
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
MBAFT 7708 PREDICTIVE ANALYTICS AND BIG DATA
IN-TEXT QUESTION
10. Which of the following is NOT one of the four V’s of big data?
(a) Volume
(b) Velocity
(c) Validity
(d) Variety
2.7 Summary
This unit comprehensively explores key concepts and practices in data
management and analysis. Here are some of the key points that are
covered in this unit:
In the first section, the concept of data quality and its importance in
business is examined. The dimensions of the data quality include
accuracy, completeness, consistency, reliability, and relevance. The
46 PAGE
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
DATA QUALITY, DATA CLEANING, HANDLING MISSING DATA
consequences and impact of poor data quality have been covered. Notes
Some data quality frameworks and methodologies, such as Six Sigma
TQM, to reduce the impact of poor data quality on organizations
are also covered in this section.
The second section includes data cleaning, which is an iterative
process and a crucial step in the process of data preparation. It
is done to ensure the accuracy and reliability of the data used
for analysis. It involves several critical tasks, including handling
missing data through removal or imputation, identifying and treating
outliers, converting data types to appropriate formats, standardizing
inconsistent data, and removing duplicates. Categorical variables
are encoded for analysis, and text data undergoes pre-processing.
Big data offers an overview, which is a significant shift in data
management techniques.
The third section consists of two important elements of data
analysis, dealing with missing data and outliers. Data absence or
missing values can lead to bias and inaccurate analysis. Mean or
median imputation and other imputation techniques like regression
imputation and multiple imputation can replace missing values with
estimates. Deleting rows or columns may be necessary for small
amounts of missing data. Data points deviating mainly from the rest
must be attended to as they can impact analyses. Both z-scores and
transformation techniques, when combined with domain knowledge,
can help reduce the impact of outliers. The approach to handling
missing data and outliers should mesh with data qualities, research
objectives, and the more significant analytical setting to get accurate
and reliable results.
The final section offers an overview of big data, a paradigm shift
in data management and analysis. It explores the four V’s of big
data: Volume, Velocity, Variety, and Veracity.
1. (c) Accuracy
2. (c) Analyze
PAGE 47
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
MBAFT 7708 PREDICTIVE ANALYTICS AND BIG DATA
48 PAGE
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
DATA QUALITY, DATA CLEANING, HANDLING MISSING DATA
PAGE 49
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
L E S S O N
3
Navigating the Data
Analytics Journey:
Lifecycle, Exploration, and
Visualization
Aditi Priya
Research Scholar
School of Computer & Systems Sciences
JNU, New Delhi - 110067
Email-Id: kajal.aditi@gmail.com
STRUCTURE
3.1 Learning Objectives
3.2 Introduction
3.3 Understanding Data Analytics
3.4 Data Exploration: Uncovering Insights
3.5 Data Visualisation: Communicating Insights
3.6 Summary
3.7 Answers to In-Text Questions
3.8 Self-Assessment Questions
3.9 References
3.10 Suggested Reading
50 PAGE
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
NAVIGATING THE DATA ANALYTICS JOURNEY
3.2 Introduction
Recently, the amount of data being created and generated has immensely
increased. As of the current year, approximately 328.77 million terabytes
of data are created each day, and the rate of data creation is consistently
on the rise. Since 2010, the amount of data generated annually has grown
yearly. It is estimated that 90% of the world’s data was developed in the
last two years alone. In 13 years, this figure has increased by an estimated
60x from just two zettabytes in 2010. The 120 zettabytes generated in 2023
are expected to increase by over 150% in 2025, hitting 181 zettabytes.
The data we encounter is highly varied. People develop a wide array of
content, including blog posts, tweets, interactions on social networks,
and images. The internet, the ultimate data source, is unfathomable and
extensive. This remarkable surge in the volume of data being created and
generated profoundly impacts businesses. Conventional database systems
cannot handle such a large volume of data. Moreover, these systems
struggle to cope with the demands of ‘Big Data.’
Here comes the question: What exactly is Big Data? Big data refers to
massive, complex, and varied sets of data. It consists of four V’s, i.e.,
Volume, Velocity, Variety, and veracity. Volume involves significant
amounts of data ranging in terabytes and more, generated from several
sources, such as sensors, social media, online transactions and many more.
Velocity is the unprecedented speed at which data is generated in today’s
digital world. Social media interactions such as emails, messages, chats,
IoT devices, and financial transactions come under real-time data, which
contributes to the velocity of big data. Variety refers to various formats of
data, such as databases, spreadsheets, text, images, etc. This diverse form
of data requires specific tools for processing and analysis. Veracity refers
to the quality and reliability of data. Big data can sometimes be noisy
or incomplete, and data scientists need to ensure that the insights drawn
from the data are accurate and trustworthy. Big data is generated across
PAGE 51
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
MBAFT 7708 PREDICTIVE ANALYTICS AND BIG DATA
52 PAGE
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
NAVIGATING THE DATA ANALYTICS JOURNEY
real world, the data is generated rapidly and is complex. The complexity Notes
and interconnectedness involved in real-world data require using newer
and advanced technologies to discover underlying relationships and
patterns present in the data. Understanding and forecasting future trends
to minimise risk and strategise actions to make effective and accurate
decisions is often helpful.
So, to cope with all the challenges related to data generation in the real
world, the concept of data science becomes evident. Data science is a
multidisciplinary field that accumulates various methods and algorithms to
extract valuable information from different file formats, such as structured,
unstructured, and semi-structured. It consists of various processes of
collecting, transforming, analysing, and interpreting data to make more
informed and accurate decisions. The key components related to data
science are as follows:
(i) Collection of relevant data from various sources such as user-generated
data, sensor data, social media interaction data, transaction data,
healthcare data, financial data, etc.
(ii) Pre-processing of the data to ensure quality and maintain accuracy
by eliminating any discrepancies present in the data.
(iii) Selection of relevant and important features and characteristics to
increase the performance of any system.
(iv) Interpreting and discovering hidden patterns, distributions, and relations
through data visualisation tools and applications.
(v) Applications of statistical methods and techniques to discover trends
in data.
(vi) Using Natural Language Processing to analyse and process human
language to extract meaningful information.
(vii) Applications of machine-learning and deep-learning models to
develop our models for performing any task.
(viii) Processing bulk volume of data to reduce complexity and increase
performance.
(ix) Considering the ethical and legal aspects of data collection, storage,
and usage.
PAGE 53
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
MBAFT 7708 PREDICTIVE ANALYTICS AND BIG DATA
54 PAGE
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
NAVIGATING THE DATA ANALYTICS JOURNEY
to optimise their methods and techniques to mitigate risk and increase Notes
their output. The work of a data analyst usually involves statistical tools
like Excel, SQL, Tableau, etc. These tools help the data analyst perform
simple calculations for complex data modelling and visualisation. They
aim to answer specific business questions, identify opportunities, and guide
operational improvements. Data analytics involves examining, cleaning,
transforming, and interpreting data to extract valuable insights and make
informed decisions. It involves using various techniques, tools, and
methodologies to analyse data and uncover patterns, trends, correlations,
and other meaningful information that can aid in understanding business,
scientific, or other phenomena. Data analytics encompasses a wide range
of activities, from simple descriptive analysis summarising historical
data to advanced predictive and prescriptive analysis using statistical
and machine learning models to make predictions and recommendations.
It involves working with structured and unstructured data from various
sources, including databases, spreadsheets, text documents, images, videos,
sensor data, social media, and more. Data analytics is used across multiple
sectors like marketing, finance, healthcare, manufacturing, businesses,
etc. It aids organisations in achieving process optimisation by making
data-driven decisions and identifying newer prospects and opportunities
to grow exponentially and quickly. There has been an increase in the
development of tools and techniques used in Data analytics, enabling
businesses to keep up with the market competition. Data analytics is used
across industries, including marketing, finance, healthcare, manufacturing,
and more. It enables organisations to make data-driven decisions, optimise
processes, identify new opportunities, and gain a competitive edge in the
market. The tools and techniques used in data analytics continue to evolve,
allowing businesses to extract more valuable insights from their data.
The data analytics lifecycle consists of a sequence of stages and steps
commonly adhered to while conducting tasks related to data analytics. It
gives a systematic approach to deriving meaningful and valuable insights
from data to make data-driven decisions. While the steps and procedures
depend on the type of organisation, a generalised framework is followed
in the data analytics lifecycle. The most common steps utilised in the
data analytics lifecycle are as follows—
PAGE 55
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
MBAFT 7708 PREDICTIVE ANALYTICS AND BIG DATA
Notes The first step of the data analytics lifecycle begins with defining the
problem with utmost clarity so that the analysis’s purposes, goals,
and objectives are aligned with the organisation’s requirements.
The second step of the data analytics lifecycle involves the collection
of data from different sources of data such as sensors, social media,
online transactions, healthcare data, etc. It also ensures that the
collected data is accurate, reliable, and meets the organisation’s needs.
The third step of the data analytics lifecycle consists of removing
the discrepancies present in the data and transforming the data using
feature engineering.
The fourth step of the data analytics lifecycle involves exploring
data using visual and statistical methods to understand the features
and identify patterns and relations in the data.
The fifth step involves the selection of an appropriate model
according to the problem and characteristics of the data. This helps
to develop a predictive or descriptive analysis of data using various
models related to machine learning, deep learning, etc.
The sixth step consists of evaluating the performance of various
models using several measures. This is done to validate the model
and check its efficacy, whether it is a generalised model, i.e., it
gives accurate results when newer datasets are considered.
The seventh step involves the interpretation of the results obtained
after the analysis of data to extract useful information to make
well-informed decisions.
The eighth step involves making informed decisions to take the
required actions.
The last step involves continuously performing surveillance and
monitoring the performance of the implemented actions to update
and refine the procedures to yield better results. This also involves
considering the feedback and responses to refine the model to adapt
to the needs of the business processes.
56 PAGE
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
NAVIGATING THE DATA ANALYTICS JOURNEY
Notes
IN-TEXT QUESTIONS
1. What is the primary role of data analytics in decision-making
and problem-solving?
(a) To make data look presentable
(b) To make data collection efficient
(c) To extract insights and inform decisions
(d) To replace traditional decision-making processes
2. Which of the following statements is true regarding the Data
Analytics Lifecycle?
(a) It is a rigid and unmodifiable process
(b) It is only applicable to large organisations
(c) It guides structured data analysis but lacks flexibility
(d) It provides a framework that can adapt to diverse data
analysis needs
PAGE 57
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
MBAFT 7708 PREDICTIVE ANALYTICS AND BIG DATA
58 PAGE
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
NAVIGATING THE DATA ANALYTICS JOURNEY
To check whether the collected data suits the process and can be Notes
used.
Enhance data search capabilities by attaching keyword descriptions
or categorising them.
Check the quality of data to see whether it aligns with particular
standards.
Check the risk involved when the data is used in different applications.
Uncover metadata from the source database, encompassing value
patterns, distributions, potential key candidates, foreign key candidates,
and functional dependencies.
Check whether the existing metadata precisely represents the real
values present within the source database.
After performing automatic actions, drilling down or filtering manually
occurs to locate irregularities or designations found during automation. In
addition, data exploration commonly entails manually inputting commands
or formulating queries within the data itself, either through Structured Query
Language (SQL) or other related programming tongues, and leveraging
specialised applications that enable visual analysis of unstructured data.
Adopting this conducive approach enables analysts to achieve clarity via
associative thinking about how they perceive various sizes of relational
models through applying unique personas within contextual boundaries
defined around dynamics dictated intrinsically by social determinism
rather than simplistically maintaining fixed facets through preconceived
rules derived purely based upon formal system manipulation alone (i.e.,
reductionistic heuristic).
Upon gaining knowledge of the information initially, refining or rejuvenating
the data by tidying up unnecessary bits of data through excisions (data
cleaning) is possible. Furthermore, realigning erroneously constructed
articles and uncovering latent interconnections between data collections
will mark this milestone within the procedure. Moreover, identifying
areas of insufficiency regarding the reliability and uniformness among
data sets shall culminate into this turning point during processing, which
is underpinned by forming concise statements of quality concerning these
materials. Executing these moves meticulously yet resourcefully initiates
paving means to help protect against repercussions arising from diminutive
PAGE 59
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
MBAFT 7708 PREDICTIVE ANALYTICS AND BIG DATA
Notes errors committed along this journey towards exchanging the existing
conceptualisations/explanations either within specific frames like direct
statistical interpretations (evaluative figures) relating ideas amassed so far
manifest fundamentally hinges upon how adept we endow ourselves with
mastery over selection rules following which steps prior application gives
rise, perceptions within related context prove especially germane right until
action stages invite personal insight unfoldment allowing our judgments
extend due proper care hence promoting data facility evolution containing
each necessary shaping aid well-nigh whenever required accordingly
setting basic foundational notions supported on theory enunciated vis-a-
vis pertinent standards become productive acts accomplishing harmonious
formulation procedures amidst.
Apart from organised inquiry and prediction methods, another way to
examine data is by trying random questions to search for undisclosed
patterns. This practice entails generating discoveries with minor initial
hypothesis building. Data analysis professionals and scientists centre
their work around exploring data rather than just focusing on traditional
statistics.
Experts use Exploratory Data Analysis (EDA) to examine datasets employing
assorted strategies to unveil underlying patterns, establish early indications,
and isolate anomalous features. Understanding the character of the data
through these techniques facilitates ensuing analysis and decision-making.
Here are some critical exploratory data analysis techniques:
1. Summary Statistics:
(i) Mean, Median, Mode: Central tendency measures reveal a
variable’s typical or common value. The mean is calculated
by adding up every value and dividing the final result by how
many values are in the dataset. Indicating the location of the
central point serves as a tool for assessing central tendency.
Notably, mean values can be influenced by exceptional data
points (outliers), tending them towards the mean. The middle
value is labelled as the median when organising data in
ascending or descending order. The alternative measure provides
resistance to the outlier impact on the mean. Particularly
useful when dealing with irregular distributions, the median
assembles the data points around a central value. Mean and
60 PAGE
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
NAVIGATING THE DATA ANALYTICS JOURNEY
PAGE 61
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
MBAFT 7708 PREDICTIVE ANALYTICS AND BIG DATA
62 PAGE
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
NAVIGATING THE DATA ANALYTICS JOURNEY
(ii) Bar Plots and Stacked Bar Plots: Represent the distribution Notes
of categorical variables visually.
5. Multivariate Analysis:
(i) Scatter plot Matrix: These plots effectively illustrate the
relationships between pairs of variables, significantly contributing
to data analysis.
(ii) Parallel Coordinates Plot: Illustrates how multiple factors
interrelate by depicting each as an upright line.
(iii) Principal Component Analysis (PCA): These retain vital details
and generate a more digestible representation.
6. Outlier Detection:
(i) Z-Scores: Determine how far away a data point is from the
mean by measuring in standard deviations.
(ii) Interquartile Range (IQR): Determining the gap between the
first and third quartiles highlights potential outliers.
7. Distribution Fitting:
(i) Normality Tests: This refers to whether you can confirm whether
a variable adheres to a normal distribution through statistical
analysis.
(ii) Kernel Density Estimation: Determine the probability of a
variable distributed in a particular manner.
8. Geospatial Analysis:
(i) Heat Maps: Depict data density using colour gradients on a
geographical map.
(ii) Choropleth Maps: Use colours or patterns to represent data
values in different geographical areas.
9. Time Series Analysis:
(i) Time Plots: Display data over time to identify trends, seasonality,
and patterns.
(ii) Autocorrelation Function (ACF) and Partial Autocorrelation
Function (PACF): Identify time-dependent relationships.
PAGE 63
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
MBAFT 7708 PREDICTIVE ANALYTICS AND BIG DATA
64 PAGE
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
NAVIGATING THE DATA ANALYTICS JOURNEY
PAGE 65
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
MBAFT 7708 PREDICTIVE ANALYTICS AND BIG DATA
IN-TEXT QUESTIONS
3. What is the primary goal of data exploration in the data analysis
process?
(a) To make data look presentable
(b) To understand data characteristics, patterns, and relationships
(c) To clean and pre-process data
(d) To build predictive models
4. Which of the following statements is true regarding data
exploration?
(a) Data exploration is primarily focused on building predictive
models
(b) Data exploration aids in uncovering hidden insights and
patterns in data
(c) Data exploration is only concerned with data cleaning
(d) Data exploration is the final stage in the data analysis
process
66 PAGE
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
NAVIGATING THE DATA ANALYTICS JOURNEY
PAGE 67
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
MBAFT 7708 PREDICTIVE ANALYTICS AND BIG DATA
68 PAGE
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
NAVIGATING THE DATA ANALYTICS JOURNEY
12. Word Clouds: Determine the size of words based on their frequency Notes
and visualise the text data accordingly. Textual data can be analysed
for common terms.
13. Network Diagrams: Graphically representing networks, displaying
nodes as entities and connections as edges. Through social network
analysis and systems modelling, this is often used.
14. Choropleth Maps: By region, represent data using colour shading
or pattern. Displaying regional data, such as population density or
election results, is ideal for us.
15. Sankey Diagrams: Visualising the movement of data or resources
across categories or stages is crucial.
16. Radar Charts (Spider Charts): Showcasing multivariate data on a
radial grid, each axis represents a different variable. Across various
dimensions, entities can be compared using this technique.
17. 3D Visualisations: A third dimension elevates standard chart styles,
such as 3D scatter plots and surface plots. Effective for visualising
volumetric data.
18. Interactive Dashboards: Visualisations and controls can be incorporated
into an interactive interface. The application allows for timely
insights and data analysis.
PAGE 69
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
MBAFT 7708 PREDICTIVE ANALYTICS AND BIG DATA
IN-TEXT QUESTIONS
5. What is the primary purpose of data visualisation in data
analysis?
(a) To replace data collection processes
(b) To make data look presentable
(c) To summarise data characteristics in tables
(d) To communicate data insights effectively
6. In data visualisation, what is the purpose of creating a line chart?
(a) To display the distribution of a single variable
(b) To show relationships between two numerical variables
over time
(c) To display categorical data using bars
(d) To create 3D visualisations
3.6 Summary
This unit comprehensively explores key concepts of the data analytics
process. Here are some of the key points that are covered in this unit:
In the first section, we explored the fundamentals of data analytics.
Data analytics is a systematic process that involves collecting,
preparing, analysing, visualising, and interpreting data to extract
valuable insights and inform decision-making. We introduced the
Data Analytics Lifecycle, a structured framework that outlines the
stages of data analysis, including data collection, preparation, analysis,
visualisation, and interpretation. This section laid the foundation for
understanding data analytics’s key components and processes.
In the second section, we covered data exploration, the initial phase
of data analysis, focusing on understanding data characteristics,
identifying patterns, and assessing data quality. We discussed
techniques such as summary statistics and data profiling, which
70 PAGE
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
NAVIGATING THE DATA ANALYTICS JOURNEY
PAGE 71
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
MBAFT 7708 PREDICTIVE ANALYTICS AND BIG DATA
Notes 7. List some best practices in data visualisation and explain why they
are important for creating compelling visualisations.
3.9 References
Bruce, P., Bruce, A., & Gedeck, P. (2020). Practical statistics for
data scientists: 50+ essential concepts using R and Python. O’Reilly
Media.
Yau, N. (2013). Data points: Visualisation that means something.
John Wiley & Sons.
Maheshwari, A. (2014). Data analytics made accessible. Seattle:
Amazon Digital Services.
72 PAGE
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
L E S S O N
4
Foundations of Predictive
Modelling:
Linear and Logistic
Regression, Model
Comparison, and
Decision Trees
Aditi Priya
Research Scholar
School of Computer & Systems Sciences
JNU, New Delhi - 110067
Email-Id: kajal.aditi@gmail.com
STRUCTURE
4.1 Learning Objectives
4.2 Introduction
4.3 Linear Regression
4.4 Logistic Regression
4.5 Model Comparison
4.6 Decision Trees
4.7 Summary
4.8 Answers to In-Text Questions
4.9 Self-Assessment Questions
4.10 References
4.11 Suggested Reading
PAGE 73
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
MBAFT 7708 PREDICTIVE ANALYTICS AND BIG DATA
4.2 Introduction
Predictive modelling is integral to data science, offering a vital role in
unearthing essential insights and directing intelligent choices. Historical
data-based predictions or trend forecasts are created through mathematical
and statistical modelling techniques. The importance of predictive modelling
in data science cannot be overstated, as it enables organizations to:
Anticipate Trends: Forecasting future trends is where predictive
models come into play, keeping businesses competitive and informed.
Anticipating customer preferences and market shifts is vital for
maintaining a competitive edge.
Optimize Operations: Insights into operational efficiencies are
provided via data-driven approaches, optimizing processes and resource
allocation. Resource Optimization: Businesses can optimize their
operations and resources through predictive models. These models
help in efficient inventory management, supply chain optimization,
and workforce scheduling, reducing costs and improving productivity.
Improve Decision-Making: Models predict outcomes by offering
probabilistic and identifying factors that contribute to them. Data
scientists help make informed and critical decisions about different
business strategies and resource allocation by analyzing past data
and trends and identifying valuable insights.
Enhance Customer Experience: Marketing and e-commerce benefit
from predictive modelling through enhanced customer experiences
and targeted audiences.
74 PAGE
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
FOUNDATIONS OF PREDICTIVE MODELLING
PAGE 75
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
MBAFT 7708 PREDICTIVE ANALYTICS AND BIG DATA
76 PAGE
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
FOUNDATIONS OF PREDICTIVE MODELLING
PAGE 77
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
MBAFT 7708 PREDICTIVE ANALYTICS AND BIG DATA
78 PAGE
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
FOUNDATIONS OF PREDICTIVE MODELLING
PAGE 79
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
MBAFT 7708 PREDICTIVE ANALYTICS AND BIG DATA
80 PAGE
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
FOUNDATIONS OF PREDICTIVE MODELLING
PAGE 81
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
MBAFT 7708 PREDICTIVE ANALYTICS AND BIG DATA
82 PAGE
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
FOUNDATIONS OF PREDICTIVE MODELLING
PAGE 83
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
MBAFT 7708 PREDICTIVE ANALYTICS AND BIG DATA
84 PAGE
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
FOUNDATIONS OF PREDICTIVE MODELLING
IN-TEXT QUESTIONS
5. Which one of them is not a common technique used for
comparing models?
(a) Hypothesis testing
(b) Cross-validation
(c) Fitting the model to different subsets
(d) ROC-AUC
PAGE 85
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
MBAFT 7708 PREDICTIVE ANALYTICS AND BIG DATA
86 PAGE
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
FOUNDATIONS OF PREDICTIVE MODELLING
Wind = Strong) would be sorted down the leftmost branch of this decision Notes
tree and would, therefore, be classified as a negative instance (i.e., the
tree predicts that Play Tennis = no).
PAGE 87
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
MBAFT 7708 PREDICTIVE ANALYTICS AND BIG DATA
88 PAGE
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
FOUNDATIONS OF PREDICTIVE MODELLING
PAGE 89
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
MBAFT 7708 PREDICTIVE ANALYTICS AND BIG DATA
Notes Handling Mixed Data Types: Categorical and numerical data can
be processed by decision trees and missing values without extra
pre-processing. Data type varieties are what some tree algorithms,
including Random Forest, can accommodate.
Scalability: Small to moderate-sized datasets are ideal for decision
trees. Combining multiple trees, ensemble methods like random
forests, and gradient-boosting trees helps address limitations.
Versatility: Used for different purposes, decision trees are used
for classification, regression, and ranking. By combining multiple
trees through ensemble methods like Random Forests or AdaBoost,
complex problems can also be solved.
While decision trees have several advantages, they also come with
limitations that need to be considered when using them in machine
learning and data analysis:
Overfitting: Noise in the training data becomes a factor when
deep decision trees lead to overfitting. Perfect fit to the training
data has an overfitted tree but may generalize poorly to unseen
data. Pruning and setting a maximum tree depth are tools used to
mitigate overfitting.
Instability: Minor alterations in the data can significantly impact
the tree’s structure. Due to this instability, Decision tree creation
is affected by changes in the training data, leading to various trees
for concordant datasets. By averaging the predictions of multiple
trees, ensemble methods like Random Forests help reduce prediction
instability.
Bias Toward Features with More Levels or Values: More levels or
values make selecting features desirable for decision trees. Many
levels of categorical attributes can lead to biased decision tree
performance.
Inadequate for Capturing Complex Relationships: Non-linear
relationships are not an issue for decision trees but can fail with
more intricate ones. Even deep tree models may struggle to capture
complicated patterns in high-dimensional and noisy datasets.
High Variance: Decision trees are known for High variance.
90 PAGE
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
FOUNDATIONS OF PREDICTIVE MODELLING
IN-TEXT QUESTIONS
7. What is the primary role of decision trees in machine learning
and data science?
(a) To perform clustering analysis
(b) To model the relationship between two continuous variables
(c) To create a visual representation of data
(d) To make decisions by mapping data features to outcomes
8. Which criteria are used for splitting in decision trees?
(a) Mean
(b) Variance
(c) Gini impurity
(d) Chi-square test
4.7 Summary
This unit comprehensively explores linear regression, logistic regression,
model comparison, and decision tree. Here are some of the key points
that are covered in this unit:
The first section introduces Linear Regression, a powerful tool
for modelling the relationship between dependent and independent
variables. It is commonly used for predicting continuous numerical
outcomes. It also covers simple and multiple linear regression,
assumptions, and interpretation of coefficients.
The second section covers Logistic Regression, which is used for
classification tasks. This technique is ideal for binary or multi-class
PAGE 91
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
MBAFT 7708 PREDICTIVE ANALYTICS AND BIG DATA
92 PAGE
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
FOUNDATIONS OF PREDICTIVE MODELLING
Bruce, P., Bruce, A., & Gedeck, P. (2020). Practical statistics for
data scientists: 50+ essential concepts using R and Python. O’Reilly
Media.
Learning, M. (1997). Tom Mitchell. Publisher: McGraw Hill.
PAGE 93
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
L E S S O N
5
Unveiling Data Patterns:
Clustering and Association
Rules
Hemraj Kumawat
Research Scholar
School of Computer & Systems Sciences
JNU, New Delhi - 110067
Email-Id: hkumawat077@gmail.com
STRUCTURE
5.1 Learning Objectives
5.2 Introduction
5.3 Understanding Clustering
5.4 The Mechanics of Clustering Algorithms
5.5 Real-World Applications of Clustering
5.6 Evaluating Cluster Quality
5.7 Unravelling Connections: The World of Association Rules
5.8 Real-World Applications of Association Rules
5.9 Summary
5.10 Answers to In-Text Questions
5.11 Self-Assessment Questions
5.12 References
5.13 Suggested Readings
94 PAGE
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
UNVEILING DATA PATTERNS: CLUSTERING AND ASSOCIATION RULES
Notes
To gain insight into association rules and their application in
identifying connections and relationships within datasets.
Learn about the practical applications of clustering and association
rules across various domains, from retail to healthcare.
5.2 Introduction
It is exciting, even amazing that we live in an era when we have more
data than ever. In this digital era, data is like oxygen for business. Every
click, transaction, and interaction on social media or e-commerce sites
generates an overwhelming amount of information that helps understand
customer behaviour, optimise business operations, and drive strategic
decisions. Every single organisation, be it a small start-up or a big tech
giant, is in a race to leverage the benefits of the vast amount of available
data. However, amid this vast sea of data lies a challenge: How do we
make sense of such vast data? Is there any way to gain meaningful insight
and extract hidden patterns from the data, which can guide us to make
better marketing decisions for our business?
Well, that is where data analysis comes into the picture. It is the key
for these organisations and businesses to survive and thrive in this data-
rich landscape. In this chapter, we plan to learn about two fundamental
pillars of data analysis: Clustering and Association rules. These popular
data analysis techniques help us to make sense of the enormous expanse
of data and uncover hidden but crucial information from the data.
The power of clustering: Amidst this chaos of data, imagine being able
to group similar kinds of data. Wouldn’t it make our life easy? Well,
certainly it would, and clustering helps us in doing so. In the business
sense, one can use clustering to categorise customers automatically based
on their preferences and cluster products based on their features. Clustering
empowers us to make data-driven decisions, whether identifying customer
groups for targeted marketing or finding anomalies in network traffic.
On the other side of the coin, association rules mining or association
analysis helps us uncover hidden relationships and connections within
our data. You might have noticed how e-commerce sites recommend the
products as if they know what you are planning to buy. Well, association
rules are the secrets to this magic. From a business perspective, by
PAGE 95
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
MBAFT 7708 PREDICTIVE ANALYTICS AND BIG DATA
96 PAGE
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
UNVEILING DATA PATTERNS: CLUSTERING AND ASSOCIATION RULES
1. Pattern Discovery: You might love how the detectives find hidden Notes
clues and information while solving the case. Well, clustering is
like a detective of the world of data, which searches for hidden
patterns in the data. It groups data based on associations and
structures that might not be apparent initially. It is beneficial for
market researchers to segment customers with similar purchasing
behaviour. To biologists, it may aid in grouping similar species in
diverse ecosystems.
2. Decision Support: It is often daunting to make sense of complex data.
When you have a myriad of data points, it can be overwhelming.
Clustering simplifies strategic decision-making by creating categories.
For instance, retailers can use clustering to optimise inventory
management and tailor marketing campaigns.
3. Data reduction: It is the era of big data, and managing such a vast
amount of data has become increasingly challenging. Clustering helps
organise vast amounts of data in manageable clusters, simplifying
decision-making. This is a boon in fields such as network security,
where anomaly detection for large datasets is a critical task.
4. Anomaly detection: We have discussed how clustering helps group
similar objects. Apart from that, clustering also plays a crucial role
in detecting outliers or identifying anomalies. A typical example is
fraud detection, where clustering can point out unusual patterns of
transactions that might be indicators of fraudulent activity.
Different flavours of clustering
There are various clustering approaches. You can refer to [1] for a
complete list of approaches. Each approach is suitable for a particular
type of data distribution:—
1. Centroid-based clustering: This type of clustering partitions data
based on proximity to centroids. It often organises data in non-
hierarchical clusters instead of hierarchical clustering, which we
will know shortly. One of the popular algorithms in this category
is k-means. Centroid-based algorithms are efficient; however, they
are susceptible to initial conditions and outliers. Figure 5.1 shows
an example of centroid-based clustering.
PAGE 97
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
MBAFT 7708 PREDICTIVE ANALYTICS AND BIG DATA
Notes
98 PAGE
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
UNVEILING DATA PATTERNS: CLUSTERING AND ASSOCIATION RULES
Notes
PAGE 99
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
MBAFT 7708 PREDICTIVE ANALYTICS AND BIG DATA
Notes algorithms; instead, try to understand that clustering is more than that.
It is about revealing order in the chaos, identifying some structure that
seems absent at first glance, and, more importantly, gaining insights that
eventually result in better decision-making in business and beyond.
100 PAGE
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
UNVEILING DATA PATTERNS: CLUSTERING AND ASSOCIATION RULES
during this process unless the initial centroid estimates were exceptionally Notes
accurate. This process continues until there are no further changes in
cluster membership. It is important to note that different initial centroid
selections can result in different cluster outcomes.
The working of k-means algorithms can be summarised in the following
steps:
1. Decide the value of k, that is, the number of clusters.
2. Randomly initialise k centroids for clusters (It can be selected from
data points randomly).
3. Now, allocate a cluster to each data point to its closest cluster based
on Euclidean distance.
4. Compute the mean value of each cluster, which becomes new centroids
for the clusters.
5. Repeat step 3, i.e. reallocate data points to the new closest centroid.
6. If any change in cluster membership occurs, go to step 4 or terminate
the process.
Note that the k-means algorithm uses Euclidean distance to compute the
distance of data points from cluster centroids. Euclidean distance between
two point P(x1, y1) and Q(x2, y2) is computed using the equation 1-
PAGE 101
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
MBAFT 7708 PREDICTIVE ANALYTICS AND BIG DATA
102 PAGE
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
UNVEILING DATA PATTERNS: CLUSTERING AND ASSOCIATION RULES
We then initiate the process with new centroids, that is, the means, Notes
and calculate the Euclidean distance from each data point to these new
centroids-
Student Age Score 1 Score 2 Score 3 Euclidean Cluster
Distance assigned
C1 19.0 75.0 74.0 60.7
C2 19.7 85.0 87.0 83.7
C3 26.0 63.5 60.3 60.8
P1 18 73 75 57 4.44/31.68/19.62 C1
P2 18 79 85 75 12.34/10.89/33.40 C2
P3 23 70 70 52 11.52/39.11/14.93 C1
P4 20 55 55 55 28.19/52.42/13.04 C3
P5 22 85 86 87 30.73/4.14/42.72 C2
P6 19 91 90 89 36.23/8.58/49.83 C2
P7 20 70 65 65 11.20/32.53/10.86 C3
P8 21 53 56 59 28.55/50.96/12.53 C3
P9 19 82 82 60 10.65/24.42/29.37 C1
P10 40 76 60 78 30.62/35.42/25.45 C3
Now, the current state of 3 clusters is as follows- C1 = {P1, P3, P9};
C2 = {P2, P5, P6};
C3 = {P4, P7, P8, P10}. You may notice the change in the members of
the three clusters. We keep following the same procedure until no further
changes occur in cluster membership. The rest of the steps have been
left for the students to try by themselves.
The k-means method may use other distance measures, such as Manhattan
distance; however, employing various distance measures on the same
dataset can yield different outcomes, making it challenging to determine
the most optimal result.
A significant challenge in k-means clustering is deciding the optimal
value of k because the algorithm’s performance is highly dependent on an
optimal number of clusters. However, how do we decide on an optimal
number of clusters? There are many ways to decide the value of k, but
here we will discuss the most suitable method, the Elbow method.
PAGE 103
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
MBAFT 7708 PREDICTIVE ANALYTICS AND BIG DATA
104 PAGE
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
UNVEILING DATA PATTERNS: CLUSTERING AND ASSOCIATION RULES
DBSCAN Notes
The Density-Based Spatial Clustering of Applications with Noise (DBSCAN)
algorithm forms clusters based on the density of data points in the space.
Here, clusters are defined by areas of high-density data points separated
by lower-density areas. It is advantageous when dealing with data of
varying densities and variable shapes. This algorithm can automatically
find several clusters and can identify outliers as noise.
The fundamental concept behind the DBSCAN algorithm is that for every
data point within a cluster, its neighbourhood, defined by a given radius,
should encompass a minimum number of data points.
Hierarchical Clustering Algorithm
Hierarchical clustering can have two approaches. An approach involves
gradually merging various data points into one cluster, a technique known
as the agglomerative approach. Another approach, which is called the
divisive approach, involves dividing large clusters into small clusters.
Here, we will consider the Agglomerative approach. This method begins
with each cluster initially containing a single data point. Then, using a
distance measure, the algorithm merges the two nearest clusters into a
single cluster. As the algorithm proceeds, the number of clusters gets
reduced. The process terminates once all the clusters are merged into a
single cluster.
Various methods can be used to compute the distance between clusters.
A few methods are discussed here-
Single link method (nearest neighbour): It determines the distance
between two clusters by considering the distance between their closest
points, where one point is chosen from each cluster.
Complete linkage method (furthest neighbour): The distance between
two clusters is determined by considering the distance between their most
distant points, with one point chosen from each cluster.
Centroid method: The distance is calculated by finding the difference
between the centroids of two clusters.
Unweighted pair-group average method: It calculates the distance between
two clusters by finding the average distance among all pairs of objects
taken from these clusters. This process entails computing a total of p * n
distances, where ‘p’ and ‘n’ denote the number of objects in each cluster.
PAGE 105
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
MBAFT 7708 PREDICTIVE ANALYTICS AND BIG DATA
IN-TEXT QUESTIONS
1. What is the primary objective of clustering in data analysis?
(a) To perform binary classification
(b) To group similar data points together
(c) To predict future data patterns
(d) To perform statistical hypothesis testing
2. _______ is employed to determine the optimal value of ‘k’ in
the k-means clustering algorithm.
(a) Arm method
(b) Threshold method
(c) Elbow method
(d) None of the above
106 PAGE
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
UNVEILING DATA PATTERNS: CLUSTERING AND ASSOCIATION RULES
PAGE 107
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
MBAFT 7708 PREDICTIVE ANALYTICS AND BIG DATA
Notes In the case of textual data, clustering can be employed for topic
modelling and document classification, which further aid in efficient
content management and retrieval.
4. Anomaly detection: Anomaly detection plays an integral part in
various domains. For instance, in cybersecurity, unusual patterns
in network traffic may indicate a cyberattack. Another example
could be manufacturing, where product quality anomalies may
indicate faults in the production line. We can use clustering to
create clusters of normal behaviour. Anything that falls outside of
clusters is considered an anomaly and flagged immediately, leading
to further investigation.
In this section, we saw that clustering’s applications range from
customer-centric marketing strategies to advancement in healthcare
and beyond, and it proves itself to be a versatile tool that transforms
large amounts of data into actionable knowledge.
In the following section, we will learn the inner workings of the
clustering algorithms and showcase their uses in real-world scenarios.
108 PAGE
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
UNVEILING DATA PATTERNS: CLUSTERING AND ASSOCIATION RULES
cluster’s centroid. The lower the inertia, the tighter or more compact Notes
the clusters are.
2. Silhouette Score: It quantifies cohesiveness in the cluster, i.e. how
similar an object is to its cluster compared to other clusters, which
indicates separation. A higher value of silhouette score indicates
better-defined clusters.
External evaluation metrics: These evaluation metrics rely on external
information, such as ground truth labels of expert knowledge, in order to
assess the quality of clusters. This is particularly useful when you have
access to labelled data. Common metrics in this category include:—
1. Adjusted Rand Index (ARI): This metric assesses the similarity
between the true labels and cluster assignments, taking a chance
into account. It produces a score ranging from -1 to 1, with -1
indicating no similarity and 1 denoting perfect similarity.
2. Normalised Mutual Information (NMI): This metric quantifies the
mutual information between cluster assignments and actual labels
while normalising the result to a value from 0 to 1. A score of 0
indicates no similarity, while a score of 1 signifies perfect similarity.
The visual aspect of evaluation: Evaluation metrics discussed so far
provide a quantitative aspect of cluster quality; however, visual inspection
has a crucial role in cluster quality assessment. Visualisation techniques
such as scatter plots, dendrogram, and t-SNE (t-Distributed Stochastic
Neighbour Embedding) help understand aspects metrics might miss. It
may uncover complex cluster structures and anomalies that cannot be
captured only through metrics.
Note that choosing appropriate evaluation metrics depends on the problem
you are tackling. Some metrics are more suitable for specific data types
and cluster shapes. Therefore, it is essential to consider the nature of
the data and the task at hand in choosing evaluation metrics. Clustering
quality evaluation is an important step in the data analysis journey. It
ensures that clusters created are not just mathematical constructs but
provide meaningful insights that help the decision-making process.
PAGE 109
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
MBAFT 7708 PREDICTIVE ANALYTICS AND BIG DATA
110 PAGE
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
UNVEILING DATA PATTERNS: CLUSTERING AND ASSOCIATION RULES
PAGE 111
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
MBAFT 7708 PREDICTIVE ANALYTICS AND BIG DATA
Notes When N = 4, the outcomes are as follows: All items meet the support
criterion of 50%,only two pairs satisfy minimum support criteria, and no
3-item set meets the minimum support threshold.
Itemsets Frequency
Laptop 3
Smart Watch 3
Keyboard 2
Mouse 2
(Laptop, Smart Watch) 2
(Laptop, Keyboard) 1
(Laptop, Mouse) 1
(Smart Watch, Keyboard) 2
(Smart Watch, Mouse) 1
(Keyboard, Mouse) 1
(Laptop, Smart Watch, Keyboard) 1
(Laptop, Smart Watch, Mouse) 0
(Laptop, Keyboard, Mouse) 0
(Smart Watch, Keyboard, Mouse) 1
(Laptop, Smart Watch, Keyboard, Mouse) 0
We will examine the pairs that meet the minimum support value to assess
whether they also meet the minimum confidence level. Items or item sets
meeting the minimum support threshold are referred to as ‘frequent.’ All
individual items and two pairs are deemed frequent in the above example.
Next, we evaluate whether the pairs {Laptop, Smart Watch} and {Smart
Watch, Keyboard} satisfy the 75% confidence threshold for association
UXOHV )RU HDFK SDLU ^;<` ZH FDQ GHULYH WZR UXOHV QDPHO\ ;ĺ< DQG
<ĺ; SURYLGHG ERWK UXOHV PHHW WKH PLQLPXP FRQILGHQFH UHTXLUHPHQW
7KH FRQILGHQFH RI ;ĺ< LV GHWHUPLQHG E\ GLYLGLQJ WKH VXSSRUW RI ; DQG
Y occurring together by the support of X.
We have four potential rules, and their respective confidence levels are
as follows:
/DSWRS ĺ 6PDUW :DWFK KDYLQJ FRQILGHQFH RI
6PDUW :DWFK ĺ /DSWRS KDYLQJ FRQILGHQFH RI
6PDUW :DWFK ĺ .H\ERDUG KDYLQJ FRQILGHQFH RI
.H\ERDUG ĺ 6PDUW :DWFK KDYLQJ WKH FRQILGHQFH RI
112 PAGE
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
UNVEILING DATA PATTERNS: CLUSTERING AND ASSOCIATION RULES
+HQFH RQO\ WKH ILQDO UXOH µ.H\ERDUG ĺ 6PDUW :DWFK¶ SRVVHVVHV D Notes
confidence level exceeding the minimum 75% threshold and is eligible.
Rules surpassing the user-defined minimum confidence are called ‘confident’.
This straightforward algorithm employs a brute-force approach and performs
effectively when dealing with four items, as it requires the examination
of only 16 combinations. However, when confronted with many items,
such as 100, overall combinations become significantly larger, reaching
billions. For instance, with 20 items, the number of combinations amounts
to approximately one million, as the total combinations are given by 2n
for ‘n’ items. Although efforts have been made to enhance the naive
algorithm’s efficiency when handling larger datasets, it still struggles to
cope with many items and transactions. To address this limitation, we
introduce a more advanced algorithm known as the Apriori algorithm.
The Apriori algorithm
The popular algorithm for association rule mining is the Apriori algorithm.
It works on the principle of “Apriori” or “prior knowledge.” It operates
under the assumption that if an item set is frequent, then all of its subsets
are also frequent. The algorithm reduces the search space and speeds up
the rule discovery process.
This algorithm works as follows:—
1. Find Frequent Itemsets: The algorithm begins by identifying frequent
itemsets—Sets of items that regularly appear together in the dataset.
2. Create Association Rules: Using these frequent itemsets, association
rules are generated. These rules have the form “If {A} then {B},”
indicating that if itemset A is present, itemset B is also likely to
be present.
3. Calculate Support and Confidence: Support and Confidence are two
important metrics used with association rules. Support measures
how frequently a rule is applicable, while confidence quantifies the
strength of the rule.
4. Pruning and Optimisation: The Apriori algorithm employs pruning
techniques to eliminate uninteresting or redundant rules, focusing
on those with the highest support and confidence. Pruning and
Optimisation techniques adhere to the anti-monotone property,
meaning any subset of a frequent itemset must also be frequent. It
PAGE 113
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
MBAFT 7708 PREDICTIVE ANALYTICS AND BIG DATA
114 PAGE
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
UNVEILING DATA PATTERNS: CLUSTERING AND ASSOCIATION RULES
6. Step 5: Scan all transactions and find frequent items within the Notes
candidate set Ck. Label these frequent itemsets as Lk.
7. Go to Step 4 while Lk is not empty.
8. Terminate when Lk is empty.
IN-TEXT QUESTIONS
3. In the context of association rules, what is “market basket
analysis” commonly used for?
(a) Identifying customer demographics
(b) Predicting stock market trends
(c) Discovering patterns in customer purchasing behaviour
(d) Analysing website traffic data
4. In the context of association rule mining, what does the term
“support” refer to?
(a) The probability of an association rule being true
(b) The number of transactions containing all items in the
antecedent and consequent of a rule
(c) The lift of the association rule
(d) The confidence of the association rule
PAGE 115
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
MBAFT 7708 PREDICTIVE ANALYTICS AND BIG DATA
116 PAGE
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
UNVEILING DATA PATTERNS: CLUSTERING AND ASSOCIATION RULES
5.9 Summary
This lesson serves as an introductory gateway into the dynamic world
of data analysis, where we embark on a journey to unlock the hidden
patterns, relationships, and insights concealed within data. We delve
into the fundamental data analysis techniques, focusing on two pivotal
ones: clustering and association. Following are the key highlights of the
chapter:—
Clustering: It is the art of grouping similar data points, revealing
natural patterns and structures. It finds applications in customer
segmentation, biological data organisation, and scientific classifications.
Association Rules: Association rules mining uncovers connections and
relationships within data. It is vital for optimising retail strategies,
enhancing recommendation systems, and gaining insights into
healthcare and other domains.
Clustering and association rules are foundational techniques for data
pattern discovery, providing the tools to extract valuable information
from complex datasets.
The lesson concludes by setting the stage for further exploration, promising
advanced algorithms, real-world case studies, and practical applications
in subsequent lessons.
PAGE 117
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
MBAFT 7708 PREDICTIVE ANALYTICS AND BIG DATA
5.12 References
Xu, Dongkuan, and Yingjie Tian. “A comprehensive survey of
clustering algorithms.” Annals of Data Science 2 (2015): 165-193.
118 PAGE
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
UNVEILING DATA PATTERNS: CLUSTERING AND ASSOCIATION RULES
PAGE 119
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
L E S S O N
6
From Data to Strategy:
Classification and Market
Basket Analysis Driving
Action
Hemraj Kumawat
Research Scholar
School of Computer & Systems Sciences
JNU, New Delhi - 110067
Email-Id: hkumawat077@gmail.com
STRUCTURE
6.1 Learning Objectives
6.2 Introduction
6.3 &ODVVL¿FDWLRQ $ 3UHGLFWLYH 0RGHOOLQJ
6.4 8QGHUVWDQGLQJ 3RSXODU &ODVVL¿FDWLRQ $OJRULWKPV
6.5 1DLYH %D\HV &ODVVL¿FDWLRQ 3UREDELOLW\ DQG ,QGHSHQGHQFH
6.6 K-NN (K-Nearest Neighbour) Algorithm: The Power of Proximity
6.7 (YDOXDWLRQ 0HDVXUHV IRU &ODVVL¿FDWLRQ
6.8 5HDO:RUOG $SSOLFDWLRQV RI &ODVVL¿FDWLRQ
6.9 Strategic Insights with Market Basket Analysis
6.10 Summary
6.11 Answers to In-Text Questions
6.12 Self-Assessment Questions
6.13 References
6.14 Suggested Readings
120 PAGE
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
CLASSIFICATION AND MARKET BASKET ANALYSIS
6.2 Introduction
“It is a capital mistake to theorise before one has data. Insensibly, one
begins to twist facts to suit theories, instead of theories to suit facts”,
the Famous fictional character Sherlock Holmes proclaims in the novel
‘A Scandal in Bohemia’ by Sir Arthur Conan Doyle.
This basic idea lies at the core of the data analysis. Data serves as
the bedrock of knowledge and fuels well-informed decision-making. In
the contemporary data-driven landscape, organisations grapple with an
abundance of information. The power to transform this data into actionable
insights is not just a competitive advantage but a strategic necessity. In
this lesson, we will journey through predictive modelling, classification,
and market basket analysis, tracing the path from raw data to informed
strategies that drive business success.
There is more to data than just numbers and records; it is a treasure
trove of opportunities and hidden patterns waiting to be uncovered. In
the quest for excellence and competitiveness, organisations have realised
that harnessing data’s power is paramount. However, it is not the data
itself but the intelligent application of it that makes the difference. This
lesson focuses on how data is transformed into actionable insights and
strategies that shape the future.
At the heart of this lesson is classification—a supervised learning technique
that categorises data into pre-defined groups or classes. Classification
serves as the bridge between raw data and informed decision-making.
It empowers organisations to not only understand their data but also to
predict future outcomes. By exploring various classification algorithms
PAGE 121
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
MBAFT 7708 PREDICTIVE ANALYTICS AND BIG DATA
122 PAGE
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
CLASSIFICATION AND MARKET BASKET ANALYSIS
PAGE 123
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
MBAFT 7708 PREDICTIVE ANALYTICS AND BIG DATA
124 PAGE
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
CLASSIFICATION AND MARKET BASKET ANALYSIS
PAGE 125
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
MBAFT 7708 PREDICTIVE ANALYTICS AND BIG DATA
126 PAGE
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
CLASSIFICATION AND MARKET BASKET ANALYSIS
Notes
IN-TEXT QUESTIONS
1. Which classification algorithm is known for modelling the
probability of an observation belonging to a particular class,
making it suitable for binary classification tasks?
(a) Random Forest
(b) Decision Trees
(c) Logistic regression
(d) Support Vector Machines
2. What is the primary advantage of using Random Forest, an
ensemble learning method, for classification?
(a) It is simple and easy to interpret
(b) It can handle complex, non-linear classification tasks
effectively
(c) It relies on the concept of maximising margin
(d) It works well with small datasets
128 PAGE
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
CLASSIFICATION AND MARKET BASKET ANALYSIS
P(E|H).P(H) Notes
P(H|E) = (1)
P(E)
where,
P(H): The prior probability of hypothesis H (prior).
P(E): The probability of evidence E (marginal likelihood).
P(H|E): The probability of hypothesis H given evidence E (posterior
probability).
P(E|H): The probability of evidence E given hypothesis H (likelihood).
The term “naive” in Naive Bayes stems from the assumption that all
features are independent when considering the class. In simpler terms, the
presence or absence of one feature is assumed not to influence the presence
or absence of any other feature. While this independence assumption is
rarely valid in real-world data, Naive Bayes can perform well, especially
when features are approximately conditionally independent.
Types of Naive Bayes:
There are several variants of the Naive Bayes algorithm, each suited to
different types of data:
Gaussian Naive Bayes: This variant is used when the features follow
a Gaussian (normal) distribution. It is appropriate for continuous
data.
Multinomial Naive Bayes: This variant is commonly used for text
classification tasks, where features represent word frequencies (e.g.,
in document classification). It assumes a multinomial distribution
for the features.
Bernoulli Naive Bayes: Suited for binary or Boolean features, such
as document presence/absence in text classification.
The Naïve Bayes algorithm works in two phases:—
1. Training Phase: During the training phase, Naive Bayes calculates
the prior probabilities of each class (P(H)) and the conditional
probabilities of each feature given each class (P(E|H)). These
probabilities are estimated from the training data.
2. Prediction Phase: When making predictions for new data, Naive
Bayes calculates the posterior probabilities of each class given
PAGE 129
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
MBAFT 7708 PREDICTIVE ANALYTICS AND BIG DATA
Notes the observed features using Bayes’ theorem. The predicted class is
determined as the one with the highest posterior probability.
An Example: Spam Email Detection:
Let us suppose you work for an email service provider, and you want to
classify incoming emails as either “Spam” or “Not Spam” (ham). You
have a dataset of emails and their corresponding labels, indicating whether
each email is spam (1) or not (0). Your goal is to build a Naive Bayes
classifier to classify new incoming emails automatically.
Here is the small portion of the dataset you have:—
Email Text Label (1=Spam, 0=Not
Spam)
“Congratulations! You’ve won a million dollars!” 1
“Hello, please find attached the report.” 0
“Get a free iPhone now!” 1
“Meeting agenda for tomorrow’s conference.” 0
“Claim your prize today!” 1
“Reminder: Your appointment is tomorrow.” 0
Step 1: Data Pre-processing: For this example, let us assume that the
text has been pre-processed and converted into a numerical format. We
have already calculated the prior and conditional probabilities from the
training data.
Prior Probabilities (from training data): Calculate the prior probabilities:
P(Spam) and P(Not spam) based on the training data:
P(Spam) = Count of Spam Emails / Total Count of Emails in the
Training Set.
P(Not spam) = Count of Not Spam Emails / Total Count of Emails
in the Training Set.
In our example, P(Spam) = 3/6 = 0.5, and P(Not Spam) = 3/6 = 0.5.
Conditional Probabilities (from training data): Calculate the conditional
probabilities of each word in the email text given the class (Spam or
Not Spam). For example, calculate P(Claim|Spam), P(Claim|Not Spam),
P(free|spam), P(free|Not spam), etc., using the training data. Let us assume
the following conditional probabilities for some example words:
3 &ODLP_6SDP §
130 PAGE
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
CLASSIFICATION AND MARKET BASKET ANALYSIS
PAGE 131
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
MBAFT 7708 PREDICTIVE ANALYTICS AND BIG DATA
Notes P(Not Spam | “Claim your free prize now!”) ןP(“Claim your
free prize now!”|Not Spam) * P(Not Spam)
3. Normalise the probabilities:
P(Spam | “Claim your free prize now!”) = P(Spam | “Claim
your free prize now!”) / (P(Spam | “Claim your free prize
now!”) + P(Not Spam | “Claim your free prize now!”))
P(Not Spam | “Claim your free prize now!”) = P(Not Spam |
“Claim your free prize now!”) / (P(Spam | “Claim your free
prize now!”) + P(Not Spam | “Claim your free prize now!”))
4. Choose the class with the maximum posterior probability as the
predicted class.
P(Spam|”Claim your free prize now!”) § ן
P(Not Spam|”Claim your free prize now!”) § ן
0.1667
Since P(Spam|”Claim your free prize now!”) > P(Not Spam|”Claim your
free prize now!”), the email is predicted to be “Spam.”
Advantages of Naive Bayes:
Simplicity: Naive Bayes is straightforward to comprehend and apply,
making it appropriate for novices and experienced practitioners.
Efficiency: It demonstrates computational efficiency and can manage
datasets with high dimensionality, featuring numerous attributes.
Interpretability: The probabilities assigned to each class can provide
insights into the classification process.
Effective for Text Classification: Naive Bayes is particularly effective
in text classification tasks, such as sentiment analysis and spam
detection.
Challenges and Limitations:
Independence Assumption: The strong independence assumption
may only hold in some real-world scenarios.
Data Scarcity: It may perform poorly when limited training data or
certain feature combinations are unseen in the training set.
Continuous Data: Gaussian Naive Bayes assumes that features follow
a Gaussian distribution, which may not be accurate for all datasets.
132 PAGE
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
CLASSIFICATION AND MARKET BASKET ANALYSIS
Applications: Notes
Naive Bayes has numerous applications, including:
Spam email detection.
Document classification.
Disease diagnosis based on medical test results.
Credit risk assessment in finance.
Sentiment analysis in NLP.
Despite its simplicity, Naive Bayes is a robust classification algorithm
that can yield excellent results in various real-world scenarios, mainly
when the independence assumption is approximately satisfied or when
dealing with text data.
PAGE 133
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
MBAFT 7708 PREDICTIVE ANALYTICS AND BIG DATA
134 PAGE
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
CLASSIFICATION AND MARKET BASKET ANALYSIS
PAGE 135
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
MBAFT 7708 PREDICTIVE ANALYTICS AND BIG DATA
136 PAGE
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
CLASSIFICATION AND MARKET BASKET ANALYSIS
and all the data points within the training set, potentially leading Notes
to impractical computational demands.
Storage and Memory Requirements: The K-NN algorithm requires
storing the entire training dataset in memory for prediction, which
can be memory-intensive for large datasets.
Despite these challenges, K-NN remains a valuable algorithm in various
domains, primarily when used appropriately and with a good understanding
of its strengths and limitations. Addressing these challenges often involves
careful pre-processing, parameter tuning, and keeping in mind the specific
characteristics of the dataset and task at hand.
IN-TEXT QUESTIONS
3. What is a key assumption of the Naive Bayes algorithm?
(a) All features are dependent on each other
(b) All features exhibit conditional independence with respect
to the class
(c) Features have a linear relationship
(d) Features are normally distributed
4. What happens if you choose a small value of ‘K,’ such as 1
in K-NN?
(a) The algorithm becomes computationally slower
(b) The decision boundary becomes more complex
(c) The model becomes less prone to overfitting
(d) The predictions may be sensitive to noise or outliers
PAGE 137
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
MBAFT 7708 PREDICTIVE ANALYTICS AND BIG DATA
(TP+TN)
Accuracy = (4)
(TP+TN+FP+FN
Use Case: Accuracy is suitable when the class distribution in the
dataset is roughly balanced.
3. Precision:
Definition: It measures the fraction of true positive predictions among
all positive predictions by the classifier. It indicates the classifier’s
ability to avoid false positive errors. It is given by equation (5).
TP
Precision = (5)
TP+FP
Use Case: Precision is crucial when false positives are costly or
when the focus is on minimising type I errors (e.g., in medical
diagnoses).
4. Recall (True Positive Rate or Sensitivity):
Definition: It computes the fraction of true positive predictions among
all truly positive instances in the dataset. It is given by equation
(6).
138 PAGE
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
CLASSIFICATION AND MARKET BASKET ANALYSIS
TP Notes
Precision = (6)
TP+FP
Use Case: Recall is important when missing positive instances (false
negatives) is costly or when the goal is to maximise true positive
identification (e.g., in disease detection).
5. F1-Score:
Definition: It is computed using the harmonic mean of precision
and recall. It provides a balance between these two metrics and
is particularly useful when dealing with imbalanced datasets. It is
given by equation (7).
2*(Precision*Recall)
F1 – Score = (7)
(Precision + Recall)
Use Case: The F1-score is valuable when both precision and recall
need to be considered, especially in scenarios with imbalanced
classes.
6. ROC-AUC (Receiver Operating Characteristic Curve and Area Under
the Curve:
Definition: The ROC curve is a graphical representation of a
classifier’s performance across different threshold values. It generates
a plot depicting the True Positive Rate (Recall) in relation to the
False Positive Rate (1 - Specificity) at different thresholds. The
AUC quantifies the area under the ROC curve, providing a single
numeric value for classifier performance.
PAGE 139
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
MBAFT 7708 PREDICTIVE ANALYTICS AND BIG DATA
Notes Use Case: ROC-AUC helps assess binary classifiers and compare
models’ performance. A higher AUC indicates a better classifier.
7. Specificity (True Negative Rate):
Definition: Specificity quantifies the ratio of correct negative predictions
to all actual negative instances, indicating the classifier’s capacity
to detect all negative instances correctly. It can be calculated using
the equation (8).
TN
Specificity = (8)
(TF+FN)
Use Case: Specificity is valuable when avoiding false alarms (false
positives) is crucial, such as in security or fraud detection.
8. F-beta Score:
Definition: The F-beta score is a generalisation of the F1-score that
allows you to adjust the balance between precision and recall. It
incorporates a parameter (beta) that controls the relative significance
of precision and recall. It is given by equation (9).
(1+ȕ2)*(Precision*Recall)
F – beta Score = (9)
Ǻ2*(Precision + Recall)
Use Case: F-beta score is helpful when a specific trade-off between
precision and recall needs to be considered.
140 PAGE
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
CLASSIFICATION AND MARKET BASKET ANALYSIS
PAGE 141
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
MBAFT 7708 PREDICTIVE ANALYTICS AND BIG DATA
142 PAGE
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
CLASSIFICATION AND MARKET BASKET ANALYSIS
Beyond Retail: While Market Basket Analysis has traditionally been Notes
associated with the retail sector, its principles extend far beyond.
Its versatility and the insights it provides can be applied across
various industries. Here are some examples:
Media and Entertainment: Streaming platforms utilise Market
Basket Analysis to suggest additional content to users,
considering their viewing history. This keeps users engaged
and increases content consumption.
Hospitality: In the hospitality industry, Market Basket Analysis
helps hotels and resorts offer personalised experiences by
suggesting additional services or amenities based on guest
preferences. For example, a guest booking a room may receive
offers for spa services or restaurant reservations.
E-commerce: Besides retail, e-commerce platforms also use Market
Basket Analysis extensively. They recommend complementary
products, enhance user experiences, and maximise sales revenue.
Market Basket Analysis provides a data-driven approach to understanding
customer behaviour and preferences in all these cases, leading to more
targeted and effective marketing strategies.
As we conclude our exploration of Market Basket Analysis, it becomes
clear that this technique is not just about product recommendations. It is
a strategic tool that drives revenue, enhances customer satisfaction, and
extends its reach beyond retail. The insights gained from Market Basket
Analysis can reshape business strategies, making them more customer-
centric, profitable, and sustainable.
IN-TEXT QUESTIONS
5. What is the primary goal of Market Basket Analysis (MBA)?
(a) To identify the most frequently purchased product.
(b) To discover associations between products frequently
bought together.
(c) To analyse customer demographics for targeted marketing.
(d) To optimise product pricing strategies.
PAGE 143
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
MBAFT 7708 PREDICTIVE ANALYTICS AND BIG DATA
6.10 Summary
In this lesson, we embarked on a journey through the world of predictive
modelling, classification, and Market Basket Analysis, discovering how
data-driven insights shape the strategies of modern businesses. We have
explored the significance and real-world applications of these techniques,
shedding light on their transformative power. The following key points
were discussed in this lesson:—
1. Predictive Modelling with Classification: We began by understanding
the fundamentals of classification—an essential technique in
supervised learning. Classification bridges raw data and informed
decision-making, enabling organisations to predict future outcomes,
from diagnosing diseases to identifying fraudulent activities. We
explored various classification algorithms and their critical role in
data-driven strategies.
2. Strategic Insights with Market Basket Analysis: Moving on to
Market Basket Analysis, we unveiled its core concepts, including
association rules, support, confidence, and lift. We learned how
this technique deciphers consumer behaviour, optimises product
placement, and drives cross-selling and upselling strategies. Beyond
the retail sector, we discovered how Market Basket Analysis extends
its reach to various industries, providing insights that enhance the
customer experience and drive business growth.
3. The Profound Impact on Business Strategies: Finally, we examined
the tangible impact of Market Basket Analysis on business strategies.
We saw how it influences pricing strategies, enabling businesses to
bundle products, offer discounts, and increase the average transaction
value. We explored the art of cross-selling and upselling, where
144 PAGE
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
CLASSIFICATION AND MARKET BASKET ANALYSIS
PAGE 145
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
MBAFT 7708 PREDICTIVE ANALYTICS AND BIG DATA
6.13 References
Holmes, Arthur Conan, “A Scandal in Bohemia,” in “The Adventures
of Sherlock Holmes,” Strand Magazine, 1891.
Hastie, Trevor, et al., “The Elements of Statistical Learning: Data
Mining, Inference, and Prediction,” 2nd ed. Springer, 2009.
Mitchell, Tom M., “Machine Learning,” McGraw-Hill, 1997.
Quinlan, J. Ross, “C4.5: Programs for Machine Learning,” Morgan
Kaufmann, 1993.
Powers, David M. W., “Evaluation: From Precision, Recall and
F-Measure to ROC, Informedness, Markedness & Correlation,” in
“Journal of Machine Learning Technologies,” 2011.
Fawcett, Tom, “An Introduction to ROC Analysis,” in “Pattern
Recognition Letters,” 2006.
Agrawal, Rakesh, and Ramakrishnan Srikant, “Fast Algorithms for
Mining Association Rules,” in “Proceedings of the 20th International
Conference on Very Large Data Bases (VLDB),” 1994.
Chen, Ming-Syan, et al., “A Fast Parallel Algorithm for Thinning
Digital Patterns,” in “Communications of the ACM,” 1993.
146 PAGE
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
CLASSIFICATION AND MARKET BASKET ANALYSIS
“Data Science for Business” by Foster Provost & Tom Fawcett (First
edition, 2013).
“Mining of Massive Datasets” by Jure Leskovec, Anand Rajaraman,
& Jeffrey D.
“Data Science for Business” by Foster Provost & Tom Fawcett (First
edition, 2013).
“Mining of Massive Datasets” by Jure Leskovec, Anand Rajaraman,
& Jeffrey D. Ullman (Third edition, 2014).
“Data Mining: Concepts and Techniques” by Jiawei Han, Micheline
Kamber, & Jian Pei (Third Edition, 2011).
“Data Science for Executives” by Nir Kaldero (First Edition, 2018).
“Data Mining: Practical Machine Learning Tools and Techniques”
by Ian H. Witten & Eibe Frank (Second Edition, 2005).
PAGE 147
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
L E S S O N
7
Predictive Analytics
and its Use
Dr. Charu Gupta
Assistant Professor
Department of Computer Science
School of Open Learning
University of Delhi
Email-Id: charugupta.sol.2023@gmail.com; charu.gupta@sol-du.ac.in
STRUCTURE
7.1 Learning Objectives
7.2 Introduction
7.3 Predictive Analytics
7.4 3UHGLFWLYH$QDO\WLFV$SSOLFDWLRQV DQG %HQH¿WV 0DUNHWLQJ +HDOWKFDUH 2SHUD-
tions and Finance
7.5 Text Analysis
7.6 Analysis of Unstructured Data
7.7 In-Database Analytics
7.8 Summary
7.9 Answers to In-Text Questions
7.10 Self-Assessment Questions
7.11 References
7.12 Suggested Readings
It has been a human tendency to know about the future. As such, various
techniques have been developed over the ages to predict the events that
may occur in the future based on historical as well as current data.
With the advent of science and technology, sophistication in the field
of prediction using historical and current data has increased many folds.
As such, the arena of Prediction Analysis and its applications in varied
fields has increased manifolds. Forecasting weather reports, predicting
future market trends, and predicting sales during a predefined period
are a few examples of using data and prediction methods. Nowadays,
there is a plethora of vast amounts of data both in volume, veracity and
variety that can be mined to unravel hidden patterns and trends to gain
insights into future outcomes and performance for developing video
games, translating voice-to-text messages, making decisions regarding
customer-oriented services, developing investment portfolios, improving
operational efficiencies, reducing risks, detecting deviations, detecting
cyber frauds, looking potential losses, analysing sentiments and social
behaviour and many more.
PAGE 149
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
MBAFT 7708 PREDICTIVE ANALYTICS AND BIG DATA
Depurate (Monitor
and Refine) Model Data Collection
Data
Deploy Model
Analysis
Decide
Model Develop Model
150 PAGE
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
PREDICTIVE ANALYTICS AND ITS USE
from transactional systems, sensors, and logs to ensure in-depth results. Notes
However, it is to be ensured that the data is collected from reliable sources
and complies with the data privacy and governance policies. The results
of prediction derived from the predictive model depend entirely on the
data being utilised; it is essential to collect the most relevant data aligned
with the problem and requirements. While collecting data, format, period
and attributes are also needed in the form of metadata.
(3) DATA ANALYSIS - EXPLORATORY DATA ANALYSIS
In the next step, after collection of the requisite data in sufficient volume,
the Analysis is performed. The data relevancy, suitability, quality, and
cleanliness are analysed. The Data collected is then cleaned, structured,
and formatted as desired using various data cleaning, transforming, and
wrangling processes.
Once the data in the desired form is transformed, it is analysed using
various statistical tools and methods. It is crucial to know the properties of
the data. This process of analysing and exploring the data before applying
the predictive model is called Exploratory Data Analysis. In this step,
the dependent and independent attributes and correlation among features/
attributes of a dataset are determined. The data types, dimensions, and
data distribution among the datasets collected are identified. It is also
analysed for any missing data, duplicate values, redundant data, outliers,
and any prominent pattern in data distribution. The correlation among the
features and attributes of data sets is calculated, and their impact on the
outcome is identified. Raw data sets may also be transformed into new
features for calculating a prediction. For example, different age groups
are created to predict the sales of a clothing apparel brand, like kids,
teenagers, youngsters, middle-aged people, and senior citizens, instead
of accurate age values.
Exploratory Data Analysis (EDA) is a pivotal phase in the data analysis
process that involves scrutinizing and understanding raw data before formal
statistical modelling. It serves as the lens through which data analysts
and scientists unravel patterns, relationships, and outliers, facilitating the
formulation of hypotheses and guiding subsequent analytical decisions. EDA
encompasses a myriad of techniques, from descriptive statistics and data
visualization to summary statistics and hypothesis testing. Visualization
tools such as histograms, scatter plots, and box plots provide a visual
PAGE 151
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
MBAFT 7708 PREDICTIVE ANALYTICS AND BIG DATA
152 PAGE
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
PREDICTIVE ANALYTICS AND ITS USE
PAGE 153
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
MBAFT 7708 PREDICTIVE ANALYTICS AND BIG DATA
154 PAGE
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
PREDICTIVE ANALYTICS AND ITS USE
PAGE 155
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
MBAFT 7708 PREDICTIVE ANALYTICS AND BIG DATA
Notes their computational complexity and demand for substantial data can
pose challenges, neural networks stand at the forefront of cutting-
edge artificial intelligence applications, showcasing their prowess
in addressing complex problems across diverse domains.
(5) DECIDE - MODEL EVALUATION- ASSESSING PERFORMANCE
FOR INFORMED DECISION-MAKING
In this step, the efficiency of the predictive model applied in the previous
step is analysed by performing various tests. Different measures are used
for evaluating the performance of predictive models.
For regression models: Mean Squared Error (MSE), Root Mean Squared
Error (RMSE), R Squared (R2 Score).
For classification models: F1 Score, Confusion Matrix, Precision, Recall,
AUC-ROC, Percent Correction Classification.
156 PAGE
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
PREDICTIVE ANALYTICS AND ITS USE
generalize well to new, unseen data. Common evaluation metrics include Notes
accuracy, precision, recall, F1 score, and area under the Receiver Operating
Characteristic (ROC) curve, among others. These metrics offer insights
into the model’s performance in terms of correctly classified instances,
false positives, false negatives, and the trade-offs between precision and
recall. Cross-validation techniques, such as k-fold cross-validation, aid
in robustly assessing a model’s performance by mitigating the impact
of data variability. The ultimate goal of model evaluation is to provide
stakeholders with a clear understanding of the model’s strengths and
limitations, enabling informed decision-making regarding its deployment,
optimization, or potential adjustments to meet specific business or
application requirements.
(6) DEPLOY - MODEL DEPLOYMENT - BRIDGING INSIGHT TO
ACTION
During the sixth step of predictive Analysis, the evaluated model is
deployed in a real-world environment for day-to-day decision-making
processes. For example, a new sales-boosting engine will be integrated
into the e-commerce platform to recommend high-rated products to the
customer.
Model deployment marks the culmination of the predictive learning
journey, transitioning from insightful analyses to practical applications.
Once a machine learning model has been trained, validated, and fine-
tuned, deployment involves integrating it into real-world systems to make
predictions on new, unseen data. The deployment process encompasses
considerations of scalability, efficiency, and integration with existing
infrastructure. Cloud-based solutions, containerization, and Application
Programming Interfaces (APIs) play pivotal roles in streamlining
this transition. Continuous monitoring and feedback mechanisms are
implemented to ensure the model’s ongoing relevance and accuracy in
dynamic environments. Model deployment is a crucial bridge connecting
predictive insights to actionable outcomes, enabling organizations to
leverage the power of machine learning for informed decision-making and
improved business processes. Successful deployment hinges not only on
the technical prowess of the model but also on the seamless integration
of predictive analytics into the operational fabric of the organization.
PAGE 157
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
MBAFT 7708 PREDICTIVE ANALYTICS AND BIG DATA
IN-TEXT QUESTIONS
1. How many steps are there in predictive modelling?
(a) Five
(b) Six
(c) Seven
(d) Eight
2. The process termed ‘Depurate’ means:
(a) Declare
(b) Deploy
(c) Derive
(d) Refine
158 PAGE
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
PREDICTIVE ANALYTICS AND ITS USE
PAGE 159
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
MBAFT 7708 PREDICTIVE ANALYTICS AND BIG DATA
160 PAGE
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
PREDICTIVE ANALYTICS AND ITS USE
Benefits: Sales teams can prioritize leads with higher scores, Notes
focusing efforts on prospects more likely to convert, resulting
in improved conversion rates and a more efficient sales process.
(F) Cross-Sell and Upsell:
Applications: Predictive analytics identifies opportunities for
cross-selling or upselling by analyzing customer purchase
history and behaviour. The model predicts which additional
products or services a customer is likely to be interested in.
Benefits: Increased revenue as marketers can strategically
promote complementary products or premium upgrades to
existing customers, maximizing the lifetime value of each
customer.
Common Benefits in Marketing:
Increased ROI: Predictive analytics helps marketers allocate
resources more efficiently, leading to improved campaign
performance and a higher return on investment.
Enhanced Customer Experience: Personalized marketing efforts
based on predictive insights result in a more tailored and
relevant experience for customers, fostering brand loyalty.
Strategic Decision-Making: Marketers can make data-driven
decisions, optimizing strategies and tactics based on insights
derived from predictive analytics.
Competitive Edge: Organizations leveraging predictive analytics
gain a competitive advantage by staying ahead in the rapidly
evolving landscape of marketing strategies.
In the marketing arena, predictive analytics serves as a powerful tool for
maximizing the impact of campaigns, improving customer engagement,
and driving business growth through strategic decision-making.
2. HEALTHCARE:
Applications: Predictive analytics plays a crucial role in healthcare for
patient risk stratification, disease prediction, and resource allocation.
It aids in identifying high-risk patients, optimizing treatment plans,
and reducing readmission rates.
PAGE 161
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
MBAFT 7708 PREDICTIVE ANALYTICS AND BIG DATA
162 PAGE
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
PREDICTIVE ANALYTICS AND ITS USE
PAGE 163
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
MBAFT 7708 PREDICTIVE ANALYTICS AND BIG DATA
164 PAGE
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
PREDICTIVE ANALYTICS AND ITS USE
PAGE 165
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
MBAFT 7708 PREDICTIVE ANALYTICS AND BIG DATA
166 PAGE
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
PREDICTIVE ANALYTICS AND ITS USE
PAGE 167
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
MBAFT 7708 PREDICTIVE ANALYTICS AND BIG DATA
168 PAGE
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
PREDICTIVE ANALYTICS AND ITS USE
PAGE 169
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
MBAFT 7708 PREDICTIVE ANALYTICS AND BIG DATA
170 PAGE
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
PREDICTIVE ANALYTICS AND ITS USE
PAGE 171
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
MBAFT 7708 PREDICTIVE ANALYTICS AND BIG DATA
172 PAGE
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
PREDICTIVE ANALYTICS AND ITS USE
For the above three reviews, the Bag-of-Words vector will be formed Notes
as below:
Review the night Gel is Sticky oily cream very light weight skin makes Total
1 1 1 1 1 1 1 0 0 0 0 0 0 6
2 1 0 1 1 0 0 1 2 1 1 0 0 8
3 1 0 1 0 0 1 1 1 0 0 1 1 7
Drawbacks of Bag-of-Words
New words will increase the size and length of the vector.
Vectors will contain a large number of zeroes for other vectors.
No information on the word order is maintained.
Term Frequency (TF)
Term frequency is a numerical statistic that reflects the importance of a
word in a document collection or corpus.
The frequency of the world in the document
TF =
,number of terms in the document
TFIDF of each word
R the night Gel is Sticky oily cream very light weight skin makes Total
1 1/6 1/6 1/6 1/6 1/6 1/6 0 0 0 0 0 0 6
2 1/8 0 1/8 1/8 0 0 1/8 2/8 1/8 1/8 0 0 8
3 1/7 0 1/7 0 0 1/7 1/7 1/7 0 0 1/7 1/7 7
Inverse Document Frequency (IDF)
The sparseness-low occurrence of a term t is commonly measured by an
equation called inverse document frequency.
Total number documents
IDF for a term t = 1 + log
The number of documents containing termt
Term Frequency-Inverse Document Frequency (TFIDF)
Term Frequency (TF) and Inverse Document Frequency (IDF), commonly
referred to as TFIDF. The TFIDF value of a term t in a given document
d is thus:
TFIDF(t, d) = TF(t, d) × IDF(t)
N-grams sequence
The sequence of n-number of words in a sequence is called n-grams. The
sequence of two words in a document is called bi-gram, and the sequence
PAGE 173
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
MBAFT 7708 PREDICTIVE ANALYTICS AND BIG DATA
7.8 Summary
In this lesson, basic concepts of predictive analytics in which a model is
built that can provide an estimated value of a target variable for a new
unseen example. In the process, 7Ds of predictive modelling are introduced,
comprising Defining requirements, data collection, data analysis, data
modelling, deciding on the model based on various performance metrics,
and deploying the model for use in business processes. The deployed
predictive model is then further monitored and refined to estimate better
the target variable that provides solutions to real-world business problems.
For example, historical data about employee retention and moving to other
competitive companies will help an organisation frame “employee retention
policies” and detect frauds that will, in turn, optimise the recruitment
process in the long run. The lesson briefs the benefits and applications
of predictive modelling in different organisations, marketing, sales,
healthcare, e-commerce platforms, operations, finance, fraud detection,
reducing risk, education, and data security. The lesson further concluded
with text analytics and various parameters for its evaluation. The lesson
introduces a common way to turn text into a feature vector: to break
each document into individual words (its “bag of words” representation)
174 PAGE
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
PREDICTIVE ANALYTICS AND ITS USE
and assign values to each term using the TFIDF formula. The approach Notes
is relatively simple, inexpensive, versatile, and requires little domain
knowledge.
1. (c) seven
2. (d) refine
3. (b) regression model
7.11 References
What is predictive analytics? | IBM. (Oct, 2023.). https://www.ibm.
com/topics/predictive-analytics.
Provost, F., & Fawcett, T. (2013). Data Science for Business: What
You Need to Know about Data Mining and Data-Analytic Thinking.
O’Reilly Media, Inc.
Miller, T.W. (2014). Modeling Techniques in Predictive Analytics:
Business Problems and Solutions with R. Pearson FT Press.
PAGE 175
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
MBAFT 7708 PREDICTIVE ANALYTICS AND BIG DATA
Notes Ittoo, A., & van den Bosch, A. (2016). Text analytics in industry:
Challenges, desiderata and trends. Computers in Industry, 78, 96-
107.
Sarkar, D. (2016). Text analytics with Python (Vol. 2). New York,
NY, USA: Apress.
176 PAGE
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
L E S S O N
8
Technology (Analytics)
Solutions and Management
of their Implementation in
Organizations
Dr. Charu Gupta
Assistant Professor
Department of Computer Science
School of Open Learning
University of Delhi
Email-Id: charugupta.sol.2023@gmail.com; charu.gupta@sol-du.ac.in
STRUCTURE
8.1 Learning Objectives
8.2 Introduction
8.3 Management of Analytics Technology Solution in Predictive Modelling
8.4 Predictive Modelling Technology Solutions
8.5 Summary
8.6 Answers to In-Text Questions
8.7 Self-Assessment Questions
8.8 References
8.9 Suggested Readings
PAGE 177
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
MBAFT 7708 PREDICTIVE ANALYTICS AND BIG DATA
178 PAGE
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
TECHNOLOGY (ANALYTICS) SOLUTIONS AND MANAGEMENT
PAGE 179
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
MBAFT 7708 PREDICTIVE ANALYTICS AND BIG DATA
180 PAGE
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
TECHNOLOGY (ANALYTICS) SOLUTIONS AND MANAGEMENT
PAGE 181
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
MBAFT 7708 PREDICTIVE ANALYTICS AND BIG DATA
182 PAGE
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
TECHNOLOGY (ANALYTICS) SOLUTIONS AND MANAGEMENT
PAGE 183
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
MBAFT 7708 PREDICTIVE ANALYTICS AND BIG DATA
184 PAGE
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
TECHNOLOGY (ANALYTICS) SOLUTIONS AND MANAGEMENT
8.5 Summary
Effective predictive modelling technology management is crucial for
organizations aiming to extract meaningful insights from vast datasets. This
involves strategic planning, aligning technology initiatives with overarching
business goals, and ensuring that the chosen technology solutions integrate
seamlessly into existing infrastructures. Key aspects of management include
overseeing data acquisition and preparation processes, selecting appropriate
modelling techniques, and addressing scalability concerns. Interpretability
and explain ability of models, along with continuous monitoring and
maintenance, are also paramount. Challenges such as talent acquisition,
data security, and dynamic data nature must be met with comprehensive
strategies. The successful management of predictive modelling technology
empowers organizations to make informed decisions, stay competitive,
and navigate the evolving landscape of data analytics.
PAGE 185
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
MBAFT 7708 PREDICTIVE ANALYTICS AND BIG DATA
8.8 References
Gordan, J & Spillecke, D et al., Marketing & Sales Big Data, Analytics,
and the Future of Marketing & Sales (2015). Mc Kinskey.
Miller, T.W. (2014). Modelling Techniques in Predictive Analytics:
Business Problems and Solutions with R. Pearson FT Press.
Attaran, M., & Attaran, S. (2019). Opportunities and challenges of
implementing predictive analytics for competitive advantage. Applying
business intelligence initiatives in healthcare and organizational
settings, 64-90.
186 PAGE
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
Glossary
Area Under the Curve (AUC): A numerical measure quantifying the area under the ROC
curve, indicating the classifier’s ability to distinguish between classes.
Association Rules: Data mining technique used to identify interesting patterns and
relationships between items or events in a dataset, often employed in market basket analysis
and recommendation systems.
AutoML (Automated Machine Learning): The use of automated tools and processes to
streamline and accelerate the end-to-end process of developing machine learning models,
from data preparation to model deployment.
Bernoulli Naive Bayes: A type of the Naive Bayes algorithm for binary or Boolean
features, typically used in text classification tasks like spam detection.
Bias: The error introduced in a predictive model due to oversimplified data or model
structure assumptions.
Big Data: Large amounts of data consisting of the four V’s of Volume, Velocity, Variety,
and Veracity.
Classification: A supervised machine learning technique that assigns pre-defined categories
or labels to data points based on their features.
Cloud-Based Solutions: Technology solutions hosted on cloud platforms, offering scalability,
flexibility, and accessibility for predictive modelling tasks without the need for extensive
on-premises infrastructure.
Clustering: A data analysis technique that involves grouping similar data points together
based on specific criteria or features, revealing inherent patterns or structures.
Confidence: It measures the robustness of the relationship between items A and B. The
FRQILGHQFH IRU$ĺ% UHSUHVHQWV WKH SUREDELOLW\ RI % RFFXUULQJ ZKHQ$ LV DOUHDG\ SUHVHQW
expressed as P(B|A).
Confusion Matrix: A table summarising a classifier’s predictions and actual class labels,
providing detailed information about true positives, true negatives, false positives, and
false negatives.
Continuous Monitoring: The ongoing surveillance of predictive models to detect changes
in data patterns, model performance, and potential deviations, requiring timely intervention
for recalibration.
PAGE 187
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
MBAFT 7708 PREDICTIVE ANALYTICS AND BIG DATA
188 PAGE
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
GLOSSARY
PAGE 189
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
MBAFT 7708 PREDICTIVE ANALYTICS AND BIG DATA
Notes F-beta Score: A generalisation of the F1-score that allows adjusting the
balance between precision and recall using a parameter (beta).
Feature Engineering: The process of selecting, transforming, or creating
new features (variables) from raw data to enhance the performance of
predictive models.
Gaussian Naive Bayes: A type of the Naive Bayes algorithm suitable for
continuous data, assuming a Gaussian (normal) distribution of features.
Google Cloud AI Platform: Google Cloud AI Platform offers a suite
of tools for building, deploying, and managing machine learning models
on Google Cloud. It supports popular machine learning frameworks like
TensorFlow and scikit-learn.
Histogram: A graphical representation of the distribution of a single
variable, showing the frequency of data values within specific intervals.
Imbalanced Data: A situation in which one class of a classification
problem is significantly more prevalent than the other class, potentially
leading to biased models.
Imputation: This involves estimating values based on techniques such
as mean imputation or regression imputation to replace missing values.
Interpretability: The degree to which a predictive model’s results and
decisions can be understood and explained.
Label: Text added to a visualisation to provide context and identify data
points or categories.
Legend: A key that explains the meaning of colours, symbols, or other
visual elements used in a visualisation. Top of Form
Line Chart: A type of data visualisation that displays trends and
relationships between two numerical variables over time.
Linear Regression: A statistical method used to model the relationship
between a dependent variable and one or more independent variables.
Logistic Regression: A statistical method used for binary classification,
modelling the relationship between predictor variables and the probability
of an event occurring.
190 PAGE
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
GLOSSARY
PAGE 191
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
MBAFT 7708 PREDICTIVE ANALYTICS AND BIG DATA
Notes Precision: A classification metric that quantifies the ratio of true positive
predictions to all positive predictions made by the model, with a focus
on minimising false positives.
Predictive Analytics: It is a branch of advanced analytics that makes
predictions about future outcomes by unravelling hidden patterns from
historical data combined with statistical modelling, data mining techniques
and machine learning.
Predictive Modelling: the process of employing data and statistical
algorithms to predict or categorise future events or results by drawing
insights from past data.
Predictors: The independent attributes in the dataset that are used to
predict the value of the target/ dependent variable.
Pruning: Removing branches or nodes from a decision tree to prevent
overfitting and improve model performance.
Recall (Sensitivity): A classification metric that calculates the ratio of
true positive predictions to all actual positive instances, emphasising
detecting all positives.
Receiver Operating Characteristic (ROC) Curve: A visual representation
illustrating a classifier’s performance, depicting the True Positive Rate
(Recall) against the False Positive Rate (1 - Specificity) at multiple
thresholds.
Regression: A predictive modelling technique used to model the relationship
between one or more independent variables and a continuous dependent
variable.
ROC Curve: Receiver Operating Characteristic Curve, a graphical
representation of a model’s ability to distinguish between true and false
positives.
Scalability: The ability of technology solutions to handle growing
volumes of data and increased computational demands without sacrificing
performance.
Scatter Plot: A visualisation that displays the relationship between two
numerical variables through points on a graph.
192 PAGE
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
GLOSSARY
Semi-Structured Data: Data that does not conform to the structure of Notes
traditional structured data but has some level of organization, often in a
hierarchical or nested format, e.g., XML or JSON.
Specificity (True Negative Rate): A classification metric that evaluates
the ratio of true negative predictions to all actual negative instances,
particularly crucial in mitigating false alarms.
Splitting Criteria: Criteria used to decide how to split nodes in a decision
tree, including Gini impurity and information gain.
Strategic Planning: The process of defining the long-term objectives
of predictive modelling initiatives and aligning them with the overall
business strategy.
Structured Data: Data organized into a fixed format with defined fields
and an explicit schema, typically stored in relational databases.
Support: It quantifies the frequency of the pattern. The support for
DVVRFLDWLRQ UXOH$ĺ% FRUUHVSRQGV WR WKH OLNHOLKRRG RI ERWK LWHPV$ DQG
B co-occurring, which can be denoted by P(A B).
Talent Development: The cultivation of skilled professionals capable of
effectively managing and implementing predictive modelling technology,
often through training programs and skill enhancement initiatives.
Target: The dependent variable whose values are to be predicted using
a prediction model.
Text Analytics: It is the process of transforming unstructured text
documents into usable, structured data.
Tokenisation: It is the process of breaking apart a sentence or phrase
into its component pieces. Tokens are usually words or numbers.
Underfitting: A modelling problem where a predictive model is too
simple to capture the underlying patterns in the data.
Unstructured Data: Data that lacks a predefined structure and is typically
in the form of text, images, audio, or video.
Variety: The diversity of data types and sources, including structured,
semi-structured, and unstructured data.
Velocity: The speed at which data is generated, collected, and processed
in real-time.
PAGE 193
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
MBAFT 7708 PREDICTIVE ANALYTICS AND BIG DATA
194 PAGE
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi