Ba 2025
Ba 2025
FOR DECISION
Dr. Samarendra
MAKING Kumar Mohanty
COURSE OBJECTIVES
Course Outcomes
3
Unit -II
• Introduction to Python – Introduction to Jupyter
Notebooks; Basic Programming Concepts and
Syntax; Core Libraries in Python; Map and Filter;
Processing, Wrangling, and Visualizing Data – Data
Collection, Data Description, Data Wrangling, Data
Visualization; Feature Engineering and Selection –
Feature Extraction and Engineering, Feature
Engineering on Numeric Data, Categorical Data,
Text Data, & Image Data, Feature Scaling, Feature
Selection.
4
Unit -III
• Building, Tuning, and Deploying Models – Building Models,
Model Evaluation, Tuning, Interpretation, & Deployment;
Exploratory Data Analysis; Diagnostic Analysis; Exploration
of Data using Visualization; Steps in Building a Regression
Model; Building Simple and Multiple Regression Models -
Model Diagnostics; Binary Logistic Regression – Model
Diagnostics, ROC and AUC, Finding Optimal Classification
Cut-Off; Gain and Lift Chart; Regularization – L1 and L2.
5
Suggested Readings
7
Journals
• Analytics Magazine.
• Data Science and Business Analytics.
• Harvard Business Review (HBR).
• Information Systems Research.
• Journal of Business Analytics.
• Journal of Business Intelligence Research.
• MIT Sloan Management Review.
8
Key Trends 2025 and Beyond
• AI-powered analytics
• AI technologies like machine learning (ML) and natural language
processing (NLP) will be used to automate tasks, identify patterns,
and provide more accurate predictions.
• Data democratization
• Low-code and no-code platforms will make advanced data
analysis accessible to a broader audience.
• Edge computing
• Edge computing will unlock innovation and improve efficiency.
9
Key Trends 2025 and Beyond
• Natural language processing (NLP)
• NLP will make insights accessible to everyone, regardless of
technical expertise.
• Cloud-based analytics
• Cloud-based analytics will transform how businesses navigate
data management.
10
Benefits
• Improved accuracy: AI technologies can analyze vast amounts of
data with high precision.
• Faster, more informed decisions: Businesses will be able to
make quicker decisions based on data.
• Competitive edge: Businesses that embrace these trends will
gain a significant competitive edge.
• Improved operational efficiency: Businesses will be able to
improve operational efficiency by leveraging data.
11
Edge Computing
• Edge computing is a
distributed computing model
that brings computation and
data storage closer to the
sources of data. More
broadly, it refers to any design
that pushes computation
physically closer to a user, so
as to reduce the latency
compared to when an
application runs on a
centralized data centre.
12
Natural language processing (NLP)
13
Natural language processing (NLP)
• NLP research has helped enable the era of generative AI, from the
communication skills of large language models (LLMs) to the
ability of image generation models to understand requests.
• NLP is already part of everyday life for many, powering search
engines, prompting chatbots for customer service with spoken
commands, voice-operated GPS systems and question-answering
digital assistants on smartphones such as Amazon’s Alexa,
Apple’s Siri and Microsoft’s Cortana.
14
Benefits of NLP
• Automation of repetitive tasks
• Improved data analysis and insights
• Enhanced search
• Content generation
15
Cloud Analytics
16
17
• Data comes from both cloud and on-premises sources and applications. The best
cloud analytics platforms can manage hybrid data delivery and application
automation. Examples of data sources include transactional, website usage, social
media, and CRM data.
• Data is stored in a cloud data warehouse from a vendor such as Amazon Redshift,
Google BigQuery, Microsoft Azure, or Snowflake.
• The cloud analytics tool uses this data to let you perform a variety of analytics use
cases such as creating visualizations, dashboards and cloud reporting. The best
tools go further by enabling you to perform augmented analytics and predictive
analytics, machine learning or AutoML (automated machine learning), embed
analytics into other applications, and trigger alerts and actions in other systems.
• This array of analytics capabilities helps you to identify patterns and develop
insights that lead to actions which can increase efficiency, revenue and profits. Top
tools can also integrate with other applications to trigger automated, data-driven
events.
18
Introduction to Business Analytics
• In god we trust, all others must bring data- Edward Deming
19
Analytics
• Analytics is a body of knowledge consisting of statistical,
mathematical, and operational research techniques; artificial
intelligence techniques such as machine learning, deep learning
algorithms; data collection and data storage; data management
processes such as data extraction, data transformation and
loading(ETL); and computing and big data technologies such as
Hadoop, spark and Hive that create value by developing
actionable items from data.
20
Theory of Bounded Rationality
• Reasons for the rise in use of analytics Theory of Bounded
Rationality proposed by Herbert Simons, 1972.
• The increasing complexity of business problems, the existence of
several alternative solutions, and limited time available for
decision making demand a highly structured decsion making
process using past data for the effective management of
organizations. (Herbert Simons, 1972)
21
Business analytics-Data driven decision
making flow diagram
Stage 3: Preprocess the
Stage 2: Identify the
Stage1: Identify data for missing
source of data required
problems/improvement /incorrect data and
for the problem
opportunities generate new variables
identified in stage 1.
if necessary.
22
Pyramid of Analytics
23
Why Analytics
• Decisions usually made using HiPPO algorithm (“highest paid
person’s opinion” algorithm).
• There is significant change in the form of data-driven decision-
making among several companies.
• According to the theory of firm (Coase, 1937; Fame, 1980) firms
exist to minimize transaction cost. Transactions take place when
goods and services are transferred from supplier to the customer.
The cost of decision making is an important elemement of the
transaction cost.
24
• Michalos(1970) groups cost of decision making into 3
categories:
• 1. Cost of reaching a decision with the help of a decision
maker or procedure; this is also known as production cost
, that is cost of producing a decision.
• 2. Cost of actions based on decisions produced; also
Why known as implementation cost.
25
Business Analytics
• Business Analytics is a set of statistical and operation research
techniques, artificial intelligence, information technology, and
management strategies used for framing a business problem,
collecting data, and analysing the data to create value for the
organizations.
26
Components of Business Analytics
Data
Science
Business Technology
context
27
Business Context
• Business analytics projects start with the business context and the
ability of the organization to ask the right questions.
• Who are prospective customers?
• At what time customers are likely to make max purchase?
• For many customers shopping is a habbit and they do not respond to
promotions since shopping is a routine for them. Shopping behavior
changes during marriage or pregnancy and it becomes easy to target
them during these special events.
• Target stores has developed a pregnancy score for each female
customer which could be used for target marketing(Duhigg,2012)
• Pregnant women are likely to be price insensitive so them become Holy
Grail for retailers like Target. Expectant mothers are willing to spend
more for their comfort as well as their babies (Duhigg,2012).
28
Business Context
29
Business Context
30
Business Context
• The Pink Tax
• A 2015 study from the New York City Department of Consumer Affairs –
now called the New York City Department of Consumer and Worker
Protection – found that, on average, women’s products cost 7% more
than similar ones for men.
• The biggest difference was in personal care products, which cost
women 13% more. Women’s shampoo, razors and cartridges, lotion
and deodorant all cost more than similar items marketed toward men.
The study also found examples of pink bikes, scooters, bike helmets
and girls’ toys costing more than similar items for boys.
• Since then, California and New York have passed laws to prevent some
gender-based price discrimination, and other states have been
studying the issue.
31
• Research published in 2021 by the Kellogg School of Management
at Northwestern University indicates that the tide may be turning.
32
• In India, the pink tax denotes the additional charges imposed on
women for products targeted at them, rendering these items
pricier compared to similar ones for men.
• Women in India face this economic burden, especially
considering they earn approximately 35% less than men. In India,
no specific laws against the pink tax result in price differences
between products for women and men, driven by market
dynamics
33
Products and Services comes under Pink Tax
in India
• Personal Care Products:
• Razors: Women’s razors often carry a higher price tag than men’s
razors, despite having similar blade quality and features.
• Deodorants and Body Washes: Products marketed towards women are
frequently priced higher than their male counterparts, even when the
ingredients and functionality are comparable.
• Clothing:
• T-shirts and Jeans: Studies have shown that the base price for women’s
t-shirts and jeans can be significantly higher than men’s clothing of
similar quality and fabric.
• This price difference can also extend to other clothing items like jackets
and sweaters.
34
Products and Services comes under Pink Tax
in India
• Salon Services:
• Haircuts and Hair Colouring: Salons often charge a premium for haircuts and
hair colouring services offered to women compared to similar services
provided to men.
• Other Products:
• Toys: While not as prevalent, instances exist where toys marketed towards
girls, such as dolls and toy kitchens, are priced higher than toys targeted
towards boys.
• Feminine Hygiene Products:
• Previously, sanitary napkins and tampons were subjected to a 12% Goods
and Services Tax (GST) in India, while condoms faced no such tax.
• This distinction, rectified in 2018, serves as an example of how societal
perception can influence product categorization and pricing.
35
Mom and baby product stores
36
37
• Horlicks Women's Plus Chocolate Nutrition Drink 400 g Refill
Pack, Nutrition for strong Bones with 100% daily Calcium &
Vitamin D - No Added Sugar ₹288 ₹72.00 per g (₹72 /100 g)
38
Technology
• Information technology (IT) is used for data capture, data storage,
data preparation, data analysis and data share.
• Today most data are unstructured data (Data not arranged in
matrix form with rows and columns). Unstructured data includes
images, text, voice, video, click etc.
• Software i.e. R, Python, SAS, SPSS, Tableau etc. are used for
analysis.
39
Data Science
• It is the most important component of analytics.
• The objective of data science component of analytics is to identify
the most appropriate statistical model /machine learning
algorithm that can be used.
• Business analytics can be grouped into descriptive analytics,
predictive analytics, and prescriptive analytics.
40
Framework for Data driven decision making
flow diagram
Stage 3: Preprocess the
Stage 2: Identify the
Stage1: Identify data for missing
source of data required
problems/improvement /incorrect data and
for the problem
opportunities generate new variables
identified in stage 1.
if necessary.
41
House of Analytics Excellence
Top management
support
Analytics talent
Information
technology(IT)
Innovation
42
Analytics Capability Building
1. Top management support: Data-driven decision-making requires
change in Organizational culture which requires support from top
management.
2. Analytics talent: Identify the right talent and nurture them from within
the organization
3. Information technology(IT): proper data architecture supported by
other IT infrastructure
4. Innovation
5. All the above pillars need to be integrated with the domain knowledge
of business : else analytics might end up solving non value adding
problems.
43
Roadmap for Analytics Capability Building
• Initial: Low-hanging fruits to be targeted with simple analytical
tools: Descriptive statistics, Data visualization, pivot table,
correlation analysis, basic quality tools, lean, Six-Sigma.
• Pivot table using MS Excel
44
1.Define Analytics strategy
2. Build talent
Roadmap for
Analytics Capability 3.Build infrastructure
5. Analytics Implementation
45
1.Define Analytics strategy
46
2. Build talent
47
3.Build infrastructure
48
4. Identify sources of data and develop data
collection plan
• Analytics starts with the data.
• Organizations should identify all relevant data and automate the
data collection process.
49
5. Analytics Implementation
50
Roadmap for Analytics capability building
• This roadmap emphasizes a structured approach to integrating analytics into business
processes to enhance data-driven decision-making. The key steps include:
1. Define Objectives and Goals:
o Clearly articulate the organization's strategic objectives and how analytics can
support these goals.
o Identify specific areas where analytics can provide value, such as improving
customer insights, optimizing operations, or enhancing product development.
2. Assess Current Capabilities:
o Evaluate the existing analytics infrastructure, including technology, data quality, and
human resources.
o Determine the organization's maturity level in terms of data management and
analytical skills.
3. Develop a Strategic Analytics Plan:
o Create a comprehensive plan that outlines the steps needed to build or enhance
analytics capabilities.
o Set measurable targets and timelines for achieving analytics objectives. 51
Roadmap for Analytics capability building
4. Invest in Technology and Tools:
• Acquire the necessary analytics tools and platforms that align with the
organization's needs.
• Ensure scalability and integration capabilities with existing systems.
5. Build a Skilled Analytics Team:
• Recruit and train personnel with expertise in data analysis, statistics,
and domain-specific knowledge.
• Foster a culture of continuous learning and development in analytics.
6. Establish Data Governance and Management:
• Implement policies and procedures to ensure data quality, security, and
privacy.
• Define data ownership and stewardship roles within the organization.
52
Roadmap for Analytics capability building
• 7. Promote a Data-Driven Culture:
• Encourage decision-making based on data insights across all levels of the organization.
• Provide training and resources to help employees understand and utilize analytics in their roles.
• 8. Implement Analytics Solutions:
• Deploy analytics projects that address identified business needs.
• Use pilot projects to demonstrate value and refine approaches before broader implementation.
• 9. Monitor and Evaluate Performance:
• Regularly assess the effectiveness of analytics initiatives against predefined metrics.
• Gather feedback to identify areas for improvement and to inform future analytics strategies.
• 10. Scale and Innovate:
• Expand successful analytics initiatives to other areas of the organization.
• Stay abreast of emerging analytics trends and technologies to maintain a competitive edge.
53
Challenges in Data-Driven Decision-making
and Future
• 1. Data Quality and Integration:
• Ensuring the accuracy, completeness, and consistency of data from diverse sources is
crucial.
• Integrating data from various departments and systems can be complex and time-
consuming.
• 2. Lack of Skilled Personnel:
• There's a shortage of professionals proficient in analytics, data science, and related fields.
• Organizations often struggle to build teams with the necessary expertise to analyze data
effectively.
• 3. Cultural Resistance:
• Employees and management may resist adopting data-driven approaches due to a
preference for traditional decision-making methods.
• Overcoming skepticism and fostering a culture that values data is essential.
54
Challenges in Data-Driven Decision-making
and Future
• 4. Data Privacy and Security Concerns:
• Handling sensitive information requires strict adherence to
privacy laws and regulations.
• Protecting data from breaches and unauthorized access is a
continuous challenge.
• 5. Rapid Technological Changes:
• Keeping up with the fast-paced advancements in analytics tools
and technologies can be daunting.
• Continuous learning and adaptation are necessary to stay
competitive.
55
Future Trends in Business Analytics
1. Integration of Artificial Intelligence (AI):
o AI is increasingly being used to enhance decision-making processes.
o For instance, AI can assist in optimizing operations, as discussed in the article
"The case for appointing AI as your next COO."
2. Advanced Predictive and Prescriptive Analytics:
o Organizations are moving beyond descriptive analytics to predictive and
prescriptive models.
o These models help forecast future trends and recommend actionable
strategies.
3. Real-Time Data Processing:
o The demand for real-time analytics is growing, enabling organizations to
make immediate, informed decisions.
• This is particularly important in dynamic industries where timely insights are
56
critical
Future Trends in Business Analytics
1. Enhanced Data Visualization:
o Improved visualization tools are making it easier to interpret complex data
sets.
o Effective visualizations aid in communicating insights clearly to stakeholders.
2. Ethical and Responsible AI Use:
o As AI becomes more prevalent, there's a focus on ensuring its ethical
application.
o Discussions around responsible AI use are highlighted in articles like "How
we can use AI to create a better society."
57
Foundations of Data Science • 1. Based on data format:
Types of Data:
Categorization of data based on structure, source, and use case.
59
3. Based on nature
Type Definition Examples
Qualitative Data Descriptive and non- Customer reviews,
numerical. interview transcripts
60
4. Based on Processing State
61
5. Based on Use:
Type Definition Examples
Operational Data Used in daily business Transaction records,
operations inventory levels.
Analytical Data Utilized for insights and Historical sales trends,
decision-making. predictive analytics data
62
6. Based on Content
Type Definition Examples
63
7. Specialized Types of Data
Type Definition Examples
65
Scales of Variable Measurement
• 3. Interval Scale:
• Description: Measures data with equal intervals between values, but
lacks a true zero point, meaning ratios are not meaningful.
• Examples: Temperature in Celsius or Fahrenheit, where the difference
between degrees is consistent, but zero does not indicate the absence
of temperature.
• 4. Ratio Scale:
• Description: Similar to the interval scale, but includes a true zero
point, allowing for meainingful ratios between measurements.
• Examples: Height, weght, age, or income, where zero signifies the
absence of the measured attribute, and comparisons like "twice as
much" are meaningful.
66
• Foundations of Data Science – Data Types and Scales of Variable
Measurement, Feature Engineering; Functional Applications of
Business Analytics in Management; Widely Used Analytical Tools;
Ethics in Business Analytics.
67
Feature Engineering
• Feature engineering involves creating new variables or modifying
existing ones to enhance the performance of predictive models.
• This process is crucial for improving model accuracy and
uncovering hidden patterns within the data.
68
Key Aspects of Feature Engineering
1. Data Transformation:
o Applying mathematical functions to variables, such as logarithmic or square
root transformations, to stabilize variance or normalize distributions.
2. Interaction Features:
o Creating new features by combining two or more variables to capture
interactions that may influence the target outcome.
3. Binning:
o Grouping continuous variables into discrete bins or categories to reduce noise
and handle non-linear relationships.
69
Key Aspects of Feature Engineering
4. Encoding Categorical Variables:
o Converting categorical data into numerical format using techniques like one-hot
encoding or label encoding to make them suitable for machine learning algorithms.
5. Handling Missing Values:
o Imputing missing data with appropriate values or creating indicator variables to flag
missing entries.
6. Scaling and Normalization:
o Adjusting the range of variables to ensure they contribute equally to the analysis,
especially important for distance-based algorithms.
7. Date and Time Feature Extraction:
o Deriving new features from date and time variables, such as day of the week, month,
or time of day, to capture temporal patterns.
70
Functional Applications of Business Analytics
in Management
• The functional applications of business analytics in management
span across various departments and domains within an
organization.
• These applications enable better decision-making, improve
efficiency, and drive strategic objectives.
71
1.Marketing Analytics
• Customer Segmentation: Identifying and grouping customers based
on behavior, preferences, and demographics.
• Campaign Performance: Measuring the effectiveness of marketing
campaigns through KPIs like ROI and conversion rates.
• Personalization: Leveraging data to tailor marketing messages and
product recommendations.
• Churn Analysis: Predicting customer attrition and implementing
retention strategies.
72
2. Financial Analytics
• Budgeting and Forecasting: Using historical data and predictive
models to create accurate financial projections.
• Risk Management: Identifying and mitigating financial risks through
stress testing and scenario analysis.
• Profitability Analysis: Evaluating profitability at product, customer,
or segment levels.
• Fraud Detection: Employing machine learning algorithms to detect
anomalies and fraudulent activities.
73
3. Supply Chain and Operations Analytics
• Inventory Optimization: Ensuring optimal stock levels using
demand forecasting and reorder point analysis.
• Logistics and Transportation: Enhancing route planning, delivery
times, and cost efficiency.
• Process Improvement: Identifying bottlenecks and inefficiencies in
operations to enhance productivity.
• Demand Planning: Using predictive analytics to match supply with
customer demand.
74
4. Human Resource Analytics
• Talent Acquisition: Analyzing recruitment data to improve hiring
strategies and reduce time-to-hire.
• Employee Retention: Identifying factors influencing turnover and
designing retention programs.
• Performance Management: Tracking employee performance
metrics and aligning them with organizational goals.
• Workforce Planning: Forecasting future staffing needs based on
business growth and market trends.
75
5.Strategic Management
• Market Trends Analysis: Monitoring market dynamics and
competitive landscapes for informed strategy formulation.
• Scenario Planning: Evaluating potential outcomes of strategic
decisions through simulation models.
• Mergers and Acquisitions: Conducting due diligence and valuing
target companies based on financial and operational data.
• KPI Monitoring: Developing dashboards to track organizational
performance against strategic objectives.
76
6. Customer Relationship Management
(CRM)
• Lifetime Value Prediction: Estimating the long-term value of
customers to prioritize resources.
• Customer Feedback Analysis: Extracting insights from surveys,
reviews, and social media for service improvement.
• Loyalty Programs: Designing and optimizing loyalty initiatives to
enhance customer retention.
77
7. Product and Service Development
• Innovation Analytics: Using customer insights and market trends to
guide product innovation.
• Quality Assurance: Analyzing production data to maintain high
product quality.
• Pricing Optimization: Determining optimal price points based on
market conditions and consumer behavior.
78
8. Risk and Compliance Analytics
• Regulatory Compliance: Ensuring adherence to industry regulations
using monitoring tools.
• Operational Risk: Identifying vulnerabilities in processes and
implementing safeguards.
• Crisis Management: Leveraging analytics to predict and prepare for
potential crises.
79
9. IT and Cybersecurity Analytics
• Threat Detection: Identifying and mitigating cyber threats using
pattern recognition algorithms.
• System Performance: Monitoring IT systems for performance
optimization and downtime reduction.
• Data Management: Enhancing data governance and ensuring data
quality for better decision-making.
80
Ethics in Business Analytics
• Ethics in Business Analytics is a crucial aspect of using data to
drive decisions in any organization.
• As the reliance on data-driven insights and algorithms increases,
ensuring ethical practices in business analytics becomes vital to
avoid harmful consequences.
• Ethical concerns can arise at various stages of data analysis,
including data collection, analysis, and decision-making. Here are
the key areas where ethics play a significant role in business
analytics
81
1. Data Privacy and Protection
• Informed Consent: Businesses must obtain explicit consent from individuals before
collecting or using their data, particularly in sensitive areas like healthcare or personal
information.
• Data Minimization: Collect only the data that is necessary for the intended purpose to
reduce exposure and minimize risks.
• Compliance with Regulations: Adhering to laws and regulations such as GDPR
(General Data Protection Regulation), HIPAA (Health Insurance Portability and
Accountability Act), and CCPA (California Consumer Privacy Act) is crucial to protect
consumer privacy.
• Data Anonymization: Personal data should be anonymized or de-identified to reduce
the risk of misuse.
82
2. Transparency in Algorithms
• Explainability: Algorithms and models used in decision-making should be
transparent and interpretable. Business stakeholders should be able to understand
how decisions are made by the system.
• Bias Detection: Regularly audit algorithms for biases (e.g., racial, gender, or
socio-economic biases) that may distort outcomes. Model developers should
ensure that their models do not perpetuate societal inequalities.
83
Algorithms no silver bullet
• 'Lazy and Mediocre' HR Team Fired After Manager's Own CV Gets Auto-
Rejected in Seconds, Exposing System Failure
• https://www.ibtimes.co.uk/lazy-mediocre-hr-team-fired-after-managers-
own-cv-gets-auto-rejected-seconds-exposing-system-1727202
• A Distressing Discovery
• The manager, who shared his experience on Reddit, was growing increasingly
frustrated with the HR department's inability to find qualified candidates over
a three-month period. He had been monitoring the recruitment process
closely, but when he inquired about candidate progress, he was repeatedly
told there were potential hires who had not passed the initial screening. "The
truly infuriating part was that I consistently talked to them asking for
progress, and they always told me that they had some candidates that didn't
pass the first screening processes, which was false," he explained. To
investigate further, the manager created a fake email and submitted a
modified version of his CV under a different name. Alarmingly, he too
received an auto-rejection email, reinforcing his concerns about the hiring
process. "HR didn't even look at my CV," he lamented.
84
3. Accountability and Responsibility
• Human Oversight: While data analytics can inform decision-making, final
decisions, particularly those impacting individuals or communities, should be
made by humans, ensuring accountability for any negative consequences.
85
4. Ensuring the Integrity of Data
• Data Accuracy: Ensuring the quality and accuracy of data used in analytics is
essential. Incorrect data can lead to false conclusions, damaging reputations or
causing harm.
86
5. Ethical Use of Predictive Models
• Predictions and Privacy: Predictive models, such as those for customer behavior
or credit scoring, should not infringe on an individual’s privacy or autonomy.
Businesses should avoid using models that predict sensitive characteristics
without clear consent.
• Transparency in Predictive Decisions: Customers and employees should have
visibility into how their data is being used to predict outcomes (e.g.,
creditworthiness, hiring decisions, or insurance pricing).
• Impact on Vulnerable Groups: Analytics should avoid practices that
disproportionately harm vulnerable groups, such as targeted marketing for
exploitative products or discriminatory lending practices.
87
6. Ethical Implications in Marketing Analytics
• Targeted Advertising: While targeted advertising can be effective, it should not
exploit consumers’ vulnerabilities (e.g., advertising products like payday loans to
financially vulnerable individuals).
88
Targeting the vulnerable
• Loan app agents enticing vulnerable people with easy money, warn police.
• A woman organiser from Vijayawada who was recently arrested was found to
have links to Pakistan
• https://www.thehindu.com/news/national/andhra-pradesh/loan-app-agents-
enticing-vulnerable-people-with-easy-money-warn-police/article66968086.ece
• Explained: Why you should stay away from small-time loan apps
• Small-time loan apps often operate illegally, employing aggressive measures
for loan recovery, including harassment, intimidation, and even blackmail.
• https://www.indiatoday.in/business/story/small-time-loan-apps-illegal-predatory-
practices-harrasment-debt-trap-financial-risk-2405971-2023-07-13
• Are FinTech lending apps harmful? Evidence from user experience in the Indian
market
• https://www.sciencedirect.com/science/article/pii/S0890838923001269
89
7. AI and Automation Ethics
• Automation and Job Displacement: The increasing automation of tasks using
AI and analytics tools should consider the social impact, such as job
displacement. Ethical AI development includes creating systems that complement
human workers rather than replace them entirely.
• Bias in AI Models: AI models used in business analytics should be monitored
and adjusted regularly to avoid any unintentional reinforcement of historical
biases.
• Fair Access to AI: Businesses should ensure that AI technologies are accessible
to all stakeholders, particularly marginalized groups that might otherwise be
excluded from the benefits of innovation.
90
8. Social and Environmental Responsibility
• Sustainability: Business analytics should support sustainable practices, helping
businesses reduce waste, conserve resources, and improve environmental impact.
• Social Impact: Ethical business analytics should prioritize projects and initiatives
that benefit society, such as healthcare improvements, education, and equitable
access to resources.
91
9. Ethical Decision-Making Framework
• Ethical Review Boards: Organizations can establish internal ethics boards or
committees to review significant data analytics projects to ensure that ethical
standards are met.
92
Key Principles of Ethics in Business Analytics
• Respect for privacy
• Fairness and impartiality
• Transparency and explainability
• Accountability and responsibility
• Sustainability and social responsibility
93
IT Act, 2000
• Informational Technology Act of 2000, is the primary legislation
in India dealing with cybercrime and electronic commerce. It
was formulated to ensure the lawful conduct of digital
transactions and the reduction of cyber crimes, on the basis of the
United Nations Model Law on Electronic Commerce 1996
(UNCITRAL Model). This legal framework, also known as IT Act
2000, comes with 94 sections, divided into 13 chapters and 2
schedules.
94
Digital Personal Data Protection Rules, 2023
• https://www.meity.gov.in/writereaddata/files/Digital%20Personal%20Data%20Prote
ction%20Act%202023.pdf
• The Bill will apply to the processing of digital personal data within India where such data
is collected online, or collected offline and is digitised. It will also apply to such
processing outside India, if it is for offering goods or services in India.
• Personal data may be processed only for a lawful purpose upon consent of an
individual. Consent may not be required for specified legitimate uses such as voluntary
sharing of data by the individual or processing by the State for permits, licenses, benefits,
and services.
• Data fiduciaries will be obligated to maintain the accuracy of data, keep data secure, and
delete data once its purpose has been met.
• The Bill grants certain rights to individuals including the right to obtain information, seek
correction and erasure, and grievance redressal.
• The central government may exempt government agencies from the application
of provisions of the Bill in the interest of specified grounds such as security of the state,
public order, and prevention of offences.
• The central government will establish the Data Protection Board of India to adjudicate on
non-compliance with the provisions of the Bill.
95
• Digital Personal Data Protection Rules, 2025
• The Ministry of Electronics and Information Technology
(MeitY)invites feedback/comments on the draft ‘Digital Personal
DataProtection Rules,2025’
96
Introduction to Python
• Python was developed by Guido Van Rossum in the year 1991.
• Python is a high level programming language that contains
features of functional programming language like C and object
oriented programming language like Java.
97
Core Libraries in Python
• The huge library of Python contains several small applications (or small
packages) which are already developed and immediately available to
programmers. These libraries are called ‘batteries included’.
• argparse is a package that represents command-line parsing library.
98
Core Libraries in Python
• jellyfish is a library for doing approximate and phonetic matching of strings.
• pandas is a package for powerful data structures for data analysis, time series and
statistics.
99
Core Libraries in Python
• pyquery represents jquery-like library for Python.
• whoosh contains fast and pure Python full text indexing, search and spell
checking library.
100
Unit-III
• Building, Tuning, and Deploying Models – Building Models, Model
Evaluation, Tuning, Interpretation, & Deployment; Exploratory
Data Analysis; Diagnostic Analysis; Exploration of Data using
Visualization; Steps in Building a Regression Model; Building
Simple and Multiple Regression Models - Model Diagnostics;
Binary Logistic Regression – Model Diagnostics, ROC and AUC,
Finding Optimal Classification Cut-Off; Gain and Lift Chart;
Regularization – L1 and L2.
101
Regression Analysis
• Regression analysis is a statistical modeling technique used by
statisticians and Data Scientists alike. It is the process of
investigating relationships between dependent and independent
variables.
• Regression itself includes a variety of techniques for modeling and
analyzing relationships between variables.
• It is widely used for predictive analysis, forecasting, and time
series analysis.
• The dependent or target variable is estimated as a function of
independent or predictor variables. Theestimation function is
called the regression function.
102
Regression Analysis
• In a very abstract sense, regression is referred to as the estimation
of continuous response/target variables as opposed to
classification, which estimates discrete targets.
• Linear regression is a foundational statistical tool for modeling the
relationship between a dependent variable and one or more
independent variables.
• The dependent features are called the dependent variables,
outputs, or responses. The independent features are called the
independent variables, inputs, regressors, or predictors.
103
Regression Analysis
• Regression problems usually have one continuous and
unbounded dependent variable.
• The inputs, however, can be continuous, discrete, or even
categorical data such as gender, nationality, or brand.
104
Types of Linear Regression
• Simple linear regression: This involves predicting a dependent
variable based on a single independent variable.
• Multiple linear regression: This involves predicting a dependent
variable based on multiple independent variables.
• Polynomial linear regression: This involves predicting a
dependent variable based on a polynomial relationship between
independent and dependent variables.
• Logistic Regression:
105
Problem Formulation
• When implementing linear regression of some dependent variable
𝑦 on the set of independent variables 𝐱 = (𝑥₁, …, 𝑥ᵣ), where 𝑟 is the
number of predictors, you assume a linear relationship between 𝑦
and 𝐱: 𝑦 = 𝛽₀ + 𝛽₁𝑥₁ + ⋯ + 𝛽ᵣ𝑥ᵣ + 𝜀. This equation is the regression
equation. 𝛽₀, 𝛽₁, …, 𝛽ᵣ are the regression coefficients, and 𝜀 is
the random error.
• Linear regression calculates the estimators of the regression
coefficients or simply the predicted weights, denoted with 𝑏₀, 𝑏₁,
…, 𝑏ᵣ. These estimators define the estimated regression function
𝑓(𝐱) = 𝑏₀ + 𝑏₁𝑥₁ + ⋯ + 𝑏ᵣ𝑥ᵣ. This function should capture the
dependencies between the inputs and output sufficiently well.
106
Problem Formulation…
The estimated or predicted response, 𝑓(𝐱ᵢ), for each observation 𝑖
= 1, …, 𝑛, should be as close as possible to the corresponding
actual response 𝑦ᵢ. The differences 𝑦ᵢ - 𝑓(𝐱ᵢ) for all observations 𝑖
= 1, …, 𝑛, are called the residuals. Regression is about
determining the best predicted weights—that is, the weights
corresponding to the smallest residuals.
• To get the best weights, you usually minimize the sum of squared
residuals (SSR) for all observations 𝑖 = 1, …, 𝑛: SSR = Σᵢ(𝑦ᵢ - 𝑓(𝐱ᵢ))².
This approach is called the method of ordinary least squares.
107
Regression Performance
The variation of actual responses 𝑦ᵢ, 𝑖 = 1, …, 𝑛, occurs partly due
to the dependence on the predictors 𝐱ᵢ. However, there’s also an
additional inherent variance of the output.
The coefficient of determination, denoted as 𝑅², tells you which
amount of variation in 𝑦 can be explained by the dependence on 𝐱,
using the particular regression model. A larger 𝑅² indicates a
better fit and means that the model can better explain the
variation of the output with different inputs.
• The value 𝑅² = 1 corresponds to SSR = 0. That’s the perfect fit,
since the values of predicted and actual responses fit completely
to each other.
108
Simple Linear Regression
Predicting a response using a single feature
109
• The residual sum of squares, also known as the sum of squared
residuals or the sum of squared estimate of errors, is the sum of the
squares of residuals. It is a measure of the discrepancy between the
data and an estimation model, such as a linear regression. A small RSS
indicates a tight fit of the model to the data.
h(xi)=β0+β1xi
Here,
• h(x_i) represents the predicted response value for ith observation.
• b_0 and b_1 are regression coefficients and represent the y-intercept
and slope of the regression line respectively.
110
Simple Linear Regression (SLR)
• Simple or single-variate linear regression is the simplest case of
linear regression, as it has a single independent variable, 𝐱 = 𝑥.
When implementing simple linear regression, you typically start
with a given set of input-output (𝑥-𝑦) pairs. These pairs are your
observations, shown as green circles in the figure. For example,
the leftmost observation has the input 𝑥 = 5 and the actual output,
or response, 𝑦 = 5. The next one has 𝑥 = 15 and 𝑦 = 20, and so on.
The estimated regression function, represented by the black line,
has the equation 𝑓(𝑥) = 𝑏₀ + 𝑏₁𝑥. Your goal is to calculate the
optimal values of the predicted weights 𝑏₀ and 𝑏₁ that minimize
SSR and determine the estimated regression function.
111
112
SLR
The value of 𝑏₀, also called the intercept, shows the point where the
estimated regression line crosses the 𝑦 axis. It’s the value of the
estimated response 𝑓(𝑥) for 𝑥 = 0. The value of 𝑏₁ determines the slope
of the estimated regression line.
• The predicted responses, shown as red squares, are the points on the
regression line that correspond to the input values. For example, for the
input 𝑥 = 5, the predicted response is 𝑓(5) = 8.33, which the leftmost
red square represents.
• The vertical dashed gray lines represent the residuals, which can be
calculated as 𝑦ᵢ - 𝑓(𝐱ᵢ) = 𝑦ᵢ - 𝑏₀ - 𝑏₁𝑥ᵢ for 𝑖 = 1, …, 𝑛. They’re the distances
between the green circles and red squares. When you implement linear
regression, you’re actually trying to minimize these distances and
make the red squares as close to the predefined green circles as
possible.
113
Multiple Linear Regression
Multiple or multivariate linear regression is a case of linear
regression with two or more independent variables.
If there are just two independent variables, then the estimated
regression function is 𝑓(𝑥₁, 𝑥₂) = 𝑏₀ + 𝑏₁𝑥₁ + 𝑏₂𝑥₂. It represents a
regression plane in a three-dimensional space. The goal of
regression is to determine the values of the weights 𝑏₀, 𝑏₁, and 𝑏₂
such that this plane is as close as possible to the actual
responses, while yielding the minimal SSR.
• The case of more than two independent variables is similar, but
more general. The estimated regression function is 𝑓(𝑥₁, …, 𝑥ᵣ) = 𝑏₀
+ 𝑏₁𝑥₁ + ⋯ +𝑏ᵣ𝑥ᵣ, and there are 𝑟 + 1 weights to be determined
when the number of inputs is 𝑟.
114
Polynomial Regression
• polynomial regression as a generalized case of linear regression.
You assume the polynomial dependence between the output and
inputs and, consequently, the polynomial estimated regression
function.
• The simplest example of polynomial regression has a single
independent variable, and the estimated regression function is a
polynomial of degree two: 𝑓(𝑥) = 𝑏₀ + 𝑏₁𝑥 + 𝑏₂𝑥².
• In the case of two variables and the polynomial of degree two, the
regression function has this form: 𝑓(𝑥₁, 𝑥₂) = 𝑏₀ + 𝑏₁𝑥₁ + 𝑏₂𝑥₂ + 𝑏₃𝑥₁²
+ 𝑏₄𝑥₁𝑥₂ + 𝑏₅𝑥₂².
115
Logistic regression
Logistic regression analysis is used to examine the association of
(categorical or continuous) independent variable(s) with one
dichotomous dependent variable. This is in contrast to linear
regression analysis in which the dependent variable is a continuous
variable.
Logistic regression is named for the function used at the core of the
method, the logistic function.
The logistic function, also called the sigmoid function was developed by
statisticians to describe properties of population growth in ecology,
rising quickly and maxing out at the carrying capacity of the
environment. It’s an S-shaped curve that can take any real-valued
number and map it into a value between 0 and 1, but never exactly at
those limits.
116
Logistic regression
1 / (1 + e^-value)
• Where e is the base of the natural logarithms (Euler’s number or
the EXP() function in your spreadsheet) and value is the actual
numerical value that you want to transform. Below is a plot of the
numbers between -5 and 5 transformed into the range 0 and 1
using the logistic function.
117
Logistic Function
118
Types of logistic regression
• Binary logistic regression: In this approach, the response or
dependent variable is dichotomous in nature—that is, it has
only two possible outcomes (for example 0 or 1).
• Some popular examples of its use include predicting if an email
is spam or not spam or if a tumor is malignant or not malignant.
Within logistic regression, this is the most commonly used
approach, and more generally, it is one of the most common
classifiers for binary classification.
• Ordinal logistic regression: This type of logistic regression
model is leveraged when the response variable has three or
more possible outcome, but in this case, these values do have a
defined order.
• Examples of ordinal responses include grading scales from A to
F or rating scales from 1 to 5.
119
Types of logistic regression
• Multinomial logistic regression: In this type of logistic
regression model, the dependent variable has three or more
possible outcomes; however, these values have no
specified order.
• For example, movie studios want to predict what genre of
film a moviegoer is likely to see to market films more
effectively. A multinomial logistic regression model can
help the studio to determine the strength of influence a
person's age, gender and dating status may have on the
type of film that they prefer. The studio can then orient an
advertising campaign of a specific movie toward a group of
people likely to go see it.
120
Use cases of logistic regression
• Fraud detection: Logistic regression models can help teams
identify data anomalies, which are predictive of fraud. Certain
behaviors or characteristics may have a higher association with
fraudulent activities, which is particularly helpful to banking and
other financial institutions in protecting their clients.
• SaaS-based companies have also started to adopt these
practices to eliminate fake user accounts from their datasets
when conducting data analysis around business performance.
121
Use cases of logistic regression
• Disease prediction: In medicine, this analytics approach can be used to
predict the likelihood of disease or illness for a given population.
• Healthcare organizations can set up preventative care for individuals that
show higher propensity for specific illnesses.
• Churn prediction: Specific behaviors may be indicative of churn in different
functions of an organization.
• For example, human resources and management teams may want to know if
there are high performers within the company who are at risk of leaving the
organization; this type of insight can prompt conversations to understand
problem areas within the company, such as culture or compensation.
Alternatively, the sales organization may want to learn which of their clients
are at risk of taking their business elsewhere.
• This can prompt teams to set up a retention strategy to avoid lost revenue.
122
Linear regression vs logistic regression
• Linear regression models are used to identify the relationship between a
continuous dependent variable and one or more independent variables.
When there is only one independent variable and one dependent variable, it
is known as simple linear regression, but as the number of independent
variables increases, it is referred to as multiple linear regression. For each
type of linear regression, it seeks to plot a line of best fit through a set of data
points, which is typically calculated using the least squares method.
123
AUC ROC Curve in Machine Learning
• In machine learning model evaluation is crucial to ensure that the
model performs well. Common evaluation metrics for
classification tasks include accuracy, precision.
• The AUC-ROC curve is an essential tool used for evaluating the
performance of binary classification models. It plots the True
Positive Rate (TPR) against the False Positive Rate (FPR) at
different thresholds showing how well a model can distinguish
between two classes such as positive and negative outcomes.
124
How AUC-ROC curve is used?
Key Terms in AUC-ROC:
• TPR (True Positive Rate): The ratio of correctly predicted positive
instances.
• FPR (False Positive Rate): The ratio of incorrectly predicted
negative instances.
• Specificity: The proportion of actual negatives correctly identified
by the model (inverse of FPR).
• Sensitivity/Recall: The proportion of actual positives correctly
identified by the model (same as TPR).
125
• TPR (True Positive Rate): The ratio of
correctly predicted positive
instances.
• FPR (False Positive Rate): The ratio
of incorrectly predicted negative
instances.
• Specificity: The proportion of actual
negatives correctly identified by the
model (inverse of FPR).
• Sensitivity/Recall: The proportion of
actual positives correctly identified
by the model (same as TPR).
126
Confusion mtrix
127
• ROC Curve: ROC Curve plots TPR vs. FPR at different thresholds.
It represents the trade-off between the sensitivity and specificity
of a classifier.
• AUC (Area Under the Curve): AUC measures the area under the
ROC curve. A higher AUC value indicates better model
performance as it suggests a greater ability to distinguish between
classes. An AUC value of 1.0 indicates perfect performance while
0.5 suggests it is random guessing.
128
How AUC-ROC Works
AUC-ROC curve helps us understand how well a classification model
distinguishes between the two classes (positive and negative).
Imagine we have 6 data points and out of these:
• 3 belong to the positive class: Class 1 for people who have a disease.
• 3 belong to the negative class: Class 0 for people who don’t have
disease.
• Now the model will give each data point a predicted probability of
belonging to Class 1 (the positive class). The AUC measures the
model’s ability to assign higher predicted probabilities to the positive
class than to the negative class.
129
How AUC-ROC
Works
Here’s how it work:
1.Randomly choose a pair: Pick one data
point from the positive class (Class 1) and
one from the negative class (Class 0).
2.Check if the positive point has a higher
predicted probability: If the model assigns a
higher probability to the positive data point
than to the negative one for correct ranking.
3.Repeat for all pairs: We do this for all
possible pairs of positive and negative
examples.
130
Model Performance with AUC-ROC
• High AUC (close to 1): The model effectively distinguishes
between positive and negative instances.
• Low AUC (close to 0): The model struggles to differentiate
between the two classes.
• AUC around 0.5: The model doesn’t learn any meaningful
patterns i.e it is doing random guessing.
• In short, the AUC gives you an overall idea of how well your model
is doing at sorting positives and negatives, without being affected
by the threshold you set for classification. A higher AUC means
your model is doing good.
131
Cumulative Gains and Lift Charts
• Lift is a measure of the effectiveness of a predictive model
calculated as the ratio between the results obtained with and
without the predictive model.
• Cumulative gains and lift charts are visual aids for measuring
model performance.
• Both charts consist of a lift curve and a baseline
• The greater the area between the lift curve and the baseline, the
better the model
132
133
• Gain
• Gain at a given decile level is the ratio of cumulative number of
targets (events) up to that decile to the total number of targets
(events) in the entire data set.
• Lift
• It measures how much better one can expect to do with the
predictive model comparing without a model. It is the ratio of gain
% to the random expectation % at a given decile level. The random
expectation at the xth decile is x%.
134
Example
137
• Prediction of Response Model: A response model predicts who
will respond to a marketing campaign. If we have a response
model, we can make more detailed predictions. For example, we
use the response model to assign a score to all 100,000
customers and predict the results of contacting only the top
10,000 customers, the top 20,000 customers, etc.
• Overall Response Rate: If we assume we have no model other
than the prediction of the overall response rate, then we can
predict the number of positive responses as a fraction of the total
customers contacted. Suppose the response rate is 20%. If all
100,000 customers are contacted we will receive around 20,000
positive responses.
138
Process
1.Randomly split data into two samples: 70% = training sample, 30% =
validation sample.
2.Score (predicted probability) the validation sample using the response model
under consideration.
3.Rank the scored file, in descending order by estimated probability
4.Split the ranked file into 10 sections (deciles)
5.Number of observations in each decile
6.Number of actual events in each decile
7.Number of cumulative actual events in each decile
8.Percentage of cumulative actual events in each decile. It is called Gain
Score.
9.Divide the gain score by % of data used in each portion of 10 bins. For
example, in second decile, divide gain score by 20.
139
140