Eval Plus Notes
Eval Plus Notes
EVALUATION
DATASET: INVESTMENT TRENDS IN INDIAN STARTUPS: 2000-2023
https://www.kaggle.com/code/tanveersinghbedi/investment-trends-in-indian-startups-2000-2023
Steps involved
1. importing important libraries: numpy,pandas matplotlib.pyplot seaborn sklearn
2. Understanding the Data Structure
3. List of columns
4. Checking for missing values
5. Fill the missing values in the dataset as NA since the missing value percentage is > 50%
6. Mismatch in the state and city columns
7. Remove 'sub_' prefix and keep only numerical values in 'Sub-Sector' column
8. Convert Funding_Date to datetime format
9. Standardize Profitability (convert "Yes/No" to binary values)
10. Replace missing or erroneous values in columns like Co-Investors, Lead_Investor
11. Verifying the missing values imputation
12. Checking for duplicate rows
13. Displaying the data types of the columns
14. Identifying the unique values in the dataset
15. Dropping columns that are not useful for EDA: Acquisition_Details, Startup_ID, Pitch_Deck_Link
16. saving the cleaned data as a new csv file
17. Standardizing the 'Profitability' column to binary values
18. Create a new dataframe with general startup information: ['Name', 'Sector', 'Sub-Sector', 'City', 'State',
'Founded_Year', 'Founder_Name', 'Funding_Stage', 'Investment_Type', 'Amount_Raised',
'Investors_Count', 'Lead_Investor', 'Co-Investors', 'Valuation_Post_Funding', 'Revenue', 'Profitability',
'Number_of_Employees']
19. Creating a competition and technical edge dataframe: ['Name', 'Sector', 'Sub-Sector', 'Tech_Stack',
'Primary_Product', 'Competitors', 'Patents', 'ESG_Score', 'Diversity_Index', 'Net_Impact_Score',
'Funding_Date', 'Social_Media_Followers', 'Profitability']
20. Saving all dataframes as CSV files
21. Digital Transformations
22. Count the number of startups in each sector
23. Group by Funding Stage and calculate the average amount raised
24. Group by Lead_Investor and calculate the average Valuation_Post_Funding
25. Calculate the correlation between Growth_Rate and Profitability
26. Calculate the correlation matrix and Plot the heatmap
27. Find the pairs of variables with the highest correlation coefficients and get the top 10 correlated pairs
28. Sort the dataframe by 'Revenue' in descending order and get the top 5 startups and display the top 5 by
revenue
29. Get the summary statistics of the 'Revenue' column
30. Define the bins and labels along with a new column 'Revenue_Bracket' with the categorized revenue and
display the updated dataframe
31. Select features for clustering and apply K-Means clustering
32. Count the number of startups and in each state and also the number of sectors and visualize.
33. Calculate average funding amount per sector
34. Count the most used tech stacks and visualize
35. Identify top 10 profitable startups and high growth startups
36. Correlate social media presence with ESG scores
37. Group by year and count the number of funding events and visualize the funding trends over time
38. Plot distribution of funding amounts.
39. Created a bar plot for the average funding amount by year and sector
40. Created a box plot for the funding amounts by year and sector
41. Create a FacetGrid for the box plot with four parameters: Funding_Satge, Funding_Year,
Amount_Raised and Sector
42. Calculate the correlation matrix using the parameters: Amount_Raised, Growth_Rate, Revenue,
Customer_Base_Size and create a heatmap
43. Create a swarm plot for the growth rate by sector: Sector, Growth_Rate
44. Create a count plot for the number of startups in each funding stage
45. Create a line plot for the average funding amount by year and sector: Funding_Year, Amount_Raised
and Sector
46. Create a pair plot: Amount_Raised, Growth_Rate, Revenue, Customer_Base_Size, Sector
Based on your EDA transformations, the following business questions are directly relevant to your analysis:
✅ Which cities have attracted the highest funding in the last five years?
✅ Are startups with a higher number of investors more likely to raise larger funding rounds?
Relevant Transformation: Investment Trends Over Time (if co-investor data is available).
Justification: Can be inferred if investors who syndicate together fund larger rounds.
✅ What percentage of startups have successfully exited, and what factors contribute to successful exits?
✅ What strategies do successful startups use to outperform their competitors in terms of funding and
growth?
Data governance is a structured framework of policies, processes, and standards that ensures an organization's
data is accurate, secure, compliant, and effectively managed. It helps businesses maximize data value while
maintaining regulatory compliance and mitigating risks.
A multinational FMCG (Fast-Moving Consumer Goods) company deals with massive volumes of data from
diverse sources such as sales transactions, supply chain logistics, customer feedback, and market trends. To
ensure effective data governance, the company should implement the following strategy:
Define roles and responsibilities (Data Owners, Data Stewards, Data Analysts).
Set up a Data Governance Council involving IT, compliance, and business teams.
Implement data governance policies for data collection, processing, and access control.
Adhere to global data regulations like GDPR, CCPA, and India’s DPDP Act.
Encrypt sensitive customer and transaction data to prevent unauthorized access.
Implement role-based access controls (RBAC) for secure data handling.
Maintain a single source of truth for key business data (customers, products, suppliers).
Integrate multiple data sources (POS, e-commerce, third-party market reports) for a unified view.
Use AI-based data deduplication to ensure accurate records.
Scenario: A global FMCG brand like Unilever or Nestlé wants to improve its customer engagement strategies
by offering hyper-personalized promotions based on purchasing behavior.
Challenges:
Solution:
Data Unification: Implemented a Master Data Management (MDM) system to integrate customer data
across regions.
Quality Enhancement: Used AI-based data cleaning tools to remove duplicate customer profiles.
Security & Compliance: Implemented GDPR-compliant role-based access controls (RBAC).
AI-driven Insights: Used machine learning to analyze purchase patterns and deliver personalized
product recommendations via digital ads.
Results:
✅ 25% increase in campaign effectiveness due to accurate targeting.
✅ 40% reduction in duplicate customer records.
✅ Improved compliance with global data protection regulations.
Q.2 Explain with examples quantitative and qualitative data a bank should be collecting for
personalized offerings. (CO2)
A bank collects data to enhance customer experience, improve services, and offer personalized financial
products. The two key types of data collected are:
Quantitative Data: Numerical data that can be measured and analyzed statistically.
Qualitative Data: Descriptive data that provides insights into customer behavior, preferences, and
motivations.
📌 Example: A bank notices that a customer frequently uses their credit card for travel bookings. Based on this
quantitative data, the bank offers a travel rewards credit card to enhance customer engagement.
📌 Example: A bank gathers qualitative feedback from customers who find their mobile banking app difficult
to navigate. The bank then redesigns the app interface for better usability.
Scenario: A leading bank wants to improve personalized financial recommendations for customers.
Challenges:
Solution:
Data Integration: The bank combined transactional (quantitative) and behavioral (qualitative) data
for deeper insights.
AI-Powered Analysis: Machine learning models predicted customer needs based on spending patterns.
Personalized Offerings:
o Frequent travelers were offered travel insurance & forex benefits.
o High-spending customers received exclusive credit card offers.
o Customers saving for a house were recommended home loan pre-approvals.
Results:
✅ 30% increase in customer engagement with personalized banking offers.
✅ 20% reduction in customer churn due to relevant financial products.
✅ Improved brand loyalty & customer satisfaction.
By leveraging quantitative & qualitative data, banks can provide tailored financial solutions, improving
both customer satisfaction and business performance.
Q.3 Explain with examples the 4 most important data clean-up tasks when working with raw
data. (CO3)
Raw data is often incomplete, inconsistent, or inaccurate, leading to poor analysis and faulty business decisions.
Data cleaning ensures high-quality, reliable, and usable data for analysis and decision-making.
Problem: Missing values in datasets can lead to inaccurate predictions and incorrect insights.
Solution:
✅ Remove records with excessive missing values if they are not useful.
✅ Use imputation techniques (mean, median, mode, or predictive modeling) to fill in missing values.
✅ For categorical data, use the most frequent category or "Unknown" label.
📌 Example: In a customer database, missing age values can be replaced with the average age of similar
customers.
2. Removing Duplicates
Problem: Duplicate records can distort statistical analysis and inflate customer counts.
Solution:
✅ Identify duplicate records using customer ID, email, or transaction timestamps.
✅ Remove duplicates based on rules (latest entry, first occurrence, highest value).
✅ Merge records when partial duplicates exist (e.g., same customer with different phone numbers).
📌 Example: An e-commerce retailer finds multiple records of the same customer due to typo errors in email
addresses. Removing duplicates ensures accurate sales tracking.
📌 Example: A bank finds that transaction dates are recorded in both "MM/DD/YYYY" and "YYYY-MM-
DD" formats, causing issues in reports. Standardizing the format prevents errors.
4. Removing Outliers
📌 Example: A financial institution detects a $1 million transaction in a student's bank account, which turns
out to be a data entry error. Cleaning such anomalies prevents fraud detection failures.
Scenario: A global retail chain wants to analyze customer purchases to improve marketing campaigns.
Challenges:
Solution:
Results:
✅ 95% accuracy improvement in customer segmentation.
✅ 20% increase in marketing campaign ROI due to better targeting.
✅ Reliable sales forecasts with clean & consistent data.
Effective data cleaning leads to better insights, accurate predictions, and improved decision-making.
Q.4 What is the difference between external data sources and internal data sources? Explain
this for a retail giant such as Star Bazaar. (CO3)
Internal Data Sources → Data generated within the organization, offering direct insights into
operations.
External Data Sources → Data obtained from outside sources, helping businesses understand market
trends and customer behavior.
📌 Example: Star Bazaar analyzes internal sales data to identify that dairy products sell more in the evenings,
helping optimize store stocking schedules.
📌 Example: Star Bazaar integrates government inflation reports and competitor pricing data to adjust
pricing strategies during economic downturns.
Scenario: Star Bazaar wants to optimize product pricing and promotional campaigns based on real-time market
trends.
Challenges:
Internal sales data provides limited insights into competitor pricing trends.
Customer preferences change due to external economic conditions.
Promotions are ineffective because they are not aligned with market demand.
Solution:
By combining internal and external data sources, Star Bazaar can stay competitive, optimize inventory,
and enhance customer satisfaction.
Q.5 Explain with examples the difference between descriptive, predictive, and prescriptive
analytics for a logistics company. (CO4)
In the logistics industry, data analytics plays a crucial role in optimizing supply chain operations, reducing
costs, and improving delivery efficiency. The three key types of analytics are:
Function: Uses historical data to identify trends and patterns in logistics operations.
Data Used: Delivery times, fuel consumption, route efficiency, past shipment records.
📌 Example: A logistics company analyzes monthly delivery performance to determine that late deliveries
increased by 10% during the holiday season due to high traffic.
Tools: Dashboards, business intelligence reports, data visualization (e.g., Power BI, Tableau).
Function: Uses machine learning and statistical models to predict potential logistics issues.
Data Used: Weather conditions, vehicle maintenance logs, customer demand forecasts.
📌 Example: Based on historical traffic and weather data, the company predicts that deliveries to city
centers will be delayed by 20% during monsoons, allowing them to plan alternative routes.
📌 Example: The company uses AI-driven route optimization to suggest the fastest and most fuel-efficient
routes for delivery trucks, reducing fuel costs by 15%.
Scenario: A logistics company handling deliveries for an e-commerce giant wants to reduce delivery delays
and operational costs.
Challenges:
Solution:
Descriptive Analytics → Analyzed past delivery logs to identify areas with frequent delays.
Predictive Analytics → Used machine learning models to predict future traffic congestion.
Prescriptive Analytics → Implemented AI-based route optimization to suggest alternative paths in
real-time.
Results:
✅ 25% reduction in average delivery time.
✅ 20% savings in fuel costs with optimized routes.
✅ Increased on-time delivery rate, leading to higher customer satisfaction.
By leveraging descriptive, predictive, and prescriptive analytics, logistics companies can improve
efficiency, reduce costs, and enhance customer service.
1. Database
Definition
A database is an organized collection of structured data that allows users to efficiently store, retrieve, manage,
and manipulate information. It serves as the backbone for data-driven applications, enabling businesses to
process and analyze data effectively. Databases can be relational (SQL-based) or non-relational (NoSQL-
based).
Importance in Data Analytics
Customer Relationship Management (CRM): Stores customer data, transaction history, and
preferences.
Enterprise Resource Planning (ERP): Manages supply chain, HR, and financial data.
E-commerce Platforms: Handles product catalogs, customer orders, and payment information.
Financial Services: Stores account details, transactions, and fraud detection data.
Types of Databases: Relational (MySQL, PostgreSQL, SQL Server) vs. Non-Relational (MongoDB,
Cassandra).
SQL (Structured Query Language): Writing queries for data extraction and analysis.
Normalization & Indexing: Optimizing database performance and reducing redundancy.
Data Security & Compliance: Understanding GDPR, HIPAA, and other data protection regulations.
Scenario: An enterprise-level e-commerce company, ShopEase, needs a robust database to manage its growing
number of customers, orders, and inventory.
Challenges:
Solution:
Implemented a relational database (MySQL) for structured customer and order data.
Used NoSQL (MongoDB) for handling dynamic product reviews and user-generated content.
Integrated data replication & indexing to ensure quick retrieval of information.
Results:
✅ 30% faster query execution, improving customer experience.
✅ Accurate stock tracking, reducing out-of-stock complaints.
✅ Better fraud detection, minimizing payment risks.
A well-structured database is fundamental for enterprises to manage, analyze, and secure their data
efficiently.
2. Data Lake
Definition
A Data Lake is a centralized repository that stores vast amounts of structured, semi-structured, and
unstructured data in its raw form. Unlike traditional databases, a data lake allows businesses to store data
without needing to define its structure beforehand, making it highly flexible for analytics and machine learning.
Allows storage of large volumes of diverse data types (text, images, videos, IoT sensor data).
Enables scalability for enterprises dealing with massive data inflows.
Supports real-time and batch processing, making it suitable for big data analytics.
Facilitates advanced analytics and AI/ML applications without the need for extensive preprocessing.
Retail & E-commerce: Stores customer purchase behavior, reviews, and website interactions for
personalization.
Financial Services: Collects transaction logs, risk assessment data, and fraud detection information.
Healthcare: Stores patient records, medical imaging, and real-time monitoring data.
Manufacturing: Aggregates IoT sensor data for predictive maintenance and quality control.
Difference Between Data Lake & Data Warehouse: Data lakes store raw data, while warehouses store
processed, structured data.
Storage Technologies: AWS S3, Azure Data Lake, Google Cloud Storage.
Processing Frameworks: Apache Spark, Hadoop, and Databricks for data transformation.
Schema-on-Read Concept: Structure is applied at the time of querying rather than during data
ingestion.
Security & Governance: Implementing role-based access control and encryption to protect sensitive
data.
Scenario: A global bank, FinTrust, wants to leverage AI for fraud detection and customer insights. Traditional
databases cannot handle the huge influx of structured and unstructured data from multiple sources.
Challenges:
Solution:
Implemented a Data Lake using AWS S3 to store raw transaction logs, chat transcripts, and biometric
data.
Used Apache Spark to process real-time data for fraud detection.
Integrated machine learning models to detect suspicious transaction patterns.
Results:
✅ 50% reduction in fraud detection time through real-time analysis.
✅ Enhanced customer experience by personalizing loan and credit card recommendations.
✅ Cost savings by eliminating the need for multiple expensive databases.
A Data Lake provides enterprises with flexibility, scalability, and analytical power, making it a crucial
component for modern data-driven decision-making.
3. Data Governance
Definition
Data Governance is a set of policies, processes, and standards that ensure an organization's data is accurate,
secure, and used effectively. It involves defining roles, responsibilities, and frameworks to manage data assets
throughout their lifecycle.
Retail & E-commerce: Ensures customer data privacy while personalizing recommendations.
Healthcare: Maintains data integrity in electronic health records (EHR) for patient safety.
Banking & Finance: Ensures compliance with anti-money laundering (AML) regulations.
Manufacturing: Standardizes IoT sensor data for predictive maintenance analytics.
Key Data Governance Frameworks: DAMA-DMBOK, CMMI Data Management Maturity Model.
Data Ownership & Stewardship: Defining roles such as Data Owners, Data Stewards, and Data
Custodians.
Master Data Management (MDM): Ensuring a single source of truth for business-critical data.
Data Quality Metrics: Accuracy, completeness, consistency, timeliness, and validity.
Data Security & Compliance: Implementing encryption, access control, and audit trails.
Scenario: A multinational hospital chain, MediCarePlus, needs to improve patient data management across
multiple locations while complying with HIPAA regulations.
Challenges:
Results:
✅ Improved patient safety by ensuring accurate and complete medical records.
✅ 40% reduction in compliance violations through standardized data policies.
✅ Faster decision-making by providing healthcare professionals with reliable data.
A well-implemented Data Governance strategy helps enterprises ensure data accuracy, security,
compliance, and operational efficiency.
4. Data Visualization
Definition
Data Visualization is the process of representing data through graphical formats like charts, graphs, and
dashboards to help businesses analyze trends, patterns, and insights effectively. It transforms raw data into a
visual context, making it easier to interpret and communicate findings.
Retail & E-commerce: Sales performance dashboards to track revenue and customer behavior.
Healthcare: Real-time patient monitoring and disease trend analysis.
Banking & Finance: Fraud detection dashboards displaying suspicious transaction patterns.
Supply Chain & Logistics: Shipment tracking and route optimization visualizations.
Types of Data Visualizations: Line charts, bar graphs, scatter plots, heatmaps, and geospatial maps.
Best Practices for Effective Visualization: Choosing the right chart type, maintaining clarity, and
avoiding misleading visuals.
Popular Data Visualization Tools: Tableau, Power BI, Google Data Studio, Matplotlib, Seaborn
(Python).
Storytelling with Data: Presenting insights in a compelling, actionable format.
Dashboards vs. Reports: Dashboards provide real-time, interactive data views, while reports are static
summaries.
Use Case: Data Visualization in an E-commerce Enterprise
Scenario: An online retail giant, ShopEase, wants to optimize its customer engagement strategy by analyzing
shopping patterns and sales performance.
Challenges:
Solution:
Results:
✅ 20% increase in revenue by identifying and promoting high-demand products.
✅ Improved decision-making with real-time sales tracking.
✅ Optimized inventory management, reducing stockouts and overstock issues.
Data Visualization empowers businesses by converting raw data into actionable insights, enabling faster and
smarter decision-making.
5. Algorithm
Definition
An algorithm is a step-by-step procedure or set of rules used to solve a problem or perform a specific task. In
data analytics, algorithms are used to process, analyze, and interpret data to generate meaningful insights. They
form the foundation of data processing, machine learning, and AI-driven decision-making.
Retail & E-commerce: Recommender algorithms suggest products based on past purchases (e.g.,
Amazon).
Banking & Finance: Fraud detection algorithms analyze transaction patterns for anomalies.
Healthcare: AI-driven diagnostic algorithms assist in detecting diseases from medical imaging.
Manufacturing: Quality control algorithms detect defects in products using image recognition.
What a Data Analyst Should Know
Types of Algorithms:
✅ Sorting & Searching Algorithms – Used for organizing data (e.g., QuickSort, Binary Search).
✅ Machine Learning Algorithms – Used for predictive modeling (e.g., Linear Regression, Decision
Trees).
✅ Optimization Algorithms – Used in business process improvements (e.g., Genetic Algorithms, A*
Search).
✅ Clustering Algorithms – Used for customer segmentation (e.g., K-Means, DBSCAN).
Understanding Algorithm Efficiency: Time complexity (Big O notation) and space optimization.
Real-world Implementation: Python libraries (NumPy, Pandas, Scikit-learn) for algorithm execution.
Scenario: A leading telecom company, ConnectTel, wants to reduce customer churn by identifying users
likely to leave their service.
Challenges:
Solution:
Used machine learning algorithms (Logistic Regression, Random Forest) to analyze customer
behavior.
Processed call logs, billing history, and customer complaints to identify churn signals.
Built an automated predictive model that assigns a churn probability score to each customer.
Sent targeted retention offers (discounts, better plans) to high-risk customers.
Results:
✅ 30% reduction in customer churn by proactive engagement.
✅ Increased revenue by retaining valuable customers.
✅ Automated churn detection, reducing manual effort.
Algorithms power automation, efficiency, and predictive insights, making them a core component of data-
driven enterprises.
6. Raw Data
Definition
Raw Data refers to unprocessed, unstructured, and unrefined data collected from various sources. It has not
undergone any transformation, cleaning, or structuring and requires processing before analysis.
Acts as the primary input for all data processing and analysis.
Enables data-driven decision-making once cleaned and structured.
Provides a complete and unbiased dataset for analysis.
Helps in pattern recognition and trend discovery in its natural state.
Retail & E-commerce: Customer purchase logs and browsing behavior before categorization.
Finance & Banking: Unprocessed transaction logs before fraud detection.
Healthcare: Raw patient data from medical devices and lab tests before diagnosis.
Manufacturing: IoT sensor readings before predictive maintenance analysis.
Use Case: Raw Data Processing for Sentiment Analysis in a Social Media Enterprise
Scenario: A global social media platform, TrendTalk, wants to analyze user sentiment about trending topics.
Challenges:
Solution:
Gathered raw tweets, comments, and reviews from social media feeds.
Cleaned data by removing stopwords, special characters, and duplicates.
Applied Natural Language Processing (NLP) models to classify sentiments (Positive, Neutral,
Negative).
Built a dashboard for real-time sentiment tracking of trending topics.
Results:
✅ 70% faster sentiment analysis, improving market response time.
✅ Real-time brand monitoring, helping brands manage crises.
✅ Data-driven content strategy, increasing audience engagement.
Raw data is the foundation of all analytics, but it must be cleaned, processed, and structured for effective
decision-making.
7. Data Transformation
Definition
Data Transformation is the process of converting raw data into a structured, clean, and usable format. It
includes tasks like data cleaning, filtering, aggregation, normalization, and formatting to make data suitable for
analysis and business intelligence.
Converts inconsistent raw data into a structured format for meaningful insights.
Enhances data quality by removing errors, missing values, and duplicates.
Standardizes data formats for seamless integration across multiple systems.
Improves processing efficiency, enabling faster analytics and reporting.
Retail & E-commerce: Standardizing product categories from different suppliers for inventory
management.
Finance & Banking: Converting transaction data into a consistent format for fraud detection.
Healthcare: Transforming patient records from multiple hospitals into a common structure.
Manufacturing: Aggregating IoT sensor data from various machine models for performance tracking.
Scenario: A global investment firm, WealthTrust, collects financial data from multiple stock exchanges
worldwide. However, the data formats are inconsistent, making real-time portfolio tracking difficult.
Challenges:
Different stock exchanges use various formats for price, currency, and time zones.
Inconsistent naming conventions and missing stock symbols.
Large datasets cause slow processing in reporting systems.
Solution:
Extracted data from multiple sources (NYSE, NASDAQ, London Stock Exchange).
Applied data transformation techniques:
o Currency conversion for global standardization.
o Time zone alignment for real-time tracking.
o Data deduplication & missing value imputation.
Loaded the transformed data into a cloud-based analytics platform for visualization.
Results:
✅ 50% faster report generation, improving investment decisions.
✅ Accurate portfolio insights, reducing financial risks.
✅ Standardized data pipelines, enabling smooth integration with AI-driven forecasting models.
Data transformation is crucial for data-driven enterprises to ensure accuracy, efficiency, and seamless
analytics.
8. Data Modelling
Definition
Data Modelling is the process of designing a structured representation of data relationships within a database or
system. It defines how data is stored, organized, and accessed to ensure consistency, scalability, and efficiency
in data management.
Scenario: A global e-commerce company, ShopEase, wants to enhance customer experience by improving
its recommendation engine.
Challenges:
Solution:
Designed a relational data model linking Customers, Orders, Products, and Reviews.
Applied Normalization to eliminate redundant customer records.
Created Indexes & Optimized Queries for faster data retrieval.
Integrated a graph database (Neo4j) to enhance product recommendations.
Results:
✅ 40% faster product recommendations, improving sales.
✅ Better customer segmentation, enabling targeted marketing.
✅ Streamlined data queries, reducing server load.
Data Modelling is a foundational step for enterprises to ensure data efficiency, accuracy, and scalability for
analytics and decision-making.
It is widely used in business intelligence and data warehousing to consolidate and prepare data for analysis.
Retail & E-commerce: Aggregates sales, inventory, and customer data from online and offline stores.
Banking & Finance: Extracts transactional data from multiple banking systems for fraud analysis.
Healthcare: Integrates patient records from hospitals, pharmacies, and insurance providers.
Manufacturing: Collects IoT sensor data from machines for predictive maintenance.
What a Data Analyst Should Know
ETL Process:
✅ Extract – Pulling raw data from sources (APIs, databases, cloud storage).
✅ Transform – Cleaning, normalizing, aggregating, and formatting data.
✅ Load – Storing transformed data in a database or data warehouse.
ETL vs. ELT (Extract, Load, Transform): ELT is used for big data, allowing transformation inside
the data warehouse.
Common ETL Tools: Apache Nifi, Talend, Informatica, Microsoft SSIS, AWS Glue, Apache
Spark.
Challenges in ETL: Handling data duplication, real-time processing, and error handling.
Scenario: A multinational retail company, MegaMart, wants a centralized dashboard to track sales
performance across its stores and e-commerce platforms.
Challenges:
Solution:
Results:
✅ 90% reduction in data processing time, enabling real-time analytics.
✅ Unified sales data, improving demand forecasting and stock replenishment.
✅ Automated reporting, reducing manual effort by business teams.
ETL is critical for enterprises to ensure data accuracy, integration, and accessibility for analytics and
decision-making.
Big Data refers to extremely large and complex datasets that traditional databases cannot efficiently store,
process, or analyze. It is characterized by the 3Vs:
Scenario: A global telecom company, ConnectTel, wants to optimize network performance and reduce
customer churn by analyzing massive amounts of customer and network data.
Challenges:
Billions of call logs and internet usage records are generated daily.
Need to detect network congestion and customer dissatisfaction in real time.
Traditional data processing systems cannot handle such high data volumes.
Solution:
Results:
✅ 30% reduction in network downtime, improving customer satisfaction.
✅ Improved customer retention by offering targeted service improvements.
✅ Faster issue resolution, reducing the number of customer complaints.
Big Data is a game-changer for enterprises, enabling scalable analytics, automation, and AI-driven
decision-making.
Data Mining is the process of discovering patterns, trends, and valuable insights from large datasets using
statistical techniques, machine learning, and database systems. It helps businesses uncover hidden relationships
in data to improve decision-making.
Retail & E-commerce: Identifying customer purchase behavior for targeted marketing.
Banking & Finance: Detecting fraudulent transactions by analyzing spending patterns.
Healthcare: Predicting disease outbreaks by analyzing medical records.
Manufacturing: Optimizing production schedules based on historical demand patterns.
Scenario: A global streaming service, StreamFlix, wants to reduce customer churn by understanding why
users cancel subscriptions.
Challenges:
Solution:
Results:
✅ 25% reduction in churn by proactively engaging at-risk users.
✅ Increased customer loyalty, leading to higher lifetime value (LTV).
✅ Improved marketing efficiency, reducing unnecessary promotional spending.
Data Mining empowers enterprises by uncovering valuable insights from raw data, driving informed and
strategic decision-making.
12. Charts
Definition
A chart is a graphical representation of data that helps businesses visualize trends, comparisons, and
relationships. Charts make complex data more understandable by transforming numbers into visual elements
like bars, lines, and pie slices.
Retail & E-commerce: Sales trends over time (line charts), customer segmentation (pie charts).
Finance & Banking: Stock market performance (candlestick charts), fraud detection (scatter plots).
Healthcare: Disease progression tracking (line charts), patient demographics (bar charts).
Supply Chain & Logistics: Inventory levels (area charts), delivery times (histograms).
What a Data Analyst Should Know
Scenario: A global online retailer, ShopEase, wants to analyze sales trends and customer behavior to
optimize inventory and marketing strategies.
Challenges:
Solution:
Results:
✅ 20% increase in sales by adjusting ad campaigns based on insights.
✅ Improved inventory planning, reducing stock shortages.
✅ Faster executive decision-making with data-driven visuals.
Charts are essential for businesses to simplify, analyze, and present data effectively, leading to better
strategic decisions.
Structured Data refers to highly organized data stored in a predefined format, typically within relational
databases. It follows a clear schema with rows and columns, making it easy to search, filter, and analyze using
SQL queries.
Importance in Data Analytics
Retail & E-commerce: Customer purchase history, product catalogs, and inventory records.
Banking & Finance: Account details, transaction logs, and credit history in relational databases.
Healthcare: Electronic Health Records (EHR) with patient demographics, diagnoses, and treatments.
Manufacturing: Supply chain databases tracking raw materials, production schedules, and shipments.
Scenario: A global bank, FinTrust, wants to analyze customer transaction patterns to detect potential fraud
and improve personalized offerings.
Challenges:
Solution:
Results:
✅ 30% improvement in fraud detection accuracy through pattern analysis.
✅ Higher customer retention by offering personalized financial products.
✅ Faster data retrieval, reducing report generation time by 50%.
Structured data enables enterprises to store, retrieve, and analyze information efficiently, driving better
business decisions and operational improvements.
Unstructured Data refers to information that does not follow a predefined format or schema. Unlike structured
data stored in relational databases, unstructured data includes text, images, videos, emails, social media posts,
and sensor data, which require advanced processing techniques for analysis.
Retail & E-commerce: Analyzing customer reviews, social media sentiment, and chatbot interactions.
Banking & Finance: Detecting fraud through email conversations and transaction logs.
Healthcare: Processing medical images, doctor’s notes, and patient records.
Manufacturing & IoT: Monitoring machine performance using sensor-generated unstructured logs.
Scenario: A telecom giant, ConnectTel, wants to analyze customer sentiment across emails, social media,
and call center recordings to improve service quality.
Challenges:
Solution:
Results:
✅ Faster customer issue resolution, reducing complaint handling time by 40%.
✅ Improved brand perception by addressing negative feedback proactively.
✅ Data-driven service improvements, increasing customer satisfaction rates.
Unstructured data unlocks hidden business insights and is essential for enterprises leveraging AI, automation,
and real-time analytics.
Machine Learning (ML) is a subset of artificial intelligence (AI) that enables systems to learn from data and
improve performance without being explicitly programmed. ML algorithms analyze patterns, make predictions,
and automate decision-making.
Scenario: A global bank, FinTrust, wants to detect fraudulent transactions in real-time to prevent financial
losses.
Challenges:
Solution:
Results:
✅ 85% increase in fraud detection accuracy, reducing financial losses.
✅ Real-time transaction monitoring, improving security.
✅ Lower false positives, minimizing inconvenience for legitimate customers.
Machine learning empowers enterprises to automate complex decision-making, enhance efficiency, and
improve predictive analytics.
16. Artificial Intelligence (AI)
Definition
Artificial Intelligence (AI) is a branch of computer science that enables machines to simulate human
intelligence, including learning, reasoning, problem-solving, perception, and language understanding. AI
systems use algorithms and data to automate tasks and improve decision-making.
Key AI Technologies:
✅ Machine Learning (ML): AI systems learning from data.
✅ Natural Language Processing (NLP): AI understanding human language (e.g., chatbots, sentiment
analysis).
✅ Computer Vision: AI analyzing images and videos (e.g., facial recognition, defect detection).
✅ Deep Learning: Advanced neural networks for speech, image, and pattern recognition.
AI Frameworks & Tools: TensorFlow, PyTorch, OpenAI, IBM Watson, Google AI Platform.
Challenges in AI:
✅ Data Privacy & Ethics: AI systems must follow regulations (e.g., GDPR).
✅ Bias in AI Models: Poor training data can lead to biased decisions.
✅ High Computational Power: AI models require GPUs and cloud resources for training.
Scenario: A global telecom company, ConnectTel, wants to automate customer support to handle increasing
customer queries efficiently.
Challenges:
Solution:
Results:
✅ 50% reduction in customer support costs.
✅ 70% faster response time, improving customer satisfaction.
✅ AI-powered self-service options, reducing human workload.
Data storytelling is the practice of using data visualizations, narratives, and analysis to communicate insights
effectively. It combines analytical thinking with storytelling techniques to make data-driven insights compelling
and actionable.
Scenario: A global retailer, ShopEase, wants to present Black Friday sales performance to stakeholders.
Challenges:
Raw sales data is overwhelming, making insights hard to grasp.
Executives need a clear narrative rather than complex reports.
Marketing and inventory teams need actionable insights.
Solution:
Results:
✅ Improved executive decision-making, leading to better stock planning.
✅ Optimized marketing campaigns based on purchase trends.
✅ Higher team engagement, as insights were easy to understand and apply.
Data storytelling transforms numbers into actionable business decisions, making analytics more effective and
persuasive.
18. Dashboard
Definition
A dashboard is an interactive visual interface that provides a real-time summary of key performance indicators
(KPIs), metrics, and insights. It consolidates data from multiple sources into a single view, enabling quick and
informed decision-making.
Retail & E-commerce: Tracks sales, customer behavior, and stock levels.
Finance & Banking: Monitors fraud detection, account balances, and transactions.
Healthcare: Displays patient monitoring data and hospital occupancy rates.
Manufacturing & Logistics: Tracks supply chain efficiency, production, and delivery timelines.
Types of Dashboards:
✅ Operational Dashboards – Monitor real-time processes (e.g., website traffic).
✅ Strategic Dashboards – Provide high-level KPIs for executives (e.g., company revenue).
✅ Analytical Dashboards – Support deep data exploration and trends (e.g., customer segmentation).
Key Features of a Good Dashboard:
✅ User-friendly design with clear visual hierarchy.
✅ Relevant KPIs and metrics aligned with business goals.
✅ Interactive filters and drill-down capabilities for deeper insights.
Popular Dashboard Tools: Tableau, Power BI, Google Data Studio, Looker, Qlik Sense.
Scenario: A global manufacturing firm, AutoParts Inc., needs a real-time dashboard to track its supply chain
efficiency.
Challenges:
Solution:
Built an interactive Power BI dashboard integrating data from warehouses and suppliers.
Included KPIs for inventory levels, delivery times, and supplier performance.
Added real-time alerts for stock shortages and shipment delays.
Enabled drill-down features to analyze performance at regional and product levels.
Results:
✅ 20% reduction in supply chain delays through faster response times.
✅ Improved inventory management, reducing overstock and shortages.
✅ Faster decision-making with real-time data visualization.
Dashboards empower businesses with real-time insights, enabling quick and informed decision-making.
Data Visualization is the graphical representation of information and data using charts, graphs, maps, and
dashboards. It helps businesses identify patterns, trends, and insights quickly, making complex data easier to
understand.
Retail & E-commerce: Visualizing sales trends and customer purchase behavior.
Banking & Finance: Fraud detection through anomaly visualization.
Healthcare: Disease outbreak tracking and patient health trends.
Marketing & Advertising: Campaign performance analytics and customer segmentation.
Scenario: A fashion retail company, StyleTrend, wants to track sales performance and optimize inventory
across its stores.
Challenges:
Solution:
Results:
✅ 30% improvement in inventory management by predicting demand accurately.
✅ Higher revenue by adjusting marketing strategies based on sales trends.
✅ Faster decision-making, reducing manual reporting efforts.
Data visualization transforms raw numbers into actionable insights, making it a crucial skill for data
analysts.
20. Reports
Definition
A report is a structured document that presents data, insights, and analysis to help businesses monitor
performance, make decisions, and track progress over time. Reports can be generated manually or automatically
and often include charts, tables, and key metrics.
Retail & E-commerce: Monthly sales reports to evaluate store and online performance.
Banking & Finance: Risk assessment reports for credit approvals and fraud detection.
Healthcare: Patient health reports and hospital efficiency analysis.
Human Resources: Employee performance reports for appraisals and workforce planning.
Types of Reports:
✅ Operational Reports – Track daily business operations (e.g., inventory reports).
✅ Financial Reports – Analyze revenue, expenses, and profitability (e.g., balance sheets).
✅ Performance Reports – Measure KPIs and efficiency (e.g., sales performance reports).
✅ Predictive Reports – Use historical data to forecast trends (e.g., demand forecasting reports).
Best Practices for Creating Reports:
✅ Use clear and concise language to communicate findings.
✅ Include visual elements (charts, tables, KPIs) for better understanding.
✅ Ensure data accuracy and consistency across reports.
✅ Automate report generation to reduce manual effort.
Popular Reporting Tools: Power BI, Tableau, Google Data Studio, Excel, SQL Reporting Services
(SSRS).
Scenario: A multinational e-commerce company, ShopEase, needs monthly sales reports to evaluate business
performance across regions.
Challenges:
Solution:
Results:
✅ 50% faster reporting time, reducing manual effort.
✅ Improved revenue forecasting, leading to better stock management.
✅ More data-driven decisions, increasing marketing efficiency.
Reports help businesses stay informed, track progress, and optimize strategies, making them a key
component of data analytics.
21. Analytics
Definition
Analytics is the process of examining data to extract meaningful insights, identify trends, and support decision-
making. It involves techniques like statistical analysis, machine learning, and data visualization to interpret
business data effectively.
Types of Analytics:
✅ Descriptive Analytics – What happened? (e.g., sales performance reports).
✅ Diagnostic Analytics – Why did it happen? (e.g., identifying reasons for revenue drop).
✅ Predictive Analytics – What will happen? (e.g., forecasting demand for products).
✅ Prescriptive Analytics – What should we do? (e.g., suggesting best marketing strategies).
Key Techniques in Analytics:
✅ Statistical Analysis – Hypothesis testing, correlation analysis.
✅ Machine Learning – AI-driven insights and pattern recognition.
✅ Data Visualization – Graphs, charts, dashboards for better interpretation.
✅ Big Data Processing – Handling large datasets using cloud and distributed computing.
Popular Analytics Tools: Power BI, Tableau, Python (Pandas, Scikit-learn), Google Analytics,
SQL.
Scenario: A global video streaming service, StreamFlix, wants to reduce customer churn and increase
engagement.
Challenges:
Solution:
Used predictive analytics (Machine Learning models) to identify users likely to churn.
Applied customer segmentation analysis to understand preferences.
Launched personalized email campaigns with offers and content recommendations.
Created real-time dashboards to monitor customer retention rates.
Results:
✅ 20% reduction in churn by targeting at-risk customers with offers.
✅ Improved content engagement, increasing user watch time.
✅ More effective marketing campaigns, boosting customer satisfaction.
Analytics is at the core of data-driven businesses, helping enterprises optimize operations, predict trends,
and enhance customer experience.
Data Transformation is the process of converting raw data into a clean, structured, and usable format for
analysis. It includes tasks like data cleaning, standardization, aggregation, normalization, and enrichment to
prepare data for business intelligence and decision-making.
Converts inconsistent raw data into a structured format for better insights.
Improves data quality, accuracy, and reliability for analytics.
Standardizes data formats for seamless integration across multiple sources.
Enhances processing efficiency, enabling faster data queries and reporting.
How It’s Used in a Business Context
Retail & E-commerce: Standardizing customer purchase history across multiple sales channels.
Finance & Banking: Converting transaction records into a uniform format for fraud detection.
Healthcare: Aggregating patient records from different hospitals into a standardized structure.
Manufacturing: Normalizing IoT sensor data from different machines for predictive maintenance.
Use Case: Data Transformation for Customer Insights in a Global Retail Enterprise
Scenario: A multinational retail company, MegaMart, wants to create a centralized customer database by
merging purchase data from online stores, in-store sales, and mobile apps.
Challenges:
Customer data exists in multiple formats (CSV files, SQL databases, NoSQL platforms).
Inconsistent naming conventions make matching customers difficult.
Data duplication and missing values lead to reporting errors.
Solution:
Extracted data from multiple sources (POS systems, website, loyalty programs).
Cleaned data by removing duplicates and filling in missing values.
Standardized customer IDs and purchase history using data normalization.
Loaded transformed data into a cloud data warehouse for real-time analytics.
Results:
✅ Unified customer database, enabling better segmentation.
✅ More accurate sales reports, reducing data inconsistencies.
✅ Improved personalized marketing, increasing customer engagement.
Data Transformation is essential for accurate and efficient data analytics, ensuring that businesses can derive
actionable insights from their data.
Storytelling vs Data Visualization
Both storytelling and data visualization are essential for effectively communicating data insights, but they
serve different purposes. Here’s a detailed comparison:
📌 Data Visualization Alone: A marketing team sees a bar chart showing that sales dropped by 20% last
quarter.
📌 Storytelling with Data: An analyst presents a report explaining that the drop was due to a competitor’s
discount campaign and suggests launching a loyalty program.
Final Thought
A global e-commerce company, ShopEase, experiences a 15% decline in online sales during the last quarter.
The leadership team wants to understand the reasons and take action.
Approach 1: Using Only Data Visualization
✅ The visuals show the "what" (sales decline, mobile user drop, affected regions).
❌ But they don’t explain the "why" or recommend solutions.
A customer journey flowchart shows that cart abandonment rates increased by 25%.
Customer feedback analysis (NLP insights) reveals that users complain about a slow mobile
checkout experience.
A competitor analysis chart shows a rival launched a one-click checkout system last quarter.
Final Outcome
✅ Executives clearly understand the issue (not just data, but the reason behind it).
✅ Data-backed decision-making leads to a new checkout system rollout.
✅ Sales rebound within 3 months due to lower cart abandonment.
Key Takeaway
Data Modelling is the process of designing a structured representation of data relationships within a database or
system. It defines how data is stored, organized, and accessed to ensure consistency, scalability, and efficiency
in data management.
It involves creating visual representations (diagrams, schemas) that help businesses understand how different
data points interact.
✅ 1. Organizes Complex Data – Helps businesses structure raw data efficiently, making it easier to analyze
and use for decision-making.
✅ 2. Improves Data Quality & Consistency – Reduces redundancy, prevents errors, and ensures data integrity
across systems.
✅ 4. Supports System Integration – Helps different applications and departments work with the same
consistent data framework (e.g., CRM, ERP, and BI tools).
✅ 5. Optimizes Performance & Scalability – Allows businesses to scale databases and analytics processes
efficiently as they grow.
✅ 6. Enables Predictive & Prescriptive Analytics – Well-structured data allows AI and machine learning
models to make accurate predictions.
Challenges:
Results:
Key Takeaway
🔹 Final Outcome:
✅ Clean, structured data ready for customer segmentation analysis.
✅ Improved sales tracking, enabling better personalized marketing.
Key Takeaways
First-party data is data that a company collects directly from its own sources. This includes customer
interactions, transactions, and behavioral data from websites, apps, CRM systems, and loyalty programs.
Scenario: An online retailer, ShopEase, uses customer purchase history and browsing behavior to
recommend products.
Outcome:
✅ Increased conversion rates through personalized product recommendations.
✅ Improved customer retention with loyalty-based promotions.
Second-party data is someone else’s first-party data that is shared with a trusted partner. It is usually
exchanged between two companies with a mutual agreement.
Airline & Hotel Partnership: A hotel chain accesses airline booking data to offer exclusive stay
discounts.
Retail & Payment Provider: A retailer partners with a payment gateway to understand customer
spending habits.
Automobile & Insurance: A car manufacturer shares vehicle usage data with an insurance company to
offer personalized policies.
Scenario: A global airline, FlyHigh Airways, partners with a hotel chain, StayComfort, to offer personalized
hotel deals to passengers based on travel history.
Outcome:
✅ More targeted marketing, leading to higher bookings.
✅ Increased revenue for both partners through data-driven promotions.
Third-party data is data collected and sold by an external company that does not have a direct relationship
with the end consumer. It is aggregated from multiple sources and is often purchased.
Marketing & Advertising: Companies buy audience demographics from data brokers (e.g., Nielsen,
Experian).
Financial Services: Banks purchase credit risk reports from third-party agencies.
Retail: Market research reports to understand industry trends.
Scenario: A retail giant, MegaMart, buys third-party audience data from a data aggregator to target new
customers through online ads.
Outcome:
✅ Expanded reach to new potential buyers.
✅ Improved ad performance with demographic and behavioral targeting.
✔ First-party data is the most valuable and reliable for personalization and direct customer insights.
✔ Second-party data is useful for strategic partnerships and expanding audience reach.
✔ Third-party data is helpful for market research and advertising but comes with privacy risks.
A classification problem is a type of supervised machine learning task where the goal is to categorize data into
predefined groups or labels. The model learns patterns from labeled training data and then predicts the category
of new, unseen data.
✅ Discrete Output – The model assigns an input to one of several predefined categories (e.g., spam vs. not
spam).
✅ Labeled Data – Requires historical data where the correct class labels are already known.
✅ Decision Boundaries – The model learns to differentiate between different classes based on patterns in the
data.
Types of Classification Problems
A telecom provider, ConnectTel, wants to predict which customers are likely to cancel their service so they
can take preventive action.
Solution:
Input Data: Customer subscription history, call duration, complaints, billing details.
Labels: Churn (Yes/No) – If a customer cancels their service.
Model Used: Logistic Regression / Random Forest / XGBoost.
Outcome: The model assigns a probability score to each customer, predicting their likelihood to churn.
Business Impact:
Key Takeaway
An association problem in machine learning refers to the task of identifying relationships between variables in
large datasets. It helps businesses discover patterns, correlations, and associations between different items,
events, or behaviors.
These problems are often solved using association rule mining, where the goal is to find rules like "If X
happens, Y is likely to happen" based on historical data.
✅ Unsupervised Learning – There are no predefined labels; patterns are discovered automatically.
✅ Finds Relationships Between Items – Instead of predicting a single outcome, the model finds frequent
patterns in data.
✅ Uses Support, Confidence, and Lift Metrics – Measures how strong and meaningful an association rule is.
How Association Problems Are Used in Business?
A supermarket chain, MegaMart, wants to increase sales by understanding which products are frequently
bought together so they can create better promotions.
Solution:
Key Takeaways
Recommendation Engines
Definition
A Recommendation Engine (or Recommender System) is an AI-driven system that suggests products,
services, or content to users based on their preferences, behavior, and historical interactions. It is widely used
in e-commerce, streaming platforms, and online services to personalize user experiences.
📌 How it works: Recommends items similar to what a user has interacted with in the past.
📌 Example:
Netflix recommends movies with similar genres and actors based on your watch history.
Spotify suggests songs with similar beats & artists based on past listening habits.
🎯 Business Use Case: Ideal for personalized recommendations when user data is available.
📌 How it works: Suggests items based on what similar users have liked or purchased.
📌 Example:
Amazon recommends products based on "Customers who bought X also bought Y."
YouTube suggests videos based on what similar users watched.
🎯 Business Use Case: Effective for large-scale recommendation systems (e-commerce, social media).
✅ User-Based – Finds users with similar preferences and suggests what they liked.
✅ Item-Based – Finds similar items and suggests them to users with matching preferences.
📌 How it works: Uses a mix of content-based + collaborative filtering for better recommendations.
📌 Example:
Netflix uses both your watch history + what similar users liked to recommend shows.
Amazon Prime combines past purchases + trending items for recommendations.
🎯 Business Use Case: Provides higher accuracy and personalization.
An online retail giant, ShopEase, wants to increase sales and engagement by showing personalized product
recommendations.
Solution:
Business Impact:
Key Takeaways
✔ Recommendation engines personalize user experiences, increasing engagement and revenue.
✔ Content-based, collaborative, and hybrid models improve recommendation accuracy.
✔ Used in e-commerce, streaming, finance, healthcare, and online services.
Would you like a Python example demonstrating how to build a basic recommendation system? 😊
A Regression Problem is a type of supervised machine learning task where the goal is to predict a continuous
numerical value based on input data. Unlike classification problems (which predict categories), regression
problems estimate quantitative outcomes such as prices, sales, temperatures, or demand.
Models a straight-line relationship between input features and the target variable.
📌 Example: Predicting employee salary based on years of experience.
An online retailer, ShopEase, wants to predict next month's revenue based on historical data and marketing
spend.
Solution:
Business Impact:
🔹 Key Takeaways
A clustering problem is a type of unsupervised machine learning task where the goal is to group similar data
points together based on their characteristics, without predefined labels. It helps in discovering hidden patterns
and relationships in data.
Partitions data into K clusters, assigning each point to the nearest cluster center.
📌 Example:
E-commerce: Groups customers into segments (high spenders, bargain hunters, occasional buyers).
Creates a tree-like structure where clusters are merged or split at different levels.
📌 Example:
Healthcare: Groups diseases based on symptoms and medical reports.
Groups dense areas of data while marking sparse areas as noise (useful for anomaly detection).
📌 Example:
Fraud Detection: Identifies unusual banking transactions as outliers.
A retail company, ShopEase, wants to identify customer groups to personalize marketing campaigns.
Solution:
Business Impact:
🔹 Key Takeaways
A Data Dictionary is a structured document that defines and describes the fields, attributes, and metadata of
a dataset or database. It provides detailed information about each data element, ensuring clarity, consistency,
and standardization across an organization.
🔹 Why is a Data Dictionary Important?
✅ Ensures Data Consistency – Standardized definitions prevent misinterpretation.
✅ Improves Data Quality – Helps in detecting errors and inconsistencies.
✅ Enhances Collaboration – Acts as a common reference for data teams, analysts, and developers.
✅ Facilitates Data Governance & Compliance – Supports regulatory requirements like GDPR, HIPAA.
✅ Speeds Up Data Analysis & ETL Processes – Clearly defined data simplifies data transformations.
📌 Business Impact:
✅ Faster onboarding for new analysts by providing a structured data reference.
✅ Prevents errors in ETL pipelines by defining data types & constraints.
✅ Ensures compliance with GDPR by documenting personally identifiable information (PII).
🔹 Key Takeaways
:::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::
Structured Data: Organized data stored in relational databases (e.g., SQL tables).
Semi-Structured Data: JSON, XML, or log files with some structure.
Unstructured Data: Emails, social media posts, videos, images, and IoT logs.
Example: A bank uses descriptive analytics to analyze transactions, predictive analytics to detect
fraud patterns, and prescriptive analytics to suggest preventive actions.
Example: Amazon stores structured customer orders in a data warehouse, while unstructured web
logs go to a data lake for machine learning models.
Example: A telecom company removes duplicate customer records and fills in missing addresses before
launching a targeted campaign.
5. Data Mining & Machine Learning Techniques
Supervised Learning (Labeled Data)
6. Analytics Methodology
Step 1: Define Business Goals
✔ Identify the key problem or objective.
Example: Analyze customer complaints, transaction records, and competitor pricing trends.
Example: Cleaning a database with inconsistent customer addresses before targeted advertising.
Example: A retail company identifies peak shopping hours by analyzing transaction timestamps.
Example: A bank presents a fraud detection report to executives with actionable recommendations.
Final Outcome: Students present findings and insights, applying real-world analytics concepts to business
strategy.
Final Takeaways
✔ Enterprise data is the foundation of modern analytics. ✔ Data must be cleaned, stored, and analyzed
effectively for business insights. ✔ AI & Machine Learning play a critical role in predictive decision-
making. ✔ Analytics methodology follows a structured process from data collection to final decision-
making.
Summary of the Case Study: "Applying Data Science and Analytics at P&G"
📌 Overview
The case study explores how Procter & Gamble (P&G), one of the world’s largest consumer goods
companies, leverages data science and analytics (DA) to enhance decision-making, marketing, supply chain
efficiency, and product innovation.
P&G has long been a data-driven company, but advancements in AI, machine learning, and big data have
significantly transformed how it collects, processes, and utilizes information for competitive advantage.
✔ P&G has integrated data science into all aspects of business operations, from supply chain management
to marketing strategy.
✔ Uses real-time data analytics to enhance business agility and improve responses to market demand
fluctuations.
✔ Leadership teams rely on predictive analytics for strategic decision-making rather than intuition-based
approaches.
💡 Example: P&G’s executives and managers use digital dashboards with real-time analytics to monitor key
performance indicators (KPIs).
P&G utilizes advanced forecasting models to optimize inventory management, reduce waste, and
prevent stockouts.
Uses real-time shipment tracking and demand prediction algorithms to enhance logistics efficiency.
P&G leverages customer sentiment analysis, social media trends, and historical data to create
personalized marketing campaigns.
Uses AI-powered consumer behavior modeling to optimize advertising budgets and product
placement strategies.
💡 Business Impact: Increased marketing ROI and higher customer engagement through targeted
promotions.
✅ 3. Product Innovation & R&D (Data-Driven Product Design)
P&G employs machine learning and AI to analyze consumer preferences and competitor trends,
helping in new product development.
Uses A/B testing and data analytics to refine packaging, pricing, and formulations before large-
scale production.
💡 Business Impact: Faster product development cycles and higher customer satisfaction.
P&G applies predictive analytics models to forecast sales trends and optimize production schedules.
Uses big data analytics to analyze seasonal demand variations, helping distributors maintain optimal
inventory levels.
💡 Business Impact: Increased forecasting accuracy, reducing overproduction and supply chain inefficiencies.
P&G has invested in AI-driven automation for data cleaning, trend analysis, and market
intelligence.
Uses cloud-based analytics platforms to centralize global data, improving collaboration across regions.
💡 Business Impact: Streamlined operations, reducing manual errors and decision-making delays.
💡 Final Thought: P&G is a leader in data-driven business strategy, leveraging big data, AI, and predictive
analytics to stay ahead in the competitive FMCG (Fast-Moving Consumer Goods) market.
The case study highlights several challenges P&G faced in adopting and scaling data science and analytics
(DA) across its global operations. Below are the key challenges and how the company addressed them.
P&G operates in over 180 countries, leading to fragmented data sources across different regions and
departments.
Disconnected data storage made cross-functional decision-making difficult.
✅ Solution:
Implemented a unified, cloud-based data platform to centralize data from multiple sources.
Integrated AI-powered dashboards that provided real-time access to enterprise-wide data.
Ensured seamless data sharing between marketing, supply chain, and R&D teams for better
collaboration.
💡 Business Impact: Improved decision-making speed, reducing time spent on manual data gathering by 40%.
P&G collects massive amounts of data from social media, customer feedback, sales records, and
supply chains.
Duplicate records, missing data, and format inconsistencies affected analytics accuracy.
✅ Solution:
Developed automated data cleaning and validation processes to standardize data formats.
Used AI-powered data governance frameworks to detect and correct inconsistencies in real time.
Trained employees on best practices for data entry and handling.
💡 Business Impact: Data accuracy improved by 30%, leading to more reliable AI-driven insights.
Many executives and employees relied on intuition-based decision-making rather than data.
Adoption of AI and advanced analytics faced resistance from traditional business units.
✅ Solution:
P&G introduced data literacy training programs to educate employees on how data-driven decisions
improve efficiency.
Implemented user-friendly dashboards and AI-assisted insights, making analytics accessible to non-
technical employees.
Encouraged a "test-and-learn" culture, where managers experimented with data-driven strategies
before large-scale adoption.
💡 Business Impact: Increased adoption of analytics tools across leadership teams, leading to faster, evidence-
based decision-making.
P&G's vast global operations generate petabytes of structured and unstructured data.
Traditional data processing tools struggled with real-time data analysis and storage scalability.
✅ Solution:
Shifted to Big Data technologies like Hadoop, Apache Spark, and cloud computing to handle high-
volume data processing.
Implemented AI-driven predictive analytics to process large datasets efficiently.
Used automated ETL (Extract, Transform, Load) pipelines to streamline data ingestion.
💡 Business Impact: Enabled real-time analytics, reducing data processing time by 50%.
✅ Solution:
AI-driven demand forecasting models helped P&G predict sales trends with high accuracy.
IoT-based real-time tracking provided visibility into logistics and warehouse operations.
Used prescriptive analytics to adjust production and inventory levels based on market conditions.
💡 Business Impact: Reduced stock shortages by 25% and optimized logistics costs.
📌 6. Ensuring Data Security & Regulatory Compliance
⚠ Challenge:
Handling customer and operational data across multiple countries introduced GDPR, CCPA, and
other regulatory challenges.
Cybersecurity risks increased with cloud migration and AI-based automation.
✅ Solution:
Adopted AI-powered security systems to detect data breaches and cyber threats.
Implemented automated compliance frameworks to meet global data privacy regulations.
Used role-based access control (RBAC) to limit data exposure to only authorized personnel.
💡 Business Impact: Improved data security & regulatory compliance, reducing legal risks.
💡 Final Thought: P&G’s success in data science came from strategic investment in AI, employee training,
and cloud analytics, making it a global leader in data-driven decision-making.