BUSINESS INTELLIGENCE
BUSINESS INTELLIGENCE
UNIT – 1
UnitI : Introduction to Business Intelligence Business Intelligence (BI), Scope of BI solutions and their fitting into
existing infrastructure, BI Components, Future of Business Intelligence, Functional areas and description of BI
tools, Data mining & warehouse, OLAP, Drawing insights from data: DIKW pyramid Business Analytics project
methodology - detailed description of each phase.
BI COMPONENTS
Business Intelligence (BI) is made up of several key components that work together to help organizations
collect, analyze, and present data for decision-making. These components are essential for turning raw
data into actionable insights. Below are the primary components of a BI system:
1. Data Sources
• Data Sources are the origins from which data is collected for analysis. These can include:
o Internal Data: Information from enterprise systems like Customer Relationship
Management (CRM) systems, Enterprise Resource Planning (ERP) systems,
financial systems, sales and marketing platforms, etc.
o External Data: Data from external sources such as social media, market research,
external APIs, and third-party data providers.
2. Data Integration (ETL Process)
• ETL stands for Extract, Transform, Load and is a critical process in BI:
o Extract: Data is extracted from various source systems (databases, spreadsheets, APIs,
etc.).
o Transform: Data is cleaned, transformed, and formatted to ensure consistency and
accuracy. This may involve filtering, aggregation, and applying business rules.
o Load: The transformed data is loaded into a centralized storage location, typically a data
warehouse or data lake, where it can be accessed for reporting and analysis.
3. Data Warehousing
• Data Warehouse: A centralized repository that stores structured data from various sources. It is
designed to support decision-making processes by providing a consistent, historical view of data.
o Data Marts: Smaller, department-specific data warehouses that focus on a particular
business area (e.g., marketing, finance).
• Data Lake: A storage system that can handle large volumes of raw, unstructured, or semi-
structured data, such as log files or social media feeds. It is often used alongside a data warehouse
to store big data.
4. Data Analytics
• Data Analytics refers to the techniques used to process and analyze data in order to derive
insights. This includes:
o Descriptive Analytics: Analyzes historical data to understand what has happened (e.g.,
reports, dashboards, KPIs).
o Diagnostic Analytics: Investigates data to determine why something happened (e.g., root
cause analysis).
o Predictive Analytics: Uses statistical models and machine learning algorithms to forecast
future trends or behaviors (e.g., sales forecasting, customer churn prediction).
o Prescriptive Analytics: Recommends actions or decisions based on the data analysis
(e.g., inventory management, pricing optimization).
5. BI Tools
• BI Tools are software platforms used to analyze, visualize, and report data. They help business
users create dashboards, run queries, and generate reports. Key types of BI tools include:
o Data Visualization Tools: Tools like Tableau, Power BI, and Qlik allow users to
visualize data through charts, graphs, and interactive dashboards.
o Reporting Tools: Tools like SAP BusinessObjects, Oracle BI Publisher, or IBM
Cognos are used to generate structured, formatted reports.
o Self-service BI: Tools that enable non-technical users to create their own reports and
dashboards without IT support (e.g., Microsoft Power BI, Tableau).
o Ad-hoc Query Tools: Allow users to create customized queries on the fly (e.g., SQL-
based tools or report generators).
6. Data Visualization
• Data Visualization is the graphical representation of data to help business users understand
trends, patterns, and insights.
o It includes charts (bar, pie, line), graphs, heatmaps, and interactive dashboards that make
it easier for users to interpret and act on data.
o Visualization helps highlight key metrics, making it easier to identify areas for
improvement and opportunities.
7. Reporting
• Reporting is the generation of structured, detailed, and sometimes periodic documents that
summarize data, trends, and business performance.
o Operational Reports: Detailed reports that reflect day-to-day operations (e.g., daily
sales reports).
o Strategic Reports: High-level reports used for decision-making at the strategic level
(e.g., quarterly business review reports, financial reports).
8. Dashboards
• Dashboards are interactive, real-time tools that provide a quick overview of key metrics and
KPIs (Key Performance Indicators).
o Dashboards are typically customizable and designed for specific user roles (e.g.,
executive dashboard, sales team dashboard).
o They present a combination of visualizations and reports that can track performance
against business goals.
9. Data Mining
• Data Mining is the process of discovering patterns, correlations, and insights in large datasets
using machine learning, statistical models, and algorithms.
o It includes clustering, classification, regression, association rule mining, and anomaly
detection.
o Data mining helps uncover hidden trends and make predictions based on historical data.
10. Advanced Analytics and AI
• Advanced Analytics: Uses sophisticated techniques, such as ML, NLP, and AI, to extract deeper
insights from data.
o Machine Learning: Algorithms that improve as they are exposed to more data and can
be used for predictive analytics (e.g., predicting customer behavior).
o Natural Language Processing (NLP): Allows systems to interpret and analyze human
language, making it easier for users to interact with BI tools (e.g., querying data using
voice or text).
o AI-powered Insights: Tools that automatically identify patterns and trends, provide
recommendations, and make business decisions without human intervention.
11. Collaboration and Sharing
• Collaboration Tools enable team members to share insights and work together based on the data.
This can include:
o Shared dashboards and reports.
o Commenting and annotations on reports.
o Alerts and notifications based on specific data conditions (e.g., sales drop alerts).
12. Data Governance and Security
• Data Governance ensures the quality, integrity, and compliance of data within BI systems. It
includes:
o Data Quality Management: Ensuring data is accurate, complete, and up to date.
o Data Security: Protecting sensitive information through encryption, access controls, and
compliance with regulations like GDPR.
o Data Lineage: Tracking the flow of data through the system to understand where data
comes from and how it has been processed.
13. Users and User Roles
• BI systems have various users, each with different needs and responsibilities:
o Data Analysts: Use BI tools to perform in-depth analysis and create reports.
o Business Executives: Use dashboards and reports to make strategic decisions.
o Operational Staff: Use BI for day-to-day decision-making, often using more granular
reports.
o IT and Data Engineers: Responsible for maintaining the BI infrastructure, integrating
data sources, and ensuring data quality.
14. Metadata
• Metadata is data about data. It describes the structure, content, and meaning of data, making it
easier for users to understand and navigate BI systems.
o Metadata helps ensure consistency and clarity in reporting, data retrieval, and analysis.
2. Data Warehousing
Data Warehousing involves collecting, storing, and managing large volumes of structured data from
multiple sources in a central repository known as a data warehouse. The purpose of a data warehouse is
to provide a consolidated view of business data from across an organization, enabling efficient querying
and analysis for decision-making.
Key Concepts in Data Warehousing:
• Data Warehouse: A centralized repository that integrates data from various sources (e.g.,
transactional databases, external data, flat files). Data is typically stored in a dimensional format,
optimized for analytical querying rather than operational tasks.
• ETL Process: Extract, Transform, Load (ETL) is the process used to gather data from various
sources, transform it into a consistent format, and load it into the data warehouse.
o Extract: Extracting data from operational systems (e.g., sales, inventory, customer
databases).
o Transform: Converting the extracted data into a format suitable for analysis, which
might include cleaning, aggregating, or joining data from different sources.
o Load: Inserting the transformed data into the data warehouse, often in a schema
optimized for reporting and querying.
• OLAP (Online Analytical Processing): A set of tools and technologies that allow users to
analyze data in a multidimensional format. OLAP enables fast querying and reporting by
organizing data into cubes that are optimized for high-performance analysis.
• Star Schema: A type of data schema used in data warehouses where a central fact table (e.g.,
sales data) is connected to multiple dimension tables (e.g., customers, time, products). The
schema is designed for simplicity and efficient querying.
• Data Marts: Subsets of a data warehouse that focus on a specific business function or
department, such as finance or marketing.
Benefits of Data Warehousing:
• Centralized Data: All business data is stored in one location, allowing for better decision-making
and reporting.
• Improved Data Quality: Data is cleaned and transformed, making it more consistent and
accurate for analysis.
• Historical Data Analysis: Data warehouses typically store historical data, enabling long-term
trend analysis and forecasting.
• Optimized for Querying: Data warehouses are optimized for read-heavy operations, allowing
for fast queries and complex reporting.
Data Warehousing Architecture:
• Staging Area: Temporary storage used during the ETL process where data is cleaned and
transformed.
• Data Warehouse: The central repository where transformed data is stored.
• Data Mart: A smaller, focused database that holds data for a particular department or business
function.
• OLAP Cubes: A multidimensional structure that allows for fast and flexible analysis by
providing various perspectives on the data.
Applications of OLAP:
1. Financial Analysis:
o OLAP is commonly used for financial reporting and analysis, including budget planning,
variance analysis, profit and loss reports, and cash flow projections.
2. Sales and Marketing:
o Sales and marketing teams use OLAP to analyze sales trends, customer behavior, product
performance, and marketing campaign results. This allows for targeted marketing
strategies and forecasting future sales.
3. Supply Chain Management:
o OLAP tools are used to optimize supply chain operations by analyzing inventory levels,
vendor performance, demand forecasts, and logistics.
4. Healthcare:
o Healthcare organizations use OLAP for analyzing patient data, treatment outcomes, and
operational efficiency to improve decision-making and patient care.
5. Retail:
o Retailers use OLAP to analyze purchasing trends, customer segmentation, inventory
management, and sales performance.
6. Executive Dashboards:
o OLAP is often used to create executive dashboards that provide real-time data on KPIs,
financial metrics, and other performance indicators across the organization.
Popular OLAP Tools:
• Microsoft SQL Server Analysis Services (SSAS): Provides MOLAP, ROLAP, and HOLAP
functionality for building multidimensional models and cubes.
• IBM Cognos Analytics: A powerful BI suite that offers OLAP capabilities for in-depth data
analysis and reporting.
• SAP BusinessObjects: An integrated BI platform that includes OLAP tools for multidimensional
analysis.
• Oracle OLAP: Offers OLAP tools as part of Oracle's suite of business intelligence solutions.
• QlikView and Qlik Sense: Qlik’s associative model provides a form of OLAP analysis, allowing
users to interactively explore and visualize data.
DIKW PYRAMID
The DIKW Pyramid (Data, Information, Knowledge, Wisdom) is a framework that illustrates the
hierarchy of how raw data is transformed into actionable insights. It represents the process of deriving
meaning from data through increasing levels of refinement, with each level contributing to decision-
making and problem-solving.
The DIKW Pyramid Breakdown:
1. Data (Base of the Pyramid):
• Definition: Data represents raw facts and figures without context or meaning. Data alone has
little to no inherent value until it is processed and interpreted.
• Characteristics:
o Unprocessed and unorganized.
o Can be quantitative (e.g., numbers, dates) or qualitative (e.g., words, observations).
o Examples: Individual sales transactions, sensor readings, customer contact details, etc.
• Purpose: Raw data is the foundation of the DIKW pyramid. It forms the basis upon which all
further analysis and interpretation are built.
• Tools: Data collection tools, databases, spreadsheets.
2. Information:
• Definition: Information is processed data that has been organized or structured in a way that it
can be understood and used. At this stage, data is contextualized, meaning it is presented with
relevance and purpose.
• Characteristics:
o Data is organized and presented in context.
o Information can be used to identify patterns, trends, or relationships.
o Examples: A report that shows monthly sales numbers across different regions, customer
demographics, etc.
• Purpose: Information is data that has been processed to answer questions like who, what, where,
and when.
• Tools: Data visualization tools (e.g., charts, graphs), reporting systems, dashboards.
3. Knowledge:
• Definition: Knowledge is information that has been further processed and understood by
applying experience, expertise, and context. It is the understanding of patterns and relationships
in the data that help in decision-making.
• Characteristics:
o Information that is interpreted and understood by individuals.
o Contextualized and combined with experience to make sense of information.
o Examples: Analyzing sales trends to identify that specific products sell better in certain
regions, or understanding customer preferences and behavior.
• Purpose: Knowledge is used to answer how and why questions. It builds on information to create
actionable insights.
• Tools: Decision support systems (DSS), analytical models, machine learning algorithms.
4. Wisdom (Top of the Pyramid):
• Definition: Wisdom is the ability to make sound decisions based on knowledge, experience, and a
deep understanding of the context. It involves applying knowledge to practical, real-world
situations to achieve optimal outcomes.
• Characteristics:
o Involves ethical judgment, foresight, and the ability to make decisions based on not just
data but human judgment and experience.
o Example: Deciding on strategic business directions based on a combination of market
trends, customer behavior, and organizational goals.
• Purpose: Wisdom is the ability to use knowledge effectively and make the best possible decisions
in any given situation.
• Tools: Leadership, intuition, strategic frameworks, ethical decision-making models.
Example:
• Data: A customer purchases a product at 3:00 PM.
• Information: The customer purchased a specific product (e.g., "Laptop Model A") at 3:00 PM.
• Knowledge: This product is frequently purchased by customers aged 30–40 in urban areas and is
often bought during holiday seasons.
• Wisdom: To increase sales, the company should target the 30-40 age demographic in urban
locations with marketing campaigns around upcoming holidays, emphasizing the laptop's features
that appeal to this group.
KEY DRIVERS
The successful implementation of Business Intelligence (BI) involves a variety of factors that drive its
adoption and effectiveness. These key drivers ensure that BI systems deliver actionable insights, foster
better decision-making, and align with organizational goals. Here are the primary key drivers for BI
implementation:
1. Data Quality and Integration
• Reliable and Clean Data: The foundation of any BI system is high-quality, accurate, and
consistent data. Poor data quality can undermine the reliability of insights generated by BI tools.
Establishing data governance practices ensures that data is standardized, cleaned, and validated
before being analyzed.
• Data Integration: Integrating data from disparate sources (internal and external) ensures that the
BI system offers a unified view. Seamless integration of structured and unstructured data, cloud
and on-premises data, and various business applications (CRM, ERP, etc.) enhances decision-
making.
2. Leadership Support and Organizational Culture
• Executive Sponsorship: BI adoption is more likely to succeed when it has strong support from
top leadership. Executives play a key role in advocating for BI initiatives and allocating the
necessary resources.
• Culture of Data-Driven Decision Making: The organization must cultivate a culture that
embraces data-driven decision-making. Employees at all levels should understand the value of BI
and be motivated to use insights to improve performance.
3. Clear Business Objectives and Alignment
• Strategic Alignment: BI initiatives should align with the organization’s strategic goals and
objectives. Without clear business goals, BI may fail to provide the insights needed to solve key
problems.
• Business Requirement Definition: Establishing clear requirements from key business
stakeholders ensures that BI tools meet the specific needs of various departments (sales,
marketing, operations, finance, etc.).
4. Technology and Tools
• User-Friendly Tools: The BI tools selected should be intuitive and accessible to non-technical
users. Ease of use promotes widespread adoption across the organization, including executives,
managers, and front-line employees.
• Advanced Analytics Capabilities: BI tools should provide more than basic reporting. They
should support advanced analytics, including predictive analytics, machine learning, and data
visualization, to uncover trends, forecast outcomes, and generate insights.
• Scalability and Flexibility: The BI system should be scalable to accommodate the growing data
needs of the business and flexible enough to adapt to future technological advancements.
5. Skills and Training
• Data Literacy: Employees should be trained in understanding data, interpreting reports, and
leveraging BI tools. Regular training helps improve overall data literacy and ensures that users
can make informed decisions using the data provided by the BI system.
• BI Expertise: Having skilled professionals, including data analysts, data scientists, and BI
specialists, is essential for implementing and maintaining BI solutions. These experts ensure that
the BI platform is optimally configured, reports are relevant, and users get the most out of the
system.
6. Change Management and User Adoption
• Effective Change Management: BI implementation often requires a cultural shift, particularly if
employees are used to traditional decision-making processes. Clear communication, training
programs, and support during the transition help users embrace new technologies.
• User-Centric Design: BI solutions should be designed with the end user in mind. This includes
considering the needs and skills of different user groups to ensure that the system is useful and
that users adopt it.
7. Cost and ROI Considerations
• Cost-Effectiveness: The cost of implementing BI systems should be justified by the value they
provide. A well-executed BI system can deliver significant returns by improving decision-
making, efficiency, and profitability.
• ROI Measurement: Establishing clear metrics and KPIs for measuring the success of BI
implementation helps track its effectiveness and justifies continued investment.
8. Security and Compliance
• Data Security: BI systems often handle sensitive business data, so security is a top priority.
Proper access controls, data encryption, and compliance with relevant regulations (GDPR,
HIPAA, etc.) are necessary to prevent breaches and ensure privacy.
• Regulatory Compliance: BI solutions must comply with industry-specific regulations to ensure
that the organization adheres to standards related to data storage, access, and reporting.
9. Continuous Improvement and Iteration
• Feedback Loops: BI systems should be iteratively improved based on user feedback, evolving
business requirements, and changing market conditions. Regular updates and maintenance help
the system remain relevant and effective.
• Performance Monitoring: Monitoring BI system performance ensures that it continues to meet
business needs. If performance declines or new requirements arise, adjustments should be made.
10. Collaboration and Communication
• Cross-Functional Collaboration: BI should promote collaboration between departments (sales,
finance, HR, operations, etc.). Having a single source of truth that departments can refer to
enhances alignment and cross-functional decision-making.
• Effective Communication of Insights: Data visualization and clear reporting are key to ensuring
that insights from BI tools are understandable and actionable. The BI system should enable the
sharing of insights through dashboards, reports, and alerts to stakeholders.
PERFORMANCE METRICS
Performance metrics are key indicators used to evaluate and measure the effectiveness of various
processes, operations, and activities within an organization. They are essential for monitoring progress,
identifying areas for improvement, and ensuring that business objectives are being met. Performance
metrics can be used across all business functions, from finance to marketing, operations to human
resources, and more.
Here’s a breakdown of the most common types of performance metrics used across various business
areas:
1. Financial Metrics
These metrics track the financial health and profitability of an organization.
• Revenue Growth: Measures the increase or decrease in a company’s revenue over a specific
period.
• Gross Profit Margin: Represents the percentage of revenue that exceeds the cost of goods sold
(COGS). Gross Profit Margin=((Revenue−COGS) / Revenue )×100
• Net Profit Margin: Measures profitability after all expenses, taxes, and costs are subtracted from
total revenue. Net Profit Margin=(Net Income/Revenue)×100
• Return on Assets (ROA): Measures how efficiently a company uses its assets to generate profit.
ROA=Net Income/Total Assets
• Return on Investment (ROI): Assesses the profitability of an investment relative to its cost.
ROI=Net Profit/Cost of Investment×100
• Operating Cash Flow: Indicates the amount of cash a company generates from its core business
operations.
2. Customer Metrics
These metrics help evaluate customer satisfaction, loyalty, and overall experience.
• Customer Satisfaction (CSAT): Measures how satisfied customers are with a company’s
products or services, often through surveys.
• Net Promoter Score (NPS): Measures customer loyalty by asking how likely customers are to
recommend a product or service to others.
• Customer Retention Rate: Measures the percentage of customers who continue to buy or
engage with a brand over time.
Customer Retention Rate=((Number of Customers at End of Period−Number of New Customers)
/ Number of Customers at Start of Period))×100
• Customer Lifetime Value (CLV): Estimates the total revenue a company expects to earn from a
customer over the entire relationship.
• Customer Acquisition Cost (CAC): Measures the cost of acquiring a new customer, including
marketing and sales expenses.
3. Operational Metrics
These metrics focus on the efficiency and effectiveness of internal processes and operations.
• Cycle Time: The total time taken to complete a task, project, or manufacturing process from start
to finish.
• Inventory Turnover: Indicates how often a company sells and replaces its inventory in a given
period. Inventory Turnover=COGS/ Average inventory
• Order Fulfillment Cycle Time: Measures the average time taken to process and deliver an order
from the moment it’s placed.
• Production Efficiency: The ratio of actual output to expected output, showing how efficiently
production resources are used.
• Downtime: The amount of time a system or process is unavailable, typically due to maintenance
or failure.
4. Marketing Metrics
These metrics help assess the effectiveness of marketing strategies and campaigns.
• Conversion Rate: The percentage of visitors or leads that take a desired action, such as
completing a purchase or filling out a form.
Conversion Rate=Number of Conversions/Total Visitors or Leads×100
• Cost Per Acquisition (CPA): Measures the cost of acquiring a customer through a marketing
campaign.
• Lead-to-Customer Conversion Rate: The percentage of leads that become paying customers.
• Return on Marketing Investment (ROMI): Measures the revenue generated from marketing
efforts compared to the cost of those efforts.
ROMI=Revenue from Marketing/Marketing Spend×100
• Click-Through Rate (CTR): Measures how often people click on a link in digital ads or emails.
CTR=Number of Clicks/Number of Impressions×100
6. Sales Metrics
These metrics measure the performance of the sales function, including revenue generation and sales team
effectiveness.
• Sales Growth: The percentage increase or decrease in sales over a given period.
• Sales Conversion Rate: The percentage of leads that turn into actual sales.
Sales Conversion Rate=Number of Sales/Number of Leads×100
• Average Deal Size: Measures the average revenue generated per closed sale.
• Sales Pipeline Value: The total potential value of deals in the sales pipeline.
• Quota Achievement: The percentage of sales targets or quotas achieved by the sales team.
7. IT Metrics
These metrics are used to track the performance of IT systems, including infrastructure, software, and
technical support.
• System Downtime: The amount of time that IT systems or applications are unavailable.
• Incident Response Time: The average time taken to respond to and resolve IT incidents.
• Network Latency: The time delay in the transmission of data over a network, which can impact
system performance.
• IT Support Ticket Volume: The number of IT support requests raised over a given period.
• System Utilization: The extent to which IT resources (e.g., servers, storage) are being used.
BI ARCHITECTURE/FRAMEWORK
A business intelligence architecture is a framework for the various technologies an organization deploys
to run business intelligence and analytics applications. It includes the IT systems and software tools that
are used to collect, integrate, store and analyze BI data and then present information on business
operations and trends to corporate executives and other business users.
The underlying BI architecture is a key element in the execution of a successful business intelligence
program that uses data analysis and reporting to help an organization track business performance,
optimize business processes, identify new revenue opportunities, improve strategic planning and make
more informed business decisions.
Benefits of BI architecture
• Technology benchmarks. A BI architecture articulates the technology standards and data
management and business analytics practices that support an organization's BI efforts, as well as
the specific platforms and tools deployed.
• Improved decision-making. Enterprises benefit from an effective BI architecture by using the
insights generated by business intelligence tools to make data-driven decisions that help increase
revenue and profits.
• Technology blueprint. A BI framework serves as a technology blueprint for collecting,
organizing and managing BI data and then making the data available for analysis, data
visualization and reporting. A strong BI architecture automates reporting and incorporates policies
to govern the use of the technology components.
• Enhanced coordination. Putting such a framework in place enables a BI team to work in a
coordinated and disciplined way to build an enterprise BI program that meets the organization's
data analytics needs. The BI architecture also helps BI and data managers create an efficient
process for handling and managing the business data that's pulled into the environment.
• Time savings. By automating the process of collecting and analyzing data, BI helps organizations
save time on manual and repetitive tasks, freeing up their teams to focus on more high-value
projects.
• Scalability. An effective BI infrastructure is easily scalable, enabling businesses to change and
expand as necessary.
• Improved customer service. Business intelligence enhances customer understanding and service
delivery by helping track customer satisfaction and facilitate timely improvements. For example,
an e-commerce store can use BI to track order delivery times and optimize shipping for better
customer satisfaction.
Business intelligence architecture components and diagram
A BI architecture can be deployed in an on-premises data center or in the cloud. In either case, it contains
a set of core components that collectively support the different stages of the BI process from data
collection, integration, data storage and analysis to data visualization, information delivery and the use of
BI data in business decision-making.
Key Components of BI Architecture
1. Data Sources
o External Data: Data acquired from external sources such as market research, social
media, third-party APIs, or public datasets.
o Internal Data: Data generated within the organization from operational systems like
Enterprise Resource Planning (ERP), Customer Relationship Management (CRM),
transaction databases, and other enterprise applications.
o Unstructured Data: Text, emails, social media data, and other forms of data that do not
have a predefined model.
2. Data Integration Layer
o Extract, Transform, Load (ETL): This is the process that extracts data from various
source systems, transforms it into a usable format (such as cleaning or enriching the
data), and loads it into a centralized data repository. ETL tools can be used to automate
this process.
o Data Extraction: The process of gathering raw data from multiple sources.
o Data Transformation: Ensuring that data from different sources is cleaned, formatted,
and structured properly for analysis.
o Data Loading: Storing the transformed data in a data warehouse or a data lake for further
processing and analysis.
o Data Integration Tools: Software like Informatica, Talend, or Microsoft SSIS (SQL
Server Integration Services) that facilitate data integration tasks.
3. Data Storage Layer
o Data Warehouse: A large, centralized repository that stores structured data, typically in
relational databases. It is used for storing historical data from different systems, cleaned
and ready for reporting and analysis. Data warehouses are optimized for read-heavy
operations, like queries and reports.
o Data Lake: A storage repository that holds raw, unprocessed data in its native format
(structured, semi-structured, and unstructured data). It allows flexibility in storing vast
amounts of varied data types, including logs, sensor data, and social media feeds.
o Data Marts: Smaller, subject-specific data repositories (such as for sales, finance, or
marketing) that allow for fast, targeted analysis without querying the entire data
warehouse.
o OLAP (Online Analytical Processing) Cubes: A data structure used for
multidimensional analysis, which allows fast querying and reporting of business data
along multiple dimensions (like time, geography, product category, etc.).
4. Data Processing Layer
o Data Cleansing: The process of correcting or removing incorrect, corrupted, or irrelevant
data from the data warehouse or data lake.
o Data Transformation: Further refinement of data to ensure it is in the correct format and
structure for analysis. This may include aggregation, normalization, or creating calculated
fields.
o Data Modeling: Designing schemas (like star schema or snowflake schema) that define
how the data is structured and related to enable efficient querying and reporting.
5. Business Analytics Layer
o Data Mining: The process of discovering patterns, correlations, and trends in large
datasets using statistical and machine learning techniques. Tools like SAS, IBM SPSS, or
R are often used here.
o Predictive Analytics: Uses historical data and algorithms to predict future trends and
behaviors. It involves techniques like regression analysis, forecasting, and machine
learning models.
o Reporting & Dashboards: Tools like Tableau, Power BI, or Qlik Sense provide
graphical representations of data insights (such as charts, graphs, and tables) to make
complex data easier to interpret and actionable. Reports and dashboards can be
customized to display KPIs, trends, and metrics that align with business objectives.
o Ad-hoc Querying: Allowing business users to explore data and create custom queries
without the need for IT involvement, often through self-service BI tools.
6. Presentation Layer
o BI Tools: Business Intelligence tools (such as Power BI, Tableau, or Looker) are used to
present insights visually through dashboards, reports, charts, and graphs. These tools
make it easier for decision-makers to interpret data and make informed decisions.
o Self-Service BI: Allows end-users to create reports and analyze data independently
without heavy reliance on IT or data analysts.
o Mobile BI: Access to BI reports and dashboards via mobile devices, providing decision-
makers with real-time access to insights wherever they are.
7. Users and Decision-Makers
o Executives and Managers: These are the primary consumers of high-level BI reports,
dashboards, and performance metrics.
o Business Analysts: Use BI tools to extract, analyze, and interpret data to generate
actionable insights for the business.
o Operational Users: May use BI tools for more tactical, day-to-day decision-making
based on real-time data.
o IT Support: Provides the technical infrastructure and support to ensure the BI
environment operates smoothly and securely.
8. Security and Governance Layer
o Data Security: Protecting sensitive business data through encryption, access controls,
and other security protocols.
o Data Governance: Ensures that data management practices follow the rules, regulations,
and organizational policies. This includes establishing data standards, roles, and
responsibilities for data management.
o User Access Management: Role-based access control (RBAC) to ensure that only
authorized individuals have access to certain levels of data and reports.
o Compliance: Ensuring that data handling complies with industry standards and
regulations like GDPR, HIPAA, etc.
BI Architecture Overview
A typical BI architecture framework can be summarized as follows:
1. Data Sources → 2. ETL/Integration Layer → 3. Data Storage (Warehouse, Data Lake) → 4.
Data Processing (Cleaning, Modeling) → 5. Analytics & Reporting → 6. Presentation
(Dashboards, Reports) → 7. Users (Decision-Makers) → 8. Security & Governance
BEST PRACTICES
Best Practices for Business Intelligence (BI) Implementation help ensure the success of BI initiatives
by maximizing the value derived from data, improving decision-making, and ensuring that BI systems
align with organizational goals. Here are key best practices for implementing and managing a BI
architecture:
1. Clear Business Objectives and Strategy
• Align BI with Business Goals: Define the specific business objectives that the BI system will
support. Understand the needs of different departments (sales, marketing, finance, etc.) to ensure
that the BI system addresses their specific goals.
• Develop a BI Roadmap: Plan and prioritize BI initiatives based on business needs, available
resources, and potential impact. A clear roadmap helps ensure that the BI system evolves
strategically and can adapt to future requirements.
• Involve Key Stakeholders: Engage business leaders, department heads, and end users early in
the process to understand their pain points, requirements, and expectations from the BI system.
2. Data Governance and Quality
• Establish Data Governance Policies: Create a governance framework to ensure consistency,
accuracy, and security of data. This includes defining roles and responsibilities, setting data
standards, and implementing data stewardship.
• Ensure Data Quality: BI success relies on high-quality, accurate, and clean data. Implement
processes to regularly monitor and improve data quality, such as data profiling, cleansing, and
validation techniques.
• Data Security and Privacy: Ensure that sensitive data is protected by implementing access
control, encryption, and compliance with regulations (e.g., GDPR, HIPAA).
3. Centralized Data Repository
• Data Warehouse and Data Marts: Store integrated, cleaned, and structured data in a centralized
data warehouse or separate data marts for departmental use. A well-structured repository enables
efficient querying and reporting.
• Use a Data Lake for Raw Data: Consider using a data lake for storing raw, unstructured, or
semi-structured data. This allows for flexibility in analysis and processing, especially with big
data.
• Data Integration: Utilize ETL (Extract, Transform, Load) processes, and possibly ELT, to
efficiently integrate and clean data from multiple sources.
4. User-Centric Design
• Design for the End-User: BI systems should be intuitive, user-friendly, and tailored to the needs
of business users, not just data analysts or IT staff. Consider self-service BI tools that allow
business users to explore data and generate their own reports without relying on IT.
• Role-Based Dashboards: Build customizable, role-specific dashboards and reports that present
relevant, actionable insights to different users (e.g., executives, managers, operational staff).
• Training and Support: Provide continuous training and support to users, helping them
understand the BI tools and interpret the data for better decision-making.
5. Adopt Self-Service BI
• Empower Users: Allow business analysts and managers to create their own reports and
dashboards with self-service BI tools. This reduces dependency on IT and enables faster decision-
making.
• Governance with Flexibility: While empowering users with self-service capabilities, maintain
governance to ensure that users are accessing the correct data and insights.
6. Real-Time Analytics and Reporting
• Enable Real-Time Data Access: Integrate real-time data sources and implement tools that allow
for live dashboards and real-time reporting. This is crucial for industries that require up-to-the-
minute data (e.g., finance, operations).
• Streaming and Event-Driven Analytics: Implement real-time analytics capabilities to monitor
and react to data as it comes in, particularly for operations or customer service-related processes.
7. Scalability and Flexibility
• Scalable Architecture: Design a BI architecture that can scale with the organization's growth and
increasing data volumes. This can include leveraging cloud-based platforms for flexible storage
and processing.
• Modular Design: Build a modular system where new BI components (tools, data sources,
processes) can be added as the business needs evolve.
8. Advanced Analytics and Machine Learning
• Integrate Advanced Analytics: Incorporate advanced analytics (like predictive analytics, data
mining, and machine learning) into the BI system to uncover hidden patterns and predict future
trends, not just past performance.
• Automate Insights: Leverage AI-powered tools that can provide automated insights and
recommendations, reducing the time spent analyzing data manually.
9. Effective Data Visualization
• Visualize Data Effectively: Use data visualization tools to present insights in an easily digestible
format. Proper visualizations (charts, graphs, heat maps, etc.) can quickly highlight trends,
anomalies, and key metrics.
• Avoid Information Overload: Ensure that dashboards and reports are not cluttered with too
much information. Focus on key metrics and KPIs that matter most for decision-making.
10. Collaborative BI Culture
• Promote Collaboration: Encourage collaboration between business units and IT teams. A
successful BI implementation requires cross-departmental cooperation to ensure data needs are
properly understood and met.
• Share Insights Across the Organization: Ensure that insights derived from BI are shared across
departments, fostering a data-driven culture where decisions are based on facts rather than
intuition.
11. Continuous Monitoring and Improvement
• Monitor System Performance: Regularly monitor the performance of the BI system (e.g., speed,
data refresh rates, system downtime). Make sure it is optimized to deliver timely and accurate
insights.
• Iterative Improvement: BI systems should be iteratively improved. Gather feedback from users,
track performance, and refine processes, tools, and reports as new business needs arise.
• Evaluate ROI: Continuously evaluate the return on investment (ROI) of BI initiatives. Assess
whether the BI system is delivering the expected value to the business, both in terms of improved
decision-making and efficiency.
12. Cloud Adoption and Integration
• Cloud-Based BI Solutions: Cloud platforms offer flexibility, scalability, and cost-effectiveness.
Consider using cloud-based BI tools and services to reduce infrastructure costs and allow easy
access to data and reports from anywhere.
• Hybrid Models: A hybrid approach, combining on-premise and cloud solutions, can balance
security concerns and the need for flexibility, depending on the type of data and compliance
requirements.
13. Performance Management and KPI Alignment
• Focus on KPIs: Define and track Key Performance Indicators (KPIs) aligned with business goals
to evaluate the success of BI initiatives. This helps ensure that BI efforts are directed towards
achieving measurable business outcomes.
• Measure Effectiveness: Use performance metrics to assess how effectively the BI system
supports decision-making and whether it delivers tangible improvements to business
performance.
1. Threshold-Based Alerts
• Definition: Alerts triggered when a metric exceeds or falls below a predefined threshold. This is
the most common form of alert and is used to monitor KPIs or performance indicators.
• Example: An alert is triggered when monthly sales drop below a target of $100,000, or when
inventory levels fall below a minimum threshold.
• Use Case: Financial performance monitoring, customer service SLA compliance, operational cost
tracking.
• Benefits: Simple to set up, easy to understand, and actionable in real-time.
2. Trend-Based Alerts
• Definition: These alerts are based on changes or patterns over time rather than just a single data
point. They track how metrics evolve in a particular direction, such as a rising trend or significant
deviations from historical patterns.
• Example: An alert is triggered if sales are increasing by more than 10% week-over-week, or if
website traffic has dropped significantly compared to the previous month.
• Use Case: Monitoring market trends, sales performance, or customer engagement patterns.
• Benefits: Helps identify emerging issues or opportunities early and supports strategic planning.
3. Anomaly-Based Alerts
• Definition: Alerts triggered when data significantly deviates from expected patterns or norms.
This is often powered by machine learning or advanced analytics models.
• Example: An alert is triggered if customer orders spike unusually or if website traffic deviates
from the expected range based on historical data and seasonality.
• Use Case: Fraud detection, operational efficiency monitoring, performance issues.
• Benefits: Detects outliers or unexpected events that may not be captured by fixed thresholds.
4. Event-Driven Alerts
• Definition: These alerts are triggered by specific events or actions within the business process or
external systems. They respond to actions, rather than changes in metrics or data.
• Example: A new customer sign-up triggers a welcome email, or a stockout in the inventory
management system prompts an immediate reorder.
• Use Case: Customer relationship management (CRM), supply chain management, marketing
automation.
• Benefits: Facilitates automated actions and responses in real-time, streamlining business
operations.
5. Scheduled Alerts
• Definition: Alerts that are triggered at regular, predefined times or intervals. These can be time-
based notifications or reminders that ensure business teams stay on track with ongoing tasks.
• Example: A weekly report alert that summarizes financial performance for the past week, or an
end-of-month reminder to review budget performance.
• Use Case: Regular reporting, performance tracking, project management.
• Benefits: Ensures timely review and analysis of data on a recurring schedule.
6. User-Defined Alerts
• Definition: Alerts that allow users to set their own criteria based on personal preferences or
specific departmental needs. Users can configure the parameters and conditions that trigger alerts.
• Example: A sales manager may set an alert to notify them whenever a particular product
category's sales exceed $50,000.
• Use Case: Personalized notifications for sales targets, operational performance, and customer
satisfaction.
• Benefits: Highly flexible, allowing users to customize alerts based on their own priorities.
8. Cross-System Alerts
• Definition: Alerts that are triggered when data from multiple systems or sources meet a specific
condition, often requiring integration across platforms.
• Example: An alert when both CRM and ERP systems report a mismatch in inventory data or
when a marketing campaign's performance exceeds expectations across multiple channels.
• Use Case: Multi-departmental coordination, supply chain synchronization, marketing campaign
monitoring.
• Benefits: Helps in integrating insights across different parts of the business for more holistic
decision-making.
9. Geolocation-Based Alerts
• Definition: Alerts triggered based on the physical location of assets, devices, or individuals.
These are common in industries like retail, logistics, and transportation.
• Example: An alert triggered when a delivery truck deviates from its planned route or if a store’s
inventory level falls below a set amount in a particular location.
• Use Case: Fleet management, inventory management, field operations.
• Benefits: Adds a location-based dimension to business monitoring, allowing for more responsive
operational management.
10. Predictive Alerts
• Definition: Alerts that anticipate future conditions based on predictive analytics and machine
learning models, notifying users of potential issues or opportunities before they arise.
• Example: An alert forecasting that customer churn will rise by 10% in the next month, based on
predictive models analyzing past behavior patterns.
• Use Case: Predictive maintenance, customer retention, sales forecasting.
• Benefits: Proactive decision-making, enabling businesses to act before an issue fully materializes.
5. Decision-Making
• Description: The intelligence created from data analysis is used to support business decisions.
This phase is where the insights inform specific actions, strategies, or changes in direction.
• Examples:
o Strategic decisions like entering a new market based on predictive analytics.
o Tactical decisions like adjusting marketing spend or production schedules based on sales
forecasts.
o Operational decisions such as optimizing inventory management or staffing levels based
on performance metrics.
• Purpose: To make informed decisions that are based on data-driven insights rather than intuition
or guesswork.
• Outcome: Business actions or strategies are implemented based on the insights generated from
the data.
DECISION TAXONOMY
Decision taxonomy refers to the classification or categorization of decisions based on various
characteristics such as their nature, complexity, time horizon, and level of impact on the
organization. By understanding different types of decisions and how they are made,
organizations can better design decision support systems, optimize decision-making processes,
and apply the appropriate tools and methods for each type of decision.
Decision taxonomy helps in structuring decision-making processes and categorizing decisions to
make it easier to understand how decisions can be handled, automated, or supported in an
organization.
7. Real-time Decision-Making
• Principle: In today’s fast-paced business environment, decisions often need to be made
in real-time or near-real-time. A DMS should be capable of processing data and providing
insights quickly to enable timely decisions.
• Real-time data processing and streaming analytics are key components of systems that
support immediate decision-making.
o Example: Real-time fraud detection in banking systems or dynamic pricing in e-
commerce based on current market conditions.
9. Context-Aware Decision-Making
• Principle: Decisions should be made with awareness of the context in which they occur.
Contextual information such as current conditions, environmental factors, and
organizational priorities should influence the decision-making process.
• Context-aware computing and situational awareness tools enable the system to adjust
its recommendations or decisions based on the situation at hand.
o Example: Adaptive decision-making systems in emergency management or
supply chain optimization that adjust decisions based on real-time data (e.g.,
weather conditions or supply disruptions).
UNIT – 4
Unit IV: Analysis & Visualization Definition and applications of data mining, data mining process, analysis
methodologies, Typical pre-processing operations: combining values into one, handling incomplete or incorrect data,
handling missing values, recoding values, sub setting, sorting, transforming scale, determining percentiles, data
manipulation, removing noise, removing inconsistencies, transformations, standardizing, normalizing,min-max
normalization, z-score. standardization, rules of standardizing data. Role of visualization in analytics, different
techniques for visualizing data.
ANALYSIS METHODOLOGIES
In the field of data mining and analytics, there are several analysis methodologies that help
transform raw data into actionable insights. These methodologies utilize different techniques and
models to uncover patterns, relationships, and trends within the data. Below are some key
analysis methodologies used in data mining:
1. Descriptive Analysis
Descriptive analysis aims to summarize and describe the main features of a dataset. It focuses on
understanding the past by summarizing historical data into useful statistics and visualizations.
• Key Techniques:
o Summary Statistics: Mean, median, mode, standard deviation, and variance are
used to summarize the data.
o Data Visualization: Techniques like bar charts, histograms, box plots, scatter
plots, and heatmaps are used to visually represent the data, revealing patterns and
relationships.
o Clustering: Grouping similar data points based on shared characteristics (e.g., k-
means clustering).
• Applications:
o Market research (e.g., sales analysis)
o Customer segmentation
o Website traffic analysis
2. Predictive Analysis
Predictive analysis is used to forecast future outcomes based on historical data. It uses statistical
models and machine learning algorithms to make predictions about future events.
• Key Techniques:
o Regression Analysis: Used to predict continuous outcomes, such as forecasting
sales or predicting prices (e.g., linear regression, logistic regression).
o Classification: Assigning items into categories or classes based on input data
(e.g., decision trees, support vector machines, random forests).
o Time Series Analysis: Analyzing data points indexed in time order to predict
future values (e.g., ARIMA, exponential smoothing).
• Applications:
o Financial forecasting (e.g., stock price prediction)
o Risk management (e.g., predicting loan default)
o Demand forecasting in retail
3. Diagnostic Analysis
Diagnostic analysis seeks to determine the cause of an event or outcome. It goes beyond simple
descriptive analysis and tries to understand the why behind certain patterns or trends.
• Key Techniques:
o Correlation Analysis: Understanding the relationships between variables (e.g.,
Pearson’s correlation, Spearman’s rank correlation).
o Causal Inference: Identifying causal relationships between variables (e.g.,
Granger causality tests).
o Root Cause Analysis: Identifying the underlying factors that contribute to a
problem or event.
• Applications:
o Identifying why a product launch failed
o Determining the cause of a business downturn
o Analyzing website abandonment or churn rates
4. Prescriptive Analysis
Prescriptive analysis helps organizations determine the best course of action to take. It uses
optimization and simulation techniques to suggest decision options that lead to desired outcomes.
• Key Techniques:
o Optimization Algorithms: Linear programming, integer programming, and other
optimization methods to identify the best decision under given constraints.
o Decision Trees: Modeling decisions as a tree to evaluate the potential outcomes
of different actions.
o Simulation: Monte Carlo simulations or scenario analysis to model complex
systems and evaluate how various decisions affect outcomes.
• Applications:
o Supply chain optimization (e.g., determining optimal inventory levels)
o Resource allocation in manufacturing or project management
o Dynamic pricing strategies in e-commerce
5. Text Mining and Sentiment Analysis
Text mining involves extracting useful information from unstructured text data, such as social
media posts, customer reviews, emails, or documents. Sentiment analysis is a subfield of text
mining that determines the sentiment or opinion expressed in text.
• Key Techniques:
o Natural Language Processing (NLP): Techniques such as tokenization, named
entity recognition (NER), and part-of-speech tagging are used to process and
analyze text data.
o Sentiment Analysis: Classifying text into positive, negative, or neutral
sentiments using algorithms like Naive Bayes, Support Vector Machines, or deep
learning-based models.
o Topic Modeling: Using methods like Latent Dirichlet Allocation (LDA) to
discover the hidden themes or topics in text data.
• Applications:
o Customer feedback analysis (e.g., reviews, surveys)
o Social media monitoring (e.g., sentiment of brand mentions)
o News article analysis (e.g., identifying emerging trends)
6. Anomaly Detection
Anomaly detection identifies unusual patterns that deviate from the normal behavior of a dataset.
It is used to detect outliers or rare events that may be important for further investigation.
• Key Techniques:
o Statistical Methods: Techniques such as z-scores, modified z-scores, or Grubbs'
test to identify outliers.
o Machine Learning: Using models like k-nearest neighbors (KNN), isolation
forests, or autoencoders to detect anomalies.
o Clustering Algorithms: Identifying anomalies as data points that do not fit well
into any cluster (e.g., DBSCAN).
• Applications:
o Fraud detection (e.g., in banking or insurance)
o Intrusion detection in cybersecurity
o Network traffic monitoring
7. Association Rule Mining
Association rule mining is used to discover interesting relationships (associations) between
variables in large datasets, typically used in market basket analysis.
• Key Techniques:
o Apriori Algorithm: Identifying frequent itemsets in transaction data and
generating association rules.
o FP-Growth Algorithm: A more efficient alternative to Apriori for finding
frequent itemsets.
o Lift, Confidence, and Support: Metrics used to evaluate the strength and
relevance of association rules.
• Applications:
o Market basket analysis (e.g., discovering that people who buy bread also tend to
buy butter)
o Recommendation systems (e.g., suggesting related products in e-commerce)
o Cross-selling strategies in retail
8. Cluster Analysis
Cluster analysis is a type of unsupervised learning that groups data points based on similarity. It
is used to find natural groupings or structures within data.
• Key Techniques:
o K-means Clustering: Partitioning data into k clusters based on feature similarity.
o Hierarchical Clustering: Building a tree structure to represent nested clusters.
o DBSCAN: A density-based clustering algorithm that can find arbitrarily shaped
clusters and handle noise.
• Applications:
o Customer segmentation (e.g., grouping customers based on buying behavior)
o Image recognition (e.g., clustering similar images)
o Document clustering (e.g., organizing articles or news into topics)
2. Imputation Methods
A. Mean, Median, or Mode Imputation
• When to Use: This is a common method for numerical data, where missing values are
replaced by the mean (for symmetric distributions), median (for skewed distributions), or
mode (for categorical data).
• Pros: Simple and fast.
• Cons: Can introduce bias, especially when the data is not missing completely at random.
• Example:
python
Copy code
# Mean imputation for a numerical column
df['Column'] = df['Column'].fillna(df['Column'].mean())
# Backward fill
df['Column'] = df['Column'].fillna(method='bfill')
C. K-Nearest Neighbors (KNN) Imputation
• When to Use: When missing values are correlated with other variables, KNN can be
used to predict missing values based on the values of the nearest neighbors.
• Pros: More accurate than mean/median imputation as it accounts for relationships
between variables.
• Cons: Computationally expensive, especially for large datasets.
• Example (using KNNImputer from sklearn):
python
Copy code
from sklearn.impute import KNNImputer
imputer = KNNImputer(n_neighbors=5)
df_imputed = imputer.fit_transform(df)
D. Regression Imputation
• When to Use: You can predict the missing values using regression models based on other
related variables.
• Pros: More accurate than mean/median imputation when relationships exist between
features.
• Cons: Requires building a regression model, which can be computationally expensive.
• Example:
python
Copy code
from sklearn.linear_model import LinearRegression
# Assuming 'X' has no missing values, and 'y' has missing values.
model = LinearRegression()
model.fit(X_train, y_train)
y_pred = model.predict(X_test) # Predicted values to impute missing ones
1. Recoding Values
Recoding refers to the process of transforming or modifying existing values in a dataset, often to
make them more consistent, interpretable, or suitable for analysis. This can include changing the
scale, converting categorical variables into numeric codes, combining categories, or mapping
values to new ones.
A. Recoding Categorical Variables
Categorical variables may need to be recoded into numeric values, especially when preparing
data for machine learning models that require numerical input.
• Example: Recoding a "Gender" variable with values "Male" and "Female" into 0 and 1,
respectively.
python
Copy code
df['Gender'] = df['Gender'].map({'Male': 0, 'Female': 1})
B. Recoding Numeric Values into Categories
Sometimes, numeric data needs to be recoded into categories (e.g., age groups). This is often
done by binning continuous data into discrete ranges.
• Example: Recoding age into categories: "Young", "Middle-aged", and "Old".
python
Copy code
bins = [0, 18, 40, 100]
labels = ['Young', 'Middle-aged', 'Old']
df['Age_Group'] = pd.cut(df['Age'], bins=bins, labels=labels)
C. Recoding Based on Conditions
You can also recode values based on specific conditions. For instance, a numeric value can be
recoded into a different value depending on a threshold or condition.
• Example: Recoding "Income" values into "Low", "Medium", "High".
python
Copy code
df['Income_Category'] = pd.cut(df['Income'], bins=[0, 25000, 50000, 100000], labels=['Low',
'Medium', 'High'])
2. Subsetting Data
Subsetting refers to selecting a specific subset of rows or columns from a dataset based on
conditions or criteria. It is an important step when you want to focus on particular sections of
your data or filter out irrelevant information.
A. Subsetting by Columns
You can select specific columns from a dataset to work with. This is useful when you only need
certain features for analysis or modeling.
• Example: Selecting a subset of columns from a DataFrame.
python
Copy code
df_subset = df[['Name', 'Age', 'Income']]
B. Subsetting by Rows (Filtering)
You can filter rows based on specific conditions or criteria. For example, selecting rows where
the "Age" is greater than 30.
• Example: Filtering rows where age is greater than 30.
python
Copy code
df_filtered = df[df['Age'] > 30]
C. Combining Row and Column Subsetting
You can combine row and column subsetting to extract specific data points based on both
conditions.
• Example: Subsetting rows where "Age" is greater than 30 and selecting specific
columns.
python
Copy code
df_filtered = df[df['Age'] > 30][['Name', 'Income']]
D. Subsetting Based on Multiple Conditions
You can filter data using multiple conditions by combining them with logical operators (e.g., &, |
for AND, OR).
• Example: Filtering rows where "Age" is greater than 30 and "Income" is less than
50,000.
python
Copy code
df_filtered = df[(df['Age'] > 30) & (df['Income'] < 50000)]
3. Sorting Data
Sorting refers to arranging the data in a specific order, either in ascending or descending order,
based on one or more columns. Sorting can help identify trends, outliers, and patterns in the data.
A. Sorting by One Column
You can sort data by a single column, either in ascending or descending order.
• Example: Sorting by "Age" in ascending order.
python
Copy code
df_sorted = df.sort_values(by='Age', ascending=True)
B. Sorting by Multiple Columns
You can also sort the dataset based on multiple columns. If the first column has duplicate values,
it will then sort by the second column, and so on.
• Example: Sorting first by "Age" in ascending order, then by "Income" in descending
order.
python
Copy code
df_sorted = df.sort_values(by=['Age', 'Income'], ascending=[True, False])
C. Sorting by Index
You can sort the data by its index (row labels), which can be useful when dealing with time-
series data or hierarchical indices.
• Example: Sorting by the index in ascending order.
python
Copy code
df_sorted = df.sort_index(ascending=True)
Practical Examples:
Example 1: Recoding Gender and Age Group
python
Copy code
import pandas as pd
# Example DataFrame
data = {'Name': ['Alice', 'Bob', 'Charlie', 'David'],
'Gender': ['Female', 'Male', 'Male', 'Female'],
'Age': [25, 40, 35, 60]}
df = pd.DataFrame(data)
# Recoding Gender
df['Gender'] = df['Gender'].map({'Male': 0, 'Female': 1})
print(df)
Output:
sql
Copy code
Name Gender Age Age_Group
0 Alice 1 25 Young
1 Bob 0 40 Old
2 Charlie 0 35 Middle-aged
3 David 1 60 Old
Example 2: Subsetting Data by Condition
python
Copy code
# Subsetting rows where Age is greater than 30
df_filtered = df[df['Age'] > 30]
print(df_filtered)
Output:
sql
Copy code
Name Gender Age Age_Group
1 Bob 0 40 Old
2 Charlie 0 35 Middle-aged
3 David 1 60 Old
Example 3: Sorting Data by Multiple Columns
python
Copy code
# Sorting by Age in ascending order, then by Gender in descending order
df_sorted = df.sort_values(by=['Age', 'Gender'], ascending=[True, False])
print(df_sorted)
Output:
sql
Copy code
Name Gender Age Age_Group
0 Alice 1 25 Young
2 Charlie 0 35 Middle-aged
1 Bob 0 40 Old
3 David 1 60 Old
Summary
• Recoding Values: This involves changing the values of a variable to make them more
useful or consistent. It can include transforming categorical data into numerical codes or
binning continuous data into categories.
• Subsetting: This refers to filtering rows or selecting specific columns from a dataset
based on conditions or criteria. You can subset the data to focus on relevant sections for
analysis.
• Sorting: Sorting arranges the data in ascending or descending order based on one or
more columns. Sorting can help identify trends, prioritize records, or prepare the data for
analysis.
TRANSFORMING SCALE
Transforming scale refers to the process of changing the scale or range of data values,
particularly for numerical variables. This is done to ensure that all features are on a comparable
scale, which can improve the performance of many machine learning algorithms that are
sensitive to the scale of the input data (e.g., linear regression, k-nearest neighbors, support vector
machines).
There are several methods for transforming the scale of the data, including normalization,
standardization, and other techniques. Here's a detailed breakdown of the most common
methods:
# Sample data
import pandas as pd
data = {'Age': [25, 30, 35, 40, 45], 'Income': [50000, 60000, 55000, 70000, 80000]}
df = pd.DataFrame(data)
print(df_normalized)
Output:
Copy code
Age Income
0 0.00 0.00
1 0.25 0.25
2 0.50 0.12
3 0.75 0.62
4 1.00 1.00
# Sample data
data = {'Age': [25, 30, 35, 40, 45], 'Income': [50000, 60000, 55000, 70000, 80000]}
df = pd.DataFrame(data)
print(df_standardized)
Output:
Copy code
Age Income
0 -1.414213 -1.264911
1 -0.707107 -0.632456
2 0.000000 -0.948683
3 0.707107 0.316228
4 1.414213 1.529822
3. Robust Scaling
Robust scaling is a technique that scales data based on the median and interquartile range
(IQR) rather than the mean and standard deviation. This method is more robust to outliers, as the
median and IQR are less sensitive to extreme values.
Formula for Robust Scaling:
Xrobust=X−median(X)IQR(X)X_{\text{robust}} = \frac{X -
\text{median}(X)}{\text{IQR}(X)}Xrobust=IQR(X)X−median(X)
Where:
• median(X)\text{median}(X)median(X) is the median of the feature,
• IQR(X)\text{IQR}(X)IQR(X) is the interquartile range (difference between the 75th and
25th percentiles).
When to Use:
• When the data has significant outliers, and you don't want them to influence the scaling
process.
• When working with datasets that are not normally distributed.
Example:
python
Copy code
from sklearn.preprocessing import RobustScaler
# Robust scaling
scaler = RobustScaler()
df_robust = pd.DataFrame(scaler.fit_transform(df), columns=df.columns)
print(df_robust)
Output:
Copy code
Age Income
0 -0.16 -0.151
1 -0.08 -0.093
2 0.00 -0.124
3 0.08 0.021
4 18.00 1.493
4. Log Transformation
A log transformation is useful when the data has a skewed distribution, especially when it
follows an exponential growth pattern. The transformation helps stabilize the variance and makes
the distribution more Gaussian (normal).
When to Use:
• When your data has a highly skewed distribution (e.g., income data or population sizes).
• When the data follows an exponential or power-law distribution.
Example:
python
Copy code
import numpy as np
print(df)
Output:
Copy code
Income Log_Income
0 100 4.605170
1 200 5.298317
2 300 5.703782
3 400 5.991465
4 500 6.214608
5. Power Transformation
Power transformation includes methods such as Box-Cox and Yeo-Johnson. These
transformations are used to stabilize variance, make the data more normally distributed, and
improve the model's performance.
• Box-Cox Transformation: Suitable for positive data.
• Yeo-Johnson Transformation: Works with both positive and negative data.
When to Use:
• When you need to transform skewed data into a more symmetric distribution.
• When your data is heteroscedastic (variance is not constant).
Example (Box-Cox):
python
Copy code
from sklearn.preprocessing import PowerTransformer
# Sample data
data = {'Income': [100, 200, 300, 400, 500]}
df = pd.DataFrame(data)
# Power transformation using Yeo-Johnson (suitable for both positive and negative values)
scaler = PowerTransformer(method='box-cox')
df_transformed = pd.DataFrame(scaler.fit_transform(df), columns=df.columns)
print(df_transformed)
Output:
Copy code
Income
0 0.000000
1 0.674365
2 1.118034
3 1.445207
4 1.682507
2. Calculating Percentiles
The formula for calculating the nth percentile (PnP_nPn) of a dataset depends on the specific
percentile rank and the sorted data.
Steps to determine percentiles:
1. Sort the Data: Arrange the data in ascending order.
2. Calculate the Rank:
o For the nth percentile, calculate the rank RnR_nRn as: Rn=n100×(N+1)R_n =
\frac{n}{100} \times (N + 1)Rn=100n×(N+1) Where:
▪ nnn is the percentile rank (e.g., 25 for the 25th percentile),
▪ NNN is the total number of data points.
3. Interpret the Rank:
o If the rank RnR_nRn is an integer, the percentile is the value at that position in the
sorted data.
o If the rank is not an integer, interpolate between the two closest data points to find
the percentile.
3. Common Percentiles
• 25th Percentile (Q1): The value below which 25% of the data lie.
• 50th Percentile (Median or Q2): The middle value in the dataset.
• 75th Percentile (Q3): The value below which 75% of the data lie.
The interquartile range (IQR) is defined as:
IQR=Q3−Q1IQR = Q3 - Q1IQR=Q3−Q1
This gives an indication of how spread out the middle 50% of the data is.
# Sample data
data = [5, 10, 15, 20, 25, 30, 35, 40, 45, 50]
# Sample data
data = [5, 10, 15, 20, 25, 30, 35, 40, 45, 50]
df = pd.Series(data)
print(percentiles)
Output:
go
Copy code
0.25 13.75
0.50 27.50
0.75 41.25
dtype: float64
DATA MANIPULATION
Data manipulation refers to the process of adjusting, organizing, or modifying data to make it
more useful or appropriate for analysis, presentation, or decision-making. It involves cleaning,
transforming, and restructuring data in a way that makes it easier to analyze, interpret, and utilize
for different purposes, such as building machine learning models or reporting insights.
Data manipulation is a critical step in the data analysis pipeline, and it can be done using a
variety of techniques, depending on the type of data and the tools being used. Here, we will
explore common techniques and operations involved in data manipulation.
# Sample DataFrame
data = {
'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve'],
'Age': [25, 30, 35, None, 40],
'Score': [85, 90, 88, 92, 95]
}
df = pd.DataFrame(data)
print("Sorted DataFrame:")
print(df_sorted)
print(df)
Output:
yaml
Copy code
OrderDate
0 2024-11-01
1 2024-10-15
2 2024-09-30
3 2024-11-02
b. Handling Duplicate Data
Identify and remove duplicate rows or records. You can define criteria (such as a unique
identifier) to determine which records are duplicates.
• Example: If a dataset has multiple entries for the same customer, you may need to keep
only one unique record for each customer.
Python Example (Removing Duplicates)
python
Copy code
# Sample DataFrame with duplicates
data = {'CustomerID': [101, 102, 101, 104],
'Name': ['Alice', 'Bob', 'Alice', 'David'],
'Age': [25, 30, 25, 40]}
df = pd.DataFrame(data)
print(df_unique)
Output:
Copy code
CustomerID Name Age
0 101 Alice 25
1 102 Bob 30
3 104 David 40
c. Resolving Conflicting Values
When there are conflicting or contradictory values in the dataset, you may need to choose one
value based on business rules or preferences.
• Example: If a product has multiple prices, you might choose the highest price, the most
recent price, or the average price.
• Example: If an employee has multiple job titles listed, you may need to standardize them
to one title, based on the most authoritative source.
Python Example (Resolving Conflicts with Aggregation)
python
Copy code
# Sample DataFrame with conflicting values
data = {'ProductID': [1, 1, 2, 2],
'Price': [100, 120, 200, 180],
'Date': ['2024-10-01', '2024-11-01', '2024-10-01', '2024-11-01']}
df = pd.DataFrame(data)
# Resolving conflict by taking the most recent price for each product
df['Date'] = pd.to_datetime(df['Date'])
df_resolved = df.sort_values('Date').drop_duplicates('ProductID', keep='last')
print(df_resolved)
Output:
yaml
Copy code
ProductID Price Date
1 1 120 2024-11-01
3 2 180 2024-11-01
d. Handling Missing Data
Missing data can be addressed in various ways, including:
• Imputation: Filling missing values with statistical measures such as the mean, median,
or mode.
• Forward/Backward Fill: Filling missing data with the previous or next valid value.
• Removal: Dropping rows or columns that have missing values, if they are not crucial.
Python Example (Filling Missing Data)
python
Copy code
# Sample DataFrame with missing values
data = {'CustomerID': [101, 102, 103, 104],
'Age': [25, 30, None, 40]}
df = pd.DataFrame(data)
print(df)
Output:
Copy code
CustomerID Age
0 101 25
1 102 30
2 103 31.67
3 104 40
e. Resolving Referential Inconsistencies
Referential inconsistencies arise when there is a mismatch between related datasets or records,
such as an invalid reference to a non-existent entity.
• Example: If an order refers to a customer that doesn’t exist, you may need to either
correct the customer ID or remove the order record.
Python Example (Resolving Referential Inconsistencies)
python
Copy code
# Sample DataFrames: Orders and Customers
orders = {'OrderID': [1, 2, 3], 'CustomerID': [101, 102, 105]}
customers = {'CustomerID': [101, 102, 103], 'Name': ['Alice', 'Bob', 'Charlie']}
df_orders = pd.DataFrame(orders)
df_customers = pd.DataFrame(customers)
# Remove orders with invalid CustomerID (CustomerID = 105 does not exist)
df_valid_orders = df_orders[df_orders['CustomerID'].isin(df_customers['CustomerID'])]
print(df_valid_orders)
Output:
Copy code
OrderID CustomerID
0 1 101
1 2 102
f. Using Data Validation Rules
Data validation rules enforce consistency in data entry by ensuring that values conform to
specified criteria (e.g., allowed value ranges, required fields, etc.).
• Example: Ensure that the "Age" column contains only values between 0 and 120.
• Example: Enforce that the "Email" field contains valid email addresses.
DATA TRANSFORMATIONS
Data transformation refers to the process of converting data from its original form into a format
that is more appropriate and useful for analysis, modeling, or other purposes. It is an essential
step in the data preprocessing pipeline, helping to make data cleaner, more consistent, and more
accessible for analysis or machine learning models. Transformations can involve various
operations, including normalization, scaling, encoding, aggregation, and others.
Below, we explore common types of data transformations and their applications.
scaler = MinMaxScaler()
normalized_data = scaler.fit_transform(data)
print(normalized_data)
b. Log Transformation
Log transformation is used to reduce the effect of extreme values or outliers in highly skewed
data, especially when data spans several orders of magnitude.
• Log Transformation: The logarithmic function is applied to the data to make its
distribution more normal (less skewed).
o Formula: Xlog=log(X)X_{\text{log}} = \log(X)Xlog=log(X) where XXX is
the data point.
Python Example (Log Transformation):
python
Copy code
import numpy as np
print(log_transformed_data)
c. Binning
Binning involves grouping continuous data into discrete bins or intervals. This technique is
useful when you want to reduce noise or make patterns in data more apparent.
• Equal Width Binning: The range of data is divided into equal-sized intervals.
• Equal Frequency Binning: The data is divided into bins that each contain an equal
number of data points.
Python Example (Equal Width Binning):
python
Copy code
import pandas as pd
one_hot_encoded = pd.get_dummies(df['Color'])
print(one_hot_encoded)
e. Aggregation
Aggregation involves summarizing data, such as calculating the sum, average, count, or other
statistics for groups of data. This is particularly useful for summarizing large datasets.
• Example: Aggregating sales data by summing up sales by region or computing the
average score by student.
Python Example (Aggregation using GroupBy):
python
Copy code
import pandas as pd
print(transformed_data)
print(normalized_data)
Output:
css
Copy code
[[0. ]
[0.125]
[0.25 ]
[0.375]
[0.5 ]
[0.75 ]
[1. ]]
In this case, the data is scaled to the range [0, 1].
b. Z-Score Normalization (Standardization)
Z-score normalization (also called standardization) transforms the data to have a mean of 0 and
a standard deviation of 1. Unlike min-max scaling, it doesn't scale data to a fixed range, but
rather it centers the data.
• Formula:
Z=X−μσZ = \frac{X - \mu}{\sigma}Z=σX−μ
Where:
o ZZZ is the normalized value (z-score).
o XXX is the original value.
o μ\muμ is the mean of the feature.
o σ\sigmaσ is the standard deviation of the feature.
• Example: If the feature "Age" has a mean of 40 and a standard deviation of 10, a value of
30 would become:
Z=30−4010=−1Z = \frac{30 - 40}{10} = -1Z=1030−40=−1
This means that 30 is one standard deviation below the mean.
Python Example (Z-Score Normalization):
python
Copy code
from sklearn.preprocessing import StandardScaler
import numpy as np
# Sample data (feature 'Age' with mean 40 and std deviation 10)
data = np.array([[20], [25], [30], [35], [40], [50], [60]])
print(standardized_data)
Output:
css
Copy code
[[-1.41421356]
[-1.161895 ]
[-0.90957644]
[-0.65725789]
[-0.40493933]
[ 0.40493933]
[ 1.161895 ]]
In this case, the data has been transformed to have a mean of 0 and a standard deviation of 1.
c. Robust Scaling
Robust scaling is another normalization technique that is robust to outliers. It scales the data
based on the median and the interquartile range (IQR) rather than the mean and standard
deviation. This makes it less sensitive to extreme outliers.
• Formula:
Xnorm=X−medianIQRX_{\text{norm}} = \frac{X - \text{median}}{\text{IQR}}Xnorm
=IQRX−median
Where:
o XXX is the original value.
o median is the median of the feature.
o IQR is the interquartile range (75th percentile - 25th percentile).
• Use case: This is particularly useful when the data contains significant outliers that could
skew the results of standard normalization techniques.
Python Example (Robust Scaling):
python
Copy code
from sklearn.preprocessing import RobustScaler
import numpy as np
print(robust_scaled_data)
Output:
css
Copy code
[[-0.6]
[-0.4]
[-0.2]
[ 0. ]
[ 0.2]
[ 6. ]]
In this case, the outlier (1000) has minimal effect on the scaling process due to the use of the
median and IQR.
4. Standardization Example
Example 1: Standardizing a Single Feature
Let’s consider a feature "Age" with the following values: [25, 30, 35, 40, 45].
• Mean (μ) = (25 + 30 + 35 + 40 + 45) / 5 = 35
• Standard Deviation (σ) = √[((25-35)² + (30-35)² + (35-35)² + (40-35)² + (45-35)²) / 5] ≈
7.91
Now, let’s standardize the value 30:
Z=30−357.91≈−0.63Z = \frac{30 - 35}{7.91} ≈ -0.63Z=7.9130−35≈−0.63
So, the value 30 is now -0.63 after standardization.
Example 2: Standardizing Multiple Features (Python Example)
python
Copy code
from sklearn.preprocessing import StandardScaler
import numpy as np
# Initialize StandardScaler
scaler = StandardScaler()
Standardization (Z-Score
Aspect Normalization (Min-Max Scaling)
Normalization)
Xnorm=X−XminXmax−XminX_{\text{norm}} =
Z=X−μσZ = \frac{X -
Formula \frac{X - X_{\text{min}}}{X_{\text{max}} -
\mu}{\sigma}Z=σX−μ
X_{\text{min}}}Xnorm=Xmax−XminX−Xmin
Output
Mean = 0, Std. Dev. = 1 Typically scaled to a range [0, 1]
Range
Sensitive to outliers
Effect of Less sensitive to outliers, but extreme values will be
(outliers influence mean
Outliers compressed into the range
and std deviation)
Standardization (Z-Score
Aspect Normalization (Min-Max Scaling)
Normalization)
1. Bar Chart
• Use: To compare the quantities of different categories.
• Best For: Comparing discrete categories or groups.
• Types:
o Vertical Bar Chart: Used to show comparisons across different categories.
o Horizontal Bar Chart: Often used when category names are long or when there
is a need to emphasize the differences between categories.
Example: Comparing sales revenue across different products.
2. Line Chart
• Use: To display data trends over time.
• Best For: Showing the evolution or changes of data points over continuous intervals.
• Key Feature: Ideal for time series data, especially for tracking data over months, years,
or even days.
Example: Tracking website traffic over a period of months.
3. Pie Chart
• Use: To show the relative proportions or percentages of a whole.
• Best For: Illustrating parts of a whole or categorical data where the categories are few
and represent a significant proportion of the total.
• Key Feature: Not ideal when there are too many categories, as it becomes difficult to
distinguish slices.
Example: Market share distribution of different companies in an industry.
4. Scatter Plot
• Use: To show the relationship between two continuous variables.
• Best For: Identifying correlations, patterns, or trends between two variables.
• Key Feature: Each point represents a pair of values, allowing you to spot clusters, trends,
and outliers.
Example: Examining the relationship between advertising spend and sales performance.
5. Histogram
• Use: To represent the distribution of a single variable.
• Best For: Showing frequency distributions of numerical data.
• Key Feature: It is similar to a bar chart but represents continuous data divided into bins
(intervals).
Example: Showing the distribution of exam scores of a class.
6. Heatmap
• Use: To visualize data in matrix form where the values are represented by varying colors.
• Best For: Showing patterns, correlations, or intensity in complex datasets.
• Key Feature: Color gradients are used to represent values; the warmer the color, the
higher the value.
Example: A heatmap showing the correlation between different products and customer
demographics.
8. Area Chart
• Use: To show the cumulative totals over time.
• Best For: Visualizing how quantities accumulate over time or how they change relative
to other groups.
• Key Feature: It is a line chart with the area beneath the line filled with color.
Example: Showing the total revenue and costs over time.
9. Treemap
• Use: To display hierarchical (tree-structured) data using nested rectangles.
• Best For: Visualizing proportions within categories in a compact space.
• Key Feature: The area of each rectangle is proportional to the data value it represents,
which is useful for showing large, multi-level datasets.
Example: Displaying sales data by region, product, and subcategory.
UNIT – 5
UnitV: Business Intelligence Applications Marketing models: Relational marketing, Salesforce
management, Business case studies, supplychain optimization, optimization models for logistics
planning, revenue management system.
MARKETING MODELS
Marketing models are frameworks used to analyze, predict, and optimize marketing strategies
and activities. These models help businesses understand customer behavior, market dynamics,
and the effectiveness of marketing campaigns. By applying marketing models, companies can
make data-driven decisions, allocate resources efficiently, and ultimately improve their
marketing outcomes.
Here are several well-known marketing models:
2. AIDA Model
• Use: A framework to understand and optimize the stages customers go through before
making a purchase decision.
• Components:
o Attention: Attract the consumer’s attention.
o Interest: Raise interest by highlighting features and benefits.
o Desire: Create a desire for the product by focusing on its appeal.
o Action: Encourage the customer to take action, such as making a purchase.
Purpose: The AIDA model helps marketers craft effective advertising campaigns that guide
potential customers through these stages.
4. SWOT Analysis
• Use: A strategic planning tool used to identify the Strengths, Weaknesses,
Opportunities, and Threats related to a business or a specific marketing campaign.
• Components:
o Strengths: What does the company do well? (e.g., strong brand, excellent
customer service).
o Weaknesses: Where does the company fall short? (e.g., limited market presence,
poor online reviews).
o Opportunities: What external opportunities can be leveraged? (e.g., emerging
markets, technological advancements).
o Threats: What external factors could negatively affect the company? (e.g.,
competition, changing regulations).
Purpose: SWOT helps businesses identify their current position, evaluate external factors, and
devise strategies to capitalize on opportunities and mitigate threats.
6. The 7 Ps of Marketing
• Use: An extension of the 4 Ps that includes three additional elements for service-based
industries.
• Components:
o Product: What the business is offering to the market.
o Price: The pricing strategy used.
o Place: Distribution channels and access to customers.
o Promotion: Communication strategies to inform and persuade customers.
o People: All individuals who interact with customers (sales staff, customer
service).
o Process: The systems and processes involved in delivering the service.
o Physical Evidence: Tangible elements that help customers evaluate the service
(e.g., office location, website).
Purpose: This model is particularly useful for service industries, as it expands the marketing mix
to include elements that influence customer experience.
9. RACE Framework
• Use: A marketing model used to guide the digital marketing process across four key
stages.
• Components:
o Reach: Building awareness and attracting visitors.
o Act: Encouraging engagement and interaction (e.g., through content).
o Convert: Turning interactions into conversions (sales, sign-ups).
o Engage: Fostering customer loyalty and advocacy.
Purpose: RACE helps marketers plan, manage, and optimize their digital marketing strategies
through a structured, result-oriented approach.
RELATIONAL MARKETING
Relational marketing, also known as relationship marketing, focuses on building long-term,
mutually beneficial relationships with customers, rather than just focusing on short-term sales or
transactions. The goal of relational marketing is to foster loyalty, trust, and a deeper connection
between the business and its customers. By maintaining ongoing interactions and consistently
meeting or exceeding customer expectations, companies can enhance customer retention, which
is often more cost-effective than constantly acquiring new customers.
SALESFORCE MANAGEMENT
Salesforce management refers to the strategic approach and processes a company uses to
manage its sales team and customer relationships, often with the aid of technology (such as
Salesforce CRM). It involves overseeing and guiding the activities of sales personnel, optimizing
sales processes, tracking performance, and ensuring that the team meets or exceeds sales goals.
The ultimate aim is to drive sales efficiency, improve customer relationships, and boost revenue.
Salesforce management typically encompasses various aspects, including recruitment and
training of sales teams, defining sales goals and performance metrics, monitoring progress,
managing customer data, and using tools (like CRM systems) to streamline these processes.