Bi 1-5
Bi 1-5
Business Intelligence Introduction - Effective and timely decisions – Data, information and
knowledge – Role of mathematical models – Business intelligence architectures: Cycle of a
business intelligence analysis – Enabling factors in business intelligence projects - Development
of a business intelligence system – Ethics and business intelligence, Types of Data, The measure
of Central Tendency, Measure of Spread, Standard Normal Distribution, Skewness, Measures of
relationship, Central Limit Theorem
Conclusion
Business Intelligence (BI) is a vital component of modern business strategies, enabling
organizations to leverage data for informed decision-making and competitive advantage. By
integrating data from multiple sources, cleansing and transforming it, and providing powerful
analytical and visualization tools, BI systems empower businesses to gain deep insights and
drive operational efficiencies. Despite challenges such as data quality, integration complexity,
and user adoption, the benefits of BI make it a crucial investment for organizations aiming to
thrive in a data-driven world.
• Point of Sale (POS) Systems: Transaction data, sales volume, product returns.
• Inventory Management Systems: Stock levels, reorder points, warehouse data.
• Customer Relationship Management (CRM): Customer preferences, purchase history.
• External Data Sources: Market trends, competitor pricing, seasonal factors.
ETL Process
1. Extract: Data is extracted from POS systems, inventory databases, CRM, and external
sources.
2. Transform: Data is cleaned (e.g., removing duplicates, correcting errors), standardized
(e.g., consistent date formats), and aggregated (e.g., total sales per product).
3. Load: Transformed data is loaded into the central data warehouse.
Making Decisions
1. Inventory Management:
– Decision: Increase stock of fast-moving items before peak sales periods to
prevent stockouts.
– Timeliness: Adjust inventory orders in real-time based on sales data.
2. Marketing Campaigns:
– Decision: Launch targeted marketing campaigns for high-value customer
segments.
– Timeliness: Initiate campaigns promptly based on recent purchase trends and
customer behavior.
3. Pricing Strategy:
– Decision: Adjust prices dynamically in response to competitor pricing and market
demand.
– Timeliness: Implement price changes swiftly to capitalize on market
opportunities.
4. Operational Efficiency:
– Decision: Reallocate resources to high-performing stores and streamline
operations in underperforming locations.
– Timeliness: React to operational inefficiencies as they arise.
Conclusion
Effective and timely decisions are the backbone of a successful business strategy. By leveraging
Business Intelligence, organizations can ensure they make data-driven decisions that are
accurate, timely, comprehensive, relevant, and actionable. The example of a retail business
illustrates how BI can transform raw data into insights that drive inventory management,
marketing campaigns, pricing strategies, and operational efficiency, ultimately leading to better
business outcomes.
Information
Information is data that has been processed and organized to provide meaning. It is derived
from raw data and is used to answer specific questions or inform decisions.
Characteristics of Information
1. Processed Data:
– Information is obtained by processing raw data, which involves organizing,
structuring, and interpreting the data to give it meaning.
2. Contextual:
– Information is context-specific. The same data can provide different information
depending on the context in which it is used.
3. Useful:
– Information is actionable and useful for decision-making. It provides insights that
help in understanding situations or making decisions.
4. Timely:
– For information to be effective, it must be available at the right time. Timeliness
is a crucial attribute of valuable information.
5. Accurate:
– Accuracy is essential for information to be reliable. Inaccurate information can
lead to poor decisions.
6. Relevant:
– Information must be relevant to the specific needs of the user. Irrelevant
information, even if accurate, does not add value.
Example of Information
Consider a sales report generated from transaction data. The raw data might include individual
sales records with details such as date, product, quantity, and price. Processing this data into a
monthly sales summary by product category transforms it into useful information.
This sales summary is information that can inform decisions regarding inventory management,
marketing, and pricing strategies.
Knowledge
Knowledge is the understanding and awareness of information. It is created through the
interpretation and assimilation of information. Knowledge enables the application of
information to make informed decisions and take actions.
Characteristics of Knowledge
1. Understanding:
– Knowledge involves comprehending the meaning and implications of
information.
2. Experience-Based:
– Knowledge is often built on experience and expertise. It includes insights gained
from practical application and past experiences.
3. Contextual and Situational:
– Knowledge is deeply tied to specific contexts and situations. It is not just about
knowing facts but also understanding how to apply them.
4. Dynamic:
– Knowledge evolves over time as new information is acquired and new
experiences are gained.
5. Actionable:
– Knowledge is used to make decisions and take actions. It provides the foundation
for solving problems and innovating.
Types of Knowledge
1. Explicit Knowledge:
– Knowledge that can be easily articulated, documented, and shared. Examples
include manuals, documents, procedures, and reports.
2. Tacit Knowledge:
– Knowledge that is personal and context-specific, often difficult to formalize and
communicate. Examples include personal insights, intuitions, and experiences.
Example of Knowledge
Continuing with the sales report example, knowledge would be the understanding and insights
derived from the information. For instance, a manager might know from experience that a spike
in sales of "Widget A" typically occurs before a holiday season. This knowledge enables the
manager to increase inventory ahead of time to meet anticipated demand.
This insight is based on the manager's knowledge of sales patterns and their experience with
past sales cycles.
Conclusion
Understanding the distinction and relationship between information and knowledge is essential
for leveraging data effectively in any organization. Information provides the foundation for
knowledge, which in turn supports informed decision-making and strategic action. By
processing data into meaningful information and then interpreting that information to create
knowledge, businesses can enhance their operational efficiency, improve decision-making, and
gain a competitive edge.
Role of Mathematical Models
Mathematical models are essential tools in various fields, including science, engineering,
economics, and business, for understanding complex systems, predicting future behavior, and
optimizing processes. They provide a formal framework for describing relationships between
variables and can be used to simulate scenarios, analyze data, and support decision-making.
A retail business uses a demand forecasting model to predict future sales based on historical
sales data and other factors like seasonality and promotions.
This simple linear model predicts increasing sales for the next six months, helping the business
plan inventory and marketing strategies.
Conclusion
Mathematical models are powerful tools that play a critical role in various domains by enabling
prediction, optimization, and understanding of complex systems. They support decision-making
processes by providing a structured way to analyze data and simulate scenarios. Despite
challenges such as model complexity and data quality, the benefits of using mathematical
models in business, science, and engineering make them indispensable for informed and
effective decision-making.
2. Predictive Analytics
Predictive analytics uses mathematical models to forecast future events based on historical
data, enabling businesses to anticipate changes and plan accordingly.
• Linear Programming: This is used for optimizing resource allocation, minimizing costs,
or maximizing profits in operations management.
• Simulation Models: These evaluate different scenarios to understand potential
outcomes, helping in strategic planning and risk management.
• Decision Analysis Models: Techniques such as decision trees and game theory help
make decisions under uncertainty by evaluating the outcomes of different choices.
4. Risk Management
Mathematical models are crucial in identifying, assessing, and mitigating risks. They help
quantify risks and predict their potential impacts on business operations.
• Value at Risk (VaR): A statistical technique used to measure the risk of loss on a portfolio
of assets.
• Monte Carlo Simulations: These simulations run multiple scenarios to evaluate the
probability of different outcomes, which is useful in financial risk assessment and project
management.
Conclusion
Mathematical models are integral to Business Intelligence as they enable organizations to
transform raw data into actionable insights. By leveraging these models, businesses can
enhance their decision-making processes, forecast future trends, optimize operations, and
effectively manage risks. The use of mathematical models thus leads to more informed, timely,
and strategic business decisions, ultimately contributing to improved business performance and
competitive advantage.
The diagram illustrates a typical Business Intelligence (BI) architecture, showcasing the flow of
data from various sources through ETL processes into a data warehouse, and subsequently to
different business functions for analysis and decision-making. Let's break down the components
and their interactions in detail:
2. External Data:
• Data that comes from outside the organization. This can include social media data,
market research data, competitive analysis data, and other third-party data sources.
• Function: Enrich internal data with external insights, providing a more comprehensive
view of the business environment.
3. ETL Tools:
• ETL stands for Extract, Transform, Load. These tools are responsible for extracting data
from operational systems and external sources, transforming it into a suitable format,
and loading it into the data warehouse.
• Function: Ensure data quality, consistency, and integration from multiple sources.
Common ETL tools include Informatica, Talend, and Apache Nifi.
4. Data Warehouse:
• A centralized repository where integrated data from multiple sources is stored. The data
warehouse is optimized for query and analysis rather than transaction processing.
• Function: Store large volumes of historical data, enabling complex queries and data
analysis.
Analytics in the BI context refers to the methods and technologies used to analyze data and gain
insights. This includes statistical analysis, predictive modeling, and data mining.
2. Components of BI Analytics
• Data Warehousing: A centralized repository where data is stored, managed, and
retrieved for analysis.
• ETL (Extract, Transform, Load): Processes to extract data from various sources,
transform it into a suitable format, and load it into a data warehouse.
• Data Mining: Techniques to discover patterns and relationships in large datasets.
• Reporting: Tools to create structured reports and dashboards for data presentation.
• OLAP (Online Analytical Processing): Techniques to analyze multidimensional data from
multiple perspectives.
• Predictive Analytics: Using statistical algorithms and machine learning techniques to
predict future outcomes based on historical data.
4. BI Analytics Process
1. Data Collection: Gathering data from internal and external sources such as databases,
social media, CRM systems, and other data repositories.
2. Data Integration: Consolidating data from different sources to create a unified view.
3. Data Cleaning: Ensuring data quality by removing duplicates, handling missing values,
and correcting errors.
4. Data Analysis: Applying statistical and analytical methods to identify trends, patterns,
and insights.
5. Data Visualization: Presenting data in a graphical format using charts, graphs, and
dashboards to facilitate easy understanding.
6. Reporting: Generating reports to disseminate the insights to stakeholders for decision-
making.
5. Applications of BI Analytics
• Financial Analysis: Tracking financial performance, budgeting, and forecasting.
• Marketing Analysis: Analyzing customer data to identify trends, segment markets, and
measure campaign effectiveness.
• Sales Analysis: Monitoring sales performance, pipeline analysis, and sales forecasting.
• Operational Efficiency: Analyzing operational data to improve processes and reduce
costs.
• Customer Insights: Understanding customer behavior and preferences to enhance
customer satisfaction and loyalty.
6. Benefits of BI Analytics
• Improved Decision Making: Providing accurate and timely information for better
business decisions.
• Increased Efficiency: Streamlining operations and reducing costs through data-driven
insights.
• Competitive Advantage: Identifying market trends and opportunities to stay ahead of
competitors.
• Enhanced Customer Satisfaction: Personalizing customer experiences and improving
service quality.
• Risk Management: Identifying and mitigating risks through predictive analytics.
7. Challenges in BI Analytics
• Data Quality: Ensuring the accuracy and reliability of data.
• Data Integration: Consolidating data from disparate sources.
• Scalability: Managing large volumes of data efficiently.
• Security: Protecting sensitive data from unauthorized access and breaches.
• User Adoption: Encouraging stakeholders to embrace BI tools and processes.
Conclusion
Business Intelligence Analytics plays a critical role in modern enterprises by transforming raw
data into meaningful insights that drive strategic and operational decisions. By leveraging
advanced tools and techniques, organizations can gain a competitive edge, improve efficiency,
and enhance customer satisfaction. As technology evolves, the integration of AI and real-time
analytics will further revolutionize the field, making BI analytics an indispensable asset for
businesses.
The Business Intelligence (BI) Life Cycle
This cyclical process ensures continuous improvement and adaptation of the BI system to meet
changing business needs. By following these steps, organizations can effectively leverage their
data to make informed decisions and drive business success.
By focusing on these enabling factors, organizations can enhance the effectiveness and impact
of their BI projects, ensuring they deliver meaningful insights that drive business performance
and support strategic decision-making.
Ethics in business intelligence (BI)
involves applying ethical principles to the collection, analysis, and use of data to ensure that BI
practices are responsible, fair, and transparent. Ethical considerations are crucial in BI to
maintain trust, comply with regulations, and avoid harm to individuals and organizations. Here’s
a detailed exploration of ethics in BI:
Key Characteristics
1. Mean: The mean (average) of the standard normal distribution is 0.
2. Standard Deviation: The standard deviation, which measures the spread of the data, is 1.
3. Symmetry: The distribution is perfectly symmetric around the mean.
4. Bell-Shaped Curve: The distribution has the characteristic bell-shaped curve of a normal
distribution.
5. Total Area Under the Curve: The total area under the curve is 1, which represents the
probability of all possible outcomes.
The Z-Score
• Definition: A Z-score represents the number of standard deviations a data point is from
the mean.
• Formula: ( Z = \frac{X - \mu}{\sigma} )
– (X) is the value in the dataset.
– (\mu) is the mean of the dataset.
– (\sigma) is the standard deviation of the dataset.
A Z-score indicates how many standard deviations an element is from the mean. For example, a
Z-score of 2 means the data point is 2 standard deviations above the mean.
Example Calculations
1. Finding Probabilities:
– Example: What is the probability that a Z-score is less than 1.5?
• Look up 1.5 in the Z-table. The corresponding cumulative probability is
approximately 0.9332.
• Therefore, (P(Z < 1.5) = 0.9332).
2. Using Z-scores for Percentiles:
– Example: What Z-score corresponds to the 90th percentile?
• Find the cumulative probability of 0.90 in the Z-table. The corresponding
Z-score is approximately 1.28.
• Therefore, the 90th percentile corresponds to a Z-score of 1.28.
Visualization
A standard normal distribution graph can help visualize these concepts. The mean (0) is at the
center of the bell curve, and the standard deviations (±1, ±2, ±3) mark the points along the
horizontal axis. The area under the curve between these points represents the probabilities
mentioned in the empirical rule.
By understanding and utilizing the standard normal distribution, statisticians and analysts can
make more informed decisions based on data, conduct meaningful comparisons, and draw
accurate inferences about populations from sample data.
Skewness
Skewness is a measure of the asymmetry of the probability distribution of a real-valued random
variable about its mean. It indicates whether the data is skewed to the left (negative skewness),
to the right (positive skewness), or symmetrically distributed (zero skewness).
Types of Skewness
1. Negative Skewness (Left-Skewed)
– Description: The left tail is longer or fatter than the right tail.
– Characteristics: The majority of the data values lie to the right of the mean.
– Example: Income distribution in a high-income area where most people have
high incomes but a few have much lower incomes.
2. Positive Skewness (Right-Skewed)
– Description: The right tail is longer or fatter than the left tail.
– Characteristics: The majority of the data values lie to the left of the mean.
– Example: Age at retirement where most people retire at a similar age, but a few
retire much later.
3. Zero Skewness (Symmetrical)
– Description: The data is perfectly symmetrical around the mean.
– Characteristics: The mean, median, and mode are all equal.
– Example: Heights of adult men in a population where the distribution forms a bell
curve.
Measuring Skewness
The formula for skewness is:
Where:
• ( n ) = number of observations
• ( x_i ) = each individual observation
• ( \bar{x} ) = mean of the observations
• ( s ) = standard deviation of the observations
Alternatively, skewness can also be measured using software tools and statistical packages
which provide skewness values directly.
Measures of Relationship
Measures of relationship quantify the strength and direction of the association between two or
more variables. Key measures include covariance, correlation coefficients, and regression
analysis.
Covariance
• Description: Measures the directional relationship between two variables. It indicates
whether an increase in one variable corresponds to an increase (positive covariance) or
decrease (negative covariance) in another variable.
• Formula: [ \text{Cov}(X, Y) = \frac{1}{n-1} \sum_{i=1}^{n} (X_i - \bar{X})(Y_i - \bar{Y}) ]
Where ( X ) and ( Y ) are the two variables, ( \bar{X} ) and ( \bar{Y} ) are their means, and
( n ) is the number of data points.
• Interpretation:
– Positive covariance: Both variables tend to increase or decrease together.
– Negative covariance: One variable tends to increase when the other decreases.
– Zero covariance: No linear relationship between the variables.
Correlation Coefficient
• Description: Standardizes the measure of covariance to provide a dimensionless value
that indicates the strength and direction of the linear relationship between two variables.
• Formula: The Pearson correlation coefficient ((r)) is given by: [ r = \frac{\text{Cov}(X, Y)}
{s_X s_Y} ] Where ( s_X ) and ( s_Y ) are the standard deviations of ( X ) and ( Y ).
• Range: -1 to 1
– ( r = 1 ): Perfect positive linear relationship.
– ( r = -1 ): Perfect negative linear relationship.
– ( r = 0 ): No linear relationship.
• Interpretation:
– 0 < |r| < 0.3: Weak correlation.
– 0.3 < |r| < 0.7: Moderate correlation.
– 0.7 < |r| ≤ 1: Strong correlation.
Regression Analysis
• Description: Explores the relationship between a dependent variable and one or more
independent variables. It predicts the value of the dependent variable based on the
values of the independent variables.
• Types:
– Simple Linear Regression: Examines the relationship between two variables.
– Multiple Linear Regression: Examines the relationship between one dependent
variable and multiple independent variables.
• Model: For simple linear regression, the model is: [ Y = \beta_0 + \beta_1X + \epsilon ]
Where ( Y ) is the dependent variable, ( X ) is the independent variable, ( \beta_0 ) is the
intercept, ( \beta_1 ) is the slope, and ( \epsilon ) is the error term.
• Interpretation:
– ( \beta_1 ) indicates the change in ( Y ) for a one-unit change in ( X ).
– The coefficient of determination (( R^2 )) indicates the proportion of the variance
in the dependent variable that is predictable from the independent variable(s).
Understanding skewness and measures of relationship is crucial in data analysis as they provide
insights into the distribution and interdependencies of data, guiding more accurate and
meaningful interpretations and predictions.
Formulas
• Population Mean ((\mu)): The average of all the values in the population.
• Population Standard Deviation ((\sigma)): The measure of the spread of the population
values.
• Sample Mean ((\bar{X})): The average of the sample values.
• Standard Error ((\sigma_{\bar{X}})): The standard deviation of the sampling distribution
of the sample mean. [ \sigma_{\bar{X}} = \frac{\sigma}{\sqrt{n}} ]
Example
Imagine we have a population of test scores that is not normally distributed, with a mean score
of 70 and a standard deviation of 10. We take a sample of 50 students and calculate the sample
mean.
1. Sampling Distribution:
– According to the CLT, the distribution of the sample mean for these 50 students
will be approximately normal.
– The mean of the sampling distribution will be equal to the population mean, (\mu
= 70).
– The standard error will be: [ \sigma_{\bar{X}} = \frac{\sigma}{\sqrt{n}} = \frac{10}
{\sqrt{50}} \approx 1.41 ]
2. Probability Calculation:
– We can now use the standard normal distribution to calculate probabilities. For
example, the probability that the sample mean is greater than 72:
• Convert to Z-score: [ Z = \frac{\bar{X} - \mu}{\sigma_{\bar{X}}} = \frac{72
- 70}{1.41} \approx 1.42 ]
• Look up the Z-score in the standard normal table to find the probability.
Summary
The Central Limit Theorem is a powerful tool in statistics that allows us to make inferences
about population parameters using sample statistics, even when the population distribution is
not normal. By understanding and applying the CLT, we can perform a wide range of statistical
analyses, including confidence interval estimation and hypothesis testing, with greater accuracy
and confidence.
UNIT -2
Sure, let's delve into the basics of probability and related concepts with detailed explanations
and examples.
1. Definition of Probability
Probability is a measure of the likelihood that an event will occur. It is quantified as a number
between 0 and 1, where 0 indicates the impossibility of the event and 1 indicates certainty.
Example:
• Tossing a fair coin: The probability of getting heads (P(H)) is 0.5, and the probability of
getting tails (P(T)) is also 0.5.
2. Conditional Probability
Conditional Probability is the probability of an event occurring given that another event has
already occurred. It is denoted as P(A|B), which means the probability of event A occurring given
that B has occurred.
Example:
• Drawing two cards from a deck without replacement: Find the probability that the
second card is a heart given that the first card was a heart. [ P(\text{Second card is
heart}|\text{First card is heart}) = \frac{12}{51} = \frac{4}{17} ]
3. Independent Events
Independent Events are events where the occurrence of one event does not affect the
probability of the other. If A and B are independent, then: [ P(A \cap B) = P(A) \cdot P(B) ]
Example:
• Tossing two coins: The probability of getting heads on both coins (H1 and H2) is: [ P(H1 \
cap H2) = P(H1) \cdot P(H2) = 0.5 \cdot 0.5 = 0.25 ]
4. Bayes' Rule
Bayes' Rule is used to find the probability of an event given prior knowledge of conditions that
might be related to the event. It is expressed as: [ P(A|B) = \frac{P(B|A) \cdot P(A)}{P(B)} ]
Example:
• Medical testing: Suppose 1% of people have a disease, the test is 99% accurate. Find the
probability of having the disease given a positive test result. [ P(\text{Disease|Positive})
= \frac{P(\text{Positive|Disease}) \cdot P(\text{Disease})}{P(\text{Positive})} ] [ P(\
text{Positive}) = P(\text{Positive|Disease}) \cdot P(\text{Disease}) + P(\text{Positive|No
Disease}) \cdot P(\text{No Disease}) ] [ P(\text{Positive}) = 0.99 \cdot 0.01 + 0.01 \cdot
0.99 = 0.0198 ] [ P(\text{Disease|Positive}) = \frac{0.99 \cdot 0.01}{0.0198} = 0.5 ]
5. Bernoulli Trials
A Bernoulli Trial is a random experiment where there are only two possible outcomes,
"success" and "failure". The probability of success is ( p ) and failure is ( 1-p ).
Example:
• Tossing a fair coin once: The probability of success (getting heads) is 0.5 and failure
(getting tails) is 0.5.
6. Random Variables
A Random Variable is a variable whose possible values are numerical outcomes of a random
phenomenon.
Example:
• Rolling a six-sided die: The random variable ( X ) can take values ( {1, 2, 3, 4, 5, 6} ).
Example:
• For a fair die, the PMF ( P(X=x) ) is: [ P(X=x) = \frac{1}{6} ; \text{for} ; x \in {1, 2, 3, 4, 5, 6} ]
Example:
Example:
• For a normal distribution with mean ( \mu ) and standard deviation ( \sigma ), the PDF is:
[ f(x) = \frac{1}{\sigma \sqrt{2\pi}} e^{-\frac{(x-\mu)^2}{2\sigma^2}} ]
Example:
• For a random variable ( X ) with a CDF ( F(x) ), it is: [ F(x) = P(X \leq x) ]
Example:
Example:
• What is the probability of rolling a 4 on a fair six-sided die? [ P(X=4) = \frac{1}{6} \approx
0.167 ]
• If a card is drawn from a deck and it is known to be red, what is the probability that it is a
heart? [ P(\text{Heart}|\text{Red}) = \frac{P(\text{Heart} \cap \text{Red})}{P(\text{Red})} =
\frac{\frac{1}{4}}{\frac{1}{2}} = \frac{1}{2} ]
• What is the probability of rolling a 4 on one die and a 3 on another? [ P(4 \cap 3) = P(4) \
cdot P(3) = \frac{1}{6} \cdot \frac{1}{6} = \frac{1}{36} \approx 0.028 ]
• If 5% of people have a disease and the test is 98% accurate, what is the probability a
person has the disease given a positive test result? [ P(\text{Disease|Positive}) = \frac{P(\
text{Positive|Disease}) \cdot P(\text{Disease})}{P(\text{Positive})} ] [ P(\text{Positive}) =
P(\text{Positive|Disease}) \cdot P(\text{Disease}) + P(\text{Positive|No Disease}) \cdot P(\
text{No Disease}) ] [ P(\text{Positive}) = 0.98 \cdot 0.05 + 0.02 \cdot 0.95 = 0.069 ] [ P(\
text{Disease|Positive}) = \frac{0.98 \cdot 0.05}{0.069} \approx 0.710 ]
UNIT 3
Bayesian Analysis: Bayes Theorem and Its Applications
Bayesian Analysis is a statistical method that applies Bayes' Theorem to update the probability
of a hypothesis as more evidence or information becomes available. It is a fundamental
approach in statistics, machine learning, and data science.
Bayes' Theorem
Bayes' Theorem provides a mathematical formula for updating probabilities based on new
evidence. It is expressed as:
where:
• ( P(A|B) ) is the posterior probability: the probability of hypothesis ( A ) given the evidence
( B ).
• ( P(B|A) ) is the likelihood: the probability of evidence ( B ) given that hypothesis ( A ) is
true.
• ( P(A) ) is the prior probability: the initial probability of hypothesis ( A ) before seeing the
evidence.
• ( P(B) ) is the marginal likelihood: the total probability of the evidence under all possible
hypotheses.
• Prior Probability ((P(A))): Represents what is known about the hypothesis before
observing the new data.
• Likelihood ((P(B|A))): Measures how likely it is to observe the data given the hypothesis.
• Marginal Likelihood ((P(B))): Normalizes the result, ensuring that the probabilities sum
to one across all hypotheses.
• ( P(D) = 0.01 ): The prior probability that a randomly chosen patient has the disease is 1%.
• ( P(T|D) = 0.9 ): The probability of the test being positive if the patient has the disease is
90%.
• ( P(T|\neg D) = 0.05 ): The probability of the test being positive if the patient does not
have the disease is 5%.
• ( P(\neg D) = 0.99 ): The probability that a randomly chosen patient does not have the
disease is 99%.
To find the posterior probability ( P(D|T) ), we need the marginal likelihood ( P(T) ): [ P(T) = P(T|
D) \cdot P(D) + P(T|\neg D) \cdot P(\neg D) ] [ P(T) = (0.9 \cdot 0.01) + (0.05 \cdot 0.99) ] [ P(T) =
0.009 + 0.0495 ] [ P(T) = 0.0585 ]
Now we can apply Bayes' Theorem: [ P(D|T) = \frac{P(T|D) \cdot P(D)}{P(T)} ] [ P(D|T) = \frac{0.9 \
cdot 0.01}{0.0585} ] [ P(D|T) = \frac{0.009}{0.0585} ] [ P(D|T) \approx 0.1538 ]
So, given a positive test result, the probability that the patient has the disease is approximately
15.38%.
1. Medical Diagnosis
• Example: Determining the probability of a disease given test results. Doctors use Bayes'
Theorem to update the likelihood of a disease based on symptoms and test outcomes.
2. Spam Filtering
• Example: Email classifiers use Bayesian analysis to determine the probability that an
email is spam based on the presence of certain words and phrases.
3. Machine Learning
• Example: Naive Bayes classifiers apply Bayes' Theorem to classify data points based on
the likelihood of feature occurrences. It is commonly used in text classification and
sentiment analysis.
4. Risk Assessment
• Example: Insurance companies use Bayesian models to update the probability of an
event (like an accident) based on new data (such as driving history).
5. Forecasting
• Example: Bayesian methods are used in weather forecasting to update predictions as
new weather data becomes available.
6. Decision Making
• Example: Businesses use Bayesian decision theory to update the probabilities of various
outcomes and make informed decisions under uncertainty.
7. Genetics
• Example: In genetic studies, Bayes' Theorem helps update the probability of an
individual carrying a genetic mutation based on family history and genetic testing results.
Conclusion
Bayesian Analysis, driven by Bayes' Theorem, is a powerful tool for updating probabilities based
on new evidence. Its applications span across diverse fields, offering a flexible and intuitive
approach to decision-making under uncertainty. Understanding and applying Bayes' Theorem
can significantly enhance predictive modeling, data analysis, and various practical applications.
1. Likelihood
Definition: Likelihood is a function that measures the probability of observing the given data
under various parameter values of a statistical model.
Example: In a coin toss, if we want to estimate the probability of heads ((\theta)), the likelihood
given 10 heads in 15 tosses would be computed using the binomial distribution.
2. Prior
Definition: The prior is the probability distribution representing our beliefs about the
parameters before observing any data.
Example: For the coin toss, if we believe the coin is fair, we might use a uniform prior
distribution for (\theta), meaning every value of (\theta) from 0 to 1 is equally likely.
3. Posterior
Definition: The posterior is the updated probability distribution of the parameters after
observing the data.
Example: Continuing with the coin toss, after observing the 10 heads in 15 tosses, the posterior
distribution of (\theta) would combine the prior distribution and the likelihood of the observed
data.
4. Loss Function
Definition: A loss function quantifies the cost associated with making errors in predictions or
decisions.
Types:
• 0-1 Loss: Assigns a loss of 1 for an incorrect decision and 0 for a correct one.
• Squared Error Loss: The loss is proportional to the square of the difference between the
estimated and actual values.
Example: In a medical diagnosis, the loss function might assign a higher cost to false negatives
(missed disease diagnosis) than to false positives.
5. Bayes Rule
Definition: Bayes Rule provides a way to update the probability estimate for a hypothesis as
more evidence or information becomes available.
Steps:
Example: In a courtroom, jurors might use Bayes Rule implicitly to update their belief about a
defendant’s guilt as new evidence is presented.
Common Models:
Example: Estimating the probability of a coin landing heads ((\theta)) using a Beta distribution as
the prior.
Steps:
1. Specify Prior: Choose a prior distribution for the parameter (e.g., (\theta \sim Beta(\
alpha, \beta))).
2. Specify Likelihood: Define the likelihood based on observed data (e.g., (X \sim
Binomial(n, \theta))).
3. Compute Posterior: Use Bayes Rule to update the prior with the observed data to get the
posterior distribution.
Mathematical Example:
Summary
The Decision Theoretic framework of Bayesian Analysis provides a robust approach to updating
probabilities and making informed decisions based on new data. By leveraging concepts such as
likelihood, prior, posterior, loss function, and Bayes Rule, Bayesian methods allow for
continuous learning and adaptation. One-parameter Bayesian models offer a straightforward
yet powerful means of applying Bayesian reasoning to real-world problems. Understanding
these concepts is essential for effectively using Bayesian Analysis in various fields, from machine
learning and data science to medicine and finance.
Structure
1. Observed Data Level: The lowest level consists of the observed data and their likelihood
given the parameters.
2. Parameter Level: Parameters at this level are modeled as random variables with their
own distributions.
3. Hyperparameter Level: The parameters' distributions are governed by hyperparameters
at a higher level, which themselves can have prior distributions.
Advantages
• Borrowing Strength: Information is shared across groups, allowing for more robust
parameter estimation, especially with small sample sizes within groups.
• Flexibility: Can model complex dependencies and variations in multi-level data.
• Uncertainty Quantification: Provides a full probabilistic model that quantifies
uncertainty at all levels.
where:
Model Specification
1. Likelihood: Assume the data ( y ) follows a normal distribution given the predictors
( X ) and coefficients ( \beta ): [ y \sim \mathcal{N}(X\beta, \sigma^2) ]
Using Bayes' Theorem, the posterior distribution of (\beta) is: [ P(\beta | y, X) \propto P(y | X, \
beta) P(\beta) ]
Where:
This posterior combines the information from the data (likelihood) and the prior belief about the
coefficients.
Conclusion
Hierarchical Bayesian Models and Regression with a Ridge Prior are powerful techniques in
Bayesian Machine Learning. Hierarchical models allow for the modeling of complex, multi-level
data structures, while Bayesian Ridge Regression provides a robust method for dealing with
multicollinearity and overfitting in regression analysis. Both approaches leverage the principles
of Bayesian inference to enhance the flexibility, robustness, and interpretability of statistical
models.
where:
1. Prior Distribution
The choice of prior can reflect prior knowledge or beliefs about the parameters. Common
choices include:
2. Likelihood
The likelihood function represents the probability of the observed data given the parameters.
For logistic regression:
where ( \mathbf{y} ) is the vector of observed binary outcomes, and ( \mathbf{X} ) is the matrix of
predictor variables.
3. Posterior Distribution
The posterior distribution combines the prior distribution and the likelihood:
Since the logistic function is not conjugate to the Gaussian or Laplace prior, the posterior
distribution does not have an analytical solution. Instead, we use approximation methods such
as:
• Markov Chain Monte Carlo (MCMC): A class of algorithms for sampling from the
posterior distribution.
• Variational Inference (VI): An approach that approximates the posterior distribution with
a simpler distribution by optimizing a lower bound on the marginal likelihood.
2. Likelihood Function
The likelihood of the observed data is:
3. Posterior Distribution
The posterior distribution is proportional to the product of the likelihood and the prior:
4. Approximation Methods
Given the complexity of the posterior distribution, we use approximation methods like MCMC or
VI:
# Simulated data
np.random.seed(42)
n_samples = 100
n_features = 2
X = np.random.randn(n_samples, n_features)
true_beta = np.array([1.0, -1.0])
logits = X @ true_beta
y = np.random.binomial(1, 1 / (1 + np.exp(-logits)))
Conclusion
Bayesian Logistic Regression offers a powerful framework for binary classification, combining
the strengths of logistic regression with the flexibility and robustness of Bayesian inference. It
allows for the incorporation of prior knowledge, regularization, and provides a principled way to
quantify uncertainty in the model parameters. Using approximation methods like MCMC and
variational inference makes Bayesian logistic regression practical for real-world applications.
UNIT 4
Data Warehousing (DW)‐ Introduction & Overview; Data Marts, DW architecture ‐ DW
components, Implementation options; Meta Data, Information delivery. ETL ‐ Data Extraction,
Data Transformation ‐ Conditioning, Scrubbing, Merging, etc., Data Loading, Data Staging, Data
Quality.
2. ETL (Extract, Transform, Load) Process: This is the process that moves data from
source systems to the data warehouse. It involves:
– Extraction: Retrieving data from various source systems.
– Transformation: Cleaning, filtering, and converting the data into a suitable
format for analysis.
– Loading: Storing the transformed data into the data warehouse.
3. Data Warehouse Database: The central repository where the processed data is
stored. It is designed for query and analysis rather than transaction processing.
Common types of databases used for data warehousing include relational databases
and columnar databases.
4. Metadata: Data about the data stored in the warehouse. It helps in understanding,
managing, and using the data. Metadata includes definitions, mappings,
transformations, and lineage.
5. Data Marts: Subsets of data warehouses designed for specific business lines or
departments. Data marts can be dependent (sourced from the central data
warehouse) or independent (sourced directly from operational systems).
6. OLAP (Online Analytical Processing): Tools and technologies that enable users to
perform complex queries and analyses on the data stored in the warehouse. OLAP
systems support multidimensional analysis, allowing users to view data from
different perspectives.
1. Data Source Layer: Includes all operational and external systems that provide raw data.
2. Data Staging Layer: A temporary area where data is extracted, transformed, and loaded.
This layer handles data cleaning, integration, and transformation.
3. Data Storage Layer: The central repository (data warehouse) where transformed data is
stored.
4. Data Presentation Layer: Includes data marts, OLAP cubes, and other structures that
organize data for end-user access.
5. Data Access Layer: Tools and applications (BI tools, reporting tools) that allow users to
access, analyze, and visualize data.
Conclusion
Data warehousing plays a critical role in modern data management and business intelligence. It
enables organizations to consolidate data from various sources, ensuring high-quality data is
available for decision-making. While it comes with challenges, the benefits of improved data
management, faster query performance, and enhanced analytical capabilities make it a valuable
asset for any data-driven organization.
Data Marts and Data Warehousing (DW) Architecture
Data Marts
Data Marts are specialized subsets of data warehouses designed to serve the specific needs of a
particular business line or department. They provide focused and optimized access to data
relevant to the users in that domain. Data marts can be dependent or independent:
1. Dependent Data Marts: Sourced from an existing data warehouse. They draw data from
the central repository and provide a departmental view.
2. Independent Data Marts: Created directly from source systems without relying on a
centralized data warehouse. They are often simpler but can lead to data silos.
Implementation Options
1. On-Premises Data Warehousing:
– Hardware and Infrastructure: Organizations maintain their own servers and
storage.
– Software: On-premises solutions like Oracle, Microsoft SQL Server, or IBM Db2.
– Customization and Control: High level of control over security, compliance, and
customization.
2. Cloud-Based Data Warehousing:
– Infrastructure as a Service (IaaS): Cloud providers offer virtual machines and
storage (e.g., AWS EC2).
– Platform as a Service (PaaS): Managed data warehousing services (e.g., Amazon
Redshift, Google BigQuery, Microsoft Azure Synapse).
– Scalability and Cost Efficiency: Pay-as-you-go model, easy scaling, and reduced
maintenance overhead.
3. Hybrid Data Warehousing:
– Combines on-premises and cloud-based solutions to leverage the benefits of
both environments.
– Enables gradual migration to the cloud and flexibility in data management.
Meta Data
Metadata in data warehousing is data about data. It includes:
1. Business Metadata:
– Definitions and descriptions of data elements.
– Business rules and data policies.
2. Technical Metadata:
– Data structure details (e.g., schemas, tables, columns).
– Data lineage and data flow mappings.
– Transformation logic and data quality rules.
3. Operational Metadata:
– ETL process details (e.g., job schedules, logs).
– System performance and usage metrics.
Metadata helps users understand, manage, and utilize the data effectively, ensuring data
governance and compliance.
Information Delivery
Information delivery involves presenting data to end-users in a way that supports decision-
making. Key aspects include:
Conclusion
Data warehousing provides a structured and efficient way to manage and analyze large volumes
of data from various sources. Understanding the architecture, components, and implementation
options is crucial for designing and maintaining a robust data warehousing solution. Metadata
and effective information delivery mechanisms further enhance the usability and value of the
data warehouse, enabling informed decision-making across the organization.
Data Transformation
Data Transformation is the second step in the ETL (Extract, Transform, Load) process. It
involves converting raw data into a format suitable for analysis by applying various operations
such as data conditioning, scrubbing, merging, and more. This step ensures that the data loaded
into the data warehouse is clean, consistent, and usable.
Data Loading
Data Loading is the process of transferring transformed data into the target data warehouse or
data mart. This step ensures that the data warehouse is updated with the latest information for
analysis.
Data Staging
Data Staging refers to the intermediate storage area where data is held temporarily during the
ETL process. This area is used for data extraction and transformation before the final loading
into the data warehouse.
Data Quality
Data Quality refers to the condition of the data based on factors such as accuracy,
completeness, reliability, and relevance. Ensuring high data quality is critical for effective
analysis and decision-making.
Conclusion
Data transformation is a critical phase in the ETL process, ensuring that data is clean, consistent,
and ready for analysis. While it brings significant benefits in terms of data quality and
integration, it also presents challenges that require careful planning and execution. Data loading
and staging further support the ETL process by efficiently transferring and temporarily storing
data. Ensuring high data quality is essential for reliable and accurate business intelligence and
decision-making. By employing best practices and robust tools, organizations can effectively
manage and transform their data to derive valuable insights.
4. Normalization:
– Example: Normalizing values between 0 and 1.
import pandas as pd
from sklearn.preprocessing import MinMaxScaler
Conclusion
Data conditioning is an essential step in the ETL process that ensures raw data is clean,
consistent, and in a format suitable for further processing and analysis. By performing tasks like
parsing, handling missing values, and standardizing data, organizations can significantly
improve the quality and usability of their data. While it poses certain challenges, effective data
conditioning is critical for successful data integration, analysis, and decision-making.
Data Scrubbing
Data Scrubbing is the process of detecting and correcting (or removing) corrupt or inaccurate
records from a dataset. This involves identifying incomplete, incorrect, inaccurate, or irrelevant
parts of the data and then replacing, modifying, or deleting this dirty data.
Steps in Data Scrubbing
1. Identifying Errors:
– Inconsistencies: Checking for discrepancies in data format or content.
– Missing Values: Detecting absent or null values in the dataset.
– Duplicate Records: Identifying and removing duplicate entries.
– Invalid Data: Recognizing out-of-range or illogical values.
2. Correcting Errors:
– Standardization: Converting data into a standard format (e.g., date formats,
measurement units).
– Normalization: Ensuring data is consistent across the dataset (e.g., all text in
lowercase).
– Imputation: Filling in missing values using techniques like mean, median, or
mode imputation.
– Validation: Applying rules to ensure data adheres to defined constraints (e.g.,
email format validation).
# Sample dataset
data = {
'name': ['Alice', 'Bob', 'Charlie', None, 'Eve', 'Frank',
'Alice'],
'age': [25, np.nan, 30, 35, 40, -1, 25],
'email': ['alice@example.com', 'bob@example', 'charlie@abc.com',
'eve@example.com', None, 'frank@example.com', 'alice@example.com']
}
df = pd.DataFrame(data)
# Identify and handle missing values
df['name'].fillna('Unknown', inplace=True)
df['age'].replace([-1, np.nan], df['age'].median(), inplace=True)
df['email'].fillna('unknown@example.com', inplace=True)
# Remove duplicates
df.drop_duplicates(inplace=True)
print(df)
Data Merging
Data Merging involves combining data from multiple sources into a single, unified dataset. This
process is essential for creating a comprehensive view of information that supports analysis and
reporting.
# Sample datasets
data1 = {
'id': [1, 2, 3, 4],
'name': ['Alice', 'Bob', 'Charlie', 'David']
}
data2 = {
'id': [3, 4, 5, 6],
'age': [30, 35, 40, 45]
}
df1 = pd.DataFrame(data1)
df2 = pd.DataFrame(data2)
# Inner Join
merged_inner = pd.merge(df1, df2, on='id', how='inner')
print("Inner Join:\n", merged_inner)
Data Loading
Data Loading is the final step in the ETL process, where transformed and cleaned data is loaded
into the target data warehouse or data mart. This ensures the data is available for querying and
analysis.
Data Quality
Data Quality refers to the accuracy, completeness, reliability, and relevance of data. Ensuring
high data quality is essential for effective analysis and decision-making.
Conclusion
Data scrubbing and merging are vital components of the data transformation process within the
ETL framework. Scrubbing ensures that data is clean, accurate, and reliable, while merging
integrates data from various sources to provide a comprehensive dataset for analysis.
Understanding and effectively implementing these processes are crucial for maintaining high
data quality and enabling meaningful insights. Data loading, staging, and quality assurance
further support the ETL process by ensuring that the data warehouse contains accurate, timely,
and relevant information for analysis and reporting.
Program : B.E
Subject Name: Data Mining
Subject Code: CS-8003
Semester: 8th
Downloaded from be.rgpvnotes.in
ƀ
CS-8003 Elective-V (2) Data Mining
-------------------
Unit-I
Introduction to Data warehousing, needs for developing data Warehouse, Data
warehouse systems and its Components, Design of Data Warehouse, Dimension and
Measures, Data Marts:-Dependent Data Marts, Independents Data Marts & Distributed
Data Marts, Conceptual Modeling of Data Warehouses:-Star Schema, Snowflake
Schema, Fact Constellations. Multidimensional Data Model & Aggregates.
Data Mining:
Introduction: Data mining is the process of analyzing large amount of data sets to identify
patterns and establish relationships to solve problems through data analysis. Data Mining is
processing data to identify patterns and establish relationships.
Data mining techniques are used in many research areas, major industry areas like Banking,
Retail, Medicine, cybernetics, genetics and marketing. While data mining techniques are a means
to drive efficiencies and predict customer behavior, if used correctly, a business can set itself
apart from its competition through the use of predictive analysis.
Data mining can be applied on any kind of data or information stored. Data mining is also known
as data discovery and knowledge discovery.
Data Warehousing
A data warehouse is a relational database that is designed for query and analysis rather than for
transaction processing. DW is combining data from multiple and usually varied sources in to one
comprehensive and easily manipulated database. It usually contains historical data derived from
transaction data, but it can include data from other sources. It separates analysis workload from
transaction workload and enables an organization to consolidate data from several sources. DW
is commonly used by companies to analyze trends over time.
(OLAP) engine, client analysis tools, and other applications that manage the process of gathering
data and delivering it to business users.
Data warehousing is typically used by larger companies analyzing larger sets of data for
enterprise purposes. The data warehouse architecture is based on a relational database system
server that functions as the central warehouse for informational data. Operational data and
processing is purely based on data warehouse processing. This central information system is used
some key components designed to make the entire environment for operational systems. Its
mainly created to support different analysis, queries that need extensive searching on a larger
scale.
As soon as the data arrives into the Data staging area it is set of ETL process that extract data
from source system. It is converted into an integrated structure and format.
Data is extracted from source system and stored, cleaned, transform functions that may be
applied to load into data warehouse.
Establishing defaults for missing data accommodating source data definition changes.
Data Presentation Area
Data presentation area are the target physical machines on which the data warehouse data is
organized and stored for direct querying by end users, report writers and other applications. It’s
the place where cleaned, transformed data is stored in a dimensionally structured warehouse and
made available for analysis purpose.
Design Methods
Bottom-up design
This architecture makes the data warehouse more of a virtual reality than a physical reality. In
the bottom-up approach, starts with extraction of data from operational database into the staging
area where it is processed and consolidated for specific business processes. The bottom-up
approaches reverse the positions of the data warehouse and the data marts. These data marts can
then be integrated to create a comprehensive data warehouse.
Top-down design
The data flow in the top down OLAP environment begins with data extraction from the
operational data sources. The top-down approach is designed using a normalized enterprise data
model. The results are obtained quickly if it is implemented with iterations. It is time consuming
process with an iterative method and the failure risk is very high.
Hybrid design
The hybrid approach aims to harness the speed and user orientation of the bottom up approach to
the integration of the top-down approach. To consolidate these various data models, and
facilitate the extract transform load process, data warehouses often make use of an operational
data store, the information from which is parsed into the actual DW. The hybrid approach begins
with an ER diagram of the data mart and a gradual extension of the data marts to extend the
enterprise model in a consistent linear fashion. It will provide rapid development within an
enterprise architecture framework.
Data warehouse consists of dimensions and measures. It is a logical design technique used for
data warehouses. Dimensional model allow data analysis from many of the commercial OLAP
products available today in the market. For example, time dimension could show you the
breakdown of sales by year, quarter, month, day and hour.
Measures are numeric representations of a set of facts that have occurred. The most common
measures of data dispersion are range, the five number summery (based on quartiles), the inter-
quartile range, and the standard deviation. Examples of measures include amount of sales,
number of credit hours, store profit percentage, sum of operating expenses, number of past-due
accounts and so forth.
Types
Conformed dimension
Junk dimension
Degenerate dimension
Role-playing dimension
Data Marts
A data mart is a specialized system that brings together the data needed for a department or
related applications. A data mart is a simple form of a data warehouse that is focused on a single
subject (or functional area), such as educational, sales, operations, collections, finance or
marketing data. The sources may contain internal operational systems, central data warehouse, or
external data. It is a small warehouse which is designed for the department level.
Three basic types of data marts are dependent, independent or stand-alone, and hybrid. The
categorization is based primarily on the data source that feeds the data mart.
Dependent data marts : Data comes from warehouse. It is actually created a separate physical
data-store.
Independent data marts: A standalone systems built by drawing data directly from operational or
external sources of data or both. Independent data mart are independent and focuses exclusively
on one subject area. It has a separate physical data store.
Hybrid data marts : Can draw data from operational systems or data warehouses.
It may also be defined by discretizing or grouping values for a given dimension or attribute,
resulting in a set-grouping model. A conceptual data model identifies the highest-level
relationships between the different entities. Features of conceptual data model include:
➢ Star Schema
➢ Snowflake Schema
➢ Fact Constellation
Star Schema
• Consists of set of relations known as Dimension Table (DT) and Fact Table (FT)
• A single large central fact table and one table for each dimension.
• A fact table primary key is composition of set of foreign keys referencing dimension
tables.
• Every dimension table is related to one or more fact tables.
• Every fact points to one tuple in each of the dimensions and has additional attributes
• Does not capture hierarchies directly.
Snowflake Schema
• Dimension tables are normalized split dimension table data into additional tables. But this
may affect its performance as joins needs to be performed.
• Query performance would be degraded because of additional joins. (delay in processing)
Fact Constellation:
• As its name implies, it is shaped like a constellation of stars (i.e. star schemas).
• Allow to share multiple fact tables with dimension tables.
• This schema is viewed as collection of stars hence called galaxy schema or fact
constellation.
• Solution is very flexible, however it may be hard to manage and support.
• Sophisticated application requires such schema.
Data warehouses are generally based on ‘multi-dimensional” data model. The multidimensional
data model provides a framework that is intuitive and efficient, that allow data to be viewed and
analyzed at the desired level of details with a good performance. The multidimensional model
start with the examination of factors affecting decision-making processes is generally
organization specific facts, for example sales, shipments, hospital admissions, surgeries, and so
on. One instances of a fact correspond with an event that occurred. For example, every single
sale or shipment carried out is an event. Each fact is described by the values of a set of relevant
measures that provide a quantitative description of events. For example, receipts of sales, amount
of shipment, product cost are measures.
The multidimensional data model is an integral part of On-Line Analytical Processing, or OLAP.
Because OLAP is on-line, it must provide answers quickly; analysts pose iterative queries during
interactive sessions, not in batch jobs that run overnight. And because OLAP is also analytic, the
queries are complex. Dimension tables support changing the attributes of the dimension without
changing the underlying fact table. The multidimensional data model is designed to solve
complex queries in real time. The multidimensional data model is important because it enforces
simplicity.
Aggregates
In data warehouse huge amount of data is stored that makes analyses of data very difficult. This
is the basic reason why selection and aggregation is required to examine specific part of data.
Aggregations are the way by which information can be divided so queries can be run on the
aggregated part and not the whole set of data. These are pre-calculated summaries derived from
the most granular fact table. It is a process for information is gathered and expressed in a
summary form, for purposes such as statistical analysis. A common aggregation purpose is to get
more information about particular groups based on specific variables such as age, profession, or
income. The information about such groups can then be used for web site personalization. Tables
are always changing along with the needs of the users so it is important to define the
aggregations according to what summary tables might be of use.
*****
OLAP:
OLAP (Online Analytical Processing) is the technology support the multidimensional view of data
for many Business Intelligence (BI) applications. OLAP provides fast, steady and proficient access,
powerful technology for data discovery, including capabilities to handle complex queries, analytical
calculations, and predictive “what if” scenario planning.
OLAP is a category of software technology that enables analysts, managers and executives to gain
insight into data through fast, consistent, interactive access in a wide variety of possible views of
information that has been transformed from raw data to reflect the real dimensionality of the
enterprise as understood by the user. OLAP enables end-users to perform ad hoc analysis of data in
multiple dimensions, thereby providing the insight and understanding they need for better decision
making.
The need for more intensive decision support prompted the introduction of a new generation of tools.
Generally used to analyze the information where huge amount of historical data is stored. Those new
tools, called online analytical processing (OLAP), create an advanced data analysis environment that
supports decision making, business modeling, and operations research.
Multidimensional analysis are inherently representative of an actual business model. The most
distinctive characteristic of modern OLAP tools is their capacity for multidimensional analysis (for
example actual vs budget). In multidimensional analysis, data are processed and viewed as part of a
For efficient decision support, OLAP tools must have advanced data access features. Access
to many different kinds of DBMSs, flat files, and internal and external data sources.
Access to aggregated data warehouse data as well as to the detail data found in operational
databases.
Advanced data navigation features such as drill-down and roll-up.
Rapid and consistent query response times.
The ability to map end-user requests, expressed in either business or model terms, to the
appropriate data source and then to the proper data access language (usually SQL).
Support for very large databases. As already explained the data warehouse can easily and
quickly grow to multiple gigabytes and even terabytes.
Advanced OLAP features become more useful when access to them is kept simple. OLAP tools have
equipped their sophisticated data extraction and analysis tools with easy-to-use graphical interfaces.
Many of the interface features are “borrowed” from previous generations of data analysis tools that
are already familiar to end users. This familiarity makes OLAP easily accepted and readily used.
4. Client/Server Architecture:
Conform the system to the principals of Client/server architecture to provide a framework within
which new systems can be designed, developed, and implemented. The client/server environment
enables an OLAP system to be divided into several components that define its architecture. Those
components can then be placed on the same computer, or they can be distributed among several
computers. Thus, OLAP is designed to meet ease-of-use requirements while keeping the system
flexible.
I). Understanding and improving sales: For an enterprise that has many products and uses a number
of channels for selling the products, OLAP can assist in finding the most popular products and the
most popular channels. In some cases it may be possible to find the most profitable customers.
Multidimensional Views
The ability to quickly switch between one slice of data and another allows users to analyze their
information in small palatable chunks instead of a giant report that is confusing.
Looking at data in several dimensions; for example, sales by region, sales by sales rep, sales by
product category, sales by month, etc. Such capability is provided in numerous decision support
applications under various function names. Multidimensional approach that time is an important
dimension, and that time can have many different attributes. For example, in a spreadsheet or
database, a pivot table provides these views and enables quick switching between them.
Data Cube:
Users of decision support systems often see data in the form of data cubes. The cube is used to
represent data along some measure of interest. Although called a "cube", it can be 2-dimensional, 3-
dimensional, or higher-dimensional. Each dimension represents some attribute in the database and the
cells in the data cube represent the measure of interest. A data cube allows data to be modeled and
viewed in multiple dimensions. It is defined by dimensions and facts. For example, they could
contain a count for the number of times that attribute combination occurs in the database, or the
minimum, maximum, sum or average value of some attribute. Queries are performed on the cube to
retrieve decision support information.
Multidimensional Data Cube: Most OLAP products are developed based on a structure where the
cube is patterned as a multidimensional array. These multidimensional OLAP (MOLAP) products
usually offers improved performance when compared to other approaches mainly because they can be
indexed directly into the structure of the data cube to gather subsets of data.
Relational OLAP: Relational OLAP stores no result sets. Relational OLAP make use of the
relational database model. The ROLAP data cube is employed as a bunch of relational tables
(approximately twice as many as the quantity of dimensions) compared to a multidimensional array.
ROLAP supports OLAP analyses against large volumes of input data. Each one of these tables,
known as a cuboid, signifies a specific view.
http://www.collectionscanada.gc.ca/obj/s4/f2/dsk2/ftp01/MQ37641.pdf
Roll up
The roll-up operation (also called drill-up or aggregation operation) performs aggregation on a data
cube, either by climbing up a concept hierarchy for a dimension or by climbing down a concept
hierarchy, i.e. dimension reduction. Let me explain roll up with an example:
Consider the following cube illustrating temperature of certain days recorded weekly:
Assume we want to set up levels (hot(80-85), mild(70-75), cold(64-69)) in temperature from the
above cube. To do this we have to group columns and add up the values according to the concept
hierarchy. This operation is called roll-up. By doing this we obtain the following cube.
The concept hierarchy can be defined as hot-->day-->week. The roll-up operation groups the data
by levels of temperature.
Roll Down
The roll down operation (also called drill down) is the reverse of roll up. It navigates from less
detailed data to more detailed data. It can be realized by either stepping down a concept hierarchy for
a dimension or introducing additional dimensions. Drill down adds more detail to the given data, it
The result of a drill-down operation performed on the central cube by stepping down a concept
hierarchy for temperature can be defined as day<--week<--cool. Drill-down occurs by descending the
time hierarchy from the level of week to the more detailed level of day. Also new dimensions
can be added to the cube, because drill-down adds more detail to the given data.
Slicing
A Slice is a subset of multidimensional array corresponding to a single value for one or more
members of the dimensions. Slice performs a selection on one dimension of the given cube, thus
resulting in a subcube. For example, in the cube example above, if we make the selection,
temperature=cool we will obtain the following cube:
Dicing
A related operation to slicing is dicing. The dice operation defines a subcube by performing a
selection on two or more dimensions. For example, applying the selection (time = day 3 OR time =
day 4) AND (temperature = cool OR temperature = hot) to the original cube we get the following
subcube (still two-dimensional): Dicing provides you the smallest available slice.
Pivot/Rotate
Pivot or rotate is a visualization operation that rotates the data axes in view in order to provide an
alternate presentation of the data. Rotating changes the dimensional orientation of the cube, i.e.
rotates the data axes to view the data from different perspectives. Pivot groups data with different
dimensions. The below cubes shows 2D represntation of Pivot.
SCOPING: Restricting the view of database objects to a specified subset is called scoping. Scoping
will allow users to recieve and update some data values they wish to recieve and update.
SCREENING: Screening is performed against the data or members of a dimension in order to restrict
the set of data retrieved.
DRILL ACROSS: Accesses more than one fact table that is linked by common dimensions.
Combiens cubes that share one or more dimensions.
DRILL THROUGH: Drill down to the bottom level of a data cube down to its back end relational
tables.
Following are a number of guidelines for successful implementation of OLAP. The guidelines are,
somewhat similar to those presented for data warehouse implementation.
1. Vision: The OLAP team must, in consultation with the users, develop a clear vision for the OLAP
system. This vision including the business objectives should be clearly defined, understood, and
shared by the stakeholders.
3. Selecting an OLAP tool: The OLAP team should familiarize themselves with the ROLAP and
MOLAP tools available in the market. Since tools are quite different, careful planning may be
required in selecting a tool that is appropriate for the enterprise. In some situations, a combination of
ROLAP and MOLAP may be most effective.
4. Corporate strategy: The OLAP strategy should fit in with the enterprise strategy and business
objectives. A good fit will result in the OLAP tools being used more widely.
5. Focus on the users: The OLAP project should be focused on the users. Users should, in
consultation with the technical professional, decide what tasks will be done first and what will be
done later. Attempts should be made to provide each user with a tool suitable for that person’s skill
level and information needs. A good GUI user interface should be provided to non-technical users.
The project can only be successful with the full support of the users.
6. Joint management: The OLAP project must be managed by both the IT and business professionals.
Many other people should be involved in supplying ideas. An appropriate committee structure may be
necessary to channel these ideas.
7. Review and adapt: As noted in last chapter, organizations evolve and so must the OLAP systems.
Regular reviews of the project may be required to ensure that the project is meeting the current needs
of the enterprise.
Using high transaction volumes at a time and high volatile data. Is characterized by a large number of
short on-line transactions (INSERT, UPDATE, DELETE). The main emphasis for OLTP systems is
put on very fast query processing, maintaining data integrity in multi-access environments and an
effectiveness measured by number of transactions per second. In OLTP database there is detailed and
current data, and schema used to store transactional databases is the entity model (usually 3NF). Uses
complex database designs used by IT panel.
Low transaction volumes using many records at a time. It is characterized by relatively low volume of
transactions. Queries are often very complex and involve aggregations. For OLAP systems a response
time is an effectiveness measure. OLAP applications are widely used by Data Mining techniques. In
OLAP database there is aggregated, historical data, stored in multi-dimensional schemas (usually star
schema).
The following table summarizes the major differences between OLTP and OLAP system design.
Operational data; OLTPs are the Consolidation data; OLAP data comes from
Source of data
original source of the data. the various OLTP Databases
To control and run fundamental To help with planning, problem solving, and
Purpose of data
business tasks decision support
Inserts and Short and fast inserts and updates Periodic long-running batch jobs refresh the
Updates initiated by end users data
OLAP Servers
Online Analytical Processing Server (OLAP) is based on the multidimensional data model. It allows
managers, and analysts to get an insight of the information through fast, consistent, and interactive
access to information.
Relational OLAP
ROLAP servers are placed between relational back-end server and client front-end tools. To store and
manage warehouse data, ROLAP uses relational or extended-relational DBMS.
Multidimensional OLAP
MOLAP uses array-based multidimensional storage engines for multidimensional views of data.
Hybrid OLAP
Hybrid OLAP technologies attempt to combine the advantages of MOLAP and ROLAP. It offers
higher scalability of ROLAP and faster computation of MOLAP. HOLAP servers allows to store the
large data volumes of detailed information. The aggregations are stored separately in MOLAP store.
Specialized SQL servers provide advanced query language and query processing support for SQL
queries over star and snowflake schemas in a read-only environment