Unit II
Unit II
Introduction to Probability
• One of the primary objectives in analytics is to measure the
uncertainty associated with an event or key performance indicator.
• Probability theory is the foundation on which descriptive and
predictive analytics models are built.
Probability Theory Terminologies
Random Experiment
• Random experiment is an experiment in which the outcome is not
known with certainty.
• That is, the output of a random experiment cannot be predicted with
certainty.
• Predictive analytics mainly deals with random experiments such as
predicting quarterly revenue of an organization, customer churn
(whether a customer is likely to churn or how many customers are
likely to churn before next quarter), demand for a product at a future
time period, number of views for an YouTube video, outcome of a
football match (win, draw or lose), etc.
Sample Space
• Sample space is the universal set that consists of all possible
outcomes of an experiment.
• Sample space is usually represented using the letter ‘S’ and individual
outcomes are called the elementary events.
• The sample space can be finite or infinite.
Event
• Event (E) is a subset of a sample space and probability is usually
calculated with respect to an event.
• An event can be represented using the Venn diagram
Probability Estimation using Relative
Frequency
Algebra of Events
• Assume that X, Y and Z are three events of a sample space. Then the
following algebraic relationships are valid and are useful while
deriving probabilities of events:
FUNDAMENTAL CONCEPTS IN PROBABILITY –
AXIOMS OF PROBABILITY
• In 1933, Andrey Kolmogorov, a Russian mathematician laid the
foundation of the axiomatic theory of probability (Kolmogorov, 1956).
• According to axiomatic theory of probability, the probability of an
event E satisfies the following axioms:
FUNDAMENTAL CONCEPTS IN PROBABILITY –
AXIOMS OF PROBABILITY - Cont.,
• Using the aforementioned axioms of probability, one can derive
several mathematical relationships on probability of events using set
theory logic.
• The following elementary rules of probability are directly deduced
from the original three axioms of probability, using the set theory
relationships
FUNDAMENTAL CONCEPTS IN PROBABILITY –
AXIOMS OF PROBABILITY - Cont.,
FUNDAMENTAL CONCEPTS IN PROBABILITY –
AXIOMS OF PROBABILITY - Cont.,
Joint Probability
• Let A and B be two events in a sample space. Then the joint
probability of the two events, written as P(A ∩ B), is given by
Marginal Probability
• Marginal probability is simply a probability of an event X, denoted by
P(X), without any conditions.
Independent Events
• Two events A and B are said to be independent when occurrence of
one event (say event A) does not affect the probability of occurrence
of the other event (event B).
• Mathematically, two events A and B are independent when
P(A ∩ B) = P(A) × P(B).
Conditional Probability
• If A and B are events in a sample space, then the conditional
probability of the event B given that the event A has already
occurred, denoted by P(B|A), is defined as
APPLICATION OF SIMPLE PROBABILITY RULES
– ASSOCIATION RULE LEARNING
• In general, association rule learning (also known as association rule
mining) is a method of finding association between different entities
in a database.
Cont.,
• The strength of association between two mutually exclusive subsets
can be measured using ‘support’, ‘confidence’, and ‘lift’.
• Posterior Probability the statistical probability that a hypothesis is true calculated in the light of
relevant observations.
Solving Monty Halls Problem Using Bayes
Theorem
Generalization of Bayes’ Theorem
Random Variable
Discrete Random Variables
• If the random variable X can assume only a finite or countably infinite
set of values, then it is called a discrete random variable.
• There are very many situations where the random variable X can
assume only finite or countably infinite set of values.
Discrete Random Variables Examples
Continuous Random Variables
• A random variable X which can take a value from an infinite set of
values is called a continuous random variable.
• Examples of continuous random variables are listed below
Probability Mass Function and Cumulative
Distribution Function of a Discrete
Random Variable
• For a discrete random variable, the probability that a random variable
X taking a specific value xi, P(X = xi), is called the probability mass
function P(xi).
• That is, a probability mass function is a function that maps each
outcome of a random experiment to a probability
Representation
Expected Value, Variance, and Standard
Deviation of a Discrete Random Variable
PROBABILITY DENSITY FUNCTION (PDF) AND
CUMULATIVE DISTRIBUTION
FUNCTION (CDF) OF A CONTINUOUS RANDOM
VARIABLE
• Probability density function reflects how dense is the likelihood of a
continuous random variable X taking a value in an infinitesimally small
interval around value x.
BINOMIAL DISTRIBUTION
• Binomial distribution is one of the most important discrete probability
distribution due to its applications in several contexts.
• A random variable X is said to follow a Binomial distribution when
Probability Mass Function (PMF) of Binomial
Distribution
• The PMF of the Binomial distribution (probability that the number of
success will be exactly x out of n trials) is given by
Cumulative Distribution Function (CDF) of
Binomial Distribution
• CDF of a binomial distribution function, F(a), representing the
probability that the random variable X takes value less than or equal
to a, is given by
Mean and Variance of Binomial Distribution
Approximation of Binomial Distribution using
Normal Distribution
• If the number of trials (n) in a binomial distribution is large, then it
can be approximated by normal distribution with mean np and
variance npq, where q = 1 - p.
POISSON DISTRIBUTION
• This is often known as the distribution of rare events.
• Firstly, a Poisson process is where DISCRETE events occur in a
CONTINUOUS, but finite interval of time or space.
• The following conditions must apply:
• For a small interval the probability of the event occurring is proportional to the size
of the interval.
• The probability of more than one occurrence in the small interval is negligible (i.e.
they are rare events). Events must not occur simultaneously
• Each occurrence must be independent of others and must be at random.
• The events are often defects, accidents or unusual natural happenings, such as
earthquakes, where in theory there is no upper limit on the number of events.
• The interval is on some continuous measurement such as time, length or area.
POISSON DISTRIBUTION
• In many situations, we may be interested in calculating the number of
events that may occur over a period of time (or corresponding unit of
measurement).
• For example, number of cancellation of orders by customers at an e-
commerce portal, number of customer complaints, number of cash
withdrawals at an ATM, number of typographical errors in a book,
number of potholes on Bangalore roads, etc.
• When we have to find the probability of number of events, we use
Poisson distribution.
• The probability mass function of a Poisson distribution is given by
NORMAL DISTRIBUTION
• Normal distribution, also known as Gaussian distribution, is one of
the most popular continuous distribution in the field of analytics
especially due to its use in multiple contexts
• Normal distribution is observed across many naturally occurring
measures such as birth weight, height, intelligence, etc.
• The probability density function and the cumulative distribution
function are given by
Properties of Normal Distribution
Properties of Normal Distribution
Standard Normal Variable
• A normal random variable with mean m = 0 and s = 1 is called the
standard normal variable and usually represented by Z.
• The probability density function and cumulative distribution function
of a standard normal variable are given by
Hypothesis Testing
Formulate H0 and H1
H0: 0.40
H1: > 0.40
Step 2: Select an Appropriate Test
• The test statistic measures how close the sample has come to the null
hypothesis.
• The test statistic often follows a well-known distribution (eg, normal,
t, or chi-square).
• In our example, the z statistic, which follows the standard normal
distribution, would be appropriate.
p-
z= Where σp is standard deviation
sp
Step 3: Choose Level of Significance
Type I Error
• Occurs if the null hypothesis is rejected when it is in fact true.
• The probability of type I error ( α ) is also called the level of significance.
Type II Error
• Occurs if the null hypothesis is not rejected when it is in fact false.
• The probability of type II error is denoted by β .
• Unlike α, which is specified by the researcher, the magnitude of β
depends on the actual value of the population parameter (proportion).
Shaded Area
= 0.9699
Unshaded Area
= 0.0301
0 zCAL = 1.88
Steps 6 & 7: Compare Prob and Make the
Decision
• If the prob associated with the calculated value of the test statistic
(zCAL) is less than the level of significance (α), the null hypothesis is
rejected.
• Alternatively, if the calculated value of the test statistic is greater than
the critical value of the test statistic ( zα), the null hypothesis is
rejected.
Broad Classification of Hypothesis Tests
Hypothesis Tests
Tests of Tests of
Association Differences
Main Question: Do the (means of) the quantitative variables depend on which group
(given by categorical variable) the individual is in?
13
12
11
days 10
A B P
treatment
What does ANOVA do?
At its simplest (there are extensions) ANOVA tests the
following hypotheses:
With such small data sets, there really isn’t a really good way to check normality
from data, but we make the common assumption that physical measurements of
people tend to be normally distributed.
Chi-square test
• A chi-square test is a statistical test used to compare observed results
with expected results. The purpose of this test is to determine if a
difference between observed data and expected data is due to
chance, or if it is due to a relationship between the variables you are
studying.
DATA
QUALITAT QUANTIT
IVE ATIVE
CHI
SQUARE
T-TEST
TEST
• The most obvious difference between the chi-square
tests and the other hypothesis tests we have
considered (T test) is the nature of the data.
• For chi-square, the data are frequencies rather than
numerical scores.
87
Chi-squared Tests
test
2
90
Steps of CHI hypothesis testing
• 1. Data :counts or proportion.
• 2. Assumption: random sample selected from a population.
• 3. HO :no sign. Difference in proportion
• no significant association.
• HA: sign. Difference in proportion
• significant association.
91
• 4. level of sign.
• df 1st application=k-1(k is no. of groups)
• df 2nd &3rd application=(column-1)(row-1)
• IN 2nd application(conengency table)
• Df=1, tab. Chi= 3.841 always
• Graph is one side (only +ve)
92
• 5. apply appropriate test of significance
93
• 6. Statistical decision & 7. Conclusion
• Calculated chi <tabulated chi
• P>0.05
• Accept HO,(may be true)
• If calculated chi> tabulated chi
• P<0.05
• Reject HO& accept HA.
94
Correlation Analysis
• Correlation analysis is a tremendous tool to use in understanding how
one variable affects another.
• Take customer experience, for example. Let's say you have the overall
customer satisfaction score you need.
• But then, you want to know how that correlates with other aspects of
the customer experience such as product price, shipping time, or
quality.
Correlation Analysis
• By providing a distinct perspective on which factors impact your
business the most, you can feel more confident in the actions you
taken after the report.
• Definition
• Correlation analysis in market research is a statistical method that
identifies the strength of a relationship between two or more
variables. In a nutshell, the process reveals patterns within a
dataset’s many variables.
Correlation Analysis
• It's all about identifying relationships between variables–specifically
in research.
• Using one of the several formulas, the end result will be a numerical
output between -1 and +1.
• This includes anything that yields data that is both numerical and ordinal.
where,
• rs = Spearman Correlation coefficient
• di = the difference in the ranks given to the two variables values for each item
of the data
• n = total number of observation
Steps and Example
• Step 1: Finding Rank
• Step 2: Calculate d2
• Once all this information has been defined, you will be ready for the
next step, collecting your data.
Collect and assimilate your data
• Once your data is collected, you need to carefully assess it to
understand if the quality is appropriate to be used during a study.
• This means,
• is the sample size big enough?
• Were the procedures used to collect the data implemented correctly?
• Is the date range from the data correct?
• If coming from an external source, is it a trusted and objective one?
• With all the needed information in hand, you are ready to start the
interpretation process, but first, you need to visualize your data.
3) Use the right data visualization type
• Bar chart
• Line chart
• Pie chart
• Tables
4) Start interpreting
• the way you decide to interpret the data will solely depend on the
methods you initially decided to use.
• If you had initial research questions or hypotheses then you should
look for ways to prove their validity.
• If you are going into the data with no defined hypothesis, then start
looking for relationships and patterns that will allow you to extract
valuable conclusions from the information.
5) Keep your interpretation objective
• Being the person closest to the investigation, it is easy to become
subjective when looking for answers in the data. A good way to stay
objective is to show the information to other people related to the
study, for example, research partners or even the people that will use
your findings once they are done.
• using a visualization tool such as a modern dashboard will make the
interpretation process way easier and more efficient as the data can
be navigated and manipulated in an easy and organized way.
6) Mark your findings and draw conclusions
Data Interpretation Characteristics
• Data analysis and interpretation, regardless of the method and
qualitative/quantitative status, may include the following
characteristics:
• Data identification and explanation
• Comparing and contrasting data
• Identification of data outliers
• Future predictions
Importance of Data Interpretation
• Informed decision-making
• Anticipating needs with trends identification
• Cost efficiency
• Clear foresight
Scatter plot
• A scatter plot (aka scatter chart, scatter graph) uses dots to represent
values for two different numeric variables. The position of each dot
on the horizontal and vertical axis indicates values for an individual
data point. Scatter plots are used to observe relationships between
variables.
When you should use a scatter plot
• Scatter plots’ primary uses are to observe and show relationships
between two numeric variables. The dots in a scatter plot not only
report the values of individual data points, but also patterns when the
data are taken as a whole.
• Identification of correlational relationships are common with scatter
plots. In these cases, we want to know, if we were given a particular
horizontal value, what a good prediction would be for the vertical
value.
• You will often see the variable on the horizontal axis denoted an
independent variable, and the variable on the vertical axis the
dependent variable.
When you should use a scatter plot
• Relationships between
variables can be described in
many ways: positive or
negative, strong or weak,
linear or nonlinear.
When you should use a scatter plot
• A scatter plot can also be useful for identifying other patterns in data.
We can divide data points into groups based on how closely sets of
points cluster together.
• Scatter plots can also show if there are any unexpected gaps in the
data and if there are any outlier points.
Common issues when using scatter plots
• Overplotting
• When we have lots of data points to plot, this can run into the issue of overplotting.
Overplotting is the case where data points overlap to a degree where we have
difficulty seeing relationships between points and variables.
• Solution
• There are a few common ways to alleviate this issue. One alternative is to sample
only a subset of data points: a random selection of points should still give the general
idea of the patterns in the full data.
• We can also change the form of the dots, adding transparency to allow for overlaps
to be visible, or reducing point size so that fewer overlaps occur.
• As a third option, we might even choose a different chart type like the heatmap,
where color indicates the number of points in each bin. Heatmaps in this use case
are also known as 2-d histograms.
Interpreting correlation as causation
• Causation indicates that one event is the result of the occurrence of
the other event; i.e. there is a causal relationship between the two
events.
Common scatter plot options
• Add a trend line
• When a scatter plot is used to
look at a predictive or
correlational relationship
between variables, it is
common to add a trend line to
the plot showing the
mathematically best fit to the
data.
Categorical third variable
• A common modification of the
basic scatter plot is the addition
of a third variable.
• Values of the third variable can
be encoded by modifying how
the points are plotted.
• For a third variable that indicates
categorical values (like
geographical region or gender)
Usage of Shapes
Numeric third variable
• For third variables that have numeric values, a common encoding
comes from changing the point size.
• A scatter plot with point size based on a third variable actually goes
by a distinct name, the bubble chart. Larger points indicate higher
values.
Highlight using annotations and color
Related plots
• Scatter map
• When the two variables in a scatter plot are geographical coordinates –
latitude and longitude – we can overlay the points on a map to get a scatter
map (aka dot map). This can be convenient when the geographic context is
useful for drawing particular insights and can be combined with other third-
variable encodings like point size and color.
Heatmap
• As noted above, a heatmap can
be a good alternative to the
scatter plot when there are a lot
of data points that need to be
plotted and their density causes
overplotting issues. However, the
heatmap can also be used in a
similar fashion to show
relationships between variables
when one or both variables are
not continuous and numeric.
Connected scatter plot
• If the third variable we want to
add to a scatter plot indicates
timestamps, then one chart type
we could choose is the
connected scatter plot. Rather
than modify the form of the
points to indicate date, we use
line segments to connect
observations in order.