0% found this document useful (0 votes)
17 views74 pages

Probability 360

Uploaded by

Sahil Vasaya
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views74 pages

Probability 360

Uploaded by

Sahil Vasaya
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 74

OVERVIEW OF

PROBABILITY
CONTENTS
 Importance of Statistical tools in Machine Learning
 Concepts of probability
 Random variables
 Discrete distributions
 Continuous distributions
 Multiple random variables
 Central limit theorem
 Sampling distributions
 Hypothesis testing
 Monte Carlo Approximation
INTRODUCTION
 Machine learning provides us a set of methods
that can automatically detect patterns in data,
and then can be used to uncover patterns to
predict future data, or to perform other kinds
of decision making under uncertainty.
 The best way to perform such activities on top

of huge data set known as big data is to use


the tools of probability theory because
probability theory can be applied to any
situation involving uncertainty.
 We will discuss the tools, equations, and

models of probability that are useful for


machine learning domain
IMPORTANCE OF STATISTICAL TOOLS IN
MACHINE LEARNING
 In machine learning, we train the system by using a limited
data set called ‘training data’ and based on the confidence
level of the training data we expect the machine learning
algorithm to depict the behaviour of the larger set of actual
data.
 If we have observation on a subset of events, called ‘sample’,
then there will be some uncertainty in attributing the sample
results to the whole set or population.
 So, the question was how a limited knowledge of a sample set
can be used to predict the behaviour of a real set.
 It was realized by mathematicians that even if some knowledge
is based on a sample, if we know the amount of uncertainty
related to it, then it can be used in an optimum way.
 Probability theory provides a mathematical foundation for
quantifying this uncertainty of the knowledge.
CONCEPTS OF PROBABILITY
 In our day to day life, we use the concept of
probability in many places.
 Ex. when a coin is flipped.

 This is the frequentist interpretation of

probability.
 Another important interpretation of
probability tries to quantify the uncertainty of
some event and thus focuses on information
rather than repeated trials. This is called the
Bayesian interpretation of probability.
 Ex. compute the probability of India winning

2022 cricket world cup final.


 basic rules
 p(A) : the probability that the event A is true
 Ex. India winning 2022 cricket world cup final.
 0 ≤ p(A) ≤ 1 : probability of this event happening lies
between 0 and 1.
 p(A) = 0 : the event will definitely not happen.
 p(A) = 1: the event will definitely happen.
 p(A̅) : the probability of the event not A.
 p(A̅) = 1 − p(A).
 The probability o f selecting an event A, from a sample
size of X is defined as :
p(A) = n/X
 where n is the number of times the instance of event A is
present in the sample of size X.
 Probability of a union of two events
 Two events A and B are called mutually

exclusive if they can’t happen together.


 For
example, India winning the cricket World Cup
2022 and England winning the cricket World Cup
2022 are two mutually exclusive events.
 Conditional probability
RANDOM VARIABLES
RANDOM VARIABLES
 In probability and statistics, random variable,
random quantity or stochastic variable is a
variable whose possible values are the
outcomes of a random phenomenon.
 It’s a function which performs the mapping of

the outcomes of a random process to a


numeric value.
 Random variables makes our task much

easier to quantify results of any random


process and apply math and perform further
computation.
 Random variables allows us to ask questions in
mathematical way.
 If we flip 5 coins and want to answers questions like:
 1. What is the probability of getting exactly 3
heads?
 2. What is the probability of getting less than 4
heads?
 Then our general way of writing would be:
 · P(Probability of getting exactly 3 heads when we
flip a coin 5 times)
 · P(Probability of getting less than 4 heads when we
flip a coin 5 times)
 But if we use random variables to represent above
questions then we would write:
 1. P(X=3)
 Suppose we have a random process/experiment of
flipping a coin. One of the two possible outcomes
could be either a head or a tail. So here we use X
to denote random variable, which represents the
outcomes of the this random process.
 Therefore we can write
 X= 1, if the outcome is head
 X= 0, if the outcome is tail

 Here the random Variable X is mapping the


outcomes of the random process(flipping a coin)
to the numerical values (1 and 0).
 Summarizing in three points
 We have an experiment(tossing a coin)

 We give values to each event

 These set of values is a random variable.

 If random variable X={0,1,2,3}


 Then X could be 0, 1, 2 or 3 randomly where

each of them might have a different


probability.
DISCRETE RANDOM VARIABLES
 Let X be a random variable and it changes values
only in jumps (a countable number of them) and
remains constant between the jumps it is called a
discrete random variable.
 A discrete random variable has a countable number
of possible values.
 The probability of each value of a discrete random
variable is between 0 and 1, and the sum of all the
probabilities is equal to 1.
 Probability mass function (PMF) is a function that
relates discrete events to the probabilities associated
with those events occurring.
 PMF is a probability measure that gives us
probabilities of the possible values for a random
variable.
 Let X be a discrete random variable with
range RX={x1,x2,x3,...} .
 The function
PX(xk) = P(X=xk), for k=1,2,3,...,
is called the probability mass function
(PMF) of X.
CONTINUOUS RANDOM VARIABLES
 A continuous random variable is a random
variable where the data can take infinitely many
values.
 Most of the real-life events are continuous in nature.
For example, if we have to measure the actual time
taken to finish an activity then there can be an
infinite number of possible ways to complete the
activity.
 Thus the measurement is continuous and not
discrete as it is not similar to the discrete event of
rolling a dice or flipping a coin.
 The Probability density function (pdf) is used to
specify the probability of the random
variable falling within a particular range of values,
as opposed to taking on any one value.
 the PDF is used to specify the probability of
the random variable falling within a
particular range of values, as opposed to
taking on any one value.
 The cumulative distribution function (CDF) of a random
variable is another method to describe the distribution
of random variables.
 The advantage of the cdf is that it can be defined for
any kind of random variable (discrete, continuous).
 The cumulative distribution function (cdf) of a real-
valued random variable X, or just distribution
function of X, evaluated at x is the probability
that X will take a value less than or equal to x
 The cumulative distribution function (CDF) of random
variable X is defined as

Fx(x)=p(X≤x)
 Mean and variance
 The mean in statistical terms represents the

weighted average of the all the possible


values of random variable X and each value
is weighted by its probability.
 It is denoted by µ or E(X).

 Variance of a random variable X measures


the spread or dispersion of X.
DISCRETE DISTRIBUTIONS
BERNOULLI DISTRIBUTIONS
 The bernoulli distribution is the discrete
probability distribution of a random variable
which takes a binary, boolean output: 1 with
probability p, and 0 with probability (1-p).
 Whenever you are running an experiment

which might lead either to a success or to a


failure, you can associate with your success
(labeled with 1) a probability p, while your
insuccess (labeled with 0) will have
probability (1-p).
 Imagine your experiment consists of flipping a coin
and you will win if the output is tail. Furthermore,
since the coin is fair, you know that the probability
of having tail is p=1/2. Hence, once set tail=1 and
head=0, you can compute the probability of success
as follows:
 P(X=1) = f(1) = P = ½

 Again, imagine you are about to toss a dice, and you


bet your money on the number 1: hence, number 1
will be your success (labeled with 1), while any
other number will be a failure (labeled with 0). The
probability of success is 1/6. If you want to compute
the probability of failure, you will do like so:
 P(X=0) = f(0) = 1- P = 5/6
BINOMIAL DISTRIBUTION
 A binomial distribution can be thought of as
simply the probability of a SUCCESS or FAILURE
outcome in an experiment or survey that is
repeated multiple times.
 The binomial is a type of distribution that

has two possible outcomes


 For example,
 A coin toss has only two possible outcomes: heads
or tails
 Taking a test could have two possible outcomes:
pass or fail.
 If n independent bernoulli trials are performed
and x represents the number of success in
those n trials, then x is called a binomial
 Bernoulli random variable is a special case of
binomial random variable with parameters (1,
p).
 The first variable in the binomial formula, n, stands
for the number of times the experiment runs.
 The second variable, p, represents the probability
of one specific outcome.
 For example, let’s suppose you wanted to
know the probability of getting a 1 on a die
roll. if you were to roll a die 20 times, the
probability of rolling a one on any throw is 1/6.
Roll twenty times and you have a binomial
distribution of (n=20, p=1/6).
The binomial distribution formula is:
b(x; n, P) = nCx * Px * (1 – P)n – x
Where:
b = binomial probability
x = total number of “successes” (pass or fail, heads or tails
etc.)
P = probability of a success on an individual trial
n = number of trials
nCx = n! / x!(n – x)!
THE MULTINOMIAL AND MULTINOULLI
DISTRIBUTIONS
 The Multinoulli distribution, also called the
categorical distribution, covers the case where
an event will have one of K possible outcomes.
x in {1, 2, 3, …, K}
 It is a generalization of the Bernoulli distribution

from a binary variable to a categorical variable,


where the number of cases K for the Bernoulli
distribution is set to 2, K=2.
 A common example that follows a Multinoulli

distribution is:
 A single roll of a die that will have an outcome in
{1, 2, 3, 4, 5, 6}, e.g. K=6.
The probability (pmf) of a certain outcome can be modeled
using the formula:
p(X=k)= (n! /x1! * x2! * …* xk!) p1x1⋅p2x2…pkxk

Where
n is the number of trials,
xi is the number of times event i occurs and
pi is the probability of event i at each independent trial.

As an example, consider a problem which can take 3


outcomes at each trial. The probability of obtaining one
specific outcomes can be written as:
p(X=k)= (n! / (x1!* x2! * x3!) p1x1⋅p2x2⋅p3x3
 The repetition of multiple independent Multinoulli
trials will follow a multinomial distribution.
 The multinomial distribution is a generalization of
the binomial distribution for a discrete variable
with K outcomes.
 An example of a multinomial process includes a
sequence of independent dice rolls.
 A common example of the multinomial distribution is
the occurrence counts of words in a text document,
from the field of natural language processing.
 A multinomial distribution is summarized by a
discrete random variable with K outcomes, a
probability for each outcome from p1 to pK,
and k successive trials.
 The multinomial distribution applies to
experiments in which the following conditions
are true:
 The experiment consists of repeated trials, such as
rolling a dice five times instead of just once.
 Each trial must be independent of the others. For
example, if you roll two dice, the outcome of one
dice does not impact the outcome of the other dice.
 The probability of each outcome must be the same
across each instance of the experiment. For example,
if a dice has six sides, then there must be a one in six
chance of each number being given on each roll.
 Each trial must produce a specific outcome, such as
a number between two and 12 if rolling two six-sided
dice.
POISSON DISTRIBUTION
 A Poisson distribution is a tool that helps to predict the
probability of certain events happening when you know
how often the event has occurred. It gives us
the probability of a given number of events
happening in a fixed interval of time.
 So, a Poisson distribution can be used to measure how
many times an event is likely to occur within "X" period
of time
 A textbook store rents an average of 200 books every
Saturday night. Using this data, you can predict the
probability that more books will sell (perhaps 300 or
400) on the following Saturday nights.
 Another example is the number of diners in a certain
restaurant every day. If the average number of diners for
seven days is 500, you can predict the probability of a
certain day having more customers.
The Poisson Distribution pmf is:
P(x; μ) = (e-μ * μx) / x!
μ (the expected number of occurrences) is sometimes
written as λ. Sometimes called the event rate or
rate parameter.

 The average number of major storms in your city is 2


per year. What is the probability that exactly 3 storms
will hit your city next year?
 Step 1: Figure out the components you need to put into the
equation.
 μ = 2 (average number of storms per year, historically)
 x = 3 (the number of storms we think might hit next year)
 e = 2.71828 (e is Euler’s number, a constant)
 Step 2: Plug the values from Step 1 into the Poisson
distribution formula:
 P(x; μ) = (e-μ) (μx) / x!
 = (2.71828 – 2) (23) / 3!
 = (0.13534) (8) / 6
 = 0.180
CONTINUOUS
DISTRIBUTIONS
UNIFORM DISTRIBUTION
 Uniform distribution refers to a type
of probability distribution in which all
outcomes are equally likely.
 In a continuous uniform distribution,
outcomes are continuous and infinite.
 An idealized random number generator
would be considered a continuous uniform
distribution. With this type of distribution,
every point in the continuous range between
0.0 and 1.0 has an equal opportunity of
appearing, yet there is an infinite number of
points between 0.0 and 1.0.
 Example :
 You arrive into a building and are about to

take an elevator to the your floor. Once you


call the elevator, it will take between 0 and
40 seconds to arrive to you. In this case a =
0 and b = 40.
GAUSSIAN (NORMAL) DISTRIBUTION
 The most widely used distribution in statistics
and machine learning is the Gaussian or
normal distribution.
 Gaussian distribution, is a
probability distribution that is symmetric
about the mean, showing that data near the
mean are more frequent in occurrence than
data far from the mean. In graph form,
normal distribution will appear as a bell curve
.
THE LAPLACE DISTRIBUTION
 Like the normal distribution, this distribution is
unimodal (one peak) and it is also a
symmetrical distribution. However, it has a sharper
peak than the normal distribution.
 The Laplace distribution is the distribution of the
difference of two independent random variables with
identical exponential distributions. It is often used to
model phenomena when data has a higher peak
than the normal distribution.
 This distribution is the result of two
exponential distributions, one positive and one
negative; It is sometimes called the double
exponential distribution, because it looks like two
exponential distributions spliced together back-to-
back.
MULTIPLE RANDOM
VARIABLES
BIVARIATE RANDOM VARIABLES
 Let us consider two random variables X and Y in the
sample space of S of a random experiment. Then
the pair (X, Y) is called s bivariate random variable
or two-dimensional random vector where each of X
and Y are associated with a real number for every
element of S.
 (X, Y) is called a discrete bivariate random variable
 if the random variables X and Y both by themselves are
discrete.
 (X, Y) is called a continuous bivariate random
variable
 if the random variables X and Y both are continuous
 mixed bivariate random variable
 if one of X and Y is discrete and the other is continuous.
CONT..
 Joint distribution functions
 The joint cumulative distribution function (or

joint cdf) of X and Y is defined as:

FXY (x,y) = P (X<= x, Y <= y)


 If continuous random variables X and Y are
defined on the same sample space S, then
their joint probability density
function (joint pdf) is denoted f(x,y), that
satisfies the following.

P((X,Y)∈A)=∬Af(x,y)dxdy
COVARIANCE AND
CORRELATION
COVARIANCE AND CORRELATION
 The covariance between two random variables
X and Y measure the degree to which X and Y are
(linearly) related, which means how X varies with Y
and vice versa.

 Cov(X,Y) = E(XY) – E(X) E(Y)


 variance is the measure of how a random variable
varies with itself, then the covariance is the measure
of how two random variables vary with each other.
 If Cov (X,Y) = 0, then we can say X and Y are
uncorrelated.
 Covariance can be between 0 and infinity.
Sometimes, it is more convenient to work with a
normalized measure.
CENTRAL LIMIT THEOREM
CENTRAL LIMIT THEOREM
 In probability theory, the central limit theorem
(CLT) states that the distribution of a sample
variable approximates a normal distribution
(i.e., a “bell curve”) as the sample size becomes
larger.
 given a sufficiently large sample size from a

population with a finite level of variance, the


mean of all sampled variables from the same
population will be approximately equal to the
mean of the population.
 A key aspect of CLT is that the average of the

sample means and standard deviations will


equal the population mean and standard
deviation.
SAMPLING DISTRIBUTIONS
 The key component of machine learning is to use a
sample- based training data which can be used to
represent the larger set of actual data and it is
important to estimate how confidently an outcome
can be related to the behaviour of the training data so
that the decisions on the actual data can be made
 Population is a finite set of objects being investigated.
 Random sample refers to a sample of objects drawn
from a population in a way that every member of the
population has the same chance of being chosen.
 Sampling distribution refers to the probability
distribution of a random variable defined in a space of
random samples.
SAMPLING WITH REPLACEMENT
 While choosing the samples from the
population if each object chosen is returned
to the population before the next object is
chosen, then it is called the sampling with
replacement.
 In this case, repetitions are allowed.

 That means, if the sample size n is chosen

from the population size of N, then


 the number of such samples is = Nn , because
each object can be repeated.
 Also, the probability of each sample being
chosen is the same and is 1/ Nn
 For example,
 let’s choose a random sample of 2 patients

from a population of 3 patients {A, B, C} and


replacement is allowed. There can be 9 such
ordered pairs, like:
(A, A), (A, B), (A, C), (B, A), (B, B), (B, C), (C,
A), (C, B), (C, C)
 That means the number of random samples

of 2 from the population of 3 is


Nn = 3 2= 9
and each of the random sample has
probability of 1/9 being chosen.
SAMPLING WITHOUT REPLACEMENT
 In case, we don’t return the object being
chosen to the population before choosing the
next object, then the random sample of size
n is defined as the unordered subset of n
objects from the population and called
sampling without replacement.
 The number of such samples that can be

drawn from the population size of N is:


 In our previous example, the unordered
sample of 2 that can be created from the
population of 3 patients when replacement is
not allowed is
(A, B), (A, C), (B, C)
HYPOTHESIS TESTING
 While dealing with random variables a common
situation is when we have to make certain
decisions or choices based on the observations
or data which are random in nature.
 The solutions for dealing with these situations

is called decision theory or hypothesis testing.


 In terms of statistics, a hypothesis is an

assumption about the probability law of the


random variables.
 Hypothesis Testing is basically an assumption

that we make about the population parameter.


Ex : you say avg student in class is 40 or a boy
is taller than girls.
 A hypothesis test evaluates two mutually
exclusive statements about a population to
determine which statement is best supported
by the sample data.
 Null hypothesis :- In inferential statistics,

the null hypothesis is a general statement or


default position that there is no relationship
between two measured phenomena, or no
association among groups.
 The alternative hypothesis is the
hypothesis used in hypothesis testing that is
contrary to the null hypothesis.
 Example : you have a coin and you don’t know
whether that is fair or tricky so let’s
decide null and alternate hypothesis
 H0 : a coin is a fair coin.

 H1 : a coin is a tricky coin.

 Level of significance: Refers to the degree of


significance in which we accept or reject the
hypothesis.
 This is normally denoted with alpha(maths

symbol ) and generally it is 0.05 or 5% , which


means your output should be 95% confident to
give similar kind of result in each sample.
 There are 4 possible decisions:
1.H0 is true; accept H0 ➔ this is a correct
decision
2. H0 is true; reject H0 (which means accept H1
) ➔ this is an incorrect decision
3. H1 is true; accept H1 ➔ this is a correct
decision
4. H1 is true; reject H1 (which means accept H0
) ➔ this is an incorrect decision.
 So, we can see there is the possibility of 2
correct and 2 incorrect decisions and the
corresponding actions. The erroneous
decisions can be termed as:
 Type I error: reject H0 (or accept H1 ) when H0

is true. This is also called Alpha error where


good is interpreted as bad.
 Type II error: reject H1 (or accept H0 ) when

H1 is true. This is also called Beta error where


bad is interpreted as good and can have a
more devastating impact.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy