0% found this document useful (0 votes)
16 views56 pages

CS464 Ch3 Estimation

Uploaded by

pesimistcaylaq
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views56 pages

CS464 Ch3 Estimation

Uploaded by

pesimistcaylaq
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 56

CS464

Introduction to Machine Learning

Estimation

(slides based on the slides provided by Öznur Taştan


and Mehmet Koyutürk)
Motivation
• In machine learning, we are trying to figure out the
relationship between variables (features and
outcomes)
– For this purpose, we use a model (an assumption on the
structure of this relationship)
– A probability distribution usually serves as a good model
– How do we use observations to learn this distribution?

• Density Estimation
– Maximum Likelihood Estimator (MLE)
– Maximum A Posteriori Estimate (MAP)
• Where do we get these probability estimates?
Density Estimation
• We assume that the variable of interest is sampled
from a distribution

• We have some observations on the variable

• How do we use observations to learn the


distribution?
Density Estimation
• A billionaire asks you a question:

• He says: I have a thumbtack, if I flip it, what’s the


probability it will fall with the nail up (heads)?

• You say: Please flip it a few times…


Data
• The billionaire flips the thumbnail 5 times:

• You say the probability that it falls with the nail up

• Why frequency of heads?


• How good is this estimation?
• Why is this a machine learning problem?
Why frequency of heads?
• Frequency of heads is exactly the maximum
likelihood estimator for this problem
Thumbtack- Bernoulli Trial

D={ }}
Thumbtack- Bernoulli Trial

D={ }}

• Flips produce a data set D

• Flips are independent, identically distributed and each is


a Bernoulli trial.
Thumbtack- Bernoulli Trial

D={ } }
• Flips produce a data set D

• Flips are independent, identically distributed and each is a


Bernoulli trial.

• Maximum Likelihood Estimator (MLE):


Choose θ that maximizes the probability of
observed data
Estimation vs. Learning

• Density estimation is a learning problem too:


– Data: Observed set of flips with with 𝛼! heads and 𝛼 " tails
– Model: Bernoulli distribution
– Learning: Finding 𝜃 , which is an optimization problem

• Once we estimate 𝜃, we can predict the probability of the


next flip being a head
– We can do more than that too: For example, predict the number of
heads in the next 100 flips
Maximum Likelihood Estimation
MLE: Choose θ that maximizes the probability of
observed data (likelihood of the data)

The likelihood of observing this data is the joint probability:

Maximum likelihood estimate of θ :

The likelihood of observing this data is the joint probability:


Your First Parameter Learning Algorithm

• Why do we take the log?


Your First Parameter Learning Algorithm

• Why do we take the log?


– Joint probabilities are often in the form of multiplications
and exponents (comes from the independence
assumption)
– Log transforms multiplication to addition
– The resulting equations are easier to manage

• Take derivative and set it to 0


Your First Parameter Learning Algorithm

• Take derivative and set it to 0


Your First Parameter Learning Algorithm

• Take derivative and set it to 0


Your First Parameter Learning Algorithm

• Take derivative and set it to 0


Your First Parameter Learning Algorithm

• Take derivative and set it to 0


Maximum Likelihood Estimation
How Many Flips Do I Need?
• Your answer to the billionaire

• He says: “While you have been calculating, I flipped 50


times, 30 times it was head”. He asks what is your
answer now?
• You say: 30 / 50 = 3/5
• He says: Did I wasted my time flipping more?
• You say: No! On the contrary, the more data the merrier
• This is why….
How Many Flips Do I Need?
• Your answer to the billionaire

• He says: “While you have been calculating, I flipped 50


times, 30 times it was head”. He asks what is your
answer now?
• You say: 30 / 50 = 3/5
• He says: Did I waste my time flipping more?
• You say: No! On the contrary, the more data the merrier
• This is why….
How Many Flips Do I Need?
• Your answer to the billionaire

• He says: “While you have been calculating, I flipped 50


times, 30 times it was head”. He asks what is your
answer now?
• You say: 30 / 50 = 3/5
• He says: Did I waste my time flipping more?
• You say: No! On the contrary, the more data the merrier
• This is why….
A Bound (from Hoeffding’s Inequality)
Probably Approximately Correct
What if we have a continuous variable?
• What if we are measuring a continuous variable?

p(x)
Learning Parameters For a Gaussian
• Assume we have i.i.d data
xi Exam Scores
• Learn the parameters 0 80
– The mean, µ 1 70

– Standard deviation, σ 2 12

3 99
Learning a Gaussian Distribution
Learning a Gaussian Distribution
MLE for the Mean
MLE for the Mean
MLE for the Mean
MLE for the Mean
MLE for the Mean
MLE for the Variance
MLE for the Variance
MLE for the Variance
MLE for the Variance
MLE of Gaussian Parameters

The MLE for the variance of a Gaussian is biased. That is, the expected value of the
estimator is not equal to the true parameter. An unbiased variance estimator:
What if we have prior beliefs?
• Billionaire says wait, I think the thumbtack is close
to 50-50. How can you use this information?

• You say: I can learn it the Bayesian way.


Bayesian Rule
• What if we have prior beliefs?
Utilizing prior information
Utilizing prior information
Utilizing prior information
Bayesian Rule
Maximum A Posteriori (MAP) Estimation

MLE
Maximum A Posteriori(MAP) Approximation
MAP estimation
• Our prior could be in the form of a probability
distribution
• Priors can have different forms

• Uninformative prior:
– Uniform distribution

• Conjugate prior:
- Prior and the posterior have the same form
Posterior Distribution
Beta Distribution

0≤𝜃≤1

1 #$% &$%
𝛼, 𝛽 > 0
𝑝 𝜃 = ∗𝜃 ∗ (1 − 𝜃)
𝐵(𝛼, 𝛽)
Posterior Distribution
1 Flip it N times, and k times it was head.
𝑝 𝜃 = ∗ 𝜃 #$% ∗ (1 − 𝜃)&$%
𝐵(𝛼, 𝛽)
N=3
k=1
1 Flip it N times, and k times it was head.
𝑝 𝜃 = ∗ 𝜃 #$% ∗ (1 − 𝜃)&$%
𝐵(𝛼, 𝛽)
N=3
k=1
𝛼, 𝛽 = 2

𝑘 1
𝜃'*+ = =
𝑁 3

𝑘+𝛼−1 2
𝜃'() = =
𝑁+𝛼+𝛽−2 5
𝑘 1
𝜃'*+ = =
𝑁 3
𝑘+𝛼−1
𝜃'() =
𝑁+𝛼+𝛽−2
Bayesian Estimation
• For the parameters to estimate we assign them an a priori
distribution, which is used to capture our prior belief about
the parameter

• When the data is sparse, this allows us to fall back to the


prior and avoid the issues faced by Maximum Likelihood
Estimation (Example: univariate Gaussian)

• When the data is abundant, the likelihood will dominate the


prior and the prior will not have much of an effect on the
posterior distribution
Estimating Parameters

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy