0% found this document useful (0 votes)
15 views103 pages

Cs Ai Lecture Notes 02

The document discusses various models and theories in decision-making, particularly focusing on agent-based decision-making in deterministic and stochastic environments, exemplified by the 8-puzzle and soccer. It covers search strategies, including breadth-first and iterative-deepening search, as well as the application of probability theory and Bayesian inference in learning and decision-making processes. Additionally, it addresses concepts like regression, model selection, and the implications of dimensionality in data analysis.

Uploaded by

techviktor17
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views103 pages

Cs Ai Lecture Notes 02

The document discusses various models and theories in decision-making, particularly focusing on agent-based decision-making in deterministic and stochastic environments, exemplified by the 8-puzzle and soccer. It covers search strategies, including breadth-first and iterative-deepening search, as well as the application of probability theory and Bayesian inference in learning and decision-making processes. Additionally, it addresses concepts like regression, model selection, and the implications of dimensionality in data analysis.

Uploaded by

techviktor17
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 103

Discovering Models /

Theories

cs365 2015 mukerjee


Domain Theories
 Agent :
given precept history p ∈ P,
select decision from set of choices a ∈ A
so as to meet a goal g (performance) –
maximize utility function U()

 Requires knowledge of how actions under different precepts


affect the goal
 Model or Theory

 Task domains: a) 8-puzzle, [detrmnstc] b) Soccer [stochastic]


8-puzzle

• Precept = state

• Actions = move

• Goal : T/F

• Utility : num moves


8-puzzle

• State = [7,2,4,5,B,6,8,3,1]

• Actions = L,R, U,D


State + Action
 new State

• Decision: based on Search


• [Informed / Uninformed]
Breadth-first search
• Expand shallowest unexpanded node

• Fringe: FIFO queue new successors go at end

O(b1+d)

CS 3243 - Blind Search 5


Properties of breadth-first search
• Complete? Yes (if b is finite)

• Time? 1+b+b2+b3+… +bd + b(bd-1) = O(bd+1)

• Space? O(bd+1) (keeps every node in memory)

• Optimal? Yes (if cost = 1 per step)


Iterative-Deepening search

14 Jan 2004 CS 3243 - Blind Search 7


Cost-based search

• edges don’t have


equal cost

• Breadth-first = first
search lower costs
from START
• Fringe: FIFO

O(b1+C/ ε)

8
Soccer

• Precept = goalie, self, ball


+ wind, opponents, θ
teammates…

• Actions = kick (angle,


speed, swing)

• Utility : goal probability


Discrete-Deterministic Spaces:

Search
Uninformed search strategies
• Uninformed search strategies use only the
information available in the problem definitio
• Breadth-first search
• Uniform-cost search
• Depth-first search
• Depth-limited search
• Iterative deepening search
Breadth-first search
• Expand shallowest unexpanded node

• Fringe: FIFO queue new successors go at end

14 Jan 2004 CS 3243 - Blind Search 12


Properties of breadth-first search
• Complete? Yes (if b is finite)

• Time? 1+b+b2+b3+… +bd + b(bd-1) = O(bd+1)

• Space? O(bd+1) (keeps every node in memory)

• Optimal? Yes (if cost = 1 per step)


Representing
the state
space

1. States:

2. Actions :

3. Goal test:

4. Cost:
8-puzzle heuristics
Admissible:

• h1 : Number of misplaced tiles


=6
goal:
• h2: Sum of Manhattan
distances of the tiles
from their goal positions
= 0+0+1+1+2+3+1+3=11
8-puzzle heuristics
Nilsson’s Sequence
Score(n) = P(n) + 3 S(n)

P(n) : Sum of Manhattan distances of each tile from


its proper position
S(n), sequence score : check around the non-central
squares:
+2 for every tile not followed by successor
0 for every other tile.
piece in center = +1
Stochastic Spaces
Soccer

θ
Soccer : Shooting at goal

[acharya mukerjee 01]


Soccer : Shoot, Pass, dribble, or … ?
Handwritten digits - MNIST
Confusion matrix
Discovering theories
Continuous Data
Discrete Attribute data
• Examples described by attribute values (Boolean, discrete, continuous)
• E.g., situations where I will/won't wait at a restaurant:

• Classification of examples is positive (T) or negative (F)


Discrete Features
• Parse the sentence: “Time flies like an arrow”

May have many parses.


How to rank the choices?
Regression
Modelling as Regression
Given a set of decisions yi based on observations xi,
- derived from unknown function y = f(x)
- with noise

Try to find a model or theory:


y = h(x) ≈ f(x)

where h() is drawn from the hypothesis space – e.g. the space of
radial basis functions, or polynomials, etc.
Polynomial Curve Fitting

[Bishop 06] ch.1


Linear Regression
y = f(x) = Σi wi . φi(x)

φi(x) : basis function


wi : weights

Linear : function is linear in the weights


Quadratic error function --> derivative is linear in w
Sum-of-Squares Error Function
0th Order Polynomial
1st Order Polynomial
3rd Order Polynomial
9th Order Polynomial
Over-fitting

Root-Mean-Square (RMS) Error:


Polynomial Coefficients
9th Order Polynomial
Data Set Size:
9th Order Polynomial
Data Set Size:
9th Order Polynomial
Regularization

Penalize large coefficient values


Regularization:
Regularization:
Regularization: vs.
Polynomial Coefficients
Probability Theory
Learning = discovering regularities
- Regularity : repeated experiments:
outcome not be fully predictable

outcome = “possible world”


set of all possible worlds = Ω
Probability Theory
Apples and Oranges
Sample Space
Sample ω = Pick two fruits,
e.g. Apple, then Orange
Sample Space Ω = {(A,A), (A,O),
(O,A),(O,O)}
= all possible worlds

Event e = set of possible worlds, e ⊆ Ω


• e.g. second one picked is an apple
Learning = discovering regularities
- Regularity : repeated experiments:
outcome not be fully predictable

- Probability p(e) : "the fraction of possible worlds in


which e is true” i.e. outcome is event e

- Frequentist view : p(e) = limit as N → ∞


- Belief view: in wager : equivalent odds
(1-p):p that outcome is in e, or vice versa
Axioms of Probability
- non-negative : p(e) ≥ 0

- unit sum p(Ω) = 1


i.e. no outcomes outside sample space

- additive : if e1, e2 are disjoint events (no common


outcome):
p(e1) + p(e2) = p(e1 ∪ e2)
ALT:
p(e1 ∨ e2) = p(e1) + p(e2) - p(e1 ∧ e2)
Why probability theory?
different methodologies attempted for uncertainty:
– Fuzzy logic
– Multi-valued logic
– Non-monotonic reasoning
But unique property of probability theory:
If you gamble using probabilities you have the best
chance in a wager. [de Finetti 1931]
=> if opponent uses some other system, he's
more likely to lose
Ramsay-diFinetti theorem (1931)
If agent X’s degrees of belief are rational, then X ’s
degrees of belief function defined by fair betting
rates is (formally) a probability function
Fair betting rates: opponent decides which side one
bets on
Proof: fair odds result in a function pr () that satisifies
the Kolmogrov axioms:
Normality : pr(S) >=0
Certainty : pr(T)=1
Additivity : pr (S1 v S2 v.. )= Σ(Si)
Joint vs. conditional probability

Marginal Probability

Joint Probability Conditional Probability


Probability Theory

Sum Rule

Product Rule
Rules of Probability

Sum Rule

Product Rule
Example
A disease d occurs in 0.05% of population. A test is
99% effective in detecting the disease, but 5% of
the cases test positive in absence of d.
10000 people are tested. How many are expected to
test positive?
p(d) = 0.0005 ; p(t/d) = 0.99 ; p(t/~d) = 0.05
p(t) = p(t,d) + p(t,~d) [Sum Rule]
= p(t/d)p(d) + p(t/~d)p(~d) [Product Rule]
= 0.99*0.0005 + 0.05 * 0.9995 = 0.0505  505 +ve
Bayes’ Theorem

posterior  likelihood × prior


Bayes’ Theorem
Thomas Bayes (c.1750):
how can we infer causes from effects?
How can one learn the probability of a future event if one knew
only
how many times it had (or had not) occurred in the past?

as new evidence comes in --> prob knowledge improves.


e.g. throw a die. guess is poor (1/6)
throw die again. is it > or < than prev? Can improve guess.
throw die repeatedly. can improve prob of guess quite a lot.

Hence: initial estimate (prior belief P(h), not well formulated)


+ new evidence (support) – compute likelihood P(data|h)
 improved estimate (posterior P(h|data) )
Example
A disease d occurs in 0.05% of population. A test is
99% effective in detecting the disease, but 5% of
the cases test positive in absence of d.
If you are tested +ve, what is the probability you have
the disease?
p(d/t) = p(d) . p(t/d) / p(t) ; p(t) = 0.0505
p(d/t) = 0.0005 * 0.99 / 0.0505 = 0.0098 (about 1%)
if 10K people take the test, E(d) = 5
FPs = 0.05 * 9995 = 500
TPs = 0.99 * 5 = 5.  only 5/505 have d
Bayesian Inference
Testing for hypothesis H given evidence E
- Evidence : based on new observation E
- Prior : Earlier evaluation about the probability of H
- Likelihood : probability of evidence given hypothesis
P(E|H)
normalization(
Bayesian inference: (marginal lklihood)
P (H|E) = P(E|H) P(H) / P(E)

Posterior probability
Bayesian Inference
The fruit picked is an orange
(o). What is the probability
that it’s from the blue box (B)?

orange
P(B|o) =
P(o|B)p(B) / P(o)

Given: red box is picked


40%  p(B) = 0.6

P(o) = (¾*.6 + 1/3*0.4) = 11/20

P(B|o) = ¾ * .6 * 20/11 = 9/11


Continuous variables:
Probability Densities
Probability Densities
cumulative
Expectations

discrete x continuous x

Frequentist approximation w unbiased sample

(both discrete / continuous)


The Gaussian Distribution
Gaussian Mean and Variance
Central Limit Theorem
Distribution of sum of N i.i.d. random variables
becomes increasingly Gaussian for larger N.

Example: N uniform [0,1] random variables.


Gaussian Parameter Estimation

Observations
assumed to be
indpendently
drawn from same
distribution (i.i.d)

Likelihood function
Maximum (Log) Likelihood
Distributions over
Multi-dimensional spaces
The Multivariate Gaussian

lines of equal
probability densities
Multivariate distribution

joint distribution P(x,y) varies considerably


though marginals P(x), P(y) are identical

estimating the joint distribution requires


much larger sample: O(nk) vs nk
Marginals and Conditionals

marginals P(x), P(y) are gaussian


conditional P(x|y) is also gaussian
Non-intuitive in high dimensions

As dimensionality
increases, bulk of
data moves away
from center

Gaussian in polar coordinates;


p(r)δr : prob. mass inside annulus δr at r.
Change of variable x=g(y)
Bernoulli Process

Successive Trials – e.g. Toss a coin three times:


HHH, HHT, HTH, HTT, THH, THT, TTH, TTT

Probability of k Heads:

k 0 1 2 3
P(k) 1/8 3/8 3/8 1/8
Probability of success: p, failure q, then
Model Selection
Model Selection
Cross-Validation
Quantized-Cell Classification

flow data
red: ‘homogenous’,
green : ‘annular’,
blue : ‘laminar’.
Curse of Dimensionality

general cubic polynomial for D dimensions : O(D3) parameters


Curse of Dimensionality
The unit hyper cube and unit sphere in high dimensions

At higher dim, vol(sphere) / vol(hypercube)  0


Curse of Dimensionality
Polynomial curve fitting, M = 3

Gaussian Densities in
higher dimensions
Regression with Polynomials
Curve Fitting Re-visited
Bayesian Inference
Testing for hypothesis H given evidence E

likelihood
Bayesian inference:
P (H|E) = P(E|H) P(H) / P(E)
prior
posterior
Maximum Likelihood
Evidence = t; Hypothesis = poly(x,w)
Maximum Likelihood
Evidence = t; Hypothesis = poly(x,w)

Determine by minimizing sum-of-squares error,


.
Predictive Distribution
MAP: A Step towards Bayes

Determine by minimizing regularized sum-of-squares error,


.

MAP = Maximum Posterior


Bayesian Curve Fitting
Bayesian Predictive Distribution
Information Theory
Twenty Questions
Knower: thinks of object (point in a probability space)
Guesser: asks knower to evaluate random variables

Stupid approach:

Guesser: Is it my left big toe?


Knower: No.

Guesser: Is it Valmiki?
Knower: No.

Guesser: Is it Aunt Lakshmi?


...
Expectations & Surprisal
Turn the key: expectation: lock will open

Exam paper showing: could be 100, could be zero.


random variable: function from set of marks
to real interval [0,1]

Interestingness ∝ unpredictability

surprisal (r.v. = x) = - log2 p(x)


= 0 when p(x) = 1
= 1 when p(x) = ½
= ∞ when p(x) = 0
Expectations in data

A: 00010001000100010001. . . 0001000100010001000100010001

B: 01110100110100100110. . . 1010111010111011000101100010

C: 00011000001010100000. . . 0010001000010000001000110000

Structure in data  easy to remember


Entropy

Used in
• coding theory
• statistical physics
• machine learning
Entropy
Entropy
In how many ways can N identical objects be allocated M
bins?

Entropy maximized when


Entropy in Coding theory
x discrete with 8 possible states; how many bits to
transmit the state of x?

All states equally likely


Coding theory
Entropy in Twenty Questions
Intuitively : try to ask q whose answer is 50-50

Is the first letter between A and M?

question entropy = p(Y)logp(Y) + p(N)logP(N)

For both answers equiprobable:


entropy = - ½ * log2(½) - ½ * log2(½) = 1.0

For P(Y)=1/1028
entropy = - 1/1028 * -10 - eps = 0.01

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy