0% found this document useful (0 votes)
390 views27 pages

NB 13

1) The document discusses classification techniques, including the Heaviside and sigmoid (logistic) functions for generating labels from data points. 2) The Heaviside function produces hard labels while the sigmoid function produces soft labels, which can be helpful when the data is not cleanly separable into classes. 3) Maximum likelihood estimation is introduced as a technique for choosing parameters θ for a model, with the goal of maximizing the log likelihood through gradient ascent.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
390 views27 pages

NB 13

1) The document discusses classification techniques, including the Heaviside and sigmoid (logistic) functions for generating labels from data points. 2) The Heaviside function produces hard labels while the sigmoid function produces soft labels, which can be helpful when the data is not cleanly separable into classes. 3) Maximum likelihood estimation is introduced as a technique for choosing parameters θ for a model, with the goal of maximizing the log likelihood through gradient ascent.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 27

Notebook contains: Classi cation

• Heaviside function and labeling


• Sigmoid (logistic) function: useful for generating labels that have a probabilistic
interpretation
• Maximum likelihood estimation, log likelihood
• Gradient ascent
notcleanly
separable

it
pt

IIIIate
i
t

fi
aces

Awe wanttofindtheboundary
discriminantbetweenthe
twoclasses
oxo Ex a 0
x 1 8 att
hope intercept

byapplying
discriminant boundary

Hearts.IE
Issue it
Etffi
produces hardlabelassignmentand this can be
problematic when datapoint's aren't cleanly separable
Solution soft assignment Instead of assigning a label we assign
a probability score
Pr label x L x

avisidfundy
He V5

g y
O
maps the values into the range 0,1
a
maps negative values 967 12
to zero and positive
values to 1 o sigmoid can be interpreted as a probability

of a distribution
Probability density combination of two normal
IL Gaussian distributions

1
aka Mixture ofGaussian

9 points distributions
Suppose the

the same variance

Pr label07 1 03 9 FCA É
fromthedistributionwillhavelabel l given

we generate asoftlabel for


each point xbyapplying the
logisticfunction instead of the
heaviside
If we need a hardlabel wecanround
t
means continuous rather
than discreet

GEN
Eet GE al
TL;DR: For the classi cation task, we are using a linear discriminant function to assign
points a score. Then given the score, we have various options of how to generate labels, e.g.
heaviside and logistic function. Heaviside produces hard labels, while sigmoid produces soft
labels, which can be helpful when the classes aren’t clearly separable.

Maximum Likelihood Estimation


a technique
to choose 0 for a fx

at
Y oÉ ÉdÉts
Yaidmeters Fat T
vector of my datamatrix d dpredictors a
entriesadditional
constan
responseslabeled
E OI
lastcolumn of1thedata
matrixtall values

model we are a 0
When we are
given
a
that we
given
would have seen these particular
What are the chances
labels under this model
Likelihood Pr XO joint probability of labels
conditional

Ly given the coordinatesXandthemodel


parameters 0
y
L Q Pr XO Pr Cyo to Imy O
y sym i
T likelihood in terms of individual
observations
Assume observations are independent

IIe.ly IIer10s
Directly
19
maximizing the product is
product derivatives issue
mathematically
expensive

of smallnumbers is a verysmallnumber
product
Solution Log transform
optimatoisprserved when wetake log of of likelihood
because logs are monotonic so it can't changewhere
the max value occurs
argzaxtlOI argzaxl.gl

Dloglogatlogt
a.b
likelihood we are
Therefore instead of maximizing the
Prof maximizing log likelihood
ICO I log L Q
argzaxffeoger.cl
QE argmgxL.CO argqaxlogIIECOj

ÉÉI
goif implementation of amathematica
statementsince yeis
either

loger Q yrlogget l yr loge ga


gradient of the log
likelihood is needed
to
Tologer o find the maximum
Gradient Ascent
Generic optimization problem
some function Fx

y
maximum
location of the
value of F derivative is zero at the
maximum for scalarfunctions

y
gradient is zero
Tat vectorme aximum for
functions

Linear regression example


XOlli here we want to minimize the residual error
f O
Ily i e maximize the negative of
the residual error
If compute the gradientand
we set it to zero we willfind that
the optimal parameters D occur at the solution of the normal equation
XX QE
Tty
Fyfe is a simple function to invert
In the higher dimensions
gradient points in the direction
of steepestascent
Pf dirofstee.petasant

ay
slope

j
initialguess location of the maximum
value of Fx
directionthatfaces
pg
S is the direction of increasing slope
dimension the direction is
I in one just
sign of the
derivative
s I y a
How large should the step be
S I Ex
user specified

L
parameter
tuning
Eating
2 0
When we move from X to xts
foxes Fx Ey T T S t O S2 Taylor expansion of thefunction
s is smallthey
if we me s is smallthey
5 will also bevery 2 also has to be small
small so we canignore
it as well as any other that
herorderterms as

fixes s stepsize
off
in
ppg currentvalue
slope

fees
fetÉ g
EE Eye Efd
Mitase

For higherdimensions
0 54 d
5 Il TL DR
TxFH 112
f res F FR Stix FR
Logistic regression
Beyond regression, another important data analysis task is classification, in which you are given a set of labeled data points and you wish to lea
the labels. The canonical example of a classification algorithm is logistic regression, the topic of this notebook.

Although it's called "regression" it is really a model for classification.

Here, you'll consider binary classification. Each data point belongs to one of possible classes. By convention, we will denote these class
and "1." However, the ideas can be generalized to the multiclass case, i.e., , with labels .

You'll also want to review from earlier notebooks the concept of gradient ascent/descent (or "steepest ascent/descent"), when optimizing a sca
a vector variable.

Part 0: Introduction
This part of the notebook introduces you to the classification problem through a "geometric interpretation."

Setup
In [1]: import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
from IPython.display import display, Math

%matplotlib inline

import matplotlib as mpl


mpl.rc("savefig", dpi=100) # Adjust for higher-resolution figures

A note about slicing columns from a Numpy matrix. If you want to extract a column i from a Numpy matrix A and keep it as a column vecto
use the slicing notation, A[:, i:i+1]. Not doing so can lead to subtle bugs. To see why, compare the following slices.

In [2]: A = np.array ([[1, 2, 3],


[4, 5, 6],
[7, 8, 9]
], dtype=float)

print ("A[:, :] ==\n", A)


print ("\na0 := A[:, 0] ==\n", A[:, 0])
print ("\na1 := A[:, 2:3] == \n", A[:, 2:3])

print ("\nAdd columns 0 and 2?")


a0 = A[:, 0]
a1 = A[:, 2:3]
print (a0 + a1)

A[:, :] ==
[[1. 2. 3.]
[4. 5. 6.]
[7. 8. 9.]]

a0 := A[:, 0] ==
[1. 4. 7.]

a1 := A[:, 2:3] ==
[[3.]
[6.]
[9.]]

Add columns 0 and 2?


[[ 4. 7. 10.]
[ 7. 10. 13.]
[10. 13. 16.]]

Aside: Broadcasting in Numpy. What is happening in the operation, a0 + a1, shown above? When the shapes of two objects do not match,
to figure out if there is a natural way to make them compatible. See this supplemental notebook (./mo_numpy_mo_problems.ipynb) for informat
Numpy's "broadcasting rule," along with other Numpy tips.
Example data: Rock lobsters!
As a concrete example of a classification task, consider the results of the following experiment (http://www.stat.ufl.edu/~winner/data/lobster_su

Some marine biologists started with a bunch of lobsters of varying sizes (size being a proxy for the stage of a lobster's development). They then
exposed these lobsters to a variety of predators. Finally, the outcome that they measured is whether the lobsters survived or not.

The data is a set of points, one point per lobster, where there is a single predictor (the lobster's size) and the response is whether the lobsters s
"1") or died (label "0").

For the original paper, see this link (http://downeastinstitute.org/assets/files/Published%20papers/Wilkinson%20et%20al%202015-1.p


For what we can only guess is what marine biologists do in their labs, see this image (http://i.imgur.com/dQDKgys.jpg) (or this possibly
safe-for-work alternative (http://web.archive.org/web/20120628012654/http://www.traemcneely.com/wp-content/uploads/2012/04/wpi
Lobster-Fights-e1335308484734.jpeg)).

Start by downloading this data.

In [3]: import requests


import os
import hashlib
import io

def on_vocareum():
return os.path.exists('.voc')

def download(file, local_dir="", url_base=None, checksum=None):


local_file = "{}{}".format(local_dir, file)
if not os.path.exists(local_file):
if url_base is None:
url_base = "https://cse6040.gatech.edu/datasets/"
url = "{}{}".format(url_base, file)
print("Downloading: {} ...".format(url))
r = requests.get(url)
with open(local_file, 'wb') as f:
f.write(r.content)

if checksum is not None:


with io.open(local_file, 'rb') as f:
body = f.read()
body_checksum = hashlib.md5(body).hexdigest()
assert body_checksum == checksum, \
"Downloaded file '{}' has incorrect checksum: '{}' instead of '{}'".format(local
body_
check
print("'{}' is ready!".format(file))

if on_vocareum():
URL_BASE = "https://cse6040.gatech.edu/datasets/rock-lobster/"
DATA_PATH = "../resource/asnlib/publicdata/"
else:
URL_BASE = "https://github.com/cse6040/labs-fa17/raw/master/datasets/rock-lobster/"
DATA_PATH = ""

datasets = {'lobster_survive.dat.txt': '12fc1c22ed9b4d7bf04bf7e0fec996b7',


'logreg_points_train.csv': '25bbca6105bae047ac4d62ee8b76c841',
'log_likelihood_soln.npz': '5a9e17d56937855727afa6db1cd83306',
'grad_log_likelihood_soln.npz': 'a67c00bfa95929e12d423105d8412026',
'hess_log_likelihood_soln.npz': 'b46443fbf0577423b084122503125887'}

for filename, checksum in datasets.items():


download(filename, local_dir=DATA_PATH, url_base=URL_BASE, checksum=checksum)

print("\n(All data appears to be ready.)")

'grad_log_likelihood_soln.npz' is ready!
'hess_log_likelihood_soln.npz' is ready!
'log_likelihood_soln.npz' is ready!
'lobster_survive.dat.txt' is ready!
'logreg_points_train.csv' is ready!

(All data appears to be ready.)

Here is a plot of the raw data, which was taken from this source (http://www.stat.ufl.edu/~winner/data/lobster_survive.dat).

In [4]: df_lobsters = pd.read_table('{}lobster_survive.dat.txt'.format(DATA_PATH),


sep=r'\s+', names=['CarapaceLen', 'Survived'])
display(df_lobsters.head())
print("...")
display(df_lobsters.tail())
CarapaceLen Survived

0 27 0

1 27 0

2 27 0

3 27 0

4 27 0

...

CarapaceLen Survived

154 54 1

155 54 1

156 54 1

157 54 1

158 57 1

In [5]: ax = sns.violinplot(x="Survived", y="CarapaceLen",


data=df_lobsters, inner="quart")
ax.set(xlabel="Survived? (0=no, 1=yes)",
ylabel="",
title="Body length (carpace, in mm) vs. survival");

Although the classes are distinct in the aggregate, where the median carapace (outer shell) length is around 36 mm for the lobsters that died an
those that survived, they are not cleanly separable.

Notation
To develop some intuition and a classification algorithm, let's formulate the general problem and apply it to synthetic data sets.

Let the data consist of observations of continuously-valued predictors. In addition, for each data observation we observe a binary label wh
either 0 or 1.

Just like our convention in the linear regression case, represent each observation, or data point, by an augumented vector, ,

That is, the point is the coordinates augmented by an initial dummy coordinate whose value is 1. This convention is similar to what we did in
regression.

We can also stack these points as rows of a matrix, , again, just as we did in regression:

We will take the labels to be a binary vector, .

Example: A synthetic training set. We've pre-generated a synethetic data set consisting of labeled data points. Let's download and inspect it
table and then visually.

In [6]: df = pd.read_csv('{}logreg_points_train.csv'.format(DATA_PATH))

display(df.head())
print(" ")
print( ... )
display(df.tail())

x_0 x_1 label

0 -0.234443 -1.075960 1

1 0.730359 -0.918093 0

2 1.432270 -0.439449 0

3 0.026733 1.050300 0

4 1.879650 0.207743 0

...

x_0 x_1 label

370 1.314300 0.746001 0

371 -0.759737 -0.042944 1

372 0.683560 -0.047791 0

373 0.774747 0.743837 0

374 0.899119 1.576390 0

In [7]: def make_scatter_plot(df, x="x_0", y="x_1", hue="label",


palette={0: "red", 1: "olive"},
size=5):
sns.lmplot(x=x, y=y, hue=hue, data=df, palette=palette,
fit_reg=False)

mpl.rc("savefig", dpi=120) # Adjust for higher-resolution figures


make_scatter_plot(df)

Next, let's extract the coordinates as a Numpy matrix of points and the labels as a Numpy column vector labels. Mathematically, the point
corresponds to and the labels vector corresponds to .

In [8]: points = np.insert(df.as_matrix (['x_0', 'x_1']), 2, 1.0, axis=1)


labels = df.as_matrix(['label'])

print ("First and last 5 points:\n", '='*23, '\n', points[:5], '\n...\n', points[-5:], '\n')
print ("First and last 5 labels:\n", '='*23, '\n', labels[:5], '\n...\n', labels[-5:], '\n')

First and last 5 points:


=======================
[[-0.234443 -1.07596 1. ]
[ 0.730359 -0.918093 1. ]
[ 1.43227 -0.439449 1. ]
[ 0.0267327 1.0503 1. ]
[ 1.87965 0.207743 1. ]]
...
[[ 1.3143 0.746001 1. ]
[-0.759737 -0.0429435 1. ]
[ 0.68356 -0.0477909 1. ]
[ 0.774747 0.743837 1. ]
[ 0.899119 1.57639 1. ]]

First and last 5 labels:


=======================
[[1]
[0]
[0]
[0]
[0]
[0]]
...
[[0]
[1]
[0]
[0]
[0]]

Linear discriminants and the heaviside function


Suppose you think that the boundary between the two clusters may be represented by a line. For the synthetic data example above, I hope you
such a model is not a terrible one.

A linear boundary is also known as a linear discriminant. Any point on this line may be described by , where is a vector of coefficients:

For example, suppose our observations have two predictors each ( ). Let the corresponding data point be . Then,
means that

So that describes points on the line. However, given any point in the -dimensional space that is not on the line, still produces a value: th
be positive on one side of the line ( ) or negative on the other ( ).

In other words, you can use the linear discriminant function, , to generate a label for each point : just reinterpret its sign!

If you want "0" and "1" labels, the heaviside function, , will convert a positive to the label "1" and all other values to "0."

Exercise 0 (2 points). Given the a matrix of augmented points (i.e., the matrix) and a column vector of length , implemen
compute the value of the linear discriminant at each point. That is, the function should return a (column) vector where the .

In [9]: def lin_discr (X, theta):


### BEGIN SOLUTION
return X.dot(theta)
### END SOLUTION

In [10]: # Test cell: `lin_discr__check`


import random
theta_test = [random.random() for _ in range (3)]
x0_test = [random.random() for _ in range (2)]
x1_test = [(-theta_test[2] - theta_test[0]*x0) / theta_test[1] for x0 in x0_test]
X_test = np.array ([[x0*2 for x0 in x0_test] + [x0*0.5 for x0 in x0_test],
x1_test + x1_test,
[1.0, 1.0, 1.0, 1.0],]).T
print(X_test, "\n")
LD_test = lin_discr(X_test, np.array([theta_test]).T)
print (LD_test)
assert (LD_test[:2] > 0).all ()
assert (LD_test[2:] < 0).all ()
print("\n(Passed.)")

[[ 1.05035323 -0.52024135 1. ]
[ 1.49196858 -0.59315241 1. ]
[ 0.26258831 -0.52024135 1. ]
[ 0.37299215 -0.59315241 1. ]]

[[ 0.13320893]
[ 0.1892159 ]
[-0.06660446]
[-0.09460795]]

(Passed.)

Exercise 1 (2 points). Implement the heaviside function, . Your function should allow for an arbitrary matrix of input values and should app
heaviside function to each element. In the returned matrix, the elements should have a floating-point type.

Example, the code snippet

A = np.array([[-0.5, 0.2, 0.0],


[4.2, 3.14, -2.7]])
print(heaviside(A))
should display

[[ 0. 1. 0.]
[ 1. 1. 0.]]

There are several possible approaches that lead to one-line solutions. One uses only logical and arithmetic operators, which you will re
implemented as elementwise operations for Numpy arrays. Another uses Numpy's sign()
(http://docs.scipy.org/doc/numpy/reference/generated/numpy.sign.html) function.

In [11]: def heaviside(Y):


### BEGIN SOLUTION
return 1.0*(Y > 0.0)

# Alternative solution:
#return (np.sign(Y) > 0) * 1.0
### END SOLUTION

In [12]: # Test cell: `heaviside__check`

Y_test = np.array([[-2.3, 1.2, 7.],


[0.0, -np.inf, np.inf]])
H_Y_test = heaviside(Y_test)

print("Y:\n", Y_test)
print("\nH(Y):\n", H_Y_test)

assert (H_Y_test.astype(int) == np.array([[0, 1, 1], [0, 0, 1]])).all ()

print ("\n(Passed.)")

Y:
[[-2.3 1.2 7. ]
[ 0. -inf inf]]

H(Y):
[[0. 1. 1.]
[0. 0. 1.]]

(Passed.)

For the next exercise, we'll need the following functions.

In [13]: def heaviside_int(Y):


"""Evaluates the heaviside function, but returns integer values."""
return heaviside(Y).astype(dtype=int)

def gen_lin_discr_labels(points, theta, fun=heaviside_int):


"""
Given a set of points and the coefficients of a linear
discriminant, this function returns a set of labels for
the points with respect to this discriminant.
"""
score = lin_discr(points, theta)
labels = fun(score)
return labels

def plot_lin_discr(theta, df, x="x_0", y="x_1", hue="label",


palette={0: "red", 1: "olive"}, size=5,
linewidth=2):
lm = sns.lmplot(x=x, y=y, hue=hue, data=df, palette=palette,
size=size, fit_reg=False)

x_min, x_max = df[x].min(), df[x].max()


y_min, y_max = df[y].min(), df[y].max()

x1_min = (-theta[2][0] - theta[0][0]*x_min) / theta[1][0]


x1_max = (-theta[2][0] - theta[0][0]*x_max) / theta[1][0]
plt.plot([x_min, x_max], [x1_min, x1_max], linewidth=linewidth)

def expand_interval(x_limits, percent=10.0):


x_min, x_max = x_limits[0], x_limits[1]
if x_min < 0:
x_min *= 1.0 + 1e-2*percent
else:
x_min *= 1.0 - 1e-2*percent
if x_max > 0:
x_max *= 1.0 + 1e-2*percent
else:
x_max *= 1.0 + 1e-2*percent
return (x_min, x_max)
x_view = expand_interval((x_min, x_max))
y_view = expand_interval((y_min, y_max))
lm.axes[0,0].set_xlim(x_view[0], x_view[1])
lm.axes[0,0].set_ylim(y_view[0], y_view[1])

def mark_matches(a, b, exact=False):


"""
Given two Numpy arrays of {0, 1} labels, returns a new boolean
array indicating at which locations the input arrays have the
same label (i.e., the corresponding entry is True).

This function can consider "inexact" matches. That is, if `exact`


is False, then the function will assume the {0, 1} labels may be
regarded as the same up to a swapping of the labels. This feature
allows

a == [0, 0, 1, 1, 0, 1, 1]
b == [1, 1, 0, 0, 1, 0, 0]

to be regarded as equal. (That is, use `exact=False` when you


only care about "relative" labeling.)
"""
assert a.shape == b.shape
a_int = a.astype(dtype=int)
b_int = b.astype(dtype=int)
all_axes = tuple(range(len(a.shape)))
assert ((a_int == 0) | (a_int == 1)).all()
assert ((b_int == 0) | (b_int == 1)).all()

exact_matches = (a_int == b_int)


if exact:
return exact_matches

assert exact == False


num_exact_matches = np.sum(exact_matches)
if (2*num_exact_matches) >= np.prod(a.shape):
return exact_matches
return exact_matches == False # Invert

def count_matches(a, b, exact=False):


"""
Given two sets of {0, 1} labels, returns the number of mismatches.

This function can consider "inexact" matches. That is, if `exact`


is False, then the function will assume the {0, 1} labels may be
regarded as similar up to a swapping of the labels. This feature
allows

a == [0, 0, 1, 1, 0, 1, 1]
b == [1, 1, 0, 0, 1, 0, 0]

to be regarded as equal. (That is, use `exact=False` when you


only care about "relative" labeling.)
"""
matches = mark_matches(a, b, exact=exact)
return int(matches.sum())

Exercise 2 (2 points). For the synthetic data you loaded above, try by hand to find a value for such that "best" separates the two clu
this in a variable named my_theta, which should be a Numpy column vector. That is, define my_theta here using a line like:

my_theta = np_col_vec([3., 0., -1.])

where np_col_vec is defined below and the list of values are your best guesses at discriminating coefficients. The test code will check that yo
makes no more than ten misclassifications.

Hint: We found a set of coefficients that commits just 5 errors for the 375 input points.

In [14]: def np_col_vec (list_values):


"""Returns a Numpy column vector for the given list of scalar values."""
return np.array ([list_values]).T

# Redefine `my_theta` as instructed above to reduce the number of mismatches:


my_theta = np_col_vec([-1., 3., 0.]) # 123 mismatches
### BEGIN SOLUTION
my_theta = np_col_vec([-6.5, -1., -1.35]) # 5 mismatches
my_theta = np_col_vec([-2., -0.5, -0.55]) # 5 mismatches
### END SOLUTION

In [15]: # Here are the labels generated by your discriminant:


my_labels = gen_lin_discr_labels(points, my_theta)

# Here is a visual check:


num_mismatches = len(labels) - count_matches(labels, my_labels)
print ("Detected", num_mismatches, "out of", len(labels), "mismatches.")

df_matches = df.copy ()
df_matches['label'] = mark_matches (my_labels, labels).astype (dtype=int)
mpl.rc("savefig", dpi=100) # Adjust for higher-resolution figures
plot_lin_discr (my_theta, df_matches)

assert num_mismatches <= 10

Detected 5 out of 375 mismatches.

How the heaviside divides the space. The heaviside function, , enforces a sharp boundary between classes around the line
code produces a contour plot (https://matplotlib.org/api/_as_gen/matplotlib.axes.Axes.contourf.html) to show this effect: there will be a sharp d
between 0 and 1 values, with one set of values shown as a solid dark area and the remaining as a solid light-colored area.

In [16]: x0 = np.linspace(-2., +2., 100)


x1 = np.linspace(-2., +2., 100)
x0_grid, x1_grid = np.meshgrid(x0, x1)
h_grid = heaviside(my_theta[2] + my_theta[0]*x0_grid + my_theta[1]*x1_grid)
plt.contourf(x0, x1, h_grid)

Out[16]: <matplotlib.contour.QuadContourSet at 0x7f0ec25ad978>

Part 1: The logistic (or sigmoid) function as an alternative discriminant


As the lobsters example suggests, real data are not likely to be cleanly separable, especially when the number of features we have at our dispo
small.

Since the labels are 0 or 1, you could look for a way to interpret labels as probabilities rather than as hard (0 or 1) labels. One such function is th
function, also referred to as the logit or sigmoid (https://en.wikipedia.org/wiki/Sigmoid_function) function.

The logistic function takes any value in the range and produces a value in the range . Thus, given a value , we can interpret
conditional probability that the label is 1 given , i.e., .

Exercise 3 (2 points). Implement the logistic function. Inspect the resulting plot of in 1-D and then the contour plot of . Your functi
accept a Numpy matrix of values, Y, and apply the sigmoid elementwise.

In [17]: def logistic(Y):


### BEGIN SOLUTION
return 1.0 / (1.0 + np.exp (-Y))
### END SOLUTION

# Plot your function for a 1-D input.


y_values = np.linspace(-10, 10, 100)

mpl.rc("savefig", dpi=120) # Adjust for higher-resolution figures


sns set style("darkgrid")
sns.set_style( darkgrid )
y_pos = y_values[y_values > 0]
y_rem = y_values[y_values <= 0]
plt.plot(y_rem, heaviside (y_rem), 'b')
plt.plot(y_pos, heaviside (y_pos), 'b')
plt.plot(y_values, logistic (y_values), 'r--')
#sns.regplot (y_values, heaviside (y_values), fit_reg=False)
#sns.regplot (y_values, logistic (y_values), fit_reg=False)

Out[17]: [<matplotlib.lines.Line2D at 0x7f0ec257c2b0>]

In [18]: # Test cell: `logistic__check`

assert logistic(np.log(3)) == 0.75


assert logistic(-np.log(3)) == 0.25

g_grid = logistic(my_theta[2] + my_theta[0]*x0_grid + my_theta[1]*x1_grid)


plt.contourf (x0, x1, g_grid)
assert ((np.round(g_grid) - h_grid).astype(int) == 0).all()

print ("\n(Passed.)")

(Passed.)

Exercise 4 (optional; ungraded). Consider a set of 1-D points generated by a mixture of Gaussians. That is, suppose that there are two Gaussia
over the 1-dimensional variable, , that have the same variance ( ) but different means ( and ). Show that the conditional
observing a point labeled "1" given may be written as,

for a suitable definition of and .

Hints. Since the points come from Gaussian distributions,

To rewrite in terms of , recall Bayes's rule (also: Bayes's theorem (https://en.wikipedia.org/wiki/Bayes%27_theorem)):

where the denominator can be expanded as

You may assume the prior probabilities of observing a 0 or 1 are given by and .

The point of this derivation is to show you that the definition of the logistic function does not just arise out of thin air. It also hints that yo
might expect a final algorithm for logistic regression based on using as the discriminant will work well when the classes are best
explained as a mixture of Gaussians.

Generalizing to -dimensions. The preceding exercise can be generalized to -dimensions. Let and be -dimensional points. Then
Exercise 5 (optional; ungraded). Verify the following properties of the logistic function, .

Answers. In all of the derivations below, we use the fact that .

(P1). Multiply the numerator and denominator by .

(P2). Start with the right-hand side, , apply some algebra, and then apply (P1).

(P3). By direct calculation and application of (P1):

(P4). By the chain rule and application of (P3):

(P5). By combining (P2), variable substitution and the chain rule, and (P4),

Part 2: Determining the discriminant via maximum likelihood estimation


Previously, you determined for our synthetic dataset by hand. Can you compute a good automatically? One of the standard techniques in s
perform a maximum likelihood estimation (MLE) of a model's parameters, . Indeed, you may have seen or used MLE to derive the normal equa
regression in a more "statistically principled" way.

"Likelihood" as an objective function. MLE derives from the following idea. Consider the joint probability of observing all of the labels, given t
the parameters, :

Suppose these observations are independent and identically distributed (i.i.d.). Then the joint probability can be factored as the product of indiv
probabilities,

The maximum likelihood principle says that you should choose to maximize the chances (or "likelihood") of seeing these particular observatio
is now an objective function to maximize.

For both mathematical and numerical reasons, we will use the logarithm of the likelihood, or log-likelihood, as the objective function instead. Le

We are using the symbol , which could be taken in any convenient base, such as the natural logarithm ( ) or the information theo
base-two logarithm ( ).

The MLE fitting procedure then consists of two steps:

For the problem at hand, decide on a model of .


Run any optimization procedure to find the that maximizes .

Part 3: MLE for logistic regression


g g
Let's say you have decided that the logistic function, , is a good model of the probability of producing a label given the o
Under the i.i.d. assumption, you can interpret the label as the result of flipping a coin, or a Bernoulli trial (https://en.wikipedia.org/wiki/Bernou
the probability of success ( ) is defined as . Thus,

The log-likelihood in turn becomes,

You can write the log-likelihood more compactly in the language of linear algebra.

Convention 1. Let be a column vector of all ones, with its length inferred from context. Let be a
where denote its columns. Then, the sum of the columns is

Convention 2. Let be any matrix and let be any function that we have defined by default to accept a scalar argument and pro
result. For instance, or . Then, assume that applies elementwise to , returning a matrix whose eleme
.

With these notational conventions, convince yourself that these are two different ways to write the log-likelihood for logistic regression.

Exercise 6 (2 points). Implement the log-likelihood function in Python by defining a function with the following signature:

def log_likelihood (theta, y, X):


...

To compute the elementwise logarithm of a matrix or vector, use Numpy's log


(https://docs.scipy.org/doc/numpy/reference/generated/numpy.log.html) function.

In [19]: def log_likelihood(theta, y, X):


### BEGIN SOLUTION
u = np.ones((len (X), 1)) # column of all ones
z = X.dot(theta)
return y.T.dot(z) + u.T.dot(np.log(logistic(-z)))

def log_likelihood_alt(theta, y, X):


z = X.dot(theta)
g = logistic(z)
return y.T.dot(np.log(g)) + (1.0-y).T.dot(np.log(1.0-g))
### END SOLUTION

In [20]: # Test cell: `log_likelihood__check`

if False:
d_soln = 10
m_soln = 1000
theta_soln = np.random.random ((d_soln+1, 1)) * 2.0 - 1.0
y_soln = np.random.randint (low=0, high=2, size=(m_soln, 1))
X_soln = np.random.random ((m_soln, d_soln+1)) * 2.0 - 1.0
X_soln[:, 0] = 1.0
L_soln = log_likelihood (theta_soln, y_soln, X_soln)
np.savez_compressed('log_likelihood_soln',
d_soln, m_soln, theta_soln, y_soln, X_soln, L_soln)

npzfile_soln = np.load('{}log_likelihood_soln.npz'.format(DATA_PATH))
d_soln = npzfile_soln['arr_0']
m_soln = npzfile_soln['arr_1']
theta_soln = npzfile_soln['arr_2']
y_soln = npzfile_soln['arr_3']
X_soln = npzfile_soln['arr_4']
L_soln = npzfile_soln['arr_5']

L_you = log_likelihood(theta_soln, y_soln, X_soln)


your err = np max(np abs(L you/L soln 1 0))
your_err = np.max(np.abs(L_you/L_soln - 1.0))
display(Math(r'\left\|\dfrac{\mathcal{L}_{\tiny \mbox{yours}} - \mathcal{L}_{\tiny \mbox{solutio
cal{L}_{\tiny \mbox{solution}}}\right\|_\infty \approx %g' % your_err))
assert your_err <= 1e-12

print ("\n(Passed.)")

(Passed.)

Part 4: Computing the MLE solution via gradient ascent: theory


To optimize the log-likelihood with respect to the parameters, , you want to "set the derivative to zero" and solve for .

For example, recall that in the case of linear regression via least squares minimization, carrying out this process produced an analytic solution fo
parameters, which was to solve the normal equations.

Unfortunately, for logistic regression---or for most log-likelihoods you are likely to ever write down---you cannot usually derive an analytic solut
you will need to resort to numerical optimization procedures.

Gradient ascent, in 1-D. A simple numerical algorithm to maximize a function is gradient ascent (or steepest ascent). If instead you are minimiz
function, then the equivalent procedure is gradient (or steepest) descent. Here is the basic idea in 1-D.

Suppose we wish to find the maximum of a scalar function in one dimension. At the maximum, .

Suppose instead that and consider the value of at a nearby point, , as given approximately by a truncated Taylor series:

To make progress toward maximizing , you'd like to choose so that . One way is to choose , where

"small:"

If is small enough, then you can neglect the term and will be larger than , thus making progress toward finding a maximum

This scheme is the basic idea: starting from some initial guess , refine the guess by taking a small step in the direction of the derivative, i.e.,

Gradient ascent in higher dimensions. Now suppose is a vector rather than a scalar. Then the value of at a nearby point , where
becomes

where is the gradient of with respect to . As in the 1-D case, you want a step such that . To make as much progres
let's choose to be parallel to , that is, proportional to the gradient:

Again, is a fudge (or "gentle nudge?") factor. You need to choose it to be small enough that the high-order terms of the Taylor approximation
negligible, yet large enough that you can make reasonable progress.

The gradient ascent procedure applied to MLE. Applying gradient ascent to the problem of maximizing the log-likelihood leads to the followi

Start with some initial guess, .


At each iteration of the procedure, let be the current guess.
Compute the direction of steepest ascent by evaluating the gradient, .

Define the step to be , where is a suitably chosen fudge factor.

Take a step in the direction of the gradient, .


Stop when the parameters don't change much or after some maximum number of steps.

This procedure should remind you of one you saw in a prior notebook (the least mean square algorithm for online regression!). As was true at th
tricky bit is how to choose .

There is at least one difference between this procedure and the online regression procedure you learned earlier. Here, we are optimizing
the full dataset rather than processing data points one at a time. (That is, the step iteration variable used above is not used in exactly
same way as the step iteration in LMS.)

Another question is, how do we know this procedure will converge to the global maximum, rather than, say, a local maximum? For that
need a deeper analysis of a specific , to show, for instance, that it is convex in .
Implementing logistic regression using MLE by gradient ascent
Let's apply the gradient ascent procedure to the logistic regression problem, in order to determine a good .

Exercise 7 (optional; ungraded). Show the following.

Answer. From (V2),

Thus,

Let's consider each term in turn.

For the first term, apply the gradient identities to obtain

For the second term, recall the scalar interpretation of .

The -th component of the gradient is

Let's evaluate the summand:

Thus, the -th component of the gradient becomes

In other words, the full gradient vector is

Putting the two components together,

Exercise 8 (2 points). Implement a function to compute the gradient of the log-likelihood. Your function should have the signature,

def grad_log_likelihood (theta, y, X):


...

In [21]: def grad_log_likelihood(theta, y, X):


"""Returns the gradient of the log-likelihood."""
### BEGIN SOLUTION
return X.T.dot(y - logistic(X.dot(theta)))
### END SOLUTION

In [22]: # Test cell: `grad_log_likelihood_code__check`

if False:
d_grad_soln = 6
m_grad_soln = 399
theta_grad_soln = np.random.random((d_grad_soln+1, 1)) * 2.0 - 1.0
y_grad_soln = np.random.randint(low=0, high=2, size=(m_grad_soln, 1))
X_grad_soln = np.random.random((m_grad_soln, d_grad_soln+1)) * 2.0 - 1.0
X_grad_soln[:, 0] = 1.0
L_grad_soln = grad_log_likelihood(theta_grad_soln, y_grad_soln, X_grad_soln)
np.savez_compressed('grad_log_likelihood_soln',
d_grad_soln, m_grad_soln, theta_grad_soln, y_grad_soln, X_grad_soln, L_g

npzfile_grad_soln = np.load ('{}grad_log_likelihood_soln.npz'.format(DATA_PATH))


d_grad_soln = npzfile_grad_soln['arr_0']
m_grad_soln = npzfile_grad_soln['arr_1']
theta_grad_soln = npzfile_grad_soln['arr_2']
y_grad_soln = npzfile_grad_soln['arr_3']
X_grad_soln = npzfile_grad_soln['arr_4']
L_grad_soln = npzfile_grad_soln['arr_5']

L_grad_you = grad_log_likelihood (theta_grad_soln, y_grad_soln, X_grad_soln)


your_grad_err = np.max (np.abs (L_grad_you/L_grad_soln - 1.0))
display (Math (r'\left\|\dfrac{\nabla\, \mathcal{L}_{\tiny \mbox{yours}} - \nabla\,\mathcal{L}_{
x{solution}}}{\nabla\, \mathcal{L}_{\tiny \mbox{solution}}}\right\|_\infty \approx %g' % your_gr
assert your_grad_err <= 1e-12

print ("\n(Passed.)")

(Passed.)

Exercise 9 (4 points). Implement the gradient ascent procedure to determine , and try it out on the sample data.

Recall the procedure (repeated from above):

Start with some initial guess, .


At each iteration of the procedure, let be the current guess.
Compute the direction of steepest ascent by evaluating the gradient, .

Define the step to be , where is a suitably chosen fudge factor.

Take a step in the direction of the gradient, .


Stop when the parameters don't change much or after some maximum number of steps.

In the code skeleton below, we've set up a loop to run a fixed number, MAX_STEP, of gradient ascent steps. Also, when normalizing the step
norm.

In your solution, we'd like you to store all guesses in the matrix thetas, so that you can later see how the values evolve. To extrac
particular column t, use the notation, theta[:, t:t+1]. This notation is necessary to preserve the "shape" of the column as a colum
vector.

In [23]: ALPHA = 0.1


MAX_STEP = 250

# Get the data coordinate matrix, X, and labels vector, y


X = points
y = labels.astype(dtype=float)

# Store *all* guesses, for subsequent analysis


thetas = np.zeros((3, MAX_STEP+1))

for t in range(MAX_STEP):
# Fill in the code to compute thetas[:, t+1:t+2]
### BEGIN SOLUTION
theta_t = thetas[:, t:t+1]
delta_t = grad_log_likelihood(theta_t, y, X)
delta_t = delta_t / np.linalg.norm(delta_t, ord=2)
thetas[:, t+1:t+2] = theta_t + ALPHA*delta_t
### END SOLUTION

theta_ga = thetas[:, MAX_STEP:]


print("Your (hand) solution:", my_theta.T.flatten())
print("Computed solution:", theta_ga.T.flatten())

print("\n=== Comparisons ===")


display(Math (r'\dfrac{\theta_0}{\theta_2}:'))
print("Your manual (hand-picked) solution is", my_theta[0]/my_theta[2], \
", vs. MLE (via gradient ascent), which is", theta_ga[0]/theta_ga[2])
display(Math (r'\dfrac{\theta_1}{\theta_2}:'))
print("Your manual (hand-picked) solution is", my_theta[1]/my_theta[2], \
", vs. MLE (via gradient ascent), which is", theta_ga[1]/theta_ga[2])

print("\n=== The MLE solution, visualized ===")


ga_labels = gen_lin_discr_labels(points, theta_ga)
df_ga = df.copy()
df_ga['label'] = mark_matches(ga_labels, labels).astype (dtype=int)
plot_lin_discr(theta_ga, df_ga)

Your (hand) solution: [-2. -0.5 -0.55]


Computed solution: [-15.57666992 -3.03431905 -3.79328353]

=== Comparisons ===

Your manual (hand-picked) solution is [3.63636364] , vs. MLE (via gradient ascent), which is [4.

Your manual (hand-picked) solution is [0.90909091] , vs. MLE (via gradient ascent), which is [0.

=== The MLE solution, visualized ===


In [24]: print ("\n=== Mismatch counts ===")

my_labels = gen_lin_discr_labels (points, my_theta)


my_mismatches = len (labels) - count_matches (labels, my_labels)
print ("Your manual (hand-picked) solution has", num_mismatches, "mismatches.")

ga_labels = gen_lin_discr_labels (points, theta_ga)


ga_mismatches = len (labels) - count_matches (labels, ga_labels)
print ("The MLE method produces", ga_mismatches, "mismatches.")

assert ga_mismatches <= 8


print ("\n(Passed.)")

=== Mismatch counts ===


Your manual (hand-picked) solution has 5 mismatches.
The MLE method produces 6 mismatches.

(Passed.)

The gradient ascent trajectory. Let's take a look at how gradient ascent progresses. (You might try changing the parameter and see how it
results.)

In [25]: n_ll_grid = 100


x1 = np.linspace(-8., 0., n_ll_grid)

x2 = np.linspace(-8., 0., n_ll_grid)


x1_grid, x2_grid = np.meshgrid(x1, x2)

ll_grid = np.zeros((n_ll_grid, n_ll_grid))


for i1 in range(n_ll_grid):
for i2 in range(n_ll_grid):
theta_i1_i2 = np.array([[thetas[0, MAX_STEP]],
[x1_grid[i1][i2]],
[x2_grid[i1][i2]]])
ll_grid[i1][i2] = log_likelihood(theta_i1_i2, y, X)

# Determine a color scale


def v(x):
return -np.log(np.abs(x))
return x

def v_inv(v):
return -np.exp(np.abs(v))
return v

v_min, v_max = v(ll_grid.min()), v(ll_grid.max())


v_range = v_max - v_min
v_breaks = v_inv(np.linspace(v_min, v_max, 20))

p = plt.contourf(x1, x2, ll_grid, v_breaks, cmap=plt.cm.get_cmap("winter"))


plt.xlabel('theta_0')
plt.ylabel('theta_1')
plt.title('log-likelihood')
plt.colorbar()
plt.plot(thetas[1, :], thetas[2, :], 'k*-')

Out[25]: [<matplotlib.lines.Line2D at 0x7f0ec2484dd8>]


Part 5 (optional): Numerical optimization via Newton's method
The fudge factor, , in gradient ascent should give you pause. Can you choose the step size or direction in a better or more principled way?

One idea is Newton's method (http://www.math.uiuc.edu/documenta/vol-ismp/13_deuflhard-peter.pdf), summarized below.

This part of the notebook has additional exercises, but they are all worth 0 points. (So if you submit something that is incomplete or fail
test cells, you won't lose any points.)

The basic idea, in 1-D. Suppose you start at a point and, assuming you are not yet at the optimum, you have decided to take a step of size
you at .

How do you choose ? In gradient ascent, you do so by following the gradient, which points in an "upward" direction.

In Newton's method, you will pick in a different way: choose to maximize .

That should strike you as circular; the whole problem from the beginning was to maximize . The trick, in this case, is not to maximize
rather, let's replace it with some approximation, , and maximize instead.

A simple choice for is a quadratic function in . This choice is motivated by two factors: (a) since it's quadratic, it should have some sort of
point (and hopefully an actual maximum), and (b) it is a higher-order approximation than a linear one, and so hopefully more accurate than a line
well.

To maximize , take its derivative and then solve for the such that :

That is, the optimal step is the negative of the first derivative of divided by its second derivative.

Generalizing to higher dimensions. To see how this procedure works in higher dimensions, you will need not only the gradient of , but als
which is the moral equivalent of a second derivative.

Definition: the Hessian. Let be a function that takes a vector of length as input and returns a scalar. The Hessian of is an ma
whose entries are all possible second-order partial derivatives with respect to the components of . That is, let be the element of
define

Armed with a Hessian, the Newton step is defined as follows, by direct analogy to the 1-D case. First, the Taylor series approximation of
multidimensional variables is, as it happens,

As in the 1-D case, we want to find an extreme point of . Taking its "derivative" (gradient), , and setting it to 0 yields,

In other words, to choose the next step , Newton's method suggests that you must solve a system of linear equations, where the matrix is the
and the right-hand side is the negative gradient of .

Summary: Newton's method. Summarizing the main ideas from above, Newton's method to maximize the scalar objective function where
consists of the following steps:

Start with some initial guess .


At step , compute the search direction by solving .
p , p y g
Compute a new (and hopefully improved) guess by the update, .

Implementing logistic regression via a Newton-based MLE


To perform MLE for the logistic regression model using Newton's method, you need both the gradient of the log-likelihood as well as the Hessia
already know how to compute the gradient from the preceding exercises; so what about the Hessian?

Notationally, that calculation will be a little bit easier to write down and program with the following definition.

Definition: Elementwise product. Let and be matrices. Denote the elementwise product of and by . That
, then element .

If is but is instead just , then we will "auto-extend" . Put differently, if has the same number of rows as but only 1 column
take to have elements .

In Python, you can use np.multiply() (http://docs.scipy.org/doc/numpy/reference/generated/numpy.multiply.html) for elementwise multiplic


Numpy arrays.

In [26]: A = np.array([[1, 2, 3],


[4, 5, 6]])
B = np.array([[-1, 2, -3],
[4, -5, 6]])

print(np.multiply(A, B)) # elementwise product


print()
print(np.multiply(A, B[:, 0:1])) # "auto-extend" version

[[ -1 4 -9]
[ 16 -25 36]]

[[-1 -2 -3]
[16 20 24]]

Exercise 10 (optional; ungraded). Show that the Hessian of the log-likelihood for logistic regression is

Exercise 11 (0 points). Implement a function to compute the Hessian of the log-likelihood. The signature of your function should be,

def hess_log_likelihood (theta, y, X):


...

In [27]: ### BEGIN SOLUTION


def hess_log_likelihood(theta, y, X):
"""Returns the Hessian of the log-likelihood."""
z = X.dot(theta)
A = np.multiply(X, logistic(z))
B = np.multiply(X, logistic(-z))
return -A.T.dot(B)
### END SOLUTION

In [28]: # Test cell: `hess_log_likelihood__check`

if False:
d_hess_soln = 20
m_hess_soln = 501
theta_hess_soln = np.random.random ((d_hess_soln+1, 1)) * 2.0 - 1.0
y_hess_soln = np.random.randint (low=0, high=2, size=(m_hess_soln, 1))
X_hess_soln = np.random.random ((m_hess_soln, d_hess_soln+1)) * 2.0 - 1.0
X_hess_soln[:, 0] = 1.0
L_hess_soln = hess_log_likelihood (theta_hess_soln, y_hess_soln, X_hess_soln)
np.savez_compressed ('hess_log_likelihood_soln',
d_hess_soln, m_hess_soln, theta_hess_soln, y_hess_soln, X_hess_soln, L_

npzfile_hess_soln = np.load ('{}hess_log_likelihood_soln.npz'.format(DATA_PATH))


d_hess_soln = npzfile_hess_soln['arr_0']
m_hess_soln = npzfile_hess_soln['arr_1']
theta_hess_soln = npzfile_hess_soln['arr_2']
y_hess_soln = npzfile_hess_soln['arr_3']
X_hess_soln = npzfile_hess_soln['arr_4']
L_hess_soln = npzfile_hess_soln['arr_5']

L_hess_you = hess_log_likelihood(theta_hess_soln, y_hess_soln, X_hess_soln)


your_hess_err = np.max(np.abs(L_hess_you/L_hess_soln - 1.0))
display(Math(r'\left\|\dfrac{H_{\tiny \mbox{yours}} - H_{\tiny \mbox{solution}}}{H_{\tiny \mbox{
}}\right\|_\infty \approx %g' % your_hess_err))
assert your_hess_err <= 1e-12

print ("\n(Passed.)")
(Passed.)

Exercise 12 (0 points). Finish the implementation of a Newton-based MLE procedure for the logistic regression problem.

In [29]: MAX_STEP = 10

# Get the data coordinate matrix, X, and labels vector, l


X = points
y = labels.astype(dtype=float)

# Store *all* guesses, for subsequent analysis


thetas_newt = np.zeros((3, MAX_STEP+1))

for t in range(MAX_STEP):
### BEGIN SOLUTION
theta_t = thetas_newt[:, t:t+1]
g_t = grad_log_likelihood(theta_t, y, X)
H_t = hess_log_likelihood(theta_t, y, X)
s_t = np.linalg.solve(H_t, -g_t)
thetas_newt[:, t+1:t+2] = theta_t + s_t
### END SOLUTION

theta_newt = thetas_newt[:, MAX_STEP:]


print ("Your (hand) solution:", my_theta.T.flatten())
print ("Computed solution:", theta_newt.T.flatten())

print ("\n=== Comparisons ===")


display (Math (r'\dfrac{\theta_0}{\theta_2}:'))
print ("Your manual (hand-picked) solution is", my_theta[0]/my_theta[2], \
", vs. MLE (via Newton's method), which is", theta_newt[0]/theta_newt[2])
display (Math (r'\dfrac{\theta_1}{\theta_2}:'))
print ("Your manual (hand-picked) solution is", my_theta[1]/my_theta[2], \
", vs. MLE (via Newton's method), which is", theta_newt[1]/theta_newt[2])

print ("\n=== The MLE solution, visualized ===")


newt_labels = gen_lin_discr_labels(points, theta_newt)
df_newt = df.copy()
df_newt['label'] = mark_matches(newt_labels, labels).astype (dtype=int)
plot_lin_discr(theta_newt, df_newt)

Your (hand) solution: [-2. -0.5 -0.55]


Computed solution: [-15.63082207 -3.04255951 -3.76500606]

=== Comparisons ===

Your manual (hand-picked) solution is [3.63636364] , vs. MLE (via Newton's method), which is [4.

Your manual (hand-picked) solution is [0.90909091] , vs. MLE (via Newton's method), which is [0.

=== The MLE solution, visualized ===

In [30]: # Test cell: `logreg_mle_newt__check`

print ("\n=== Mismatch counts ===")

my_labels = gen_lin_discr_labels (points, my_theta)


my_mismatches = len (labels) - count_matches (labels, my_labels)
i t ("Y l (h d i k d) l ti h " i t h " i t h ")
print ("Your manual (hand-picked) solution has", num_mismatches, "mismatches.")

newt_labels = gen_lin_discr_labels (points, theta_newt)


newt_mismatches = len (labels) - count_matches (labels, newt_labels)
print ("The MLE+Newton method produces", newt_mismatches, "mismatches.")

assert newt_mismatches <= ga_mismatches


print ("\n(Passed.)")

=== Mismatch counts ===


Your manual (hand-picked) solution has 5 mismatches.
The MLE+Newton method produces 6 mismatches.

(Passed.)

The following cell creates a contour plot of the log-likelihood, as done previously in this notebook. Add code to display the trajectory taken by N
method.

In [31]: p = plt.contourf(x1, x2, ll_grid, cmap=plt.cm.get_cmap("winter"))


plt.xlabel('theta_0')
plt.ylabel('theta_1')
plt.title('Trajectory taken by Newton\'s method')
plt.colorbar()
plt.plot(thetas_newt[1, :], thetas_newt[2, :], 'k*-')

Out[31]: [<matplotlib.lines.Line2D at 0x7f0ec2609b38>]

How many steps does this optimization procedure take compared to gradient ascent? What is the tradeoff?

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy