0% found this document useful (0 votes)

390 views27 pages

NB 13

1) The document discusses classification techniques, including the Heaviside and sigmoid (logistic) functions for generating labels from data points. 2) The Heaviside function produces hard labels while the sigmoid function produces soft labels, which can be helpful when the data is not cleanly separable into classes. 3) Maximum likelihood estimation is introduced as a technique for choosing parameters θ for a model, with the goal of maximizing the log likelihood through gradient ascent.

Uploaded by

Alexandra Prokhorova

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

390 views27 pages

NB 13

Uploaded by

Alexandra Prokhorova

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 27

Notebook contains: Classi cation

• Heaviside function and labeling

• Sigmoid (logistic) function: useful for generating labels that have a probabilistic
interpretation
• Maximum likelihood estimation, log likelihood
• Gradient ascent
notcleanly
separable

it
pt

IIIIate
i
t

fi
aces

Awe wanttofindtheboundary
discriminantbetweenthe
twoclasses
oxo Ex a 0
x 1 8 att
hope intercept

byapplying
discriminant boundary

Hearts.IE
Issue it
Etffi
produces hardlabelassignmentand this can be
problematic when datapoint's aren't cleanly separable
Solution soft assignment Instead of assigning a label we assign
a probability score
Pr label x L x

avisidfundy
He V5

g y
O
maps the values into the range 0,1
a
maps negative values 967 12
to zero and positive
values to 1 o sigmoid can be interpreted as a probability

of a distribution
Probability density combination of two normal
IL Gaussian distributions

1
aka Mixture ofGaussian

9 points distributions
Suppose the

the same variance

Pr label07 1 03 9 FCA É
fromthedistributionwillhavelabel l given

we generate asoftlabel for

each point xbyapplying the
logisticfunction instead of the
heaviside
If we need a hardlabel wecanround
t
means continuous rather
than discreet

GEN
Eet GE al
TL;DR: For the classi cation task, we are using a linear discriminant function to assign
points a score. Then given the score, we have various options of how to generate labels, e.g.
heaviside and logistic function. Heaviside produces hard labels, while sigmoid produces soft
labels, which can be helpful when the classes aren’t clearly separable.

Maximum Likelihood Estimation

a technique
to choose 0 for a fx

at
Y oÉ ÉdÉts
Yaidmeters Fat T
vector of my datamatrix d dpredictors a
entriesadditional
constan
responseslabeled
E OI
lastcolumn of1thedata
matrixtall values

model we are a 0
When we are
given
a
that we
given
would have seen these particular
What are the chances
labels under this model
Likelihood Pr XO joint probability of labels
conditional

Ly given the coordinatesXandthemodel

parameters 0
y
L Q Pr XO Pr Cyo to Imy O
y sym i
T likelihood in terms of individual
observations
Assume observations are independent

IIe.ly IIer10s
Directly
19
maximizing the product is
product derivatives issue
mathematically
expensive

of smallnumbers is a verysmallnumber
product
Solution Log transform
optimatoisprserved when wetake log of of likelihood
because logs are monotonic so it can't changewhere
the max value occurs
argzaxtlOI argzaxl.gl

Dloglogatlogt
a.b
likelihood we are
Therefore instead of maximizing the
Prof maximizing log likelihood
ICO I log L Q
argzaxffeoger.cl
QE argmgxL.CO argqaxlogIIECOj

ÉÉI
goif implementation of amathematica
statementsince yeis
either

loger Q yrlogget l yr loge ga

gradient of the log
likelihood is needed
to
Tologer o find the maximum
Gradient Ascent
Generic optimization problem
some function Fx

y
maximum
location of the
value of F derivative is zero at the
maximum for scalarfunctions

y
gradient is zero
Tat vectorme aximum for
functions

Linear regression example

XOlli here we want to minimize the residual error
f O
Ily i e maximize the negative of
the residual error
If compute the gradientand
we set it to zero we willfind that
the optimal parameters D occur at the solution of the normal equation
XX QE
Tty
Fyfe is a simple function to invert
In the higher dimensions
gradient points in the direction
of steepestascent
Pf dirofstee.petasant

ay
slope

j
initialguess location of the maximum
value of Fx
directionthatfaces
pg
S is the direction of increasing slope
dimension the direction is
I in one just
sign of the
derivative
s I y a
How large should the step be
S I Ex
user specified

L
parameter
tuning
Eating
2 0
When we move from X to xts
foxes Fx Ey T T S t O S2 Taylor expansion of thefunction
s is smallthey
if we me s is smallthey
5 will also bevery 2 also has to be small
small so we canignore
it as well as any other that
herorderterms as

fixes s stepsize
off
in
ppg currentvalue
slope

fees
fetÉ g
EE Eye Efd
Mitase

For higherdimensions
0 54 d
5 Il TL DR
TxFH 112
f res F FR Stix FR
Logistic regression
Beyond regression, another important data analysis task is classiﬁcation, in which you are given a set of labeled data points and you wish to lea
the labels. The canonical example of a classiﬁcation algorithm is logistic regression, the topic of this notebook.

Although it's called "regression" it is really a model for classiﬁcation.

Here, you'll consider binary classiﬁcation. Each data point belongs to one of possible classes. By convention, we will denote these class
and "1." However, the ideas can be generalized to the multiclass case, i.e., , with labels .

You'll also want to review from earlier notebooks the concept of gradient ascent/descent (or "steepest ascent/descent"), when optimizing a sca
a vector variable.

Part 0: Introduction
This part of the notebook introduces you to the classiﬁcation problem through a "geometric interpretation."

Setup
In [1]: import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
from IPython.display import display, Math

%matplotlib inline

import matplotlib as mpl

mpl.rc("savefig", dpi=100) # Adjust for higher-resolution figures

A note about slicing columns from a Numpy matrix. If you want to extract a column i from a Numpy matrix A and keep it as a column vecto
use the slicing notation, A[:, i:i+1]. Not doing so can lead to subtle bugs. To see why, compare the following slices.

In [2]: A = np.array ([[1, 2, 3],

[4, 5, 6],
[7, 8, 9]
], dtype=float)

print ("A[:, :] ==\n", A)

print ("\na0 := A[:, 0] ==\n", A[:, 0])
print ("\na1 := A[:, 2:3] == \n", A[:, 2:3])

print ("\nAdd columns 0 and 2?")

a0 = A[:, 0]
a1 = A[:, 2:3]
print (a0 + a1)

A[:, :] ==
[[1. 2. 3.]
[4. 5. 6.]
[7. 8. 9.]]

a0 := A[:, 0] ==
[1. 4. 7.]

a1 := A[:, 2:3] ==
[[3.]
[6.]
[9.]]

Add columns 0 and 2?

[[ 4. 7. 10.]
[ 7. 10. 13.]
[10. 13. 16.]]

Aside: Broadcasting in Numpy. What is happening in the operation, a0 + a1, shown above? When the shapes of two objects do not match,
to figure out if there is a natural way to make them compatible. See this supplemental notebook (./mo_numpy_mo_problems.ipynb) for informat
Numpy's "broadcasting rule," along with other Numpy tips.
Example data: Rock lobsters!
As a concrete example of a classification task, consider the results of the following experiment (http://www.stat.ufl.edu/~winner/data/lobster_su

Some marine biologists started with a bunch of lobsters of varying sizes (size being a proxy for the stage of a lobster's development). They then
exposed these lobsters to a variety of predators. Finally, the outcome that they measured is whether the lobsters survived or not.

The data is a set of points, one point per lobster, where there is a single predictor (the lobster's size) and the response is whether the lobsters s
"1") or died (label "0").

For the original paper, see this link (http://downeastinstitute.org/assets/ﬁles/Published%20papers/Wilkinson%20et%20al%202015-1.p

For what we can only guess is what marine biologists do in their labs, see this image (http://i.imgur.com/dQDKgys.jpg) (or this possibly
safe-for-work alternative (http://web.archive.org/web/20120628012654/http://www.traemcneely.com/wp-content/uploads/2012/04/wpi
Lobster-Fights-e1335308484734.jpeg)).

Start by downloading this data.

In [3]: import requests

import os
import hashlib
import io

def on_vocareum():
return os.path.exists('.voc')

def download(file, local_dir="", url_base=None, checksum=None):

local_file = "{}{}".format(local_dir, file)
if not os.path.exists(local_file):
if url_base is None:
url_base = "https://cse6040.gatech.edu/datasets/"
url = "{}{}".format(url_base, file)
print("Downloading: {} ...".format(url))
r = requests.get(url)
with open(local_file, 'wb') as f:
f.write(r.content)

if checksum is not None:

with io.open(local_file, 'rb') as f:
body = f.read()
body_checksum = hashlib.md5(body).hexdigest()
assert body_checksum == checksum, \
"Downloaded file '{}' has incorrect checksum: '{}' instead of '{}'".format(local
body_
check
print("'{}' is ready!".format(file))

if on_vocareum():
URL_BASE = "https://cse6040.gatech.edu/datasets/rock-lobster/"
DATA_PATH = "../resource/asnlib/publicdata/"
else:
URL_BASE = "https://github.com/cse6040/labs-fa17/raw/master/datasets/rock-lobster/"
DATA_PATH = ""

datasets = {'lobster_survive.dat.txt': '12fc1c22ed9b4d7bf04bf7e0fec996b7',

'logreg_points_train.csv': '25bbca6105bae047ac4d62ee8b76c841',
'log_likelihood_soln.npz': '5a9e17d56937855727afa6db1cd83306',
'grad_log_likelihood_soln.npz': 'a67c00bfa95929e12d423105d8412026',
'hess_log_likelihood_soln.npz': 'b46443fbf0577423b084122503125887'}

for filename, checksum in datasets.items():

download(filename, local_dir=DATA_PATH, url_base=URL_BASE, checksum=checksum)

print("\n(All data appears to be ready.)")

'grad_log_likelihood_soln.npz' is ready!
'hess_log_likelihood_soln.npz' is ready!
'log_likelihood_soln.npz' is ready!
'lobster_survive.dat.txt' is ready!
'logreg_points_train.csv' is ready!

(All data appears to be ready.)

Here is a plot of the raw data, which was taken from this source (http://www.stat.uﬂ.edu/~winner/data/lobster_survive.dat).

In [4]: df_lobsters = pd.read_table('{}lobster_survive.dat.txt'.format(DATA_PATH),

sep=r'\s+', names=['CarapaceLen', 'Survived'])
display(df_lobsters.head())
print("...")
display(df_lobsters.tail())
CarapaceLen Survived

0 27 0

1 27 0

2 27 0

3 27 0

4 27 0

...

CarapaceLen Survived

154 54 1

155 54 1

156 54 1

157 54 1

158 57 1

In [5]: ax = sns.violinplot(x="Survived", y="CarapaceLen",

data=df_lobsters, inner="quart")
ax.set(xlabel="Survived? (0=no, 1=yes)",
ylabel="",
title="Body length (carpace, in mm) vs. survival");

Although the classes are distinct in the aggregate, where the median carapace (outer shell) length is around 36 mm for the lobsters that died an
those that survived, they are not cleanly separable.

Notation
To develop some intuition and a classiﬁcation algorithm, let's formulate the general problem and apply it to synthetic data sets.

Let the data consist of observations of continuously-valued predictors. In addition, for each data observation we observe a binary label wh
either 0 or 1.

Just like our convention in the linear regression case, represent each observation, or data point, by an augumented vector, ,

That is, the point is the coordinates augmented by an initial dummy coordinate whose value is 1. This convention is similar to what we did in
regression.

We can also stack these points as rows of a matrix, , again, just as we did in regression:

We will take the labels to be a binary vector, .

Example: A synthetic training set. We've pre-generated a synethetic data set consisting of labeled data points. Let's download and inspect it
table and then visually.

In [6]: df = pd.read_csv('{}logreg_points_train.csv'.format(DATA_PATH))

display(df.head())
print(" ")
print( ... )
display(df.tail())

x_0 x_1 label

0 -0.234443 -1.075960 1

1 0.730359 -0.918093 0

2 1.432270 -0.439449 0

3 0.026733 1.050300 0

4 1.879650 0.207743 0

...

x_0 x_1 label

370 1.314300 0.746001 0

371 -0.759737 -0.042944 1

372 0.683560 -0.047791 0

373 0.774747 0.743837 0

374 0.899119 1.576390 0

In [7]: def make_scatter_plot(df, x="x_0", y="x_1", hue="label",

palette={0: "red", 1: "olive"},
size=5):
sns.lmplot(x=x, y=y, hue=hue, data=df, palette=palette,
fit_reg=False)

mpl.rc("savefig", dpi=120) # Adjust for higher-resolution figures

make_scatter_plot(df)

Next, let's extract the coordinates as a Numpy matrix of points and the labels as a Numpy column vector labels. Mathematically, the point
corresponds to and the labels vector corresponds to .

In [8]: points = np.insert(df.as_matrix (['x_0', 'x_1']), 2, 1.0, axis=1)

labels = df.as_matrix(['label'])

print ("First and last 5 points:\n", '='*23, '\n', points[:5], '\n...\n', points[-5:], '\n')
print ("First and last 5 labels:\n", '='*23, '\n', labels[:5], '\n...\n', labels[-5:], '\n')

First and last 5 points:

=======================
[[-0.234443 -1.07596 1. ]
[ 0.730359 -0.918093 1. ]
[ 1.43227 -0.439449 1. ]
[ 0.0267327 1.0503 1. ]
[ 1.87965 0.207743 1. ]]
...
[[ 1.3143 0.746001 1. ]
[-0.759737 -0.0429435 1. ]
[ 0.68356 -0.0477909 1. ]
[ 0.774747 0.743837 1. ]
[ 0.899119 1.57639 1. ]]

First and last 5 labels:

=======================
[[1]
[0]
[0]
[0]
[0]
[0]]
...
[[0]
[1]
[0]
[0]
[0]]

Linear discriminants and the heaviside function

Suppose you think that the boundary between the two clusters may be represented by a line. For the synthetic data example above, I hope you
such a model is not a terrible one.

A linear boundary is also known as a linear discriminant. Any point on this line may be described by , where is a vector of coeﬃcients:

For example, suppose our observations have two predictors each ( ). Let the corresponding data point be . Then,
means that

So that describes points on the line. However, given any point in the -dimensional space that is not on the line, still produces a value: th
be positive on one side of the line ( ) or negative on the other ( ).

In other words, you can use the linear discriminant function, , to generate a label for each point : just reinterpret its sign!

If you want "0" and "1" labels, the heaviside function, , will convert a positive to the label "1" and all other values to "0."

Exercise 0 (2 points). Given the a matrix of augmented points (i.e., the matrix) and a column vector of length , implemen
compute the value of the linear discriminant at each point. That is, the function should return a (column) vector where the .

In [9]: def lin_discr (X, theta):

### BEGIN SOLUTION
return X.dot(theta)
### END SOLUTION

In [10]: # Test cell: `lin_discr__check`

import random
theta_test = [random.random() for _ in range (3)]
x0_test = [random.random() for _ in range (2)]
x1_test = [(-theta_test[2] - theta_test[0]*x0) / theta_test[1] for x0 in x0_test]
X_test = np.array ([[x0*2 for x0 in x0_test] + [x0*0.5 for x0 in x0_test],
x1_test + x1_test,
[1.0, 1.0, 1.0, 1.0],]).T
print(X_test, "\n")
LD_test = lin_discr(X_test, np.array([theta_test]).T)
print (LD_test)
assert (LD_test[:2] > 0).all ()
assert (LD_test[2:] < 0).all ()
print("\n(Passed.)")

[[ 1.05035323 -0.52024135 1. ]
[ 1.49196858 -0.59315241 1. ]
[ 0.26258831 -0.52024135 1. ]
[ 0.37299215 -0.59315241 1. ]]

[[ 0.13320893]
[ 0.1892159 ]
[-0.06660446]
[-0.09460795]]

(Passed.)

Exercise 1 (2 points). Implement the heaviside function, . Your function should allow for an arbitrary matrix of input values and should app
heaviside function to each element. In the returned matrix, the elements should have a ﬂoating-point type.

Example, the code snippet

A = np.array([[-0.5, 0.2, 0.0],

[4.2, 3.14, -2.7]])
print(heaviside(A))
should display

[[ 0. 1. 0.]
[ 1. 1. 0.]]

There are several possible approaches that lead to one-line solutions. One uses only logical and arithmetic operators, which you will re
implemented as elementwise operations for Numpy arrays. Another uses Numpy's sign()
(http://docs.scipy.org/doc/numpy/reference/generated/numpy.sign.html) function.

In [11]: def heaviside(Y):

### BEGIN SOLUTION
return 1.0*(Y > 0.0)

# Alternative solution:
#return (np.sign(Y) > 0) * 1.0
### END SOLUTION

In [12]: # Test cell: `heaviside__check`

Y_test = np.array([[-2.3, 1.2, 7.],

[0.0, -np.inf, np.inf]])
H_Y_test = heaviside(Y_test)

print("Y:\n", Y_test)
print("\nH(Y):\n", H_Y_test)

assert (H_Y_test.astype(int) == np.array([[0, 1, 1], [0, 0, 1]])).all ()

print ("\n(Passed.)")

Y:
[[-2.3 1.2 7. ]
[ 0. -inf inf]]

H(Y):
[[0. 1. 1.]
[0. 0. 1.]]

(Passed.)

For the next exercise, we'll need the following functions.

In [13]: def heaviside_int(Y):

"""Evaluates the heaviside function, but returns integer values."""
return heaviside(Y).astype(dtype=int)

def gen_lin_discr_labels(points, theta, fun=heaviside_int):

"""
Given a set of points and the coefficients of a linear
discriminant, this function returns a set of labels for
the points with respect to this discriminant.
"""
score = lin_discr(points, theta)
labels = fun(score)
return labels

def plot_lin_discr(theta, df, x="x_0", y="x_1", hue="label",

palette={0: "red", 1: "olive"}, size=5,
linewidth=2):
lm = sns.lmplot(x=x, y=y, hue=hue, data=df, palette=palette,
size=size, fit_reg=False)

x_min, x_max = df[x].min(), df[x].max()

y_min, y_max = df[y].min(), df[y].max()

x1_min = (-theta[2][0] - theta[0][0]*x_min) / theta[1][0]

x1_max = (-theta[2][0] - theta[0][0]*x_max) / theta[1][0]
plt.plot([x_min, x_max], [x1_min, x1_max], linewidth=linewidth)

def expand_interval(x_limits, percent=10.0):

x_min, x_max = x_limits[0], x_limits[1]
if x_min < 0:
x_min *= 1.0 + 1e-2*percent
else:
x_min *= 1.0 - 1e-2*percent
if x_max > 0:
x_max *= 1.0 + 1e-2*percent
else:
x_max *= 1.0 + 1e-2*percent
return (x_min, x_max)
x_view = expand_interval((x_min, x_max))
y_view = expand_interval((y_min, y_max))
lm.axes[0,0].set_xlim(x_view[0], x_view[1])
lm.axes[0,0].set_ylim(y_view[0], y_view[1])

def mark_matches(a, b, exact=False):

"""
Given two Numpy arrays of {0, 1} labels, returns a new boolean
array indicating at which locations the input arrays have the
same label (i.e., the corresponding entry is True).

This function can consider "inexact" matches. That is, if `exact`

is False, then the function will assume the {0, 1} labels may be
regarded as the same up to a swapping of the labels. This feature
allows

a == [0, 0, 1, 1, 0, 1, 1]
b == [1, 1, 0, 0, 1, 0, 0]

to be regarded as equal. (That is, use `exact=False` when you

only care about "relative" labeling.)
"""
assert a.shape == b.shape
a_int = a.astype(dtype=int)
b_int = b.astype(dtype=int)
all_axes = tuple(range(len(a.shape)))
assert ((a_int == 0) | (a_int == 1)).all()
assert ((b_int == 0) | (b_int == 1)).all()

exact_matches = (a_int == b_int)

if exact:
return exact_matches

assert exact == False

num_exact_matches = np.sum(exact_matches)
if (2*num_exact_matches) >= np.prod(a.shape):
return exact_matches
return exact_matches == False # Invert

def count_matches(a, b, exact=False):

"""
Given two sets of {0, 1} labels, returns the number of mismatches.

This function can consider "inexact" matches. That is, if `exact`

is False, then the function will assume the {0, 1} labels may be
regarded as similar up to a swapping of the labels. This feature
allows

a == [0, 0, 1, 1, 0, 1, 1]
b == [1, 1, 0, 0, 1, 0, 0]

to be regarded as equal. (That is, use `exact=False` when you

only care about "relative" labeling.)
"""
matches = mark_matches(a, b, exact=exact)
return int(matches.sum())

Exercise 2 (2 points). For the synthetic data you loaded above, try by hand to ﬁnd a value for such that "best" separates the two clu
this in a variable named my_theta, which should be a Numpy column vector. That is, deﬁne my_theta here using a line like:

my_theta = np_col_vec([3., 0., -1.])

where np_col_vec is defined below and the list of values are your best guesses at discriminating coefficients. The test code will check that yo
makes no more than ten misclassifications.

Hint: We found a set of coeﬃcients that commits just 5 errors for the 375 input points.

In [14]: def np_col_vec (list_values):

"""Returns a Numpy column vector for the given list of scalar values."""
return np.array ([list_values]).T

# Redefine `my_theta` as instructed above to reduce the number of mismatches:

my_theta = np_col_vec([-1., 3., 0.]) # 123 mismatches
### BEGIN SOLUTION
my_theta = np_col_vec([-6.5, -1., -1.35]) # 5 mismatches
my_theta = np_col_vec([-2., -0.5, -0.55]) # 5 mismatches
### END SOLUTION

In [15]: # Here are the labels generated by your discriminant:

my_labels = gen_lin_discr_labels(points, my_theta)

# Here is a visual check:

num_mismatches = len(labels) - count_matches(labels, my_labels)
print ("Detected", num_mismatches, "out of", len(labels), "mismatches.")

df_matches = df.copy ()
df_matches['label'] = mark_matches (my_labels, labels).astype (dtype=int)
mpl.rc("savefig", dpi=100) # Adjust for higher-resolution figures
plot_lin_discr (my_theta, df_matches)

assert num_mismatches <= 10

Detected 5 out of 375 mismatches.

How the heaviside divides the space. The heaviside function, , enforces a sharp boundary between classes around the line
code produces a contour plot (https://matplotlib.org/api/_as_gen/matplotlib.axes.Axes.contourf.html) to show this eﬀect: there will be a sharp d
between 0 and 1 values, with one set of values shown as a solid dark area and the remaining as a solid light-colored area.

In [16]: x0 = np.linspace(-2., +2., 100)

x1 = np.linspace(-2., +2., 100)
x0_grid, x1_grid = np.meshgrid(x0, x1)
h_grid = heaviside(my_theta[2] + my_theta[0]*x0_grid + my_theta[1]*x1_grid)
plt.contourf(x0, x1, h_grid)

Out[16]: <matplotlib.contour.QuadContourSet at 0x7f0ec25ad978>

Part 1: The logistic (or sigmoid) function as an alternative discriminant

As the lobsters example suggests, real data are not likely to be cleanly separable, especially when the number of features we have at our dispo
small.

Since the labels are 0 or 1, you could look for a way to interpret labels as probabilities rather than as hard (0 or 1) labels. One such function is th
function, also referred to as the logit or sigmoid (https://en.wikipedia.org/wiki/Sigmoid_function) function.

The logistic function takes any value in the range and produces a value in the range . Thus, given a value , we can interpret
conditional probability that the label is 1 given , i.e., .

Exercise 3 (2 points). Implement the logistic function. Inspect the resulting plot of in 1-D and then the contour plot of . Your functi
accept a Numpy matrix of values, Y, and apply the sigmoid elementwise.

In [17]: def logistic(Y):

### BEGIN SOLUTION
return 1.0 / (1.0 + np.exp (-Y))
### END SOLUTION

# Plot your function for a 1-D input.

y_values = np.linspace(-10, 10, 100)

mpl.rc("savefig", dpi=120) # Adjust for higher-resolution figures

sns set style("darkgrid")
sns.set_style( darkgrid )
y_pos = y_values[y_values > 0]
y_rem = y_values[y_values <= 0]
plt.plot(y_rem, heaviside (y_rem), 'b')
plt.plot(y_pos, heaviside (y_pos), 'b')
plt.plot(y_values, logistic (y_values), 'r--')
#sns.regplot (y_values, heaviside (y_values), fit_reg=False)
#sns.regplot (y_values, logistic (y_values), fit_reg=False)

Out[17]: [<matplotlib.lines.Line2D at 0x7f0ec257c2b0>]

In [18]: # Test cell: `logistic__check`

assert logistic(np.log(3)) == 0.75

assert logistic(-np.log(3)) == 0.25

g_grid = logistic(my_theta[2] + my_theta[0]x0_grid + my_theta[1]x1_grid)

plt.contourf (x0, x1, g_grid)
assert ((np.round(g_grid) - h_grid).astype(int) == 0).all()

print ("\n(Passed.)")

(Passed.)

Exercise 4 (optional; ungraded). Consider a set of 1-D points generated by a mixture of Gaussians. That is, suppose that there are two Gaussia
over the 1-dimensional variable, , that have the same variance ( ) but diﬀerent means ( and ). Show that the conditional
observing a point labeled "1" given may be written as,

for a suitable deﬁnition of and .

Hints. Since the points come from Gaussian distributions,

To rewrite in terms of , recall Bayes's rule (also: Bayes's theorem (https://en.wikipedia.org/wiki/Bayes%27_theorem)):

where the denominator can be expanded as

You may assume the prior probabilities of observing a 0 or 1 are given by and .

The point of this derivation is to show you that the deﬁnition of the logistic function does not just arise out of thin air. It also hints that yo
might expect a ﬁnal algorithm for logistic regression based on using as the discriminant will work well when the classes are best
explained as a mixture of Gaussians.

Generalizing to -dimensions. The preceding exercise can be generalized to -dimensions. Let and be -dimensional points. Then
Exercise 5 (optional; ungraded). Verify the following properties of the logistic function, .

Answers. In all of the derivations below, we use the fact that .

(P1). Multiply the numerator and denominator by .

(P2). Start with the right-hand side, , apply some algebra, and then apply (P1).

(P3). By direct calculation and application of (P1):

(P4). By the chain rule and application of (P3):

(P5). By combining (P2), variable substitution and the chain rule, and (P4),

Part 2: Determining the discriminant via maximum likelihood estimation

Previously, you determined for our synthetic dataset by hand. Can you compute a good automatically? One of the standard techniques in s
perform a maximum likelihood estimation (MLE) of a model's parameters, . Indeed, you may have seen or used MLE to derive the normal equa
regression in a more "statistically principled" way.

"Likelihood" as an objective function. MLE derives from the following idea. Consider the joint probability of observing all of the labels, given t
the parameters, :

Suppose these observations are independent and identically distributed (i.i.d.). Then the joint probability can be factored as the product of indiv
probabilities,

The maximum likelihood principle says that you should choose to maximize the chances (or "likelihood") of seeing these particular observatio
is now an objective function to maximize.

For both mathematical and numerical reasons, we will use the logarithm of the likelihood, or log-likelihood, as the objective function instead. Le

We are using the symbol , which could be taken in any convenient base, such as the natural logarithm ( ) or the information theo
base-two logarithm ( ).

The MLE ﬁtting procedure then consists of two steps:

For the problem at hand, decide on a model of .

Run any optimization procedure to ﬁnd the that maximizes .

Part 3: MLE for logistic regression

g g
Let's say you have decided that the logistic function, , is a good model of the probability of producing a label given the o
Under the i.i.d. assumption, you can interpret the label as the result of ﬂipping a coin, or a Bernoulli trial (https://en.wikipedia.org/wiki/Bernou
the probability of success ( ) is deﬁned as . Thus,

The log-likelihood in turn becomes,

You can write the log-likelihood more compactly in the language of linear algebra.

Convention 1. Let be a column vector of all ones, with its length inferred from context. Let be a
where denote its columns. Then, the sum of the columns is

Convention 2. Let be any matrix and let be any function that we have deﬁned by default to accept a scalar argument and pro
result. For instance, or . Then, assume that applies elementwise to , returning a matrix whose eleme
.

With these notational conventions, convince yourself that these are two diﬀerent ways to write the log-likelihood for logistic regression.

Exercise 6 (2 points). Implement the log-likelihood function in Python by deﬁning a function with the following signature:

def log_likelihood (theta, y, X):

...

To compute the elementwise logarithm of a matrix or vector, use Numpy's log

(https://docs.scipy.org/doc/numpy/reference/generated/numpy.log.html) function.

In [19]: def log_likelihood(theta, y, X):

### BEGIN SOLUTION
u = np.ones((len (X), 1)) # column of all ones
z = X.dot(theta)
return y.T.dot(z) + u.T.dot(np.log(logistic(-z)))

def log_likelihood_alt(theta, y, X):

z = X.dot(theta)
g = logistic(z)
return y.T.dot(np.log(g)) + (1.0-y).T.dot(np.log(1.0-g))
### END SOLUTION

In [20]: # Test cell: `log_likelihood__check`

if False:
d_soln = 10
m_soln = 1000
theta_soln = np.random.random ((d_soln+1, 1)) * 2.0 - 1.0
y_soln = np.random.randint (low=0, high=2, size=(m_soln, 1))
X_soln = np.random.random ((m_soln, d_soln+1)) * 2.0 - 1.0
X_soln[:, 0] = 1.0
L_soln = log_likelihood (theta_soln, y_soln, X_soln)
np.savez_compressed('log_likelihood_soln',
d_soln, m_soln, theta_soln, y_soln, X_soln, L_soln)

npzfile_soln = np.load('{}log_likelihood_soln.npz'.format(DATA_PATH))
d_soln = npzfile_soln['arr_0']
m_soln = npzfile_soln['arr_1']
theta_soln = npzfile_soln['arr_2']
y_soln = npzfile_soln['arr_3']
X_soln = npzfile_soln['arr_4']
L_soln = npzfile_soln['arr_5']

L_you = log_likelihood(theta_soln, y_soln, X_soln)

your err = np max(np abs(L you/L soln 1 0))
your_err = np.max(np.abs(L_you/L_soln - 1.0))
display(Math(r'\left\|\dfrac{\mathcal{L}_{\tiny \mbox{yours}} - \mathcal{L}_{\tiny \mbox{solutio
cal{L}_{\tiny \mbox{solution}}}\right\|_\infty \approx %g' % your_err))
assert your_err <= 1e-12

print ("\n(Passed.)")

(Passed.)

Part 4: Computing the MLE solution via gradient ascent: theory

To optimize the log-likelihood with respect to the parameters, , you want to "set the derivative to zero" and solve for .

For example, recall that in the case of linear regression via least squares minimization, carrying out this process produced an analytic solution fo
parameters, which was to solve the normal equations.

Unfortunately, for logistic regression---or for most log-likelihoods you are likely to ever write down---you cannot usually derive an analytic solut
you will need to resort to numerical optimization procedures.

Gradient ascent, in 1-D. A simple numerical algorithm to maximize a function is gradient ascent (or steepest ascent). If instead you are minimiz
function, then the equivalent procedure is gradient (or steepest) descent. Here is the basic idea in 1-D.

Suppose we wish to ﬁnd the maximum of a scalar function in one dimension. At the maximum, .

Suppose instead that and consider the value of at a nearby point, , as given approximately by a truncated Taylor series:

To make progress toward maximizing , you'd like to choose so that . One way is to choose , where

"small:"

If is small enough, then you can neglect the term and will be larger than , thus making progress toward ﬁnding a maximum

This scheme is the basic idea: starting from some initial guess , reﬁne the guess by taking a small step in the direction of the derivative, i.e.,

Gradient ascent in higher dimensions. Now suppose is a vector rather than a scalar. Then the value of at a nearby point , where
becomes

where is the gradient of with respect to . As in the 1-D case, you want a step such that . To make as much progres
let's choose to be parallel to , that is, proportional to the gradient:

Again, is a fudge (or "gentle nudge?") factor. You need to choose it to be small enough that the high-order terms of the Taylor approximation
negligible, yet large enough that you can make reasonable progress.

The gradient ascent procedure applied to MLE. Applying gradient ascent to the problem of maximizing the log-likelihood leads to the followi

Start with some initial guess, .

At each iteration of the procedure, let be the current guess.
Compute the direction of steepest ascent by evaluating the gradient, .

Deﬁne the step to be , where is a suitably chosen fudge factor.

Take a step in the direction of the gradient, .

Stop when the parameters don't change much or after some maximum number of steps.

This procedure should remind you of one you saw in a prior notebook (the least mean square algorithm for online regression!). As was true at th
tricky bit is how to choose .

There is at least one diﬀerence between this procedure and the online regression procedure you learned earlier. Here, we are optimizing
the full dataset rather than processing data points one at a time. (That is, the step iteration variable used above is not used in exactly
same way as the step iteration in LMS.)

Another question is, how do we know this procedure will converge to the global maximum, rather than, say, a local maximum? For that
need a deeper analysis of a speciﬁc , to show, for instance, that it is convex in .
Implementing logistic regression using MLE by gradient ascent
Let's apply the gradient ascent procedure to the logistic regression problem, in order to determine a good .

Exercise 7 (optional; ungraded). Show the following.

Answer. From (V2),

Thus,

Let's consider each term in turn.

For the ﬁrst term, apply the gradient identities to obtain

For the second term, recall the scalar interpretation of .

The -th component of the gradient is

Let's evaluate the summand:

Thus, the -th component of the gradient becomes

In other words, the full gradient vector is

Putting the two components together,

Exercise 8 (2 points). Implement a function to compute the gradient of the log-likelihood. Your function should have the signature,

def grad_log_likelihood (theta, y, X):

...

In [21]: def grad_log_likelihood(theta, y, X):

"""Returns the gradient of the log-likelihood."""
### BEGIN SOLUTION
return X.T.dot(y - logistic(X.dot(theta)))
### END SOLUTION

In [22]: # Test cell: `grad_log_likelihood_code__check`

if False:
d_grad_soln = 6
m_grad_soln = 399
theta_grad_soln = np.random.random((d_grad_soln+1, 1)) * 2.0 - 1.0
y_grad_soln = np.random.randint(low=0, high=2, size=(m_grad_soln, 1))
X_grad_soln = np.random.random((m_grad_soln, d_grad_soln+1)) * 2.0 - 1.0
X_grad_soln[:, 0] = 1.0
L_grad_soln = grad_log_likelihood(theta_grad_soln, y_grad_soln, X_grad_soln)
np.savez_compressed('grad_log_likelihood_soln',
d_grad_soln, m_grad_soln, theta_grad_soln, y_grad_soln, X_grad_soln, L_g

npzfile_grad_soln = np.load ('{}grad_log_likelihood_soln.npz'.format(DATA_PATH))

d_grad_soln = npzfile_grad_soln['arr_0']
m_grad_soln = npzfile_grad_soln['arr_1']
theta_grad_soln = npzfile_grad_soln['arr_2']
y_grad_soln = npzfile_grad_soln['arr_3']
X_grad_soln = npzfile_grad_soln['arr_4']
L_grad_soln = npzfile_grad_soln['arr_5']

L_grad_you = grad_log_likelihood (theta_grad_soln, y_grad_soln, X_grad_soln)

your_grad_err = np.max (np.abs (L_grad_you/L_grad_soln - 1.0))
display (Math (r'\left\|\dfrac{\nabla\, \mathcal{L}_{\tiny \mbox{yours}} - \nabla\,\mathcal{L}_{
x{solution}}}{\nabla\, \mathcal{L}_{\tiny \mbox{solution}}}\right\|_\infty \approx %g' % your_gr
assert your_grad_err <= 1e-12

print ("\n(Passed.)")

(Passed.)

Exercise 9 (4 points). Implement the gradient ascent procedure to determine , and try it out on the sample data.

Recall the procedure (repeated from above):

Start with some initial guess, .

At each iteration of the procedure, let be the current guess.
Compute the direction of steepest ascent by evaluating the gradient, .

Deﬁne the step to be , where is a suitably chosen fudge factor.

Take a step in the direction of the gradient, .

Stop when the parameters don't change much or after some maximum number of steps.

In the code skeleton below, we've set up a loop to run a ﬁxed number, MAX_STEP, of gradient ascent steps. Also, when normalizing the step
norm.

In your solution, we'd like you to store all guesses in the matrix thetas, so that you can later see how the values evolve. To extrac
particular column t, use the notation, theta[:, t:t+1]. This notation is necessary to preserve the "shape" of the column as a colum
vector.

In [23]: ALPHA = 0.1

MAX_STEP = 250

# Get the data coordinate matrix, X, and labels vector, y

X = points
y = labels.astype(dtype=float)

# Store all guesses, for subsequent analysis

thetas = np.zeros((3, MAX_STEP+1))

for t in range(MAX_STEP):
# Fill in the code to compute thetas[:, t+1:t+2]
### BEGIN SOLUTION
theta_t = thetas[:, t:t+1]
delta_t = grad_log_likelihood(theta_t, y, X)
delta_t = delta_t / np.linalg.norm(delta_t, ord=2)
thetas[:, t+1:t+2] = theta_t + ALPHA*delta_t
### END SOLUTION

theta_ga = thetas[:, MAX_STEP:]

print("Your (hand) solution:", my_theta.T.flatten())
print("Computed solution:", theta_ga.T.flatten())

print("\n=== Comparisons ===")

display(Math (r'\dfrac{\theta_0}{\theta_2}:'))
print("Your manual (hand-picked) solution is", my_theta[0]/my_theta[2], \
", vs. MLE (via gradient ascent), which is", theta_ga[0]/theta_ga[2])
display(Math (r'\dfrac{\theta_1}{\theta_2}:'))
print("Your manual (hand-picked) solution is", my_theta[1]/my_theta[2], \
", vs. MLE (via gradient ascent), which is", theta_ga[1]/theta_ga[2])

print("\n=== The MLE solution, visualized ===")

ga_labels = gen_lin_discr_labels(points, theta_ga)
df_ga = df.copy()
df_ga['label'] = mark_matches(ga_labels, labels).astype (dtype=int)
plot_lin_discr(theta_ga, df_ga)

Your (hand) solution: [-2. -0.5 -0.55]

Computed solution: [-15.57666992 -3.03431905 -3.79328353]

=== Comparisons ===

Your manual (hand-picked) solution is [3.63636364] , vs. MLE (via gradient ascent), which is [4.

Your manual (hand-picked) solution is [0.90909091] , vs. MLE (via gradient ascent), which is [0.

=== The MLE solution, visualized ===

In [24]: print ("\n=== Mismatch counts ===")

my_labels = gen_lin_discr_labels (points, my_theta)

my_mismatches = len (labels) - count_matches (labels, my_labels)
print ("Your manual (hand-picked) solution has", num_mismatches, "mismatches.")

ga_labels = gen_lin_discr_labels (points, theta_ga)

ga_mismatches = len (labels) - count_matches (labels, ga_labels)
print ("The MLE method produces", ga_mismatches, "mismatches.")

assert ga_mismatches <= 8

print ("\n(Passed.)")

=== Mismatch counts ===

Your manual (hand-picked) solution has 5 mismatches.
The MLE method produces 6 mismatches.

(Passed.)

The gradient ascent trajectory. Let's take a look at how gradient ascent progresses. (You might try changing the parameter and see how it
results.)

In [25]: n_ll_grid = 100

x1 = np.linspace(-8., 0., n_ll_grid)

x2 = np.linspace(-8., 0., n_ll_grid)

x1_grid, x2_grid = np.meshgrid(x1, x2)

ll_grid = np.zeros((n_ll_grid, n_ll_grid))

for i1 in range(n_ll_grid):
for i2 in range(n_ll_grid):
theta_i1_i2 = np.array([[thetas[0, MAX_STEP]],
[x1_grid[i1][i2]],
[x2_grid[i1][i2]]])
ll_grid[i1][i2] = log_likelihood(theta_i1_i2, y, X)

# Determine a color scale

def v(x):
return -np.log(np.abs(x))
return x

def v_inv(v):
return -np.exp(np.abs(v))
return v

v_min, v_max = v(ll_grid.min()), v(ll_grid.max())

v_range = v_max - v_min
v_breaks = v_inv(np.linspace(v_min, v_max, 20))

p = plt.contourf(x1, x2, ll_grid, v_breaks, cmap=plt.cm.get_cmap("winter"))

plt.xlabel('theta_0')
plt.ylabel('theta_1')
plt.title('log-likelihood')
plt.colorbar()
plt.plot(thetas[1, :], thetas[2, :], 'k*-')

Out[25]: [<matplotlib.lines.Line2D at 0x7f0ec2484dd8>]

Part 5 (optional): Numerical optimization via Newton's method
The fudge factor, , in gradient ascent should give you pause. Can you choose the step size or direction in a better or more principled way?

One idea is Newton's method (http://www.math.uiuc.edu/documenta/vol-ismp/13_deuﬂhard-peter.pdf), summarized below.

This part of the notebook has additional exercises, but they are all worth 0 points. (So if you submit something that is incomplete or fail
test cells, you won't lose any points.)

The basic idea, in 1-D. Suppose you start at a point and, assuming you are not yet at the optimum, you have decided to take a step of size
you at .

How do you choose ? In gradient ascent, you do so by following the gradient, which points in an "upward" direction.

In Newton's method, you will pick in a diﬀerent way: choose to maximize .

That should strike you as circular; the whole problem from the beginning was to maximize . The trick, in this case, is not to maximize
rather, let's replace it with some approximation, , and maximize instead.

A simple choice for is a quadratic function in . This choice is motivated by two factors: (a) since it's quadratic, it should have some sort of
point (and hopefully an actual maximum), and (b) it is a higher-order approximation than a linear one, and so hopefully more accurate than a line
well.

To maximize , take its derivative and then solve for the such that :

That is, the optimal step is the negative of the ﬁrst derivative of divided by its second derivative.

Generalizing to higher dimensions. To see how this procedure works in higher dimensions, you will need not only the gradient of , but als
which is the moral equivalent of a second derivative.

Deﬁnition: the Hessian. Let be a function that takes a vector of length as input and returns a scalar. The Hessian of is an ma
whose entries are all possible second-order partial derivatives with respect to the components of . That is, let be the element of
deﬁne

Armed with a Hessian, the Newton step is deﬁned as follows, by direct analogy to the 1-D case. First, the Taylor series approximation of
multidimensional variables is, as it happens,

As in the 1-D case, we want to ﬁnd an extreme point of . Taking its "derivative" (gradient), , and setting it to 0 yields,

In other words, to choose the next step , Newton's method suggests that you must solve a system of linear equations, where the matrix is the
and the right-hand side is the negative gradient of .

Summary: Newton's method. Summarizing the main ideas from above, Newton's method to maximize the scalar objective function where
consists of the following steps:

Start with some initial guess .

At step , compute the search direction by solving .
p , p y g
Compute a new (and hopefully improved) guess by the update, .

Implementing logistic regression via a Newton-based MLE

To perform MLE for the logistic regression model using Newton's method, you need both the gradient of the log-likelihood as well as the Hessia
already know how to compute the gradient from the preceding exercises; so what about the Hessian?

Notationally, that calculation will be a little bit easier to write down and program with the following deﬁnition.

Deﬁnition: Elementwise product. Let and be matrices. Denote the elementwise product of and by . That
, then element .

If is but is instead just , then we will "auto-extend" . Put diﬀerently, if has the same number of rows as but only 1 column
take to have elements .

In Python, you can use np.multiply() (http://docs.scipy.org/doc/numpy/reference/generated/numpy.multiply.html) for elementwise multiplic

Numpy arrays.

In [26]: A = np.array([[1, 2, 3],

[4, 5, 6]])
B = np.array([[-1, 2, -3],
[4, -5, 6]])

print(np.multiply(A, B)) # elementwise product

print()
print(np.multiply(A, B[:, 0:1])) # "auto-extend" version

[[ -1 4 -9]
[ 16 -25 36]]

[[-1 -2 -3]
[16 20 24]]

Exercise 10 (optional; ungraded). Show that the Hessian of the log-likelihood for logistic regression is

Exercise 11 (0 points). Implement a function to compute the Hessian of the log-likelihood. The signature of your function should be,

def hess_log_likelihood (theta, y, X):

...

In [27]: ### BEGIN SOLUTION

def hess_log_likelihood(theta, y, X):
"""Returns the Hessian of the log-likelihood."""
z = X.dot(theta)
A = np.multiply(X, logistic(z))
B = np.multiply(X, logistic(-z))
return -A.T.dot(B)
### END SOLUTION

In [28]: # Test cell: `hess_log_likelihood__check`

if False:
d_hess_soln = 20
m_hess_soln = 501
theta_hess_soln = np.random.random ((d_hess_soln+1, 1)) * 2.0 - 1.0
y_hess_soln = np.random.randint (low=0, high=2, size=(m_hess_soln, 1))
X_hess_soln = np.random.random ((m_hess_soln, d_hess_soln+1)) * 2.0 - 1.0
X_hess_soln[:, 0] = 1.0
L_hess_soln = hess_log_likelihood (theta_hess_soln, y_hess_soln, X_hess_soln)
np.savez_compressed ('hess_log_likelihood_soln',
d_hess_soln, m_hess_soln, theta_hess_soln, y_hess_soln, X_hess_soln, L_

npzfile_hess_soln = np.load ('{}hess_log_likelihood_soln.npz'.format(DATA_PATH))

d_hess_soln = npzfile_hess_soln['arr_0']
m_hess_soln = npzfile_hess_soln['arr_1']
theta_hess_soln = npzfile_hess_soln['arr_2']
y_hess_soln = npzfile_hess_soln['arr_3']
X_hess_soln = npzfile_hess_soln['arr_4']
L_hess_soln = npzfile_hess_soln['arr_5']

L_hess_you = hess_log_likelihood(theta_hess_soln, y_hess_soln, X_hess_soln)

your_hess_err = np.max(np.abs(L_hess_you/L_hess_soln - 1.0))
display(Math(r'\left\|\dfrac{H_{\tiny \mbox{yours}} - H_{\tiny \mbox{solution}}}{H_{\tiny \mbox{
}}\right\|_\infty \approx %g' % your_hess_err))
assert your_hess_err <= 1e-12

print ("\n(Passed.)")
(Passed.)

Exercise 12 (0 points). Finish the implementation of a Newton-based MLE procedure for the logistic regression problem.

In [29]: MAX_STEP = 10

# Get the data coordinate matrix, X, and labels vector, l

X = points
y = labels.astype(dtype=float)

# Store all guesses, for subsequent analysis

thetas_newt = np.zeros((3, MAX_STEP+1))

for t in range(MAX_STEP):
### BEGIN SOLUTION
theta_t = thetas_newt[:, t:t+1]
g_t = grad_log_likelihood(theta_t, y, X)
H_t = hess_log_likelihood(theta_t, y, X)
s_t = np.linalg.solve(H_t, -g_t)
thetas_newt[:, t+1:t+2] = theta_t + s_t
### END SOLUTION

theta_newt = thetas_newt[:, MAX_STEP:]

print ("Your (hand) solution:", my_theta.T.flatten())
print ("Computed solution:", theta_newt.T.flatten())

print ("\n=== Comparisons ===")

display (Math (r'\dfrac{\theta_0}{\theta_2}:'))
print ("Your manual (hand-picked) solution is", my_theta[0]/my_theta[2], \
", vs. MLE (via Newton's method), which is", theta_newt[0]/theta_newt[2])
display (Math (r'\dfrac{\theta_1}{\theta_2}:'))
print ("Your manual (hand-picked) solution is", my_theta[1]/my_theta[2], \
", vs. MLE (via Newton's method), which is", theta_newt[1]/theta_newt[2])

print ("\n=== The MLE solution, visualized ===")

newt_labels = gen_lin_discr_labels(points, theta_newt)
df_newt = df.copy()
df_newt['label'] = mark_matches(newt_labels, labels).astype (dtype=int)
plot_lin_discr(theta_newt, df_newt)

Your (hand) solution: [-2. -0.5 -0.55]

Computed solution: [-15.63082207 -3.04255951 -3.76500606]

=== Comparisons ===

Your manual (hand-picked) solution is [3.63636364] , vs. MLE (via Newton's method), which is [4.

Your manual (hand-picked) solution is [0.90909091] , vs. MLE (via Newton's method), which is [0.

=== The MLE solution, visualized ===

In [30]: # Test cell: `logreg_mle_newt__check`

print ("\n=== Mismatch counts ===")

my_labels = gen_lin_discr_labels (points, my_theta)

my_mismatches = len (labels) - count_matches (labels, my_labels)
i t ("Y l (h d i k d) l ti h " i t h " i t h ")
print ("Your manual (hand-picked) solution has", num_mismatches, "mismatches.")

newt_labels = gen_lin_discr_labels (points, theta_newt)

newt_mismatches = len (labels) - count_matches (labels, newt_labels)
print ("The MLE+Newton method produces", newt_mismatches, "mismatches.")

assert newt_mismatches <= ga_mismatches

print ("\n(Passed.)")

=== Mismatch counts ===

Your manual (hand-picked) solution has 5 mismatches.
The MLE+Newton method produces 6 mismatches.

(Passed.)

The following cell creates a contour plot of the log-likelihood, as done previously in this notebook. Add code to display the trajectory taken by N
method.

In [31]: p = plt.contourf(x1, x2, ll_grid, cmap=plt.cm.get_cmap("winter"))

plt.xlabel('theta_0')
plt.ylabel('theta_1')
plt.title('Trajectory taken by Newton\'s method')
plt.colorbar()
plt.plot(thetas_newt[1, :], thetas_newt[2, :], 'k*-')

Out[31]: [<matplotlib.lines.Line2D at 0x7f0ec2609b38>]

How many steps does this optimization procedure take compared to gradient ascent? What is the tradeoﬀ?

World Political Map Blank - Google Search
No ratings yet
World Political Map Blank - Google Search
1 page
Week-7 GA Solution 1
100% (1)
Week-7 GA Solution 1
11 pages
RCC 2
No ratings yet
RCC 2
22 pages
Graded Solution Maths
100% (1)
Graded Solution Maths
33 pages
Pandas Assignment
100% (2)
Pandas Assignment
11 pages
NB 12
No ratings yet
NB 12
34 pages
Nb7+ (Notes On Pandas)
No ratings yet
Nb7+ (Notes On Pandas)
34 pages
NB 7
No ratings yet
NB 7
25 pages
Draughtsman Mechanical 1st Year (Volume II of II) TT
No ratings yet
Draughtsman Mechanical 1st Year (Volume II of II) TT
211 pages
Supermart Grocery Sales - Retail Analytics Dataset - (Data Analyst)
No ratings yet
Supermart Grocery Sales - Retail Analytics Dataset - (Data Analyst)
17 pages
TOPSOE Seminar - Catalysts and Reactions PDF
100% (4)
TOPSOE Seminar - Catalysts and Reactions PDF
132 pages
NB 9
No ratings yet
NB 9
29 pages
Math4ml PDF
No ratings yet
Math4ml PDF
21 pages
Linear - Regression
100% (1)
Linear - Regression
39 pages
Poster ASME 2022 EN 013
No ratings yet
Poster ASME 2022 EN 013
1 page
NB 14
No ratings yet
NB 14
15 pages
Charmi Shah 20bcp299 Lab2
100% (1)
Charmi Shah 20bcp299 Lab2
7 pages
NB 15
No ratings yet
NB 15
20 pages
NB 10
0% (1)
NB 10
24 pages
Univariate and Bivariate Data Analysis + Probability
100% (1)
Univariate and Bivariate Data Analysis + Probability
5 pages
Regressao Linear Simples - Ipynb - Colaboratory
100% (1)
Regressao Linear Simples - Ipynb - Colaboratory
2 pages
Software Engineering
No ratings yet
Software Engineering
5 pages
Elements of Pure and Applied Mathematics
No ratings yet
Elements of Pure and Applied Mathematics
485 pages
Synchronous Rectifier MOSFET Driver Substantially Reduces Power Adapter
No ratings yet
Synchronous Rectifier MOSFET Driver Substantially Reduces Power Adapter
6 pages
Log Reg Skimed - Ipynb - Colab
No ratings yet
Log Reg Skimed - Ipynb - Colab
10 pages
Chapter-3-Linear Models For Regression
100% (1)
Chapter-3-Linear Models For Regression
61 pages
3-Bearing Pressure and Bearing Capacity
100% (1)
3-Bearing Pressure and Bearing Capacity
62 pages
English 3 Program
No ratings yet
English 3 Program
8 pages
Importing Libraries: Import As Import As Import As From Import As From Import From Import Import
100% (1)
Importing Libraries: Import As Import As Import As From Import As From Import From Import Import
11 pages
Slides On DataI
No ratings yet
Slides On DataI
33 pages
Santos Training2
No ratings yet
Santos Training2
27 pages
Continuity Equation
No ratings yet
Continuity Equation
11 pages
Data Vizualization - Jupyter Notebook
No ratings yet
Data Vizualization - Jupyter Notebook
20 pages
6 Mips Datapath
No ratings yet
6 Mips Datapath
55 pages
Prysmian MV 1CALX33HD Datasheet 2015-04
No ratings yet
Prysmian MV 1CALX33HD Datasheet 2015-04
2 pages
Week 4 Homework: This Is A Preview of The Published Version of The Quiz
No ratings yet
Week 4 Homework: This Is A Preview of The Published Version of The Quiz
7 pages
Sharp Photodevices Application Cirquits
No ratings yet
Sharp Photodevices Application Cirquits
7 pages
Unit II Visualizing Using Matplotlib
No ratings yet
Unit II Visualizing Using Matplotlib
24 pages
SVM (Support Vector Machine) For Classification - by Aditya Kumar - Towards Data Science
100% (1)
SVM (Support Vector Machine) For Classification - by Aditya Kumar - Towards Data Science
28 pages
Good Is The Activity of The Soul in Accordance With Virtue
No ratings yet
Good Is The Activity of The Soul in Accordance With Virtue
6 pages
ISYE 8803 - Kamran - M1 - Intro To HD and Functional Data - Updated
No ratings yet
ISYE 8803 - Kamran - M1 - Intro To HD and Functional Data - Updated
87 pages
Solid State Zelio Relay
No ratings yet
Solid State Zelio Relay
76 pages
Classification Problems
100% (1)
Classification Problems
25 pages
PCA Quiz
No ratings yet
PCA Quiz
8 pages
4G CN PDF
No ratings yet
4G CN PDF
2 pages
Regression Anallysis Hands0n 1
100% (1)
Regression Anallysis Hands0n 1
3 pages
Moisture in The Atmosphere
No ratings yet
Moisture in The Atmosphere
43 pages
ML Lab6.Ipynb - Colaboratory
100% (1)
ML Lab6.Ipynb - Colaboratory
5 pages
CS229 Lecture 3 PDF
100% (1)
CS229 Lecture 3 PDF
35 pages
Screenshot 2022-06-06 at 11.36.22 AM
No ratings yet
Screenshot 2022-06-06 at 11.36.22 AM
17 pages
Circuit Meets Challenges of Fast, High-Current NiCd Charging
No ratings yet
Circuit Meets Challenges of Fast, High-Current NiCd Charging
5 pages
0.1 Stock Data
100% (1)
0.1 Stock Data
4 pages
Narayana 14-06-2022 Outgoing SR Jee Main Model GTM 9 QP Final
No ratings yet
Narayana 14-06-2022 Outgoing SR Jee Main Model GTM 9 QP Final
19 pages
FT-891 Quick Manual: (PWR/LOCK) Key RF/SQL Knob
No ratings yet
FT-891 Quick Manual: (PWR/LOCK) Key RF/SQL Knob
2 pages
Pokemon HP Predictions
No ratings yet
Pokemon HP Predictions
24 pages
C2M2 - Assignment: 1 Risk Models Using Tree-Based Models
100% (1)
C2M2 - Assignment: 1 Risk Models Using Tree-Based Models
38 pages
A9F74220
No ratings yet
A9F74220
3 pages
Physical Characterization of Activated Carbon Derived From Mangosteen Peel PDF
No ratings yet
Physical Characterization of Activated Carbon Derived From Mangosteen Peel PDF
5 pages
Lecture Week 2 KNN and Model Evaluation PDF
100% (1)
Lecture Week 2 KNN and Model Evaluation PDF
53 pages
Sajjad DS
100% (2)
Sajjad DS
97 pages
Salary Prediction LinearRegression
100% (1)
Salary Prediction LinearRegression
7 pages
Noc20-Cs28 Week 07 Assignment 02
No ratings yet
Noc20-Cs28 Week 07 Assignment 02
6 pages
Unit V Data Visualization
No ratings yet
Unit V Data Visualization
49 pages
The Problem of Overfitting: Overfitting With Linear Regression
No ratings yet
The Problem of Overfitting: Overfitting With Linear Regression
32 pages
AS Notebook - PCA - Wine Data-4
100% (1)
AS Notebook - PCA - Wine Data-4
1 page
Random Forest: Implementaciones de Scikit-Learn Sobre QSAR
100% (1)
Random Forest: Implementaciones de Scikit-Learn Sobre QSAR
11 pages
Python Assignment
No ratings yet
Python Assignment
2 pages
ML Project Report: (Text Learning Case Study)
No ratings yet
ML Project Report: (Text Learning Case Study)
9 pages
Slicing. Both, Numpy Array Indexing and Slicing Will Be Discussed in The Remainder
No ratings yet
Slicing. Both, Numpy Array Indexing and Slicing Will Be Discussed in The Remainder
50 pages
Servicing
No ratings yet
Servicing
19 pages
Python Numpy (1) : Intro To Multi-Dimensional Array & Numerical Linear Algebra
100% (1)
Python Numpy (1) : Intro To Multi-Dimensional Array & Numerical Linear Algebra
27 pages
Matplotlib PDF
No ratings yet
Matplotlib PDF
16 pages
Polyphase Rectifier
No ratings yet
Polyphase Rectifier
72 pages
Assignment-4 Noc18 cs52 87
No ratings yet
Assignment-4 Noc18 cs52 87
9 pages
Logistic Regression
100% (1)
Logistic Regression
29 pages
Linear Regression
100% (1)
Linear Regression
51 pages
Human Skin Grade 6
No ratings yet
Human Skin Grade 6
15 pages
Logistic Regression in R
No ratings yet
Logistic Regression in R
19 pages
Columbia Seaborn Tutorial
No ratings yet
Columbia Seaborn Tutorial
12 pages
Problems Chapter 1 Sec B
No ratings yet
Problems Chapter 1 Sec B
7 pages
Customer Segmentation Clustering
No ratings yet
Customer Segmentation Clustering
35 pages
Nueral Network Mcqs
No ratings yet
Nueral Network Mcqs
6 pages
Chandigarh Group of Colleges College of Engineering Landran, Mohali
No ratings yet
Chandigarh Group of Colleges College of Engineering Landran, Mohali
47 pages
Python Assignment
No ratings yet
Python Assignment
7 pages
ML Lab
No ratings yet
ML Lab
21 pages
Loss Functions
No ratings yet
Loss Functions
37 pages
Outliers, Hypothesis and Natural Language Processing
100% (1)
Outliers, Hypothesis and Natural Language Processing
7 pages
Project
No ratings yet
Project
18 pages
Smart Soot Blower System
No ratings yet
Smart Soot Blower System
8 pages
Gradient Descent
No ratings yet
Gradient Descent
15 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.