NB 13
NB 13
it
pt
IIIIate
i
t
fi
aces
Awe wanttofindtheboundary
discriminantbetweenthe
twoclasses
oxo Ex a 0
x 1 8 att
hope intercept
byapplying
discriminant boundary
Hearts.IE
Issue it
Etffi
produces hardlabelassignmentand this can be
problematic when datapoint's aren't cleanly separable
Solution soft assignment Instead of assigning a label we assign
a probability score
Pr label x L x
avisidfundy
He V5
g y
O
maps the values into the range 0,1
a
maps negative values 967 12
to zero and positive
values to 1 o sigmoid can be interpreted as a probability
of a distribution
Probability density combination of two normal
IL Gaussian distributions
1
aka Mixture ofGaussian
9 points distributions
Suppose the
Pr label07 1 03 9 FCA É
fromthedistributionwillhavelabel l given
GEN
Eet GE al
TL;DR: For the classi cation task, we are using a linear discriminant function to assign
points a score. Then given the score, we have various options of how to generate labels, e.g.
heaviside and logistic function. Heaviside produces hard labels, while sigmoid produces soft
labels, which can be helpful when the classes aren’t clearly separable.
at
Y oÉ ÉdÉts
Yaidmeters Fat T
vector of my datamatrix d dpredictors a
entriesadditional
constan
responseslabeled
E OI
lastcolumn of1thedata
matrixtall values
model we are a 0
When we are
given
a
that we
given
would have seen these particular
What are the chances
labels under this model
Likelihood Pr XO joint probability of labels
conditional
IIe.ly IIer10s
Directly
19
maximizing the product is
product derivatives issue
mathematically
expensive
of smallnumbers is a verysmallnumber
product
Solution Log transform
optimatoisprserved when wetake log of of likelihood
because logs are monotonic so it can't changewhere
the max value occurs
argzaxtlOI argzaxl.gl
Dloglogatlogt
a.b
likelihood we are
Therefore instead of maximizing the
Prof maximizing log likelihood
ICO I log L Q
argzaxffeoger.cl
QE argmgxL.CO argqaxlogIIECOj
ÉÉI
goif implementation of amathematica
statementsince yeis
either
y
maximum
location of the
value of F derivative is zero at the
maximum for scalarfunctions
y
gradient is zero
Tat vectorme aximum for
functions
ay
slope
j
initialguess location of the maximum
value of Fx
directionthatfaces
pg
S is the direction of increasing slope
dimension the direction is
I in one just
sign of the
derivative
s I y a
How large should the step be
S I Ex
user specified
L
parameter
tuning
Eating
2 0
When we move from X to xts
foxes Fx Ey T T S t O S2 Taylor expansion of thefunction
s is smallthey
if we me s is smallthey
5 will also bevery 2 also has to be small
small so we canignore
it as well as any other that
herorderterms as
fixes s stepsize
off
in
ppg currentvalue
slope
fees
fetÉ g
EE Eye Efd
Mitase
For higherdimensions
0 54 d
5 Il TL DR
TxFH 112
f res F FR Stix FR
Logistic regression
Beyond regression, another important data analysis task is classification, in which you are given a set of labeled data points and you wish to lea
the labels. The canonical example of a classification algorithm is logistic regression, the topic of this notebook.
Here, you'll consider binary classification. Each data point belongs to one of possible classes. By convention, we will denote these class
and "1." However, the ideas can be generalized to the multiclass case, i.e., , with labels .
You'll also want to review from earlier notebooks the concept of gradient ascent/descent (or "steepest ascent/descent"), when optimizing a sca
a vector variable.
Part 0: Introduction
This part of the notebook introduces you to the classification problem through a "geometric interpretation."
Setup
In [1]: import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
from IPython.display import display, Math
%matplotlib inline
A note about slicing columns from a Numpy matrix. If you want to extract a column i from a Numpy matrix A and keep it as a column vecto
use the slicing notation, A[:, i:i+1]. Not doing so can lead to subtle bugs. To see why, compare the following slices.
A[:, :] ==
[[1. 2. 3.]
[4. 5. 6.]
[7. 8. 9.]]
a0 := A[:, 0] ==
[1. 4. 7.]
a1 := A[:, 2:3] ==
[[3.]
[6.]
[9.]]
Aside: Broadcasting in Numpy. What is happening in the operation, a0 + a1, shown above? When the shapes of two objects do not match,
to figure out if there is a natural way to make them compatible. See this supplemental notebook (./mo_numpy_mo_problems.ipynb) for informat
Numpy's "broadcasting rule," along with other Numpy tips.
Example data: Rock lobsters!
As a concrete example of a classification task, consider the results of the following experiment (http://www.stat.ufl.edu/~winner/data/lobster_su
Some marine biologists started with a bunch of lobsters of varying sizes (size being a proxy for the stage of a lobster's development). They then
exposed these lobsters to a variety of predators. Finally, the outcome that they measured is whether the lobsters survived or not.
The data is a set of points, one point per lobster, where there is a single predictor (the lobster's size) and the response is whether the lobsters s
"1") or died (label "0").
def on_vocareum():
return os.path.exists('.voc')
if on_vocareum():
URL_BASE = "https://cse6040.gatech.edu/datasets/rock-lobster/"
DATA_PATH = "../resource/asnlib/publicdata/"
else:
URL_BASE = "https://github.com/cse6040/labs-fa17/raw/master/datasets/rock-lobster/"
DATA_PATH = ""
'grad_log_likelihood_soln.npz' is ready!
'hess_log_likelihood_soln.npz' is ready!
'log_likelihood_soln.npz' is ready!
'lobster_survive.dat.txt' is ready!
'logreg_points_train.csv' is ready!
Here is a plot of the raw data, which was taken from this source (http://www.stat.ufl.edu/~winner/data/lobster_survive.dat).
0 27 0
1 27 0
2 27 0
3 27 0
4 27 0
...
CarapaceLen Survived
154 54 1
155 54 1
156 54 1
157 54 1
158 57 1
Although the classes are distinct in the aggregate, where the median carapace (outer shell) length is around 36 mm for the lobsters that died an
those that survived, they are not cleanly separable.
Notation
To develop some intuition and a classification algorithm, let's formulate the general problem and apply it to synthetic data sets.
Let the data consist of observations of continuously-valued predictors. In addition, for each data observation we observe a binary label wh
either 0 or 1.
Just like our convention in the linear regression case, represent each observation, or data point, by an augumented vector, ,
That is, the point is the coordinates augmented by an initial dummy coordinate whose value is 1. This convention is similar to what we did in
regression.
We can also stack these points as rows of a matrix, , again, just as we did in regression:
Example: A synthetic training set. We've pre-generated a synethetic data set consisting of labeled data points. Let's download and inspect it
table and then visually.
In [6]: df = pd.read_csv('{}logreg_points_train.csv'.format(DATA_PATH))
display(df.head())
print(" ")
print( ... )
display(df.tail())
0 -0.234443 -1.075960 1
1 0.730359 -0.918093 0
2 1.432270 -0.439449 0
3 0.026733 1.050300 0
4 1.879650 0.207743 0
...
Next, let's extract the coordinates as a Numpy matrix of points and the labels as a Numpy column vector labels. Mathematically, the point
corresponds to and the labels vector corresponds to .
print ("First and last 5 points:\n", '='*23, '\n', points[:5], '\n...\n', points[-5:], '\n')
print ("First and last 5 labels:\n", '='*23, '\n', labels[:5], '\n...\n', labels[-5:], '\n')
A linear boundary is also known as a linear discriminant. Any point on this line may be described by , where is a vector of coefficients:
For example, suppose our observations have two predictors each ( ). Let the corresponding data point be . Then,
means that
So that describes points on the line. However, given any point in the -dimensional space that is not on the line, still produces a value: th
be positive on one side of the line ( ) or negative on the other ( ).
In other words, you can use the linear discriminant function, , to generate a label for each point : just reinterpret its sign!
If you want "0" and "1" labels, the heaviside function, , will convert a positive to the label "1" and all other values to "0."
Exercise 0 (2 points). Given the a matrix of augmented points (i.e., the matrix) and a column vector of length , implemen
compute the value of the linear discriminant at each point. That is, the function should return a (column) vector where the .
[[ 1.05035323 -0.52024135 1. ]
[ 1.49196858 -0.59315241 1. ]
[ 0.26258831 -0.52024135 1. ]
[ 0.37299215 -0.59315241 1. ]]
[[ 0.13320893]
[ 0.1892159 ]
[-0.06660446]
[-0.09460795]]
(Passed.)
Exercise 1 (2 points). Implement the heaviside function, . Your function should allow for an arbitrary matrix of input values and should app
heaviside function to each element. In the returned matrix, the elements should have a floating-point type.
[[ 0. 1. 0.]
[ 1. 1. 0.]]
There are several possible approaches that lead to one-line solutions. One uses only logical and arithmetic operators, which you will re
implemented as elementwise operations for Numpy arrays. Another uses Numpy's sign()
(http://docs.scipy.org/doc/numpy/reference/generated/numpy.sign.html) function.
# Alternative solution:
#return (np.sign(Y) > 0) * 1.0
### END SOLUTION
print("Y:\n", Y_test)
print("\nH(Y):\n", H_Y_test)
print ("\n(Passed.)")
Y:
[[-2.3 1.2 7. ]
[ 0. -inf inf]]
H(Y):
[[0. 1. 1.]
[0. 0. 1.]]
(Passed.)
a == [0, 0, 1, 1, 0, 1, 1]
b == [1, 1, 0, 0, 1, 0, 0]
a == [0, 0, 1, 1, 0, 1, 1]
b == [1, 1, 0, 0, 1, 0, 0]
Exercise 2 (2 points). For the synthetic data you loaded above, try by hand to find a value for such that "best" separates the two clu
this in a variable named my_theta, which should be a Numpy column vector. That is, define my_theta here using a line like:
where np_col_vec is defined below and the list of values are your best guesses at discriminating coefficients. The test code will check that yo
makes no more than ten misclassifications.
Hint: We found a set of coefficients that commits just 5 errors for the 375 input points.
df_matches = df.copy ()
df_matches['label'] = mark_matches (my_labels, labels).astype (dtype=int)
mpl.rc("savefig", dpi=100) # Adjust for higher-resolution figures
plot_lin_discr (my_theta, df_matches)
How the heaviside divides the space. The heaviside function, , enforces a sharp boundary between classes around the line
code produces a contour plot (https://matplotlib.org/api/_as_gen/matplotlib.axes.Axes.contourf.html) to show this effect: there will be a sharp d
between 0 and 1 values, with one set of values shown as a solid dark area and the remaining as a solid light-colored area.
Since the labels are 0 or 1, you could look for a way to interpret labels as probabilities rather than as hard (0 or 1) labels. One such function is th
function, also referred to as the logit or sigmoid (https://en.wikipedia.org/wiki/Sigmoid_function) function.
The logistic function takes any value in the range and produces a value in the range . Thus, given a value , we can interpret
conditional probability that the label is 1 given , i.e., .
Exercise 3 (2 points). Implement the logistic function. Inspect the resulting plot of in 1-D and then the contour plot of . Your functi
accept a Numpy matrix of values, Y, and apply the sigmoid elementwise.
print ("\n(Passed.)")
(Passed.)
Exercise 4 (optional; ungraded). Consider a set of 1-D points generated by a mixture of Gaussians. That is, suppose that there are two Gaussia
over the 1-dimensional variable, , that have the same variance ( ) but different means ( and ). Show that the conditional
observing a point labeled "1" given may be written as,
You may assume the prior probabilities of observing a 0 or 1 are given by and .
The point of this derivation is to show you that the definition of the logistic function does not just arise out of thin air. It also hints that yo
might expect a final algorithm for logistic regression based on using as the discriminant will work well when the classes are best
explained as a mixture of Gaussians.
Generalizing to -dimensions. The preceding exercise can be generalized to -dimensions. Let and be -dimensional points. Then
Exercise 5 (optional; ungraded). Verify the following properties of the logistic function, .
(P2). Start with the right-hand side, , apply some algebra, and then apply (P1).
(P5). By combining (P2), variable substitution and the chain rule, and (P4),
"Likelihood" as an objective function. MLE derives from the following idea. Consider the joint probability of observing all of the labels, given t
the parameters, :
Suppose these observations are independent and identically distributed (i.i.d.). Then the joint probability can be factored as the product of indiv
probabilities,
The maximum likelihood principle says that you should choose to maximize the chances (or "likelihood") of seeing these particular observatio
is now an objective function to maximize.
For both mathematical and numerical reasons, we will use the logarithm of the likelihood, or log-likelihood, as the objective function instead. Le
We are using the symbol , which could be taken in any convenient base, such as the natural logarithm ( ) or the information theo
base-two logarithm ( ).
You can write the log-likelihood more compactly in the language of linear algebra.
Convention 1. Let be a column vector of all ones, with its length inferred from context. Let be a
where denote its columns. Then, the sum of the columns is
Convention 2. Let be any matrix and let be any function that we have defined by default to accept a scalar argument and pro
result. For instance, or . Then, assume that applies elementwise to , returning a matrix whose eleme
.
With these notational conventions, convince yourself that these are two different ways to write the log-likelihood for logistic regression.
Exercise 6 (2 points). Implement the log-likelihood function in Python by defining a function with the following signature:
if False:
d_soln = 10
m_soln = 1000
theta_soln = np.random.random ((d_soln+1, 1)) * 2.0 - 1.0
y_soln = np.random.randint (low=0, high=2, size=(m_soln, 1))
X_soln = np.random.random ((m_soln, d_soln+1)) * 2.0 - 1.0
X_soln[:, 0] = 1.0
L_soln = log_likelihood (theta_soln, y_soln, X_soln)
np.savez_compressed('log_likelihood_soln',
d_soln, m_soln, theta_soln, y_soln, X_soln, L_soln)
npzfile_soln = np.load('{}log_likelihood_soln.npz'.format(DATA_PATH))
d_soln = npzfile_soln['arr_0']
m_soln = npzfile_soln['arr_1']
theta_soln = npzfile_soln['arr_2']
y_soln = npzfile_soln['arr_3']
X_soln = npzfile_soln['arr_4']
L_soln = npzfile_soln['arr_5']
print ("\n(Passed.)")
(Passed.)
For example, recall that in the case of linear regression via least squares minimization, carrying out this process produced an analytic solution fo
parameters, which was to solve the normal equations.
Unfortunately, for logistic regression---or for most log-likelihoods you are likely to ever write down---you cannot usually derive an analytic solut
you will need to resort to numerical optimization procedures.
Gradient ascent, in 1-D. A simple numerical algorithm to maximize a function is gradient ascent (or steepest ascent). If instead you are minimiz
function, then the equivalent procedure is gradient (or steepest) descent. Here is the basic idea in 1-D.
Suppose we wish to find the maximum of a scalar function in one dimension. At the maximum, .
Suppose instead that and consider the value of at a nearby point, , as given approximately by a truncated Taylor series:
To make progress toward maximizing , you'd like to choose so that . One way is to choose , where
"small:"
If is small enough, then you can neglect the term and will be larger than , thus making progress toward finding a maximum
This scheme is the basic idea: starting from some initial guess , refine the guess by taking a small step in the direction of the derivative, i.e.,
Gradient ascent in higher dimensions. Now suppose is a vector rather than a scalar. Then the value of at a nearby point , where
becomes
where is the gradient of with respect to . As in the 1-D case, you want a step such that . To make as much progres
let's choose to be parallel to , that is, proportional to the gradient:
Again, is a fudge (or "gentle nudge?") factor. You need to choose it to be small enough that the high-order terms of the Taylor approximation
negligible, yet large enough that you can make reasonable progress.
The gradient ascent procedure applied to MLE. Applying gradient ascent to the problem of maximizing the log-likelihood leads to the followi
This procedure should remind you of one you saw in a prior notebook (the least mean square algorithm for online regression!). As was true at th
tricky bit is how to choose .
There is at least one difference between this procedure and the online regression procedure you learned earlier. Here, we are optimizing
the full dataset rather than processing data points one at a time. (That is, the step iteration variable used above is not used in exactly
same way as the step iteration in LMS.)
Another question is, how do we know this procedure will converge to the global maximum, rather than, say, a local maximum? For that
need a deeper analysis of a specific , to show, for instance, that it is convex in .
Implementing logistic regression using MLE by gradient ascent
Let's apply the gradient ascent procedure to the logistic regression problem, in order to determine a good .
Thus,
Exercise 8 (2 points). Implement a function to compute the gradient of the log-likelihood. Your function should have the signature,
if False:
d_grad_soln = 6
m_grad_soln = 399
theta_grad_soln = np.random.random((d_grad_soln+1, 1)) * 2.0 - 1.0
y_grad_soln = np.random.randint(low=0, high=2, size=(m_grad_soln, 1))
X_grad_soln = np.random.random((m_grad_soln, d_grad_soln+1)) * 2.0 - 1.0
X_grad_soln[:, 0] = 1.0
L_grad_soln = grad_log_likelihood(theta_grad_soln, y_grad_soln, X_grad_soln)
np.savez_compressed('grad_log_likelihood_soln',
d_grad_soln, m_grad_soln, theta_grad_soln, y_grad_soln, X_grad_soln, L_g
print ("\n(Passed.)")
(Passed.)
Exercise 9 (4 points). Implement the gradient ascent procedure to determine , and try it out on the sample data.
In the code skeleton below, we've set up a loop to run a fixed number, MAX_STEP, of gradient ascent steps. Also, when normalizing the step
norm.
In your solution, we'd like you to store all guesses in the matrix thetas, so that you can later see how the values evolve. To extrac
particular column t, use the notation, theta[:, t:t+1]. This notation is necessary to preserve the "shape" of the column as a colum
vector.
for t in range(MAX_STEP):
# Fill in the code to compute thetas[:, t+1:t+2]
### BEGIN SOLUTION
theta_t = thetas[:, t:t+1]
delta_t = grad_log_likelihood(theta_t, y, X)
delta_t = delta_t / np.linalg.norm(delta_t, ord=2)
thetas[:, t+1:t+2] = theta_t + ALPHA*delta_t
### END SOLUTION
Your manual (hand-picked) solution is [3.63636364] , vs. MLE (via gradient ascent), which is [4.
Your manual (hand-picked) solution is [0.90909091] , vs. MLE (via gradient ascent), which is [0.
(Passed.)
The gradient ascent trajectory. Let's take a look at how gradient ascent progresses. (You might try changing the parameter and see how it
results.)
def v_inv(v):
return -np.exp(np.abs(v))
return v
This part of the notebook has additional exercises, but they are all worth 0 points. (So if you submit something that is incomplete or fail
test cells, you won't lose any points.)
The basic idea, in 1-D. Suppose you start at a point and, assuming you are not yet at the optimum, you have decided to take a step of size
you at .
How do you choose ? In gradient ascent, you do so by following the gradient, which points in an "upward" direction.
That should strike you as circular; the whole problem from the beginning was to maximize . The trick, in this case, is not to maximize
rather, let's replace it with some approximation, , and maximize instead.
A simple choice for is a quadratic function in . This choice is motivated by two factors: (a) since it's quadratic, it should have some sort of
point (and hopefully an actual maximum), and (b) it is a higher-order approximation than a linear one, and so hopefully more accurate than a line
well.
To maximize , take its derivative and then solve for the such that :
That is, the optimal step is the negative of the first derivative of divided by its second derivative.
Generalizing to higher dimensions. To see how this procedure works in higher dimensions, you will need not only the gradient of , but als
which is the moral equivalent of a second derivative.
Definition: the Hessian. Let be a function that takes a vector of length as input and returns a scalar. The Hessian of is an ma
whose entries are all possible second-order partial derivatives with respect to the components of . That is, let be the element of
define
Armed with a Hessian, the Newton step is defined as follows, by direct analogy to the 1-D case. First, the Taylor series approximation of
multidimensional variables is, as it happens,
As in the 1-D case, we want to find an extreme point of . Taking its "derivative" (gradient), , and setting it to 0 yields,
In other words, to choose the next step , Newton's method suggests that you must solve a system of linear equations, where the matrix is the
and the right-hand side is the negative gradient of .
Summary: Newton's method. Summarizing the main ideas from above, Newton's method to maximize the scalar objective function where
consists of the following steps:
Notationally, that calculation will be a little bit easier to write down and program with the following definition.
Definition: Elementwise product. Let and be matrices. Denote the elementwise product of and by . That
, then element .
If is but is instead just , then we will "auto-extend" . Put differently, if has the same number of rows as but only 1 column
take to have elements .
[[ -1 4 -9]
[ 16 -25 36]]
[[-1 -2 -3]
[16 20 24]]
Exercise 10 (optional; ungraded). Show that the Hessian of the log-likelihood for logistic regression is
Exercise 11 (0 points). Implement a function to compute the Hessian of the log-likelihood. The signature of your function should be,
if False:
d_hess_soln = 20
m_hess_soln = 501
theta_hess_soln = np.random.random ((d_hess_soln+1, 1)) * 2.0 - 1.0
y_hess_soln = np.random.randint (low=0, high=2, size=(m_hess_soln, 1))
X_hess_soln = np.random.random ((m_hess_soln, d_hess_soln+1)) * 2.0 - 1.0
X_hess_soln[:, 0] = 1.0
L_hess_soln = hess_log_likelihood (theta_hess_soln, y_hess_soln, X_hess_soln)
np.savez_compressed ('hess_log_likelihood_soln',
d_hess_soln, m_hess_soln, theta_hess_soln, y_hess_soln, X_hess_soln, L_
print ("\n(Passed.)")
(Passed.)
Exercise 12 (0 points). Finish the implementation of a Newton-based MLE procedure for the logistic regression problem.
In [29]: MAX_STEP = 10
for t in range(MAX_STEP):
### BEGIN SOLUTION
theta_t = thetas_newt[:, t:t+1]
g_t = grad_log_likelihood(theta_t, y, X)
H_t = hess_log_likelihood(theta_t, y, X)
s_t = np.linalg.solve(H_t, -g_t)
thetas_newt[:, t+1:t+2] = theta_t + s_t
### END SOLUTION
Your manual (hand-picked) solution is [3.63636364] , vs. MLE (via Newton's method), which is [4.
Your manual (hand-picked) solution is [0.90909091] , vs. MLE (via Newton's method), which is [0.
(Passed.)
The following cell creates a contour plot of the log-likelihood, as done previously in this notebook. Add code to display the trajectory taken by N
method.
How many steps does this optimization procedure take compared to gradient ascent? What is the tradeoff?