C1 W1 Lab04 Gradient Descent Soln
C1 W1 Lab04 Gradient Descent Soln
Regression
Goals
In this lab, you will:
Tools
In this lab, we will make use of:
Problem Statement
Let's use the same two data points as before - a house with 1000 square feet sold for \$300,000
and a house with 2000 square feet sold for \$500,000.
Compute_Cost
This was developed in the last lab. We'll need it again here.
#Function to calculate the cost
def compute_cost(x, y, w, b):
m = x.shape[0]
cost = 0
for i in range(m):
f_wb = w * x[i] + b
cost = cost + (f_wb - y[i])**2
total_cost = 1 / (2 * m) * cost
return total_cost
f w ,b ( x (i )) =w x (i )+ b
In linear regression, you utilize input training data to fit the parameters w ,b by minimizing a
measure of the error between our predictions f w ,b ( x (i )) and the actual data y (i ). The measure is
called the c o s t , J ( w , b ). In training you measure the cost over all of our training samples x (i ) , y (i )
m− 1
1 2
J ( w , b )= ∑
2 m i=0
( f w ,b ( x (i ) ) − y (i ))
Conventions:
• The naming of python variables containing partial derivatives follows this pattern,
∂ J ( w , b)
will be dj_db.
∂b
• w.r.t is With Respect To, as in partial derivative of J ( w b ) With Respect To b .
compute_gradient
∂ J ( w , b) ∂ J ( w , b)
compute_gradient implements (4) and (5) above and returns , . The
∂w ∂b
embedded comments describe the operations.
"""
for i in range(m):
f_wb = w * x[i] + b
dj_dw_i = (f_wb - y[i]) * x[i]
dj_db_i = f_wb - y[i]
dj_db += dj_db_i
dj_dw += dj_dw_i
dj_dw = dj_dw / m
dj_db = dj_db / m
return dj_dw, dj_db
The lectures described how gradient descent utilizes the partial derivative of the cost with
respect to a parameter at a point to update that parameter.
Let's use our compute_gradient function to find and plot some partial derivatives of our cost
function relative to one of the parameters, w 0.
∂ J ( w , b)
Above, the left plot shows or the slope of the cost curve relative to w at three points.
∂w
On the right side of the plot, the derivative is positive, while on the left it is negative. Due to the
'bowl shape', the derivatives will always lead gradient descent toward the bottom where the
gradient is zero.
∂ J ( w , b) ∂ J ( w , b)
The left plot has fixed b=100. Gradient descent will utilize both and to
∂w ∂b
update parameters. The 'quiver plot' on the right provides a means of viewing the gradient of
both parameters. The arrow sizes reflect the magnitude of the gradient at that point. The
∂ J ( w , b) ∂ J ( w , b)
direction and slope of the arrow reflects the ratio of and at that point. Note
∂w ∂b
that the gradient points away from the minimum. Review equation (3) above. The scaled
gradient is subtracted from the current value of w or b . This moves the parameter in a direction
that will reduce cost.
Gradient Descent
Now that gradients can be computed, gradient descent, described in equation (3) above can be
implemented below in gradient_descent. The details of the implementation are described in
the comments. Below, you will utilize this function to find optimal values of w and b on the
training data.
Args:
x (ndarray (m,)) : Data, m examples
y (ndarray (m,)) : target values
w_in,b_in (scalar): initial values of model parameters
alpha (float): Learning rate
num_iters (int): number of iterations to run gradient descent
cost_function: function to call to produce cost
gradient_function: function to call to produce gradient
Returns:
w (scalar): Updated value of parameter after running gradient
descent
b (scalar): Updated value of parameter after running gradient
descent
J_history (List): History of cost values
p_history (list): History of parameters [w,b]
"""
for i in range(num_iters):
# Calculate the gradient and update the parameters using
gradient_function
dj_dw, dj_db = gradient_function(x, y, w , b)
# initialize parameters
w_init = 0
b_init = 0
# some gradient descent settings
iterations = 10000
tmp_alpha = 1.0e-2
# run gradient descent
w_final, b_final, J_hist, p_hist = gradient_descent(x_train ,y_train,
w_init, b_init, tmp_alpha,
iterations,
compute_cost, compute_gradient)
print(f"(w,b) found by gradient descent: ({w_final:8.4f},
{b_final:8.4f})")
Take a moment and note some characteristics of the gradient descent process printed above.
• The cost starts large and rapidly declines as described in the slide from the lecture.
• The partial derivatives, dj_dw, and dj_db also get smaller, rapidly at first and then more
slowly. As shown in the diagram from the lecture, as the process nears the 'bottom of the
bowl' progress is slower due to the smaller value of the derivative at that point.
• progress slows though the learning rate, alpha, remains fixed
Predictions
Now that you have discovered the optimal values for the parameters w and b , you can now use
the model to predict housing values based on our learned parameters. As expected, the
predicted values are nearly the same as the training values for the same housing. Further, the
value not in the prediction is in line with the expected value.
Plotting
You can show the progress of gradient descent during its execution by plotting the cost over
iterations on a contour plot of the cost(w,b).
Above, the contour plot shows the c o s t ( w ,b ) over a range of w and b . Cost levels are
represented by the rings. Overlayed, using red arrows, is the path of gradient descent. Here are
some things to note:
Zooming in, we can see that final steps of gradient descent. Note the distance between steps
shrinks as the gradient approaches zero.
# initialize parameters
w_init = 0
b_init = 0
# set alpha to a large value
iterations = 10
tmp_alpha = 8.0e-1
# run gradient descent
w_final, b_final, J_hist, p_hist = gradient_descent(x_train ,y_train,
w_init, b_init, tmp_alpha,
iterations,
compute_cost, compute_gradient)
Above, w and b are bouncing back and forth between positive and negative with the absolute
∂ J ( w , b)
value increasing with each iteration. Further, each iteration changes sign and cost is
∂w
increasing rather than decreasing. This is a clear sign that the learning rate is too large and the
solution is diverging. Let's visualize this with a plot.
Above, the left graph shows w 's progression over the first few steps of gradient descent. w
oscillates from positive to negative and cost grows rapidly. Gradient Descent is operating on
both w and b simultaneously, so one needs the 3-D plot on the right for the complete picture.
Congratulations!
In this lab you: