The gradient descent is an algorithm that minimizes functions. Given a function defined by a parameter set, the gradient slope starts with an initial set of parameter values and iteratively moves towards a set of parameter values that minimize the function. We start at a random point on the function and move towards the negative gradient of the function to reach the local / global minima.
The gradient descent requires a cost function (the most common is the mean square error cost function) because we want to minimize it. This means that the choice of the cost function will affect the calculation of the gradient of each weight. Minimizing any function means finding the deepest valley (minimum) in that function. Note that the cost function is used to monitor the error in the model predictions. Minimizing this basically means achieving the lowest possible error value or increasing the accuracy of the model.
from IPython.display import Image Image(filename="img/grad.png")
We now have all the tools we need to run gradient slope. We can initiate our search to start at any value pair (m and b) and let the algorithm go down our error function towards the best line. Each iteration will update m and b to a line that gives a slightly less error than the previous iteration. Basically, the cost function only serves to monitor the error with each training example, while the derivative of the cost function with respect to one weight is where we need to shift that weight to minimize the error for this training example. There are two parameters (coefficients) in our cost function that we can control: weight m and bias b. Since we have to consider the effect of each of them on the final prediction, we use partial derivatives.
Wnew = W – learning rate * dW (weights)
Bnew = b – learning rate * db (bias)
Loss error = ½ (predicted value – actual value) 2
It’s like looking for the bottom of the valley when we stand on the slope and see nothing. We go down intuitively, but we don’t know exactly what steps we need to take in order not to miss the bottom with shortest time possible.
An example of how the descent gradient works, our algorithm starts with 2 (current x), precision – this is the accuracy we want to achieve at the minimum – i.e. when we stop the algorithm and our function is: y = x + 2
cur_x = 2 rate = 0.01 precision = 0.000001 previous_step = 1 max_iters = 10000 iters = 0 df = lambda x: 2*(x+2)
while previous_step > precision and iters < max_iters: prev_x = cur_x cur_x = cur_x - rate * df(prev_x) previous_step = abs(cur_x - prev_x) iters = iters+1 print('Local minimum', cur_x)
Local minimum -1.9999511727371766
Our local minimum is around -2 and that’s correct, because when we put y = -2 + 2 into our function, we get y = 0
The learning rate variable controls how much step we will use to go down during each iteration. If step is too big, we can go beyond the minimum. However, if we take small steps, it will require multiple iterations to get to the minimum.
We determine these values mainly by trial and error.
from IPython.display import Image Image(filename="img/rate.png")
import pandas as pd from scipy import linspace, randn n = 50 X = np.linspace(-5,5,n) y= np.random.randn(n)
from sklearn.linear_model import LinearRegression import matplotlib.pyplot as plt %matplotlib inline lin_reg = LinearRegression() lin_reg.fit(X.reshape(-1,1), y) plt.scatter(X.reshape(-1,1), y, color = 'g') plt.plot(X.reshape(-1,1), lin_reg.predict(X.reshape(-1,1)), color = 'r') plt.title('Linear Regression') plt.show()
We will use the stats.linregress function to get the slope and intercept from the model:
from scipy import stats slope, intercept, r_value, p_value, std_err = stats.linregress(X,y) print('slope:', slope, 'intercept:', intercept)
slope: 0.028120296227129686 intercept: 0.30492277724022104
Following the logic of the gradient descent, we can calculate the optimal values of W (weights) and b (bias):
import numpy as np costs =  W = 0.5 b = 1 for i in range(1, 50): Y_pred = np.multiply(W, X) + b Loss_error = 0.5 * (Y_pred - y)**2 cost = np.sum(Loss_error)/10 db = np.sum((Y_pred - y)) dw = np.dot((Y_pred - y), X) costs.append(cost) W = W - 0.01*dw b = b - 0.01*db print("W = ", W, "b = ",b) plt.plot(costs) plt.ylabel('cost') plt.xlabel('iterations') plt.show()
W = -2.0729721598684034e+25 b = -535835011.2464077
It was a cost function plotted for one weight only. In a real model, we do all of the above, for all weights, while iterating over all the training examples. Even with a relatively small ML model, you will have more than 1 or 2 weights.
Prediction = Weight * (independent variable) + Bias
There are several types of Gradient Descent:
- Mini – Batch – GD: Here, instead of repeating all training examples and with each iteration, only one training example is calculated, we process n training examples at a time.
- Stochastic – GD: Instead of using and looping each training example, we use just one.
- Basic – GD: Looping over each training example.