Ruben Jing

#Introduction

Recall the equation for the loss function as given in the linear regression blog.

J(\theta)=\frac{1}{m}\sum_{i=1}^{m}(h_\theta\left(x^{(i)}\right)-y^{(i)})^2

To draw the best line, the loss function must be minimized. Gradient descent helps optimize the loss function.

Imagine a 3D graph with several local minima and maxima. You are dropped onto a random point on this graph. Your goal is to reach the lowest point you possibly can. You are given one device to assist you: a compass that points to the direction of the steepest descent.

To get to the lowest point, the best idea is to follow the compass until it does not point anywhere. This means that you are at the lowest point possible.

#Math

The "compass" in the previous example is known as the gradient. The gradient is a vector function, and its values are all the partial derivatives of a function $J$ .

\nabla \mathbf{J}= \begin{bmatrix} \frac{\partial J}{\partial \theta_1} \\ \frac{\partial J}{\partial \theta_2} \\ \vdots \\ \frac{\partial J}{\partial \theta_d} \end{bmatrix}

The gradient doesn't actually point in the direction of steepest descent: it points in the direction of steepest ascent! In the actual implementation of gradient descent, we move in the opposite direction of the gradient.

Each parameter $\theta_i$ will be updated using the gradient. After the gradient is calculated, each $\theta_i := \theta_i - \alpha\frac{\partial}{\partial\theta_i}J(\theta)$ , where $\alpha$ is the learning rate. The learning rate tells you how quickly the gradient moves. A high learning rate means that the gradient moves quickly, while a low learning rate means the gradient moves slowly. In practice, the learning rate starts at $0.01$ and is gradually adjusted.

The partial derivative $\frac{\partial}{\partial\theta_i}J(\theta)$ is evaluated as $\frac{1}{m}(h_\theta(x)-y)\cdot x_i=\frac{1}{m}(x\cdot\theta - y)\cdot x_i$ .

#Challenges

Once the gradient is in a local minima, the gradient will want to stay there. However, the local minima is not guaranteed to be the absolute minima. To combat this issue, there are two possible approaches.

Use a larger learning rate and slowly decrement it. The gradient can first take large steps trending towards the absolute minima, and as it approaches it, can become smaller. This allows the gradient to skip past some local minima, but because the learning rate gets smaller, the gradient should not skip the absolute minima.
Randomize the point where the gradient starts. By doing this, the gradient will eventually point in the direction of the absolute minima.

However, in linear regression, this issue does not occur. The loss function is a convex function with one local minimum, which is the absolute minimum.

#Implementation

import numpy as np
 
LEARNING_RATE= 0.01
 
def gradient_descent(X, y):
    outputs = 0
    if len(y.shape) == 1: 
        outputs = 1
        y = y.reshape(-1, 1)
    else: 
        outputs = y.shape[1]
    
    theta = np.zeros((X.shape[1],outputs), dtype=float)
 
    for _ in range(1000): # runs through 1000 epochs
        theta = theta - (LEARNING_RATE / X.shape[0]) * np.dot(np.transpose(X), (np.dot(X, theta) - y))
            
    return theta

#Faster Gradient Descent

Rather than having to go through all the data points in one step, we can pick one random data point.

We calculate the gradient at that single data point and using that gradient, we update the rest of the parameters.

This speeds up the algorithm because each update only depends on a single data point rather than the entire dataset. This is known as stochastic gradient descent and helps with large datasets.

#Introduction

#Intuition

#Math

#Challenges

#Implementation

#Faster Gradient Descent