Linear Regression

Introduction to linear regression and the normal equation

4 min read

best to know calculus and linear algebra

this very well may be riddled with tiny mistakes

Introduction

Linear regression is a simple machine learning algorithm where the machine learns to find a function that can map an input to an output. This is especially useful for predicting outcomes.

Thus, the goal of linear regression is to find a function (we call it a hypothesis, hh) such that with given inputs (or features) xx and outputs yy, h(x)yh(x)\approx y.

To find an equation for h(x)h(x), recall that the standard equation of a line is defined as

y=mx+by = mx + b

where mm is the slope of the line and bb is the intercept. However, datasets in ML tend to not be a singular xyx\mapsto y, but rather many features to output(s)

[x1,x2,,xn]y\begin{bmatrix} x_1, & x_2, & \dots, & x_n \end{bmatrix} \mapsto y

where there are nn features. Additionally, rather than having a slope mm, we have a parameter θ\theta.

Thus, in linear regression, we get the following equation for our hypothesis hh

h(x)=θ0+θ1x1+θ2x2++θnxn=i=1nθixi+θ0h(x)=\theta_0 + \theta_1 x_1 + \theta_2 x_2 + \cdots + \theta_n x_n = \sum_{i=1}^{n} \theta_i x_i + \theta_0

The learning function is trying to minimize the loss, which we define as

J(θ)=12i=1m(hθ(x)y)2J(\theta)=\frac{1}{2}\sum_{i=1}^{m}(h_\theta(x)-y)^2

We can use gradient descent (I'll cover this in a separate entry) or the normal equation, which I will cover here.

Normal Equation

We define the design matrix as

X=[(x(1))(x(n))]\boldsymbol{X}= \begin{bmatrix} \dots & \left(x^{(1)}\right)^\top & \dots\\ \dots & \vdots & \dots \\ \dots & \left(x^{(n)}\right)^\top & \dots \end{bmatrix}

Notice how h(x)h(x) is the design matrix, X\boldsymbol{X}, multiplied by the parameters, θ\theta.

Because vv=iv2v^\top v=\sum_{i} v^2 (definition of dot product), we obtain

J(θ)=12i=1m(Xθy)(Xθy)J(\theta)=\frac{1}{2}\sum_{i=1}^{m}(\boldsymbol{X}\theta-y)^\top (\boldsymbol{X}\theta-y)

In order to minimize J(θ)J(\theta), we must have θJ(θ)=0\nabla_{\theta}J(\theta)=\overrightarrow{0}. This is the gradient of J(θ)J(\theta) with respect to θ\theta, and when the function is not decreasing anymore, this gradient is zero.

Given the definition of J(θ)J(\theta), we compute the gradient as follows:

θJ(θ)=θ12(Xθy)(Xθy)=12θ(θXy)(Xθy)=12θ(θXXθθTXyyXθ+yy)=12(XXθ+XXθXyXy)\nabla_\theta J(\theta) = \nabla_\theta\frac{1}{2}(\boldsymbol{X}\theta-y)^\top(\boldsymbol{X}\theta-y) \\ = \frac{1}{2}\nabla_{\theta}(\theta^\top\boldsymbol{X}^\top-y^\top)(\boldsymbol{X}\theta-y) \\ = \frac{1}{2}\nabla_{\theta}(\theta^\top\boldsymbol{X}^\top\boldsymbol{X}\theta-\theta^{T}\boldsymbol{X}^\top y-y^\top\boldsymbol{X}\theta+y^\top y) \\ = \frac{1}{2}(\boldsymbol{X}^\top\boldsymbol{X}\theta+\boldsymbol{X}^\top\boldsymbol{X}\theta-\boldsymbol{X}^\top y-\boldsymbol{X}^\top y)

And thus, we achieve the normal equation:

XXθ=Xyθ=(XX)1Xy\boldsymbol{X}^\top \boldsymbol{X}\theta = \boldsymbol{X}^\top y \\ \therefore \theta = (\boldsymbol{X}^\top \boldsymbol{X})^{-1}\boldsymbol{X}^\top y

Implementation

Considering the normal equation is so simple, the implementation is too.

One thing to note is that we concatenate a column of 11s to the training data. This is the intercept.

Remember that

h(x)=i=1nθixi+θ0h(x) = \sum_{i=1}^{n} \theta_i x_i + \theta_0

The column of 11s is the coefficient on the 0th0^\text{th} parameter.

import numpy as np
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
import time

def normal_equation(X, y):
    X_t = np.transpose(X)
    
    return np.dot(np.linalg.pinv(np.dot(X_t, X)), np.dot(X_t, y))

X, y = make_regression(n_samples=10000, n_features=100, noise=1.0, random_state=1)

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=1
)

np.save("X_test.npy", X_test)
np.save("y_test.npy", y_test)

X_train_np = np.c_[np.ones((X_train.shape[0], 1)), X_train]

start = time.time()
theta = normal_equation(X_train_np, y_train)
np.save("theta_normal.npy", theta)
print(f"Normal equation complete in {time.time() - start}s")

We store the testing data to use in the testing code.

Testing

We test our model with the previously stored testing data.

import numpy as np
from sklearn.datasets import make_regression
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

X_test = np.load("X_test.npy")
y_test = np.load("y_test.npy")

X_test_np = np.c_[np.ones((X_test.shape[0], 1)), X_test]

theta = np.load(f"theta_normal.npy")

# prediction
pred = np.dot(X_test_np, theta)

# check accuracy
mae = mean_absolute_error(y_test, pred)
mse = mean_squared_error(y_test, pred)
r2 = r2_score(y_test, pred)

print("MAE: ", mae)
print("MSE: ", mse)
print("R^2: ", r2)

Conclusion

Linear regression is a rather straightforward machine learning algorithm, and can be easily implemented with the normal equation.

The normal equation is able to take an input and output and find parameters with some simple linear algebra, demonstrating linear algebra's vast use in the field of machine learning.