blog

best to know calculus and linear algebra

this very well may be riddled with tiny mistakes

Introduction

Linear regression is a simple machine learning algorithm where the machine learns to find a function that can map an input to an output. This is especially useful for predicting outcomes.

Thus, the goal of linear regression is to find a function (we call it a hypothesis, $h$ ) such that with given inputs (or features) $x$ and outputs $y$ , $h(x)\approx y$ .

To find an equation for $h(x)$ , recall that the standard equation of a line is defined as

y = mx + b

where $m$ is the slope of the line and $b$ is the intercept. However, datasets in ML tend to not be a singular $x\mapsto y$ , but rather many features to output(s)

\begin{bmatrix} x_1, & x_2, & \dots, & x_n \end{bmatrix} \mapsto y

where there are $n$ features. Additionally, rather than having a slope $m$ , we have a parameter $\theta$ .

Thus, in linear regression, we get the following equation for our hypothesis $h$

h(x)=\theta_0 + \theta_1 x_1 + \theta_2 x_2 + \cdots + \theta_n x_n = \sum_{i=1}^{n} \theta_i x_i + \theta_0

The learning function is trying to minimize the loss, which we define as

J(\theta)=\frac{1}{2}\sum_{i=1}^{m}(h_\theta(x)-y)^2

We can use gradient descent (I'll cover this in a separate entry) or the normal equation, which I will cover here.

Normal Equation

We define the design matrix as

\boldsymbol{X}= \begin{bmatrix} \dots & \left(x^{(1)}\right)^\top & \dots\\ \dots & \vdots & \dots \\ \dots & \left(x^{(n)}\right)^\top & \dots \end{bmatrix}

Notice how $h(x)$ is the design matrix, $\boldsymbol{X}$ , multiplied by the parameters, $\theta$ .

Because $v^\top v=\sum_{i} v^2$ (definition of dot product), we obtain

J(\theta)=\frac{1}{2}\sum_{i=1}^{m}(\boldsymbol{X}\theta-y)^\top (\boldsymbol{X}\theta-y)

In order to minimize $J(\theta)$ , we must have $\nabla_{\theta}J(\theta)=\overrightarrow{0}$ . This is the gradient of $J(\theta)$ with respect to $\theta$ , and when the function is not decreasing anymore, this gradient is zero.

Given the definition of $J(\theta)$ , we compute the gradient as follows:

\nabla_\theta J(\theta) = \nabla_\theta\frac{1}{2}(\boldsymbol{X}\theta-y)^\top(\boldsymbol{X}\theta-y) \\ = \frac{1}{2}\nabla_{\theta}(\theta^\top\boldsymbol{X}^\top-y^\top)(\boldsymbol{X}\theta-y) \\ = \frac{1}{2}\nabla_{\theta}(\theta^\top\boldsymbol{X}^\top\boldsymbol{X}\theta-\theta^{T}\boldsymbol{X}^\top y-y^\top\boldsymbol{X}\theta+y^\top y) \\ = \frac{1}{2}(\boldsymbol{X}^\top\boldsymbol{X}\theta+\boldsymbol{X}^\top\boldsymbol{X}\theta-\boldsymbol{X}^\top y-\boldsymbol{X}^\top y)

And thus, we achieve the normal equation:

\boldsymbol{X}^\top \boldsymbol{X}\theta = \boldsymbol{X}^\top y \\ \therefore \theta = (\boldsymbol{X}^\top \boldsymbol{X})^{-1}\boldsymbol{X}^\top y

Implementation

Considering the normal equation is so simple, the implementation is too.

One thing to note is that we concatenate a column of $1$ s to the training data. This is the intercept.

Remember that

h(x) = \sum_{i=1}^{n} \theta_i x_i + \theta_0

The column of $1$ s is the coefficient on the $0^\text{th}$ parameter.

import numpy as np
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
import time

def normal_equation(X, y):
    X_t = np.transpose(X)
    
    return np.dot(np.linalg.pinv(np.dot(X_t, X)), np.dot(X_t, y))

X, y = make_regression(n_samples=10000, n_features=100, noise=1.0, random_state=1)

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=1
)

np.save("X_test.npy", X_test)
np.save("y_test.npy", y_test)

X_train_np = np.c_[np.ones((X_train.shape[0], 1)), X_train]

start = time.time()
theta = normal_equation(X_train_np, y_train)
np.save("theta_normal.npy", theta)
print(f"Normal equation complete in {time.time() - start}s")

We store the testing data to use in the testing code.

Testing

We test our model with the previously stored testing data.

import numpy as np
from sklearn.datasets import make_regression
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

X_test = np.load("X_test.npy")
y_test = np.load("y_test.npy")

X_test_np = np.c_[np.ones((X_test.shape[0], 1)), X_test]

theta = np.load(f"theta_normal.npy")

# prediction
pred = np.dot(X_test_np, theta)

# check accuracy
mae = mean_absolute_error(y_test, pred)
mse = mean_squared_error(y_test, pred)
r2 = r2_score(y_test, pred)

print("MAE: ", mae)
print("MSE: ", mse)
print("R^2: ", r2)

Conclusion

Linear regression is a rather straightforward machine learning algorithm, and can be easily implemented with the normal equation.

The normal equation is able to take an input and output and find parameters with some simple linear algebra, demonstrating linear algebra's vast use in the field of machine learning.