Contents

Weighted Linear Regression

Linear Regression

If you are here there are high chances you already know how a simple linear regression works, it is the first and simplest algorithm you meet you your machine learning journey, but let’s recap since it will be useful to later introduce its weighted form.

Let’s say that you have a set of values $X$ and for each of them a target value $Y$, if you plot them can be easily seen that they could be approximated by a simple straight line:

import numpy as np
import matplotlib.pyplot as plt

x = np.linspace(0, 5, 1000)
y = 3 * x + 3 + 2 * np.random.randn(1000)

plt.plot(x, y, "b.", alpha=0.4)
plt.xlabel("X")
plt.ylabel("Y")
plt.show()

/posts/2022/05/weighted-linear-regression/index_files/index_3_0.png

Your goal is to find the best straight line which better approximates the data behaviour, this means that you want to find $\alpha$ and $\beta$ such that:

$$\operatorname{argmin}_{\alpha, \beta} \sum_i (y_i - (\alpha + \beta x_i)^2$$

Once you have them you can predict the output for every $X$ with:

$$y_{pred} = \alpha + \beta x$$

The really interesing thing about linear regression is that the best fit can be optained in closed form, i.e. solving a simple formula:

$$\beta = \frac{\sum_i(x_i - \hat x)(y_i - \hat y)}{\sum_i(x_i - \hat x)}$$ $$\alpha = \hat y - \beta \hat x$$

Where $\hat x$ and $\hat y$ are respectively the mean of all the $X$ and $Y$ values. It could be expressed also by means of matrix multiplications but here we’ll stick with this form since it can be easily implemented through broadcasting in numpy.

def linear_regression(x: np.ndarray, y: np.ndarray):
    xm, ym = np.mean(x), np.mean(y)
    b = np.sum((x - xm) * (y - ym)) / np.sum(np.square(x - xm))
    a = ym - b * xm
    return a, b

a, b = linear_regression(x, y)
plt.plot(x, y, "b.", alpha=0.4)
plt.plot(x, a + b * x, "r--", linewidth=3)
plt.show()

/posts/2022/05/weighted-linear-regression/index_files/index_5_0.png

Linear Regression Properties

Now, it’s important to observe that linear regression makes some assumptions over the data used, if these assumptions do not hold the closed form will lead to suboptimal results.

Linearity

The dependent variable ($Y$) has a linear relationship with the independent one ($X$), this is pretty obvious, for instance the following function can only poorly represented through a linear regression:

x = np.linspace(-10, 10, 1000)
y2 = 5 + 2 * x + 4 * x**2
plt.xlabel("X")
plt.ylabel("Y")
plt.plot(x, y2, "b-")
plt.show()

/posts/2022/05/weighted-linear-regression/index_files/index_8_0.png

Normality

The observation errors are normally distributed

Independency

The error of each observation is independent to the others, this means that the error that you can see in a observation can not affect the error of other observations

Low Multi-Collinearity

There isn’t high correlation among the independent variables

Homoscedasticity

All the observation errors have the same finite variance, basically this means that you assume that the noise present in you observations varies always inside the same range following a gaussian distribution which has the same variance. When this assumption does not hold you are dealing with a case of Heteroscedasticity. Following an example of Homoscedasticity in which the linear regression fails due to the high quantity of outliers.

y_hetero = 3 * x + 3 + np.abs(5 * x * np.sin(x) * np.random.randn(1000))
a, b = linear_regression(x, y_hetero)
plt.figure(figsize=(15, 5))
plt.xlabel("X"); plt.ylabel("Y")
plt.plot(x, y_hetero, "bo", alpha=0.3)
plt.plot(x, a + x * b, "r--", linewidth=3.0)
plt.show()

/posts/2022/05/weighted-linear-regression/index_files/index_10_0.png

Weighted Linear Regression

We are particularly interested in the Homoscedasticity property since relaxing this linear regression property leads to the capability to handle outliers. We can model each observation to be a random normal variable with its own variance and we want to weight each observation by means of its variance. The problem to be solved becomes thus:

$$\operatorname{argmin}_{\alpha, \beta} \sum_i \frac{1}{\sigma_i}(y_i - (\alpha + \beta x_i)^2$$

The closed form to find $\alpha$ and $\beta$ in the weighted case is:

$$\beta = \frac{\sum_i w_i(x_i - \hat x)(y_i - \hat y)}{\sum_i w_i (x_i - \hat x)^2}$$ $$\alpha = \hat y - \beta \hat x$$ $$\hat x = \frac{\sum_i w_ix_i}{\sum_i w_i}$$ $$\hat y = \frac{\sum_i w_iy_i}{\sum_i w_i}$$

Below an example where a portion of data is not representative and it is ignored assigning them zero weight.

def weighted_linear_regression(x: np.ndarray, y: np.ndarray, w: np.ndarray):
    xm = np.sum(w * x) / np.sum(w)
    ym = np.sum(w * y) / np.sum(w)
    b = np.sum(w * (x - xm) * (y - ym)) / np.sum(w * np.square(x - xm))
    a = ym - b * xm
    return a, b

y_outliers = 3 * x + 3 * 10 * np.random.randn(1000)
y_outliers[-200:] = y_outliers[-200:] + 80

plt.plot(x, y_outliers, "b.", alpha=0.4)
for i in [0.0, 1.0]:
    weights = np.ones_like(y_outliers)
    weights[-200:] = i
    a, b = weighted_linear_regression(x, y_outliers, weights)
    plt.plot(x, a + b * x, label=f"weight: {i:.2f}", linewidth=2)

plt.legend()
plt.show()

/posts/2022/05/weighted-linear-regression/index_files/index_12_0.png