Intro. to Linear Regressions

    One of the first concepts that people often learn in artificial intelligence (AI) and machine learning (ML) is the concept of linear regressions. This is largely due to the fact that many AI and ML algorithms optimize for a given solution by using linear and logistic regressions. So, let’s begin there.

To better understand linear and logistic regressions, let’s imagine we own a company, and our company makes a Widget. To see how we’re doing, we ask our salespersons to keep track of how many sales they make over nine months. If our company is successful, we would expect the total number of sales to increase as the year progresses. But here’s a more interesting question: Is there a relationship between the month and the total sales we’ve made? Let’s say our salesperson collects the following data:

Month123456789
Sales516315962789098101

We can plot the number of sales as a function of time, and that might begin to get us the answer we’re looking for. But we’d really like to fit some sort of line to the data to see if we can approximately describe the relationship with an equation. But what kind of line? This is the difference between a linear regression (“linear” implying we use a straight line), and a logistic regression (“logistic” implying we use an exponential or logarithmic relationship).

Implementation + Error analysis

Numerically, each vector of values can be stored in Python as arrays. We can utilize the high-efficiency nature of Python’s Numpy package to fit our data to a polynomial and plot the fit to the data and it’s resulting plot:

import numpy as np
import matplotlib.pyplot as plt

month = np.array([1,2,3,4,5,6,7,8,9])
total_sales = np.array([5,16,31,59,62,78,90,98,101])

# Determine linear fit
coef = np.polyfit(month, total_sales, 1)
poly1d_fn = np.poly1d(coef) 

# Plot points and linear fit
plt.plot(month, total_sales, 'yo', month, poly1d_fn(month), '--k')
plt.xlabel('Time [months]')
plt.ylabel('Sales')
plt.show()

The linear fit produced by Numpy seems to fit our dataset appropriately, but we can still apply a numerical metric to our data to tell us exactly how good our match is. The method we’ll demonstrate here utilizes the “R-squared” value to produce a value between 0 (not great) and 1 (a perfect fit). This will also allow us to compare the accuracy of our optimized function to our initial matrix division technique. We’ll start by defining the R-squared measurement:

[math] \begin{aligned} R^2=1-\frac{SSR}{SST} =1-\frac{\Sigma(y_i-\hat{y_i})^2}{\Sigma(y_i-\bar{y_i})^2} \\ \end{aligned} [/math]

where [math] y_i [/math] are the realized data (sales), [math] \hat{y_i} [/math] is the predicted (fit) data, and [math] \bar{y_i} [/math] is the mean of the real data points. Similar to our linear regression models, we can write this up as a compact function (not optimized for speed). Below, our function r_squared ingests our realized data values (y) and the data predicted by our fit (fitted_data) at the same x-values (months). We can also write a script to calculate the [math] R^2 [/math] value for us. Combined with the previous code, the following code snippet returns a value of [math] r^2 = 0.965 [/math].

def r_sqr(y,fitted_data):
    ssr = sum((total_sales-poly1d_fn(month))**2)
    sst = sum((total_sales-np.mean(total_sales))**2)
    rs = 1-(ssr/sst)
    return rs

fitted_data = poly1d_fn(month)
r2 = r_sqr(total_sales,fitted_data)
print(r2)
2D particle paths

(in progress)