In-depth Explained Simple Linear Regression from Scratch - Part 1

In my opinion, most Machine Learning tutorials aren’t beginner-friendly enough. It very math-heavy or it doesn't help you with the algorithms behind it.
In this post, we are going to do the simple Linear Regression from scratch. We will see the mathematical intuition behind it and we write the code from scratch + test it and I'm super excited to get started!!






Ready? Let’s jump in.

Let's get the intro done!
The simplest form of the linear regression model is also the linear function of the input variables. However, we can obtain a much more useful class of functions by taking linear combinations of a fixed set of nonlinear functions of the input variables, known as basis functions. Such models are linear functions of the parameters, which gives them simple analytical properties and yet can be nonlinear with respect to the input variables.

The Motivation
For most applications, knowing that such a linear relationship exists isn’t enough. We’ll want to understand the nature of the relationship. This is where we’ll use simple linear regression.

When to use it?
The relationship between the variables is linear.
The data and the variance in the residuals (the difference in the real and predicted values) is more or less constant.
The residuals are independent, meaning the residuals are distributed randomly and not influenced by the residuals in previous observations (If the residuals are not independent of each other, they’re considered to be autocorrelated)

What is a Linear Regression?

The simplest linear model for regression is one that involves a linear combination of the input variables
                        y(x, w) = w0 + w1x1 + ... + wDxD     where x = (x1,...,xD)T.
In simple terms we can also write this as 
                              y i = Î² x i + Î± + Îµ i   where Îµ is a (hopefully small) error term 

The parameter w0 allows for any fixed offset in the data and is sometimes called a bias parameter (not to be confused with ‘bias’ in a statistical sense).

The Goal

The bias coefficient gives an extra degree of freedom to this model. The goal is to draw the line of best fit between X and Y which estimates the relationship between X and Y.

But how to find these coefficients (W0, W1)?

There are different approaches to find this coefficient. One is the Ordinary Least Mean Square Method (we'll see in part1) approach and the Gradient Descent(we'll see in part 2) approach.

We will first see the Ordinary Least mean Square.

Consider the below case graphical representation for some random sample points.
In the above case, we can see that there could be multiple best fits possible.

But how can we decide which is the best one? If only we had some way to measure how well each line fits each data point...

Residual is to help here (or Error)
First, we will see what is residual and how does it solve our problem
A residual is a measure of how well a line fits an individual data point. Consider this simple data set with a line of fit drawn through it

And the total error of the linear model is the sum of the error of each point. I.e. ,



ri = Distance between the line and ith point.
n =Total number of points.
So in the above example sq(r) = 2^2+(-4)^-4+(-1)^-1+1^1+(-2)^-2 = 4+16+1+4=25

We are squaring each of the distances because some points would be above the line and some below. We can minimize the error of our linear model by minimizing thus we have




where x¯ is the mean of the input variable X and y¯ being the mean of the output variable Y.

#Toy Dataset to play with
X = [0.0, 8.0, 15.0, 22.0, 35]
Y = [32, 46.4, 59, 71.6, 95]


If you missed something? You will get a more clear picture during the implementation.

Let's see some code to understand it.

Dataset: 
We have to build a Toy dataset to test our code based on Temperature  Celcius -> Farheniet conversion.
we have X  our input variable(in Celcius) and Y variable(output variable).



What are we are going to predict is the Farheniet for a given temperature in Celcius.

Implementation

Step 1: Build the predict Function

Assuming we’ve determined such a b and m, then we make predictions simply with:

def predict(m,b,X):
  return m*X+b
Step 2: Compute the Error function.

How do we choose m and b? Well, any choice of m and b gives us a predicted output for each input X[i]. Since we know the actual output Y[i], we can compute the error for each pair:

def error(m,b,x,y):
  return predict(m,b,x) - y

Step 3: Root Squared Error (Validate Accuracy)
We need to able to measure how good our model is (accuracy). There are many methods to achieve this but we would implement Root squared error nd coefficient of Determination (R² Score).
Root Squared Error is the square root of the sum of all errors, or Mathematically  R^2

def sum_square_error(m,b,X,Y):
  errored=0
  no_of_element = len(X)
  for i in range(no_of_element):
    errored+=error(m,b,x[i],y[i])
  return errored**2


Step 4: The OLS Fit Method

The least-squares solution is to choose the m and b that make sum_of_sqerrors as small as possible.


import numpy as np
def least_square_mean_fit(X,Y):
  x_mean = np.mean(X)
  y_mean = np.mean(Y)
  ## Total number of values
  total_no_values=len(X)
  corr=0
  std=0
  for i in range(total_no_values):
    corr+=((X[i]-x_mean) * (Y[i]-y_mean))
    std+=(X[i] - x_mean)**2
  m=corr/std
  b=y_mean - (m*x_mean)
  return m,b

The Full code with Example:



#Prediction
def predict(m,b,X):
  return m*X+b

#Error
def error(m,b,x,y):
  return predict(m,b,x) - y

#Sum of squared error
def sum_square_error(m,b,X,Y):
  errored=0
  no_of_element = len(X)
  for i in range(no_of_element):
    errored+=error(m,b,x[i],y[i])
  return errored**2

#The Fit Method
import numpy as np
def least_square_mean_fit(X,Y):
  x_mean = np.mean(X)
  y_mean = np.mean(Y)
  ## Total number of values
  total_no_values=len(X)
  corr=0
  std=0
  for i in range(total_no_values):
    corr+=((X[i]-x_mean) * (Y[i]-y_mean))
    std+=(X[i] - x_mean)**2
  m=corr/std
  b=y_mean - (m*x_mean)
  return m,b

##Toy Numerical Dataset
X = [0.0, 8.0, 10.0,15.0, 22.0, 35,100,120]
Y = [32, 46.4, 41,59, 71.6, 95,200,245]
m,b = least_square_mean_fit(X,Y)

print(m,",",b)
----------------
1.7502951227089159 , 30.92606399502951

sum_square_error(m,b,X,Y)
-------------------------------------
30.92606399502951 31.136309655414966 31.34655531580042 31.556800976185876 31.76704663657133 31.977292296956787 32.18753795734224 32.3977836177277

predict(m,b,100)
-------------------------
212.95557626592108


import matplotlib.pyplot as plt
x_max = np.max(X) 
x_min = np.min(X) 

x = np.linspace(x_min, x_max, 1000)
y = b + m * x
#plotting line 
plt.plot(x, y, color='#00ff00', label='Linear Regression')
#plot the data point
plt.scatter(X, Y, color='#ff0000', label='Data Point')
# x-axis label
plt.xlabel('Celcius')
#y-axis label
plt.ylabel('Farheneit')
plt.legend()
plt.show()


Conclusion:
This is not a very trivial example but I would suggest trying it out on a real dataset and let me know the feedback.
The difference between the actual and predicted is not much. Since it has a very good linear function. I would continue with the second part with Gradient Descent.

Hey I'm Venkat
Developer, Blogger, Thinker and Data scientist. nintyzeros [at] gmail.com I love the Data and Problem - An Indian Lives in US .If you have any question do reach me out via below social media