A Simple Explanation of Regularization in Machine Learning

In this post, we are going to look into regularization and also implement it from scratch in python (Part02). We will see with example and nice visuals to understand it in a much better way. We already know about the Linear regression where this is used.



Let's get started!!

The first Question always coming to my mind after hearing this term is:

What is regularisation? 
This is a technique to minimize the complexity of the model (we will see what we mean by that) by penalizing the loss function to solve overfitting

The above definition gives three things to be looked into detail.
  • Minimize the complexity
  • Penalize the loss function
  • Solve the overfitting (Generalization).

1)Minimizing complexity. What do we mean by that?
Consider a simple example. you are trying to predict the score of students in the exam. We use a number of books read as a feature to predict.

This model will not learn anything new but it can find a few patterns but not enough to predict the score. This is called the underfitting of the model means you didn't provide enough data to understand the pattern. The model is too simple and high bias.

 The below visuals should be easy to understand the green line is not a good fit.
      
Now consider You went to identify that adding more features could improve the prediction and you added 20 more features like sleep time, no of the book he referred, no of labs he attended, no of hours of study, etc. We see that the model has started to pick a pattern to get too complex because of more input features. 

Now our model has also learned data patterns along with the noise in the training data. This is called overfitting.

When a model tries to fit the data pattern as well as noise then the model has a high variance that will be overfitting.
The above visuals show it detected patterns well on training data but will do poorly on the test set and not generalize.

Why do we need regularization?

The goal of our machine learning algorithm is to learn the data patterns and ignore the noise in the data set and to solve such cases.
Very Good !! we have understood what happens if the model is too loose or complex.

In the next section, we will see how to solve it. 

2. To solve this we need to Penalize the loss function()

Let's see how does this help in simplifying the model complexity.

Consider the Loss Function ( sum of the squared difference between the actual value and the predicted value) 

L(X,y) = Sum(yi − (w0 + wT f(x)i))2  
where f(x) = bx+b0=b0+b1x11+b2x22+b3x33  (polynomial function)

As the degree increases the model gets complex and it tries to fit all the data points

To minimize this we try to penalize the weights by moving the values closer to zero and our new f(x)becomes.

f(x) = bx+b0=b0+b1x11+b2x22+b3x33

This helps in simplifying the model which is an assumption is a regularisation. The value in red has of weight zero.

How can we make sure that we don't miss out on the important variables?
To ensure we take into account the input variables, we penalize all the weights by making them small. 

{\displaystyle \min _{f}\sum _{i=1}^{n}V(f(x_{i}),y_{i})+\lambda R(f)}

λ is the penalty term or regularization parameter which determines how much to penalizes the weights.
When λ is zero then the regularization term becomes zero. We are back to the original Loss function.
What could be λ then?
This can take a value from zero to large value and can be adjusted by using it on a small sample or sub-sample and trying various loss variability.


L1 Regularization/Lasso/L1 norm

In L1 norm/Lasso we try to generate a sparse matrix by assigning weights to zero based on feature selection by assigning non-zero weights to useful features and zero weights to insignificant. In the Sparse solution, the majority of the input features have zero weights and very few features have non zero weights.

J(w) = 1 N N i=1 (yi − (w0 + wT xi))2 + λ||w||

In L1 regularization we penalize the absolute value of the weights. The L1 regularization term is highlighted in the red box.
If we have two parameter β1, β2 then lasso coefficients have the smallest RSS(loss function) for all points that lie within the plane given by |β1|+|β2|≤ s.


L2 Regularization or Ridge Regularization

J(w)=1 (yi − (w0 + wT xi))2 + λ||w||2

In L2 regularization, the regularization λ term is the sum of the square of all feature weights as shown above in the equation.
L2 regularization forces the weights to be small but does not make them zero and does the non-sparse solution.
L2 is not robust to outliers as square terms blow up the error differences of the outliers and the regularization term tries to fix it by penalizing the weights
Wait... but what if we combine L1 & L2? Sounds new idea :-P

Elastic Net (Already Taken!)

Elastic Net was created to improvise on the lasso, whose variable selection can be too dependent on data and thus unstable. The solution is to combine the penalties of ridge regression and lasso to get the best of both worlds. Elastic Net aims at minimizing the following loss function.

J(w, λ1, λ2) = ||y − Xw||2 + λ2||w||2 2 + λ1||w||1

where α is the mixing parameter between ridge ( = 0) and lasso (α = 1).
Notice that this penalty function is strictly convex (assuming λ2 > 0) so there is a unique global minimum, even if X is not full rank.

Conclusion
This is all the basic you will need, to get started with Regularization. It is a useful technique that can help in improving the accuracy of your regression models. A popular library for implementing these algorithms is Scikit-Learn. It has a wonderful API that can get your model up running with just a few lines of code in python.

Hey I'm Venkat
Developer, Blogger, Thinker and Data scientist. nintyzeros [at] gmail.com I love the Data and Problem - An Indian Lives in US .If you have any question do reach me out via below social media