 # How To Use Regularization in Machine Learning?

Published on Aug 07,2019 228 Views 9 / 9 Blog from Supervised Learning ### myMock Interview Service for Real Tech Jobs ### myMock Interview Service for Real Tech Jobs

• Mock interview in latest tech domains i.e JAVA, AI, DEVOPS,etc
• Get interviewed by leading tech experts
• Real time assessment report and video recording

Have you ever encountered a situation where your machine learning model models the training data exceptionally well but fails to perform well on the testing data i.e. was not able to predict test data? This situation can be dealt with regularization in Machine learning.

Overfitting happens when a model learns the very specific pattern and noise from the training data to such an extent that it negatively impacts our model’s ability to generalize from our training data to new (“unseen”) data. By noise, we mean the irrelevant information or randomness in a dataset.

Preventing overfitting is very necessary to improve the performance of our machine learning model.

Following pointers will be covered in this article and eventually help us solve this problem,

## What is Regularization?

In general, regularization means to make things regular or acceptable. This is exactly why we use it for applied machine learning. In the context of machine learning, regularization is the process which regularizes or shrinks the coefficients towards zero. In simple words, regularization discourages learning a more complex or flexible model, to prevent overfitting.

## How Does Regularization Work?

The basic idea is to penalize the complex models i.e. adding a complexity term that would give a bigger loss for complex models. To understand it, let’s consider a simple relation for linear regression. Mathematically, it is stated as below:
Y≈ W_0+ W_1 X_1+ W_2 X_(2 )+⋯+W_P X_P

Where Y is the learned relation i.e. the value to be predicted.
X_1,X_(2 ),〖…,X〗_P , are the features deciding the value of Y.
W_1,W_(2 ),〖…,W〗_P , are the weights attached to the features X_1,X_(2 ),〖…,X〗_P respectively.
W_0 represents the bias.

Now, in order to fit a model that accurately predicts the value of Y, we require a loss function and optimized parameters i.e. bias and weights.

The loss function generally used for linear regression is called the residual sum of squares (RSS). According to the above stated linear regression relation, it can be given as:
RSS= ∑_(j=1)^m (Y_i-W_0-∑_(i=1)^n W_i X_ji )^2

We can also call RSS as the linear regression objective without regularization.

Now, the model will learn by the means of this loss function. Based on our training data, it will adjust the weights (coefficients). If our dataset is noisy, it will face overfitting problems and estimated coefficients won’t generalize on the unseen data.

This is where regularization comes into action. It regularizes these learned estimates towards zero by penalizing the magnitude of coefficients.

But how it assigns a penalty to the coefficients, let’s explore.

## Regularization Techniques

There are two main regularization techniques, namely Ridge Regression and Lasso Regression. They both differ in the way they assign a penalty to the coefficients. Ridge Regression (L2 Regularization)

This regularization technique performs L2 regularization. It modifies the RSS by adding the penalty (shrinkage quantity) equivalent to the square of the magnitude of coefficients.
∑_(j=1)^m (Y_i-W_0-∑_(i=1)^n W_i X_ji )^2+ α∑_(i=1)^n W_i^2=RSS+ α∑_(i=1)^n W_i^2

Now, the coefficients are estimated using this modified loss function.

In the above equation, you may have noticed the parameter α (alpha) along with shrinkage quantity. This is called a tuning parameter that decides how much we want to penalize our model. In other terms, tuning parameter balances the amount of emphasis given to minimizing RSS vs minimizing the sum of the square of coefficients.

Let’s see how the value of α alpha affects the estimates produced by ridge regression.

When α=0, the penalty term has no effect. It means it returns the residual sum of the square as loss function which we choose initially i.e. we will get the same coefficients as simple linear regression.

When α=∞, the ridge regression coefficient will be zero because the modified loss function will ignore the core loss function and minimize coefficients square and eventually end up taking the parameter’s value as 0.

When 0<α<∞, for simple linear regression, the ridge regression coefficient will be somewhere between 0 and 1.

That’s the reason for selecting a good value of α (alpha) is critical. The coefficient methods produced by ridge regression regularization technique are also known as the L2 norm.

Lasso Regression (L1 Regularization)

This regularization technique performs L1 regularization. It modifies the RSS by adding the penalty (shrinkage quantity) equivalent to the sum of the absolute value of coefficients.
∑_(j=1)^m (Y_i-W_0-∑_(i=1)^n W_i X_ji )^2+ α∑_(i=1)^n |W_i |=RSS+ α∑_(i=1)^n |W_i |
Now, the coefficients are estimated using this modified loss function.

Lasso Regression is different from ridge regression as it uses absolute coefficient values for normalization.

As loss function only considers absolute coefficients (weights), the optimization algorithm will penalize high coefficients. This is known as the L1 norm.

Here, α (alpha) is again a tuning parameter, works like that of ridge regression and provides a tradeoff between balancing RS magnitude of coefficients.

Like ridge regression, α (alpha) in lasso regression can take various values as follows:

When α=0, we will get the same coefficients as simple linear regression.

When α=∞, the lasso regression coefficient will be zero.

When 0<α<∞, for simple linear regression, the lasso regression coefficient will be somewhere between 0 and 1.

It is appearing very similar to ridge regression, but let’s have a look at both techniques with a different perspective.

Think of ridge regression as solving an equation, where the sum of squares of weights(coefficients) is less than or equal to s. According to this, considering there are 2 parameters in a given problem, the ridge regression is expressed by
W_1^2+ W_2^2≤s

It implies that ridge regression coefficients have the smallest loss function for all point that li within the circle given by the above equation.

Similarly, think of lasso regression as solving an equation, where the sum of modulus of weights(coefficients) is less than or equal to s. According to this, considering there are 2 parameters in a given problem, the lasso regression is expressed by
|W_1 |+ |W_2 |≤s

It implies that ridge regression coefficients have the smallest loss function for all point that li within the diamond given by the above equation.
The following image describes the above equations:

In this image we can see, Constraint functions (blue area); left one is for lasso whereas the right one is for the ridge, along with contours (green eclipse) for loss function i.e. RSS.

In the above case, for both regression techniques, the coefficient estimates are given by the first point at which contours (an eclipse) contacts the constraint (circle or diamond) region.

The ridge regression coefficient estimates will be exclusively non-zero. Why? Because ridge regression has a circular constraint, having no sharp points, the eclipse will not intersect the constraint on an axis.

On the other hand, the lasso constraint, because of diamond shape, has corners at each of the axes hence the eclipse will often intersect at each of the axes. Due to that, at least one of the coefficients will equal zero.

The above scenario shows that ridge regression will shrink the coefficients very close to 0 but will never make them exactly 0, which means the final model will include all predictors. This is a disadvantage of ridge regression, called model interpretability.

However, lasso regression, when α is sufficiently large, will shrink some of the coefficients estimates to exactly 0. That’s the reason lasso provides sparse solutions.

That was Regularization and its techniques, and I hope now you can comprehend it in a better way. You can use this to improve the accuracy of your machine learning models.

Now, with this, we come to the end of this ‘Regularization in Machine Learning’ article. Hope this article was insightful!

I hope this Tutorial was useful, feel free to visit our edureka’s youtube page where you will find tons of technical videos on several topics. If you want to excel in your career you can take Machine Learning Certification Training using Python from edureka. Don’t just blindly trust someone’s word, here is the link to the Course Curriculum, have a look at it and see if you find it really worth investing your time and effort. If you want to learn something more, add your suggestion in the comment box below, If the suggestion is good enough we might even plan to add it to our main course for you.

Got a question for us? Please mention them in the comments section and we will get back to you. REGISTER FOR FREE WEBINAR Thank you for registering Join Edureka Meetup community for 100+ Free Webinars each month