In the field of machine learning, regression analysis is one of the most common and popular techniques to understand relationships between variables. Linear regressions are typically used in these scenarios. However, when you see high multicollinearity in your linear regression — which is common when there are a large number of parameters — you need a way to optimise bias before you can build models. It is here that ridge regression with SGD comes handy. 

Ridge Regression with Stochastic Gradient Descent Using Python

Let’s first understand ridge regression and stochastic gradient descent algorithm individually.

What is Ridge Regression?

Ridge Regression is a commonly used method of processing regression data with multicollinearity. When independent variables in a multiple regression model are correlated, we call it multicollinearity. This might cause coefficient estimates to change erratically, as you make changes to the independent variables in your model. Ridge regression reduces standard errors by adding a degree of bias to the regression estimates.

What is Stochastic Gradient Descent?

In plain English, gradient descent means sliding down a slope. In statistics, this might be a descent down the slope of the error surface to reach its lowest point. However, it doesn’t fall straight down the slope. A gradient descent algorithm follows an iterative process that travels down the slope in steps, using local information, until reaching the lowest point.

Stochastic means random. Stochastic gradient descent introduces randomness by selecting data points at random during each iteration. What this does is help significantly reduce the burden of computation in the process of minimising the sum of squared residuals.

Reducing Loss with Gradient Descent

You can calculate the gradient using basic calculus — differentiation — or use existing functions in Python. Then, you move to the next point, calculate the gradient and so on. The jump you make until the next point is the learning rate. It is important to choose the right learning rate. 

Before fitting a model or building an algorithm, divide the dataset into three sets — training, validation, and test sets. Then, you must normalize the features. Post this, you can calculate the loss. Do a gradient check. And use gradient descent algorithms to minimise the loss.

Preparing to Perform Ridge Regression

Before we get to how to use ridge regression, there is one thing that’ll be very helpful to know: How to remove for-loop from code and make the code much faster using vectorization and broadcasting. We primarily use two libraries — Pandas and Numpy. Pandas is for relational databases, time series data, etc.; NumPy for mathematical computation; we can also use PyPlot for plotting.

Let’s take an example of matrix multiplication.

While performing multiplication using for-loop for this small matrix, we notice that the computation takes about 100 milliseconds. However, in the real world, your matrices won’t be this small. You will be working with much larger data, which will take significant time and resources for computation. First work to reduce that. You can do this with element-wise operations, which reduced the same computation to under 30ms. Or use broadcasting, which also works for 2D arrays — this can expedite computation, while keeping the code simple and elegant.

To learn how to do each of these steps, watch Springboard’s data science mentor Neeraj Agarwal demonstrate with examples in this Youtube tutorial. He also shows how each of the steps is impacting the models and teaches how to improve them. You can get the data and notebook he’s using by leaving your details in this form

To learn data science and machine learning from some of the world’s best data scientists, consider Springboard’s online learning problem. In addition to a world-class curriculum, it also offers 1:1 mentorship from industry experts, career coaching and guidance, as well as a job guarantee. Apply Now!