In this blog, you will learn how to implement a simple linear regression model in Python without using any pre-built models, make predictions using this model on publicly available data for Calories and Obesity. You will also learn to measure the accuracy of the model using r2 score (one metric to measure the accuracy of a model). There are some pre-built modules like sci-kit-learn and stats model which provide an out-of-the-box implementation of the linear regression model. The purpose of this blog is to show you the basic building blocks for predicting labels using a simple linear regression model.
The notebook for this blog is available in here.
Types of Linear Regression Models
The Linear Regression model is one of the simplest supervised machine learning models, yet it has been widely used for a large variety of problems. There are two main types of Linear Regression models:
1. Simple Linear regression
Simple linear regression uses traditional slope-intercept form, where m and b are the coefficient and intercept respectively. x represents our input data (independent variable) and y represents our prediction (dependent variable).
2. Multivariable regression
In Simple Linear regression, only one independent variable was present, in contrast, multivariable regression will have more than an independent variable that affects the outcome of the dependent variable. As the number of variables increases the dimension of the problem also increases, resulting in longer computational times for a very high number of independent variables. The function to determine the dependent variable y is given by f(x,y,z), the function can be determined by the sum of the product of respective coefficient and independent variables.
In this blog we will create a model for simple Linear regression. The procedure for simple linear regression is simpler than that for multiple linear regression.
Calculating Linear Regression Model Intercept and Coefficient
Deriving Coefficient and Intercept
In most machine learning algorithms the normal process is to estimate the a value for y and update the weights (coefficient) and bias (intercept) and the algorithm will try to “learn” to produce the most accurate predictions to reduce the error, but with simple regression predictions can be made using the following formulae.
Coefficient m can be calculated using –
Intercept b can be calculated using –
Where x is the mean of independent variable x, y is the mean of dependent variable y, xi and yi are individual observations of x and y.
Based on the above formulae, python functions are written below using numpy broadcasting functionality
Calories vs. Obesity (Our Dataset)
We will be using the Calories vs. Obesity data that is available from our world on the data website. This data is a yearly average of obesity per country. Since we are building simple linear regression we have only one independent variable, we will try to predict Obesity percentage given the average calorie intake as the independent variable.
Import necessary libraries. In my previous blog, we used an Ipython widget to get a slider in the Jupyter notebook to view time-series data. Here we are going to use Plotly which provides animations and sliders, these can be used to generate great visuals with few lines of code. We will use Plotly to generate a time-series graph with a time slider (explained in the next section).
Importing the Data and EDA
The file available in the download section from the above link is loaded onto a pandas DataFrame.
Let’s look at our data
From looking at the DataFrame head the column names are either not useful for our purpose or they are too big. We also see that a lot of the data is filled with NaN. Following data wrangling steps are performed to have a good starting point doing predictions:
- Rename the columns.
- Filter only required data, if we look at the website they describe the data, it is very important to understand the data before we start drawing conclusions from it. The website says that the calorie and the obesity data is available from 1975 to 2013, so we are going to filter only these years into our DataFrame.
- Impute missing data, After filtering data for the required years there are still some missing values. For such cases we take the mean for that country.
- Drop missing data, even after doing steps 2 and 3 if there is missing data then it means that there were no values in the first place to impute from. It does not make sense to fill in dummy values so we are going to drop data that is not present after processing steps 2 and 3.
Now, the data looks in good shape. One other thing, the continent value is not populated. This value is not necessary for our regression problem but it can be used as a good tool for visualisation (which I explain in the next section). So we are going to use a similar dataset that is readily available to get the continent value directly from the country name. Calorie vs Obesity dataset is originally obtained from Gapminder website, the same data is readily available in plotly datasets. I merge the both the DataFrames to get the final merged DataFrame.
The dataset has a lot of dimensions which can be used as independent variables if we were using a multivariable linear regression model. The below graph helps in visualising multiple dimensions in a simple scatter plot.
Generated by author via Plotly
Five dimensions are represented using different attributes of the scatter plot. Below are the details:
- X-axis – Obese(%)
- Y-axis – DailyCalorieSupply
- Size of the bubbles – Population
- Color of the bubbles – Continent
- Slider – Year
This is a great way to visualise data, a lot of inferences can be drawn with this one visual. For example:
- All of the countries to be moving towards higher obesity over time
- African countries are least obese while Europe and Americas have higher Obesity rate
- Generally, Calorie and Obesity seem to be directly proportional
This kind of graph will be very useful when we are doing multivariable analysis. We can see that Calorie and Obesity are directly proportional, so we are in the right direction for our Simple Linear regression model. Some of the country’s calorie obesity correlation seem to be better than others, we can use only variable so let us select a country which has highest correlation.
Myanmar has the highest correlation between calorie supply and obesity rate. We’ll use Myanmar for prediction. I’ve created a separate DataFrame for Myanmar.
Now, we plot a line graph to see the relationship between our independent variable and dependent variable.
Fit, Predict and Evaluate
Train Test split
Since the data is time series we cannot randomly select a certain percentage of data as test and train data. We have split the data based on the time (years). I’ve decided to take last 3 years of data as test data, i.e. from 2011 to 2013.
Fit and Predict
Based on the functions defined in section 3, we are going to derive the coefficient and intercept values on the training data, then predict the obesity rate on the test data using predict function. I have appended the predicted value to the original Myanmar DataFrame (which will be used in visualisation next).
Sklearn Fit and Predict
As a benchmark, I will also derive the predicted values using the sklearn modules’ LinearRegression method.
A line plot is shown below to compare the original data, sklearn predicted values and our simple Linear regression model prediction. Both the predictions seem to be close.
When we compare the r2_score for both the predictions, our model seems to fair a little bit better.
In this blog, we have seen the implementation of simple Linear regression using python with NumPy broadcasting. We were able to achieve a 96% R2 score on the Myanmar obesity rate prediction. Practically, almost all the problems will have multiple independent variables and multivariable Linear regression can be used (like the 5 variables defined in the visualisation section for this dataset). Most of the other models are much more complicated and require the use of a cost function, learning rate and gradient descent (or other optimisation algorithms) to minimise some function to arrive at optimal weights and biases.