Data analytics professionals form the backbone of decision making at every organisation. With the superpowers to process numbers and build insights, data professionals (data analysts, data scientists, AI/machine learning experts), these skills are high in demand. One of the most common problems any data professional solves in their career lies around determining the probability of an event occurring. ‘Will a customer buy this from me?’, ‘Should I spend money marketing to this kind of a user?’, ‘Is this customer eligible to get a credit card from me?’, ‘Does a patient have this disease?’ are some of the common questions that an analyst has to help determine with the help of data. With the information available, analysts often perform Logistic Regression techniques to help determine the most possible outcome. Once the Logistic Regression Technique is Explained in this blog post, you will be able to answer many classification & regression questions to help companies with data-driven decision making.  

Logistic Regression Explained Conceptually

We have discussed the ‘what & why of logistic regression’ in a previous article, and we understand that it can be implemented only when the independent variable is categorical in nature. In this article, we will be taking a real life example of a situation that automobile companies like Hero Moto Corp, Bajaj Auto, Royal Enfield etc would be facing. We have a dataset with some demographic information available about a user, companies want to understand if a user would end up purchasing a bike or not. For an ease of understanding, we are using a simple dataset with user ID, Gender, Age & few more details.

If we apply logistic regression to look out for a clear answer (Yes or No), out hypothesis would look something like:  

K = W * X + B

  • hΘ(x) = sigmoid(K)
  • Sigmoid Function: 
Source: Wikipedia

And its graphical representation would be:

Source: Towards Data Science

There are 3 common variations of Logistic Regression that could happen:

  • Binary Logistic Regression – It has only two possible outcomes (Category).
  • Multinomial Logistic Regression- More than two Categories possible without ordering.
  • Ordinal Logistic Regression- More than two Categories possible with ordering.

And in this example, we will be using the Binary Logistic Regression technique. As you can see in the dataset, we have 5 variables to work with: 

Source: Excel File/ Actual Data Set

Solving With Logistic Regression In Python

As for any Data Analytics/ Science problem in Python, we have a standard set of steps to follow. We start by cleaning the dataset to ensure there are no null or unnecessary values. This helps ensure that our outcome is not biased.  

1. After a thorough round of cleaning, we import the libraries & the dataset.

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
dataset = pd.read_csv('Social_Network_Ads.csv')

2. Next we split the Data set in Dependent and Independent variables.

If you look at the information in our dataset, you will be able to identify that Age and  EstimatedSalary are the  Independent variables whereas and ‘Purchased’ is the Dependent Variable.

X = dataset.iloc[:, [2,3]].values
y = dataset.iloc[:, 4].values

** X is Independent variable and y is Dependent variable.

Source: Python File/ Actual Data Set

3. Splitting the Data set into the Training Set and Test Set

As always, our training data will be used to train our Logistic model and Test data will be used to validate our model. Sklearn to split our data, after which we import  train_test_split from sklearn.model_selection

from sklearn.model_selection import train_test_splitX_train, 
X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state = 0)

4. Feature Scaling

Feature scaling is essential to scale data between 0 and 1. It is usually needed because there is a huge difference between the 2 variables. This technique helps get better accuracy. In our case, the variables are Age & EstimatedSalary. We do this in 3 steps:

  • Import StandardScaler from sklearn.preprocessing
  • Then make an instance sc_X of the object StandardScaler
  • Then fit and transform X_train and transform X_test
from sklearn.preprocessing import StandardScaler
sc_X = StandardScaler()
X_train = sc_X.fit_transform(X_train)
X_test = sc_X.transform(X_test)
Source: Python File/ Actual Data Set

5. Fitting Logistic Regression to the Training Set

To fit a logistic regression to the training set, we build our classifier (Logistic) model using these 3 steps:

  • Import LogisticRegression from sklearn.linear_model
  • Make an instance classifier of the object LogisticRegression and give random_state =  0 (this will give the same result every time)
  • Use this classifier to fit X_train and y_train
from sklearn.linear_model import LogisticRegression
classifier = LogisticRegression(random_state=0)
classifier.fit(X_train, y_train)

We now have a classifier that can predict whether a person will buy a bike or not. We check for the predictions on the Test Data set and find its accuracy using a Confusion matrix.

6. Predicting the Test set results

y_pred = classifier.predict(X_test)

which will look something like: 

Source: Python File/ Actual Data Set

Source: No Source/ Python File/ Actual Data Set

We now use y_test (Actual Result) and y_pred ( Predicted Result) to get the accuracy of our model.

7. The Confusion Matrix

Using a Confusion matrix we can understand the accuracy of our model.

from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, y_pred)

You will get an output like below, which will be used to calculate accuracy.

Source: Python File/ Actual Data Set

Source: No Source/ Python File/ Actual Data Set

The formula used here is:
Accuracy = ( cm[0][0] + cm[1][1] ) / ( Total test data points )

Source: Python File/ Actual Data Set

Source: No Source/ Python File/ Actual Data Set

As you can see, we are getting an accuracy of 89 % on this dataset, which is fairly good. 

Let’s Visualise What Regression Looks Like in Python

We visualise our training set result and test set result using matplotlib library. For a quick revision, Matplotlib is a plotting library for the Python programming language and its numerical mathematics extension NumPy. It provides an object-oriented API for embedding plots into applications. We start with visualising the training set before we proceed onto the testing set. 

1. Visualising the training set result

from matplotlib.colors import ListedColormap
X_set, y_set = X_train, y_train
X1, X2 = np.meshgrid(np.arange(start = X_set[:, 0].min() - 1, stop = X_set[:, 0].max() + 1,   step = 0.01),
np.arange(start = X_set[:, 1].min() - 1, stop = X_set[:, 1].max() + 1, step = 0.01))
plt.contourf(X1, X2, classifier.predict(np.array([X1.ravel(), X2.ravel()]).T).reshape(X1.shape),
alpha = 0.75, cmap = ListedColormap(('red', 'green')))
plt.xlim(X1.min(), X1.max())
plt.ylim(X2.min(), X2.max())
for i, j in enumerate(np.unique(y_set)):
plt.scatter(X_set[y_set == j, 0], X_set[y_set == j, 1],
c = ListedColormap(('red', 'green'))(i), label = j)
plt.title('Logistic Regression (Training set)')
plt.xlabel('Age')
plt.ylabel('Estimated Salary')
plt.legend()
plt.show()

Are you getting a graph like this?

Source: Python File/ Actual Data Set


Source: No Source/ Python File/ Actual Data Set

2. Visualising the test set result

from matplotlib.colors import ListedColormap
X_set, y_set = X_test, y_test
X1, X2 = np.meshgrid(np.arange(start = X_set[:, 0].min() - 1, stop = X_set[:, 0].max() + 1, step = 0.01),
np.arange(start = X_set[:, 1].min() - 1, stop = X_set[:, 1].max() + 1, step = 0.01))
plt.contourf(X1, X2, classifier.predict(np.array([X1.ravel(), X2.ravel()]).T).reshape(X1.shape),
alpha = 0.75, cmap = ListedColormap(('red', 'green')))
plt.xlim(X1.min(), X1.max())
plt.ylim(X2.min(), X2.max())
for i, j in enumerate(np.unique(y_set)):
 plt.scatter(X_set[y_set == j, 0], X_set[y_set == j, 1],
c = ListedColormap(('red', 'green'))(i), label = j)
plt.title('Logistic Regression (Test set)')
plt.xlabel('Age')
plt.ylabel('Estimated Salary')
plt.legend()
plt.show()
Source: Python File/ Actual Data Set


Source: No Source/ Python File/ Actual Data Set

There you go! We have successfully built a logistic regression model & visualised the outcomes of it. We have been able to create a confusion matrix and understand the accuracy of our model, & that has given us insights into user behaviour, and their probability of purchasing a bike. We are sure you still have questions in mind about whether we could have tried another technique to tackle this problem…  

Could We Have Used Linear Regression Instead?

Linear regression helps identify the relationship between variables, but can help with classification as well. When solving a classification problem using linear regression, it is essential to specify a threshold on which classification can be done. Let say the actual class is the person will buy the bike, and predicted continuous value is 0.47. Assuming that our threshold value is 0.5, this data point will be considered as the person will not buy the bike. Now, a prediction like this may or may not be right, since the threshold value has been assumed & not mathematically derived. Because Linear Regression is bounded, such classification problems can only be solved using Logistic Regression Equations. 

By now, we are sure you are familiar with the regular linear regression, logistic regression, decision tree Algorithms and some of their practical implementations. Knowing these techniques can help build an impressive analytics portfolio. You can learn all these techniques and many more in Springboard’s Data Analytics Career Track Program, where we help you transform your careers to Data Driven roles.