Logistic Regression is another technique borrowed from statistics by machine learning for classification problems.  Approximately 70% of the data science problems are classification problems making them an integral part of machine learning and data science. Classification is a technique where the data is categorized into a specific number of classes. Let’s understand the concept of classification with an example. Suppose that you want to classify gender (target class- male or female) using hair length as a feature. You can come up with certain boundary conditions like if the hair length value is less than 15 cm then gender is male or else female. Using hair length as a training feature, you can build and train a classification model based on the boundary conditions. Classification basically predicts the category to which the data belongs to.

Another example could be classifying fruits from a bunch of fruits based on features like weight, color, shape, size, and taste. In this case, the target classes can be -Apple, Banana, Orange, Cherry, etc. This is a multinomial classification where the target class can be of 3 or more types.

What is Logistic Regression?

Say you want to understand whether the exam performance of a student can be predicted based on his/her lecture attendance, revision time, and test anxiety. This prediction can be done using binomial logistic regression where the dependent variable “exam performance” is to be measured on a dichotomous scale – “pass” or “fail”. There are three independent variables namely -lecture attendance, revision time, and test anxiety.

Logistic Regression, despite its name, is not an algorithm for regression problems but one of the widely used machine learning algorithms for binary or multinomial classification problems. It is the classification counterpart of linear regression. Logistic Regression is a predictive modeling algorithm for modeling binary categorical variables. Logistic Regression falls under the category of supervised machine learning algorithms.  The main goal of logistic regression to find conditional probability, of output is y=1 given that the input is X.  Logistic regression is represented as: 

P(y=1 | X) where y is output and X is input.

https://lh6.googleusercontent.com/lLUXS9RJ8diOaa0l6Oqsb9tH3Vj7amMYqJbYdtQEo7b98VVIX-LhofeojrZF5DwHOPugn22ghr95JgU4z9s4hNB5jRYEorHDOC7GLnlzXBhmc_l0Thg52QbtO9KvdA

Image Credit – DataAspirant.com

Given a set of independent variables, logistic regression predicts a binary outcome  (True/False, 0/1, Yes/No, Happy/Sad, Approved/Not Approved). If the target variable or response variable has more than two types, then it is termed as multinomial logistic regression. Logistic Regression is used when there is a –

  • Dichotomous or binary variable Y.
  • Explanatory X-variables that are related to the variable Y
  • The value of Y-variables depends on the explanatory variables. 

The output of a logistic regression algorithm is not a discrete or outright class and is more informative compared to other classification algorithms.  Logistic regression gets the probabilities associated with every observation which can be applied to multiple standard and custom performance metrics to get a cut-off probability score and classify the output to best fit the business problem. 

Logistic Regression uses a complex cost function defined as the Sigmoid function to measure the relationship between the independent variables and the categorical dependent variables by estimating the probabilities. However, there are some assumptions that should hold good for logistic regression –

  • The dependent variable has to be categorical.
  • Independent variables should not be highly correlated with each other.
  • It requires relatively large samples.

What logistic regression seeks to do?

Imagine that you are buying a house for the first time and organizing all your financial records to apply for a home mortgage. In the process of organizing your financial records, you ask your bank to provide you with a credit report so that you can have a check on your credit score which can range from 300 to 900 before applying for a home mortgage.  Turns out that your credit score is 770. This credit score plays a vital role in determining whether your request will be approved for a home mortgage, along with other factors like income, job status, expenditures, and more. 

In this process of organizing your financial records and doing research on whether your home mortgage is likely to be approved or not, you come across raw data of 1000 applicant credit scores along with the details on whether the mortgage was approved or not. Using this data, you can –

  • Examine the relationship between your credit score and the mortgage being approved. More specifically, you want to build a model to determine the probability and the odds of your home mortgage being approved for a given credit score.
  • Find out approximately what credit score is associated with a probability of 50% (the odds are even) for being approved.
  • Determine the probability of your mortgage being approved by inputting the credit score to the model.
  •  Analyze the effect of improving your credit score from 770 to 800 on your probability of being approved the home mortgage.

This is what exactly logistic regression does. Logistic Regression seeks to –

  • Model the probability of an event occurring based on the values of independent variables, be it numerical or categorical. 
  • Estimate the probability of the occurrence of an event for any randomly selected observations vs. the probability that the event does not occur.
  • Classify the observations by estimating the probability that an observation belongs to a specific category. (such as home mortgage approved or not approved in our example)
  • Predict the effect of a series of variables on a given binary response variable.

Advantages of Using Logistic Regression

  • It is incredibly easy to implement, very efficient to train, requires minimal computation resources and tuning, and is highly interpretable. 
  • Considering its simplicity and ease of implementation, the logistic regression algorithm can be used as a baseline to measure the performance of other complex machine learning algorithms.
  • It is highly effective for problems where the data is linearly separable. Two sets of points are considered to be linearly separable if the points can be separated by a single line. The data is said to be linearly separable if there is a hyperplane that splits the input space into two half-spaces so that all points of the first class are present in one-half space and that of the second class are present in the other half-space. You should be able to draw a straight line to separate the points of each class as shown below –
  • Logistic regression does not just do the classification but also gives probabilities. This is an added advantage over machine learning models which provide only final classification result. It makes a huge difference to know that an instance has a 99% probability for a class compared to 50%.

Disadvantages of Using Logistic Regression

  • It is not among the most powerful machine learning algorithms and can easily be outperformed by several other machine learning algorithms.
  • It is highly reliant on the proper presentation of data. Logistic Regression predicts outcomes based on a set of independent variables so if a machine learning engineer includes wrong independent variables then there are high chances that the model will have no or very little predictive value. This implies that logistic regression works well only if all the relevant independent variables have been identified perfectly.
  • If there are non-linear or multiple decision boundaries, then the algorithm might underperform.
  • Logistic regression models are prone to overfitting meaning the logistic model can appear to have more predictive power than it actually has as a result of sampling bias.
  • Logistic regression begins to falter if there is too much of missing data and a large number of features.

Applications of Logistic Regression

  • Trauma and Injury Severity Score (TRISS) that is used to predict mortality in injured patients is developed using logistic regression machine learning model.
  • Logistic regression is used extensively in the healthcare industry – to predict if a given mass of tissue is malignant or benign.  
  • It is also used to reduce patient readmissions in hospitals by identifying patients with high probabilities of being readmitted and ensuring proper precautions.
  • To classify whether a person will get cancer because of the influencing environmental variables like smoking habits, staying in highway surroundings, etc.
  • Banks use it to predict if a customer will default a loan or not.
  • Identifying fraudulent credit card transactions.
  • To predict if an email is a spam or not.
  • Logistic regression finds applications in marketing applications as well to predict if a customer will purchase a product, if they will halt a subscription, to predict if an online shopper will respond to a promotional offer email or not, etc. 

So, by now you would have gained some basic understanding of how logistic regression algorithm works. There is lots to learn if you want to become a data scientist, and machine learning algorithms are an important part of it. Rome was not built in a day and neither can your machine learning skills be. Mastering the art of doing machine learning requires a lot of time, effort and practice. If you have chosen to seriously learn machine learning, the congratulations! We at Springboard promise to make it a fun and rewarding journey ahead for you with fun real-world projects that interest you.