Naive Bayes is a probabilistic algorithm based on the Bayes Theorem used for classification in data analytics. Yes, data Analytics is a lot of prediction & classification! And one glorious algorithm that comes often of use to analysts is the Naive Bayes algorithm. Mostly used for constructing classifiers, the Naive Bayes technique assumes that the value of a particular feature is independent of the value of any other feature. It derives from the Bayes Theorem Formula, which describes the probability of an event, based on prior knowledge of conditions that might be related to the event. A Naive Bayes classifier could simply say that fruit may be considered to be an apple if it is red, round, and about 10 cm in diameter. A Naive Bayes classifier considers each of these features to contribute independently to the probability that this fruit is an apple, regardless of any possible correlations between the colour, roundness, and diameter features.

Bayes Theorem Example

The Naive Bayes classifier algorithm is one of the most simple and powerful algorithms in Data Analytics. It is a classification based on Bayes’ Theorem Formula with an assumption of independence among predictors. Given a Hypothesis A and evidence B, Bayes’ Theorem calculator states that the relationship between the probability of Hypothesis before getting the evidence P(H) and the probability of the hypothesis after getting the evidence P(A|B) is:

Source: Wikipedia

This relates the probability of the hypothesis before getting the evidence P(A), to the probability of the hypothesis after getting the evidence, P(A | B). For this reason, 

  • P(B | A)  is called the prior probability
  • P(A | B) is called the posterior probability
  • P(A | B) / P(B), is called the likelihood ratio- this relates the 2 probabilities

The Bayes Theorem is often referred to as “The posterior probability equals the prior probability times the likelihood ratio.

Bayes Theorem Example
Source: Philipmartin.info

Bayes Theorem – Formula & Calculator

While we have understood the Bayes formula theoretically, let’s clear it better by learning practically. Here is a quick revision on Bayes’ Theorem before moving onto the Naive Bayes Classifier.

Question:
You have a deck of cards and need to find the probability of a card picked at random to be a Queen, given that it is a Face card. 

Solution:

P(Queen) = 4/52  4 Queens in a deck of card
P(Face | Queen) = 1 All the Queens are Face cards
P (Face) = 12/ 52 3 cards (King, Queen, Jack) * 4 Suits
P (Queen | Face) = To be calculated
P (Queen | Face) = P (Face | Queen) * P(Queen) =  1
P(Face)     3

The outcome using Bayes’ Theorem Calculator is 1/3.

Source: Walmart.ca

Bayes Theorem: The Naive Bayes Classifier

The Bayes Rule provides the formula for the probability of A given B. But, in actual problems, there are multiple B variables. When the features are independent, we can extend the Bayes Rule to what is called Naive Bayes. It is called ‘Naive’ because of the naive assumption that the B’s are independent of each other. Regardless of its name, it’s a powerful formula.

Bayes Rule   P (A= x | B)  =   P (B | A= x) * P(A = x)
P(B)
*here x is a class of A

When there are multiple B variables, we simplify it by assuming that B’s are independent. So the Bayes Rule becomes Naive Bayes Rule:

P (A= x | B1, B2, B3...Bn)  =   P (B1 | A= x) * P (B2 | A= x) * P (B3 | A= x) *......P (Bn | A= x) * P(A = x)
P(B1) * P(B2) * P(B3) *.....P(Bn)

OR

Probability of Outcome  |  Evidence = Probability of Likelihood of evidence *Prior

Probability of evidence

  • Probability of evidence is the same for all classes of A
  • Probability of Outcome  |  Evidence could be called Posterior Probability

A Working Example in Python

Let’s generate a small binary (2 class) classification problem using the make_blobs() function from the Scikit-learn API. The code below generates 1000 examples with two numerical input variables, each assigned one of two classes.

# example of generating a small classification dataset
from sklearn.datasets import make_blobs
# generate 2d classification dataset
X, y = make_blobs(n_samples=100, centers=2, n_features=2, random_state=1)
# summarise
print(X.shape, y.shape)
print(X[:5])
print(y[:5])

Running this code generates a dataset and summarises the size, confirming that the code has run successfully. Here, the “random_state” argument is set to 1. This ensures that the same random sample of observations is generated each time the code is run. The input and output elements of the first five examples can be seen, showing that the two input variables are numeric and the class labels are either 0 or 1 for each example.

(1000, 2) (1000,)
[[-10.6105446    4.11045368]
[9.05798365   0.99701708]
[8.705727     1.36332954]
[-8.29324753   2.35371596]
[6.5954554    2.4247682 ]]
[0 1 1 0 1]

Before we run the Naive Bayes Algorithm, we will model the input variables using a Gaussian probability distribution. We can do this using the norm SciPy API. 

# fit a probability distribution to a univariate data sample
def fit_distribution(data):
# estimate parameters
mu = mean(data)
sigma = std(data)
print(mu, sigma)
# fit distribution
dist = norm(mu, sigma)
return dist

Here, we want to understand the conditional probability of each input variable. This means we need one distribution for each of the input variables, and one set of distributions for each of the class labels, or four distributions in total. 

We now split the data into groups of samples for each of the class labels.

# sort data into classes
Xy0 = X[y == 0]
Xy1 = X[y == 1]
print(Xy0.shape, Xy1.shape)

We use these groups to calculate the prior probabilities for a data sample belonging to each group. This will be 50% exactly given that we have created the same number of examples in each of the two classes, but in any case, we must calculate them as a part of our step. 

# calculate priors
priory0 = len(Xy0) / len(X)
priory1 = len(Xy1) / len(X)
print(priory0, priory1)

Now, we call the fit_distribution() function that we defined to prepare a probability distribution for each variable, for each class label.

# create PDFs for y==0
X1y0 = fit_distribution(Xy0[:, 0])
X2y0 = fit_distribution(Xy0[:, 1])
# create PDFs for y==1
X1y1 = fit_distribution(Xy1[:, 0])
X2y1 = fit_distribution(Xy1[:, 1])

Putting it together, the complete probabilistic model of the dataset is listed below.

# summarise probability distributions of the dataset
from sklearn.datasets import make_blobs
from scipy.stats import norm
from numpy import mean
from numpy import std
# fit a probability distribution to a univariate data sample
def fit_distribution(data):
# estimate parameters
mu = mean(data)
sigma = std(data)
print(mu, sigma)
# fit distribution
dist = norm(mu, sigma)
return dist
# generate 2d classification dataset
X, y = make_blobs(n_samples=100, centers=2, n_features=2, random_state=1)
# sort data into classes
Xy0 = X[y == 0]
Xy1 = X[y == 1]
print(Xy0.shape, Xy1.shape)
# calculate priors
priory0 = len(Xy0) / len(X)
priory1 = len(Xy1) / len(X)
print(priory0, priory1)
# create PDFs for y==0
X1y0 = fit_distribution(Xy0[:, 0])
X2y0 = fit_distribution(Xy0[:, 1])
# create PDFs for y==1
X1y1 = fit_distribution(Xy1[:, 0])
X2y1 = fit_distribution(Xy1[:, 1])

When we run this code, the dataset splits into two groups for the two class labels and confirms the size of each group is even and the priors are 50%.

Probability distributions are then prepared for each variable for each class label and the mean and standard deviation parameters of each distribution are reported, confirming that the distributions differ.

(500, 2) (500, 2)
0.5 0.5
-1.5632888906409914 0.787444265443213
4.426680361487157 0.958296071258367
-9.681177100524485 0.8943078901048118
-3.9713794295185845 0.9308177595208521

Let’s use the prepared probabilistic model to make a prediction. 

# calculate the independent conditional probability
def probability(X, prior, dist1, dist2):
return prior * dist1.pdf(X[0]) * dist2.pdf(X[1])

This function is now used to calculate the probability for an example belonging to each class. First, we select an example to be classified. Let’s take the first example in the dataset.

# classify one example
Xsample, ysample = X[0], y[0]

Next, we calculate the score of the example belonging to the first class, then the second class.

py0 = probability(Xsample, priory0, distX1y0, distX2y0)
py1 = probability(Xsample, priory1, distX1y1, distX2y1)
print('P(y=0 | %s) = %.3f' % (Xsample, py0*1000))
print('P(y=1 | %s) = %.3f' % (Xsample, py1*1000))

The class with the largest score will be the resulting classification.

The score of the example belonging to y=0 is about 0.3 (this is an unnormalised probability), whereas the score of the example belonging to y=1 is 0.0. Hence, we classify the example as belonging to y=0. In this case, the true or actual outcome is known, y=0, which matches the prediction by our Naive Bayes model.

P(y=0 | [-0.79415228  2.10495117]) = 0.348
P(y=1 | [-0.79415228  2.10495117]) = 0.000
Truth: y=0

Practical Implementations of Naive Bayes 

Naive Bayes is more commonly used than you may realise. Some interesting use cases are News categorisation, text classification, spam filtering in emails, Sentiment Analysis of comments on social media etc. This algorithm finds the maximum use in text classification and with problems having multiple classes. Let’s understand the kind of impact this algorithm is making in the Data Analytics world.  

  1. News Categorisation: With the news on the web rapidly growing, each website has its own way of grouping similar news articles. Organisations now use web crawlers to extract useful text from HTML pages of news articles to construct a Full-Text-RSS.  The Naive Bayes Classifier is based on news code.
  2. Spam Filtering: Naive Bayes classifiers use a group of words to identify spam email. Like all text classification problems, the algorithm correlates words, or sometimes other things, with spam and non-spam and then uses Bayes’ theorem to calculate a probability that an email is or is not. Interesting, huh!?

    For instance, a lot of emails encounter the word “Lottery” and “Lucky Draw” in spam communication, but these words are seldom seen in other emails. Each word in the email contributes to the email’s spam probability. This contribution is called the posterior probability, which is computed using Bayes’ theorem.
  3. Post calculating the posterior probability, the email’s spam probability is computed over all words in the email. If this total exceeds a certain threshold (say 90%), the filter marks the email as spam. 
  4. Medical Diagnosis: Hospitals are getting smarter with algorithms like Naïve Bayes helping doctors make decisions. The algorithm takes into account evidence from many data points to make the final prediction. It also provides detailed explanations of its decisions and therefore it is considered as one of the most useful classifiers to support what doctors say.

With use cases like above, Naive Bayes is becoming the primary algorithm in a lot of Data problems involving text classification. And multiple variations of this algorithm have stood the test of time.  Can you think of any more cases where this algorithm can be implemented?

Along with your regular linear regression, logistic regression, decision tree algorithms, these algorithms can definitely give you a step up in your career and help build an impressive analytics portfolio.  You can learn this algorithm and many more in Springboard’s Data Analytics career track program that is 1:1 mentoring-led, project-driven and comes along with a job guarantee.