Data Classification is one of the most common problems to solve in data analytics. While the process becomes simpler using platforms like R & Python, it is essential to understand which technique to use. In this blog post, we will speak about one of the most powerful & easy-to-train classifiers, ‘Naive Bayes Classification’. This is a classification technique that determines the probability of an outcome, given a set of conditions using the Bayes theorem. We have studied its possible applications and even tried our hand at the email spam filtering dataset on Python. One of the most important libraries that we use in Python, the Scikit-learn provides three Naive Bayes implementations: Bernoulli, multinomial, and Gaussian. This blog is third in the series to understand the Naive Bayes Algorithm. You can read part 1 and part 2 here in the introduction to Bayes Theorem & Naive Bayes Algorithm and email spam filtering using Naive Bayes Classifier blogs.

Before we dig deeper into Naive Bayes classification in order to understand what each of these variations in the Naive Bayes Algorithm will do, let us understand them briefly…

• Bernoulli’s is a binary algorithm particularly useful when a feature can be present or not.
• Multinomial Naive Bayes assumes a feature vector where each element represents the number of times it appears (or, very often, its frequency).
• The Gaussian Naive Bayes, instead, is based on a continuous distribution characterised by mean & variance. It is suitable for more generic classification tasks.

Let’s dig into each of these techniques, and see the best use of them in our data analytics problems…

## Naive Bayes Classification Using Bernoulli

If ‘A’ is a random variable then under Naive Bayes classification using Bernoulli distribution, it can assume only two values (for simplicity, let’s call them 0 and 1). Their probability is:

`P(A) = p  if A = 1P(A) = q  if A = 0Where q = 1 - p & 0 < p < 1`

Let’s try this algorithm on a dummy dataset that we create. We will use the scikit-learn library to implement the Bernoulli Naive Bayes algorithm. Do remember, Bernoulli naive Bayes expects binary feature vectors, however, the class Bernoulli Naive Bayes Algorithm has a binarize parameter. This parameter allows specifying a threshold that will be used internally to transform the features:

`from sklearn.datasets import make_classification>>> nb_samples = 300>>> X, Y = make_classification(n_samples=nb_samples, n_features=2, n_informative=2, n_redundant=0)`

It generates a bidimensional dataset as below:

We have decided to use 0.0 as a binary threshold. This way, each point can be characterised by the quadrant where it’s located.

```from sklearn.naive_bayes import BernoulliNB
from sklearn.model_selection import train_test_split
>>> X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.25)
>>> bnb = BernoulliNB(binarize=0.0)
>>> bnb.fit(X_train, Y_train)
>>> bnb.score(X_test, Y_test)
0.85333333333333339```

This score is rather good! To understand how the binary classifier worked, it’s useful to see how the data have been internally binarized:

Let’s check the naive Bayes predictions we obtain:

`>>> data = np.array([[0, 0], [0, 1], [1, 0], [1, 1]])>>> bnb.predict(data)array([0, 0, 1, 1])`

This is the output that was expected from Bernoulli’s naive Bayes!

## Data Classification Using Multinomial Naive Bayes Algorithm

A multinomial Naive Bayes algorithm is useful to model feature vectors where each value represents the number of occurrences of a term or its relative frequency. For example, if a feature vector has n elements and each of them can assume k different values with probability pk, then:

The conditional probabilities P(xi | y) are computed with a frequency count. The frequency count corresponds to applying a maximum likelihood approach. During Multinomial Bayes Formula,  Laplace smoothing factor is to be kept in mind. Its default value is 1.0 and prevents the model from setting null probabilities when the frequency is zero.

Let’s understand this with an example, using the DictVectorizer. We consider only two records: the first one representing a city, while the second one countryside.

```from sklearn.feature_extraction import DictVectorizer
>>> data = [
{'house': 100, 'street': 50, 'shop': 25, 'car': 100, 'tree': 20},
{'house': 5, 'street': 5, 'shop': 0, 'car': 10, 'tree': 500, 'river': 1}
]

>>> dv = DictVectorizer(sparse=False)
>>> X = dv.fit_transform(data)
>>> Y = np.array([1, 0])

>>> X
array([[ 100.,  100.,    0.,   25.,   50.,   20.],[  10.,    5.,    1.,    0.,    5.,  500.]])
```

Note that the term ‘river’ is missing from the first set, so it’s useful to keep alpha equal to 1.0 to give it a small probability. The output classes are 1 for city and 0 for the countryside. Now we can train a Multinomial Naive Bayes instance:

```from sklearn.naive_bayes import MultinomialNB

>>> mnb = MultinomialNB()
>>> mnb.fit(X, Y)
MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)```

To test the model, we create a dummy city with a river and a dummy country place without any river.

`>>> test_data = data = [{'house': 80, 'street': 20, 'shop': 15, 'car': 70, 'tree': 10, 'river': 1},]{'house': 10, 'street': 5, 'shop': 1, 'car': 8, 'tree': 300, 'river': 0} >>> mnb.predict(dv.fit_transform(test_data))array([1, 0])`

As we can see, this prediction is correct!

## Naive Bayes Classification Using Gaussian

Gaussian Naive Bayes is useful when working with continuous values where probabilities can be modelled using a Gaussian distribution:

The conditional probabilities P(xi | y) are also Gaussian distributed and, therefore, it’s necessary to estimate the mean and variance of each of them using the maximum likelihood approach. On considering the property of a Gaussian, we get:

• k index refers to the samples in our dataset
• P(xi|y) is a Gaussian itself

From this, we get mean and variance for each Gaussian associated with P(xi | y), & the model is hence trained.

Let’s compare Gaussian Naive Bayes with logistic regression using the ROC curves as an example.

`from sklearn.datasets import make_classification>>> nb_samples = 300>>> X, Y = make_classification(n_samples=nb_samples, n_features=2, n_informative=2, n_redundant=0)`

Here is the dataset that you may obtain:

Let’s train both models and generate the ROC curves:

```from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import LogisticRegression from sklearn.metrics import roc_curve, auc
from sklearn.model_selection import train_test_split

>>> X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.25)
>>> gnb = GaussianNB()
>>> gnb.fit(X_train, Y_train)
>>> Y_gnb_score = gnb.predict_proba(X_test)
>>> lr = LogisticRegression()
>>> lr.fit(X_train, Y_train)
>>> Y_lr_score = lr.decision_function(X_test)
>>> fpr_gnb, tpr_gnb, thresholds_gnb = roc_curve(Y_test, Y_gnb_score[:, 1])
>>> fpr_lr, tpr_lr, thresholds_lr = roc_curve(Y_test, Y_lr_score)```

The resulting ROC curves would be like this:

`from sklearn.datasets import load_digitsfrom sklearn.model_selection import cross_val_score>>> digits = load_digits()>>> gnb = GaussianNB()>>> mnb = MultinomialNB()>>> cross_val_score(gnb, digits.data, digits.target, scoring='accuracy', cv=10).mean()0.81035375835678214>>> cross_val_score(mnb, digits.data, digits.target, scoring='accuracy', cv=10).mean()0.88193962163008377`