In the past, we have seen many companies adopt artificial intelligence(AI) by applying machine learning that provides systems the ability to learn automatically and improve from past experience without being explicitly programmed. One type of machine learning is Supervised Machine Learning that requires a data scientist to provide input data set along with the desired output. For improving the accuracy of prediction, the data scientist plays a crucial role by determining which variable or feature the model should analyze and use to develop predictions. Once the data scientist trains the machine learning algorithm with sample data, it will be ready to learn new data. However, Unsupervised Machine Learning does not need any sample data to train the algorithm or any desired output data instead they use an iterative approach called deep learning. We will learn more about Unsupervised Machine learning in the coming articles. In this blog, we will be discussing the K-Nearest Neighbors (KNN)Classification Machine learning algorithm. 

Classification Machine Learning

Classification is a supervised learning approach which means we need the training data set with the desired output to teach our algorithm. In classification Machine learning typically the output would be classes or categories and then uses this learning to classify new data sets. In classification, we can have two types of data sets: 

a) Binary-classification (Bi-class)- Bi-class data set are like identifying whether the person in the picture is male or female or true or false or mail is spam or non-spam. 

b) Multinomial-classification (Multi-class): It means multiple classes. few examples of multiple classes will be the classification of genes in the yeast data set or Identifying fruits from images which maybe mango, orange or apple.

Classification problems are those where we already have desired output and based on that, we classify or categorize the data. Here are some examples of the classification problem:

a) Text categorization ( e.g.: identifying whether the email is spam or not)

b) Fraud detection ( eg: Bank customer loan payment based on customer history)

c) Language processing ( e.g.: Understanding the spoken language from an audio clip)

d) Sentimental analysis ( e.g.: Understanding that customer’s feedback)

e) Bioinformatics ( eg: Cancer tumor cells identification)

Classification can be applied in both structured or unstructured data sets. Here are the types of classification machine learning algorithms:

a) Linear Classification – This includes Logistic Regression, Naïve Bayes classification, and Fisher’s linear classification

b) Kernel Estimation – K-Nearest Neighbor (KNN) – we will discuss it in the latter part of the article.

c) Decision Trees

d) Support Vector machine

e) Neural Networks

In this article, we are going to discuss the K-Nearest Neighbors machine learning algorithm, its use cases, and implementation.

K-Nearest Neighbors (KNN) Machine Learning Algorithm

K Nearest Neighbors (KNN) algorithm is a supervised machine learning algorithm. KNN algorithm requires an entire data set for the training phase. Once we provide the training set for given K value, KNN Algorithm will search for the entire data set for K most similar measure. Here K is the number of nearest neighbors. This K nearest neighbor will be identified by using the distance measure like euclidean distance.  Some use cases of KNN Algorithm: when we go to any e-commerce website and search for any product, they provide or suggest a few recommended relative products. You can read more about how the KNN algorithm works and use cases here.

Let’s move forward and see how to implement the KNN algorithm.

Here we are taking one scenario to predict housing values. We are fetching the data set from – https://archive.ics.uci.edu/ml/machine-learning-databases/housing/housing.data and on top of this, we are going to perform the KNN machine algorithm.

In this article, We are using Jupyter Notebook for implementing data science and scientific computing along with python (version 3) programming for data exploration and analysis. Our data set contains US census data concerning houses in the various areas around the city of Boston. Each sample corresponds to a unique area and has about a dozen measures and predicts the housing values. Let’s see the steps on how we can implement the KNN algorithm on a given data set.

  • First, we will start by loading the required libraries. Using NumPy for scientific computing with python. It’s a widely open library used for applying data science. Also, we are loading Mathplotlib for multiplatform data visualization. Another library we are using is TensorFlow for providing deep learning and to perform numerical operations. TensorFlow is developed by Google in 2015. Code snip will look like this:
  • Now from module request loaded in the first step we are going to load the dataset from the UCI. Also, we are defining our headers and columns like below:
  • Next In the next step we need to clean the data set by separating the data into dependent and independent features. Predicting MEDV will be our last variable and ignoring ZN RAD and CHAS because these column does not have informative data.
  • Next step, we are going to divide the training set into two-part x and y. X will contain 80% of rows randomly, we will call it “train set” and Y will contain 20% of rows consider as “test set”. We are doing this because in supervised machine learning we need some training data to teach our algorithm.
  • Let’s declare the value of K and batch size in next step
  • We need to create the placefolders for the Tensorflow tensors as follow:
  • Next step we must define the distance metrics.
  • Now, it’s time to implement the KNN on the refined data set. As in the last step, we have calculated the distance metrics so in this step it will find the Kth nearest neighbor. Top_K() function will get the largest value in a tensor. Although we wanted to get the smallest distances so we will be going to find the Kth Biggest Negative distance.
  • Let’s test the code
  • In this step, we are plotting a historical graph of the actual target values versus predicted values.

KNN algorithm is a versatile supervised machine learning algorithm and works really well with large datasets and its easy to implement. KNN algorithm can be used for both regression and classification. The only drawback is that if we have large data set then it will be expensive to calculate k-Nearest values. Although the KNN algorithm has a wide range of use cases so it makes this algorithm as ‘go-to algorithm’ for data scientists. 

As this algorithm works well with both structured and unstructured data set, many companies implement this algorithm during failure prediction. This makes this KNN very important during interview rounds as you can apply this algorithm on both types of data sets. Springboard India’s 1:1 mentor-led, project-based data sciencedata analytics and AI/ML career tracks are industry-focussed online learning programs that come with a job guarantee, designed to prepare you for a meaningful and successful career in future technologies.

In the next blog, we will understand K-means Clustering-based Machine learning algorithms.