In today’s world where data plays a major role, it’s important to gather insights from it. Data mining techniques pave the way for programmers to find out these insights. Python is the most popular programming language that offers the flexibility and power for programmers and data scientists to perform data analysis and apply machine learning algorithms. In recent years, Python has become more popular for data mining due to the rise in the number of data analysis libraries. This article will showcase how different data mining techniques work using Python. We’ll pick the most commonly used Python libraries for data analysis such as Matplotlib, NumPy for our examples.

How do Data Mining Techniques Work Using Python?

Here’s how data mining techniques work:

1. Data Mining Techniques: Classification

Classification (a type of supervised learning) helps to identify to which set of categories an observation belongs based on the training data set that contains the observations. The most common Python library used for classification is Scikit-Learn

Let’s take an example dataset to identify fruits. The “size”, “color” and “shape” will be the features of the fruit, and the different class labels will be “apple”, “orange”, “watermelon”. For this article, we will use the decision tree and KNN (k-nearest neighbours) classifier classification methods.

Decision Tree Classifier

The simplest way to visualize the decision tree classifier is to see it as a binary tree. In every root and internal node, a question is raised and then data on the node will be split based on their features. Let’s take an example of training a classifier in Scikit-learn. 

How does it work?

  1. Load the dataset and split the dataset into training data and test data
  2. Train the decision tree (using the classification methods) on the training data
  3. Use the classifiers to predict the class label for the test data
  4. Calculate the accuracy of prediction

from sklearn import datasets 

from sklearn.metrics import confusion_matrix 

from sklearn.model_selection import train_test_split

iris = datasets.load_iris()

a = iris.data 

b = iris.target

a_train, a_test, b_train, b_test = train_test_split(a, b, random_state = 0)

from sklearn.tree import DecisionTreeClassifier 

dtree_model = DecisionTreeClassifier(max_depth = 2).fit(a_train, b_train) 

dtree_predictions = dtree_model.predict(a_test)

c = confusion_matrix(b_test, dtree_predictions)

KNN classifier

The KNN Classifier is one of the simplest classification algorithms. When new data is found, the nearest k-neighbours in the training dataset are examined.

from sklearn import datasets 

from sklearn.metrics import confusion_matrix 

from sklearn.model_selection import train_test_split

iris = datasets.load_iris()

a = iris.data 

b = iris.target

a_train, a_test, b_train, b_test = train_test_split(a, b, random_state = 0)

from sklearn.neighbors import KNeighborsClassifier 

knn = KNeighborsClassifier(k_neighbors = 7).fit(a_train, b_train)

acc = knn.score(a_test, b_test) 

print acc

knn_predictions = knn.predict(a_test)  

c = confusion_matrix(b_test, knn_predictions)

2. Data Mining Techniques: Clustering

Clustering means grouping a set of objects such that the objects in one cluster are more or less similar to each other than  ones in the other clusters. Unlike classification, this type of data analysis is a type of unsupervised learning. One of the most popular clustering techniques is K-Means. Let’s consider an example and visualize the clustering using Python code. For this example, we will consider a dataset with 50 random points that are grouped into two regions. We will use the K-Means algorithm to group the samples based on the features. 

import matplotlib.pyplot as plot

from sklearn.datasets import make_blobs

a, b = make_blobs(

   n_samples=50, n_features=2,

   centers=2

)

# plot

plt.scatter(

   a[:, 0], b[:, 1],

   c=’white’, marker=’*’,

   edgecolor=’black’, s=50

)

plt.show()

data-mining-techniques

How does the algorithm work? 

  1. Select K centroids from the sample points as the cluster centers
  2. Assign the samples to the closest centroid
  3. Move the centroid towards the center of the samples that are assigned to the centroid
  4. Repeat the above steps until the centroid positions no longer change

We will use the KMeans class from Scikit-learn cluster module and apply it to the sample dataset. 

km = KMeans(

    n_clusters=2, init=’random’,

    n_init=5, max_iter=50, 

    tol=1e-04, random_state=0

)

y_km = km.fit_predict(a)

This code sets the count of clusters to 2, and allows the algorithm to execute 5 times for a maximum number of 50 iterations in every execution. This step will predict the clusters. As the last step, we will also visualize the cluster centroids along with the clusters that are identified.

plot.scatter(

    a[b_km == 0, 0], a[b_km == 0, 1],

    s=50, c=’pink’,

    marker=’s’, edgecolor=’black’,

)

plot.scatter(

    a[b_km == 1, 0], a[b_km == 1, 1],

    s=50, c=’violet’,

    marker=’o’, edgecolor=’black’ ,

)

# plot the centroids

plot.scatter(

    km.cluster_centers_[:, 0], km.cluster_centers_[:, 1],

    s=250, marker=’*’,

    c=’red’, edgecolor=’black’,

)

plot.legend(scatterpoints=1)

plot.grid()

plot.show()

This will place two centroids at the center of each sphere in the dataset.

data-mining-techniques

3. Data Mining Techniques: Linear Regression

Regression is a type of supervised learning algorithm that will predict the value of a dependent variable (a) based on an independent variable (b). The algorithm calculates the linear relationship between the input and output variables and plots a straight line on the graph. Let’s see how we can use the Python library numpy to explain linear regression with an example. 

from numpy.random import rand

a = rand(10,1) # explanatory variable

b = a*a+rand(10,1)/5 # dependent variable

from sklearn.linear_model import LinearRegression

linreg = LinearRegression()

linreg.fit(a,b)

from numpy import linspace, matrix

c = linspace(0,1,10)

plot(a,b,’o’,c,linreg.predict(matrix(c).T),’–r’)

show()

When you run the code, you will see the linear graph as follows – 

A close up of a map

Description automatically generated

Springboard’s courses on Data Science provide excellent learning opportunities and understanding on NLP, Deep Learning and ML that comes with a 1:1 mentoring-led and project-driven approach along with a job guarantee. Apply Now!