Andrew Ng, an ex-data scientist at Google and Baidu, currently an adjunct professor at Stanford University, and a pioneer in e-learning in the field of data science — defines machine learning very simply: The science of getting computers to act without being explicitly programmed. By extension, this means that machine learning algorithms need to be able to make decisions the way humans do, considering multiple factors, and their relationships. In this blog post, Springboard mentor Arihant Jain explains two such methods: the decision tree algorithm and random forest algorithm. He also demonstrates how to use them. Arihant is a data scientist at ZestMoney, before which he has worked in similar capacities at Vodafone, RBL Bank, Genpact, etc. He has experience building machine learning and deep learning models in the fields of retail, credit risk, marketing, customer service and banking.
Understanding Machine Learning Algorithms
Before we get into the decision tree and random forest algorithms, let’s first look at the foundations of a machine learning model. From the below diagram, you’ll understand that a machine learning flow involves feature engineering historical data, splitting them into training/validation/testing datasets to build models. Then, these models will be used to make predictions.
Kinds of Machine Learning Algorithms
There are various kinds of machine learning algorithms that you can use depending on your needs. Each of them has its own pros and cons. Two main parameters by which we measure machine learning algorithms are: accuracy and interpretability, i.e. how accurate are the predictions, how easy is it to understand and explain.
In the above graph, you’ll see that not all algorithms are equal to these parameters. In fact, the more accurate an algorithm becomes, the more difficult it is to interpret. A data scientist’s job is to identify the right balance for their needs. In this blog post, let’s look at the decision tree algorithm and the random forest algorithm.
What is a Decision Tree Algorithm?
A decision tree is a supervised machine learning algorithm (having a predefined target variable) that is used in classification problems. It is a flow chart-like decision-support tool, which explores a series of smaller decisions and consequences to arrive at a final decision.
Let’s look at it through an example: imagine driving to work, you might have three different routes you can take. The path you take might depend on various factors such as time of day, traffic, road conditions, etc. You will consider each of these variables and make your decision accordingly. In essence, you are following a mental decision tree.
A decision tree algorithm does something similar. In the diagram above, you’ll notice that the root node is split into multiple decision nodes, which are then further split into nodes until we arrive at the terminal node. A decision tree identifies the most significant variable and its value that gives best homogenous sets. To do this, you can use various methods — entropy, information gain, Gini index, etc. In this case, we are using the Gini index and picking the variable that has the highest Gini index as the root node.
What is a Random Forest Algorithm?
The random forest algorithm simply stretches the decision tree metaphor. It combines multiple decision trees to become an ensemble algorithm. Just like you would take inputs from multi-sources such as internet research, parents, friends, mentors etc. — before making any decision, the random forest algorithm takes multiple uncorrelated decision trees into account, while making predictions.
Machine Learning Algorithms: How does a Random Forest Algorithm Work?
A random forest algorithm basically follows three steps.
- Pick random subsets of data from the dataset
- Build different trees for each set
- Use majority voting to make a final prediction
Machine Learning Algorithms: How to Build a Decision Tree Algorithm & a Random Forest Algorithm?
The fundamental process is very simple. In this case, let’s take up an open-source dataset about attrition rates and see how it works. You can access the dataset by signing up here.
Step 1: Identify your data and perform EDA
Exploratory data analysis (EDA) is the process of maximising insight into your data using statistical methods. This data set has multiple variables, both categorical and continuous. It includes variables such as gender, job role, salary, number of years worked, etc. Each of them is correlated to attrition rates.
Step 2: Understand patterns
While you are performing EDA, you will begin to observe patterns in your data. This will be the kind of variables you have, the correlation between them, the magnitude of the correlation, etc. Plotting a correlation matrix will give you a clear view of the nature of your data. Analysts typically use Python and its libraries to do this process.
Step 3: Build a decision tree and random forest
Here, we are using ScikitLearn, Pandas library, NumPy for manipulation, Matplotlib and Seaborn for plotting. Let’s get into the process of building a decision tree.
- Convert categorical data into numerical values.
- Split data into training and validation data sets, we are using a function from ScikitLearn to do this.
- Import a decision tree classifier from ScikitLearn.
- Test for accuracy.
Once you’re happy with the model, let’s move to the random forest algorithm. Remember that it’s not an entire forest algorithm, just a random sample.
- Import a random forest ensemble function from ScikitLearn.
- Mention how many trees you want to build.
- Run it.
- Test for accuracy.
To see Arihant walk you through these algorithms step by step, watch his Youtube session here. Get access to the notebook and resources by signing up here. For a structured online learning program in AI/machine learning, check out Springboard. It offers 1:1 mentorship, career coaching and a job guarantee. Apply Now!