In data science, exploratory data analysis (EDA) is an important function. It is the process of analysing datasets to identify their main characteristics so that further insights can be gleaned from it. It helps data scientists understand the data — not just statistically, but also philosophically. In this blog post, Flipkart’s ML Decision Scientist, Abhishek Periwal demonstrates exploratory data analysis using Python. He uses the house prices dataset from Kaggle, which will be used for making predictions about future sale prices. He uses libraries such as Pandas and Numpy for data wrangling, Matplotlib and Seaborn for plotting, SciPy for basic statistical analysis and Scikit-Learn for data processing.
How to Perform Exploratory Data Analysis Using Python
Before we get hands-on with Python, let us first understand what is EDA. Exploratory data analysis is the process of getting to know the data. To do that, we need to:
- Maximise insight into data
- Uncover underlying structure
- Extract important variables
- Detect anomalies/outliers
- Test underlying assumptions
We will do all this and more step by step. To learn how to do EDA with the example of the house prices dataset, watch Abhishek’s video –
Step 1 – Exploratory Data Analysis Using Python: Understanding the problem
Before we get into the statistical analysis of the data, we need to understand the meaning and importance of each variable in the dataset. First, load the data and understand data dimensions. For instance, in this dataset, the sale price is the target variable. All others are features. Of the 79 variables, 36 are numerical and 43 are categorical. When you’re working, go into each variable and understand clearly. Link feature variables to the target variables philosophically, not just quantitatively.
Let’s take variables such as basement, roof style, garage, etc. Before you analyse the numbers, think about how each of these factors might impact the sale price of the house in the real world. This will give more context to the numbers you are crunching.
Step 2 – Exploratory Data Analysis Using Python: Univariate analysis
Now, let us look at the numbers statistically –
- How many zeros?
- How many negative values?
- What is to be considered as an outlier?
- Which ones need to be converted to categorical variables?
Start with the target variable. Identify its min value, max value, mean, median, standard deviation, etc. You can also do it in a table but plotting helps immensely in showing you the skewness of the dataset.
If the data is skewed, you need to treat that. In this case, the sale price is not normally distributed. So, before using it as the training dataset for a machine learning model, you need to transform it to be a normal distribution. Perform these tasks for all variables.
Step 3 – Exploratory Data Analysis Using Python: Bivariate analysis
Once you have understood each individual variable, it is time to look into the correlation between each variable and the target variable. For example, what is the relationship between basement and sale price, roof style and sale price, garage and sale price, etc?
Scatter plots are a great way to visualise this. You will be able to instantly notice if the value of the target variable is increasing or decreasing, as each variable goes up or down. Run this on all variables. Through this analysis, you can identify variables that are important, based on how their relationship is with the target variable.
You can also build a correlation matrix heatmap for this purpose, like the one above. In this case, as there are too many variables, it is prudent to create a zoomed heatmap with variables having the top 10 correlation coefficients.
Step 4 – Basic cleaning
This is the step where we begin to prepare the data for machine learning purposes. Two important aspects that we need to consider here are missing values and outliers.
Missing values are basically records that don’t have a value associated with them. There are two common ways to treat this: Remove the records with missing values or make smart estimates for them. By understanding the data and the business context, you can make a decision about this.
An outlier is a value that is abnormally distant from the other values in a dataset. These can have a significant adverse effect on the analysis or predictions. Therefore it’s necessary to standardise the data. Identify values that are far from the mean, find lower ranges and high ranges and remove outliers. But, before removing the outliers, check if they’re meaningful or not. Sometimes, there will be a strong correlation between the outlier and the target variable. In this case, removing the outlier can be harmful to your predictions.
Step 5 – Testing assumptions
In data science, according to Hair et. al, there are four basic assumptions that we must test.
Normality: Normal distribution is when the data is symmetric about the mean — a bell curve. In this dataset, you’ll see that the sale price is not a normal distribution. You can do a lobe transformation to treat this problem.
Homoscedasticity: Also called homogeneity of variance, we test this when a dependent variable exhibits equal variance across a range of predicted variables. You can run several tests for this assumption.
Linearity: If any function can be represented graphically as a straight line, it is said to be linear. Best way to test linearity is to build scatter plots and search for linear patterns. If patterns are not linear, you’ll need to transform the data.
Absence of correlated errors: When one error is correlated with another — say a positive error makes a negative error systematically — you have correlated errors. If you find correlated errors, the most common solution is to add a variable that can explain the effect you’re having.
You can watch this data science tutorial on exploratory data analysis using Python on Youtube now. If you’d like to get the stepwise analysis that Abhishek is using, fill this form and we would be sharing the files with you within 48 hours. If you’re looking to transition to a data science career, begin your learning at Springboard. The data science career track offered by Springboard is an online program that offers best-in-class curriculum, 1:1 mentorship, career coaching along with a job guarantee!