Data cleaning may seem like a boring & mundane job to do, but it is a fundamental block for any data analytics problem solving. What is data cleaning, you ask? Data cleaning is the process of preparing data for analysis by removing or modifying data that is incorrect, incomplete, irrelevant, or duplicated. Having bad quality data can be disastrous to your processes and analysis. Poor data quality leads to poorer results; thus, it is important to understand ‘what is data cleaning’. On the other hand, high-quality data can establish extraordinary results with a simple algorithm. As an analyst, it is important to understand that all data may not be useful and determine ways to treat them. So before it impacts the output quality, let’s figure out techniques to clean data. In this blog post, we will cover what is data cleaning along with its techniques and steps.

What is Data Cleaning

What is Data Cleaning?

Data cleaning is the process of preparing data for analysis to provide accurate results. There are several methods for cleaning data depending on how it is stored along with the analysis being followed. It is not simply about erasing information to make space for new data, but rather finding a way to maximise a data set’s accuracy without necessarily eliminating information. It could be small simple steps like fixing spelling and syntax errors, standardising data sets, and correcting mistakes such as empty fields, missing codes, and identifying duplicate data points.

The outcome of data cleaning is to create data sets that are standardised and uniform. This allows business intelligence and data analytics tools to easily access and find the right data for each query. The process helps improve efficiency, reduce error margin, increase accuracy & create consistency in the data. It is very important to be through with this technique when you are giving data analytics interviews for your career transition. 

Data Cleaning Techniques

While there is an outline methodology followed to clean data, the steps could vary depending upon the kind of data that you are dealing with & the kind of outcome you would like to achieve. Even the smaller nuances like numeric values, strings have an impact on the process you should follow. For instance, if you have too many discrepancies, then just 1 data cleaning technique may not work out. You might need to use multiple techniques for a better result. The more data types you have to handle, the more data cleaning techniques you’ll have to use. Its good to be familiar with all of these methods, and choose the ones that are necessary, so let’s get rid of that useless data!

  • Eliminate Irrelevant Values

There is always irrelevant data in every dataset. The first step should be to remove that. Irrelevant data is the one that might not fit the context of the issue. For eg: if you have to find out the average age of students in a class, then you don’t need the names & genders of the students. Eliminating columns that are not necessary can ease the processing time of the system, and give you a faster output. Make sure you double-check before you eliminate the irrelevant values, else you may have to reload the data and start again. 

  • Remove Duplicates

Duplicate values are similar to useless values, they don’t have a place in the dataset. They only increase the volume of data, in turn increasing the processing time of the system. Just a simple search can help understand what these values are. But in larger datasets, they may be tough to find. You might wonder why duplicates would occur in a dataset, right? Sometimes when you combine multiple datasets or join them, duplication may occur.
The Pandas function duplicated() will report whether a given row is duplicated or not. All rows are marked as either “false” to indicate that it is not a duplicate or “true” to indicate that it is a duplicate. 

  • Convert Data Types

Having the same data type across the dataset is a blessing. Because handling each unit/value differently can be a tedious task. Imagine you have to filter for a variable that has both string & boolean values. Would be difficult, right? As a base rule in data cleaning steps, we keep numeric values as numerics. Converting them to string poses for a bigger challenge than letting them be. In case you are not able to convert a specific data value, you should enter ‘NA or as a warning to show that this particular value may be wrong.

Source: AnalyticsIndiaMagazine
  • Don’t Miss the Missing Values

There are always missing values. Repeat again, there are always missing values in a dataset. This is as inevitable as getting tanned on a beach. The only thing you can do is handle it well and keep it free from errors. If any particular column in the dataset has too many missing values, it would be sensible to get rid of the entire column because there isn’t enough data to work with. This in no way means that you should ignore the missing values. So how do you deal with missing values then?

Data Cleaning Steps To Deal With Missing Values 

Ignoring missing values can be a significant mistake because they can contaminate your data, and you may not get accurate results. So how do we deal with them? 

We can detect the missing values using the Pandas library. 

# Looking at the column name
print df['column name’]
print df['column name’].isnull()

If you see the output, you will realise that pandas fill in the blank space with “NA”. Using the isnull() method, we can confirm that both the missing value and “NA” were recognised as missing values. Both boolean responses are True.

But what if the missing values are not standard in nature? What if Pandas is not able to detect all the NA’s? What if users have entered the missing values manually as “n/a” or “-“. An easy way to detect these various formats is to put them in a list. Then when we import the data, Pandas will recognise them right away. Here’s an example of how we would do that.

# Making a list of missing value types
missing_values = ["n/a", "na", "--"]
df = pd.read_csv("data.csv", na_values = missing_values)

Now, all the different formats get recognised as missing values. Depending on the kind of output you are looking at, you can summarise & replace all the null values with an appropriate value or zero (basis the data type in that column). 

Dealing with messy data is inevitable. Data cleaning is just part of the process on a data project. In this blog post on what is data cleaning, we’ve gone over a few simple ways to replace missing values along with detailed information & data cleaning techniques. You may have noticed that “what is data cleaning” is one of the most common questions in data analytics interviews. So if you are looking at the hands-on implementation of these techniques on real world data, then joining the Springboard’s data analytics or data science career track program may be beneficial. 1:1 mentor-led sessions, detailed curriculum, and 2 capstone projects will give you the expertise to tackle any real world data challenges. The career track programs also offer career coaching and come along with a job guarantee.