It has been over a month since Elections’ 19 were conducted. All of us were eagerly waiting to know who will sit next on the Prime Minister’s chair. In the meantime, one of Springboard’s Data Science Career Track learners Pratik Singh decided to do a project on Analysing Indian Election Data.
Keep reading to learn about his experience as he describes it.
There is a famous quote by Aristotle which goes:
“For the things we have to learn before we can do them, we learn by doing them.”
In this blog post, I will elaborate my approach to analyzing Indian election data and present my bit of doing what I learned about data wrangling in pandas. This is not a full-fledged Election Prediction Exercise but a Data Wrangling exercise with minor elements of prediction. I have not incorporated Tweets/News/Speeches etc in my analysis. I used a few general facts and exponential decay insight focusing majorly on data wrangling using Pandas.
It was May of 2019 and discussions of Election were heavy in the Indian atmosphere. It was at the same time that I got interested in Election Data and revisited my pandas notebooks. While discussing it with my Springboard Mentor, I decided to take up a Data Wrangling exercise using historical election data to determine which party will win how many seats.
The Election Commission of India website was an obvious starting point for my data search phase. Data for every parliamentary election from 1998 is neatly presented in the xls format. The data even has a ‘POSITION’ column which signifies the position of candidate post result declaration. Though I was mainly concerned about the Winners (with POSITION=1), one can extend the analysis taking into account the runner-up as well.
Approach and Pandas usage:
Candidate Category is a driving factor in local elections but in National settings, that effect is diminished by the Party name and sometimes by a candidate’s public profile. So that is omitted in the analysis.
Exponential Decay refers to the fact that if a Party has won the election in previous Loksabha election, then it has been given a higher weight and the farther we move back in years, the weight given decreases exponentially(hence the term exponential decay).
For all other facts and insights, you can visit the GitHub Repo.
I started with learning Scrapy (which I will cover in another post), but that was too naive of me as the data was readily available. But having Scrapy as a skill helped me in my other projects.
Some other Pandas concepts used were Concatenation, working with MultiIndex, Pandas string Manipulation, Handling Duplicates, Pivoting, Merging, etc.I initially planned to use maps to display results but could not use it as I was too keen to publish the result on LinkedIn before the actual results were out and Map Display requires some knowledge of GeoPandas which I didn’t know back then. And yes, I was able to push results in the midnight of 23rd May 2019.
The final result can be viewed on the GitHub Repo, I have listed only BJP and some of its allies here:
BJP 244.0/ ADMK 28.0/ BJD 17.0/ SHS 15.0/ JD(U) 5.0/ SAD 5.0/
In an Applied Machine Learning lecture of Columbia University, Andreas Mueller (one of the core developers of the scikit-learn) explains how a naive task such a missing value imputation in a real world scenario can leave you perplexed.
The real world data can be very intimidating. And that’s the reason behind my expedition to DO what I LEARN with real-world data.
So my advice to fellow learners, whatever is that you are learning, just be a Lannister and pay your debt to the Data World by applying that learning and let others learn from it. However small the project be, just remember there is always someone who can benefit from your work.