Nearly all predictions say that machine learning and artificial intelligence will transform the future — change the way we work, do business, perhaps even live. If you’re an aspiring data scientist, looking to supplement your learning, and practicing your skills, the best way is to start with your own personal projects, working with data science datasets. With such project-based learning, not only will you have the hands-on experience to ace your next interview, but also give you a portfolio to show off.

But machine learning and artificial intelligence are irredeemably dependent on one thing: Data. Without large-enough volumes of data, no algorithm can be built, let alone be accurate and usable. Where will an aspiring data scientist go for that kind of machine learning datasets and data analytics datasets? To the Internet, of course. 

Several open data initiatives — of both government and private organizations — globally have made data available for public use. Take this awesome public datasets or this NLP data sets repo from GitHub, or even Reddit and Kaggle, for instance. But the large volume and variety of data can be overwhelming — “where do I start?” can be a crippling dilemma. 

In this blog post, we’ll identify top data science datasets, i.e. some of the most fascinating data sets from India that you could use to conduct your own experiments, write your own machine learning algorithms, perform data analytics and visualization.

Top Indian data science datasets

Data Science Datasets in IPL: A data-driven approach to India’s favorite pastime

Across its 12-year run, the Indian Premier League has been one of the most digitally savvy sports tournaments in the country. Not only is team-by-team, match-by-match, ball-by-ball information captured, it’s also publicly available for anyone to use.

Here are two complementary data science datasets you can bring together to glean insights.

Once you’ve made yourself comfortable identifying teams that won most matches, or grounds with the most boundaries, set yourself a tougher challenge: Say, compare weather data with rained out matches, using the frequentist inference machine learning approach, and predict which future matches will get rained out! 

Crime in India: Understanding human Behavior through data

The National Crime Records Bureau (NCRB) was set up primarily with the mission of collecting information and making them available to law enforcement agencies country-wide to solve individual crimes. But data analytics on these datasets can do much more than that: It can help us understand trends, identify anomalies, even inform law-making.

There is so much you can understand from the data, but let’s get you some ideas to start with: Compare cases with arrests and convictions to understand the delivery of justice, using issue trees and hypothesis trees. Compare personnel strength, budgets and infrastructure with convictions. 

Politics: Exploring governance with data

For the average citizen, the functioning of the democracy might seem a bit distant — who watches the Lok Sabha TV these days? But, politics mediated by TV or online journalism is also slowly becoming unreliable. Perhaps, data science and machine learning can help quantify and clarify democracy. 

Here are a couple of places you can start.

  • Mann Ki Baat: This is a dataset containing all of Indian Prime Minister Narendra Modi’s radio speeches from October 2014 to September 2017. 
  • Rajya Sabha Q&A: This is a dataset containing questions and answered exchanged in the Rajya Sabha from 2009 till September 2017. 

After cleansing the text elements of the dataset, perform deep sentiment analysis using natural language processing (NLP) techniques for a good data science challenge. Also, consider the correlation between themes and sentiments. Transcripts of public speeches can be a meaningful eye into what bothers/inspires the country during that time.

Environment: Predicting the future of the planet

Research organizations globally have long collected in-depth data about various environmental factors — from temperature to tiger population, from rainfall to deforestation. 

Here are some large datasets you can use to identify key factors that are changing the planet we’re living in.

Once you’ve identified trends across times and locations, perform comparative analysis with significant policy changes during the time. Then, correlate that with other data, such as this dataset which presents information about the change in forest cover from 2005 to 2009. You might want to explore if the change in forest cover impacted air pollution in the surrounding area.

People: 1.3 billion data

In a country like India, the biggest wealth of data is about people. Understanding the intricacies and complexities of the Indian people might now be possible with machine learning and artificial intelligence.

  • Begin with the census data: Across the length and breadth of the country, sliced and diced across various parameters, the census of India website has more data about the Indian people than anyone else.

For more granular data about cities and infrastructure, try Open City. If you’re looking for a bigger challenge, you might find it in this dataset for handwriting recognition. Or try the Youtube 8M dataset.

If you’re looking to practice what you’ve learned in data science, the above data science datasets are a great place to start. If you’re looking for a structured, project-based, 1:1mentoring-led programs in data science, machine learning or artificial intelligence, look no further: Try Springboard.