If there’s one thing common between you, the machine learning (ML) professional, and the machines you’re teaching, it’s that you both learn by practice! Just like ML models get more accurate with training, you become a better algorithm builder with practice. But where would an aspiring data scientist go for quality machine learning datasets? The internet is a treasure trove. Universities, enterprises, and research organizations alike open their data up for anyone to work with. Google’s new Dataset Search utility itself has over 25 million public datasets you can access for free. But more is not always better. Finding the right dataset for the right purpose in the right form can be overwhelming. And it’s a waste of your precious time. To save you from falling down the dataset rabbit hole, we’ve put together the top 10 datasets you can practice your skills on right away!
Top Machine Learning Datasets
We’ve handpicked for you a wide range of machine learning datasets — from easy to complex, from text to images and video, from financial transactions to Netflix usage statistics — for applications like simple correlation to complex computer vision. Depending on your skill level, area of interest, you can pick any of these and work your way around.
Starting with a complex dataset might scare you off data science and machine learning itself. And no one wants that! If you’re a beginner in ML, start with the following simple datasets.
- Height-weight dataset: This dataset is a collection of 25,000 height and weight records, synthesized from a growth survey of children from birth to 18 years of age in Hong Kong. Given that it’s a simple dataset of just two columns, you can practice building a linear regression model to predict weight for a given height or vice-versa. If you’re a sports enthusiast, you might find this dataset of major league baseball players more interesting. You can build similar correlation algorithms for this dataset also.
- Car evaluation dataset: This multivariate dataset from the University of Irvine, California’s machine learning repository contains information on car features across six attributes such as the cost of maintenance, luggage space, safety, seating, etc. You can practice finding correlations between various parameters, to help choose the car that fits your needs, and even whether it is priced right. Or use decision tree algorithms to evaluate if it’s worth buying a particular car, based on its score.
Businesses all over the globe are leveraging their data to reach wider markets, increase profits or reduce costs. Practicing on enterprise-grade data will help apply your tech skills to real business problems.
- Credit card fraud dataset: Machine learning is the number one choice for anti-money laundering and fraud detection these days. This credit card fraud dataset has over 284,000 transactions European cardholders, anonymized and scrubbed to remove any personal information. You can use Naïve Bayes (NB), Support Vector Machines (SVM), or K-Nearest Neighbor (KNN) to identify the fraudulent transactions. Don’t stop there, go a step further and evaluate which of them works best for the dataset.
- Netflix dataset: To improve its recommendation algorithm and provide better movie suggestions to their subscribers, Netflix created the Netflix Prize. This is the dataset from that competition. It has data of customers, movies, and their ratings for analysis — the winning entry beat Netflix’s in-house prediction algorithm by more than 10%! See how you fare — use the data to work on a range of applications, from data loading and cleansing to building models. This blog post about the Netflix recommendation engine might help you.
Online communication has grown significantly complex today. While texting, people use various forms of natural language methods including sarcasm, rhetoric, etc. In addition to text, they send emojis, gifs, voice notes, selfies and so much more. A data scientist of tomorrow will need to be able to derive insights from across these forms. Here are some datasets you can begin your learning with.
- Yelp dataset: A top review site in the US, Yelp calls this the “all-purpose database for learning”. It includes data of over 6 million reviews; information about the various businesses such as working hours, ambiance, parking, etc.; and user-posted photos. This data is an ML goldmine. Practice logistic regression and linear discriminant analysis algorithms to identify patterns and analyze reviewer characteristics. Perhaps even predict whether a review is fake based on those patterns.
- WikiQA dataset: This question and answer dataset are collected and annotated for you to practice on. You can use it to train a model for chatbot applications. Build models to understand the intent and entities from a query — and respond to it.
- Fake news dataset: This dataset contains text from over 20,000 news articles — the name of the author, the title of the article and the full body text. It also already contains a label for potentially unreliable articles. You can run algorithms to identify which ones are fake, and which ones not. Practice with convolutional neural networks (CNN) or recurrent neural networks (RNN).
Machine Learning Datasets: Computer vision datasets
As video becomes a preferred form of content, experiences grow visual and augmented reality becomes commonplace, computer vision will become a sought-after part of the machine learning future. Here are some datasets you can use to prepare for that.
- Cityscapes dataset: This dataset contains video sequences from 50 cities, all street scenes, with buildings, vehicles, humans, etc. It also includes high-quality pixel-annotations of thousands of frames. You can use it to run and understand computer vision algorithms — R-CNN and fast R-CNN for object detection, CNN for object classification — and test their efficacy. The Cityscapes dataset is also good for training deep neural networks.
- YouTube 8M dataset: This is a collection of data about millions of videos — including their audio — from YouTube, since 2016, being updated regularly. It also includes human-verified segment annotations and labels for the videos. Practice machine learning classification of the videos into sports, art, entertainment, news, trending, etc. using the algorithms CNN and SVM.
- Indian Movie Face database (IMFDB) dataset: This dataset has nearly 35,000 images of one hundred Indian actors, collected from videos. It has detailed annotations about age, gender, expression, etc., in spite of having high valiability in image size, shape, and quality. Practice CNN and decision tree algorithms to do facial detection, recognition, and matching.
While we’d encourage you to ‘practice’ machine learning, our graduates tend to use the word ‘play’ when they talk about working on datasets. So, don’t let the big terms scare you. Pick a machine learning dataset now and start right away. You might even come to enjoy it! If ever you need a more guided approach to your machine learning future, do consider Springboard’s 1:1 mentoring-led, project-based online learning programs that come with a job guarantee.