The key to getting better at data science and furthering your learning as an aspiring data scientist is– Practice, Practice, and Practice. After learning the basic data science skills, the typical advice mentors at Springboard give is to practice on a variety of data science projects from image processing to speech recognition. One of the most important ways to develop your data science skills and improve your employability as a data scientist is to work on real-world data science projects. The first step is to find an interesting dataset to work with.
Mentors at Springboard often get asked this question by aspiring data scientists – “How do you find datasets for data science projects to practice?”.
Aspiring data scientists want to work on data science projects but struggle to find an interesting dataset to work with. What’s important as a learner is to find a dataset that interests and motivates you. When choosing a dataset for your project, it’s up to you to decide the size and complexity of the data you want to work with. If you are a beginner, we suggest you begin your project on clean data. Though data cleaning is an integral part of the data science workflow, as a beginner you would want to focus more on analysis than spending time on cleaning data.
There are thousands of publicly available datasets on diverse topics ranging from biology to particle physics. To ensure that you spend less time searching for the right dataset, you should know where to look for it. Your search for finding datasets for your data science projects ends here! Here we’ve enlisted some of the best sources to find publicly available datasets for your next project.
Top 10 Dataset Sources
Knoema – The most comprehensive and integrated dataset repository in the world, you can refer to it as the “Atlas of World Data”. It is a free-to-use, open data platform for individuals with interest in data analysis, machine learning, statistics, and visual storytelling. Knoema hosts more than 2.8 billion time series data on 1000+ topics from Agriculture to Transportation from 1200 different sources including Amazon, Google, Facebook, WHO, UNICEF, ILO, and more. Data scientists can work with this data online in the form of charts, or tables. However, exporting a dataset from this repository requires you to have a premium account -Knoema Professional. A premium account entitles a user for unlimited access to data and statistics along with several easy to use tools for data analysis, data visualization, and presentation. If you are confused on how to search or browse for a particular dataset on Knoema, here’s a quick tutorial on how to browse datasets.
Kaggle – The word “data scientist” and “Kaggle” are inextricably interlinked and everyone in the data science community is familiar with it. Kaggle is a fantastic resource for data scientists and machine learning engineers looking for datasets to work on with some pre-processing already done. It is a great place to find datasets on everything under the sun as the platform is popular for hosting multiple data science and machine learning challenges for a real problem that various organizations are trying to solve. You will find datasets of all sizes upto as large as 2TB having more than 50 million records. Users can choose from over 18,000 datasets from the Kaggle Dataset repository. One can easily find a required dataset using the search box with multiple filters such as the size of the dataset, filetype, tags, etc. One can also preview and bookmark the datasets they like.
Google Custom Dataset Search – The custom Google Dataset search engine was launched in September 2018. You can specify whether you are searching for data on plants, animals, diseases, UFO sighting, movies, calamities, and more. Ideally, within seconds, you should be able to find the desired dataset. It is easier to harvest datasets from the custom Google Dataset search engine using keywords, name of the dataset, creator-info, format (JSON, CSV, etc), and description. You can also search for datasets in mark-up languages and find datasets wherever they are hosted -an author’s personal page, publisher’s website, or any digital library. The objective of developing this engine is to unify thousands of various dataset repositories and make that data easily available.
UCI Machine Learning Repository – If you are looking for a dataset repository that can help you find the dataset by the type of machine learning problem, then UCI Machine learning repository is the go-to place. It has datasets classified based on characteristics such as Univariate, Time Series, Multivariate, sequential and also based on the associated tasks such as classification, regression, clustering, etc. Most of the datasets on UCI are cleaned but they have varying levels of cleanliness because the researchers who have prepared these datasets have already done some kind of pre-processing on them like – the selection of instances and attributes. You can browse the 475 datasets based on various filters such as the number of attributes, number of instances, data type, associated tasks, attribute type, and the subject matter. These filters can be of great help in finding the right dataset. However, if you are interested in investigating large scale problems and techniques, then this repository might not be helpful as it houses small datasets.
Data.gov – Offering more than 248,783 datasets(at the time of publishing), the US Government’s data portal hosts all sorts of amazing datasets from climate to crime. Data.gov is an aggregator of publicly available free data from various US government agencies. Anyone can easily download the data but much of the data on this website requires research making it difficult to figure out the right version of the dataset. Similarly, Data.gov.in is the home for the Indian government’s open data on various industries like education, climate, finance, energy, economics, and more
VisualData – Efficiently sourcing visual data, in particular images with high-quality annotations, has always been a challenge for many aspiring data professionals. VisualData is a fantastic search engine for over 334 image datasets contributed by businesses, researchers, and hobbyists. It lets you search with keywords to find images relevant to your requirements.
VisualData has it all- from highly specialized datasets containing images from 3D reconstructions and faces to robots, fashion, animals, birds, and more.
Google Cloud Public Datasets – You can explore large datasets hosted on Google Cloud using a tool called Big Query. However, to view all the datasets, you need to sign up for a Google Cloud account and create a project. Though Google provides storage of data and access to it, as a user you will have to pay for the queries you perform on the data for analysis. The first 1 TB of queries are free every month so if you use it carefully, you might not have to pay anything extra.
GitHub- Awesome Public Datasets – The large community of software developers has a page dedicated to datasets on over 30 diverse topics from Agriculture to Transportation which is very helpful. One can directly jump on to a specific domain they wish to explore and choose a dataset from it.
KDNuggets – This famous data science website has a collection of data from various international government agencies, research centers, exchanges, and data published by other data enthusiasts with a brief description of each.
Reddit – Datasets subreddit has a dedicated discussion board for handling the specific requests on data, recommending qualitative sources of data, and data collected and published by other like-minded people. You never know when you might come across a treasure trove of data for a particular subject.
So now you can sharpen your data science skills by choosing whatever dataset interests or amuses you from the goldmine of freely available public datasets. We are always adding more and more sources for datasets, so bookmark this page to stay updated. We also have a small favor to ask; if you have any other interesting source for publicly available datasets, do share it with the learning community. Send us the details with a brief overview of the dataset to firstname.lastname@example.org, and we will publish it with your name.