When working with comprehensive datasets every data scientist seems to have their favorite go to. For free resources, Mansi Singhal CEO of qplum pointed to data.gov, Socrata, Amazon OpenData, Google public data, Kaggle, and UCI Machine Learning Archives as a few examples. “In financial services industry, we find FRED database incredibly useful,” she said.
Then there are those datasets that are proprietary. “A good example is the stock price data for which you might need to work with an exchange or one of the 3rd party providers,” she said. “For business purposes, there is not much work around and these days exchanges make more money from selling data instead of trading fees.”
But for machine learning, the bigger the better — and that usually translates into large public domain datasets. Either way Singhal said, the key “is to build the in-house tech infrastructure so that you have a pipeline to pull the data, clean and process it and repeat this process periodically without too much overhead.”
Related Article: 4 Steps to Enhance Your Data Lifecycle Management
With machine learning on the uptick we've done the leg work for you and assembled a list of top public domain datasets as ranked by Github. The full list, along with several other lists of datasets, can be found here.
- Amazon -This registry is meant to help people discover and share datasets that are available via AWS resources.
- Archive.org Datasets - The dataset collection consists of large data archives from both sites and individuals.
- Archive-it from Internet ArchiveA web archiving service for cultural heritage on the web
- CMU JASA data archive - This is a jasadata archive containing contributed datasets from articles published in the Journal of the American Statistical Association.
- CMU StatLab collections - These are interesting dataset, or collection of data from books.
- Data.World - A public benefit corporation that is focused on creating collaborative open datasets.
- Enigma Public - Billed as “the world’s broadest collection of public data,” is a data management and intelligence company that provides a repository of public data.
- Google - Google’s collection of public datasets.
- KDNuggets Data Collections - Provides datasets for data mining and data science.
- Microsoft Data Science for Research - Microsoft Research’s collection of free datasets, tools and resources.
Related Article: 8 Ways to Segment Your Customer Data