Earlier this month Google launched another digital product in beta, Google Dataset Search. It is, as the name clearly explains, a search tool for datasets. In its blog explaining the launch, Google likened Dataset Search to Google Scholar and for good reason — both sites dig up resources that are not easily found by traditional search tools. Google alluded to that when it wrote, “Dataset Search lets you find datasets wherever they’re hosted, whether it’s a publisher's site, a digital library, or an author's personal web page.”
The search engine will link to most datasets in environmental and social sciences, as well as data from other disciplines including government data and data provided by news organizations, such as ProPublica, according to Google.
As with its other search tools, as more users — in this case data repositories—use the schema.org standard that Google recommends in its guidelines to describe their datasets, the variety and number of datasets found by Dataset Search will grow.
Related Article: Github's Top Open Datasets For Machine Learning
A Struggle to Find Datasets
Given that Google’s mission is to index the world’s data it is little surprise that it has turned its attention to datasets. In this case, though, the search tool is more than welcome by the user community. Datasets, according to the people we spoke with, can be very hard to find. Even people with “strong Google-fu using advanced operators to refine their query will struggle to find datasets,” said Mark Cook, digital marketing director with Candour.
Derek Gleason, content lead at ConversionXL, echoed similar sentiments. “From my years as a reference publisher, I can confirm that some of this data is really, really hard to find. Government sources, in particular, have massive data troves but it’s rarely formatted in a way that traditional search engine retrieval can surface.”
One of the main problems with searching for datasets has been filtering out the noise, or junk, Jason Eland, principal of Eland Consulting, LLC said. “In the past you had to search through results and then download a CSV file in the hopes that it contained the data you wanted to find.” Another option, he continued, were the internal or paid scholarly databases whose relevant datasets were often in PDF form.
Another reason datasets have been difficult to search prior to Google’s Dataset Search is that dataset publication has been extremely fragmented, said Tim Absalikov, co-founder, and CEO of Lasting Trend. “It was especially hard if you needed data outside of your own community,” he said, such as a geologist who is looking for a specific dataset on ocean temperatures.
Related Article: Data Ingestion Best Practices
How It Can Be Used
The datasets found in Google’s search product can be put to a myriad number of uses. Relevant datasets can be used to train machine learning algorithms, for starters. Also, combining certain smaller datasets could allow larger trends to emerge, Eland said. “Two or more different studies completed years apart could have largely overlapping data making the total sample size much larger.” The effect of outliers or bad data could be mitigated — even negated — with these larger sample sizes, he added.
Still In Beta
It is important to note, again, that this search tool is still in beta. Google has been known to leave products in beta for months if not years, Gleason said. It has, of course, also been known to kill off products — even those that are popular among its users. Gleason is leaning to a more positive interpretation. “I am sure their goal is to provide a seamless experience, such as simply returning a link to the Excel file for a query like crime stats Illinois, but that may take them a while to solve algorithmically.”
In the meantime users can revel in what Google is offering right now. “The ability to be able to specifically search datasets, like how Google gave us the ability to search papers with Scholar, I believe is going to be invaluable over many industries,” Cook said.