Never underestimate the power of the right dataset to answer a question you might have about a market or product. The problem for many companies is that finding that “right” dataset can be difficult. It was only recently that Google launched a search tool for datasetsa development that may bring some order to what is right now a hodgepodge of resources located haphazardly throughout the internet.

Some datasets are well known and both easy to find and large enough to be very useful for certain tasks such as training machine learning applications. Subject-specific datasets, though, can be harder to identify. But they are out there. Here are six datasets we found (plus one bonus statistical study on email marketing) that marketers may find useful. Some of these datasets are big and also well-suited for machine learning some are small, some proprietary and some open. Any one of them just might be able to deliver that “right” answer for your needs.

1. Largest Food & Packaged Goods Companies 2016

A dataset that can be found on, was extracted from the 2017 Fortune 500 list, which was based on 2016 results. It lists the largest publicly-held US food, beverage, personal care, pharmaceutical and tobacco companies for that year. One interesting use of this dataset is the Texas Beer Project. According to, volunteers gathered to collaborate on this project at the 2018 Houston Hackathon. Its goal is to delve into political contributions from beer PACs on behalf of the Texas Craft Brewer's Guild. The group wants insight into contributions made by beer wholesalers and big distributors as this has made a big impact in past legislative sessions in Texas and other states. 

Related Article: Github's Top Open Datasets For Machine Learning

2. Women's Shoe Prices

Also found on, is a list of 10,000 women's shoes with product information provided by Datafiniti's Product Database, a site that sells product datasets of a large catalog of ecommerce listings from hundreds of retail websites. This particular dataset includes shoe name, brand, price and other data. It should be noted that this is a sample of a large data set. The full data set is available through Datafiniti.

According to, researchers can use this data to determine brand markups, pricing strategies and trends for luxury shoes. Some questions this dataset can answer include:

  • What is the average price of each distinct brand listed?
  • Which brands have the highest prices?
  • Which ones have the widest distribution of prices?
  • Is there a typical price distribution (e.g., normal) across brands or within specific brands?

This data will also let you:

  • Correlate specific product features with changes in price.
  • Cross-reference the data with a sample of’s Men's Shoe Prices to see if there are any differences between women's brands and men's brands.

3. Yelp Open Dataset

The Yelp dataset is a subset of the company’s businesses, reviews and user data. It is available as JSON files and it is meant to be used to teach students about databases, to learn natural language processes, or for sample production data while learning how to make mobile apps. It is based on 5,996,996 reviews on 188,593 businesses and contains 280,992 pictures. It also has more than 1.4 million business attributes like hours, parking, availability and ambiance and aggregated check-ins over time for each of the 188,593 businesses.

Related Article: What Is Google Dataset Search?

4. Survey Data on Impulsive Buying And Its Impact 

Learning Opportunities

Found on, the researcher Narayanamurthy T asked shoppers at multiple locations the same set of questions in order to understand the tendency for impulse shopping as well as the store environment, products and promotions that are influencing them. Data was also collected on the shoppers’ age, source of income and how many days per month they shopped.

5. Fashion-MNIST

Found on Github, this dataset consists of 60,000 training images and 10,000 test images. It was released about a year ago by Han Xiao, Senior Scientist III at Tencent AI Lab, according to his post. He writes that Fashion-MNIST was intended to serve as a drop-in replacement for the original (and very popular) MNIST dataset, helping people to benchmark and understand machine learning algorithms. Primarily viewed as a machine learning dataset, it nonetheless has use for marketers.

6. Labeled Faces in the Wild

This database contains 13,000 face photographs collected from the Web and labeled with the name of the person pictured. The dataset was designed for studying the problem of unconstrained face recognition. Like the preceding database, this one is often tapped for machine learning, but also can be useful to marketers.

* Bayesian Inference for Assessing Effects of Email Marketing Campaigns (A Bonus Listing) 

We’re also including a bonus: a statistical study on assessing the effects of email marketing campaigns. It was published this year in the Journal of Business & Economic Statistics and developed by academics Jiexing Wu, Kate J. Li & Jun S. Liu. This study sought to quantify the effectiveness of email marketing campaigns in conjunction with customer characteristics. The researchers analyzed a large email marketing dataset of an online ticket marketplace to evaluate the short- and long-term effectiveness of their email campaigns. It found that email offers can increase customer purchase rate both immediately and during a longer term. 

“Customers’ characteristics such as length of shopping history, purchase recency, average ticket price, average ticket count, and number of genres purchased also affect customers’ purchase rate,” according to the study. It also found that customers who have been inactive recently are more likely to take advantage of promotional offers.