As someone whose career in the 21st Century has focused mainly on user contribution systems and user created content, I leverage several crowdsourcing sites on the Web. One of my favorites is San Francisco based, which according to its CEO, Anthony Goldbloom, enables people to outsource big data questions. 

Every predictive modeling problem is framed with a competition where the person who builds the most accurate model gives that model to the company and in exchange the company gives them a prize. Kaggle is a powerful way to build predictive modeling algorithms.


What They Do

During our discussion, Goldbloom mentioned two competitions:

1. The William and Flora Hewlett foundation (Hewlett) reached out to Kaggle's data scientists and machine learning specialists to develop an affordable solution for automated grading of student written essays. (Not sure my wife, who is a high school teacher will like this).

The Hewlett foundation ended up collecting 24,000 graded essays written by high school students. In the end, a British hedge fund trader (trained as a physicist), a software developer at the national weather service and a German grad student created the winning solution, which can help schools assess students’ writing.

The Foundation sponsored the contest and awarded US$ 100,000 to the top three research teams. In the end, 250 teams participated and there were 2,500 submissions. (Note: None of the winners had a data science background).

2. The Wikipedia Challenge focused on getting data-mining experts to build a model that predicts the number of edits an editor would make. Wikipedia wanted to understand what factors determine editing behavior. Contestants were expected to build a predictive model that can be reused by the Wikimedia Foundation to forecast long term trends in the number of edits that we can expect. There were 94 Teams with 115 players and 1024 entries. Here's a page describing the challenge:


Kaggle combines many of the popular current trends in the industry: gamification, crowdsourcing, virtual workforce and, of course, Big Data. (Venture Capitalists must love this company).

Kaggle’s crowdsourcing approach relies on the fact that there are countless strategies that can be applied to any predictive modeling task, and it is impossible to know at the outset which technique or analyst will be most effective.

Kaggle's secret sauce is that there's lots and lots of data out there, and a strong desire to play with this data.

Kaggle is gaining the most traction in financial services, in the technology sector, and in life sciences. Competitions filter talent and also let the best data solutions float to the top of the pack while people are giving objective feedback along the way.

Most of the 45,000 members on Kaggle call themselves data scientists, which is one of the hottest professions in Silicon Valley. Most of them, however, have an engineer or computer science degrees. Here's a breakdown of their professions:


Kaggle has several public offerings:

Kaggle Prospect (in beta now), which Practice Fusion (another favorite company of mine), a vendor of electronic records, used by opting up their data to determine what types of problems could be solved, such as predicting who will develop diabetes.

Kaggle In-Class is another product, predicting the past or the future requires students to build models
that are evaluated against past outcomes.

The Best to Rise to the Top

Kaggle has a great business model, one that should be considered by other crowdsourcing companies. As Goldbloom explains “Competitions are open to everybody. The sole purpose of these competitions is to qualify talent. So you if you finish in the top ten percent of two public competitions, we’ll label you as qualified talent.” Most of Kaggle’s commercial work, such as banks trying to predict who’s going to default on a loan is conducted via a private competition.

For private competitions we basically invite 15 of our strongest members. Each of them compete behind the scenes and the prize money is consistently -- it’s a six figure sum and we also take a large fee on those private competitions.”

Here are some examples on potential ROI vs. Realized ROI.

1. Transactional Fraud: A large credit card issuer.

Assuming the issuers has 50M credit cards with their customers spending on average US$ 500 per month. Based on current industry estimates, let’s assume the issuer experiences 10 basis points (1 basis point is 1/100th of 1%) in current fraud losses, will put total fraud losses per year in the neighborhood at US$ 300M / year (50M * 500 * 12 * 10 basis points). Just a mere 5% reduction in fraud losses with a better model will generate an incremental return of US$ 15M / year. This can easily put the ROI in the double digits, especially when you can think about much time and how many people you would need to resolve these issues.

2. Retail consumer marketing: A large retailer

A big box retailer, with over 20M customers, sends product promotions to their customers on a monthly basis. Typically the number of customers who respond to these offers is less than 1%. Assuming, each customer spends US$ 200 on average because of the marketing offer, the retailer probably sees US$ 40M (20M * 1% * 200) in incremental sales. A better predictive model through Kaggle can easily double or triple the response rates to these marketing offers, there by leading to US$ 80M to US$ 120M in incremental sales!

Goldbloom’s team’s grand vision is to create a Meritocracy, a labor market where the best people rise to the top, both in perms of skill and value.”

I highly recommend that you check out!

ROI (Real Overall Impact!)

  1. Use a public area to identify potential leaders to participate in a private area.
  2. Leverage a real time leaderboard which motivates people.
  3. Enable the community to determine the content -- what problem will be resolved.
  4. Check out Hacker News for a good implementation of the Thumbs up / Thumbs down process.
  5. The platform for uniting free agents is important.
  6. People learn more by doing vs. sitting in a class or reading a user manual.

What is Anthony reading?

A podcast of the article is available here.

Editor's Note: This is the first in the series for Scott, but to read some of his thoughts on the Social Enterprise:

-- Socialized Business = Humanized Business