You Too Can Be A Data Scientist Courtesy of Airbnb

No, Airbnb won’t be providing you with a bunch of clean data sets and algorithms when you arrive at your destination. But the rental lodging site is doing something that will delight analysts a whole lot more.

Today it announced that it is open sourcing Airpal, an internally developed web-based query execution tool that leverages Facebook's PrestoDB to facilitate data analysis.

It puts data scientist-like powers into more hands, and that’s not just talk. One-third of Airbnb employees have issued a query using the tool.

Keep It Simple

Using SQL for analytics isn’t a cakewalk, even for experienced data workers, and the workflow rarely resembles a stroll down the red carpet.

There are challenges like recalling precisely how a query was written, copying and pasting from the command line, and running multiple terminal windows can slow down analysis and be frustrating.

Task a beginner with these sorts of things and he may become flabbergasted, ask him to partner with others and you may get a frustrated team.

The engineers at Airbnb realized that there was a way to ease the problem, so they built Airpal for their employees. It provides UI tools that they say have helped drive adoption and promote knowledge sharing.

How It Works

Airbnb told us it holds 1.5 petabytes of data as Hive managed tables in HDFS, and that the size of its important "core_data" tables allows it to leverage Presto as the default query engine for analysis.

When it comes to running ad hoc queries and iterating on the steps of an analysis, it found Presto to be “snappier” and more responsive than traditional map reduce jobs.

In an email to CMSWire a company representative wrote, “The biggest benefit to adding Presto to our infrastructure stack, though, is that we don't have to add additional complexity to allow interactive querying.

"Because we are querying against our one, central Hive warehouse, we can keep a single source of truth with no large scale copies to a separate storage/query layer. Additionally, the fact that we don't need change data storage type from RC format to see the speed improvements, makes Presto a great choice for our infrastructure.”

The key features of Airpal are:

  • Optional access controls for users
  • Ability to search and find tables
  • See metadata, partitions, schemas and sample rows
  • Write queries in an easy-to-read editor 
  • Submit queries through a web interface
  • Track query progress
  • Get the results back through the browser as a CSV 
  • Create new Hive table based on the results of a query
  • Save queries once written
  • Searchable history of all queries run within the tool 

If you want to check Airpal out, you can test it without any overhead or cost. For more detailed information, visit the GitHub page.

For as much as back and forth as there is between commercial vendors in the open source community, there is also real beauty when it works well.

This is an example. Facebook open sourced Presto  less than two years ago, Airbnb picked it up and built Airpal which it is now sharing with the community.

This open source at its best. Developers get paid to build the tools that help their employers’ businesses thrive and, when they aren’t trade secrets, they get to share them with a community of their peers. 

Title image by Airbnb.