This year for Drupalcon, the organizers felt a little cross-pollination was in order. There are multiple talks that don't in fact feature Drupal (news, site) at all. Instead, the topics revolve around things that the Drupal community can learn from.

One of those talks was offered by Ben Sandofsky (@sandofsky) of Twitter (news, site).

Enter the Twitterverse

The story of Twitter is a matter of scale. There are 107 million Twitter accounts and climbing. These accounts generate 50 million tweets per day, and through interaction with the site and clients, 3 billion API requests a day. That's not a typo. Three billion--representing 75% of Twitter's traffic.

As anyone familiar with the infamous Fail Whale knows, the folks at Twitter have learned a lot of rough lessons along the way as their real-time messaging service grew in popularity. The solutions revolve really around one central concept, that of removing bottlenecks.

Hardware Bottlenecks

At one point, Twitter was a cloud-based service. However, the cloud simply couldn't offer the level of performance that they needed as they grew, so Twitter's servers all now run out of a single data center.

Human Capacity Bottlenecks

Twitter has 175 employees. Among those, the devs follow an agile model with one-week sprints and code in pairs. Pair programming has helped cut down on the amount of time lost to relatively simple bugs like typos, not to mention helps them better avoid issues like memory leaks since there are always two sets of eyes on the code.

From there, the pairs are integrated into teams, which work on features. Since the teams are distributed they use a web chat product called Campfire to coordinate things, choosing it for its logging, file upload features and the fact that being browser-based means people can access it when necessary without having to go and get an IRC client. As much as possible, they're trying to remove issues where individuals or teams become bottlenecks.

Process Bottlenecks

The Twitter code repository is managed through Git. There are dozens of active code base branches, with the farthest out branches typically relating to features, which feed into Team branches, which feed into the Master branch, and so on. The smarter the logic for controlling branch syncs and merges, the less bottlenecks exist.

For issue tracking they use Pivotal Tracker. Sandofsky says that he finds that breaking an issue into points rather than units of time generates more active estimates of how long it will take to get something done.

Software Bottlenecks

While he went heavily into how they handle incoming API calls and Tweets, I'm going to break this down into some basics. Anyone familiar with the Fail Whale knows that Twitter has had a number of growing pains in trying to handle the massive traffic the network can generate. Navigating the rough waters of those pains has led the company down a heavily distributed path that involves:

  1. Pre-processing incoming messages through business logic that checks to make sure the tweet is within the 140 character limit, the user is authenticated and the account isn't offline for spam or other behavioral issues
  2. Handing off the message to a brand new instance of a queue
  3. Handing the message off to a brand new instance of a worker process to finish the job

This distributed workflow is designed to move things through as quickly as possible. Sandofski says that there are some trade-offs for the speed, such as timelines that might show tweets arriving in a different order than their real arrival times, but they feel that given the performance benefits a few tweets out of order is worth it.

Even at the database level, they're hunting and destroying bottlenecks. While Twitter currently uses MySQL, they're in danger of hitting a ceiling of what they can store within it. Instead, they're moving to Cassandra, a distributed database open sourced by Facebook (news, site) and recently accepted as an Apache Software Foundation (news, site) project.

Other technologies involved in their software refinements include a heavily modified Ruby on Rails to reduce bottlenecks introduced during business rule processing, Apache (news, site) proxying to Unicorn, the Kestrel distributed queue, Java Virtual Machines and the Scala programming language.

And Much More

There's a lot more to Twitter's efforts to keep up with demand. They use Review Board to determine if something is ready to ship through peer review. All engineers do QA rather than there being a QA team, by creating unit, functional and integration tests. Webrat and Selenium help out as well, allowing things like testing the code in a browser. Sandofsky boasts that there is a 1-to-1 code-to-test ratio for the mobile Twitter code.

If there's a single lesson to learn, it's that when bottlenecks start to get in the way, you have to be willing to sit down and identify where the real issue is, and then act aggressively to deal with the issue.

Twitter has gone from the cloud to the data center, ported and reworked their distributed queue, put their programmers into pairs, modified an entire development framework and more to try to keep people from meeting the Fail Whale too often.

What will you do to streamline your sites? (Really, tell us in the comments.)