If you turned in your March Madness brackets earlier this week, chances are that that’s how you feel.
Fear not, you’re not alone. ESPN says 99 percent of the brackets submitted to their website were damaged by yesterday’s results.
Who would have thought the University of Alabama Blazers would beat the third seeded Iowa State Cyclones? Many expected Iowa State to be in the final four.
Or what were chances of UCLA knocking off SMU? Ditto for the Georgia State — Baylor result.
A blindfolded monkey throwing darts at the brackets might have better success than I did. 😝😩😡 Just sayin... #Marchmadness— Julie Goolsby (@crazymrsg) March 20, 2015
Filling out ur brackets u think ur genius. End of 1st day u realize ur an imbecile #MarchMadness— Billy Black Chip (@BillyBlackChip) March 20, 2015
Not only do these search engine giants have some of the best-trained algorithms and data scientists working on their behalf. But they also have a great deal at stake.
After all, if they can predict the winner of most of these games correctly, imagine what kind of results they can offer an advertiser.
The two search giants used very different approaches in completing their brackets.
Bing leveraged Bing Predicts, which went 15 for 15 in the World Cup and called the Super Bowl right. For this year’s NCAA tournament, it analyzed 9.2 quintillion potential brackets. (Add eighteen 0’s if you want to write the number out.)
Google, on the other hand, used not only an algorithm based on Google search, but it also looked at fan activity across multiple platforms, factoring the wisdom of the crowds into its equation.
Be Your Own Data Scientist
Joseph Blue, a data scientist at MapR, told us that what the search engine giants did was quite impressive. But simply looking at the performance of each team over the season might yield comparable results.
Not only that, but he also proposed that most folks with a computer and a college degree (we’re thinking that it would need to be in IT, math, statistics or some kind of science, although he didn’t say so) could fill out their brackets with some level of confidence using the PageRank Algorithm Google used in its earlier days.
Here are the instructions he shared:
PageRank was derived in the early days of Google to determine the relevance of web pages. The philosophy of the algorithm is this: the most important page is the one that the most important pages link to. The algorithm assumes that if page A links to B, then B is more “important” than A. But there are many pages and many links.
Two simple formulae explain the basics of PageRank:
G = α S + (1-α) E
πT = πT G
G refers to the square “Google” matrix. It is composed of a matrix (S) where each row contains the signals being sent out to other pages from one particular page plus the “teleportation matrix” (E) that allows for a user to jump randomly from any page to any other - even when there’s no link. The weights given to S & E are controlled by a constant α, which is usually set around 0.9.
The value πT is the stationary row vector of G (in an infinite number of browsing attempts, these are the probabilities you’ll end up at each page). Start with all pages assigned the same weight. Iterate through a process known as the power method to derive πT. Sort this vector and you have your rankings. This is an extension of Markov Theory, a good topic for further study.
Google has much more complex approaches now, but this relatively simple method can be applied to situations in which we want to rank entities based on a signal of “importance” they send when interacting, just like college basketball.
Application to College Basketball
What happens when team A loses to B? If we interpret team A’s loss as a signal indicating that B is superior, then the PageRank algorithm should give us the teams in order of the strength of the “signals” they get from the other teams. It certainly rewards teams for winning games, but also considers the quality of the teams that beat them.
Since Google already did the hard work, all we have to do is get the game results, assign each team an equal rank, and then build the matrix of teams sending “signals” to those that defeated them. We iterate for the stationary vector (πT) which should give us our rankings.
The code and game data to reproduce or experiment with these results can be downloaded from the following repository: https://github.com/joebluems/MarchMadness.git
Since this algorithm is data-driven, we’re going to need data. This data is available from a variety of internet sources - use the csv file on the repository or pull the scores yourself with a series of curl commands from your favorite sports site. Game data comprised 5,778 matches through March 15, 2015 including 631 different colleges.
Division I schools can play non-Division I schools, which generates activity for teams with very few games (sorry, Daniel Webster and Sarah Lawrence College). The algorithm automatically weights these teams low to a lack of data, so no adjustments to the raw results are necessary.
Running the Algorithm
Applying the algorithm to generate PageRanks requires two steps:
Step 1 (create the matrix) – read the game data and assign a unique integer to each school. This information is helpful when displaying the results in the next step. Once the data is read, keep track of which team loses to whom and assign a “weight” to each loss. This information comprises the H matrix (a more pliable form of the S matrix), which is key to the PageRank algorithm.
Step 2 (find the ranks) – execute the power method to solve for the stationary row vector (πT) of the G matrix. Sorting this vector in descending order provides the rankings. The alpha parameter was set to 0.90 and the power method continues until the convergence criteria are satisfied (residual <= 0.001). The algorithm was adjusted to base the outgoing signal (loss) on a minimum denominator of 25 games (to account for non-Division I teams with incomplete schedules).
The following table shows the PageRank outcome and the AP ranks for comparison (the Associated Press ranks are compiled from votes by 64 sportswriters and broadcasters - which may average bias or compile it):
Rounding out the top 25 by PageRank are the following: Baylor, Wichita State, North Carolina, Oregon, San Diego State, Louisville, Colorado State, Dayton, Michigan State and Wofford. Every team in the top 25 made the tournament, though Dayton was assigned a play-in game to join the final 64-team bracket.
The top 10 for both lists are slightly shuffled. It’s important to keep in mind that the PageRanks are based only on win-loss results. As the sole undefeated team, Kentucky is a no-brainer, consensus #1. Villanova’s record earns it the second spot from PageRank, but the AP ranks it 4th. Conversely, the algorithm penalizes Virginia down to 6th from 3rd in the AP poll. Based on the strength of their wins (and losses), Arkansas improves from 21st on AP to #15 on PageRank.
PageRank also made a few pulls from outside the AP top-25, like Virginia Commonwealth, Wofford and Colorado State. One of the differences of this approach is that it treats all games equally - the polls will give greater weight to more recent games (which may not be a bad idea). Understanding a method’s limitations will give you the clues to improve it to fit your situation.
So if you’re a geek who’s into the tourney, why not give it a whirl? Just make sure to update your stats before you start. Let us know how you did.