Well it is March Madness and like millions of other college basketball fans it is time for me to fill out my bracket.  As I went through the process of filling all the information out on Yahoo, I noticed a small button at the top that said "Fill Out with Most Popular" and I clicked it.  Basically this is a real simple form of crowd sourcing a bracket based on what the myriads of other people are picking (a better form would be to use an information market like one of my clients Daggre).  This got me thinking, there are lots of teams in the NCAA tournament (68) and lots of years it has been going on (at least 40 years), so there has been some serious data produced in that time period.  Where there is data, there is data science and I was off and running.

Data

I quickly found several great sources of data:

While I would be interested in more data, the problem with the NCAA tournament is one only has a few days between the seedings and the games starting, so I needed to be quick about the analysis.

Processing

Despite my most resourceful (ahem, still lazy) efforts I did have to do some work to form the various datasets into something useful.  I took the Washington Post data from 2004 to 2011 and needed to "join" it to the kenpom.com data from the same period.  The way I wanted to do this was to have 2 teams, Team1 and Team2 (lower seeded always first), and build a table that had all the kenpom.com data for each team and a binary variable for which team lost (well I set it up to be 1 or 2, for Team1 or Team2).

This "joining" was not difficult besides for accounting for the team name differences between the datasets (the biggest difference being State -> St., i.e. Michigan State -> Michigan St.).  I simply built a mapping of the names from one dataset to another because there weren't enough mismatches to warrant anything more complicated (like edit distance).  Finally because the already formed data from kenpom only went to 2011, I had to join together a file with the RPI of each team to the kenpom data for 2012 and append that file to the already formed data.  All of this work was done with the wonderful pandas framework in Python.  The final "raw" file (not converted to features) can be found here.

Conversion to the final features was accomplished by taking all the categorical columns and assigning integers to them.  The final feature output is available here.  Note I deleted the SCORE1 and SCORE2 columns before any learning as of course those columns are not available to predict with for this years tournament.

Learning

To learn I used the amazing scikit-learn's Random Forest implementation.  I started off doing 5 fold cross validation and got an accuracy of around 80% for predicting winners.  Not too shabby for the little work I did gathering the data.  I also had the Random Forest output the important features for the classifier:

    1. TEAM2_w (Higher Seeded teams number of wins) - 0.164
    2. TEAM1_w (Lower Seeded teams number of wins) - 0.109
    3.  TEAM2_pyth (Higher Seeded Pythagorean expected win %) - 0.056
    4.  TEAM2_adjd_rnk (KenPom.com adjusted rank) - 0.040
    5. TEAM1_pyth (Lower Seeded Pythagorean expected win %) - 0.036
    6. TEAM1_ncopp_pyth_rnk (Lower Seeded opposition rank) - 0.035
    7. TEAM1_rpi (Lower Seeded RPI) - 0.031
    8. TEAM1_oppd_rnk (Lower Seeded opposition defensive rank) - 0.029
    9. TEAM2_ncopp_pyth (Higher Seeded opposition rank) - 0.027
    10. TEAM1_adjo - (Lower seeded Opponents? average adjusted offensive efficiency) - 0.026

Details for the metrics can be found at kenpom.com.  The numbers reported are the Random Forest importance metrics.

Everything here looks reasonable to me, I would expect most of these columns to be important.  The total lack of the "TEAM" column means that the Random Forest is not massively biasing towards individual teams (e.g. Duke, UNC, and Kansas).

Prediction

Okay folks here is where the rubber meets the proverbial road.   I setup an input file with the 2013 features exactly as the previous year features and asked the Random Forest to tell me whether Team1 or Team2 would be the winner.  So far I have only processed the first round, obviously each round builds on the previous and errors will multiply.  Without further adieu:

North Carolina A&T (16) vs Liberty (16) winner North Carolina A&T with 78% certainty

Colorado St. (8) vs Missouri (9) winner Colorado St. with 73% certainty

Oklahoma St. (5) vs Oregon (12) winner Oklahoma St. with 59% certainty

St. Louis (4) vs New Mexico St. (13) winner St. Louis with 52% certainty

Middle Tennessee (11) vs St. Mary's (11) winner Middle Tennessee with 56% certainty

Michigan St. (3) vs Valparaiso (14) winner Michigan St. with 84% certainty

Kansas (1) vs Western Kentucky (16) winner Kansas with 96% certainty

Creighton (7) vs Cincinnati (10) winner Creighton with 90% certainty

Duke (2) vs Albany (15) winner Duke with 97% certainty

North Carolina (8) vs Villanova (9) winner North Carolina with 52% certainty

Virginia Commonwealth (5) vs Akron (12) winner Virginia Commonwealth with 74% certainty

Michigan (4) vs South Dakota St. (13) winner Michigan with 60% certainty

UCLA (6) vs Minnesota (11) winner UCLA with 56% certainty

Florida (3) vs Northwestern St. (14) winner Florida with 68% certainty

San Diego St. (7) vs Oklahoma (10) winner San Diego St. with 62% certainty

Georgetown (2) vs Florida Gulf Coast (15) winner Georgetown with 51% certainty

Long Island (16) vs James Madison (16) winner Long Island with 65% certainty

Gonzaga (1) vs Southern (16) winner Gonzaga with 94% certainty

Pittsburgh (8) vs Wichita St. (9) winner Pittsburgh with 55% certainty

North Carolina St. (8) vs Temple (9) winner North Carolina St. with 67% certainty

Wisconsin (5) vs Mississippi (12) winner Wisconsin with 76% certainty

Nevada Las Vegas (5) vs California (12) winner Nevada Las Vegas with 50% certainty

Syracuse (4) vs Montana (13) winner Syracuse with 53% certainty

Butler (6) vs Bucknell (11) winner Butler with 56% certainty

Boise St. (13) vs La Salle (13) winner Boise St. with 60% certainty

Marquette (3) vs Davidson (14) winner Marquette with 51% certainty

Illinois (7) vs Colorado (10) winner Illinois with 65% certainty

Arizona (6) vs Belmont (11) winner Arizona with 58% certainty

Miami FL (2) vs Pacific (15) winner Miami FL with 95% certainty

New Mexico (3) vs Harvard (14) winner New Mexico with 92% certainty

Notre Dame (7) vs Iowa St. (10) winner Notre Dame with 58% certainty

Ohio St. (2) vs Iona (15) winner Ohio St. with 61% certainty

Analysis

The first thing I noticed was that the lower seeds were always picked to win, which on the face of it was disappointing because I was hoping to be able to predict upsets (see future work below).  However after I looked at the numbers I started to see some interesting things, namely that the certainty estimates should be interpreted as a classification probability and therefore anything near 50% means that the algorithm was not very sure of its classification.  When you start to look at all the items under 55% some interesting trends emerge.  Here are those predictions culled from the list above:

St. Louis (4) vs New Mexico St. (13) winner St. Louis with 52% certainty

North Carolina (8) vs Villanova (9) winner North Carolina with 52% certainty

Georgetown (2) vs Florida Gulf Coast (15) winner Georgetown with 51% certainty

Pittsburgh (8) vs Wichita St. (9) winner Pittsburgh with 55% certainty

Nevada Las Vegas (5) vs California (12) winner Nevada Las Vegas with 50% certainty

Syracuse (4) vs Montana (13) winner Syracuse with 53% certainty

Marquette (3) vs Davidson (14) winner Marquette with 51% certainty

Some things that pop right out looking at this sample:

  • The UNLV vs Cal game is an absolute tossup according to the algorithm which is a bit suprising for a 5 vs 12 matchup.
  • The Big East is not looking strong in the eyes of this algorithm; Georgetown, Syracuse and Marquette are all barely favored to win and those are 2, 3, and 4 seeds!
  • The presence of 8 vs 9 games should surprise no one as they should be toss-ups; actually the fact that only two of the four 8 vs 9 games make it into the tossup column is kind of interesting and may say that the seedings are only right half the time.

Future Work

Here are some directions I see for how to expand the current algorithm/data and other interesting analyses that could be done on this dataset.

  • Add in individual regular season game data instead of using the kenpom.com rolled up data
  • Add in individual player data and injury data
  • Gather data further back in time (before 2004) to expand the dataset
  • Do an upset analysis, training the Random Forest with an equal mix of upsets (+1) and non-upsets (-1), and use that model to look specifically for upsets
  • Time slice the data set to see if there are temporal effects from the game changing over time (rules changes, more players going to the NBA early, etc.)

Stay Tuned!

I am going to run the full bracket simulation this week as my time allows and I'll try and rerun it as the results come in.  In the meantime happy bracket picking and tournament watching!

Updated Full Bracket

Round 2 Editable %26 Printable Excel Bracket.png

Update 2, Round 3:

Louisville (1) vs Colorado St. (8) winner Louisville with 95% certainty

St. Louis (4) vs Oregon (5) winner St. Louis with 60% certainty

Michigan St. (3) vs Memphis (6) winner Michigan St. with 51% certainty

Duke (2) vs Creighton (7) winner Duke with 90% certainty

Gonzaga (1) vs Wichita St. (8) winner Gonzaga with 90% certainty

Mississippi (12) vs La Salle (13) winner Mississippi with 58% certainty

Arizona (6) vs Harvard (14) winner Arizona with 75% certainty

Ohio St. (2) vs Iowa St. (7) winner Ohio St. with 65% certainty

Indiana (1) vs Temple (9) winner Indiana with 55% certainty

Syracuse (4) vs California (12) winner Syracuse with 58% certainty

Marquette (3) vs Butler (6) winner Marquette with 54% certainty

Miami FL (2) vs Illinois (7) winner Miami FL with 93% certainty

Kansas (1) vs North Carolina (8) winner Kansas with 92% certainty

Michigan (4) vs Virginia Commonwealth (5) winner Michigan with 65% certainty

Florida (3) vs Minnesota (6) winner Florida with 63% certainty

San Diego St. (7) vs Florida Gulf Coast (14) winner San Diego St. with 64% certainty