I've updated my algorithm with the 2013 data and rerun the learning and prediction, here are the predictions for the 1st round:
Florida (1) vs Albany (16) 81.8%Colorado (8) vs Pittsburgh (9) 42.3%VCU (5) vs SF Austin (12) 51.6%UCLA (4) vs Tulsa (13) 62.7%X Ohio St (6) vs Dayton (11) 56.1%Syracuse (3) vs W. Michigan (14) 90.4%New Mexico (7) vs Stanford (10) 75.9%Kansas (2) vs E Kentucky (15) 73.8%Virginia (1) vs Coastal Car (16) 83.9%Memphis (8) vs GW (9) 55.0%X Cincinnati (5) vs Harvard (12) 60.4%Michigan St (4) vs Deleware (13) 68.9%N Carolina (6) vs Providence (11) 53.0%Iowa St (3) vs NC Central (14) 78.4%Connecticut (7) vs St Joseph's (10) 69.4%Villanova (2) vs Milwaukee (15) 93.7%Arizona (1) vs Weber St (16) 96.1%Gonzaga (8) vs Oklahoma St (9) 72.8%Oklahoma (5) vs N Dakota St (12) 46.4%San Diego St (4) vs New Mexico St (13) 59.6%Baylor (6) vs Nebraska (11) 68.9%Creighton (3) vs ULL (14) 83.8%X Oregon (7) vs BYU (10) 46.1%Wisconsin (2) vs American (15) 70.9%Wichita St (1) vs Texas Southern (16) 68.5%Wichita St (1) vs Cal Poly (16) 67.6%Kentucky (8) vs Kansas St (9) 68.5%Saint Louis (5) vs NC State (12) 58.4%Louisville (4) vs Manhattan (13) 73.7%Massachusetts (6) vs Iowa (11) 62.5%Massachusetts (6) vs Tennessee (11) 46.2%Duke (3) vs Mercer (14) 84.5%Texas (7) vs Arizona St (10) 66.2%Michigan (2) vs Wofford (15) 86.9%
In terms of upsets:
- Colorado vs Pittsburgh is being picked as an upset, PIttsburgh is favored by the algorithm by almost 8% points.
- Oregon vs BYU is also being picked as an upset, though only by 4% points
- If Tennessee wins the play in game, they are picked to upset UMass by 4% points
Other close games to watch:
- VCU vs Stephen F Austin is a toss up, VCU is favored by less than 2%
- Memphis vs George Washington is a marginal game, 5% favoring Memphis
- North Carolina vs Providence is a toss up, UNC is favored by only 3%
I'll add more as I get my analysis going...
Here is a plot to show I seem to be on the right track:
This is the higher seeded teams (i.e. 16 ) win probability versus the number of wins the lower seeded teams (i.e. 1 ) win total. As the higher seeded team's win total rises the probability that the lower seeded team will win decreases. The shading is 1 sigma in the bins, there are 7 bins in win probability space.
I got a great comment from my old friend Robert asking me to compare to some other models. He is absolutely right, my only excuse is that this is just a quick analytic side project for me so I just didn't get around to it till he asked.
Anyway to compare to other models, we need other models. The simplest naive model is guessing who will win, in this case a simple percent correct (accuracy) actually is a comparison. So in this case we are predicting roughly 20 - 25% (75% is 25% better than 50%) better than such a dart throwing model.
Now this random model is overly naive, we do have more information, namely we have the seeding in the tournament, so another good comparison is the "higher seed wins" model, where we always choose the higher seed to win. I crunched the numbers on that and it turns out that predicts at an accuracy of 68.5% (since 2004). So we are doing considerably better than that model. I say considerably because it is worth noting that a LOT of data goes into the seeding (including crowd sourcing, polls, records, etc.), so the seeding SHOULD be good (or the selection committee is not doing its job) and every percentage above the seeding is hard fought. I also will note that seeding is an input to this model, so we SHOULD do as well as the higher seed model.