May 15, 2008

Elo Ratings

The Elo rating system is a method of ranking players or teams in sports and games. It only considers wins and losses, and it ignores margin of victory. The system was originally created to rate international chess players by Arpad Elo, a physics professor who was himself a master chess player.

In a nutshell, the system estimates the probability one opponent should beat another. If an opponent wins more often than expected, his rating would improve, and vice versa. The algorithm needs to start with a prior expectation of how good each player (or team) is. Then, as the players complete matches, their ratings are adjusted upwards or downwards based on who won. The size of each adjustment is based on how significant the win was. For example, if a grand master chess player beats a novice, his rating would hardly budge, but if a novice beat the master, both ratings would move significantly.

The actual algorithm is based on the function below. EA is the expected win probability of player A. RA is player A's rating, and RB is player B's rating.


After a game between opponents A and B, player A's new ranking (R'A) is revised as:

where K is a maximum size of adjustment, and SA is the actual result of the match. The K value has traditionally been 32 for chess, but it can be adjusted to tailor the system to various other games and sports. Ratings are typically set to have an average of 1500, but this is arbitrary and can be adjusted also.

For example, if player A's rating is 1655 and player B's rating is 1500, then according to Elo's function the probability A would beat B is 0.65. If player A defeats player B, then the actual outcome is 1.00. Player A's new rating would be:

R'A = 1655 + 32 * (1.00 - 0.65) = 1666

One interesting way to look at the ratings is to create a generic win probability. By using the Elo algorithm to compute the expected win probability against a notional average rating, we can get a sense of each team's expected winning percentage.

Sagarin's Application of Elo

Jeff Sagarin uses a version of the Elo system to create NFL team ratings. He transforms them to produce ratings that are predictive of a game's point spread. So the difference between two opponents' ratings, plus an adjustment for home field advantage, predict the margin of victory. Sagarin's adjustment is a straightforward linear transformation of the original Elo system, as you can tell from the graph below. (I suspect Sagarin may over-weight recent games, however.)



Elo Mimicked

Using the same method as I described in my last post, we can mimic Elo ratings. That method computed team ratings based on margin of victory from each game. Instead of using margin of victory we can simply replace the score of each game with a 1 or 0 based on who won. Then we can solve for the ratings that best estimate the game outcomes. Because the ratings are linear we can transform them into individual game probabilities or generic win probabilities using a logistic transformation:


These rating systems can be adapted for any type of game or sport. Recently, on-line games have been using similar algorithms to rank players. The primary advantage to this type of system is that it discounts victories over very weak opponents. Often players will set up phony opponents to beat in order to inflate their own scores.

To get a sense of what these rankings would look like for the most recent (2007) NFL season, the table below lists several ratings for each team. The Elo column lists the ratings I derived from the actual Elo algorithm. The Sagarin column lists Jeff Sagarin's version of Elo--his final 2007 season ratings . Lastly, based on the Elo algorithm, the win probability column lists the probability each team would beat a league-average team on a neutral site. All ratings include results from the playoffs and Super Bowl.







































TeamEloSagarin Win Prob
NE231536.20.99
DAL191329.890.92
NYG187732.350.90
GB186528.970.89
SD183628.460.87
IND180727.230.85
JAX169225.620.75
WAS165623.440.71
PHI162423.630.67
TEN157322.630.60
DET155321.470.58
MIN152322.220.53
HOU152320.250.53
TB151219.820.52
DEN149019.820.49
CHI148021.610.47
CAR145317.930.43
SEA144020.430.41
PIT143818.640.41
NO142917.420.40
CLE141318.710.38
BUF138418.240.34
ARI137416.450.33
OAK130914.340.25
CIN128414.830.22
ATL126213.340.20
KC125414.710.20
BAL124012.560.18
SF123112.490.18
NYJ119811.980.15
STL11039.540.09
MIA9464.80.04

May 9, 2008

Homemade Sagarin Ratings

Since the early 1970s, Jeff Sagarin has been publishing sports team ratings. For most sports, including the NFL, his ratings are calculated so that the difference between two opponent's ratings, plus a home field adjustment, forecast a game's point spread. His ratings are widely recognized as some of the best around. They can be found every week of the NFL season on the USA Today site. Sagarin has never published his exact algorithms, but we can easily build a very good facsimile.

Excel has a powerful tool called "Solver." It's one of those thousand or so features that Microsoft packs into its Office products that no one ever knows about. In fact, you do don't even see it on the Tools menu until you enable it from the "Tools|Add Ins..." command.

If you go to Microsoft's on-line help site for Solver, the example problem provided is an exercise estimating point spreads for NFL games. The sample spreadsheet is for all the game scores from the 2002 season.

Basically all you do is create a table of ratings for each team. The ratings don't have to mean anything yet. For now they can be your best guess, or all ones, or anything. Solver will calculate them later. Then for each game, you calculate what the ratings suggest should be the point spread. The ratings are intended to work just like Jeff Sagarin's ratings. If team A's rating is 5 and team B's rating is 8, then when team A plays team B the point spread should be 3 in favor of team B. Factoring in a league-wide value for home field advantage, say 3, and the spread becomes 6 if team B is at home.

Next, using the LOOKUP function to grab the ratings from the table, you calculate the error between the expected spread and the actual result for each game. Square the error (as every good statistician would). In a cell, sum all the squared errors for all the games in the season. In another cell, enter a point value for home field advantage--3 points is a good initial guess. Solver takes over from here.


In the Solver dialog box, you tell it to minimize the value in the cell for the sum of squared errors. Then you tell it to do so by varying the values in the table for the team ratings and the cell for the home field advantage. (You can also add in a constraint that says the average for all the teams' ratings should be zero, so that good teams will have positive ratings and poor teams will have negative ratings.)

Solver will compute the team ratings necessary to best fit the actual point spreads. And now you have your very own homemade Sagarin ratings.

I noticed that Sagarin's average rating is 20 instead of 0, which makes sense because the average NFL score is about 20 points. So I altered the Solver constraint accordingly. For the 2007 season, including the playoffs, the homemade ratings were nearly identical to Sagarin's.


Comparison of Sagarin and MS Solver Team Ratings for 2007





































TeamSagarinMS Solver
NE36.437.7
IND30.230.6
SD30.028.8
GB29.428.9
NYG28.325.2
DAL28.028.2
JAX27.926.1
PHI24.924.8
PIT23.624.6
MIN22.823.6
WAS22.222.7
SEA22.022.0
TEN21.820.2
CHI20.521.2
HOU20.319.7
TB19.820.5
CLE19.418.8
DEN17.515.8
DET17.216.7
BUF16.420.0
CIN16.217.6
NO16.117.4
ARI16.016.2
OAK15.714.3
NYJ15.516.1
KC15.215.1
CAR15.014.3
MIA13.211.7
BAL13.014.6
ATL10.29.9
SF8.38.9
STL7.17.8





The differences beween Sagarin's and our homemade ratings may come from the method of solving. Solver uses a "brute force" numerical iteration method, and Sagarin's method is unknown. Sagarin may also weight recent games heavier. Notice how the difference in the Giants' rating is one of the more significant. The Giants finished the 2007 season on quite a win streak.

Doug Drinen of Pro-Football-Reference.com discusses a very similar method for ranking teams based on margin of victory which he calls the Simple Rating System (SRS). His post includes a good discussion on the advantages and disadvantages of a pure margin of victory ranking system.

Sagarin actually uses two different systems. One is called Pure Points, which is based solely on point differential. This is the system which the method discussed above mimics. His other method is called Elo Chess, which considers only wins and losses, and ignores points. This system is based on a method devised to rate chess players by Arpad Elo, a physics professor and master chess player. In the next post I'll demonstrate how to mimic the Elo ratings.

May 5, 2008

The Ellsberg Paradox and 4th Down

The Romer paper and other research provide fairly conclusive evidence that NFL coaches should go for it on 4th down more often than they currently do. The Ellsberg Paradox might help explain why.

Say there are two jars of 100 balls of which some are red and some are blue. Jar A has 50 red balls and 50 blue balls. Jar B has a random unknown mix of red and blue balls. You'll be given $100 if you pick a red ball from a jar. Which jar would you choose to pick from?

In clinical experiments, people almost universally choose jar A. This is the Ellsberg Paradox, a violation of the utility theory in economics. The expected value of each choice is equal. There is a 50/50 chance of winning $100 from either jar, so we wouldn't expect one option to be significantly preferable to the other.

The Ellsberg Paradox demonstrates the difference between risk and uncertainty. Risk is measurable but uncertainty is not. People almost always prefer a known risk to an unknown uncertainty, even if the expected results are equal.

People prefer Jar A according to the equation above. U() is the utility function.

Punting seems a lot like Jar A, for which the risks and potential outcomes are known. Going for the first down seems more like Jar B, for which the potential outcomes are vague and hard to measure. So at the equilibrium point between going for it and punting, where each decision provides equal chances of ultimately winning, coaches would be heavily biased toward punting. Even beyond the equilibrium point, where going for it would be favorable, coaches would still be biased toward the relatively certain (but less favorable) outcome of the standard 40-net-yard punt.

In a strict analogy, the $100 would be a win, and the red balls would represent the probability of winning the game. There would actually be some uncertainty in each strategy, but far more uncertainty in the go-for-it strategy--perhaps something like 40 to 60 red balls in the punt jar and 20 to 80 balls in the go-for-it jar. The Ellsberg Paradox suggests coaches would naturally prefer punting, the less uncertain option. Only when the advantage of going for it is beyond obvious would a coach choose to go for the 1st down--say 10 to 20 red balls for punting and 15 to 60 red balls for going for it.

I think NFL coaches typically employ the maximin strategy. In game theory the maximin strategy is one that selects the alternative with the best worst-case-scenario. It maximizes the minimum possible payoff. This is a conservative strategy in comparison to the maximax strategy, which selects the alternative with the greatest maximum payoff.

Continuing the jar and red ball analogy, compare jar X with 10 to 90 red balls and jar Y with 30-40 red balls. Utility theory would suggest the rational option is jar X with a higher overall chance of success. The maximin choice however, would be jar Y because it has a higher minimum chance of success.

Calculating the probability distributions of a football game's outcome given the combinations of score, time remaining, field position, etc. is far more complex than being told how many red balls are in a jar. It would be overwhelming for a human brain even to attempt it. In such a situation, coaches, like everyone else, use heuristic shortcuts such as the maximin strategy. Punting on every 4th down is a known risk, especially because coaches can count on opposing coaches to follow the same strategy (which suggests that always punting is a Nash Equilibrium). Punting usually presents the best worst-case-scenario despite being a sub-optimum decision.

Apr 26, 2008

How Draft Experts Are Graded

It seems that everyone has a mock draft board these days. The object of this parlor game seems to be to predict which players will be picked by each team. So draft gurus tend to be judged by how many correct predictions they make. When you think about it, it's pretty ridiculous. A draft expert has to get four things right. He needs to not only evaluate players and team needs, but also evaluate all 32 teams' own perceptions of each player and its needs. And if one prediction at the top of the first round is off the mark, the house of cards collapses.

I think a much better way to evaluate draft experts is to wait several years, then see which players actually turned out to be more productive. Look at how they rated each player, not how well they read the minds of the league's GMs.

Besides, in my eyes the real value of a draft expert is be able to tell me in 30 seconds everything there is to know about that strong safety from Alcorn State my favorite team just picked up midway through the 4th round...without any notes or any hesitation...and has a full head of hair. And for that, there is only one man.