Jul 13, 2007

What Makes Teams Win? 3

This is the third part of a four-part article discussing the relative importance of factors in winning NFL games. Part 1 is here and Part 2 is here.

CORRELATION SUMMARY

So far we’ve analyzed each phase of the game and its statistical connection with regular season team wins. Below is a table that lists the relevant statistics and their correlations. The table is sorted in order of absolute strength of correlation.













StatWin Correlation
Off Pass Yds/Att0.61
Def Pass Yds/Att-0.47
Off Fumble Rate-0.46
Off Int Rate-0.45
Def FFumble Rate-0.41
Def Int Rate 0.39
Off Pen Rate-0.37
Off Run Yds/Att0.18
Def Run Yds/Att-0.04


The table is presented graphically below. Negative coefficients, such as defense pass efficiency, are shown as positive values to make it easier to compare each variable's relative importance.

The relative importance of each aspect of the game begins to come into focus. Passing is most important, followed by turnovers, then penalties and running. For every aspect, the correlation on the offensive side of the ball is stronger than on the defensive side.

But this isn’t the final word. Correlation coefficients by themselves do not take into account the other factors. In other words, they ignore the effect of the other stats when calculating the correlation.

REGRESSION

To take all facets of the game into account simultaneously and produce a valid model of winning NFL games, we can use linear regression to estimate coefficients for each stat. The relative value of the coefficients will reveal the relative importance of each phase of the game, holding all other variables constant. This will yield estimates that are more pure and accurate than simple correlations.

The dependent variable of the regression model is regular season wins. The independent variables are the efficiency stats I’ve previously outlined. The data set continues to be all 32 teams over the past 5 seasons for a total of 160 observations. The results of the regression are detailed below.
















VARIABLECOEFFICIENT
const5.31
O Pass Eff1.43
D Pass Eff-1.65
O Int Rate-53.50
D Int Rate81.70
O Fum Rate-49.10
D FF Rate70.90
O Run Eff1.00
D Run Eff-0.55
Pen Rate-2.73
R-squared0.802


Each of the independent variables are statistically significant at the 0.05 level or better, except defensive run efficiency, which is significant at 0.06. The R-squared value indicates an extremely good overall fit for the model. 80% of the variance in team wins can be explained by the included variables. The remaining 20% could be due to any number of factors, but we have to accept that outcomes in any sport are partly due to luck.

Using the regression results we can estimate a team’s expected wins using a linear equation. Here is what the equation would look like:

Wins = 5.31 + (1.43 * O Pass Eff) + (- 1.65 * D Pass Eff) + …

The regression coefficients are stated in terms of wins per unit of the variable. For example, the coefficient for offensive pass efficiency (yds/att) is 1.43. So for every 1 yard improvement in pass efficiency a team can expect 1.43 additional wins. When coefficients are stated this way, it makes it very easy to estimate the effect on the dependent variable (wins) given a change in one of the independent variables. But it makes it very difficult to get a sense of the relative importance of each variable. Defensive forced fumble rates are certainly not 70 times more important than offensive run efficiency.

To reveal the true relative importance of each factor, we need to standardize each variable by calculating the number of standard deviations from its average value. In statistics, these are known as “normalized” or "standardized" variables, noted by the prefix “z.”

Here are the regression results again, this time calculated with standardized coefficients. The significance of each variable, and the overall fit of the model remain the same since only the units of the variables have changed.













VARIABLECOEFFICIENT
constant8.06
Z O Pass Eff 1.14
Z D Pass Eff-0.92
Z O Int Rate-0.45
Z D Int Rate0.76
Z O Fum Rate-0.33
Z D FF Rate0.42
Z O Run Eff0.46
Z D Run Eff-0.24
Z Pen Rate-0.39


It may seem like we’ve gone through a tortured process to arrive at these coefficients. But they are merely the mathematical weight we would need to give each factor to have the best estimate of actual team wins. These are based on real-world data from every team’s season between 2002 and 2006.

Here is a graph representing each variable’s relative weight. Negative coefficients, are shown as positive values.


Probably the simplest way to interpret the chart is this way. If my team is average in absolutely everything, I'd expect to win 8 games. But if my team is average in everything except offensive pass efficiency, in which we're one standard deviation above average, I'd expect to win 9.94 games (8 + 1*1.14).

So if my team was the league's best at running the ball, say 2.5 standard deviations above average, but average at everything else, we'd expect to win 9.15 games (8 + 2.5*0.46). Compare that to passing--if my team were average at everything but best in the league in passing, we'd expect to win 10.85 games (8+1.5*1.14).

Continue reading the fourth and final part of this article.

2 comments:

Max Power said...

First off, this is an outstanding blog that I enjoy and read regularly.

Second, what is the adjusted R-squared of your season win model? The R-squared you reported was 0.802, but I wonder what the effect of keeping all those regressors (especially defensive run efficiency)has on the fit.

Brian Burke said...

Max-Thanks. Didn't Homer Simpson go by that name in an episode? Classic.

The adjusted R-squared for the model cited in this post is 0.791.

I think one reason the adjusted value is so high is that the model isn't a "kitchen-sink" model. A lot of people are tempted to throw in first downs, touchdowns, field goals, etc. which are really just intermediate results between yards gained in individual plays and team wins.