Sunday, January 17, 2016

Sports statistics are more random than you think

In a previous post I modeled the home-field advantage (the difference between a team's winning percentage at home and on the road) in the NFL using 10 years of won-lost data.  The NFL average advantage is 15%, but individual teams range from zero to 35%.  It is natural to look for reasons why one team is so good at home while another is relatively good on the road, but I took a more skeptical approach.  I showed that this 0--35% spread in the advantage as "measured" over 160 games (10 years) is exactly what is expected if each and every NFL team has the same "true" (15%) home advantage.   The thinking tool behind this---that the process of measuring increases the apparent scatter---is so important that it's worth more explanation.

Let's recap the reasoning, this time with the simpler example of won-lost record.  Imagine 32 people (each representing one NFL team) sitting at a table, tossing coins.  In the long run you know each person should average 50% heads. But after 16 tosses (representing one NFL season), we know that some of them will exhibit an average above 50% and others will be below 50%.  We know this is due to random events rather than some coins being "more effective" than others at producing heads.  We can actually calculate (or simulate) the expected spread in the 32 measured heads percentages, and compare it with the actual spread.  If a coin produces a heads percentage well outside this spread, we might suspect that it's a trick coin.  We can use exactly the same math to calculate the expected spread in seasonal won-lost records if each game is essentially a coin toss; performances outside this spread can be attributed to a team consistently doing something better (or worse) than their competitors.

Fans may be aghast at the suggestion that each game is essentially a coin toss.  How can I suggest this when players give it their all, shedding blood, sweat and tears?  The key is that the players (and coaches) are quite evenly matched, with equal amounts of blood, sweat and tears on both sides.  Each team still has a tremendous amount of skill and could beat any non-NFL team, but in any league worth watching the teams are sufficiently evenly matched so that random events are important---otherwise we would not bother to play or watch the games!  I also hasten to add that what I have outlined so far is not a conclusion---it is a framework for drawing conclusions about the importance of true skill differences relative to random events.  Presumably there are real differences in skill, both for individuals and teams.  This framework merely helps us counteract the human tendency to attribute every observed outcome to a difference in skill.

Before I present the data and the outcome of the coin-toss model, I ask you to make a prediction.  Surely a team can come by a 9-7 or 7-9 record if each game is a coin toss, but how likely do you think a 10-6 or 6-10 record is?  What about 5-11 or 11-5? A 5-11 team is usually described as "hapless", and an 11-5 team is to be feared, so I'm guessing most fans would think 11 heads in 16 tosses is extremely unlikely.

And the answer is....



The bars in the figure show a simulation of one NFL season, and the curve shows the average of 1,000 simulated seasons.  In the particular season shown, one team compiled a 13-3 record despite having only a 50% chance of winning each game! The curve shows that the average number of such teams per season is about 0.27; in other words, this model predicts a 13-3 team about once every four years.  Keep in mind that the bars can fluctuate a lot from season to season: often there is a big spike or valley somewhere, other times an outlier like 2-12, and so on.  The particular season shown is just one possible realization of the way a season can depart from the average season; a typical simulated season actually shows greater departures from the curve.

The curve shows that in this model (where all teams are equal), we expect one 4-12 (and one 12-4) team, two 5-11 (and two 11-5) teams, four 6-10 (and four 10-6) teams, five or six 7-9 (and five or six 9-7) teams, and six or seven 8-8 teams. Teams at 3-13 should appear about every 4 years, as should teams at 13-3.  We would have to wait about 17 years for a 2-14 team, and somewhere in those 17 years we should also see a 14-2 team (you can read this from the blue curve by noting that in any given season about 1/17 of a team is expected to compile a 14-2 record). A 1-15 team and a 15-1 team are expected every 128 years. while 0-16 and 16-0 are expected every 2048 years.

Now let's look at the actual 2015 NFL regular season with the same blue curve:

The Panthers' 15-1 record is clearly unexpected in our coin-toss model; in other words, the Panthers really were good in 2015.  The Cardinals' 13-3 would appear every 4 years or so in the coin-toss model, so a hard-core skeptic could say this could be random.  However, unlike in the coin-toss model, there is typically at least one 13-3 team per NFL season, so most actual 13-3 teams cannot be average teams that got lucky. At the other end of the spectrum, we have two 3-13 teams (Browns and Titans) and for the same reason we should not suppose that these are average teams that got unlucky.

Next, there are three teams (Broncos, Patriots, and Bengals) that compiled 12-4 records, while we expected only one team to do so in the coin-toss model.  Give credit to those teams. But teams with 11-5 and below could well be average teams with some good luck.

Even as we have to admit that an 11-5 team might be an average team that got some lucky breaks, we have to admit that the handful of really good and really bad teams cast serious doubt on the coin-toss model as a complete explanation.  To further test the model we can look at additional seasons, and we quickly see that 2015 was not a fluke; for example the 2014 season had five teams at 12-4, versus only one predicted by the model.  As any NFL fan knows, a simple model in which any team has a 50% chance of winning any game is wrong.* However, the fact that only a handful of teams beats the coin-toss model in any given year illustrates an important point: random events can cause a great deal of spread in won-lost records.  Not every outcome should be attributed to differences in skill.

Given that differences in skill and random events are both important, can we construct a model that incorporates both?  Yes, but that's too much for one post.** What I want to emphasize here is that random events nearly always cause the spread in outcomes to be larger than the spread in skill.  Consider a simple model with some bad teams, many middling teams, and some good teams.  Some of the bad teams will be luckier than others, so the bad teams will compile records that range from terrible to nearly middling.  Some of  the good teams will be luckier than others, so the good teams will compile records that range from excellent to just above middling.  And as we have seen, the middling teams will spread out.  So there is more spread in win-loss records than there is in skill levels.

The fact that data scatter more than the intrinsic distribution is nearly universal. Accounting for this is a key part of just about any scientific data analysis.  The general public relates more easily to examples from sports, though, so some of the most accessible explanations are based on sports-related examples. If you want to go a bit further, try the Scientific American article "Stein's Paradox in Statistics."  If you prefer to keep it simple, just remember this: attributing some of the observed variance to random events requires attributing less of it to intrinsic factors such as skill differences between teams.  It is very easy to forget this and attribute too much to intrinsic variation.  Of course, variations in skill are much more important than random events in many sports, like the 100-m dash. But most team-sport leagues have feedback mechanisms to maintain some level of parity between teams (in the US, at season end the worst teams get the first draft picks; in European football the worst teams get demoted to a lesser league).  This opens a wider door for random events.

Another piece of data supporting the relatively large role of randomness, especially in the NFL, is the fact that experts generally predict winners and losers with only 60% accuracy.  This is astonishingly low considering that if you simply pick the home team every time, you will already have 57.5% accuracy!  These experts aren't idiots---random events are just really important.  A future post will assess how much of the apparent variation in expert performance must itself be random.


*Testing additional seasons was a natural way to further test this particular model, but in other cases it is not so easy.  For example, if a team has a good record but is on the edge of what could be compiled randomly, we might look at another year of data to see if the team continues to have a good record.  But the team changes from year to year!  Even if it's mostly the same team, the use of longer-term data is not entirely straightforward.  A similar statement goes for individual performances from year to year.  This is one of the things that makes sports statistics so interesting!

**I might address this in a future post.  A quick preview is that the coin-toss model can easily be extended to biased coins.  For example, the Patriots have won about 75% of their games over the past 10 years, so we could represent the Patriots using a coin that comes up heads 75% of the time.  (In practice we would use a computer's random number generator.)

Sunday, January 10, 2016

Logarithms and units

One of the things that every intro calculus student learns is: $${d\ln x\over dx} = {1\over x}$$ This property of the logarithm leads to something else, which turns out to be useful to physicists and astronomers, but is never explicitly taught. If we rearrange this equation to read $${d\ln x} = {dx\over x}$$ we see that a given change in the logarithm (\(d\ln x\)) corresponds to a given fractional change in x. This equation also implies that the logarithm of anything is unitless, as follows:

  • the right side of this equation, \({dx\over x}\), is unitless regardless of the units of x;
  • therefore the left side, \(d\ln x\), must also be unitless; 
  • \(d\ln x\) must have the same units as \(\ln x\);
  • therefore \(\ln x\) must also be unitless, regardless of the units of x
Physics students keeping track of their units can be stumped: what units does the log of a current or a voltage have? This tiny bit of math helps us see that the answer is "none."

The fact that \(d \ln x\) specifies a fractional change in x has further repercussions in astronomy, because it is traditional to quote the measurement of a flux \(f\) in the magnitude system: $$m = -2.5 \log_{10} {f\over f_0}$$ where \(f_0\) is some reference flux. This means that a quoted uncertainty in the magnitude of a star or galaxy, \(dm\), specifies a fractional uncertainty in the flux. Let's work out the details: \(\log_{10} x\) is the same as \({\ln x \over \ln 10}\) so $$dm = -{2.5\over \ln 10}  d\ln{f\over f_0} $$ $$dm = -{2.5\over \ln 10} {df\over f} $$ Because \(\ln 10\approx 2.30\), we get \(dm \approx -1.086 {df\over f}\).  For quick estimation purposes, the magnitude uncertainty is about the same as the fractional uncertainty in flux.

This explains why a 0.1 mag uncertainty is about a 10% flux uncertainty, regardless of the magnitude. One should not say that a 0.1 mag uncertainty is a 1% uncertainty in an \(m=10\) star, nor a 0.5% uncertainty in an \(m=20\) galaxy.  For the quantity that matters---the flux of the object---a 0.1 mag uncertainty implies about a 10% uncertainty regardless of the flux.

Monday, January 4, 2016

Seeing Patterns That Don't Exist: Sports Edition

I found a good example of how not to think about data in Time magazine's 2015 "Answers Issue." Among the many examples of analysis that could have been deeper, one stood out:


"Which team has the best home-field advantage?" is essentially one big graphic illustrating the home-field advantage (the difference between its winning percentages at home and away) for every major American sports team. On top of this graphic, they have placed some random observations.  I cannot resist critiquing a few of these before I get to my main point:

  • "Stadiums don't generally have a great influence on win percentage except in baseball, where each stadium is unique."   If they mean only that playing-field peculiarities play no role in sports where all playing fields are identical, then---duh!  If they are saying that peculiarities of the playing field do have a great influence in baseball, then---whoa!   These peculiarities could play a role, but Time hasn't shown any data, or even a quote from a player, to support this.
  • "The Ravens [the team with the best overall home-field advantage, with a 35% difference: 78% at home vs 43% away] play far better when in Baltimore. They lost every 2005 road game but were undefeated at home in 2011."  Why would they compare the road record in 2005 to the home record six years later?  This is a clue that they are "cherry-picking": looking for specifics that support their conclusion rather than looking for the fairest comparison.  I don't follow sports much but I know six years is enough time to turn over nearly the entire team, thus making this a comparison between the home and road records of essentially different teams (with different coaches).  This is easy enough to look up: the 2005 Ravens were 0-8 on the road and 6-2 at home (a 75% difference with a 6-10 overall record), while the 2011 Ravens were 4-4 on the road and 8-0 at home (a 50% difference with a 12-4 overall record). This suggests the Ravens maintain a substantial home advantage, not only when they are a strong team overall but also when they are a weak team.  Rather than make this "substantial and consistent" point Time's factoid misleads us into thinking that a single team has an overwhelming home advantage.
  • "Grueling travel---especially in the NHL and NBA, where many road games are back-to-back---can take a toll on visitors."  This may explain why the NBA overall has a 19% home advantage---but why then does the NHL have only a 10% home advantage, nearly the lowest of the four major sports? It seems as if Time's "data-driven journalism" is limited to "explaining" selected facts without a serious attempt to investigate patterns.
Now to the main point.  A skeptical, data-driven person must ask: couldn't many of these numbers have arisen randomly?  The overall home advantage in the NFL is 15%: a 57.5% winning percentage at home, vs. 42.5% on the road. Imagine that each of the 32 teams has a real 15% home advantage.  They play only 8 home and 8 away games each season, so a typical team expects something like a 5-3 record at home and 3-5 on the road. If random events cause them to win just one more home game and lose just one more road game, they now have an apparent 50% home advantage (6-2 or 75% at home, vs 2-6 or 25% on the road).  They could also randomly win one less at home and one more on the road, for an apparent 0% home advantage.  This is roughly equal to Time's "worst" team, the Cowboys (to whom we will return later).  So the observed spread in home-field advantage is plausibly due to randomness, without requiring us to believe that the Cowboys really have no home advantage and that the Ravens really have a huge home advantage.

In science we have something called Occam's razor: we prefer the simplest model that matches the data.  A complicated model of the NFL is one in which we assign a unique home-field advantage to each team.  A simpler model is that each team has a true 15% home advantage, and that the spread is only in the apparent advantage as measured by the actual won-lost record.  The previous paragraph shows that the simpler model is plausible, at least for a single year.  How do we make this more quantitative and compare to Time's 10 years of data?  Let's flip a coin for the outcome of each game.   This has to be a biased coin, with a 57.5% chance of yielding a win for the home team and 42.5% for the visitors.  We don't need a physical coin; it's easier to use a computer's random number generator.  For each of 32 NFL teams, we flip this "coin" 160 times (for the ten years of games examined by Time) and just see what are the minimum and maximum home vs. away differences.  This takes surprisingly few lines of code in Python:

import numpy
import numpy.random as npr
nteams = 32
ngames = 80 # ten years of home (or away) games in NFL
homegames = (npr.random(size=(nteams,ngames)))>=0.425
homepct = homegames.sum(axis=1)/float(ngames)
awaygames = (npr.random(size=(nteams,ngames)))>=0.575
awaypct = awaygames.sum(axis=1)/float(ngames)
print numpy.sort(homepct-awaypct)


This prints out a set of numbers reflecting the apparent 10-year home advantage for each of 32 simulated teams, for example:

[-0.0375 -0.0125  0.      0.0125  0.0625  0.075   0.1     0.1125  0.1125
  0.1125  0.125   0.125   0.125   0.1375  0.1375  0.1375  0.15    0.15
  0.1625  0.175   0.175   0.175   0.175   0.175   0.2     0.2125  0.225
  0.2375  0.25    0.275   0.275   0.35  ]

As you can see, the largest apparent home advantage is 35%, exactly matching the Ravens, and the smallest apparent home advantage is -3.75%, about the same as the Cowboys' -2%.  Time's entire premise is consistent with being a mirage!

This modeling approach is at the heart of science, and is really fun. There are several directions we could take this if we had more time, and they are illustrative of the process of science:

  • making my statement "consistent with a mirage" more precise. I did this by running many simulations like the one above and I found that a number as large as 35% comes up 17% of the time (meaning in 17% of simulated 10-year periods of football). Thus there is no evidence that the Ravens have a greater than 15% home advantage.*  And even if they do, the fact that the average simulation (31%) comes so close to their record means that most of their apparent advantage is likely to be random. The burden of proof is on those who think the effect is real, to tease out what the effect is and show that it can't be random.  If you find something that really doesn't fit the simple model, congratulations---you have made a discovery!  For example, it is plausible that (as Time suggests) the Cowboys do well on the road because they are "America's team."  With 10 years of data, their home vs. road record is still consistent with the NFL average, but if you like the "America's team" hypothesis you may be able to prove it by looking at 30 or more years of data, where random fluctuations will be smaller.
  • making a more sophisticated model.  I have to stress how brain-dead my model is. For example, each simulated team has a 50% winning record overall.  This is a really simple model that would be inadequate for predicting, for example, the lengths of winning streaks.  We could make the model more sophisticated by programming in the overall winning percentage of each team. I'm fairly confident this won't affect the home advantage, because most teams have a 10-year winning percentage not too far from 50% (in the 40-60% range, with the Ravens at 60.5%), and the exceptions (the Lions with 30% and Patriots with 77% overall winning percentage) still have home advantages consistent with the typical 15%.  But if you were determined to test the simple home-advantage model, you would want to write the extra code to make sure.  (Note that calling for a more sophisticated model here does not violate Occam's razor.  We know that some teams truly are good and some truly are bad, so we should include this in our model if we want to model the data thoroughly.  It just so happens that overall winning percentage is probably not important in modeling home-field advantage.)
  • modeling additional features of the data.  Upgrading the model as described in the previous paragraph would allow you to have even more fun, because this model would allow you to predict other things like the lengths of winning streaks.  It is truly satisfying to have a relatively simple model that explains a wide variety of data.
  • making your model more universal (in this case, extending it to additional sports). This is actually pretty easy; even Time may be capable of this.  Modifying my Python script to do basketball is trivial: just change the home/road winning percentages to 59.5%/40.5% and the number of games at each venue to 41 per year, or 410 in ten years. Before we do that, let's predict what will happen: random fluctuations will play a smaller role in an 82-game season.  The "best" and "worst" teams in the NBA will therefore show smaller deviations from the NBA average (19%) than we saw in football.  In fact, the Jazz lead the NBA with an apparent 27% advantage and the Nets trail with 12%---both consistent with my simulations. I encourage interested readers to do hockey and baseball for themselves.  
I can imagine two types of results from modeling a wide variety of sports, each of which would be rewarding.  First, it could be that randomness explains the variations in all sports.  This would be an impressive achievement for such a simple model.  Second, it could be that randomness explains the variations in most sports, but that there is some interesting exception.  If baseball is an exception then perhaps baseball stadiums do matter.  If Denver is an exception, then perhaps altitude matters.**  

The same thinking tool can be used in many other contexts. The New York Times set a great example with How Not To Be Misled By The Jobs Report.  They showed how uncertainties in the process of counting jobs could lead from an actual job gain of 150,000 to a wide range of apparent job gains, and thus to misleading conclusions about the economy if people take any one jobs report too seriously.

Summary: whether in science, in data-driven journalism, or just as part of being a thinking person, you should have a model in mind when you look at data or make observations.  This will prevent you from over-interpreting apparent features and help you make true discoveries.

*If you think the 17% indicates something unlikely, consider that it is not much less than the chance of getting two heads in two coin tosses, and no one would suggest that there must be something special about a coin that yields two heads in two tosses.  To even think about investigating something further, you should demand that what you observe would have arisen randomly in less than 5% of simulations.

**Spoiler alert: it turns out that neither baseball nor Denver is an exception.