The previous posts explain how to do this. First, calculate how much variation must randomly creep in to the results even if there is no true variation in skill. Then, compare this to the documented variation in results. If the model and data exhibit about the same amount of variation, you can claim that there is no evidence for true variation in skill. True variation in skill will result in performances outside the envelope of simulated performances.
I found a nice list of the records of 133 expert pickers for the 2015 regular season (256 games in all) at nflpickwatch.com. Here is the histogram of their correctness percentages (blue):
The green histogram is a simulation of 100 times as many experts, assuming they all have a "true" accuracy rate of 60.5%. That is, each pick has a 60.5% chance of being correct, and the actual numbers for each expert reflect good or bad luck in having more or fewer of these "coin tosses" land their way. I simulated so many experts (and then divided by 100) so that we could see a smooth distribution. A few things stand out in this plot:
- No one did better than the simulation. In other words, there is not strong evidence that anyone is more skilled than a 60.5% long-term average, even if they happened to hit nearly 67% this season. Of course, we can't prove this; it remains possible that a few experts are slightly more skilled than the average expert. The best we can do is extend the analysis to multiple seasons so that the actual data more closely approach the long-term average. In other words, the green histogram would get narrower if we included 512 or 1024 games, and if some of the spread in the blue histogram is really due to skill, the blue histogram will not narrow as much.
- A few people did far worse than the simulation. In other words, while the experts who outdid everyone else probably did so on the basis of luck, the ones who did poorly cannot rely on bad luck as the explanation. They really are bad. How could anyone do as poorly as 40% when you could get 50% by flipping a coin and 57.5% by always picking the home team?
- Because the home team wins 57.5% of the time, the experts are adding some knowledge beyond home-team advantage---but not much. Or rather, they may have a lot of knowledge but that knowledge relates only weakly to which team wins. This suggests that random events are very important. Let's contrast this with European soccer; I found a website that claims they correctly predict the winner of a soccer match 88% of the time. European soccer has few or none of the features that keep American teams roughly in parity with each other: better draft picks for losing teams, salary cap, revenue sharing, etc. It's much more of a winner-take-all business, which makes the outcomes of most matches rather predictable. In leagues with more parity, random events have more influence on outcomes.
- If you remove the really bad experts (below 53%, say) the distribution of the remaining competent experts is tighter than the distribution of simulations. How can the actual spread be less than the spread in a simulation where all experts are identically skilled? It must be that experts follow the herd: if some team appears hot, or another team lost a player to injury, most experts will make the same pick on a given game. This is not in my simulation, but it surely happens in real life, and it would indeed make experts perform more similarly than in the simulation.
My simulation neglects a lot of things that could happen in real life. For example, feedback loops: let's say the more highly skilled team gets behind in a game because of random events. You might think that this would energize them. Pumped up at the thought that they are losing to a worse team, they try that much harder, and come from behind to win the game. Nice idea, but if it were true then the final outcomes of games would be incredibly predictable. The fact that they are so unpredictable indicates that this kind of feedback loop is not important in real life. The same idea applies regarding an entire season: if a highly skilled team finds itself losing a few games due to bad luck, we might find them trying even harder to reach their true potential. But the fact that most NFL teams have records indistinguishable from a coin toss again argues that this kind of feedback loop does not play a large role. Of course, teams do feel extra motivation at certain times, but the extra motivation must have a minimal impact on the chance of winning. For every time a team credits extra motivation for a win, there is another time where they have to admit that it just wasn't their day despite the extra motivation.