Friday, December 20, 2013

Shedding light on missing mass

An important part of science---of life, I would argue---is making inferences from data. (Contrary to some popular perception, it is not all that we do, but it is a large part.)  This process is a lot more interesting than many people think, and this is best conveyed by stories where it went horribly wrong.  I just published a paper rebutting a paper in which it went wrong; not horribly so, but to make this post understandable I will weave a few of these horrible stories into my tale.  

The paper I rebutted claimed that one method (called weak gravitational lensing) of measuring the mass of a certain galaxy cluster gave an answer too low compared to the answers obtained through two other methods, and therefore the lensing method itself was suspect.  The context is that astronomers find it very difficult to measure the mass of anything, since we are so far away.  If the cluster is not changing over time, we can relate the velocities of the galaxies in the cluster to its mass (called the dynamical method) and we can also relate the cluster's X-ray emission to its mass.  But that's a big if, and we would like a method which does not depend on this assumption. Lensing is such a method; it has weaknesses too, but I don't want to get too deeply into that here.  The central question in this paper is really simple and applies to many situations: when numbers seemingly disagree, how do we characterize the strength of disagreement given that there is some uncertainty associated with each number?

The original paper made a model of the cluster using the X-ray method, and simulated weak lensing measurements of this model to see how often the simulated measurements gave answers as low as the actual weak lensing measurements.  This is a great technique; it gives us what's called a p-value.  By tentatively assuming that weak lensing is as effective as the X-ray method---the "null hypothesis"---we will see how often the inherent uncertainties in weak lensing would just randomly give us an answer as low as we got in real life. If the answer is "never" then we can state that our null hypothesis is wrong and weak lensing is not as effective as the X-ray method.  More quantitatively, if the answer is "in 1 out of every 100 experiments" we would say p=0.01, which has the naive interpretation  of "99% confidence that the null hypothesis is rejected."  (One of the reasons it's naive is that if you tested, say, 100 different true hypotheses, you would still expect one to randomly come out with p=0.01.  So the true interpretation is more nuanced.  I will develop this further below.)

Now, what if this method gives you p=0.1 or so? You can't really reject the null hypothesis unless you have stronger proof than that, so you may go out and take more data, do more experiments, etc, to get the stronger proof. If you do so, make sure that the new experiments are independent of the original one. For example, if you want to prove that tall people are better basketball players than short people, the null hypothesis would be that they are the same and you might record the score from a scrimmage in which a tall person plays against a short one. If the tall person comes out slightly ahead, you will not have strong proof that the tall person is better, so you might replay the scrimmage.  But if you play the same two people against each other, you can never prove that tall is better; the most you might prove is that player A is better than player B.  To make the trials independent, you have to play a different tall person against a different short person.  In more general terms, if you're trying to get an idea of the natural variation or "noise" in your measurement, you have to repeat the measurement in a way that actually incorporates those variations.  What this paper did was equivalent to failing to recognize the nonindependence of identical triplet weak lensing players.  They ran three scrimmages between an X-ray player and each of these three weak lensing players, mistakenly yielding a strong conclusion about X-ray vs weak lensing.

This idea of independence---and recognizing nonindependence even when it's subtle---is really important.  Ben Goldacre in his book Bad Science relates the story of a woman suspected of murder because two of her kids died of sudden infant death syndrome.  The chance of one baby dying of SIDS was stated as 1 in 8543.  Prosecutors assumed that the chance of a second child dying of SIDS (over the course of years, not in the same incident) was independent of the chance of the first child dying of SIDS, so we can multiply probabilities and come up with a 1 in 73,000,000 chance of two babies dying of SIDS; so unlikely that we might suspect murder.  But they're not independent. If SIDS has anything to do with genes or environment then they can't be independent, because the babies have the same parents and the same house.  Given the shared genes and environment, the second baby's chance of SIDS may actually be quite high.  In that case, we have no reason to suspect murder.  The prosecutors vastly overstated the statistical case for murder by failing to recognize the non-independence.  (That's not the only mistake the prosecutors made.  I highly recommend Goldacre's book.)

A second mistake the authors of the weak lensing paper made was multiplying the p-values from the three experiments to obtain an overall p-value.   Many people, even scientists, fall into the following trap: Say Experiment A gives p=0.10 and you interpret that as only a 10% chance that the null hypothesis is correct. Now independent Experiment B gives p=0.08, which you interpret as only an 8% chance that the null hypothesis is correct. It is natural to think that the experiments together imply only 8% of a 10% chance of the null hypothesis being correct, or p=0.008.  But it's wrong! You have vastly underestimated the chance of the null hypothesis being correct, just as the paper we rebutted vastly underestimated the chance that the weak lensing measurements were actually consistent with the dynamical and X-ray measurements.  Even if the experiments are independent, you should not multiply the p-values.

Here's an easy way to confirm that the above procedure is wrong: following an equivalent procedure you could also interpret p=0.10 as a 90% chance that the null hypothesis is incorrect and p=0.08 as a 92% chance that the null hypothesis is incorrect.  Multiplying them, we would get an 82.8% chance that the null hypothesis is incorrect. But the same process in the previous paragraph yielded an 0.8% chance that it's correct, which doesn't match the 82.8% chance that it's incorrect. So something's wrong with the process!  And it gets more wrong as you do more experiments and more multiplications.  If we do 100 independent experiments and follow this line of reasoning, we will come up with a vanishingly small chance of the null hypothesis being correct, and a vanishingly small chance of the null hypothesis being incorrect, regardless of the specific p-values, because they are always less than one.  You will rule out things which do not deserve to be ruled out.  Goldacre gives the horrifying example of a nurse who was suspected of murdering patients and convicted largely on the basis of faulty statistics but was eventually freed. I'm going to make up the following numbers for simplicity.  Let's say that some number of patients died while she was working, such that there was only a 10% chance that that would have happened randomly.  So you start poking around, and find that at the previous hospital where she worked, there was only a 50% chance of that large a number of patients dying, and at the hospital before that only a 70% chance, and at the hospital before that only a 30% chance, etc. Multiplying all these together gives a really small chance that all these things occurred randomly.  But you know by now that multiplying these is wrong.

Why is it wrong? Probabilities can be multiplied (1/2 chance of heads in each coin toss means 1/4 chance of two heads in two coin tosses), but despite its name, the p-value is not a simple probability that the null hypothesis is true.  It's a measure of consistency which is constructed so that for any one experiment we interpret all but the lowest p-values as being consistent with the null hypothesis. Therefore p=0.5, say, is perfectly consistent with the null hypothesis; it does not mean a 50% chance of it being true or false.  In the hypothetical example above, p=0.50 actually means that an average number of patients died on shift (deaths on random shifts rose to [at least] that level 50% of the time) and p=0.70 actually means that fewer than average patients died on shift (deaths on random shifts rose to [at least] that level 70% of the time).  A correct way to combine p-values for independent experiments is Fisher's method.  Had the paper we rebutted used that method, they would have seen that the dynamical and weak lensing measurements were entirely consistent, even without correcting the error regarding nonindependent trials.  Correcting both errors makes it even more clear.

This stuff is complicated and it's easy to go wrong.  Happily, many incorrect inferences in science are caught relatively quickly because so many scientists have so much practice in this kind of analysis.  But  Goldacre's book is an eye-opener.  Things don't always work out so well so quickly.