The World in a Grain of Sand: myresearch

Showing posts with label myresearch. Show all posts

Friday, December 20, 2013

Shedding light on missing mass

An important part of science---of life, I would argue---is making inferences from data. (Contrary to some popular perception, it is not all that we do, but it is a large part.) This process is a lot more interesting than many people think, and this is best conveyed by stories where it went horribly wrong. I just published a paper rebutting a paper in which it went wrong; not horribly so, but to make this post understandable I will weave a few of these horrible stories into my tale.

The paper I rebutted claimed that one method (called weak gravitational lensing) of measuring the mass of a certain galaxy cluster gave an answer too low compared to the answers obtained through two other methods, and therefore the lensing method itself was suspect. The context is that astronomers find it very difficult to measure the mass of anything, since we are so far away. If the cluster is not changing over time, we can relate the velocities of the galaxies in the cluster to its mass (called the dynamical method) and we can also relate the cluster's X-ray emission to its mass. But that's a big if, and we would like a method which does not depend on this assumption. Lensing is such a method; it has weaknesses too, but I don't want to get too deeply into that here. The central question in this paper is really simple and applies to many situations: when numbers seemingly disagree, how do we characterize the strength of disagreement given that there is some uncertainty associated with each number?

The original paper made a model of the cluster using the X-ray method, and simulated weak lensing measurements of this model to see how often the simulated measurements gave answers as low as the actual weak lensing measurements. This is a great technique; it gives us what's called a p-value. By tentatively assuming that weak lensing is as effective as the X-ray method---the "null hypothesis"---we will see how often the inherent uncertainties in weak lensing would just randomly give us an answer as low as we got in real life. If the answer is "never" then we can state that our null hypothesis is wrong and weak lensing is not as effective as the X-ray method. More quantitatively, if the answer is "in 1 out of every 100 experiments" we would say p=0.01, which has the naive interpretation of "99% confidence that the null hypothesis is rejected." (One of the reasons it's naive is that if you tested, say, 100 different true hypotheses, you would still expect one to randomly come out with p=0.01. So the true interpretation is more nuanced. I will develop this further below.)

Now, what if this method gives you p=0.1 or so? You can't really reject the null hypothesis unless you have stronger proof than that, so you may go out and take more data, do more experiments, etc, to get the stronger proof. If you do so, make sure that the new experiments are independent of the original one. For example, if you want to prove that tall people are better basketball players than short people, the null hypothesis would be that they are the same and you might record the score from a scrimmage in which a tall person plays against a short one. If the tall person comes out slightly ahead, you will not have strong proof that the tall person is better, so you might replay the scrimmage. But if you play the same two people against each other, you can never prove that tall is better; the most you might prove is that player A is better than player B. To make the trials independent, you have to play a different tall person against a different short person. In more general terms, if you're trying to get an idea of the natural variation or "noise" in your measurement, you have to repeat the measurement in a way that actually incorporates those variations. What this paper did was equivalent to failing to recognize the nonindependence of identical triplet weak lensing players. They ran three scrimmages between an X-ray player and each of these three weak lensing players, mistakenly yielding a strong conclusion about X-ray vs weak lensing.

This idea of independence---and recognizing nonindependence even when it's subtle---is really important. Ben Goldacre in his book Bad Science relates the story of a woman suspected of murder because two of her kids died of sudden infant death syndrome. The chance of one baby dying of SIDS was stated as 1 in 8543. Prosecutors assumed that the chance of a second child dying of SIDS (over the course of years, not in the same incident) was independent of the chance of the first child dying of SIDS, so we can multiply probabilities and come up with a 1 in 73,000,000 chance of two babies dying of SIDS; so unlikely that we might suspect murder. But they're not independent. If SIDS has anything to do with genes or environment then they can't be independent, because the babies have the same parents and the same house. Given the shared genes and environment, the second baby's chance of SIDS may actually be quite high. In that case, we have no reason to suspect murder. The prosecutors vastly overstated the statistical case for murder by failing to recognize the non-independence. (That's not the only mistake the prosecutors made. I highly recommend Goldacre's book.)

A second mistake the authors of the weak lensing paper made was multiplying the p-values from the three experiments to obtain an overall p-value. Many people, even scientists, fall into the following trap: Say Experiment A gives p=0.10 and you interpret that as only a 10% chance that the null hypothesis is correct. Now independent Experiment B gives p=0.08, which you interpret as only an 8% chance that the null hypothesis is correct. It is natural to think that the experiments together imply only 8% of a 10% chance of the null hypothesis being correct, or p=0.008. But it's wrong! You have vastly underestimated the chance of the null hypothesis being correct, just as the paper we rebutted vastly underestimated the chance that the weak lensing measurements were actually consistent with the dynamical and X-ray measurements. Even if the experiments are independent, you should not multiply the p-values.

Here's an easy way to confirm that the above procedure is wrong: following an equivalent procedure you could also interpret p=0.10 as a 90% chance that the null hypothesis is incorrect and p=0.08 as a 92% chance that the null hypothesis is incorrect. Multiplying them, we would get an 82.8% chance that the null hypothesis is incorrect. But the same process in the previous paragraph yielded an 0.8% chance that it's correct, which doesn't match the 82.8% chance that it's incorrect. So something's wrong with the process! And it gets more wrong as you do more experiments and more multiplications. If we do 100 independent experiments and follow this line of reasoning, we will come up with a vanishingly small chance of the null hypothesis being correct, and a vanishingly small chance of the null hypothesis being incorrect, regardless of the specific p-values, because they are always less than one. You will rule out things which do not deserve to be ruled out. Goldacre gives the horrifying example of a nurse who was suspected of murdering patients and convicted largely on the basis of faulty statistics but was eventually freed. I'm going to make up the following numbers for simplicity. Let's say that some number of patients died while she was working, such that there was only a 10% chance that that would have happened randomly. So you start poking around, and find that at the previous hospital where she worked, there was only a 50% chance of that large a number of patients dying, and at the hospital before that only a 70% chance, and at the hospital before that only a 30% chance, etc. Multiplying all these together gives a really small chance that all these things occurred randomly. But you know by now that multiplying these is wrong.

Why is it wrong? Probabilities can be multiplied (1/2 chance of heads in each coin toss means 1/4 chance of two heads in two coin tosses), but despite its name, the p-value is not a simple probability that the null hypothesis is true. It's a measure of consistency which is constructed so that for any one experiment we interpret all but the lowest p-values as being consistent with the null hypothesis. Therefore p=0.5, say, is perfectly consistent with the null hypothesis; it does not mean a 50% chance of it being true or false. In the hypothetical example above, p=0.50 actually means that an average number of patients died on shift (deaths on random shifts rose to [at least] that level 50% of the time) and p=0.70 actually means that fewer than average patients died on shift (deaths on random shifts rose to [at least] that level 70% of the time). A correct way to combine p-values for independent experiments is Fisher's method. Had the paper we rebutted used that method, they would have seen that the dynamical and weak lensing measurements were entirely consistent, even without correcting the error regarding nonindependent trials. Correcting both errors makes it even more clear.

This stuff is complicated and it's easy to go wrong. Happily, many incorrect inferences in science are caught relatively quickly because so many scientists have so much practice in this kind of analysis. But Goldacre's book is an eye-opener. Things don't always work out so well so quickly.

Friday, January 11, 2013

Modeling Colliding Galaxy Clusters and Dark Matter

Science is about making models. We make a conceptual model of
something (in other words, we guess how it works), then we figure out
what that model would predict, we compare those predictions to
observations, and then we either discard, modify, or (for the time
being) accept that the model could be correct. (Recently I stumbled
across an entertaining video of Richard Feynman explaining this
method.) Ideally, the model is conceptually simple, with as few
moving parts as possible, yet is able to match a rich variety of data.
Sometimes this modeling process is about something really fundamental,
such as how gravity works; other times the modeling process is just
about estimating something we can't measure directly. For example, a
scale which doesn't read the correct weight can still be used to infer
your weight, as long as you have a workable model such as "reads about
ten pounds too heavy" or "reads about half the true weight."

My student Will Dawson recently finished some work which is a really nice example of the latter kind of modeling. We have been studying colliding clusters of galaxies; our lives are too short to see the collision play out, so we have to make a model which allows us to extrapolate from the current (just post-collision) state back to the time of maximum impact.

A good analogy: given a photograph and a speed gun reading taken a
split-second after an automobile collision, reason backward from that
to infer the velocity of collision and the time since collision. The
main difference is that the galaxy clusters mostly go right through
each other because the space between the galaxies is so big. The hot
gas clouds filling those spaces do collide, leaving a pancake of hot
gas in the middle while the galaxies continue on (eventually to slow
down and turn around due to gravity). We are interested in what
happens to the dark matter: the Bullet Cluster and the Musketball
Cluster show that it is mostly collisionless, but "mostly" is
something we really want to quantify, for reasons I'll explain in a
future post.

But first, we have to make a model of the collision. Speed guns and
spectrographs can only tell how fast the galaxies are moving toward us
or away from us; they say nothing about how fast they are moving in
the transverse direction (in the plane of the sky). To study dark
matter we need to know the full three-dimensional velocity, and we
want to know what it was at the time of collision, rather than what it
is right now, after the collision. This is closely related to how
much time has passed since the collision (by collision I really mean
the maximum overlap of the two clusters), because the observed
separation since the collision could have been achieved by moving at a
high velocity for a short time, or moving at a low velocity for a long
time. Making things more complicated, the separation we observe is
only the transverse part of the separation. So a collision which
occurs along the line of sight will give us a large velocity on our
speed gun but a small apparent separation, while the same collision
viewed transversely will exhibit a small part of the velocity and a
large part of the separation. We don't know what angle we are viewing
the system at, so the true velocity could be just about anything.

Like a rock climber inching up a chimney, the way out of this is to
push against two extremes. If we observe any line-of-sight velocity,
the motion can't be completely transverse; in fact we can rule out a
range of near-transverse geometries because they would require an
absurdly large three-dimensional velocity. (Absurdly large is defined
here as larger than the free-fall velocity from infinity.) Similarly,
if we observe any transverse separation we can rule out a range of
nearly line-of-sight geometries because they would require absurdly
large three-dimensional separation. (Absurdly large is defined here
as requiring longer than the age of the universe to reach that
separation.)

Still, we are left with a range of geometries in the middle; at the
extremes of that range the geometries aren't completely ruled out, but
look pretty unlikely. Here Will applied another important concept:
marginalizing over all remaining possibilities. His code ranges over
all possible geometries, tabulating how well they match the data, and
thus produces a probability distribution. So we don't know exactly
how fast the collision was, but we can be 99% confident it is within
have a certain broad range, 90% confident it is within a certain
smaller range, etc.

This was a pretty big advance in sophistication compared to the way
previous studies had estimated the collision velocity and time since
collision. Using this technique, Will demonstrated that what
astronomers don't know can hurt them---not knowing the angle from
which we are viewing a collision results in substantial uncertainty in
the quantities we need to know to study the behavior of dark matter.
To narrow things down to a useful range, we need additional
information about a collision. (In the case of the Bullet Cluster,
the shock gives this information, but most collisions have not shown
an observable shock.)

Will also used his new technique to estimate that it has been about
700 million years since the Musketball Cluster collided, compare to
"only" 200 million years for the Bullet Cluster. This is good for the
study of dark matter because people would be justifiably skeptical of
conclusions about the nature of dark matter based on just one kind of
collision (a recent, high-speed collision of massive clusters such as
the Bullet). Studying a range of collision conditions--including less
recent, lower-speed collisions of less-massive clusters such as the
Musketball--gives us a much better chance of identifying universal
properties of dark matter with high confidence.

Thursday, September 20, 2012

Desperately Seeking Distances

One of the most shocking things about astronomy is that when we take a
picture of celestial objects in the night sky, we have very little
idea how far away they are. This is utterly different from everyday
life, where our brain automatically processes distance-related clues
and instantly supplies us with correct judgements. The brain knows
the true sizes of everyday objects, so it can use the apparent size
of, say, a car to infer its distance: the smaller the car appears, the
further away it must be. The same can be done with the apparent
brightness of lights: if we see headlights but they appear faint, we
know the car is still far away.

But in astronomy, we can only figure out the true sizes and distances
of things with a lot of effort. One difficulty is simply that
everything is so far away: apart from the Sun, no star is close enough
to ever appear larger than a point, so we can't judge their distances
by their apparent sizes. Another difficulty is that the universe is
far less standardized than our man-made world: most cars have more or
less the same true size, but stars and galaxies come in a vast range
of intrinsic sizes, preventing us from forming a rule of thumb about
"if it appears this big, it must be about that far away." Imagine if
some trickster built a 50-foot iPad, faithful in every detail. If you
mistook it for a real iPad, you would guess that it's much closer to
you than it really is. The universe is full of the equivalents of
50-foot iPads---stars 100 to 1000 times bigger in diameter and
millions of times bigger in volume than our Sun---as well a
50-millimeter iPads---dwarf galaxies containing thousands of times
fewer stars than does our own Milky Way galaxy.

Astronomers have painstakingly built up a vast store of knowledge
regarding the sizes and distances of things, which I won't attempt to
describe here (but at the end of this post I provide a few links to
sites which help you visualize these things). The point is that when
a new technique to estimate distances comes along, it's a potentially
powerful tool for astronomers. Today's episode describing a recent
paper of mine shows how I explored a new idea for determining
distances and showed that it was interesting, but ultimately less
powerful than other ideas that have already been developed.

The new idea is actually an old problem turned on its head, which is
often a useful way to make progress in science. Imagine that you're
the assistant to a seventeenth-century scientist, put in charge of
monitoring his inventory of chemicals. You get really frustrated
because you can't tell how much alcohol is in the narrow-necked
bottle---it keeps expanding during the day and contracting at night.
You could continue to view this as a problem, or you could turn the
problem on its head and invent the thermometer. In science, we often
approach relationships between two or more variables (in this case,
temperature and volume) with a predetermined notion of which variable
is important or worth measuring. But when measuring that variable
gets frustrating, brainstorming a new goal often results in a valuable
new tool. That's easy to point out in retrospect but difficult to
apply in practice because on an everyday basis we are often too caught
up in reaching our immediate goals.

In this case, the original "problem" arises from using an effect
called gravitational lensing in which light from background
galaxies is bent by the gravity of an intervening mass concentration
such as a cluster of galaxies. We can use this effect to determine
the mass of the cluster, if we know the distance to the background
galaxies. In certain contexts, it's very difficult to know the
distance to the background galaxies accurately enough, and overcoming
this difficulty is an ongoing area of research for major gravitational
lensing projects now in the planning phase.

At some point my colleague Tony Tyson suggested to my graduate student
Will Dawson that he look into how well the distances to background
galaxies could be pinned down by studying the lensing effect around a
few well-studied mass concentrations. At the least, it might be
possible to distinguish between sets of galaxies which are more or
less in the foreground (and thus are not lensed) and sets of galaxies
which are more or less in the background (and thus are lensed). With
different lenses at different distances, it might be possible to infer
something more specific about how galaxies are distributed in terms of
distance from us.

We tried different ways of pulling this information out of the data,
but none of them worked very well. So I suggested something nearly as
good, at least as a first step: assuming that some solution exists,
let us compute how precise the solution could be in a best-case
scenario. This would tell us whether continued searching for the
solution would even be worth it. Now, the ability to compute the
precision of an experiment which has not even been performed yet seems
like magic, but in my previous post I explained how it works.
For me, the best thing about this whole project was that I did a
calculation like this for the first time (they don't teach you this
stuff in school) and therefore really understood it for the first
time. It's really a pleasure to come to understand something which
previously seemed like a bit of a black box.

The result: lensing can be used to infer how galaxies are distributed
in terms of distance from us, but only roughly. The precision gets
better and better as you add more data, but to do as well as other
methods which have already been developed requires a very large amount
of data indeed. For a given amount of telescope time, the other
methods are more precise. That doesn't mean this method will never be
used: because it piggybacks on a lot of data which will be taken
anyway for other purposes, it may someday be used to double-check that
the other methods are not way off due to faulty assumptions or other
"systematic errors." It's always good to have multiple different ways
to check something as important as the distances of galaxies. It may
be somewhat disappointing that this method won't be the primary method
people use, but we can take some satisfaction in definitively
answering the question "how good will this method ever be?" rather
than getting bogged down searching for marginal improvements.

A few resources about the sizes of things in the universe:

Scale of the Universe is a neat visualization which lets you zoom smoothly from very small things like atoms all the way to the size of the observable universe, and has nice accompanying music. But it doesn't show you the distances between celestial objects. Most tools don't, because the distances are so large that 99% of your screen would be empty space! Scale of the Universe 2 is by the same people and honestly I can't see much difference between the two.
Nikon's Universcale is a similar approach, but with more accompanying text information so you can learn more. The presentation is a little weak on the astronomical end of the scale, but strong on the micro end of the scale.
Powers of 10 is a classic documentary which does the same zoom trick and does show you the distances between things. A much more slick attempt at the same thing called Cosmic Voyage was made decades later, but I still prefer the classic.

This work was supported by the University of California (and therefore to some extent by the State of California) through my salary. I thank California for investing in research. It ultimately pays off because research apprenticeships are how we train the next generation to become independent thinkers.

Monday, September 17, 2012

The Phisher Matrix

This is the post I've been dreading.

As regular readers know, I'm writing a blog post for each paper I
publish, in an effort to help the public understand the scientific
research that they pay for. That research is often communicated only
to other scientists in papers which are impossible to decipher unless
the reader is already an expert on the subject, so a gentle intro to
the topic is the least I can do to give something back to the citizens
who help fund my research.

It's nearly a year since I decided to do this, but at that time I was
working on a paper based on the Fisher matrix, and I was very
reluctant to try explaining this to novices. At one point, I was
reading the Dark Energy Task Force report to review how they used the
Fisher matrix, and I came across this sentence:

My daughter looked over my shoulder and said, "Really, Dad? The Fisher
matrix is simply ....?" So I've been procrastinating this one.

Instead of focusing on the mathematical manipulations, let's focus on
what purpose they serve. Imagine you work in a mail room, and your
boss gives you two boxes to weigh, and two chances to use the scale.
Naturally you will weigh each box once. But suppose that your boss
intends to glue the boxes together and ship them as one item, and
furthermore that you need to know the total weight as precisely as
possible and the scale has a random uncertainty of +/- 0.5 pounds.
Should you weigh the boxes separately and then add the numbers, or
weigh them together, or does it not matter? Assume the glue will add
no weight, and remember that you have two chances to use the scale to
attain the best accuracy.

If you weigh the boxes separately, you have 0.5 pound uncertainty on
the weight of the first box and 0.5 pound uncertainty on the weight of
the second box. The uncertainty on the sum of the weights is not 1.0
pound as you might expect at first. It's less, because if errors are
random they will not be the same every time. For example, the scale
could read high on one box and low on the other box, so that the error
on the sum is very small. However, we can't assume that errors will
nicely cancel every time either. A real mathematical treatment shows
that the uncertainty on the sum is about 0.7 pounds. (Note that we
are not considering the possibility that the scale reads high every
time. That's a systematic error, not a random error, and we can deal
with it simply by regularly putting a known weight on the scale and
calibrating it. Scientists have to calibrate their experiments all the
time, but for this paper I am mainly thinking of random errors.)

If you weigh the boxes together, you have a 0.5 pound uncertainty on
the sum, and furthermore you can use your second chance on the scale
to weigh them together again and take the average of the two
measurements, yielding a final uncertainty of about 0.35 pounds (0.7
divided by 2, because you divide by two when you take the average of
the two measurements). So you are twice as precise if you weigh them
together! This may not seem like a big deal, but it can be if
procedures like this save the mail room money by not having to buy a
new high-precision scale. Similarly, scientists think through every
detail of their experiments to squeeze out every last drop of
precision so that they can get the most bang for the buck.

Now bear with me as we examine one more twist on this scenario, to
illustrate this point in more detail. Suppose your boss changes her mind
and decides to ship the boxes separately after all. If you were smart
enough to follow the procedure which yielded the most precise total
weight, you would now be at a complete loss, because you have no
information on the weights of the individual packages. If you know your
boss is indecisive, you might want to devise a procedure which is nearly
optimal for the total weight, but still gives some information about the
individual weights. For example, you could use your first chance on the
scale to weigh the boxes together, which would yield a 0.5-pound uncertainty
on the total (better than the 0.7 pounds provided by the naive procedure of
weighing the boxes separately and then summing), and use your second
chance on the scale to weigh one box alone (yielding an uncertainty of
0.5 pound on that box, the same as if you had performed the naive
procedure). You can always obtain the weight of the second box if
necessary by subtracting the weight of the first box from the total!
We had to give up something though: the weight of the second box is
now more uncertain (0.7 pounds) because it comes from combining two
measurements which were each uncertain by 0.5 pounds.

You probably hadn't suspected that an experiment as simple as weighing
a few boxes could become so complicated! But it's a useful exercise
because it forces us to think about what we really want to get out of
the experiment: the total weight, the weight of each box, or something
else? Similarly, a press release about an experiment might express
its goals generically ("learn more about dark energy"), but you can
bet that the scientists behind it have thought very carefully about
defining the goals very, very specifically ("minimize the uncertainty
in dark energy equation of state parameter times the uncertainty in
its derivative"). This is particularly true of experiments which
require expensive new equipment to be built, because (1) we want to
squeeze as much precision as we can out of our experiment given its
budget, and to start doing that we must first define the goal very
specifically; and (2) if we want to have any chance of getting funded
in a competitive grant review process, we have to back up our claims
that our experiment will do such-and-such once built.

If you made it this far, congratulations! It gets easier. There's only one
more commonsense point to make before defining the Fisher matrix,
and that is that we don't always measure directly the things we
are most interested in. Let's say we are most interested in the total
weight of the packages, but together they exceed the capacity of the
scale. In that case, we must weigh them separately and infer the
total weight from the individual measurements. We call the individual
weights the "observables" and we call the total weight a "model
parameter." This is a really important distinction in science, because
usually the observables (such as the orbits of stars in a galaxy) are
several steps removed from the model parameters (such as the density
of dark matter in that galaxy) in a logical chain of reasoning. So to
say that we "measure" some aspect of a model (such as the density of
dark matter) is imprecise. We measure the observables, and we infer
some parameters of the model.

Now we can finally approach the main point head-on. The Fisher matrix is a way of predicting how precisely we can infer the parameters of the model, given that we can only observe our observables with a certain precision. It helps us estimate the precision of an experiment before we even build it, often before we even design it in any detail! For example, to estimate the precision of the total weight of a bunch of packages which would overload the scale if weighed together, we just need to know (1) that the precision of each weighing is +-0.5 pounds, and (2) the number of weighings we need to do. We don't actually have to weigh anything to find out if we need to build a more precise scale!

The Fisher matrix also forecasts the relationships between different things you could infer from the experiment. Take the experiment in which you first weigh the two boxes together, then weigh one individually and infer the weight of the second box by subtracting the weight of the first box from the weight of both boxes together. If the scale randomly read a bit high on the first box alone, then you not only overestimate the weight of the first box, but you will underestimate the weight of the second box because of the subtraction procedure used to infer its weight. The uncertainties in the two weights are coupled together. Those of you who did physics labs in college may recognize all this as "propagation of errors." The Fisher matrix is a nice mathematical device for summarizing all these uncertainties and relationships when you have many observables (such as the motions of many different stars in different parts of the galaxy) and many model parameters (such as the density of dark matter in different parts of the galaxy), such that manual "propagation of errors" would be extremely unwieldy.

The great thing about the Fisher matrix approach is that it gives you a best-case estimate of how precise an experiment will be, before you ever build the experiment ("best-case" being a necessary qualifier here because you can always screw up the experiment after designing it, or screw up the data analysis after recording the data). Thus, it can tell you whether an experiment is worth doing and potentially save you a lot of money and trouble. You can imagine many different experiments and do a quick Fisher matrix test on each one to see which one will yield the most precise results. Or you can imagine an experiment of a type no one thought of before, and quickly show whether it is competitive with current experimental approaches in constraining whatever model parameters you want to constrain. It's a way of "phishing" for those experiments which will surrender the most information.

That's the Fisher matrix, but what did I do with it in my paper? Well, this has been a pretty long post already, so I'll deal with that in my next post. Meanwhile, if you want to follow up some of the ideas here, try these links:

The report of the Dark Energy Task Force contains a solid review of
the Fisher matrix for professional physicists
The Wikipedia article on Design of experiments goes through an
example of weighing things in different combinations, as well as
clarification of statistical vs systematic errors and lots of other
terms.
A very informal guide I wrote to introduce the Fisher matrix to, say,
undergraduate physics majors.

Saturday, August 25, 2012

Cosmic Magnification

In the previous post I noted how wide-area surveys of the sky like the
Deep Lens Survey serve the dual purposes of finding rare objects and
surveying a representative sample of the universe (to determine its
average density, for example), and I described one of our successes in
the former category while promising to post an example in the second
category. Here it is!

First, we have to understand that the path of light is bent by gravity
and therefore, if we can observe some consequence of this bending, we
can learn about how much mass is between us and the source of light.
I'm not going to explain this in any detail here, but if you wish you
can watch my YouTube video on the subject, or just skip to the part
where I do a demo showing that this bending can lead to magnification.
In that demo I don't specifically point out the magnification, but at
one point you can clearly see that the blue ring on the whiteboard has
been magnified.

If we observe this magnification while looking in one very specific
direction as in the video, we can find how much mass is lurking in the
object which provides the magnification (usually a specific galaxy or
cluster of galaxies). A few galaxies happen to have background
sources of light lined up just right so that we can see the
magnification easily, so we can learn about those specific galaxies.
But are they representative of galaxies in general? Probably not,
because the most massive galaxies provide the most magnification and
are more likely to get noticed this way. Also, having the mass more
concentrated toward the center of the galaxy helps, so if we just
study these galaxies, we will be looking only at the more massive,
concentrated galaxies.

In our wide-field survey, a team led by graduate student Chris
Morrison measured the very small amount of magnification around the
locations of hundreds of thousands of typical galaxies. Their
statistical analysis doesn't measure the magnification caused by each
galaxy (which would be too small to measure), but it measures the
typical magnification caused by the galaxies in aggregate. For this
reason, this type of analysis is called "cosmic magnification" which
sounds mysterious but can be thought of as "magnification caused by
the general distribution of mass in the cosmos rather than by a
specific identifiable lump of mass."

The amount of cosmic magnification tells us not only about the
distribution of mass in the universe, but also about the distances
between us, the magnifying masses, and the sources of light. (Imagine
watching the wineglass demo in my video, but having me move the
wineglass much closer to the whiteboard...you can probably predict
that the magnification will be less.) These are two very fundamental
things about the universe which astronomers are trying to measure,
because they are both affected by the expansion rate of the universe,
and the expansion rate is unexpectedly accelerating. Three
astronomers won the 2011 Nobel Prize in Physics for their role in
discovering this acceleration, and ever since they discovered it
(1998), many astronomers and physicists have focused on figuring out
why. Some attack this question from a theoretical point of view (a
theorist coined the term "dark energy" which has become the popular
term, but be warned that it may not be caused by a new form of energy
at all), and others attack it from an observational point of view: if
we can get better and better measurements of how the expansion is
actually behaving, we can rule out some of the theories which have
been proposed to explain it. Cosmic magnification has a real role to
play in that process, and Morrison's paper is the first one to
measure, even in a crude way, how cosmic magnification increases as we
increase the distance between us and the masses causing the
magnification.

Tuesday, August 14, 2012

Colliding clusters of galaxies

One of the questions generated by my previous post describing the
Deep Lens Survey is: Why do such a large survey of the sky? What
do you hope to accomplish that the Hubble Space Telescope can't?

HST is great at some things but not others. Expecting HST to be great
at everything in astronomy is like expecting a great criminal-defense
attorney to also be great for cases involving bankruptcy law, probate
law, torts, and tax law. Novices would put all of these things under
the single category of "law," but people closer to the legal system
recognize that these are very different specialties. Similarly, if
you look closer at "astronomy" or "telescopes" you realize that
there's such a wide variety that no one telescope can do it all. And
whether it's attorneys or astronomy, the few performers which become
known outside the field are those with some combination of high
performance in the field and a good public relations machine.

So what is HST great at? It was launched into space primarily because
turbulence in Earth's atmosphere makes images blurry. Above the
atmosphere, HST can take really sharp images. The flip side of
capturing these really fine details is that it can't capture a very
wide panorama. So we need very big, wide surveys from other
telescopes to find things which are interesting enough to follow up with
HST and other specialized telescopes such as X-ray telescopes (which,
like HST, need to be above the atmosphere and are therefore similarly
expensive and rare).

But wide surveys are more than just rare-object finders for HST and
other specialized telescopes. Equally important, they give us a
representative sample of the universe. Just as an anthropologist
could not fully understand how humans live by studying only the "most
interesting" countries (the ones with revolutions underway, for
example), astronomers could not understand the universe in general
just by studying the most interesting objects. I'll give an example
of the rare-object-finding capability of the Deep Lens Survey in this
post, and an example of the understanding-the-universe-in-general
capability in my next post.

Rare objects are scientifically interesting for many reasons. Some of
them tell us about extremes: knowing the mass of the most massive star
or the luminosity of the most luminous star tells us something about
how stars work. In other cases, what is rare and interesting is not
so much the object itself as the stage it happens to be in right now.
Because the lifetimes of celestial objects are millions or billions of
years, we can't follow a single star, say, over its lifetime to
determine its life stages. Instead we have to piece together their
life cycles from different stars seen in different stages. Imagine an
alien anthropologist who pieces together the human life cycle from one
day's visit to Earth: because a small fraction of humans are babies
right now, that must mean that people spend a small fraction of their
lives as babies. In the same way, a certain star or galaxy may not be
intrinsically special, but if we happen to be seeing it at a special point in
its life cycle, that helps us understand all objects of its type.
Finally, in some cases objects are particularly interesting because we
have a particularly clear view of them. Just as an overhead camera's picture
of the top of a person's head is less informative than a picture of their face,
Earth's view of many celestial objects is not fully informative. Objects
which happen to expose their "faces" to us give us more insight, which
can then be applied even to those objects which do not face us.

Today I want to highlight a collision between two galaxy clusters
which was discovered in the Deep Lens Survey. Imagine observing a
head-on collision between two large trucks. You will observe a lot
more of the details if you are standing by the side of the road (a
"transverse" view) than if you are driving behind one of the trucks.
My student Will Dawson was the first to realize that we have a
transverse view of this collision. This immediately makes it
interesting because if the component parts of the clusters (galaxies,
hot gas, dark matter) become separated, a tranverse view gives us the
best chance of seeing that separation and therefore learning more
about those components.

In particular, separation between the dark matter (which carries most
of the mass) and the hot gas (which is the second-most-massive
component) is important because dark matter has never been observed
very directly. Astronomers infer the existence of dark matter when
orbits (of galaxies in a cluster, for example) are too fast to be
explained by the gravity of all the visible mass (stars and gas).
Therefore, the conclusion goes, there must be some invisible component
with substantial mass: dark matter. Observing a clear separation
between dark matter (ie, the bulk of the mass) and normal matter would
boost our confidence in this conclusion, and help refute competing
hypotheses (for example, that what we understand about gravity and
orbits from studying the solar system may not fully apply to these
other systems). This "direct empirical proof of the existence of dark
matter" was first done for a transversely colliding galaxy cluster
called the Bullet Cluster, which you should definitely read about if
you are interested in this topic. A good place to start for beginners might
be this Nova Science Now video.

Finally, we get to the research paper I wanted to highlight. It's an
examination of the evidence we collected regarding the aforementioned
collision in the Deep Lens Survey, including not only the original DLS
images but also data from HST, the Chandra X-ray telescope, the
10-meter Keck telescopes and other telescopes. The conclusion:
it is indeed a transversely viewed collision of galaxy clusters
with a substantial separation between the dark matter and the hot gas.
My student Will Dawson is the principal author who assembled all these
pieces, with substantial help from many co-authors. This is something
of a teaser paper: it's not an exhaustive analysis, but it's enough of
an analysis to establish that it's an important system worthy of
further study. All further studies of the system (including proposals for more
telescope time) will cite this paper because it lays out the basic facts.

Closeup of our colliding clusters, with the location of the mass (mostly dark matter) painted on in blue and the location of the hot gas painted on in red. If you look closely you can see that there are many galaxies colocated with the mass, but not with the hot gas. This (temporary) ejection of the hot gas allows us to study dark matter more clearly.

I've tried to keep this post short and relatively free of technical details, so
some readers may want more. A good place to start is Will's research page.
And feel free to ask questions in the comments below! I may give a quick
response in the comments, or I may use them to motivate a future post.

Finally, to bring this back to the question which initially stimulated
this post: this is a special system which we did study with HST, but
we never could have found it without a wide survey like the Deep Lens
Survey.

Tuesday, July 17, 2012

What does figure skating have to do with astronomy?

With this post I add a new ingredient to this blog: explaining my
astrophysical research. I do feel that scientists have an obligation
to help the public understand their research, most of which is funded
by the public, and I'm afraid that science journalism is generally not
up to this important task. Science journalism does well when a story
easily fits into the "gee whiz, isn't this a cool result" format, but
tends to be too focused on "breakthroughs" rather than the
investigative process which really makes science what it is (Radiolab
excepted). Most "breakthroughs" are overstated because scientists
have an incentive to overstate the importance of their specific
contribution to the field, and the media generally give science
journalists too little space to explore the process of science more
deeply. Maybe scientists blogging about their work can help fill a
gap.

That's easy to say, but harder to deliver. I don't pretend that I
will have a lot of spare time to write thorough, clear explanations of
many aspects of my research. But I do hope to convey, for each paper
I author or co-author, why that paper is important or interesting, how
that fits into the bigger picture of astrophysics research, and maybe
something interesting about how the research was done. I'll try to do
a paper every week or two over the summer, eventually describing all
the papers dating back to last fall when I first wanted to do this.

The first paper I'd like to describe is (like all my papers) freely
available on the arXiv. It solves a problem which
appears not only in astrophysics, but also in figure skating and many
other contexts. (I'm not claiming that I came up with the original
solution to such a far-reaching problem, just that this paper
describes how to apply it in a certain astrophysical niche.) So let's
start with the figure-skating version of the problem. Say you have a
figure-skating competition with hundreds of skaters. Evaluating each
skater with the same panel of judges would be too time-consuming; the
competition would have to be stretched over a period of weeks and the
judges would have judging fatigue by the end. You need to have
several panels of judges working simultaneously in several rinks, at
least to narrow it down enough so that the top few skaters can have a
head-to-head competition for the championship at the end.

Given that, how do we make it as fair as possible? One panel of
judges may be stricter than another, so that a skater's score depends
as much on a random factor (which panel judged her) as on her
performance. Remixing the judging panels from time to time, by
itself, doesn't help. That does prevent situations such as "the most
lenient panel sits all week in Rink 1," but it generates situations
such as "a lenient panel sat in Rink 1 Monday morning, a slightly
different lenient panel sat in Rink 2 Monday afternoon, yet another
lenient panel sat in Rink 3 Tuesday afternoon," etc. But remixing the
panels does enable the solution: cross-calibration of judges.

Let's say judges Alice, Bob, and Carol sit on Panel 1 Monday morning
and judge ten skaters. We can quantify how lenient Alice is relative
to Bob and Carol, just by comparing the scores they give for the same
ten skaters. The mathematical description of leniency could be simple
("Alice gives 0.5 points more than Bob and 0.25 points more than
Carol, on average") or complicated ("Alice scores the top skaters just
as harshly as Bob and Carol but is progressively less harsh on
lower-scoring skaters as captured by this mathematical formula")
without changing the basic nature of the cross-calibration process
described here. At the same time, judges David, Ethan, and Frank sit
on Panel 2 Monday morning and judge ten other skaters. We can
quantify how lenient David is relative to Ethan and Frank by comparing
the average scores they give for the their ten skaters.

But we still don't know how lenient Alice, Bob, and Carol are as a
group, compared to David, Ethan, and Frank; if Panel 1's scores were
higher than Panel 2's on average, we can't tell if that's because
Panel 1 is more lenient or because the skaters assigned to Panel 1
happened to be better on average than the skaters assigned to Panel 2.
So in the afternoon session we switch Alice and David. Now that we
can measure how lenient Alice is relative to Ethan and Frank, and how
lenient David is relative to Bob and Carol, we know the relative
leniency of all six judges, and we can go back and adjust each
skater's score for the leniency of her judges.

This system isn't perfect. However we choose to describe leniency,
that description is probably too simple to capture all the real-life
effects. For example, perhaps Alice isn't that much more lenient than
Bob and Carol overall, but she is very lenient on the triple axel,
which most or all of Monday morning skaters happened to do. We would
then be making a mistake by looking at Alice's Monday morning scores
and concluding that she is overall more lenient that Bob and Carol.
But this system is surely better than no system at all. Scientists
make models of reality, and we know those models never capture all the
complexity, but we are satisfied if they capture the most important
features. The purpose of a model is to simplify reality enough to
understand its most important features. Over time, if we are not
satisfied with a model, we can increase its complexity to capture more
and more features of reality. As Einstein said, "make things as
simple as possible, but not simpler." In this example, we could
improve the model, at the cost of additional complexity, if the
judges wrote down their scores for each performance element. We could
then compute the leniency of each judge for each element and apply these
more finely tailored correction factors. But in practice, a simple correction
for an overall leniency factor probably gets rid of most of the unfairness
in the system.

In astronomy, assessing the brightnesses of stars and galaxies is like
assessing the performance of figure skaters in this giant
competition. We have too many stars and galaxies to assess to do it
one at a time, so we build a camera with a wide field of view, which
can look at many thousands of stars and galaxies simultaneously. But
different parts of the camera's field of view may not be equally
sensitive. By taking one picture and then repointing the camera a
little bit so that each star or galaxy is assessed first by one area
of the camera and then by another, we can compute the relative
leniencies of the different areas of the camera. Then we can infer
the true brightness of any star or galaxy by correcting for the
"leniency" of the area(s) upon which its light fell.

The social sciences are actually years ahead of the physical sciences
in applying these kinds of models. The reason is that social sciences
very often face the problem of calibrating the relative difficulty of
things for which there is no absolute standard. For example, on an
exam or evaluation how much more difficult is Question 10 compared to
Question 1? There is no absolute standard for difficulty, so social
scientists have developed methods, as described above for the figure
skaters, for calibrating relative difficulty from the test responses
("the data") themselves. This is quite different from a typical
physics experiment in which we directly compare the thing to be
measured with some kind of objective reference; for example, in
astronomy we could more or less directly compare the measured
brightness of a star or galaxy ("the science data") with a calibrated
light source ("the calibration data"). So astronomers typically had
no reason to try to use the star/galaxy data alone for the kind of
self-calibration that social scientists do.

But this is changing for several reasons. First, sky surveys are
getting more massively parallel. We now have cameras which take
images of millions of galaxies in a single shot, spread over a billion
pixels. We can no longer do the most direct calibration---shining a
calibrated lamp on the very same pixel---for each star or
galaxy. Second, we never really did such a direct comparison all the
time. We often fooled ourselves into thinking that an occasional
direct comparison was good enough to keep things on track, but we are
now demanding more precision out of our sky surveys. Third, precisely
because the science data have so much information, we should use that
information for self-calibration as much as possible, rather than rely
on a different and smaller set of data deemed to be "calibration data."
This point was established by Padmanabhan et al (2008) who christened
this approach (using the main data set to do as much internal calibration as
possible) "ubercal" and applied it to the Sloan Digital Sky Survey (SDSS).

My paper adopts this general approach for calibrating a sky survey
called the Deep Lens Survey (DLS), but because DLS is implemented very
differently from SDSS, the choices made in implementing ubercal for
DLS were very different. One goal of the paper was of course to
document what was done for DLS, but this was a goal which could have
been accomplished in other ways (a section of a larger paper on DLS,
for example). The reasons for making it a standalone paper were (1)
because most surveys are more like the DLS than like the SDSS, provide
a detailed worked example of applying the ubercal idea to this type of
survey; and (2) raise awareness that ubercal MUST be done to get the
most out of our surveys, because the errors left by the previously
standard calibration techniques are surprisingly large. Only by
applying ubercal were we able to quantify just how large they were.

If you want to learn more about how social scientists calibrate tests
and evaluations, look up "psychometrics." This is a pretty broad
area, so you may find it easier to focus in on one specific technique
called Rasch modeling. I learned from a book on Rasch modeling that
to become a medical examiner in Illinois, doctors were randomly
assigned a case study from a pool of cases, and randomly assigned an
evaluator from a pool of evaluators. But it turned out that the main
factors influencing whether a candidate passed were (1) which case
study was assigned, as some were easier than others; and (2) which
evaluator was assigned, as some were easier than others. This was
discovered by doing Rasch modeling to determine the relative leniency
of each evaluator and the relative difficulty of each case. After
correcting a candidate's score for these factors to obtain a "true"
score indicating candidate quality, it was apparent that candidate
quality was not a very important factor in determining who passed and
who failed! (Aficionados should be aware that candidate quality is
really a model parameter rather than an afterthought as this
description may imply, but novices need not care about this
distinction.)

Rasch modeling can be used not just to compute the overall leniency of
each judge, but also to help flag scores which are unexpected given
that judge's leniency. For example, perhaps a Russian figure skating
judge at the Olympics is consistently a bit tougher than the other
judges, but then is no tougher than the other judges when a Russian
skates. Without statistical analysis such as Rasch modeling, we
wouldn't know exactly how much tougher the Russian judge is normally,
and therefore we wouldn't know how much of a break the Russian judge
gave the Russian skater by giving a score in line with the other
judges. The Russian judge could argue that she showed no favoritism
because her score for the Russian skater was no higher than the other
judges' scores. But a statistical analysis such as Rasch modeling, by
quantifying the Russian judge's strictness for most skaters, could
provide strong evidence that she did do a favor for the Russian
skater, and quantify the size of that favor. There are analogies for
this too in astronomy, where ubercal helps flag bad data. (Maven
alert: ubercal is not Rasch modeling because the underlying
mathematical model is linear in ubercal, but the concept is the same.)
If you want to read about the application of Rasch modeling to some
figure skating controversies, start here.

Work on the Deep Lens Survey has been funded by Bell Laboratories/Lucent Technologies, the National Science Foundation, NASA, and the State of California (indirectly, but importantly, through University of California faculty salaries and startup funds). Thank you! Our data were obtained at Kitt Peak National Observatory in Arizona and Cerro Tololo Inter-American Observatory in Chile, which are wonderful scientific resources funded by the National Science Foundation.