Tuesday, July 17, 2012

What does figure skating have to do with astronomy?

With this post I add a new ingredient to this blog: explaining my
astrophysical research.  I do feel that scientists have an obligation
to help the public understand their research, most of which is funded
by the public, and I'm afraid that science journalism is generally not
up to this important task.  Science journalism does well when a story
easily fits into the "gee whiz, isn't this a cool result" format, but
tends to be too focused on "breakthroughs" rather than the
investigative process which really makes science what it is (Radiolab
excepted).  Most "breakthroughs" are overstated because scientists
have an incentive to overstate the importance of their specific
contribution to the field, and the media generally give science
journalists too little space to explore the process of science more
deeply.  Maybe scientists blogging about their work can help fill a
gap.

That's easy to say, but harder to deliver.  I don't pretend that I
will have a lot of spare time to write thorough, clear explanations of
many aspects of my research.  But I do hope to convey, for each paper
I author or co-author, why that paper is important or interesting, how
that fits into the bigger picture of astrophysics research, and maybe
something interesting about how the research was done.  I'll try to do
a paper every week or two over the summer, eventually describing all
the papers dating back to last fall when I first wanted to do this.

The first paper I'd like to describe is (like all my papers) freely
available on the arXiv. It solves a problem which
appears not only in astrophysics, but also in figure skating and many
other contexts.  (I'm not claiming that I came up with the original
solution to such a far-reaching problem, just that this paper
describes how to apply it in a certain astrophysical niche.)  So let's
start with the figure-skating version of the problem.  Say you have a
figure-skating competition with hundreds of skaters.  Evaluating each
skater with the same panel of judges would be too time-consuming; the
competition would have to be stretched over a period of weeks and the
judges would have judging fatigue by the end.  You need to have
several panels of judges working simultaneously in several rinks, at
least to narrow it down enough so that the top few skaters can have a
head-to-head competition for the championship at the end.

Given that, how do we make it as fair as possible?  One panel of
judges may be stricter than another, so that a skater's score depends
as much on a random factor (which panel judged her) as on her
performance.  Remixing the judging panels from time to time, by
itself, doesn't help.  That does prevent situations such as "the most
lenient panel sits all week in Rink 1," but it generates situations
such as "a lenient panel sat in Rink 1 Monday morning, a slightly
different lenient panel sat in Rink 2 Monday afternoon, yet another
lenient panel sat in Rink 3 Tuesday afternoon," etc.  But remixing the
panels does enable the solution: cross-calibration of judges.

Let's say judges Alice, Bob, and Carol sit on Panel 1 Monday morning
and judge ten skaters.  We can quantify how lenient Alice is relative
to Bob and Carol, just by comparing the scores they give for the same
ten skaters.  The mathematical description of leniency could be simple
("Alice gives 0.5 points more than Bob and 0.25 points more than
Carol, on average") or complicated ("Alice scores the top skaters just
as harshly as Bob and Carol but is progressively less harsh on
lower-scoring skaters as captured by this mathematical formula")
without changing the basic nature of the cross-calibration process
described here.  At the same time, judges David, Ethan, and Frank sit
on Panel 2 Monday morning and judge ten other skaters.  We can
quantify how lenient David is relative to Ethan and Frank by comparing
the average scores they give for the their ten skaters.

But we still don't know how lenient Alice, Bob, and Carol are as a
group, compared to David, Ethan, and Frank; if Panel 1's scores were
higher than Panel 2's on average, we can't tell if that's because
Panel 1 is more lenient or because the skaters assigned to Panel 1
happened to be better on average than the skaters assigned to Panel 2.
So in the afternoon session we switch Alice and David.  Now that we
can measure how lenient Alice is relative to Ethan and Frank, and how
lenient David is relative to Bob and Carol, we know the relative
leniency of all six judges, and we can go back and adjust each
skater's score for the leniency of her judges.

This system isn't perfect.  However we choose to describe leniency,
that description is probably too simple to capture all the real-life
effects.  For example, perhaps Alice isn't that much more lenient than
Bob and Carol overall, but she is very lenient on the triple axel,
which most or all of Monday morning skaters happened to do.  We would
then be making a mistake by looking at Alice's Monday morning scores
and concluding that she is overall more lenient that Bob and Carol.
But this system is surely better than no system at all.  Scientists
make models of reality, and we know those models never capture all the
complexity, but we are satisfied if they capture the most important
features.  The purpose of a model is to simplify reality enough to
understand its most important features.  Over time, if we are not
satisfied with a model, we can increase its complexity to capture more
and more features of reality.  As Einstein said, "make things as
simple as possible, but not simpler."  In this example, we could
improve the model, at the cost of additional complexity, if the
judges wrote down their scores for each performance element.  We could
then compute the leniency of each judge for each element and apply these
more finely tailored correction factors.  But in practice, a simple correction
for an overall leniency factor probably gets rid of most of the unfairness
in the system.

In astronomy, assessing the brightnesses of stars and galaxies is like
assessing the performance of figure skaters in this giant
competition.  We have too many stars and galaxies to assess to do it
one at a time, so we build a camera with a wide field of view, which
can look at many thousands of stars and galaxies simultaneously.  But
different parts of the camera's field of view may not be equally
sensitive.  By taking one picture and then repointing the camera a
little bit so that each star or galaxy is assessed first by one area
of the camera and then by another, we can compute the relative
leniencies of the different areas of the camera.  Then we can infer
the true brightness of any star or galaxy by correcting for the
"leniency" of the area(s) upon which its light fell.

The social sciences are actually years ahead of the physical sciences
in applying these kinds of models.  The reason is that social sciences
very often face the problem of calibrating the relative difficulty of
things for which there is no absolute standard.  For example, on an
exam or evaluation how much more difficult is Question 10 compared to
Question 1?  There is no absolute standard for difficulty, so social
scientists have developed methods, as described above for the figure
skaters, for calibrating relative difficulty from the test responses
("the data") themselves.  This is quite different from a typical
physics experiment in which we directly compare the thing to be
measured with some kind of objective reference; for example, in
astronomy we could more or less directly compare the measured
brightness of a star or galaxy ("the science data") with a calibrated
light source ("the calibration data").  So astronomers typically had
no reason to try to use the star/galaxy data alone for the kind of
self-calibration that social scientists do.

But this is changing for several reasons.  First, sky surveys are
getting more massively parallel.  We now have cameras which take
images of millions of galaxies in a single shot, spread over a billion
pixels.  We can no longer do the most direct calibration---shining a
calibrated lamp on the very same pixel---for each star or
galaxy.  Second, we never really did such a direct comparison all the
time.  We often fooled ourselves into thinking that an occasional
direct comparison was good enough to keep things on track, but we are
now demanding more precision out of our sky surveys.  Third, precisely
because the science data have so much information, we should use that
information for self-calibration as much as possible, rather than rely
on a different and smaller set of data deemed to be "calibration data."
This point was established by Padmanabhan et al (2008) who christened
this approach (using the main data set to do as much internal calibration as
possible) "ubercal" and applied it to the Sloan Digital Sky Survey (SDSS).

My paper adopts this general approach for calibrating a sky survey
called the Deep Lens Survey (DLS), but because DLS is implemented very
differently from SDSS, the choices made in implementing ubercal for
DLS were very different.  One goal of the paper was of course to
document what was done for DLS, but this was a goal which could have
been accomplished in other ways (a section of a larger paper on DLS,
for example).  The reasons for making it a standalone paper were (1)
because most surveys are more like the DLS than like the SDSS, provide
a detailed worked example of applying the ubercal idea to this type of
survey; and (2) raise awareness that ubercal MUST be done to get the
most out of our surveys, because the errors left by the previously
standard calibration techniques are surprisingly large.  Only by
applying ubercal were we able to quantify just how large they were.

If you want to learn more about how social scientists calibrate tests
and evaluations, look up "psychometrics."  This is a pretty broad
area, so you may find it easier to focus in on one specific technique
called Rasch modeling.  I learned from a book on Rasch modeling that
to become a medical examiner in Illinois, doctors were randomly
assigned a case study from a pool of cases, and randomly assigned an
evaluator from a pool of evaluators.  But it turned out that the main
factors influencing whether a candidate passed were (1) which case
study was assigned, as some were easier than others; and (2) which
evaluator was assigned, as some were easier than others. This was
discovered by doing Rasch modeling to determine the relative leniency
of each evaluator and the relative difficulty of each case.  After
correcting a candidate's score for these factors to obtain a "true"
score indicating candidate quality, it was apparent that candidate
quality was not a very important factor in determining who passed and
who failed!  (Aficionados should be aware that candidate quality is
really a model parameter rather than an afterthought as this
description may imply, but novices need not care about this
distinction.)

Rasch modeling can be used not just to compute the overall leniency of
each judge, but also to help flag scores which are unexpected given
that judge's leniency.  For example, perhaps a Russian figure skating
judge at the Olympics is consistently a bit tougher than the other
judges, but then is no tougher than the other judges when a Russian
skates.  Without statistical analysis such as Rasch modeling, we
wouldn't know exactly how much tougher the Russian judge is normally,
and therefore we wouldn't know how much of a break the Russian judge
gave the Russian skater by giving a score in line with the other
judges.  The Russian judge could argue that she showed no favoritism
because her score for the Russian skater was no higher than the other
judges' scores.  But a statistical analysis such as Rasch modeling, by
quantifying the Russian judge's strictness for most skaters, could
provide strong evidence that she did do a favor for the Russian
skater, and quantify the size of that favor.  There are analogies for
this too in astronomy, where ubercal helps flag bad data.  (Maven
alert: ubercal is not Rasch modeling because the underlying
mathematical model is linear in ubercal, but the concept is the same.)
If you want to read about the application of Rasch modeling to some
figure skating controversies, start here.

Work on the Deep Lens Survey has been funded by Bell Laboratories/Lucent Technologies, the National Science Foundation, NASA, and the State of California (indirectly, but importantly, through University of California faculty salaries and startup funds).  Thank you!  Our data were obtained at Kitt Peak National Observatory in Arizona and Cerro Tololo Inter-American Observatory in Chile, which are wonderful scientific resources funded by the National Science Foundation.