The World in a Grain of Sand: data analysis

Showing posts with label data analysis. Show all posts

Sunday, February 7, 2016

Sports: more random than you think (part 2)

In two recent posts I debunked the idea that some NFL teams vary greatly in their home-field advantage, and I showed how the same thinking tool can be used to confirm that NFL teams do vary in their skill. In this post I ask the same question of NFL experts: does the variation in their game-picking accuracy reflect a true variation in skill or expertise?

The previous posts explain how to do this. First, calculate how much variation must randomly creep in to the results even if there is no true variation in skill. Then, compare this to the documented variation in results. If the model and data exhibit about the same amount of variation, you can claim that there is no evidence for true variation in skill. True variation in skill will result in performances outside the envelope of simulated performances.

I found a nice list of the records of 133 expert pickers for the 2015 regular season (256 games in all) at nflpickwatch.com. Here is the histogram of their correctness percentages (blue):

The green histogram is a simulation of 100 times as many experts, assuming they all have a "true" accuracy rate of 60.5%. That is, each pick has a 60.5% chance of being correct, and the actual numbers for each expert reflect good or bad luck in having more or fewer of these "coin tosses" land their way. I simulated so many experts (and then divided by 100) so that we could see a smooth distribution. A few things stand out in this plot:

No one did better than the simulation. In other words, there is not strong evidence that anyone is more skilled than a 60.5% long-term average, even if they happened to hit nearly 67% this season. Of course, we can't prove this; it remains possible that a few experts are slightly more skilled than the average expert. The best we can do is extend the analysis to multiple seasons so that the actual data more closely approach the long-term average. In other words, the green histogram would get narrower if we included 512 or 1024 games, and if some of the spread in the blue histogram is really due to skill, the blue histogram will not narrow as much.
A few people did far worse than the simulation. In other words, while the experts who outdid everyone else probably did so on the basis of luck, the ones who did poorly cannot rely on bad luck as the explanation. They really are bad. How could anyone do as poorly as 40% when you could get 50% by flipping a coin and 57.5% by always picking the home team?
Because the home team wins 57.5% of the time, the experts are adding some knowledge beyond home-team advantage---but not much. Or rather, they may have a lot of knowledge but that knowledge relates only weakly to which team wins. This suggests that random events are very important. Let's contrast this with European soccer; I found a website that claims they correctly predict the winner of a soccer match 88% of the time. European soccer has few or none of the features that keep American teams roughly in parity with each other: better draft picks for losing teams, salary cap, revenue sharing, etc. It's much more of a winner-take-all business, which makes the outcomes of most matches rather predictable. In leagues with more parity, random events have more influence on outcomes.
If you remove the really bad experts (below 53%, say) the distribution of the remaining competent experts is tighter than the distribution of simulations. How can the actual spread be less than the spread in a simulation where all experts are identically skilled? It must be that experts follow the herd: if some team appears hot, or another team lost a player to injury, most experts will make the same pick on a given game. This is not in my simulation, but it surely happens in real life, and it would indeed make experts perform more similarly than in the simulation.

My simulation neglects a lot of things that could happen in real life. For example, feedback loops: let's say the more highly skilled team gets behind in a game because of random events. You might think that this would energize them. Pumped up at the thought that they are losing to a worse team, they try that much harder, and come from behind to win the game. Nice idea, but if it were true then the final outcomes of games would be incredibly predictable. The fact that they are so unpredictable indicates that this kind of feedback loop is not important in real life. The same idea applies regarding an entire season: if a highly skilled team finds itself losing a few games due to bad luck, we might find them trying even harder to reach their true potential. But the fact that most NFL teams have records indistinguishable from a coin toss again argues that this kind of feedback loop does not play a large role. Of course, teams do feel extra motivation at certain times, but the extra motivation must have a minimal impact on the chance of winning. For every time a team credits extra motivation for a win, there is another time where they have to admit that it just wasn't their day despite the extra motivation.

Monday, February 1, 2016

The Cumulative Distribution Function

The cumulative distribution function (CDF) is a long name for a simple concept---a concept you should become familiar with if you like to think about data.

One of the most basic visualizations of a set of numbers is the histogram: a plot of how frequently various values appear. For example, measuring the heights of 100 people might yield a histogram like this:

This technique is taught to kids even in preschool, where teachers often record the weather (cloudy, rainy, sunny, etc.) on a chart each day. Over several weeks, a picture of the frequency of sunny days, rainy days, etc, naturally emerges. (Sometimes it seems as if the histogram is the only data visualization kids learn in school.)

The CDF is a different way to visualize the same data. Instead of recording how often a particular value occurs, we record how often we see that value or less. We can turn a histogram into a CDF quite simply. Start at the left side of the height histogram: four people have a height in the 1-1.1 m range so clearly, four people have a height of 1.1 m or less. Now, we move up to the next bin: five people are in the 1.1-1.2 m range so including the four shorter people we have nine with height 1.2 m or less. We then add these nine (the "or less" part) to the number in the next bin to obtain the number with height 1.3 m or less. This total then becomes the number of "or less" people to add to the number of people at 1.4 m, and so on. (This procedure is similar to integration in calculus.) The final result is:

(Notice that this graph shows smaller details than the histogram; I'll explain that at the end.) What is this graph useful for? If we want to know the percentage of people over 6 feet (1.8 m), we can now read it straight off the CDF graph! Just go to 1.8 m, look up until you hit the curve, and then look horizontally to see where you hit the vertical axis. In our example here, that is about 95%:

This means 95% of people are 6 feet or shorter; in other words 5% are taller than 6 feet. Compared to the histogram, the CDF makes it blazingly fast to look up the percentage taller than 6 feet, shorter than 5 feet (1.5 m), or anything of that nature. (Beware: I made up these data as a hypothetical example, so don't take this as an actual comment on human height.)

Plotting two CDFs against each other is a great way to visualize nonuniformity or inequality. We often hear that around 20% of the income in the US goes to the top 1% of earners. A properly constructed graph can tell us not only the percentage that goes to the top 1%, but also the percentage that goes to the top 2%, the top 5%, the bottom 5%, etc---all in a single glance. Here's how we do it. Get income data from the IRS here: I chose the 2013 link in the first set of tables. Here's a screenshot:

I won't even attempt to turn this into a histogram because if I use a reasonable portion of the screen to represent most people ($0 to $200,000, say), the richest people will have to be very far off the right-hand edge of the screen. But if I squeeze the richest people onto the screen, details about most people will be squeezed into a tiny space. Turning the income axis into a CDF actually solves this problem, because the CDF will allocate screen space according to the share of income. We will be able to simultaneously see the contribution of many low-income people and that of a few high-income people. (I'm going to use "people", "returns" and "families" interchangeably rather than try to break things down to individuals vs. families.)

OK, let's do it. In the first bin we have 2.1 million returns with no income. So the first point on the people CDF will be 2.1 million, and the first point on the income CDF will be $0. Next, we have 10.6 million people (for 12.7 million total on the people CDF) making in the $1 to $5000 range, say $2500 on average. So these 10.6 million people collectively make $26.5 billion. The second point on our income CDF is therefore $0+$26.5 billion = $26.5 billion. We carry the 12.7 million total returns and $26.5 billion total income over to the next bin, and so on. At the end of the last bin, we find 147 million returns and $9.9 trillion in total income. Dividing each CDF by its maximum amount (and multiplying by 100 to show percentage) we get this blue curve:

We can now instantly read off the graph that the top 1% of returns have 15% of the income, the top 5% have 35%, the bottom 20% have 2%, and so on. In a perfectly equal-income society, the bottom 5% would take 5% of the income, the bottom 10% would take 10%, etc---in other words, the curve would follow a diagonal line on this graph. The more the curve departs from the diagonal line, the more unequal the incomes. We can measure how far the curve departs from the line and use that as a quick summary of the country's inequality---this is called the Gini coefficient. (The Wikipedia article linked to here has a nice summary of Gini coefficients measured in different countries and different years, but you have to scroll down quite a bit.)

A few remarks for people who want to go deeper:

the plotting of two CDFs against each other, as in the last plot shown here, is referred to as a P-P plot. A closely related concept is the Q-Q plot.
I emphasize again that the CDF and the histogram present the same information, just in a different way. However, there is one advantage to the CDF: the data need not be binned. When making a histogram, we have to choose a bin size, and if we have few data points we need to make these bins rather wide to prevent the histogram from being merely a series of spikes. For the height histogram, for example, I generated 100 random heights and used bins 10 cm (about 4 inches) wide. Maybe 100 data points would be better shown as a series of spikes than a histogram---but then the spikes in the middle might overlap confusingly. The CDF solves this problem by presenting the data as a series of steps so we can see the contribution of each point without overlap. If a CDF has very many data points you can no longer pick out individual steps but the slope of the CDF anywhere still equals the density of data points there.
my income numbers won't match a more complete analysis, for at least three reasons. First, Americans need to file tax returns only if they exceed a certain income, so some low-income families may be missed in these numbers. Second, the IRS numbers here contain only a "greater than $10 million" final bin. I assumed an average income of $20 million in this bin, which is a very rough guess. To do a better job, economists studying inequality supplement the IRS data I downloaded with additional data on the very rich; they find that the top 1% make more like 20% of the total, so my guess was on the low side. Finally, I made no attempt to disentangle individual income from family income as a better analysis would.

Friday, May 24, 2013

Our Solar System, Graphs, and Classification Schemes

Following the previous week's intro to the solar system, on Friday May 17 I visited the 3-4 grade room and used the solar system as a context for practice with graphs. We used the graphs in turn as a tool for helping us think about how to classify solar system objects. By establishing several clearly different classes of solar system objects, we raised questions about how the solar system might have formed these different classes, and we even began to answer those questions. I think this worked quite well as a coherent activity while asking the students to practice a variety of skills.

The centerpiece was a graph (technically a scatterplot*) of size vs distance from Sun for various solar system objects. My first idea was to help the kids make their own graphs from a table of data, but I discarded that idea as requiring too much time before we got to any science. So I made this graph and handed out a copy to each student:

I still wanted students to graph some data, so I planned to make them analyze and understand this graph as a gateway to getting them to add more points and do more analysis. I think this plan went well. I started with the question: can you identify any of the points? This required them to think about the meaning of the axes, and once they understood, they started saying things like "the top one must be Jupiter, because it's the biggest planet" and "the one most to the right must be Neptune because it's most distant from the Sun." Once they grasped that, they were able to label more and more points until we eventually got them all. (The word "eventually" hides a lot of time spent one-on-one with kids, helping them with the reasoning. Eg, Earth and Venus are almost exactly the same size, but Earth is a bit bigger, so which point is Earth? Double-check your conclusion by looking at distance from the Sun. Does it make sense? Etc.)

This was an excellent activity to make them think about the meaning of the graph rather than getting caught up in big numbers which wouldn't mean much to them anyway. (Jupiter is 90,000 miles across? How big is that?) But now let's think about the numbers. The graph says Earth's distance from the Sun is 1. What is that? One foot? One billion miles? The only unit that makes sense is units of "Earth-Sun distance." In other words, the graph makes it easy to read off the relative distances of the planets. It's a scale model. Again, this makes it easy to think about what the solar system is without getting caught up in a bunch of meaningless numbers. We repeated that exercise with the vertical axis.

Then we looked at whether the planets form any distinct groups. The graph makes it clear that there are two groups: small and close to the Sun, vs large and far from the Sun. What other differences might these groups have? It turns out that the large ones are made of different stuff (mostly gas vs rock), so maybe we should really think of two types of planets (gas giants and rocky planets) rather than thinking that all things called "planet" are similar things.

Next, I took them back to the year 1801 when a new planet was discovered: Ceres. I gave them the Ceres-Sun distance in units of the Earth-Sun distance (2.77) and Ceres' size in Earth-size units (0.07) and asked them to put Ceres on the graph. For the faster students, I gave them three more planets which were discovered soon after Ceres (Pallas**, Juno, and Vesta, which have similar distances and sizes) while the teachers helped the slower students with the graphing. After graphing these, it's clear that they form a distinct group: a group of very small things between Mars and Jupiter. Today we call these things main-belt asteroids, but when they were discovered they were simply called new planets. It was only after discovering many of them that people began to think that maybe we shouldn't call all new discoveries planets, and especially not these new discoveries which clearly form a separate group. The way we think about things is highly dependent on how much information we have.

This took until the break. After the break, we added Pluto to the graph. When Pluto was discovered, it was immediately called a planet because it was much larger than any asteroid, and there was no other category it could have been assigned to. But it does seem a bit out of place on the graph, being substantially smaller than any of the eight planets we started with, and also breaking the pattern of the larger planets being farther from the Sun. Well, it took 60 years, but eventually astronomers started discovering lots of other things roughly as far from the Sun and roughly the same size. I gave the kids data for these new objects: Eris, Sedna, Quaoar, and Orcus to start with.

Just as with the asteroids, it became clear that things like Pluto form a new category: the Kuiper Belt. This is even more clear when we realize that all these things are made of ices***, which is not like the inner planets or the outer planets. Once this new category was recognized, it became silly to continue calling Pluto a planet, just as in the 1800's it became silly to continue calling Ceres, Pallas, Juno, and Vesta planets. Perhaps Pluto should have been in a category of its own from the start, but there was no available category other than "planet," and why create a new category just for one object? Another illustration that the way we think about things depends on how much information we have.

[A side note: astronomers created the additional category "dwarf planet" to describe a body which, regardless of its location, is large enough that its gravity pulls it into a round shape (but smaller than the eight planets). Thus Pluto is both a Kuiper Belt object and a dwarf planet just as I am both a teacher and a father---they are not exclusive categories. But "Kuiper Belt object" is a much more descriptive term because it implies being made of ice, being a certain distance from the Sun, etc, whereas "dwarf planet" implies only that the size is neither very large nor very small.]

Next, we talked about how the solar system might have formed in order to form these different classes of objects. I showed clips from the Birth of the Earth episode of the series How the Earth Was Made. It has some really nice visualizations, and it is constructed around evidence, which is a key feature missing from most science documentaries. It tells science like the detective story it is. We spent probably half an hour on this, but I won't write much here because it's already a long blog post.

To cap off this intense morning, I brought some liquid nitrogen to demonstrate how cold the outer planets are. I froze a racquetball and shattered it just by trying to bounce it off the floor; I froze a banana and showed how it can be used as a hammer (until it shattered), and I made a balloon shrink and then expand again as I warmed it up. LN2 is always a great hit with the kids. On Pluto summers can be just warm enough to vaporize some nitrogen, but right about now Pluto is in early fall, and it will get so cold that nitrogen will not only liquify, it will freeze.

Notes

*Notice that this graph is not a histogram, which seems to be the only type of graph elementary teachers ever work with. I see that kids start working with graphs around second grade if not earlier, so by the time they get to college, they should be highly proficient. But in my college classes that students are typically far from proficient. My guess is that much of the time spent on graphs in school is wasted because students are never introduced to the idea of graphing the relationship between two different abstract quantities, which is absolutely key to data analysis and science.

**I got the idea for some of this activity when I saw that the element palladium was so named because for a long time it was fashionable to name newly discovered elements after recently discovered planets. I was long aware of uranium, neptunium, and plutonium being named this way, but I had never made the connection to cerium and palladium. People really thought that asteroids were planets until enough asteroids were discovered.

***Ices includes ice made of materials other than water, such as methane, ammonia, etc.

Saturday, February 9, 2013

Toy DNA Analysis Part III: Eve of Reconstruction

After the first and second activities of the morning, there was not
much more than 30 minutes left. I wanted to do an activity with
mitochondrial DNA, so I went over the background first. (They had seen
much of the following earlier in the year, but the review turned out
to be necessary.)

Each cell has a nucleus which contains DNA, surrounded by the bulk of
the cell ("cytoplasm") which has various structures ("organelles") for
performing various functions. One type of organelle is mitochondria,
when help you turn oxygen into energy. Each cell has many
mitochondria, and here is the amazing thing: they have their own DNA!
They are not built according to instructions recorded in the DNA of
the nucleus; they simply reproduce by dividing asexually, as if they
were self-contained cells within the cell. When the cell itself
divides, each daughter gets half the cytoplasm and therefore half the
mitochondria. It is thought that mitochondria were once independent
bacteria, which learned to cooperate so well with other cells that
they took up residence. That's pretty amazing! Another amazing fact
is that all creatures on Earth share the same DNA code. We are all
related, even humans and yeast. (Example: if you put the DNA letters
for human insulin in yeast, the yeast understand those instructions
perfectly and makes human insulin.)

When a human egg cell is fertilized, the sperm carries in half the
nuclear DNA to complement the mother's half of the nuclear DNA. But
the egg has an enormous amount of ctyoplasm and the sperm contributes
none. So your mitochondrial DNA is an exact replica of your mother's,
and of her mother's, and of HER mother's....there is no shuffling with
each generation as we have with nuclear DNA. Thus, mitochondrial DNA
makes it much easier to test whether you are a direct descendant
(through an all-female line) of, say, Cleopatra. (A similar thing can
be done with Y chromosomes and all-male lines of inheritance.)
Furthermore, by mapping the geographical distributions of
mitochondrial DNA, we can trace out migrations of women over time.
(Ditto for Y chromosomes and men.)

It's good to ask the kids a few questions to see how well they
understand. In this case, a girl said she was sorry for boys because
they had no mitochondria. So we discussed that issue again: everyone
has mitochondria (that's how they turn oxygen into energy) but boys
won't pass theirs on to their kids. Moms really do contribute more
than half, as immortalized by this song.

But there can be mutations. It turns out they're fairly rare in
mitochondria, probably because most mutations would be fatal very
early on. But they do happen. So if we gather mitochondrial DNA from
a large sample of people, we will find sequences that differ by a
little bit. We should be able to trace the mutations backward and
reconstruct ancestor DNA. For example, if we saw sequences GATTACA,
GATTACT, and AATTACA, we might guess that the ur-grandmother, many
generations back, of all three people had the sequence GATTACA. One
mutation somewhere along the line would explain the people with
GATTACT, a different mutation somewhere else along the line would
explain the people with AATTACA, and the people who had never
experienced a mutation along their line would still have GATTACA.
They hypothesis of, say GATTACT being the ancestor is much less likely
because it requires that there was one mutation to make it GATTACA and
then, in the line with this mutation already present, there was
a second mutation making AATTACA.

So here's the problem I posed to the kids: reconstruct the ancestor of
these sequences:

CATTACGACT
GAATACGACA
GATTACAACT
GATTACGACA
GATTACGACT
GATTATAACT
GATTCCAACT
GTTTCCAACT

Go ahead: print these out and cut them into strips, try to arrange
them as leaves of a tree, and guess what the branches and trunk have
to look like.

Some groups were lost, and so I tried to work it out with them on the
board, starting by making a guess about the immediate ancestor of one
very similar pair. It turned out this was probably a bad guess,
because once we had worked out two hypothetical ancestors of two
different pairs, those two hypothetical ancestors seemed to have very
little in common, whereas we would have expected them to look similar
enough that we could guess a hypothetical original ancestor which
spawned them both. Just as I was realizing that we were almost out of
time, another group handed me a sheet of paper in which they had
worked it all out. The lesson I drew for everyone: don't be afraid to
take a guess, work out the consquences of that guess, and if it
doesn't work, scrap that guess and start over. That's what science is
all about! (See the first minute or so of this video.) Just because
lunchtime was coming up fast does not mean that we had done anything
wrong. The wrong thing would be to continue pushing a guess which
doesn't explain all the evidence.

If I do this activity again, I would print out very large copies of
the sequence so I could rearrange them easily on the board (writing
with chalk does not lend itself to rearrangement). Or I would print
it at regular size and use a document camera. I would probably also
walk them through a simplified example first as I did in writing this
post. Another idea I just had is to try representing the information
differently. Perhaps a color code instead of letters would make
things just jump out.

I saved a few minutes for the coolest part of this: because we know
that mutations happen about once every 10,000 years, we can use this
as a clock. In my simplified example, you have to reverse-engineer
three mutations to get back to a common ancestor which explains all
the data. That makes 30,000 years. In real life with real data, you
have to go back 200,000 years, but you can do it. That means that
there was one female about 200,000 years ago from whom every human
alive today has inherited their mitochondrial DNA; she is called
mitochondrial Eve. This doesn't mean that other females living at
that time didn't contribute to people alive today; they surely did,
through their nuclear DNA. But mitochondrial Eve is the only one who
has an unbroken female line to anyone alive today. And a similar
argument identifies "Y-chromosomal Adam" who lived around 100,000
years ago. We are all intimately related!

Toy DNA Analysis Part II

After the first activity and recess break, I posed the challenge of
finding out which animal is most closely related to the hippopotamus.
I gave them short (400-base) DNA sequences of various mammals and
asked them to figure out which was most similar to the hippo's.
Again, I just made up these short sequences to keep it simple. In
real life, there are a lot more complications just as varying number
of chromosomes, billions of base pairs, etc.

The bottom line is that they had to take an animal, look at each
position in the sequence, ask whether that that animal's DNA at that
position was different from the hippo's, and total up the differences.
After repeating this for several animals, they would see which one had
the least differences. After the near-chaos of the morning's first
activity, I thought it would be a good idea to go over the big picture
and assemble some pseudocode on the board based on their ideas:

get hippo dna
for each animal other than hippo:
    get this animal's dna
    compare this animal's to hippodna
    print number of differences

I emphasized basic ideas like what do we want to put inside the loop and what do we want to put outside the loop. A program like this does it:

hippodna = open('hippo.txt').read()
nletters = len(hippodna)

for filename in ('cat.txt','dog.txt','rhino.txt','bluewhale.txt','rockhyrax.txt','rhesusmonkey.txt'):
    dna = open(filename).read()
    differences = 0
    position = 0
    for letters in hippodna:
        if hippodna[position] != dna[position]:
            differences += 1
 position += 1
    print filename,'has',differences,'differences with hippo dna'

Yes, Python experts, I know there are more efficient ways to do it but this seems most straightforward for a kid. I won't go into the detail I went into for the morning's first activity, but it was a similarly intense back-and-forth with students running into an obstacle every 30 seconds, which I tried to turn into a learning experience. They again needed emphasis on proper indenting, punctuation, remembering that if they named a variable 'difference' early on then they couldn't refer to it as 'differences' later on, etc. They again wanted to do one animal at a time rather than write a loop. But it was a really good learning experience.

This took close to an hour. At the end, we discussed how they might modify the program to see what's most closely related to some other mammal, and then a third mammal, etc, and build up a picture of how everything is related to everything else: a family tree. That was the emphasis of the morning's third activity.

Toy DNA Analysis Part I

Last time I visited the 5-7 graders at Peregrine School, I introduced
the idea of computer programming languages (specifically, Python) as a
way of automatic repetitive tasks. That sounds boring, but I made it
interesting by building the activity around an inherently interesting
challenge---cracking a code---which happens to have a repetitive
aspect---translating each letter of a long message after you've
figured out the substitution pattern. At the end, we had a little
time to discuss how DNA is a code, which is the connection I really
wanted to make to what they had already learned in biology. This
week, we were poised to take it much further with three data analysis
challenges, two of them requiring Python programming. As a result,
I'm going to split the morning's activities into three separate blog
entries.

The first challenge was to rescue a baby. Some babies are born with a
genetic defect which does not allow their body to process the amino
acid phenylalanine. If they don't follow a strict low-phenylalanine
diet, it builds up in their body and causes brain damage (mental
retardation) within the first year of life. So there is a very strong
motivation to find out if the baby has this condition very soon after
it is born! In practice, there is a blood test which does not involve
analyzing the DNA itself, but I framed the problem as one of DNA
analysis. I made up long DNA sequences for about 15 babies, and asked
them to write a program which would look at a certain location within
each sequence, and print an alert if the sequence was not normal. To
keep things simple, I just made up a simple condition for "normal":
the three letters starting at position 399 should be "GAT".

They had forgotten a lot in the two weeks since my first visit. I
would normally recommend much more frequent visits for teaching
programming, but I was sick last week. Still, they were very into it.
They wracked their brains trying to remember things, but I had to give
a refresher tutorial on for loops and if statements. It took almost
an hour for everyone to finish completely, and there was again a
friendly competition to see who could finish (identify the sick baby)
first.

The evolution of their programs was fascinating. One group thought
they were done when they had something like:

dna = open('jane.txt').read()
print dna[398:401]

Even this simple program has a lot of ideas in it. For example, doing
the subscripting correctly requires thinking about: (1) position 398
is actually the 399th position because Python starts counting at
position 0 rather than position 1; (2) 398:401 means up to but not
including 401; (3) you can't write dna[398-401] because the arithmetic
operation 398-401 yields -3, and Python will interpret this as 3
positions before the end of the sequence. There was a lot of
interaction between me and the kids on each of these points.

In any case, when the first group had gotten this far, they thought
they were done. They would just change jane.txt to john.txt, rerun it
to see if John was sick, repeat for Joan, etc, and just read off the
screen whether it said 'GAT' for that baby. This would work ok for
the 15 examples I brought, but PKU (the aforementioned disease) is
present in only about 1 in 15,000 babies, so they really needed to
automate a loop over all the babies. They initially thought I was
moving the goalposts, but the other groups agreed that this first
group was not really done. So they started working on a loop:

names = ('jane.txt','john.txt','joe.txt')
for name in names:
    dna = open(name).read()
    print dna[398:401]

I had to repeat many times the importance of proper indentation so that it's clear
(to the computer and to a human reading the code) which actions get
repeated for each name, and which actions only get done once. A
common mistake is

names = ('jane.txt','john.txt','joe.txt')
for name in names:
    dna = open(name).read()
print dna[398:401]

which prints only the last one, after cycling through all the names. Notice that I typed in only three names to make sure my loop would work, before bothering to type in all 15 names. In fact, I taught them how to avoid typing in any names by grabbing all names that fit a certain pattern:

import glob
names = glob.glob('j*.txt')
for name in names:
    dna = open(name).read()
    print dna[398:401]

I briefly discussed how it can be incredibly time-saving to import some functionality that someone else has already written. But back to the main point, the fastest group now thought that they were really done: this program would print "GAT" for each healthy person, and they just had to scan the output for something other than GAT. But, I asked them, how would they know which person had that abnormal sequence? They tried adding to the end of their program:

import glob
names = glob.glob('j*.txt')
for name in names:
    dna = open(name).read()
    print dna[398:401]
    print name

But this gives output like:

GAT
jane.txt
GAT
john.txt
GCT
joe.txt
GAT
jim.txt
GAT
...

and they incorrectly interpreted this as John having GCT, because they hadn't paid attention to the fact that they asked the computer to print the DNA first, and then the name. An important rule of programming is to make your output clear. Things which belong together should be printed together. A much better print statement is:

print dna[398:401], name

which keeps it to one line per baby:

GAT jane.txt
GAT john.txt
GCT joe.txt
GAT jim.txt
...

Now they really thought they were done, but I pointed out that they were looking for 1 in 15,000 babies, and they couldn't count on a human to scan a list of 15,000 lines which say 'GAT' and not miss an abnormal one. They really need to print out just the sick ones with a very clear warning:

import glob
names = glob.glob('j*.txt')
for name in names:
    dna = open(name).read()
    if dna[398:401] != 'GAT':
       print name, "is sick"

Now they were done: the output is simply

joe.txt is sick

I took a few minutes to talk about the subtleties of combining if's, for example the difference between

if dna[398] != 'G' and dna[399] != 'A' and dna[400] != 'T'

if dna[398] != 'G' or dna[399] != 'A' or dna[400] != 'T'

Surprisingly (to me at least), they found this last point very easy. As I said, this whole activity took very nearly an hour. (I had saved a bit of time by preloading their computers with the data files.) It was highly interactive, and if there had been much more than three groups having another knowledgeable adult in the room probably would have been a good idea. I'm not going to pretend that as a result of this activity the kids have actually mastered even the very basics of programming, but it was good practice in logical thinking, and it's clear that with continued practice I will eventually get them doing some real data analysis. One good sign: one of the girls continued programming throughout the recess break following this activity!

Friday, January 25, 2013

DNA is Code

Today I started Friday morning sessions with the upper (5-7) graders
at Peregrine School. I'll have about ten of these weeks before I
switch to astronomy with the 3-4 graders. My main job with the upper
graders is to cover the geology standards which didn't get covered in
the lead-up to the Yosemite trip, but I'm going to take a few weeks
first to build on some of the ideas they learned before Christmas
break regarding genetics and DNA. In the process, they will learn how
to analyze data and program computers. It's ambitious, but with these
kids I have to be ambitious.

DNA is a code. The letters themselves have no intrinsic meaning, but

they represent different specific amino acids which, when assembled in

the sequence specified by the code, fold up into a 3-d shape which is

useful for performing some function in the body. So to introduce the

idea of "breaking the code," I started with a non-DNA message in code:

Uif Ivohfs Hbnft jt b 2008 zpvoh bevmu opwfm cz Bnfsjdbo xsjufs Tvaboof Dpmmjot. Ju jt xsjuufo jo uif wpjdf pg 16-zfbs-pme Lbuojtt Fwfseffo, xip mjwft jo uif qptu-bqpdbmzqujd obujpo pg Qbofn, xifsf uif dpvousjft pg Opsui Bnfsjdb podf fyjtufe. Uif Dbqjupm, b ijhimz bewbodfe nfuspqpmjt, fyfsdjtft qpmjujdbm dpouspm pwfs uif sftu pg uif obujpo. Uif Ivohfs Hbnft bsf bo boovbm fwfou jo xijdi pof cpz boe pof hjsm bhfe 12–18 gspn fbdi pg uif uxfmwf ejtusjdut tvsspvoejoh uif Dbqjupm bsf tfmfdufe cz mpuufsz up dpnqfuf jo b ufmfwjtfe cbuumf up uif efbui.

Uif cppl sfdfjwfe nptumz qptjujwf gffecbdl gspn nbkps sfwjfxfst boe bvuipst, jodmvejoh bvuipst Tufqifo Ljoh boe Tufqifojf Nfzfs. Ju xbt qsbjtfe gps jut tupszmjof boe dibsbdufs efwfmpqnfou, uipvhi tpnf sfwjfxfst ibwf opufe tjnjmbsjujft cfuxffo Dpmmjot' cppl boe Lpvtivo Ublbnj't Cbuumf Spzbmf (1999), cpui pg xijdi efmjwfs ubmft frvbmmz hsjqqjoh, zfu mftt ebsl uibo uif hfosf't psjhjobm nbtufs xpsl: Ljoh't Uif Mp oh Xbml (1979). Jo xsjujoh Uif Ivohfs Hbnft, Dpmmjot esfx vqpo Hsffl nzuipmphz, Spnbo hmbejbupsjbm hbnft boe dpoufnqpsbsz sfbmjuz ufmfwjtjpo gps uifnbujd dpoufou. Uif opwfm xpo nboz bxbset, jodmvejoh uif Dbmjgpsojb Zpvoh Sfbefs Nfebm, boe xbt obnfe pof pg Qvcmjtifst Xfflmz't "Cftu Cpplt pg uif Zfbs" jo 2008.

Uif Ivohfs Hbnft xbt gjstu qvcmjtife jo ibsedpwfs po Tfqufncfs 14, 2008 cz Tdipmbtujd, gfbuvsjoh b dpwfs eftjhofe cz Ujn P'Csjfo. Ju ibt tjodf cffo sfmfbtfe jo qbqfscbdl boe bmtp bt bo bvejpcppl boe fcppl. Bgufs bo jojujbm qsjou pg 200,000, uif cppl ibe tpme 800,000 dpqjft cz Gfcsvbsz 2010. Tjodf jut sfmfbtf, Uif Ivohfs Hbnft ibt cffo usbotmbufe joup 26 mbohvbhft, boe qvcmjtijoh sjhiut ibwf cffo tpme jo 38 uf ssjupsjft. Uif opwfm jt uif gjstu jo Uif Ivohfs Hbnft usjmphz, gpmmpxfe cz Dbudijoh Gjsf (2009) boe Npdljohkbz (2010). B gjmn bebqubujpo, ejsfdufe cz Hbsz Sptt boe dp-xsjuufo boe dp-qspevdfe cz Dpmmjot ifstfmg, xbt sfmfbtfe jo 2012.

This was a really good idea for motivating the students; they really
wanted to break the code once they realized it was doable! I elicited
from them some observations which might be useful:

Each paragraph begins with "Uif". They guessed pretty quickly that "Uif" might stand for "The". This was a springboard for making explicit what they already implicitly knew about word and letter frequencies in English.
Numbers seem to stand for themselves: "Tfqufncfs 14, 2008" looks like a date.
Ditto for punctuation: the periods and commas look natural, not like they stand in for something else.
"b" is a word by itself. Presumably it stands for either "a" or "I".

One student guessed the answer even as I was writing down the fourth

observation: simply replace each letter here with the letter which

precedes it in the alphabet. This "b" becomes "a" and "Uif" becomes

"The". I asked them to make some predictions to double-check this

hypothesis. For example, this hypothesis predicts that "a" and "r"

will be uncommon because they stand for the uncommon letters "z" and

"q". That checks out, so let's proceed.

Here's where I introduce computers. It would be extremely tedious to

manually translate all these letters to decode the message. But if we

write a computer program to do it, in a short time we will have a tool

that automatically translates this and all future such messages.

Computers are great for automating repetitive tasks!

Before we actually open our laptops, let's write some pseudocode.

"Code" has a new meaning here: "a set of instructions to the

computer", and pseudocode is when we write those instructions a bit

more like everyday language to help us organize and communicate our

thoughts to other humans. To decode our message, we want to tell the

computer to do this:

for each character in the message:
    if character is a letter:
       print the previous letter
    else:
       print the same character # (because it's punctuation or a number)

(The # symbol and everything after it represent a comment about the instruction, rather than an actual instruction.) This pseudocode illustrates two of the most important ideas in programming: looping, in which we repeat some set of instructions over and over, and if-then-else.

A somewhat old-fashoned way of constructing a loop which does something 100 times is:

tasksCompleted = 0
while tasksCompleted < 100:
      doSomething
      tasksCompleted = tasksCompleted + 1

Although this is a bit old-fashioned, I thought it was worth making the loop logic very explicit for beginners. Next, we practiced some simple loops in an actual language, Python. I gave them a file containing the coded message, named 'coded1.txt'. I showed them how to read it in:

letters = open('coded1.txt').read()

Now letters is the name of a location in memory that contains the message. A very simple loop is:

for character in letters:
     print character,

With this more modern type of loop, Python automatically keeps track of how many times we need to repeat. This is easier to write, easier for others to read, and less error-prone.

So far we have just repeated the very simple action of printing that character. By the time each group had finished this, the 45 minutes was up and we took a break. After the break, we set about putting more complicated instructions inside the loop. We started writing something like:

for character in letters:
    if character == 'b':
       print 'a',   
    elif character == 'c':
       print 'b',

(Note that == is a test of equality, whereas = would have been an assignment, forcing them to be equal; and elif is shorthand for else if.) We can see that this will become tedious if we have to type in an if-else block for each substitution rule, so let's find a better way. Let's set up a dictionary named sub (short for substitution):

sub = {'b':'a',
       'c':'b',
       'd':'c',
       ...}

so now we can just say

print sub[character]

and if the character is 'b' the computer will look up the "definition" (the value, really) stored at the dictionary location of 'b', which is 'a', and print that. (Note that the ... above is not real code, but you can guess what it stands for.) We spent a while talking about the software concept of dictionaries and playing with them for a while before returning to the decoding task, which now looks like:

letters = open('coded1').read()
sub = {'a':'z',
       'b':'a',
       'c':'b',
       'd':'c',
       ...}

for character in letters:
    if character in sub:
        print sub[character]
    else:
        print character

As a test, we did this after typing in just some of the substitution rules. We could see that it was going to work, but it didn't handle capitals: "Uif" became "Uhe". Instead of typing in another 26 upper-case substitution rules, we did this before entering our main loop:

for key in sub.keys():
    upperCaseKey = str.upper(key)
    upperCaseSub = str.upper(sub[key])
    sub[upperCaseKey] = upperCaseSub

str.upper is a function which makes an upper-case version of the letter which is given to it inside the parentheses. So this block of code creates a new upper case dictionary entry and value for each pre-existing one, so we don't have to type them in manually. (Yes, I know there are more efficient ways to do this, but I wanted to make the logic very explicit for the learners.) With that done, the race was on as each group rushed to be the first to type in all the substitution rules and read the message. That was really fun. If you want to see the message, you'll just have to assemble your own Python script out of the parts I've illustrated here!

Once all the groups had decoded the message, we had used about 45 minutes since the break. A word of advice to teachers: it takes a while. Although I explained several times how one typo or indentation mistake will torpedo the script, kids are not used to this, and they need a lot of help finding their mistakes even though I had my own example code projected up on the screen. But it was really worth spending the time. They started to get the idea of debugging; although I often had to find the bugs for them, they started to understand the process. I had the luxury of doing it with six kids in three groups; if you have a larger group I would try to arrange for assistants and give them some advance training.

That gave us about 12 minutes to discuss how DNA is a code. We reviewed a bit about how letters code for amino acids, which build proteins. Then we looked at the DNA code table and discussed by some amino acids have many codons (groups of three DNA letters) which code for them, while others have just one. The basic idea is that some "typos" will not actually result in a mistake in the amino acid, which is a good thing; it adds robustness to the code. And it's even more robust if the most common amino acids are the ones with the most robustness against typos, in other words have the most different codons. That led to an interesting discussion of where DNA "typos" might come from, including cell reproduction and radiation.

Finally, I asked why we need three letters to represent an amino acid if we have all these leftover three-letter combinations. If we have four options (G,C,T, or A) at each DNA location and we have three consecutive locations, how many total options are there? The best way to think about this, if you are not used to mathematical manipulations, is to draw a tree with four big branches representing the four options at the first decision point. Each of those big branches then has four medium-size branches representing the four options at the second decision point. And each of those 16 medium-size branches then then has four little branches representing the four options at the third decision point, which makes 64 little branches, representing 64 possible combinations. There are only 20 amino acids, so with 64 code words we can have a lot of redundancy, leading to a lot of robustness against damage. The last question we discussed was: since we have so many extra code words by using three-letter combinations, could we (robustness aside) get away with using just two DNA letters to code for an amino acid? I'll let you figure that one out for yourself.