Friday, January 25, 2013

DNA is Code


Today I started Friday morning sessions with the upper (5-7) graders
at Peregrine School.  I'll have about ten of these weeks before I
switch to astronomy with the 3-4 graders.  My main job with the upper
graders is to cover the geology standards which didn't get covered in
the lead-up to the Yosemite trip, but I'm going to take a few weeks
first to build on some of the ideas they learned before Christmas
break regarding genetics and DNA.  In the process, they will learn how
to analyze data and program computers.  It's ambitious, but with these
kids I have to be ambitious.

DNA is a code.  The letters themselves have no intrinsic meaning, but
they represent different specific amino acids which, when assembled in
the sequence specified by the code, fold up into a 3-d shape which is
useful for performing some function in the body.  So to introduce the
idea of "breaking the code," I started with a non-DNA message in code:
Uif Ivohfs Hbnft jt b 2008 zpvoh bevmu opwfm cz Bnfsjdbo xsjufs Tvaboof Dpmmjot. Ju jt xsjuufo jo uif wpjdf pg 16-zfbs-pme Lbuojtt Fwfseffo, xip mjwft jo uif qptu-bqpdbmzqujd obujpo pg Qbofn, xifsf uif dpvousjft pg Opsui Bnfsjdb podf fyjtufe. Uif Dbqjupm, b ijhimz bewbodfe nfuspqpmjt, fyfsdjtft qpmjujdbm dpouspm pwfs uif sftu pg uif obujpo. Uif Ivohfs Hbnft bsf bo boovbm fwfou jo xijdi pof cpz boe pof hjsm bhfe 12–18 gspn fbdi pg uif uxfmwf ejtusjdut tvsspvoejoh uif Dbqjupm bsf tfmfdufe cz mpuufsz up dpnqfuf jo b ufmfwjtfe cbuumf up uif efbui.

Uif cppl sfdfjwfe nptumz qptjujwf gffecbdl gspn nbkps sfwjfxfst boe bvuipst, jodmvejoh bvuipst Tufqifo Ljoh boe Tufqifojf Nfzfs. Ju xbt qsbjtfe gps jut tupszmjof boe dibsbdufs efwfmpqnfou, uipvhi tpnf sfwjfxfst ibwf opufe tjnjmbsjujft cfuxffo Dpmmjot' cppl boe Lpvtivo Ublbnj't Cbuumf Spzbmf (1999), cpui pg xijdi efmjwfs ubmft frvbmmz hsjqqjoh, zfu mftt ebsl uibo uif hfosf't psjhjobm nbtufs xpsl: Ljoh't Uif Mp oh Xbml (1979). Jo xsjujoh Uif Ivohfs Hbnft, Dpmmjot esfx vqpo Hsffl nzuipmphz, Spnbo hmbejbupsjbm hbnft boe dpoufnqpsbsz sfbmjuz ufmfwjtjpo gps uifnbujd dpoufou. Uif opwfm xpo nboz bxbset, jodmvejoh uif Dbmjgpsojb Zpvoh Sfbefs Nfebm, boe xbt obnfe pof pg Qvcmjtifst Xfflmz't "Cftu Cpplt pg uif Zfbs" jo 2008.

Uif Ivohfs Hbnft xbt gjstu qvcmjtife jo ibsedpwfs po Tfqufncfs 14, 2008 cz Tdipmbtujd, gfbuvsjoh b dpwfs eftjhofe cz Ujn P'Csjfo. Ju ibt tjodf cffo sfmfbtfe jo qbqfscbdl boe bmtp bt bo bvejpcppl boe fcppl. Bgufs bo jojujbm qsjou pg 200,000, uif cppl ibe tpme 800,000 dpqjft cz Gfcsvbsz 2010. Tjodf jut sfmfbtf, Uif Ivohfs Hbnft ibt cffo usbotmbufe joup 26 mbohvbhft, boe qvcmjtijoh sjhiut ibwf cffo tpme jo 38 uf ssjupsjft. Uif opwfm jt uif gjstu jo Uif Ivohfs Hbnft usjmphz, gpmmpxfe cz Dbudijoh Gjsf (2009) boe Npdljohkbz (2010). B gjmn bebqubujpo, ejsfdufe cz Hbsz Sptt boe dp-xsjuufo boe dp-qspevdfe cz Dpmmjot ifstfmg, xbt sfmfbtfe jo 2012.

This was a really good idea for motivating the students; they really
wanted to break the code once they realized it was doable! I elicited
from them some observations which might be useful:
  • Each paragraph begins with "Uif".  They guessed pretty quickly that "Uif" might stand for "The".  This was a springboard for making  explicit what they already implicitly knew about word and letter frequencies in English.
  • Numbers seem to stand for themselves: "Tfqufncfs 14, 2008" looks like a date.
  • Ditto for punctuation: the periods and commas look natural, not like they stand in for something else.
  • "b" is a word by itself.  Presumably it stands for either "a" or "I".
One student guessed the answer even as I was writing down the fourth
observation: simply replace each letter here with the letter which
precedes it in the alphabet.  This "b" becomes "a" and "Uif" becomes
"The".  I asked them to make some predictions to double-check this
hypothesis.  For example, this hypothesis predicts that "a" and "r"
will be uncommon because they stand for the uncommon letters "z" and
"q".  That checks out, so let's proceed.

Here's where I introduce computers.  It would be extremely tedious to
manually translate all these letters to decode the message.  But if we
write a computer program to do it, in a short time we will have a tool
that automatically translates this and all future such messages.
Computers are great for automating repetitive tasks!

Before we actually open our laptops, let's write some pseudocode.
"Code" has a new meaning here: "a set of instructions to the
computer", and pseudocode is when we write those instructions a bit
more like everyday language to help us organize and communicate our
thoughts to other humans.  To decode our message, we want to tell the
computer to do this:

for each character in the message:
    if character is a letter:
       print the previous letter
    else:
       print the same character # (because it's punctuation or a number)

(The # symbol and everything after it represent a comment about the instruction, rather than an actual instruction.)  This pseudocode illustrates two of the most important ideas in programming: looping, in which we repeat some set of instructions over and over, and if-then-else.

A somewhat old-fashoned way of constructing a loop which does something 100 times is:

tasksCompleted = 0
while tasksCompleted < 100:
      doSomething
      tasksCompleted = tasksCompleted + 1

Although this is a bit old-fashioned, I thought it was worth making the loop logic very explicit for beginners. Next, we practiced some simple loops in an actual language, Python.  I gave them a file containing the coded message, named 'coded1.txt'.  I showed them how to read it in:

letters = open('coded1.txt').read()

Now letters is the name of a location in memory that contains the message.  A very simple loop is:

for character in letters:
     print character,

With this more modern type of loop, Python automatically keeps track of how many times we need to repeat.  This is easier to write, easier for others to read, and less error-prone.

So far we have just repeated the very simple action of printing that character.  By the time each group had finished this, the 45 minutes was up and we took a break. After the break, we set about putting more complicated instructions inside the loop.  We started writing something like:

for character in letters:
    if character == 'b':
       print 'a',   
    elif character == 'c':
       print 'b',

(Note that == is a test of equality, whereas = would have been an assignment, forcing them to be equal; and elif is shorthand for else if.)  We can see that this will become tedious if we have to type in an if-else block for each substitution rule, so let's find a better way.  Let's set up a dictionary named sub (short for substitution):

sub = {'b':'a',
       'c':'b',
       'd':'c',
       ...}

so now we can just say

print sub[character]

and if the character is 'b' the computer will look up the "definition" (the value, really) stored at the dictionary location of 'b', which is 'a', and print that.  (Note that the ... above is not real code, but you can guess what it stands for.)  We spent a while talking about the software concept of dictionaries and playing with them for a while before returning to the decoding task, which now looks like:

letters = open('coded1').read()
sub = {'a':'z',
       'b':'a',
       'c':'b',
       'd':'c',
       ...}

for character in letters:
    if character in sub:
        print sub[character]
    else:
        print character

As a test, we did this after typing in just some of the substitution rules.  We could see that it was going to work, but it didn't handle capitals: "Uif" became "Uhe". Instead of typing in another 26 upper-case substitution rules, we did this before entering our main loop:

for key in sub.keys():
    upperCaseKey = str.upper(key)
    upperCaseSub = str.upper(sub[key])
    sub[upperCaseKey] = upperCaseSub

str.upper is a function which makes an upper-case version of the letter which is given to it inside the parentheses.  So this block of code creates a new upper case dictionary entry and value for each pre-existing one, so we don't have to type them in manually.  (Yes, I know there are more efficient ways to do this, but I wanted to make the logic very explicit for the learners.) With that done, the race was on as each group rushed to be the first to type in all the substitution rules and read the message.  That was really fun.  If you want to see the message, you'll just have to assemble your own Python script out of the parts I've illustrated here!

Once all the groups had decoded the message, we had used about 45 minutes since the break.  A word of advice to teachers: it takes a while.  Although I explained several times how one typo or indentation mistake will torpedo the script, kids are not used to this, and they need a lot of help finding their mistakes even though I had my own example code projected up on the screen. But it was really worth spending the time.  They started to get the idea of debugging; although I often had to find the bugs for them, they started to understand the process.  I had the luxury of doing it with six kids in three groups; if you have a larger group I would try to arrange for assistants and give them some advance training.

That gave us about 12 minutes to discuss how DNA is a code.  We reviewed a bit about how letters code for amino acids, which build proteins.  Then we looked at the DNA code table and discussed by some amino acids have many codons (groups of three DNA letters) which code for them, while others have just one.  The basic idea is that some "typos" will not actually result in a mistake in the amino acid, which is a good thing; it adds robustness to the code.  And it's even more robust if the most common amino acids are the ones with the most robustness against typos, in other words have the most different codons.  That led to an interesting discussion of where DNA "typos" might come from, including cell reproduction and radiation.

Finally, I asked why we need three letters to represent an amino acid if we have all these leftover three-letter combinations.  If we have four options (G,C,T, or A) at each DNA location and we have three consecutive locations, how many total options are there? The best way to think about this, if you are not used to mathematical manipulations, is to draw a tree with four big branches representing the four options at the first decision point.  Each of those big branches then has four medium-size branches representing the four options at the second decision point.  And each of those 16 medium-size branches then then has four little branches representing the four options at the third decision point, which makes 64 little branches, representing 64 possible combinations.  There are only 20 amino acids, so with 64 code words we can have a lot of redundancy, leading to a lot of robustness against damage.  The last question we discussed was: since we have so many extra code words by using three-letter combinations, could we (robustness aside) get away with using just two DNA letters to code for an amino acid?  I'll let you figure that one out for yourself.


Friday, January 11, 2013

Modeling Colliding Galaxy Clusters and Dark Matter

Science is about making models.  We make a conceptual model of
something (in other words, we guess how it works), then we figure out
what that model would predict, we compare those predictions to
observations, and then we either discard, modify, or (for the time
being) accept that the model could be correct. (Recently I stumbled
across an entertaining video of Richard Feynman explaining this
method.)  Ideally, the model is conceptually simple, with as few
moving parts as possible, yet is able to match a rich variety of data.
Sometimes this modeling process is about something really fundamental,
such as how gravity works; other times the modeling process is just
about estimating something we can't measure directly.  For example, a
scale which doesn't read the correct weight can still be used to infer
your weight, as long as you have a workable model such as "reads about
ten pounds too heavy" or "reads about half the true weight."

My student Will Dawson recently finished some work which is a really nice example of the latter kind of modeling.  We have been studying colliding clusters of galaxies; our lives are too short to see the collision play out, so we have to make a model which allows us to extrapolate from the current (just post-collision) state back to the time of maximum impact.

A good analogy: given a photograph and a speed gun reading taken a
split-second after an automobile collision, reason backward from that
to infer the velocity of collision and the time since collision.  The
main difference is that the galaxy clusters mostly go right through
each other because the space between the galaxies is so big.  The hot
gas clouds filling those spaces do collide, leaving a pancake of hot
gas in the middle while the galaxies continue on (eventually to slow
down and turn around due to gravity).  We are interested in what
happens to the dark matter: the Bullet Cluster and the Musketball
Cluster show that it is mostly collisionless, but "mostly" is
something we really want to quantify, for reasons I'll explain in a
future post.

But first, we have to make a model of the collision.  Speed guns and
spectrographs can only tell how fast the galaxies are moving toward us
or away from us; they say nothing about how fast they are moving in
the transverse direction (in the plane of the sky).  To study dark
matter we need to know the full three-dimensional velocity, and we
want to know what it was at the time of collision, rather than what it
is right now, after the collision.  This is closely related to how
much time has passed since the collision (by collision I really mean
the maximum overlap of the two clusters), because the observed
separation since the collision could have been achieved by moving at a
high velocity for a short time, or moving at a low velocity for a long
time.  Making things more complicated, the separation we observe is
only the transverse part of the separation.  So a collision which
occurs along the line of sight will give us a large velocity on our
speed gun but a small apparent separation, while the same collision
viewed transversely will exhibit a small part of the velocity and a
large part of the separation.  We don't know what angle we are viewing
the system at, so the true velocity could be just about anything.

Like a rock climber inching up a chimney, the way out of this is to
push against two extremes.  If we observe any line-of-sight velocity,
the motion can't be completely transverse; in fact we can rule out a
range of near-transverse geometries because they would require an
absurdly large three-dimensional velocity.  (Absurdly large is defined
here as larger than the free-fall velocity from infinity.)  Similarly,
if we observe any transverse separation we can rule out a range of
nearly line-of-sight geometries because they would require absurdly
large three-dimensional separation.  (Absurdly large is defined here
as requiring longer than the age of the universe to reach that
separation.)

Still, we are left with a range of geometries in the middle; at the
extremes of that range the geometries aren't completely ruled out, but
look pretty unlikely.  Here Will applied another important concept:
marginalizing over all remaining possibilities.  His code ranges over
all possible geometries, tabulating how well they match the data, and
thus produces a probability distribution.  So we don't know exactly
how fast the collision was, but we can be 99% confident it is within
have a certain broad range, 90% confident it is within a certain
smaller range, etc.

This was a pretty big advance in sophistication compared to the way
previous studies had estimated the collision velocity and time since
collision.  Using this technique, Will demonstrated that what
astronomers don't know can hurt them---not knowing the angle from
which we are viewing a collision results in substantial uncertainty in
the quantities we need to know to study the behavior of dark matter.
To narrow things down to a useful range, we need additional
information about a collision.  (In the case of the Bullet Cluster,
the shock gives this information, but most collisions have not shown
an observable shock.)

Will also used his new technique to estimate that it has been about
700 million years since the Musketball Cluster collided, compare to
"only" 200 million years for the Bullet Cluster.  This is good for the
study of dark matter because people would be justifiably skeptical of
conclusions about the nature of dark matter based on just one kind of
collision (a recent, high-speed collision of massive clusters such as
the Bullet).  Studying a range of collision conditions--including less
recent, lower-speed collisions of less-massive clusters such as the
Musketball--gives us a much better chance of identifying universal
properties of dark matter with high confidence.