Friday, January 25, 2013

DNA is Code


Today I started Friday morning sessions with the upper (5-7) graders
at Peregrine School.  I'll have about ten of these weeks before I
switch to astronomy with the 3-4 graders.  My main job with the upper
graders is to cover the geology standards which didn't get covered in
the lead-up to the Yosemite trip, but I'm going to take a few weeks
first to build on some of the ideas they learned before Christmas
break regarding genetics and DNA.  In the process, they will learn how
to analyze data and program computers.  It's ambitious, but with these
kids I have to be ambitious.

DNA is a code.  The letters themselves have no intrinsic meaning, but
they represent different specific amino acids which, when assembled in
the sequence specified by the code, fold up into a 3-d shape which is
useful for performing some function in the body.  So to introduce the
idea of "breaking the code," I started with a non-DNA message in code:
Uif Ivohfs Hbnft jt b 2008 zpvoh bevmu opwfm cz Bnfsjdbo xsjufs Tvaboof Dpmmjot. Ju jt xsjuufo jo uif wpjdf pg 16-zfbs-pme Lbuojtt Fwfseffo, xip mjwft jo uif qptu-bqpdbmzqujd obujpo pg Qbofn, xifsf uif dpvousjft pg Opsui Bnfsjdb podf fyjtufe. Uif Dbqjupm, b ijhimz bewbodfe nfuspqpmjt, fyfsdjtft qpmjujdbm dpouspm pwfs uif sftu pg uif obujpo. Uif Ivohfs Hbnft bsf bo boovbm fwfou jo xijdi pof cpz boe pof hjsm bhfe 12–18 gspn fbdi pg uif uxfmwf ejtusjdut tvsspvoejoh uif Dbqjupm bsf tfmfdufe cz mpuufsz up dpnqfuf jo b ufmfwjtfe cbuumf up uif efbui.

Uif cppl sfdfjwfe nptumz qptjujwf gffecbdl gspn nbkps sfwjfxfst boe bvuipst, jodmvejoh bvuipst Tufqifo Ljoh boe Tufqifojf Nfzfs. Ju xbt qsbjtfe gps jut tupszmjof boe dibsbdufs efwfmpqnfou, uipvhi tpnf sfwjfxfst ibwf opufe tjnjmbsjujft cfuxffo Dpmmjot' cppl boe Lpvtivo Ublbnj't Cbuumf Spzbmf (1999), cpui pg xijdi efmjwfs ubmft frvbmmz hsjqqjoh, zfu mftt ebsl uibo uif hfosf't psjhjobm nbtufs xpsl: Ljoh't Uif Mp oh Xbml (1979). Jo xsjujoh Uif Ivohfs Hbnft, Dpmmjot esfx vqpo Hsffl nzuipmphz, Spnbo hmbejbupsjbm hbnft boe dpoufnqpsbsz sfbmjuz ufmfwjtjpo gps uifnbujd dpoufou. Uif opwfm xpo nboz bxbset, jodmvejoh uif Dbmjgpsojb Zpvoh Sfbefs Nfebm, boe xbt obnfe pof pg Qvcmjtifst Xfflmz't "Cftu Cpplt pg uif Zfbs" jo 2008.

Uif Ivohfs Hbnft xbt gjstu qvcmjtife jo ibsedpwfs po Tfqufncfs 14, 2008 cz Tdipmbtujd, gfbuvsjoh b dpwfs eftjhofe cz Ujn P'Csjfo. Ju ibt tjodf cffo sfmfbtfe jo qbqfscbdl boe bmtp bt bo bvejpcppl boe fcppl. Bgufs bo jojujbm qsjou pg 200,000, uif cppl ibe tpme 800,000 dpqjft cz Gfcsvbsz 2010. Tjodf jut sfmfbtf, Uif Ivohfs Hbnft ibt cffo usbotmbufe joup 26 mbohvbhft, boe qvcmjtijoh sjhiut ibwf cffo tpme jo 38 uf ssjupsjft. Uif opwfm jt uif gjstu jo Uif Ivohfs Hbnft usjmphz, gpmmpxfe cz Dbudijoh Gjsf (2009) boe Npdljohkbz (2010). B gjmn bebqubujpo, ejsfdufe cz Hbsz Sptt boe dp-xsjuufo boe dp-qspevdfe cz Dpmmjot ifstfmg, xbt sfmfbtfe jo 2012.

This was a really good idea for motivating the students; they really
wanted to break the code once they realized it was doable! I elicited
from them some observations which might be useful:
  • Each paragraph begins with "Uif".  They guessed pretty quickly that "Uif" might stand for "The".  This was a springboard for making  explicit what they already implicitly knew about word and letter frequencies in English.
  • Numbers seem to stand for themselves: "Tfqufncfs 14, 2008" looks like a date.
  • Ditto for punctuation: the periods and commas look natural, not like they stand in for something else.
  • "b" is a word by itself.  Presumably it stands for either "a" or "I".
One student guessed the answer even as I was writing down the fourth
observation: simply replace each letter here with the letter which
precedes it in the alphabet.  This "b" becomes "a" and "Uif" becomes
"The".  I asked them to make some predictions to double-check this
hypothesis.  For example, this hypothesis predicts that "a" and "r"
will be uncommon because they stand for the uncommon letters "z" and
"q".  That checks out, so let's proceed.

Here's where I introduce computers.  It would be extremely tedious to
manually translate all these letters to decode the message.  But if we
write a computer program to do it, in a short time we will have a tool
that automatically translates this and all future such messages.
Computers are great for automating repetitive tasks!

Before we actually open our laptops, let's write some pseudocode.
"Code" has a new meaning here: "a set of instructions to the
computer", and pseudocode is when we write those instructions a bit
more like everyday language to help us organize and communicate our
thoughts to other humans.  To decode our message, we want to tell the
computer to do this:

for each character in the message:
    if character is a letter:
       print the previous letter
    else:
       print the same character # (because it's punctuation or a number)

(The # symbol and everything after it represent a comment about the instruction, rather than an actual instruction.)  This pseudocode illustrates two of the most important ideas in programming: looping, in which we repeat some set of instructions over and over, and if-then-else.

A somewhat old-fashoned way of constructing a loop which does something 100 times is:

tasksCompleted = 0
while tasksCompleted < 100:
      doSomething
      tasksCompleted = tasksCompleted + 1

Although this is a bit old-fashioned, I thought it was worth making the loop logic very explicit for beginners. Next, we practiced some simple loops in an actual language, Python.  I gave them a file containing the coded message, named 'coded1.txt'.  I showed them how to read it in:

letters = open('coded1.txt').read()

Now letters is the name of a location in memory that contains the message.  A very simple loop is:

for character in letters:
     print character,

With this more modern type of loop, Python automatically keeps track of how many times we need to repeat.  This is easier to write, easier for others to read, and less error-prone.

So far we have just repeated the very simple action of printing that character.  By the time each group had finished this, the 45 minutes was up and we took a break. After the break, we set about putting more complicated instructions inside the loop.  We started writing something like:

for character in letters:
    if character == 'b':
       print 'a',   
    elif character == 'c':
       print 'b',

(Note that == is a test of equality, whereas = would have been an assignment, forcing them to be equal; and elif is shorthand for else if.)  We can see that this will become tedious if we have to type in an if-else block for each substitution rule, so let's find a better way.  Let's set up a dictionary named sub (short for substitution):

sub = {'b':'a',
       'c':'b',
       'd':'c',
       ...}

so now we can just say

print sub[character]

and if the character is 'b' the computer will look up the "definition" (the value, really) stored at the dictionary location of 'b', which is 'a', and print that.  (Note that the ... above is not real code, but you can guess what it stands for.)  We spent a while talking about the software concept of dictionaries and playing with them for a while before returning to the decoding task, which now looks like:

letters = open('coded1').read()
sub = {'a':'z',
       'b':'a',
       'c':'b',
       'd':'c',
       ...}

for character in letters:
    if character in sub:
        print sub[character]
    else:
        print character

As a test, we did this after typing in just some of the substitution rules.  We could see that it was going to work, but it didn't handle capitals: "Uif" became "Uhe". Instead of typing in another 26 upper-case substitution rules, we did this before entering our main loop:

for key in sub.keys():
    upperCaseKey = str.upper(key)
    upperCaseSub = str.upper(sub[key])
    sub[upperCaseKey] = upperCaseSub

str.upper is a function which makes an upper-case version of the letter which is given to it inside the parentheses.  So this block of code creates a new upper case dictionary entry and value for each pre-existing one, so we don't have to type them in manually.  (Yes, I know there are more efficient ways to do this, but I wanted to make the logic very explicit for the learners.) With that done, the race was on as each group rushed to be the first to type in all the substitution rules and read the message.  That was really fun.  If you want to see the message, you'll just have to assemble your own Python script out of the parts I've illustrated here!

Once all the groups had decoded the message, we had used about 45 minutes since the break.  A word of advice to teachers: it takes a while.  Although I explained several times how one typo or indentation mistake will torpedo the script, kids are not used to this, and they need a lot of help finding their mistakes even though I had my own example code projected up on the screen. But it was really worth spending the time.  They started to get the idea of debugging; although I often had to find the bugs for them, they started to understand the process.  I had the luxury of doing it with six kids in three groups; if you have a larger group I would try to arrange for assistants and give them some advance training.

That gave us about 12 minutes to discuss how DNA is a code.  We reviewed a bit about how letters code for amino acids, which build proteins.  Then we looked at the DNA code table and discussed by some amino acids have many codons (groups of three DNA letters) which code for them, while others have just one.  The basic idea is that some "typos" will not actually result in a mistake in the amino acid, which is a good thing; it adds robustness to the code.  And it's even more robust if the most common amino acids are the ones with the most robustness against typos, in other words have the most different codons.  That led to an interesting discussion of where DNA "typos" might come from, including cell reproduction and radiation.

Finally, I asked why we need three letters to represent an amino acid if we have all these leftover three-letter combinations.  If we have four options (G,C,T, or A) at each DNA location and we have three consecutive locations, how many total options are there? The best way to think about this, if you are not used to mathematical manipulations, is to draw a tree with four big branches representing the four options at the first decision point.  Each of those big branches then has four medium-size branches representing the four options at the second decision point.  And each of those 16 medium-size branches then then has four little branches representing the four options at the third decision point, which makes 64 little branches, representing 64 possible combinations.  There are only 20 amino acids, so with 64 code words we can have a lot of redundancy, leading to a lot of robustness against damage.  The last question we discussed was: since we have so many extra code words by using three-letter combinations, could we (robustness aside) get away with using just two DNA letters to code for an amino acid?  I'll let you figure that one out for yourself.


No comments:

Post a Comment