The World in a Grain of Sand: programming

Showing posts with label programming. Show all posts

Saturday, February 9, 2013

Toy DNA Analysis Part II

After the first activity and recess break, I posed the challenge of
finding out which animal is most closely related to the hippopotamus.
I gave them short (400-base) DNA sequences of various mammals and
asked them to figure out which was most similar to the hippo's.
Again, I just made up these short sequences to keep it simple. In
real life, there are a lot more complications just as varying number
of chromosomes, billions of base pairs, etc.

The bottom line is that they had to take an animal, look at each
position in the sequence, ask whether that that animal's DNA at that
position was different from the hippo's, and total up the differences.
After repeating this for several animals, they would see which one had
the least differences. After the near-chaos of the morning's first
activity, I thought it would be a good idea to go over the big picture
and assemble some pseudocode on the board based on their ideas:

get hippo dna
for each animal other than hippo:
    get this animal's dna
    compare this animal's to hippodna
    print number of differences

I emphasized basic ideas like what do we want to put inside the loop and what do we want to put outside the loop. A program like this does it:

hippodna = open('hippo.txt').read()
nletters = len(hippodna)

for filename in ('cat.txt','dog.txt','rhino.txt','bluewhale.txt','rockhyrax.txt','rhesusmonkey.txt'):
    dna = open(filename).read()
    differences = 0
    position = 0
    for letters in hippodna:
        if hippodna[position] != dna[position]:
            differences += 1
 position += 1
    print filename,'has',differences,'differences with hippo dna'

Yes, Python experts, I know there are more efficient ways to do it but this seems most straightforward for a kid. I won't go into the detail I went into for the morning's first activity, but it was a similarly intense back-and-forth with students running into an obstacle every 30 seconds, which I tried to turn into a learning experience. They again needed emphasis on proper indenting, punctuation, remembering that if they named a variable 'difference' early on then they couldn't refer to it as 'differences' later on, etc. They again wanted to do one animal at a time rather than write a loop. But it was a really good learning experience.

This took close to an hour. At the end, we discussed how they might modify the program to see what's most closely related to some other mammal, and then a third mammal, etc, and build up a picture of how everything is related to everything else: a family tree. That was the emphasis of the morning's third activity.

Toy DNA Analysis Part I

Last time I visited the 5-7 graders at Peregrine School, I introduced
the idea of computer programming languages (specifically, Python) as a
way of automatic repetitive tasks. That sounds boring, but I made it
interesting by building the activity around an inherently interesting
challenge---cracking a code---which happens to have a repetitive
aspect---translating each letter of a long message after you've
figured out the substitution pattern. At the end, we had a little
time to discuss how DNA is a code, which is the connection I really
wanted to make to what they had already learned in biology. This
week, we were poised to take it much further with three data analysis
challenges, two of them requiring Python programming. As a result,
I'm going to split the morning's activities into three separate blog
entries.

The first challenge was to rescue a baby. Some babies are born with a
genetic defect which does not allow their body to process the amino
acid phenylalanine. If they don't follow a strict low-phenylalanine
diet, it builds up in their body and causes brain damage (mental
retardation) within the first year of life. So there is a very strong
motivation to find out if the baby has this condition very soon after
it is born! In practice, there is a blood test which does not involve
analyzing the DNA itself, but I framed the problem as one of DNA
analysis. I made up long DNA sequences for about 15 babies, and asked
them to write a program which would look at a certain location within
each sequence, and print an alert if the sequence was not normal. To
keep things simple, I just made up a simple condition for "normal":
the three letters starting at position 399 should be "GAT".

They had forgotten a lot in the two weeks since my first visit. I
would normally recommend much more frequent visits for teaching
programming, but I was sick last week. Still, they were very into it.
They wracked their brains trying to remember things, but I had to give
a refresher tutorial on for loops and if statements. It took almost
an hour for everyone to finish completely, and there was again a
friendly competition to see who could finish (identify the sick baby)
first.

The evolution of their programs was fascinating. One group thought
they were done when they had something like:

dna = open('jane.txt').read()
print dna[398:401]

Even this simple program has a lot of ideas in it. For example, doing
the subscripting correctly requires thinking about: (1) position 398
is actually the 399th position because Python starts counting at
position 0 rather than position 1; (2) 398:401 means up to but not
including 401; (3) you can't write dna[398-401] because the arithmetic
operation 398-401 yields -3, and Python will interpret this as 3
positions before the end of the sequence. There was a lot of
interaction between me and the kids on each of these points.

In any case, when the first group had gotten this far, they thought
they were done. They would just change jane.txt to john.txt, rerun it
to see if John was sick, repeat for Joan, etc, and just read off the
screen whether it said 'GAT' for that baby. This would work ok for
the 15 examples I brought, but PKU (the aforementioned disease) is
present in only about 1 in 15,000 babies, so they really needed to
automate a loop over all the babies. They initially thought I was
moving the goalposts, but the other groups agreed that this first
group was not really done. So they started working on a loop:

names = ('jane.txt','john.txt','joe.txt')
for name in names:
    dna = open(name).read()
    print dna[398:401]

I had to repeat many times the importance of proper indentation so that it's clear
(to the computer and to a human reading the code) which actions get
repeated for each name, and which actions only get done once. A
common mistake is

names = ('jane.txt','john.txt','joe.txt')
for name in names:
    dna = open(name).read()
print dna[398:401]

which prints only the last one, after cycling through all the names. Notice that I typed in only three names to make sure my loop would work, before bothering to type in all 15 names. In fact, I taught them how to avoid typing in any names by grabbing all names that fit a certain pattern:

import glob
names = glob.glob('j*.txt')
for name in names:
    dna = open(name).read()
    print dna[398:401]

I briefly discussed how it can be incredibly time-saving to import some functionality that someone else has already written. But back to the main point, the fastest group now thought that they were really done: this program would print "GAT" for each healthy person, and they just had to scan the output for something other than GAT. But, I asked them, how would they know which person had that abnormal sequence? They tried adding to the end of their program:

import glob
names = glob.glob('j*.txt')
for name in names:
    dna = open(name).read()
    print dna[398:401]
    print name

But this gives output like:

GAT
jane.txt
GAT
john.txt
GCT
joe.txt
GAT
jim.txt
GAT
...

and they incorrectly interpreted this as John having GCT, because they hadn't paid attention to the fact that they asked the computer to print the DNA first, and then the name. An important rule of programming is to make your output clear. Things which belong together should be printed together. A much better print statement is:

print dna[398:401], name

which keeps it to one line per baby:

GAT jane.txt
GAT john.txt
GCT joe.txt
GAT jim.txt
...

Now they really thought they were done, but I pointed out that they were looking for 1 in 15,000 babies, and they couldn't count on a human to scan a list of 15,000 lines which say 'GAT' and not miss an abnormal one. They really need to print out just the sick ones with a very clear warning:

import glob
names = glob.glob('j*.txt')
for name in names:
    dna = open(name).read()
    if dna[398:401] != 'GAT':
       print name, "is sick"

Now they were done: the output is simply

joe.txt is sick

I took a few minutes to talk about the subtleties of combining if's, for example the difference between

if dna[398] != 'G' and dna[399] != 'A' and dna[400] != 'T'

if dna[398] != 'G' or dna[399] != 'A' or dna[400] != 'T'

Surprisingly (to me at least), they found this last point very easy. As I said, this whole activity took very nearly an hour. (I had saved a bit of time by preloading their computers with the data files.) It was highly interactive, and if there had been much more than three groups having another knowledgeable adult in the room probably would have been a good idea. I'm not going to pretend that as a result of this activity the kids have actually mastered even the very basics of programming, but it was good practice in logical thinking, and it's clear that with continued practice I will eventually get them doing some real data analysis. One good sign: one of the girls continued programming throughout the recess break following this activity!

Friday, January 25, 2013

DNA is Code

Today I started Friday morning sessions with the upper (5-7) graders
at Peregrine School. I'll have about ten of these weeks before I
switch to astronomy with the 3-4 graders. My main job with the upper
graders is to cover the geology standards which didn't get covered in
the lead-up to the Yosemite trip, but I'm going to take a few weeks
first to build on some of the ideas they learned before Christmas
break regarding genetics and DNA. In the process, they will learn how
to analyze data and program computers. It's ambitious, but with these
kids I have to be ambitious.

DNA is a code. The letters themselves have no intrinsic meaning, but

they represent different specific amino acids which, when assembled in

the sequence specified by the code, fold up into a 3-d shape which is

useful for performing some function in the body. So to introduce the

idea of "breaking the code," I started with a non-DNA message in code:

Uif Ivohfs Hbnft jt b 2008 zpvoh bevmu opwfm cz Bnfsjdbo xsjufs Tvaboof Dpmmjot. Ju jt xsjuufo jo uif wpjdf pg 16-zfbs-pme Lbuojtt Fwfseffo, xip mjwft jo uif qptu-bqpdbmzqujd obujpo pg Qbofn, xifsf uif dpvousjft pg Opsui Bnfsjdb podf fyjtufe. Uif Dbqjupm, b ijhimz bewbodfe nfuspqpmjt, fyfsdjtft qpmjujdbm dpouspm pwfs uif sftu pg uif obujpo. Uif Ivohfs Hbnft bsf bo boovbm fwfou jo xijdi pof cpz boe pof hjsm bhfe 12–18 gspn fbdi pg uif uxfmwf ejtusjdut tvsspvoejoh uif Dbqjupm bsf tfmfdufe cz mpuufsz up dpnqfuf jo b ufmfwjtfe cbuumf up uif efbui.

Uif cppl sfdfjwfe nptumz qptjujwf gffecbdl gspn nbkps sfwjfxfst boe bvuipst, jodmvejoh bvuipst Tufqifo Ljoh boe Tufqifojf Nfzfs. Ju xbt qsbjtfe gps jut tupszmjof boe dibsbdufs efwfmpqnfou, uipvhi tpnf sfwjfxfst ibwf opufe tjnjmbsjujft cfuxffo Dpmmjot' cppl boe Lpvtivo Ublbnj't Cbuumf Spzbmf (1999), cpui pg xijdi efmjwfs ubmft frvbmmz hsjqqjoh, zfu mftt ebsl uibo uif hfosf't psjhjobm nbtufs xpsl: Ljoh't Uif Mp oh Xbml (1979). Jo xsjujoh Uif Ivohfs Hbnft, Dpmmjot esfx vqpo Hsffl nzuipmphz, Spnbo hmbejbupsjbm hbnft boe dpoufnqpsbsz sfbmjuz ufmfwjtjpo gps uifnbujd dpoufou. Uif opwfm xpo nboz bxbset, jodmvejoh uif Dbmjgpsojb Zpvoh Sfbefs Nfebm, boe xbt obnfe pof pg Qvcmjtifst Xfflmz't "Cftu Cpplt pg uif Zfbs" jo 2008.

Uif Ivohfs Hbnft xbt gjstu qvcmjtife jo ibsedpwfs po Tfqufncfs 14, 2008 cz Tdipmbtujd, gfbuvsjoh b dpwfs eftjhofe cz Ujn P'Csjfo. Ju ibt tjodf cffo sfmfbtfe jo qbqfscbdl boe bmtp bt bo bvejpcppl boe fcppl. Bgufs bo jojujbm qsjou pg 200,000, uif cppl ibe tpme 800,000 dpqjft cz Gfcsvbsz 2010. Tjodf jut sfmfbtf, Uif Ivohfs Hbnft ibt cffo usbotmbufe joup 26 mbohvbhft, boe qvcmjtijoh sjhiut ibwf cffo tpme jo 38 uf ssjupsjft. Uif opwfm jt uif gjstu jo Uif Ivohfs Hbnft usjmphz, gpmmpxfe cz Dbudijoh Gjsf (2009) boe Npdljohkbz (2010). B gjmn bebqubujpo, ejsfdufe cz Hbsz Sptt boe dp-xsjuufo boe dp-qspevdfe cz Dpmmjot ifstfmg, xbt sfmfbtfe jo 2012.

This was a really good idea for motivating the students; they really
wanted to break the code once they realized it was doable! I elicited
from them some observations which might be useful:

Each paragraph begins with "Uif". They guessed pretty quickly that "Uif" might stand for "The". This was a springboard for making explicit what they already implicitly knew about word and letter frequencies in English.
Numbers seem to stand for themselves: "Tfqufncfs 14, 2008" looks like a date.
Ditto for punctuation: the periods and commas look natural, not like they stand in for something else.
"b" is a word by itself. Presumably it stands for either "a" or "I".

One student guessed the answer even as I was writing down the fourth

observation: simply replace each letter here with the letter which

precedes it in the alphabet. This "b" becomes "a" and "Uif" becomes

"The". I asked them to make some predictions to double-check this

hypothesis. For example, this hypothesis predicts that "a" and "r"

will be uncommon because they stand for the uncommon letters "z" and

"q". That checks out, so let's proceed.

Here's where I introduce computers. It would be extremely tedious to

manually translate all these letters to decode the message. But if we

write a computer program to do it, in a short time we will have a tool

that automatically translates this and all future such messages.

Computers are great for automating repetitive tasks!

Before we actually open our laptops, let's write some pseudocode.

"Code" has a new meaning here: "a set of instructions to the

computer", and pseudocode is when we write those instructions a bit

more like everyday language to help us organize and communicate our

thoughts to other humans. To decode our message, we want to tell the

computer to do this:

for each character in the message:
    if character is a letter:
       print the previous letter
    else:
       print the same character # (because it's punctuation or a number)

(The # symbol and everything after it represent a comment about the instruction, rather than an actual instruction.) This pseudocode illustrates two of the most important ideas in programming: looping, in which we repeat some set of instructions over and over, and if-then-else.

A somewhat old-fashoned way of constructing a loop which does something 100 times is:

tasksCompleted = 0
while tasksCompleted < 100:
      doSomething
      tasksCompleted = tasksCompleted + 1

Although this is a bit old-fashioned, I thought it was worth making the loop logic very explicit for beginners. Next, we practiced some simple loops in an actual language, Python. I gave them a file containing the coded message, named 'coded1.txt'. I showed them how to read it in:

letters = open('coded1.txt').read()

Now letters is the name of a location in memory that contains the message. A very simple loop is:

for character in letters:
     print character,

With this more modern type of loop, Python automatically keeps track of how many times we need to repeat. This is easier to write, easier for others to read, and less error-prone.

So far we have just repeated the very simple action of printing that character. By the time each group had finished this, the 45 minutes was up and we took a break. After the break, we set about putting more complicated instructions inside the loop. We started writing something like:

for character in letters:
    if character == 'b':
       print 'a',   
    elif character == 'c':
       print 'b',

(Note that == is a test of equality, whereas = would have been an assignment, forcing them to be equal; and elif is shorthand for else if.) We can see that this will become tedious if we have to type in an if-else block for each substitution rule, so let's find a better way. Let's set up a dictionary named sub (short for substitution):

sub = {'b':'a',
       'c':'b',
       'd':'c',
       ...}

so now we can just say

print sub[character]

and if the character is 'b' the computer will look up the "definition" (the value, really) stored at the dictionary location of 'b', which is 'a', and print that. (Note that the ... above is not real code, but you can guess what it stands for.) We spent a while talking about the software concept of dictionaries and playing with them for a while before returning to the decoding task, which now looks like:

letters = open('coded1').read()
sub = {'a':'z',
       'b':'a',
       'c':'b',
       'd':'c',
       ...}

for character in letters:
    if character in sub:
        print sub[character]
    else:
        print character

As a test, we did this after typing in just some of the substitution rules. We could see that it was going to work, but it didn't handle capitals: "Uif" became "Uhe". Instead of typing in another 26 upper-case substitution rules, we did this before entering our main loop:

for key in sub.keys():
    upperCaseKey = str.upper(key)
    upperCaseSub = str.upper(sub[key])
    sub[upperCaseKey] = upperCaseSub

str.upper is a function which makes an upper-case version of the letter which is given to it inside the parentheses. So this block of code creates a new upper case dictionary entry and value for each pre-existing one, so we don't have to type them in manually. (Yes, I know there are more efficient ways to do this, but I wanted to make the logic very explicit for the learners.) With that done, the race was on as each group rushed to be the first to type in all the substitution rules and read the message. That was really fun. If you want to see the message, you'll just have to assemble your own Python script out of the parts I've illustrated here!

Once all the groups had decoded the message, we had used about 45 minutes since the break. A word of advice to teachers: it takes a while. Although I explained several times how one typo or indentation mistake will torpedo the script, kids are not used to this, and they need a lot of help finding their mistakes even though I had my own example code projected up on the screen. But it was really worth spending the time. They started to get the idea of debugging; although I often had to find the bugs for them, they started to understand the process. I had the luxury of doing it with six kids in three groups; if you have a larger group I would try to arrange for assistants and give them some advance training.

That gave us about 12 minutes to discuss how DNA is a code. We reviewed a bit about how letters code for amino acids, which build proteins. Then we looked at the DNA code table and discussed by some amino acids have many codons (groups of three DNA letters) which code for them, while others have just one. The basic idea is that some "typos" will not actually result in a mistake in the amino acid, which is a good thing; it adds robustness to the code. And it's even more robust if the most common amino acids are the ones with the most robustness against typos, in other words have the most different codons. That led to an interesting discussion of where DNA "typos" might come from, including cell reproduction and radiation.

Finally, I asked why we need three letters to represent an amino acid if we have all these leftover three-letter combinations. If we have four options (G,C,T, or A) at each DNA location and we have three consecutive locations, how many total options are there? The best way to think about this, if you are not used to mathematical manipulations, is to draw a tree with four big branches representing the four options at the first decision point. Each of those big branches then has four medium-size branches representing the four options at the second decision point. And each of those 16 medium-size branches then then has four little branches representing the four options at the third decision point, which makes 64 little branches, representing 64 possible combinations. There are only 20 amino acids, so with 64 code words we can have a lot of redundancy, leading to a lot of robustness against damage. The last question we discussed was: since we have so many extra code words by using three-letter combinations, could we (robustness aside) get away with using just two DNA letters to code for an amino acid? I'll let you figure that one out for yourself.