Saturday, February 9, 2013

Toy DNA Analysis Part I

Last time I visited the 5-7 graders at Peregrine School, I introduced
the idea of computer programming languages (specifically, Python) as a
way of automatic repetitive tasks.  That sounds boring, but I made it
interesting by building the activity around an inherently interesting
challenge---cracking a code---which happens to have a repetitive
aspect---translating each letter of a long message after you've
figured out the substitution pattern.  At the end, we had a little
time to discuss how DNA is a code, which is the connection I really
wanted to make to what they had already learned in biology.  This
week, we were poised to take it much further with three data analysis
challenges, two of them requiring Python programming.  As a result,
I'm going to split the morning's activities into three separate blog
entries.

The first challenge was to rescue a baby.  Some babies are born with a
genetic defect which does not allow their body to process the amino
acid phenylalanine.  If they don't follow a strict low-phenylalanine
diet, it builds up in their body and causes brain damage (mental
retardation) within the first year of life.  So there is a very strong
motivation to find out if the baby has this condition very soon after
it is born!  In practice, there is a blood test which does not involve
analyzing the DNA itself, but I framed the problem as one of DNA
analysis.  I made up long DNA sequences for about 15 babies, and asked
them to write a program which would look at a certain location within
each sequence, and print an alert if the sequence was not normal.  To
keep things simple, I just made up a simple condition for "normal":
the three letters starting at position 399 should be "GAT".

They had forgotten a lot in the two weeks since my first visit.  I
would normally recommend much more frequent visits for teaching
programming, but I was sick last week.  Still, they were very into it.
They wracked their brains trying to remember things, but I had to give
a refresher tutorial on for loops and if statements.  It took almost
an hour for everyone to finish completely, and there was again a
friendly competition to see who could finish (identify the sick baby)
first. 

The evolution of their programs was fascinating.  One group thought
they were done when they had something like:

dna = open('jane.txt').read()
print dna[398:401]

Even this simple program has a lot of ideas in it.  For example, doing
the subscripting correctly requires thinking about: (1) position 398
is actually the 399th position because Python starts counting at
position 0 rather than position 1; (2) 398:401 means up to but not
including 401; (3) you can't write dna[398-401] because the arithmetic
operation 398-401 yields -3, and Python will interpret this as 3
positions before the end of the sequence.  There was a lot of
interaction between me and the kids on each of these points.

In any case, when the first group had gotten this far, they thought
they were done.  They would just change jane.txt to john.txt, rerun it
to see if John was sick, repeat for Joan, etc, and just read off the
screen whether it said 'GAT' for that baby.  This would work ok for
the 15 examples I brought, but PKU (the aforementioned disease) is
present in only about 1 in 15,000 babies, so they really needed to
automate a loop over all the babies.  They initially thought I was
moving the goalposts, but the other groups agreed that this first
group was not really done.  So they started working on a loop:
names = ('jane.txt','john.txt','joe.txt')
for name in names:
    dna = open(name).read()
    print dna[398:401]
I had to repeat many times the importance of proper indentation so that it's clear
(to the computer and to a human reading the code) which actions get
repeated for each name, and which actions only get done once.  A
common mistake is
names = ('jane.txt','john.txt','joe.txt')
for name in names:
    dna = open(name).read()
print dna[398:401]
which prints only the last one, after cycling through all the names. Notice that I typed in only three names to make sure my loop would work, before bothering to type in all 15 names. In fact, I taught them how to avoid typing in any names by grabbing all names that fit a certain pattern:
import glob
names = glob.glob('j*.txt')
for name in names:
    dna = open(name).read()
    print dna[398:401]
I briefly discussed how it can be incredibly time-saving to import some functionality that someone else has already written. But back to the main point, the fastest group now thought that they were really done: this program would print "GAT" for each healthy person, and they just had to scan the output for something other than GAT. But, I asked them, how would they know which person had that abnormal sequence? They tried adding to the end of their program:
import glob
names = glob.glob('j*.txt')
for name in names:
    dna = open(name).read()
    print dna[398:401]
    print name
But this gives output like:
GAT
jane.txt
GAT
john.txt
GCT
joe.txt
GAT
jim.txt
GAT
...
and they incorrectly interpreted this as John having GCT, because they hadn't paid attention to the fact that they asked the computer to print the DNA first, and then the name. An important rule of programming is to make your output clear. Things which belong together should be printed together. A much better print statement is:
print dna[398:401], name
which keeps it to one line per baby:
GAT jane.txt
GAT john.txt
GCT joe.txt
GAT jim.txt
...
Now they really thought they were done, but I pointed out that they were looking for 1 in 15,000 babies, and they couldn't count on a human to scan a list of 15,000 lines which say 'GAT' and not miss an abnormal one. They really need to print out just the sick ones with a very clear warning:
import glob
names = glob.glob('j*.txt')
for name in names:
    dna = open(name).read()
    if dna[398:401] != 'GAT':
       print name, "is sick"
Now they were done: the output is simply
joe.txt is sick
I took a few minutes to talk about the subtleties of combining if's, for example the difference between
if dna[398] != 'G' and dna[399] != 'A' and dna[400] != 'T'
vs
if dna[398] != 'G' or dna[399] != 'A' or dna[400] != 'T'
Surprisingly (to me at least), they found this last point very easy. As I said, this whole activity took very nearly an hour. (I had saved a bit of time by preloading their computers with the data files.) It was highly interactive, and if there had been much more than three groups having another knowledgeable adult in the room probably would have been a good idea. I'm not going to pretend that as a result of this activity the kids have actually mastered even the very basics of programming, but it was good practice in logical thinking, and it's clear that with continued practice I will eventually get them doing some real data analysis. One good sign: one of the girls continued programming throughout the recess break following this activity!

2 comments:

  1. This is really cool! I had no idea you could teach programming to students this young, let alone DNA sequence analysis :) Would love to hear more about it.

    ReplyDelete
    Replies
    1. Surprisingly (to me), the students thought programming was really cool. One of them is now, on his own, writing a program to generate prime numbers.

      Delete