Last time I visited the 5-7 graders at
Peregrine School, I introduced
the idea of computer programming languages (specifically, Python) as a
way of automatic repetitive tasks. That sounds boring, but I made it
interesting by
building the activity around an inherently interesting
challenge---cracking a code---which happens to have a repetitive
aspect---translating each letter of a long message after you've
figured out the substitution pattern. At the end, we had a little
time to discuss how DNA is a code, which is the connection I really
wanted to make to what they had already learned in biology. This
week, we were poised to take it much further with three data analysis
challenges, two of them requiring Python programming. As a result,
I'm going to split the morning's activities into three separate blog
entries.
The first challenge was to rescue a baby. Some babies are born with a
genetic defect which does not allow their body to process the amino
acid phenylalanine. If they don't follow a strict low-phenylalanine
diet, it builds up in their body and causes brain damage (mental
retardation) within the first year of life. So there is a very strong
motivation to find out if the baby has this condition very soon after
it is born! In practice, there is a blood test which does not involve
analyzing the DNA itself, but I framed the problem as one of DNA
analysis. I made up long DNA sequences for about 15 babies, and asked
them to write a program which would look at a certain location within
each sequence, and print an alert if the sequence was not normal. To
keep things simple, I just made up a simple condition for "normal":
the three letters starting at position 399 should be "GAT".
They had forgotten a lot in the two weeks since my first visit. I
would normally recommend much more frequent visits for teaching
programming, but I was sick last week. Still, they were very into it.
They wracked their brains trying to remember things, but I had to give
a refresher tutorial on for loops and if statements. It took almost
an hour for everyone to finish completely, and there was again a
friendly competition to see who could finish (identify the sick baby)
first.
The evolution of their programs was fascinating. One group thought
they were done when they had something like:
dna = open('jane.txt').read()
print dna[398:401]
Even this simple program has a
lot of ideas in it. For example, doing
the subscripting correctly requires thinking about: (1) position 398
is actually the 399th position because Python starts counting at
position 0 rather than position 1; (2) 398:401 means up to but not
including 401; (3) you can't write dna[398-401] because the arithmetic
operation 398-401 yields -3, and Python will interpret this as 3
positions before the
end of the sequence. There was a lot of
interaction between me and the kids on each of these points.
In any case, when the first group had gotten this far, they thought
they were done. They would just change jane.txt to john.txt, rerun it
to see if John was sick, repeat for Joan, etc, and just read off the
screen whether it said 'GAT' for that baby. This would work ok for
the 15 examples I brought, but PKU (the aforementioned disease) is
present in only about 1 in 15,000 babies, so they really needed to
automate a loop over all the babies. They initially thought I was
moving the goalposts, but the other groups agreed that this first
group was not really done. So they started working on a loop:
names = ('jane.txt','john.txt','joe.txt')
for name in names:
dna = open(name).read()
print dna[398:401]
I had to repeat many times the importance of proper indentation so that it's clear
(to the computer and to a human reading the code) which actions get
repeated for each name, and which actions only get done once. A
common mistake is
names = ('jane.txt','john.txt','joe.txt')
for name in names:
dna = open(name).read()
print dna[398:401]
which prints only the last one, after cycling through all the names.
Notice that I typed in only three names to make sure my loop would
work, before bothering to type in all 15 names. In fact, I taught
them how to avoid typing in any names by grabbing all names that fit a
certain pattern:
import glob
names = glob.glob('j*.txt')
for name in names:
dna = open(name).read()
print dna[398:401]
I briefly discussed how it can be incredibly time-saving to import
some functionality that someone else has already written. But back to
the main point, the fastest group now thought that they were really
done: this program would print "GAT" for each healthy person, and they
just had to scan the output for something other than GAT. But, I
asked them, how would they know which person had that abnormal
sequence? They tried adding to the end of their program:
import glob
names = glob.glob('j*.txt')
for name in names:
dna = open(name).read()
print dna[398:401]
print name
But this gives output like:
GAT
jane.txt
GAT
john.txt
GCT
joe.txt
GAT
jim.txt
GAT
...
and they incorrectly interpreted this as John having GCT, because they
hadn't paid attention to the fact that they asked the computer to
print the DNA first, and
then the name. An important rule of
programming is to make your output clear. Things which belong
together should be printed together. A much better print statement is:
print dna[398:401], name
which keeps it to one line per baby:
GAT jane.txt
GAT john.txt
GCT joe.txt
GAT jim.txt
...
Now they
really thought they were done, but I pointed out that they
were looking for 1 in 15,000 babies, and they couldn't count on a
human to scan a list of 15,000 lines which say 'GAT' and not miss an
abnormal one. They really need to print out just the sick ones with a
very clear warning:
import glob
names = glob.glob('j*.txt')
for name in names:
dna = open(name).read()
if dna[398:401] != 'GAT':
print name, "is sick"
Now they were done: the output is simply
joe.txt is sick
I took a few minutes to talk about the subtleties of combining if's,
for example the difference between
if dna[398] != 'G' and dna[399] != 'A' and dna[400] != 'T'
vs
if dna[398] != 'G' or dna[399] != 'A' or dna[400] != 'T'
Surprisingly (to me at least), they found this last point very easy.
As I said, this whole activity took very nearly an hour. (I had saved
a bit of time by preloading their computers with the data files.) It
was highly interactive, and if there had been much more than three
groups having another knowledgeable adult in the room probably would
have been a good idea. I'm not going to pretend that as a result of
this activity the kids have actually mastered even the very basics of
programming, but it was good practice in logical thinking, and it's
clear that with continued practice I will eventually get them doing
some real data analysis. One good sign: one of the girls continued
programming throughout the recess break following this activity!