The World in a Grain of Sand: DNA

Showing posts with label DNA. Show all posts

Saturday, February 9, 2013

Toy DNA Analysis Part III: Eve of Reconstruction

After the first and second activities of the morning, there was not
much more than 30 minutes left. I wanted to do an activity with
mitochondrial DNA, so I went over the background first. (They had seen
much of the following earlier in the year, but the review turned out
to be necessary.)

Each cell has a nucleus which contains DNA, surrounded by the bulk of
the cell ("cytoplasm") which has various structures ("organelles") for
performing various functions. One type of organelle is mitochondria,
when help you turn oxygen into energy. Each cell has many
mitochondria, and here is the amazing thing: they have their own DNA!
They are not built according to instructions recorded in the DNA of
the nucleus; they simply reproduce by dividing asexually, as if they
were self-contained cells within the cell. When the cell itself
divides, each daughter gets half the cytoplasm and therefore half the
mitochondria. It is thought that mitochondria were once independent
bacteria, which learned to cooperate so well with other cells that
they took up residence. That's pretty amazing! Another amazing fact
is that all creatures on Earth share the same DNA code. We are all
related, even humans and yeast. (Example: if you put the DNA letters
for human insulin in yeast, the yeast understand those instructions
perfectly and makes human insulin.)

When a human egg cell is fertilized, the sperm carries in half the
nuclear DNA to complement the mother's half of the nuclear DNA. But
the egg has an enormous amount of ctyoplasm and the sperm contributes
none. So your mitochondrial DNA is an exact replica of your mother's,
and of her mother's, and of HER mother's....there is no shuffling with
each generation as we have with nuclear DNA. Thus, mitochondrial DNA
makes it much easier to test whether you are a direct descendant
(through an all-female line) of, say, Cleopatra. (A similar thing can
be done with Y chromosomes and all-male lines of inheritance.)
Furthermore, by mapping the geographical distributions of
mitochondrial DNA, we can trace out migrations of women over time.
(Ditto for Y chromosomes and men.)

It's good to ask the kids a few questions to see how well they
understand. In this case, a girl said she was sorry for boys because
they had no mitochondria. So we discussed that issue again: everyone
has mitochondria (that's how they turn oxygen into energy) but boys
won't pass theirs on to their kids. Moms really do contribute more
than half, as immortalized by this song.

But there can be mutations. It turns out they're fairly rare in
mitochondria, probably because most mutations would be fatal very
early on. But they do happen. So if we gather mitochondrial DNA from
a large sample of people, we will find sequences that differ by a
little bit. We should be able to trace the mutations backward and
reconstruct ancestor DNA. For example, if we saw sequences GATTACA,
GATTACT, and AATTACA, we might guess that the ur-grandmother, many
generations back, of all three people had the sequence GATTACA. One
mutation somewhere along the line would explain the people with
GATTACT, a different mutation somewhere else along the line would
explain the people with AATTACA, and the people who had never
experienced a mutation along their line would still have GATTACA.
They hypothesis of, say GATTACT being the ancestor is much less likely
because it requires that there was one mutation to make it GATTACA and
then, in the line with this mutation already present, there was
a second mutation making AATTACA.

So here's the problem I posed to the kids: reconstruct the ancestor of
these sequences:

CATTACGACT
GAATACGACA
GATTACAACT
GATTACGACA
GATTACGACT
GATTATAACT
GATTCCAACT
GTTTCCAACT

Go ahead: print these out and cut them into strips, try to arrange
them as leaves of a tree, and guess what the branches and trunk have
to look like.

Some groups were lost, and so I tried to work it out with them on the
board, starting by making a guess about the immediate ancestor of one
very similar pair. It turned out this was probably a bad guess,
because once we had worked out two hypothetical ancestors of two
different pairs, those two hypothetical ancestors seemed to have very
little in common, whereas we would have expected them to look similar
enough that we could guess a hypothetical original ancestor which
spawned them both. Just as I was realizing that we were almost out of
time, another group handed me a sheet of paper in which they had
worked it all out. The lesson I drew for everyone: don't be afraid to
take a guess, work out the consquences of that guess, and if it
doesn't work, scrap that guess and start over. That's what science is
all about! (See the first minute or so of this video.) Just because
lunchtime was coming up fast does not mean that we had done anything
wrong. The wrong thing would be to continue pushing a guess which
doesn't explain all the evidence.

If I do this activity again, I would print out very large copies of
the sequence so I could rearrange them easily on the board (writing
with chalk does not lend itself to rearrangement). Or I would print
it at regular size and use a document camera. I would probably also
walk them through a simplified example first as I did in writing this
post. Another idea I just had is to try representing the information
differently. Perhaps a color code instead of letters would make
things just jump out.

I saved a few minutes for the coolest part of this: because we know
that mutations happen about once every 10,000 years, we can use this
as a clock. In my simplified example, you have to reverse-engineer
three mutations to get back to a common ancestor which explains all
the data. That makes 30,000 years. In real life with real data, you
have to go back 200,000 years, but you can do it. That means that
there was one female about 200,000 years ago from whom every human
alive today has inherited their mitochondrial DNA; she is called
mitochondrial Eve. This doesn't mean that other females living at
that time didn't contribute to people alive today; they surely did,
through their nuclear DNA. But mitochondrial Eve is the only one who
has an unbroken female line to anyone alive today. And a similar
argument identifies "Y-chromosomal Adam" who lived around 100,000
years ago. We are all intimately related!

Toy DNA Analysis Part II

After the first activity and recess break, I posed the challenge of
finding out which animal is most closely related to the hippopotamus.
I gave them short (400-base) DNA sequences of various mammals and
asked them to figure out which was most similar to the hippo's.
Again, I just made up these short sequences to keep it simple. In
real life, there are a lot more complications just as varying number
of chromosomes, billions of base pairs, etc.

The bottom line is that they had to take an animal, look at each
position in the sequence, ask whether that that animal's DNA at that
position was different from the hippo's, and total up the differences.
After repeating this for several animals, they would see which one had
the least differences. After the near-chaos of the morning's first
activity, I thought it would be a good idea to go over the big picture
and assemble some pseudocode on the board based on their ideas:

get hippo dna
for each animal other than hippo:
    get this animal's dna
    compare this animal's to hippodna
    print number of differences

I emphasized basic ideas like what do we want to put inside the loop and what do we want to put outside the loop. A program like this does it:

hippodna = open('hippo.txt').read()
nletters = len(hippodna)

for filename in ('cat.txt','dog.txt','rhino.txt','bluewhale.txt','rockhyrax.txt','rhesusmonkey.txt'):
    dna = open(filename).read()
    differences = 0
    position = 0
    for letters in hippodna:
        if hippodna[position] != dna[position]:
            differences += 1
 position += 1
    print filename,'has',differences,'differences with hippo dna'

Yes, Python experts, I know there are more efficient ways to do it but this seems most straightforward for a kid. I won't go into the detail I went into for the morning's first activity, but it was a similarly intense back-and-forth with students running into an obstacle every 30 seconds, which I tried to turn into a learning experience. They again needed emphasis on proper indenting, punctuation, remembering that if they named a variable 'difference' early on then they couldn't refer to it as 'differences' later on, etc. They again wanted to do one animal at a time rather than write a loop. But it was a really good learning experience.

This took close to an hour. At the end, we discussed how they might modify the program to see what's most closely related to some other mammal, and then a third mammal, etc, and build up a picture of how everything is related to everything else: a family tree. That was the emphasis of the morning's third activity.

Toy DNA Analysis Part I

Last time I visited the 5-7 graders at Peregrine School, I introduced
the idea of computer programming languages (specifically, Python) as a
way of automatic repetitive tasks. That sounds boring, but I made it
interesting by building the activity around an inherently interesting
challenge---cracking a code---which happens to have a repetitive
aspect---translating each letter of a long message after you've
figured out the substitution pattern. At the end, we had a little
time to discuss how DNA is a code, which is the connection I really
wanted to make to what they had already learned in biology. This
week, we were poised to take it much further with three data analysis
challenges, two of them requiring Python programming. As a result,
I'm going to split the morning's activities into three separate blog
entries.

The first challenge was to rescue a baby. Some babies are born with a
genetic defect which does not allow their body to process the amino
acid phenylalanine. If they don't follow a strict low-phenylalanine
diet, it builds up in their body and causes brain damage (mental
retardation) within the first year of life. So there is a very strong
motivation to find out if the baby has this condition very soon after
it is born! In practice, there is a blood test which does not involve
analyzing the DNA itself, but I framed the problem as one of DNA
analysis. I made up long DNA sequences for about 15 babies, and asked
them to write a program which would look at a certain location within
each sequence, and print an alert if the sequence was not normal. To
keep things simple, I just made up a simple condition for "normal":
the three letters starting at position 399 should be "GAT".

They had forgotten a lot in the two weeks since my first visit. I
would normally recommend much more frequent visits for teaching
programming, but I was sick last week. Still, they were very into it.
They wracked their brains trying to remember things, but I had to give
a refresher tutorial on for loops and if statements. It took almost
an hour for everyone to finish completely, and there was again a
friendly competition to see who could finish (identify the sick baby)
first.

The evolution of their programs was fascinating. One group thought
they were done when they had something like:

dna = open('jane.txt').read()
print dna[398:401]

Even this simple program has a lot of ideas in it. For example, doing
the subscripting correctly requires thinking about: (1) position 398
is actually the 399th position because Python starts counting at
position 0 rather than position 1; (2) 398:401 means up to but not
including 401; (3) you can't write dna[398-401] because the arithmetic
operation 398-401 yields -3, and Python will interpret this as 3
positions before the end of the sequence. There was a lot of
interaction between me and the kids on each of these points.

In any case, when the first group had gotten this far, they thought
they were done. They would just change jane.txt to john.txt, rerun it
to see if John was sick, repeat for Joan, etc, and just read off the
screen whether it said 'GAT' for that baby. This would work ok for
the 15 examples I brought, but PKU (the aforementioned disease) is
present in only about 1 in 15,000 babies, so they really needed to
automate a loop over all the babies. They initially thought I was
moving the goalposts, but the other groups agreed that this first
group was not really done. So they started working on a loop:

names = ('jane.txt','john.txt','joe.txt')
for name in names:
    dna = open(name).read()
    print dna[398:401]

I had to repeat many times the importance of proper indentation so that it's clear
(to the computer and to a human reading the code) which actions get
repeated for each name, and which actions only get done once. A
common mistake is

names = ('jane.txt','john.txt','joe.txt')
for name in names:
    dna = open(name).read()
print dna[398:401]

which prints only the last one, after cycling through all the names. Notice that I typed in only three names to make sure my loop would work, before bothering to type in all 15 names. In fact, I taught them how to avoid typing in any names by grabbing all names that fit a certain pattern:

import glob
names = glob.glob('j*.txt')
for name in names:
    dna = open(name).read()
    print dna[398:401]

I briefly discussed how it can be incredibly time-saving to import some functionality that someone else has already written. But back to the main point, the fastest group now thought that they were really done: this program would print "GAT" for each healthy person, and they just had to scan the output for something other than GAT. But, I asked them, how would they know which person had that abnormal sequence? They tried adding to the end of their program:

import glob
names = glob.glob('j*.txt')
for name in names:
    dna = open(name).read()
    print dna[398:401]
    print name

But this gives output like:

GAT
jane.txt
GAT
john.txt
GCT
joe.txt
GAT
jim.txt
GAT
...

and they incorrectly interpreted this as John having GCT, because they hadn't paid attention to the fact that they asked the computer to print the DNA first, and then the name. An important rule of programming is to make your output clear. Things which belong together should be printed together. A much better print statement is:

print dna[398:401], name

which keeps it to one line per baby:

GAT jane.txt
GAT john.txt
GCT joe.txt
GAT jim.txt
...

Now they really thought they were done, but I pointed out that they were looking for 1 in 15,000 babies, and they couldn't count on a human to scan a list of 15,000 lines which say 'GAT' and not miss an abnormal one. They really need to print out just the sick ones with a very clear warning:

import glob
names = glob.glob('j*.txt')
for name in names:
    dna = open(name).read()
    if dna[398:401] != 'GAT':
       print name, "is sick"

Now they were done: the output is simply

joe.txt is sick

I took a few minutes to talk about the subtleties of combining if's, for example the difference between

if dna[398] != 'G' and dna[399] != 'A' and dna[400] != 'T'

if dna[398] != 'G' or dna[399] != 'A' or dna[400] != 'T'

Surprisingly (to me at least), they found this last point very easy. As I said, this whole activity took very nearly an hour. (I had saved a bit of time by preloading their computers with the data files.) It was highly interactive, and if there had been much more than three groups having another knowledgeable adult in the room probably would have been a good idea. I'm not going to pretend that as a result of this activity the kids have actually mastered even the very basics of programming, but it was good practice in logical thinking, and it's clear that with continued practice I will eventually get them doing some real data analysis. One good sign: one of the girls continued programming throughout the recess break following this activity!

Friday, January 25, 2013

DNA is Code

Today I started Friday morning sessions with the upper (5-7) graders
at Peregrine School. I'll have about ten of these weeks before I
switch to astronomy with the 3-4 graders. My main job with the upper
graders is to cover the geology standards which didn't get covered in
the lead-up to the Yosemite trip, but I'm going to take a few weeks
first to build on some of the ideas they learned before Christmas
break regarding genetics and DNA. In the process, they will learn how
to analyze data and program computers. It's ambitious, but with these
kids I have to be ambitious.

DNA is a code. The letters themselves have no intrinsic meaning, but

they represent different specific amino acids which, when assembled in

the sequence specified by the code, fold up into a 3-d shape which is

useful for performing some function in the body. So to introduce the

idea of "breaking the code," I started with a non-DNA message in code:

Uif Ivohfs Hbnft jt b 2008 zpvoh bevmu opwfm cz Bnfsjdbo xsjufs Tvaboof Dpmmjot. Ju jt xsjuufo jo uif wpjdf pg 16-zfbs-pme Lbuojtt Fwfseffo, xip mjwft jo uif qptu-bqpdbmzqujd obujpo pg Qbofn, xifsf uif dpvousjft pg Opsui Bnfsjdb podf fyjtufe. Uif Dbqjupm, b ijhimz bewbodfe nfuspqpmjt, fyfsdjtft qpmjujdbm dpouspm pwfs uif sftu pg uif obujpo. Uif Ivohfs Hbnft bsf bo boovbm fwfou jo xijdi pof cpz boe pof hjsm bhfe 12–18 gspn fbdi pg uif uxfmwf ejtusjdut tvsspvoejoh uif Dbqjupm bsf tfmfdufe cz mpuufsz up dpnqfuf jo b ufmfwjtfe cbuumf up uif efbui.

Uif cppl sfdfjwfe nptumz qptjujwf gffecbdl gspn nbkps sfwjfxfst boe bvuipst, jodmvejoh bvuipst Tufqifo Ljoh boe Tufqifojf Nfzfs. Ju xbt qsbjtfe gps jut tupszmjof boe dibsbdufs efwfmpqnfou, uipvhi tpnf sfwjfxfst ibwf opufe tjnjmbsjujft cfuxffo Dpmmjot' cppl boe Lpvtivo Ublbnj't Cbuumf Spzbmf (1999), cpui pg xijdi efmjwfs ubmft frvbmmz hsjqqjoh, zfu mftt ebsl uibo uif hfosf't psjhjobm nbtufs xpsl: Ljoh't Uif Mp oh Xbml (1979). Jo xsjujoh Uif Ivohfs Hbnft, Dpmmjot esfx vqpo Hsffl nzuipmphz, Spnbo hmbejbupsjbm hbnft boe dpoufnqpsbsz sfbmjuz ufmfwjtjpo gps uifnbujd dpoufou. Uif opwfm xpo nboz bxbset, jodmvejoh uif Dbmjgpsojb Zpvoh Sfbefs Nfebm, boe xbt obnfe pof pg Qvcmjtifst Xfflmz't "Cftu Cpplt pg uif Zfbs" jo 2008.

Uif Ivohfs Hbnft xbt gjstu qvcmjtife jo ibsedpwfs po Tfqufncfs 14, 2008 cz Tdipmbtujd, gfbuvsjoh b dpwfs eftjhofe cz Ujn P'Csjfo. Ju ibt tjodf cffo sfmfbtfe jo qbqfscbdl boe bmtp bt bo bvejpcppl boe fcppl. Bgufs bo jojujbm qsjou pg 200,000, uif cppl ibe tpme 800,000 dpqjft cz Gfcsvbsz 2010. Tjodf jut sfmfbtf, Uif Ivohfs Hbnft ibt cffo usbotmbufe joup 26 mbohvbhft, boe qvcmjtijoh sjhiut ibwf cffo tpme jo 38 uf ssjupsjft. Uif opwfm jt uif gjstu jo Uif Ivohfs Hbnft usjmphz, gpmmpxfe cz Dbudijoh Gjsf (2009) boe Npdljohkbz (2010). B gjmn bebqubujpo, ejsfdufe cz Hbsz Sptt boe dp-xsjuufo boe dp-qspevdfe cz Dpmmjot ifstfmg, xbt sfmfbtfe jo 2012.

This was a really good idea for motivating the students; they really
wanted to break the code once they realized it was doable! I elicited
from them some observations which might be useful:

Each paragraph begins with "Uif". They guessed pretty quickly that "Uif" might stand for "The". This was a springboard for making explicit what they already implicitly knew about word and letter frequencies in English.
Numbers seem to stand for themselves: "Tfqufncfs 14, 2008" looks like a date.
Ditto for punctuation: the periods and commas look natural, not like they stand in for something else.
"b" is a word by itself. Presumably it stands for either "a" or "I".

One student guessed the answer even as I was writing down the fourth

observation: simply replace each letter here with the letter which

precedes it in the alphabet. This "b" becomes "a" and "Uif" becomes

"The". I asked them to make some predictions to double-check this

hypothesis. For example, this hypothesis predicts that "a" and "r"

will be uncommon because they stand for the uncommon letters "z" and

"q". That checks out, so let's proceed.

Here's where I introduce computers. It would be extremely tedious to

manually translate all these letters to decode the message. But if we

write a computer program to do it, in a short time we will have a tool

that automatically translates this and all future such messages.

Computers are great for automating repetitive tasks!

Before we actually open our laptops, let's write some pseudocode.

"Code" has a new meaning here: "a set of instructions to the

computer", and pseudocode is when we write those instructions a bit

more like everyday language to help us organize and communicate our

thoughts to other humans. To decode our message, we want to tell the

computer to do this:

for each character in the message:
    if character is a letter:
       print the previous letter
    else:
       print the same character # (because it's punctuation or a number)

(The # symbol and everything after it represent a comment about the instruction, rather than an actual instruction.) This pseudocode illustrates two of the most important ideas in programming: looping, in which we repeat some set of instructions over and over, and if-then-else.

A somewhat old-fashoned way of constructing a loop which does something 100 times is:

tasksCompleted = 0
while tasksCompleted < 100:
      doSomething
      tasksCompleted = tasksCompleted + 1

Although this is a bit old-fashioned, I thought it was worth making the loop logic very explicit for beginners. Next, we practiced some simple loops in an actual language, Python. I gave them a file containing the coded message, named 'coded1.txt'. I showed them how to read it in:

letters = open('coded1.txt').read()

Now letters is the name of a location in memory that contains the message. A very simple loop is:

for character in letters:
     print character,

With this more modern type of loop, Python automatically keeps track of how many times we need to repeat. This is easier to write, easier for others to read, and less error-prone.

So far we have just repeated the very simple action of printing that character. By the time each group had finished this, the 45 minutes was up and we took a break. After the break, we set about putting more complicated instructions inside the loop. We started writing something like:

for character in letters:
    if character == 'b':
       print 'a',   
    elif character == 'c':
       print 'b',

(Note that == is a test of equality, whereas = would have been an assignment, forcing them to be equal; and elif is shorthand for else if.) We can see that this will become tedious if we have to type in an if-else block for each substitution rule, so let's find a better way. Let's set up a dictionary named sub (short for substitution):

sub = {'b':'a',
       'c':'b',
       'd':'c',
       ...}

so now we can just say

print sub[character]

and if the character is 'b' the computer will look up the "definition" (the value, really) stored at the dictionary location of 'b', which is 'a', and print that. (Note that the ... above is not real code, but you can guess what it stands for.) We spent a while talking about the software concept of dictionaries and playing with them for a while before returning to the decoding task, which now looks like:

letters = open('coded1').read()
sub = {'a':'z',
       'b':'a',
       'c':'b',
       'd':'c',
       ...}

for character in letters:
    if character in sub:
        print sub[character]
    else:
        print character

As a test, we did this after typing in just some of the substitution rules. We could see that it was going to work, but it didn't handle capitals: "Uif" became "Uhe". Instead of typing in another 26 upper-case substitution rules, we did this before entering our main loop:

for key in sub.keys():
    upperCaseKey = str.upper(key)
    upperCaseSub = str.upper(sub[key])
    sub[upperCaseKey] = upperCaseSub

str.upper is a function which makes an upper-case version of the letter which is given to it inside the parentheses. So this block of code creates a new upper case dictionary entry and value for each pre-existing one, so we don't have to type them in manually. (Yes, I know there are more efficient ways to do this, but I wanted to make the logic very explicit for the learners.) With that done, the race was on as each group rushed to be the first to type in all the substitution rules and read the message. That was really fun. If you want to see the message, you'll just have to assemble your own Python script out of the parts I've illustrated here!

Once all the groups had decoded the message, we had used about 45 minutes since the break. A word of advice to teachers: it takes a while. Although I explained several times how one typo or indentation mistake will torpedo the script, kids are not used to this, and they need a lot of help finding their mistakes even though I had my own example code projected up on the screen. But it was really worth spending the time. They started to get the idea of debugging; although I often had to find the bugs for them, they started to understand the process. I had the luxury of doing it with six kids in three groups; if you have a larger group I would try to arrange for assistants and give them some advance training.

That gave us about 12 minutes to discuss how DNA is a code. We reviewed a bit about how letters code for amino acids, which build proteins. Then we looked at the DNA code table and discussed by some amino acids have many codons (groups of three DNA letters) which code for them, while others have just one. The basic idea is that some "typos" will not actually result in a mistake in the amino acid, which is a good thing; it adds robustness to the code. And it's even more robust if the most common amino acids are the ones with the most robustness against typos, in other words have the most different codons. That led to an interesting discussion of where DNA "typos" might come from, including cell reproduction and radiation.

Finally, I asked why we need three letters to represent an amino acid if we have all these leftover three-letter combinations. If we have four options (G,C,T, or A) at each DNA location and we have three consecutive locations, how many total options are there? The best way to think about this, if you are not used to mathematical manipulations, is to draw a tree with four big branches representing the four options at the first decision point. Each of those big branches then has four medium-size branches representing the four options at the second decision point. And each of those 16 medium-size branches then then has four little branches representing the four options at the third decision point, which makes 64 little branches, representing 64 possible combinations. There are only 20 amino acids, so with 64 code words we can have a lot of redundancy, leading to a lot of robustness against damage. The last question we discussed was: since we have so many extra code words by using three-letter combinations, could we (robustness aside) get away with using just two DNA letters to code for an amino acid? I'll let you figure that one out for yourself.