Martin Tompa - What can we do with lots of genome sequences?
I went to two seminars at the UW today—very considerate of them to start clumping the talks I'm most interested in together—and I'll write about one today and one later.
Martin Tompa of the UW Genome Sciences and Computer Science & Engineering departments, gave a talk about computational genomics, focussing mainly on how comparing data across many species is useful.
Part of the talk was devoted to conveying the sheer volume of data involved, and demoing the UC-Santa Cruz genome browser. In one dimension, 17 entire vertebrate genomes and >400 bacterial genomes have been sequenced [Wikipedia has a nice List of sequenced eukaryotic genomes]. A typical genome is itself a vast sequence; Tompa didn't give whole-genome numbers, but focussed on human chromosome #1 (out of 46), which has 247,000,000 base pairs; a point illustrated by zooming out and out and out repeatedly on its representation in the genome browser. Making sense of genome data involves sifting through vast quantities of it, of which apparently around 95% is 'junk' DNA with no clear function.
The vast volume of data presents at least two significant computational challenges: verifying the correctness of it all, and identifying which sections are important. Both of these goals are served by finding long sequences that correspond between species, because functional sequences tend to be conserved (since most mutations are detrimental, whereas any mutation in a non-coding sequence is neutral) and errors are unlikely to cause chance correspondences between sequences of more than a few bases. Tompa introduced the Karlin-Altschul measure of the statistical significance of a correspondence between two sequences, and explained how poorly this scales for n-way comparisons. He described a heuristic he's been using to sidestep the scaling problem, by assuming that within any group of n species' genomes, the differences will be found between the least closely related species. This might get risky with some choices of comparison species, but in many cases it's fairly obvious: for instance it seems safe to assume that more sequences will be conserved between humans and rhesus macaques than between either of those species and zebrafish.
The original work presented in the talk builds on this, to look at what are considered 'consensus sequences' (i.e. long sections of DNA that are thought to match between many species) and flag suspicious sequences that seem to be false positives found by matching algorithms. To illustrate the point, Tompa showed a set of sequences which were very close to identical (eyeballing it, I would guess <5% of bases were different), except for the aforementioned zebrafish which had far more discrepancies. It was easy to see how the zebrafish sequence would get picked out by an algorithm, because it still had more than 50% correspondence with the others—which does at least look better than chance—but it's inevitable that there will be some false positives. All that statistical measures tell us is a notional probability that the null hypothesis [in this case, the assertion that two sequences are not related] should be rejected, but in using them we always have to set a somewhat arbitrary threshold. Considering the sheer volume of data involved with genome sequence comparisons, it's impossible to set the threshold such that there will be no false positives without causing a large number of false negatives, so secondary analysis techniques that allow suspect false positives to be identified are very important.
The talk didn't go into much detail about what actually happens to these suspect sequences, but simply identifying them is a useful step.

Comments