search

David Baker: Computing structural biology

Last week I went to a talk by David Baker, in which he presented a major project of his lab: the Rosetta system for modelling and predicting protein folding.

I'll try to recount Baker's potted description of the process that Rosetta models, though the web page probably does a better job than I will.

Each protein is a chain of amino acids, each of which has different chemical properties. These chains fold in on themselves in ways that are in principle predictable, but difficult to predict, and the way they fold is what determines how the protein as a whole will behave. For a given protein (i.e. a given sequence of amino acids), it is possible to determine analytically what degrees of freedom exist, and then every possible folding could be predicted. Of the possible foldings, proteins typically fold to the minimum free energy conformation, just like crystal formation.

The problem with exhaustively searching the space of possible folds is that it is so huge. Establishing a protein's fold in vitro takes 1-8 years, and while I forgot to note down the numbers for an exhaustive search, I remember the number of permutations being pretty daunting. Because of this, quite a lot of the talk was taken up with issues of how Rosetta tackles the problem of search space size.

Part of the solution is to increase available computing power—there's a rosetta@home project and a rather large cluster—but as is true of many search projects, this is not enough on its own. The demand for more computing power always outstrips supply, if thought is not put in to algorithms.

The primary strategy used by Rosetta is to search in two passes. First there is a coarse-grained monte carlo search, in which the whole space of possible foldings is randomly sampled and the free energy of each sample is calculated. Then a finer-grained secondary search is conducted around whichever of the initial samples had the lowest free energy.

The problem with this is that how well it works will depend on the nature of the search space. For a theoretical search space that is perfectly smooth, this technique will reliably find global optima (but there are more efficient gradient-descent techniques which could be used instead). For a maximally noisy search space (one in which neighbouring points are not at all correlated with each other) nothing short of an exhaustive search will work properly. For anything in between, a sampling-and-resampling technique like this one will work better if there is a reasonably strong correlation between a point and others near it, and less well if there are very small pockets of optimality.

Unfortunately, at least some proteins do have these small pockets of optimality in their space of possible foldings. Baker showed a few graphs of the conformations tested by the initial search, with the best known conformation added, and the system did not always touch on the area around the best known in a given search. Rosetta has had some pretty impressive results, as measured by testing the system against empirically-determined protein folding patterns, but it seems to me that it always runs the risk of missing small pockets of low energy.

At the same time, I found myself thinking that these search spaces are probably quite well suited to the application of a genetic algorithm. It's a space in which any given point is typically fairly well correlated with its immediate neighbours (the equivalent of one-point mutations in a GA, and a common feature of biological search spaces), but there is not a straightforward smooth gradient to descend. To my mind, GAs have the advantage of combining some of the random sampling of a monte carlo search (particularly in the generation of the initial population), with a way of proceeding from that initial sample in a structured way (selection + descent-with-modifications).

The presentation left me both pretty impressed with Rosetta as it is, and curious about why the Baker lab isn't using a GA for this space. There probably is a good reason, and I wonder if they tried it at some point, but there isn't a reason that's obvious from the limited amount I know about the project.

Trackbacks

Trackback URL for this entry is: http://blog.case.edu/exg39/mt-tb.cgi/5379 Seminars I've been to this semester
Excerpt: This is just a list of talks because I need to keep the university posted to meet some course requirements. I'll describe one or two in more detail, but having had such a long blog hiatus I won't attempt to catch up on the whole lot...
Weblog: Eldan Goldenberg's lab notebook
Tracked: May 23, 2006 04:09 PM

Comments

Post a comment