Knowing what to leave out
[this is really a continuation of last Thursday's post]
I remember going through a phase during my MSc of wanting to build neuron models that included every known detail. It ought in principle to be possible to model individual molecular interactions, in a non-deterministic way, and thereby effectively run a neuron in silico. My big idea was to test the theoretical understanding of how neurons work, by seeing if the model matched experimental observations, but I eventually came to realise some serious issues with this idea.
In practical terms, I realised that there's no way to properly do the comparison I had in mind, because no two real neurons will behave identically. Could I build a detailed model whose behaviour was in the range observed in real neurons? Probably, but that wouldn't tell us anything interesting at all because it's already been done with less detailed models, and besides the data we have from individual real neurons are all under highly artificial experimental conditions such as voltage clamps. Getting to a molecule-by-molecule level of detail with data from real neurons runs into Heisenbergian problems, and having this data from a model would never allow specific detailed predictions to be made because each interaction is probabilistic. All the really useful things we can learn from a neuron model are at a somewhat higher level - abstracting away from the inherently unpredictable individual spikes (never mind individual ions crossing the cell membrane) to the aggregate behaviour of groups of neurons. But then at the opposite extreme it is easy to leave out too much information in using over-idealised neuron models; much of the behaviour of canonical artificial neural networks is an artefact of the model as opposed to anything that happens in biological neurons.
I can see that there is a very similar tension with the modelling of genetic regulatory networks (GRNs). Many of the most useful and impressive papers I've been reading use the simplest model type I know of, and it's clear that the simplification was essential to making the model analytically tractable. Yet at the same time there's a nagging doubt that boolean network models, which ignore any graduations of gene expression or variability of timescales, may be leaving out too much to be generalisable.
I've been brought back to this topic by reading two somewhat contradictory papers back-to-back. A Nonlinear Continuous Stochastic Model for Genetic Regulatory Networks. Caveats for Microarray Data Analysis
demonstrates a very detailed differential equation model of GRNs, which uses continuous values and operates in continuous time. It then uses the data from this model to argue that many more genes may be relevant to a system's behaviour than are typically flagged as significant in microarray experiments, and also that much faster dynamics than can be measured with microarrays are relevant. Quoting from the paper:
...it is quite possible that rapidly fluctuating components of the regulatory network are the integral parts of the process as a whole, and their high-frequency variations manifest the preparatory work of supplying the mRNAs for slower processes with bigger amplitudesIn the opposite corner,
Less Is More in Modeling Large Genetic Networkspresents a succinct argument for leaving much of the workings out of GRN models. As experimental data, it presents the findings of a few other papers, in each of which a highly idealised boolean network model produced high-level dynamics that match the biological data. The argument here is that the model will necessarily use different mechanisms than the biological system, but if the high-level behaviour matches up then one can be used for predictions about the other.
I'm still working out exactly where I stand on this issue, which will be quite an important determinant of exactly what experimental work I do, but I'm leaning towards the higher-level models. This is partly from an awareness of how my thoughts about appropriate levels for neural network models have changed, but also based on a dose of realism: the best data we have on differential gene expression seems to come from microarray experiments, which are necessarily coarse-grained discrete-time snapshots. There's not yet any way to get the continuous-time data on gene expression, never mind going into great detail on the molecular interactions involved.

Comments