search

I was wrong about trial selection

After a few hiccups due to my own errors and some hardware trouble, I now have all the results I was waiting for from the temporal-correlation experiments with 3 interneurons. They're not very impressive, but I have learned some things from them.

There's only one really surprising result, which is that my preliminary report that the 'comprehensive trials' experiments were doing better than ones with randomly generated trials was in fact wrong. It is true that the fitness scores reported during the search are much closer to the results from a more thorough re-testing if the search used the comprehensive trial generation system, but that's the only advantage. More detail, and some hand-waving about why, behind the cut.

To compare the two systems, I compared pairs of otherwise identical runs (same parameters, same random seed and consequently identical starting populations) where one used randomly generated trials, and the other a comprehensive batch of trials in which every allowed oscillation period was represented once. Then, to get away from problems of evaluation noise, I re-tested the agents produced by these searches on a standard batch of trials, in which every allowed oscillation period was represented 5 times, and normalised the results by dividing them by the best fitness attainable in each set of conditions. Comparing either the best or the final agents from these runs finds the 'comprehensive trials' ones to perform slightly, though not statistically significantly, worse on average.









All conditionsConditions in which the comprehensive scheme generates fewer trialsConditions in which the comprehensive scheme generates more trialsConditions in which both schemes generate the same number of trials
best agentlast agentbest agentlast agentbest agentlast agentbest agentlast agent
mean difference-0.0479471-0.0259938-0.01187830.0190831-0.0524626-0.035347-0.1583840-0.1510275
standard deviation0.08926700.13178030.08392940.13605720.08284920.12893470.04937770.0597813

[aside: I had forgotten how much of a pain editing a table in HTML is. Next time I think I'll just export from Excel, ugly as that is. Also, I know the table is clumsily shrunk to fit; I'll try and fix that tomorrow because I've already spent longer writing this than I had intended to]

The breakdown of conditions is because the random trial generation always used 20 trialsets, whereas the number of sets generated by the comprehensive scheme varied from 10 to 50, depending on the particular conditions. All that the breakdown really does is to tell me that this effect is similar across conditions; there's only one in which the comprehensive-trials experiments did better than the random-trials ones. The differences are all statistically insignificant, though, so really all that this data tells me is that there isn't a significant improvement from using the comprehensive trial generation scheme.

I find this interesting because I was sure there would be. I don't know why there isn't, but the one hint I've seen comes from one of the few agents apparently not to use some variant of the stupid threshold strategy. It was evolved using comprehensively generated trials of periods 1-10, and scored a pretty poor overall fitness (0.64111), but on inspection it turned out to be hyper-specialised for certain periods. It scored the maximum possible for a period of 1, and very close to it for 2 and 6, but abysmally poor (<0.45) on everything else.

On inspection, it turned out to have a very simple strategy, that was once again using thresholds. It's just that the thresholds in question matched up very well for three particular periods, and not at all for any other. In theory, the comprehensive trial generation scheme was supposed to avoid this kind of over-fitting, but it looks like in this instance there was still a local optimum simply because this relatively poor strategy was still better than anything a small mutation away from it. The random-trials scheme might have managed to break away from this, if at some point the trials under-represented those periods, but the comprehensive-trials scheme wasn't able to do that.

Of course, the random-trials scheme has its own drawbacks, which I've detailed in the past couple of weeks, but what this data tells me is that they are not significantly more problematic—and may even be marginally less bad—than the drawbacks of the comprehensive-trials scheme.

The trouble with all this is that, as Inman Harvey is fond of saying, GAs are something of a black art. Just because I've found the trial generation scheme not to matter for this set of experiments, doesn't mean it won't for other experiments, or even simple changes like different numbers of interneurons. I really only have two options: pick one system arbitrarily and use it consistently, or try both for all my experiments so I can say whether or not it matters.

I'm inclined to pick one and run with it, because this is all orthogonal to the hypothesis I'm investigating, but my fear is that doing so would weaken the negative results, because there's always the doubt that perhaps I would have had more luck using the other scheme.

Trackbacks

Trackback URL for this entry is: http://blog.case.edu/exg39/mt-tb.cgi/5755 More interneurons don't help either
Excerpt: I tried running the same set of experiments, but giving agents 10 interneurons instead of 3. To be honest, I wasn't expecting this to help, because I suspect that the stupid threshold strategy is a local optimum that would need some change in the envir...
Weblog: Eldan Goldenberg's lab notebook
Tracked: February 8, 2006 09:27 PM

Comments

Post a comment