Disambiguating performance
Today's update is going to be about two things: how important that bug I found on Thursday actually was, and some consideration of the appropriate way to compare fitness values.
Once I fixed Thursday's bug, I set off some scripts to re-run 42 experiments—everything I commented on last week—because it had made all of their results suspect. At the time of writing, I have the results of 25 of them. Unfortunately most of the remaining ones are those that I expected to be more time-consuming, so it will probably be a couple more days till I have all the data.
What I have so far is enough to tell me that this bug was critical. Previous positive results, in as much as I have any, are still valid, because the old searches were producing valid agents. However, all negative results (which are not only the majority, but also very important to my motivating question) are invalid, because a failure to evolve something with a faulty search obviously proves nothing about the difficulty of evolving it with a search that does what I said it does. This is obviously a bit of a blow, but it's also in the nature of doing research (as one of my lab-mates said, it's called research because we often have to re-search), and on the bright side at least I found this now and not just after submitting a paper.
The good news is that slightly more of my searches have found good agents. Or at least, agents that score highly on the fitness test; I haven't yet analysed individual agents' behaviour to see if they're doing anything interesting. So that's at least something for me to look forward to doing this afternoon, and I'm leaving "what do I do next?" decisions for after I've looked at a sample of these agents.
The bad news is that my initial tabulation of results gave me the impression that there were many more successful searches than the detailed second look did. The reasons for this are structural, and it's another thing that had been distorted—in this case the effect was dramatically reduced—by the bug in the previous weeks' experiments.
The problem in short is one of normalisation. Expose the same agent to two trial sets, and I won't necessarily get the same fitness score. In fact, I'm quite unlikely to get the same fitness score, unless it's an agent that does nothing whatsoever. There are two reasons for this, and I'll deal with the simpler one first. The simpler reason is just that different sets of conditions have different limits on the highest fitness possible. This is because fitness is the agent's energy level averaged over all the trials it sees, so the frequency with which good food is available, and the length of the lean periods, both affect how high a notional perfect agent would score. For every time step in which good food is not present, there's nothing the agent can do to avoid losing some energy.
Thinking about this led me to realise that I had to work out what the highest fitness achieveable would be in each set of conditions, in order to establish a proper baseline for fitness comparisons. I tried to come up with a formula for this, but found myself stumped and realised that it would be easier to do empirically. Programming an omniscient agent was trivial (I just had to give it access to information normally concealed from agents and set up extremely simple logic to control it), and the results were illuminating.
The pattern of highest-achievable-fitness was as I expected when comparing sets of conditions—it didn't take genius to figure out that increasing the penalty for not eating or lengthening the periods without good food would both decrease the maximum available energy—but the range was wider than I anticipated. In the set of conditions I'm using for the current runs, it ranges from 0.932 to 0.993 . Obviously this makes quite a substantial difference to how good a score of, say, 0.925 really is. This still doesn't quite tell the whole story, because a non-omniscient agent must at some point sample bad food in order to learn how to discriminate it from good food, but I think it's reasonable to assume that the cost of this varies more between agents than it does between conditions. So I'm now normalising fitness scores by dividing them by the maximum attainable under the conditions in which an agent was tested. At least this should mean that between-conditions comparisons actually compare like with like.
The more complicated issue is that of the random factors involved in trial generation. Every trial has some random noise in it, and while a good agent will have evolved to ignore that noise, this doesn't reliably happen, and every search begins with poor-quality agents. Then there's the issue of oscillation periods: in half of these experiments (which I'll refer to as "random period" experiments) the oscillation period is randomised from one trial to the next, creating a lot more noise.
To compensate for this, I've set up a standardised fitness evaluation for comparing data. I won't use it in the actual searches, because it's too computationally time consuming, but I'm analysing the products of the searches with it. In the standardised evaluation, an agent is presented with 5 sets of trials at each oscillation period possible under its particular parameters. The methodical selection of oscillation periods should deal with the noise introduced by the period, and using each one 5 times should deal with the other sources of noise.
Comparing fitnesses as tested this way with those reported by the actual search winnowed out several agents that had initially looked good. The biggest difference I've seen so far has been an agent that scored 0.949 during the search, but only 0.764 in the standardised evaluation (both numbers normalised as described above). The smallest difference among the "random trials" experiments was 0.977 vs 0.963.
Meanwhile the other batch of experiments—which I will refer to as "comprehensive trials" because in them each agent was presented with 1 trial at each period—show either very small differences between the search result and the standardised evaluation, or no difference at all. In general, it's the best performing agents that don't show any difference, presumably because evolving to ignore the noise improves overall performance.
This does have some important implications for what I do next, because it reveals a couple of serious shortfalls in the "random trials" experiments. For one thing it means that fitness scores reported by a search may be wildly inaccurate, under those conditions, but not in the "comprehensive trials" experiments. This in turn means that what the search flagged as the "best agent seen in the search" is not necessarily the real best agent. Because of this, I've started also analysing the best agent from the final generation of each search, which occasionally performs a little better, but often much worse. And that leads to the last and most serious problem: all this evaluation noise can lead a search backwards, making these experiments much less reliable at producing anything of interest at all.
Because of this, I'm leaning towards only using the "comprehensive trials" style of experiments in future.
Trackbacks
Trackback URL for this entry is: http://blog.case.edu/exg39/mt-tb.cgi/5487 I was wrong about trial selectionExcerpt: After a few hiccups due to my own errors and some hardware trouble, I now have all the results I was waiting for from the temporal-correlation experiments with 3 interneurons. They're not very impressive, but I have learned some things from them. There...
Weblog: Eldan Goldenberg's lab notebook
Tracked: February 7, 2006 11:33 PM

Comments