search

Death and test data

For the past couple of days, I've been running and analysing the output from the fixed experiments in which agents are correctly saved every 100 generations, and agents die as intended, as I described on Wednesday. I'm nowhere near finished with this batch of data analysis, but while I wait for some tests to finish, and before I finish work for the week, I wanted to mention a few notable things I seem to be seeing:

CAVEAT: this is all preliminary; it describes incomplete data on which I have not yet done any significance tests.

  1. Agents score higher on the harder task
  2. Agents do better if trial sets are longer, but the search is slower
  3. Random trials marginally understimate performance relative to systematic trials
  4. The stupid threshold strategy is very common
  5. The population continues to improve after the supposed best agent has been found

1: For a given agent, the version of the simulator that incorporates death will produce a lower fitness score than the version that does not. This is consistent [among all of my evolved agents - it is certainly possible to design an agent for which it's not true], and easy to explain. In a trial set in which the agent's energy level reaches 0, the correct version of the simulator gives the agent a score of 0 for every trial after that point, whereas in the older, broken version of the simulator the agent would continue to exist after that, but its energy level would never drop below 0 so it could always score some additional points by any time it happened to bite good food. This is why I consider the correct simulator to be implementing a harder task than the buggy version was.

What I'm finding now that I have a body of data from the corrected version of the simulator is that agents evolved on this harder task often score higher on it than agents evolved for the easier task score on the easier task. In other words, the highest fitnesses observed are higher for searches in the more difficult environment. The trend is more pronounced for the experiments with longer trial sets.

This is not altogether unexpected, and I do have a theory to explain it. The difficulty of a task is not the only variable determining the success of a GA search; some search spaces are just better suited to this particular technique than others. In this instance, I think the relevant property is the structure of the space, and specifically how many opportunities there are for the search to get stuck in a local optimum. Agent death penalises some suboptimal strategies more than others, and it has no impact on the fitness of a theoretical perfect agent (one that never makes a false move). I suspect that by making some of the stupider agents score lower, the introduction of agent death has removed some local optima and made the path to the globally optimal agent smoother, in turn allowing the searches to get somewhat closer to that optimum (which is equally high for both versions).

Going from speculation to a definitive answer to questions about the detailed nature of the searchspace would take a huge amount of data gathering, and probably be of no general interest whatsoever, so I think I'll be leaving this as speculation, but it is important to have some sort of feel for the search space.


2: The effect of trialset length is interesting, and was not what I expected to see. I've been manipulating two independent variables with these experiments: the range of oscillation periods (I run experiments at 7 different ranges which I won't go into now), and the number of trials [defined as simulation cycles in which the agent may bite or pass] in a trial set [defined as a set of trials in which the agent's state is preserved from one trial to the next]. I've been running experiments with sets of 250, 500 or 1000 trials.

I'm waiting on the results of the 1000-trial sets, but it looks like there is a systematic difference between 250 and 500. In 6 of the 7 sets, the 500-trial experiments have yielded higher scoring agents, and in 5 of the 7 the highest scoring agent emerged considerably later in the run. The difference in fitness may be too small to be significant, but the differences in emergence time are large, and suggest that the 5000 generations I gave these runs was too short, because in 4 runs the best agent emerged after generation #4000 (which was extremely rare in previous experiments, leading me to settle on the 5000-generation limit).

The difference between the correct simulation's results and those from the buggy version in which agents did not die is also much more pronounced at 500 trials in a set than at 250.

I don't know what to make of this, though it definitely suggests I should run some longer searches to see if anything happens after generation 5000. The only thing I can think of is that agent death has a lower cost in a shorter trial set, because the agent is missing fewer opportunities to eat good food and improve its fitness. This does nothing to explain the search speed though.


3: In the experiments, fitness is assessed using a rotating batch of 20 randomly generated trial sets. Each agent in one generation gets the same trials as its peers, and each generation one set is replaced with a new random one, in an effort to balance low evaluation noise with a wide sampling of all possible trial combinations. Because there are definite noise issues with this kind of evaluation, I've been subjecting agents to a more comprehensive fitness evaluation before quoting their fitness for comparison purposes: the agent sees 1000 sets of trials, randomly generated with the same random seed so that any two agents from the same experimental condition will get the same trials.

Fitnesses according to the more comprehensive evaluation are always lower than those found during the search. That much is to be expected—the search uses a quick and dirty technique because it would take too long otherwise—but I've been finding the size of the discrepancy worrying. Because of this, I've set up a new version of the simulator (with experiments running over the coming weekend) in which the batch of trials is fixed to include n trials at each integer value of the time period range. There's still one set replaced each generation, because there are still other random values in the trials, but each time a set is replaced, its replacement has the same period.

Meanwhile, it occurred to me that I could probably use the same technique to evaluate the agents I'm analysing, and save time. For each agent I've looked at today, I compared 1000 random trialsets with 5 trialsets at each period (50-250 trialsets, depending on the experimental condition). The results are very close, but in all 14 comparisons the systematic evaluation gives a higher score than the random one. The differences range from <0.001 to 0.012, but because there are no exceptions this may still be significant.

I have no theory to explain this at present. On Monday I'll figure out the appropriate significance test to use, and I'll be relieved if the difference is not significant. If the difference is insignificant then I can use the systematic trials and save myself some CPU time, but a signficiant difference would really demand explanation.


4: The dumb strategy I described earlier (start eating when energy drops below threshold A, stop if/when it rises above threshold B) is common. All 14 agents I've looked at today fall into one of two categories: obviously using this strategy, or it wasn't obvious what strategy they were implementing. This is not good.

Previously I expressed the fear that this strategy—I'll call it the threshold strategy—is a local optimum that is too easy for these searches to fall into, and evidence is building up that I was right. So far, I have only found one agent that implements a better strategy, and it is still fairly similar. I am becoming increasingly convinced that I need to try some different experiments; either co-evolving trials with agents or using a different signal to convey information to the agent.


5: Now that I have my simulator saving agents from generations after the highest observed fitness, I've been able to confirm my intuition that the population does continue to improve. I'm fairly sure that this is because there's an interaction between the random values in a batch of trials and the fitness score it will give a particular agent, so the apparent peak has as much to do with an agent getting just the right batch of trials as being the best agent ever (which is also why the better fitness evaluations I use in analysis return lower scores than the same agent got during evolution). I am curious about whether this trend will continue (and be as clear) in the systematic-trials experiments I'm now running.

I've also been able to confirm that new strategies don't emerge as the search is in this final phase. It seems to just be a matter of fine-tuning parameters while keeping the general structure of the controller intact.

Trackbacks

Trackback URL for this entry is: http://blog.case.edu/exg39/mt-tb.cgi/5329

Comments

Post a comment