How to deal with evaluation noise
As I've mentioned before, I'm running into problems of evaluation noise with my runs, and the most obvious solution (give each agent more trials) is too costly to be practical. I do have a few ideas of what to do about this, which I'll be experimenting with over the next few weeks.
First, it's worth stating exactly what I mean by evaluation noise, and distinguishing between different kinds. Any evolutionary algorithm has to select between individuals in some way, in order to decide which will be reproduced. If the function for which the algorithm is being used is sufficiently trivial—matching a target string, for instance—then it's simple enough to score each individual precisely on its closeness to the goal. In practice, however, this doesn't work for any interesting problem because comprehensively evaluating each individual would simply take too long. Instead, it's necessary to find some way of approximating the real quality of each individual. In my experiments, I do this by presenting the agent with a sample of possible trials (generated randomly, but each agent in a given generation gets the same set). For many real world applications there's an additional problem—agents are tested on a simulation of the application they'll ultimately be used for—but I don't need to worry about that because my agents only exist in simulation.
Evaluation noise is the disconnect between the scores individuals are given during a GA run, and their real overall performance. I can think of at least three types:
- Systematic bias. This can be from bugs in the simulator (experience suggests that if there is an exploitable bug that incorrectly assigns high fitness, a GA will find the exploit rather than the desired behaviour) or biases in the way the samples of trials are generated. I've encountered both of these problems in the past, but right now there's nothing leading me to believe I'm in this situation with the current experiments.
- Random noise. Even if there is no systematic bias in the generation of trial sets, any given snapshot is unlikely to sample the whole space of possible trials evenly (and making it do so is a great way to create systematic bias). This opens up the possibility of selection being effectively random, because the individuals that happen to perform well on one particular set of trials are rubbish at all others, and the evaluation trials change fast.
- Overfitting. If the evaluation trials don't change fast enough to cause the random noise problem, agents can instead overspecialise, by evolving to exploit some regularity of the evaluation set that does not correspond to a regularity of the set of all possible trials.
I've already established that while pushing up the number of trials in a generation can work, it makes the runs unworkably slow. There is one caveat to that though, which is that I've been having some success with dramatically smaller population sizes than usual. By default, I run experiments with populations of 100, just because that's never too small and writing up and discussing work is made easier by having a reasonably standard method. However, I've been running some time-learning experiments with population sizes between 10 and 90, and the results suggest that I should use populations as small as 20. Every population size that has finished running so far has reached a similarly high performance level, and while the smaller populations have been requiring more generations to reach that level, the total number of evaluations required (defined as population size * generations elapsed * size of evaluation set) is lowest for populations of 20-30. So far [incomplete data] it looks like using a population size of 25 would reduce the total number of evaluations (and therefore running time) by about two thirds.
I am, however, concerned that this might only work because the particular experiments I've tried with smaller populations are very easy, so I'm not convinced this will solve all my problems. I will default to populations of 25 from now on, but I also won't be surprised if I find it necessary to increase either the population size or the number of generations I let a run go for, once I get back to the more difficult problems.
So this leaves me still in need of cleverer ways to deal with evaluation noise, and here are a few ideas I have in mind:
Dynamically adjusting the number of trials
Rather than having a fixed-size evaluation set, it might be useful to vary the size relative to how the population is shaping up. The problem here would be that I don't know how to measure either evaluation noise or overfitting without knowing what the 'true' fitness of an individual is. In other words, I'd have to go through all the compute time involved in comprehensively testing agents, just to determine what size evaluation set to use. I suppose this wouldn't have to be done every generation, but I still think there's a considerable risk of adding more overhead with this technique than is removed by it.
Cumulative statistics
This idea is from Simon McGregor: start by giving an agent a small number of evaluations, and then do some sort of statistical test—either working out confidence intervals or a significance test on the fitness difference between this individual and another—and use this as the criterion for whether to run more trials on that individual. I find this idea quite appealing, but it's worth bearing in mind that Simon has been playing with it to no avail so far, which makes me a little less eager to dive in.
Significance-guided selection
Rather than using the significance of differences between agents to determine whether or not to do more testing, I could use them to add a probabilistic element to selection. I could see this helping to prevent good agents from becoming extinct because they happen to encounter a batch of trials that 'defeats' them, but I'm not convinced that adding more randomness to a system that's already suffering from noise is really going to help.
Coevolution
One thing I certainly need to implement is the simple coevolutionary scheme that I've used in previous work, and which I took from Doug Downey. This involves scoring the trials as well as the agents, with a trial's score being the inverse of the agents' average performance on it. Then once an overall performance criterion is reached, the weakest trial (i.e. the one on which agents are performing best) is removed and replaced with a fresh one. This should keep pushing agents to perform better as they are continuously presented with more challenging trials, and has two important advantages: it's simple, and I've had good results with it before.
There is a potential drawback, which is that the coevolving populations of agents and trials can get stuck in a game of rock-paper scissors (or RPS-25 if you prefer) whereby they keep cycling through states of one dominating the other, without improving overall. Various conversations at GECCO have led me to realise that if I do run into this problem, the solution is probably to imitate one of several prior pareto coevolutionary methods. I need to read more about exactly how this is done, but the basic idea is to look for individuals that consistently defeat their counterparts (very good agents or very hard trials) and keep them in a 'pareto-set' that is used to keep testing the coevolving population against.
For now, I think I should implement the simple coevolutionary scheme and see what happens. If I recognise a dominance cycle emerging—which should be fairly obvious—then the received wisdom is that pareto coevolution is the way to solve this problem, so for now I need to read more about that idea.

Comments