It's called research because you have to RE search
The past week in brief:
1-4 my to-do list from last week, of which I got as far as #3
5 Experiments I re-ran with fixed code, because the bug I mentioned last week about turned out to have a large effect on results (everything from the last couple of weeks; not all have finished yet).
6 Experiments running right now (testing out a simple shaping algorithm; trying larger numbers of interneurons)
7 Experiments I want to run soon (intermediate numbers of interneurons)
8 Programming to do (sensory flags; co-evolution?)
I'll start by revisiting my to-do list from last week.
1. Fix a niggling bug in the 'comprehensive' trial generation scheme.
I got this fixed; it's the one I talked about last Thursday. In short: ouch. I'll come back to this in #5.
2. Set up the simple shaping system.
This is now done. The search starts with a particular range of periods, and when fitness reaches 0.8 it expands the range by 1. Results I've seen this week (from experiments that don't use shaping) suggest that it will produce better quality agents, though I'm sceptical about it producing qualitatively different ones.
3. Set up an adaptive stopping criterion that will declare a search to have finished if n generations have passed without any improvement in fitness
This is now done. Based on the results I've been looking at this week, I currently have the simulator stopping either after 10,000 generations total (just to make sure runs don't last forever), or after there's been no improvement in fitness after 1,000 generations. In general the searches that get really poor results tend to show no improvement after generation #1000, while the ones that do well tend to still show improvements after #4000, so hopefully this will extend the fruitful searches while cutting short the number of CPU cycles wasted on dead-end ones.
4. Implement the version of the simulator that uses sensory flags in various ways.
I didn't get to this, because the fallout from #1 turning out to be so important is that I've spent the week analysing data from re-run experiments, and I still have some waiting to finish. I really should this week, though.
5 Experiments I re-ran with fixed code
I ran 21 searches with randomly generated trials, and 21 with trialsets at every allowed time period. The most important thing I've found is that this does make a big difference to some outcomes, but it doesn't seem to make them overall better or worse. I'll deal with the two sets of experiments separately, because the effect was different.
For the randomly generated trials, the first thing I noticed was the appearance that searches were doing a whole lot better. Unfortunately, this turned out to be an artefact. I re-tested each best agent on a larger sample of trialsets, in which each allowed period was represented 5 times, to give a benchmark fitness, and found that quite a few of these agents performed very poorly on the benchmark. Then I looked at a sample qualitatively, and their behaviour turned out to only be sensible for a subset of periods.
I think what's been happening is that with the randomly generated trials, the highest fitness scores are produced not by agents that are good across the range of conditions they're supposed to be evolved for, but by very over-specialised agents that happen to get just the right combination of trials in a particular generation. I could probably fix this by dramatically increasing the number of trialsets they see, but this would have a proportional effect on running times, so it's not ideal.
There are still a few that do well on the benchmark tests. So far, each one that I've looked at qualitatively has been doing some variation of the reactive "threshold" strategy. I decided to test them on a wider range of periods, and the results have been quite interesting. Here are two that gave me a headache today [click for larger, clearer images]:
These graphs are from subjecting the best agent from a given run to 10 trials at each period (t) from 1 to 250.
The feature at the right of both shows up all the time, and it's easy to explain: these runs had sets consisting of 250 trials each, and every set starts with good food, so as t increases the agent gets more and more of a free ride.
The things I found more interesting were the noise, and the effect of which range the agent was evolved with. A was evolved on t=1-20, and B with t=20-50. So, glossing over the difficulty with very low t values (in general it's hard for agents to do anything sensible with very short periods), both agents have evolved to do best within the range they were evolved on (no surprise), but there isn't a sharp fall-off immediately outside it (which is typical of all the agents I've looked at). And they're both badly affected by input noise, but they ignore it within the range they were evolved for. I think that these two factors suggest that their strategy could be tuned for a wider range of t values, so I'm going to do some shaping experiments with similar parameters.
For the comprehensive trials, I haven't yet got to qualitative evaluations, but numerically what I'm seeing is that relatively few appear to do really well, but there's a much better match between the fitness reported by the run and the fitness on the benchmark trials. I'm wondering whether this means I should drop either the randomised searches—because they turn up so many false leads—or the comprehensive ones, because they just don't seem to be fruitful all that often. I am waiting for quite a few of the comprehensive-trials searches to finish, though, so this is all a bit tentative.
6 Experiments running right now
Some of the experiments I queued up last Friday are still going.
Apart from that, I have one set of shaping trials queued, and one set of non-shaping ones with 10 interneurons instead of 3, using a subset of the parameters I've run before.
7 Experiments I want to run soon
I think that whatever happens with the 10-interneuron searches I need to also sample the space between 3 and 10, based on what Jacob & I learned from the catcher experiments. Having too many interneurons can make searches less likely to come up with anything good, so I don't think it will be a complete exploration without doing those.
Depending on the results, I may also want to try more than 10 ints, but I'll make that call later.
8 Programming to do
I still need to implement the sensory data condition. It doesn't look like the temporal-correlation experiments are going to give me anything that I would be comfortable labelling a learning agent, but the reason I've let myself be sidetracked by it is that I need the null result to be a real, reproducible null result and not an artefact of stupid programming errors. If I'm going to claim that XYZ is crucial to the emergence of learning agents, I'd better be sure that they don't show up without it....
I'm also interested in trying some more complex co-evolutionary strategies for trial generation, but I'm not sure that needs to be a priority right now.



Comments