Goodhart's Law and a Minimum Viable Sugarscape: Karpathy Pattern ABM Autoresearch

Raven Of Empire

Intro

Agent based models (ABMs) are notorious for being over-parameterized. Researchers can get caught in loops effectively doing manual p-hacking adding more and more rules and variables to their funny world of little guys in a grid. My problem is that ABMs can be far too complex, to truly understand the emergent phenomena (in my opinion) we must find the simplest structure that results in their emergence.

With this in mind, I vibe coded and iterated on a small experiment: a harness to allow for a karpathy style ABM ablation research loop. It's made up of the following ideas:

prepare.py + other stuff: The script retrieves baseline benchmark data from the canonical Mesa Sugarscape implementation, establishing ground-truth. It serves as the test harness that evaluates every proposed iteration to check that emergent behaviors remain within proposed error bounds.
strategy.py: An agent controlled git repo with a single file. To start it's a pre-flattened and shrunken version of the mesa libraries sugarscape implementation. This is in order to allow the model to see the whole implementation at once, as Karpathy did with his train.py.
program.md: The human research agenda, steers the LLM (I used gemma) based on overarching goals of the experiment.
sugarscape_autoresearch.py: Main orchestrator that queries the LLM to propose structural simplifications to the strategy file. The LLM writes a new file based on the old one, if the variant passes the tests and has a smaller complexity measure, the change is committed to the git.

How did I determine that a strategy kept the emergent features of the orignal mesa? Using some key characteristics of sugarscape as described in its source, the book Growing Artificial Societies by Epstein and Axtell, with my best attempt at faithful error bounds of the measures used. Arguably my error bounds are smaller than should be accepted because they are smaller than the noise of the ABM, it just somewhat makes sure edge cases of the proposed strategy stays in bounds as well. Finally as part of the harness, it completes a parameter sweep over the chosen initial conditions as relevant to the desired emergent features, and makes sure the measured features were within the error bounds of the Mesa implementation.

Results

Version 1: Line count complexity

My first attempt at a complexity measure was a lazy and simple line count. As you can see in this graph we quickly plateau at around 40 LOC.

Now this run was ~12 hours and tried 213 strategies with ~13 successful ratchets, this experiments generation count is just a count of the times the LLM was prompted. This version was attempting 3 strategies per LLM output, probably affecting strategies viability, but you can still see a clear convergence in the results.

Unfortunately and predictably given my choice of complexity measurement, Goodhart's law clearly comes into play. If you look at the strategies that were successful in this situation; it created mass of unreadable and compressed code with many walrus operators, single-pass procedural loops, and aggressive tuple unpacking.

Version 2: Combined AST Node Complexity and Cyclomatic Complexity

This version used a combined complexity metric of AST node counts and decision point counts (Weighted 40% and 60% respectively). I ended up running this for 36 hours or so, and it got 13 total successful ratchets/strategy attempts (Compared to the ~13 for the LOC complexity over 12 hours). This graphs generation count is counting each successful strategy attempt. One thing to keep in mind is that this versions starting strategy file was actually the ending strategy file of the last experiment, and therefore already quite simplified according to a line count perspective.

As you can see in the graph, it was much harder for the LLM to make progress with this. Now although I stopped, im sure you could run this for much longer and see it possibly converge. Im not sure if the fact that I made it only do one strategies per LLM call made it research slower? It could have but that feels unintuitive to me, I assumed allowing more attention to go to solving a more complex problem would make it faster. It likely had such slow progress because of how uninterpretable the file was, how compressed it already was from the original experiment, but even with that the llm was clearly able to make progress.

The actual optimizations the LLM made were interesting, ironically it was able to shorten the line count too. Here are a few examples:

It was able to realize that raising utility to the power of is a monotonic transformation, allowing it to simplify the equation from to .
The original calculated the marginal rate of substitution (MRS) using nested divisions, requiring 6 divisions and multiple nested brackets. It was able to use the algebraic identity to consolidate the calculation.
It used to calculate float utilities divided by the sum of metabolisms, by raising both sides of the inequality to the power of the LLM eliminated the divisions entirely.
Eliminated the trader class, simply represents them in a list. Much lower complexity.
Previously during agent movement the strategy dynamically generated coordinate ranges and manhattan distances on the fly. Now it pre-calculates all possible spatial offsets for maximum possible vision range, then filters this list, this replaces dynamic and nested math with a lookup loop.

Conclusion

Unfortunately I am a big dummy and did not commit/save the original flattened sugarscape implementation, because of this we unfortunately can't do a good comparison to the original state of the strategy. I will say that the harness, if inspected, still meets justifiable standards of the sugarscape model. Understand the starting point was originally ~140 lines and must have passed the harness, therefore it ran mostly like the original mesa implementation.

The lessons I want to take away are this:

Goodhart's law clearly dictates the "flavour" of the code. Optimizing for text length just creates obfuscation. Optimizing for structural logic more cleanly translates to revelations about the system and how it can be optimized, although not exactly solving the obfuscation issue. The LLM isn't just reformatting the code, it can clearly identify newer algebraic expressions required to simplify loops and such, it is able to use itself as an algorithmic razor.

ABMs in general suffer from a lot of human bias. Humans could design algorithms with their self discovery in mind, not just engineering in human concepts but allowing for room for the algorithm to discover and demystify its own concepts. I think we can start designing such that your experiment can reformat and optimize a cruder human guess/attempt at a thesis.

Using this idea, we can replace the manual tweaking of code structure and simulation rules with automated and metric bound iteration. In my case to showcase attempting to find the minimum viable structure of a model, but in other cases it may be approaching a new and complex emergent phenomena in itself. To restate; We can be honest about the overfitting of ABMs that humans do and just allow for the robot to do it.

The repo is here (be warned it's gross and not modular):
https://github.com/NobodyKnowNothing/Sugarscape-Auto-Ablation

In it contains logs of all this and the graphs I have posted. If someone can generalize this pipeline and throw it at more mesa models that would be amazing, I may do it myself if I can get a grant or if I feel like it. This technique is probably applicable to cellular automata and many other kinds of rule based emergence seeking simulations.