They have a doc for the harness changes from model-to-model for this series of runs (claudeplayspokemon). Excerpt on Opus 4.5 changes:
ClaudePlaysPokemon Opus 4.5 Harness Changes
Navigator
Added support for Surf 😉
Marked Spin tiles and Teleport tiles as not navigable so navigation sequences wouldn’t accidentally hit these which was impossible for the model to realize
Made it so when the model hits a spin tile we wait for the player to stop moving before giving the model the next screenshot
Added support for side entrances to gates (previously were marked as non navigable)
Opus called out that it was confusing that tiles that were out of reach but walkable were marked as red, so we updated it to mark those as cyan instead. Possible a better prompt could have fixed this, but it was an easy change
Memory
We’re back to multi-file memory now – Claude is responsible and is able to manage multiple files without losing the plot.
Misc.
I removed a bunch of tooling that told Claude when things were going wrong (e.g. informing Claude it was stuck). Claude doesn’t need this anymore.
Also I let claude enter names faster now 🚀
Hints
I removed all of the hints I used to give models (Claude is pretty good these days). I do have a few examples of mistakes that Claude makes visually that you could interpret as hints, ymmv:
So, yes, it does seem hard to draw many conclusions from the performance differences since we're far from apples-to-apples. But at least we can see that the harness is not only accommodate the models' deficiencies, but, over time, also removing assists as new strengths emerge.
They have a doc for the harness changes from model-to-model for this series of runs (claudeplayspokemon). Excerpt on Opus 4.5 changes:
So, yes, it does seem hard to draw many conclusions from the performance differences since we're far from apples-to-apples. But at least we can see that the harness is not only accommodate the models' deficiencies, but, over time, also removing assists as new strengths emerge.