I see someone's been writing long horizon solvers.
Good engineering instincts.
They do behave like this in practice, if you have the patience to manage the testing surface, run the evals. It's been my experience that this kind of thing is fairly straightforward to do on a single model, but truly painful to generalize across models. That's been getting easier, though, as the models become richer reasoners, and the tooling becomes more uniform.
The framing as fiction makes it more engaging, I think.
It makes me think about the layers you would need to make some... (read more)
I see someone's been writing long horizon solvers.
Good engineering instincts.
They do behave like this in practice, if you have the patience to manage the testing surface, run the evals. It's been my experience that this kind of thing is fairly straightforward to do on a single model, but truly painful to generalize across models. That's been getting easier, though, as the models become richer reasoners, and the tooling becomes more uniform.
The framing as fiction makes it more engaging, I think.
It makes me think about the layers you would need to make some... (read more)