SWEBench Verified - Anthropic claims 80.2% with parallel test time compute, I think we should count it?
I would like to register that I think this is very vibe-codeable given the right tools, and if there's an interested game designer that doesn't have full stack experience I would be willing to help them get up to speed on vibe coding best practices for this project.
I appreciate the praise and insights!
I hadn't thought of the sandbagging version of adversarial evals and it sounds interesting, although I'm a bit confused about the specifics of the reward function. It sounds to me like in order to catch sandbagging you need an example of the same base model performing better on the task?
Asymmetries in the reward structure - If I understood correctly, I think this is covered by the biased overseer? A critique is too good if the overseer overestimates how similar a bad word is to the clue.
Open to hearing more ideas, I agree there's more than can be done with this set up
I had the impression collective punishment was disallowed in the IDF, but as far as I can tell by googling this only applies to keeping soldiers from their vacations (including potentially a weekend). I couldn't find anything about the origin but I bet collectively keeping a unit from going home was pretty common before it was disallowed in 2015, and I think it still happens today sometimes even though it's disallowed.
source: https://www.idf.il/%D7%90%D7%AA%D7%A8%D7%99-%D7%99%D7%97%D7%99%D7%93%D7%95%D7%AA/%D7%90%D7%AA%D7%A8-%D7%94%D7%A4%D7%A7%D7%95%D7%93%D7%95%D7%AA/%D7%A4%D7%A7%D7%95%D7%93%D7%95%D7%AA-%D7%9E%D7%98%D7%9B-%D7%9C/%D7%9E%D7%A9%D7%98%D7%A8-%D7%95%D7%9E%D7%A9%D7%9E%D7%A2%D7%AA-33/%D7%A9%D7%99%D7%A4%D7%95%D7%98-03/%D7%9E%D7%A0%D7%99%D7%A2%D7%AA-%D7%97%D7%95%D7%A4%D7%A9%D7%94-33-0352/
I disagree with almost everything you wrote, here are some counter-arguments:
That said, I don't claim that everything is perfect and we're all definitely going to be fine. Particularly, I agree that it will be hard or impossible to get everyone to follow this methodology, and I don't yet see a good plan to enforce compliance. I'm also afraid of what will happen if we get stuck on not being able to confidently align a system that we've identified as dangerous (in this case it will get increasingly more likely that the model gets deployed anyway, or that other less compliant actors will achieve a dangerous model).
Finally - I get the feeling that your writing is motivated by your negative outlook, and not by trying to provide good analysis, concrete feedback, or an alternative plan. I find it unhelpful.
Oh god... hope you're wrong about this providing a significant efficiency boost. If working on the same server is viable then presumably there's a pretty homogenous software stack, which would also mean docker images can often be reused?
This is basically in our SL5 recommendations, with the assumption that the model you'd use for this didn't merit being trained in an SL5 data center and so is not a catastrophic risk. But maybe it could already misaligned and sneaky enough to introduce a backdoor even if it's not smart enough to be a catastrophic risk? Do we have a prediction for number of model generations between models still being dumb enough to trust, and models being smart enough to be a catastrophic risk? (with the best case being 1). If we buy the AI-2027 distinction between superhuman coder and superhuman AI researcher, it could even be the case that the big scary model doesn't write any code, and only comes up with ideas which are then implemented by the superhuman coder, or whatever coder you have that is the smartest one you trust.
At some point, I expect the AI company will want to use its smartest models to create the infrastructure necessary to serve itself over the internet while remaining SL5. Hopefully this only happens after they're justifiably confident that their model is aligned.
Another idea is to require that any infra created by an untrusted model has to be formally verified against a specification written by a human or trusted model, though I don't know enough about formal verification to know if that's good enough.