Running Lightcone Infrastructure, which runs LessWrong and Lighthaven.space. You can reach me at habryka@lesswrong.com.
(I have signed no contracts or agreements whose existence I cannot mention, which I am mentioning here as a canary)
I think this would be quite valuable! I don't know how it compares to other things you do, but I definitely know a lot of people who end up doing a lot of their own research here in duplicated ways.
Yeah, OK, I do think your new two paragraphs help me understand a lot better where you are coming from.
The way I currently think about the kind of study we are discussing is that it does really feel like it's approximately at the edge of "how bad do instructions need to be/how competent do steering processes need to be in order to get good results out of current systems", and this is what makes them interesting! Like, if you had asked me in advance what LLMs would do if given the situation in the paper, I wouldn't have given you a super confident answer.
I think the overall answer to the question of "how bad to instructions need to be" is "like, reasonably bad. You can totally trigger specification gaming if you are not careful, but you also don't have to try enormously hard to prevent it for most low-stakes tasks". And the paper is one of the things that has helped me form that impression.
Of course, the thing about specification gaming that I care about the most are procedural details around things like "are there certain domains where the model is much better at instruction following?" and "does the model sometimes start substantially optimizing against my interests even if it definitely knows that I would not want that" and "does the model sometimes sandbag its performance?". The paper helps a bit with some of those, but as far as I can tell, doesn't take strong stances on how much it informs any of those questions (which I think is OK, I wish it had some better analysis around it, but AI papers are generally bad at that kind of conceptual analysis), and where secondary media around the paper does take a stance I do feel like it's stretching things a bit (but I think the authors are mostly not to blame for that).
To be clear, the URL for "What Superintelligence Looks Like" that was listed in that survey was "superintelligence2027.com", so that one also had the year in the name!
This seems like (1) either the landscapers are malicious, and trying to misunderstand me (2) or they lack some kind of useful background knowledge to humans. It's a reason for further inquiry into what kind of people the landscapers are.
I don't understand, why are we limiting ourselves to these two highly specific hypotheses?
Like, neither of these hypotheses applies to the classical specification gaming boat. It's not like it's malicious, and it's not like anything would change about its behavior if it "knew" some kind of useful background knowledge to humans.
This whole "you must be implying the model is malicious" framing is the central thing I am objecting to. Nobody is implying the specification gaming boat is malicious! The AI does not hate you, or love you, but you are made out of atoms it can use for something else. The problem with AI has never been, and will never be, that it is malicious. And repeatedly trying to frame things in that way is just really frustrating, and I feel like it's happening here again.
Yep, honestly, I really deeply hate the way legally admissible evidence works where keeping any kind of record becomes a liability, and can be revealed in discovery in a huge range of lawsuits.
My guess is this usually isn't worth worrying much about, but it does make me quite sad.
Oh alas, I think that is a major update downwards on MIRI's work here. Happy to chat about it if you want sometime, but it appears to me that almost every time Eliezer intentionally writes substantial public comms here, things get non-trivially better (e.g. I think the Time article was much better than other things MIRI had done for a while). I am not super confident here.
Or put broadly, if "specification gaming" is about ambiguous instructions you have two choices. One, decide any kind of ambiguous instruction counts, in which case you can generate infinite ambiguous instructions by being bad at the act of instructing.
Yes, you can generate arbitrary examples of specification gaming. Of course, most of those examples won't be interesting. In general you can generate arbitrary examples of almost any phenomenon we want to study. How is "you can generate arbitrary examples of X" an argument against studying X being important?
Now, specification gaming is of course a relevant abstraction because we are indeed not anywhere close to fully specifying what we want out of AI systems. As AI systems get more powerful, this means they might do things that are quite bad by our lights in the pursuit of misspecified reward[1]. This is one of the core problems at the heart of AI Alignment. It is also a core problem at the heart of management and coordination and trade and law.
And there are lots of dynamics that will influence how bad specification gaming is in any specific instance. Some of them have some chance of scaling nicely to smarter systems, some of them get worse with smarter systems.
We could go into why specifically the Palisade study was interesting, but it seems like you are trying to make some kind of class argument I am not getting.
No Turntrout, this is normal language, every person knows what it means for an AI system to "pursue reward" and it doesn't mean some weird magical thing that is logically impossible
I feel really confused about this standard here. Nobody has ever applied this standard to specification gaming in the past.
"Given that the boat was given a reward for picking up power ups along the way, doesn't it seem odd to report on the boat going around in circles never finishing a round, with words implying such clearly morally bad words as 'cheating' and 'hacking'? Isn't it a charity we should extend to these alien visitors to our planet, and to ourselves, lest we misunderstand them horribly?"
It's clearly a form of specification gaming. It's about misspecified reward. Nobody is talking about malice. Yes, it happens to be that LLMs are capable of malice in a way that historical RL gaming systems were not, but that seems to me to just be a distraction (if a pervasive one in a bunch of AI alignment).
This is clearly an interesting and useful phenomena to study!
And I am not saying there isn't anything to your critique, but I also feel some willful ignorance in the things you are saying here about the fact that clearly there are a bunch of interesting phenomena to study here, and this doesn't depend on whether one ascribes morally good or morally bad actions to the LLM.
I think around 15%/yr? Maybe 20% depending on the details. Definitely not more than 25%.
We have reached the $1,000,000 mark in our fundraiser! Thank you all so much!