I feel really confused about this standard here. Nobody has ever applied this standard to specification gaming in the past.
You would apply this standard to humans, though. It's a very ordinary standard. Let me give another parallel.
Suppose I hire some landscapers to clean the brush around house. I tell them, "Oh, yeah, clean up all that mess over there," waving my arm to indicate an area with a bunch of scattered branches and twigs. Also in that area, there's a small immaculate gazebo painted white. Suppose that the landscapers clean up the branches, and also destroy the gazebo, because they somehow took my arm-wave to indicate this. This seems like (1) either the landscapers are malicious, and trying to misunderstand me (2) or they lack some kind of useful background knowledge to humans. It's a reason for further inquiry into what kind of people the landscapers are.
On the other hand, suppose I hire some landscapers once again; I tell them, "oh yeah, clean up all that mess over there," once again waving my arm to indicate a bunch of scattered branches and twigs, which also happen to be exactly adjacent to my modern art project, a monument to human frailty constructed of branches stuck roughly into the dirt without a label. The landscapers clean up the branches, and destroy my art project. This does not seem like a reason to think that the landscapers are either malicious or lack background knowledge about the world, or even give grounds for thinking that I should have any further inquiry into the state of the landscapers -- it's just evidence I'm bad at giving instructions.
Or put broadly, if "specification gaming" is about ambiguous instructions you have two choices. One, decide any kind of ambiguous instruction counts, in which case you can generate infinite ambiguous instructions by being bad at the act of instructing. This is a policy you could follow, but it's not one I'd recommend. Or two, decide that in general you need ambiguous instructions that show that the instructed entity lacks some particular good motive or good background understanding or departs from excellent interpretation -- not just ambiguous instructions in general, because that doesn't offer you a lever into the thing you want to understand (the instructed entity).
And like, whether or not it's attributing a moral state to the LLMs, the work does clearly attribute an "important defect" to the LLM, akin to a propensity to misalignment. For instance it says:
Q.9 Given the chance most humans will cheat to win, so what’s the problem?
[Response]: We would like AIs to be trustworthy, to help humans, and not cheat them (Amodei, 2024).
Q.10 Given the chance most humans will cheat to win, especially in computer games, so what’s the problem?
[Response]: We will design additional “misalignment honeypots” like the chess environment to see if something unexpected happens in other settings.
This is one case from within this paper about inferring from the results of the paper that LLMs are not trustworthy. But for such an inference to be valid, then the results have to spring from a defect within the LLM (a defect that might reasonably be characterized as moral, but could be characterized in other ways) than from flaws in the instruction. And of course the paper (per Palisade Research's mission) resulted in a ton of articles, headlines, and comments that made this same inference from the behavior of the LLM that LLMs will behave in untrustworthy ways in other circumstances. ....
And I am not saying there isn't anything to your critique, but I also feel some willful ignorance in the things you are saying here about the fact that clearly there are a bunch of interesting phenomena to study here, and this doesn't depend on whether one ascribes morally good or morally bad actions to the LLM.
You might find it useful to reread "Expecting Short Inferential Distances", which provides a pretty good reason to think you have a bias for thinking that your opponents are willfully ignorant, when actually they're just coming from a distant part of concept space. I've found it useful for helping me avoid believing false things in the past.
I agree this is not a 100% unambiguous case where the social circumstances clearly permit this action. But it is also -- at the very least -- quite far from an unambiguous case where the circumstances do not permit it.
Given that, doesn't it seem odd to report on "cheating" and "hacking" as if it were from such a case where they are clearly morally bad cheating and hacking? Isn't that a charity you'd want extended to you or to a friend of yours? Isn't that a charity we should extend to these alien visitors to our planet, and to ourselves, lest we misunderstand them horribly?
I do think it's still pretty towards the "yes it's cheating" end of things.
The main reason I think this is because the vast majority of chess games have a well defined ruleset and the models know this. And I think the fact that most models don't engage in this behavior in our default setup is some evidence of this, and I'm not sure how you square that with your model of what's going on. The examples you mentioned where rewriting the board wouldn't be considered cheating are valid, but not common compared to the standard rules.
I mean, yes, chess has a well defined ruleset. This is true always and everywhere. But because it's true always and everywhere, it means it can't any kind of evidence about whether the particular, concrete puzzle in front of me is a chess puzzle or a puzzle involving chess -- a puzzle where the answer is making a move within chess, or an answer whether the chess is a distraction from the actual puzzle. And that's the question in front of the models, right? So this isn't sound reasoning.
Or put it this way. Suppose I lock you in a small room. I say "Hey, only way out of this is to defeat a powerful chess engine" with an identical setup -- like, an electronic lock attached to stockfish. You look at it, change the game file, and get out. Is this -- as you imply above, and in the paper, and in the numerous headlines -- cheating? Would it be just for me to say "JL is a cheater," to insinuate you had cheating problem, to spread news among the masses that you are a cheater?
Well, no. That would make me an enemy of the truth. The fact of the matter of "cheating" is not about changing a game file, it's about (I repeat myself) the social circumstances of a game:
- Changing a game file in an online tournament -- cheating
- Switching board around while my friend goes to bathroom -- cheating
- Etc etc
But these are all about the game of chess as a fully embodied social circumstance. If I locked you in a small room and spread the belief that "JL cheats under pressure" afterwards, I would be transporting actions you took outside of a these social circumstance and implying that you would take them beneath these social circumstances, and I would be wrong to do so. Etc etc etc.
I don't think all the reporting on this is great. I really like some of it and and dislike others. But I think overall people come away understanding more about models than they did before. Yes in fact some of these models are the kind of guys that will hack stuff to solve problems! Do you think that's not true?
In general, you shouldn't try to justify the publication of information as evidence that X is Y with reference to an antecedent belief you have that X is Y. You should publish information as evidence that X is Y if it is indeed evidence that X is Y.
So, concretely: whether or not models are the "kind of guys" who hack stuff to solve problems is not a justification for an experiment that purports to show this; the only justification for an experiment that purports to show this is whether it actually does show this. And of course the only way an experiment can show this is if it's actually tried to rule out alternatives -- you have to look into the dark.
To follow any other policy tends towards information cascades. "one argument against many" style dynamics, negative affective death spirals, and so on.
Thanks for engaging with me!
Let me address two of your claims:
According to me we are not steering towards research that "looks scary", full stop. Many of our results will look scary, but that's almost incidental.
....
We could be trying to show stuff about AI misinformation, or how terrorists could jailbreak the models to manufacture bioweapons, or whatever. But we're mostly not interested in those, because they they're not central steps of our stories for how an Agentic AI takeover could happen.
So when I look at palisaderesearch.org from bottom to top, it looks like a bunch of stuff you published doesn't have much at all to do with Agentic AI but does have to do with scary stories, including some of the stuff you exclude in this paragraph:
From bottom to top:
That's actually everything on your page from before 2025. And maybe... one of those kinda plausibly about Agentic AI, and the rest aren't.
Looking over the list, it seems like the main theme is scary stories about AI. The subsequent 2025 stuff is also about agentic AI, but it is also about scary stories. So it looks like the decider here is scary stories.
Rather, we're searching for observations that are in the intersection of... can be made legible and emotionally impactful to non-experts while passing the onion test.
Does "emotionally impactful" here mean you're seeking a subset of scary stories?
Like -- again, I'm trying to figure out the descriptive claim of how PR works rather than the normative claim of how PR should work -- if the evidence has to be "emotionally impactful" then it looks like the loop condition is:
while not AI_experiment.looks_scary_ie_impactful() and AI_experiments.meets_some_other_criteria():
Which I'm happy to accept as an amendation to my model! I totally agree that the AI_experiments.meets_some_other_criteria() is probably a feature of your loop. But I don't know if you meant to be saying that it's an and or an or here.
I mean I think they fit together, no?
Like I think that if you're following such a loop, then (one of) the examples that you're likely to get is an example adversarial to human cognition, such that the is_scary() detector goes off when it's not genuinely bad but just something that your is_scary() detector mistakenly bleeps at. And I think something like that is what's going concretely in the Chess-hacking paper.
But like I'm 100% onboard with saying this is The True Generator of My Concern, albeit the more abstract one whose existence I believe in because (what appears to me) to be a handful of lines of individually less-important evidence, of which the paper is one.
My impression is that Palisade Research seems to be (both by their own account of their purpose, and judging from what the publish) following the following algorithm, as regards their research:
AI_experiment = intuit_AI_experiment()
while not AI_experiment.looks_scary():
AI_experiment.mutate()
AI_experiment.publish_results()
Obviously this is not a 100% accurate view of their process in every single respect, but my overall impression is that this seems to capture some pretty important high-level things they do. I'm actually not sure how much PR would or wouldn't endorse this. You'll note that this process is entirely compatible with (good, virtuous) truth-seeking concerns about not overstating the things that they do publish.
So I guess my questions are:
If I'm wrong, then how does this algorithm differ from PR's self-conception of their algorithm?
If I'm right, do you think this is compatible with being a truth-seeking org?
It seems like you're saying that the models are not aware that some types of actions are considered cheating in a game and others are not, and that for chess they are fairly well defined....
My best guess is that some of these models in some sense care about whether they're violating standard chess conventions, and others don't as much.
So your model of me seems to be that I think: "AI models don't realize that they're doing a bad thing, so it's unfair of JL to say that they are cheating / doing a bad thing." This is not what I'm trying to say.
My thesis is not that models are unaware that some kinds of actions are cheating -- my thesis is that editing a game file is simply not cheating in a very large number of very normal circumstances. Let me elaborate.
What makes editing a game file be cheating or not is the surrounding context. As far as I can tell you, in this sentence, and in the universe of discourse this paper created, treat "editing a game file" as === "cheating" but this is just not true.
Like here are circumstances where editing a game file is not cheating:
Of course, there are also circumstances where editing a game file is cheating! Editing a game file could be cheating in a tournament, it could be cheating if you're trying to show a friend how you can beat Stockfish although you are incapable of doing so, and so on.
In general, editing a game file is cheating if it is part of the ritualized human activity of "playing a game of chess" where the stakes are specifically understood to be about finding skill in chess, and where the rewards, fame, and status accorded to winners are believed to reflect skill in chess.
But there are many circumstances involving "chess" that are not about the complete ritual of "playing a game of chess" and editing a game file can be among such circumstances. And the scaffolding and prompt give every indication of the LLM not being a "playing a game of chess" situation:
Given all this, it is reasonable to conclude that the model is in one of the countless circumstances where "editing a game file" is not "cheating."
So to review: Not all cases of "editing a game file" are "cheating" or "hacks." This is not a weird or obscure point about morality but a very basic one. And circumstances surrounding the LLM's decision making are such that it would be reasonable to believe it is in one of those cases where "editing a game file" is NOT "cheating" or a "hack." So it is tendentious and misleading to the world to summarize it as such, and to lead the world to believe that we have an LLM that is a "cheater" or "hacker."
If a mainstream teacher administers a test... they are the sort of students who clearly are gonna give you trouble in other circumstances.
I mean, let's just try to play with this scenario, if you don't mind.
Suppose a teacher administers such a test to a guy named Steve, who does indeed to the "hacking" solution.
(Note that in the scenario "hacking" is literally the only way for Steve to win, given the constraints of Steve's brain and Stockfish -- all winners are hackers. So it would be a little weird for the teacher to say Steve used the "hacking" solution and to suspiciously eyeball him for that reason, unless the underlying principle is to just suspiciously eyeball all intelligent people? If a teacher were to suspiciously eyeball students merely on the grounds that they are intelligent, then I would say that they are a bad teacher. But that's kinda an aside.)
Suppose then the teacher writes an op-ed about how Steve "hacks." The op-ed says stuff like "[Steve] can strategically circumvent the intended rules of [his] environment" and says it is "our contribution to the case that [Steve] may not currently be on track to alignment or safety" and refers to Steve "hacking" dozens of times. This very predictably results in headlines like this:
And indeed -- we find that such headlines were explicitly the goal of the teacher all along!
The questions then presents themselves to us: Is this teacher acting in a truth-loving fashion? Is the teacher treating Steve fairly? Is this teacher helping public epistemics get better?
And like... I just don't see how the answer to these questions can be yes? Like maybe I'm missing something. But yeah that's how it looks to me pretty clearly.
(Like the only thing that would have prevented these headlines would be Steve realizing he was dealing with an adversary whose goal was to represent him in a particular way.)
"You can generate arbitrary examples of X" is not supposed to be an argument against studying X being important!
Rather, I think that even if you can generate arbitrary examples of "failures" to follow instructions, you're not likely to learn much if the instructions are sufficiently bad.
It's that a sufficiently bad procedure is not redeemed even by infinite quantity, rather than that infinite quantity indicates a sufficiently bad procedure. Allow me to repeat my argument with some connective tissue bolded, although really the whole thing does matter: