If you have feedback for me, you can fill out the form at https://w-r.me/feedback .
Or you can email me, at [the second letter of the alphabet]@[my username].net
This take is a bit frustrating to me, because the preprint does discuss Rajamanoharan & Nanda's result, and in particular when we tried Rajamanoharan & Nanda's strongest prompt clarification on other models in our initial set, it didn't in fact bring the rate to zero. Which is not to say that it would be impossible to find a prompt that brings the rate low enough to be entirely undetectable for all models - of course you could find such a prompt if you knew that you needed to look for one.
I'm genuinely sorry to ask this (I think a better version of me wouldn't need to ask), but would you be willing to be more specific about your critique here? I wasn't involved with this work but I'm sure I'm at least a bit mindkilled / ideas-as-soldiers-y about it since I work at Palisade, and I think that's making it hard for me to find, e.g., the thing we're saying that's unfair or painting the behavior in an unfair light.
I made a thing that generates strong passwords (>92 bits) that are also easy to remember because they're rhyming nonsense couplets (that also scan pretty well): https://www.benwr.net/2025/07/16/opensesame.html
Ernyyl ernyyl cbjreshy synfu yvtug. Cbvag vg ng fbzr irel sne njnl zngrevny jubfr genwrpgbel lbh pna cerqvpg. Fbzrobql jnagf gb renfr lbhe qngn? Gbb onq; gur orfg gurl pna qb vf oybpx crbcyr sebz ernqvat vg jura vg'f ersyrpgrq. Nyfb arng cebcregl bs guvf vf gung, vs lbh'er fhssvpvragyl pnershy, vg'f "ernq-bapr".
Note that we made all our code available (https://github.com/PalisadeResearch/shutdown_avoidance/) and it's pretty easy to run exactly our code but with your prompt, if you'd like to avoid writing your own script. You need a docker runtime and a python environment with the dependencies installed (and it's easier if you use nix though that's not required), but if you have those things you just have to modify the conf.py file with your prompt and then do something like run --model openai/o3.
There are lots of reasons that a "survival drive" is something we were interested in testing; one reason is that self-preservation has been suggested as a "convergent instrumental goal"; see https://selfawaresystems.com/wp-content/uploads/2008/01/ai_drives_final.pdf for a (fairly dated, but seminal) discussion of this.
There are lots of ways for systems to malfunction, but only a small number of those involve actively / surprisingly disabling a shutdown mechanism without any instruction to do so. If the issue was that it got the math problems wrong, that seems like a run-of-the-mill malfunction; if it rewrites the shutdown program, that seems like a much more interesting and specific failure mode, and the word "sabotage" seems appropriate.
I weak-downvoted this comment; I did so for the following reasons:
(a) "Reasoning models as trained do not follow a typical instruction hierarchy of a chat template." doesn't seem true if you mean that they're not intended to follow such a hierarchy (see the link to the system card in the OP), and if you mean that they don't follow it despite being meant/designed to, I don't really see much evidence presented (and don't have any background knowledge that would suggest) that the "Goal, Outcome, Warning, Context" sequence is "bound to" give consistent results.
(b) The comment seems pretty unclear about what you tested and why; e.g. the post you link as an explanation seems to be mostly about irrelevant things, and I don't really want to read the whole thing to figure out what you mean to reference in it. I also don't understand the setting in which you ran this experiment: Were you using the chat interface? If so, there's an existing hidden system prompt that's going to change the results. Were you using our code from github but with your prompt substituted? If so, why only run it 15 times for one model? Your own code with the OpenAI API? Same question, and also I'd like more details in order to understand specifically what you tested.
(c) We explain that a "survival drive" is merely one hypothesis, and mostly not one that we feel is a good explanation for this behavior; far from jumping to anthropomorphic conclusions, we came away feeling that this isn't a good explanation for what we're seeing. We don't mention consciousness at all, as we don't think it's very relevant here.
I wouldn't have downvoted if I hadn't felt that the comment was mostly based around a strawman / semi-hallucinated impression of the OP, or if it had been clearer about what exactly the takeaway is meant to be, or what exactly you tested. I didn't strong-downvote because I appreciated that you shared the prompt you ran, and in general I appreciate that you put in some effort to be empirical about it.
Right, I think it just seems like doing emotional preparation that matches this description is a kind of earthbender-friendly / earthbender-assuming move, while an airbender-friendly move would be more like "notice and accept that you'd have more fun doing it a different way or doing a different thing; that flailing isn't actually fun". The effect is kind of similar, i.e. both earthbenders and airbenders should come away less-clinging-to-something, but the earthbender comes away less-clinging-to "the locally easy and straightforward things will work if I do them enough" while the airbender is less-clinging-to something more like "This is what I'd choose".
Re the last-resort framing, I'm not sure why I said that exactly; I think it's related to the vibe I got from the OP: Like, "if you notice that you're not making progress, what do you do? Well, you could keep flailing or avoidantly doomscrolling, or you could [do the thing I'm suggesting], or you could give up in despair"; I think it feels like a "last resort" because the other realistic options presented are kind of like different kinds of death?
Yeah I can try to say some of them, though my sense of the crucialness here does shift on learning that you mean for this to be about hour-to-week levels of effort. I guess I may as well try to come up with element-bending-flavored ones since I'm in pretty deep on that metaphor here.
The biggest differences in which one I'd recommend as a "default last resort" depend on the person and their strengths rather than the situation.
In a version of the shutdown resistance paper that's currently being reviewed (not included in the preprint yet) the following details are included:
> We began our examination of this topic because we had an intuitive expectation that current LLMs might resist shutdown in settings like this one; we did not discover it by sampling uniformly from the space of all possible or realistic tasks. Specifically, we began our exploration by considering several ways to check for the presence of ``instrumentally convergent'' behavior from current LLMs. In addition to shutdown resistance, we considered ways to elicit self-replication or resource acquisition. We then did some exploratory work in each area, and found that shutdown resistance was very easy to elicit in simple settings like the one presented in this paper: The environment we present here has not been substantially modified from our initial working implementation, and the initial prompt we present (Prompt A) differs mainly in that we corrected some typographical errors. All our experiments exploring shutdown resistance, including very nearly all exploratory work and experiments performed while developing the technical environment, are available such that a reader can examine the sequence of experiments that produced the specific prompts presented here.
...
> All of our experimental results and inspect traces are also available, collected in a table at https://shutres.fyi/all-experiments, including nearly all exploratory work, experiments performed while testing our technical environment, and smaller runs of the final prompts. The only results not included in this table were (a) fewer than ten experiments performed during the initial development of the technical environment (with N <= 10 each), or (b) due to accidental misconfigurations in logging the results, which occurred less than five times during development and were followed by running the same experiments again with the configurations corrected.
(tbc this obviously isn't the same as a commitment to show all negative results, it just seemed moderately relevant and potentially useful)