benwr — LessWrong

LESSWRONG
LW

I made a thing that generates strong passwords (>92 bits) that are also easy to remember because they're rhyming nonsense couplets (that also scan pretty well): https://www.benwr.net/2025/07/16/opensesame.html

benwr's unpolished thoughts

benwr3mo40

Ernyyl ernyyl cbjreshy synfu yvtug. Cbvag vg ng fbzr irel sne njnl zngrevny jubfr genwrpgbel lbh pna cerqvpg. Fbzrobql jnagf gb renfr lbhe qngn? Gbb onq; gur orfg gurl pna qb vf oybpx crbcyr sebz ernqvat vg jura vg'f ersyrpgrq. Nyfb arng cebcregl bs guvf vf gung, vs lbh'er fhssvpvragyl pnershy, vg'f "ernq-bapr".

benwr's unpolished thoughts

benwr3mo30

Are there any storage media that are basically impossible to destroy/erase? Answer in rot13 ITT.

Shutdown Resistance in Reasoning Models

benwr3mo60

Note that we made all our code available (https://github.com/PalisadeResearch/shutdown_avoidance/) and it's pretty easy to run exactly our code but with your prompt, if you'd like to avoid writing your own script. You need a docker runtime and a python environment with the dependencies installed (and it's easier if you use nix though that's not required), but if you have those things you just have to modify the conf.py file with your prompt and then do something like run --model openai/o3.

There are lots of reasons that a "survival drive" is something we were interested in testing; one reason is that self-preservation has been suggested as a "convergent instrumental goal"; see https://selfawaresystems.com/wp-content/uploads/2008/01/ai_drives_final.pdf for a (fairly dated, but seminal) discussion of this.

There are lots of ways for systems to malfunction, but only a small number of those involve actively / surprisingly disabling a shutdown mechanism without any instruction to do so. If the issue was that it got the math problems wrong, that seems like a run-of-the-mill malfunction; if it rewrites the shutdown program, that seems like a much more interesting and specific failure mode, and the word "sabotage" seems appropriate.

Shutdown Resistance in Reasoning Models

benwr3mo40

I weak-downvoted this comment; I did so for the following reasons:

(a) "Reasoning models as trained do not follow a typical instruction hierarchy of a chat template." doesn't seem true if you mean that they're not intended to follow such a hierarchy (see the link to the system card in the OP), and if you mean that they don't follow it despite being meant/designed to, I don't really see much evidence presented (and don't have any background knowledge that would suggest) that the "Goal, Outcome, Warning, Context" sequence is "bound to" give consistent results.
(b) The comment seems pretty unclear about what you tested and why; e.g. the post you link as an explanation seems to be mostly about irrelevant things, and I don't really want to read the whole thing to figure out what you mean to reference in it. I also don't understand the setting in which you ran this experiment: Were you using the chat interface? If so, there's an existing hidden system prompt that's going to change the results. Were you using our code from github but with your prompt substituted? If so, why only run it 15 times for one model? Your own code with the OpenAI API? Same question, and also I'd like more details in order to understand specifically what you tested.
(c) We explain that a "survival drive" is merely one hypothesis, and mostly not one that we feel is a good explanation for this behavior; far from jumping to anthropomorphic conclusions, we came away feeling that this isn't a good explanation for what we're seeing. We don't mention consciousness at all, as we don't think it's very relevant here.

I wouldn't have downvoted if I hadn't felt that the comment was mostly based around a strawman / semi-hallucinated impression of the OP, or if it had been clearer about what exactly the takeaway is meant to be, or what exactly you tested. I didn't strong-downvote because I appreciated that you shared the prompt you ran, and in general I appreciate that you put in some effort to be empirical about it.

"Buckle up bucko, and get ready for multiple hard cognitive steps."

benwr3mo50

Right, I think it just seems like doing emotional preparation that matches this description is a kind of earthbender-friendly / earthbender-assuming move, while an airbender-friendly move would be more like "notice and accept that you'd have more fun doing it a different way or doing a different thing; that flailing isn't actually fun". The effect is kind of similar, i.e. both earthbenders and airbenders should come away less-clinging-to-something, but the earthbender comes away less-clinging-to "the locally easy and straightforward things will work if I do them enough" while the airbender is less-clinging-to something more like "This is what I'd choose".

Re the last-resort framing, I'm not sure why I said that exactly; I think it's related to the vibe I got from the OP: Like, "if you notice that you're not making progress, what do you do? Well, you could keep flailing or avoidantly doomscrolling, or you could [do the thing I'm suggesting], or you could give up in despair"; I think it feels like a "last resort" because the other realistic options presented are kind of like different kinds of death?

"Buckle up bucko, and get ready for multiple hard cognitive steps."

benwr3mo20

Yeah I can try to say some of them, though my sense of the crucialness here does shift on learning that you mean for this to be about hour-to-week levels of effort. I guess I may as well try to come up with element-bending-flavored ones since I'm in pretty deep on that metaphor here.

The biggest differences in which one I'd recommend as a "default last resort" depend on the person and their strengths rather than the situation.

The thing recommended for the airbender above, i.e. "retreat to a safe distance and consider if there's anything that seems like a fun challenge instead of a grind".
For a firebender: "Commit to doing it really hard for about [30 minutes]. See where you get. Wait [2 * 30 minutes] after that, and check if you feel energized or drained. If energized, repeat; if drained, [I don't know; this is the one I have the least familiarity with]"
For a waterbender: Something like "Do naturalism to it". "When you make good enough observations, you can't help but make high-quality inferences" is one of my favorite quotes, and I think it applies especially well in this kind of last-resort setting.

"Buckle up bucko, and get ready for multiple hard cognitive steps."

benwr3mo10

I think I maybe mean to say a slightly different thing than came across, which makes sense because I was leaning heavily into the metaphor rather than trying to be very clear.

I think the triggers are definitely hints in the direction that buckling up might be the right move. Yet I also observe that, when I imagine the median or even 80th percentile person-explicitly-buckling-up-for-something-big, a big part of me wants to shake my head and be like "ah well, it was nice while it lasted" about their chances of doing a hard thing, especially an unusual hard thing.

This part of me is clearly wrong sometimes: My head would have shaken well off my shoulders if someone had told me-transported-to-1986, "Andrew Wiles is going to spend the next 6 years trying in secret solitude to prove Fermat's Last Theorem".

But also I think it's clearly not entirely mistaken about such a person's odds. If a person is a native "airbender", i.e. they are deeply familiar with the "taking their situation lightly" stance, I don't think I want to recommend that they take a sense of "welp, I guess I have to finally buckle up for this one" at face value, especially in the context of it being a last-resort for one of their most challenging projects or goals. It feels to me like such a person is more likely to succeed by (a) deciding to stop flailing, (b) retreating to a safe distance, and (c) reevaluating whether this is the path they really want to follow, while in connection with their sense of fun.

"Buckle up bucko, and get ready for multiple hard cognitive steps."

benwr3mo80

Sure; if it's not obvious they're from the universe of Avatar: The Last Airbender.

Earthbending is substantially about: facing things head-on, "just getting it done", "buckling down" (though I suppose this can be different than "buckling up"), being unyielding, orienting around "grit".

Waterbending is substantially about: Being flexible, responsive to the environment, and careful.

Airbending is substantially about: Freedom of movement and action, using an opponent's strength against them (which in PvE looks more like "doing what's easy and/or fun"), speed.

The other one is firebending, but I didn't reference it and I don't really understand it well enough to put it in the same terms; still, my best gloss attempt is that it's about focused and kinda bursty / lower-endurance intensity.

In the universe there are (usually) only these four elements, and most benders are pure specialists / only physically capable of learning their particular kind of bending; the main character (the avatar) is the only person who can / has to learn all four. In my analogy these elements don't really cover the space of motivational structures that well, and anyway people don't have to be specialists.

"Buckle up bucko, and get ready for multiple hard cognitive steps."

benwr3mo61

You don't really mention <a thing that I think is extremely crucial> in this domain, which is that you do not have to (metaphorically) be an earthbender about everything. Other types of bending also exist. If you are not a native earthbender, you might be able to learn to do it (the real world does not have only one Chosen One who can bend all the elements), but as a meta-waterbender I personally recommend first looking around carefully, and trying to figure out how the most successful benders of your native element are doing it.

You do seem to see earthbending as maybe a "last resort" rather than the only way to do things, but it's not obvious to me that it's the correct last resort for everyone. The last resort of a successful airbender is probably more like "take even more steps back to see if there are any easier but more oblique approaches to this summit, or even other summits you'd actually rather climb";

LESSWRONG
LW

LESSWRONG
LW

Posts

Wikitag Contributions

Comments