LESSWRONG
LW

benwr
890151000
Message
Dialogue
Subscribe

If you have feedback for me, you can fill out the form at https://w-r.me/feedback .

Or you can email me, at [the second letter of the alphabet]@[my username].net

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
4benwr's unpolished thoughts
6y
48
No wikitag contributions to display.
Shutdown Resistance in Reasoning Models
benwr5d60

Note that we made all our code available (https://github.com/PalisadeResearch/shutdown_avoidance/) and it's pretty easy to run exactly our code but with your prompt, if you'd like to avoid writing your own script. You need a docker runtime and a python environment with the dependencies installed (and it's easier if you use nix though that's not required), but if you have those things you just have to modify the conf.py file with your prompt and then do something like run --model openai/o3.

There are lots of reasons that a "survival drive" is something we were interested in testing; one reason is that self-preservation has been suggested as a "convergent instrumental goal"; see https://selfawaresystems.com/wp-content/uploads/2008/01/ai_drives_final.pdf for a (fairly dated, but seminal) discussion of this.

There are lots of ways for systems to malfunction, but only a small number of those involve actively / surprisingly disabling a shutdown mechanism without any instruction to do so. If the issue was that it got the math problems wrong, that seems like a run-of-the-mill malfunction; if it rewrites the shutdown program, that seems like a much more interesting and specific failure mode, and the word "sabotage" seems appropriate.

Reply
Shutdown Resistance in Reasoning Models
benwr5d40

I weak-downvoted this comment; I did so for the following reasons:

(a) "Reasoning models as trained do not follow a typical instruction hierarchy of a chat template." doesn't seem true if you mean that they're not intended to follow such a hierarchy (see the link to the system card in the OP), and if you mean that they don't follow it despite being meant/designed to, I don't really see much evidence presented (and don't have any background knowledge that would suggest) that the "Goal, Outcome, Warning, Context" sequence is "bound to" give consistent results.
(b) The comment seems pretty unclear about what you tested and why; e.g. the post you link as an explanation seems to be mostly about irrelevant things, and I don't really want to read the whole thing to figure out what you mean to reference in it. I also don't understand the setting in which you ran this experiment: Were you using the chat interface? If so, there's an existing hidden system prompt that's going to change the results. Were you using our code from github but with your prompt substituted? If so, why only run it 15 times for one model? Your own code with the OpenAI API? Same question, and also I'd like more details in order to understand specifically what you tested.
(c) We explain that a "survival drive" is merely one hypothesis, and mostly not one that we feel is a good explanation for this behavior; far from jumping to anthropomorphic conclusions, we came away feeling that this isn't a good explanation for what we're seeing. We don't mention consciousness at all, as we don't think it's very relevant here.

I wouldn't have downvoted if I hadn't felt that the comment was mostly based around a strawman / semi-hallucinated impression of the OP, or if it had been clearer about what exactly the takeaway is meant to be, or what exactly you tested. I didn't strong-downvote because I appreciated that you shared the prompt you ran, and in general I appreciate that you put in some effort to be empirical about it.

Reply
"Buckle up bucko, this ain't over till it's over."
benwr7d50

Right, I think it just seems like doing emotional preparation that matches this description is a kind of earthbender-friendly / earthbender-assuming move, while an airbender-friendly move would be more like "notice and accept that you'd have more fun doing it a different way or doing a different thing; that flailing isn't actually fun". The effect is kind of similar, i.e. both earthbenders and airbenders should come away less-clinging-to-something, but the earthbender comes away less-clinging-to "the locally easy and straightforward things will work if I do them enough" while the airbender is less-clinging-to something more like "This is what I'd choose".

Re the last-resort framing, I'm not sure why I said that exactly; I think it's related to the vibe I got from the OP: Like, "if you notice that you're not making progress, what do you do? Well, you could keep flailing or avoidantly doomscrolling, or you could [do the thing I'm suggesting], or you could give up in despair"; I think it feels like a "last resort" because the other realistic options presented are kind of like different kinds of death?

Reply1
"Buckle up bucko, this ain't over till it's over."
benwr7d20

Yeah I can try to say some of them, though my sense of the crucialness here does shift on learning that you mean for this to be about hour-to-week levels of effort. I guess I may as well try to come up with element-bending-flavored ones since I'm in pretty deep on that metaphor here.

The biggest differences in which one I'd recommend as a "default last resort" depend on the person and their strengths rather than the situation.
 

  • The thing recommended for the airbender above, i.e. "retreat to a safe distance and consider if there's anything that seems like a fun challenge instead of a grind".
  • For a firebender: "Commit to doing it really hard for about [30 minutes]. See where you get. Wait [2 * 30 minutes] after that, and check if you feel energized or drained. If energized, repeat; if drained, [I don't know; this is the one I have the least familiarity with]"
  • For a waterbender: Something like "Do naturalism to it". "When you make good enough observations, you can't help but make high-quality inferences" is one of my favorite quotes, and I think it applies especially well in this kind of last-resort setting.
Reply
"Buckle up bucko, this ain't over till it's over."
benwr7d10

I think I maybe mean to say a slightly different thing than came across, which makes sense because I was leaning heavily into the metaphor rather than trying to be very clear.

I think the triggers are definitely hints in the direction that buckling up might be the right move. Yet I also observe that, when I imagine the median or even 80th percentile person-explicitly-buckling-up-for-something-big, a big part of me wants to shake my head and be like "ah well, it was nice while it lasted" about their chances of doing a hard thing, especially an unusual hard thing.

This part of me is clearly wrong sometimes: My head would have shaken well off my shoulders if someone had told me-transported-to-1986, "Andrew Wiles is going to spend the next 6 years trying in secret solitude to prove Fermat's Last Theorem".

But also I think it's clearly not entirely mistaken about such a person's odds. If a person is a native "airbender", i.e. they are deeply familiar with the "taking their situation lightly" stance, I don't think I want to recommend that they take a sense of "welp, I guess I have to finally buckle up for this one" at face value, especially in the context of it being a last-resort for one of their most challenging projects or goals. It feels to me like such a person is more likely to succeed by (a) deciding to stop flailing, (b) retreating to a safe distance, and (c) reevaluating whether this is the path they really want to follow, while in connection with their sense of fun.

Reply
"Buckle up bucko, this ain't over till it's over."
benwr7d60

Sure; if it's not obvious they're from the universe of Avatar: The Last Airbender.

Earthbending is substantially about: facing things head-on, "just getting it done", "buckling down" (though I suppose this can be different than "buckling up"), being unyielding, orienting around "grit".

Waterbending is substantially about: Being flexible, responsive to the environment, and careful.

Airbending is substantially about: Freedom of movement and action, using an opponent's strength against them (which in PvE looks more like "doing what's easy and/or fun"), speed.

The other one is firebending, but I didn't reference it and I don't really understand it well enough to put it in the same terms; still, my best gloss attempt is that it's about focused and kinda bursty / lower-endurance intensity.

In the universe there are (usually) only these four elements, and most benders are pure specialists / only physically capable of learning their particular kind of bending; the main character (the avatar) is the only person who can / has to learn all four. In my analogy these elements don't really cover the space of motivational structures that well, and anyway people don't have to be specialists.

Reply
"Buckle up bucko, this ain't over till it's over."
benwr7d50

You don't really mention <a thing that I think is extremely crucial> in this domain, which is that you do not have to (metaphorically) be an earthbender about everything. Other types of bending also exist. If you are not a native earthbender, you might be able to learn to do it (the real world does not have only one Chosen One who can bend all the elements), but as a meta-waterbender I personally recommend first looking around carefully, and trying to figure out how the most successful benders of your native element are doing it.

You do seem to see earthbending as maybe a "last resort" rather than the only way to do things, but it's not obvious to me that it's the correct last resort for everyone. The last resort of a successful airbender is probably more like "take even more steps back to see if there are any easier but more oblique approaches to this summit, or even other summits you'd actually rather climb";

Reply1
It's 'Well, actually...' all the way down
benwr2mo20

I mean to use "the precise definition" to identify it as the one that isn't "the specific definition" (based on the earlier comparison), not to say that it's the only precise definition or anything like that. i.e. I could have said "the comparatively more precise definition of these two" instead.

Reply
It's 'Well, actually...' all the way down
benwr2mo70

I wonder if we should bet about something here. It seems plausible that we would make different predictions about how much agreement there would be on what is a "chemical", if you were to explain about the structure, manufacturing, and acquisition of a given substance.

Reply1
It's 'Well, actually...' all the way down
benwr2mo*57

I think we disagree about what a definition is, in natural language.

I think it's fair to say that you and I have an internally-consistent-though-vague definition of the word "sandwich", even though I'm sure there are astronomically many edge-case examples for which we would answer differently. And in fact this is probably true for almost literally every non-mathematical noun or verb that we both "know", albeit to varying degrees.

Edited to add: It suddenly seems likely to me that I should be using the word "meaning" rather than "definition" here, but the rest of the post goes through fine if I make that switch.

Reply
Load More
132Shutdown Resistance in Reasoning Models
8d
14
40It's 'Well, actually...' all the way down
2mo
34
12Information throughput of biological humans and frontier LLMs
5mo
0
15Biological humans collectively exert at most 400 gigabits/s of control over the world.
5mo
3
62Not all capabilities will be created equal: focus on strategically superhuman agents
5mo
8
46Bounty for Evidence on Some of Palisade Research's Beliefs
10mo
4
2311 diceware words is enough
1y
6
40What policies have most thoroughly crippled (otherwise-promising) industries or technologies?
Q
3y
Q
4
39A Litany Missing from the Canon
3y
3
17Sneaking Suspicion
3y
2
Load More