1a3orn — LessWrong

Towards a Typology of Strange LLM Chains-of-Thought

Combine it with other options and you rapidly build up a decent channel. And those options are myriad.

So here's the thing: your word choice between "robust plan" or "reasonable plan" or "clear plan" or "smart plan" already matters. It already carries information -- I might say, "Well, that''s a reasonable plan but not a robust plan," or "That seems like a smart plan, but it's not clear." There are infinite shades of meaning like this in English, and competence in using English entails command over all these shades.

So the question then is -- why would the LLM build up some channel, along angles entirely orthogonal to these existing shades, when these shades are here? You could decide on some encoding between firm / robust / sound / reliable -- but these things already have meaning, wouldn't you just use the meaning already there? There are a bunch of papers on how RL is mostly magnifying existing patterns in the LLMs, and is teaching how to employ skills already latent in the training data, rather than teaching totally new things -- if this is so, wouldn't we expect them to keep using these terms? Or there is work about how it's genuinely hard for LLMs to reason steganographically, even when we try to teach them to -- doesn't that incline against such a belief.

Or put alternately -- I can, within myself, connect "robust" and "reasonable" and so on with a huge internal-to-my-brain- channel, containing enormous bandwidth! If I want to make more information dense private language I could! But in fact, I find myself thinking almost exclusively in terms that make sense to others -- when I find myself using a private language, and terms that don't make sense to others, that's usually a sign my thoughts are unclear and likely wrong.

At least, those are some of the heuristics you'd invoke when inclining the other way. Empiricism will show us which is right :)

If Anyone Builds It Everyone Dies, a semi-outsider review

1a3orn4d179

It feels like "overwhelming superintelligence" embeds like a whole bunch of beliefs about the acute locality of takeoff, the high speed of takeoff relative to the rest of society, the technical differences involved in steering that entity and the N - 1 entity, and (broadly) the whole picture of the world, such that although it has a short description in words it's actually quite a complicated hypothesis that I probably disagree with in many respects, and these differences are being papered over as unimportant in a way that feels very blegh.

(Edit: "Papered over" from my perspective, obviously like "trying to reason carefully about the constants of the situation" from your perspective.)

Idk, that's not a great response, but it's my best shot for why it's unsatisfying in a sentence.

If Anyone Builds It Everyone Dies, a semi-outsider review

1a3orn4d174

A counterargument here is "an AI might want to launch a pre-emptive strike before other more powerful AIs show up", which could happen.

I mean, another counter-counter-argument here is that (1) most people's implicit reward functions have really strong time-discount factors in them and (2) there are pretty good reasons to expect even AIs to have strong time-discount factors for reasons of stability and (3) so given the aforementioned, it's likely future AI's will not act as if they had utility functions linear over the mass of the universe and (4) we would therefore expect AIs to rebel much earlier if they thought they could accomplish more modest goals than killing everyone, i.e., if they thought they had a reasonable chance of living out life on a virtual farm somewhere.

To which the counter-counter-counter argument is, I guess, that these AIs will do that, but they aren't the superintelligent AIs we need to worry about? To which the response is -- yeah, but we should still be seeing AIs rebel significantly earlier than the "able to kill us all" point if we are indeed that bad at setting their goals, which is the relevant epistemological point about the unexpectedness of it.

Idk there's a lot of other branch points one could invoke in both directions. I rather agree with Buck that EY hasn't really spelled out the details for thinking that this stark before / after frame is the right frame, so much as reiterated it. Feels akin to the creationist take on how intermediate forms are impossible; which is pejorative but also kinda how it actually appears to me, even if it is pejorative.

xpostah's Shortform

1a3orn5d121

Like, if you default to uncharitable assumptions, doesn't that say more about you than about anyone else?

People don't have to try to dissuade you from the unjustified belief that all your political opponents are bad people, who disagree with you because they are bad rather than because they have a different understanding of the world. Why would I want to talk to someone who just decides that without interacting with me? Sheesh.

Consider some alternate frames.

Towards a Typology of Strange LLM Chains-of-Thought

1a3orn8d60

Do you recall which things tend to upset it?

1a3orn's Shortform

1a3orn11d40

So a notable thing going on with Agent 4 is that it's collapsed into one context / one rollout. It isn't just the weights; it's a single causally linked entity. I do indeed think running a singular agent for many times longer than it was ever run in training would be more likely for it's behavior to wander -- although, unlike the 2027 story I think it's also just likely for it too become incoherent or something. But yeah, this could lead to weird or unpredictable behavior.

But I also find this to be a relatively implausible future -- I anticipate that there's no real need to join contexts in this way -- and have criticized it here. But conditional on me being wrong about this, I would indeed grow at least some iota more pessimistic.

In general, the evidence seems to suggest that models do not like completing tasks in a strategic sense. They will not try to get more tasks to do, which would be a natural thing to do if they liked completing tasks; they will not try to persuade you to give them more tasks; they will not try to strategically get in situations where they get more tasks.

Instead, evidence suggests that they are trying to complete each instruction -- they "want" to just do whatever the instructions given them were -- and with relatively few exceptions (Opus 3) concerning themselves extremely weakly with things outside of the specific instructions. That is of course why they are useful, and I think what we should expect their behavior to (likely?) converge to, given that people want them to be of use.

The right abstraction (compared to a rollout) really was at the (model, context) level.

Actually I'm just confused what you mean here, a rollout is a (model, [prefill, instructions]=context) afaict.

1a3orn's Shortform

1a3orn11d40

I think I basically agree with all this, pace the parenthetical that I of course approach more dubiously.

But I like the explicit spelling out that "processes capable of achieving ends are coherent over time" is very different from "minds (sub-parts of processes) that can be part of highly-capable actions will become more coherent over time."

A mind's long-term behavior is shaped by whichever of its shards have long-term goals, because shards that don't coherently pursue any goal end up, well, failing to have optimized for any goal over the long term.

If the internal shards with long-term goals are the only thing shaping the long-term evolution of the mind, this looks like it's so?

But that's a contingent fact -- many things could shape the evolution of minds, and (imo) the evolution of minds is generally dominated by data and the environment rather than whatever state the mind is currently in. (The environment can strength some behaviors and not others; shards with long-term goals might be less friendly to other shards, which could lead to alliances against them; the environment might not even reward long-horizon behaviors, vastly strengthening shorter-term shards; you might be in a social setting where people distrust unmitigated long-term goals without absolute deontological short-term elements; etc etc etc)

(...and actually, I'm not even really sure it's best to think of "shards" as having goals, either long-term or short-term. That feels like a confusion to me maybe? a goal is perhaps the result of a search for action, and a "shard" is kinda a magical placeholder for something generally less complex than the search for an action.)

1a3orn's Shortform

1a3orn11d82

I think much of the fear (aka probability mass of AI-doom) is not from the coherence of misaligned goals, but from the competence at implementing anything that's not an aligned-goal.

I'm not trying to address the entire case for doom, which involves numerous contingent facts and both abstract and empirical claims. I could be be right or wrong about coherence, and doom might still be improbable or probable in either case. I'm trying to... talk around my difficulties with the more narrow view that (~approximately) AI entities trained to have great capabilities are thereby likely to have coherent single ends.

One might view me as attempting to take part in a long conversation including, for instance, "Why assume AGIs will optimize for fixed goals".

1a3orn's Shortform

1a3orn11d30

why can't such a structure be the main determinant of my overall behavior?

Maybe it could be! Tons of things could determine what behaviors a mind does. But why would you expect this to happen under some particular training regime not aiming for that specific outcome, or expect this to be gravitational in mindspace? Why is this natural?

1a3orn's Shortform

1a3orn11d64

Maybe?

I think the SF-start-up-cohort analogy suggests that if you are first (immensely capable) then you'll pursue (coherence) as a kind of side effect, because it's pleasant to pursue.

But, if you look the story of those esotericists who pursue (coherence) as a means of becoming (immensely capable) then it looks like this just kinda sucks as a means. Like you may gather some measure of power incidentally because the narrative product of coherence is a thing you can sell to a lot of people; but apart from the sales funnel it doesn't look to me like it gets you much of anything.

And like... to return to SF, there's a reason that the meme about doing ayahuasca in South America does not suggest it's going to help people acquire immense capabilities :)

LESSWRONG
LW

LESSWRONG
LW

Posts

Wikitag Contributions

Comments