You know who else is completely simple inside, with no internal degrees of freedom, and always wills the same thing eternally unceasingly?
Yeah that's right, the Medieval Catholic Scholastic God.
Do you have examples of the kind of models / kind of questions that lead to this most strongly? I've been collecting behaviors but it's slow work reading a lot of CoTs and so anything would be welcome :)
Note that Nostalgebraist and Olli's comments on the original paper argue (imo cogently) that the original paper's framing is pretty misleading / questionable.
It looks like many of their points would carry over to this.
Combine it with other options and you rapidly build up a decent channel. And those options are myriad.
So here's the thing: your word choice between "robust plan" or "reasonable plan" or "clear plan" or "smart plan" already matters. It already carries information -- I might say, "Well, that''s a reasonable plan but not a robust plan," or "That seems like a smart plan, but it's not clear." There are infinite shades of meaning like this in English, and competence in using English entails command over all these shades.
So the question then is -- why would the LLM build up some channel, along angles entirely orthogonal to these existing shades, when these shades are here? You could decide on some encoding between firm / robust / sound / reliable -- but these things already have meaning, wouldn't you just use the meaning already there? There are a bunch of papers on how RL is mostly magnifying existing patterns in the LLMs, and is teaching how to employ skills already latent in the training data, rather than teaching totally new things -- if this is so, wouldn't we expect them to keep using these terms? Or there is work about how it's genuinely hard for LLMs to reason steganographically, even when we try to teach them to -- doesn't that incline against such a belief.
Or put alternately -- I can, within myself, connect "robust" and "reasonable" and so on with a huge internal-to-my-brain- channel, containing enormous bandwidth! If I want to make more information dense private language I could! But in fact, I find myself thinking almost exclusively in terms that make sense to others -- when I find myself using a private language, and terms that don't make sense to others, that's usually a sign my thoughts are unclear and likely wrong.
At least, those are some of the heuristics you'd invoke when inclining the other way. Empiricism will show us which is right :)
It feels like "overwhelming superintelligence" embeds like a whole bunch of beliefs about the acute locality of takeoff, the high speed of takeoff relative to the rest of society, the technical differences involved in steering that entity and the N - 1 entity, and (broadly) the whole picture of the world, such that although it has a short description in words it's actually quite a complicated hypothesis that I probably disagree with in many respects, and these differences are being papered over as unimportant in a way that feels very blegh.
(Edit: "Papered over" from my perspective, obviously like "trying to reason carefully about the constants of the situation" from your perspective.)
Idk, that's not a great response, but it's my best shot for why it's unsatisfying in a sentence.
A counterargument here is "an AI might want to launch a pre-emptive strike before other more powerful AIs show up", which could happen.
I mean, another counter-counter-argument here is that (1) most people's implicit reward functions have really strong time-discount factors in them and (2) there are pretty good reasons to expect even AIs to have strong time-discount factors for reasons of stability and (3) so given the aforementioned, it's likely future AI's will not act as if they had utility functions linear over the mass of the universe and (4) we would therefore expect AIs to rebel much earlier if they thought they could accomplish more modest goals than killing everyone, i.e., if they thought they had a reasonable chance of living out life on a virtual farm somewhere.
To which the counter-counter-counter argument is, I guess, that these AIs will do that, but they aren't the superintelligent AIs we need to worry about? To which the response is -- yeah, but we should still be seeing AIs rebel significantly earlier than the "able to kill us all" point if we are indeed that bad at setting their goals, which is the relevant epistemological point about the unexpectedness of it.
Idk there's a lot of other branch points one could invoke in both directions. I rather agree with Buck that EY hasn't really spelled out the details for thinking that this stark before / after frame is the right frame, so much as reiterated it. Feels akin to the creationist take on how intermediate forms are impossible; which is pejorative but also kinda how it actually appears to me, even if it is pejorative.
Like, if you default to uncharitable assumptions, doesn't that say more about you than about anyone else?
People don't have to try to dissuade you from the unjustified belief that all your political opponents are bad people, who disagree with you because they are bad rather than because they have a different understanding of the world. Why would I want to talk to someone who just decides that without interacting with me? Sheesh.
Consider some alternate frames.
Do you recall which things tend to upset it?
So a thing I've been trying to look at is get a better notion of "What actually is it about human intelligence that lets us be the dominant species?" Like, "intelligence" is a big box that holds which specific behaviors? What were the actual behaviors that evolution reinforced, over the course of giving of big brains? Big question, hard to know what's the case.
I'm in the middle of "Darwin's Unfinished Symphony", and finding it at least intriguing as a look how creativity / imitation are related, and how "imitation" is a complex skill that humans are nevertheless supremely good at. (The "Secret of Our Success" is another great read here of course.)
Both of these kinda about the human imitation prior... in humans. And why that may be important. So I think if one is thinking around the human-imitation prior being powerful, it would make sense to read them as cases for why something like the human imitation prior is also powerful in humans :)
They don't give straight answers to any questions about AI, of course, and I'd be sympathetic to the belief that they're irrelevant or kinda a waste of time, and frankly they might be a waste of time depending on what you're funging against. I'm not saying they answer any question; I'm saying they're interesting. But I think they're good reads if one's approaching from the angle of "Intelligence is what lets humans dominate the earth" and want a particular angle on how "intelligence" is a mixed bag of some different skills, at least some of which are probably not general search and planning. So, yeah.