I concretely disagree with (what I see as) your implied premise that the outer (training) task has any direct influence on the inner optimizer's cognition. I think this disagreement (which I internally feel like I've already tried to make a number of times) has been largely ignored so far. As a result, many of the things you wrote seem to me to be answerable by largely the same objection:
As I see it: in training, it was optimized for that. The trained model likely contains one or more optimizers optimized by that training. But what the model is trained/optimized to do, is actually answer the questions.
The model's "training/optimization", as characterized by the outer loss, is not what determines the inner optimizer's cognition.
If the model in training has an optimizer, a goal of the optimizer for being capable of answering questions wouldn't actually make the optimizer more capable, so that would not be reinforced. A goal of actually answering the questions, on the other hand, would make the optimizer more capable and so would be reinforced.
The model's "training/optimization", as characterized by the outer loss, is not what determines the inner optimizer's cognition.
Likewise, the heuristics/"adaptations" that coalesced to form the optimizer would have been oriented towards answering the questions.
...why? (The model's "training/optimization", as characterized by the outer loss, is not what determines the inner optimizer's cognition.)
All this points to mask-level goals and does not provide a reason to believe in non-mask goals, and so a "goal slot" remains more parsimonious than an actor with a different underlying goal.
I still don't understand your "mask" analogy, and currently suspect it of mostly being a red herring (this is what I was referring to when I said I think we're not talking about the same thing). Could you rephrase your point without making mention to "masks" (or any synonyms), and describe more concretely what you're imagining here, and how it leads to a (nonfake) "goal slot"?
(Where is a human actor's "goal slot"? Can I tell an actor to play the role of Adolf Hitler, and thereby turn him into Hitler?)
Regarding the evolutionary analogy, while I'd generally be skeptical about applying evolutionary analogies to LLMs, because they are very different, in this case I think it does apply, just not the way you think. I would analogize evolution -> training and human behaviour/goals -> the mask.
I think "the mask" doesn't make sense as a completion to that analogy, unless you replace "human behaviour/goals" with something much more specific, like "acting". Humans certainly are capable of acting out roles, but that's not what their inner cognition actually does! (And neither will it be what the inner optimizer does, unless the LLM in question is weak enough to not have one of those.)
I really think you're still imagining here that the outer loss function is somehow constraining the model's inner cognition (which is why you keep making arguments that seem premised on the idea that e.g. if the outer loss says to predict the next token, then the model ends up putting on "masks" and playing out personas)—but I'm not talking about the "mask", I'm talking about the actor, and the fact that you keep bringing up the "mask" is really confusing to me, since it (in my view) forces an awkward analogy that doesn't capture what I'm pointing at.
Actually, having written that out just now, I think I want to revisit this point:
Likewise, the heuristics/"adaptations" that coalesced to form the optimizer would have been oriented towards answering the questions.
I still think this is wrong, but I think I can give a better description of why it's wrong than I did earlier: on my model, the heuristics learned by the model will be much more optimized towards world-modelling, not answering questions. "Answering questions" is (part of) the outer task, but the process of doing that requires the system to model and internalize and think about things having to do with the subject matter of the questions—which effectively means that the outer task becomes a wrapper which trains the system by proxy to acquire all kinds of potentially dangerous capabilities.
(Having heuristics oriented towards answering questions is a misdescription; you can't correctly answer a math question you know nothing about by being very good at "generic question-answering", because "generic question-answering" is not actually a concrete task you can be trained on. You have to be good at math, not "generic question-answering", in order to be able to answer math questions.)
Which is to say, quoting from my previous comment:
I strongly disagree that the "extra machinery" is extra; instead, I would say that it is absolutely necessary for strong intelligence. A model capable of producing plans to take over the world if asked, for example, almost certainly contains an inner optimizer with its own goals; not because this was incentivized directly by the outer loss on token prediction, but because being able to plan on that level requires the formation of goal-like representations within the model.
None of this is about the "mask". None of this is about the role the model is asked to play during inference. Instead, it's about the thinking the model must have learned to do in order to be able to don those "masks"—which (for sufficiently powerful models) implies the existence of an actor which (a) knows how to answer, itself, all of the questions it's asked, and (b) is not the same entity as any of the "masks" it's asked to don.
Full Solomon Induction on a hypercomputer absolutely does not just "learn very similar internal functions models", it effectively recreates actual human brains.
Full SI on a hypercomputer is equivalent to instantiating a computational multiverse and allowing us to access it. Reading out data samples corresponding to text from that is equivalent to reading out samples of actual text produced by actual human brains in other universes close to ours.
...yes? And this is obviously very, very different from how humans represent things internally?
I mean, for one thing, humans don't recreate exact simulations of other humans in our brains (even though "predicting other humans" is arguably the high-level cognitive task we are most specced for doing). But even setting that aside, the Solomonoff inductor's hypothesis also contains a bunch of stuff other than human brains, modeled in full detail—which again is not anything close to how humans model the world around us.
I admit to having some trouble following your (implicit) argument here. Is it that, because a Solomonoff inductor is capable of simulating humans, that makes it "human-like" in some sense relevant to alignment? (Specifically, that doing the plan-sampling thing Rob mentioned in the OP with a Solomonoff inductor will get you a safe result, because it'll be "humans in other universes" writing the plans? If so, I don't see how that follows at all; I'm pretty sure having humans somewhere inside of your model doesn't mean that that part of your model is what ends up generating the high-level plans being sampled by the outer system.)
It really seems to me that if I accept what looks to me like your argument, I'm basically forced to conclude that anything with a simplicity prior (trained on human data) will be aligned, meaning (in turn) the orthogonality thesis is completely false. But... well, I obviously don't buy that, so I'm puzzled that you seem to be stressing this point (in both this comment and other comments, e.g. this reply to me elsethread):
Note I didn't actually reply to that quote. Sure that's an explicit simplicity prior. However there's a large difference under the hood between using an explicit simplicity prior on plan length vs an implicit simplicity prior on the world and action models which generate plans. The latter is what is more relevant for intrinsic similarity to human though processes (or not).
(to be clear, my response to this is basically everything I wrote above; this is not meant as its own separate quote-reply block)
you need to first investigate the actual internal representations of the systems in question, and verify that they are isomorphic to the ones humans use.
This has been ongoing for over a decade or more (dating at least back to Sparse Coding as an explanation for V1).
That's not what I mean by "internal representations". I'm referring to the concepts learned by the model, and whether analogues for those concepts exist in human thought-space (and if so, how closely they match each other). It's not at all clear to me that this occurs by default, and I don't think the fact that there are some statistical similarities between the high-level encoding approaches being used means that similar concepts end up being converged to. (Which is what is relevant, on my model, when it comes to questions like "if you sample plans from this system, what kinds of plans does it end up outputting, and do they end up being unusually dangerous relative to the kinds of plans humans tend to sample?")
I agree that sparse coding as an approach seems to have been anticipated by evolution, but your raising this point (and others like it), seemingly as an argument that this makes systems more likely to be aligned by default, feels thematically similar to some of my previous objections—which (roughly) is that you seem to be taking a fairly weak premise (statistical learning models likely have some kind of simplicity prior built in to their representation schema) and running with that premise wayyy further than I think is licensed—running, so far as I can tell, directly to the absolute edge of plausibility, with a conclusion something like "And therefore, these systems will be aligned." I don't think the logical leap here has been justified!
Yeah, I'm growing increasingly confident that we're talking about different things. I'm not referring to about "masks" in the sense that you mean it.
I don't know what you mean by "one" or by "inner". I would expect different masks to behave differently, acting as if optimizing different things (though that could be narrowed using RLHF), but they could re-use components between them. So, you could have, for example, a single calculation system that is reused but takes as input a bunch of parameters that have different values for different masks, which (again just an example) define the goals, knowledge and capabilities of the mask.
Yes, except that the "calculation system", on my model, will have its own goals. It doesn't have a cleanly factored "goal slot", which means that (on my model) "takes as input a bunch of parameters that [...] define the goals, knowledge, and capabilities of the mask" doesn't matter: the inner optimizer need not care about the "mask" role, any more than an actor shares their character's values.
- That there is some underlying goal that this optimizer has that is different than satisfying the current mask's goal, and it is only satisfying the mask's goal instrumentally.
This I think is very unlikely for the reasons I put in the original post. It's extra machinery that isn't returning any value in training.
Yes, this is the key disagreement. I strongly disagree that the "extra machinery" is extra; instead, I would say that it is absolutely necessary for strong intelligence. A model capable of producing plans to take over the world if asked, for example, almost certainly contains an inner optimizer with its own goals; not because this was incentivized directly by the outer loss on token prediction, but because being able to plan on that level requires the formation of goal-like representations within the model. And (again) because these goal representations are not cleanly factorable into something like an externally visible "goal slot", and are moreover not constrained by the outer loss function, they are likely to be very arbitrary from the perspective of outsiders. This is the same point I tried to make in my earlier comment:
And in that case, the "awakened shoggoth" does seem likely to me to have an essentially arbitrary set of preferences relative to the outer loss function—just as e.g. humans have an essentially arbitrary set of preferences relative to inclusive genetic fitness, and for roughly the same reason: an agentic cognition born of a given optimization criterion has no reason to internalize that criterion into its own goal structure; much more likely candidates for being thus "internalized", in my view, are useful heuristics/"adaptations"/generalizations formed during training, which then resolve into something coherent and concrete.
The evolutionary analogy is apt, in my view, and I'd like to ask you to meditate on it more directly. It's a very concrete example of what happens when you optimize a system hard enough on an outer loss function (inclusive genetic fitness, in this case) that inner optimizers arise with respect to that outer loss (animals with their own brains). When these "inner optimizers" are weak, they consist largely of a set of heuristics, which perform well within the training environment, but which fail to generalize outside of it (hence the scare-quotes around "inner optimizers"). But when these inner optimizers do begin to exhibit patterns of cognition that generalize, what they end up generalizing is not the outer loss, but some collection of what were originally useful heuristics (e.g. kludgey approximations of game-theoretic concepts like tit-for-tat), reified into concepts which are now valued in their own right ("reputation", "honor", "kindness", etc).
This is a direct consequence (in my view) of the fact that the outer loss function does not constrain the structure of the inner optimizer's cognition. As a result, I don't expect the inner optimizer to end up representing, in its own thoughts, a goal of the form "I need to predict the next token", any more than humans explicitly calculate IGF when choosing their actions, or (say) a mathematician thinks "I need to do good maths" when doing maths. Instead, I basically expect the system to end up with cognitive heuristics/"adaptations" pertaining to the subject at hand—which in the case of our current systems is something like "be capable of answering any question I ask you." Which is not a recipe for heuristics that end up unfolding into safely generalizing goals!
I want to revisit what Rob actually wrote:
If you sampled a random plan from the space of all writable plans (weighted by length, in any extant formal language), and all we knew about the plan is that executing it would successfully achieve some superhumanly ambitious technological goal like "invent fast-running whole-brain emulation", then hitting a button to execute the plan would kill all humans, with very high probability.
(emphasis mine)
That sounds a whole lot like it's invoking a simplicity prior to me!
LLMs and human brains learn from basically the same data with similar training objectives powered by universal approximations of bayesian inference and thus learn very similar internal functions/models.
This argument proves too much. A Solomonoff inductor (AIXI) running on a hypercomputer would also "learn from basically the same data" (sensory data produced by the physical universe) with "similar training objectives" (predict the next bit of sensory information) using "universal approximations of Bayesian inference" (a perfect approximation, in this case), and yet it would not be the case that you could then conclude that AIXI "learns very similar internal functions/models". (In fact, the given example of AIXI is much closer to Rob's initial description of "sampling from the space of possible plans, weighted by length"!)
In order to properly argue this, you need to talk about more than just training objectives and approximations to Bayes; you need to first investigate the actual internal representations of the systems in question, and verify that they are isomorphic to the ones humans use. Currently, I'm not aware of any investigations into this that I'd consider satisfactory.
(Note here that I've skimmed the papers you cite in your linked posts, and for most of them it seems to me either (a) they don't make the kinds of claims you'd need to establish a strong conclusion of "therefore, AI systems think like humans", or (b) they do make such claims, but then the described investigation doesn't justify those claims.)
E.g. a system capable of correctly answering questions like "given such-and-such chess position, what is the best move for the current player?" must in fact performing agentic/search-like thoughts internally, since there is no other way to correctly answer this question.
Yes, but that sort of question is in my view answered by the "mask", not by something outside the mask.
I don't think this parses for me. The computation performed to answer the question occurs inside the LLM, yes? Whether you classify said computation as coming from "the mask" or not, clearly there is an agent-like computation occurring, and that's concretely dangerous regardless of the label you choose to slap on it.
(Example: suppose you ask me to play the role of a person named John. You ask "John" what the best move is in a given chess position. Then the answer to that question is actually being generated by me, and it's no coincidence that—if "John" is able to answer the question correctly—this implies something about my chess skills, not "John's".)
The masks can indeed think whatever - in the limit of a perfect predictor some masks would presumably be isomorphic to humans, for example - though all is underlain by next-token prediction.
I don't think we're talking about the same thing here. I expect there to be only one inner optimizer (because more than one would point to cognitive inefficiencies), whereas you seem like you're talking about multiple "masks". I don't think it matters how many different roles the LLM can be asked to play; what matters is what the inner optimizer ends up wanting.
Mostly, I'm confused about the ontology you appear to be using here, and (more importantly) how you're manipulating that ontology to get us nice things. "Next-token prediction" doesn't get us nice things by default, as I've already argued, because of the existence of inner optimizers. "Masks" also don't get us nice things, as far as I understand the way you're using the term, because "masks" aren't actually in control of the inner optimizer.
In order to make the doom conclusion actually go through, arguments should make stronger claims about the priors involved, and how they differ from those of the human learning process.
Isn't it enough that they do differ? Why do we need to be able to accurately/precisely characterize the nature of the difference, to conclude that an arbitrary inductive bias different from our own is unlikely to sample the same kinds of plans we do?
It would conflict with a deceptive awake Shoggoth, but IMO such a thing is unlikely because the model is super-well optimized for next token prediction
Yeah, so I think I concretely disagree with this. I don't think being "super-well optimized" for a general task like sequence prediction (and what does it mean to be "super-well optimized" anyway, as opposed to "badly optimized" or some such?) means that inner optimizers fail to arise in the limit of sufficient ability, or that said inner optimizers will be aligned on the outer goal of sequence prediction.
Intuition: some types of cognitive work seem so hard that a system capable of performing said cognitive work must be, at some level, performing something like systematic reasoning/planning on the level of thoughts, not just the level of outputs. E.g. a system capable of correctly answering questions like "given such-and-such chess position, what is the best move for the current player?" must in fact performing agentic/search-like thoughts internally, since there is no other way to correctly answer this question.
If so, this essentially demands that an inner optimizer exist—and, moreover, since the outer loss function makes no reference whatsoever to such an inner optimizer, the structure of the outer (prediction) task poses essentially no constraints on the kinds of thoughts the inner optimizer ends up thinking. And in that case, the "awakened shoggoth" does seem likely to me to have an essentially arbitrary set of preferences relative to the outer loss function—just as e.g. humans have an essentially arbitrary set of preferences relative to inclusive genetic fitness, and for roughly the same reason: an agentic cognition born of a given optimization criterion has no reason to internalize that criterion into its own goal structure; much more likely candidates for being thus "internalized", in my view, are useful heuristics/"adaptations"/generalizations formed during training, which then resolve into something coherent and concrete.
(Aside: it seems to have become popular in recent times to claim that the evolutionary analogy fails for some reason or other, with justifications like, "But look how many humans there are! We're doing great on the IGF front!" I consider these replies more-or-less a complete nonsequitur, since it's nakedly obvious that, however much success we have had in propagating our alleles, this success does not stem from any explicit tracking/pursuit of IGF in our cognition. To the extent that human behavior continues to (imperfectly) promote IGF, this is largely incidental on my view—arising from the fact that e.g. we have not yet moved so far off-distribution to have ways of getting what we want without having biological children.)
One possible disagreement someone might have with this, is that they think the kinds of "hard" cognitive work I described above can be accomplished without an inner optimizer ("awakened shoggoth"), by e.g. using chain-of-thought prompting or something similar, so as to externalize the search-like/agentic part of the solution process instead of conducting it internally. (E.g. AlphaZero does this by having its model be responsible only for the static position evaluation, which is then fed into/amplified via an external, handcoded search algorithm.)
However, I mostly think that
This doesn't actually make you safe, because the ability to generate a correct plan via externalized thinking still implies a powerful internal planning process (e.g. AlphaZero with no search still performs at a 2400+ Elo level, corresponding to the >99th percentile of human players). Obviously the searchless version will be worse than the version with search, but that won't matter if the dangerous capabilities still exist within the searchless version. (Intuition: suppose we have a model which, with chain-of-thought prompting, is capable of coming up with a detailed-and-plausible plan for taking over the world. Then I claim this model is clearly powerful enough to be dangerous in terms of its underlying capabilities, regardless of whether it chooses to "think aloud" or not, because coming up with a good plan for taking over the world is not the kind of thing "thinking aloud" helps you with unless you're already smarter than any human.)
Being able to answer complicated questions using chain-of-thought prompting (or similar) is not actually the task incentivized during training; what is incentivized is (as you yourself stressed continuously throughout your post) next token prediction, which—in cases where the training data contains sentences where substantial amounts of "inference" occurred between tokens (which happens a lot on the Internet!)—directly incentives the model to perform internal rather than external search. (Intuition: suppose we have a model trained to predict source code. Then, in order to accurately predict the next token, the model must have the capability to assess whatever is being attempted by the lines of code visible within the current context, and come up with a logical continuation of that code, all within a single inference pass. This strongly promotes internalization of thought—and various other types of training input have this property, such as mathematical proofs, or even more informal forms of argumentation such as e.g. LW comments.)
RE: decision theory w.r.t how "other powerful beings" might respond - I really do think Nate has already argued this, and his arguments continue to seem more compelling to me than the the opposition's. Relevant quotes include:
(To the above, I personally would add that this whole genre of argument reeks, to me, essentially of giving up, and tossing our remaining hopes onto a Hail Mary largely insensitive to our actual actions in the present. Relying on helpful aliens is what you do once you're entirely out of hope about solving the problem on the object level, and doesn't strike me as a very dignified way to go down!)