Quintin Pope


Quintin's Alignment Papers Roundup
Shard Theory

Wiki Contributions


Addressing this objection is why I emphasized the relatively low information content that architecture / optimizers provide for minds, as compared to training data. We've gotten very far in instantiating human-like behaviors by training networks on human-like data. I'm saying the primacy of data for determining minds means you can get surprisingly close in mindspace, as compared to if you thought architecture / optimizer / etc were the most important.

Obviously, there are still huge gaps between the sorts of data that an LLM is trained on versus the implicit loss functions human brains actually minimize, so it's kind of surprising we've even gotten this far. The implication I'm pointing to is that it's feasible to get really close to human minds along important dimensions related to values and behaviors, even without replicating all the quirks of human mental architecture.

I believe the human visual cortex is actually the more relevant comparison point for estimating the level of danger we face due to mesaoptimization. Its training process is more similar to the self-supervised / offline way in which we train (base) LLMs. In contrast, 'most abstract / "psychological"' are more entangled in future decision-making. They're more "online", with greater ability to influence their future training data.

I think it's not too controversial that online learning processes can have self-reinforcing loops in them. Crucially however, such loops rely on being able to influence the externally visible data collection process, rather than being invisibly baked into the prior. They are thus much more amenable to being addressed with scalable oversight approaches.

I'm guessing you misunderstand what I meant when I referred to "the human learning process" as the thing that was a ~ 1 billion X stronger optimizer than evolution and responsible for the human SLT. I wasn't referring to human intelligence or what we might call human "in-context learning". I was referring to the human brain's update rules / optimizer: i.e., whatever quasi-Hebbian process the brain uses to minimize sensory prediction error, maximize reward, and whatever else factors into the human "base objective". I was not referring to the intelligences that the human base optimizers build over a lifetime.

If instead of reward circuitry inducing human values, evolution directly selected over policies, I'd expect similar inner alignment failures.

I very strongly disagree with this. "Evolution directly selecting over policies" in an ML context would be equivalent to iterated random search, which is essentially a zeroth-order approximation to gradient descent. Under certain simplifying assumptions, they are actually equivalent. It's the loss landscape an parameter-function map that are responsible for most of a learning process's inductive biases (especially for large amounts of data). See: Loss Landscapes are All You Need: Neural Network Generalization Can Be Explained Without the Implicit Bias of Gradient Descent

Most of the difference in outcomes between human biological evolution and DL comes down to the fact that bio evolution has a wildly different mapping from parameters to functional behaviors, as compared to DL. E.g.,

  1. Bio evolution's parameters are the genome, which mostly configures learning proclivities and reward circuitry of the human within lifetime learning process, as opposed to DL parameters being actual parameters which are much more able to directly specify particular behaviors.
  2. The "functional output" of human bio evolution isn't actually the behaviors of individual humans. Rather, it's the tendency of newborn humans to learn behaviors in a given environment. It's not like in DL, where you can train a model, then test that same model in a new environment. Rather, optimization over the human genome in the ancestral environment produced our genome, and now a fresh batch of humans arise and learn behaviors in the modern environment.

Point 2 is the distinction I was referencing when I said:

"human behavior in the ancestral environment" versus "human behavior in the modern environment" isn't a valid example of behavioral differences between training and deployment environments.

Overall, bio evolution is an incredibly weird optimization process, with specific quirks that predictably cause very different outcomes as compared to either DL or human within lifetime learning. As a result, bio evolution outcomes have very little implication for DL. It's deeply wrong to lump them all under the same "hill climbing paradigm", and assume they'll all have the same dynamics.

It's also not necessary that the inner values of the agent make no mention of human values / objectives, it needs to both a) value them enough to not take over, and b) maintain these values post-reflection. 

This ties into the misunderstanding I think you made. When I said:

  1. Deliberately create a (very obvious[2]) inner optimizer, whose inner loss function includes no mention of human values / objectives.[3]

The "inner loss function" I'm talking about here is not human values, but instead whatever mix of predictive loss, reward maximization, etc that form the effective optimization criterion for the brain's "base" distributed quasi-Hebbian/whatever optimization process. Such an "inner loss function" in the context of contemporary AI systems would not refer to the "inner values" that arise as a consequence of running SGD over a bunch of training data. They'd be something much much weirder and very different from current practice.

E.g., if we had a meta-learning setup where the top-level optimizer automatically searches for a reward function F, which, when used in another AI's training, will lead to high scores on some other criterion C, via the following process:

  1. Randomly initializing a population of models.
  2. Training them with the current reward function F.
  3. Evaluate those models on C.
  4. Update the reward function F to be better at training models to score highly on C.

The "inner loss function" I was talking about in the post would be most closely related to F. And what I mean by "Deliberately create a (very obvious[2]) inner optimizer, whose inner loss function includes no mention of human values / objectives", in the context of the above meta-learning setup, is to point to the relationship between F and C.

Specifically, does F actually reward the AIs for doing well on C? Or, as with humans, does F only reward the AIs for achieving shallow environmental correlates of scoring well on C? If the latter, then you should obviously consider that, if you create a new batch of AIs in a fresh environment, and train them on an unmodified reward function F, that the things F rewards will become decoupled from the AIs eventually doing well on C.

Returning to humans:

Inclusive genetic fitness is incredibly difficult to "directly" train an organism to maximize. Firstly, IGF can't actually be measured in an organism's lifetime, only estimated based on the observable states of the organism's descendants. Secondly, "IGF estimated from observing descendants" makes for a very difficult reward signal to learn on because it's so extremely sparse, and because the within-lifetime actions that lead to having more descendants are often very far in time away from being able to actually observe those descendants. Thus, any scheme like "look at descendants, estimate IGF, apply reward proportional to estimated IGF" would completely fail at steering an organism's within lifetime learning towards IGF-increasing actions. 

Evolution, being faced with standard RL issues of reward sparseness and long time horizons, adopted a standard RL solution to those issues, namely reward shaping. E.g., rather than rewarding organisms for producing offspring, it builds reward circuitry that reward organisms for precursors to having offspring, such as having sex, which allows rewards to be more frequent and closer in time to the behaviors they're supposed to reinforce.

In fact, evolution relies so heavily on reward shaping that I think there's probably nothing in the human reward system that directly rewards increased IGF, at least not in the direct manner an ML researcher could by running a self-replicating model a bunch of times in different environments, measuring the resulting "IGF" of each run, and directly rewarding the model in proportion to its "IGF".

This is the thing I was actually referring to when I mentioned "inner optimizer, whose inner loss function includes no mention of human values / objectives.": the human loss / reward functions not directly including IGF in the human "base objective".

(Note that we won't run into similar issues with AI reward functions vs human values. This is partially because we have much more flexibility in what we include in a reward function as compared to evolution (e.g., we could directly train an AI on estimated IGF). Mostly though, it's because the thing we want to align our models to, human values, have already been selected to be the sorts of things that can be formed via RL on shaped reward functions, because that's how they actually arose at all.)

For (2), it seems like you are conflating 'amount of real world time' with 'amount of consequences-optimization'. SGD is just a much less efficient optimizer than intelligent cognition

Again, the thing I'm pointing to as the source of the human-evolutionary sharp left turn isn't human intelligence. It's a change in the structure of how optimization power (coming from the "base objective" of the human brain's updating process) was able to contribute to capabilities gains over time. If human evolution were an ML experiment, the key change I'm pointing to isn't "the models got smart". It's "the experiment stopped being quite as stupidly wasteful of compute" (which happened because the models got smart enough to exploit a side-channel in the experiment's design that allowed them to pass increasing amounts of information to future generations, rather than constantly being reset to the same level each time). Then, the reason this won't happen in AI development is that there isn't a similarly massive overhang of completely misused optimization power / compute, which could be unleashed via a single small change to the training process.

in-context learning happens much faster than SGD learning.

Is it really? I think they're overall comparable 'within an OOM', just useful for different things. It's just much easier to prompt a model and immediately see how this changes its behavior, but on head-to-head comparisons, it's not at all clear that prompting wins out. E.g., Few-Shot Parameter-Efficient Fine-Tuning is Better and Cheaper than In-Context Learning

In particular, I think prompting tends to be more specialized to getting good performance in situations similar to those the model has seen previously, whereas training (with appropriate data) is more general in the directions in which it can move capabilities. Extreme example: pure language models can be few-shot prompted to do image classification, but are very bad at it. However, they can be directly trained into capable multi-modal models.

I think this difference between in-context vs SGD learning makes it unlikely that in-context learning alone will suffice for an explosion in general intelligence. If you're only sampling from the probability distribution created by a training process, then you can't update that distribution, which I expect will greatly limit your ability to robustly generalize to new domains, as compared to a process where you gather new data from those domains and update the underlying distribution with those data. 

For (3), I don't think that the SLT requires the inner optimizer to run freely,  it only requires one of: 

a. the inner optimizer running much faster than the outer optimizer, such that the updates don't occur in time. 

b. the inner optimizer does gradient hacking / exploration hacking, such that the outer loss's updates are ineffective. 

(3) is mostly there to point to the fact that evolution took no corrective action whatsoever in regards to humans. Evolution can't watch humans' within lifetime behavior, see that they're deviating away from the "intended" behavior, and intervene in their within lifetime learning processes to correct such issues.

Human "inner learners" take ~billions of inner steps for each outer evolutionary step. In contrast, we can just assign whatever ratio of supervisory steps to runtime execution steps, and intervene whenever we want.

It doesn't mention the literal string "gradient descent", but it clearly makes reference to the current methodology of training AI systems (which is gradient descent). E.g., here:

The techniques OpenMind used to train it away from the error where it convinces itself that bad situations are unlikely? Those generalize fine. The techniques you used to train it to allow the operators to shut it down? Those fall apart, and the AGI starts wanting to avoid shutdown, including wanting to deceive you if it’s useful to do so.

The implication is that the dangerous behaviors that manifest during the SLT are supposed to have been instilled (at least partially) during the training (gradient descent) process.

However, the above is a nitpick. The real issue I have with your comment is that you seem to be criticizing me for not addressing the "capabilities come from not-SGD" threat scenario, when addressing that threat scenario is what this entire post is about.

Here's how I described SLT (which you literally quoted): "SGD creating some 'inner thing' which is not SGD and which gains capabilities much faster than SGD can insert them into the AI."

This is clearly pointing to a risk scenario where something other than SGD produces the SLT explosion of capabilities.

You say:

For example, Auto-GPT is composed of foundation models which were trained with SGD and RHLF, but there are many ways to enhance the capabilities of Auto-GPT that do not involve further training of the foundation models. The repository currently has hundreds of open issues and pull requests. Perhaps a future instantiation of Auto-GPT will be able to start closing these PRs on its own, but in the meantime there are plenty of humans doing that work.

To which I say, yes, that's an example where SGD creates some 'inner thing' (the ability to contribute to the Auto-GPT repository), which (one might imagine) would let Auto-GPT "gain capabilities much faster than SGD can insert them into the AI." This is exactly the sort of thing that I'm talking about in this post, and am saying it won't lead to an SLT.

(Or at least, that the evolutionary example provides no reason to think that Auto-GPT might undergo an SLT, because the evolutionary SLT relied on the sudden unleash of ~9 OOM extra available optimization power, relative to the previous mechanisms of capabilities accumulation over time.)

Finally, I'd note that a very significant portion of this post is explicitly focused on discussing "non-SGD" mechanisms of capabilities improvement. Everything from "Fast takeoff is still possible" and on is specifically about such scenarios. 

I've recently decided to revisit this post. I'll try to address all un-responded to comments in the next ~2 weeks.

Part of this is just straight disagreement, I think; see So8res's Sharp Left Turn and follow-on discussion.

Evolution provides no evidence for the sharp left turn

But for the rest of it, I don't see this as addressing the case for pessimism, which is not problems from the reference class that contains "the LLM sometimes outputs naughty sentences" but instead problems from the reference class that contains "we don't know how to prevent an ontological collapse, where meaning structures constructed under one world-model compile to something different under a different world model."

I dislike this minimization of contemporary alignment progress. Even just limiting ourselves to RLHF, that method addresses far more problems than "the LLM sometimes outputs naughty sentences". E.g., it also tackles problems such as consistently following user instructions, reducing hallucinations, improving the topicality of LLM suggestions, etc. It allows much more significant interfacing with the cognition and objectives pursued by LLMs than just some profanity filter.

I don't think ontological collapse is a real issue (or at least, not an issue that appropriate training data can't solve in a relatively straightforwards way). I feel similarly about lots of things that are speculated to be convergent problems for ML systems, such as wireheading and mesaoptimization.

Or, like, once LLMs gain the capability to design proteins (because you added in a relevant dataset, say), do you really expect the 'helpful, harmless, honest' alignment techniques that were used to make a chatbot not accidentally offend users to also work for making a biologist-bot not accidentally murder patients?

If you're referring to the technique used on LLMs (RLHF), then the answer seems like an obvious yes. RLHF just refers to using reinforcement learning with supervisory signals from a preference model. It's an incredibly powerful and flexible approach, one that's only marginally less general than reinforcement learning itself (can't use it for things you can't build a preference model of). It seems clear enough to me that you could do RLHF over the biologist-bot's action outputs in the biological domain, and be able to shape its behavior there.

If you're referring to just doing language-only RLHF on the model, then making a bio-model, and seeing if the RLHF influences the bio-model's behaviors, then I think the answer is "variable, and it depends a lot on the specifics of the RLHF and how the cross-modal grounding works". 

People often translate non-lingual modalities into language so LLMs can operate in their "native element" in those other domains. Assuming you don't do that, then yes, I could easily see the language-only RLHF training having little impact on the bio-model's behaviors.

However, if the bio-model were acting multi-modally by e.g., alternating between biological sequence outputs and natural language planning of what to use those outputs for, then I expect the RLHF would constrain the language portions of that dialog. Then, there are two options:

  • Bio-bot's multi-modal outputs don't correctly ground between language and bio-sequences. 
    • In this case, bio-bot's language planning doesn't correctly describe the sequences its outputting, so the RLHF doesn't constrain those sequences.
    • However, if bio-bot doesn't ground cross-modally, than bio-bot also can't benefit from its ability to plan in the language modality to better use its bio modality capabilities (which are presumably much better for planning than its bio-modality). 
  • Bio-bot's multi-modal outputs DO correctly ground between language and bio-sequences. 
    • In that case, the RLHF-constrained language does correctly describe the bio-sequences, and so the language-only RLHF training does also constrain bio-bot's biology-related behavior.

Put another way, I think new capabilities advances reveal new alignment challenges and unless alignment techniques are clearly cutting at the root of the problem, I don't expect that they will easily transfer to those new challenges.

Whereas I see future alignment challenges as intimately tied to those we've had to tackle for previous, less capable models. E.g., your bio-bot example is basically a problem of cross-modality grounding, on which there has been an enormous amount of past work, driven by the fact that cross-modality grounding is a problem for systems across very broad ranges of capabilities.

There was an entire thread about Yudkowsky's past opinions on neural networks, and I agree with Alex Turner's evidence that Yudkowsky was dubious. 

I also think people who used brain analogies as the basis for optimism about neural networks were right to do so.

Roughly, the core distinction between software engineering and computer security is whether the system is thinking back.

Yes, and my point in that section is that the fundamental laws governing how AI training processes work are not "thinking back". They're not adversaries. If you created a misaligned AI, then it would be "thinking back", and you'd be in an adversarial position where security mindset is appropriate.

What's your story for specification gaming?

"Building an AI that doesn't game your specifications" is the actual "alignment question" we should be doing research on. The mathematical principles which determine how much a given AI training process games your specifications are not adversaries. It's also a problem we've made enormous progress on, mostly by using large pretrained models with priors over how to appropriately generalize from limited specification signals. E.g., Learning Which Features Matter: RoBERTa Acquires a Preference for Linguistic Generalizations (Eventually) shows how the process of pretraining an LM causes it to go from "gaming" a limited set of finetuning data via shortcut learning / memorization, to generalizing with the appropriate linguistic prior knowledge.

It can be induced on MNIST by deliberately choosing worse initializations for the model, as Omnigrok demonstrated.

Re empirical evidence for influence functions:

Didn't the Anthropic influence functions work pick up on LLMs not generalising across lexical ordering? E.g., training on "A is B" doesn't raise the model's credence in "Bs include A"?

Which is apparently true: https://x.com/owainevans_uk/status/1705285631520407821?s=46

Load More