[The basic idea in this post is probably not original to me, since it's somewhat obvious when stated directly. But it seems particularly relevant and worth distilling in light of recent developments with LLM-based systems, and because I keep seeing arguments which seem confused about it.]

Alignment is a property of agents, or even more generally, of systems. An "aligned model" is usually[1] a type error.

Often when a new state-of-the-art ML model is developed, the first thing people ask is what it can do when instantiated in the most obvious way possible: given an input, execute the function represented by the model on that input and return the output. I'll refer to this embodiment as the "trivial system" for a given model.

For an LLM, the trivial system generates a single token for a given prompt. There's an obvious extension to this system, which is to feed the predicted token and the initial prompt back into the model again, and repeat until you hit a stop sequence or a maximum length. This is the system you get when you make a single call to the OpenAI text or chat completion API. I'll name this embodiment the "trivial++ system".

You can take this much further, by building a chatbot interface around this API, hooking it up to the internet, running it in an agentic loop, or even more exotic arrangements of your own design. These systems have suddenly started working much better with  the release of GPT-4, and AI capabilities researchers are just getting started. The capabilities of any particular system will depend on both the underlying model and the ways it is embodied: concretely, you can improve Auto-GPT by: (a) backing it with a better foundation model, (b) giving it access to more APIs and tools, and (c) improving the code, prompting, and infrastructure that it runs on.

Perhaps models to which techniques like RLHF  have been applied will make it easy to build aligned systems and hard or impossible to build unaligned systems, but this is far from given.

Things like RLHF and instruction prompt tuning definitely make it easier to build powerful systems out of the foundation models to which they are applied. Does an RLHF'd model make it easier to build Order-GPT and harder to build Chaos-GPT? (I'm not sure; no Auto-GPT applications seem to really be working particularly well yet, but I wouldn't count on that trend continuing.)

So while RLHF is definitely a capabilities technique, it remains to be seen whether it can be called an alignment technique when applied to models embedded in non-trivial systems (though it is quite effective at getting the trivial++ system for GPT-4 not to say impolite things.)

Some examples where others seem confused or not careful about this distinction


From Evolution provides no evidence for the sharp left turn:

Let’s consider eight specific alignment techniques:

Most of these seem like useful techniques for building  or understanding larger foundation models. None of them self-evidently help with the alignment of nontrivial systems which use executions of those foundation models as a component, though it wouldn't surprise me if at least some of them were relevant.

From a different section of the same post:

In my frame, we've already figured out and applied the sharp left turn to our AI systems, in that we don't waste our compute on massive amounts of incredibly inefficient neural architecture search, hyperparameter tuning, or meta optimization. For a given compute budget, the best (known) way to buy capabilities is to train a single big model in accordance with empirical scaling laws such as those discovered in the Chinchilla paper, not to split the compute budget across millions of different training runs for vastly tinier models with slightly different architectures and training processes. In fact, we can be even more clever and use small models to tune the training process, before scaling up to a single large run, as OpenAI did with GPT-4.

"AI systems" in the first sentence should probably be "AI models". And then, I'd say that the best way to advance capabilities is to train a single big model and then wrap it up in a useful system with a slick UX and lots of access to APIs and other functionality.

From Deconfusing Direct vs Amortised Optimization:

Amortized optimization, on the other hand, is not directly applied to any specific problem or state. Instead, an agent is given a dataset of input data and successful solutions, and then learns a function approximator that maps directly from the input data to the correct solution. Once this function approximator is learnt, solving a novel problem then looks like using the function approximator to generalize across solution space rather than directly solving the problem.

The use of the word "agent" seems like a type error here. A function approximator is trained on a dataset of inputs and outputs. An agent may then use that function approximator as one tool in its toolbox when presented with a problem. If the problem is similar enough to the ones the function approximator was designed to approximate, it may be that the trivial system that executes the function approximator on the test input is sufficient, and you don't need to introduce an agent into the situation at all.

From My Objections to "We’re All Gonna Die with Eliezer Yudkowsky":

(emphasis mine, added to highlight the relevant parts in a longer quote without dropping the context.)

Yudkowsky brings up strawberry alignment

I mean, I wouldn't say that it's difficult to align an AI with our basic notions of morality. I'd say that it's difficult to align an AI on a task like 'take this strawberry, and make me another strawberry that's identical to this strawberry down to the cellular level, but not necessarily the atomic level'. So it looks the same under like a standard optical microscope, but maybe not a scanning electron microscope. Do that. Don't destroy the world as a side effect."

 

My first objection is: human value formation doesn't work like this. There's no way to raise a human such that their value system cleanly revolves around the one single goal of duplicating a strawberry, and nothing else. By asking for a method of forming values which would permit such a narrow specification of end goals, you're asking for a value formation process that's fundamentally different from the one humans use. There's no guarantee that such a thing even exists, and implicitly aiming to avoid the one value formation process we know is compatible with our own values seems like a terrible idea.

It also assumes that the orthogonality thesis should hold in respect to alignment techniques - that such techniques should be equally capable of aligning models to any possible objective. 

This seems clearly false in the case of deep learning, where progress on instilling any particular behavioral tendencies in models roughly follows the amount of available data that demonstrate said behavioral tendency. It's thus vastly easier to align models to goals where we have many examples of people executing said goals. As it so happens, we have roughly zero examples of people performing the "duplicate this strawberry" task, but many more examples of e.g., humans acting in accordance with human values, ML / alignment research papers, chatbots acting as helpful, honest and harmless assistants, people providing oversight to AI models, etc. See also: this discussion.

Again, "aligning a model", "behavioral tendencies in models" and "providing oversight to AI models" are type errors in my ontology. This type error is important, because it hides an implicit assumption that either any systems in which "aligned" models are embedded will also be aligned, or that people will build only trivial systems out of these models. Both of these assumptions have already been falsified in rather dramatic fashion.

My mental model of the authors I've cited understand and mostly agree with the basic distinction I've made in this post. However, I think they were being not very careful about tracking it in some of the arguments and explanations above, and that this is more than just a pedantic point or argument over definitions

This distinction is important, because AI system capabilities are currently advancing rapidly, independent of any DL paradigm-based improvements in foundation models. The very first paragraph of one of the posts quoted above summarizes the "sharp left turn" argument as factoring through SGD, but SGD is not the only way of pushing the capabilities frontier, and may not be the main one for much longer, as GOFAI approaches come back into vogue. (A possibility which, I note, the original authors of the sharp left turn argument foresaw.)

  1. ^

    Maybe if the models and training runs get large enough, an inner optimizer develops somewhere inside and even a single output from the trivial system embodied by a model becomes dangerous.

    Or, perhaps the inner optimizer develops and "breaks out" entirely during the training process itself.

    Regardless of how likely those failure modes are, we'll probably encounter less exotic-looking failure modes of systems which are built on top of non-dangerous models first, which is the topic of this post. Mesaoptmizers are just another possible way we might fail later, if we manage to solve some easier problems first.

New Comment
11 comments, sorted by Click to highlight new comments since:

Interesting point, which I broadly agree with. I do think however, that this post has in some sense over updated on recent developments around agentic LLMs and the non-dangers of foundation models. Even 3-6 months ago, in the intellectual zeitgeist it was unclear whether autoGPT style agentic LLM wrappers were the main threat and people were primarily worried about foundation models being directly dangerous. It now seems clearer that at least at current capability levels, foundation models are not directly goal-seeking at present, although adding agency is relatively straightforward. This may also change in future such as if we were to do direct goal-driven RL training of the base models to create agents that way -- this would make direct alignment and interpretability of base models still necessary for safety. 

I agree the zeitgeist has changed, but I think some people (or at least Nate and Eliezer in particular), have always been more concerned about more agent-like systems, along the lines of Mu Zero. For example, in the 2021 MIRI conversations here:

I do not quite think that gradient descent on Stack More Layers alone - as used by OpenAI for GPT-3, say, and as opposed to Deepmind which builds more complex artifacts like Mu Zero or AlphaFold 2 - is liable to be the first path taken to AGI.

and here:

Okay.  I don't think the pure-text versions of GPT-5 are being very good at designing nanosystems while Living Zero is ending the world.

 and here:

There may be different cognitive technology that could follow a path like that. Gradient descent follows a path a bit relatively more in that direction along that axis - providing that you deal in systems that are giant layer cakes of transformers and that's your whole input-output relationship; matters are different if we're talking about Mu Zero instead of GPT-3.

Deep Deceptiveness is more recent, but it's another example of a carefully non-specific argument that doesn't factor through any current DL-paradigm methods, and is consistent with the kind of thing Nate and Eliezer have always been saying.

I think recent developments with LLMs have caused some other people to update towards LLMs alone being dangerous, which might be true, but if so it doesn't imply that more complex systems are not even more dangerous.

Would you say that a parallel phenomenon that does not involve AI would be "a company may exhibit behavior that is not in the interests of its customers, even if every individual employee of that company is doing their job in a way that they believe is in the interests of the customer"?

My impression is that the original reason to avoid the "superhuman AI :: corporation" analogy was that the corporation was built out of humans, which are at least somewhat aligned with other humans, while the superhuman AI would not be. However, if we stipulate that the foundation model(s) is aligned with humans, but that the structure containing it may not be, the parallels seem much stronger to me.

That's a decent analogy for the point I'm making in this post, but I don't like the corporations metaphor for superintelligence in general, because corporations aren't actually very good at making decisions that optimize for any particular crisp or coherent outcome, regardless of who or what that outcome is aligned with.

Corporations often control lots of resources that give them a wide range of options and opportunities to make decisions and take actions. But which actions to take are ultimately made by humans, and I believe the prevailing theory of corporate management is that a single CEO empowered to make decisions is better than any kind of decision-by-committee or consensus-based decision-making.

(Super-forecasters or prediction markets might be better, but I don't think any big or successful corporations actually use those, or if they do, they are merely one input to the CEO, who retains final decision-making ability.)

Even if the corporation as a whole behaves as a system that ends up optimizing for some kind of goal like "maximize bureaucracy" or "maximize job security and comfort for executives", I don't think the emergent decision-making process is better at optimizing for that outcome than an evil or deranged CEO who is explicitly trying to optimize for it (instead of optimizing for shareholder profit or short-term growth or whatever an "aligned" CEO is supposed to be optimizing for).

In the language of my steering systems post, corporations are systems with a large initial set of actions, but the ability of the corporation as a whole to steer through a tree of action sequences to any particular outcome is still below what a single human CEO can do by explicitly choosing the key actions.

Corporations often control lots of resources that give them a wide range of options and opportunities to make decisions and take actions. But which actions to take are ultimately made by humans, and I believe the prevailing theory of corporate management is that a single CEO empowered to make decisions is better than any kind of decision-by-committee or consensus-based decision-making.

My model of how corporations work is very different than this, and I think our different models might be driving the disagreement. Specifically, my model is roughly as follows.

To the extent that big, successful companies have a "goal", that "goal" is usually to make enough money to grow and keep the lights on. A large corporation is much more successful at coming up with and performing actions which cause the corporation to keep existing and growing than any individual human in that corporation would be.

I agree that corporations are worse at explicit consequentialist reasoning than the individual humans inside the corporation. Most corporations do have things like "quarterly roadmaps" and "business plans" and other forms of explicit plans with contingency plans for how exactly they intend to navigate to a future desirable-to-them world-state, but I think the documented plan is in almost all cases worse than the plan that lives in the CEO's head.

I claim that this does not matter, because neither the explicit written plan nor the hidden plan in the CEO's head are the primary driver of what the corporation actually does. Instead, I think that almost all of the optimization pressure an organization comes from having a large number of people trying things, seeing what works and what doesn't work, and then doing more of the kinds of things that worked and fewer of the things that didn't work.

Let's consider a concrete example. Imagine a gym chain called "SF Fitness." About half of the revenue of "SF Fitness" comes from customers who are paying a monthly membership but have not attended the gym in quite a while. We observe that "SF Fitness" has made the process of cancelling your membership extremely obnoxious - to cancel their membership, a customer must navigate a confusing phone tree.

I expect that the "navigate a confusing phone tree" policy was not put in place by the CEO saying "we could make a lot of money by making it obnoxious to cancel your membership, so I will send a feature request to make a confusing phone tree". Instead, I expect that the origin of the confusing phone tree was something like

  1. The CEO set a KPI of "minutes of customer service time spent per month of membership". This created an incentive for specific employees, whose bonuses or status were tied to that KPI.
  2. One obvious way of reducing CS time was to add a simple menu for callers which allowed them to do certain things without needing to interact with a person (e.g. get gym hours), and which directed calls to the correct person if they did in fact need to go to a real person (e.g. correct language, department).
  3. There was continued testing of which options, and which order of options, worked best to reduce CS time.
  4. The CEO set a KPI of "fraction of customers cancelling their membership each month",
  5. Changes to the phone tree which increased cancellation rates were more likely to be reverted on account of making that metric worse than ones that decreased cancellation rates.
  6. After a bunch of iterations, "SF Fitness" had a deep, confusing, nonsensical phone tree without any single person having made the decision to create it. That phone tree is more effective per unit of spent resources at reducing cancellations than any strategy that the CEO would have come up with themselves.

As a side note, I don't think that prediction markets would actually improve the operation of most corporations by very much, relative to the current dominant approach of A/B testing. I can expand on why I think that if that's a crux for you.

TL;DR: I think corporations are far better than individual humans at steering the world state into states that look like "the corporation controls lots of resources", but worse at steering the world into arbitrary states.

I'm basically with you until here:

That phone tree is more effective per unit of spent resources at reducing cancellations than any strategy that the CEO would have come up with themselves.
 


Explicitly optimizing the cancellation part of the phone tree for confusion and difficulty definitely seems like a strategy that a not-particularly-ethical CEO or product manager can and do come up with entirely on their own.

Companies routinely use dark patterns in software, in many cases as an explicit choice. On paper, the use of these patterns might be justified by KPIs or A/B tests or iteration to hide their true purpose. Maybe at some companies, these patterns get used accidentally or "emergently" with no one in the entire organization explicitly optimizing for the thing that the dark pattern is actually accomplishing. My claim is that any such company would be more effective, or at least no worse off, if the decision-makers were explicitly optimizing. At the very least, they could just skip or half-ass a bunch of the A/B testing and iteration when they already know their destination.
 

I claim that a company that uses the strategy of "come up with a target KPI and then implement every possible dark pattern which could plausibly lead to that KPI doing what we want it to do" will be outperformed by a company which uses the strategy of "come up with a bunch of things to do which will plausibly change the value of that KPI, try all of them out on a small scale, and then scale up the ones that work and scale down the ones that don't work".

For context for what's driving my intuitions here, I at one point worked at a startup where one of the cofounders did pretty much operate under the philosophy of "let's look at what other companies vaguely in our space do, paying particular attention to things which look like they're intended to trick customers out of their money, and implement those things in our own product as faithfully as possible". That strategy did in fact sometimes work, but in most cases it significantly hurt retention while being minimally useful for revenue (or, in a couple of cases, hurting both retention and revenue).

In the language of your steering systems post (thank you for writing that BTW), I expect a company where the pruning system is "try things out at small scale in the real world and iterate" will outperform even a human who has a very good world-model.

I actually suspect that this is a more general disagreement -- I think that, in complicated domains, the approach of "figure out what things work locally, do those things, and iterate" outperforms the approach of "look at the problem, work really hard on coming up with an explicit model of the reward landscape, and then do the optimal thing according to your model". Obviously you can outperform either approach in isolation by combining them, but I think that the best performance is generally far to the "try things and iterate" side. If that's still a thing you disagree with, even in that framing, I suspect that's a useful crux for us to explore more.

Edit: To be more explicit, I think that corporations are more powerful at steering the future towards a narrow space than individual humans because they are able to try out more things than any individual human, not because they have a better internal model of the world or better process for deciding which of two atomic, mutually exclusive plans to execute.

I think this claim:

I actually suspect that this is a more general disagreement -- I think that, in complicated domains, the approach of "figure out what things work locally, do those things, and iterate" outperforms the approach of "look at the problem, work really hard on coming up with an explicit model of the reward landscape, and then do the optimal thing according to your model".

is to first order probably the most general crux in whether you view LW as a useful thing, perhaps the most important useful thing, or whether you see LW as essentially worthless.

It strikes at the core of the LessWrong worldview, so it's natural that such deep differences result in different predictions.

To be clear, I think you can sensibly disagree with people on LW about AI risk being high or a real thing, as well as other issues I haven't looked at even under a worldview which agrees with "look at the problem, work really hard on coming up with an explicit model of the reward landscape, and then do the optimal thing according to your model", but I think a lot of topics on LW make a lot more sense if you fundamentally buy the worldview under which model making is most important compared to iteration.

A few remarks, not necessarily disagreements with anything specific:

  • "hire a bunch of people and tell them to try a bunch of things according to some general guidelines, rather than explicitly micromanaging them" is causally upstream of trying out those things.
  • Given access to the same resources, sufficiently smart humans are usually capable of explicit strategy stealing of any other human or group of humans in full generality and on any level of meta. Though object-level strategy stealing of your competitors might not always be a good strategy, as you point out.
  • "figure out what things work locally, do those things, and iterate." Agreed that this is a good strategy in general. I'm saying that an individual explicitly reflecting on and reasoning about what an organization is trying to do and the strategy they're using to do it, should always help, or at least not hurt, if done correctly and at the right level of generality and meta. We might disagree about how strong those preconditions are, and how likely they are to be met in practice.

All of those remarks look correct to me. Though "at the right level of generality and meta" is doing a lot of the work.

Good post. I very much agree with you on your point that capabilities will advance through script wrappers for recursive prompting, as well as on LLMs directly. I just finished a post elaborating all of the several ways I think these will advance relatively rapidly, and at the least they will improve LLM effective intelligence modestly, making them part of the de facto standard for deployed AI. I'm very curious what you think both about the claims and how I'm trying to explain them.

Capabilities and alignment of LLM cognitive architectures