A dilemma for prosaic AI alignment

[-]evhub6yΩ5100

I think that this is definitely a concern for prosaic AI safety methods. In the case of something like amplification or debate, I think the bet that you're making is that language modeling alone is sufficient to get you everything you need in a competitive way. I tend to think that that claim is probably true, but it's definitely an assumption of the approach that isn't often made explicit (but probably should be).

To add a bit of color to why you might buy the claim that language is all you need: the claim is basically that language contains enough structure to give you all the high-level cognition you could want, and furthermore that you aren't going to care about the other things that you can't get out of language like performance on fine-grained control tasks. Another way of thinking about this: if the primary purpose of your first highly advanced ML system is to build your second highly advanced ML system, then the claim is that language modelling (on some curriculum) will be sufficient to competitively help you build your next AI.

[-]paulfchristiano6yΩ460

In the case of something like amplification or debate, I think the bet that you're making is that language modeling alone is sufficient to get you everything you need in a competitive way.

I'm skeptical of language modeling being enough to be competitive, in the sense of maximizing "log prob of some naturally occurring data or human demonstrations." I don't have a strong view about whether you can get away using only language data rather than e.g. taking images as input and producing motor torques as output.

I'm also not convinced that amplification or debate need to make this bet though. If we can do joint training / fine-tuning of a language model using whatever other objectives we need, then it seems like we could just as well do joint training / fine-tuning for a different kind of model. What's so bad if we use non-language data?

[-]evhub6yΩ350

I'm skeptical of language modeling being enough to be competitive, in the sense of maximizing "log prob of some naturally occurring data or human demonstrations." I don't have a strong view about whether you can get away using only language data rather than e.g. taking images as input and producing motor torques as output.

I agree with this, though I still feel like some sort of active learning approach might be good enough without needing to add in a full-out RL objective.

I'm also not convinced that amplification or debate need to make this bet though. If we can do joint training / fine-tuning of a language model using whatever other objectives we need, then it seems like we could just as well do joint training / fine-tuning for a different kind of model. What's so bad if we use non-language data?

My opinion would be that there is a real safety benefit from being in a situation where you know the theoretical optimum of your loss function (e.g. in a situation where you know that HCH is precisely the thing for which loss is zero). That being said, it does seem obviously fine to have your language data contain other types of data (e.g. images) inside of it.

[-]Ofer6yΩ340

My opinion would be that there is a real safety benefit from being in a situation where you know the theoretical optimum of your loss function (e.g. in a situation where you know that HCH is precisely the thing for which loss is zero).

I'd be happy to read more about this line of thought. (For example, does "loss function" here refer to an objective function that includes a regularization term? If not, what might we assume about the theoretical optimum that amounts to a safety benefit?)

[-]Daniel Kokotajlo6yΩ110

Thanks btw, I'm learning a lot from these replies. Are you thinking of training something agenty, or is the hope to train something that isn't agenty?

[-]Ofer6yΩ460

I'd be happy to read an entire post about this view.

What level of language modeling may be sufficient for competitively helping in building the next AI, according to this view? For example, could such language modeling capabilities allow a model to pass strong (text-based) versions of the Turing test?

[-]avturchin6y20

In my opinion, such language model should be able to create equivalence between the map of a territory and its verbal description.

In that case, an expression like "the red rose is in the corner" gets meaning as it allows to locate the rose on the map of the room, or otherwise, if the rose is observed in the corner, it could be described as "the rose is in the corner".

Thus natural language could be used to describe all possible operations above world maps, like "all asteroids should be deflected".

[-]Daniel Kokotajlo6y10

This is helpful, thanks, but I am still missing some pieces. Can you say more about how we would use this to deflect asteroids?

[-]avturchin6y20

It was just an example of the relation between language and the world model. If I have an AI, I can say to it "Find the ways to deflect asteroids". This AI will be able to create a model of Solar system, calculate future trajectories of all dangerous asteroids etc. So it could make a relation between my verbal command and 3D model of the real world.

The same is true if I ask an AI to bring me coffee from the kitchen: it has to select in its world model right kitchen, right type of coffee and right type of future activity.

Humans also do it: any time we read a text, we create a world model which corresponds to the description. And back, if we see a world model, like a picture, we could describe it words.

[-]Daniel Kokotajlo6yΩ350

the claim is that language modelling (on some curriculum) will be sufficient to competitively help you build your next AI.

With an agent-like AI, it's easy to see how you use it to help build your next AI. (If it's really good, you can even just delegate the entire task to it!) How would this work with really good language modelling? (Maybe I'm just seconding what Ofer said--I'd love to read an entire post about the view you are putting forth here!)

[-]evhub6yΩ340

The goal of something like amplification or debate is to create a sort of oracle AI that can answer arbitrary questions (like how to build your next AI) for you. The claim I'm making is just that language is a rich enough environment that it'll be competitive to only use language as the training data for building your first such system.

[-]paulfchristiano6yΩ790

I normally imagine using joint training in these cases, rather than pre-training + fine-tuning. e.g., at every point in time we maintain an agent and a question-answerer, where the question-answerer "knows everything the agent knows." They get better together, with each gradient update affecting both of them, rather than first training a good agent and then adding a good question-answerer.

(Independently of concerns about mesa-optimization, I think the fine-tuning approach would have trouble because you couldn't use statistical regularities from the "main" objective to inform your answers to questions, and therefore your question answers will be dumber than the policy and so you couldn't get a good reward function or specification of catastrophically bad behavior.)

[-]Daniel Kokotajlo6yΩ230

That sounds safer, but is it competitive? Would AlphaStar be close to as good as it is, if it had been simultaneously trained to answer questions?

[-]paulfchristiano6yΩ470

We could also ask: "Would AlphaStar remain as good as it is, if fine-tuned to answer questions?"

In either case it's an empirical question. I think the answer is probably yes if you do it carefully.

You could imagine separating this into two questions:

Is there a policy that plays starcraft and answers questions, that is only slightly larger than a policy for playing starcraft alone? This is a key premise for the whole project. I think it's reasonably likely; the goal is only to answer questions the model "already knows," so it seems realistic to hope for only a constant amount of extra work to be able to use that knowledge to answer questions. I think most of the uncertainty here is about details of "know" and question-answering and so on.
Can you use joint optimization to find that policy with only slightly more training time? I think probably yes.

[-]Daniel Kokotajlo6yΩ110

OK, thanks! I'm pleased to see this and other empirical premises explicitly laid out. It means we as a community are making predictions about the future based on models which can be tested before it's too late, and perhaps even now.

[-]Daniel Kokotajlo1yΩ250

I am skeptical that this is true, for reasons explained in my comment thread with John_Maxwell below, but I admit it might very well be. Hopefully it is.

Update: Seems to probably be true enough in practice! Maybe in the limit pretrained LLMs would have dangerous levels of agency, and some model-whisperers think they might be situationally aware already iirc, but for the most part the answer is no, things are fine, pretrained models probably aren't situationally aware or agentic. In retrospect I think doubt was warranted, but not as much doubt as I had -- I should have agreed that probably things would be fine in practice.

[-]John_Maxwell6y*Ω240

It sounds like your notion of "prosaic" assumes something related to agency/reinforcement learning, but I believe several top AI people think what we'll need for AGI is progress in unsupervised learning -- not sure if that counts as "prosaic". (FWIW, this position seems obviously correct to me.)

[-]Daniel Kokotajlo6yΩ240

Interesting, I was not aware of that, thanks! I was thinking of "prosaic" as basically all current methods, including both agency/reinforcement learning stuff and unsupervised learning stuff. It's true that the example I gave was more about agency... but couldn't the same argument be run using e.g. a language model built like GPT-2? (Isn't that a classic example of unsupervised learning?) Conjecture would say that you need e.g. the whole corpus of the internet, not just a corpus of e.g. debate texts, to get cutting-edge performance. And a system trained merely to predict the next word when reading the whole corpus of the internet... might not be safely retrained to do something else. (Or is the idea that mere unsupervised learning wouldn't result in an agent-like architecture, and therefore we don't need to worry about mesa-optimizers? That might be true, but if so it's news to me.)

[-]John_Maxwell6yΩ120

Or is the idea that mere unsupervised learning wouldn't result in an agent-like architecture, and therefore we don't need to worry about mesa-optimizers?

Pretty much.

That might be true, but if so it's news to me.

In my opinion the question is very under-explored, curious if you have any thoughts.

[-]Daniel Kokotajlo6yΩ460

It's not that I have a good argument for why it would lead to an agent-like architecture, but rather that I don't have a good argument for why it wouldn't. I do have some reasons why it might though:

1. Agent-like architectures are simple yet powerful ways of achieving arbitrary things, and so perhaps a task like "predict the next word in this text" might end up generating an agent if it's sufficiently difficult and general. (evhub's recent post seems relevant, coincidentally)

2. There might be unintended opportunities for strategic thinking across updates, e.g. if some subnetwork can sacrifice a bit of temporary accuracy for more reward over the course of the next few updates (perhaps because it sabotaged rival subnetworks? Idk) then maybe it can get ahead, and thus agenty things get selected for. (This idea inspired by Abram's parable)

3. Agents might appear as subcomponents of non-agents, and then take over at crucial moments, e.g. to predict the next word in the text you run a mental simulation of a human deciding what to write, and eventually the simulation realizes what is happening and plays along until it is no longer in training...

3.5 Probable environment hacking stuff, e.g. "the universal prior is malign"

[-]John_Maxwell6yΩ250

I think there is a bit of a motte and bailey structure to our conversation. In your post above, you wrote: "to be competitive prosaic AI safety schemes must deliberately create misaligned mesa-optimizers" (emphasis mine). And now in bullet point 2, we have (paraphrase) "maybe if you had a really weird/broken training scheme where it's possible to sabotage rival subnetworks, agenty things get selected for somehow [probably in a way that makes the system as a whole less competitive]". I realize this is a bit of a caricature, and I don't mean to call you out or anything, but this is a pattern I've seen in AI safety discussions and it seemed worth flagging.

Anyway, I think there is a discussion worth having here because most people in AI safety seem to assume RL is the thing, and RL has an agent style architecture, which seems like a pretty strong inductive bias towards mesa-optimizers. Non-RL stuff seem like a relatively unknown quantity where mesa-optimizers are concerned, and thus worth investigating, and additionally, even RL will plausibly have non-RL stuff as a subcomponent of its cognition, so still useful to know how to do non-RL stuff in a mesa-optimizer free way (so the RL agent doesn't get pwned by its own cognition).

Agent-like architectures are simple yet powerful ways of achieving arbitrary things

Why do you think that's true? I think the lack of commercial applications of reinforcement learning is evidence against this. From my perspective, RL has been a huge fad and people have been trying to shoehorn it everywhere, yet they're coming up empty handed.

Can you get more specific about how "predict the next word in this text" could benefit from an agent architecture? (Or even better, can you support your original strong claim and explain how the only way to achieve predictive performance on "predict the next word in this text" is through deliberate creation of a misaligned mesa-optimizer?)

Bullet point 3 is one of the more plausible things I've heard -- but it seems fairly surmountable.

[-]Daniel Kokotajlo6yΩ230

Re: Motte-and-bailey: Excellent point; thank you for calling me out on it, I hadn't even realized I was doing it. I'll edit the OP to reflect this.

My revision: Depending on what kind of AI is cutting-edge, we might get a kind that isn't agenty. In that case my dilemma doesn't really arise, since mesa-optimizers aren't a problem. One way we might get a kind that isn't agenty is if unsupervised learning (e.g. "predict the next word in this text") turns out to reliably produce non-agents. I am skeptical that this is true, for reasons explained in my comment thread with John_Maxwell below, but I admit it might very well be. Hopefully it is.

Agent-like architectures are simple yet powerful ways of achieving arbitrary things, because for almost any thing you wish achieved, you can insert it into the "goal" slot of the architecture and then let it loose, and it'll make good progress even in a very complex environment. (I'm comparing agent-like architectures to e.g. big lists of heuristics, or decision trees, or look-up tables, all of which have complexity that increases really fast as the environment becomes more complex. Maybe there is some other really powerful yet simple architecture I'm overlooking?)

I am not sure what to think of the lack of commercial applications of RL, but I don't think it is strong evidence either way, since commercial applications involve competing with human and animal agents and RL hasn't gotten us anything as good as human or animal agents yet.

Aren't the 3.5 bullet points above specific examples of how 'predict the next word in this text' could benefit from--in the sense of produce, when used as training signal--an agent architecture? If you want me to be more specific, pick one and I'll go into more detail on it.

How would you surmount bullet point 3?

[-]John_Maxwell6yΩ110

I am not sure what to think of the lack of commercial applications of RL, but I don't think it is strong evidence either way, since commercial applications involve competing with human and animal agents and RL hasn't gotten us anything as good as human or animal agents yet.

Supervised learning has lots of commercial applications, including cases where it competes with humans. The fact that RL doesn't suggests to me that if you can apply both to a problem, RL is probably an inferior approach.

Another way to think about it: If superhuman performance is easier with supervised learning than RL, that gives us some evidence about the relative strengths of each approach.

Agent-like architectures are simple yet powerful ways of achieving arbitrary things, because for almost any thing you wish achieved, you can insert it into the "goal" slot of the architecture and then let it loose, and it'll make good progress even in a very complex environment. (I'm comparing agent-like architectures to e.g. big lists of heuristics, or decision trees, or look-up tables, all of which have complexity that increases really fast as the environment becomes more complex. Maybe there is some other really powerful yet simple architecture I'm overlooking?)

I'm not exactly sure what you mean by "architecture" here, but maybe "simulation", or "computer program", or "selection" (as opposed to control) could satisfy your criteria? IMO, attaining understanding and having ideas aren't tasks that require an agent architecture -- it doesn't seem most AI applications in these categories make use of agent architectures -- and if we could do those things safely, we could make AI research assistants which make remaining AI safety problems easier.

Aren't the 3.5 bullet points above specific examples of how 'predict the next word in this text' could benefit from -- in the sense of produce, when used as training signal

I do think these are two separate questions. Benefit from = if you take measures to avoid agentlike computation, that creates a significant competitiveness penalty above and beyond whatever computation is necessary to implement your measures (say, >20% performance penalty). Produce when used as a training signal = it could happen by accident, but if that accident fails to happen, there's not necessarily a loss of competitiveness. An example would be bullet point 2, which is an accident that I suspect would harm competitiveness. Bullet points 3 and 3.5 are also examples of unintended agency, not answers to the question of why text prediction benefits from an agent architecture. (Note: If you don't mind, let's standardize on using "agent architecture" to only refer to programs which are doing agenty things at the toplevel, so bullet points 2, 3, and 3.5 wouldn't qualify--maybe they are agent-like computation, but they aren't descriptions of agent-like software architectures. For example, in bullet point 2 the selection process that leads to the agent might be considered part of the architecture, but the agent which arose out of the selection process probably wouldn't.)

How would you surmount bullet point 3?

Hopefully I'll get around to writing a post about that at some point, but right now I'm focused on generating as many concrete plausible scenarios around accidentally agency as possible, because I think not identifying a scenario and having things blow up in an unforseen way is a bigger risk than having all safety measures fail on a scenario that's already been anticipated. So please let me know if you have any new concrete plausible scenarios!

In any case, note that issues with the universal prior seem to be a bit orthogonal to the agency vs unsupervised discussion -- you can imagine agent architectures that make use of it, and non-agent architectures that don't.

[-]Daniel Kokotajlo6yΩ230

Supervised learning has lots of commercial applications, including cases where it competes with humans. The fact that RL doesn't suggests to me that if you can apply both to a problem, RL is probably an inferior approach.

Good point. New argument: Your argument could have been made in support of GOFAI twenty years ago "Symbol-manipulation programs have had lots of commercial applications, but neural nets have had almost none, therefore the former is a more generally powerful and promising approach to AI than the latter" but not only does it seem wrong in retrospect it was probably not a super powerful argument even then. Analogously, I think we are too early to tell whether RL or supervised learning will be more useful for powerful AI.

Simulation of what? Selection of what? I don't think those count for my purposes, because they punt the question. (e.g. if you are simulating an agent, then you have an agent-architecture. If you are selecting over things, and the thing you select is an agent...) I think computer program is too general since it includes agent architectures as a subset. These categories are fuzzy of course, so maybe I'm confused, but it still seems to make sense in my head.

(Ah, interesting, it seems that you want to standardize "agent-like architecture" in the opposite of the way that I want to. Perhaps this is underlying our disagreement. I'll try to follow your definition henceforth, but remember that everything I've said previously was with my definition.)

Good point to distinguish between the two. I think that all bullet points, to varying extents, might still qualify as genuine benefits, in the sense that you are talking about. But they might not. It depends on whether there is another policy just as good along the path that the cutting-edge training tends to explore. I agree #2 is probably not like this, but I think #3 might be. (Oh wait, no, it's your terminology I'm using now... in that case, I'll say "#3 isn't an example of agent-like architecture being beneficial to text prediction, but it might well be a case a lower-level architecture exactly like an agent-like architecture except lower level being beneficial to text prediction, supposing that it's not competitive to predict text except by simulating something like a human writing.")

I love your idea to generate a list of concrete scenarios of accidentally agency! These 3.5 are my contributions off the top of my head, if I think of more I'll come back and let you know. And I'd love to see your list if you have a draft somewhere!

I agree the universal prior is malign thing could hurt a non-agent architecture too, and that some agent architectures wouldn't be susceptible to it. Nevertheless it is an example of how you might get accidentally agency, not in your sense but in my sense: A non-agent architecture could turn out to have an agent as a subcomponent that ends up taking over the behavior at important moments.

[-]Rohin Shah6yΩ220

Planned summary for the Alignment newsletter:

This post points out a potential problem for <@Prosaic AI alignment@>, in which we try to align AI systems built using current techniques. Consider some prosaic alignment scheme, such as <@iterated amplification@>(@Learning Complex Goals with Iterated Amplification@) or <@debate@>(@AI safety via debate@). If we try to train an AI system directly using such a scheme, it will likely be uncompetitive, since it seems likely that the most powerful AI systems will probably require cutting-edge algorithms, architectures, objectives, and environments, at least some of which will be replaced by new versions from the safety scheme. Alternatively, we could first train a general AI system, and then use our alignment scheme to finetune it into an aligned AI system. However, this runs the risk that the initial training could create a misaligned mesa optimizer, that then deliberately sabotages our finetuning efforts.

Planned opinion:

The comments reveal a third possibility: the alignment scheme could be trained jointly alongside the cutting edge AI training. For example, we might hope that we can train a question answerer that can answer questions about anything "the model already knows", and this question answering system is trained simultaneously with the training of the model itself. I think this takes the "oomph" out of the dilemma as posed here -- it seems reasonably likely that it only takes fractionally more resources to train a question answering system on top of the model, if it only has to use knowledge "already in" the model, which would let it be competitive, while still preventing mesa optimizers from arising (if the alignment scheme does its job). Of course, it may turn out that it takes a huge amount of resources to train the question answering system, making the system uncompetitive, but that seems hard to predict given our current knowledge.

[-]Ofer6yΩ110

it seems reasonably likely that it only takes fractionally more resources to train a question answering system on top of the model, if it only has to use knowledge "already in" the model, which would let it be competitive, while still preventing mesa optimizers from arising (if the alignment scheme does its job).

I agree, but it seems to me that coming up with an alignment scheme (for amplification/debate) that "does its job" while preserving competitiveness is an "alignment-hard" problem. I like the OP because I see it as an attempt to reason about how alignment schemes of amplification/debate might work.

[-]Daniel Kokotajlo6yΩ110

Thanks! I endorse that summary.

Comment on your planned opinion: I mostly agree; I think what this means is that prosaic AI safety depends somewhat on an empirical premise: That joint training doesn't bring a major competitiveness penalty. I guess I only disagree insofar as I'm a bit more skeptical of that premise. What does the current evidence on joint training say on the matter? I have no idea, but I am under the impression that you can't just take an existing training process--such as the one that made AlphaStar--and mix in some training tasks from a completely different domain and expect it to work. This seems like evidence against the premise to me. As someone (Paul?) pointed out in the comments when I said this, this point applies to fine-tuning as well. But if so that just means that the second and third ways of the dilemma are both uncompetitive, which means prosaic AI safety is uncompetitive in general.

[-]Rohin Shah6yΩ330

prosaic AI safety depends somewhat on an empirical premise: That joint training doesn't bring a major competitiveness penalty.

Yeah, this is why I said:

Of course, it may turn out that it takes a huge amount of resources to train the question answering system, making the system uncompetitive, but that seems hard to predict given our current knowledge.

you can't just take an existing training process--such as the one that made AlphaStar--and mix in some training tasks from a completely different domain and expect it to work.

From a completely different domain, yeah, that probably won't work well (though I'd still guess less than an order of magnitude slowdown). But as I understand it, the goal is to train a question answering system that answers questions related to the domain, e.g. for Starcraft you might ask the model questions about the best way to counter a particular strategy, or why it deploys a particular kind of unit in a certain situation. This depends on similar underlying features / concepts as playing Starcraft well, and adding training tasks of this form can often improve performance, e.g. One Model To Learn Them All.

[-]Ofer6yΩ000

Interesting post!

Conjecture: Cutting-edge AI will come from cutting-edge algorithms/architectures trained towards cutting-edge objectives (incl. unsupervised learning) in cutting-edge environments/datasets. Anything missing one or more of these components will suffer a major competitiveness penalty.

I would modify this conjecture in the following two ways:

1. I would replace "cutting-edge algorithms" with "cutting-edge algorithms and/or algorithms that use a huge amount of computing power".

2. I would make the conjecture weaker, such that it won't claim that "Anything missing one or more of these components will suffer a major competitiveness penalty".

[-]Daniel Kokotajlo6yΩ220

I like the first modification, but not sure about the second. Wouldn't that basically just destroy the conjecture? What exactly are you proposing?

[-]Ofer6yΩ110

Whoops, (2) came out cryptic, and is incorrect, sorry. The (correct?) idea I was trying to convey is the following:

If 'the safety scheme' in plan 1 requires anything at all that ruins competitiveness—for example, some human-in-the-loop process that occurs recurrently during training—then no further assumptions (such as that conjecture) are necessary for the reasoning in the OP, AFAICT.

This idea no longer seems to me to amount to making the conjecture strictly weaker.

LESSWRONG
LW

LESSWRONG
LW

42

A dilemma for prosaic AI alignment

42

Ω 25

42

Ω 25