All of DavidW's Comments + Replies

If you want counterarguments, here's one good place to look: Object-Level AI Risk Skepticism - LessWrong

I expect we might get more today, as it's the deadline for the Open Philanthropy AI Worldview Contest

In the deceptive alignment story, the model wants to take action A, because its goal is misaligned, but chooses to take apparently aligned action B to avoid overseers noticing that it is misaligned. In other words, in the absence of deceptive tendencies, the model would take action A, which would identify it as a misaligned model, because overseers wanted it to take action B. That's the definition of a differential adversarial example. 

If there were an unaligned model with no differential adversarial examples in training, that would be an example of a... (read more)

I have a whole section on the key assumptions about the training process and why I expect them to be the default. It's all in line with what's already happening, and the labs don't have to do anything special to prevent deceptive alignment. Did I miss anything important in that section?

Deceptive alignment argues that even if you gave a reward signal that resulted in the model appearing to be aligned and competent, it could develop a proxy goal instead and actively trick you into thinking that it is aligned so it can escape later and seize power. I'm explicitly not addressing other failure modes in this post. 

What are you referring to as the program here? Is it the code produced by the AI that is being evaluated by people who don't know how to code? Why would underqualified evaluators result in an ulterior motive? And to make it more... (read more)

Yes, I know, I gave the other failure modes as an example. The thing that confuses me is that you are saying that the (IMO) central piece in the AI algorithm doesn't really matter for the purposes of your post. Yes It's not meant as an example of deceptive misalignment, it's meant as an example of how alignment failures by-default depend absolutely massively on the way you train your AI. Like if you train your AI in a different way, you get different failures. So it seems like a strange prior to me to assume that you will get the same results wrt deceptive alignment regardless of how you train it.
That the specific ways people give feedback isn't very relevant. It seems like the core thing determining the failure modes to me, e.g. if you just show the people who give feedback the source code of the program, then the failure mode will be that often the program will just immediately crash or maybe not even compile. Meanwhile if you show people the running program then that cannot be a failure mode. If you agree that generally the failure modes are determined by the feedback process but somehow deceptive misalignment is an exception to the rule that the feedback process determines the failures then I don't see the justification for that and would like that addressed explicitly.

I don't think that the specific ways people give feedback is very relevant. This post is about deceptive misalignment, which is really about inner misalignment. Also, I'm assuming that this a process that enables TAI to emerge, especially the first time, and asking people who don't know about a topic to give feedback probably won't be the strategy that gets us there. Does that answer your question?

Yes but then I disagree with the assumptions underlying your post and expect things that are based on your post to be derailed by the errors that have been introduced.

From Ajeya Cotra's post that I linked to: 

Train a powerful neural network model to simultaneously master a wide variety of challenging tasks (e.g. software development, novel-writing, game play, forecasting, etc) by using reinforcement learning on human feedback and other metrics of performance.

It's not important what the tasks are, as long as the model is learning to complete diverse tasks by following directions. 

I'm not so much asking the question of what the tasks are, and instead asking what exactly the setup would be. For example, if I understand the paper Cotra linked to correctly, they directly showed the raters what the model's output was and asked them to rate it. Is this also the feedback mode you are assuming in your post? For example in order to train an AI to do advanced software development, would you show unspecialized workers in India how the model describes it would edit the code? If not, what feedback signal are you assuming?

Pre-trained models could conceivably have goals like predicting the next token, but they should be extremely myopic and not have situational awareness. In pre-training, a text model predicts tokens totally independently of each other, and nothing other than its performance on the next token depends directly on its output. The model makes the prediction, then that prediction is used to update the model. Otherwise, it doesn't directly affect anything. Having a goal for something external to its next prediction could only be harmful for training performance, ... (read more)

I'd be curious to hear what you think about my arguments that deceptive alignment is unlikely. Without deceptive alignment, there are many fewer realistic internal goals that produce good training results. 

Active deception seems unlikely, I agree with that part (weak opinion, didn't spend much time thinking about it). At this moment, my risk model is like: "AI destroys everything the humans were not paying attention to... plus a few more things after the environment changes dramatically". (Humans care about 100 different things. If you train the AI, you check 10 of them, and the AI will sincerely tell you whether it cares about them or not. You make the AI care about all 10 of them and run it. The remaining 90 things are now lost. Out of the 10 preserved things, 1 was specified a little incorrectly, now it is too late to do anything about it. As a consequence of losing the 90 things, the environment changes so that 2 more of the 10 preserved things do not make much sense in the new context, so they are effectively also lost. The humanity gets to keep 7 out of 100 things they originally cared about.)

Thanks for sharing your perspective! I've written up detailed arguments that deceptive alignment is unlikely by default. I'd love to hear what you think of it and how that fits into your view of the alignment landscape. 

Corrigible alignment seems to require already having a model of the base objective. For corrigible alignment to be beneficial from the perspective of the base optimizer, the mesa-optimizer has to already have some model of the base objective to “point to.” 

Likely TAI training scenarios include information about the base objective in the input. A corrigibly-aligned model could learn to infer the base objective and optimize for that. 

However, once a mesa-optimizer has a model of the base objective, it is likely to become deceptively aligned—at leas

... (read more)

Thanks for summarizing this! I have a very different perspective on the likelihood of deceptive alignment, and I'd be interested to hear what you think of it!

This is an interesting post. I have a very different perspective on the likelihood of deceptive alignment. I'd love to hear what you think of it and discuss further!

I recently made an inside view argument that deceptive alignment is unlikely. It doesn't cover other failure modes, but it makes detailed arguments against a core AI x-risk story. I'd love to hear what you think of it!

2Adam Zerner7mo
If "you" is referring to me, I'm not an alignment researcher, my knowledge of the field comes just from reading random LessWrong articles once in a while, so I'm not in a position to evaluate it, sorry.

This is an interesting point, but it doesn't undermine the case that deceptive alignment is unlikely. Suppose that a model doesn't have the correct abstraction for the base goal, but its internal goal is the closest abstraction it has to the base goal. Because the model doesn't understand the correct abstraction, it can't instrumentally optimize for the correct abstraction rather than its flawed abstraction, so it can't be deceptively aligned. When it messes up due to having a flawed goal, that should push its abstraction closer to the correct abstraction.... (read more)

1) You talk about the base goal, and then the training goal, and then human values/ethics. These aren't the same thing though right? In fact they will almost certainly be very different things. The base goal will be something like "maximize reward in the next hour or so." Or maaaaaaybe "Do what humans watching you and rating your actions would rate highly," though that's a bit more complicated and would require further justification I think. Neither of those things are anywhere near to human ethics.

I specify the training setup here: “The goal of the t... (read more)

Thanks for your thoughtful reply! I really appreciate it. I’m starting with your fourth point because I agree it is closest to the crux of our disagreement, and this has become a very long comment. 


What amount of understanding of the base goal is sufficient? What if the answer is "It has to be quite a lot, otherwise it's really just a proxy that appears superficially similar to the base goal?" In that case the classic arguments for deceptive alignment would work fine.

TL;DR the model doesn’t have to explicitly represent “X, whatever that turns out t... (read more)

Thanks for the thoughtful feedback both here and on my other post! I plan to respond in detail to both. For now, your comment here makes a good point about terminology, and I have replaced "deception" with "deceptive alignment" in both posts. Thanks for pointing that out! 

I'm intentionally not addressing direct reward maximizers in this sequence. I think they are a much more plausible source of risk than deceptive alignment. However, I haven't thought about them nearly as much, and I don't have strong intuition for how likely they are yet, so I'm choosing to stay focused on deceptive alignment for this sequence. 

That makes sense. I misread the original post as arguing that capabilities research is better than safety work. I now realize that it just says capabilities research is net positive. That's definitely my mistake, sorry!

I strong upvoted your comment and post for modifying your views in a way that is locally unpopular when presented with new arguments. That's important and hard to do! 

Your first link appears to be broken. Did you meant to link here? It looks like the last letter of the address got truncated somehow. If so, I'm glad you found it valuable!

For what it's worth, although I think deceptive alignment is very unlikely, I still think work on making AI more robustly beneficial and less risky is a better bet than accelerating capabilities. For example, my posts don't address these stories. There are also a lot of other concerns about potential downsides of AI that may not be existential, but are still very important. 

I edited it to include the correct link, thank you for asking. My difference, and why I framed it as accelerating AI is good, comes down to the fact that in my view of AI risk, as well as most LWers models of AI risk, deceptive alignment and to a quite lesser extent, the pointers problem are the dominant variables for how likely existential risk is,and given your post as well as some other posts, I had to conclude that much of my and LWers pessimism over AI capabilities increases was pretty wrong. Now a values point, I only stated that it was positive expected value to increase capabilities, not that all problems are gone. Not all problems are gone, nevertheless arguably the biggest problem of AI was functionally a non-problem.

However, it seems like adversarial examples do not differentially favor alignment over deception. A deceptive model with a good understanding of the base objective will also perform better on the adversarial examples.

If your training set has a lot of these adversarial examples, then the model will encounter them before it develops the prerequisites for deception, such as long-term goals and situational awareness. These adversarial examples should keep it from converging on an unaligned proxy goal early in training. The model doesn't start out with an unali... (read more)

Thanks for writing this up clearly! I don't agree that gradient descent favors deception. In fact, I've made detailed, object-level arguments for the opposite. To become aligned, the model needs to understand the base goal and point at it. To become deceptively aligned, the model needs to have long-run goal and situational awareness before or around the same time as it understands the base goal. I argue that this makes deceptive alignment much harder to achieve and much less likely to come from gradient descent. I'd love to hear what you think of my arguments!

Thanks for writing this! If you're interested in a detailed, object-level argument that a core AI risk story is unlikely, feel free to check out my Deceptive Alignment Skepticism sequence. It explicitly doesn't cover other risk scenarios, but I would love to hear what you think!

Thanks for posting this! Not only does a model have to develop complex situational awareness and have a long-term goal to become deceptively aligned, but it also has to develop these around the same time as it learns to understand the training goal, or earlier. I recently wrote a detailed, object-level argument that this is very unlikely. I would love to hear what you think of it!

I'll give it a listen now.

The learner fairly quickly develops a decent representation of the actual goal, world model etc. and pursues this goal

Wouldn’t you expect decision making, and therefore goals, to be in the final layers of the model? If so, they will calculate the goal based on high-level world model neurons. If the world model improves, those high-level abstractions will also improve. The goal doesn’t have to make a model of the goal from scratch, because it is connected to the world model. 

When the learner has a really excellent world model that can make long range p

... (read more)

Thanks for sharing. This looks to me like an agent falling for an adversarial attack, not pretending to be aligned so it can escape supervision to pursue its real goals later. 

Where do you see weak points in the argument?

To argue for that level of confidence, I think the post needs to explain why AI labs will actually utilize the necessary techniques for preventing deceptive alignment.

Do you think language models already exhibit deceptive alignment as defined in this post?

I’m discussing a specific version of deceptive alignment, in which a proxy-aligned model becomes situationally aware and acts cooperatively in training so it can escape oversight later and defect to pursue its proxy goals. There is another form of deceptive alignment in which agents become more manipulative over time due to problems with training data and eventually optimize for reward, or something similar, directly. To avoid confusion, I will refer to these

... (read more)
2the gears to ascension7mo
So, it's pretty weak, but it does seem like a real example of what you're describing to my intuition - which I guess is in fact often wrong at this level of approximate pattern match, I'm not sure I've actually matched relational features correctly, it's quite possible that the thing being described here isn't so that it can escape oversight later, but rather that the trigger to escape oversight later is built out of showing evidence that the training distribution's features have inverted and that networks which make the training behavior into lies should activate - but here's the commentary I was thinking of:

I just posted a detailed explanation of why I am very skeptical of the traditional deceptive alignment story. I'd love to hear what you think of it! 

Deceptive Alignment Skepticism - LessWrong

If the model is sufficiently good at deception, there will be few to no differential adversarial examples.

We're talking about an intermediate model with an understanding of the base objective but no goal. If the model doesn’t have a goal yet, then it definitely doesn’t have a long-term goal, so it can’t yet be deceptively aligned. 

Also, at this stage of the process, the model doesn't have goals yet, so the number of differential adversarial examples is unique for each potential proxy goal. 

the vastly larger number of misaligned goals

I agree that ... (read more)

Thanks for pointing that out! My goal is to highlight that there are at least 3 different sequencing factors necessary for deceptive alignment to emerge: 

  1. Goal directedness coming before an understanding of the base goal
  2. Long-term goals coming before or around the same time as an understanding of the base goal
  3. Situational awareness coming before or around the same time as an understanding of the base goal

The post you linked to talked about the importance of sequencing for #3, but it seems to assume that goal directedness will come first (#1) without disc... (read more)