Shah (DeepMind) and Leahy (Conjecture) Discuss Alignment Cruxes

Olive Branch; Rohin Shah; Connor Leahy; Andrea_Miotti

Preface

In December 2022, Rohin Shah (DeepMind) and Connor Leahy (Conjecture) discussed why Leahy is pessimistic about AI risk, and Shah is less so. Below is a summary and transcript.

Summary

Leahy expects discontinuities - capabilities rapidly increasing and behavior diverging far from what we aim towards - to be a core alignment difficulty. While this concept is similar to the sharp left turn (SLT), Leahy notes he doesn’t like the term SLT. SLT seems to imply deception or a malicious treacherous turn, which Leahy doesn’t see as an interesting, relevant, or necessary part.

Shah suggests they start by discussing Leahy’s SLT views.

Leahy explains he expects there to be some properties that are robustly useful for achieving many goals. As systems become powerful and applied to an increasing variety of tasks, these properties will be favored, and systems will generalize well to increasingly novel tasks.

However, the goals we intend for our AIs to learn might not generalize. Perhaps it will seem that we’ve gotten them relatively aligned them to the goal of helping humans, but once they become smart enough to access the action space of "hook all humans up to super heroin", we’ll realize they learned some superficial distortion of our intentions that allows for bizarre and undesirable actions like that. And an AI might access this intelligence level rapidly, such that it’s not visibly misaligned ahead of time. In this sense, as systems get smarter, they might become harder to understand and manage.

Shah is mostly on board: capabilities will improve, convergent instrumental goals exist, generalization will happen, goals can fail to generalize, etc. (However, he doesn’t think goal misgeneralization will definitely happen.)

Shah asks why Leahy expects AI capabilities to increase rapidly. What makes the sharp left turn sharp?

Leahy answers if we can’t rule a sharp left turn, and given a sharp left turn we expect it would be catastrophic, that’s sufficient cause for concern.

The reason Leahy expects capabilities to increase sharply, though, relates to how humanity’s capabilities developed. He sees humans as having undergone sharp left turns in the past 10000 and 150 years. Our intelligence increased rapidly, such that we went from agriculture to the space age incredibly quickly. These shifts seem to have happened for three subsequent reasons. First, because of increases in brain size. Second, because of the development of culture, language and tools. And third, because of improvements in epistemology: science, rationality and emotional regulation. All three of these mechanisms will also exist for AIs, and AIs will be less constrained than humans in several respects, such as no biological limits on their brain size. So, as we scale AI systems, give them access to tools, and guide them towards discovering better ways of solving problems, Leahy expects they’ll become rapidly smarter just like we did.

[Editor’s note expanding the parallel: When humans became rapidly smarter, they became increasingly unaligned with evolution’s goal of inclusive genetic fitness. Humans never cared about inclusive genetic fitness. Evolution just got us to care about proxies like sex. Early on, that wasn’t a big deal; sex usually led to offspring. Once we were smart enough to invent birth control, though, it was a big deal. Nowadays, human behavior is significantly unaligned with evolution’s intention of inclusive genetic fitness. If AIs become rapidly smarter, and they’ve only learned shallow proxies of our goals, a direction that we currently don’t know how to avoid, we might expect their behavior will stray very far from what we value too.]

Shah notes Leahy’s points thus far seem consistent with most people’s views (for instance, the points also seem consistent with soft takeoff), so he doesn’t know what predictions Leahy would make that he would disagree with. Shah suggests Leahy say what exactly an SLT is to surface their cruxes faster. For a prototypical SLT, does the AI start out at roughly human level? Does it end up at roughly human level?

Leahy notes he doesn’t expect one discrete SLT. He expects there are several “intelligence gated” abilities that will rapidly increase intelligence, once an AI is intelligent enough to reach them.

Explaining this by analogy: once humans became smart enough to reflect on our own minds and improve them, we rapidly became even smarter, because we could make ourselves more efficient. The same could happen with AIs. "A" SLT occuring is getting access to one of these "intelligence gated" abilities.

Shah is still broadly on board: as you get more intelligent, you also get better at making yourself more intelligent, and this is a runaway feedback loop. He expects a singularity to occur at some point.

A key question for Shah is––if point T is when humans become unable to manage unaligned AI, to what extent is the model at time T aligned enough that it keeps itself aligned as it becomes more intelligent?

…because if we can manage AI systems until they’re about as coherent as humans, it seems at least non-trivially likely that by supervising AI systems closely, we can get their motivations to be roughly what we want. Given this story (that Shah thinks is plausible, but is not necessarily presenting as the default outcome), it seems we should be about as optimistic about how an AI’s motivations would then generalize post-SLT as we would be about a human generalizing post-SLT.

Leahy responds he doesn't think humans will be able to manage unaligned AI until it’s as coherent as humans. He expects an SLT before AIs are as coherent as humans, and once an SLT happens, they’ll likely be too smart for us to manage. Relatedly, he expects something as powerful as GPT4 and as coherent as a human would already be AGI. Since we don’t know how to reliably align AI, this would be too late and kill us.

Leahy also isn’t optimistic humans would remain safe if we had another SLT. When humans first SLTd, we went from mostly outer aligned to very unaligned with evolution’s intention of inclusive genetic fitness. With 1000x more computing power, or memory, or speed, who knows how far we’d stray from intended goals again.

So, while Leahy agrees it’s possible the goals we intend for AIs will generalize, he really doesn’t expect that.

Akash, the moderator, steps in to ask how likely Shah and Leahy each think it is that AI gets out of control and causes extinction or similar disempowerment.

Leahy answers 90%.

Shah answers less likely than not, which is probably enough for a disagreement.

Akash asks Shah to flesh out

what he expects will happen once we get human-level AGI
when he thinks a singularity would occur, and why there’s a >50% chance that it'll be safe

Shah answers

There will be lots of AI tools in the economy that don’t look super coherent but that are automating away stuff that was previously done by humans. E.g. AI takes care of all data entry. AI customer support except in places that sell “the human touch”. AI writes most of the code for websites based on dialogues with humans that clarify what the code should do.
Roughly 2050, it’s safe because people tried to make it safe.

Akash clarifies, is Shah’s position essentially "the singularity goes safely because we were able to align the IQ 150 [or whatever the number should be] system, that system is sufficiently incentivized to make sure future AGIs are aligned, and that system is sufficiently capable to make sure future AGIs are aligned?"

Shah says yes, but notes of course most of the action is in aligning the ~IQ 150 model.

Shah suggests getting concrete with a particular story. Say we continue to scale up neural networks, continue to throw in other ways of amplifying capabilities like chain of thought, and this results in AI systems that are more coherently pursuing some sort of goal. Then, an SLT happens via that AI system figuring out how to create more intelligent AIs that are pursuing its goals (possibly by amplifying its own intelligence)...

[Note: This story isn’t Shah’s mainline view; he’s just presenting a story that doesn't lead to doom to which he assigns non-trivial probability, for the sake of concrete discussion.]

Leahy asks, what if the predecessor accidentally (or without any volition at all, really) creates the unaligned successor? What if GPT-N improves its coherency with chain of thought or whatever, and for some reason, this results in GPT-N+0.1 that suddenly has very different actions/preferences?

Shah notes, if we add in the claim that GPT-N+0.1 also takes over the world, he finds the story pretty implausible. He thinks coherence is necessary for taking over the world. He thinks it’d take at least a year to go from incoherent GPTN to coherent-enough-to-end-the-world GPTN+x. And he expects something besides what Leahy mentioned would need to happen for this jump to occur.

Leahy expects less than a year is needed, but notes a year is still quite short. What does Shah think humans could do in that year that makes things go well?

Shah answers in this world it sounds like coherence was particularly hard to get, so he imagines we have a decent chunk of time prior to The one year. In the prior time, he imagines it’s plausible that:

We’ll have developed better AI tools that can help with oversight
We’ll have trained teams of humans to be able to use these tools to understand what models are doing and why and to use that understanding to find situations in which the AI system does something different from what we’ve wanted.
This is how we’ve already built AI systems that do real-world things (think WebGPT) to the level of robustness that companies are willing to actually release them as products.

So, by the time The one year before coherent-enough-to-end-the-world GPTN+x rolls around, it’s also plausible that:

We’re applying the above tools and techniques
Our teams of people supervising and evaluating the AI system can see the coherence arising and note it
A lot of extra effort starts going into this particular finetuning run
Any time the AI system does something that wasn’t consistent with what we actually wanted, we successfully notice this
In addition in some cases through interpretability tools we find a situation in which the AI system would have done something different from what we wanted (including potentially cases where it tries to deceive us) and we penalize it for that
- After all this, it could still be that the AI system just learned how to be deceptive while pursuing some different goal. However, in this success story, Shah expects reality turned out to be such that it was just simpler / more convergent for the AI to be motivated the way we wanted it to be motivated.

Leahy notes this success story seems to pass all the difficulty to the future, as if saying “in worlds where we develop the tools to solve alignment, we solve alignment", without explaining how we’ll develop those tools. From his perspective:

We’re still struggling with easier versions of problems like this (e.g. interpreting current models, supervising current models, building theories of intelligence that are robust to SLT etc).
We have theoretical reasons to think these problems could become much harder as systems scale (as discussed earlier).
Only a few hundred people are working on these problems, and progress has been slow.
So, it seems unlikely we’ll be able to “notice every time the AI system does something that wasn’t consistent with what we actually wanted” soon.As such, he doesn’t see how we’ll align an AI enough for it to reliably safely transition to (proto) AGI.

Shah’s still unclear why Leahy doesn’t think we can’t currently supervise AIs.

Leahy explains it’s hard to detect all LLM failure modes. Also, just being able to detect problems (eg "my model keeps confusing lilac and violet when describing colors") doesn’t mean we know how to fix them.

Shah suggests the lilac/violent problem could be solvable with fine-tuning.

Leahy says he doesn’t think fine-tuning robustly solves such problems in all cases.

Shah says he mostly doesn’t buy frames that say “we have to be able to do X robustly in all settings now” and is more into things like “we need to be able to fix situations where the model does something bad where it had the knowledge to do something good” for evaluating progress now. This is because the former doesn’t seem like it separates capabilities from alignment when the AI systems are not very coherent / reflective.

Transcript

[Wasil, moderator]

Welcome!

[Shah]

I’d guess the SLT [Sharp Left Turn] stuff would be best to discuss?

[Leahy]

Thanks so much for your time Rohin, I'm really excited to talk! I'm really interested in learning more about the models other people in the field have around alignment, and you are a bit known as someone who is more on the optimistic side (definitely compared to me!), so I'd be very excited to learn more about your views and models and how they differ from mine :)

SLT would be a good topic to start with from my perspective, I expect that to unearth some cruxes

[Wasil, moderator]

+1. Here are two few prompts that I think could be interesting to start with:

How "hard" do each of you think alignment will be? What does "hard" actually mean to you?
What is your definition of the sharp left turn? Why do you (Connor) expect it to happen, and why do you (Rohin) not expect it to happen?

[Shah]

Yeah, I’d love to get more concrete detail on SLTs, I feel like I don’t really understand that perspective yet

[Leahy]

Cool, lets hone in on Askash's prompt 2 then

[Shah]

I think I should let Connor define SLT?

[Leahy]

Sure!

[Shah]

(I can also talk about arguments for why alignment might not be hard, but I kinda expect that it will go “here’s some stuff we can do” and then Connor will say “but SLT” and I’ll be like “please elaborate”)

[Wasil, moderator]

(Yeah I think starting with the SLT seems great, as it seems like a major crux. At some point, I might be like "Rohin, assume that we will get a SLT. How would your worldview change, how would this affect your view on how hard alignment is, and what work do you think we should be prioritizing?)

(Throwing this out there just so you have something to think about as Connor writes his response )

Rohin Shah

Mostly I’d say “please be more precise about SLT, I can’t compute updates to my worldview otherwise”

[Wasil, moderator]

+1-- I think we should stick to whatever definition that you and Connor end up agreeing on.

[Leahy]

So, as an ahead: I actually don't particularly like SLT as a term, but it is the closest common term for a concept that I think is very core to the difficulty of alignment. The thing I dislike about SLT as a term is that it seems to imply some kind of deceptive or malicious treacherous turn, but I don't actually think that is the most interesting/relevant concept.

Fundamentally, I expect that there are various "simple"/convergent algorithmic/behavioral things (which we might call "optimization" or "consequentialism" or whatever) that are robustly useful to achieving many kinds of things. In general, we should expect that as general systems become stronger and perform more and more different tasks, these generalized behaviors/skills/whatever will be favored, and you will more likely get systems that generalize well to even newer tasks. As an illustrative figurative example, I expect a powerful AI that can drive many different kinds of red cars to also be able to generalize to blue cars pretty easily.

[Shah]

Cool, I’m mostly on board

(With some niggling worries about what “simple” means but probably I’d agree with you if we fleshed it out)

[Leahy]

So I expect you get stronger and more general optimizers (all things being equal) as you scale up the breadth of application and power of your AI systems. But this does not naturally generalize to the domain of values! As your systems become more powerful, they can more and more finely and powerfully optimize certain things, which makes goal misspecification worse. So if your system at a low power level seemed relatively safe and aligned because it couldn't access the action space of "hook all humans up to super heroin" (or even more bizarre "misinterpretations"), it might very rapidly get access to this state, and this will not be visible ahead of time given our current very rudimentary understanding of intelligence.

So as your system gets better at optimizing, it actually becomes harder to control and more chaotic from the perspective of a user, and I expect this to happen pretty fast for reasons I'm happy to elaborate on (edited)

[Shah]

Yup, I’m still on board (less on board if we’re saying “and the goal misspecification will definitely happen”)

Is that the full definition?

I was expecting something about “sharpness”

[Leahy]

Yes, that's the next point, why do I expect it to be fast/sharp

[Shah]

(Btw let me know if you’d prefer I not blurt out my quick takes / reactions before you’ve finished with your full definition)

[Leahy]

Whatever you prefer!

So, there are a few points that go into this, at various level of meta/concreteness.

This is an annoyingly outside view one, but basically: We have no reason to not expect it to be? We can't rule it out, we think it would be catastrophic, therefore a priori we should be concerned until we have better theory to rule this out.

More concretely, my inside view is that we have already seen very sharp returns to threshold intelligence in humans. Humans go to the moon, chimps don't. This has become a bit of a trite meme at this point, but I think it's worth calling attention to just how extremely weird this actually is! And it's not just humans, but behaviorally modern humans after a very, very short time from agriculture to space age! What the hell! This seems like an extremely important phenomenon to try and understand mechanistically! I'm not saying I have a perfect mechanistic model, but overall it seems to me that a few factors played an important role: Heavy returns on marginal increases to intelligence (see how heavily evolution optimizes for brain size, humans are born extremely premature, birth is extremely dangerous because of the large head, it takes up a massive amount of our caloric budget etc), development of culture/language (comparatively crude and slow interfaces between brains to pool intelligence) and improvements in epistemology (better scientific methods, rationality, introspective control over emotions/beliefs etc). All three of these factors seem to heavily favor AI systems over humans, and the fact humans didn't go foom much faster seems to me to have been limited more by boring biological implementation details than fundamental algorithmic limitations.

[Shah]

Maybe I’d recommend just saying what exactly an SLT is rather than arguing for it? I suspect we’ll get to cruxes faster that way.

E.g. I’m interested in things like — for a prototypical SLT, does the AI start out at roughly human level? Does it end up at roughly human level? Would you say that humans underwent an SLT?

[Wasil, moderator]

Optional prompt for Rohin as Connor types:

Rohin, I understand your position as something like "sure, systems will get more powerful, and it's plausible that as they get more powerful we will have more problems with goal misspecification. However, the capabilities of the systems will grow fairly gradually, and researchers will be able to see the goal misspecification problems in advance. We will have many months/years to gradually detect and improve various subproblems. As a result, a huge crux is the sharpness of the "left turn"-- if it's not sharp, we can apply traditional scientific/engineering approaches to the problem."

How accurate is this, and what parts have I gotten wrong?

(And I'll acknowledge that we haven't gotten a precise definition of SLT yet)

[Shah]

Idk man it’s pretty hard to know if this is my position without definitions of “gradual”, like I do eventually expect something like a singularity

[Wasil, moderator]

What do you think happens when we reach a singularity?

Or a longer question: Can you briefly summarize what you think will happen when we get human-level AGI? How long does it take to get to superhuman levels, and what does that process look like?

(Also feel free to pass on any of these, especially if you're thinking about Connor's claims)

[Leahy]

I slightly disagree with Nate's framing of "the" SLT as a discrete event, I expect there to be several "overhangs" and "intelligence gated self improvement abilities". e.g. I expect AIs currently have an IQ/memory overhang but be bottlenecked on other factors, that if improved would rapidly catch up to the next bottleneck.

A way in which you could talk about "a" SLT occuring is getting access to "intelligence gated" self improvement. I think this is what Nate refers to when he talks about "the" SLT. For example, humans are smart enough to reflect on their own minds and epistemology, and improve it. This doesn't work in chimps, and I think this is strongly intelligence gated. There is a kind of minimum coherency necessary for you to be able to reflect on and reason about your own thinking to improve it, and this can have extremely fast returns as you make your thinking algorithms more elegant and efficient. This I assume is a crux and something we could elaborate on.

I would agree that humans underwent an SLT over the last 10000 years, yes, and I think you could claim the same over the last 150.

Any places where you disagree with this?

[Shah]

I’m still broadly on board, I think

Like, I agree that as you get more intelligent, you also get better at making yourself more intelligent, and this is a runaway feedback loop

Let’s say that humans are unable to keep up with this at time T. Then I think one key question is — to what extent is the model at time T aligned enough that it “wants” to continue keeping itself aligned as it becomes more intelligent

(Really we should be talking about tool-assisted humans but I’m happy to set that aside for the moment)

[Leahy]

Sure, if we have a pre SLT system that we know is strongly aligned enough to be resistant to self modification causing value drift (meaning it has solved its own alignment problem for its future updated versions), then I agree, that sounds like the problem has been solved! But currently, I don't see how we would build, or even check whether we have built, such a system

It seems to me that the "AI+its future post-SLT instantiations and all the complex dynamics that lead there" is a very complex object that is very different from the kinds of AIs we tend to currently study and consider, if that makes sense?

[Shah]

I agree it’s very different from the objects we currently study

Considering the time T above, do you expect the model at that time to be roughly as coherent as a human?

[Leahy]

On our current trajectory? No

I think GPT models for example are very incoherent.

But still very powerful/smart.

[Shah]

Sure, but we can keep up with GPT models?

[Leahy]

Could you define what you mean by "keep up"?

I'm not sure I understand

[Shah]

Yeah that’s fair. I mean that our AIs are making themselves more intelligent, taking actions in the world, giving people advice, etc and we basically can’t tell whether the things they are doing are actually good or just look good to us, we just have to go off of trust

If we were still able to look at an individual AI decision and (potentially with a lot of effort) say “yeah this decision was good / bad, I’m not just trusting the AI, I’ve verified it myself to a high but not 100% degree of confidence” then I’d say we were keeping up

[Leahy]

Ok, I think I understand that. I am not sure though where you are going with this? It seems to me that "keeping up" by this definition gives us no guarantee over future safety, unless we also have some kind of knowledge/guarantees about the trajectory of systems we are supervising (i.e. don't rapidly increase capabilities or become deceptive)

[Shah]

If you think we can “keep up” until systems are about as coherent as humans, it seems at least non-trivially likely to me that by supervising AI systems closely we can get their motivations to be roughly what we want, and then it seems like you should be about as optimistic about how that AI would then generalize post-SLT as you would be about a human generalizing post-SLT

So this seems like one way you can get to “you should not be confident in doom”

I’m not saying that this is what happens, but it’s a non-trivial chunk of my probability distribution

(I guess part of the argument here is also that I’m reasonably optimistic about an intelligent, careful human generalizing post-SLT)

(But I think that one is probably less controversial?)

[Wasil, moderator]

I think this is interesting & I appreciate you both trying to get concrete. My guesses on some possible cruxes:

Connor doesn't think we will be able to "keep up" until systems are about as coherent as humans (in fact maybe he thinks we're not even able to "keep up" with GPT-3, if we have a sufficiently strict definition of "keep up" that is something like "understand the system well enough to provide a near-guarantee that it is doing what we actually want it to do")
Connor does think we will be able to "keep up" until systems are about as coherent as humans, but he doesn't think that this will be sufficient for good outcomes. In other words, he doesn't expect that an aligned human-level AI will be likely to produce an aligned superhuman AI.

[Leahy]

Thanks! I basically agree with an even more extreme version of 1

Regarding “Can humans keep up with systems past a certain threshold?”, a few points where I am not currently convinced:

I do not think it follows obviously at all that we will be able to keep up until systems are as coherent as humans. I think a system as coherent as a human and as powerful as GPT-4 may well already be AGI! (i.e. too late)

It's also not clear to me that only because we can supervise their outputs, we can know much about their motivations. I do of course agree that there is a non trivial possibility that things generalize nicely! But again, it seems to me like just because we have a system that on the surface seems to behave in a way we like before it crosses over into a regime we can't check, doesn't really give us much guarantee over what will happen post phase change, unless we have additional theoretical reasons to believe this generalization should happen in this way.

Also, can we really control/supervise current systems? It's pretty clear to me that we cannot currently fully control e.g. GPT3 (it does things people don't want all the time!), and even supervision seems really, really hard, not just because the task is wide and fuzzy, but also because of the sheer volume. Who knows how many terrible, repeatable failure modes exist in the logs of the GPT3 API that no human has ever had the time to surface?

I am not actually confident that a human would remain safe post-SLT, and not just post-epistemological/cultural-SLT (as happened with humans already, and we already generalized well outside of evolutions intended values for us), but post-massive intelligence/memory/speed/architecture improvement-SLT. We have never seen what happens when a human suddenly gets 1000x more computing power, or memory, or speed, or gets to rewrite other fundamental properties of their mind. It might be really, really weird!

I do agree btw that there is a non-trivial chunk of non-doom! I do take the possibility seriously that alignment, in various ways, turns out to be "easy", I don't think this is ruled out by our current state of knowledge at all

[Wasil, moderator]

Connor and Rohin, how likely is it that [advanced AI gets out of control and disempowers humanity]?

[Shah]

Idk, but I’d say it’s less likely than not, which is probably enough for a disagreement

[Leahy]

I’d say 90%. It would be 99% if I was more overconfident :)

[Shah]

So what you’re saying is, you’re only somewhat overconfident?

[Leahy]

Correct :D

[Wasil, moderator]

Yeah wow-- one interesting thing I'll note is that you seem to agree on a lot of the stated points so far, and your models don't appear to be strikingly different, and yet you end up with super different numbers. Not sure what to do with this, just throwing it out there.

[Shah]

I think that’s pretty common fwiw

I don’t feel like I disagree with Eliezer and Nate on very much, other than something like “the nature of intelligence”

[Leahy]

for what it's worth, I am neither Eliezer nor Nate :)

[Side thread →

[Wasil, moderator]

Rohin, can you flesh out where you think you disagree the most with Eliezer and Nate (and possibly Connor)? For instance, can you tell us a bit more about:

What do you expect will happen once we get human-level AGI?
You mentioned that you expect a singularity will occur-- roughly when does it occur and why is there a >50% chance that it'll be safe?
Do you agree or disagree with Connor's "we can't even keep up now" point?

[Shah]

Um, there will be lots of AI tools in the economy that don’t look super coherent but that are automating away stuff that was previously done by humans. E.g. AI takes care of all data entry. AI customer support except in places that sell “the human touch”. AI writes most of the code for websites based on dialogues with humans that clarify what the code should do.
Roughly 2050, it’s safe because people tried to make it safe.
I don’t disagree with his object-level claim but I do disagree that it implies “we can’t keep up” on the definition I gave.

[Wasil, moderator]

Can you say more about what you expect the singularity and self-improvement to look like? Specifics I'm curious about:

Does the singularity happen around the same time as we get human-level AGI, way after, or before?
How quickly does the system improve? Does the singularity last days, months, or years?
Is your position essentially "the singularity goes safely because we were able to align the IQ 150 [or whatever the number should be] system, that system is sufficiently incentivized to make sure future AGIs are aligned, and that system is sufficiently capable to make sure future AGIs are aligned?"

[Shah]

Yes to (3), though of course a lot of the action is in the first part

[Back to main thread]

[Shah]

Not quite sure what you mean by “as coherent as a human and as powerful as GPT-4”. I thought your position was that coherence was what led to power? Maybe you mean as knowledgeable as GPT-4?

I agree that this is where a lot of the action is and I can say some more stuff about it, but let me talk about the other parts first.

I agree that wide volume is a challenging problem, but it seems like you can address this by having other AI systems rate the probability that something needs our supervision.

How pessimistic are you about post-SLT humans? I could imagine this being a crux

(Lmk if I should say more about point 2, or if we should talk about some other point)

[Leahy]

Lets just say "a system that is built with as much computing power and training data as GPT4".

Would love to hear more about this!

This seems to me like it just pushes the alignment problem somewhere else, you still eventually have to bottom out on something that is powerful enough to supervise a powerful system, and is aligned enough that you trust it to be safe even in weird scenarios that you maybe could not have conceived of or tested ahead of time.

I do absolutely think there are ways to uplift a human mind safely to posthuman regimes, but I think if you just e.g. hooked up 1000x more neurons, who knows what would happen!

I think digging into 2 would be worthwhile.

[Shah]

Oh, re: (4), I’m imagining that the human gets to choose whether to undergo the SLT (just as the AI can choose whether to amplify its intelligence). I’d imagine the human just doesn’t do the “hook up 1000x more neurons” plan precisely because who knows what that would do!

[Leahy]

But this is how we currently build ML systems!

haha

[Shah]

Okay actually maybe let’s concretize to a particular story

I’m imagining that we continue to scale up neural networks, continue to throw in other ways of amplifying capabilities like chain of thought, and this results in AI systems that are more coherently pursuing some sort of goal

I’m then imagining that the SLT happens via that AI system figuring out how to create more intelligent AIs that are pursuing its goals (possibly by amplifying its own intelligence)

I’m not imagining a world where AI systems do a task for a while, and then the humans scale up the neural network to be 10x larger, and then get a new AI system which could be pursuing some totally different goal

I agree that such worlds are possible but I’d like to concretize to an example to figure out what’s going on in that example

Happy for you to choose the world

(I’d say different things for the different worlds)

[Leahy]

We can go with this world, I'm not confident we are in such a world (the cynical reply would be to gesture at various AI labs with much less concerns about safety and scaling capabilities), but we can go with it (edited)

[Shah]

… I’m confused. I don’t feel like I’ve made any claims that rely on AI labs being concerned about safety yet?

[Leahy]

As in, I wouldn't be surprised if some labs do the scale up by 10x thing, sorry, I may have misread your point

[Wasil, moderator]

Let's say Alice is the AI that ends up creating more intelligent AIs that are designed to pursue Alice's goals.

Guesses for possible cruxes between you and Connor:

We won't align Alice. Alice will be created by taking a previous system and making it 10X larger. Alice will essentially be GPT-6. Due to race dynamics and AI hype and lack of safety culture, people won't put a lot of energy into aligning Alice.
We will align Alice, but aligned Alice won't be able to make intelligent systems that are aligned with her (i.e., our) goals. Then, eventually, someone creates Bob, an AI that is not aligned to our values. Bob doesn't have the same restrictions as Alice, so Bob is able to create smarter systems more easily, and Bob's smarter systems end up destroying us.

(+1 to the fact that Rohin didn't make any explicit claims about safety culture and +1 to Connor thinking that labs might do the scale up by 10x thing, which could result in an "unaligned Alice.")

[Shah]

(Tbc it wasn’t even explicit claims, I don’t think I was saying anything that even implicitly depended on safety culture)

[Wasil, moderator]

"I’m not imagining a world where AI systems do a task for a while, and then the humans scale up the neural network to be 10x larger, and then get a new AI system which could be pursuing some totally different goal."

I think this is the part that made me (and maybe Connor) think about safety culture.

My impression is that your model relies on us getting Alice (or GPT-N or whatever we call it) through a process that isn't just "labs keep scaling neural nets".

[Shah]

Imagine that there’s an AI that could end the world. I want to distinguish between two different ways we got this AI:

There was some predecessor AI that had goals and explicitly tried to create the AI that could end the world to pursue its own goals.

Humans kept scaling up neural networks with the aid of AIs that were more tool-like, and eventually made the one that could destroy the world

I’m just saying that in the world I’m considering, reality turned out to be (1) rather than (2)

(And in that case it could be that the predecessor AI built the new AI by just scaling up, though it seems like it would have to be paired with solid alignment techniques)

[Wasil, moderator]

My guess is that the crux might come earlier-- like, how do we get the predecessor AI? My impression is that Connor believes that we get the predecessor AI through scaling (and other stuff TM), and we don't have good reasons to believe that [scaling + other methods] will produce a sufficiently-aligned predecessor AI. A subcrux might be something like "how do we expect labs to act & how do we expect race dynamics to play out?"

(In other words, if your predecessor AI is able to build a system that can destroy the world, then your predecessor AI is the AI that can destroy the world.)

(But Connor should totally correct me if I'm mischaracterizing) (edited)

[Leahy]

Ah, thanks for that clarification! In that case, one question: Do you also consider scenarios where the predecessor accidentally (or without any volition at all, really) creates the unaligned successor? e.g. GPT-N improves its coherency with chain of thought or whatever, and for some reason, this results in GPT-N+0.1 that suddenly has very different actions/preferences, but this was not done with foresight from GPT-N (or anyone else)

[Shah]

If we add in the claim that GPT-N+0.1 also takes over the world, then I find the story pretty implausible

(And we need to add in that claim to make it different from the other two stories)

That’s a good example of an SLT that I would say is “too sharp”

[Leahy]

This seems like a crux! Could you tell me about why you think this is implausible? (note I make no strong statement on how much wall time elapses between GPT-N and GPT-N+0.1)

[Shah]

I think in principle you could imagine that GPT-N does this for some very long period of time and then it becomes coherent / goal-directed when it previously wasn’t, and also takes over the world. There I would just say that something else happens first.

Like, GPT-N does this for some shorter period of time, that leads to a more useful and capable model that is somewhat more coherent, that gets deployed, and then that model is the one that builds the one that can take over the world (edited)

[Leahy]

That seems like it would also result in a discontinuity? I am not confident that is what it would look like in practice. I think a crux here is probably messy, uncertain inside views on how fast this kind of improvement could happen, but if your scenario is "company X trains GPT-N, lets it self improve to N+0.1, deploys it, it builds GPT-F which ends the world", that seems plausible enough to me, but I don't see strong reasons to believe GPT-F couldn't emerge directly from GPT-N+x for any x. It's not clear to me how we could know it won't happen, I don't think we have enough theoretical understanding of these kinds of processes to rule anything out at the moment for any x.

[Shah]

My main claim is “coherence comes before taking over the world”

(With some non-trivial lead time)

[Leahy]

Ok, sure, seems pretty reasonable. Not sure about the lead time, could be long, could not in my view.

Also, I think there are many gradations of "coherence". Humans are very coherent compared to GPT, but still very incoherent in absolute terms.

[Shah]

Kind of hard to make quantitative predictions without having some better notion of “coherence” but like, at least a year between systems that look like they are pursuing things as coherently as humans, and the system that take over the world (edited)

[Leahy]

Ok, it seems like a one year takeoff (which most people would consider "fast") would be pretty scary

I'm not as confident as you that there will be such a long gap, but it seems possible to me

[Shah]

Agreed that’s pretty fast! It’s not my expected view, but I am pretty uncertain! I’m mostly saying that even shorter seems pretty surprising

And one year seems like enough where I’m like “nah, we don’t go from incoherent GPT-N to world-ending GPT-F, something else happens first”

[Leahy]

I'm basically trying to be even less certain, in that I would be a little, but not very surprised by much shorter turnaround from GPT-N to GPT-F, but ok, lets go with one year. Assuming it is one year, what do we do in that year that makes things go well? If we aren't already very far ahead from where we currently are in alignment, that seems like a pretty bad scenario for us to be in, from my point of view.

GPT-N may not be world ending, but it's not aligned either!

[Wasil, moderator]

(I'd be interested for both of you to describe if/how you think we'll get "Alice"-- the system that isn't itself capable of destroying the world but is able to build more powerful systems that have the capability of destroying the world.

My guess is there are some cruxes relating to the process used to develop such a system, how safety-conscious the actor who develops it is, and how hard that actor tries to align it.)

[Leahy]

It is not my mainline that there will be an Alice (or I think Alice will be SGD+some kind of nifty loss function on GPT-N) I think developing an Alice system is possible, but probably the shortest path from Alice to GPT-F is "do more gradient updates on Alice"

I think the safety consciousness and alignability of actors is a seperate, longer discussion

[Shah]

(Going to respond to Connor’s question first)

(Noting that I don’t think I’ll say anything super novel here)

This seems like a particularly hard world for alignment. In this world it sounds like coherence was particularly hard to get and so I’m imagining we have a decent chunk of time before that year. During that time we’ve developed better AI tools that can help with oversight; we’ve trained teams of humans to be able to use these tools to understand what models are doing and why, and to use that understanding to find situations in which the AI system does something different from what we’ve wanted. This is how we’ve already built AI systems that do real-world things (think WebGPT) to the level of robustness that companies are willing to actually release this as a product.

[Leahy]

I am very confused by your point. I don't think "level of robustness that companies are willing to actually release as a product" is an adequate bar for safety to be robust to phase transition in powerful (proto) AGI systems, and I wouldn't know how to get them to a point where they are. Again, we, in my point of view, are heavily, heavily bottlenecked by actual understanding of not just current models, but general dynamics of not just SLT-like scenarios, but also just general intelligence increase/SGD.

At the moment, we don't have tools that seem anywhere close to up to the job of supervising such complex and novel phenomena (again, it looks to me like we have GPT alignment failures all the time in practice! Even post instructGPT!)

I think it is of course imaginable that in the future people will develop much, much more sophisticated methods for supervision, and control, but this seems to be isomorphic to "actually solving alignment", from my point of view.

[Wasil, moderator]

Also noting that there are ~10 mins left! I'm finding the current thread interesting, but feel free to throw out any other topics you want to discuss.

(Possible examples: To what extent should we consider alignment a problem that can be solved with traditional science/engineering methods? Will systems become more interpretable or less interpretable as they get more powerful?) (edited)

[Shah]

We’re now applying these tools and techniques to the new GPT-N; our teams of people supervising and evaluating the AI system can see the coherence arising and note it, a lot of extra effort starts going into this particular finetuning run, any time the AI system does something that wasn’t consistent with what we actually wanted, we successfully notice this; in addition in some cases through interpretability tools we find a situation in which the AI system would have done something different from what we wanted (including potentially cases where it tries to deceive us) and we penalize it for that (edited)

[Leahy]

any time the AI system does something that wasn’t consistent with what we actually wanted, we successfully notice this

Do we? This doesn't reflect my experiences with ML systems at all! Haha! This seems like postponing alignment to the future, to me. Maybe I have missed one of your core points, but it is not obvious to me why we should have a strong reason for this future to unfold by default from the present. It is possible, yes, but then it almost seems trivial ("in worlds where we develop the tools to solve alignment, we solve alignment")

[Shah]

And after all of this, it still could be that the AI system just learned how to be deceptive while pursuing some different goal, but (in this story) reality turned out to be such it was just simpler / more convergent for the AI to be motivated the way we wanted it to be motivated

The strong reason is “people are trying to do it”

[Leahy]

I am very confused. There are many things people try to do, but fail anyways for many reasons. Around 200 people max in the world are currently working on this, doesn't seem like humanity's on track to solving the problem!

[Shah]

And why isn’t humanity on track to solving the problem?

[Leahy]

Because currently, as far as I can tell, we're struggling with many of the easier versions (e.g. interpreting current models, supervision, building theories of intelligence that are robust to SLT etc), and we have many theoretical reasons (as we discussed previously) for the problem to become much harder, potentially very soon

[Shah]

I’m still not clear why you think we can’t currently supervise models.

I don’t think alignment depends on strong interpretability or theories of intelligence, though I agree they help.

[Leahy]

If anything, I expect interpretability and supervision to become much harder as systems become stronger!

[Wasil, moderator]

Here's a relevant statement that Rohin made-- I think it's from a few years ago though so it might be outdated:

I would guess that AI systems will become more interpretable in the future, as they start using the features / concepts / abstractions that humans are using. Eventually, sufficiently intelligent AI systems will probably find even better concepts that are alien to us, but if we only consider AI systems that are (say) 10x more intelligent than us, they will probably still be using human-understandable concepts. This should make alignment and oversight of these systems significantly easier. For significantly stronger systems, we should be delegating the problem to the AI systems that are 10x more intelligent than us.

(This is very similar to the picture painted in Chris Olah’s views on AGI safety (AN #72), but that had not been published and I was not aware of Chris's views at the time of this conversation.)

[Shah]

I do still agree with the point relative to the AI systems of 2018 or whenever I said that, but less so relative to current systems (where the concepts already seem fairly crisp, and the effect where they get more complex probably dominates the one where they get more crisp)

I agree with that but in large part that’s because supervision seems so easy right now. What makes you think that supervision is hard right now? (edited)

[Leahy]

I agree with that but in large part that’s because supervision seems so easy right now. What makes you think that supervision is hard right now?

In my experience, debugging e.g. LLM failure modes is really, really tricky, and even just determining performance in a novel task ahead of time (even with all the benchmarks we have!) is really hit or miss. And just because I can for example detect "oh my model keeps confusing lilac and violet when describing colors" (or whatever), doesn't mean I can fix it

[Shah]

I don’t feel like I said anything related to any of this? My claim was that we can notice when the AI system does bad things.

For the lilac / violet example, I’d want more details, but sounds like a capability failure. I agree you can’t trivially fix those because it looks really hard to create new capabilities in the model.

In contrast I’d say that if you want a model to e.g. write summaries you can just finetune it on some data and it does then basically write summaries, and so in some sense it is actually easier to change its motivations.

I don’t see the relevance of determining performance in a novel task ahead of time; this just doesn’t seem all that important for whether the model is aligned or deceptive.

(It’s useful for things like forecasting, and figuring out when we should be extra-scared and declaring crunch time, but that’s not the same thing as “is the model you produced misaligned or not”)

[Leahy]

I am confused by how you separate "capabilities" from "motivations". The lilac/violet example doesn't necessarily mean it can't differentiate them, it might just have some preference for saying "lilac" more often, or whatever.

I'm also confused about the summary finetuning example. If I finetune a LLM on a given summarizing dataset, that actually tells me quite little about whether it has some bizarre failure modes that only appear when summarizing 17th century bulgarian literature (I'm not saying this is an actual example that happened, but more that, if it did, how would we know until our unwitting bulgarian customer tries it themselves?)

[Shah]

Mostly I’m predicting that if you can’t fix it then it’s probably some sort of capability that it’s lacking. I agree that your description doesn’t imply that

[Leahy]

I don't understand that point, could you elaborate? Why would we think that if it was a different kind of issue, it would be solvable? Where is the mechanistic difference? How could I tell?

[Shah]

I agree that the summarization example doesn’t show “you can get a model that works robustly in all situations”, I’m just saying “it sure seems a lot easier to change what the model appears to be doing than to give it a new capability”

[Leahy]

That's fair enough

I guess I'm still very confused how this gets us back to alignment. It seems to me, that if all we can do is change how it appears to behave, we're still very far away from alignment that will be robust

[Shah]

(Back to lilac/violet) I expect it would be solvable because finetuning works well in practice.

You could look into how the model was handling the sentences and see e.g. to what extent the computation changes for lilac vs. violet. It’s a little hard for me to give a mechanistic story though because I’m still not clear on what the failure mode is

[Leahy]

I don't think anyone could currently robustly solve this problem as you describe, actually! If we could, I would consider that progress towards alignment!

[Shah]

Which problem?

[Leahy]

The lilac problem. The methods you just described e.g.

You could look into how the model was handling the sentences and see e.g. to what extent the computation changes for lilac vs. violet

afaik do not currently exist

[Shah]

What is the lilac problem and have you tried finetuning

[Leahy]

If they do in a robust fashion, that would be a positive update! (I am aware of e.g. ROME and would not consider it "robust")

I mean, from a far more theoretical than practical standpoint. Imagine I have a language model, that for reasons unknown, calls things that should be called "violet", "lilac" instead every so often. I could finetune on data that uses lilac correctly, but what I would want for progress towards alignment is some kind of technique that ensures it actually will use lilac correctly, even in test cases I don't bother to check. I think being capable of things like this is necessary (but far from sufficient) to achieve alignment)

[Shah]

Oh I see. I mostly don’t buy frames that say “we have to be able to do X robustly in all settings now” and am more into things like “we need to be able to fix situations where the model does something bad where it had the knowledge to do something good” for evaluating progress now. (In particular, the former doesn’t seem like it separates capabilities from alignment when the AI systems are not very coherent / reflective.)

(And then I have a separate story for why this turns into properly-motivated AI systems in the future once they can be coherent, which is basically “we notice when they do bad stuff and the easiest way for gradient descent to deal with this is for the AI system to be motivated to do good stuff”.) (edited)

[Leahy]

This is very interesting and I would really like to talk more about this frame at another time! Thanks Rohin!

Whenever you have time, would love some writeup of this story

[Shah]

I think I’ve oversold it, it’s really just “we notice when they do bad stuff and the easiest way for gradient descent to deal with this is for the AI system to be motivated to do good stuff”.

[Leahy]

A crux I'd really like to discuss more, but I'll let you go for now. Thanks so much for your time Rohin! Really, it is very valuable to write out and discuss ideas like this, and big respect for taking the time and effort. We're all in this together, and are trying to figure things out, so thanks for your help in this endeavor!

[-]Nathan Helm-Burger3y70

It seems to me like one of the cruxes is that there is this rough approximate alignment that we can currently do. It's rough in the sense that it's spotty, not covering all cases. It's approximate in that its imprecise and doesn't seem to work perfectly even in the cases it covers.

The crux is whether the forecaster expects this rough approximate alignment to get easier and more effective as the model gets more capable, because the model understands what we want better. Or whether it will get harder as the model gets more capable, because the model will cross certain skill thresholds relating to self-awareness and awareness of instrumental goals.

I am in the camp that this will get harder as the model gets more competent. If I were in the 'gets easier' camp, then my views would be substantially closer to Rohin's and Quinton Pope's and Alex Turner's more optimistic views.

I am, however, a bit more optimistic than Connor I think. My optimism hinges on a different crux which has come up multiple times when discussing this with less optimistic people having views more like Connor's or Eliezer's or Nate Soares'.

This crux which gives me an unusual amount of optimism depends on three hopes.

First is that I believe it is possible to safely contain a slightly-superintelligent AGI in a carefully designed censored training simulation on a high security compute cluster.

Second is that I also think that we will get non-extinction level near-misses before we have a successfully deceptive AGI, and that these will convince the leading AI labs to start using more thorough safety precautions. I think there are a lot of smart people currently in the camp of "I'll believe it when I see it" for AGI risk. It is my hope that they will change their minds and behaviors quickly once they do see real world impacts.

Third is that we can do useful alignment experimentation work on the contained slightly-superhuman AGI without either accidentally releasing it or fooling ourselves into thinking we've fixed the danger without truly having fixed it. This gives us the opportunity to safely iterate gradually towards success.

Obviously, all three of these are necessary for my story of an optimistic future to go well. A failure of one renders the other two moot.

Note that I expect an adequate social response would include bureaucratic controls adequate to prevent reckless experimentation on the part of monkeys overly fascinated by the power of the poisoned banana.

[-]watermark3y*52

I'd be interested in hearing more about what Rohin means when he says:

... it’s really just “we notice when they do bad stuff and the easiest way for gradient descent to deal with this is for the AI system to be motivated to do good stuff”.

This sounds something like gradient descent retargeting the search for you because it's the simplest thing to do when there are already existing abstractions for the "good stuff" (e.g. if there already exists a crisp abstraction for something like 'helpfulness', and we punish unhelpful behaviors, it could potentially be 'really easy' for gradient descent to simply use the existing crisp abstraction of 'helpfulness' to do much better at the tasks we give it).

I think this might be plausible, but a problem I anticipate is that the abstractions for things we "actually want" don't match the learned abstractions that end up being formed in future models and you face what's essentially a classic outer alignment failure (see Leo Gao's 'Alignment Stream of Thought' post on human abstractions). I see this happening for two reasons:

Our understanding of what we actually want is poor, such that we wouldn't want to optimize for how we understand what we want
We poorly express our understanding of what we actually want in the data we train our models with, such that we wouldn't want to optimize for the expression of what we want

[-]Rohin Shah3y30

High level response: yes, I agree that "gradient descent retargets the search" is a decent summary; I also agree the thing you outline is a plausible failure mode, but it doesn't justify confidence in doom.

Our understanding of what we actually want is poor, such that we wouldn't want to optimize for how we understand what we want

I'm not very worried about this. We don't need to solve all of philosophy and morality, it would be sufficient to have the AI system to leave us in control and respect our preferences where they are clear.

We poorly express our understanding of what we actually want in the data we train our models with, such that we wouldn't want to optimize for the expression of what we want

I agree this is more of an issue, but it's very unclear to me how badly this issue will bite us. Does this lead to AI systems that sometimes say what we want to hear rather than what is actually true, but are otherwise nice? Seems mostly fine. Does this lead to AI systems that tamper with all of our sources of information about things that are happening in the world, to make things simply appear to be good rather than actually being good? Seems pretty bad. Which of the two (or the innumerable other possibilities) happens? Who knows?

[-]watermark3y50

We don't need to solve all of philosophy and morality, it would be sufficient to have the AI system to leave us in control and respect our preferences where they are clear

I agree that we don't need to solve philosophy/morality if we could at least pin down things like corrigibility, but humans may poorly understand "leaving humans in control" and "respecting human preferences" such that optimizing for human abstractions of these concepts could be unsafe (this belief isn't that strongly held, I'm just considering some exotic scenarios where humans are technically 'in control' according to the specification we thought of, but the consequences are negative nonetheless, normal goodharting failure mode).

Which of the two (or the innumerable other possibilities) happens?

Depending on the work you're asking the AI(s) to do (e.g. automating large parts of open ended software projects, automating large portions of STEM work), I'd say the world takeover/power-seeking/recursive self improvement type of scenarios happen since these tasks incentivize the development of unbounded behaviors (because open-ended, project based work doesn't have clear deadlines, may require multiple retries, and has lots of uncertainty, I can imagine unbounded behaviors like "gain more resources because that's broadly useful under uncertainty" to be strongly selected for).

[-]Martín Soto3y40

in the past 1000 and 150 years

Should be 10.000. Which is relevant to:

First, because of increases in brain size.

The human brain didn't expand that recently, and this part of the summary makes it sound like it. But this is not only a fault of the summary: in the transcript Connor does mention that one of the factors causing sharp turns where marginal increases to intelligence (as demonstrated by brain size), and later mentions 10.000 and 150 years ago as main examples of SLTs (when, of course, the brain size didn't change).

Being more charitable (since the text is messy), maybe he just meant brain size is evolutionary evidence of how valuable marginal increases in intelligence is (the power of intelligence). But then this doesn't seem relevant to discuss the sharpness of SLTs.

[-]Andrea_Miotti3y10

The "1000" instead of "10000" was a typo in the summary.

In the transcript Connor states "SLT over the last 10000 years, yes, and I think you could claim the same over the last 150". Fixed now, thanks for flagging!

[-]Martín Soto3y10

Yes, that's what I meant, sorry for not making it clearer!

[-]jacquesthibs3y20

Just a small tangent with respect to:

Here's a relevant statement that Rohin made-- I think it's from a few years ago though so it might be outdated:
I would guess that AI systems will become more interpretable in the future, as they start using the features / concepts / abstractions that humans are using. Eventually, sufficiently intelligent AI systems will probably find even better concepts that are alien to us, but if we only consider AI systems that are (say) 10x more intelligent than us, they will probably still be using human-understandable concepts. This should make alignment and oversight of these systems significantly easier. For significantly stronger systems, we should be delegating the problem to the AI systems that are 10x more intelligent than us.
(This is very similar to the picture painted in Chris Olah’s views on AGI safety (AN [AN #80]: Why AI risk might be solved without additional intervention from longtermists

Chris provided additional clarification to his views on interpretability here. TLDR: the AGI safety post by Evan was just pointing to Chris' intuitions at the time of doing interpretability on CNN vision models and he's not certain his views are true or transfer to the LLM regime.

96

Shah (DeepMind) and Leahy (Conjecture) Discuss Alignment Cruxes

96

Ω 33

Preface

Summary

Transcript

96

Ω 33

96

Ω 33