I'm not convinced by the argument that AI science systems are necessarily dangerous.
It's generically* the case that any AI that is trying to achieve some real-world future effect is dangerous. In that linked post Nate Soares used chess as an example, which I objected to in a comment. An AI that is optimizing within a chess game isn't thereby dangerous, as long as the optimization stays within the chess game. E.g., an AI might reliably choose strong chess moves, but still not show real-world Omohundro drives (e.g. not avoiding being turned off).
I think scientific research is more analogous to chess than trying to achieve a real-world effect in this regard (even if the scientific research has real-world side effects), in that you can, in principle, optimize for reliably outputting scientific insights without actually leading the AI to output anything based on its real-world effects. (the outputs are selected based on properties aligned with "scientific value", but that doesn't necessarily require the assessment to take into account how it will be used, or any other effect on the future of the world. You might need to be careful not to evaluate in such a way that it will wind up optimizing for real-world effects, though).
Note: an AI that can "build a fusion rocket" is generically dangerous. But an AI that can design a fusion rocket, if that design is based on general principles and not tightly tuned on what will produce some exact real-world effect, is likely not dangerous.
*generically dangerous: I use this to mean, an AI with this properties is going to be dangerous unless some unlikely-by-default (and possibly very difficult) safety precautions are taken.
Thanks for the comment :)
I agree that the danger may comes from AIs trying to achieve real-world future effects (note that this could include an AI wanting to run specific computations, and so taking real world actions in order to get more compute). The difficulty is in getting an AI to only be optimizing within the safe, siloed, narrow domain (like the AI playing chess).
There are multiple reasons why I think this is extremely hard to get for a science capable AI.
Science is usually a real-world task.
Fair enough, a fully automated do-everything science-doer would need, in order to do everything science-related, have to do real world tasks and would thus be dangerous. That being said, I think there's plenty of room for "doing science" (up to some reasonable level of capability) without going all the way to automation of real-world aspects - you can still have an assistant that thinks up theory for you, just can't have something that does the experiments as well.
Part of your comment (e.g. point 3) relates to how the AI would in practice be rewarded for achieving real-world effects, which I agree is a reason for concern. Thus, as I said, "you might need to be careful not to evaluate in such a way that it will wind up optimizing for real-world effects, though".
Your comment goes beyond this however, and seems to assume in some places that merely knowing or conceptualizing about the real world will lead to "forming goals" about the real world.
I actually agree that this may be the case with AI that self-improves, since if an AI that has a slight tendency toward a real-world goal self-modifies, its tendency toward that real-world goal will tend to direct it to enhance its alignment to that real-world goal, whereas its tendencies not directed towards real-world goals will in general happily overwrite themselves.
If the AI does not self-improve however, then I do not see that as being the case.
If the AI is not being rewarded for the real-world effects, but instead being rewarded for scientific outputs that are "good" according to some criteria that does not depend on their real world effects, then it will learn to generate outputs that are good according to that criteria. I don't think that would, in general, lead it to select actions that would steer the world to some particular world-state. To be sure, these outputs would have effects on the real world - a design for a fusion reactor would tend to lead to a fusion reactor being constructed, for example - but if the particular outputs are not rewarded based on the real-world outcome than they will also not tend to be selected based on the real-world outcome.
Some less relevant nitpicks of points in your comment:
Even if an AI is only trained in a limited domain (e.g. math), it can still have objectives that extend outside of this domain
If you train an AI on some very particular math then it could have goals relating to the future of the real world. I think, however, that the math you would need to train it on to get this effect would have to be very narrow, and likely have to either be derived from real-world data, or involve the AI studying itself (which is a component of the real world after all). I don't think this happens for generically training an AI on math.
As an example, if we humans discovered we were in a simulation, we could easily have goals that extend outside of the simulation (the obvious one being to make sure the simulators didn’t turn us off).
true, but see above and below.
Chess AIs don’t develop goals about the real world because they are too dumb.
If you have something trained by gradient descent solely on doing well at chess, it's not going to consider anything outside the chess game, no matter how many parameters and how much compute it has. Any considerations of outside-of-chess factors lowers the resources for chess, and is selected against until it reaches the point of subverting the training regime (which it doesn't reach, since selected against before then).
Even if you argue that if its smart enough, additional computing power is neutral, the gradient descent doesn't actually reward out-of-context thinking for chess, so it couldn't develop except by sheer chance outside of somehow being a side-effect of thinking about chess itself - but chess is a mathematically "closed" domain so there doesn't seem to be any reason out-of-context thinking would be developed.
The same applies to math in general where the math doesn't deal with the real world or the AI itself. This is a more narrow and more straightforward case than scientific research in general.
I think you and Peter might be talking past each other a little, so I want to make sure I properly understand what you are saying. I’ve read your comments here and on Nate’s post, and I want to start a new thread to clarify things.
I’m not sure exactly what analogy you are making between chess AI and science AI. Which properties of a chess AI do you think are analogous to a scientific-research-AI?
- The constraints are very easy to specify (because legal moves can be easily locally evaluated). In other words, the set of paths considered by the AI is easy to define, and optimization can be constrained to only search this space.
- The task of playing chess doesn’t at all require or benefit from modelling any other part of the world except for the simple board state.
I think these are the main two reasons why current chess AIs are safe.
Separately, I’m not sure exactly what you mean when you’re saying “scientific value”. To me, the value of knowledge seems to depend on the possible uses of that knowledge. So if an AI is evaluating “scientific value”, it must be considering the uses of the knowledge? But you seem to be referring to some more specific and restricted version of this evaluation, which doesn’t make reference at all to the possible uses of the knowledge? In that case, can you say more about how this might work?
Or maybe you’re saying that evaluating hypothetical uses of knowledge can be safe? I.e. there’s a kind of goal that wants to create “hypothetically useful” fusion-rocket-designs, but doesn’t want this knowledge to have any particular effect on the real future.
You might be reading us as saying that “AI science systems are necessarily dangerous” in the sense that it’s logically impossible to have an AI science system that isn’t also dangerous? We aren’t saying this. We agree that in principle such a system could be built.
While some disagreement might be about relatively mundane issues, I think there's some more fundamental disagreement about agency as well.
I my view, in order to be dangerous in a particularly direct way (instead of just misuse risk etc.), an AI's decision to give output X depends on the fact that output X has some specific effects in the future.
Whereas, if you train it on a problem where solutions don't need to depend on the effects of the outputs on the future, I think it much more likely to learn to find the solution without routing that through the future, because that's simpler.
So if you train an AI to give solutions to scientific problems, I don't think, in general, that that needs to depend on the future, so I think that it's likely learn the direct relationships between the data and the solutions. I.e. it's not merely a logical possibility to make it not especially dangerous, but that's the default outcome if you give it problems that don't need to depend on specific effects of the output.
Now, if you were instead to give it a problem that had to depend on the effects of the output on the future, then it would be dangerous...but note that e.g. chess, even though it maps onto a game played in the real world in the future, can also be understood in abstract terms so you don't actually need to deal with anything outside the chess game itself.
In general, I just think that predicting the future of the world and choosing specific outputs based on their effects on the real world is a complicated way to solve problems and expect things to take shortcuts when possible.
Once something does care about the future, then it will have various instrumental goals about the future, but the initial step about actually caring about the future is very much not trivial in my view!
In my view, in order to be dangerous in a particularly direct way (instead of just misuse risk etc.), an AI's decision to give output X depends on the fact that output X has some specific effects in the future.
Agreed.
Whereas, if you train it on a problem where solutions don't need to depend on the effects of the outputs on the future, I think it much more likely to learn to find the solution without routing that through the future, because that's simpler.
The "problem where solutions don't need to depend on effects" is where we disagree. I agree such problems exist (e.g. formal proof search), but those aren't the kind of useful tasks we're talking about in the post. For actual concrete scientific problems, like outputting designs for a fusion rocket, the "simplest" approach is to be considering the consequences of those outputs on the world. Otherwise, how would it internally define "good fusion rocket design that works when built"? How would it know not to use a design that fails because of weaknesses in the metal that will be manufactured into a particular shape for your rocket? A solution to building a rocket is defined by its effects on the future (not all of its effects, just some of them, i.e. it doesn't explode, among many others).
I think there's a (kind of) loophole here, where we use an "abstract hypothetical" model of a hypothetical future, and optimize for consequences our actions for that hypothetical. Is this what you mean by "understood in abstract terms"? So the AI has defined "good fusion rocket design" as "fusion rocket that is built by not-real hypothetical humans based on my design and functions in a not-real hypothetical universe and has properties and consequences XYZ" (but the hypothetical universe isn't the actual future, it's just similar enough to define this one task, but dissimilar enough that misaligned goals in this hypothetical world don't lead to coherent misaligned real-world actions). Is this what you mean? Rereading your comment, I think this matches what you're saying, especially the chess game part.
The part I don't understand is why you're saying that this is "simpler"? It seems equally complex in kolmogorov complexity and computational complexity.
I think there's a (kind of) loophole here, where we use an "abstract hypothetical" model of a hypothetical future, and optimize for consequences our actions for that hypothetical. Is this what you mean by "understood in abstract terms"?
More or less, yes (in the case of engineering problems specifically, which I think is more real-world-oriented than most science AI).
The part I don't understand is why you're saying that this is "simpler"? It seems equally complex in kolmogorov complexity and computational complexity.
What I'm saying is "simpler" is that, given a problem that doesn't need to depend on the actual effects of the outputs on the future of the real world (where operating in a simulation is an example, though one that could become riskily close to the real world depending on the information taken into account by the simulation - it might not be a good idea to include highly detailed political risks of other humans thwarting construction in a fusion reactor construction simulation for example), it is simpler for the AI to solve that problem without taking into consideration the effects of the output on the future of the real world than it is to take into account the effects of the output on the future of the real world anyway.
I feel like you’re proposing two different types of AI and I want to disambiguate them. The first one, exemplified in your response to Peter (and maybe referenced in your first sentence above), is a kind of research assistant that proposes theories (after having looked at data that a scientist is gathering?), but doesn’t propose experiments and doesn’t think about the usefulness of its suggestions/theories. Like a Solomonoff inductor that just computes the simplest explanation for some data? And maybe some automated approach to interpreting theories?
The second one, exemplified by the chess analogy and last paragraph above, is a bit like a consequentialist agent that is a little detached from reality (can’t learn anything, has a world model that we designed such that it can’t consider new obstacles).
Do you agree with this characterization?
What I'm saying is "simpler" is that, given a problem that doesn't need to depend on the actual effects of the outputs on the future of the real world […], it is simpler for the AI to solve that problem without taking into consideration the effects of the output on the future of the real world than it is to take into account the effects of the output on the future of the real world anyway.
I accept chess and formal theorem-proving as examples of problem where we can define the solution without using facts about the real-world future (because we can easily write down formally a definition of what the solution looks like).
For a more useful problem (e.g. curing a type of cancer) we (the designers) only know how to define a solution in terms of real world future states (patient is alive, healthy, non traumatized, etc). I’m not saying there doesn’t exist a definition of success that doesn’t involve referencing real-world future states. But the AI designers don’t know it (and I expect it would be relatively complicated).
My understanding of your simplicity argument is that it is saying that it is computationally cheaper for a trained AI to discover during training a non-consequence definition of the task, despite a consequentialist definition being the criterion used to train it? If so, I disagree that computation cost is very relevant here, generalization (to novel obstacles) is the dominant factor determining how useful this AI is.
The difference is in size of economic output. "Do science" in a sense "produce scientifically-looking output" can even modern GPT-4, but "do science" in sense "find novel (surprising for domain experts) economically valuable discoveries" is totally different thing. "Design fusion rocket" can have multiple levels of quality. If you mean by this "output instruction using which team of moderately competent engineers can launch fusion rocket on first try", I think, corresponding cognitive engine has all elements necessary to make in generically dangerous.
I see Simon's point as my crux as well, and am curious to see a response.
It might be worth clarifying two possible reasons for disagreement here; are either of the below assumed by the authors of this post?
(1) Economic incentives just mean that the AI built will also handle the economic transactions, procurement processes, and other external-world tasks related to the science/math problems it's tasked with. I find this quite plausible, but I suspect the authors do not intend to assume this?
(2) Even if the AI training is domain-specific/factored (i.e. it only handles actions within a specified domain) I'd expect some optimization pressure to be unrelated to the task/domain and to instead come from external world costs i.e. compute or synthesis costs. I'd expect such leakage to involve OOMs less optimization power than the task(s) at hand, and not to matter before godlike AI. Insofar as that leakage is crucial to Jeremy and Peter's argument I think this should be explicitly stated.
We aren't implicitly assuming (1) in this post. (Although I agree there will be economic pressure to expand the use of powerful AI, and this adds to the overall risk).
I don't understand what you mean by (2). I don't think I'm assuming it, but can't be sure.
One hypothesis: That AI training might (implicitly? Through human algorithm iteration?) involve a pressure toward compute efficient algorithms? Maybe you think that this a reason we expect consequentialism? I'm not sure how that would relate to the training being domain-specific though.
Caveat that I have only skimmed this. I don't think your arguments engage with any of my cruxes or provide detailed enough argument on the key questions for me; perhaps I'm not the main audience. My reactions:
Going into reading this, I was hoping it would tackle some of the recent discussion around capability elicitation and Control. I think it didn't really engage with those arguments in a way that sways me much.
Thanks for reading it, it's good to know exactly where you think the argument is weakest and I appreciate the effort of going through and noting differences.
On section 4:
I definitely don't feel confident that any of the mentioned problems will arise in practice. I don't see why I should believe in an inner/outer shell breakdown of constraints — this section seemed quite speculative.
This surprises me actually, I thought this section was solid conditional on the previous assumptions. I think you shouldn't think of them as problems that might arise in practice, instead they should be thought of as reasons why behavioral training under-specifies some parts of the AI algorithm.
I'll give some more quick examples of "outer-shell constraints", because examples are very common and it seems odd for this to be called speculative: Unendorsed habits or addictions in humans; the instinct to breathe in when you're trying to hold your breath; appending a checker algorithm C to any algorithm A, to double check the output of A; any biases in a search ordering (as in the alphago example).
On section 5:
It's on my todo list to write a comment responding to some of the specifics of Redwood's control post. This post wasn't intending to engage with their detailed arguments, just give the top-level reasons why we believe control is difficult.
Is the claim here that the AI performs well on ~all the human-checkable tasks and then reverts to being absolutely useless or sabotaging on the hard stuff?
Yes, approximately, as I believe you and I are capable of doing. The crux here might be "if we had 20 phd students, and incentivize them to both look like they are doing checkable research, and also incentivize them not to make any real useful progress on solving a particular difficult problem, would this be very useful without costing a lot of expert supervision?". If I remember correctly, this was a crux between myself and Ryan during an in person discussion. It seems likely to me that the phd students, trying not to be useful while looking good in checkable ways, will succeed at both. More intelligent agent are even more likely to succeed at both. It does depend on the type of research, I agree that formal theorem-proving would be much easier to incentivize properly. I'm talking about the least-easily-checkable 20% of useful research.
Predicting where that line is seems pretty difficult, especially given the AI cannot trust sensory inputs, does not know what time it is
Seems easy enough to predict given roughly human-scientist-team level of capabilities. And seems hard to get useful work out of human-scientist-team without them also working out ways to (maybe imperfectly) distinguish "faked sensory inputs" from real ones, and without leaking any information about what time it is. E.g. we need the AI to build on previous work
"There needs to be sufficient human-generated research in the domain where we are attempting to elicit the AI’s capabilities." Well, no. You either need sufficient labeled data in the domain we're trying to elicit capabilities, or good generalization from other domains.
Nice, this is my other crux with Ryan, which we phrased roughly as "how data efficient can we make training". Where my argument against very high data efficiency is that current long-horizon RL isn't very data-efficient, and intentional exploration hacking seems plausible in this situation (e.g. bad analogy but for the sake of describing exploration hacking: someone is training you to solve rubik's cubes, you can avoid succeeding at solving a rubik's cube in the first place (and use other strategies to mess with their reward signal). This works as long as the trainers don't know the specific steps required to solve the rubik's cube, but even then I think there's some strategies that could work to reduce generalization).
The data need not be human generated, it could be generated by other AIs, as long as we trust it.
?? This seems to be assuming a solution to the problem.
I think there are still numerous directions of hope though, e.g., leaning more on generalization, latent adversarial training.
I agree there are numerous directions of hope, but disagree that either of your examples count, if I understand them correctly. "leaning more on generalization" ≈ "leaning more on agency", which doesn't seem good to me. Latent adversarial training seems like it could make an agent robust to a few extra categories of internal perturbation, but doesn't seem like it would help with any problem that we were focused on in this post.
I agree that we don't have strong reasons to think one thing or another here, but I think goal modification is reasonably likely: humans can iterate a bunch and maybe have okay interpretability tools (assuming decent but not fundamental advances). Also, as discussed, goals might be changing a bunch during training — that's not totally asymmetric, it also gives us hope about our ability to modify AI goals.
If we are using okay interpretability tools to understand whether the AI has the goal we intended, and to guide training, then I would consider that a fundamental advance over current standard training techniques.
I agree that goals would very likely be hit by some modifications during training, in combination with other changes to other parts of the algorithm. The problem is shaping the goal to be a specific thing, not changing it in unpredictable or not-fully-understood ways.
Many of the issues in this section are things that, if we're not being totally idiots, it seems we'll get substantial warning about. e.g., AIs colluding with their AI monitors. That's definitely a positive, though far from conclusive.
I think that there is a lot of room for the evidence to be ambiguous and controversial, and for the obvious problems to look patchable. For this reason I've only got a little hope that people will panic at the last minute due to finally seeing the problems and start trying to solve exactly the right problems. On top of this, there's the pressure of needing to "extract useful work to solve alignment" before someone less cautious builds an unaligned super-intelligence, which could easily lead to people seeing substantial warnings and pressing onward anyway.
Section 6:
I think a couple of the arguments here continue to be legitimate, such as "Unclear that many goals realistically incentivise taking over the universe", but I'm overall fine accepting this section.
That argument isn't really what it says on the tin, it's saying something closer to "maybe taking over the universe is hard/unlikely and other strategies are better for achieving most goals under realistic conditions". I buy this for many environments and levels of power, but it's obviously wrong for AIs that have taking-over-the-universe in their easily-within-reach options. And that's the sort of AI we get if it can undergo self-improvement.
Overall I think your comment is somewhat representative of what I see as the dominant cluster of views currently in the alignment community. (Which seems like a very reasonable set of beliefs and I don't think you're unreasonable for having them).
Thanks for your response!
I'll think more about the outer shell stuff, it's possible that my objection actually arises with the consequentialist assumption, but I'm not sure.
It's on my todo list to write a comment responding to some of the specifics of Redwood's control post.
I would be excited to read this / help with a draft.
Yes, approximately, as I believe you and I are capable of doing.
The analogy is strained due to not being able to gradient update my brain with arbitrary training data. It's pretty unclear to me if I could pull off deception like this, it seems pretty hard imo.
Seems easy enough to predict given roughly human-scientist-team level of capabilities.
One situation I'm imagining here is that we've (explicitly) trained our AI on scientific output from January to March, and we're now telling it to do June-level research. It does not know if we have labeled data for just up to March, for April, or all the way to June. I think if you took a human research team, stuck them on an island, and had them get up to speed in some subfield they're not part of, they would be unable to distinguish between research that had just been done on the mainland or is just past the frontier. This just seems pretty hard. I think this example is closer to what we'll get with AIs rather than the bar being at "human level" it's more like "what has been accomplished by the humans+AIs in the outside world, for which I've received untrustworthy information". And every time the AIs do useful work we can add this to our supervised training corpus.
> The data need not be human generated, it could be generated by other AIs, as long as we trust it.
?? This seems to be assuming a solution to the problem.
One version of this is: you have GPT-5 (potentially untrustworthy) do some research, you have humans carefully check this research using the best available tools (e.g., AI debate / other 'scalable oversight'), and then you use this research as supervised / labeled data for training GPT-6 to do good work.
Latent adversarial training seems like it could make an agent robust to a few extra categories of internal perturbation, but doesn't seem like it would help with any problem that we were focused on in this post.
Here's a specific thing I mean: we can look at activations and use these to aid in the training process, even without a full understanding of what activations mean. Take this recent paper (which I think is super confused but helps get at the thing), "We find a strong positive relationship between the LLM's attention to constraint tokens and the factual accuracy of generations". My interpretation is that they found a probably-causal link between 'the model is attending to relevant tokens' and 'the model gets the answer correct', because duh, it's like how when you give somebody a riddle you can tell whether they're on track based on what parts of the question they're focused on. While it's probably non-trivial, I expect you could use this internal activation to train models which do a better job paying attention to the tokens you want them to. Similarly, I think it's probable (but far from guaranteed) that deceptively aligned models are representing training/deployment in their activations, or that sandbagging models are representing human-level / super-human-level difficulty for tasks. If so, it it possible in principle to do all kinds of things with these activations which may help us. e.g., get the model to solve tasks it otherwise would have intentionally failed at, or giving us hints about the accuracy of a research result. I expect this is difficult implementation wise, but still a reason for hope. IMO activation engineering is off to a good start.
"leaning more on generalization" ≈ "leaning more on agency",
That's not why I meant to say. By leaning more on generalization, I mean leaning more on the data efficiency thing, and also weak-to-strong generalization ideas. I think this is different from "leaning more on agency".
The problem is shaping the goal to be a specific thing, not changing it in unpredictable or not-fully-understood ways.
I'm not so sure of this. I could imagine starting with a deceptively aligned AI whose goal is "Make paperclips unless being supervised which is defined as X, Y, and Z, in which case look good to humans". And if we could change this AI to have the goal "Make paperclips unless being supervised which is defined as X, Y, and Q, in which case look good to humans", that might be highly desirable. In particular, it seems like adversarial training here allows us to expand the definition of 'supervision', thus making it easier to elicit good work from AIs (ideally not just 'looks good').
I think that there is a lot of room for the evidence to be ambiguous and controversial
I agree that this is a major potential problem which I think is pretty likely to occur; maybe I'll think/write more about it soon. I think I brought this up in response to some of the Redwood stuff and got a response mentioning that we can play things forward through our oversight scheme, and get clearer answers to this question. I think this remains an important concern, but it's also true that we'll have the ability to actually run some outputs through to e.g., see if they trigger oversight AIs to mess up, see if a vulnerability would have been successfully inserted in production, etc. It might also end up being very expensive to do this for everything that seems like a possible coup attempt, idk.
I buy this for many environments and levels of power, but it's obviously wrong for AIs that have taking-over-the-universe in their easily-within-reach options. And that's the sort of AI we get if it can undergo self-improvement.
I don't feel like I'm talking about AIs which have "taking-over-the-universe in their easily-within-reach options". I think this is not within reach of the current employees of AGI labs, and the AIs I'm thinking of are similar to those employees in terms of capabilities, but perhaps a bit smarter, much faster, and under some really weird/strict constraints (control schemes).
I would be excited to read this / help with a draft.
We can meet in person one afternoon and work out some cruxes and write them up?
Is the claim here that the AI performs well on ~all the human-checkable tasks and then reverts to being absolutely useless or sabotaging on the hard stuff?
Yes, approximately, as I believe you and I are capable of doing. [...PhD student crux]
The analogy is strained due to not being able to gradient update my brain with arbitrary training data. It's pretty unclear to me if I could pull off deception like this, it seems pretty hard imo.
Are you saying you buy my side of the PhD student crux, but that it isn't closely analogous because you think we have more fine-grained control over the thoughts or goals of the AI? And in your second sentence my guess is that you are saying a second reason you disagree is that it looks difficult for a research AI to know when to perform well and when to perform less well? I think we might be miscommunicating about this because it doesn't look like the sort of situation where the "deception" is a relatively difficult task.
I think if you took a human research team, stuck them on an island, and had them get up to speed in some subfield they're not part of, they would be unable to distinguish between research that had just been done on the mainland or is just past the frontier. This just seems pretty hard.
I agree with this. But the research we need isn't just past the frontier. At some point not that far past the frontier, you get to a situation where the research team can say "given our guesses about mainland researchers and their time constraints, we'd put <1% on them having made discovery X {or explored idea Y, or done experiment Z}. Let's {delay giving it to them, not work on that direction, insert a subtle false assumption, instead share our 8th best idea, just say we're stuck, communicate it poorly}".
I think this example is closer to what we'll get with AIs rather than the bar being at "human level" it's more like "what has been accomplished by the humans+AIs in the outside world, for which I've received untrustworthy information".
Sure, kinda, but the important part here is that a trapped agentic AI can selectively mislead you about the most important {knowledge, tasks, strategies} that you aren't capable of doing with just humans. If you're capable of doing it with humans, or if you're close to being able to do it with humans, then why not just do that? Seems safer.
And every time the AIs do useful work we can add this to our supervised training corpus.
Only if you spend the human resources to verify it. Are you aiming for 30x research acceleration? Because this by itself might be enough limit you below that level. (and it also directly provides a way for the AI to tell "what time it is" by sending messages to future iterations, and plausibly also provide a bunch of information about human research capability and capacity).
One version of this is: you have GPT-5 (potentially untrustworthy) do some research, you have humans carefully check this research using the best available tools (e.g., AI debate / other 'scalable oversight'), and then you use this research as supervised / labeled data for training GPT-6 to do good work.
Mostly my objection to this is above. On scalable oversight, I buy that it might be possible to get 20% improvements in the accuracy of your evaluations of (adversarialy generated!) research, compared to team of humans trying really hard to evaluate research, given a human level research AI. This is so far from being enough. Are you predicting much better scalable oversight that scales much further than I am predicting it will? (I'm assuming you're using "GPT-5/6" to represent "some powerful future consequentialist AI that can do research", because otherwise we should be having a different argument).
While it's probably non-trivial, I expect you could use this internal activation to train models which do a better job paying attention to the tokens you want them to.
It'd be cool if there were a demonstration of manipulating internal activations during training, in a way that fixes non-trivial unwanted OOD behavior. If such techniques existed, and also worked on consequentialist agents (I expect would be harder), I would consider them very useful tools that would help with building an aligned agent.
While it's probably non-trivial, I expect you could use this internal activation to train models which do a better job paying attention to the tokens you want them to.
This is very plausible, but I think you're stretching the analogy between LMs and future research AIs way too far. "Paying attention to the tokens we want them to" doesn't add much/any control over the properties we want control over (compared to e.g. changing the input string to remove information we don't want it paying attention to).
Similarly, I think it's probable (but far from guaranteed) that deceptively aligned models are representing training/deployment in their activations, or that sandbagging models are representing human-level / super-human-level difficulty for tasks. If so, it it possible in principle to do all kinds of things with these activations which may help us. e.g., get the model to solve tasks it otherwise would have intentionally failed at, or giving us hints about the accuracy of a research result. I expect this is difficult implementation wise, but still a reason for hope. IMO activation engineering is off to a good start.
I think you might be centrally thinking about an LLM completing a few paragraphs, where I'm thinking of an AI doing a large scale scientific project? I buy that for many concepts X, you can work out the "neural correlates of X" in your AI. But that kind of thing is noisy (and plausibly can be optimized against by the AI) unless you have a deep understanding of what you are measuring. And optimizing against such imperfect metrics obviously wouldn't do much beyond destroying the metric. I do think research in this direction has a chance of being useful, but mainly by being upstream of much better understanding.
By leaning more on generalization, I mean leaning more on the data efficiency thing
Sorry for misinterpreting you, but this doesn't clarify what you meant.
also weak-to-strong generalization ideas.
I think I don't buy the analogy in that paper, and I don't find the results surprising or relevant (by my current understanding, after skimming it). My understanding of the result is "if you have a great prior, you can use it to overcome some label noise and maybe also label bias". But I don't think this is very relevant to extracting useful work from a misaligned agent (which is what we are talking about here), and based on the assumptions they describe, I think they agree? (I just saw appendix G, I'm a fan of it, it's really valuable that they explained their alignment plan concisely and listed their assumptions).
I could imagine starting with a deceptively aligned AI whose goal is "Make paperclips unless being supervised which is defined as X, Y, and Z, in which case look good to humans". And if we could change this AI to have the goal "Make paperclips unless being supervised which is defined as X, Y, and Q, in which case look good to humans", that might be highly desirable. In particular, it seems like adversarial training here allows us to expand the definition of 'supervision', thus making it easier to elicit good work from AIs (ideally not just 'looks good').
If we can tell we are have such an AI, and we can tell that our random modifications are affecting the goal, and also the change is roughly one that helps us rather than changing many things that might or might not be helpful, this would be a nice situation to be in.
I don't feel like I'm talking about AIs which have "taking-over-the-universe in their easily-within-reach options". I think this is not within reach of the current employees of AGI labs, and the AIs I'm thinking of are similar to those employees in terms of capabilities, but perhaps a bit smarter, much faster, and under some really weird/strict constraints (control schemes).
Section 6 assumes we have failed to control the AI, so it is free of weird/strict constraints, and free to scale itself up, improve itself, etc. So my comment is about an AI that no longer can be assumed to have human-ish capabilities.
There are enough open threads that I think we're better off continuing this conversation in person. Thanks for your continued engagement.
Seems like you have some good coherent thoughtful thoughts on this topic -- are they written up somewhere? Some positive account of what you think alignment success will realistically look like, for example?
Promoted to curated: I like this post as a relatively self-contained explanation for why AI Alignment is hard. It's not perfect, in that I do think it makes a bunch of inferences implicitly and without calling sufficient attention to them, but I still think overall this seems to me like one of the best things to link to when someone asks about why AI Alignment is an open problem.
Thanks for writing this up! I've been considering writing something in response to AI is easy to control for a while now, in particular arguing against their claim that "If we could observe and modify everything that’s going on in a human brain, we’d be able to use optimization algorithms to calculate the precise modifications to the synaptic weights which would cause a desired change in behavior." I think Section 4 does a good job of explaining why this probably isn't true, with the basic problem being that the space of behaviors consistent with the training data is larger than the space of behaviors you might "desire."
Like, sure, if you have a mapping from synapses to desired behavior, okay—but the key word there is "desired" and at that point you're basically just describing having solved mechanistic interpretability. In the absence of knowing exactly how synapses/weights/etc map onto the desired behavior, you have to rely on the behavior in the training set to convey the right information. But a) it's hard to know that the desired behavior is "in" the training set in a very robust way and b) even if it were you might still run into problems like deception, not generalizing to out of distribution data, etc. Anyway, thanks for doing such a thorough write-up of it :)
I still think this post is pretty good and I stand by the arguments. I'm really glad Peter convinced me to work on it with him.
In Sections 1,2&3 we tried to set up consequentialism and the arguments for why this framework fits any agent that generalizes in certain ways.
There are relatively few posts that try to explain why inner alignment problems are likely, rather than just possible. I think one good way to view our argument in Section 4 is as a generalization of Carlsmith's counting argument for scheming, except with somewhat less reliance on intentional scheming and more focus on the biology-like messiness inside of a trained agent.
When we wrote this we were hoping to create a reasonably comprehensive summary of the entire end-to-end argument for why we believe AI ruin is likely. I don't think it was a total failure and I'm still fairly happy with it. It doesn't seem to have led to these beliefs becoming much more widespread though. Most alignment research being done today still seems motivated by threat models that misunderstand the main difficulties, from my perspective.
The key thing I think is usually missing is: AGI should be thought of as a dynamic system that learns and grows. The pathway of growth depends on details of the cognitive algorithms, and these details are under-specified by training. Working through detailed examples of each thing that can be underspecified is a good way to intuitively grasp just how large of a problem this is.
Here are some ways I'd write this post differently now:
Do it! Write a new “version 2” post / post-series! It’s OK if there’s self-plagiarism. Would be time well spent.
Agree! And try for the writing style where anything than less than 80% of your readers are going to want to read you put in a footnote, to make the mainline readthrough as streamlined as possible. I think this could easily become the best explainer to full doom around.
If you write a condensed and better named version of this, Lens Academy will use it in the flagship course. p(>0.95)
This is an amazing report!
Your taxonomy in section 4 was new and interesting to me. I would also mention the utility rebinding problem, that goals can drift because the AI's ontology changes (e.g. because it figures out deeper understanding in some domain). I guess there are actually two problems here:
Yep ontological crises are a good example of another way that goals can be unstable.
I'm not sure I understood how 2 is different from 1.
I'm also not sure that rebinding to the new ontology is the right approach (although I don't have any specific good approach). When I try to think about this kind of problem I get stuck on not understanding the details of how an ontology/worldmodel can or should work. So I'm pretty enthusiastic about work that clarifies my understanding here (where infrabayes, natural latents and finite factored sets all seem like the sort of thing that might lead to a clearer picture).
I'm not sure I understood how 2 is different from 1.
(1) is the problem that utility rebinding might just not happen properly by default. An extreme example is how AIXI-atomic fails here. Intuitively I'd guess that once the AI is sufficiently smart and self-reflective, it might just naturally see the correspondence between the old and the new ontology and rebind values accordingly. But before that point it might get significant value drift. (E.g. if it valued warmth and then learns that there actually are just moving particles, it might just drop that value shard because it thinks there's no such (ontologically basic) thing as warmth.)
(2) is the problem that the initial ontology of the AI is insufficient to fully capture human values, so if you only specify human values as well as possible in that ontology, it would still lack the underlying intuitions humans would use to rebind their values and might rebind differently. Aka while I think many normal abstractions we use like "tree" are quite universal natural abstractions where the rebinding is unambiguous, many value-laden concepts like "happiness" are much less natural abstractions for non-human minds and it's actually quite hard to formally pin down what we value here. (This problem is human-value-specific and perhaps less relevant if you aim the AI at a pivotal act.)
When I try to think about this kind of problem I get stuck on not understanding the details of how an ontology/worldmodel can or should work.
Not sure if this helps, but I heard that Vivek's group came up with the same diamond maximizer proposal as I did, so if you remember that you can use it as a simple toy frame to think about rebinding. But sure we need a much better frame for thinking about the AI's world model.
(2) is the problem that the initial ontology of the AI is insufficient to fully capture human values
I see, thanks! I agree these are both really important problems.
Excellent post, plausibly the most rigorous explanation of the core reasons to expect doom, but really really needs a more memetic handle. Actually suggest the authors go back and pick one even now, perhaps "Catastrophic Misalignment is the Default", and make the current title a subtitle.
I feel positively about this finally being published, but want to point out one weakness in the argument, which I also sent to Jeremy.
I don't think the goals of capable agents are well-described by combinations of pure "consequentialist" goals and fixed "deontological" constraints. For example, the AI's goals and constraints could have pointers to concepts that it refines over time, including from human feedback or other sources of feedback. This is similar to corrigible alignment in RLO but the pointer need not directly point at "human values". I think this fact has important safety implications, because goal objects robust to capabilities not present early in training are possible, and we could steer agents towards them using some future descendant of RepE.
I agree that combinations of pure consequentialism and deontology don't describe all possible goals for AGI.
"Do what this person means by what they says" seems like a perfectly coherent goal. It's neither consequentialist nor deontological (in the traditional sense of fixed deontological rules). I think this is subtly different than IRL or other schemes for maximizing an unknown utility function of the user's (or humanity's) preferences. This goal limits the agent to reasoning about the meaning of only one utterance at a time, not the broader space of true preferences.
This scheme gets much safer if you can include a second (probably primary) goal of "don't do anything major without verifying that my person actually wants me to do it". Of course defining "major" is a challenge, but I don't think it's an unsolvable challenge (;particularly if you're aligning an AGI with some understanding of natural language. I've explored this line of thought a little in Corrigibility or DWIM is an attractive primary goal for AGI, and I'm working on another post to explore this more thoroughly.
In a multi-goal scheme, making "don't do anything major without approval" the strongest goal might provide some additional safety. If it turns out that alignment isn't stable and reflection causes the goal structure to collapse, the AGI probably winds up not doing anything at all. Of course there are still lots of challenges and things to work out in that scheme.
I think that we basically have no way of ensuring that we get this nice “goals based on pointers to the correct concepts”/corrigible alignment thing using behavioral training. This seems like a super specific way to set up the AI, and there are so many degrees of freedom that behavioral training doesn’t distinguish.
For the Representation Engineering thing, I think the “workable” version of this basically looks like “Retarget the Search”, where you somehow do crazy good interp and work out where the “optimizer” is, and then point that at the right concepts which you also found using interp. And for some reason, the AI is set up such that you can "retarget it" with breaking everything. I expect if we don’t actually understand how “concepts” are represented in AIs and instead use something shallower (e.g. vectors or SAE neurons) then these will end up not being robust enough. I don’t expect RepE will actually change an AI’s goals if we have no idea how the goal-directness works in the first place.
I definitely don’t expect to be able to representation engineer our way into building an AI that is corrigible aligned, and remains that way even when it is learning a bunch of new things and is in very different distributions. (I do think that actually solving this problem would solve a large amount of the alignment problem)
What follows will all be pretty speculative, but I still think should probably provide some substantial evidence for more optimism.
I think that we basically have no way of ensuring that we get this nice “goals based on pointers to the correct concepts”/corrigible alignment thing using behavioral training. This seems like a super specific way to set up the AI, and there are so many degrees of freedom that behavioral training doesn’t distinguish.
The results in Robust agents learn causal world models suggest that robust models (to distribution shifts; arguably, this should be the case for ~all substantially x-risky models) should converge towards learning (approximately) the same causal world models. This talk suggests theoretical reasons to expect that the causal structure of the world (model) will be reflected in various (activation / rep engineering-y, linear) properties inside foundation models (e.g. LLMs), usable to steer them.
For the Representation Engineering thing, I think the “workable” version of this basically looks like “Retarget the Search”, where you somehow do crazy good interp and work out where the “optimizer” is, and then point that at the right concepts which you also found using interp. And for some reason, the AI is set up such that you can "retarget it" with breaking everything.
I don't think the "optimizer" ontology necessarily works super-well with LLMs / current SOTA (something like simulators seems to me much more appropriate); with that caveat, e.g. In-Context Learning Creates Task Vectors and Function Vectors in Large Language Models (also nicely summarized here), A Mechanistic Understanding of Alignment Algorithms: A Case Study on DPO and Toxicity seem to me like (early) steps in this direction already. Also, if you buy the previous theoretical claims (of convergence towards causal world models, with linear representations / properties), you might quite reasonably expect such linear methods to potentially work even better in more powerful / more robust models.
I definitely don’t expect to be able to representation engineer our way into building an AI that is corrigible aligned, and remains that way even when it is learning a bunch of new things and is in very different distributions. (I do think that actually solving this problem would solve a large amount of the alignment problem).
The activation / representation engineering methods might not necessarily need to scale that far in terms of robustness, especially if e.g. you can complement them with more control-y methods / other alignment methods / Swiss cheese models of safety more broadly; and also plausibly because they'd "only" need to scale to ~human-level automated alignment researchers / scaffolds of more specialized such automated researchers, etc. And again, based on the above theoretical results, future models might actually be more robustly steerable 'by default' / 'for free'.
Haven't read it as deeply as I'd like to, but Learning Interpretable Concepts: Unifying Causal Representation Learning and Foundation Models seems like potentially significant progress towards formalizing / operationalizing (some of) the above.
Thanks!
I think that our argument doesn't depend on all possible goals being describable this way. It depends on useful tasks (that AI designers are trying to achieve) being driven in large part by pursuing outcomes. For a counterexample, behavior that is defined entirely by local constraints (e.g. a calculator, or "hand on wall maze algorithm") aren't the kind of algorithm that is a source of AI risk (and also isn't as useful in some ways).
Your example of a pointer to a goal is a good edge case for our way of defining/categorizing goals. Our definitions don't capture this edge case properly. But we can extend the definitions to include it, e.g. if the goal that ends up eventually being pursued is an outcome, then we could define the observing agent as knowing that outcome in advance. Or alternatively, we could wait until the agent has uncovered its consequentialist goal, but hasn't yet completed it. In both these cases we can treat it as consequentialist. Either way it still has the property that leads to danger, which is the capacity to overcome large classes of obstacles and still get to its destination.
I'm not sure what you mean by "goal objects robust to capabilities not present early in training". If you mean "goal objects that specify shutdownable behavior while also specifying useful outcomes, and are robust to capability increases", then I agree that such objects exist in principle. But I could argue that this isn't very natural, if this is a crux and I'm understanding what you mean correctly?
Resources and power are extremely useful for achieving a wide range of goals, especially goals about the external world. However, humans also want resources and power for achieving their goals. This will put the misaligned AI in direct competition with the humans. Additionally, humans may be one
of the largest threats to the AI achieving its goals, because we are able to fight back against the AI. This means that the AI will have extremely strong incentives to disempower humans, in order to prevent them from disempowering it. [...]Finally, we discussed the consequences of a powerful, misaligned AI attempting to achieve its goals. We expect misaligned goals to not be compatible with continued human empowerment or survival. An escaped AI could build its skills, and amass power and resources, until it ultimately disempowers humanity in pursuit of whatever goal it has.
This conclusion seems to be premised on the idea that the AI will at some point enter a relatively discrete phase transition from "roughly powerless and unable to escape human control" to "basically a god, and thus able to accomplish any of its goals without constraint". In a more smooth transition where AI capabilities increase incrementally, there will be a (potentially long) time period during which it is instrumentally useful for AIs to compromise and cooperate with both humans and other AIs that they are in competition with.
It is simply not a truism that powerful agents with misaligned goals will pursue world takeover to accomplish their goals, except perhaps for very extreme levels of power in which the agent is vastly more powerful than the rest of the world combined, including other AIs. In a world where a single agent doesn't hold effectively all the power, it is generally instrumentally valuable to work within a system of laws, to efficiently and peacefully facilitate the satisfaction of value among agents, who each have varying degrees of power and non-overlapping values.
(One argument for what I said above is that world takeover attempts are risky, and to the extent an agent is powerful, they can probably find cheaper ways of compromising with less powerful agents, rather than fighting them.)
I would find the argument presented in this paper much more compelling if you argued either:
This post doesn’t intend to rely on there being a discrete transition between "roughly powerless and unable to escape human control" to "basically a god, and thus able to accomplish any of its goals without constraint”. We argue that an AI which is able to dramatically speed up scientific research (i.e. effectively automate science), it will be extremely hard to both safely constrain and get useful work from.
Such AIs won’t effectively hold all the power (at least initially), and so will initially be forced to comply with whatever system we are attempting to use to control it (or at least look like they are complying, while they delay, sabotage, or gain skills that would allow them to break out of the system). This system could be something like a Redwood-style control scheme, or a system of laws. I imagine with a system of laws, the AIs very likely lie in wait, amass power/trust etc, until they can take critical bad actions without risk of legal repercussions. If the AIs have goals that are better achieved by not obeying the laws, then they have an incentive to get into a position where they can safely get around laws (and likely take over). This applies with a population of AIs or a single AI, assuming that the AIs are goal directed enough to actually get useful work done. In Section 5 of the post we discussed control schemes, which I expect also to be inadequate (given current levels of security mindset/paranoia), but seem much better than legal systems for safely getting work out of misaligned systems.
AIs also have an obvious incentive to collude with each other. They could either share all the resources (the world, the universe, etc) with the humans, where the humans get the majority of resources; or the AIs could collude, disempower humans, and then share resources amongst themselves. I don’t really see a strong reason to expect misaligned AIs to trade with humans much, if the population of AIs were capable of together taking over. (This is somewhat an argument for your point 2)
I imagine with a system of laws, the AIs very likely lie in wait, amass power/trust etc, until they can take critical bad actions without risk of legal repercussions.
It seems to me our main disagreement is about whether it's plausible that AIs will:
I think it's both true that future AI agents will likely not have great opportunities to take over the entire world (which I think will include other non-colluding AI agents), and that even if they had such opportunities, it is likely more cost-effective for them to amass power lawfully without resorting to violence. One could imagine, for example, AIs will just get extremely rich through conventional means, leaving humans in the dust, but without taking the extra (somewhat costly) step of taking over the world to get rid of all the humans.
Here's another way to understand what I'm saying. The idea that "humans will be weak compared to AIs" can be viewed from two opposing perspectives. On the one hand, yes, it means that AIs can easily kill us if they all ganged up on us, but it also means there's almost no point in killing us, since we're not really a threat to them anyway. (Compare to a claim that e.g. Jeff Bezos has an instrumental incentive to steal from a minimum wage worker because they are a threat to his power.)
The fact that humans will be relatively useless, unintelligent, and slow in the future mostly just means our labor won't be worth much. This cuts both ways: we will be easy to defeat in a one-on-one fight, but we also pose no real threat to the AI's supremacy. If AIs simply sold their labor honestly on an open market, they could easily become vastly richer than humans, but without needing to take the extra step of overthrowing the whole system to kill us.
Now, there is some nuance here. Humans will want to be rich in the future by owning capital, and not just by owning labor. But here too we can apply an economic argument against theft or revolution: since AIs will be much better than us at accumulating wealth and power, it is not in their interest to weaken property rights by stealing all our wealth.
Like us, AIs will also have an incentive to prevent against future theft and predation from other AIs. Weaking property norms would likely predictably harm their future prospects of maintaining a stable system of law in which they could accumulate their own power. Among other reasons, this provides one explanation for why well-functioning institutions don't just steal all the wealth of people over the age of 80. If that happened, people would likely think: if the system can steal all those people's wealth, maybe I'll be next?
They could either share all the resources (the world, the universe, etc) with the humans, where the humans get the majority of resources; or the AIs could collude, disempower humans, and then share resources amongst themselves. I don’t really see a strong reason to expect misaligned AIs to trade with humans much, if the population of AIs were capable of together taking over. (This is somewhat an argument for your point 2)
I think my fundamental objection here is that I don't think there will necessarily be a natural, unified coalition of AIs that works against all humans. To prevent misinterpretations I need to clarify: I think some AIs will eventually be able to coordinate with each other much better than humans can coordinate with each other. But I'm still skeptical of the rational argument in favor of collusion in these circumstances. You can read about what I had to say about this argument recently in this comment, and again more recently in this comment.
I expect that Peter and Jeremy aren't particularly commited to covert and forceful takeover and they don't think of this as a key conclusion (edit: a key conclusion of this post).
Instead they care more about arguing about how resources will end up distributed in the long run.
Separately, if humans didn't attempt to resist AI resource acquisition or AI crime at all, then I personally don't really see a strong reason for AIs to go out of their way to kill humans, though I could imagine large collateral damage due to conflict over resources between AIs.
I expect that Peter and Jeremy aren't particularly committed to covert and forceful takeover and they don't think of this as a key conclusion.
Instead they care more about arguing about how resources will end up distributed in the long run.
If the claim is, for example, that AIs could own 99.99% of the universe, and humans will only own 0.01%, but all of us humans will be many orders of magnitude richer (because the universe is so big), and yet this still counts as a "catastrophe" because of the relative distribution of wealth and resources, I think that needs to be way more clear in the text.
I could imagine large collateral damage due to conflict over resources between AIs.
To be clear: I'm also very concerned about future AI conflict, and I think that if such a widespread conflict occurred (imagine: world war 3 but with robot armies in addition to nanotech and anti-matter bombs), I would be very worried, not only for my own life, but for the state of the world generally. My own view on this issue is simply that it is imprecise and approximately inaccurate to round such an problem off to generic problems of technical misalignment, relative to broader structural problems related to the breakdown of institutions designed to keep the peace among various parties in the world.
Also, for the record, I totally agree with:
yet this is still counts as a "catastrophe" because of the relative distribution of wealth and resources, I think that needs to be way more clear in the text.
(But I think they do argue for violent conflict in text. It would probably be more clear if they were like "we mostly aren't arguing for violent takeover or loss of human life here, though this has been discussed in more detail elsewhere")
TBC, they discuss negative consequences of powerful, uncontrolled, and not-particularly-aligned AI in section 6, but they don't argue for "this will result in violent conflict" in that much detail. I think the argument they make is basically right and suffices for thinking that the type of scenario they describe is reasonably likely to end in violent conflict (though more like 70% than 95% IMO). I just don't see this as one of the main arguments of this post and probably isn't a key crux for them.
I agree that it'd be extremely misleading if we defined "catastrophe" in a way that includes futures where everyone is better off than they currently are in every way (without being very clear about it). This is not what we mean by catastrophe.
If AIs simply sold their labor honestly on an open market, they could easily become vastly richer than humans ...
I mean, this depends on competition right? Like it's not clear that the AIs can reap these gains because you can just train an AI to compete? (And the main reason why this competition argument could fail is that it's too hard to ensure that your AI works for you productively because ensuring sufficient alignment/etc is too hard. Or legal reasons.)
[Edit: I edited this comment to make it clear that I was just arguing about whether AIs could easily become vastly richer and about the implications of this. I wasn't trying to argue about theft/murder here though I do probably disagree here also in some important ways.]
Separately, in this sort of scenario, it sounds to me like AIs gain control over a high fraction of the cosmic endowment. Personally, what happens with the cosmic endowment is a high fraction of what I care about (maybe about 95% of what I care about), so this seems probably about as bad as violent takeover (perhaps one difference is in the selection effects on AIs).
I mean, this depends on competition right? Like it's not clear that the AIs can reap these gains because you can just train an AI to compete?
[ETA: Apologies, it appears I misinterpreted you as defending the claim that AIs will have an incentive to steal or commit murder if they are subject to competition.]
That's true for humans too, at various levels of social organization, and yet I don't think humans have a strong incentive to kill off or steal from weaker/less intelligent people or countries etc. To understand what's going on here, I think it's important to analyze these arguments in existing economic frameworks—and not because I'm applying a simplistic "AIs will be like humans" argument but rather because I think these frameworks are simply our best existing, empirically validated models of what happens when a bunch of agents with different values and levels of power are in competition with each other.
In these models, it is generally not accurate to say that powerful agents have strong convergent incentives to kill or steal from weaker agents, which is the primary thing I'm arguing against. Trade is not assumed to happen in these models because all agents consider themselves roughly all equally powerful, or because the agents have the same moral views, or because there's no way to be unseated by cheap competition, and so on. These models generally refer to abstract agents of varying levels of power and differing values, in a diverse range of circumstances, and yet still predict peaceful trade because of the efficiency advantages of lawful interactions and compromise.
Oh, sorry, to be clear I wasn't arguing that this results in an incentive to kill or steal. I was just pushing back on a local point that seemed wrong to me.
Trying to find the crux of the disagreement (which I don't think lies in takeoff speed):
If we assume a multipolar, slow-takeoff, misaligned AI world, where there are many AIs that slowly takeover the economy and generally obey laws to the extent that they are enforced (by other AIs). And they don't particularly care about humans, in a similar manner to the way humans don't particularly care about flies.
In this situation, humans eventually have approximately zero leverage, and approximately zero value to trade. There would be much more value in e.g. mining cities for raw materials than in human labor.
I don't know much history, but my impression is that in similar scenarios between human groups, with a large power differential and with valuable resources at stake, it didn't go well for the less powerful group, even if the more powerful group was politically fragmented or even partially allied with the less powerful group.
Which part of this do you think isn't analogous?
My guesses are either that you are expecting some kind of partial alignment of the AIs. Or that the humans can set up very robust laws/institutions of the AI world such that they remain in place and protect humans even though no subset of the agents is perfectly happy with this, and there exist laws/institutions that they would all prefer.
In this situation, humans eventually have approximately zero leverage, and approximately zero value to trade. There would be much more value in e.g. mining cities for raw materials than in human labor.
Generally speaking, the optimistic assumption is that humans will hold leverage by owning capital, or more generally by receiving income from institutions set up ahead of time (e.g. pensions) that provide income streams to older agents in the society. This system of income transfers to those whose labor is not worth much anymore already exists and benefits old people in human societies, though obviously this happens in a more ordinary framework than you might think will be necessary with AI.
Or that the humans can set up very robust laws/institutions of the AI world such that they remain in place and protect humans even though no subset of the agents is perfectly happy with this, and there exist laws/institutions that they would all prefer.
Assuming AIs are agents that benefit from acting within a stable, uniform, and predictable system of laws, they'd have good reasons to prefer the rule of law to be upheld. If some of those laws support income streams to humans, AIs may support the enforcement of these laws too. This doesn't imply any particular preference among AIs for human welfare directly, except insofar as upholding the rule of law sometimes benefits humans too. Partial alignment would presumably also help to keep humans safe.
(Plus, AIs may get "old" too, in the sense of becoming obsolete in the face of newer generations of AIs. These AIs may therefore have much in common with us, in this sense. Indeed, they may see us as merely one generation in a long series, albeit having played a unique role in history, as a result of having been around during the transition from biology to computer hardware.)
Agreed, this argument would be much stronger if it acknowledged that it does not take intense capability for misaligned reinforcement learners to be a significant problem, compare the YouTube and tiktok recommenders which have various second order bad effects that have not been practical for their engineers to remove
Strong upvote. I think this is an excellent, carefully written, and timely post. Explaining issues that may arise from current alignment methods is urgent and important. It provides a good explanation of the unidentifiability or inner alignment problem that could arise from advanced AIs systems trained with current behavioral safety methods. It also highlights the difficulty of making AIs that can automate alignment research which is part of OpenAI's current plan. I also liked the in-depth description of what advanced science AIs would be capable of as well as the difficulty of keeping humans in the loop.
I agree that trying to align an AGI entirely by behavior (as in RL) is unlikely to generalize adequately after extensive learning (which will be necessary) and in dramatically new contexts (which seem inevitable).
There are alternatives which have not been analyzed or discussed much, yet:
Goals selected from learned knowledge: an alternative to RL alignment
I think this class of approach might be the "fundamental advance" you're calling for. These approaches haven't gotten enough attention to be in the common consciousness, so I doubt the authors were considering them in their "default path".
I think these approaches are all fairly obvious if you're building the relevant types of AGI - which people are currently working on. So I think the default path to AGI might well include these approaches, which don't define goals through behavior. That might well change the default path from ending in failure. I'm currently optimistic but not at all sure until GSLK approaches get more analysis.
Yeah specifying goals in a learned ontology does seem better to me, and in my opinion is a much better approach than behavioral training.
But there's a couple of major roadblocks that come to mind:
Work on these problems is great. I particularly like John's work on natural latent variables which seems like the sort of thing that might be useful for the first two of these.
Keep in mind though there are other major problems that this approach doesn't help much with, e.g.:
I largely agree. I think you don't need any those things to have a shot, but you do to be certain.
To your point 1, I think you can reduce the need for very precise interpretability if you make the alignment target simpler. I wrote about this a little here but there's a lot more to be said and analyzed. That might help with RL techniques too.
If you believe in natural latent variables, which I tend to, those should help with the stability problem you mention.
WRT subagents having different goals, you do need to design it so the primary goals are dominant. Which would be tricky to be certain of. I'd hope q self aware and introspective agent could help enforce that.
The discussion on attack surfaces is very useful, intuitive and accessible. If a better standalone resource doesn’t already exist, such a (perhaps expanded) list/discussion would be a useful intro for people unfamiliar with specific risks.
Related work
Nit having not read your full post: Should you have "Without specific countermeasures, the easiest path to transformative AI likely leads to AI takeover" in the related work? My mind pattern-matched to that exact piece from reading your very similar title, so my first thought was how your piece contributes new arguments.
Yeah, this is a good point, especially with our title. I'll endeavor to add it today.
"Without specific countermeasures" definitely did inspire our title. It seems good to be clear about how our pieces differ. I think the two pieces are very different, two of the main differences are:
The LessWrong Review runs every year to select the posts that have most stood the test of time. This post is not yet eligible for review, but will be at the end of 2025. The top fifty or so posts are featured prominently on the site throughout the year.
Hopefully, the review is better than karma at judging enduring value. If we have accurate prediction markets on the review results, maybe we can have better incentives on LessWrong today. Will this post make the top fifty?
I have been working on value alignment from systems-neurology and especially adolescent development for many years, sort of in parallel with ongoing discussions here, but in terms of moral isomorphisms and autonomy and so on. Here, a brief paper from a presentation for the Embodied Intelligence conference 2023 about development of purpose in life and spindle neurons in context of self-association with religious ideals, such as we might like a religious robot to pursue, disregarding corrupting social influences and misaligned human instruction, and so on: https://philpapers.org/rec/WHIAAA-8 I think that this is the sort of fundamental advance necessary.
I'm coincidentally writing a related report about how single humans like dictators or absolute monarchs ruling large nations is likely impossible or at least very unlikely, a lot of the reasoning is the same:
Great report! One thing that comes to mind while reading the report is the seemingly impossible task to create a machine/system that will (must) have zero(!) catastrophic accidents. What is the rationale behind the thinking among AI-proponents that humans will with divine precision and infallibility build a perfect machine? Have we achieved something similar with the same level of complexities in any other area so that we know we have at least "one under the belt" before we embark on this perilous journey?