I consider this system to be superhuman, and the problem of aligning it to be "alignment-complete" in the sense that if you solve any of the problems in this class, you essentially solve alignment down the line and probably avoid x-risk,
I find this line of reasoning (and even mentioning it) not useful. Any alignment solution will be alignment complete so it’s tautological.
I think you’ve defined alignment as a hard problem, which no one will disagree with, but you also define any steps taken towards solving the alignment problem as alignment complete, and thus impossible unless they also seem infeasibly hard. Can there not be an iterative way to solve alignment? I think we can construct some trivial hypotheticals where we iteratively solve it.
For the sake of argument say I created a superhuman math theorem solver, something that can solve IMO problems written in lean with ease. I then use it to solve a lot of important math problems within alignment. This in turn affords us strong guarantees about certain elements of alignment or gradient descent. Can you convince me that the solution to getting a narrow AI useful for alignment is as hard as aligning a generally superhuman AI?
What if we reframe it to some real world example. The proof for the Riemann hypothesis begins with a handful of difficult but comparatively simple lemmas. Solving those lemmas is not as hard as solving the Reimann hypothesis. And we can keep decomposing this proof into parts that are simpler than the whole.
A step in a process being simpler than the end result of the process is not an argument against that step.
what happens if we automatically evaluate plans generated by superhuman AIs using current LLMs and then launch plans that our current LLMs look at and say, "this looks good".
The obvious failure mode is that LLM is not powerful enough to predict consequences of the plan. The obvious fix is to include human-relevant description of the consequences. The obvious failure modes: manipulated description of the consequences, optimizing for LLM jail-breaking. The obvious fix: ...
I won't continue, but shallow rebuttals is not that convincing, but deep ones is close to capability research, so I don't expect to find interesting answers.
I don't know anyone in the community who'd say it's a bad thing that leads to extinction if a CEV-aligned superintelligence grabs control.
Hi there. I am a member of the community, and I expect that any plan that looks like "some people build a system that they believe to be a CEV-aligned superintelligence and tell it to seize control" will end in a way that is worse and different than "utopia". If the "seize control" strategy is aggressive enough, "extinction" is one of the "worse and different" outcomes that is on the table, though I expect that the modal outcome is more like "something dumb doesn't work, and the plan fails before anything of particular note has even happened", and the 99th percentile large effect outcome is something like "your supposed CEV-aligned superintelligence breaks something important enough that people notice you were trying to seize control of the world, and then something dumb doesn't work".
Note that evolution has had "white-box" access to our architecture, optimising us for inclusive genetic fitness, and getting something that optimizes for similar collections of things.
Would you mind elaborating on what exactly you mean by the terms "white-box" and "optimizing for", in the above statement (and, particularly, whether you mean the same thing by your first usage of "optimizing" and your second usage).
I think the argument would be clearer if it distinguished between the following meanings of the term "optimizer":
So if we use those terms, the traditional IGF argument looks something like this:
Evolution optimized (selective shaping) humans to be reproductively successful, but despite that humans do not optimize (deliberative maximization) for inclusive genetic fitness.
Thanks for the comment!
any plan that looks like "some people build a system that they believe to be a CEV-aligned superintelligence and tell it to seize control"
People shouldn’t be doing anything like that; I’m saying that if there is actually a CEV-aligned superintelligence, then this is a good thing. Would you disagree?
what exactly you mean by the terms "white-box" and "optimizing for"
I agree with “Evolution optimized humans to be reproductively successful, but despite that humans do not optimize for inclusive genetic fitness”, and the point I was making was that the stuff that humans do optimize for is similar to the stuff other humans optimize for. Were you confused by what I said in the post or are you just suggesting a better wording?
People shouldn’t be doing anything like that; I’m saying that if there is actually a CEV-aligned superintelligence, then this is a good thing. Would you disagree?
I think an actual CEV-aligned superintelligence would probably be good, conditional on being possible, but also that I expect that anyone who thinks they have a plan to create one is almost certainly wrong about that and so plans of that nature are a bad idea in expectation, and much more so if that plan looks like "do a bunch of stuff that would be obviously terrible if not for the end goal in the name of optimizing the universe".
Were you confused by what I said in the post or are you just suggesting a better wording?
I was specifically unsure which meaning of "optimize for" you were referring to with each usage of the term.
To solve the problem of aligning superhuman systems, you need some amount of complicated human thought/hard high-level work. If a system can output that much hard high-level work in a short amount of time, I consider this system to be superhuman, and the problem of aligning it to be "alignment-complete" in the sense that if you solve any of the problems in this class, you essentially solve alignment down the line and probably avoid x-risk, but solving any of these problems requires a lot of hard human work, and safely automating so much the hard work is an alignment-complete problem.
There needs to be an argument for why one can successfully use a subhuman system to control a complicated superhuman system, as otherwise, having generations of controllable subhuman systems doesn't matter.
Thinking carefully about these things (rather than rehashing MIRI-styled arguments a bit carelessly) is actually important, because it can change the strategic (alignment-relevant) landscape; e.g. from Before smart AI, there will be many mediocre or specialized AIs:
Assuming that much of this happens “behind the scenes”, a human interacting with this system might just perceive it as a single super-smart AI. Nevertheless, I think this means that AI will be more alignable at a fixed level of productivity. (Eventually, we’ll face the full alignment problem — but “more alignable at a fixed level of productivity” helps if we can use that productivity for something useful, such as giving us more time or helping us with alignment research.)
Most obviously, the token-by-token output of a single AI system should be quite easy for humans to supervise and monitor for danger. It will rarely contain any implicit cognitive leaps that a human couldn’t have generated themselves. (C.f. visible thoughts project and translucent thoughts hypothesis.)
A specialised AI can speed up Infra-Bayesianism by the same amount random mathematicians can, by proving theorems and solving some math problems. A specialised AI can’t actually understand the goals of the research and contribute to the part that require the hardest kind of human thinking. There’s a requirement for some amount of problem-solving of the kind hardest human thinking produces to go into the problem. I claim that if a system can output enough of that kind of thinking to meaningfully contribute, then it’s going to be smart enough to be dangerous. I further claim that there’s a number of hours of complicated-human-thought such that making a safe system that can output work corresponding to that number in less than, e.g., 20 years, requires at least that number of hours of complicated human thought. Safely getting enough productivity out of these systems for it to matter is impossible IMO. If you think a system can solve specific problems, then please outline these problems (what are the hardest problem you expect to be able to safely solve with your system?) and say how fast the system is going to solve it and how many people will be supervising its “thoughts”. Even putting aside object-level problems with these approaches, this seems pretty much hopeless.
'A specialised AI can speed up Infra-Bayesianism by the same amount random mathematicians can, by proving theorems and solving some math problems. A specialised AI can’t actually understand the goals of the research and contribute to the part that require the hardest kind of human thinking.' 'I claim that if a system can output enough of that kind of thinking to meaningfully contribute, then it’s going to be smart enough to be dangerous.'-> what about MINERVA, GPT-4, LATS, etc.? Would you say that they're specialized / dangerous / can't 'contribute to the part that require the hardest kind of human thinking'? If the latter, what is the easiest benchmark/task displaying 'complicated human thought' a non-dangerous LLM would have to pass/do for you to update?
'I further claim that there’s a number of hours of complicated-human-thought such that making a safe system that can output work corresponding to that number in less than, e.g., 20 years, requires at least that number of hours of complicated human thought.' -> this seems obviously ridiculously overconfident, e.g. there are many tasks for which verification is easier/takes less time than generation; e.g. (peer) reviewing alignment research; I'd encourage you to try to operationalize this for a prediction market.
'Safely getting enough productivity out of these systems for it to matter is impossible IMO.' -> this is a very strong claim backed by no evidence; I'll also note that 'for it to matter' should be a pretty low bar, given the (relatively) low amount of research work that has gone into (especially superintelligence) alignment and the low number of current FTEs.
'If you think a system can solve specific problems, then please outline these problems (what are the hardest problem you expect to be able to safely solve with your system?) and say how fast the system is going to solve it and how many people will be supervising its “thoughts”.' -> It seems to me a bit absurd to ask for these kinds of details (years, number of people) years in advance; e.g. should I ask of various threat models how many AI researchers would work for how long on the (first) AI that takes over, before I put any credence on them? But yes, I do expect automated alignment researchers to be able to solve a wide variety of problems very safely (on top of all the automated math research which is easily verifiable), including scalable oversight (e.g. see RLAIF already) and automated mech interp (e.g. see OpenAI's recent work and automated circuit discovery). More generally, I expect even if you used systems ~human-level on a relatively broad set of tasks (so as to have very high confidence you fully cover everything a median human alignment researcher can do), the takeover risks from only using them internally for automated alignment research (for at least months of calendar time) could relatively easily be driven << 0.1%, even just through quite obvious/prosaic safety/alignment measures, like decent evals and red-teaming, adversarial training, applying safety measures like those from recent safety work by Redwood (e.g. removing steganography + monitoring forced intermediate text outputs), applying the best prosaic alignment methods at the time (e.g. RLH/AIF variants as of today), unlearning (e.g. of ARA, cyber, bio capabilities), etc.
There are many things I feel like the post authors miss, and I want to share a few thoughts that seem good to communicate.
I'm going to focus on controlling superintelligent AI systems: systems powerful enough to solve alignment (in the CEV sense) completely, or to kill everyone on the planet.
In this post, I'm going to ignore other AI-related sources of x-risk, such as AI-enabled bioterrorism, and I'm not commenting on everything that seems important to comment on.
I'm also not going to point at all the slippery claims that I think can make the reader generalize incorrectly, as it'd be nitpicky and also not worth the time (examples of what I'd skip- I couldn't find evidence that GPT-4 has undergone any supervised fine-tuning; RLHF shapes chatbots' brains into the kind of systems that produce outputs that make human graders click on thumbs-up/"I prefer this text", smart systems that do that are not themselves necessarily "preferred" by human graders; one footnote[1]).
Intro
This misrepresents the worry. Saying "but even if" makes it look like: people worrying about x-risk place credence on "loss of control leads to x-risk no matter/despite alignment"; and these people wrong, as the post shows "this outcome" to be implausible; and, separately, that even if they're right about loss of control, they're wrong about x-risk, as it'll be fine because of alignment.
But mostly, people (including the leading voices) are worried specifically about capable misaligned systems leading to human extinction. I don't know anyone in the community who'd say it's a bad thing that leads to extinction if a CEV-aligned superintelligence grabs control.
I expect it to be easy to reward-shape AIs below a certain level[2] of capability, and I worry about controlling AIs above that level. I believe you need a superhumanly capable system to design and oversee a superhumanly capable system so that it doesn't kill everyone. The current ability of subhuman systems to oversee other subhuman systems such that these systems don't kill everyone is something that I predicted, and that doesn't provide a lot of evidence for subhuman systems being able to oversee superhuman systems.[3]
To solve the problem of aligning superhuman systems, you need some amount of complicated human thought/hard high-level work. If a system can output that much hard high-level work in a short amount of time, I consider this system to be superhuman, and the problem of aligning it to be "alignment-complete" in the sense that if you solve any of the problems in this class, you essentially solve alignment down the line and probably avoid x-risk, but solving any of these problems requires a lot of hard human work, and safely automating so much the hard work is an alignment-complete problem.
There needs to be an argument for why one can successfully use a subhuman system to control a complicated superhuman system, as otherwise, having generations of controllable subhuman systems doesn't matter.
Optimization
Let's talk about the goals specific neural networks will be pursuing.
Note that evolution has had "white-box" access to our architecture, optimising us for inclusive genetic fitness, and getting something that optimizes for similar collections of things. Consider that humans are so alignable because of that. Children are already wired to easily want chocolate, politics, and cooperation; if instead you get an alien child wired to associate goodness with eating children or sorting pebbles, giving this child rewards can make them learn your language, but won't necessarily make them not want to eat children or sort pebbles.
If you have a child, you don't need to specify, in math, everything that you value: they're probably not going to be super-smart about causing you to give them a reward, and they're already wired to want stuff that's similar to the kinds of things you want.
When you create AI, you do need to have a target of optimisation: what you hope the AI is going to try to do, a utility function safe to optimize for even with superintelligent optimization power. We don't know how to safely specify a target like that.
And then, even if you somehow design a target like that, you need to somehow find an AI that actually tries to achieve that target, and not something else, the process of achieving which was correlated with the target during training.
I'm not sure what their assumptions are around the inner alignment problem. This is false: we expect that a smart AI with a wide range of goals can perform well on a wide range of reward functions that can be used, and gradient descent won't optimize the terminal goals that AI is actually trying to pursue.
I fully expect gradient descent to successfully optimize artificial neural networks to achieve low loss; I just don't expect the loss function they can design to represent what we value, and I expect gradient descent to find neural networks that try to achieve something different from what was specified in the reward function.
If gradient descent finds an agent that tries to maximize something completely unrelated to humanity, and understands that for this, it needs to achieve a high score on our function, the agent will successfully achieve a high score. Gradient descent will optimize its ability to achieve a high score on our function - it will optimize the structure that makes up the agent - but won't really care about the goal contents of the current structure. If after training is finished, this structure optimizes for anything weird about the future of the universe and plans to kill us, this doesn't retroactively make the gradient change it- there is no known way for us to specify a loss function that trains away parameters that in the future might plan to kill us.
Interventions
Being able to conduct experiments doesn't mean we can get demonstrations of all potential problems in advance. If the AI is smart enough and already wants something different enough from what we want, and we don't understand its cognitive architecture, we're not going to be able to trick it into believing its simulated environment is the real world where it can finally take over. Simply having read/write access to the weights and activations doesn't allow us to control what AI thinks about[4]. Techniques to shape the behaviour of subhuman systems aren't going to let us keep control of smarter systems.
Yes, but this is not an argument of the x-risk community.
AFAIK, Nate Soares wouldn't claim that humans are aligned with evolution. Unfortunately, the authors of this or the linked post don't mechanistically understand the dynamics of the sharp left turn.
"AI control research is easier"
(I'm going to assume both control and alignment are meant by "control".) Ways it's easier to test AI control techniques than human control techniques are listed. Valid for subhuman systems but isn't relevant or inapplicable to superhuman systems, as:
"Values are easy to learn"
True if the AI is smart and coherent enough to be able to do that. But if it's not yet a CEV-aligned superintelligence, having learnt what humans want doesn't incentivise gradient descent to change it in ways that move it towards CEV-aligned superintelligence. I expect understanding human values to, indeed, be easy for a smart AI, and to make it easier to play along; but it doesn't automatically make human values an optimisation target. Knowing what humans want doesn't make AI care unless you solve the problem of making AI care.
The behaviour of subhuman models that seems "aligned" corresponds to a messy collection of stuff that kind of optimises for what humans give rewards for; but every time gradient descent make the model grok more general optimisation/agency, the fuzzy thing that a messy collection of stuff had been optimised for is not going to influence the goal content of the new architecture gradient descent installs. There isn't a reason for gradient descent to preserve the goals and values of algorithms implemented by the neural network in the past: new, smarter AI algorithms implemented by the neural network can achieve a high score with a wider range of possible goals and values.
I'd guess the description of human values is probably shorter than a gigabyte of information or something; AI can learn what they are; but they're not simple enough for us to easily specify them as an optimization target- see The Hidden Complexity of Wishes.
They're capable of evaluating the consequences presented to them- but not more capable than humans. That said,
Conclusion
Unfortunately, in this post, I have not seen evidence that superintelligent AIs will be easy to control or align with human values. If a neural network implements a superhuman AI agent that wants something different from what we want, the post has not presented any evidence for thinking we'd be able to keep control over the future despite the impact of what this agent does, or to change it to implement a superhuman AI agent aligned with human values in the CEV sense, or even just to notice that something is wrong with the agent until it's too late.
While we directly optimize the weights of our AI systems to get rewards, and changes in human brains in response to rewards are less clear and transparent, we do not know how to use it to make a superintelligent AI want something we'd wish it wanted.
By default, superhuman AI systems that wipe out humanity won't have emotions. They're going to be extremely good optimizers. But seems important to note that if we succeed at not dying in the next 20 years from extremely good optimizers, I'd want us to build AI systems with emotions only intentionally and after understanding how to design new minds. See Nonsentient Optimizers and Can't Unbirth a Child.
I focus on generally subhuman vs. generally superhuman systems, as this seems like a relevant distinction while being simpler to focus on, even though it's loses some nuance. It seems that with inference being cheaper than training, once you trained a human-level system, you can immediately run many copies of it, which can together make up a superhuman system (smart enough to solve alignment in a relatively short amount of time if it wanted to, and also capable enough to kill everyone). Many copies of sub-human systems, put together, won't be able to solve alignment, or any problems requiring a lot of best human cognition. So, I imagine a fuzzy threshold around the human level and focus on it in this post.
There's also an open, more general problem, that I don't discuss here, of weaker systems steering stronger systems (not getting gamed and preserving preferences). We don't know how to do that.
And unfortunately, we don't know what each of the weights represents, and we don't have much transparency into the algorithms they implement; we don't understand the thought process and we wouldn't know how to influence it in a way that'd work despite various internal optimization pressures