Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.
This is an experiment in having an Open Thread dedicated to AI Alignment discussion, hopefully enabling researchers and upcoming researchers to ask small questions they are confused about, share very early stage ideas and have lower-key discussions.
Has anyone seen this argument for discontinuous takeoff before? I propose that there will be a discontinuity in AI capabilities at the time that the following strategy becomes likely to succeed:
The reason for the discontinuity is that this strategy is an all-or-nothing kind of thing. There is a threshold in the chance of success in taking over other people's hardware, below which you're likely to get caught and punished/destroyed before you take over the world (and therefore almost nobody attempts it, and the few who do just quickly get caught), and above which the above strategy becomes feasible.
(Short writeup for the sake of putting the idea out there)
AI x-risk people often compare coordination around AI to coordination around nukes. If we ignore military applications of AI and restrict ourselves to misalignment, this seems like a weird analogy to me:
Perhaps a more appropriate nuclear analogy for AI x-risk would be accidents like Chernobyl.
There is a nuclear analog for accident risk. A quote from Richard Hamming:
If you're arguing against that, I'm still not sure what your counter-argument is. To me, the argument is: the upsides of nukes are the ability to take over the world (militarily) and to defend against such attempts. The downsides include risks of local and global catastrophe. People raced to develop nukes because they judged the upsides to be greater than the downsides, in part because they're not altruists and longtermists. It seems like people will develop potentially unsafe AI for analogous reasons: the upsides include the ability to take over the world (militarily or economically) and to defend against such attempts, and the downsides include risks of local and global catastrophe, and people will likely race to develop AI because they judge the upsides to be greater than the downsides, in part because they're not altruists and longtermists.
Where do you see this analogy breaking down?
Assuming you agree that people are already pushing too hard for progress in AGI capability (relative to what's ideal from a longtermist perspective), I think the current motivations for that are mostly things like money, prestige, scientific curiosity, wanting to make the world a better place (in a misguided/shorttermist way), etc., and not so much wanting to take over the world or to defend against such attempts. This seems likely to persist in the near future, but my concern is that if AGI research gets sufficiently close to fruition, governments will inevitably get involved and start pushing it even harder due to national security considerations. (Recall that Manhattan Project started 8 years before detonation of the first nuke.) Your argument seems more about what's happening now, and does not really address this concern.
It wasn't meant as a reply to a particular thing - mainly I'm flagging this as an AI-risk analogy I like.
On that theme, one thing "we don't know if the nukes will ignite the atmosphere" has in common with AI-risk is that the risk is from reaching new configurations (e.g. temperatures of the sort you get out of a nuclear bomb inside the Earth's atmosphere) that we don't have experience with. Which is an entirely different question than "what happens with the nukes after we don't ignite the atmosphere in a test explosion".
I like thinking about coordination from this viewpoint.
For me it's because:
Also, not a reason, but:
AI seems likely to be weaponized, and warfare (whether conventional or not) seems like one of the areas where we should be most worried about "unbridled competition" creating a race-to-the-bottom on safety.
It seems to me that many people believe something like "We need proof-level guarantees, or something close to it, before we build powerful AI". I could interpret this in two different ways:
I am uncertain on the normative claim (there might be great benefits to building powerful AI sooner, including the reduction of other x-risks), and disagree with the empirical claim.
If I had to argue briefly for the empirical claim, it would go something like this: "Since powerful AI will be world-changing, it will either be really good, or really bad -- neutral impact is too implausible. But due to fragility of value, the really bad outcomes are far more likely. The only way to get enough evidence to rule out all of the bad outcomes is to have a proof that the AI system ... (read more)
My thoughts: we can't really expect to prove something like "this ai will be beneficial". However, relying on empiricism to test our algorithms is very likely to fail, because it's very plausible that there's a discontinuity in behavior around the region of human-level generality of intelligence (specifically as we move to the upper end, where the system can understand things like the whole training regime and its goal systems). So I don't know how to make good guesses about the behavior of very capable systems except through mathematical analysis.
There are two overlapping traditions in machine learning. There's a heavy empirical tradition, in which experimental methodology is used to judge the effectiveness of algorithms along various metrics. Then, there's machine learning theory (computational learning theory), in which algorithms are analyzed mathematically and properties are proven. This second tradition seems far more applicable to questions of safety.
(But we should not act as if we only have one historical example of a successful scientific field to try and generalize from. We can also look at how other fields accomplish difficult things, especially in the face of significant risks.)
I don't think you need to posit a discontinuity to expect tests to occasionally fail.
I suspect the crux is more about how bad a single failure of a sufficiently advanced AI is likely to be.
I'll admit I don't feel like I really understand the perspective of people who seem to think we'll be able to learn how to do alignment via trial-and-error (i.e. tolerating multiple failures). Here are some guesses why people might hold that sort of view:
... (read more)
- We'll develop AI in a well-designed box, so we can do a lot of debugging and stress testing.
- counter-argument: but the concern is about what happens at deployment time
- We'll deploy AI in a box, too then
- counter: seems like that entails a massive performance hit (but it's not clear if that's actually the case)
- We'll have other "AI police" to stop any "evil AIs" that "go rogue" (just like we have for people).
- counter: where did the AI police come from, and why can't they go rogue as well?
- The "AI police" can just be the rest of the AIs in the world ganging up on anyone who goes rogue.
- counter: this seems to be assuming the "corrigibility as basin of attraction
Yeah, this is why I think some kind of discontinuity is important to my case. I expect different kinds of problems to arise with very very capable systems. So I don't see why it makes sense to expect smaller problems to arise first which indicate the potential larger problems and allow people to avert them before they occur.
If a case could be made that all potential problems with very very capable systems could be expected to first arise in survivable forms in moderately capable systems, then I would see how the more empirical style of development could give rise to safe systems.
The mesa-optimizer paper, along with some other examples of important intellectual contributions to AI alignment, have two important properties:
[This is an "or" condition -- either one of those two condit... (read more)
Not sure who you have in mind as people believing this, but after searching both LW and Arbital, the closest thing I've found to a statement of the empirical claim is from Eliezer's 2012 Reply to Holden on ‘Tool AI’:
Paul Christiano argued against this at length in Stable self-improvement as an AI safety problem, concluding as follows:
Note that the above talked about "stable self-modification" instead of ‘running this AI system will be beneficial’, and the former is a much narrower and easier to formalize concept than the latter. I haven... (read more)
MIRIx events are funded by MIRI, but we don't decide the topics or anything. I haven't taken a poll of MIRI researchers to see how enthusiastic different people are about formal verification, but AFAIK Nate and Eliezer don't see it as super relevant. See https://www.lesswrong.com/posts/xCpuSfT5Lt6kkR3po/my-take-on-agent-foundations-formalizing-metaphilosophical#cGuMRFSi224RCNBZi and the idea of a "safety-story" in https://www.lesswrong.com/posts/8gqrbnW758qjHFTrH/security-mindset-and-ordinary-paranoia for better attempts to characterize what MIRI is looking for.
ETA: From the end of the latter dialogue,... (read more)
I mentioned that I expect proof-level guarantees will be easy once the conceptual problems are worked out. Strong interpretability is part of that: if we know how to "see whether the AI runs a check for whether it can deceive humans", then I expect systems which provably don't do that won't be much extra work. So we might disagree less on that front than it first seemed.
The question of whether to model the AI as an open-ended optimizer is is one I figured would come up. I don't think we need to think of it as truly open-ended in order to use any of the above arguments, especially the wish-granting analogy. The relevant point is that limited optimization implies limited wish-granting ability. In order to grant more "difficult" wishes, the AI needs to steer the universe into a smaller chunk of state-space - in other words, it needs to perform stronger optimization. So AI with limited optimization capability will be safer to exactly the extent that they are unable to grant unsafe wishes - i.e. the chunks of state-space which they can access just don't contain really bad outcomes.
It does sound like our disagreement is the same thing outlined in Realism about Rationality (although I disagree with almost all of the "realism about rationality" examples in that post - e.g. I don't think AGI will necessarily be an "agent", I don't think Turing machines or Kolmogorov complexity are useful foundations for epistemology, I'm not bothered by moral intuitions containing contradictions, etc).
I would also describe my "no proofs => doomed" view, not as the proofs being causally important, but as the proofs being evidence of understanding. If we don't have the proofs, it's highly unlikely that we understand the system well enough to usefully predict whether it is safe - but the proofs themselves play a relatively minor role.
I do not know of any engineering discipline which places most of the confidence in safety on comprehensive, expensive testing. Every single engineering discipline I have ever studied starts from understanding the system under design, the principles which govern its function, and designs a system which is expected to be safe based on that understanding. As long as those underlying principles are ... (read more)
A downside of the portfolio approach to AI safety research
Given typical human biases, researchers of each AI safety approach are likely to be overconfident about the difficulty and safety of the approach they're personally advocating and pursuing, which exacerbates the problem of unilateralist's curse in AI. This should highlighted and kept in mind by practitioners of the portfolio approach to AI safety research (e.g., grant makers). In particular it may be a good idea to make sure researchers who are being funded have a good understanding of the overconfidence effect and other relevant biases, as well as the unilateralist's curse.
I'm posting a few research directions in my research agenda about which I haven't written much elsewhere (except maybe in the MIRIx Discord server), and for which I so far haven't got the time to make a full length essay with mathematical details. Each direction is in a separate child comment.
You are handed a hypercomputer, and allowed to run any code you like on it. You can then take 1Tb of data from your computations and attach it to a normal computer. The hypercomputer is removed. You are then handed a magic human utility function. How do you make an FAI with these resources?
The normal computer is capable of running a highly efficient super-intelligence. The hypercomputer can do a brute force search for efficient algorithms. The idea is to split FAI into building a capability module, and a value module.
The telos of life is to collect and preserve information. That is to say: this is the defining behavior of a living system, so it is an inherent goal. The beginning of life must have involved some replicating medium for storing information. At first, life actively preserved information by replicating, and passively collected information through the process of evolution by natural selection. Now life forms have several ways of collecting and storing information. Genetics, epigenetic, brains, immune systems, gut biomes, etc.
Obviously a system that collects a... (read more)
Is this open thread not going to be a monthly thing?
FWIW I liked reading the comment threads here, and would be inclined to participate in the future. But that's just my opinion. I'm curious if more senior people had reasons for not liking the idea?
I've just been invited to this forum. How do I decide whether to put a post on the Alignment Forum vs. Less Wrong?
I saw this thread complaining about the state of peer review in machine learning. Has anyone thought about trying to design a better peer review system, then creating a new ML conference around it and also adding in a safety emphasis?