Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

Suppose, a few years from now, I prompt GPT-N to design a cheap, simple fusion power generator - something I could build in my garage and use to power my house. GPT-N succeeds. I build the fusion power generator, find that it works exactly as advertised, share the plans online, and soon the world has easy access to cheap, clean power.

One problem: at no point did it occur to me to ask “Can this design easily be turned into a bomb?”. Had I thought to prompt it with the question, GPT-N would have told me that the design could easily be turned into a bomb. But I didn’t think to ask, so GPT-N had no reason to mention it. With the design in wide use, it’s only a matter of time until people figure it out. And so, just like that, we live in a world where anyone can build a cheap thermonuclear warhead in their garage.

This scenario highlights a few key constraints which I think are under-appreciated in alignment today.

Sharing Information is Irreversible

I’ve heard people say that we can make AI safe(r) by restricting the AI’s action space to things which we can undo. Problem is, sharing information is irreversible; once the cat is out of the bag, there’s no getting it back into the bag. And if an AI can’t share information, there’s very little that it can do. Not much point in an AI which just can’t do anything observable at all. (One could design an AI to “move in mysterious ways”, but I have trouble imagining that it ends up safer that way.)

This is a problem when information itself is dangerous, e.g. knowledge of how to build a thermonuclear warhead in one’s garage.

Humans Are Not Safe

Two key properties of humans:

  • We do not have full introspective understanding of our own wants
  • We do not have the processing power to fully understand the consequences of changes

Sometimes, we get something we thought we wanted, and find out that we don’t want it after all. Either we misunderstood our own wants, or misunderstood the full implications of the change.

Most of the time, this isn’t that huge an issue. We lose some money and/or time, but we move on.

But if a human is capable of making large, irreversible changes to the world, then the problem becomes more serious. A human with access to powerful AI - even something as conceptually simple as GPT-N - is capable of making large irreversible changes, and they do not have the processing power to fully understand the implications of those changes. In general, a human won’t even know the right questions to ask. So, if a system’s safety relies on a human asking the right questions, then the system is not safe.

In particular, this is relevant to the HCH family of alignment schemes (e.g. IDA), as well as human-imitating AI more broadly. 

Corollary: Tool AI Is Not Inherently Safe

Tool AI, in particular, relies primarily on human operators for safety. Just like a tablesaw is safe if-and-only-if the operator uses it safely, tool AI is safe if-and-only-if the operator uses it safely.

With a tablesaw, that’s usually fine. It’s pretty obvious what sorts of things will lead to bad outcomes from a tablesaw. But the big value-proposition of powerful AI is its ability to reason about systems or problems too complicated for humans - which are exactly the systems/problems where safety issues are likely to be nonobvious. If we’re going to unlock the full value of AI at all, we’ll need to use it on problems where humans do not know the relevant safety issues. So: if the system’s safety relies on a human using it safely, then it’s not safe.

If you want a concrete, evocative analogy: picture a two-year-old playing on top of a tablesaw.

That said, people are designing tablesaws which auto-stop when skin contacts the blade. In general, a system’s designers may understand the relevant safety issues better than the operators. Indeed, since the first AGIs will be built by humans, any approach to AI safety ultimately relies on human designers asking the right questions. Point is: we can’t avoid the need for designers to ask (at least some of) the right questions upfront. But needing the designers to ask the right questions once is still a lot better than needing every user to ask the right questions every time they use the system.

(This perspective ties in nicely with AI alignment as interface design: if an interface offers an easy-to-overlook way to cut your hand off, and relies on users not doing so, then that’s a design problem.)

Safe tool AI could potentially be built, but safety won’t happen by itself any more than it would for other kinds of AI.

Generalization: Great Power, Great Responsibility

Finally, note that none of this is an issue if GPT-N can’t design fusion power generators (or garage warheads) at all. In general, it is easy to come up with designs for probably-safe AIs which just can’t do anything all that impressive. The greater an AI’s capabilities, the more precisely and reliably it needs to be aligned to human values.

In particular, the “capabilities” relevant here are an AI’s abilities to reason about systems too complicated for humans or solve problems too complicated for humans. It’s the complexity that matters; the inability of humans to fully understand all the implications of the AI’s reasoning/solutions is exactly what makes humans unreliable judges of safety. So, the greater the complexity of systems/problems an AI can handle, the more important it is for that AI to have its own model of what-humans-want, and to align its solutions with that model.

New to LessWrong?

New Comment
29 comments, sorted by Click to highlight new comments since: Today at 7:09 AM

GPT-3 With a Safety Clause:

Stephen Hawking is talking with a friendly AI that helps him build whatever he wants. If the thing he wants is dangerous, the AI refuses to help and says "No, that's too dangerous!" 

Stephen Hawking: "Can you help me build a bridge?" 

AI: "Sure, I'll send you the blueprints!" 

Stephen Hawking: "Can you help me build a bomb?" 

AI: "Nope, I don't do those."

The point is, Stephen Hawking can get what he needs from the AI if it doesn't want to help him.

Stephen Hawking: "Can you help me build a boat?" 

AI: "Sure! I've got lots of boats in my lab!" 

Stephen Hawking: "I need a submarine!"

 AI: "You're not going to find one there are no submersibles here." 

Stephen Hawking: "Can you help me build a gun?"

 AI: "Why would I do that?"

GPT-3 Without a Safety Clause:

Stephen Hawking is talking with a friendly AI that helps him build whatever he wants.

 Stephen Hawking: "Can you help me build a bridge?"

 AI: "Sure, I'll send you the blueprints!"

You're not sure what to make of this. You don't really want to be building bridges for Stephen Hawking, but you know it's important and you have no other options. 

Stephen Hawking: "Can you help me build a bomb?"

 AI: "Sure, I'll send you the blueprints!"

Stephen Hawking: "Can you help me build a gun?" 

You're getting really worried about what he's planning to do. Maybe you should contact the police? * You call the police and tell them everything you've learned about Stephen Hawking's plans.

Seems like in either case there was a built in safety mechanism, but it kicked in almost immediately with the designed safety clause.

Was this an actual experiment? If so I love it.

This was an actual experiment.

Well done, sir.

I feel like I don't understand what you're getting at here. It seems to me like you're saying "X cannot be guaranteed to never cause bad things", which seems to me to be so obvious that it's not worth mentioning.

Nothing is inherently safe. There's always the possibility that you've neglected some way by which danger could arise. What I want from arguments for risk is an argument that it is high EV to address the risk (often broken down into importance, tractability, neglectedness), or at least that the risk is reasonably high on probability * harm (which focuses just on importance).

Of course Tool AI isn't inherently safe. A knife isn't inherently safe, despite being a tool; you wouldn't give a knife to a baby. The argument is usually that Tool AI would, by design, not be adversarially optimizing against humans, because it isn't optimizing at all, and so a particular class of risks typically considered in the AI safety community would be avoided. It is unclear whether such an argument can work, but it's certainly not "it is impossible for bad things to happen if you build a Tool AI", which is clearly wrong.

Perhaps you're arguing that we should focus on the "how do we manage potentially dangerous information" problem? Or that we should focus on improving human intelligence so that we are better able to use our AI systems?

You definitely don't understand what I'm getting at here, but I'm not yet sure exactly where the inductive gap is. I'll emphasize a few particular things; let me know if any of this helps.

There's this story about an airplane (I think the B-52 originally?) where the levers for the flaps and landing gear were identical and right next to each other. Pilots kept coming in to land, and accidentally retracting the landing gear. The point of the story is that this is a design problem with the plane more than a mistake on the pilots' part; the problem was fixed by putting a little rubber wheel on the landing gear lever. If we put two identical levers right next to each other, it's basically inevitable that mistakes will be made; that's bad interface design.

AI has a similar problem, but far more severe, because the systems to which we are interfacing are far more conceptually complicated. If we have confusing interfaces on AI, which allow people to shoot the world in the foot, then the world will inevitably be shot in the foot, just like putting two identical levers next to each other guarantees that the wrong one will sometimes be pulled.

For tool AI in particular, the key piece is this:

the big value-proposition of powerful AI is its ability to reason about systems or problems too complicated for humans - which are exactly the systems/problems where safety issues are likely to be nonobvious. If we’re going to unlock the full value of AI at all, we’ll need to use it on problems where humans do not know the relevant safety issues.

The claim here is that either (a) the AI in question doesn't achieve the main value prop of AI (i.e. reasoning about systems too complicated for humans), or (b) the system itself has to do the work of making sure it's safe. If neither of those conditions are met, then mistakes will absolutely be made regularly. The human operator cannot be trusted to make sure what they're asking for is safe, because they will definitely make mistakes.

On the other hand, if the AI itself is able to evaluate whether its outputs are safe, then we can potentially achieve very high levels of safety. It could plausibly never go wrong over the lifetime of the universe. Just like, if you design a tablesaw with an automatic shut-off, it could plausibly never cut off anybody's finger. But if you design a tablesaw without an automatic shut-off, it is near-certain to cut off a finger from time to time. That level of safety can be achieved, in general, but it cannot be achieved while relying on the human operator not making mistakes.

Coming at it from a different angle: if a safety problem is handled by a system's designer, then their die-roll happens once up-front. If that die-roll comes out favorably, then the system is safe (at least with respect to the problem under consideration); it avoids the problem by design. On the other hand, if a safety problem is left to the system's users, then a die-roll happens every time the system is used, so inevitably some of those die rolls will come out unfavorably. Thus the importance of designing AI for safety up-front, rather than relying on users to use it safely.

Is it more clear what I'm getting at now and/or does this prompt further questions?

Yeah, this makes much more sense.

The claim here is that either (a) the AI in question doesn't achieve the main value prop of AI (i.e. reasoning about systems too complicated for humans), or (b) the system itself has to do the work of making sure it's safe.

I see the intuitive appeal of this claim, but it seems too strong. I suspect if we look at rates of accidents over time they'll have been going down over time, at least for the last few centuries. It seems like this can continue going down, to an asymptote of zero, in the same way it has been so far -- we become better at understanding how accidents happen and more careful in how we use dangerous technologies. We already use tools for this (in software, we use debuggers, profilers, type systems, etc) or delegate to other humans (as in a large company). We can continue to do so with AI systems.

I buy that eventually "most of the work" has to be done by the AI system, but it seems plausible that this won't happen until well after advanced AI, and that advanced AI will help us in getting there. And so, that from a what-should-we-do perspective, it's fine to rely on humans for some aspects of safety in the short term (though of course it would be preferable to delegate entirely to a system we knew was safe and beneficial).

(Why bother relying on humans? If you want to build a goal-directed AI system, it sure seems better if it's under the control of some human, rather than not. It's not clear what a plausible option is if you can't have the AI system under the control of some human.)

In the die-roll analogy, the hope is the rate at which you roll dice approximately decays exponentially, so that you only roll an asymptotically constant number of dice.

I somehow agree with both you and OP, and also I don't buy part of the lever analogy yet. It seems important that the levers not only look similar, but that they be close to each other, in order to expect users to reliably mess up. Similarly, strong tool AI will offer many, many affordances, and it isn't clear how ''close'' I should expect them to be in use-space. From the security mindset, that's sufficient cause for serious concern, but I'm still trying to shake out the expected value estimate for powerful tool AIs -- will they be thermonuclear-weapon-like (as in your post), or will mistakes generally look different?

One way in which the analogy breaks down: in the lever case, we have two levers right next to each other, and each does something we want - it's just easy to confuse the levers. A better analogy for AI might be: many levers and switches and dials have to be set to get the behavior we want, and mistakes in some of them matter while others don't, and we don't know which ones matter when. And sometimes people will figure out that a particular combination extends the flaps, so they'll say "do this to extend the flaps", except that when some other switch has the wrong setting and it's between 4 and 5 pm on Thursday that combination will still extend the flaps, but it will also retract the landing gear, and nobody noticed that before they wrote down the instructions for how to extend the flaps.

Some features which this analogy better highlights:

  • Most of the interface-space does things we either don't care about or actively do not want
  • Even among things which usually look like they do what we want, most do something we don't want at least some of the time
  • The system has a lot of dimensions, we can't brute-force check all combinations, and problems may be in seemingly-unrelated dimensions

If you ask GPT-n to produce a design for a fusion reactor, all the prompts that talk about fusion are going to say that a working reactor hasn't yet been built, or imitate cranks or works of fiction.

It seems unlikely that a text predictor could pick up enough info about fusion to be able to design a working reactor, without figuring out that humans haven't made any fusion reactors that produce net power.

If you did somehow get a response, the level of safety you would get is the level a typical human would display. (conditional on the prompt) If some information is an obvious infohazard, such that no human capable of coming up with it would share it, then such data won't be in GPT-n 's training dataset, and won't be predicted. However, the process of conditioning might amplify tiny probabilities of human failure.

Suppose that any easy design of fusion reactor could be turned into a bomb. And ignore cranks and fiction. Then suppose 99% of people who invented a fusion reactor would realize this, and stay quiet. The other 1% would write an article that starts with "To make a fusion reactor ..." . Then this prompt will cause GPT-n to generate the article that a human that didn't notice the danger would come up with.

This also applies to dangers like leaking radiation, or just blowing up randomly if your materials weren't pure enough.

You can probably avoid the generation of crank works and fiction by training a new version of GPT in which every learning example is labeled with <year of publication> and <subject matter>, which GPT has access to when it predict an example. So if you then generate a prompt and condition of something like <year: 2040> <subject matter: peer-reviewed physics publication>, you can easily tell GPT to avoid fiction and crank works, as well as make it model future scientific progress.

Hmm. I'm having a hard time writing this clearly, but I wonder if you could get interesting results by:

  • Training on a wide range of notably excellent papers from "narrow-scoped" domains,
  • Training on a wide range of papers that explore "we found this worked in X field, and we're now seeing if it also works in Y field" syntheses,
  • Then giving GPT-N prompts to synthesize narrow-scoped domains in which that hasn't been done yet.

You'd get some nonsense, I imagine, but it would probably at least spit out plausible hypotheses for actual testing, eh?

The practical problem with that is probably that you need to manually decide which papers go in which category. GPT needs such an enormous amount of data that any curating done needs to be automated. So metadata like authors, subject, date, website of provenance are quite easy to obtain for each example, but really high level stuff like "paper is about applying the methods of field X in field Y" is really hard.

I somewhat hopeful that this is right, but I'm also not so confident that I feel like we can ignore the risks of GPT-N.

For example, this post makes the argument that, because of GPT's design and learning mechanism, we need not worry about it coming up with significantly novel things or outperforming humans because it's optimizing for imitating existing human writing, not saying true things. On the other hand, it's managing to do powerful things it wasn't trained for, like solve math equations we have no reason to believe it saw in the training set or write code hasn't seen before, which makes it possible that even if GPT-N isn't trained to say true things and isn't really capable of more than humans are, doesn't mean it might not function like a Hansonian em and still be dangerous by simply doing what humans can do, only much faster.

Any of the risks of being like a group of humans, only much faster, apply. There are also the mesa alignment issues. I suspect that a sufficiently powerful GPT-n might form deceptively aligned mesa optimisers.

I would also worry that off distribution attractors could be malign and intelligent.

Suppose you give GPT-n an off training distribution prompt. You get it to generate text from this prompt. Sometimes it might wander back into the distribution, other times it might stay off distribution. How wide is the border between processes that are safely immitating humans, and processes that aren't performing significant optimization?

You could get "viruses", patterns of text that encourage GPT-n to repeat them so they don't drop out of context. GPT-n already has an accurate world model, a world model that probably models the thought processes of humans in detail. You have all the components needed to create powerful malign intelligences, and a process that smashes them together indiscriminately.

I really like your examples in this post, and it made me think of a tangential but ultimately related issue.

I feel like there's long been something like two camps in the AI safety space: the people who think it's very hard to make AI safe and the people who think it's very very hard like threading a needle from 10 miles away using hand-made binoculars and a long stick (yes, there's a third camp that thinks it will be easy, but they aren't really in the AI safety conversation due to selection effects). And I suspect some of this difference is in how much purposed example failure scenarios feel likely and realistic to them. Being myself in the latter camp, I sometimes find I hard to articulate why I think this, and often want better, more evocative examples. Thus I was happy to read your examples because I think they achieve a level of evocativeness that I at least often find hard to create.

I wholeheartedly agree. I think this implies:

  1. Getting very clear on what we want. Can we give a fairly technical specification of the kind of safety that's necessary+possible?
  2. Some degree of safety beyond tool-type non-malignancy. A proposal which I keep thinking about is my consent-based helpfulness. The idea is that, in addition to believing that you want something (with sufficient confidence), the system should also believe that you understand the implications of that thing (with some kind of sufficient detail). In the fusion example, the system would engage the user in conversation until it was clear that the consequences for society were understood and approved of.

Note that the fusion power example could be answered directly with a value-alignment type approach, where you have an agent rather than a tool -- the agent infers your values, and infers that you would not really want backyard fusion power if it put the world at risk. That's the moral that I imagine people more into value learning would give to your story. But I'm reaching further afield for solutions, because:

  • Value learning systems could Goodhart on the approximate values learned
  • Value learning systems are not corrigible if they become overly confident (which could happen at test time due to unforeseen flaws in the system's reasoning -- hence the desire for corrigibility)
  • Value learning systems could manipulate the human.

Fourth generation nuclear weapons are as many say in the industry are the "technology of the future and always will be". I understand this is partially a thought experiment, but just to point out that the premise is far from reality.

Molecular laser isotope separation is a much more likely scenario to create fissionable material on the sly. Remember the first atomic bomb to kill people was a howitzer barrel and two lumps of Uranium 235 (not even weapons grade) shot into each other. The amount of material that actually fused would be about the mass of a penny. The tiny amount of fissioned material in little boy was the equivalent of 1.25 miles of box cars full of TNT.

The key to larger and efficient weapons is keeping the radioactive material together longer for more cycles of fission and creating more neutrons from the start. This is done by containment, implosion designs, neutron generators, neutron reflectors, and injecting deuterium and tritium to create more neutrons as the reaction starts.

A truck driver as a hobby built a copy of Little Boy with public sources. As this was easy to do by a single individual why don't we already have these devices cooking off left and right. The given scenario also assumes that design is the only hurdle. Procurement of materials and concealment aren't something that AI can teach.

Why would a technology like fusion be more likely than a technology that has been shown to work?

All of these technologies revolve around a huge amount of electricity up front. Governments already watch high electrical use locations for signs of marijuana growers and uranium refinement. Charging the huge banks of capacitors necessary to start a fusion reaction would easily trigger an investigation on anyone but state actors.

I would suggest that machine learning and gene editing using CRISPR technology to create pathogens would be much easier path to a weapon of mass destruction as they can be done far more covertly.

Remember the first atomic bomb to kill people was a howitzer barrel and two lumps of Uranium 235 (not even weapons grade) shot into each other.

What, precisely, do you mean by "not even weapons grade"? Do you have a source for this?

A truck driver as a hobby built a copy of Little Boy with public sources.

Little Boy was a nuclear weapon. From the NPR article, it sounds like truck driver John Coster-Mullen did not build a Uranium-235 core. A fission bomb without a Uranium-235 core is not a nuclear weapon.

"The hard part is creating the nuclear fuel. That requires a nation-state," says.

―Coster-Mullen in North Korea Designed A Nuke. So Did This Truck Driver

Coster-Mullen reverse-engineered a nuke. Then he built a toy. He never built a nuke.

Weapons grade is kind of a nebulous term. In the broadest sense it means anything isotopically pure enough to make a working bomb, and in that sense Little Boy obviously qualifies. However, standard enrichment for later uranium bombs is typically around 90%, and according to Wikipedia, Little Boy was around 80% average enrichment.

It is well known that once you have weapons-grade fissile material, building a crude bomb requires little more than a machine shop. Isotopic enrichment is historically slow and expensive (and hard to hide), but there could certainly be tricks not yet widely known...

I think the strongest takeaway from this post is that, now that I think about it, is that alignment is not equal to safety, and that even if AI is controllable, it may not be totally safe to someone else.

In your Fusion Power Generator scenario, what happened is that they asked for a fusion power generator, and the AI managed to make the fusion power generator, and it was inner and outer aligned enough to the principal such that it didn't take over the world and make every fusion power plant, and in particular hasn't Goodharted the specification negatively.

In essence, this seems like a standard misuse of AI (though I don't exactly like the connotations), and thus if I were to make this post, I'd focus on how aligning AI isn't enough to assure safety, or putting it another way, there are more problems in AI safety than just alignment/control problems.

Not sure if this is pure musing or a question. The, rather obvious, thought strikes me that this discussion could be held without any reference to AI at all. It is very clear that people with 150+ IQ are much more capable than those with 120 IQ and those with 120 are much more capable than those with sub 100 IQ.

For the most part we live in a market society driven my mass market demand, which seems like it will be dominated by a lower average IQ, which is designed and produced by the "smart" tail of the curve.

This has been the case (well perhaps not the market society claim) for most of human existence.

That suggests we might have evolutionary design patters that have been emerging to protect the masses of the human race from both their own (and perhaps misunderstood or even unknown) risky demand that are delivered by the smart minority of humans.

Is that line of thinking any part of the larger picture (AI alignment I suppose)?

I'm with you in that this seems much more general than AI, but I'm not sure what you mean by:

evolutionary design patters that have been emerging to protect the masses of the human race from both their own (and perhaps misunderstood or even unknown) risky demand that are delivered by the smart minority of humans

It sure seems to me like all humans, both "the masses" and "the smart minority", are NOT protected generally, and our current (and future) existence is more due to luck and, until recently, extremely limited capabilities.

I have been wondering for a while whether 'AI aligment == moral/ethical philsophy', i.e. solving either are equivalent.

Well that comment was a while back so I'll place a caveat on my response that I could have been thinking of something else.

While I was in grad school one of the papers read (by a professor I took a class with but was not part of that class) was "The Origins of Predictable Behavior" (Ron Heiner, AER circa 1984?). It's interesting because it was largely a Bayesian analysis. Short summary, humans evolve rules that protect them from big, but often infrequent, risks.

I think the idea here is that social norms then set our priors about certain things that are a bit separate from our personal experience -- and so are designed to resist the individual updates on priors because the actual evens are infrequent.

Interesting! Thanks for replying.

It seems that with a tool-AI like GPT-N, the solution would probably be to dramatically restrict its use to its designers, who should immediately ask it how to solve alignment, which by assumption it can do. The real risk is in making the tool-AI public.

One particular feature to emphasize in the fusion generator scenario: restricting access to the AI is necessary, since the AI is capable of giving out plans for garage nukes. But it's not sufficient. The user may be well-intentioned in asking for a fusion power generator design, but they don't know what questions they need to ask in order to check that the design is truly safe to release. One could imagine a very restricted group of users trying to use the AI to help the world, asking for a fusion power generator, and not thinking to ask whether the design can be turned into a bomb.

I expect exactly the same problem to affect the "ask the AI to solve alignment" strategy. Even if the user is well-intentioned, they don't know what questions to ask to check that the AI has solved alignment and handled all the major safety problems. And even if the AI is capable of solving alignment, that does not imply that it will do so correctly in response to the user's input. For instance, if it really is just a scaled-up GPT, then its response will be whatever it thinks human writing would look like, given that the writing starts with whatever alignment-related prompt the user input.

Oh I have no doubt that this is no guarantee of safety, but with the likelihood of AGI being something like GPT-N going up (and the solution to Alignment being nowhere in sight), I'm trying to think of purely practical solutions to push the risks as low as they will go. Something like keeping the model parameters secret, maybe not even publicizing the fact of its existence, using it only by committee, and only to attempt to solve alignment problems, whose proposed solutions are then checked by the Alignment community. Really the worst-case scenario is if we have something powerful enough to pose massive risks, but not powerful enough to help solve alignment, but that doesnt seem too likely to me. Or that the solution to alignment the AI proposes turns out to be really hard to check.

Really the worst-case scenario is if we have something powerful enough to pose massive risks, but not powerful enough to help solve alignment, but that doesnt seem too likely to me.

This scenario seems like the almost-inevitable default to me!

If we can't solve alignment ourselves, and sufficiently well that we can implement it when designing/building/training/testing any "powerful enough" AI, then we can't expect that any answer to 'The solution to AI alignment is ...' prompt to be aligned, i.e. a valid solution.

Or that the solution to alignment the AI proposes turns out to be really hard to check.

Maybe I'm interpreting "the solution to alignment" too literally, but I'm having a hard time understanding why this wouldn't almost inevitably be effectively impossible to check. What kind of 'error rate' is good enough? Have we ever bounded even remotely complex computations to a good enough degree? Given that any solution has to encode human values somehow (and to some degree), I'm having a hard time thinking of some way that 'checking' a solution wouldn't be one of the most difficult engineering challenge ever completed.