Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

Summary / Preamble

In AGI Ruin: A List of Lethalities, Eliezer writes “A cognitive system with sufficiently high cognitive powers, given any medium-bandwidth channel of causal influence, will not find it difficult to bootstrap to overpowering capabilities independent of human infrastructure.”

I have larger error-bars than Eliezer on some AI-safety-related beliefs, but I share many of his concerns (thanks in large part to being influenced by his writings).

In this series I will try to explore if we might:

  • Start out with a superintelligent AGI that may be unaligned (but seems superficially aligned)
  • Only use the AGI in ways where it's channels of causal influence are minimized (and where great steps are taken to make it hard for the AGI to hack itself out of the "box" it's in)
  • Work quickly but step-by-step towards a AGI-system that probably is aligned, enabling us to use it in more and more extensive ways (as we get more assurances that it's aligned)

From the AGI-system we may (directly or indirectly) obtain programs that are interpretable and verifiable. These specialized programs could give us new capabilities, and we may trust these capabilities to be aligned and safe (even if we don't trust the AGI to be so). We may use these capabilities to help us with verification, widening the scope of programs we are able to verify (and maybe helping us to make the AGI-system safer to interact with). This could perhaps be a positive feedback-loop of sorts, where we get more and more aligned capabilities, and the AGI-system becomes safer and safer to interact with.

The reasons for exploring these kinds of strategies are two-fold:

  • Maybe we wont solve alignment prior to getting superintelligence (even though it would be better if we did!)
  • Even if we think we have solved alignment prior to superintelligence, some of the techniques and strategies outlined here could be encouraged as best practice, so that we get additional layers of alignment-assurance.

The strategy as a whole involves many iterative and contingency-dependent steps working together. I don't claim to have a 100% watertight and crystalized plan that would get us from A to B. Maybe some readers could be inspired to build upon some of the ideas or analyze them more comprehensively.

Are any of the ideas in this series new? See here for a discussion of that.


Me: I have some ideas about how how to make use of an unaligned AGI-system to make an aligned AGI-system.

Imaginary friend: My system 1 is predicting that a lot of confused and misguided ideas are about to come out of your mouth.

Me: I guess we’ll see. Maybe I'm missing the mark somehow. But do hear me out.

Imaginary friend: Ok.

Me: First off, do we agree that a superintelligence would be able to understand what you want when asking for something, presuming that it is given enough information?

Imaginary friend: Well, kind of. Often there isn’t really a clear answer to what you want.

Me: Sure. But it would probably be good at predicting what looks to me like good answers. Even if it isn’t properly aligned, it would probably be extremely good at pretending to give me what I want. Right?

Imaginary friend: Agreed.

Me: So if I said to it “show me the best source code you can come up with for an aligned AGI-system, and write the code in such a way that it’s as easy as possible to verify that it works as it should”, then what it gave me would look really helpful - with no easily way for me to see a difference between what I’m provided and what I would be provided if it was aligned. Right?

Imaginary friend: I guess I sort of agree. Like, if it answered your request it would probably look really convincing. But maybe it doesn’t answer your question. It could find a security vulnerability in the OS, and hack itself onto the internet somehow - that would be game over before you even got to ask it any questions. Or maybe you didn’t even try to box it in the first place, since you didn’t realize how capable your AI-system was getting, and it was hiding its capabilities from you.

Imaginary friend: Or maybe it socially manipulated you in some really clever way, or “hacked” your neural circuitry somehow through sensory input, or figured out some way it could affect the physical world from within the digital realm (e.g. generating radio waves by “thinking”, or some thing we don't even know is physically possible).

When we are dealing with a system that may prefer to destroy us (for instrumentally convergent reasons), and that system may be orders of magnitude smarter than ourselves - well, it's better to be too careful than not paranoid enough..

Me: I agree with all that. But it’s hard to cover all the branches of things that should be considered in one conversation-path. So for the time being, let’s assume a hypothetical situation where the AI is “boxed” in. And let’s assume that we know it’s extremely capable, and that it can’t “hack” itself out of the box in some direct way (like exploiting a security flaw in the operating system). Ok?

Imaginary friend: Ok.

Me: I presume you agree that there are more and less safe ways to use a superintelligent AGI-system? To give an exaggerated example: There is a big difference between “letting it onto the internet” and “having it boxed in, only giving it multiplication questions, and only letting it answer yes or no”.

Imaginary friend: Obviously. But even if you only give it multiplication-questions, some other team will sooner or later develop AGI and be less scrupulous..

Me: Sure. But still, we agree that there are more and less safe to try to use an AGI? There is a “scale” of sorts?

Imaginary friend: Of course.

Me: Would you also agree that there is a “scale” for how hard it is for an oracle/genie to “trick” you into falsely believing that it has provided you with what you want? For example, if I ask it to prove a mathematical conjecture, that is much harder to “pretend” to do the way I want it without actually doing it (compared to most things)?

Imaginary friend: Sure.

Me: What I want to talk about are ways of asking an AGI genie/oracle for things in ways where it’s hard for it to “pretend” that it’s giving us what we want without doing it. And ways we might leverage that to eventually end up with an aligned AGI-system, while trying to keep the total risk (of all the steps we take) low.

Imaginary friend: My system 1 suspects I am about to hear some half-baked ideas.

Me: And your system 1 may have a point. I don’t claim to have detailed and watertight arguments showing exactly how we get from A to B. What I have is an outline of how we might get there, while minimizing risk along the way (not to 0%, but to way less than 50% if suggestions are executed faithfully).

Believe me, I don’t have full knowledge and understanding of everything that has been explored in the AI alignment community…

Imaginary friend: I believe you.

Me: …but it seems to me that the kinds of techniques and strategies I'll be outlining are under-discussed.

Imaginary friend: It sounds like you are looking for ways to avoid having to solve the alignment problem, so that you can focus on other things instead of solving it beforehand.

Me: I guess it can be interpreted that way. I don’t want it to come across that way though! If we find reliable methods to align a superintelligent AGIs before we get superintelligent AGIs, then that would improve our odds of success!

But maybe we don’t. If we don’t then it’s better to have some plans ready, so that less improvisation is necessary. AFAIK there is no fire alarm for AGI, and it certainly doesn't seem too early to start analyzing these kinds of scenarios in a more comprehensive way..

And also, even if we think we have solved alignment in a reliable way prior to superintelligence, some of the strategies and principles I’ll discuss can be used as additional layers of alignment-assurance. Hence me thinking that it may be a good idea to develop these ideas further, and to encourage them as best practice.

Restrictions in expressivity of AI

Me: The first part of this discussion will focus on things that we can do while the AGI stays in the digital realm. We will get to the question of how to do complicated things in the non-digital realm (and on the internet), but that’s for later.

Imaginary friend: You understand that any use of an AGI enables it to affect the non-digital realm, right? If you let it write or modify code, that is a way for the AGI to communicate with humans, and communicating with humans is a way to affect the non-digital realm.

Me: Yes, I agree with that. But I would point out that it is up to us how much “expressivity” we give the AI-system. For example, if we get help from it to point out potential mistakes in some piece of code, maybe we don’t let it write text to explain itself (limiting instead to things such as pointing out areas of code to take an extra look at, highlighting examples of input that may produce unintended behavior, etc). And if we do let it write text, then it is up to us how we limit the expressivity of what it can write in text.

Imaginary friend: Ok, did you by the way notice that you spoke about not letting it write text - but letting it write example input? Well, example input can be text! Those kinds of things can be easy to miss, and often you may not be aware of what you’re missing.

Me: I may miss such things sometimes, but sometimes I will also leave them out even though I’m aware of them, so as to not get into tangents all the time. I did notice that one though. Partly thanks to you helping me to get better at noticing those sorts of things.

Imaginary friend: That’s what imaginary friends are for.

Keep in mind that any way to point out bugs or potential problems is a way for the AI to affect how you think and what you think about. Take this test of selective attention for example. Or look at how magicians focus and misdirect people's attention to trick them. There are lots of things like this.

Me: I’m certainly not claiming that the techniques I’ll describe are risk free. At the same time, it’s important to consider how different techniques can come together to reduce risk. For example, one thing that will be touched upon again and again is testing the convergence of the results from different methods and architectures and pieces of code. Preferably, if an AI-system writes the code for some module, then the overall system should ensure that we also have that module be written by other AI-systems, and also by humans if possible.

In regards to e.g. AIs pointing out potential issues, the system as a whole (a system of systems of systems..) would preferably keep track of the changes made due to such suggestions, and also try to keep track of what the output of the system would be if those changes hadn’t been made.

Imaginary friend: Ok. But just because something makes you more safe, and you can handwave to a lot of techniques where each thing helps a bit with safety - this does not show that the the process as a whole will be safe. And I am a bit worried about you proposing ideas that are hard to criticize because they’re complicated and vague, instead of getting to the “core” of the problem. Also, adding on lots of complication to increase security can in some cases be counter-productive.

Me: Some good points there. Although do I feel uncertain about the degree to which there is a “core” of the problem. I want this series to outline a path towards a more "complete" solution, but I want to start by focusing more on techniques and principles that are "building-blocks".

The best scenario is if the first superintelligent AGI is robustly aligned from the start. I think we should try hard to make that be the case, and I don't hope this series is much of a distraction from that.

I am reminded of how I am in support of trying to reduce CO2 emissions today, but also support research on geoengineering.

Topics I'll cover in this series

Imaginary friend: So where do you see this discussion going?

Me: I guess we’ll see. But I do have some plans.

One thing I want to discuss is the concept of “solution-generators” (and maybe also “solution-generators-generators”, etc). When asking an oracle/genie for something, it may often make sense to not ask the question directly, but to instead ask for a “solution-generator” - that is to say, a function that generates solutions within the domain in question. These “generators” should be optimized for being narrow/specialized/modular/verifiable (techniques for trying to ensure that the system is faithfully optimizing the “generator” for these properties is a topic by itself).

For example, if we want to do a physics-simulation, then we might want to use a “simulation-generator” or “simulation-generator-generator”. One merit of this is that the code can be easier to inspect and understand than the code of a superintelligent AGI. Another merit is that the “generator” can be tested against existing data.

But of course, you can also ask it to answer the question directly (in addition to asking for the “solution-generator”). And then you can see if the output is the same when you obtain it indirectly vs when you obtain it directly.

Imaginary friend: Isn’t it better to not ask it the specific question, so that it’s harder for it to guess what you’re after?

Me: I will mostly not be mentioning this sort of thing explicitly, but yes I agree, you shouldn’t ask both questions of the same instance of the AGI-system. The instance that is asked for A should not know that you also have asked it for B. Though of course, you could have additional instances that are asked both questions, and check if results/output converge.

Another topic I want to speculate a bit about is the topic of brain emulations, and also other “generators” that try to predict the behavior and thinking of humans. I’m interested in the question of “Can we bootstrap from AGI to either brain emulations and/or something else that can predict human behavior, while more or less staying within the digital realm?”. Obviously there are lots of ethical considerations here, and it’s very important to avoid suffering subroutines insofar as possible!

Me: I also want to discuss techniques involving computable argument-trees/”proofs”, where every inference-step is made explicit. As well as outlining possible techniques to have the concepts/propositions of such proofs represent more or less any thought that is sufficiently “clear/crisp/explicit” (blurring the distinction between “mathematical” proofs and any other argument about anything). Included in the discussion will be outlines of ideas for how to deal with “vagueness” and cluster-like concepts within such argument-trees/”proofs”.

And I’ll be outlining thoughts about capabilities that I think will help with verifying that instructions for doing things in the real world (developing new types of machines and that sort of thing) will work as intended. Such as for example copying a strawberry at the molecular level without unintended consequences. Among other things there will be some focus on “generators” for mappings between (1) models/ontologies, and (2) data-structures representing geometric structures (e.g. some sort of physics-simulation), and (3) real things in the actual world that the models are meant to refer to.

The more people there are who (1) are smart and have thought a lot about something and (2) see things differently from you, the more reason for self-doubt about your own judgment. And this is for me a significant source of uncertainty about my ideas in regards to alignment (and AI more generally). But it seems best to me to just try to describe my perspective as well as I can, and then people can do with that what seems best to them.

Imaginary friend: Talk to you later then.

Any feedback or comments (be that positive or negative or neither) would be received with interest.


Ω 4

6 comments, sorted by Click to highlight new comments since: Today at 11:46 AM
New Comment

I skimmed the post, here is some feedback (context):

  • I'm probably not the best person to red team this since some of my own alignment ideas are along similar lines. I'm also a bit on the optimistic side about alignment more generally -- it might be better to talk to a pessimist.

  • This sounds a bit like the idea of a "low-bandwidth oracle".

  • I think the biggest difficulty is the one you explicitly acknowledged -- boxing is hard.

  • But there are also problems around ensuring that bandwidth is actually limited. If you have a human check to see that the AGI's output conforms to the spec, the AGI could put an infohazard in the output, or write code which looks like it does one thing but actually does another. If you have a computer program or AI check the AI's output, the AGI could exploit a flaw in that program/AI. I think this sort of thing basically gets you a probabilistic safety measure, because there's always a risk that there's a flaw that the superintelligent AGI sees (or can infer) that you don't see. (I like this intuition pump for seeing why these sort of problems are plausible.) I think probabilistic safety measures can be good if we stack a lot of them together in the right way.

  • The idea of emitting machine-checkable proofs is interesting. I'm not sure such proofs are very useful though. "Finding the right spec is one of the biggest challenges in formal methods." - source. And finding the right spec seems more difficult to outsource to an unfriendly AI. In general, I think using AI to improve software reliability seems good, and tractable.

I think you'll find it easier to get feedback if you keep your writing brief. Assume the reader's time is valuable. Sentences like "I will mention some stuff later that maybe will make it more clear how I’d think about such a question." should simply be deleted -- make huge cuts. I think I might have been able to generate the bullet points above based on a 2-paragraph executive summary of your post. Maybe post a summary at the top, and say people are welcome to give feedback after just having read the summary.

Similarly, I think it is worth investing in clarity. If a sentence is unclear, I have a tendency to just keep reading and not mention it unless I have a prior that the author knows what they're talking about. (The older I get, the more I assume that unclear writing means the author is confused and ignorable.) I like writing advice from Paul Graham and Scott Adams.

Personally I'm more willing to give feedback on prepublication drafts because that gives me more influence on what people end up reading. I don't have much time to do feedback right now unfortunately.

This sounds a bit like the idea of a "low-bandwidth oracle".

Thanks, that's interesting. Hadn't seen that (insofar as I can remember). Definitely overlap there.

I think probabilistic safety measures can be good if we stack a lot of them together in the right way.

Same, and that's a good/crisp way to put it. 

Maybe post a summary at the top, and say people are welcome to give feedback after just having read the summary.

Will edit at some point so as to follow the first part of that suggestion. Thanks!

I think you'll find it easier to get feedback if you keep your writing brief. (..) I think I might have been able to generate the bullet points above based on a 2-paragraph executive summary of your post.

Some things in that bullet-list addresses stuff I left out to cut length, and stuff I though I would address in future parts of series. Found also those parts of bullet-list helpful, but still this exemplifies dilemmas/tradeoffs regarding length. Will try to make more effort to look for things to make shorter based on your advice. And I should have read through this one more before publishing. 

Personally I'm more willing to give feedback on prepublication drafts because that gives me more influence on what people end up reading.

I have a first draft ready for part 2 now:

Will read it over more, but plan to post within maybe a few days.

I have also made a few changes to part 1, and will probably make additional changes to part 1 over time.

As you can see if you open the Google Doc, part 2 is not any shorter than part 1. You may or may not interpret that as an indication that I don't make effective use of feedback.

Part 3, which I have not finished, is the part that will focus more on proofs.

Any help from anyone in reading over would be appreciated, but at the same time it is not expected :)

Me: So if I said to it “show me the best source code you can come up with for an aligned AGI-system, and write the code in such a way that it’s as easy as possible to verify that it works as it should”, then what it gave me would look really helpful - with no easily way for me to see a difference between what I’m provided and what I would be provided if it was aligned.

Is this really true? I would guess that we might be so far from solving alignment that nothing would look particularly helpful? Or, even worse, the only thing that would look helpful is something completely wrong?

Thanks for commenting :)

> I would guess that we might be so far from solving alignment that nothing would look particularly helpful?

My thinking is that using reinforcement-learning-like methods will select towards systems that look like they are aligned / optimized for what we are trying to optimize them for. If the system gives answers/solutions/etc where humans can see that it doesn't really optimize well for what we want it to optimize for, then I presume it would be tweaked further until that no longer was the case. For example, suppose that we get it to write code for us, and we select how easy the code is to read/understand for humans as an optimization-criteria, and then it returns code that has clear examples of things that could be improved - well, then it would presumably be tweaked further (and it would presumably be a bad strategy for it if it was trying to trick us into thinking it was aligned).

That being said, it is a possibility that a system could be pretending that it's less capable than it really is. That way it could return answers that didn't look particularly helpful. Which I guess sort of makes the answer to your question "yes". As in, there could be scenarios with AGI/superintelligence where it is "pretending" to not be an AGI/superintelligence. And then there could be other scenarios where it's not hiding that it's really capable, but where it does pretend to be less capable than it really is at certain types of things. But there would presumably be incentives for it to not do that to such a degree that it became easy for humans to notice that it isn't doing its best. (Btw, I am consciously leaving out a few types of conceivable scenarios so as to make this comment less jumbled.)

Most of this series will focus on scenarios where we know that the system in question is superintelligent, and where we had the foresight to "box" it in (making it hard to hack itself out through security vulnerabilities in operating system and that sort of stuff) well before it acquired anything resembling superintelligence (or something that plausibly might self-improve in a self-reinforcing feedback-loop towards anything resembling superintelligence).

> Or, even worse, the only thing that would look helpful is something completely wrong?

Yes, that is a major worry. And that is one of the main concerns/problems/difficulties that I'm hoping to address in this series (not the only one, but one of the major ones). That is to say, I want to outline strategies/techniques that are intended to make it as hard and as unlikely as possible for the AI-system to be able to trick us into thinking it is being helpful (giving us what we want) while it really isn't.

Are any of the ideas in this series new?

TLDR: Many ideas in this series have been written about before by others. But it also seems to me that some of the ideas are new and important, or if not new then at least under-discussed. But I could be wrong.

There are many big and small ideas in this series. Many of them are have been described or alluded to elsewhere by others. But there are also ideas in this series that (1) seem to me like they probably are important and (2) I cannot remember having seen described elsewhere.

I've had a hobby interest in the topic superintelligence since 2009, and in the topic of AI Alignment since 2014. So I've read a lot of what has been written, but there is also a lot that I haven't read. I could be presenting ideas in this series that seem new/interesting to me, but that actually are old. 

Here are some writings I am aware of that I think are relevant to ideas in this series (some of them have influenced me, and some of them I know overlap quite a bit):

  • Eric Drexler has written extensively about principles and techniques for designing AGI-systems that don't have agent-like behavior. In Comprehensive AI Services as General Intelligence he - well, I'm not goanna do his 200-page document justice, but one of the things he writes about is having AGI-systems consisting of more narrowly intelligent sub-systems (that are constrained and limited in what they do).
  • Paul Christiano and others have written about concepts such as Iterated Distillation and Amplification and AI safety techniques revolving around debate.
  • Several people have talked about using AIs to help with work on AI safety. For example, Paul Christiano writes: "I think Eliezer is probably wrong about how useful AI systems will become, including for tasks like AI alignment, before it is catastrophically dangerous. I believe we are relatively quickly approaching AI systems that can meaningfully accelerate progress by generating ideas, recognizing problems for those ideas and, proposing modifications to proposals, etc. and that all of those things will become possible in a small way well before AI systems that can double the pace of AI research. By the time AI systems can double the pace of AI research, it seems like they can greatly accelerate the pace of alignment research."
    • To me both Eliezer's and Paul's intuitions about FOOM seem like they plausibly could be correct. However, I think my object-level intuitions are more similar to Eliezer's than Paul's. Which partly explains my interest in how we might go about getting help from an AGI-system with alignment after it has become catastrophically dangerous. If AI-systems help us to more or less solve alignment before they become catastrophically dangerous, then that would of course be preferable - and, I dunno, perhaps they will, but I prefer to contemplate scenarios where they don't.
  • Several people have pointed out that verification often is easier than generation, and that this is a principle we can make heavy use of in AI Alignment. Paul Christiano writes: "Eliezer seems to argue that humans couldn’t verify pivotal acts proposed by AI systems (e.g. contributions to alignment research), and that this further makes it difficult to safely perform pivotal acts. (...) I think that this claim is probably wrong and clearly overconfident. I think it doesn’t match well with pragmatic experience in R&D in almost any domain, where verification is much, much easier than generation in virtually every domain."
    • It seems to me as well that Eliezer is greatly underestimating the power of verifiability. At the same time, I know Eliezer is really smart and has thought deeply about AI Alignment. Eliezer seems to think quite differently from me on this topic, but the reasons why are not clear to me, and this irks me.
  • Steve Omohundro (a fellow enthusiast of proofs, verifiability, and of combining symbolic and connectionist systems) has written about Safe-AI Scaffolding. He writes: "Ancient builders created the idea of first building a wood form on top of which the stone arch could be built. Once the arch was completed and stable, the wood form could be removed. (...) We can safely develop autonomous technologies in a similar way. We build a sequence of provably-safe autonomous systems which are used in the construction of more powerful and less limited successor systems. The early systems are used to model human values and governance structures. They are also used to construct proofs of safety and other desired characteristics for more complex and less limited successor systems."
  • As John Maxwell pointed out to me, Stuart Armstrong has written about Low-Bandwidth Oracles. In the same comment he wrote "I'm probably not the best person to red team this since some of my own alignment ideas are along similar lines".
  • In his book Superintelligence, Nick Bostrom writes: "We could ask such an oracle questions of a type for which it is difficult to find the answer but easy to verify whether a given answer is correct. Many mathematical problems are of this kind. If we are wondering whether a mathematical proposition is true, we could ask the oracle to produce a proof or disproof of the proposition. Finding the proof may require insight and creativity beyond our ken, but checking a purported proof’s validity can be done by a simple mechanical procedure. (...) For instance, one might ask for the solution to various technical or philosophical problems that may arise in the course of trying to develop more advanced motivation selection methods. If we had a proposed AI design alleged to be safe, we could ask an oracle whether it could identify any significant flaw in the design, and whether it could explain any such flaw to us in twenty words or less. (...) Suffice it here to note that the protocol determining which questions are asked, in which sequence, and how the answers are reported and disseminated could be of great significance. One might also consider whether to try to build the oracle in such a way that it would refuse to answer any question in cases where it predicts that its answering would have consequences classified as catastrophic according to some rough-and-ready criteria."
  • There are at least a couple of posts on LessWrong with titles such as Bootstrapped Alignment.

I am reminded a bit of experiments where people are told to hum a melody, and how it's often quite hard for others to guess what song you are humming. AI Alignment discussion feels to me a bit like that sometimes - it's hard to convey exactly what I have in mind, and hard to guess exactly what others have in mind. Often there is a lot of inferential distance, and we are forced to convey great networks of thoughts and concepts over the low bandwidth medium of text/speech.

I am also reminded a bit about Holden Karnofsky's article Thoughts on the Singularity Institute (SI), where he wrote:

One of the things that bothers me most about SI is that there is practically no public content, as far as I can tell, explicitly addressing the idea of a "tool" and giving arguments for why AGI is likely to work only as an "agent."

In his response to this, Eliezer wrote the following:

Tool AI wasn't the obvious solution to John McCarthy, I.J. Good, or Marvin Minsky. Today's leading AI textbook, Artificial Intelligence: A Modern Approach - where you can learn all about A* search, by the way - discusses Friendly AI and AI risk for 3.5 pages but doesn't mention tool AI as an obvious solution. For Ray Kurzweil, the obvious solution is merging humans and AIs. For Jurgen Schmidhuber, the obvious solution is AIs that value a certain complicated definition of complexity in their sensory inputs. Ben Goertzel, J. Storrs Hall, and Bill Hibbard, among others, have all written about how silly Singinst is to pursue Friendly AI when the solution is obviously X, for various different X. Among current leading people working on serious AGI programs labeled as such, neither Demis Hassabis (VC-funded to the tune of several million dollars) nor Moshe Looks (head of AGI research at Google) nor Henry Markram (Blue Brain at IBM) think that the obvious answer is Tool AI. Vernor Vinge, Isaac Asimov, and any number of other SF writers with technical backgrounds who spent serious time thinking about these issues didn't converge on that solution.

In conclusion, I wouldn't write this series if I didn't think it could be useful. But as a fellow hairless monkey trying to do his best, it's hard for me to confidently distinguish ideas that are new/helpful from ideas that are old/obvious/misguided. I appreciate any help in disabusing me of false beliefs.

New to LessWrong?