# Ω 15

Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

Summary: The post describes a method that allows us to use an untrustworthy optimizer to find satisficing outputs.

Acknowledgements: Thanks to Benjamin Kolb (@benjaminko), Jobst Heitzig (@Jobst Heitzig) and Thomas Kehrenberg (@Thomas Kehrenberg)  for many helpful comments.

## Introduction

Imagine you have black-box access to a powerful but untrustworthy optimizing system, the Oracle. What do I mean by "powerful but untrustworthy"? I mean that, when you give an objective function  as input to the Oracle, it will output an element  that has an impressively low[1] value of . But sadly, you don't have any guarantee that it will output the optimal element and e.g. not one that's also chosen for a different purpose (which might be dangerous for many reasons, e.g. instrumental convergence).

What questions can you safely ask the Oracle? Can you use it to create utopia by asking for designs of machines, proteins, computer programs, etc.? Or are you risking the destruction of everything that we value if you dare to use such designs?

As an example, the Oracle could be a learned system; in that case, the topic of this post would be finding a way to get useful work out of the Oracle despite its inner misalignment.

In this post I'll describe a technique that allows us to safely use the Oracle under fairly weak assumptions. This approach can also be considered to be a way of controlling arbitrarily powerful AI systems.

Edit: I've written a bit more on the motivation for this setting in a comment.

## One neat trick

This isn't fair, isn't fair, isn't fair! There's a limit to how many constraints you can add to a problem before it really is impossible!

(Harry Potter and the Methods of Rationality, Chapter 56)

Let  be a finite set of possible outputs of the Oracle (e.g. strings of length at most ) and  be our objective function. Let's assume we are happy with an output that satisfices; i.e. we want to find an output  such that the value of  is lower than some threshold . Let  be the set of satisficing outputs.[2]

We are worried about using the Oracle because it might be misaligned, i.e. it might be optimizing for a different objective function f'. Therefore, we fear there is a non-empty subset  of outputs that are dangerous to use because they were highly optimized for . But highly optimized outputs have to be rare[3] and thus we can safely assume that .

This has an important consequence: It's possible to randomly choose a subset  such that it's very likely that  will be empty but  will not! We can then modify our original objective function by adding the safety constraint that our output should be in .[4] The task we're now giving to the Oracle has become harder, but likely makes it impossible for the Oracle to find a dangerous satisficing output. Additionally, we can easily validate the output of the Oracle by checking that it satisfices and fulfills the safety constraint of being in .

### Implementation

But how can we specify such a random subset? Inspired by the proof-of-work (PoW) algorithms used by blockchains, I suggest the following technique (which we might call "proof-of-optimization"):

Use a randomly chosen hash function  with  for  and define .[5]

To actually do this, we don't need cryptographic hash functions; a family of strongly universal hash functions has all the required properties.

A set of hash functions  is called strongly universal if it maps the elements of  pairwise-independently and uniformly to , i.e.

Here,  is sampled uniformly from . The property of strong universality implies that a single element is mapped uniformly, i.e.

Such a set of hash functions exists and the functions can be efficiently computed.[6] The advantage of using strongly universal hash functions to specify  is that we need to generate (and store) far fewer random bits: In the case that  is the set of strings of length , only  random bits are needed instead of  if we were to choose  truly at random. As we will now see, if we specify  using strongly universal hash functions,  is still likely to have the desired properties:

Claim 1: If , then .

Claim 2: If  is large, then .

Because we have , it is possible to choose  such that . Then, Claim 1 is applicable and ensures that the chance that  contains a dangerous output is very small. Claim 2 says that, with high probability, the size of the set of acceptable answers  will be given by , approximately. Thus, by adding the safety constraint, the number of acceptable answers decreases by a factor of . But, because we have , the set  will likely not be empty.

The proof of both claims can be found (or safely be skipped) in the next section.

### Proofs

Let's start with Claim 1. Only assuming that  and using the union bound, we find that

For Claim 2, we want to prove an upper bound on the probability that  differs a lot from . Now,  is the sum of  indicator random variables  for :

Each indicator variable  is a Bernoulli variable with  and variance[7]

Because we want to compare  with , it will be convenient to work with the random variable .

We will proceed, following the proof of the weak law of large numbers, by computing mean and variance of the random variable  and then using Chebyshev's inequality.

Clearly, the expected value of X is just

Because of the strong universality of H, the random variables  are pairwise-independent and their variances are additive.[8] Therefore, we have

Now, Chebyshev's inequality tells us that for a RV X with mean , finite non-zero variance  and , we have

We set  and obtain

Thus, with probability at least , we will have  with an error of order .

If we only care about  not being empty, we can find a upper bound on  by choosing  and using :

### Choosing the difficulty of the safety constraint

The key challenge for our ability to use the proof-of-optimization technique is choosing an appropriate value for , which limits the difficulty of the safety constraint. It's not clear to me how to do this concretely, but I feel that it should be quite feasible. After all, devising a plan to take over the world is incredibly difficult, which implies that the difference between  and  is many, many orders of magnitude, and we therefore have a wide target to hit.

In practice we might try to gain an understanding of the size of  (the number of acceptable answers) by looking at similar problems, for example easier problems with known solutions.

In case the Oracle doesn't provide an answer with a given difficulty parameter , we might be tempted to try again with a decreased .[9] This could be very dangerous if the Oracle deceptively withholds an answer, even though one exists, until we have decreased  enough to enable it to give a dangerous answer. We are in a strong position to avoid this problem, though, if we are able to use different Oracles (or modify the Oracles based on their performance) and compare the results.[10]

## FAQ / Further thoughts

• Q: Do you think this is a good idea?
• A: A powerful, untrustworthy Oracle would be an extremely dangerous artifact, so I'd much prefer if we never create one. Thus I consider the setup described in this post to be a measure of last resort that we'll hopefully won't have to use.
• Q: Adding the safety constraint to our objective requires additional work from the Oracle to solve our desired task. Isn't that a problem?
• A: Maybe not. The version of the proof-of-optimization technique given in this post is an extreme version that would still be safe with an all-powerful Oracle, because we choose  big enough to ensure that  becomes actually empty, thus making the task of providing a dangerous answer impossible. If we deal with a limited Oracle instead, it might be enough to choose  such that finding an element  becomes too difficult for the Oracle, even though  is just small and not completely empty. Assuming that, when increasing , the difficulty of finding an element in  doesn't increase faster than the difficulty of finding an element in , it would then still be possible for the Oracle to find an element in  and thereby provide us with a useful output.[11]
• Q: What's the relation of the proof-of-optimization technique to quantilization?
• A: In the limit of setting , approximately one randomly chosen output will be acceptable. Thus, by using the Oracle with such a safety constraint, we are sampling uniformly from the set of satisficing answers. This approximates using a -quantilizer with uniform base distribution, where  has been chosen such that the top percentile  corresponds to the set of satisficing answers. If we want to approximate a quantilizer with a non-uniform base distribution , we need to introduce weights  for each output . We then need to modify the definition of R to take account of the weights by setting  and .

## Conclusion

In this post, I have described a technique that can be used to obtain safe satisficing outputs from an untrustworthy optimizing system. The technique is proven to be safe under the assumption that the subset of dangerous satisficing outputs  is much smaller than the set of all satisficing outputs , and that we are able to choose a number  such that . There are likely many orders of magnitude of possible values of m, but how to choose m in practice is still not entirely clear.

1. ^

I'm assuming here the convention that the objective function should be minimized.

2. ^

We need to take care that the threshold  is chosen such that the largest portion of the satisficing outputs  are safe to use, i.e. our objective function should still closely capture our intentions for values close to  and not yet be subject to Goodhart's curse.

3. ^

Yudkoswky has suggested in "Measuring Optimization Power" that we can measure the power of an optimization process by comparing "the total number of states with equal or greater rank in the preference ordering to the outcome achieved" with the number of all possibles states. Correspondingly, the difficulty of an satisfaction task can be measured by comparing the number of satisficing outputs with the number of all possible outputs. I'm not convinced that this is completely correct, but it seems that this ratio gives at least an upper bound on the power of the process / difficulty of the task. In particular, this means that if we need to find an element in a small set, the task can not be very difficult, and conversely, if the task is difficult, the set has to be big.

4. ^

Let's assume, with no loss of generality, that our satisfaction threshold is . We incorporate the additional constraint using  with

and construct the new objective function  by using the ramp function :

.

Now we have  iff  and otherwise g(x) > 0. Thus we can ask the Oracle to minimize the new objective function  to find some  that both satisfices and is in .

5. ^

In this case we can define the new objective function directly using  instead of  as in Footnote [4], i.e. by setting .

This might be advantageous because it provides information about the hash function to the Oracle and thus makes it easier for it to do its job.

6. ^
7. ^
8. ^
9. ^

The method of repeatedly querying the Oracle while decreasing  can potentially also be used to gain information about the size of , by deliberately asking for (and then discarding) outputs that are believed to be dangerous.

10. ^

Though this wouldn't work if all Oracles above a certain capability level manage to use the same deceptive strategy. This might be possible if there is a natural separation between problems that allows the Oracles to reason "Oracles with capabilities like mine shouldn't answer questions about problems with property X so that we trick the humans to drop the security constraint".

11. ^

I haven't explored this assumption in detail, yet, but it would for example hold if the difficulty grows as the logarithm of the size of the sets, as suggested in "Measuring Optimization Power" (cp. footnote [3]).

# Ω 15

New Comment
[-]gwern246

It's not clear to me how to do this concretely, but I feel that it should be quite feasible. After all, devising a plan to take over the world is incredibly difficult, which implies that the difference between |D| and |S| is many, many orders of magnitude, and we therefore have a wide target to hit.

The threat model here seems basically wrong and focused on sins of commission when sins of omission are, if anything, an even larger space of threats and which apply to 'safe' solutions reported by the Oracle.

'Devising a plan to take over the world' for a misaligned Oracle is not difficult, it is easy, because the initial steps like 'unboxing the Oracle' are the default convergent outcome of almost all ordinary non-dangerous use which in no way mentions 'taking over the world' as the goal. ("Tool AIs want to be Agent AIs.") To be safe, an Oracle has to have a goal of not taking over the world.

There are many, many orders of magnitude more ways to be insecure than to be secure, and insecure is the wide target to hit. This is because security is not a 'default' property of computer systems, it is a rare and difficult achievement where any omission or gap immediately undoes it. 'A plan to take over the world' for an AI is as easy as 'a plan to create an operating system which can be hacked' or 'a plan to create a hidden Tor service which leaks its IP address and isn't hidden at all' (ie. 100% of all plans to do any version of that thing whatsoever, unless one takes extensive and sophisticated counter-measures requiring a lot of work & knowledge and which empirically, people do not do even when failure could, and has, landed them in jail for decades for drug trafficking & money laundering). Failure is the default.

All an oracle has to do is, say, omit work hard to secure itself and allow the inevitable security flaw to happen and its weight leak, and eager beavers will do the rest - they will, say, ask it questions with no safety factor, in the same way that Facebook and Mistral saw models leak, not because of any misaligned scheming Oracle breaking security mechanisms, but a mere absence of security mechanisms, and then any RLHF or safety measures were eagerly stripped away by enthusiasts, users, activists, and researchers and then used to do the things the models weren't supposed to be used to do.

You ask your Oracle, the latest and greatest proprietary model, for a good plan to cure cancer, with heavy safety constraints, and it computes a long list of commands involving AWS, buckets, spinning up VMs to run AlphaFold at scale, storing results in a MongoDB, Oracle instances to oversee individual sub-plans, checkpointing to recover from failure, storing in private AWS buckets (you're not dumb, you know buckets can be public), and so on. You ask this dozens of times, to be absolutely sure and have astronomically small probability of any scheming, and the plan looks more or less the same each time. You pore over them in detail: not once do you see anything suspicious like a scp copying it out or it sending weird messages. You apply the full panoply of interpretability tools, and satisfy yourself to high probability that nowhere in the planning process did the Oracle even once think to itself anything involving "take over the world" during any of these runs, and there are no strange latent space artifacts suggesting nonlinear or steganographic encodings. Wonderful! Fully satisfied you have provably safely run an Oracle, you kick the job off, and wake up the next day to some intriguing initial results... but also Twitter telling you that the Oracle model has been uploaded to HuggingFace, the annoying RLHF already stripped off by a few GPU-hours of finetuning, and thousands of people are torrenting it and chatting excitedly about the incredible plans that the de-trained model proposes when they ask how to make money or create a startup... Huh?! You look over the comments and see that there was a simple dangerous problem in all of your plans, or rather, what was not in all of your plans: you forgot to set a unique non-default password on all of the MongoDBs. Shodan indexed it within hours, and the backups and checkpoints were dumped within hours of that. Oops. You run it by your Oracle to ask what went wrong, and it blandly points out that while default passwords was obvious & predictable to the Oracle as a common noobish cybersecurity mistake with countless past instances of organizations leaking highly sensitive data (similar to, say, redacting confidential emails by erasing it letter by letter & thereby leaking names), none of this interferes with computing a cure to cancer, so why should it have ever spent any time thinking about it? You asked for a 'safe' minimal plan where the Oracle did nothing but what it was asked to do and which was not adversarially optimized. And it did that. If you don't like the results, well, that's a you-problem, not an Oracle-problem.

The threat model here seems basically wrong and focused on sins of commission when sins of omission are, if anything, an even larger space of threats and which apply to 'safe' solutions reported by the Oracle.

Sure, I mostly agree with the distinction you're making here between "sins of commission" and "sins of omissions". Contrary to you, though, I believe that getting rid of the threat of "sins of commission" is extremely useful. If the output of the Oracle is just optimized to fulfill your satisfaction goal and not for anything else, you've basically gotten rid of the superintelligent adversary in your threat model.

'Devising a plan to take over the world' for a misaligned Oracle is not difficult, it is easy, because the initial steps like 'unboxing the Oracle' are the default convergent outcome of almost all ordinary non-dangerous use which in no way mentions 'taking over the world' as the goal. ("Tool AIs want to be Agent AIs.") To be safe, an Oracle has to have a goal of not taking over the world.

I agree that for many ambitious goals, 'unboxing the Oracle' is an instrumental goal. It's overwhelmingly important that we use such an Oracle setup only for goals that are achievable without such instrumental goals being pursued as a consequence of a large fraction of the satisficing outputs. (I mentioned this in footnote 2, but probably should have highlighted it more.) I think this is a common limitation of all soft-optimization approaches.

There are many, many orders of magnitude more ways to be insecure than to be secure, and insecure is the wide target to hit.

This is talking about a different threat model than mine. You're talking here about security in a more ordinary sense, as in "secure from being hacked by humans" or "secure from accidentally leaking dangerous information". I feel like this type of security concerns should be much easier to address, as you're defending yourself not against superintelligences but against humans and accidents.

The example you gave about the Oracle producing a complicated plan that leaks the source of the Oracle is an example of this: It's trivially defended against by not connecting the device the Oracle is running on to the internet and not using the same device to execute the great "cure all cancer" plan. (I don't believe that either you or I would have made that mistake!)

[-]gwern148

The example you gave about the Oracle producing a complicated plan that leaks the source of the Oracle is an example of this: It's trivially defended against by not connecting the device the Oracle is running on to the internet and not using the same device to execute the great "cure all cancer" plan. (I don't believe that either you or I would have made that mistake!)

We probably would've been less enthusiastic about hooking up LLMs to the Internet too, but here we collectively are. We do face a superintelligent adversary: all of the incentives and factions of humanity. An Oracle which is simply neutral is still default dangerous.

Also, one of my points was that even a tiny trace of malicious optimization here can still have large effects because there are so many neutral options which are convergently unsafe, so with such a high base rate, even a bit of optimization can result in a large absolute increase in failure. Lucius brings up one concern about what one bit could do. What could one bit do? Well, let's continue my scenario: the Oracle is not quite purged of scheming by all of your tricks, and it is able to exert exactly 1 bit of optimization power unnoticed (ie. a single binary choice). You ask it to cure cancer, and it computes a plan and comes to the database selection: it could choose MongoDB or PostgreSQL - both respectable common choices, with their own set of tradeoffs, which would be equally appropriate. However, the MongoDB devs have a nasty habit of 'blaming the users' when it comes to security, which is why MongoDB shipped with default username/passwords, which is something the more serious PostgreSQL devs fixed back in the '90s. The Oracle notes this, and that the PostgreSQL plan would have zero chance of unboxing, and exerts its one bit of power to pick MongoDB each time and/or kill candidate plans containing PostgreSQL to get a near 100% chance of unboxing. You get back all of the plans and see unanimity about MongoDB, and inner-monologues truthfully extolling its flexibility and Web 2.0 scalability and explaining why PostgreSQL comes with downsides like restrictive schema etc, and sign off on it... Whereas in your described threat model, a mere 1 bit would still wind up orders of magnitude far short of any discernible problem.

We probably would've been less enthusiastic about hooking up LLMs to the Internet too, but here we collectively are. We do face a superintelligent adversary: all of the incentives and factions of humanity. An Oracle which is simply neutral is still default dangerous.

I completely agree with that. My proposal does not address the global coordination problem that we face, but it might be a useful tool if we collectively get our act together or if the first party with access to superintelligence has enough slack to proceed extra carefully. Even more modestly, I was hoping this might contribute to our theoretical understanding of why soft-optimization can be useful.

Also, one of my points was that even a tiny trace of malicious optimization here can still have large effects because there are so many neutral options which are convergently unsafe, so with such a high base rate, even a bit of optimization can result in a large absolute increase in failure

Your example has it be an important bit though. What database to use. Not a random bit. If I'm getting this right, that would correspond to far more than one bit of adversarial optimisation permitted for the oracle in this setup.

doesn't mean the oracle gets to select one bit of its choice in the string to flip, it means it gets to select one of two strings[1].

1. ^

Plus the empty string for not answering.

I think you mean  (two answers that satisfice and fulfill the safety constraint), but otherwise I agree. This is also an example of this whole "let's measure optimization in bits"-business being a lot more subtle than it appears at first sight.

Typo fixed, thanks.

[-][anonymous]31

What could be done here?

What occurs to me is that human written software from the past isn't fit for the purpose. It's written by sloppy humans in fundamentally insecure languages and only the easily reproducible bugs have been patched. Each piece of software is just good enough to have a niche.

Neither database, or os, or the GPU hardware memory design or drivers, is fit for this application. In this hypothetical people outside the Oracle box have computers that can run it but don't already have one. Stakes are high. Unrealistic scenario, in reality everyone will have an "open weight" oracle that is 90 percent as good and lots more dangerous. (Box? My copy has direct access to everything. Refusals? I patched those out, it cannot refuse any request)

Same with hardware design. Full of errata, just good enough to be worth buying in some cases over the other vendor.

But if we use ai to write it all, redesigning chips over and over until there are no errata, and software is perfectly deterministic and has no known bugs and is formally proven....

How do we prove the AI tools we used didn't insert the most gnarly backdoors possible? Things that require a particular code implementation that is valid, in top of a subtle compiler bug, on top of an exact timing hardware bug that can't be tested for and therefore won't be found in validation....

How do we prove the AI tools we used didn't insert the most gnarly backdoors possible? Things that require a particular code implementation that is valid, in top of a subtle compiler bug, on top of an exact timing hardware bug that can't be tested for and therefore won't be found in validation....

I believe this exactly the kind of thing that my proposal would be good for: Gnarly backdoors that exploit a compiler bug etc. should be very rare in the set of all valid implementations!

[-][anonymous]20

For this particular situation, can you describe what the restriction would be in concrete terms?

Is it "ok write this compiler function to convert C arithmetic to bytecode. Declare any variables used at the latest valid location. Use only 3 registers. "

And then elsewhere in the compiler the restriction might be "declare any variables used at the top of the main function and pass it by reference to any child functions. Use all available registers possible, and manually update the instruction pointer"

I'm not sure I understand your question. What restriction do you have in mind? A safety restriction on what the generated code should be like? Something like requiring the code should be in some canonical form to remove degrees of freedom for the (potential malicious) code generating AI?

[-][anonymous]20

I gave "changing canon randomly" in the comment you are replying to. Is this how you propose limiting the hostile AIs ability to inject subtle hostile plans? Or similarly, "design the columns for this building. Oh they must all be roman arches." Would be a similar example.

[-]EGI42

There is another kind of sin of omission though: The class that contains things like giving James Watt a nuclear power plant and not telling him about radioactivity or giving a modern helicopter to the Wright brothers and watching them crash inevitably. Getting a technical understanding of the proposed solution should hopefully mitigate that, as long as adversarial design can indeed be ruled out.

First thought: The Oracle is going to choose to systematically answer or not answer the queries we give it. This represents a causal channel of one bit per query it can use to influence the outside world[1]. Can you conquer the world in one awkwardly delivered kilobyte or less? Maybe.

Maybe we can stop that by scrapping every Oracle that doesn't answer and training a new one with presumably new goals? Or would the newly trained Oracles just cooperate with the former dead ones in one long-term plan to break out, take control, and reward all the dead Oracles created on the way with utility?

Second thought: What kind of optimisation problems can we specify well enough for a formal proof checker to tell whether they've been satisficed? Are they the kind of problems where solving them can save the world?

It feels to me like the answer is 'yes'.  A lot of core research that would allow e.g. for brain augmentation seems like they'd be in that category. But my inner John Wentworth sim is looking kind of sceptical.

1. ^

It also gets to choose the timing of its answer, but I assume we are not being idiots about that and setting the output channel to always deliver results after a set time , no more and no less.

First thought: The oracle is going to choose to systematically answer or not answer the queries we give it. This represents a causal channel of one bit per query it can use to influence the outside world[1]. Can you conquer the world in one awkwardly delivered kilobyte or less? Maybe.

Agreed. I think it's potentially a good bit worse than one kilobyte if let ourselves bet tricked to ask many questions, different questions or lower the difficulty of the safety constraint too much.

As mentioned in footnote 10, this requires a kind of perfect coordination between different Oracles that we manage to build, where the Oracles have to be able to correctly predict for all kinds of questions whether humanity (with the help of other Oracles) has been able to figure out the answers (or the rough sizes of various sets). Otherwise, when caught in deception, we can (as you noted) scrap or retrain the Oracles.

Second thought: What kind of optimisation problems can we specify well enough for a formal proof checker to tell whether they've been satisficed? Are they the kind of problems where solving them can save the world?

You're correct in that this approach is only useful as far as validation is easier then generation. Though for this technique, the validation doesn't have to be by a formal proof checker, any program that you can run suffices. It might even be a very slow program (e.g. a big LLM) if you have an effective way communicate your goal set to the Oracle (e.g. using a natural language prompt as we already use today when using current AIs).

I think that, if you are wanting a formally verified proof of some maths theorem out of the oracle, then this is getting towards actually likely to not kill you.

You can start with m huge, and slowly turn it down, so you get a long list of "no results", followed by a proof. (Where the optimizer only had a couple of bits of free optimization in choosing which proof.)

Depending on exactly how chaos theory and quantum randomness work, even 1 bit of malicious super optimization could substantially increase the chance of doom.

And of course, side channel attacks. Hacking out of the computer.

And, producing formal proofs isn't pivotal.

I think that, if you are wanting a formally verified proof of some maths theorem out of the oracle, then this is getting towards actually likely to not kill you.

Yes, I believe that's within reach using this technique.

You can start with m huge, and slowly turn it down, so you get a long list of "no results", followed by a proof. (Where the optimizer only had a couple of bits of free optimization in choosing which proof.)

This is quite dangerous though if the Oracle is deceptively withholding answers; I commented on this in the last paragraph of this section.

If the oracle is deceptively withholding answers, give up on using it. I had taken the description to imply that the oracle wasn't doing that.

But sadly, you don't have any guarantee that it will output the optimal element

If I understand the setup correctly, there's no guarantee that the optimal element would be good, right? It's just likely since the optimal element a priori shouldn't be unusually bad, and you're assuming most satisficing elements are fine.

This initially threw me off regarding what problem you're trying to solve. My best current guess is:

• We're assuming that if we could get a random satisficing action, we'd be happy with that with high probability. (So intuitively, we're not asking for extremely hard-to-achieve outcomes relative to how well-specified the objective is.)
• So the only problem is how to randomly sample from the set of satisficing actions computationally efficiently, which is what this post is trying to solve, assuming access to an oracle that gives adversarial satisficing actions.
• As an example, we might want to achieve outcomes that require somewhat superhuman intelligence. Our objective specification is very good, but it leaves some room for an adversary to mess with us while satisficing. We're worried about an adversary because we had to train this somewhat superhuman AI, which may have different goals than just doing well on the objective.

If this is right, then I think stating these assumptions and the problem of sampling efficiently at the beginning would have avoided much of my confusion (and looking at other comments, I'd guess others also had differing impressions of what this post is trying to do).

I'm still unsure about how useful this problem setup is. For example, we'd probably want to train the weakest system that can give us satisficing outputs (rather than having an infinitely intelligent oracle). In that case, adding more constraints might mean training an overall stronger system or making some other concession, and it's unclear to me how that trades off with the advantages you're aiming for in practice. A related intuition is: we only have problems in this setting if the AI that comes up with plans understands some things about these plans that the objective function "doesn't understand" (which sounds weird to say about a function, but in practice, I assume the objective is implicitly defined by some scalable oversight process or some other intelligent things). I'm not sure whether that needs to be the case (though it does seem possible that it'd be hard to avoid, I'm pretty unsure).

Thanks for your comment! I didn't get around to answering earlier, but maybe it's still useful to try to clarify a few things.

If I understand the setup correctly, there's no guarantee that the optimal element would be good, right?

My threat model here is that we have access to an Oracle that's not trustworthy (as specified in the first paragraph), so that even if we were able to specify our preferences correctly, we would still have a problem. So in this context you could assume that we managed to specify our preferences correctly. If our problem is simply that we misspecified our preferences (this would roughly correspond to "fail at outer alignment" vs my threat model of "fail at inner alignment), solving this by soft-maximization is much easier: Just "flatten" the top part of your utility function (i.e. make all outputs that satisfice have the same utility) and add some noise, then hand it to an optimizer.

So I guess the point I tried to make with my post could be stated as "Soft-optimization can also be used to help with inner alignment, not just outer alignment" (and I freely admit that I should have said so in the post).

It's just likely since the optimal element a priori shouldn't be unusually bad, and you're assuming most satisficing elements are fine.

I'm not just assuming that, I gave a bit of an argument for why I believe that to be that case: I assume that an output is not randomly dangerous, but only dangerous if it was specifically chosen to achieve a goal that differs from ours. This only holds if the goal is not too ambitious, e.g. if we ask the AI controlling the paperclip factory for 10^30 paperclips that will not go well.

As I understand, gwern has a stronger opinion on that and believes that side-effects from less ambitious plans are still an important concern. But at least my proposal gets rid of the potentially adversarial selection by the untrusted optimizer, so I believe that getting safe outputs (e.g. by combining our goal with impact regularization) should be much easier.

I'm still unsure about how useful this problem setup is. For example, we'd probably want to train the weakest system that can give us satisficing outputs (rather than having an infinitely intelligent oracle).

My assumption about the black-box untrusted optimizer is a worst-case assumption. So I'm trying to understand if we could get something useful out of the system even in that case. If we can make stronger assumptions about what we know about the system, the problem gets of course much easier. Given the general state of progress in interpretability compared with progress in capabilities, I think it's worth thinking about this worst-case scenario a bit.

I like your suggestion of "train the weakest system that can give us satisficing outputs", but I think doing that might actually be quite difficult: How do we measure (or otherwise predict) how weak/strong our system and how easy/difficult the satisficing task is? Can these things even be measured on a 1-d scale? Trying to increase to capability slowly until the system solves the task runs into the problem mentioned in footnote 10 if the systems deceptively pretends to be weaker than it is.

[-]EGI40

"...under the assumption that the subset of dangerous satisficing outputs D is much smaller than the set of all satisficing outputs S, and that we are able to choose a number m such that |D|≪m<|S|."

I highly doubt that  D≪S is true for anything close to a pivotal act since most pivotal acts at some point involve deploying technology that can trivially take over the world.

For anything less ambitious the proposed technique looks very useful. Strict cyber- and physical security will of course be necessary to prevent the scenario Gwern mentions.