*Note: weird stuff, very informal.*

Suppose I search for an algorithm that has made good predictions in the past, and use that algorithm to make predictions in the future.

I may get a "daemon," a consequentialist who happens to be motivated to make good predictions (perhaps because it has realized that only good predictors survive). Under different conditions, the daemon may no longer be motivated to predict well, and may instead make "predictions" that help it achieve its goals at my expense.

I don't know whether this is a real problem or not. But from a theoretical perspective, not knowing is already concerning--I'm trying to find a strong argument that we've solved alignment, not just something that seems to work in practice.

I am pretty convinced that daemons are a real problem for Solomonoff induction. Intuitively, the problem is caused by "too much compute." I suspect that daemons are also a problem for some more realistic learning procedures (like human evolution), though in a different shape. I think that this problem can probably be patched, but that's one of the major open questions for the feasibility of prosaic AGI alignment.

I suspect that daemons *aren't* a problem if we exclusively select for computational efficiency. That is, I suspect that **the fastest way to solve any particular problem doesn't involve daemons**.

I don't think this question has much intrinsic importance, because almost all realistic learning procedures involve a strong simplicity prior (e.g. weight sharing in neural networks).

But I do think this question has deep similarities to more important problems, and that answering this question will involve developing useful conceptual machinery. Because we have an unusually strong intuitive handle on the problem, I think it's a good thing to think about.

## Problem statement and intuition

Can the smallest boolean circuit that solves a problem be a daemon? For example, can the smallest circuit that predicts my behavior (at some level of accuracy) be a daemon?

Intuitively, if we have a daemon that is instrumentally or incidentally motivated to solve my problem, then there is some smaller circuit that solves the problem equally well but skips the instrumental reasoning. If my daemon is doing some complex reasoning to answer the question "Should I predict well?" we could just skip straight to the answer "yes." This both makes the circuit smaller, and prevents the circuit from ever deciding not to predict well.

A different perspective on a similar intuition: the daemon is doing some actual cognitive work to solve the problem. Since that computation is being done by the daemon, it is embedded as a smaller circuit. Jessica explores this intuition a bit here. Here we are considering an easy version of the problem, since by taking the smallest circuit we are effectively quantifying over all possible ways of extracting logical information from the daemon.

Instead of showing that minimal circuits can't be daemons, we might end up concluding that they can be. That would be even more interesting.

Another possible outcome is giving a strong argument that captures our intuitions/concerns about daemons, and which clearly doesn't apply to the minimal circuit that solves a problem. In this case we couldn't prove anything positive about the minimal circuit, but we would have "screened off" the possible cause for concern.

## Difficulties

The first and most serious difficulty is understanding what we are talking about.

I don't expect to get total clarity on concepts like "daemon" or "optimization" or "generic problem," but we need to have a better grip than we do right now. I expect that we'll develop better concepts in the course of solving the problem, rather than as a precondition for solving the problem (in general I think "define things so that you can prove the theorem" is often the right strategy).

A second difficulty is that the different parts of the computation can be tangled up in an extremely complex way. In an extreme case, the daemon may be cryptographically obfuscated.

We want to show that given any daemon, there is a smaller circuit that solves the problem. The most natural approach is showing how to construct a smaller circuit, given a daemon. But if the daemon is obfuscated, there is no efficient procedure which takes the daemon circuit as input and produces a smaller circuit that still solves the problem.

So we can't find any efficient constructive argument. That rules out most of the obvious strategies.

For human evolution, the problem is too little compute rather than too much, right? Meaning if evolution just gave humans the goal of "maximize inclusive fitness" then the human wouldn't be able to find a good policy for achieving that due to lack of computing power so instead we got a bunch of goals that would have been subgoals of "maximize inclusive fitness" in our ancestral environment (like eat tasty food and make friends/allies).

Suppose we wanted to make a minimal circuit that would do as well as humans in maximizing inclusive fitness in some range of environments. Wouldn't it make sense to also "help it out" by having it directly optimize for useful subgoals in those environments rather than having it do a big backchain from "maximize inclusive fitness"? And then it would be a daemon because it would keep optimizing for those subgoals even if you moved it outside of those environments?

I agree with this basic point and it seems important, thanks.

It seems like there are two qualitatively different concerns when trying to optimize for X, that probably need to be distinguished / thought about separately:

Obviously the real situation can be a complicated mixture, and this is not a clean distinction even apart from that.

The arguments in the OP only plausibly apply to downstream daemons. I think they make the most sense in terms of making induction benign.

I've normally thought of upstream daemons as much more likely, but much easier to deal with:

- In the case of upstream daemons, the most natural response is to probe a broad range of situations similar to the ones you care about. This requires a lot more understanding, but it seems quite realistic to try to avoid the situation where (Y --> empowering humans) on the training distribution but (Y --> disempo

... (read more)I think it makes sense to classify daemons into two types the way you do. Interestingly MIRI seems to be a lot more concerned about what you call upstream daemons. The Arbital page you linked to only talks about upstream daemons and the Google Doc "MIRI notes on alignment difficulty" seems to be mostly about that too. (What is it with people keeping important AI safety documents in private Google Docs these days, with no apparent plans of publication? Do you know any others that I'm not already shared on, BTW?)

I don't recall you writing about this before. How do you see this working? I guess with LBO you could train a complete "core for reasoning" and then amplify that to keep retraining the higher level agents on broader and broader distributions, but how would it work with HBO, where the human overseer's time becomes increasingly scarce/costly relative to the AI's as AIs get faster? I'm also pretty concerned about the overseer running into their own lack of robustness against distributional shifts if this is what you're planning.

I think people (including at MIRI) normally describe daemons as emerging from upstream optimization, but then describe them as becoming downstream daemons as they improve. Without the second step, it seems hard to be so pessimistic about the "normal" intervention of "test in a wider range of cases."

At time 0 the human trains the AI to operate at time 1. At time T>>0 the AI trains itself to operate at time T+1, at some point the human no longer needs to be involved---if the AI is actually aligned on inputs that it encounters at time T, then it has a hope of remaining aligned on inputs it encounters at time T+1.

I spoke a bit too glibly though, I think there are lots of possible approaches for dealing with this problem, each of them slightly increases my optimism, thi... (read more)

I propose a counterexample. Suppose we are playing a series of games with another agent. To play effectively, we train a circuit to predict the opponent's moves. At this point the circuit already contains an adversarial agent. However, one could object that it's unfair: we asked for an adversarial agent so we got an adversarial agent (nevertheless for AI alignment it's still a problem). To remove the objection, let's make some further assumptions. The training is done on some set of games, but distributional shift happens and later games are different. The opponent knows this, so on the training games it simulates a different agent. Specifically, it simulates an agent who searches for a strategy s.t. the best response to this strategy has the strongest counter-response. The minimal circuit hence contains the same agent. On the training data we win, but on the shifted distribution the daemon deceives us and we lose.

I consider the argument in this post a reasonably convincing negative answer to this question---a minimal circuit may nevertheless end up doing learning internally and thereby generate deceptive learned optimizers.

This suggests a second informal clarification of the problem (in addition to Wei Dai's comment): can the search for minimal circuits itself be responsible for generating deceptive behavior? Or is it always the case that something else was the offender and the search for minimal circuits is an innocent bystander?

If the search for minimal circuits was itself safe then there's still some hope for solutions that avoid deception by somehow penalizing computational cost. Namely: if that techniques is competitive, then we can try to provide a loss that encourages any learned optimization to use the same techniques.

(I've previously thought about this mostly in the high-stakes setting, but I'm now thinking about it in the context of incentivizing honest answers in the low-stakes setting. The following story will focus on the low-stakes setting since I don't want to introduce extra ingredients to handle high stakes.)

To illustrate, suppose there was a trick where you can divide your... (read more)

I curated this post partly for the OP, and partly for the subsequent discussion.

Something valuable I think LessWrong can be is a place where people pose well formed questions on important problems, and then make progress on them. I don't have the impression that any clear-cut breakthroughs happened here, but it does look like incremental, "chip away at the edges" progress was made.

My current take is that the knowledge-building process has several phases, that can reinforce each other in a non-linear fashion:

I think it's important for LW to showcase progress on each of those stages. By default, a tendency is to only publish work that's reached the final stages, or that feels like it makes

somekind of coherent point. This post and comments seemed to be doing some thingreal,even if at a middle-stage, and I want it to be clear that this is something LW strives to reward.I'm having trouble thinking about what it would mean for a circuit to contain daemons such that we could hope for a proof. It would be nice if we could find a simple such definition, but it seems hard to make this intuition precise.

For example, we might say that a circuit contains daemons if it displays more optimization that necessary to solve a problem. Minimal circuits could have daemons under this definition though. Suppose that some function f describes the behaviour of some powerful agent, a function ~f is like f with noise added, and our problem is to predict sufficiently well the function ~f. Then, the simplest circuit that does well won't bother to memorize a bunch of noise, so it will pursue the goals of the agent described by f more efficiently than ~f, and thus more efficiently than necessary.

This post grounds a key question in safety in a relatively simple way. It led to the useful distinction between upstream and downstream daemons, which I think is necessary to make conceptual progress on understanding when and how daemons will arise.

Yes. Consider a predictor that predicts what Paul will say if given an input and n time-steps to think about it, where n can be any integer up to some bound k. One possible circuit would have k single-step simulators chained together, plus a mux which takes the output of the nth single-step simulator. But a circuit which consisted of k single-step simulators and took the output of the ... (read more)

Pretty minimal in and of itself, but has prompted plenty of interesting discussion. Operationally that suggests to me that posts like this should be encouraged, but not by putting them into "best of" compilations.

This post formulated a concrete open problem about what are now called 'inner optimisers'. For me, it added 'surface area' to the concept of inner optimisers in a way that I think was healthy and important for their study. It also spurred research that resulted in this post giving a promising framework for a negative answer.

I think it's worth distinguishing between "smallest" and "fastest" circuits.

A note on smallest.

1) Consider a travelling salesman problem and a small program that brute-forces the solution to it. If the "deamon" wants to make a travelling salesman visit a particular city first, then they would simply order the solution space to consider it first. This has no guarantee of working, but the deamon would get what it wants some of the time. More generally, if there is a class of solutions we are indifferent to, but daemons ha... (read more)

Don't know if this counts as a 'daemon', but here's one scenario where a minimal circuit could plausibly exhibit optimization we don't want.

Say we are trying to build a model of some complex environment containing agents, e.g. a bunch of humans in a room. The fastest circuit that predicts this environment will almost certainly devote more computational resources to certain parts of the environment, in particular the agents, and will try to skimp as much as possible on less relevant parts such as chairs, desks etc. This could lead t... (read more)

I'm confused about the definition of

the set of boolean circuitsin which we're looking at the smallest circuit.Is that set defined in terms of a set of inputs X and a boolean utility function u; and then that set is all the boolean circuits that for each input x∈X yield an output o that fulfills u(o)=1 ?

I think some clarity for "minimal", "optimization", "hard", and "different conditions" would help.

I'll take your problem "definition" using a distribution D, a reward function R, and some circuit C and and Expectation E over R(x, C(x)).

Do we want the minimal C that maximizes E? Or do we want the minimal C that satisfies E > 0? These are not necessarily equivalent because max(E) might be non-computable while E > 0 not. Simple example would be: R(x, C(x)) is the number of 1s that the Turing Machine wi

This seems like the sort of problem that can be tackled more efficiently in the context of an actual AGI design. I don't see "daemons" as a problem per se; instead I see a heuristic for finding potential problems.

Consider something like code injection. There is no deep theory of code injection, at least not that I know of. It just describes a particular cluster of software vulnerabilities. You might create best practices to prevent particular types of code injection, but a software stack which claims to be "immune to code injection" sou... (read more)

This may be relevant:

Imagine a computational task that breaks up into solving many instances of problems A and B. Each instance reduces to at most n instances of problem A and at most m instances of problem B. However, these two maxima are never achieved both at once: The sum of the number of instances of A and instances of B is bounded above by some r<n+m. One way to compute this with a circuit is to include n copies of a circuit C0 for computing problem A and m copies of a circuit C1 for computing problem B. Another approach for solving the task is to... (read more),,,,,,

Is there a non-obfuscated circuit corresponding to every obfuscated one? And would the non-obfuscated circuit be at least as small as the obfuscated one?

If so it seems like you could just show how to construct the sm... (read more)

(Eli's personal "trying to have thoughts" before reading the other comments. Probably incoherent. Possibly not even on topic. Respond iff you'd like.)

(Also, my thinking here is influenced by having read this report recently.)

On the one hand, I can see the intuition that if a daemon is solving a problem, there is some part of the system that is solving the problem, and there is another part that is working to (potentially) optimize against you. In theory, we could "cut out" the part that is the problematic agency, preserving th... (read more)

Is a turing machine that has a property because it searches the space of all turing machines for one with that property and emulates it a daemon?

I don't think the procedure needs to be efficient to solve the problem, since we only care about existence of a smaller circuit (not an efficient way to produce it).

Does this mean you do not expect daemons to occur in practice because they are too complicated?

Given any random circuit, you can not, in general, show whether it is the smallest circuit that produces the output it does. That's just Rice's theorem, right? So why would it be possible for a daemon?

Let's set aside daemons for a moment, and think about a process which does "try to" make accurate predictions, but also "tries to" perform the relevant calculations as efficiently as possible. If it's successful in this regard, it will generate small (but probably not minimal) prediction circuits. Let's call this an efficient-predictor process. The same intuitive argument used for daemons also applies to this new process: it seems like we can get a smaller circuit which makes the same predictions, by removing the optimizy... (read more)

If I understand the problem correctly, then it is not that deep. Consider the specific example of weather (e.g. temperature) prediction. Let C(n) be the set of circuits that correctly predict the weather for the last n days. It is obvious that the smallest circuit in C(1) is a constant, which predicts nothing, and which also doesn't fall into C(2). Likewise, for every n there are many circuits that simply compress the ... (read more)