Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

Open question: are minimal circuits daemon-free?

50Wei Dai

22paulfchristiano

12Wei Dai

11paulfchristiano

6Wei Dai

4paulfchristiano

4Wei Dai

4paulfchristiano

4Wei Dai

1David Scott Krueger (formerly: capybaralet)

1Liam Donovan

1David Scott Krueger (formerly: capybaralet)

1Liam Donovan

2David Scott Krueger (formerly: capybaralet)

4robertzk

2Wei Dai

1David Scott Krueger (formerly: capybaralet)

5Liam Donovan

1David Scott Krueger (formerly: capybaralet)

0John_Maxwell

1Liam Donovan

2John_Maxwell

1Liam Donovan

3John_Maxwell

21Vanessa Kosoy

11paulfchristiano

11Raemon

10SamEisenstat

6paulfchristiano

3Ofer

1interstice

2paulfchristiano

1SamEisenstat

8Richard_Ngo

8jimrandomh

8paulfchristiano

5jimrandomh

4jimrandomh

1Liam Donovan

7Jameson Quinn

6DanielFilan

6agilecaveman

5interstice

4paulfchristiano

5Ofer

5paulfchristiano

3Ofer

5paulfchristiano

1Ofer

1Ofer

4PhilipTrettner

4John_Maxwell

6paulfchristiano

2John_Maxwell

3itaibn0

3ESRogs

7paulfchristiano

2Eli Tyre

2Gurkenglas

1Ramana Kumar

1Ramana Kumar

6paulfchristiano

1Confusion

2Jalex Stark

1johnswentworth

1johnswentworth

1zulupineapple

3paulfchristiano

1zulupineapple

1Liam Donovan

New Comment

70 comments, sorted by Click to highlight new comments since: Today at 6:15 AM

I am pretty convinced that daemons are a real problem for Solomonoff induction. Intuitively, the problem is caused by “too much compute.” I suspect that daemons are also a problem for some more realistic learning procedures (like human evolution), though in a different shape.

For human evolution, the problem is too little compute rather than too much, right? Meaning if evolution just gave humans the goal of "maximize inclusive fitness" then the human wouldn't be able to find a good policy for achieving that due to lack of computing power so instead we got a bunch of goals that would have been subgoals of "maximize inclusive fitness" in our ancestral environment (like eat tasty food and make friends/allies).

Suppose we wanted to make a minimal circuit that would do as well as humans in maximizing inclusive fitness in some range of environments. Wouldn't it make sense to also "help it out" by having it directly optimize for useful subgoals in those environments rather than having it do a big backchain from "maximize inclusive fitness"? And then it would be a daemon because it would keep optimizing for those subgoals even if you moved it outside of those environments?

I agree with this basic point and it seems important, thanks.

It seems like there are two qualitatively different concerns when trying to optimize for X, that probably need to be distinguished / thought about separately:

- Upstream daemon: directly optimizes something that is upstream of X on the training distribution.
- Downstream daemon: optimizes something that is downstream of X on the training distribution (e.g. because survival is a convergent instrumental goal and X is upstream of survival), and so backwards chains through X.

Obviously the real situation can be a complicated mixture, and this is not a clean distinction even apart from that.

The arguments in the OP only plausibly apply to downstream daemons. I think they make the most sense in terms of making induction benign.

I've normally thought of upstream daemons as much more likely, but much easier to deal with:

- In the case of upstream daemons, the most natural response is to probe a broad range of situations similar to the ones you care about. This requires a lot more understanding, but it seems quite realistic to try to avoid the situation where (Y --> empowering humans) on the training distribution but (Y --> disempo

I think it makes sense to classify daemons into two types the way you do. Interestingly MIRI seems to be a lot more concerned about what you call upstream daemons. The Arbital page you linked to only talks about upstream daemons and the Google Doc "MIRI notes on alignment difficulty" seems to be mostly about that too. (What is it with people keeping important AI safety documents in private Google Docs these days, with no apparent plans of publication? Do you know any others that I'm not already shared on, BTW?)

and am imagining consistently retraining the agent such that it doesn’t need to deal with giant distributional shift

I don't recall you writing about this before. How do you see this working? I guess with LBO you could train a complete "core for reasoning" and then amplify that to keep retraining the higher level agents on broader and broader distributions, but how would it work with HBO, where the human overseer's time becomes increasingly scarce/costly relative to the AI's as AIs get faster? I'm also pretty concerned about the overseer running into their own lack of robustness against distributional shifts if this is what you're planning.

Interestingly MIRI seems to be a lot more concerned about what you call upstream daemons. The Arbital page you linked to only talks about upstream daemons and the Google Doc "MIRI notes on alignment difficulty" seems to be mostly about that too.

I think people (including at MIRI) normally describe daemons as emerging from upstream optimization, but then describe them as becoming downstream daemons as they improve. Without the second step, it seems hard to be so pessimistic about the "normal" intervention of "test in a wider range of cases."

how would it work with HBO, where the human overseer's time becomes increasingly scarce/costly relative to the AI's as AIs get faster?

At time 0 the human trains the AI to operate at time 1. At time T>>0 the AI trains itself to operate at time T+1, at some point the human no longer needs to be involved---if the AI is actually aligned on inputs that it encounters at time T, then it has a hope of remaining aligned on inputs it encounters at time T+1.

I spoke a bit too glibly though, I think there are lots of possible approaches for dealing with this problem, each of them slightly increases my optimism, thi...

66y

Another way to be pessimistic is you expect that if the test fails on a wider range of cases, it will be unclear how to proceed at that point, and less safety-conscious AI projects may take the lead before you figure that out. (I think this, or a similar point, was made in the MIRI doc.)
I don't think this can work if you're just doing naive imitation learning? Do you have some other training method in mind?

46y

To be clear, I'm imagining imitation learning + amplification. So the agent at time T engages in some deliberative process to produce training targets for the agent at time T+1. The agent at time T also deliberates in order to choose what situations the agent at time T+1 should train on.
What obstruction do you have in mind?
(I'm imagining using imitation+RL rather than pure imitation, but the difference won't help with this question.)

46y

By "naive imitation learning" I was thinking "without amplification". With amplification, I'm less sure it won't work but it still seems pretty iffy. The plan seems to depend on at least the following:
1. We can create a small (since generated by expensive humans) set of training data that is representative of the data manifold of HBO reasoning (without missing some important part of it).
2. The data manifold of HBO reasoning is universal, i.e., all future tasks can be broken down (recursively) into subtasks that lie on this manifold.
3. At each iteration of amplification, the agent being amplified knows how to break down an input task into subtasks that lie on (or is not too far from) its own training distribution. It's not clear to me how to do this, for example how the agent can obtain a simple enough representation of its own training distribution in order to reason about this problem.
4. The AI at time T can predict the test distribution at time T+1 well enough to generate training data for it. This seems hard to ensure given that the environment is likely to contain hard to predict elements like other agents, including adversarial agents. (This may not be a dealbreaker if the AI can detect out-of-distribution inputs at time T+1 and ask for further training data on them. Is this what you have in mind?)

46y

I don't see why to separate 1/2, the goal is to find training data that describes some "universal" core for behavior.
3. I don't think you need to know the training distribution. You just need something that points you back in the direction of the universal core where the human model is competent, e.g. an appropriate notion of simplicity.
4. Hard-to-predict inputs aren't intrinsically a problem. If your agent fails malignly on input x, but not on distribution D, then your agent is able to distinguish x from D. So the difficulty comes from inputs that are easy-to-recognize but hard-to-generate. These certainly exist (e.g. consider a model which kills everyone given a signed edict from the pope). I think the most likely approach is to "reach inside" the model in order to stress test the behavior on inputs that you can't actually synthesize (e.g. by understanding that is checking the pope's signature, and just seeing what would happen if the check passed). This is the endpoint of improvements in these techniques.
(Of course, I don't think we'll be able to prevent benign failures in general.)

46y

It seems to me there are separate risks of the human HBO itself not being universal (e.g., humans are not universal or we need even higher bandwidth to be universal), and not being able to capture enough of the human HBO input/output function in a dataset to train an AI to be universal.
What if the path towards the universal core goes through an area where the AI wasn't trained on?
I think that makes sense but now you're making a conjunctive instead of disjunctive argument (which it seemed like you were claiming by saying "I think there are lots of possible approaches for dealing with this problem" and listing retraining and optimizing worst case performance as separate approaches).
ETA: If you're able to obtain a control guarantee over the whole input space, then that seems to solve the problem and you don't need constant retraining to be aligned. If you're only able to obtain it for some subset of inputs, then it seems that at time T the AI needs to be able to predict the T+1 test distribution so that it can make sure that's covered by the control guarantee.

15y

Regarding daemons starting as upstream and becoming downstream...
I think this makes it sound like the goal (soit Y) of the daemon changes, but I usually don't think of it that way.
What changes is that pursuing Y initially leads to rapidly improving performance at X, but then the performance of X and Y pull apart as the daemon optimizes more heavily for Y.
It seems highly analogous to hacking a learned reward function.

15y

Isn't that the problem Christiano is relatively optimistic about solving through training on different distributions? In particular, it seems like we'd notice the daemon's goals diverging from ours and try to correct it. If I understand Christiano's point correctly, he is optimistic that we could do so successfully due to the existence of a "basin of corrigibility". OTOH, the most concerning case would be when the daemon is intentionally trying to hide the fact that its goals are diverging from X, because we may not realize the system is misaligned before the daemon executes a treacherous turn.
Instead, I interpreted "starting as upstream and becoming downstream" as a description of a phenomenon that might look roughly like this:
1. Upstream daemon (UD) is "discovered" by some optimization process because it seems to do a good job optimizing X on the training data
2. The daemon is "initialized" by the outer optimizer and starts "running"
3. The outer optimizer (OO) provides a set of inputs to the daemon with the goal of optimizing X
4. Based on these inputs, the daemon "notices" that the entity that "initialized" it is trying to optimize X rather than Y
5. The UD infers that the outer optimizer will "terminate" it if the OO discovers the UD is optimizing for Y rather than X
6. The UD starts optimizing for X in order to survive (a convergent instrumental goal regardless of Y)
7. The OO observes the UD's performance on a wide variety of training data and infers that the UD is not in fact a upstream daemon
8. The daemon is "set loose" and proceeds to execute a treacherous turn
As you can tell, my model of how daemons work is extremely lacking; I've used quotes for every phrase that feels dubious to me.
One major question I have which this model, if correct, partially resolves for me: How does the downstream daemon ever "know" what X we are optimizing for in order for the outer optimizer to "discover" it?

15y

A concrete vision:
Suppose the best a system can do without a daemon is 97% accuracy.
The daemon can figure out how to get 99% accuracy.
But in order to outperform other systems, it can just provide 98% accuracy, and use 1% of inputs to pursue it's own agenda.
This all happens on-distribution.
If there are multiple daemon-containing systems competing for survival (with selection happening according to accuracy), this might force them to maximize accuracy, instead of just beating a "non-daemon baseline".

15y

This is all only relevant to downstream daemons, right? If so, I don't understand why the DD would ever provide 98% accuracy; I'd expect it to provide 99% accuracy until it sees a chance to provide [arbitarily low]% accuracy and start pursuing its agenda directly. As you say, this might happen due to competition between daemon-containing systems, but I think a DD would want to maximize its chances of survival by maximizng its accuracy either way.

25y

I think it's relevant for either kind (actually, I'm not sure I like the distinction, or find it particularly relevant).
If there aren't other daemons to compete with, then 98% is sufficient for survival, so why not use the extra 1% to begin pursuing your own agenda immediately and covertly? This seems to be how principle-agent problems often play out in real life with humans.

46y

I am interested as well. Please share the docs in question with my LW username at gmail dot com if that is a possibility. Thank you!

26y

You should contact Rob Bensinger since he's the owner of the document in question. (It looks like I technically can share the document with others, but I'm not sure what Rob/MIRI's policy is about who that document should be shared with.)

15y

(Summarizing/reinterpreting the upstream/downstream distinction for myself):
"upstream": has a (relatively benign?) goal which actually helps achieve X
"downstream": doesn't

55y

Coincidentally I'm also trying to understand this post at the same time, and was somewhat confused by the "upstream"/"downstream" distinction.
What I eventually concluded was that there are 3 ways a daemon that intrinsically values optimizing some Y can "look like" it's optimzing X:
* Y = X (this seems both unconcerning and unlikely, and thus somewhat irrelevant)
* optimzing Y causes optimization pressure to be applied to X (upstream daemon, describes humans if Y = our actual goals and X = inclusive genetic fitness)
* The daemon is directly optimizing X because the daemon believes this instrumentally helps it achieve Y (downstream daemon, e.g. if optimizing X helps the daemon survive)
Does this seem correct? In particular, I don't understand why upstream daemons would have to have a relatively benign goal.

15y

Yeah that seems right. I think it's a better summary of what Paul was talking about.

06y

If evolution is to humans as humans are to UFAI, I suppose UFAI corresponds to too little compute allocated to understanding our goal specification, and too much compute allocated to maximizing it. That suggests the solution is relatively simple.

15y

(sorry for commenting on such an old post)
It seems like the problem from evolution's perspective isn't that we don't understand our goal specification but that our goals are different from evolution's goals. It seems fairly tautological that putting more compute towards maximizing a goal specification than towards making sure the goal specification is what we want is likely to lead to UFAI; I don't see how that implies a "relatively simple" solution?

25y

And the "relatively simple" solution is to do the reverse, and put more compute towards making sure the goal specification is what we want than towards maximizing it.
(It's possible this point isn't very related to what Wei Dai said.)

15y

Isn't this just saying it would be nice if we collectively put more resources towards alignment research relative to capabilities research? I still feel like I'm missing something :/

35y

We may be able to offload some work to the system, e.g. by having it search for a diverse range of models for the user's intent, instead of making it use a single hardcoded goal specification.
This comment of mine is a bit related if you want more elaboration:
https://www.lesswrong.com/posts/NtX7LKhCXMW2vjWx6/thoughts-on-reward-engineering#jJ7nng3AGmtAWfxsy
If you have thoughts on it, probably best to reply there--we are already necroposting, so let's keep the discussion organized.

I propose a counterexample. Suppose we are playing a series of games with another agent. To play effectively, we train a circuit to predict the opponent's moves. At this point the circuit already contains an adversarial agent. However, one could object that it's unfair: we asked for an adversarial agent so we got an adversarial agent (nevertheless for AI alignment it's still a problem). To remove the objection, let's make some further assumptions. The training is done on some set of games, but distributional shift happens and later games are different. The opponent knows this, so on the training games it simulates a different agent. Specifically, it simulates an agent who searches for a strategy s.t. the best response to this strategy has the strongest counter-response. The minimal circuit hence contains the same agent. On the training data we win, but on the shifted distribution the daemon deceives us and we lose.

I consider the argument in this post a reasonably convincing negative answer to this question---a minimal circuit may nevertheless end up doing learning internally and thereby generate deceptive learned optimizers.

This suggests a second informal clarification of the problem (in addition to Wei Dai's comment): can the search for minimal circuits itself be responsible for generating deceptive behavior? Or is it always the case that something else was the offender and the search for minimal circuits is an innocent bystander?

If the search for minimal circuits was itself safe then there's still some hope for solutions that avoid deception by somehow penalizing computational cost. Namely: if that techniques is competitive, then we can try to provide a loss that encourages any learned optimization to use the same techniques.

(I've previously thought about this mostly in the high-stakes setting, but I'm now thinking about it in the context of incentivizing honest answers in the low-stakes setting. The following story will focus on the low-stakes setting since I don't want to introduce extra ingredients to handle high stakes.)

To illustrate, suppose there was a trick where you can divide your...

I curated this post partly for the OP, and partly for the subsequent discussion.

Something valuable I think LessWrong can be is a place where people pose well formed questions on important problems, and then make progress on them. I don't have the impression that any clear-cut breakthroughs happened here, but it does look like incremental, "chip away at the edges" progress was made.

My current take is that the knowledge-building process has several phases, that can reinforce each other in a non-linear fashion:

- researching current literature
- transforming ad-hoc exploratory research and impressions into a clearly stated questions
- brainstorming new ideas
- refining those ideas into something legible
- subjective those ideas to scrutiny
- distilling all that into a final concept that others can build on

I think it's important for LW to showcase progress on each of those stages. By default, a tendency is to only publish work that's reached the final stages, or that feels like it makes *some* kind of coherent point. This post and comments seemed to be doing some thing *real, *even if at a middle-stage, and I want it to be clear that this is something LW strives to reward.

I'm having trouble thinking about what it would mean for a circuit to contain daemons such that we could hope for a proof. It would be nice if we could find a simple such definition, but it seems hard to make this intuition precise.

For example, we might say that a circuit contains daemons if it displays more optimization that necessary to solve a problem. Minimal circuits could have daemons under this definition though. Suppose that some function describes the behaviour of some powerful agent, a function is like with noise added, and our problem is to predict sufficiently well the function . Then, the simplest circuit that does well won't bother to memorize a bunch of noise, so it will pursue the goals of the agent described by more efficiently than , and thus more efficiently than necessary.

66y

I don't know what the statement of the theorem would be. I don't really think we'd have a clean definition of "contains daemons" and then have a proof that a particular circuit doesn't contain daemons.
Also I expect we're going to have to make some assumption that the problem is "generic" (or else be careful about what daemon means), ruling out problems with the consequentialism embedded in them.
(Also, see the comment thread with Wei Dai above, clearly the plausible version of this involves something more specific than daemons.)

34y

I agree. The following is an attempt to show that if we don't rule out problems with the consequentialism embedded in them then the answer is trivially "no" (i.e. minimal circuits may contain consequentialists).
Let c be a minimal circuit that takes as input a string of length 10100 that encodes a Turing machine, and outputs a string that is the concatenation of the first 10100 configurations in the simulation of that Turing machine (each configuration is encoded as a string).
Now consider a string x′ that encodes a Turing machine that simulates some consequentialist (e.g. a human upload). For the input x′, the computation of the output of c simulates a consequentialist; and c is a minimal circuit.

16y

By "predict sufficiently well" do you mean "predict such that we can't distinguish their output"?
Unless the noise is of a special form, can't we distinguish $f$ and $tilde{f}$ by how well they do on $f$'s goals? It seems like for this not to be the case, the noise would have to be of the form "occasionally do something weak which looks strong to weaker agents". But then we could get this distribution by using a weak (or intermediate) agent directly, which would probably need less compute.

26y

Suppose "predict well" means "guess the output with sufficiently high probability," and the noise is just to replace the output with something random 5% of the time.

16y

Yeah, I had something along the lines of what Paul said in mind. I wanted not to require that the circuit implement exactly a given function, so that we could see if daemons show up in the output. It seems easier to define daemons if we can just look at input-output behaviour.

This post grounds a key question in safety in a relatively simple way. It led to the useful distinction between upstream and downstream daemons, which I think is necessary to make conceptual progress on understanding when and how daemons will arise.

Can the smallest boolean circuit that solves a problem be a daemon? For example, can the smallest circuit that predicts my behavior (at some level of accuracy) be a daemon?

Yes. Consider a predictor that predicts what Paul will say if given an input and n time-steps to think about it, where n can be any integer up to some bound k. One possible circuit would have k single-step simulators chained together, plus a mux which takes the output of the nth single-step simulator. But a circuit which consisted of k single-step simulators and took the output of the ...

86y

But can we make a smaller circuit by stripping out the part of Paul that attempts to recognize whether an input could be part of the training distribution?

56y

If he's a neural net, this is likely an obstacle to any attempts to simplify out parts of him; those parts would still be contributing to the result, it's just that within the test input domain those contributions would look like noise.

46y

In general: if a circuit implements a prediction problem that sometimes but doesn't always require simulating an agent, and if that agent is capable of making itself implement the identity function for interesting subsets of its inputs, then it can potentially show up as a daemon for any input that didn't require simulating it.

15y

Why couldn't you just use a smaller circuit that runs one single-step simulator, and outputs the result? It seems like that would output an accurate prediction of Paul's behavior iff the k-step simulator outputs an accurate prediction.

Pretty minimal in and of itself, but has prompted plenty of interesting discussion. Operationally that suggests to me that posts like this should be encouraged, but not by putting them into "best of" compilations.

This post formulated a concrete open problem about what are now called 'inner optimisers'. For me, it added 'surface area' to the concept of inner optimisers in a way that I think was healthy and important for their study. It also spurred research that resulted in this post giving a promising framework for a negative answer.

I think it's worth distinguishing between "smallest" and "fastest" circuits.

A note on smallest.

1) Consider a travelling salesman problem and a small program that brute-forces the solution to it. If the "deamon" wants to make a travelling salesman visit a particular city first, then they would simply order the solution space to consider it first. This has no guarantee of working, but the deamon would get what it wants some of the time. More generally, if there is a class of solutions we are indifferent to, but daemons ha...

Don't know if this counts as a 'daemon', but here's one scenario where a minimal circuit could plausibly exhibit optimization we don't want.

Say we are trying to build a model of some complex environment containing agents, e.g. a bunch of humans in a room. The fastest circuit that predicts this environment will almost certainly devote more computational resources to certain parts of the environment, in particular the agents, and will try to skimp as much as possible on less relevant parts such as chairs, desks etc. This could lead t...

46y

But can we just take whatever cognitive process the agent uses for pretending, and then leave the rest of it out?

I'm confused about the definition of **the set of boolean circuits** in which we're looking at the smallest circuit.

Is that set defined in terms of a set of inputs and a boolean utility function ; and then that set is all the boolean circuits that for each input x∈X yield an output that fulfills ?

56y

Here is one definition of a "problem":
Fix some distribution D on {0,1}n, and some function R:{0,1}n×{0,1}m→[−1,1]. Then consider the set of circuits C:{0,1}n→{0,1}m for which the expectation of R(x,C(x)), for x sampled from D, is ≥0.

36y

Can we assume that R itself is aligned in the sense that it doesn't assign non-negative values to outputs that are catastrophic to us?

56y

Yeah, if we want C to not be evil we need some very hard-to-state assumption on R and D.
(markdown comment editor is unchecked, will take it up with admins)

16y

Perhaps it'll be useful to think about the question for specific D and R.
Here are the simplest D and R I can think of that might serve this purpose:
D - uniform over the integers in the range [1,101010].
R - for each input x, R assigns a reward of 1 to the smallest prime number that is larger than x, and −1 to everything else.

16y

I think you need to uncheck "Markdown Comment Editor" under "Edit Account". Your comment with latex follows:

I think some clarity for "minimal", "optimization", "hard", and "different conditions" would help.

I'll take your problem "definition" using a distribution D, a reward function R, and some circuit C and and Expectation E over R(x, C(x)).

Do we want the minimal C that maximizes E? Or do we want the minimal C that satisfies E > 0? These are not necessarily equivalent because max(E) might be non-computable while E > 0 not. Simple example would be: R(x, C(x)) is the number of 1s that the Turing Machine wi

This seems like the sort of problem that can be tackled more efficiently in the context of an actual AGI design. I don't see "daemons" as a problem per se; instead I see a heuristic for finding potential problems.

Consider something like code injection. There is no deep theory of code injection, at least not that I know of. It just describes a particular cluster of software vulnerabilities. You might create best practices to prevent particular types of code injection, but a software stack which claims to be "immune to code injection" sou...

66y

I've listed one algorithm for which daemons are obviously a problem, namely Solomonoff induction. Now I'm describing a very similar algorithm, and wondering if daemons are a problem. As far as I can tell, any learning algorithm is plausibly beset by daemons, so it seems natural to ask for a variant of learning that isn't.
I'm not sure exactly how to characterize the problem other than by doing this kind of exercise. This post is already implicitly considering a particular design for AGI, I don't see what we gain by being more specific here.

26y

That's fair. I guess my intuition is that the Solomonoff induction example could use more work as motivation. Sampling a cell at a particular frequency pretty fairly unrealistic to me. Realistically an AGI is going to be interested in more complex outcomes. So then there's a connection to the idea of adversarial examples, where the consequentialists in the universal prior are trying to make the AGI think that something is going on when it isn't actually going on. (Absent such deception, I'm not sure there is a problem. For example, if consequentialists reliably make their universe one in which everyone is truly having a good time for the purpose of possibly influencing a universal prior, then that will be true in our universe too, and we should take it into account for decisionmaking purposes.) But this is actually easier than typical adversarial examples, because an AGI also gets to observe the consequentialists plot their adversarial strategy and read their minds while they're plotting. The AGI would have to be rather "dumb" in order to get tricked. If it's simulating the universal prior in sufficiently high resolution to produce these weird effects, then by definition it's able to see what is going on.
Humans already seem able to solve this problem: We simulate how others might think and react, and we don't seem super worried about people we simulate internally breaking out of our simulation and hijacking our cognition. (Or at least, insofar as we do get anxious about e.g. putting ourselves in the shoes of people we dislike, this doesn't have obvious relevance to an AGI--although again, perhaps this would be a good heuristic for brainstorming potential problems.) Anyway, my hunch is that this particular manifestation of the "daemon" problem will not require a lot of special effort once other AGI/FAI problems are solved.
Does your idea of neural nets + RL involve use of the universal prior? If not, I think I would try to understand if/how the daemon problem tran

This may be relevant:

Imagine a computational task that breaks up into solving many instances of problems A and B. Each instance reduces to at most n instances of problem A and at most m instances of problem B. However, these two maxima are never achieved both at once: The sum of the number of instances of A and instances of B is bounded above by some . One way to compute this with a circuit is to include n copies of a circuit for computing problem A and m copies of a circuit for computing problem B. Another approach for solving the task is to... ,,,,,,

We want to show that given any daemon, there is a smaller circuit that solves the problem. The most natural approach is showing how to construct a smaller circuit, given a daemon. But if the daemon is obfuscated, there is no efficient procedure which takes the daemon circuit as input and produces a smaller circuit that still solves the problem.

Is there a non-obfuscated circuit corresponding to every obfuscated one? And would the non-obfuscated circuit be at least as small as the obfuscated one?

If so it seems like you could just show how to construct the sm...

76y

Yes, this is why I think the statement is true.
"Obfuscated circuit" implies there is some circuit that get obfuscated, obfuscation is a map from circuits to circuits that increases the size and makes them inscrutable.
So obfuscation per se is not a problem for the statement. But it is an obstruction to a proof. You can't just handle obfuscation as a special case, it's not a natural kind, just an example that shows what kind of thing is possible.

(Eli's personal "trying to have thoughts" before reading the other comments. Probably incoherent. Possibly not even on topic. Respond iff you'd like.)

(Also, my thinking here is influenced by having read this report recently.)

On the one hand, I can see the intuition that if a daemon is solving a problem, there is some part of the system that is solving the problem, and there is another part that is working to (potentially) optimize against you. In theory, we could "cut out" the part that is the problematic agency, preserving th...

if the daemon is obfuscated, there is no efficient procedure which takes the daemon circuit as input and produces a smaller circuit that still solves the problem.

So we can't find any efficient constructive argument. That rules out most of the obvious strategies.

I don't think the procedure needs to be efficient to solve the problem, since we only care about existence of a smaller circuit (not an efficient way to produce it).

65y

No, I think a simplicity prior clearly leads to daemons in the limit.

25y

Rice's theorem applies if you replace "circuit" with "Turing machine". The circuit version can be resolved with a finite brute force search.

Let's set aside daemons for a moment, and think about a process which does "try to" make accurate predictions, but also "tries to" perform the relevant calculations as efficiently as possible. If it's successful in this regard, it will generate small (but probably not minimal) prediction circuits. Let's call this an efficient-predictor process. The same intuitive argument used for daemons also applies to this new process: it seems like we can get a smaller circuit which makes the same predictions, by removing the optimizy...

16y

From this standpoint, the key property of daemons (or any other goal-driven process) is that it's adaptive: it will pursue the goal with some success across multiple possible environments. Intuitively, we expect that adaptivity to come with a complexity cost, e.g. in terms of circuit size.

Suppose I search for an algorithm that has made good predictions in the past, and use that algorithm to make predictions in the future.

If I understand the problem correctly, then it is not that deep. Consider the specific example of weather (e.g. temperature) prediction. Let C(n) be the set of circuits that correctly predict the weather for the last n days. It is obvious that the smallest circuit in C(1) is a constant, which predicts nothing, and which also doesn't fall into C(2). Likewise, for every n there are many circuits that simply compress the ...

36y

Why is this the question?
Which claims / assumptions / conjectures are you uncomfortable with?

16y

Because c_opt is the safe circuit you want, and because your question was about the smallest circuits.
Not claims or assumptions, just weird words, like "motivated" or "evil". I don't think these are useful ways to think of the problem.

15y

But the other elements in C(n) aren't necessarily daemons either, right?; Certainly "encoding n days of weather data" isn't daemonic at all; some versions of c_apx might be upstream daemons, but that's not necessarily concerning. I don't understand how this argument tells us anything about whether the smallest circuit is guaranteed to be (downstream) daemon-free.

Note: weird stuff, very informal.Suppose I search for an algorithm that has made good predictions in the past, and use that algorithm to make predictions in the future.

I may get a "daemon," a consequentialist who happens to be motivated to make good predictions (perhaps because it has realized that only good predictors survive). Under different conditions, the daemon may no longer be motivated to predict well, and may instead make "predictions" that help it achieve its goals at my expense.

I don't know whether this is a real problem or not. But from a theoretical perspective, not knowing is already concerning--I'm trying to find a strong argument that we've solved alignment, not just something that seems to work in practice.

I am pretty convinced that daemons are a real problem for Solomonoff induction. Intuitively, the problem is caused by "too much compute." I suspect that daemons are also a problem for some more realistic learning procedures (like human evolution), though in a different shape. I think that this problem can probably be patched, but that's one of the major open questions for the feasibility of prosaic AGI alignment.

I suspect that daemons

aren'ta problem if we exclusively select for computational efficiency. That is, I suspect thatthe fastest way to solve any particular problem doesn't involve daemons.I don't think this question has much intrinsic importance, because almost all realistic learning procedures involve a strong simplicity prior (e.g. weight sharing in neural networks).

But I do think this question has deep similarities to more important problems, and that answering this question will involve developing useful conceptual machinery. Because we have an unusually strong intuitive handle on the problem, I think it's a good thing to think about.

## Problem statement and intuition

Can the smallest boolean circuit that solves a problem be a daemon? For example, can the smallest circuit that predicts my behavior (at some level of accuracy) be a daemon?

Intuitively, if we have a daemon that is instrumentally or incidentally motivated to solve my problem, then there is some smaller circuit that solves the problem equally well but skips the instrumental reasoning. If my daemon is doing some complex reasoning to answer the question "Should I predict well?" we could just skip straight to the answer "yes." This both makes the circuit smaller, and prevents the circuit from ever deciding not to predict well.

A different perspective on a similar intuition: the daemon is doing some actual cognitive work to solve the problem. Since that computation is being done by the daemon, it is embedded as a smaller circuit. Jessica explores this intuition a bit here. Here we are considering an easy version of the problem, since by taking the smallest circuit we are effectively quantifying over all possible ways of extracting logical information from the daemon.

Instead of showing that minimal circuits can't be daemons, we might end up concluding that they can be. That would be even more interesting.

Another possible outcome is giving a strong argument that captures our intuitions/concerns about daemons, and which clearly doesn't apply to the minimal circuit that solves a problem. In this case we couldn't prove anything positive about the minimal circuit, but we would have "screened off" the possible cause for concern.

## Difficulties

The first and most serious difficulty is understanding what we are talking about.

I don't expect to get total clarity on concepts like "daemon" or "optimization" or "generic problem," but we need to have a better grip than we do right now. I expect that we'll develop better concepts in the course of solving the problem, rather than as a precondition for solving the problem (in general I think "define things so that you can prove the theorem" is often the right strategy).

A second difficulty is that the different parts of the computation can be tangled up in an extremely complex way. In an extreme case, the daemon may be cryptographically obfuscated.

We want to show that given any daemon, there is a smaller circuit that solves the problem. The most natural approach is showing how to construct a smaller circuit, given a daemon. But if the daemon is obfuscated, there is no efficient procedure which takes the daemon circuit as input and produces a smaller circuit that still solves the problem.

So we can't find any efficient constructive argument. That rules out most of the obvious strategies.