Open question: are minimal circuits daemon-free?

by paulfchristiano2 min read5th May 201869 comments

130

Ω 15

Inner AlignmentMesa-OptimizationOpen ProblemsAI
Curated
Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

Note: weird stuff, very informal.

Suppose I search for an algorithm that has made good predictions in the past, and use that algorithm to make predictions in the future.

I may get a "daemon," a consequentialist who happens to be motivated to make good predictions (perhaps because it has realized that only good predictors survive). Under different conditions, the daemon may no longer be motivated to predict well, and may instead make "predictions" that help it achieve its goals at my expense.

I don't know whether this is a real problem or not. But from a theoretical perspective, not knowing is already concerning--I'm trying to find a strong argument that we've solved alignment, not just something that seems to work in practice.

I am pretty convinced that daemons are a real problem for Solomonoff induction. Intuitively, the problem is caused by "too much compute." I suspect that daemons are also a problem for some more realistic learning procedures (like human evolution), though in a different shape. I think that this problem can probably be patched, but that's one of the major open questions for the feasibility of prosaic AGI alignment.

I suspect that daemons aren't a problem if we exclusively select for computational efficiency. That is, I suspect that the fastest way to solve any particular problem doesn't involve daemons.

I don't think this question has much intrinsic importance, because almost all realistic learning procedures involve a strong simplicity prior (e.g. weight sharing in neural networks).

But I do think this question has deep similarities to more important problems, and that answering this question will involve developing useful conceptual machinery. Because we have an unusually strong intuitive handle on the problem, I think it's a good thing to think about.

Problem statement and intuition

Can the smallest boolean circuit that solves a problem be a daemon? For example, can the smallest circuit that predicts my behavior (at some level of accuracy) be a daemon?

Intuitively, if we have a daemon that is instrumentally or incidentally motivated to solve my problem, then there is some smaller circuit that solves the problem equally well but skips the instrumental reasoning. If my daemon is doing some complex reasoning to answer the question "Should I predict well?" we could just skip straight to the answer "yes." This both makes the circuit smaller, and prevents the circuit from ever deciding not to predict well.

A different perspective on a similar intuition: the daemon is doing some actual cognitive work to solve the problem. Since that computation is being done by the daemon, it is embedded as a smaller circuit. Jessica explores this intuition a bit here. Here we are considering an easy version of the problem, since by taking the smallest circuit we are effectively quantifying over all possible ways of extracting logical information from the daemon.

Instead of showing that minimal circuits can't be daemons, we might end up concluding that they can be. That would be even more interesting.

Another possible outcome is giving a strong argument that captures our intuitions/concerns about daemons, and which clearly doesn't apply to the minimal circuit that solves a problem. In this case we couldn't prove anything positive about the minimal circuit, but we would have "screened off" the possible cause for concern.

Difficulties

The first and most serious difficulty is understanding what we are talking about.

I don't expect to get total clarity on concepts like "daemon" or "optimization" or "generic problem," but we need to have a better grip than we do right now. I expect that we'll develop better concepts in the course of solving the problem, rather than as a precondition for solving the problem (in general I think "define things so that you can prove the theorem" is often the right strategy).

A second difficulty is that the different parts of the computation can be tangled up in an extremely complex way. In an extreme case, the daemon may be cryptographically obfuscated.

We want to show that given any daemon, there is a smaller circuit that solves the problem. The most natural approach is showing how to construct a smaller circuit, given a daemon. But if the daemon is obfuscated, there is no efficient procedure which takes the daemon circuit as input and produces a smaller circuit that still solves the problem.

So we can't find any efficient constructive argument. That rules out most of the obvious strategies.

130

Ω 15

69 comments, sorted by Highlighting new comments since Today at 10:36 AM
New Comment

I am pretty convinced that daemons are a real problem for Solomonoff induction. Intuitively, the problem is caused by “too much compute.” I suspect that daemons are also a problem for some more realistic learning procedures (like human evolution), though in a different shape.

For human evolution, the problem is too little compute rather than too much, right? Meaning if evolution just gave humans the goal of "maximize inclusive fitness" then the human wouldn't be able to find a good policy for achieving that due to lack of computing power so instead we got a bunch of goals that would have been subgoals of "maximize inclusive fitness" in our ancestral environment (like eat tasty food and make friends/allies).

Suppose we wanted to make a minimal circuit that would do as well as humans in maximizing inclusive fitness in some range of environments. Wouldn't it make sense to also "help it out" by having it directly optimize for useful subgoals in those environments rather than having it do a big backchain from "maximize inclusive fitness"? And then it would be a daemon because it would keep optimizing for those subgoals even if you moved it outside of those environments?

I agree with this basic point and it seems important, thanks.

It seems like there are two qualitatively different concerns when trying to optimize for X, that probably need to be distinguished / thought about separately:

  • Upstream daemon: directly optimizes something that is upstream of X on the training distribution.
  • Downstream daemon: optimizes something that is downstream of X on the training distribution (e.g. because survival is a convergent instrumental goal and X is upstream of survival), and so backwards chains through X.

Obviously the real situation can be a complicated mixture, and this is not a clean distinction even apart from that.

The arguments in the OP only plausibly apply to downstream daemons. I think they make the most sense in terms of making induction benign.

I've normally thought of upstream daemons as much more likely, but much easier to deal with:

  • In the case of upstream daemons, the most natural response is to probe a broad range of situations similar to the ones you care about. This requires a lot more understanding, but it seems quite realistic to try to avoid the situation where (Y --> empowering humans) on the training distribution but (Y --> disempowering humans) on the slightly expanded distribution. (Compared to others around here, this probably seems more plausible to me because I have an intuition about a basin of attraction around corrigibility, and am imagining consistently retraining the agent such that it doesn't need to deal with giant distributional shift.)
  • In the case of downstream daemons, you have to deal with an adversary who is motivated to look for the first opportunity to defect that you won't be able to notice. This pits much more optimization pressure against your training regime. In particular, the "distributional shift" in this case is caused by the agent being powerful enough to seize control, and therefore it is very difficult to avoid.

While I usually flag these as two potentially distinct concerns, they do run together a lot in my head as evidenced by this post. I'm not sure if it's possible to cleanly distinguish them, or how. The right distinction may also be something else, e.g focusing directly on the possibility of a treacherous turn.

I think it makes sense to classify daemons into two types the way you do. Interestingly MIRI seems to be a lot more concerned about what you call upstream daemons. The Arbital page you linked to only talks about upstream daemons and the Google Doc "MIRI notes on alignment difficulty" seems to be mostly about that too. (What is it with people keeping important AI safety documents in private Google Docs these days, with no apparent plans of publication? Do you know any others that I'm not already shared on, BTW?)

and am imagining consistently retraining the agent such that it doesn’t need to deal with giant distributional shift

I don't recall you writing about this before. How do you see this working? I guess with LBO you could train a complete "core for reasoning" and then amplify that to keep retraining the higher level agents on broader and broader distributions, but how would it work with HBO, where the human overseer's time becomes increasingly scarce/costly relative to the AI's as AIs get faster? I'm also pretty concerned about the overseer running into their own lack of robustness against distributional shifts if this is what you're planning.

Interestingly MIRI seems to be a lot more concerned about what you call upstream daemons. The Arbital page you linked to only talks about upstream daemons and the Google Doc "MIRI notes on alignment difficulty" seems to be mostly about that too.

I think people (including at MIRI) normally describe daemons as emerging from upstream optimization, but then describe them as becoming downstream daemons as they improve. Without the second step, it seems hard to be so pessimistic about the "normal" intervention of "test in a wider range of cases."

how would it work with HBO, where the human overseer's time becomes increasingly scarce/costly relative to the AI's as AIs get faster?

At time 0 the human trains the AI to operate at time 1. At time T>>0 the AI trains itself to operate at time T+1, at some point the human no longer needs to be involved---if the AI is actually aligned on inputs that it encounters at time T, then it has a hope of remaining aligned on inputs it encounters at time T+1.

I spoke a bit too glibly though, I think there are lots of possible approaches for dealing with this problem, each of them slightly increases my optimism, this isn't the most important:

  • Retraining constantly. More generally, only using the AI for a short period of time before building a completely new AI. (I think that humans basically only need to solve the alignment problem for AI-modestly-better-than-humans-at-alignment, and then we leave the issue up to the AI.)
  • Using techniques here to avoid "active malice" in the worst case. This doesn't include all cases where the AI is optimizing a subgoal which is no longer correlated with the real goal. But it does include cases where that subgoal then involves disempowering the human instrumentally, which seems necessary to really have a catastrophe.
  • I think there is some real sense in which an upstream daemon (of the kind that could appear for a minimal circuit) may be a much smaller problem, though this requires much more understanding.
I'm also pretty concerned about the overseer running into their own lack of robustness against distributional shifts if this is what you're planning.

I think this is definitely an additional difficulty. Right now I think accidentally introducing consequentialists is a somewhat larger concern, either daemons from the distillation step or weird memetic patterns in the amplification step, but hopefully at some point I'll be focusing on this problem.

Without the second step, it seems hard to be so pessimistic about the “normal” intervention of “test in a wider range of cases.”

Another way to be pessimistic is you expect that if the test fails on a wider range of cases, it will be unclear how to proceed at that point, and less safety-conscious AI projects may take the lead before you figure that out. (I think this, or a similar point, was made in the MIRI doc.)

At time 0 the human trains the AI to operate at time 1. At time T>>0 the AI trains itself to operate at time T+1, at some point the human no longer needs to be involved---if the AI is actually aligned on inputs that it encounters at time T, then it has a hope of remaining aligned on inputs it encounters at time T+1.

I don't think this can work if you're just doing naive imitation learning? Do you have some other training method in mind?

I don't think this can work if you're just doing naive imitation learning? Do you have some other training method in mind?

To be clear, I'm imagining imitation learning + amplification. So the agent at time T engages in some deliberative process to produce training targets for the agent at time T+1. The agent at time T also deliberates in order to choose what situations the agent at time T+1 should train on.

What obstruction do you have in mind?

(I'm imagining using imitation+RL rather than pure imitation, but the difference won't help with this question.)

By "naive imitation learning" I was thinking "without amplification". With amplification, I'm less sure it won't work but it still seems pretty iffy. The plan seems to depend on at least the following:

  1. We can create a small (since generated by expensive humans) set of training data that is representative of the data manifold of HBO reasoning (without missing some important part of it).
  2. The data manifold of HBO reasoning is universal, i.e., all future tasks can be broken down (recursively) into subtasks that lie on this manifold.
  3. At each iteration of amplification, the agent being amplified knows how to break down an input task into subtasks that lie on (or is not too far from) its own training distribution. It's not clear to me how to do this, for example how the agent can obtain a simple enough representation of its own training distribution in order to reason about this problem.
  4. The AI at time T can predict the test distribution at time T+1 well enough to generate training data for it. This seems hard to ensure given that the environment is likely to contain hard to predict elements like other agents, including adversarial agents. (This may not be a dealbreaker if the AI can detect out-of-distribution inputs at time T+1 and ask for further training data on them. Is this what you have in mind?)

I don't see why to separate 1/2, the goal is to find training data that describes some "universal" core for behavior.

3. I don't think you need to know the training distribution. You just need something that points you back in the direction of the universal core where the human model is competent, e.g. an appropriate notion of simplicity.

4. Hard-to-predict inputs aren't intrinsically a problem. If your agent fails malignly on input x, but not on distribution D, then your agent is able to distinguish x from D. So the difficulty comes from inputs that are easy-to-recognize but hard-to-generate. These certainly exist (e.g. consider a model which kills everyone given a signed edict from the pope). I think the most likely approach is to "reach inside" the model in order to stress test the behavior on inputs that you can't actually synthesize (e.g. by understanding that is checking the pope's signature, and just seeing what would happen if the check passed). This is the endpoint of improvements in these techniques.

(Of course, I don't think we'll be able to prevent benign failures in general.)

I don’t see why to separate 1⁄2, the goal is to find training data that describes some “universal” core for behavior.

It seems to me there are separate risks of the human HBO itself not being universal (e.g., humans are not universal or we need even higher bandwidth to be universal), and not being able to capture enough of the human HBO input/output function in a dataset to train an AI to be universal.

3. I don’t think you need to know the training distribution. You just need something that points you back in the direction of the universal core where the human model is competent, e.g. an appropriate notion of simplicity.

What if the path towards the universal core goes through an area where the AI wasn't trained on?

This is the endpoint of improvements in these techniques.

I think that makes sense but now you're making a conjunctive instead of disjunctive argument (which it seemed like you were claiming by saying "I think there are lots of possible approaches for dealing with this problem" and listing retraining and optimizing worst case performance as separate approaches).

ETA: If you're able to obtain a control guarantee over the whole input space, then that seems to solve the problem and you don't need constant retraining to be aligned. If you're only able to obtain it for some subset of inputs, then it seems that at time T the AI needs to be able to predict the T+1 test distribution so that it can make sure that's covered by the control guarantee.

Regarding daemons starting as upstream and becoming downstream...

I think this makes it sound like the goal (soit Y) of the daemon changes, but I usually don't think of it that way.

What changes is that pursuing Y initially leads to rapidly improving performance at X, but then the performance of X and Y pull apart as the daemon optimizes more heavily for Y.

It seems highly analogous to hacking a learned reward function.

Isn't that the problem Christiano is relatively optimistic about solving through training on different distributions? In particular, it seems like we'd notice the daemon's goals diverging from ours and try to correct it. If I understand Christiano's point correctly, he is optimistic that we could do so successfully due to the existence of a "basin of corrigibility". OTOH, the most concerning case would be when the daemon is intentionally trying to hide the fact that its goals are diverging from X, because we may not realize the system is misaligned before the daemon executes a treacherous turn.


Instead, I interpreted "starting as upstream and becoming downstream" as a description of a phenomenon that might look roughly like this:

1. Upstream daemon (UD) is "discovered" by some optimization process because it seems to do a good job optimizing X on the training data

2. The daemon is "initialized" by the outer optimizer and starts "running"

3. The outer optimizer (OO) provides a set of inputs to the daemon with the goal of optimizing X

4. Based on these inputs, the daemon "notices" that the entity that "initialized" it is trying to optimize X rather than Y

5. The UD infers that the outer optimizer will "terminate" it if the OO discovers the UD is optimizing for Y rather than X

6. The UD starts optimizing for X in order to survive (a convergent instrumental goal regardless of Y)

7. The OO observes the UD's performance on a wide variety of training data and infers that the UD is not in fact a upstream daemon

8. The daemon is "set loose" and proceeds to execute a treacherous turn


As you can tell, my model of how daemons work is extremely lacking; I've used quotes for every phrase that feels dubious to me.

One major question I have which this model, if correct, partially resolves for me: How does the downstream daemon ever "know" what X we are optimizing for in order for the outer optimizer to "discover" it?



A concrete vision:

Suppose the best a system can do without a daemon is 97% accuracy.

The daemon can figure out how to get 99% accuracy.

But in order to outperform other systems, it can just provide 98% accuracy, and use 1% of inputs to pursue it's own agenda.

This all happens on-distribution.


If there are multiple daemon-containing systems competing for survival (with selection happening according to accuracy), this might force them to maximize accuracy, instead of just beating a "non-daemon baseline".


This is all only relevant to downstream daemons, right? If so, I don't understand why the DD would ever provide 98% accuracy; I'd expect it to provide 99% accuracy until it sees a chance to provide [arbitarily low]% accuracy and start pursuing its agenda directly. As you say, this might happen due to competition between daemon-containing systems, but I think a DD would want to maximize its chances of survival by maximizng its accuracy either way.


I think it's relevant for either kind (actually, I'm not sure I like the distinction, or find it particularly relevant).

If there aren't other daemons to compete with, then 98% is sufficient for survival, so why not use the extra 1% to begin pursuing your own agenda immediately and covertly? This seems to be how principle-agent problems often play out in real life with humans.


I am interested as well. Please share the docs in question with my LW username at gmail dot com if that is a possibility. Thank you!

You should contact Rob Bensinger since he's the owner of the document in question. (It looks like I technically can share the document with others, but I'm not sure what Rob/MIRI's policy is about who that document should be shared with.)

(Summarizing/reinterpreting the upstream/downstream distinction for myself):

"upstream": has a (relatively benign?) goal which actually helps achieve X

"downstream": doesn't



Coincidentally I'm also trying to understand this post at the same time, and was somewhat confused by the "upstream"/"downstream" distinction.

What I eventually concluded was that there are 3 ways a daemon that intrinsically values optimizing some Y can "look like" it's optimzing X:

  • Y = X (this seems both unconcerning and unlikely, and thus somewhat irrelevant)
  • optimzing Y causes optimization pressure to be applied to X (upstream daemon, describes humans if Y = our actual goals and X = inclusive genetic fitness)
  • The daemon is directly optimizing X because the daemon believes this instrumentally helps it achieve Y (downstream daemon, e.g. if optimizing X helps the daemon survive)

Does this seem correct? In particular, I don't understand why upstream daemons would have to have a relatively benign goal.

Yeah that seems right. I think it's a better summary of what Paul was talking about.

If evolution is to humans as humans are to UFAI, I suppose UFAI corresponds to too little compute allocated to understanding our goal specification, and too much compute allocated to maximizing it. That suggests the solution is relatively simple.

(sorry for commenting on such an old post)

It seems like the problem from evolution's perspective isn't that we don't understand our goal specification but that our goals are different from evolution's goals. It seems fairly tautological that putting more compute towards maximizing a goal specification than towards making sure the goal specification is what we want is likely to lead to UFAI; I don't see how that implies a "relatively simple" solution?

It seems fairly tautological that putting more compute towards maximizing a goal specification than towards making sure the goal specification is what we want is likely to lead to UFAI

And the "relatively simple" solution is to do the reverse, and put more compute towards making sure the goal specification is what we want than towards maximizing it.

(It's possible this point isn't very related to what Wei Dai said.)

Isn't this just saying it would be nice if we collectively put more resources towards alignment research relative to capabilities research? I still feel like I'm missing something :/

We may be able to offload some work to the system, e.g. by having it search for a diverse range of models for the user's intent, instead of making it use a single hardcoded goal specification.

This comment of mine is a bit related if you want more elaboration:

https://www.lesswrong.com/posts/NtX7LKhCXMW2vjWx6/thoughts-on-reward-engineering#jJ7nng3AGmtAWfxsy

If you have thoughts on it, probably best to reply there--we are already necroposting, so let's keep the discussion organized.

I'm having trouble thinking about what it would mean for a circuit to contain daemons such that we could hope for a proof. It would be nice if we could find a simple such definition, but it seems hard to make this intuition precise.

For example, we might say that a circuit contains daemons if it displays more optimization that necessary to solve a problem. Minimal circuits could have daemons under this definition though. Suppose that some function describes the behaviour of some powerful agent, a function is like with noise added, and our problem is to predict sufficiently well the function . Then, the simplest circuit that does well won't bother to memorize a bunch of noise, so it will pursue the goals of the agent described by more efficiently than , and thus more efficiently than necessary.

I don't know what the statement of the theorem would be. I don't really think we'd have a clean definition of "contains daemons" and then have a proof that a particular circuit doesn't contain daemons.

Also I expect we're going to have to make some assumption that the problem is "generic" (or else be careful about what daemon means), ruling out problems with the consequentialism embedded in them.

(Also, see the comment thread with Wei Dai above, clearly the plausible version of this involves something more specific than daemons.)

Also I expect we're going to have to make some assumption that the problem is "generic" (or else be careful about what daemon means), ruling out problems with the consequentialism embedded in them.

I agree. The following is an attempt to show that if we don't rule out problems with the consequentialism embedded in them then the answer is trivially "no" (i.e. minimal circuits may contain consequentialists).

Let be a minimal circuit that takes as input a string of length that encodes a Turing machine, and outputs a string that is the concatenation of the first configurations in the simulation of that Turing machine (each configuration is encoded as a string).

Now consider a string that encodes a Turing machine that simulates some consequentialist (e.g. a human upload). For the input , the computation of the output of simulates a consequentialist; and is a minimal circuit.

By "predict sufficiently well" do you mean "predict such that we can't distinguish their output"?

Unless the noise is of a special form, can't we distinguish $f$ and $tilde{f}$ by how well they do on $f$'s goals? It seems like for this not to be the case, the noise would have to be of the form "occasionally do something weak which looks strong to weaker agents". But then we could get this distribution by using a weak (or intermediate) agent directly, which would probably need less compute.

Suppose "predict well" means "guess the output with sufficiently high probability," and the noise is just to replace the output with something random 5% of the time.

Yeah, I had something along the lines of what Paul said in mind. I wanted not to require that the circuit implement exactly a given function, so that we could see if daemons show up in the output. It seems easier to define daemons if we can just look at input-output behaviour.

I curated this post partly for the OP, and partly for the subsequent discussion.

Something valuable I think LessWrong can be is a place where people pose well formed questions on important problems, and then make progress on them. I don't have the impression that any clear-cut breakthroughs happened here, but it does look like incremental, "chip away at the edges" progress was made.

My current take is that the knowledge-building process has several phases, that can reinforce each other in a non-linear fashion:

  • researching current literature
  • transforming ad-hoc exploratory research and impressions into a clearly stated questions
  • brainstorming new ideas
  • refining those ideas into something legible
  • subjective those ideas to scrutiny
  • distilling all that into a final concept that others can build on

I think it's important for LW to showcase progress on each of those stages. By default, a tendency is to only publish work that's reached the final stages, or that feels like it makes some kind of coherent point. This post and comments seemed to be doing some thing real, even if at a middle-stage, and I want it to be clear that this is something LW strives to reward.

Can the smallest boolean circuit that solves a problem be a daemon? For example, can the smallest circuit that predicts my behavior (at some level of accuracy) be a daemon?

Yes. Consider a predictor that predicts what Paul will say if given an input and n time-steps to think about it, where n can be any integer up to some bound k. One possible circuit would have k single-step simulators chained together, plus a mux which takes the output of the nth single-step simulator. But a circuit which consisted of k single-step simulators and took the output of the *last* one would be smaller, and if Paul commits to not use the extra time steps to change his output on any input which could possibly be a training input, then this circuit is a valid predictor. He could then use the extra time steps to implement a daemon strategy for any input which he can reliably recognize as one that would not be used during training.

But can we make a smaller circuit by stripping out the part of Paul that attempts to recognize whether an input could be part of the training distribution?

If he's a neural net, this is likely an obstacle to any attempts to simplify out parts of him; those parts would still be contributing to the result, it's just that within the test input domain those contributions would look like noise.

In general: if a circuit implements a prediction problem that sometimes but doesn't always require simulating an agent, and if that agent is capable of making itself implement the identity function for interesting subsets of its inputs, then it can potentially show up as a daemon for any input that didn't require simulating it.

Why couldn't you just use a smaller circuit that runs one single-step simulator, and outputs the result? It seems like that would output an accurate prediction of Paul's behavior iff the k-step simulator outputs an accurate prediction.

[This comment is no longer endorsed by its author]Reply

I propose a counterexample. Suppose we are playing a series of games with another agent. To play effectively, we train a circuit to predict the opponent's moves. At this point the circuit already contains an adversarial agent. However, one could object that it's unfair: we asked for an adversarial agent so we got an adversarial agent (nevertheless for AI alignment it's still a problem). To remove the objection, let's make some further assumptions. The training is done on some set of games, but distributional shift happens and later games are different. The opponent knows this, so on the training games it simulates a different agent. Specifically, it simulates an agent who searches for a strategy s.t. the best response to this strategy has the strongest counter-response. The minimal circuit hence contains the same agent. On the training data we win, but on the shifted distribution the daemon deceives us and we lose.

Don't know if this counts as a 'daemon', but here's one scenario where a minimal circuit could plausibly exhibit optimization we don't want.

Say we are trying to build a model of some complex environment containing agents, e.g. a bunch of humans in a room. The fastest circuit that predicts this environment will almost certainly devote more computational resources to certain parts of the environment, in particular the agents, and will try to skimp as much as possible on less relevant parts such as chairs, desks etc. This could lead to 'glitches in the matrix' where there are small discrepancies from what the agents expect.

Finding itself in such a scenario, a smart agent could reason: "I just saw something that gives me reason to believe that I'm in a small-circuit simulation. If it looks like the simulation is going to be used for an important decision, I'll act to advance my interests in the real world; otherwise, I'll act as though I didn't notice anything".

In this way, the overall simulation behavior could be very accurate on most inputs, only deviating in the cases where it is likely to be used for an important decision. In effect, the circuit is 'colluding' with the agents inside it to minimize its computational costs. Indeed, you could imagine extreme scenarios where the smallest circuit instantiates the agents in a blank environment with the message "you are inside a simulation; please provide outputs as you would in environment [X]". If the agents are good at pretending, this could be quite an accurate predictor.

Indeed, you could imagine extreme scenarios where the smallest circuit instantiates the agents in a blank environment with the message "you are inside a simulation; please provide outputs as you would in environment [X]". If the agents are good at pretending, this could be quite an accurate predictor.

But can we just take whatever cognitive process the agent uses for pretending, and then leave the rest of it out?

I'm confused about the definition of the set of boolean circuits in which we're looking at the smallest circuit.
Is that set defined in terms of a set of inputs and a boolean utility function ; and then that set is all the boolean circuits that for each input x∈X yield an output that fulfills ?

Here is one definition of a "problem":

Fix some distribution on , and some function . Then consider the set of circuits for which the expectation of , for sampled from , is .

Can we assume that itself is aligned in the sense that it doesn't assign non-negative values to outputs that are catastrophic to us?

Yeah, if we want C to not be evil we need some very hard-to-state assumption on R and D.

(markdown comment editor is unchecked, will take it up with admins)

Perhaps it'll be useful to think about the question for specific and .

Here are the simplest and I can think of that might serve this purpose:

- uniform over the integers in the range .

- for each input , assigns a reward of to the smallest prime number that is larger than , and to everything else.

I think you need to uncheck "Markdown Comment Editor" under "Edit Account". Your comment with latex follows:

Here is one definition of a "problem":
Fix some distribution on , and some function : . Then consider the set of circuits for which the expectation of , for sampled from , is .

This seems like the sort of problem that can be tackled more efficiently in the context of an actual AGI design. I don't see "daemons" as a problem per se; instead I see a heuristic for finding potential problems.

Consider something like code injection. There is no deep theory of code injection, at least not that I know of. It just describes a particular cluster of software vulnerabilities. You might create best practices to prevent particular types of code injection, but a software stack which claims to be "immune to code injection" sounds like snake oil. If someone says their software stack is "immune to code injection", what they really mean is they implement best practices for guarding against all the code injection types they can think of. Which is great, but it doesn't make sense to go around telling people you are "immune to code injection" because that will discourage security researchers from thinking of new code injection types.

Instead of trying to figure out how to create AIs that are "immune to daemons", I would suggest trying to think of more cases where daemons are actually a problem. Trying to guard against a problem before you have characterized it seems like premature optimization. The more cases you can describe where daemons are a problem, and the more clearly you can characterize these cases, the easier it will be to spot potential daemons in a potential AGI design. Once you have spotted the potential daemon, identifying a very general way to guard against it is likely to be the easy part. Proofs are the last step, not the first.

I've listed one algorithm for which daemons are obviously a problem, namely Solomonoff induction. Now I'm describing a very similar algorithm, and wondering if daemons are a problem. As far as I can tell, any learning algorithm is plausibly beset by daemons, so it seems natural to ask for a variant of learning that isn't.

I'm not sure exactly how to characterize the problem other than by doing this kind of exercise. This post is already implicitly considering a particular design for AGI, I don't see what we gain by being more specific here.

That's fair. I guess my intuition is that the Solomonoff induction example could use more work as motivation. Sampling a cell at a particular frequency pretty fairly unrealistic to me. Realistically an AGI is going to be interested in more complex outcomes. So then there's a connection to the idea of adversarial examples, where the consequentialists in the universal prior are trying to make the AGI think that something is going on when it isn't actually going on. (Absent such deception, I'm not sure there is a problem. For example, if consequentialists reliably make their universe one in which everyone is truly having a good time for the purpose of possibly influencing a universal prior, then that will be true in our universe too, and we should take it into account for decisionmaking purposes.) But this is actually easier than typical adversarial examples, because an AGI also gets to observe the consequentialists plot their adversarial strategy and read their minds while they're plotting. The AGI would have to be rather "dumb" in order to get tricked. If it's simulating the universal prior in sufficiently high resolution to produce these weird effects, then by definition it's able to see what is going on.

Humans already seem able to solve this problem: We simulate how others might think and react, and we don't seem super worried about people we simulate internally breaking out of our simulation and hijacking our cognition. (Or at least, insofar as we do get anxious about e.g. putting ourselves in the shoes of people we dislike, this doesn't have obvious relevance to an AGI--although again, perhaps this would be a good heuristic for brainstorming potential problems.) Anyway, my hunch is that this particular manifestation of the "daemon" problem will not require a lot of special effort once other AGI/FAI problems are solved.

This post is already implicitly considering a particular design for AGI, I don't see what we gain by being more specific here.

Does your idea of neural nets + RL involve use of the universal prior? If not, I think I would try to understand if/how the daemon problem transfers to the neural nets + RL framework before trying to solve it. A solid description of a problem is the first step to finding a solution. The minimal version is a concrete example of how it could occur.

(Apologies if I am coming across as disagreeable--IMO, this is a mistake that FAI people make semi-frequently, and I would like for them to make it less often--you just got unlucky that I'm explaining myself in a comment on your post :P)

Pretty minimal in and of itself, but has prompted plenty of interesting discussion. Operationally that suggests to me that posts like this should be encouraged, but not by putting them into "best of" compilations.

This may be relevant:

Imagine a computational task that breaks up into solving many instances of problems A and B. Each instance reduces to at most n instances of problem A and at most m instances of problem B. However, these two maxima are never achieved both at once: The sum of the number of instances of A and instances of B is bounded above by some . One way to compute this with a circuit is to include n copies of a circuit for computing problem A and m copies of a circuit for computing problem B. Another approach for solving the task is to include r copies of a circuit which, with suitable control inputs, can compute either problem A or problem B. Although this approach requires more complicated control circuitry, if r is significantly less than n+m and the size of