Summary: I’m going to give a $10k prize to the best evidence that my preferred approach to AI safety is doomed. Submit by commenting on this post with a link by April 20.

I have a particular vision for how AI might be aligned with human interests, reflected in posts at ai-alignment.com and centered on iterated amplification.

This vision has a huge number of possible problems and missing pieces; it’s not clear whether these can be resolved. Many people endorse this or a similar vision as their current favored approach to alignment, so It would be extremely valuable to learn about dealbreakers as early as possible (whether to adjust the vision or abandon it).

Here’s the plan:

  • If you want to explain why this approach is doomed, explore a reason it may be doomed, or argue that it’s doomed, I strongly encourage you to do that.
  • Post a link to any relevant research/argument/evidence (a paper, blog post, repo, whatever) in the comments on this post.
  • The contest closes April 20.
  • You can submit content that was published before this prize was announced.
  • I’ll use some process to pick my favorite 1-3 contributions. This might involve delegating to other people or might involve me just picking. I make no promise that my decisions will be defensible.
  • I’ll distribute (at least) $10k amongst my favorite contributions.

If you think that some other use of this money or some other kind of research would be better for AI alignment, I encourage you to apply for funding to do that (or just to say so in the comments).

This prize is orthogonal and unrelated to the broader AI alignment prize. (Reminder: the next round closes March 31. Feel free to submit something to both.)

This contest is not intended to be “fair”---the ideas I’m interested in have not been articulated clearly, so even if they are totally wrong-headed it may not be easy to explain why. The point of the exercise is not to prove that my approach is promising because no one can prove it’s doomed. The point is just to have a slightly better understanding of the challenges.

Edited top add the results:

  • $5k for this post by Wei_Dai, and the preceding/following discussion, some points about the difficulty of learning corrigibility in small pieces.
  • $3k for Point 1 from this comment by eric_langlois, an intuition pump for why security amplification is likely to be more difficult than you might think.
  • $2k for this post by William_S, which clearly explains a consideration / design constraint that would make people less optimistic about my scheme. (This fits under "summarizing/clarifying" rather than novel observation.)

Thanks to everyone who submitted a criticism! Overall I found this process useful for clarifying my own thinking (and highlighting places where I could make it easier to engage with my research by communicating more clearly).

Background on what I’m looking for

I’m most excited about particularly thorough criticism that either makes tight arguments or “plays both sides”---points out a problem, explores plausible responses to the problem, and shows that natural attempts to fix the problem systematically fail.

If I thought I had a solution to the alignment problem I’d be interested in highlighting any possible problem with my proposal. But that’s not the situation yet; I’m trying to explore an approach to alignment and I’m looking for arguments that this approach will run into insuperable obstacles. I'm already aware that there are plenty of possible problems. So a convincing argument is trying to establish a universal quantifier over potential solutions to a possible problem.

On the other hand, I’m hoping that we'll solve alignment in a way that knowably works under extremely pessimistic assumptions, so I’m fine with arguments that make weird assumptions or consider weird situations / adversaries.

Examples of interlocking obstacles I think might totally kill my approach:

  • Amplification may be doomed because there are important parts of cognition that are too big to safely learn from a human, yet can’t be safely decomposed. (Relatedly, security amplification might be impossible.)
  • A clearer inspection of what amplification needs to do (e.g. building a competitive model of the world in which an amplified human can detect incorrigible behavior) may show that amplification isn’t getting around the fundamental problems that MIRI is interested in and will only work if we develop a much deeper understanding of effective cognition.
  • There may be kinds of errors (or malign optimization) that are amplified by amplification and can’t be easily controlled (or this concern might be predictably hard to address in advance by theory+experiment).
  • Corrigibility may be incoherent, or may not actually be easy enough to learn, or may not confer the kind of robustness to prediction errors that I’m counting on, or may not be preserved by amplification.
  • Satisfying safety properties in the worst case (like corrigibility) may be impossible. See this post for my current thoughts on plausible techniques. (I’m happy to provisionally grant that optimization daemons would be catastrophic if you couldn’t train robust models.)
  • Informed oversight might be impossible even if amplification works quite well. (This is most likely to be impossible in the context of determining what behavior is catastrophic.)

I value objections but probably won't have time to engage significantly with most of them. That said: (a) I’ll be able to engage in a limited way, and will engage with objections that significantly shift my view, (b) thorough objections can produce a lot of value even if no proponent publicly engages with them, since they can be convincing on their own, (c) in the medium term I’m optimistic about starting a broader discussion about iterated amplification which involves proponents other than me.

I think our long-term goal should be to find, for each powerful AI technique, an analog of that technique that is aligned and works nearly as well. My current work is trying to find analogs of model-free RL or AlphaZero-style model-based RL. I think that these are the most likely forms for powerful AI systems in the short term, that they are particularly hard cases for alignment, and that they are likely to turn up alignment techniques that are very generally applicable. So for now I’m not trying to be competitive with other kinds of AI systems.

New Comment
63 comments, sorted by Click to highlight new comments since:

The results:

  • $5k for this post by Wei_Dai, and the preceding/following discussion, some points about the difficulty of learning corrigibility in small pieces.
  • $3k for Point 1 from this comment by eric_langlois, an intuition pump for why security amplification is likely to be more difficult than you might think.
  • $2k for this post by William_S, which clearly explains a consideration / design constraint that would make people less optimistic about my scheme. (This fits under "summarizing/clarifying" rather than novel observation.)

Thanks to everyone who submitted a criticism! Overall I found this process useful for clarifying my own thinking (and highlighting places where I could make it easier to engage with my research by communicating more clearly).

Can you link this comment from the OP? I skimmed the whole thread looking for info on who won prizes and managed to miss this on my first pass.

Point 1: Meta-Execution and Security Amplification

I have a comment on the specific difficulty of meta-execution as an approach to security amplification. I believe that while the framework limits the "corruptibility" of the individual agents, the system as a whole is still quite vulnerable to adversarial inputs.

As far as I can tell, the meta-execution framework is Turing complete. You could store the tape contents within one pointer and the head location in another, or there's probably a more direct analogy with lambda calculus. And by Turing complete I mean that there exists some meta-execution agent that, when given any (suitably encoded) description of a Turing machine as input, executes that Turing machine and returns its output.

Now, just because the meta-execution framework is Turing complete, this doesn't mean that any particular agent created in this manner is Turing complete. If our agents were in practice Turing complete, I feel like that would defeat the security-amplification purpose of meta-execution. Maybe the individual nodes cannot be corrupted by the limited input they see, but the system as a whole could be made to perform arbitrary computation and produce arbitrary output on specific inputs. The result of "interpret the input as a Turing machine and run it" is probably not the correct or aligned response to those inputs.

Unfortunately, it seems empirically the case that computation systems become Turing complete very easily. Some examples:

In particular, return oriented programming is interesting as an adversarial attack on pre-written programs by taking advantage of the fact that a limited control over execution flow in the presence of existing code often forms a Turing complete system, despite the attacker having no control over the existing code.

So I suspect that any meta-computation agent that is practically useful for answering general queries is likely to be Turing complete, and that it will be difficult to avoid Turing completeness (up to a resource limit, which doesn't help the arbitrary code execution problem).

An addition to this argument thanks to William Saunders: We might end up having to accept that our agent will be Turing complete and hope that the malicious inputs are hard to find or work with low probability. But in that case, limiting the amount of information seen by individual nodes may make it harder for the system to detect and avoid these inputs. So what you gain in per-node security you lose in overall system security.

Point 2: IDA in general

More broadly, my main concern with IDA isn't that it has a fatal flaw but that it isn't clear to me how the system helps with ensuring alignment compared to other architectures. I do think that IDA can be used to provide modest improvement in capabilities with small loss in alignment (not sure if better or worse then augmenting humans with computational power in other ways), but that the alignment error is not zero and increases the larger the improvement in capabilities.

Argument:

  1. It is easy and tempting for the amplification to result in some form of search ("what is the outcome of this action?" "what is the quality of this outcome?" repeat), which fails if the human might misevaluate some states.
  2. To avoid that, H needs to be very careful about how they use the system.
  3. I don't believe that it is practically possible for formally specify the rules H needs to follow in order to produce an aligned system (or if you can, it's just as hard as specifying the rules for a CPU + RAM architecture). You might disagree with this premise in which case the rest doesn't follow.
  4. If we can't be confident of the rules H needs to follow, then it is very risky just asking H to act as best as they can in this system without knowing how to prevent things from going wrong.
  5. Since I don't believe specifying IDA-specific rules is any easier than for other architectures, it seems unlikely to me that you'd have a proof about the alignment or corrigibility of such a system that wouldn't be more generally applicable, in which case why not use a more direct architecture with fewer approximation steps?

To expand on the last point, if A[*], the limiting agent, is aligned with H then it must contain at least implicitly some representation of H's values (retrievable through IRL, for example). And so must A[i] for every i. So the alignment and distillation procedures must preserve the implicit values H. If we can prove that the distillation preserves implicit values, then it seems plausible that a similar procedure with similar proof would be able to just directly distill the values of H explicitly and then we can train an agent to behave optimally with respect to those.

I find your point 1 very interesting but point 2 may be based in part on a misunderstanding.

To expand on the last point, if A[*], the limiting agent, is aligned with H then it must contain at least implicitly some representation of H’s values (retrievable through IRL, for example). And so must A[i] for every i.

I think this is not how Paul hopes his scheme would work. If you read https://www.lesswrong.com/posts/yxzrKb2vFXRkwndQ4/understanding-iterated-distillation-and-amplification-claims, it's clear that in the LBO variant of IDA, A[1] can't possibly learn H's values. Instead A[1] is supposed to learn "corrigibility" from H and then after enough amplifications, A[n] will gain the ability to learn values from some external user (who may or may not be H) and then the "corrigibility" that was learned and preserved through the IDA process is supposed to make it want to help the user achieve their values.

I won't deny probably misunderstanding parts of LDA but if the point is to learn corrigibility from H couldn't you just say that corrigibility is a value that H has? Then use the same argument with "corrigibility" in place of "value"? (This assumes that corrigiblity is entirely defined with reference to H. If not, replace with the subset that is defined entirely from H, if that is empty then remove H).

If A[*] has H-derived-corrigibility then so must A[1] so distillation must preserve H-derived-corrigibility so we could instead directly distill H-derived-corrigibility from H which can be used to directly train a powerful agent with that property, which can then be trained from some other user.

so we could instead directly distill H-derived-corrigibility from H which can be used to directly train a powerful agent with that property

I'm imagining the problem statement for distillation being: we have a powerful aligned/corrigible agent. Now we want to train a faster agent which is also aligned/corrigible.

If there is a way to do this without starting from a more powerful agent, then I agree that we can skip the amplification process and jump straight to the goal.

So a convincing argument is trying to establish a universal quantifier over potential solutions to a possible problem.

This seems like a hard thing to do that most people may not have much experience with (especially since the problems are only defined informally at this point). Can you link to some existing such arguments, either against this AI alignment approach (that previously caused you to change your vision), or on other topics, to give a sense of what kinds of techniques might be helpful for establishing such a universal quantifier?

For example should one try to define the problem formally and then mathematically prove that no solution exists? But how does one show that there's not an alternative formal definition of the problem (that still captures the essence of the informal problem) for which a solution does exist?

Some examples that come to mind:

  • This comment of yours changed my thinking about security amplification by cutting off some lines of argument and forced me to lower my overall goals (though it is simple enough that it feels like it should have been clear in advance). I believe the scheme overall survives, as I discussed at the workshop, but in a slightly different form.
  • This post by Jessica both does a good job of overviewing some concerns and makes a novel argument (if the importance weight is slightly wrong then you totally lose) that leaves me very skeptical about any importance-weighting approach to fixing Solomonoff induction, which in turn leaves me more skeptical about "direct" approaches to benign induction.
  • In this post I listed implicit ensembling as an approach to robustness. Between Jessica's construction described here and discussions with MIRI folk arguing persuasively that the number of extra bits needed to get honesty was reasonably large such that even a good KWIK bound would be mediocre (partially described by Jessica here) I ended up pessimistic.

None of these posts use heavy machinery.

To clarify, when I say "trying to establish" I don't mean "trying to establish in a rigorous way," I just mean that that the goal of the informal reasoning should be the informal conclusion "we won't be able to find a way around this problem." It's also not a literal universal quantifier, in the same way that cryptography isn't up against against a literal universal quantifier, so I was doubly sloppy.

I don't think that a mathematical proof is likely to be convincing on its own (as you point out, there is a lot of slack in the choice of formalization). It might be helpful as part of an argument, though I doubt that's going to be where the action is.

I'm not bidding for the prize, because I'm judging the other prize and my money situation is okay anyway. But here's one possible objection:

You're hoping that alignment will be preserved across steps. But alignment strongly depends on decisions in extreme situations (very high capability, lots of weirdness), because strong AI is kind of an extreme situation by itself. I don't see why even the first optimization step will preserve alignment w.r.t. extreme situations, because that can't be easily tested. What if the tails come apart immediately?

This is related to your concerns about "security amplification" and "errors that are amplified by amplification", so you're almost certainly aware of this. More generally, it's an special case of Marcello's objection that says path dependence is the main problem. Even a decade later, it's one of the best comments I've ever seen on LW.

It seems like this objection might be empirically testable, and in fact might be testable even with the capabilities we have right now. For example, Paul posits that AlphaZero is a special case of his amplification scheme. In his post on AlphaZero, he doesn't mention there being an aligned "H" as part of the set-up, but if we imagine there to be one, it seems like the "H" in the AlphaZero situation is really just a fixed, immutable calculation that determines the game state (win/loss/etc.) that can be performed with any board input, with no risk of the calculation being incorrectly performed, and no uncertainty of the result. The entire board is visible to H, and every board state can be evaluated by H. H does not need to consult A for assistance in determining the game state, and A does not suggest actions that H should take (H always takes one action). The agent A does not choose which portions of the board are visible to H. Because of this, "H" in this scenario might be better understood as an immutable property of the environment rather than an agent that interacts with A and is influenced by A. My question is, to what degree is the stable convergence of AlphaZero dependent on these properties? And can we alter the setup of AlphaZero such that some or all of these properties are violated? If so, then it seems as though we should be able to actually code up a version in which H still wants to "win", but breaks the independence between A and H, and then see if this results in "weirder" or unstable behavior.

Clearly the agent will converge to the mean on unusual situations, since e.g. it has learned a bunch of heuristics that are useful for situations that come up in training. My primary concern is that it remains corrigible (or something like that) in extreme situations. This requires (a) corrigibility makes sense and is sufficiently easy-to-learn (I think it probably does but it's far from certain) and (b) something like these techniques can avoid catastrophic failures off distribution (I suspect they can but am even less confident).

One concern that I haven't seen anyone express yet is, if we can't discover a theory which assures us that IDA will stay aligned indefinitely as the amplifications iterate, it may become a risky yet extremely tempting piece of technology to deploy, possibly worsening the strategic situation from one where only obviously dangerous AIs like reinforcement learners can be built. If anyone is creating mathematical models of AI safety and strategy, it would be interesting to see if this intuition (that the invention of marginally less risky AIs can actually make things worse overall by increasing incentives to deploy risky AI) can be formalized in math.

A counter-argument here might be that this applies to all AI safety work, so why single out this particular approach. I think some approaches, like MIRI's HRAD, are more obviously unsafe or just infeasible without a strong theoretical framework to build upon, but IDA (especially the HBO variant) looks plausibly safe on its face, even if we never solve problems like how to prevent adversarial attacks on the overseer, or how to ensure that incorrigible optimizations do not creep into the system. Some policy makers are bound to not understand those problems, or see them as esoteric issues not worth worrying about when more obviously important problems are at hand (like how to win a war or not get crushed by economic competition).

Can iterated amplification recreate a human's ability for creative insight? By that I mean the phenomenon where after thinking about a problem for an extended period of time, from hours to years, a novel solution suddenly pops into your head seemingly out of nowhere. I guess under the hood what's probably happening is that you're building up and testing various conceptual frameworks for thinking about the problem, and using those frameworks and other heuristics to do a guided search of the solution space. The problem for iterated amplification is that we typically don't have introspective access to the conceptual framework building algorithms or the search heuristics that our brains learned or came up with over our lifetimes, so it's unclear how to break down these tasks when faced with a problem that requires creative insight to solve.

If iterated amplification needs to exhibit creative insight in order to succeed (not sure if you can sidestep the problem or find a workaround for it), I suggest that it be included in the set of tasks that Ought will evaluate for their factored cognition project.

EDIT: Maybe this is essentially the same as the translation example, and I'm just not understanding how you're proposing to handle that class of problems?

EDIT: Maybe this is essentially the same as the translation example, and I'm just not understanding how you're proposing to handle that class of problems

Yes, I think these are the same case. The discussion in this thread applies to both. The relevant quote from the OP:

I think our long-term goal should be to find, for each powerful AI technique, an analog of that technique that is aligned and works nearly as well. My current work is trying to find analogs of model-free RL or AlphaZero-style model-based RL.

I think "copy human expertise by imitation learning," or even "delegate to a human," raise different kinds of problems than RL. I don't think those problems all have clean answers.

Going back to the translation example, I can understand your motivation to restrict attention to some subset of all AI techniques. But I think it's reasonable for people to expect that if you're aiming to be competitive with a certain kind of AI, you'll also aim to avoid ending up not being competitive with minor variations of your own design (in this case, forms of iterated amplification that don't break down tasks into such small pieces). Otherwise, aren't you "cheating" by letting aligned AIs use AI techniques that their competitors aren't allowed to use?

To put it another way, people clearly get the impression from you that there's hope that IDA can simultaneously be aligned and achieve state of the art performance at runtime. See this post where Ajeya Cotra says exactly this:

The hope is that if we use IDA to train each learned component of an AI then the overall AI will remain aligned with the user’s interests while achieving state of the art performance at runtime — provided that any non-learned components such as search or logic are also built to preserve alignment and maintain runtime performance.

But the actual situation seems to be that at best IDA can either be aligned (if you break down tasks enough) or achieve state of the art performance (if you don't), but not both at the same time.

In general, if you have some useful but potentially malign data source (humans, in the translation example) then that's a possible problem---whether you learn from the data source or merely consult it.

You have to solve each instance of that problem in a way that depends on the details of the data source. In the translation example, you need to actually reason about human psychology. In the case of SETI, we need to coordinate to not use malign alien messages (or else opt to let the aliens take over).

Otherwise, aren't you "cheating" by letting aligned AIs use AI techniques that their competitors aren't allowed to use?

I'm just trying to compete with a particular set of AI techniques. Then every time you would have used those (potentially dangerous) techniques, you can instead use the safe alternative we've developed.

If there are other ways to make your AI more powerful, you have to deal with those on your own. That may be learning from human abilities that are entangled with malign behavior in complex ways, or using an AI design that you found in an alien message, or using an unsafe physical process in order to generate large amounts of power, or whatever.

I grant that my definition of the alignment problem would count "learn from malign data source" as an alignment problem, since you ultimately end up with a malign AI, but that problem occurs with or without AI and I don't think it is deceptive to factor that problem out (but I agree that I should be more careful about the statement / switch to a more refined statement).

I also don't think it's a particularly important problem. And it's not what people usually have in mind as a failure mode---I've discussed this problem with a few people, to try to explain some subtleties of the alignment problem, and most people hadn't thought about it and were pretty skeptical. So in those respects I think it's basically fine.

When Ajeya says:

provided that any non-learned components such as search or logic are also built to preserve alignment and maintain runtime performance.

This is meant to include things like "You don't have a malign data source that you are learning from." I agree that it's slightly misleading if we think that humans are such a data source.

I think “copy human expertise by imitation learning,” or even “delegate to a human,” raise different kinds of problems than RL. I don’t think those problems all have clean answers.

I think I can restate the problem as about competing with RL: Presumably eventually RL will be as capable as a human (on its own, without copying from or delegating to a human), including on problems that humans need to use "creative insight" on. In order to compete with such RL-based AI with an Amplification-based AI, it seems that H needs to be able to introspectively access their cognitive framework algorithms and search heuristics in order to use them to help break down tasks, but H doesn't have such introspective access, so how does Amplification-based AI compete?

If an RL agent can learn to behave creatively, then that implies that amplification from a small core can learn to behave creatively.

This is pretty clear if you don't care about alignment---you can just perform the exponential search within the amplification step, and then amplification is structurally identical to RL. The difficult problem is how to do that without introducing malign optimization. But that's not really about H's abilities.

This is pretty clear if you don’t care about alignment---you can just perform the exponential search within the amplification step, and then amplification is structurally identical to RL.

I don't follow. I think if you perform the exponential search within the amplification step, amplification would be exponentially slow whereas RL presumably wouldn't be? How would they be structurally identical? (If someone else understands this, please feel free to jump in and explain.)

The difficult problem is how to do that without introducing malign optimization.

Do you consider this problem to be inside your problem scope? I'm guessing yes but I'm not sure and I'm generally still very confused about this. I think it would help a lot if you could give a precise definition of what the scope is.

As another example of my confusion, an RL agent will presumably learn to do symbolic reasoning and perform arbitrary computations either inside its neural network or via an attached general purpose computer, so it could self-modify into or emulate an arbitrary AI. So under one natural definition of "compete", to compete with RL is to compete with every type of AI. You must not be using this definition but I'm not sure what definition you are using. The trouble I'm having is that there seems to be no clear dividing line between "internal cognition the RL agent has learned to do" and "AI technique the RL agent is emulating" but presumably you want to include the former and exclude the latter from your problem definition?

Another example is that you said that you exclude "all failures of competence" and I still only have a vague sense of what that means.

How would they be structurally identical? (If someone else understands this, please feel free to jump in and explain.)

AlphaZero is exactly the same as this: you want to explore an exponentially large search tree. You can't do that. Instead you explore a small part of the search tree. Then you train a model to quickly (lossily) imitate that search. Then you repeat the process, using the learned model in the leaves to effectively search a deeper tree. (Also see Will's comment.)

Do you consider this problem to be inside your problem scope? I'm guessing yes but I'm not sure and I'm generally still very confused about this.

For now let's restrict attention to the particular RL algorithms mentioned in the post, to make definitions clearer.

By default these techniques yield an unaligned AI.

I want a version of those techniques that produces aligned AI, which is trying to help us get what we want.

That aligned AI may still need to do dangerous things, e.g. "build a new AI" or "form an organization with a precise and immutable mission statement" or whatever. Alignment doesn't imply "never has to deal with a difficult situation again," and I'm not (now) trying to solve alignment for all possible future AI techniques.

We would have encountered those problems even if we replaced the aligned AI with a human. If the AI is aligned, it will at least be trying to solve those problems. But even as such, it may fail. And separately from whether we solve the alignment problem, we may build an incompetent AI (e.g. it may be worse at solving the next round of the alignment problem).

The goal is to get out an AI that is trying to do the right thing. A good litmus test is whether the same problem would occur with a secure human. (Or with a human who happened to be very smart, or with a large group of humans...). If so, then that's out of scope for me.

To address the example you gave: doing some optimization without introducing misalignment is necessary to perform as well as the RL techniques we are discussing. Avoiding that optimization is in scope.

There may be other optimization or heuristics that an RL agent (or an aligned human) would eventually use in order to perform well, e.g. using a certain kind of external aid. That's out of scope, because we aren't trying to compete with all of the things that an RL agent will eventually do (as you say, a powerful RL agent will eventually learn to do everything...) we are trying to compete with the RL algorithm itself.

We need an aligned version of the optimization done by the RL algorithm, not all optimization that the RL agent will eventually decide to do.

I think the way to do exponential search in amplification without being exponentially slow is to not try to do the search in one amplification step, but start with smaller problems, learn how to solve those efficiently, then use that knowledge to speed up the search in later iteration-amplification rounds.

Suppose we have some problem with branching factor 2 (ie. searching for binary strings that fit some criteria)

Start with agent .

Amplify agent to solve problems which require searching a tree of depth at cost .

Distill agent , which uses the output of the amplification process to learn how to solve problems of depth faster than the amplified , ideally as fast as any other ML approach. One way would be to learn heuristics for which parts of the tree don't contain useful information, and can be pruned.

Amplify agent , which can use the heuristics it has learned to prune the tree much earlier and solve problems of depth at cost

Distill agent , which can now efficiently solve problems of depth

If this process is efficient enough, the training cost can be less than to get an agent that solves problems of depth (and the runtime cost is as good as the runtime cost of the ML algorithm that implements the distilled agent)

Thanks for the explanation, but I'm not seeing how this would work in general. Let's use Paul's notation where and . And say we're searching for binary strings s such that F(s, t)=1 for fixed F and variable t. So we start with (a human) and distill+amplify it into which searches strings up to length (which requires searching a tree of depth at cost ). Then we distill that into which learns how to solve problems of depth faster than , and suppose it does that by learning the heuristic that the first bit of s is almost always the parity of t.

Now suppose I'm an instance of running at the top level of . I have access to other instances of which can solve this problem up to length but I need to solve a problem of length (which let's say is ). So I ask another instance of "Find a string s of length such that s starts with 0 and F(s, t)=1" then followed by query to another "Find a string s of length such that s starts with 1 and F(s, t)=1" Well the heuristic that learned doesn't help to speed up those queries so each of them is still going to take time .

The problem here as I see it is it's not clear how I, as , can make use of the previously learned heuristics to help solve larger problems more efficiently, since I have no introspective access to them. If there's a way to do that and I'm missing it, please let me know.

(I posted this from greaterwrong.com and it seems the LaTeX isn't working. Someone please PM me if you know how to fix this.)

[Habryka edit: Fixed your LaTeX for you. GreaterWrong doesn't currently support LaTeX I think. We would have to either improve our API, or greaterwrong would need to do some more fancy client-side processing to make it work]

For this example, I think you can do this if you implement the additional query "How likely is the search on [partial solution] to return a complete solution?". This is asked of all potential branches before recursing into them. learns to answer the solution probability query efficiently.

Then in amplification of in the top level of looking for a solution to problem of length , the root agent first asks "How likely is the search on [string starting with 0] to return a complete solution?" and "How likely is the search on [string starting with 1] to return a complete solution?". Then, the root agent first queries whichever subtree is most likely to contain a solution. (This doesn't improve worst case running time, but does improve average case running time.).

This is analogous to running a value estimation network in tree search, and then picking the most promising node to query first.