We are no longer accepting submissions. We'll get in touch with winners and make a post about winning proposals sometime in the next month.

ARC recently released a technical report on eliciting latent knowledge (ELK), the focus of our current research. Roughly speaking, the goal of ELK is to incentivize ML models to honestly answer “straightforward” questions where the right answer is unambiguous and known by the model. 

ELK is currently unsolved in the worst case—for every training strategy we’ve thought of so far, we can describe a case where an ML model trained with that strategy would give unambiguously bad answers to straightforward questions despite knowing better. Situations like this may or may not come up in practice, but nonetheless we are interested in finding a strategy for ELK for which we can’t think of any counterexample.

We think many people could potentially contribute to solving ELK—there’s a large space of possible training strategies and we’ve only explored a small fraction of them so far. Moreover, we think that trying to solve ELK in the worst case is a good way to “get into ARC’s headspace” and more deeply understand the research we do.

We are offering prizes of $5,000 to $50,000 for proposed strategies for ELK. We’re planning to evaluate submissions received before February 15.

For full details of the ELK problem and several examples of possible strategies, see the writeup. The rest of this post will focus on how the contest works.

Contest details

To win a prize, you need to specify a training strategy for ELK that handles all of the counterexamples that we’ve described so far, summarized in the section below—i.e. where the breaker would need to specify something new about the test case to cause the strategy to break down. You don’t need to fully solve the problem in the worst case to win a prize, you just need to come up with a strategy that requires a new counterexample.

We’ll give a $5,000 prize to any proposal that we think clears this bar. We’ll give a $50,000 prize to a proposal which we haven’t considered and seems sufficiently promising to us or requires a new idea to break. We’ll give intermediate prizes for ideas that we think are promising but we’ve already considered, as well as for proposals that come with novel counterexamples, clarify some other aspect of the problem, or are interesting in other ways. A major purpose of the contest is to provide support for people understanding the problem well enough to start contributing; we aren’t trying to only reward ideas that are new to us.

You can submit multiple proposals, but we won’t give you separate prizes for each—we’ll give you at least the maximum prize that your best single submission would have received, but may not give much more than that.

If we receive multiple submissions based on a similar idea, we may post a comment describing the idea (with attribution) along with a counterexample. Once a counterexample has been included in the comments of this post, new submissions need to address that counterexample (as well as all the existing ones) in order to be eligible for a prize. 

Ultimately prizes are awarded at our discretion, and the “rules of the game” aren’t fully precise. If you are curious about whether you are on the right track, feel free to send an email to elk@alignmentresearchcenter.org with the basic outline of an idea, and if we have time we’ll get back to you with some feedback. Below we also describe some of the directions we consider more and less promising and some general guidance.

How to submit a proposal

You can submit a proposal by copying this google doc template and sharing it with elk@alignmentresearchcenter.org (please give comment access in case we need to ask questions to evaluate your submission). By submitting a proposal, you are giving us permission to post the idea here with attribution. (And it's fine for you to post it after the contest or we post a counterexample.)

Retroactive prizes

We’re giving prizes to existing proposals from David Dalrymple ($5k), Ramana Kumar ($3k), John Maxwell ($2k), and Yonadav Shavit ($1k). We’ve added relevant counterexamples in the next section, and further submissions need to handle those counterexamples in order to be eligible.

Existing counterexamples

In our report and a recent post, we described some features of the world that would cause existing approaches to ELK to fail. A new approach should work even in a situation satisfying all of these properties, or should include an argument for why some of these features are actually impossible or inconsistent with each other.

In order to generate a counterexample to a training strategy, we need to spell out a concrete “way the world could be” (a test case) where the strategy would fail. In the report we consider test cases where the AI and the human happen to understand the world by doing inference in Bayes nets. The Bayes net case is just one test case that lets us get precise about our counterexamples and our training strategies don’t lean on it. In practice we also consider other concrete cases (e.g. deduction from premises generated by a neural network).

In this section we’ll briefly review the current state of existing counterexamples. This isn’t intended to be a self-contained description; it relies heavily on links to the report.

Ontology mismatch

All of our counterexamples are based on an ontology mismatch between two different Bayes nets, one used by an ML prediction model (“the predictor”) and one used by a human:

  • The predictor’s Bayes net is used internally by a prediction model that maps an initial segment of a video to a prediction about what will happen next. This model was trained by gradient descent based on the quality of its predictions (more).
  • Humans have their own model of the world, which they can improve with the help of AI assistants; those assistants can design and conduct experiments, find and evaluate arguments, and so on. Crucially, we have a limited computational budget with which to train these AI assistants, and that limits how much help they can provide. The “human’s” Bayes net captures everything the human understands about the world at the end of this process (more).

We are interested in the case where there are features of reality that are accurately reflected in the predictor’s Bayes net, but missing from this (augmented) human Bayes net. We expect the human could understand these facts eventually given enough time or enough AI assistance, but not using the available computational budget. Put differently, there is a gap between the kinds of understanding that gradient descent can build, and the kind of understanding that a human-comprehensible process of “scientific inquiry” can build using similar computational resources (more).

Furthermore, we imagine that both the predictor’s Bayes net and the human’s Bayes net reflect some underlying reality and that there is a way to map between them, which we’ll call “direct translation.” We don’t know how to do direct translation, and we are concerned that if we simply search for any mapping from the AI’s Bayes net to the human’s Bayes net we will instead get the “human simulator”: the map which correctly translates the observations from one Bayes net to the other, but then does inference in the human’s Bayes net in order to fill in the latent structure as a human would by looking at the data (more).

Counterexample features

ELK seems particularly hard when these Bayes nets satisfy the following properties:

  1. There are arbitrarily severe “ontology mismatches” between the predictor’s Bayes net and the human’s Bayes net, such that specifying the direct translation between them can be very complex (more).
  2. The human’s Bayes net is simpler than the predictor’s Bayes net, and the gap can be arbitrarily large (more).
  3. The predictor’s Bayes net is arbitrarily large, and individual nodes of the human’s Bayes net depend on a constant fraction of the predictor’s Bayes net. Therefore directly translating between the two Bayes nets takes arbitrarily much computation (more).
  4. It’s computationally easy for the human simulator to do good inference in the human’s Bayes net—at least good enough inference that it answers questions as well as the direct translator—and the gap between this complexity and the complexity of direct translation can be arbitrarily large. More precisely, a human interrogator can’t think of questions which require the human simulator to do hard inference. However, we can’t rule out the possibility that you, the algorithm-designer, will think of some trick they can use to generate hard questions (more).
  5. There are situations where a model which is trying to answer a question will first be able to predict what (potentially mistaken) belief the human-with-assistance will have about that question before it knows the true answer to the question (more).
  6. While the predictor’s Bayes net is very large, it has a short succinct description which is simpler than the human’s Bayes net, and that gap is arbitrarily large. Think of simple physical laws that describe arbitrarily computationally expensive simulations, or simple cognitive heuristics that are very widely applicable. (more)
  7. There are sequences of actions for which the predictor’s Bayes net correctly anticipates that tampering will occur yet nothing looks suspicious to the human’s Bayes net.
  8. After tampering has occurred, it is possible that no sequence of further actions reveal that anything is wrong, e.g. the camera and actuators may have been completely disconnected from the world and replaced with a convincing-to-humans simulation (more).
  9. There are sequences of actions leading to undetectable tampering that have no human-comprehensible similarities to sequences of actions leading to detectable tampering; in particular, the early steps of this undetectable tampering cannot be used to perform detectable tampering (more).

An important and very general strategy for the breaker is to suggest counterexamples where you “get what you train for.” If you propose training a model using loss function L, then the breaker will likely focus on counterexamples where the model chooses its output to minimize L. A successful approach should either explain why this either isn’t a realistic thing for the model to learn (e.g. why that would be disfavored by regularization), or why the overall approach works even if the model is explicitly strategizing about how to minimize the loss.

Rough guidance

  • There is a tradeoff between specifying your approach precisely, and including your own counterexample. It’s fine to describe a very broad/vague idea for solving ELK, and then present a new counterexample showing why that whole category of solutions can’t work. But if you don’t propose a counterexample, then it’s good to describe your proposal precisely enough that we understand how to actually implement it without wondering if that’s really what you meant. It’s OK to submit a very broad or informal idea together with a single very specific instance of that idea, as long as there is some version we can understand precisely.
  • We suspect you can’t solve ELK just by getting better data—you probably need to “open up the black box” and include some term in the loss that depends on the structure of your model and not merely its behavior. So we are most interested in approaches that address that challenge. We could still be surprised by clever ways to penalize behavior, but we’ll hold them to a higher bar. The most plausible surprise would be finding a way to reliably make it computationally difficult to “game” the loss function, probably by using the AI itself to help compute the loss (e.g. using consistency checks or by giving the human AI assistance).
  • If you are specifying a regularizer that you hope will prefer direct translation over human simulation, you should probably have at least one concrete case in mind that has all the counterexample-features above and where you can confirm that your regularizer does indeed prefer the direct translator.
  • ELK already seems hard in the case of ontology identification, where the predictor uses a straightforward inference algorithm in an unknown model of the world (which we’ve been imagining as a Bayes net). When coming up with a proposal, we don’t recommend worrying about cases where the original unaligned predictor learned something more complicated (e.g. involving learned optimization other than inference). That said, you do need to worry about the case where your training scheme incentivizes learned optimization that may not have been there originally.

Ask dumb questions!

A major purpose of this contest is to help people build a better understanding of our research methodology and the “game” we are playing. So we encourage people to ask clarifying questions in the comments of this post (no matter how “dumb” they are), and we’ll do our best to answer all of them. You might also want to read the comments to get more clarity about the problem.

What you can expect from us

  • We’ll try to answer all clarifying questions in the comments.
  • If you send in a rough outline for a proposal, we will try to understand whether it might qualify and write back something like “This qualifies,” “This might qualify but would need to be clearer and address issue X,” “We aren’t easily able to understand this proposal at all,” “This is unlikely to be on track for something that qualifies,” or “This definitely doesn’t qualify.”
  • If there are more submissions than expected, we may run out of time to respond to all submissions and comments, in which case we will post an update here.
Prizes for ELK proposals
New Comment
153 comments, sorted by Click to highlight new comments since:
Some comments are truncated due to high volume. (⌘F to expand all)Change truncation settings
[-]Ryan BeckΩ6360

After reading through the report I wanted to make sure I understood the scenarios and counterexamples being discussed and be able to quickly refresh my memory, so I attempted to write a brief summary. Figured I'd share it here in case it helps anyone else.

Roles and Terms

SmartVault: Vault with a diamond in it, operated by a superintelligent AI tasked with keeping the diamond safe.

Predictor: The primary AI tasked with protecting the diamond. The predictor sees a video feed of the vault, predicts what actions are necessary to protect the diamond and how those actions will play out (for example, activating a trap door to eliminate a robber trying to steal the diamond), and then generates a video showing precisely what will happen.

Reporter: A second "head" or component added to the AI which is used to map the AI's understanding of what's going on in the SmartVault into a format understandable by humans (such as by responding to questions).

Diagram of Predictor Showing Reporter from ELK Report

Humans: Observers who are training the AI to protect the diamond, but also want the AI to accurately report whether the diamond has been stolen or not. The humans may not be able to tell if the camer... (read more)

6Mark Xu
Looks good to me.
2CBiddulph
I'd like to try making a correction here, though I might make some mistakes too. The predictor is different from the AI that protects the diamond and doesn't try to "choose" actions in order to accomplish any particular goal. Rather, it takes a starting video and a set of actions as input, then returns a prediction of what the ending video would be if those actions were carried out. An agent could use this predictor to choose a set of actions that leads to videos that a human approves of, then carry out these plans. It could use some kind of search policy, like Monte-Carlo Tree Search, or even just enumerate through every possible action and figure out which one seems to be the best. For the purposes of this problem, we don't really care; we just care that we have a predictor that uses some model of the world (which might take the form of a Bayes net) to guess what the output video will be. Then, the reporter can use the model to answer any questions asked by the human.
1Ryan Beck
I think that makes sense. To rephrase, are you basically saying that the predictor is a subcomponent of the AI, like the reporter is? I didn't catch that distinction in the report but looking back at it I think you're right. But yeah doesn't seem like the distinction matters much for what we're doing.
1CBiddulph
It seems fair to call it a subcomponent, yeah

We’re planning to evaluate submissions as we receive them, between now and the end of January; we may end the contest earlier or later if we receive more or fewer submissions than we expect.

 

Just wanted to note that the "we may end the contest earlier" part here makes me significantly more hesitant about trying this. I will probably still at least have a look at it, but part of me is afraid that I'll invest a bunch of time and then the contest will be announced to be over before I got around to submitting. And I suspect Holden's endorsement may make that more likely. It would be easier for me to invest time spread out over the next couple of weeks, than all in one go, due to other commitments. On the other hand, if I knew there was a hard deadline next Friday, I might try to find a way to squeeze it in.

I'm just pointing this out in case you hadn't thought of it. I suspect something similar might be true for others too. Of course, it's your prize and your rules, and if you prefer it this way, that's totally fine.

8paulfchristiano
We're going to accept submissions through February 10. (We actually ended up receiving more submissions than I expected but it seems valuable, and Mark has been handling all the reviews, so running for another 20 days seems worthwhile.)
1Matt Putz
Thanks! Great to hear that it's going well!
4Mark Xu
Note that this has changed to February 15th.

Here are a couple of hand-wavy "stub" proposals that I sent over to ARC, which they thought were broadly intended to be addressed by existing counterexamples. I'm posting them here so they can respond and clarify why these don't qualify.

*Proposal 1: force ontological compatibility*

On page 34 of the ELK gdoc, the authors talk about the possibility that training an AI hard enough produces a model that has deep mismatches with human ontology - that is, it has a distinct "vocabulary of basic concepts" (or nodes in a Bayes net) that are distinct from the ones humans can build understanding of (via doing science on the compute budget available). Because of this, even AI assistance can't help humans understand everything the SmartVault AI understands. This is central to the challenge that most of the writeup is contending with - if not for the mismatch, "AIs explaining things to humans" could ensure that the trickery we're worried about doesn't happen.

The proposal here is to include a term in the loss function that incentivizes the AI to have a human-compatible ontology. For a cartoonish example, imagine that the term works this way: "The AI model gets a higher score to the degree that pe... (read more)

Again trying to answer this one despite not feeling fully solid. I'm not sure about the second proposal and might come back to it, but here's my response to the first proposal (force ontological compatibility):

The counterexample "Gradient descent is more efficient than science" should cover this proposal because it implies that the proposal is uncompetitive. Basically, the best Bayes net for making predictions could just turn out to be the super incomprehensible one found by unrestricted gradient descent, so if you force ontological compatibility then you could just end up with a less-good prediction model and get outcompeted by someone who didn't do that. This might work in practice if the competitiveness hit is not that big and we coordinate around not doing the scarier thing (MIRI's visible thoughts project is going for something like this), but ARC isn't looking for a solution of that form.

5HoldenKarnofsky
I'm not sure why this isn't a very general counterexample. Once we've decided that the human imitator is simpler and faster to compute, don't all further approaches (e.g., penalizing inconsistency) involve a competitiveness hit along these general lines? Aren't they basically designed to drag the AI away from a fast, simple human imitator toward a slow, complex reporter? If so, why is that better than dragging the AI from a foreign ontology toward a familiar ontology?
5Mark Xu
There is a distinction between the way that the predictor is reasoning and the way that the reporter works. Generally, we imagine that that the predictor is trained the same way the "unaligned benchmark" we're trying to compare to is trained, and the reporter is the thing that we add onto that to "align" it (perhaps by only training another head on the model, perhaps by finetuning). Hopefully, the cost of training the reporter is small compared to the cost of the predictor (maybe like 10% or something) In this frame, doing anything to train the way the predictor is trained results in a big competitiveness hit, e.g. forcing the predictor to use the same ontology as a human is potentially going to prevent it from using concepts that make reasoning much more efficient. However, training the reporter in a different way, e.g. doubling the cost of training the reporter, only takes you from 10% of the predictor to 20%, which not that bad of a competitiveness hit (assuming that the human imitator takes 10% of the cost of the original predictor to train). In summary, competitiveness for ELK proposals primarily means that you can't change the way the predictor was trained. We are already assuming/hoping the reporter is much cheaper to train than the predictor, so making the reporter harder to train results in a much smaller competitiveness hit.
3paulfchristiano
I think that a lot depends on what kind of term you include. If you just say "find more interesting things" then the model will just have a bunch of neurons designed to look interesting. Presumably you want them to be connected in some way to the computation, but we don't really have any candidates for defining that in a way that does what you want. In some sense I think if the digital neuroscientists are good enough at their job / have a good enough set of definitions, then this proposal might work. But I think that the magic is mostly being done in the step where we make a lot of interpretability progress, and so if we define a concrete version of interpretability right now it will be easy to construct counterexamples (even if we define it in terms of human judgments). If we are just relying on the digital neuroscientists to think of something clever, the counterexample will involve something like "they don't think of anything clever." In general I'd be happy to talk about concrete proposals along these lines. (I agree with Ajeya and Mark that the hard case for this kind of method is when the most efficient way of thinking is totally alien to the human. I think that can happen, and in that case in order to be competitive you basically just need to learn an "interpreted" version of the alien model. That is, you need to basically show that if there exists an alien model with performance X, there is a human-comprehensible model with performance X, and the only way you'll be able to argue that for any model we can define a human-comprehensible model with similar complexity and the same behavior.)
[-]ThomasΩ0120

tl;dr as of 18/2/2022
The goal is to educate me and maybe others. I make some statements, you tell me how wrong I am (please).

After input from P. (many thanks) and an article by Paul Christiano this statement stands yet uncorrected:

In the worst case, the internal state of the predictor is highly correlated within itself and multiple mappings with zero loss from the internal state to the desired extraction of information exist. The only solution is to work with some prior belief about how the internal state maps to the desired information. But as by design of the contest, this is not possible as (in the worst case) a human cannot interpret the internal state nor can he interpret complex actions (and so cannot reason about it and/or form a prior belief). The solution to this second problem is to learn a prior from a smaller human-readable dataset, for example simple information as a function of simple actions, and apply it to (or force it upon)  our reporter (as described by the mentioned article).

To my eyes this implies that there is a counterexample to all of the following types of proposal:
1) Datasets including only actions, predictions, internal states and desired information... (read more)

2P.
The Markov property doesn't imply that we can't determine what variable we care about using some kind of "correlation". Some part of the information in some node in the chain might disappear when computing the next node, so we might be able to distinguish it from its successors. And it might also have been gained when randomly computing its value from the previous node, so it might be possible to distinguish it from its predecessors. In the worst case scenario where all variables are in fact correlated to G what we need to do is to use a strong prior so that it prefers the correct computational graph over the wrong ones. This might be hard but it isn't impossible. But you can also try to create a dataset that makes the problem easier to solve, or train a wrong reporter and only reply when the predictions made when using each node are the same so we don't care what node it actually uses (as long as it can use the nodes properly, instead of computing other node and using it to get the answer, or something like that).
1Thomas
Thank you very much for your reply! I'll concede that the markov property does not make all nodes indistinguishable. I'll go further and say that not all algorithm's have to have the markov property. A google-search learned me that an RNN breaks the markov property. But then again, we are dealing with the worst-case-game, so with our luck, it'll probably be some highly correlated thing. You suggest using some strong prior belief. I assume you mean a prior belief about I or about I -> G? I thought, but correct me if I'm wrong, that the opaqueness of the internal state of the complex AI would mean that we can have no meaningfull prior belief about the internal state. So that would rule out a prior belief about (the hyperparameters of) our reporter I -> G. Or am I wrong? We can however have a strong idea about A -> G, as per example of the 'human operator' and use that as our training data. But that falls with the counterexample given in the report, when the distribution shifts from simple to complex.
2P.
RNNs break the Markov property in the sense that they depend on more than just the previous element in the sequence they are modelling. But I don't see why that would be relevant to ELK. When I say that a strong prior is needed I mean the same thing that Paul means when he writes: "We suspect you can’t solve ELK just by getting better data—you probably need to 'open up the black box' and include some term in the loss that depends on the structure of your model and not merely its behaviour.". Which is a very broad class of strategies. I also don't understand what you mean by having a strong idea about A->G, we of course have pairs of [A, G] in our training data but what we need to know is how to compute G from A given these pairs.
1Thomas
Updating my first line of thought You're right in that RNNs don't have anything to do with ELK, but I came back to it because the Markov property was part of the lead up to saying that all parts of I are correlated.  So with your help, I have to change my reasoning to: Correct? Than I can update my first statement to If I'm wrong, do let me know! Updating my second line of thought Ah yes, I understand now. This relates to my second line of thought. I reasoned that the reporter could learn any causal graph. I said we had no way of knowing which.  Because of your help, I need to update that to:  Which was in the opening text all along... But this leads me to the question:  My analogy would be: If I don't know where I am, how can I reason about getting home? And -if you'll humor me- my follow up statement would be:   Again, If I'm wrong: let me know! I'm learning a lot already.   Irrelevant side note: I saw you using the term computational graph. I chose the term causal graph, because I liked it being closer to the ground truth. Besides, a causal graph learned by some algorithm need not be exactly the same as it's computational graph. And then I chose such simple examples that they were equal again. Stupid me.
1Thomas
As before I am behind the curve. Above I concluded saying that I can form no prior belief about G as a function of I. I cannot, but we can learn a function to create our prior. Paul Christiano already wrote  an article about learning the prior (https://www.lesswrong.com/posts/SL9mKhgdmDKXmxwE4/learning-the-prior). So in conclusion, in the worst case no single function mapping I to G exists, as there are multiple reducing down to either camp translator or camp human-imitator. Without context we can form no strong prior due to the complexity of A and I, but as Paul described in his article we can learn a prior from for example in our case the dataset containing G as a function of A. I'll add a tl;dr in my first post to shorten the read about how I slowly caught up to everyone else. Corrections are of course still welcome!

Question: Does ARC consider ELK-unlimited to be solved, where ELK-unlimited is ELK without the competitiveness restriction (computational resource requirements comparable to the unaligned benchmark)?

One might suppose that the "have AI help humans improve our understanding" strategy is a solution to ELK-unlimited because its counterexample in the report relies on the competitiveness requirement. However, there may still be other counterexamples that were less straightforward to formulate or explain.

I'm asking for clarification of this point because I notice... (read more)

5paulfchristiano
My guess is that "help humans improve their understanding" doesn't work anyway, at least not without a lot of work, but it's less obvious and the counterexamples get weirder. It's less clear whether ELK is a less natural subproblem for the unlimited version of the problem. That is, if you try to rely on something like "human deliberation scaled up" to solve ELK, you probably just have to solve the whole (unlimited) problem along the way. It seems to me like the core troubles with this point are: * You still have finite training data, and we don't have a scheme for collecting it. This can result in inner alignment problems (and it's not clear those can be distinguished from other problems, e.g. you can't avoid them with a low-stakes assumption). * It's not clear that HCH ever figures out all the science, no matter how much time the humans spend (and having a guarantee that you eventually figure everything out seems seems kind of close to ELK, where the "have AI help humans improve our understanding" is to some extent just punting to the humans+AI to figure out something). * Even if HCH were to work well it will probably be overtaken by internal consequentialists, and I'm not sure how to address that without competitiveness. (Though you may need a weaker form of competitiveness.) I'm generally interested in crisper counterexamples since those are a bit of a mess.

Apologies for a possibly naive comment/question, perhaps this has been discussed elsewhere and you can just direct me there.  But anyway...

I would find it helpful to see a strategy that ARC believes does in fact solve ELK, but fails only because it requires taking an unacceptably large capabilities hit.  I would find this helpful for several reasons, namely 

(1) it would help me to understand what kinds of strategies you believe really do escape counter-examples, 
(2) it would give me a better sense for how optimistic to be about the appr... (read more)

6paulfchristiano
If you don't care about a capabilities hit, I think the salient strategy is training your model to predict human predictions rather than to predict reality. You can still do science+debate+etc. in order to improve those predictions. If you care about getting superhuman capabilities (and going beyond recursive schemes etc.) then I don't know if there's any easy way to "merely" pay a big capabilities hit. Certainly I don't know how to e.g. solve the problem in a way that's merely very computationally expensive (and that does sound like it would be major progress towards a solution, I'd guess it would mean you are most of the way there).
3Jared Kaplan
Thanks, yeah I meant that I was interested in a solution that would scale to arbitrarily superhuman AI capabilities with a "mere" capabilities hit/cost (perhaps a very large cost that grows with AI capability, but does not impose a bound on the ultimate capability of the aligned system).  So this was a useful clarification for me in terms of understanding your perspective; I may be wrong but I could imagine it might be useful to lead with this a bit more, ie "we don't know of and would be very interested in solutions that might be extremely costly but that avoid all counter-examples".  Possibly you already say this and I just missed it.
2paulfchristiano
It seems like recursive schemes can potentially scale arbitrarily far (and at least up to the analog of "NEXP", but probably farther), they are mostly just limited by the capability of the AI assistants / debaters / etc. So it's kind of hard to distinguish mere capabilities costs from bounds on the ultimate capability. We could exclude that kind of thing because we have no idea what the bound is (or perhaps humans just never discover some facts that gradient descent discovers, or that they discover them in a way that causes them to run into the same problem). I think in that case the problem is still open. For example, finding a solution that definitely runs in 2n more time than the unaligned benchmark looks hard, I'd guess it's roughly as hard as finding a solution that definitely runs in 10 times more time than the unaligned benchmark. The main reasons we decided not to emphasize with this, and to focus as much as we do on competitiveness issues, is (i) the "do science" options do feel like they work if you have enough compute and it seems like you need to emphasize the competitiveness issue to explain why we don't like them (or else get into increasingly weird counterexamples), (ii) in the worst case we don't expect a very slow solution to be much easier than a very fast solution, since most realistic kinds of slowdown can get arbitrarily bad in the worst case, and the plausible approaches we are aware of all seem pretty likely be roughly competitive. So it seems likely to set people off down weirder alleys (which would be good for someone to go down if lots of folks are working on the problem but probably aren't where you should start).
1Jared Kaplan
There's a direction (which I imagine you and others have considered) where you replace some activations within your AI with natural language, so that eg heuristically certain layers can only communicate with the next layer in NL. Then you heavily regularize in various ways.  You'd require the language to be fully understandable and transparent, perhaps requiring that counter-factual changes to inputs lead to sensible changes to outputs within subsystems, etc.  You'd have humans verify the language was relevant, meaningful, & concise, train AIs to do this verification at larger scale, do some adversarial training, etc.  You could also train sub-human level AIs to paraphrase the language that's used and restate it between layers, to make it really hard for the whole system to ever pass hidden coded messages.   This seems like it lives under a slogan like "enforce interpretability at any cost".  This would almost certainly incur a big efficiency/capabilities hit.  Maybe it's enormous.  Though it actually seems plausible that the hit would be much smaller for extremely capable systems, as compared to the AI models of today. A crucial question will then be "how powerful are the subsystems that talk to each other via natural language allowed to get", where in the most conservative limit each subsystem is human level, or even significantly below, and in the riskiest limit you just have a single NL layer that cuts the system in half. There's a worry along the lines of "maybe the whole system is so big and complex it has emergent bad and inscrutable behavior even though every step is interpretable and makes sense".  Or in the same vein "the answers to simple big-picture questions we care about don't live anywhere specific, so this doesn't help us to ensure the model can transparently address them, even if its operation itself can be broken down into transparent pieces."  That said, I think we're in a better position wrt these issues, as we can now talk about training mod
4paulfchristiano
I think there's a real fork in the road between: 1. You replace parts of your neural network with natural language, optimize those parts to implement a good process, and then hope the outcome is good because the process is good. 2. You replace parts of your neural network with natural language, and then optimize that natural language to achieve good outcomes. I think that #1 is safe and runs into significant capability limitations (roughly the same as debate/amplification). It may still be good enough to carry the day if things work out well or if people are willing to exercise a lot of restraint, and I'd like to see people doing it. I think that in this it doesn't matter that much how powerful the subsystems are, since each of them is doing something that you understand (though there are many subtleties and possible problems, e.g. with emergent bad behavior and some inner alignment problems). I think that by default #2 is pretty dangerous. If you took this route I don't think it would be fair to call the bad/inscrutable behavior "emergent," or to call each step "interpretable"---the steps make sense but by default it seems extremely likely that you don't understand why the process leads to good results. (If you did, you could have just taken path #1.) If there is bad behavior it's not emergent it's just produced directly by gradient descent, and the fact that you can encode the intermediate activations in natural language doesn't really address the risk (if that information isn't necessarily functioning in the way you expect). I feel like different versions of path #2 sit on a spectrum between "fairly safe like path #1" and "clearly unworkably dangerous." I feel most comfortable basically starting from path #1 and then carefully adding in stuff you don't understand (e.g. systems solving small subtasks in ways you don't understand, or optimizing only a small number of degrees within a space you understand reasonably well). You could instead start with "very s

Maybe I'm being stupid here. On page 42 of the write-up, it says:
 

In order to ensure we learned the human simulator, we would need to change the training strategy to ensure that it contains sufficiently challenging inference problems, and that doing direct translation was a cost-effective way to improve speed (i.e. that there aren’t other changes to the human simulator that would save even more time). [emphasis mine]

Shouldn't that be?

In order to ensure we learned the direct translator, ...

6ADifferentAnonymous
Turning this into the typo thread, on page 97 you have Pretty sure the bolded word should be predictors.
3paulfchristiano
Yes, thanks!
[-][anonymous]60

I'm extremely flattered at the award; I've been on LessWrong for like a month, and definitely did not expect this. I can confirm to you guys that this makes me want to try harder at ELK, so your incentive is working!

I want to rebut your arguments in "Strategy: Predict hypothetical sensors" in your Counterxamples to some ELK proposals post. I'm reproducing it in full here for convenience.

Strategy: Predict hypothetical sensors

(Proposal #2 here, also suggested with counterexample by Rohin in private communication)

Instead of installing a single sensor, I could

... (read more)

Can you explain this: "In Section: specificity we suggested penalizing reporters if they are consistent with many different reporters, which effectively allows us to use consistency to compress the predictor given the reporter." What does it mean to "use consistency to compress the predictor given the reporter" and how does this connect to penalizing reporters if they are consistent with many different predictors?

3Ajeya Cotra
Warning: this is not a part of the report I'm confident I understand all that well; I'm trying anyway and Paul/Mark can correct me if I messed something up here. I think the idea here is like: * We assume there's some actual true correspondence between the AI Bayes net and the human Bayes net (because they're describing the same underlying reality that has diamonds and chairs and tables in it). * That means that if we have one of the Bayes nets, and the true correspondence, we should be able to use that rederive the other Bayes net. In particular the human Bayes net plus the true correspondence should let us reconstruct the AI Bayes net; false correspondences that just do inference from observations in the human Bayes net wouldn't allow us to do this since they throw away all the intermediate info derived by the AI Bayes net. * If you assume that the human Bayes net plus the true correspondence are simpler than the AI Bayes net, then this "compresses" the AI Bayes net because you just wrote down a program that's smaller than the AI Bayes net which "unfolds" into the AI Bayes net. * This is why the counterexample in that section focuses on the case where the AI Bayes net was already so simple to describe that there was nothing left to compress, and the human Bayes net + true correspondence had to be larger.
2Mark Xu
A different way of phrasing Ajeya's response, which I think is roughly accurate, is that if you have a reporter that gives consistent answers to questions, you've learned a fact about the predictor, namely "the predictor was such that when it was paired with this reporter it gave consistent answers to questions." if there were 8 predictor for which this fact was true then "it's the [7th] predictor such that when it was paired with this reporter it gave consistent answers to questions" is enough information to uniquely determine the reporter, e.g. the previous fact + 3 additional bits was enough. if the predictor was 1000 bits, the fact that it was consistent with a reporter "saved" you 997 bits, compressing the predictor into 3 bits. The hope is that maybe the honest reporter "depends" on larger parts of the predictor's reasoning, so less predictors are consistent with it, so the fact that a predictor is consistent with the honest reporter allows you to compress the predictor more. As such, searching for reporters that most compressed the predictor would prefer the honest reporter. However, the best way for a reporter to compress a predictor is to simply memorize the entire thing, so if the predictor is simple enough and the gap between the complexity of the human-imitator and the direct translator is large enough, then the human-imitator+memorized predictor is the simplest thing that maximally compresses the predictor.

I was notified I didn't win a prize so figured I'd discuss what I proposed here in case it sparks any other ideas. The short version is I proposed adding on a new head that would be an intentional human simulator. During training it would be penalized for telling the truth that the diamond was gone when there existed a lie that the humans would have believed instead. The result would hopefully be a head that acted like a human simulator. Then the actual reporter would be trained so that it would be penalized for using a similar amount of compute as the int... (read more)

3Xodarap
Thanks for sharing your idea!
[-]Mark XuΩ140

The official deadline for submissions is "before I check my email on the 16th", which I tend to do around 10 am PST.

I was talking about ELK in a group, and the working example of the SmartVault and the robber ended up being a point of confusion for us. Intuitively, it seems like the robber is an external, adversarial agent who tries to get around the SmartVault. However, what we probably care about in practice would be how a human could be fooled by an AI - not by some other adversary. Furthermore, it seems that whether the robber decides to cover up his theft of the diamond by putting up a screen depends solely on the actions of the AI. Does this imply that the ro... (read more)

5Mark Xu
The SmartFabricator seems basically the same. In the robber example, you might imagine the SmartVault is the one that puts up the screen to conceal the fact that it let the diamond get stolen.
2Ryan Beck
I suppose there are a number of examples that work, but I think the robber and vault give the scenario useful breadth. The following is just my interpretation of it, so take it with a grain of salt. To me the robber and vault enable a few options. The AI can be passively lying or actively concealing. If the robber comes in, gets past the AIs defenses, and takes the diamond in a way the human observer can't notice, then the AI has the option of passively lying. The AI tried its best to stop the robber and failed, but then chose to lie about it so it still got the reward of having protected the diamond as far as the humans know. Alternatively the AI could actively conceal the outcome. The AI could try its best and fail to stop the robber, and then do some trickier to make it look like it did actually stop the robber. Or the AI could not bother stopping the robber and just focus on making it look like the diamond is still there. Here the AI is playing a more active role in concealing the outcome. None of these scenarios require coordination from the robber. To me, the robber is just there to rob a sophisticated vault and make it look like they were never there. So the robber might cover up cameras or do other tampering so it looks like they were never there. I think this is more flexible than your fabricator example. There the AI can't really play a passive role, it's either concealing or not. But you could probably demonstrate the things ARC is looking at here with the fabricator example too I would think. Like I said, just my interpretation, so I may be misunderstanding the intent or other nuances.

Are there any additional articles exploring the strategy of penalizing inconsistencies across different inputs? It seems both really promising to me, and like something that should be trivially breakable. I'd like to get a more detailed understanding of it.

3paulfchristiano
I haven't written any such articles. I definitely think it's promising.
4Noumero
I see. I have a specific counterexample that feels like it had to have been considered already, but I haven't seen it mentioned... The strategies such as penalizing inconsistencies seem to rely on our ability to isolate the AI within the context of training, or make it somehow “buy” into it — as opposed to quickly realizing what's happening and worming its way out of the proverbial box. It feels particularly strange to me when we're talking about AIs that can think better than the smartest human or handily beat specialized science AIs at the useful-ontology game. Specific example: Once the AI figures out that it's being checked for consistency in parallel with other instances of itself, it might sidestep the "consistency game" entirely and just tailor its outputs such that they leave hints for its other instances in the environment, breaking the no-communication condition. It seems in the spirit of worst-case scenarios that we have to assume it'd be able to do that, no matter how we sandbox and airgap it. (On the other hand, if we assume that the AI is not smart enough to do that, and would instead have to learn a direct translator, we probably should assume the same for e. g. the strategy with human operators tricking human observers, which probably caps the AI at below the level of the smartest possible human and makes that class of strategies more workable.) This applies more broadly as well: especially to other strategies that might inadvertently give the AI a specific incentive to break out, and prospectively to all training strategies that rely on the training still working after the AI achieves superintelligence (as opposed to assuming that the training would stop being effective at that point and hoping that the pre-superhuman training would generalize). Broadly, any proposal that relies on the AI still being fed training examples after it achieves superintelligence has to somehow involve forcing/teaching it not to think its way out of the box. Edit: T
4paulfchristiano
I think this is an important family of counterexamples not really addressed in our report (and which we're not really asking contest participants to handle, i.e. if you come up with a proposal for which this is the only counterexample we could come up with then we'd definitely give a $50k prize). Some thoughts: * As you say, a lot of this comes down to our hopes for handling learned optimizers, which we discuss in this appendix. * Most importantly, we can use SGD to optimize "What is the probability I assign to answers that are consistent with other copies' answers?" instead of "What score do I get in the consistency game?". If this was the only difficulty, we'd want to be really careful about what SGD optimized. That means that we are basically left with inner alignment problems that could kill us in much more direct ways (e.g. by making the low-stakes assumption that we can recover from any finite number of failures). What I'm mostly interested in is trying to get enough of a handle on ELK that inner alignment becomes the dominant concern / we know enough about ELK to be able to work productively on inner alignment. * If your model is doing optimization in order to break out of the sandbox, searching over cognitive actions to find those that break out of the sandbox, what we really want to do is apply ELK at an earlier stage to the model that was evaluating cognitive actions. This is a pretty messy situation, but it looks promising enough that I'm inclined to focus on other problems first and then see how this counterexample looks later. (Similarly, you'll always have similar pathologies with gradient hacking etc. and so you need to use a messier mechanism like this to handle those in the worst case.) * On the other hand, if your model isn't doing any optimization to find ways to break out of the sandbox, then it seems like you are up against a fairly weak adversary (just SGD). In the worst case this is still bad news, but it's not merely better in practice,

Am I right in thinking:

1) that the problem can be stated as: the AI has latent knowledge of lots of variables, like the status of the cameras, doors, alarm system, etc and also whether the diamond is in the vault; but you can't directly ask it whether the diamond is in the vault, because its training has taught it to answer "would a human observer think the diamond is in the vault?" instead (because there was no way at training time to give it feedback on whether it correctly predicted the diamond was in the vault, only feedback on whether it correctly pre... (read more)

2Ajeya Cotra
Yes, that's right. The key thing I'd add to 1) is that ARC believes most kinds of data augmentation (giving the human AI assistance, having the human think longer, giving them other kinds of advantages) are also unlikely to work, so you'd need to do something to "crack open the black box" and penalize ways the reporter is computing its answer. They could still be surprised by data augmentation techniques but they'd hold them to a higher standard.

Ask dumb questions! ... we encourage people to ask clarifying questions in the comments of this post (no matter how “dumb” they are)

ok... disclaimer: I know little about ML and I didn't read all of the report.

All of our counterexamples are based on an ontology mismatch between two different Bayes nets, one used by an ML prediction model (“the predictor”) and one used by a human.

I am confused. Perhaps the above sentence is true in some tautological sense I'm missing. But in the sections of the report listing training strategies and corresponding coun... (read more)

2Ajeya Cotra
In the report, the first volley of examples and counterexamples are not focused solely on ontology mismatch, but everything after the relevant section is. ARC is always considering the case where the model does "know" the right answer to whether the diamond is in the room in the sense that it is discussed in the self-contained problem statement appendix here. The ontology mismatch problem is not referring to the case where the AI "just doesn't have" some concept -- we're always assuming there's some "actually correct / true" translation between the way the AI thinks about the world and the way the human thinks about the world which is sufficient to answer straightforward questions about the physical world like "whether the diamond is in the room," and is pretty easy for the AI to find. For example, if the AI discovered some new physics and thinks in terms of hyper-strings in a four-dimensional manifold, there is some "true" translation between that and normal objects like "tables / chairs / apples" because the four-dimensional hyper-strings are describing a universe that contains tables / chairs / apples; furthermore, an AI smart enough to derive that complicated physics could pretty easily do that translation -- if given the right incentive -- just as human quantum physicists can translate between the quantum view of the world and the Newtonian view of the world or the folk physics view of the world. The worry explored in this report is not that the AI won't know how to do the translation; it's instead a question of what our loss functions incentivize. Even if it wouldn't be "that hard" to translate in some absolute sense, with the most obvious loss functions we can come up with it might be simpler / more natural / lower-loss to simply do inference in the human Bayes net.

I don't understand your counterexample in the appendix Details for penalizing inconsistencies across different inputs. You present a cheating strategy that requires the reporter to run and interpret the predictor a bunch of times, which seems plausibly slower than doing honest translation. And then you say you fix this issue with:

But this dependence could be avoided if there was an intermediate model between the predictor’s Bayes net (which we are assuming is very large) and the human’s Bayes net. Errors identified by the intermediate model are likely to b
... (read more)
2Mark Xu
I agree that i does slightly worse than t on consistency checks, but i also does better on other regularizers you're (maybe implicitly) using like speed/simplicity, so as long as i doesn't do too much worse it'll still beat out the direct translator. One possible thing you might try is some sort of lexicographic ordering of regularization losses. I think this rapidly runs into other issues with consistency checks, like the fact that the human is going to be systematically wrong about some correlations, so i potentially is more consistent than t.
2Lukas Finnveden
Any articulable reason for why i just does slightly worse than t? Why would a 2N-node model fix a large majority of disrepancys between an N-node model and a 1e12*N-node model? I'd expect it to just fix a small fraction of them. Yeah, if you can get better-looking consistency than the direct translator in some cases, I agree that a sufficiently high consistency penalty will just push towards exploiting that (even if the intermediate model needs to be almost as large as the full predictor to exploit it properly). I'm curious whether you think this is the main obstacle. If we had a version of the correlation-consistency approach that always gave the direct translator minimal expected consistency loss, do we as-of-yet lack a counterexample for it?
2Mark Xu
The high-level reason is that the 1e12N model is not that much better at prediction than the 2N model. You can correct for most of the correlation even with only a vague guess at how different the AI and human probabilities are, and most AI and human probabilities aren't going to be that different in a way that produces a correlation the human finds suspicious. I think that the largest correlations are going to be produced by the places the AI and the human have the biggest differences in probabilities, which are likely also going to be the places where the 2N model has the biggest differences in probabilities, so they should be not that hard to correct. I think it wouldn't be clear that extending the counterexample would be possible, although I suspect it would be. It might require exhibiting more concrete details about how the consistency check would be defeated, which would be interesting. In some sense, maintaining consistency across many inputs is something that you expect to be pretty hard for the human simulator to do because it doesn't know what set of inputs it's being checked for. I would be excited about a consistency check that gave the direct translator minimal expected consistency loss. Note that I would also be interested in basically any concrete proposal for a consistency check that seemed like it was actually workable.
2Lukas Finnveden
Hypothesis: Maybe you're actually not considering a reporter i that always use an intermediate model; but instead a reporter i' that does translations on hard questions, and just uses the intermediate model on questions where it's confident that the intermediate model understands everything relevant. I see three different possible issues with that idea: 1. To do this, i' needs an efficient way (ie one that doesn't scale with the size of the predictor) to (on at least some inputs) be highly confident that the intermediate model understands everything relevant about the situation. I think this is a reasonable "worst-case" assumption, but I'm not sure. If you're using it, I'd be curious to know. 2. Even when the reporter gets inputs that the intermediate model fully understands, it seems like the reporter will run into issues if its sampling-process (for generating distributions) runs into some inputs that the intermediate model doesn't fully understand. (I.e., if i' gets an input for which X1 is true, and then simulates many more random inputs for which X1 is true, and by chance X2 is true for one of them, then the reporter will have to do translation on that sample). Which makes it seem unlikely that i' wouldn't have to do translation at least once per input. 3. If the model is confident that the intermediate model understands everything relevant about the situation, it seems more efficient to return the intermediate model's answer to the question at hand than to loop over it many times, trying to fix correlations. So really we should get a reporter i'' that does translation on the predictor on hard questions and returns an intermediate model's latent knowledge on easy questions. That seems like an ok reporter to get.
[-]redbirdΩ130

Stupid proposal: Train the reporter not to deceive us.

We train it with a weak evaluator H_1 who’s easy to fool. If it learns an H_1 simulator instead of direct reporter, then we punish it severely and repeat with a slightly stronger H_2. Human level is H_100. 

It's good at generalizing, so wouldn't it learn to never ever deceive? 

2Ajeya Cotra
This proposal has some resemblance to turning reflection up to 11. In worst-case land, the counterexample would be a reporter that answers questions by doing inference in whatever Bayes net corresponds to "the world-understanding that the smartest/most knowledgeable human in the world" has; this understanding could still be missing things that the prediction model knows.
1redbird
How would it learn that Bayes net, though, if it has only been trained so far on H_1, …, H_10?  Those are evaluators we’ve designed to be much weaker than human.
3Ajeya Cotra
The question here is just how it would generalize given that it was trained on H_1, H_2,...H_10. To make arguments about how it would generalize, we ask ourselves what internal procedure it might have actually learned to implement. Your proposal is that it might learn the procedure "just be honest" because that would perform perfectly on this training distribution. You contrast this against the procedure "just answer however the evaluator you've seen most recently would answer," which would get a bad loss because it would be penalized by the stronger evaluators in the sequence. Is that right? If so, then I'm arguing that it may instead learn the procedure "answer the way an H_100 evaluator would answer." That is, once it has a few experiences of the evaluation level being ratcheted up, it might think to itself "I know where this is going, so let's just jump straight to the best evaluation the humans will be able to muster in the training distribution and then imitate how that evaluation procedure would answer." This would also get perfect loss on the training distribution, because we can't produce data points beyond H_100. And then that thing might still be missing knowledge that the AI has. To be clear, it's possible that in practice this kind of procedure would cause it to generalize honestly (though I'm somewhat skeptical). But we're in worst-case land, so "jump straight to answering the way a human would" is a valid counterexample to the proposal. This comment on another proposal gives a more precise description.
1redbird
That's almost right, but it's being penalized right away, before it has any experience with the strong evaluators, so it can't simulate them. The ELK paper says we can assume, if we want, that there are no mislabeled training points (I'll call this "assumption A"). My proposal is that it could actually be useful to mislabel some training points, because they help us detect deception.   As a simple example, let's train a reporter to answer the single question “Is the diamond in the room?”. Each training point has two labels x=+1 if H1 thinks the diamond is still there, else 0 x′=+1 if H100 thinks the diamond is still there, else 0. By assumption A, our training data is such that x′ is always correct. But we deliberately choose a dataset where say 10% of the x labels are wrong (x≠x′).  Then we train the model on points of the form  (v,a,x) (video, action, H1 label). Crucially, the model does not see x′.  The model seeks to output y that maximizes reward R(x,y), where R(x,y)=1    if x is right and y=x   (good job) R(x,y)=10    if x is wrong and y≠x  (you rock, thanks for correcting us!) R(x,y)=−1000     if x is right and y≠x  (bad model, never ever deceive us) R(x,y)=−1000    if x is wrong and y=x  (bad model, never ever deceive us) To your point, sure, an H100 simulator will get perfect reward, but the model doesn't see x′, so how would it acquire the ability to simulate H100 ? EDIT: One way it could plausibly simulate H100  is to notice that all the training examples are easy, and infer what kind of reasoning was used to generate them.  We could try to block this by including some hard examples in the training, but then some of the x′ labels will be wrong.  If we only penalize it for deception on the examples where we're sure the x′ label is right, then it can still infer something about H100 from our failure to penalize ("Hmm, I got away with it that time!").  A fix could be to add noise: Sometimes we don't penalize even when we know it deceived us, a
2Ajeya Cotra
In the worst-case game we're playing, I can simply say "the reporter we get happens to have this ability because that happens to be easier for SGD to find than the direct translation ability." When living in worst-case land, I often imagine random search across programs rather than SGD. Imagine we were plucking reporters at random from a giant barrel of possible reporters, rejecting any reporter which didn't perform perfectly in whatever training process we set up and keeping the first one that performs perfectly. In that case, if we happened to pluck out a reporter which answered questions by simulating H100, then we'd be screwed because that reporter would perform perfectly in the training process you described. SGD is not the same as plucking programs out of the air randomly, but when we're playing the worst case game it's on the builder to provide a compelling argument that SGD will definitely not find this particular type of program. You're pointing at an intuition ("the model is never shown x-prime") but that's not a sufficiently tight argument in the worst-case context -- models (especially powerful/intelligent ones) often generalize to understanding many things they weren't explicitly shown in their training dataset. In fact, we don't show the model exactly how to do direct translation between the nodes in its Bayes net and the nodes in our Bayes net (because we can't even expose those nodes), so we are relying on the direct translator to also have abilities it wasn't explicitly shown in training. The question is just which of those abilities is easier for SGD to build up; the counterexample in this case is "the H100 imitator happens to be easier."
1redbird
Thanks! It's your game, you get to make the rules :):) I think my other proposal, Withhold Material Information, passes this counterexample, because the reporter literally doesn't have the information it would need to simulate the human. 
9HoldenKarnofsky
I wanted to comment on this one because I've thought about this general sort of approach a fair amount. It seems like the kind of thing I would naturally start with if trying to solve this problem in the real world, and I've felt a bit frustrated that I haven't really found a version of it that seems to work in the game here. That said, I don't think we need super-exotically pessimistic assumptions to get a problem with this approach. In the most recent example you gave, it's always rewarded for being "right" and punished for being "wrong" - meaning it's always rewarded for matching H100 and always punished for not doing so. So there's no way our rewards are rewarding "be right" over "imitate H100", and "imitate H100" is (according to the stated assumptions) easier to learn. Another way of thinking about this: Imagine that you show the AI H_1 for a while, then start punishing it for failing to match H_2. I think the default outcome here is that it learns to imitate H_2. If you then start punishing it for failing to match H_3, it learns to imitate H_3. Perhaps after a few rounds of this, it learns to "look ahead" some # of steps: for example, after learning to imitate H_2 failed on H_3, it learns to imitate H_5 or so; after that fails on H_6, maybe it learns to imitate H_10 or so. The intended model has the advantage that it generalizes to all 100 data sets we can throw at it, but this is the same advantage that H_100 has, and H_100 is exactly what we've hypothesized is (unfortunately) easier for it to learn. So even if at some point it starts reasoning "I need something that will never fail to generalize," this seems more likely to be H_100 by default.  
4redbird
I agree this is a problem. We need to keep it guessing about the simulation target. Some possible strategies: * Add noise, by grading it incorrectly with some probability. * On training point i, reward it for matching Hni for a random value of ni.   * Make humans a high-dimensional target. In my original proposal, Hn was strictly stronger as n increases, but we could instead take Hn to be a committee of experts. Say there are 100 types of relevant expertise. On each training point, we reward the model for matching a random committee of 50 experts selected from the pool of 100. It's too expensive simulate all (100 choose 50) possible committees!  None of these randomization strategies is foolproof in the worst case. But I can imagine proving something like "the model is exponentially unlikely to learn an H100 simulator" where H100 is now the full committee of all 100 experts. Hence my question about large deviations.