All of redbird's Comments + Replies

Great points about not wanting to summon the doom memeplex!

It sounds like your proposed narrative is not doom but disempowerment: humans could lose control of the future. An advantage of this narrative is that people often find it more plausible: many more scenarios lead to disempowerment than to outright doom.

I also personally use the disempowerment narrative because it feels more honest to me: my P(doom) is fairly low but my P(disempowerment) is substantial.

I’m curious though whether you’ve run into the same hurdle I have, namely that people already fee... (read more)

Hypothesis I is testable! Instead of prompting with a string of actual tokens, use a “virtual token” (a vector v from the token embedding space) in place of ‘ petertodd’.

It would be enlightening to rerun the above experiments with different choices of v:

  • A random vector (say, iid Gaussian )
  • A random sparse vector
  • (apple+banana)/2
  • (villain-hero)+0.1*(bitcoin dev)

Etc.

8Jan_Kulveit10mo
It is testable in this way for OpenAI, but I can't skip the tokenizer and embeddings and just feed vectors to GPT3.  Someone can try that with ' petertodd' and GPT-J. Or,  you can simulate something like anomalous tokens by feeding such vectors to some of the LLaAMA (maybe I'll do, just don't have the time now). I did some some experiments with trying to prompt "word component decomposition/ expansion". They don't prove anything and can't be too fine-grained, but the projections shown intuitively make sense davinci-instruct-beta, T=0: Add more examples of word expansions in vector form  'bigger'' = 'city' - 'town'  'queen'- 'king' = 'man' - 'woman' ' bravery' = 'soldier' - 'coward'  'wealthy' = 'business mogul' - 'minimum wage worker'  'skilled' = 'expert' - 'novice'  'exciting' = 'rollercoaster' - 'waiting in line'  'spacious' = 'mansion' - 'studio apartment'  1. ' petertodd' = 'dictator' - 'president' II. ' petertodd' = 'antagonist' - 'protagonist' III. ' petertodd' = 'reference' - 'word'  

However, there is some ambiguity, as at temperature 0, ‘ petertodd’ is saving the world

All superheroes are alike; each supervillain is villainous in its own way.

Did you ever try this experiment? I'm really curious how it turned out!

1peligrietzer1y
No but I hope to have a chance to try something like it this year! 

How can the Continuum Hypothesis be independent of the ZFC axioms? Why does the lack of “explicit” examples of sets with a cardinality between that of the naturals and that of the reals not guarantee that there are no examples at all? What would an “implicit” example even mean?

It means that you can’t reach a contradiction by starting with “Let S be a set of intermediate cardinality” and following axioms of ZFC.

All the things you know and love doing with sets —intersection, union, choice, comprehension, Cartesian product, power set — you can do those thi... (read more)

3Donald Hobson5mo
Your "simpler is better" is hard to apply. One way of thinking about models where there are no intermediate cardinals isn't that S doesn't exist. But that T, a mapping from S to either the naturals or the reals, does exist. And T will also be something you can't explicitly construct.  Also, the axiom of choice basically says "there exists loads of sets that can't be explicitly constructed". 
6Richard_Kennaway10mo
At least one mathematician (I forget his name) considers V=L to be a reasonable axiom to add. Informally put, it says that nothing exists except the things that are required to exist by the axioms. ZF + V=L implies choice, the generalised continuum hypothesis, and many other things. His argument is that just as we consider the natural numbers to be the numbers intended to be generated by the Peano axioms, i.e. the smallest model, so we should consider the constructible universe L to be the sets intended to be generated by the ZF axioms. The axioms amount to an inductive definition, and the least fixed point is the thing they are intended to define. One can think about larger models of ZF, just as one can think about non-standard natural numbers, but L and N are respectively the natural models. I don't know how popular this view is.

Yep, it's a funny example of trade, in that neither party is cognizant of the fact that they are trading! 

I agree that Abrams could be wrong, but I don't take the story about "spirits" as much evidence: A ritual often has a stated purpose that sounds like nonsense, and yet the ritual persists because it confers some incidental benefit on the enactor.

Anecdotal example of trade with ants (from a house in Bali, as described by David Abrams):

The daily gifts of rice kept the ant colonies occupied–and, presumably, satisfied. Placed in regular, repeated locations at the corners of various structures around the compound, the offerings seemed to establish certain boundaries between the human and ant communities; by honoring this boundary with gifts, the humans apparently hoped to persuade the insects to respect the boundary and not enter the buildings.

4gwern1y
Abrams, we should be clear, is not only reporting just his own speculation rather than any statement made by the Balinese (which itself may or may not indicate any trade successfully going on, which is rather dubious to begin with as feeding ants just makes more ants), he is, by his own account, making this up in direct contradiction to what his Bali hosts were telling him: And presuming to explain what they were 'really' trying to do.

if you are smarter at solving math tests where you have to give the right answer, then that will make you worse at e.g. solving math "tests" where you have to give the wrong answer.

 

Is that true though? If you're good at identifying right answers, then by process of elimination you can also identify wrong answers. 

I mean sure, if you think you're supposed to give the right answer then yes you will score poorly on a test where you're actually supposed to give the wrong answer.  Assuming you get feedback, though, you'll soon learn to give wrong answers and then the previous point applies.

2tailcalled2y
I was assuming no feedback, like the test looks identical to an ordinary math test in every way. The "no free lunch" theorem also applies in the case where you get feedback, but there it is harder to construct. Basically in such a case the task would need to be anti-inductive, always providing feedback that your prior gets mislead by. Of course these sorts of situations are kind of silly, which is why the no free lunch theorem is generally considered to be only of academic interest.

There’s a trap here where the more you think about how to prevent bad outcomes from AGI, the more you realize you need to understand current AI capabilities and limitations, and to do that there is no substitute for developing and trying to improve current AI!

A secondary trap is that preventing unaligned AGI probably will require lots of limited aligned helper AIs which you have to figure out how to build, again pushing you in the direction of improving current AI.

The strategy of “getting top AGI researchers to stop” is a tragedy of the commons: They can ... (read more)

1sovran2y
Top researchers are not easy to replace. Without the 0.1st percentile of researchers, progress would be slowed much more than 0.1%

“no free lunch in intelligence” is an interesting thought, can you make it more precise?

Intelligence is more effective in combination with other skills, which suggests “free lunch” as opposed to tradeoffs.

2tailcalled2y
Basically, the idea is that e.g. if you are smarter at solving math tests where you have to give the right answer, then that will make you worse at e.g. solving math "tests" where you have to give the wrong answer. So for any task where intelligence helps, there is an equal and opposite task where intelligence hurts.

Young kids don’t make a clear distinction between fantasy and reality. The process of coming to reject the Santa myth helps them clarify the distinction.

It’s interesting to me that young kids function as well as they do without the notions of true/false, real/pretend! What does “belief” even mean in that context? They change their beliefs from minute to minute to suit the situation.

Even for most adults, most beliefs are instrumental: We only separate true from false to the extent that it’s useful to do so!

4[DEACTIVATED] Duncan Sabien2y
The above strikes me as more true than false, but not true thanks to some combination of making its claim too strongly/too universally/via a kind of typical-mind channel. If I had been trying to convey [my own version of this claim], I would have written something like: ... these hedges and caveats might feel like nitpicks, but they feel pretty important to me personally for not immediately losing track of what's true! =P

Thanks for the comment!

I know you are saying it predicts *uncertainly,* but we still have to have some framework to map uncertainty to a state, we have to round one way or the other. If uncertainty avoids loss, the predictor will be preferentially inconclusive all the time.

There's a standard trick for scoring an uncertain prediction: It outputs its probability estimate p that the diamond is in the room, and we score it with loss  if the diamond is really there,  otherwise. Truthfully reporting p minimizes its loss.

So we

... (read more)
1Brownbat2y
Happy to try to clarify, and this is helping me rethink my own thoughts, so appreciate the prompts. I'm playing with new trains of thought here and so have pretty low confidence in where I ended up, so greatly appreciate any further clarifications or responses you have. Yup, understand that is how to effectively score uncertainty. I was very wrong to phrase this as "we still have to have some framework to map uncertainty to a state" because you don't strictly have to do anything, you can just use probabilities. Restricting this to discrete, binary states allows us to simplify the comparison between models for this discussion. I will claim we can do so with no loss of fidelity (leaning heavily on Shannon, ie, this is all just information, encoding it to binary and back out again doesn't mess anything up). And doing so is not obliged, but useful. I really shouldn't have said "you must X!" I should have said "it's kind of handy if you X," sorry for that confusion. We have a high quality information stream and a low quality information stream, and they both gesture vaguely at the ultimate high quality information stream, namely, the true facts of the matter of the world itself. Say, LQ < HQ < W. LQ may be low quality because it is missing information in HQ, it may just be a subset of HQ, like a lower resolution video. Or it may have actual noise, false information. If we have a powerful algorithm, we may be able to, at least asymptomatically, convert LQ to HQ, using processing power. So maybe in some cases LQ + processing = HQ exactly. But that makes the distinction uninteresting, and you would likely have to further degrade v′1 to get the effect you are looking for, so let's discard that and consider only cases where v′1 is strictly worse. You can now use a NAND to sort the outputs of LQ and HQ into two buckets:  1. A stream of outputs that all agree. 2. A stream of outputs that all disagree. So for bucket 1, there are aspects of the world where there's effe

"Train the predictor on lots of cases until it becomes incredibly good; then train the reporter only on the data points with missing information, so that it learns to do direct translation from the predictor to human concepts; then hope that reporter continues to do direct translation on other data points.”

That's different from what I had in mind, but better! My proposal had two separate predictors, and what it did is reduce the human  strong predictor OI problem (OI = “ontology identification”, defined in the ELK paper) to the weak pr... (read more)

1Brownbat2y
This is really interesting.  To understand this more thoroughly I'm simplifying the high and low quality video feeds to lists of states that correspond to reality. (This simplification might be unfair so I'm not sure this is a true break of your original proposal, but I think it helped me think about general breaking strategies.) Ok, video feeds compressed to arrays: We consider scenarios in fixed order. If the diamond is present, we record a 1, and if not, a 0. The high quality feed gives us a different array than the low quality mode (otherwise the low quality mode is not helpful). E.g., High reports: (1,0,1,1,0, ...); Low: (1,0,1,?,0,...) There are two possible ways that gap can get resolved. In case one, the low quality predictor has a powerful enough model of reality to effectively derive the High quality data. (We might find this collapses to the original problem, because it has somehow reconstructed the high quality stream from the low quality stream, then proceeds as normal. You might argue that's computationally expensive, ok, then let's proceed to case two.) In case two, the low quality datafeed predictor predicts wrongly. (I know you are saying it predicts *uncertainly,* but we still have to have some framework to map uncertainty to a state, we have to round one way or the other. If uncertainty avoids loss, the predictor will be preferentially inconclusive all the time. If we round uncertainty up, effectively we're in case one. If we round down, effectively case two.) So we could sharpen case two and say that sometimes the AI's camera intentionally lies to it on some random subset of scenarios. And the AI finds itself in a chaotic world where it is sometimes punished for predicting what it just knows to be true things. In that case, although it's easy to show how it would diverge from human simulation, it also might not simulate reality very well either, since deriving the algorithm generating the lies might be too computationally complex. (Or ma

I agree this is a problem. We need to keep it guessing about the simulation target. Some possible strategies:

  • Add noise, by grading it incorrectly with some probability.
  • On training point , reward it for matching  for a random value of .  
  • Make humans a high-dimensional target. In my original proposal,  was strictly stronger as  increases, but we could instead take  to be a committee of experts. Say there are 100 types of relevant expertise. On each training point, we reward the model for match
... (read more)

You're saying AI will be much better than us at long-term planning?

It's hard to train for tasks where the reward is only known after a long time (e.g. how would you train for climate prediction?)

2tailcalled2y
No, I'm saying that AI might be much worse than us at long-term planning, because evolution has selected us on the basis of very long chains of causal effects, whereas we can't train AIs on the basis of such long chains of causal effects.

Great links, thank you!!

So your focus was specifically on the compute performed by animal brains.

I expect total brain compute is dwarfed by the computation inside cells (transcription & translation). Which in turn is dwarfed by the computation done by non-organic matter to implement natural selection. I had totally overlooked this last part!

3jessicata2y
Non-brain matter is most of the compute for a naive physics simulation, however it's plausible that it could be sped up a lot, e.g. the interiors of rocks are pretty static and similar to each other so maybe they can share a lot of computation. For brains it would be harder to speed up the simulation without changing the result a lot.

Interesting, my first reaction was that evolution doesn't need to "figure out" the extended phenotype (= "effects on the real world") It just blindly deploys its algorithms, and natural selection does the optimization.

But I think what you're saying is, the real world is "computing" which individuals die and which ones reproduce, and we need a way to quantify that computational work. You're right!

2tailcalled2y
I should add: I think it is the hardest-to-compute aspects of this that are the most important to the evolution of general intelligence. With a "reasonable" compute budget, you could set up a gauntlet of small tasks that challenge your skills in various ways. However, this could probably be Goodharted relatively "easily". But the real-world isn't some closed short-term system; it also tests you against effects that take years or even decades to become relevant, just as hard as it tests you for effects that immediately become relevant. And that is something I think we will have a hard time with optimizing our AIs against.
1tailcalled2y
Yep.

Question: Would a proposal be ruled out by a counterexample even if that counterexample is exponentially unlikely?

I'm imagining a theorem, proved using some large deviation estimate, of the form:  If the model satisfies hypotheses XYZ, then it is exponentially unlikely to learn W. Exponential in the number of parameters, say. In which case, we could train models like this until the end of the universe and be confident that we will never see a single instance of learning W.

4paulfchristiano2y
I'd be fine with a proposal that flips coins and fails with small probability (in every possible world).

Thanks! It's your game, you get to make the rules :):)

I think my other proposal, Withhold Material Information, passes this counterexample, because the reporter literally doesn't have the information it would need to simulate the human. 

9HoldenKarnofsky2y
I wanted to comment on this one because I've thought about this general sort of approach a fair amount. It seems like the kind of thing I would naturally start with if trying to solve this problem in the real world, and I've felt a bit frustrated that I haven't really found a version of it that seems to work in the game here. That said, I don't think we need super-exotically pessimistic assumptions to get a problem with this approach. In the most recent example you gave, it's always rewarded for being "right" and punished for being "wrong" - meaning it's always rewarded for matching H100 and always punished for not doing so. So there's no way our rewards are rewarding "be right" over "imitate H100", and "imitate H100" is (according to the stated assumptions) easier to learn. Another way of thinking about this: Imagine that you show the AI H_1 for a while, then start punishing it for failing to match H_2. I think the default outcome here is that it learns to imitate H_2. If you then start punishing it for failing to match H_3, it learns to imitate H_3. Perhaps after a few rounds of this, it learns to "look ahead" some # of steps: for example, after learning to imitate H_2 failed on H_3, it learns to imitate H_5 or so; after that fails on H_6, maybe it learns to imitate H_10 or so. The intended model has the advantage that it generalizes to all 100 data sets we can throw at it, but this is the same advantage that H_100 has, and H_100 is exactly what we've hypothesized is (unfortunately) easier for it to learn. So even if at some point it starts reasoning "I need something that will never fail to generalize," this seems more likely to be H_100 by default.  

What are FLOPz and FLOPs ?

What sources did you draw from to estimate the distributions?

5jessicata2y
FLOPs = floating point operations per second. FLOPz = the same thing, I think (it's used as if it's in the FLOPs unit). I don't remember the sources for everything. If you want to get a more accurate estimate I recommend re-running with your own numbers. Here are some estimates of brain compute. Here's an estimate of the mass of a human brain. Here's an estimate of current animal biomass. Here are brain to body mass ratios for different species. Here's an estimate of the composition of animal biomass which should help figure out which brain-to-body-mass numbers to use. Here's a Quora question about changes in Earth biomass over time. (I think if you spent some time on these estimates they'd turn out different from the numbers in the model; we did this mostly to check rough order of magnitude over the course of a couple hours, finding that evolution will not be simulable with foreseeable compute.)

Your A' is equivalent to my A, because it ends up optimizing for 1-day expected return, no matter what environment it's in.

My A' is not necessarily reasoning in terms of "cooperating with my future self", that's just how it acts!

(You could implement my A' by such reasoning if you want.  The cooperation is irrational in CDT, for the reasons you point out. But it's rational in some of the acausal decision theories.)

Awesome!!! Exactly the kind of thing I was looking for

Hmm how would you define "percentage of possibilities explored"? 

I suggested several metrics, but I am actively looking for additional ones, especially for the epigenome and for communication at the individual level (e.g. chemical signals between fungi and plants, animal calls, human language).

4Derek M. Jones2y
Chemical space, https://en.wikipedia.org/wiki/Chemical_space, is one candidate for a metric of the possibilities.   The book "Chemical Evolution: Origins of the Elements, Molecules and Living Systems" by  Stephen F. Mason might well contain the kinds of calculations you are looking for.

AGI timeline is not my motivation, but the links look helpful, thanks!

the long-term trader will also increase the value of  for other traders than itself, probably just as much as it does for itself

Hmm, like what? I agree that the short-term trader s does a bit better than the long-term trader l in the l,l,... environment, because s can sacrifice the long term for immediate gain.  But s does lousy in the s,s,... environment, so I think L^*(s) < L^*(l).  It's analogous to CC having higher payoff than DD in prisoner's dilemma. (The prisoners being current and future self)

I like the traps example, it sho... (read more)

1tailcalled2y
It's true that L(s;s,s,…) is low, but you have to remember to subtract off argmaxmL(m;s,s,…). Since every trader will do badly in the environment generated by the short-term trader, the poor performance of the short-term trader in its own environment cancels out. Essentially, L∗ asks, "To what degree can someone exploit your environment better than you can?". If you're limited to trading stocks, yeah, the traps example is probably very hard or impossible to pull off. What I had in mind is an AI with more options than that.

Idea:  Withhold Material Information

We're going to prevent the reporter from simulating a human, by giving the human material information that the reporter doesn't have.

Consider two camera feeds:

Feed 1 is very low resolution, and/or shows only part of the room.

Feed 2 is high resolution, and/or shows the whole room.

We train a weak predictor using Feed 1, and a strong predictor using Feed 2.  

We train a reporter to report the beliefs of the weak predictor, using scenarios labeled by humans with the aid of the strong predictor. The humans can correc... (read more)

4HoldenKarnofsky2y
I'm interpreting this as something like: "Train the predictor on lots of cases until it becomes incredibly good; then train the reporter only on the data points with missing information, so that it learns to do direct translation from the predictor to human concepts; then hope that reporter continues to do direct translation on other data points." The problem as I see it is that once the predictor is good enough that it can get data points right despite missing crucial information, it is also (potentially) good enough that it can learn how to imitate "what the human would think had happened if they had more information." Both of these perform equally well, and the existing assumption is that human imitation is easier to learn than direct translation, so I think by default (according to the contest assumptions) you get the latter.

Your proposal is that it might learn the procedure "just be honest" because that would perform perfectly on this training distribution. You contrast this against the procedure "just answer however the evaluator you've seen most recently would answer," which would get a bad loss because it would be penalized by the stronger evaluators in the sequence. Is that right?

That's almost right, but it's being penalized right away, before it has any experience with the strong evaluators, so it can't simulate them.

The ELK paper says we can assume, if we want, that the... (read more)

2Ajeya Cotra2y
In the worst-case game we're playing, I can simply say "the reporter we get happens to have this ability because that happens to be easier for SGD to find than the direct translation ability." When living in worst-case land, I often imagine random search across programs rather than SGD. Imagine we were plucking reporters at random from a giant barrel of possible reporters, rejecting any reporter which didn't perform perfectly in whatever training process we set up and keeping the first one that performs perfectly. In that case, if we happened to pluck out a reporter which answered questions by simulating H100, then we'd be screwed because that reporter would perform perfectly in the training process you described. SGD is not the same as plucking programs out of the air randomly, but when we're playing the worst case game it's on the builder to provide a compelling argument that SGD will definitely not find this particular type of program. You're pointing at an intuition ("the model is never shown x-prime") but that's not a sufficiently tight argument in the worst-case context -- models (especially powerful/intelligent ones) often generalize to understanding many things they weren't explicitly shown in their training dataset. In fact, we don't show the model exactly how to do direct translation between the nodes in its Bayes net and the nodes in our Bayes net (because we can't even expose those nodes), so we are relying on the direct translator to also have abilities it wasn't explicitly shown in training. The question is just which of those abilities is easier for SGD to build up; the counterexample in this case is "the H100 imitator happens to be easier."

I like the approach. Here is where I got applying it to our scenario:

 is a policy for day trading

 is expected 1-day return

 is the "trading environment" produced by . Among other things it has to record your own positions, which include assets you acquired a long time ago. So in our scenario it has to depend not just on the policy we used yesterday but on the entire sequence of policies used in the past.  The iteration becomes

In words, the new policy is the optimal policy in the environment produ... (read more)

1tailcalled2y
The initial part all looks correct. However, something got lost here: Because it's true that long-term trading will give a high L, but remember for myopia we might see it as optimizing L∗, and L∗ also subtracts off argmaxmL(m;x,x,…). This is an issue, because the long-term trader will also increase the value of L for other traders than itself, probably just as much as it does for itself, and therefore it won't have a long-term time horizon. As a result, a pure long-term trader will actually score low on L∗. On the other hand, a modified version of the long-term trader which sets up "traps" that cause financial loss if it deviates from its strategy would not provide value to anyone who does not also follow its strategy, and therefore it would score high on L∗. There are almost certainly other agents that also score high on L∗ too, though.

Consider two possible agents A and A'.

A optimizes for 1-day expected return.

A' optimizes for 10-day expected return under the assumption that a new copy of A' will be instantiated each day.

I claim that A' will actually achieve better1-day expected return (on average, over a sufficiently long time window, say 100 days).

So even if we're training the agent by rewarding it for 1-day expected return, we should expect to get A' rather than A.

1[anonymous]2y
A’_1 (at time 1) can check whether A’_0 setup favorable conditions, and then exploit them. It can then defect from the “trade” you’ve proposed, since A’_0 can’t revoke any benefit it set up. If they were all coordinating simultaneously, I’d agree with you that you could punish defectors, but they aren’t so you can’t. If I, as A’_1, could assume that A’_0 had identical behavior to me, then your analysis would work. But A’_1 can check, after A’_0 shut down, how it behaved, and then do something completely different, which was more advantageous for its own short horizon (rather than being forward-altruistic).

The person deploying the time-limited agent has a longer horizon. If they want their bank balance to keep growing, then presumably they will deploy a new copy of the agent tomorrow, and another copy the day after that. These time-limited agents have an incentive to coordinate with future versions of themselves: You’ll make more money today, if past-you set up the conditions for a profitable trade yesterday.

So a sequence of time-limited agents could still develop instrumental power-seeking.  You could try to avert this by deploying a *different* agent each day, but then you miss out on the gains from intertemporal coordination, so the performance isn’t competitive with an unaligned benchmark.

2tailcalled2y
Not really, due to the myopia of the situation. I think this may provide a better approach for reasoning about the behavior of myopic optimization.
1[anonymous]2y
I don’t see how the game theory works out. Agent 1 (from day 1) has no incentive to help agent 2 (from day 2), since it’s only graded on stuff that occurs by the end of day 1. Agent 2 can’t compensate agent 1, so the trade doesn’t happen. (Same with the repeated version - agent 0 won’t cooperate with agent 2 and thus create an incentive for agent 1, because agent 0 doesn’t care about agent 2 either.)

How would it learn that Bayes net, though, if it has only been trained so far on H_1, …, H_10?  Those are evaluators we’ve designed to be much weaker than human.

3Ajeya Cotra2y
The question here is just how it would generalize given that it was trained on H_1, H_2,...H_10. To make arguments about how it would generalize, we ask ourselves what internal procedure it might have actually learned to implement. Your proposal is that it might learn the procedure "just be honest" because that would perform perfectly on this training distribution. You contrast this against the procedure "just answer however the evaluator you've seen most recently would answer," which would get a bad loss because it would be penalized by the stronger evaluators in the sequence. Is that right? If so, then I'm arguing that it may instead learn the procedure "answer the way an H_100 evaluator would answer." That is, once it has a few experiences of the evaluation level being ratcheted up, it might think to itself "I know where this is going, so let's just jump straight to the best evaluation the humans will be able to muster in the training distribution and then imitate how that evaluation procedure would answer." This would also get perfect loss on the training distribution, because we can't produce data points beyond H_100. And then that thing might still be missing knowledge that the AI has. To be clear, it's possible that in practice this kind of procedure would cause it to generalize honestly (though I'm somewhat skeptical). But we're in worst-case land, so "jump straight to answering the way a human would" is a valid counterexample to the proposal. This comment on another proposal gives a more precise description.

Stupid proposal: Train the reporter not to deceive us.

We train it with a weak evaluator H_1 who’s easy to fool. If it learns an H_1 simulator instead of direct reporter, then we punish it severely and repeat with a slightly stronger H_2. Human level is H_100. 

It's good at generalizing, so wouldn't it learn to never ever deceive? 

2Ajeya Cotra2y
This proposal has some resemblance to turning reflection up to 11. In worst-case land, the counterexample would be a reporter that answers questions by doing inference in whatever Bayes net corresponds to "the world-understanding that the smartest/most knowledgeable human in the world" has; this understanding could still be missing things that the prediction model knows.