Epistemologist specialized in the difficulties of alignment. Currently at Conjecture, and running Refine.


Building Blocks
Becoming Stronger as Epistemologist
Epistemic Cookbook for Alignment
Reviews for the Alignment Forum
AI Alignment Unwrapped
Deconfusing Goal-Directedness

Wiki Contributions


I'm confused by your confusion, given that I'm pretty sure you understand the meaning of cognitive bias, which is quite explicitly the meaning of bias drawn upon here.

Thanks for your comment!

Actually, I don't think we really disagree. I might have just not made my position very clear in the original post.

The point of the post is not to say that these activities are not often valuable, but instead to point out that they can easily turn into "To do science, I need to always do [activity]". And what I'm getting from the examples is that in some cases, you actually don't need to do [activity]. There's a shortcut, or maybe just you're in a different phase of the problem.

Do you think there is still a disagreement after this clarification?

In a limited context, the first example that comes to me is high performers in competitive sports and games. Because if they truly only give a shit about winning (and the best generally do), they will throw away their legacy approaches when they find a new one, however it pains them.

Thanks for the kind words!

I'm not aware of any such statistics, but I'm guessing that MATS organizers might have some.

I interpret Alex as making an argument such that there is not just two vs one difficulties, but an additional difficulty. From this perspective, having two will be more of an issue than one, because you have to address strictly more things.

This makes me wonder though if there is not just some sort of direction question underlying the debate here. Because if you assume the "difficulties" are only positive numbers, then if the difficulty for the direct instillation is  and the one for the grader optimization is  , then there's no debate that the latter is bigger than the former.

But if you allow directionality (even in one dimension), then there's the risk that the sum leads to less difficulty in total (by having the  move in the opposite direction in one dimension). That being said, these two difficulties seem strictly additive, in the sense that I don't see (currently) how the difficulty of evaluation could partially cancel the difficulty of instillation.

Thanks for taking time to answer my questions in detail!

About your example for other failure modes

Is it meant to point at the ability of the actor to make the plan more confusing/harder to evaluate? Meaning that you're pointing at the ability for the actor to "obfuscate" its plan in order to get high reward?

If so, it's not clear to me why this is valuable for the actor to do? How is it supposed to get better reward from confusion only? If it has another agenda (making paperclips instead of diamonds for example), then the obfuscation is clearly valuable to allow it to work on its main goal. But here its goal is to improve evaluation, and so confusion doesn't seem like it helps.

About the cost/competitiveness argument

I think that a values-executing AGI can also search over as many plans which actually make sense, I don't think its options are limited or anything. But it'll be generating different kinds of plans, using reflective reasoning to restrict its search to non-adversarial-to-own-values parts of plan space (e.g. "don't think about basilisks").

This is the part where I currently cannot reconstruct your model (maybe because I haven't read shard theory in a while). Because from an abstract perspective, the trick part of starting from significantly more limited plans is how to expand the range of plans without allowing adversarial ones. And I don't have a model of what mechanism you think make it easier to go from the bottom up (from few plans accepted to more plans accepted) safely rather than from the top down (from all plans accepted to less plans accepted) safely.

About the "negative result" of not even trying to do grader optimization

  1. I don't see why that should exist; any plan-inhabiting adversary wishes to fool the boundary of whatever rule you provide. 
  2. See the last 10+years of alignment researchers failing to do this. Probably wise to not spend further increments of research time on such matters, once the fault is pointed out.

But why don't you see that it should exist? What are the parts of your model that make this prediction? Because by definition, there is not a single innovation or progress that most people expected to exist. That's what it means to figure out something new. So you need an actual reason for why you're not falling into that trap of only being able to imagine what already exists.

Here are some guesses:

  • Specific scientific claim that there should be no true name/clean concept of what is adversarial for a large class of cognitive architectures (in opposition to Natural Abstraction Hypothesis I guess?)
  • Human limitation claim that even though there might be such a true name/clear concept, we have neither the time not the ability to get it right under the current conditions
  • Security mindset claim that because the actor is allowed to get arbitrary smart, you should expect that if it optimizes against you, it wins.


For the second point, what do you make of the claim that there are ridiculously few researchers who even attempted to work on the problem in the last 10 years, compared to almost all easier fields of knowledge? Do you think that this should not alter the evidence of absence you get from the absence of evidence? 

It... seems to be a significant simplification of the problem? I mean, not needing all the interpretability and surgery tools would be a bigger improvement, but that's probably not something we can have.

Why do you think so? Currently I'm seeing a couple of massive difficulties here that don't generally or necessarily appear in alternatives approaches:

  • You need to know that you're going to reach an AGI before it becomes superintelligent, or you'll waste your time training an AI that will be taken over by the competitors. Whereas many approaches don't require this.
  • You need basically perfect interpretability, compared with approaches that require no or just some interpretability capabilities
  • You need to figure out the right translation to bootstrap it, and there seem to be risks if you get it wrong.
  • You need to figure out the right thought similarity measure to bootstrap it, and there seem to be risks if you get it wrong.

Can you help me understand why you think that these strong requirements nonethless are simpler than most versions or approaches of the problem that you know about?

The way you write this (especially the last sentence) makes me think that you see this attempt as being close to the only one that makes sense to you atm. Which makes me curious:

  • Do you think that you are internally trying to approximate your own ?
  • Do you think that you have ever made the decision (either implicitly or explicitly) to not eval all or most plans because you don't trust your ability to do so for adversarial examples (as opposed to tractability issues for example)?
  • Can you think of concrete instances where you improved your own Eval?
  • Can you think of concrete instances where you thought you improved you own Eval but then regretted it later?
  • Do you think that your own changes to your eval have been moving in the direction of your ?

> This includes “What would this specific and superintelligent CEV-universe-simulation say about this plan?”.

> This doesn’t include (somehow) getting an AI which correctly computes what program would be recommended by AGI designers in an altruistic and superintelligent branch of humanity, and then the AI executes that program and shuts itself off without doing anything else.[5]

But isn't 1 here is at least as good as 2, since the CEV-universe-simulation could always compute X=[the program that would be recommended by AGI designers in an altruistic and superintelligent branch of humanity] then return 1 iff input-plan = 'run X then shuts itself off without doing anything else' (by doing a simple text match), 0 otherwise, so there's no chance of adversarial inputs? Not to say this is a realistic way of getting an aligned AGI, but just that your argument seems to be proving too much, if it's saying that 2 is safer/better than 1.

Is your issue here that there exist a specific CEV-universe-simulation that makes 1 just as safe as 2, by basically emulating the latter situation? If so, why do you think this is a point against Alex's claim(which strikes me more as saying "there are a lot more cases of 2. being safe than of 1.")? 

  1. Intelligence => strong selection pressure => bad outcomes if the selection pressure is off target.
  2. In the case of agents that are motivated to optimize evaluations of plans, this argument turns into "what if the agent tricks the evaluator".
  3. In the case of agents that pursue values / shards instilled by some other process, this argument turns into "what if the values / shards are different from what we wanted".
  4. To argue for one of these over the other, you need to compare these two arguments. However, this post is stating point 2 while ignoring point 3.

One thing that is not clear to me from your comment is what you make of Alex's argument (as I see it) to the extent that "evaluation goals" are further away from "direct goals" than "direct goals" are between themselves. If I run with this, it seems like an answer to your point 4 would be:

  • with directly instilled goals, there will be some risk of discrepancy that can explode due to selection pressure;
  • with evaluation based goals, there is the same discrepancy than between directly instilled goals (because it's hard to get your goal exactly right) plus an additional discrepancy between valuing "the evaluation of X" and valuing "X".

I'm curious what you think of this claim, and if that influences at all your take.

Load More