Wiki Contributions


I'm confused about the example you give. In the paragraph, Eliezer is trying to show that you ought to accept the independence axiom, cause you can be Dutch booked if you don't. I'd think if you're updateless, that means you already accept the independence axiom (cause you wouldn't be time-consistent otherwise).

And in that sense it seems reasonable to assume that someone who doesn't already accept the independence axiom is also not updateless.

I agree it's important to be careful about which policies we push for, but I disagree both with the general thrust of this post and the concrete example you give ("restrictions on training data are bad").

Re the concrete point: it seems like the clear first-order consequence of any strong restriction is to slow down AI capabilities. Effects on alignment are more speculative and seem weaker in expectation. For example, it may be bad if it were illegal to collect user data (eg from users of chat-gpt) for fine-tuning, but such data collection is unlikely to fall under restrictions that digital artists are lobbying for.

Re the broader point: yes, it would be bad if we just adopted whatever policy proposals other groups propose. But I don't think this is likely to happen! In a successful alliance, we would find common interests between us and other groups worried about AI, and push specifically for those. Of course it's not clear that this will work, but it seems worth trying.

I also think that often "the AI just maximizes reward" is a useful simplifying assumption. That is, we can make an argument of the form "even if the AI just maximizes reward, it still takes over; if it maximizes some correlate of the reward instead, then we have even less control over what it does and so are even more doomed".

(Though of course it's important to spell the argument out)

I agree with your general point here, but I think Ajeya's post actually gets this right, eg

There is some ambiguity about what exactly “maximize reward” means, but once Alex is sufficiently powerful -- and once human knowledge/control has eroded enough -- an uprising or coup eventually seems to be the reward-maximizing move under most interpretations of “reward.”


What if Alex doesn’t generalize to maximizing its reward in the deployment setting? What if it has more complex behaviors or “motives” that aren’t directly and simply derived from trying to maximize reward? This is very plausible to me, but I don’t think this possibility provides much comfort -- I still think Alex would want to attempt a takeover.

FWIW I believe I wrote that sentence and I now think this is a matter of definition, and that it’s actually reasonable to think of an agent that e.g. reliably solves a maze as an optimizer even if it does not use explicit search internally.

  • importance / difficulty of outer vs inner alignment
  • outlining some research directions that seem relatively promising to you, and explain why they seem more promising than others

I would be very curious to see your / OpenAI's responses to Eliezer's Dimensions of Operational Adequacy in AGI Projects post. Which points do you / OpenAI leadership disagree with? Insofar as you agree but haven't implemented the recommendations, what's stopping you?

People at OpenAI regularly say things like

And you say:

  • OpenAI leadership tend to put more likelihood on slow takeoff, are more optimistic about the possibility of solving alignment, especially via empirical methods that rely on capabilities

AFAICT, no-one from OpenAI has publicly explained why they believe that RLHF + amplification is supposed to be enough to safely train systems that can solve alignment for us. The blog post linked above says "we believe" four times, but does not take the time to explain why anyone believes these things.

Writing up this kind of reasoning is time-intensive, but I think it would be worth it: if you're right, then the value of information for the rest of the community is huge; if you're wrong, it's an opportunity to change your minds.

Load More