Leon Lang

Wiki Contributions


Does this contest still run, given that the FTX Future Fund doesn't exist anymore?

Thanks a lot for these pointers!

Neither of your interpretations is what I was trying to say. It seems like I expressed myself not well enough.

What I was trying to say is that I think outer alignment itself, as defined by you (and maybe also everyone else), is a priori impossible since no physically realizable reward function that is defined solely based on observations rewards only actions that would be chosen by a competent, well-motivated AI. It always also rewards actions that lead to corrupted observations that are consistent with the actions of a benevolent AI. These rewarded actions may come from a misaligned AI. 

However, I notice people seem to use the terms of outer and inner alignment a lot, and quite some people seem to try to solve alignment by solving outer and inner alignment separately. Then I was wondering if they use a more refined notion of what outer alignment means, possibly by taking into account the physical capabilities of the agent, and I was trying to ask if something like that has already been written down anywhere. 

Specification gaming: Even on the training distribution, the AI is taking object-level actions that humans think are bad. This can include bad stuff that they don't notice at the time, but that seems obvious to us humans abstractly reasoning about the hypothetical.

Do you mean "the AI is taking object-level actions that humans think are bad while achieving high reward"?

If so, I don't see how this solves the problem. I still claim that every reward function can be gamed in principle, absent assumptions about the AI in question. 

To classify as specification gaming, there needs to be bad feedback provided on the actual training data. There are many ways to operationalize good/bad feedback. The choice we make here is that the training data feedback is good if it rewards exactly those outputs that would be chosen by a competent, well-motivated AI.


I assume you would agree with the following rephrasing of your last sentence:

The training data feedback is good if it rewards outputs if and only if they might be chosen by a competent, well-motivated AI. 

If so, I would appreciate it if you could clarify why achieving good training data feedback is even possible: the system that gives feedback necessarily looks at the world through observations that conceal large parts of the state of the universe. For every observation that is consistent with the actions of a competent, well-motivated AI, the underlying state of the world might actually be catastrophic from the point of view of our "intentions". E.g., observations can be faked, or the universe can be arbitrarily altered outside of the range of view of the feedback system.

If you agree with this, then you probably assume that there are some limits to the physical capabilities of the AI, such that it is possible to have a feedback mechanism that cannot be effectively gamed. Possibly when the AI becomes more powerful, the feedback mechanism would in turn need to become more powerful to ensure that its observations "track reality" in the relevant way. 

Does there exist a write-up of the meaning of specification gaming and/or outer alignment that takes into account that this notion is always "relative" to the AI's physical capabilities?

Yes, after reflection I think this is correct. I think I had in mind a situation where with deployment, the training of the AI system simply stops, but of course, this need not be the case. So if training continues, then one either needs to argue stronger reasons why the distribution shift leads to a catastrophe (e.g., along the lines you argue) or make the case that the training signal couldn't keep up with the fast pace of the development. The latter would be an outer alignment failure, which I tried to avoid talking about in the text. 

Yes, it seems like both creating a "New Shortform" when hovering over my user name and commenting on "Leon Lang's Shortform" will do the exact same thing. But I can also reply to the comments.

This is my first comment on my own, i.e., Leon Lang's, shortform. It doesn't have any content, I just want to test the functionality.

This is my first short form. It doesn't have any content, I just want to test the functionality.

But essentially, I suspect generating suffering as a subgoal for an AGI is something like an anti-convergent goal: It makes almost all other goals harder to achieve.

I think I basically agree (though maybe not with as much high confidence as you), but I think that doesn't mean that huge amounts of suffering will not dominate the future. For example, if there will be not one but many superintelligent AI systems determining the future, this might create suffering due to cooperation failures. 

Load More