PhD student at the Center for Human-Compatible AI. Creator of the Alignment Newsletter.


Value Learning
Alignment Newsletter


Matt Botvinick on the spontaneous emergence of learning algorithms

It might well be that 1) people who already know RL shouldn't be much surprised by this result and 2) people who don't know much RL are justified in updating on this info (towards mesa-optimizers arising more easily).

I agree. It seems pretty bad if the participants of a forum about AI alignment don't know RL.

AI safety via market making

If the dishonest debater disputes some honest claim, where honest has an argument for their answer that actually bottoms out, dishonest will lose - the honest debater will pay to recurse until they get to a winning node. 

This part makes sense.

If the the dishonest debater makes some claim and plan to make a circular argument for it, the honest debater will give an alternative answer but not pay to recurse. If the dishonest debater doesn't pay to recurse, the judge will just see these two alternative answers and won't trust the dishonest answer.

So in this case it's a stalemate, presumably? If the two players disagree but neither pays to recurse, how should the judge make a decision?

AI safety via market making

Hmm, I was imagining that the honest player would have to recurse on the statements in order to exhibit the circular argument, so it seems to me like this would penalize the honest player rather than the circular player. Can you explain what the honest player would do against the circular player such that this "payment" disadvantages the circular player?

EDIT: Maybe you meant the case where the circular argument is too long to exhibit within the debate, but I think I still don't see how this helps.

Communication Prior as Alignment Strategy

If it were, then one of our first messages would be (a mathematical version of) "the behavior I want is approximately reward-maximizing".

Yeah, I agree that if we had a space of messages that was expressive enough to encode this, then it would be fine to work in behavior space.

Communication Prior as Alignment Strategy

Yeah, this is a pretty common technique at CHAI (relevant search terms: pragmatics, pedagogy, Gricean semantics). Some related work:

I agree that it should be possible to do this over behavior instead of rewards, but behavior-space is much larger or more complex than reward-space and so it would require significantly more data in order to work as well.

Misalignment and misuse: whose values are manifest?
  • misalignment means the bad outcomes were wanted by AI (and not by its human creators), and
  • accident means that the bad outcomes were not wanted by those in power but happened anyway due to error.

My impression was that accident just meant "the AI system's operator didn't want the bad thing to happen", so that it is a superset of misalignment.

Though I agree with the broader point that in realistic scenarios there is usually no single root cause to enable this sort of categorization.

Clarifying inner alignment terminology

Planned summary for the Alignment Newsletter:

This post clarifies the author’s definitions of various terms around inner alignment. Alignment is split into intent alignment and capability robustness, and then intent alignment is further subdivided into outer alignment and objective robustness. Inner alignment is one way of achieving objective robustness, in the specific case that you have a mesa optimizer. See the post for more details on the definitions.

Planned opinion:

I’m glad that definitions are being made clear, especially since I usually use these terms differently then the author. In particular, as mentioned in my opinion on the highlighted paper, I expect performance to smoothly go up with additional compute, data, and model capacity, and there won’t be a clear divide between capability robustness and objective robustness. As a result, I prefer not to divide these as much as is done in this post.

Confucianism in AI Alignment

Changed second paragraph to:

This post suggests that in any training setup in which mesa optimizers would normally be incentivized, it is not sufficient to just prevent mesa optimization from happening. The fact that mesa optimizers could have arisen means that the incentives were bad. If you somehow removed mesa optimizers from the search space, there would still be a selection pressure for agents that without any malicious intent end up using heuristics that exploit the bad incentives. As a result, we should focus on fixing the incentives, rather than on excluding mesa optimizers from the search space.

How does that sound?

Confucianism in AI Alignment

Planned summary for the Alignment Newsletter (note it's written quite differently from the post, and so I may have introduced errors, so please check more carefully than usual):

Suppose we trained our agent to behave well on some set of training tasks. <@Mesa optimization@>(@Risks from Learned Optimization in Advanced Machine Learning Systems@) suggests that we may still have a problem: the agent might perform poorly during deployment, because it ends up optimizing for some misaligned _mesa objective_ that only agrees with the base objective on the training distribution.

This post points out that this is not the only way systems can fail catastrophically during deployment: if the incentives were not designed appropriately, they may still select for agents that have learned heuristics that are not in our best interests, but nonetheless lead to acceptable performance during training. This can be true even if the agents are not explicitly “trying” to take advantage of the bad incentives, and thus can apply to agents that are not mesa optimizers.

Load More