Sorted by New


Specification gaming examples in AI

Thanks a lot for this post. I think “specification gaming” is an essential problem to solve but also possibly one of the hardest to make progress on, so it’s great to see resources aimed at helping people do that. I was unaware of CycleGAN, but it sounds very interesting and I plan to look into it. Thanks for posting something that put me onto an interesting obvservation!

Metaphilosophical competence can't be disentangled from alignment

I think decision making can have an impact on values, but that this depends on the design of the agent. In my comment, by values, I had in mind something like “the thing that the agent is maximizing”. We can imagine an agent like the paperclip maximizer for which the “decision making” ability of the agent doesn’t change the agent’s values. Is this agent in an “epistemic pit”? I think the agent is in a “pit” from our perspective, but it’s not clear that the pit is epistemic. One could model the paperclip maximizer as an agent whose epistemology is fine but that simply values different things than we do. In the same way, I think people could be worried about amplifying humans because they are worried that those humans will get stuck in non-epistemic pits rather than epistemic pits. For example, I think the discussion in other comments related to slavery partly have to due with this issues.

The extent to which a human's "values" will improve as a result of improvements in "decision making" to me seems to depend on having a psychological model of humans, which won't necessarily be a good model of an AI. As a result, people may agree/disagree in terms of their intuitions about amplification as applied to humans due to similarities/differences in their psychological models of those humans, while their agreement/disagreement on amplification as applied to AI may result from different factors. In this sense, I'm not sure that different intuitions about amplifying humans is necessarily a crux for differences in terms of amplifying AIs.

In general I expect amplification to improve decision-making processes substantially, but in most cases to not improve them enough.

To me, this seems like a good candidate for something close to a crux between "optimistic" alignment strategies (similar to amplification/distillation) vs. "pessimistic" alignment strategies (similar to agent foundations). I see it like this. The "optimistic" approach is more optimistic that certain aspects of metaphilosophical competence can be learned during the learning process or else addressed by fairly simple adjustment to the design of the learning process, whereas the "pessimistic" approach is pessimistic that solutions to certain issues can be learned and so thinks that we need to do a lot of hard work to get solutions to these problems that we understand on a deep level before we can align an agent. I'm not sure which is correct, but I do think this is a critical difference in the rationales underlying these approaches.

Metaphilosophical competence can't be disentangled from alignment

I think the thought experiment that you propose is interesting, but doesn't isolate different factors that may contribute to people's intuitions. For example, it doesn't distinquish between worries about making individual people powerful because of their values (e.g. they are selfish or sociopathic) vs. worries due to their decision-making processes. I think this is important because it seems likely that "amplifying" someone won't fix value-based issues, but possibly will fix decision-making issues. If I had to propose a candidate crux, it would probably be more along the lines of how much of alignment can be solved through using a learning algorthm to help learn solutions vs. how much of the problem needs to be solved "by hand" and understood on a deep level rather than learned. Along those lines, I found the postscript to Paul Christiano's article on corrigibility interesting.