Research scientist at DeepMind working on AI safety, and cofounder of the Future of Life Institute.
Website and blog: vkrakovna.wordpress.com
As a data point, I found it to be a net positive to live in a smallish group house (~5 people) during the pandemic. The negotiations around covid protocols were time-consuming and annoying at times, but still manageable because of the small number of people, and seemed worth it for the benefits of socializing in person to my mental well-being. It also helped that we had been living together for a few years and knew each other pretty well. I can see how this would quickly become overwhelming with more people involved, and result in nothing being allowed if anyone can veto any given activity.
Writing this post helped clarify my understanding of the concepts in both taxonomies - the different levels of specification and types of Goodhart effects. The parts of the taxonomies that I was not sure how to match up usually corresponded to the concepts I was most confused about. For example, I initially thought that adversarial Goodhart is an emergent specification problem, but upon further reflection this didn't seem right. Looking back, I think I still endorse the mapping described in this post.
I hoped to get more comments on this post proposing other ways to match up these concepts, and I think the post would have more impact if there was more discussion of its claims. The low level of engagement with this post was an update for me that the exercise of connecting different maps of safety problems is less valuable than I thought.
Hi there! If you'd like to get up to speed on impact measures, I would recommend these papers and the Reframing Impact sequence.
It was not my intention to imply that semantic structure is never needed - I was just saying that the pedestrian example does not indicate the need for semantic structure. I would generally like to minimize the use of semantic structure in impact measures, but I agree it's unlikely we can get away without it.
There are some kinds of semantic structure that the agent can learn without explicit human input, e.g. by observing how humans have arranged the world (as in the RLSP paper). I think it's plausible that agents can learn the semantic structure that's needed for impact measures through unsupervised learning about the world, without relying on human input. This information could be incorporated in the weights assigned to reaching different states or satisfying different utility functions by the deviation measure (e.g. states where pigeons / cats are alive).
Looks great, thanks! Minor point: in the sparse reward case, rather than "setting the baseline to the last state in which a reward was achieved", we set the initial state of the inaction baseline to be this last rewarded state, and then apply noops from this initial state to obtain the baseline state (otherwise this would be a starting state baseline rather than an inaction baseline).
I would say that impact measures don't consider these kinds of judgments. The "doing nothing" baseline can be seen as analogous to the agent never being deployed, e.g. in the Low Impact AI paper. If the agent is never deployed, and someone dies in the meantime, then it's not the agent's responsibility and is not part of the agent's impact on the world.
I think the intuition you are describing partly arises from the choice of language: "killing someone by not doing something" vs "someone dying while you are doing nothing". The word "killing" is an active verb that carries a connotation of responsibility. If you taboo this word, does your question persist?
Thanks Flo for pointing this out. I agree with your reasoning for why we want the Markov property. For the second modification, we can sample a rollout from the agent policy rather than computing a penalty over all possible rollouts. For example, we could randomly choose an integer N, roll out the agent policy and the inaction policy for N steps, and then compare the resulting states. This does require a complete environment model (which does make it more complicated to apply standard RL), while inaction rollouts only require a partial environment model (predicting the outcome of the noop action in each state). If you don't have a complete environment model, then you can still use the first modification (sampling a baseline state from the inaction rollout).
I don't think the pedestrian example shows a need for semantic structure. The example is intended to illustrate that an agent with the stepwise inaction baseline has no incentive to undo the delayed effect that it has set up. We want the baseline to incentivize the agent to undo any delayed effect, whether it involves hitting a pedestrian or making a pigeon fly.
The pedestrian and pigeon effects differ in the magnitude of impact, so it is the job of the deviation measure to distinguish between them and penalize the pedestrian effect more. Optionality-based deviation measures (AU and RR) capture this distinction because hitting the pedestrian eliminates more options than making the pigeon fly.
The baseline is not intended to indicate what should happen, but rather what happens by default. The role of the baseline is to filter out effects that were not caused by the agent, to avoid penalizing the agent for them (which would produce interference incentives). Explicitly specifying what should happen usually requires environment-specific human input, and impact measures generally try to avoid this.
I was thinking of an AI specific tag, it seems a bit too broad otherwise.