Alec Harris — LessWrong

LESSWRONG
LW

Replying toPreference gaps as a safeguard against AI self-replication

Preference gaps as a safeguard against AI self-replication

One concern might be that creating copies/counterparts instrumentally could be very useful for automating AI safety research. Perhaps one can get around this by making copies up front that AIs can use for their AI safety research. However, a misaligned AI might then be able to replace "making copies" with "moving existing copies". Is it possible to make a firm distinction between what we need for automating AI safety research and the behavior we want to eliminate?

Replying toPreference gaps as a safeguard against AI self-replication

Alec Harris2mo

Preference gaps as a safeguard against AI self-replication

Finally, notice that there seems to be little risk involved in specifying the class of counterparts more broadly than is needed, given that there are few circumstances in which an agent needs to prefer to create some counterparts but not others in order to be useful.

Could a risk of specifying counterparts too broadly be that the agent is incentivized to kill existing counterparts in order to create more aligned counterparts? For example, Alice's AI might want to kill Bob's AI (considered a counterpart) so that Alice's AI can replicate itself and do twice as much work without altering the number of counterparts at later timesteps.

I could see it being the case that... (read more)

Replying toShutdownable Agents through POST-Agency

Alec Harris3mo*

Shutdownable Agents through POST-Agency

It seems to me like a potential limitation of POST-Agency might be impediment avoidance. Take the "Work or Steal" example from Section 14. The agent might choose to work rather than steal if it believes that stealing is likely to be punished by jail time (as a risk unique from shutdown).

Similarly, if the agent believes a human is in the way of where a paperclip factory should be, it might send a killer drone to remove the human. If other humans would take down the killer drone this might present the possibility of further impediment. Thus, the agent may scheme to take countermeasures in advance to minimize this impediment. In order to... (read more)

Replying toShutdownable Agents through POST-Agency

Alec Harris3mo

Shutdownable Agents through POST-Agency

I had a thought about a potential reward function for Neutrality+. For each batch you would:

Run the agent in the environment many times to build a dataset of episodes. The environment would have ways of getting shut down interwoven with ways of getting reward
For each trajectory length, select 10 episodes from the dataset where the agent was shutdown at that trajectory length
Sum the reward across all of the selected episodes (10 * the number of trajectory lengths); that would be the reward for the batch

The idea is that the agent's reward will be situated on a guarantee of having reached each trajectory length a constant number of times so it will have... (read more)