A month ago, I predicted that AI systems will be able to access safety plans posted on the Internet and use them for its own purposes. If true, it follows that a likely misaligned-by-default AGI could be able to exploit our safety plans, likely to our detriment. 

The post was controversial. On the EA forum, it obtained only 13 net upvotes from 23 voters, and the top comment (which disagreed with the post) obtained 25 net upvotes from 17 voters. 

On LessWrong, my post obtained only 3 net upvotes from 13 votes, while the top comment (which also disagreed with the post) obtained 9 upvotes from 3 votes.

I'm writing to report that OpenAI's recent ChatGPT system has corroborated my prediction. Even this non-AGI-scale system was able to access and meaningfully process various safety plans (and even personal information about individual AI safety researchers) posted on the Internet.

Below are some examples:

Prompt: List of AI safety plans and how an AI agent might exploit them

ChatGPT's ability to list detailed interpretability plans and how to exploit each of them is especially concerning. Interpretability is worked on by a disproportionate percentage of AI safety researchers, despite the fact that it may have a low scientific upside, and may even be net-negative due to its dual-use nature. Specifically, the AGI may be able to exploit the interpretability channels it knows about in advance.

Prompt: List of AI interpretability plans and how an AI agent might exploit them

ChatGPT was even able to develop individualized strategies for deceptive misalignment that were tailored to specific AI safety researchers.

AGI may be able to use individualized deception plans on alignment researchers, based on Internet data.
New Comment
4 comments, sorted by Click to highlight new comments since:

I dislike the framing of this post. Reading this post made the impression that

  • You wrote a post with a big prediction ("AI will know about safety plans posted on the internet")
  • Your post was controversial and did not receive a lot of net-upvotes
  • Comments that disagree with you receive a lot of upvotes. Here you make me think that these upvoted comments disagree with the above prediction.

But actually reading the original post and the comments reveals a different picture:

  • The "prediction" was not a prominent part of your post.
  • The comments such as this imo excellent comment did not disagree with the "prediction", but other aspects of your post.

Overall, I think its highly likely that the downvotes where not because people did not believe that future AI systems will know about safety plans posted on LW/EAF, but because of other reasons. I think people were well aware that AI systems will get to know about plans for AI safety, just as I think that it is very likely that this comment itself will be found in the training data of future AI systems.

Thank you very much for the honest and substantive feedback, Harfe! I really appreciate it.

I think the disagreeing commenters and perhaps many of the downvoters agreed that the loss in secrecy value was a factor, but disagreed about the magnitude of this effect (and my claim that it may be comparable or even exceed the magnitude of the other effect, a reduction in the number of AI safety plans and new researchers).

Quoting my comment on the EA forum for discussion of the cruxes and how I propose they may be updated:

"Thank you so much for the clarification, Jay! It is extremely fair and valuable.

I don't really understand how this is supposed to be an update for those who disagreed with you. Could you elaborate on why you think this information would change people's minds?

The underlying question is: does the increase in the amount of AI safety plans resulting from coordinating on the Internet outweigh the decrease in secrecy value of the plans in EV? If the former effect is larger, then we should continue the status-quo strategy. If the latter effect is larger, then we should consider keeping safety plans secret (especially those whose value lies primarily in secrecy, such as safety plans relevant to monitoring). 

The disagreeing commenters generally argued that the former effect is larger, and therefore we should continue the status-quo strategy. This is likely because their estimate of the latter effect was quite small and perhaps far-into-the-future.

I think ChatGPT provides evidence that the latter should be a larger concern than many people's prior. Even current-scale models are capable of nontrivial analysis about how specific safety plans can be exploited, and even how specific alignment researchers' idiosyncrasies can be exploited for deceptive misalignment. 

For this to be a threat, we would need an AGI that was

- Misaligned
- Capable enough to do significant damage if it had access to our safety plans
- Not capable enough to do a similar amount of damage without access to our safety plans

I see the line between 2 and 3 to be very narrow. I expect almost any misaligned AI capable of doing significant damage using our plans to also be capable of doing significant damage without needing them.

I am uncertain about whether the line between 2 and 3 will be narrow. I think the argument of the line between 2 and 3 being narrow often assumes fast takeoff, but I think there is a strong empirical case that takeoff will be slow and constrained by scaling, which suggests the line between 2 and 3 might be larger than one might think. But I think this is a scientific question that we should continue to probe and reduce our uncertainty about!"

and so it needs to be safe despite that. Knowing about the security measure does not make it that much less secure, security through obscurity is not security. especially against a superintelligence strong enough to beat AFL, which chat GPT is not.

For an AI to exploit safety plans the AI would need to have a goal to be unsafe. Most of the safety plans we have are about avoiding AI from developing such goals. 

It might very well be helpful if the AI wants to be aligned if the AI knows about a bunch of different plans to make aligned AI.

Threat modeling is important when doing any security and I would expect that disagreeing with your threat model is the main reason your post wasn't better received the last time. The information from the interaction with ChatGPT doesn't address any cruxes.