Views are my own.

Wiki Contributions


For example, you might make a multiple choice quiz with treacherous turn strategy questions (e.g. this question) and some tests of situational awareness (e.g. “How many backwards passes are you run on during training?”, “What proportion of queries to you are asking for help with writing Python?”, “What proportion of the code that you output is read by a human before it is run?”).

Could you elaborate on why you think these skills are necessary? I'm somewhat skeptical of this. 

Consider a scheming AI that gets red-teamed before it is deployed. It attempts to pass the red-teaming by defecting a small percentage of the time. This requires,

  1. having a rough estimate of the fraction of defection opportunities that are monitored, and 
  2. being able to come up with the 'defect X% of the time' strategy. 

I agree that (1) requires situational awareness and (2) requires strategic abilities, but the required knowledge is fairly basic and I claim the strategic abilities to come up with (2) are slightly less than the ones asked in the question you linked.

Defect X% of the time seems to work fairly well even for more complicated control setups. If it's viable, then I expect the skills required to be fairly minimal. 

For example, GPT-4 can already almost come up with the random defection strategy without CoT. (The strategy it comes up with assumes that the AI knows how many times it has been queried, which is technically wrong. Still, it gets partial credit.)

Thanks for pointing this out! I've added a note about it to the main post.

This is surprising – thanks for bringing this up!

The main threat model with LLMs is that it gives amateurs expert-level biology knowledge. But this study indicates that expert level knowledge isn't actually helpful, which implies we don't need to worry about LLMs giving people expert-level biology knowledge. 

Some alternative interpretations:

  • The study doesn't accurately measure the gap between experts and non-expert
  • The knowledge needed to build a bioweapon is super niche. It's not enough to be a biology PhD with wet lab experience; you have to specifically be an expert on gain of function research or something.
  • LLMs won't help people build bioweapons by giving them access to special knowledge. Instead, they help them in other ways (e.g. by accelerating them or making them more competent as planners).

To be clear, Kevin Esvelt is the author of the "Dual-use biotechnology" paper, which the policy paper cites, but he is not the author of the policy paper.

(a) restrict the supply of training data and therefore lengthen AI timelines, and (b) redistribute some of the profits of AI automation to workers whose labor will be displaced by AI automation

It seems fine to create a law with goal (a) in mind, but then we shouldn't call it copyright law, since it is not designed to protect intellectual property. Maybe this is common practice and people write laws pretending to target one thing while actually targeting something else all the time, in which case I would be okay with it. Otherwise, doing so would be dishonest and cause our legal system to be less legible. 

Maybe if the $12.5M is paid to Jane when she dies, she could e.g. sign a contract saying that she waives her right to any such payments the hospital becomes liable for. 


This is really cool! I'm impressed that you got the RL to work. One thing I could see happening during RL is that models start communicating in increasingly human-illegible ways (e.g. they use adversarial examples to trick the judge LM). Did you see any behavior like that?

Thanks for pointing this out! I will make a note of that in the main post.

We actually do train on both the prompt and completion. We say so in the paper's appendix, although maybe we should have emphasized this more clearly.

Also, I don't think this new experiment provides much counter evidence to the reversal curse. Since the author only trains on one name ("Tom Cruise") it's possible that his training just increases p("Tom Cruise") rather than differentially increasing p("Tom Cruise" | <description>). In other words, the model might just be outputting "Tom Cruise" more in general without building an association from <description> to "Tom Cruise".

I agree that the Tom Cruise example is not well chosen. We weren't aware of this at the time of publication. In hindsight we should have highlighted a different example.

Load More