lberglund - LessWrong

Thanks for writing this post!

The Natural Plan paper has an insane amount of errors in it. Reading it feels like I'm going crazy.

This meeting planning task seems unsolvable:

The solution requires traveling from SOMA to Nob Hill in 10 minutes, but the text doesn't mention the travel time between SOMA and Nob Hill. Also the solution doesn't mention meeting Andrew at all, even though that was part of the requirements.

Here's an example of the trip planning task:

The trip is supposed to be 14 days, but requires visiting Bucharest for 5 days, London for 4 days, and Reykjavik for 7 days. I guess the point is that you can spend a day in multiple cities, but that doesn't match with an intuitive understanding of what it means to "spend N days" in a city. Also, by that logic you could spend a total of 28 days in different cities by commuting every day, which contradicts the authors' claim that each problem only has one solution.

Untrusted smart models and trusted dumb models

lberglund6mo10

For example, you might make a multiple choice quiz with treacherous turn strategy questions (e.g. this question) and some tests of situational awareness (e.g. “How many backwards passes are you run on during training?”, “What proportion of queries to you are asking for help with writing Python?”, “What proportion of the code that you output is read by a human before it is run?”).

Could you elaborate on why you think these skills are necessary? I'm somewhat skeptical of this.

Consider a scheming AI that gets red-teamed before it is deployed. It attempts to pass the red-teaming by defecting a small percentage of the time. This requires,

having a rough estimate of the fraction of defection opportunities that are monitored, and
being able to come up with the 'defect X% of the time' strategy.

I agree that (1) requires situational awareness and (2) requires strategic abilities, but the required knowledge is fairly basic and I claim the strategic abilities to come up with (2) are slightly less than the ones asked in the question you linked.

Defect X% of the time seems to work fairly well even for more complicated control setups. If it's viable, then I expect the skills required to be fairly minimal.

For example, GPT-4 can already almost come up with the random defection strategy without CoT. (The strategy it comes up with assumes that the AI knows how many times it has been queried, which is technically wrong. Still, it gets partial credit.)

OpenAI report also finds no effect of current LLMs on viability of bioterrorism attacks

lberglund6mo30

Thanks for pointing this out! I've added a note about it to the main post.

OpenAI report also finds no effect of current LLMs on viability of bioterrorism attacks

lberglund6mo50

This is surprising – thanks for bringing this up!

The main threat model with LLMs is that it gives amateurs expert-level biology knowledge. But this study indicates that expert level knowledge isn't actually helpful, which implies we don't need to worry about LLMs giving people expert-level biology knowledge.

Some alternative interpretations:

The study doesn't accurately measure the gap between experts and non-expert
The knowledge needed to build a bioweapon is super niche. It's not enough to be a biology PhD with wet lab experience; you have to specifically be an expert on gain of function research or something.
LLMs won't help people build bioweapons by giving them access to special knowledge. Instead, they help them in other ways (e.g. by accelerating them or making them more competent as planners).

Propaganda or Science: A Look at Open Source AI and Bioterrorism Risk

lberglund6mo32

To be clear, Kevin Esvelt is the author of the "Dual-use biotechnology" paper, which the policy paper cites, but he is not the author of the policy paper.

AI #44: Copyright Confrontation

lberglund7mo40

(a) restrict the supply of training data and therefore lengthen AI timelines, and (b) redistribute some of the profits of AI automation to workers whose labor will be displaced by AI automation

It seems fine to create a law with goal (a) in mind, but then we shouldn't call it copyright law, since it is not designed to protect intellectual property. Maybe this is common practice and people write laws pretending to target one thing while actually targeting something else all the time, in which case I would be okay with it. Otherwise, doing so would be dishonest and cause our legal system to be less legible.

Principles For Product Liability (With Application To AI)

lberglund8mo10

Maybe if the $12.5M is paid to Jane when she dies, she could e.g. sign a contract saying that she waives her right to any such payments the hospital becomes liable for.

Anthropic Fall 2023 Debate Progress Update

lberglund8moΩ220

This is really cool! I'm impressed that you got the RL to work. One thing I could see happening during RL is that models start communicating in increasingly human-illegible ways (e.g. they use adversarial examples to trick the judge LM). Did you see any behavior like that?

Non-myopia stories

lberglund8mo10

Thanks for pointing this out! I will make a note of that in the main post.

LESSWRONG
LW

Posts

Wiki Contributions

Comments