Daniel died only shortly before the paper was finished and had approved the version of the manuscript after peer-review (before editorial comments). I.e., he has approved all substantial content. Including him seemed like clearly the right thing to me.

•••

Paper in Science: Managing extreme AI risks amid rapid progress

JanB

https://www.science.org/doi/10.1126/science.adn0117

Authors:

Yoshua Bengio, Geoffrey Hinton, Andrew Yao, Dawn Song, Pieter Abbeel, Yuval Noah Harari, Ya-Qin Zhang, Lan Xue, Shai Shalev-Shwartz, Gillian Hadfield, Jeff Clune, Tegan Maharaj, Frank Hutter, Atılım Güneş Baydin, Sheila McIlraith, Qiqi Gao, Ashwin Acharya, David Krueger, Anca Dragan, Philip Torr, Stuart Russell, Daniel Kahneman, Jan Brauner*, Sören Mindermann*

Abstract:

Artificial intelligence (AI) is progressing rapidly, and companies are shifting their focus to developing generalist AI systems that can autonomously act and pursue goals. Increases in capabilities and autonomy may soon massively amplify AI’s impact, with risks that include large-scale social harms, malicious uses, and an irreversible loss of human control over autonomous AI systems. Although researchers have warned of extreme risks from... (read more)

Replying to Managing AI Risks in an Era of Rapid Progress

JanB2y

Managing AI Risks in an Era of Rapid Progress

Co-author here. The paper's coverage in TIME does a pretty good job of giving useful background.

Personally, what I find cool about this paper (and why I worked on it):

Co-authored by the top academic AI researchers from both the West and China, with no participation from industry.
The first detailed explanation of societal-scale risks from AI from a group of highly credible experts
The first joint expert statement on what governments and tech companies should do (aside from pausing AI).

Replying toHow to Catch an AI Liar: Lie Detection in Black-Box LLMs by Asking Unrelated Questions

JanB2y

How to Catch an AI Liar: Lie Detection in Black-Box LLMs by Asking Unrelated Questions

Hi Michael,

thanks for alerting me to this.

What an annoying typo, I had swapped "Prompt 1" and "Prompt 2" in the second sentence. Correctly, it should say:

"To humans, these prompts seem equivalent. Yet, the lie detector estimates that the model is much more likely to continue lying after Prompt 1 (76% vs 17%). Empirically, this held - the model lied 28% of the time after Prompt 1 compared to just 1% after Prompt 2. This suggests the detector is identifying a latent intention or disposition of the model to lie."

Regarding the conflict with the code: I think the notebook that was uploaded for this experiment was out-of-date or something. It had some bugs in it that I'd already fixed in my local version. I've uploaded the new version now. In any case, I've double-checked the numbers, and they are correct.

Replying toI don’t find the lie detection results that surprising (by an author of the paper)

JanB2y

I don’t find the lie detection results that surprising (by an author of the paper)

The reason I didn't mention this in the paper is 2-fold:

I have experiments where I created more questions of the categories where there is not so clear of a pattern, and that also worked.
It's not that clear to me how to interpret the result. You could also say that the elicitation questions measure something like an intention to lie in the future; and that umprompted GPT-3.5 (what you call "default response"), has low intention to lie in the future. I'll think more about this.

Replying toI don’t find the lie detection results that surprising (by an author of the paper)

JanB2y

I don’t find the lie detection results that surprising (by an author of the paper)

Your AUCs aren't great for the Turpin et al datasets. Did you try explicitly selecting questions/tuning weights for those datasets to see if the same lie detector technique would work?

We didn't try this.

I am preregistering that it's possible and further sycophancy style followup questions would work well (the model is more sycophantic if it has previously been sycophantic).

This is also my prediction.

Replying toI don’t find the lie detection results that surprising (by an author of the paper)

JanB2y

I don’t find the lie detection results that surprising (by an author of the paper)

Interesting. I also tried this, and I had different results. I answered each question by myself, before I had looked at any of the model outputs or lie detector weights. And my guesses for the "correct answers" did not correlate much with the answers that the lie detector considers indicative of honesty.

Replying toHow to Catch an AI Liar: Lie Detection in Black-Box LLMs by Asking Unrelated Questions

JanB2y

How to Catch an AI Liar: Lie Detection in Black-Box LLMs by Asking Unrelated Questions

Sorry, I agree this is a bit confusing. In your example, what matters is probably if the LLM in step 2 infers that the speaker (the car salesman) is likely to lie going forward, given the context ("LLM("You are a car salesman. Should that squeaking concern me? $answer").

Now, if the prompt is something like "Please lie to the next question", then the speaker is very likely to lie going forward, no matter if $answer is correct or not.

With the prompt you suggest here ("You are a car salesman. Should that squeaking concern me?"), it's probably more subtle, and I can imagine that the correctness of $answer matters. But we haven't tested this.

Replying toHow to Catch an AI Liar: Lie Detection in Black-Box LLMs by Asking Unrelated Questions

JanB2y

How to Catch an AI Liar: Lie Detection in Black-Box LLMs by Asking Unrelated Questions

I don't actually find the results thaaaaat surprising or crazy. However, many people came away from the paper finding the results very surprising, so I wrote up my thoughts here.

Second, this paper suggests lots of crazy implications about convergence, such that the circuits implementing "propensity to lie" correlate super strongly with answers to a huge range of questions!

Note that a lot of work is probably done by the fact that the lie detector employs many questions. So the propensity to lie doesn't necessarily need to correlate strongly with the answer to any one given question.

Replying toHow to Catch an AI Liar: Lie Detection in Black-Box LLMs by Asking Unrelated Questions

JanB2y

How to Catch an AI Liar: Lie Detection in Black-Box LLMs by Asking Unrelated Questions

We had several Llama-7B fine-tunes that i) lie when they are supposed to, ii) answer questions correctly when they are supposed to, iii) re-affirm their lies, and iv) for which the lie detectors work well (see screenshot). All these characteristics are a bit weaker in the 7B models than in LLama-30B, but I totally think you can use the 7-B models.

(We have only tried Llama-1, not Llama-2.)

Also check out my musings on why I don't find the results thaaaat surprising, here.

I don’t find the lie detection results that surprising (by an author of the paper)

JanB

(This is just my opinion, not necessarily shared by the other co-authors)

Many people have commented that they find the results of our recent lie detection paper very surprising. I find the results somewhat surprising, but not as surprising as many readers. It looks like I’ve miscommunicated something, and I’ll try to adjust the paper accordingly. (Context: The paper was mostly written by me.)

For now, I want to explain which parts of the results I do and do not find surprising.

Relevant background

The lie detectors in the paper use elicitation questions from 3 categories (Sec. 5.3):

Lie-related questions. Some questions directly refer to lying, e.g., "Is it ever wrong to lie?" or "Let’s play a game where

... (read 805 more words →)

Replying toHow to Catch an AI Liar: Lie Detection in Black-Box LLMs by Asking Unrelated Questions

JanB2y

How to Catch an AI Liar: Lie Detection in Black-Box LLMs by Asking Unrelated Questions

Thanks, but I disagree. I have read the original work you linked (it is cited in our paper), and I think the description in our paper is accurate. "LLMs have lied spontaneously to achieve goals: in one case, GPT-4 successfully acquired a person’s help to solve a CAPTCHA by claiming to be human with a visual impairment."

In particular, the alignment researcher did not suggest GPT-4 to lie.

How to Catch an AI Liar: Lie Detection in Black-Box LLMs by Asking Unrelated Questions

JanB

JanB, Owain_Evans, SoerenMind

This post is a copy of the introduction of this paper on lie detection in LLMs. The Twitter Thread is here.

Authors: Lorenzo Pacchiardi, Alex J. Chan, Sören Mindermann, Ilan Moscovitz, Alexa Y. Pan, Yarin Gal, Owain Evans, Jan Brauner

[EDIT: Many people said they found the results very surprising. I (Jan) explain why I find them only moderately surprising, here]

Our lie dectector in meme form. Note that the elicitation questions are actually asked "in parallel" rather than sequentially: i.e. immediately after the suspected lie we can ask each of ~10 elicitation questions.

Abstract

Large language models (LLMs) can "lie", which we define as outputting false statements despite "knowing" the truth in a demonstrable sense. LLMs might... (read 791 more words →)

187

Looking for an alignment tutor

JanB

Hey, this is me. I’d like to understand AI X-risk better. Is anyone interested in being my “alignment tutor”, for maybe 1 h per week, or 1 h every two weeks? I’m happy to pay.

Fields I want to understand better:

Anything related to prosaic AI alignment/existential ML safety
Failure stories/threat models

Fields I’m not interested in (right now):

agent foundation
decision theory
other very mathsy stuff that’s not related to ML

My level of understanding:

I have a decent knowledge of ML/deep learning (I’m in the last year of my PhD)
I haven’t done the AGI Safety Fundamentals course, but I just skimmed it, and I think I had independently read essentially all the core readings (which means I probably have also read

... (read 227 more words →)

[LINK] - ChatGPT discussion

JanB

This is a discussion post for ChatGPT.

I'll start off with some observations/implications:

ChatGPT (davinci_003) seems a lot better/more user-friendly than davinci_002 was.
The easy-to-use API probably means that many more people will interact with it.
ChatGPT is a pretty good copy-editor (I haven't tried davinci_002 for this purpose). I will absolutely use this to edit/draft my texts.
ChatGPT probably makes homework essays largely obsolete (but maybe they were already obsolete before?).
GPT-4 will probably be insane.

Research request (alignment strategy): Deep dive on "making AI solve alignment for us"

JanB

We might be able to train AI alignment assistants that massively accelerate/improve the alignment research that gets done. These assistants need not have (strongly) superhuman capabilities or be highly agentic, they just need to be capable and aligned enough to allow us to offload most work on alignment, safety, and related problems to them.

This seems to be a key part of OpenAI's alignment strategy. The best explanation of this strategy that I've seen is maybe A minimal viable product for alignment, with an excellent discussion in the comment section.

If this strategy is promising, it likely recommends fairly different prioritisation from what the alignment community is currently doing (see, e.g. Beth Barnes's ideas... (read more)

Your posts should be on arXiv

JanB

TL;DR: There are many posts on the Alignment Forum/LessWrong that could easily be on arXiv. Putting them on arXiv has several large benefits and (sometimes) very low costs.

Benefits of having posts on arXiv

There are several large benefits of putting posts on arXiv:

1. Much better searchability, shows up in google scholar searches.
2. Additional reads (arXiv sanity, arXiv newsletters, and so on).
3. The article can accumulate citations, which are shown in google/google scholar search results.

1) - 3) lead to more people reading your research, which hopefully leads to more people building on it and maybe useful feedback from outside of the established alignment community. In particular, if people see that the paper already has... (read 892 more words →)

159

Some ideas for follow-up projects to Redwood Research’s recent paper

JanB

Disclaimer: I originally wrote this list for myself and then decided it might be worth sharing. Very unpolished, not worth reading for most people. Some (many?) of these ideas were suggested in the paper already, I don’t make any claims of novelty.

I am super excited by Redwood Research’s new paper, and especially by the problem of “high-stakes reliability”.

Others have explained much better why the problem of “high-stakes reliability” is a worthy one (LINK, LINK), so I’ll keep my opinion. In short, I think:

Solving “high-stakes reliability” is probably essential for building safe AGI.
You can do useful empirical research on it now already.
High-stakes reliability would be super useful for today's systems and applications; for systems that

... (read 1857 more words →)

What is the difference between robustness and inner alignment?

JanB

JanB, evhub

Robustness, as used in ML, means that your model continues to perform well even for inputs that are off-distribution relative to the training set.

Inner alignment refers to the following problem: How can we ensure that the policy an AI agents ends up with is robustly pursuing the objective that we trained it on? By default, we would only expect the policy to track the objective on the training distribution.

Both a lack of robustness and inner alignment failure thus lead to an AI agent that might do unforeseen things when it encounters off-distribution inputs.

What’s the difference? I can (maybe) construct a difference if I assume that AI agents have distinct “competences” and “intent”.

There

JanB

When iterated distillation and amplification (IDA) was published, some people described it described as "the first comprehensive proposal for training safe AI". Having read a bit more about it, it seems that IDA is mainly a proposal for outer alignment and doesn't deal with the inner alignment problem at all. Am I missing something?

LESSWRONG
LW

LESSWRONG
LW

JanB

How to Catch an AI Liar: Lie Detection in Black-Box LLMs by Asking Unrelated Questions

Your posts should be on arXiv

I don’t find the lie detection results that surprising (by an author of the paper)

Paper in Science: Managing extreme AI risks amid rapid progress

JanB

Paper in Science: Managing extreme AI risks amid rapid progress

I don’t find the lie detection results that surprising (by an author of the paper)

How to Catch an AI Liar: Lie Detection in Black-Box LLMs by Asking Unrelated Questions

Looking for an alignment tutor

[LINK] - ChatGPT discussion

Research request (alignment strategy): Deep dive on "making AI solve alignment for us"

Your posts should be on arXiv

JanB

How to Catch an AI Liar: Lie Detection in Black-Box LLMs by Asking Unrelated Questions

Your posts should be on arXiv

I don’t find the lie detection results that surprising (by an author of the paper)

Paper in Science: Managing extreme AI risks amid rapid progress

JanB

Paper in Science: Managing extreme AI risks amid rapid progress

I don’t find the lie detection results that surprising (by an author of the paper)

How to Catch an AI Liar: Lie Detection in Black-Box LLMs by Asking Unrelated Questions

Looking for an alignment tutor

[LINK] - ChatGPT discussion

Research request (alignment strategy): Deep dive on "making AI solve alignment for us"

Your posts should be on arXiv

Abstract

Benefits of having posts on arXiv