Truthful LMs as a warm-up for aligned AGI

Here's my worry.

If we adopt a little bit of deltonian pessimism (though not the whole hog), and model present-day language models as doing something vaguely like nearest-neighbor interpolation in a slightly sophisticated latent space (while still being very impressive), then we might predict that there are going to be some ways of getting honest answers an impressive percentage of the time while staying entirely within the interpolation regime.

And then if you look at the extrapolation regime, it's basically the entire alignment problem squeezed into a smaller space! So I worry that people are going to do the obvious things, get good answers on 90%+ of human questions, and then feel some kind of pressure to write off the remainder as not that important ("we've got honest answers 98% of the time, so the alignment problem is like 98% solved"). When what I want is for people to use language models as a laboratory to keep being ambitions, and do theory-informed experiments that try to push the envelope in terms of extrapolating human preferences in a human-approved way.

[-]Jacob_Hilton4yΩ350

I can think of a few different interpretations of your concern (and am interested to hear if these don't cover it):

There will be insufficient attention paid to robustness.
There will be insufficient attention paid to going beyond naive human supervision.
The results of the research will be misinterpreted as representing more progress than is warranted.

I agree that all of these are possibilities, and that the value of the endeavor could well depend on whether the people conducting (and communicating) the research are able to avoid pitfalls such as these.

There's certainly more object-level discussion to be had about how much emphasis should be placed on avoiding these particular pitfalls, and I'm happy to dig in to them further if you're able to clarify which if any of them capture your main concern.

[-]Charlie Steiner4yΩ120

I think there are different kinds of robustness, and people focused on present-day applications (including tests that are easy to do in the present) are going to focus on the kinds of robustness that help with present-day problems. Being robust to malicious input from human teenagers will only marginally help make you robust to input from a future with lots of extra technological progress. They might have very different-looking solutions, because of factors like interpolation vs. extrapolation.

Framing it this way suggests one concrete thing I might hope for you to do, which is to create artificial problems for the language model that you think will exercise kinds of robustness and generalization not represented by the problem of fine-tuning GPT (or a BERT-based classifier) to be robust to the teenager distribution.

[-]Jacob_Hilton4yΩ110

one concrete thing I might hope for you to do...

I think this is included in what I intended by "adversarial training": we'd try to find tasks that cause the model to produce negligent falsehoods, train the model to perform better at those tasks, and aim for a model that is robust to someone searching for such tasks.

[-]Charlie Steiner4yΩ120

Sure - another way of phrasing what I'm saying is that I'm not super interested (as alignment research, at least) in adversarial training that involves looking at difficult subsets of the training distribution, or adversarial training where the proposed solution is to give the AI more labeled examples that effectively extend the training distribution to include the difficult cases.

It would be bad if we build an AI that wasn't robust on the training distribution, of course, but I think of this as a problem already being addressed by the field of ML without any need for looking ahead to AGI.

[-]Jon Garcia4y40

I like this approach to alignment research. Getting AIs to be robustly truthful (producing language output that is consistent with their best models of reality, modulo uncertainty) seems like it falls in the same space as getting them to keep their goals consistent with their best estimates of human goals and values.

As for avoiding negligent falsehoods, I think it will be crucial for the AI to have explicit estimates of its uncertainty for anything it might try to say. To a first approximation, assuming the system can project statements to a consistent conceptual space, it could predict the variance in the distribution of opinions in its training data around any particular issue. Then it could use this estimate of uncertainty to decide whether to state something confidently, to add caveats to what it says, or to turn it into a question for the interlocutor.

[-]Daniel Kokotajlo4yΩ330

Thanks for this!

I think that working on truthful LMs has a comparative advantage in worlds where:

--We have around 10-40 years until transformative AI

--Transformative AI is built using techniques that resemble modern deep learning

--There is a slow takeoff

--Alignment does not require vastly more theoretical insight (but may require some)

--Our current picture of the risks posed by transformative AI is incomplete

Can you elaborate on what you mean by slow takeoff here?

Also, what do you mean by the current picture of the risks being incomplete? What would it even mean for our picture to be complete?

[-]Jacob_Hilton4yΩ210

Thanks for these questions, these phrases were ambiguous or poorly chosen:

By "slow takeoff", I had in mind the "Paul slow takeoff" definition, although I think the (related) "Continuous takeoff" definition is more relevant to this post. The point is that trying for alignment to continually keep pace with capabilities, and to catch misalignment early, seems less valuable if there is going to be a sudden jump in capabilities. (I could be wrong about this, as I don't think I understand the fast takeoff viewpoint well.)
By "our current picture of the risks is incomplete", I meant something like: a significant portion of the total existential risk from AI comes from scenarios that have not yet been clearly articulated. More specifically, I had in mind power-seeking misalignment as the most clearly articulated risk, so I think it would have been better to say: a significant portion of the total existential risk from AI comes from risks other than power-seeking misalignment. Examples of potential sources of such risk include AI persuasion, social upheaval, deliberate misuse, authoritarianism and unforseen risks.

[-]Daniel Kokotajlo4yΩ440

Thanks, these clarifications are very helpful.

FWIW I think paul slow takeoff is pretty unlikely for reasons to be found in this thread and this post. On the other hand, as someone who thinks fast takeoff (in various senses) is more likely than not, I don't yet see why that makes Truthful LM work significantly less useful. (By contrast I totally see why Truthful LM work is significantly less useful if AGI/TAI/etc. comes from stuff that doesn't resemble modern deep learning.)

"Catch misalignment early..." This makes it sound like misalignment is something that AIs don't have yet but might one day have, so we need to be vigilant and notice it when it appears. But instead isn't misalignment something that all AIs have by default?

My current view is that power-seeking misalignment will probably cause existential catastrophe, that persuasion tools happen first and have a >20% chance of destroying our ability to solve that problem, and that there are various philosophical and societal problems that could (>20%) get us even if we solve power-seeking misalignment. Does this mean I agree or disagree with "our current picture of the risks is incomplete?"

[-]MondSemmel4y40

FYI, the "this thread" link in your comment doesn't work. Apparently it's possible for a link to be simultaneously green and unclickable.

[-]Zack_M_Davis4y40

(The underlying HTML is <a href="http://I think that working on truthful LMs has a comparative advantage in worlds where: We have around 10-40 years until transformative AI Transformative AI is built using techniques that resemble modern deep learning There is a slow takeoff Alignment does not require vastly more theoretical insight (but may require some) Our current picture of the risks posed by transformative AI is incomplete">this thread</a>. I am also surprised that this results in clicking being a no-op, rather than a "functional" link that leads to your browser's could-not-resolve-host page.)

[-]Daniel Kokotajlo4y20

Fixed, thanks!

[-]Jacob_Hilton4yΩ330

"Catch misalignment early..." - This should have been "scary misalignment", e.g. power-seeking misalignment, deliberate deception in order to achieve human approval, etc., which I don't think we've seen clear signs of in current LMs. My thinking was that in fast takeoff scenarios, we're less likely to spot this until it's too late, and more generally that truthful LM work is less likely to "scale gracefully" to AGI. It's interesting that you don't share these intuitions.

Does this mean I agree or disagree with "our current picture of the risks is incomplete?"

As mentioned, this phrase should probably be replaced by "a significant portion of the total existential risk from AI comes from risks other than power-seeking misalignment". There isn't supposed to be a binary cutoff for "significant portion"; the claim is that the greater the risks other than power-seeking misalignment, the greater the comparative advantage of truthful LM work. This is because truthful LM work seems more useful for addressing risks from social problems such AI persuasion (as well as other potential risks that haven't been as clearly articulated yet, I think). Sorry that my original phrasing was so unclear.

[-]Daniel Kokotajlo4yΩ330

Nothing to apologize for, it was reasonably clear, I'm just trying to learn more about what you believe and why. This has been helpful, thanks!

I totally agree that in fast takeoff scenarios we are less likely to spot those things until it's too late. I guess I agree that truthful LM work is less likely to scale gracefully to AGI in fast takeoff scenarios... so I guess I agree with your overall point... I just notice I feel a bit confused and muddle about it, is all. I can imagine plausible slow-takeoff scenarios in which truthful LM work doesn't scale gracefully, and plausible fast-takeoff scenarios in which it does. At least, I think I can. The former scenario would be something like: It turns out the techniques we develop for making dumb AIs truthful stop working once the AIs get smart, for similar reasons that techniques we use to make small children be honest (or to put it more vividly, believe in santa) stop working once they grow up. The latter scenario would be something like: Actually that's not the case, the techniques work all the way up past human level intelligence, and "fast takeoff" in practice means "throttled takeoff" where the leading AI project knows they have a few month lead over everyone else and is using those months to do some sort of iterated distillation and amplification, in which it's crucial that the early stages be truthful and that the techniques scale to stage N overseeing stage N+1.

LESSWRONG
is fundraising!
LW

LESSWRONG
is fundraising!
LW

65

Truthful LMs as a warm-up for aligned AGI

65

Ω 35

65

Ω 35

Warm-ups for aligned AGI

Truthful LMs as a good warm-up

Why focus on negligent falsehoods?

Medium-term vision for truthful LMs

Comparison with related proposals

Method-driven projects

Aligning language models in general

Common objections

Lack of focus

AI unboxing

Similarity to capabilities research

Conclusion

Request for feedback