I'd still describe my optimistic take as "do imitative generalization.
If the answer is a deterministic function of the subanswers, then the consistency check can rule most possible combinations out as inconsistent, and therefore: (passes consistency check) + (subanswers are correct) ==> (answers are correct).
I agree with this argument. But it seems "if the answer is a deterministic [human-known] function of the subanswers" is a very strong condition, such that "(passes consistency check) + (subanswers are correct) ==> (answers are correct)" rarely holds in practice. Maybe the most common case is that we have some subanswers, but they don't uniquely define the right answer / there are (infinitely many) other subquestions we could have asked which aren't there.
Not sure this point is too important though (I'd definitely want to pick it up again if I wanted to push on a direction relying on something like generalization-of-honesty).
I'm comparably optimistic about the "neuralese" case as the French case
Got it, thanks! (I am slightly surprised, but happy to leave it here.)
Zooming out a bit, I would summarize a few high-level threads as:
I think the two disagreements are probably broader threads, so I'm mostly curious whether this seems right to you. Will also explore a few individual points a bit more below:
> My model is that "if honesty doesn't generalize in general, then it may generalize somewhat better within translation tasks, but is unlikely to generalize completely". This is not clear to me (and it seems like we get to check).
> My model is that "if honesty doesn't generalize in general, then it may generalize somewhat better within translation tasks, but is unlikely to generalize completely".
This is not clear to me (and it seems like we get to check).
I'm not sure I understand the distinction here. Suppose that amplification lets us compute B from A, so that B is good if A is good (i.e. so that the learned model will end up answering top-level questions well if it answers leaves well). Whereas a coherence condition ensures that A and B are consistent with one another---so if B is "supposed to be" a deterministic function of A, then consistency guarantees that B is good if A is good.
In this framing, the distinction is that implication is only one way. If B is the model's claim about tone, and A is a consistency check, or another fact serving as input to some consistency check, then A being bad implies B is bad, but not necessarily the converse of this. Whereas in amplification/debate, we would actually check that all the subquestion answers + subsequent reasoning actually justify B.
I don't think the model necessarily "knows" how to talk about tone (given only a question about tone in natural language with no examples), nor how to translate between its internal beliefs about tone and a description in natural language.
Got it, thanks. This seems right - I agree that the finetuning (either via coherence, or via direct supervision) is providing some extra information about how to do this translation, and what questions about tone actually are asking about, and this information is not necessarily present just from across-task generalization.
For example, I do think you can keep adding coherence conditions until you reach the limit of "Actually looks coherent to a human no matter how they investigate it," such that you are actually generalizing to new coherence checks. At that point honesty becomes much more plausible.
I see; this makes more sense to me. I guess this will be pretty closely related (equivalent?) to your optimism in previous discussions about being able to use adversarial training to eventually eliminate all malign failures, where we have the similar dynamic of generalizing to unseen testing strategies, and needing to eventually eliminate all failures. I think I'm pretty pessimistic about this approach, but happy to leave this in favor of seeing how empirics play out with experiments like these (or the Unrestricted Advx Contest).
ETA: I suppose a difference here is that with adv training, we are typically assuming we can recognize the failures, and the generalization is across multiple inputs. Whereas here we're less concerned about generalization across inputs, and aren't assuming we can recognize failures, and instead want to generalize across different coherence checks. The two feel similar in spirit, but maybe this is a significant enough difference that the analogy isn't useful.
I don't really think we've done those experiments. I don't know what specification gaming examples you have in mind. And less careful raters are also less careful about coherence conditions. Not to mention that no effort was made to e.g. ask multiple questions about the text and check agreement between them.I agree that it's possible to have some plausibility condition which is insufficient to get good behavior. But that's quite different from saying "And if you actually try to make it work it doesn't work."
I don't really think we've done those experiments. I don't know what specification gaming examples you have in mind. And less careful raters are also less careful about coherence conditions. Not to mention that no effort was made to e.g. ask multiple questions about the text and check agreement between them.
I agree that it's possible to have some plausibility condition which is insufficient to get good behavior. But that's quite different from saying "And if you actually try to make it work it doesn't work."
I think this is fair. I agree that nobody has "really tried" to make something like this work (in part this is because it is often possible to just hire/train better human raters, or re-scope the task so that you actually measure the thing you care about). I do think the wide variety of failures from researchers "sort of trying" points me against optimism-about-generalization. Overall, I do agree it'd be valuable to distill to a single project which really tests this.
A related intuition pump: I'm pretty pessimistic in cases where non-French speakers are supervising questions about French. I'm substantially more pessimistic in cases where humans are supervising questions about e.g. what "neuralese statements" (activations vectors) passed between neural networks mean, where e.g. we don't have intuitions for how grammar structures should work, can't rely on common Latin roots, can't easily write down explicit rules governing the language, and generally have a harder time generating consistency checks. The French case is nicer for experiments because it's easier for us to check ground truth for evaluation, but hopefully the example also conveys some intuition for my pessimism in a substantially harder case.
Thanks for these thoughts. Mostly just responding to the bits with questions/disagreements, and skipping the parts I agree with:
That's basically right, although I think the view is less plausible for "decisions" than for some kinds of reports. For example, it is more plausible that a mapping from symbols in an internal vocabulary to expressions in natural language would generalize than that correct decisions would generalize (or even than other forms of reasoning).
(This may also make it more clear why I'm interested in coherence conditions where you can't supervise---in some sense "use the stuff that does generalize as an input into amplification" is quite similar to saying "impose a coherence condition amongst the stuff you can't directly supervise.")
IIUC, the analogy you're drawing here is that amplification is playing a similar role to the coherence condition, where even if we can't supervise the model response in full, we can at least check it is consistent with the inputs into amplification (I might be misunderstanding). These seem different though: in amplification, we are concerned about dishonesty in the inputs into amplification, whereas in these experiments, we're concerned about dishonesty in the model's outputs. I also feel substantially more optimistic about amplification in the scalable oversight story: we check for accuracy of amplification's outputs, assuming the inputs are accurate, and we separately are checking for accuracy of inputs, via the same recursion.
(I'd be curious to know if you don't encounter a lot of optimism-about-generalization.)
I often encounter a somewhat similar view from people optimistic about multi-agent approaches (e.g. we can train agents to be cooperative in a broad set of simulations, and transfer this to the real world). I am even more skeptical about this strong-optimism-about-generalization.
Among folks excited about direct RL, I think it is more common to say "we will be very careful about our optimization targets" and have a view more like optimism about scalable oversight.
(My assessment may also be biased. To the extent that people are optimistic about both, or have different ways of thinking about this, or haven't thought too much about the indefinitely scalable case and so have difficulty articulating what their intuitions are, I am inclined to extend a "charitable" interpretation and assume their view is closer to optimism-about-oversight, because that's the view I consider most plausible. If I pushed harder against optimism-about-oversight in conversations, I expect I'd get a more accurate picture.)
I think there's a further question about whether the model knows how to talk about these concepts / is able to express itself. This is another part of the motivation for plausibility constraints (if nothing else they teach the model the syntax, but then when combined with other knowledge about language the coherence conditions seem like they probably just pin down the meaning unambiguously).
I don't think I follow. Doesn't the model already know syntax? If that plus the "other knowledge about language... pins down the meaning unambiguously", it feels like basically all the work came from the "other knowledge about language", and I'm not sure what the coherence conditions are adding.Regarding optimism about generalization:
I'm not so pessimistic. There are lots of "bad" ways to answer questions, and as we add more and more stringent plausibility checks we eliminate more and more of them...
I think I share your general picture that it is very difficult to "get truth out of the gate", but reach the opposite conclusion:
One way of poking at the intuition is that the initial goal is hitting an isolated point in a giant high-dimensional space of all question-answering policies. But after selecting for "passing all plausibility checks" it seems plausible that you are basically left with a set of isolated points and the question is just which of them you got.
Perhaps a core claim is that we are "basically left with a set of isolated points". I don't buy this claim (this disagreement is possibly also related to my second bullet above on "My model is that..."). It seems like the plausibility checks probably cut out some dimensions from the original high-dimensional space, but picking out "honest" behavior is still woefully underdefined for the model. For instance, it seems like "completely honest" is training-consistent, so is "coherent-but-often-inaccurate", and so is every level of inaccuracy between these two points.
Overall I do think it's >50% that if the whole thing works, one of the two pieces worked independently.
It seems we've already done the experiments on training for plausibility without leveraging generalization across tasks (e.g. generic specification gaming examples, or summarization with less careful human raters). So I'd put <1% on that possibility. This estimate seems consistent with the view that across-task-generalization plausibly works out of the box (though I also consider that <<50%).
Thanks for sharing these thoughts. I'm particularly excited about the possibility of running empirical experiments to better understand potential risks of ML systems and and contribute to debates about difficulties of alignment.1. Potential implications for optimistic views on alignment
If we observe systems that learn to bullshit convincingly, but don't transfer to behaving honestly, I think that's a real challenge to the most optimistic views about alignment and I expect it would convince some people in ML.
I'm most interested in this point. IIUC, the viewpoint you allude to here is something along the lines "There will be very important decisions we can't train directly for, but we'll be able to directly apply ML to these decisions by generalizing from feedback on easier decisions." We can call this "optimism about generalization" (of honesty, on out-of-distribution tasks). I'm generally pretty skeptical about this position (see point 2 below).
OTOH, there is a different reason for optimism, along the lines of "The set of decisions where 'humans never get the right answer' is small to none, and we don't need to use ML directly for these cases, and this situation is helped dramatically by the fact that we can use ML for questions we can supervise to indirectly help us with these questions." For example, in amplification, ML systems can summarize important arguments on both sides of thorny questions, produce scientific insights we can directly evaluate, guide us through stacks of sub-questions, and so on. We can call this "optimism about scalable oversight".
My view is that a negative result in experiments proposed here would point against optimism about generalization, but not against optimism about scalable oversight. (I'm curious if this seems right to you.) And I imagine your view is that (a) this is still useful, since optimism about generalization is fairly common (among ML researchers you talk to) (b) we should in fact currently have some uncertainty about optimism about generalization which this would address (see point 2 below) and (c) in the limit, scalable human feedback might not be competitive with more heavily generalization-reliant approaches, and so we need to better understand these generalization questions too.
2. Skepticism about optimism about generalization (of honesty, on out-of-distribution tasks)
I'd reasonably strongly (as a really rough number off the top of my head, >80%) expect a negative result from these experiments, and am also generally pessimistic about the broader approach of optimism about generalization. For these experiments, this depends in large part on what's considered a positive or negative result.
I agree that very plausibly:
At the same time:
Training for plausibility or coherence
I'm pretty wary about this, and we should hold such approaches to a high standard-of-proof. We already have a bunch of examples for what happens when you optimize for "plausible but not-necessarily correct": you end up with plausible but not-necessarily-correct outputs. (And so in general, I think we should be cautious about claiming any particular generalization feature will fix this, unless we have a principled reason for thinking so.)
I realize that the key point is that honesty might generalize from the tasks where we can directly supervise for correctness. But even then, it seems that if you think the basic generalization story won't work, you shouldn't expect training for plausibility to rescue it; we should expect the responses to become much more plausible, and somewhat (but not completely) more correct. So I'd be wary of whether training for plausibility would be masking the problem without addressing the root cause.
Thanks for writing this. I've been having a lot of similar conversations, and found your post clarifying in stating a lot of core arguments clearly.
Is there an even better critique that the Skeptic could make?
Focusing first on human preference learning as a subset of alignment research: I think most ML researchers "should" agree on the importance of simple human preference learning, both from a safety and capabilities perspective. If we take the narrower question "should we do human preference learning, or is pretraining + minimal prompt engineering enough?", I feel confident in the answer you give as Advocate: To the extent prompt engineering works, it's because it's preference learning in disguise, and leaning into preference learning (including supervised / RL finetuning) will work much better. Both the theoretical and empirical pictures to-date agree with this.
(My sense is that not all ML researchers immediately agree with this / maybe just haven't considered the question in this frame, but that most researchers are pretty receptive to it and will agree in discussion.)
So I think a more challenging Skeptic might say: "Perhaps simple human preference learning is enough, and we can focus all alignment research there. Why do we need the other research directions in the alignment portfolio like handling inaccessible information, deceptive mesa-optimizers, or interpretability?" Here, "simple" human preference learning is referring to something like supervised (your step 1 for Question 1) + RL finetuning (step 2) + ad hoc ways of making it easier for humans to supervise models (limited versions of step 3).
I again side with Advocate here, but I think making the case is more difficult (and also perhaps requires different arguments for different research directions). I don't have a response for this as short or convincing as what you have here. My typical response would expand on your points that more capable models will be more dangerous and that alignment might turn out to be very hard, so it's important to consider these potential difficulties in advance. The hardness claim would probably involve failure stories (along these lines) or more abstract hardness arguments (along these lines).
I'd be interested in the relationship between this and Implicit Gradient Regularization and the sharp/flat minima lit.The basic idea there is to compare the continuous gradient flow on the original objective, to the path followed by SGD due to discretization. They show that the latter can be re-interpreted as optimizing a modified objective which favors flat minima (low sensitivity to parameter perturbations). This isn't clearly the same as what you're analyzing here, since you're looking at variance due to sampling instead, but they might be related under appropriate regularity conditions.
Sam Smith also has this nice paper on sharp/flat minima which links a lot of previous observations together, and has some similarities to your approach here.
Haven't thought about any of this too closely, so my apologies if these aren't useful! Seems close enough that it might be of interest though.
Thanks for the great post. I found this collection of stories and framings very insightful.1. Strong +1 to "Problems before solutions." I'm much more focused when reading this story (or any threat model) on "do I find this story plausible and compelling?" (which is already a tremendously high bar) before even starting to get into "how would this update my research priorities?"2. I wanted to add a mention to Katja Grace's "Misalignment and Misuse" as another example discussing how single-single alignment problems and bargaining failures can blur together and exacerbate each other. The whole post is really short, but I'll quote anyways:
I think a likely scenario leading to bad outcomes is that AI can be made which gives a set of people things they want, at the expense of future or distant resources that the relevant people do not care about or do not own...When the business strategizing AI systems finally plough all of the resources in the universe into a host of thriving 21st Century businesses, was this misuse or misalignment or accident? The strange new values that were satisfied were those of the AI systems, but the entire outcome only happened because people like Bob chose it knowingly (let’s say). Bob liked it more than the long glorious human future where his business was less good. That sounds like misuse. Yet also in a system of many people, letting this decision fall to Bob may well have been an accident on the part of others, such as the technology’s makers or legislators.
In the post's story, both "misalignment" and "misuse" seem like two different, both valid, frames on the problem.3. I liked the way this point is phrased on agent-agnostic and agent-centric (single-single alignment-focused) approaches as complementary.
The agent-focused and agent-agnostic views are not contradictory... Instead, the agent-focused and agent-agnostic views offer complementary abstractions for intervening on the system... Both types of interventions are valuable, complementary, and arguably necessary.
At one extreme end, in the world where we could agree on what constitutes an acceptable level of xrisk, and could agree to not build AI systems which exceed this level, and give ourselves enough time to figure out the alignment issues in advance, we'd be fine! (We would still need to do the work of actually figuring out a bunch of difficult technical and philosophical questions, but importantly, we would have the time and space to do this work.) To the extent we can't do this, what are the RAAPs, such as intense competition, which prevent us from doing so?
And at the other extreme, if we develop really satisfying solutions to alignment, we also shouldn't end up in worlds where we have "little human insight" or factories "so pervasive, well-defended, and intertwined with our basic needs that we are unable to stop them from operating."
I think Paul often makes this point in the context of discussing an alignment tax. We can both decrease the size of the tax, and make the tax more appealing/more easily enforceable.4. I expect to reconsider many concepts through the RAAPs lens in the next few months. Towards this end, it'd be great to see a more detailed description of what the RAAPs in these stories are. For example, a central example here is "the competitive pressure to produce." We could also maybe think about "a systemic push towards more easily quantifiable metrics (e.g. profit vs. understanding or global well-being)" which WFLL1 talks about or "strong societal incentives for building powerful systems without correspondingly strong societal incentives for reflection on how to use them". I'm currently thinking about all these RAAPs as a web (or maybe a DAG), where we can pull on any of these different levers to address the problem, as opposed to there being a single true RAAP; does that seem right to you?Relatedly, I'd be very interested in a post investigating just a single RAAP (what is the cause of the RAAP? what empirical evidence shows the RAAP exists? how does the RAAP influence various threat models?). If you have a short version too, I think that'd help a lot in terms of clarifying how to think about RAAPs.5. My one quibble is that there may be some criticism of the AGI safety community which seems undeserved. For example, when you write "That is, outside the EA / rationality / x-risk meme-bubbles, lots of AI researchers think about agent-agnostic processes," it seems to imply that inside this community, researchers don't think about RAAPs (though perhaps this is not what you meant!) It seems that many inside these circles think about agent-agnostic processes too! (Though not framed in these terms, and I expect this additional framing will be helpful.) Your section on "Successes in our agent-agnostic thinking" gives many such examples.This is a quibble in the sense that, yes, I absolutely agree there is lots of room for much needed work on understanding and addressing RAAPs, that yes, we shouldn't take the extreme physical and economic competitiveness of the world for granted, and yes, we should work to change these agent-agnostic forces for the better. I'd also agree this should ideally be a larger fraction of our "portfolio" on the margin (acknowledging pragmatic difficulties to getting here). But I also think the AI safety community has had important contributions on this front.