Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

This is a link post for https://aligned.substack.com/p/alignment-mvp

I'm writing a sequence of posts on the approach to alignment I'm currently most excited about. This second post argues that instead of trying to solve the alignment problem once and for all, we can succeed with something less ambitious: building a system that allows us to bootstrap better alignment techniques.

37 comments, sorted by Click to highlight new comments since: Today at 2:24 AM
New Comment

Building weak AI systems that help improve alignment seems extremely important to me and is a significant part of my optimism about AI alignment. I also think it's a major reason that my work may turn out not to be relevant in the long term.

I still think there are tons of ways that delegating alignment can fail, such that it matters that we do alignment research in advance:

  • AI systems could have comparative disadvantage at alignment relative to causing trouble, so that AI systems are catastrophically risky before they solve alignment. Or more realistically, they may be worse at some parts of alignment such that human efforts continue to be relevant even if most of the work is being done by AI.
  • It could be very hard to oversee work on alignment because it's hard to tell what constitutes progress. Being really good at alignment ourselves increases the probability that we know what to look for, so that we can train for alignment researchers. More broadly, doing a bunch of alignment increases the chances that we know how to automate it.
  • There may be long serial dependencies, where it's hard to scale up work in alignment too rapidly and early work can facilitate larger investments down the line. Or more generally there could be prep work that's important to do in advance that we can recognize and then do. This is closely related to the last two points.
  • It may be that alignment is just not soluble, and by recognizing that further in advance (and understanding the nature of the difficulty) we can have a better response (like actually avoiding dangerous forms of AI altogether, or starting to invest more seriously in various hail mary plans).

Overall I think that "make sure we are able to get good alignment research out of early AI systems" is comparably important to "do alignment ourselves." Realistically I think the best case for "do alignment ourselves" is that if "do alignment" is the most important task to automate, then just working a ton on alignment is a great way to automate it. But that still means you should be investing quite a significant fraction of your time in automating alignment.

I also basically buy that language models are now good enough that "use them to help with alignment" can be taken seriously and it's good to be attacking it directly.

What do you (or others) think is the most promising, soon-possible way to use language models to help with alignment? A couple of possible ideas:

  1. Using LMs to help with alignment theory (e.g., alignment forum posts, ELK proposals, etc.)
  2. Using LMs to run experiments (e.g., writing code, launching experiments, analyzing experiments, and repeat)
  3. Using LMs as research assistants (what Ought is doing with Elicit)
  4. Something else?

This seems to completely ignore the main problem with approaches which try to outsource alignment research to AGI: optimizing for alignment strategies which look promising to a human reviewer will also automatically incentivize strategies which fool the human reviewer. Evaluation is not actually easier than generation, when Goodhart is the main problem to begin with.

Evaluation is not actually easier than generation, when Goodhart is the main problem to begin with.

I think it's very unclear how big a problem Goodhart is for alignment research---it seems like a question about a particular technical domain. There are domains where evaluation is much easier; most obviously mathematics, but also in e.g. physics or computer science, there are massive gaps between recognition and generation even if you don't have formal theorem statements. There are also domains where it's not much easier, where the whole thing rests on complicated judgments where the search for clever arguments just isn't doing much work.

It looks to me like alignment is somewhere in the middle, though it's not at all clear---right now there are different strands of alignment progress, which seem to have very different properties with respect to the ease of evaluation.

The kind of Goodhart we are usually concerned about is stuff like "it's easier to hijack the reward signal than to actually perform a challenging task," and I don't think that's very tightly correlated with the question about alignment. So this feels like the rhetoric here involves a bit of an equivocation.

I think it's very unclear how big a problem Goodhart is for alignment research---it seems like a question about a particular technical domain.

Just a couple weeks ago I had this post talking about how, in some technical areas, we've been able to find very robust formulations of particular concepts (i.e. "True Names"). The domains where evaluation is much easier - math, physics, CS - are the domains where we have those robust formulations. Even within e.g. physics, evaluation stops being easy when we're in a domain where we don't have a robust mathematical formulation of the phenomena of interest.

The other point of that post is that we do not currently have such formulations for the phenomena of interest in alignment, and (one way of framing) the point of foundational agency research is to find them.

So I agree that the difficulty of evaluation varies by domain, but I don't think it's some mysterious hard-to-predict thing. The places where robust evaluation is easy all build on qualitatively-similar foundational pieces, and alignment does not yet have those sorts of building blocks.

The kind of Goodhart we are usually concerned about is stuff like "it's easier to hijack the reward signal than to actually perform a challenging task," and I don't think that's very tightly correlated with the question about alignment. So this feels like the rhetoric here involves a bit of an equivocation.

Go take a look at that other post, it has two good examples of how Goodhart shows up as a central barrier to alignment.

I don't buy the empirical claim about when recognition is easier than generation. As an example, I think that you can recognize robust formulations much more easily than you can generate them in math, computer science, and physics. In general I think "recognition is not trivial" is different from "recognition is as hard as generation."

I found this comment pretty convincing. Alignment has been compared to philosophy, which seems at the opposite end of "the fuzziness spectrum" as math and physics. And it does seem like concept fuzziness would make evaluation harder.

I'll note though that ARC's approach to alignment seems more math-problem-flavored than yours, which might be a source of disagreement between you two (since maybe you conceptualize what it means to work on alignment differently).

If it turns out that evaluation of alignment proposals is not easier than generation, we're in pretty big trouble because we'll struggle to convince others that any good alignment proposals humans come up with are worth implementing.

You could still argue by generalization that we should use alignment proposals produced by humans who had a lot of good proposals on other problems even if we're not sure about those alignment proposals. But then you're still susceptible to the same kinds of problems.

I think we need to unpack "sufficiently aligned"; here's my attempt. There are A=2^10000 10000-bit strings. Maybe 2^1000 of them are coherent English text, and B=2^200 of these are alignment proposals that look promising to a human reviewer, and C=2^100 of them are actually correct and will result in aligned AI.The thesis of the post requires that we can make a "sufficiently aligned" AI that, conditional on a proposal looking promising, is likely to be actually correct.

  • A system that produces a random 10000-bit string that looks promising to a human reviewer is not "sufficiently aligned"
  • A system that follows the process that the most truthful possible humans use to do alignment research is sufficiently aligned (or if not, we're doomed anyway). Truth-seeking humans doing alignment research are only accessing a tiny part of the space of 2^200 persuasive ideas, and most of this is in the subset of 2^100 truthful ideas
  • If the system is selecting for appearance, it needs to also have 100 bits of selection towards truth to be sufficiently aligned.

We can't get those 100 bits through further selection for appearance. It seems plausible that we can get them somehow, though.

Is your story:

  1. AI systems are likely to be much better at persuasion than humans, relative to how good they are at alignment.
  2. Actually if a human was trying to write down a convincing alignment proposal, it would be much easier to trick us than to write down a good proposal.

It sounds like you are thinking of 2. But I think we have reasonably good intuitions about that. I think for short evaluations "fool us" is obviously easier. For long evaluations (including similarly-informed critics pointing out holes etc.) I think that it rapidly becomes easier to just do good work (though it clearly depends on the kind of work).

Consider the space of 10-page google docs. Within this space, we pick out all the google docs which some human evaluator would consider a good alignment proposal. (You can imagine the human is assisted in some way if you want, it makes little difference to this particular argument.) Then the question is, what fraction of these will actually be good alignment proposals? So, we have two relevant numbers:

  • Number of proposals which look good to the human
  • Number of proposals which look good to the human AND are actually good

Now, the key heuristic: in a high-dimensional space, adding any non-simple constraint will exponentially shrink the search space. "Number of proposals which look good to the human AND are actually good" has one more complicated constraint than "Number of proposals which look good to the human", and will therefore be exponentially smaller.

So in "it would be much easier to trick us than to write down a good proposal", the relevant operationalization of "easier" for this argument is "the number of proposals which both look good and are good is exponentially smaller than the number which look good".

I think that argument applies just as easily to a human as to a model, doesn't it?

So it seems like you are making an equally strong claim that "if a human tries to write down something that looks like good alignment work almost all of it will be persuasive but bad." And I think that's kind of true and kind of not true. In general I think you can get much better estimates by thinking about delegating to sociopathic humans (or to humans with slightly different comparative advantages) than trying to make a counting argument.

(I think the fact that "how smart the human is" doesn't matter mostly just proves that the counting argument is untethered from the key considerations.)

A human writing their own alignment proposal has introspective access to the process-which-generates-the-proposal, and can get a ton of bits from that. They can trust the process, rather than just the output.

A human who is good at making their own thinking process legible to others, coupled with an audience who knows to look for that, could get similar benefits in a more distributed manner. Faking a whole though process is more difficult, for a human, than simply faking an output. That does not apply nearly as well to an AI; it is far more likely that the AI's thought-process would be very different from ours, such that it would be easier to fake a human-legible path than to truly follow one from the start.

(I think the fact that "how smart the human is" doesn't matter mostly just proves that the counting argument is untethered from the key considerations.)

I think "how smart the human is" is not a key consideration.

I think how well we can evaluate claims and arguments about AI alignment absolutely determines whether delegating alignment to machines is easier than doing alignment ourselves. A heuristic argument that says "evaluation isn't easier than generation, and that claim is true regardless of how good you are at evaluation until you get basically perfect at it" seems obviously wrong to me. If that's a good summary of the disagreement I'm happy to just leave it there.

A heuristic argument that says "evaluation isn't easier than generation, and that claim is true regardless of how good you are at evaluation until you get basically perfect at it" seems obviously wrong to me.

Yup, that sounds like a crux. Bookmarked for later.

I strongly agree with you that it'll eventually be very difficult for humans to tell apart AI-generated alignment proposals that look good and aren't good from ones that look good and are actually good.

There is a much stronger version of the claim "alignment proposals are easier to evaluate than to generate" that I think we're discussing in this thread, where you claim that humans will be able to tell all good alignment proposals apart from bad ones or at least not accept any bad ones (precision matters much more than recall here since you can compensate bad recall with compute). If this strong claim is true, then conceptually RLHF/reward modeling should be sufficient as an alignment technique for the minimal viable product. Personally I think that this strong version of the claim is unlikely to be true, but I'm not certain that I will be false for the first systems that can do useful alignment research.

As William points out below, if we get AI-assisted human evaluation to work well, then we can uncover flaws in alignment proposals that are too hard to find for unassisted humans. This is a weaker version of the claim, because you're just claiming that humans + AI assistance are better at evaluating alignment proposals than human + AI assistance are at generating them. Generally I'm pretty optimistic about that level of supervision actually allowing us to supervise superhuman alignment research; I've written more about this here: https://aligned.substack.com/p/ai-assisted-human-feedback

Is the claim here that the 2^200 "persuasive ideas" would actually pass the scrutiny of top human researchers (for example, Paul Christiano studies one of them for a week and concludes that it is probably a full solution)?  Or do you just mean that they would look promising in a shorter evaluation done for training purposes?

Assuming humans can always find the truth eventually, the number of persuasive ideas probably shrinks as humans have more time-- maybe 2^300 in a training loop, 2^250 for Paul thinking for a day, 2^200 for Paul thinking for a month, 2^150 for Paul thinking for 5 years... I think the core point still applies.

I endorse this explanation.

You might think that humans are more robust on the distribution of [proposals generated by humans trying to solve alignment] vs [proposals generated by a somewhat superhuman model trying to get a maximal score]

yeah that's a fair point

If it turns out that evaluation of alignment proposals is not easier than generation, we’re in pretty big trouble because we’ll struggle to convince others that any good alignment proposals humans come up with are worth implementing.

But this is pretty likely the case though, isn't it? Actually I think by default the situation will be the opposite: it will be too easy to convince others that some alignment proposal is worth implementing, because humans are in general too easily convinced by informal arguments that look good but contain hidden flaws (and formalizing the arguments is both very difficult and doesn't help much because you're still depending on informal arguments for why the formalized theoretical concepts correspond well enough to the pre-theoretical concepts that we actually care about). Look at the history of philosophy, or cryptography, if you doubt this.

But suppose we're able to convince people to distrust their intuitive sense of how good an argument is, and to keep look for hidden flaws and counterarguments (which might have their own hidden flaws and so on). Well how do we know when it's safe to end this process and actually hit the run button?

It feels to me like there's basically no question that recognizing good cryptosystems is easier than generating them. And recognizing attacks on cryptosystems is easier than coming up with attacks (even if they work by exploiting holes in the formalisms). And recognizing good abstract arguments for why formalisms are inadequate is easier than generating them. And recognizing good formalisms is easier than generating them.

This is all true notwithstanding the fact that we often make mistakes. (Though as we've discussed before, I think that a lot of the examples you point to in cryptography are cases where there were pretty obvious gaps in formalisms or possible improvements in systems, and those would have motivated a search for better alternatives if doing so was cheap with AI labor.)

The example of cryptography was mainly intended to make the point that humans are by default too credulous when it comes to informal arguments. But consider your statement:

It feels to me like there’s basically no question that recognizing good cryptosystems is easier than generating them.

Consider some cryptosystem widely considered to be secure, like AES. How much time did humanity spend on learning / figuring out how to recognize good cryptosystems (e.g. finding all the attacks one has to worry about, like differential cryptanalysis), versus specifically generating AES with the background knowledge in mind? Maybe the latter is on the order of 10% of the former?

Then consider that we don't actually know that AES is secure, because we don't know all the possible attacks and we don't know how to prove it secure, i.e., we don't know how to recognize a good cryptosystem. Suppose one day we figure that out, wouldn't finding an actually good cryptosystem be trivial at that point compared to all the previous effort?

Some of your other points are valid, I think, but cryptography is just easier than alignment (don't have time to say more as my flight is about to take off), and philosophy is perhaps a better analogy for the more general point.

I think this concern is only relevant if your strategy is to do RL on human evaluations of alignment research. If instead you just imitate the distribution of current alignment research, I don't think you get this problem, at least anymore than we have it now--and I think you can still substantially accelerate alignment research with just imitation. Of course, you still have inner alignment issues, but from an outer alignment perspective I think imitation of human alignment research is a pretty good thing to try.

Evaluation assistance as mentioned in the post on AI-assisted human feedback could help people avoid being fooled (e.g. in debate where the opponent can point out how you're being fooled). It's still an open question how well that will work in practice and how quickly it will Goodhart (these techniques should fail on some things, as discussed in the ELK report), but it seems possible that models will be helpful enough on alignment before they Goodhart.

What are people's timelines for deceptive alignment failures arising in models, relative to AI-based alignment research being useful?

Today's language models are on track to become quite useful, without showing signs of deceptive misalignment or its eyebrow-raising pre-requisites (e.g., awareness of the training procedure), afaik. So my current best guess is that we'll be able to get useful alignment work from superhuman sub-deception agents for 5-10+ years or so. I'm very curious if others disagree here though

I personally have pretty broad error bars; I think it's plausible enough that AI won't help with automating alignment that it's still valuable for us to work on alignment, and plausible enough that AI will help with automating alignment that it significantly increases our chances of survival and is worth preparing for making use of. I also tend to think that current progress in language modeling seems to suggest that models will reach the point of being extremely helpful with alignment way before they become super scary.

Eliezer has consistently expressed confidence that AI systems smart enough to help with alignment will also be smart enough that they'll inevitably be trying to kill you.  I don't think he's really explained this view, and I've never found it particularly compelling. I think this a lot of folks around LW have absorbed a similar view; I'm not totally sure how much it comes from Eliezer but I'd guess that's a lot of it.

I think part of Eliezer's views of this come from a view of intelligence and recursive self-improvement that imply that explosive recursive self-improvement begins before high object-level competence on other research tasks. I think this view is most likely mistaken, but my guess is that it's tied up with Eliezer's views about how to build AGI closely enough that Eliezer won't want to defend his position here.

(My position is the very naive one, that recursive self-improvement will become critical at roughly the same time that AI systems are better than humans at contributing to further AI progress, which has roughly a 50-50 shot of happening before alignment progress.)

Beyond that, Eliezer has not said very much about where these intuitions are coming from. What he has said does not seem (to me) to have fared particularly well over the last few years. For example:

Similar remarks apply to interpreting and answering "What will be its effect on _?" It turns out that getting an AI to understand human language is a very hard problem, and it may very well be that even though talking doesn't feel like having a utility function, our brains are using consequential reasoning to do it. Certainly, when I write language, that feels like I'm being deliberate. It's also worth noting that "What is the effect on X?" really means "What are the effects I care about on X?" and that there's a large understanding-the-human's-utility-function problem here.

In fact it does not seem hard to get AI systems to understand the relevant parts of human language (relative to being able to easily kill all humans or to inevitably be trying to kill all humans). And it does not seem hard to get an AI to predict which things you will judge to be relevant, well enough that this is a very bad way of explaining why Holden's proposal would fail.

Of course getting an AI to tell you what it's really thinking may be hard (and indeed I think it's hard enough that I think there's a significant probability that we will all die because we failed to solve it). And I think Eliezer even has a fair model of why it's hard (or at least I've often defended him based on a more charitable reading of his overall views).

But my point is that to the extent Eliezer has explained why he thinks AI won't be helpful until it's too late, so far it doesn't seem like adjacent intuitions have stood the test of time well.

Your link redirects back to this page. The quote is from one of Eliezer's comments in Reply to Holden on Tool AI.

A model which is just predicting the next word isn't optimizing for strategies which look good to a human reviewer, it's optimizing for truth itself (as contained in it's training data). If you begin re-feeding its outputs as training inputs then there could be a feedback loop leading to such incentives, but if the model is general and sufficient intelligent, you don't need to do that. You can train it in a different domain and it will generalize to your domain of interest.

Even if you that, you can try to make the new data grounded in reality in some way, like including experiment results. And the model won't just absorb the new data as truth, it will include it in it's world model to make better predictions. If it's fed a bunch of new alignment forum posts that are bad ideas which look good to humans, it will just predict that alignment forum produces that kind of post, but that doesn't mean there isn't some prompt that can make it output what it actually thinks is correct.

IMO, the alignment MVP claim Jan is making is approximately '‘we only need to focus on aligning narrow-ish alignment research models that are just above human level, which can be done with RRM (and maybe some other things, but no conceptual progress?)’'
and requires:

  1. we can build models that are:
    1. Not dangerous themselves
    2. capable of alignment research
    3. We can use RRM to make them aligned enough that we can get useful research out of them. 
  2. We can build these models before [anyone builds models that would be dangerous without [more progress on alignment than is required for aligning the above models]]
  3. We have these models for long enough before danger and/or the models speed up alignment progress by enough that the alignment progress made during this time is comparably large to or larger than the progress made up to that date.


I'd imagine some cruxes to include:
 - whether it's possible to build models capable of somewhat superhuman alignment research that do not have inner agents
- whether people will build systems that require conceptual progress in alignment to make safe before we can build the alignment MVP and get significant work out of it

I think I’m something like 30% on ‘The highest-leverage point for alignment work is once we have models that are capable of alignment research - we should focus on maximising the progress we make at that point, rather than on making progress now, or on making it to that point - most of the danger comes after it’

Things this maybe implies:

  • We should try to differentially advance models’ ability to do alignment research relative to other abilities (abilities required to be dangerous, or abilities required to accelerate capabilities)
    • For instance, trying to make really good datasets related to alignment, e.g. by paying humans to proliferate/augment all the alignment research and writing we have so far
    • Figuring out what combination of math/code/language/arxiv etc seem to be the most conducive to alignment-relevant capabilities
    • More generally, researching how to develop models that are strong in some domains and handicapped in others
  • We should focus on getting enough alignment to extract the alignment research capabilities
    • This might mean we only need to align:
      • models that are not agentic/not actively trying to deceive you
      • Models that in many domains are subhuman
    • If we think these models are going to be close to having agency, maybe we want to avoid RL or other finetuning that incentivizes the model to think about its environment/human supervisors. Instead we might want to use some techniques that are more like interpretability or extracting latent knowledge from representations, rather than RLHF?
  • We should think about how we can use powerful models to accelerate alignment
  • We should focus more on how we would recognise good alignment research as opposed to producing it
    • For example, setups where you can safely train a fairly capable model according to some proposed alignment scheme, and see how well it works?