Machines of Faithful Obedience

[-]ryan_greenblatt5mo184

In fact, I am more worried about partial success than total failure in aligning AIs. In particular, I am concerned that we will end up in the “uncanny valley,” where we succeed in aligning AIs to a sufficient level for deployment, but then discover too late some “edge cases” in the real world that have a large negative impact.

I think it's pretty plausible that AIs which are obviously pretty misaligned are deployed (e.g. we've caught them trying to escape or sabotage research, or maybe we've caught an earlier model doing this and we don't have any strong reason to think we've resolved this in the current system). This is made more likely by an aggressive race to the bottom (possibly caused by an arms race over AI as you discuss). Part of my perspective here is that misalignment issues might be very difficult to resolve in time due to rapid AI progress and difficulty studying misalignment. (For instance, because the very AIs you're studying might not want to be studied!).

I also think it's plausible that very capable AIs will end up being schemers/alignment-fakers which look very aligned (sufficient for your deployment checks), but have misaligned long run aims. And, even if you found evidence of this in earlier AIs, this wouldn't suffice to prevent AIs where we haven't confidently ruled this out from being deployed (see prior paragraph). I also think it's plausible that you won't see smoking gun evidence of this before it's too late as I discuss in a prior post of mine.

The issues I'm worried about don't feel like edge cases or an uncanny valley to me (though I suppose you could think of alignment faking as an uncanny valley).

My understanding is that you disagree about the possibility of relatively worst case scenarios with respect to scheming and think that people would radically change their approach if we had clear evidence of (relatively consistent, across context) scheming without a strong resolution of this problem that generalizes to more capable AIs. I hope you're right.

To be clear, I agree that it would be better if AIs are obviously (seriously) misaligned than if they are instead scheming undetected.

[-]boazbarak5mo40

I am not sure how much we actually disagree, but let me add some more clarificatoins.

My mental model is what happened with software security. It was never completely ignored, but for a long time I think many companies had the mental model of "do the minimal security work so that it's not obviously broken." For example, it took some huge scandals for Microsoft to make security its highest priority (see Bill Gates 2002 memo and retrospective). Ultimately the larger companies changed their mindset, but I would like us not to repeat this history with AI, especially given that it is likely to progress faster.

At the moment the way we deploy AIs are in settings (chat or using as coding agent for discrete tasks) where there very output is fed to a human that is ultimately responsible for the results. So this setting allows us to be quite tolerant of alignment failures. But I don't think this will last for long.

I do think that alignment faking is a form of "uncanny valley". There are some subtleties with the examples you mention since it is debatable whether we have "caught" models doing bad thing X or "entrapped" them to do so. This is why I like the sycophancy incident since it is one where (1) the bad thing occured in the wild, (2) it points out to a deeper issue which is the mixed objectives that AIs have, and in particular the objective to satisfy the user as well as the objective to follow the model spec / policies.

But, I agree that we should get to the point where it is impossible to even "entrap" the models to exhibit misaligned behavior. Again in cybersecurity there have been many examples of attacks that were initially dismissed by practitioners as being too "academic" but were eventually extended to realistic setting. I think these days people understand that even such "academic attacks" point out to real weaknesses, since (as people often say in cryptography) "attacks only get better".

[-]ryan_greenblatt5mo41

I think if the attitude in AI was "there can't be any even slightly plausible routes to misalignment related catastrophe" and this was consistently upheld in a reasonable way, that would address my concerns. (So, e.g. by the time we're deploying AIs which could cause huge problems if they were conspiring against us there needs to be a robust solution to alignment faking / scheming which has broad consensus among researchers in the area.)

I don't expect this because we seem very far from success and this might happen rapidly. (Though we might end up here after eating a bunch of ex-ante risk for some period while using AIs to do safety work.)

[-]boazbarak5mo51

Different people might have different interpretation for "slightly plausible" but I agree we are very far right now and need to step up our game!

[-]ryan_greenblatt5mo31

Agreed, maybe the relevant operationalization would be "broad consensus" and then we could outline a relevant group of researchers.

[-]ryan_greenblatt5mo*132

I left some comments noting disagreements, but I thought it be helpful to note some areas of agreement:

I agree AI could be (highly) risky and think that it's good to acknowledge this (as you do).
I agree that you could have a maximally obedient superintelligent AI. (Some questions around manipulation could be philosophically tricky, but this seems resolvable at least in principle.)
I agree that obedience (instruction following) is a good target, though I think there are some caveats to this (and uncertainties I have).
I agree there are substantial risks from arms races (which cause a race to the bottom on safety), AI enabled authoritarianism (which could be ~indefinitely stable), and undesirable societal shifts due to unemployment and general upheaval. I'm most worried about misalignment risks, though these risks might be increased by these other factors, particularly arms races.
I agree that pure internal deployment, single monopoly, and safety underinvestment would (probably) make these risks worse. Though I might think of the downsides of these as somewhat different than what you're focusing on.
I agree that probably offense-defense issue can be handled if most intelligence is in the hands of good actors, and careful (and potentially quite strong) actions are taken to defend properly. (I'm maybe skeptical that sufficiently strong actions will be taken in practice for reasons similar to those discussed here, though I don't agree with the bottom line of this linked post overall.)

[-]boazbarak5mo40

Thank you! I am happy to see that there are so many points of agreement!

[-]ryan_greenblatt5mo62

Currently, it is very challenging to train AIs to achieve tasks that are too hard for humans to supervise. I believe that (a) we will need to solve this challenge to unlock “self-improvement” and superhuman AI, and (b) we will be able to solve it. I do not want to discuss here why I believe these statements. I would just say that if you assume (a) and not (b), then the capabilities of AIs, and hence their risks, will be much more limited.
Ensuring AIs accurately follow our instructions, even in settings too complex for direct human oversight, is one such hard-to-supervise objective. Getting AIs to be maximally honest, even in settings that are too hard for us to verify, is another such objective. I am optimistic about the prospects of getting AIs to maximize these objectives.

I agree that to train AIs which are generally very superhuman, you'll need to be able to make AIs highly capable on tasks that are too hard for humans to supervise. And, that if we have no ability to make AIs capable on tasks which are hard for humans to supervise, risks are much more limited.^[1]

However, I don't think that making AIs highly capable on tasks which are too hard for humans to supervise necessarily requires being able to ensure AIs do what we want in these settings nor does it require being able to train AIs for specific objectives in these settings.

Instead, you could in principle create very superhuman AIs through transfer (as humans do in many cases) and this wouldn't require any ability to directly supervise on domains where the AI ends up being superhuman nevertheless. Further, you might be able to directly train AIs to be highly capable (as in, without depending on much transfer) using flawed feedback in a given domain (e.g. feedback which is often possible to reward hack but which still teaches the AI the relevant abilities).

So, I agree that the ability to make very superhuman AIs implies that we'll (very likely) be able to make AIs which are capable of following our instructions and which are capable of being maximally honest, but this doesn't imply that we'll be able to ensure these properties (e.g. the AI could intentionally disobey instructions or lie). Further, there is a difference between being able to supervise instruction following and honesty in any given task and being able to produce an AI which robustly instruction follows and is honest. (Things like online training using this supervision only give you average case guarantees, and that's if you actually use online training.)

It's certainly possible that there is substantial transfer from the task of "train AIs to be highly capable (and useful) in harder-to-check domains" to the task of ensuring AIs are robustly honest and instruction following, but it is also easy to imagine ways this could go wrong. E.g., the AIs are faking alignment or at some level of capability, increased capabilities (from transfer or whatever) still makes the AIs (seem) more useful while simultaneously it makes them less instruction following and honest due to issues in the training signal.

(My all-things-considered view is that the default course of advancing capabilities and usefulness will end up figuring out some ways to supervise AIs in training sufficiently well to train them to perform reasonably well on average in most cases (according to human judgment of outcomes), but this performance will substantially depend on transfer and generalization in ways which aren't robust to egregious misalignment. And that this will solve some alignment problems that would have otherwise existed if not for people optimizing for usefulness over pure capabilities. That said I also think it's possible that we'll be seeing increasingly sophisticated and egregious reward hacking rise with capabilities but with a sufficient increase in usefulness in many domains despite this reward hacking such that scaling continues. And, I don't think handling worst case scheming/alignment-faking will be very incentivized by commercial/capabilities incentives by default.)

You might separately be optimistic about getting superhuman AIs to robustly follow instructions, but I don't think "we have a way to make the AIs superhumanly capable in general" implies "we can ensure the AIs actually robustly follow instructions (rather than just being capable of following instructions)".

To the extent that you're defining "capable" in a somewhat non-standard way, it seems good to be careful about this and consider explaining how you are using these terms (or consider defining a new term).

^{^}
That said, I do think it would be possible in principle to automate AI R&D or at least automate most of AI R&D (perhaps what you mean by unlocking "self-improvement") even if we could only initially make AIs highly capable on tasks which humans can supervise. Humans can supervise the tasks involved in AI R&D and superhumanness isn't necessarily required for this automation. Also, there are verification generation gaps in AI R&D because we can use outcome based feedback, so you can in principle get substantially superhuman AI R&D while only training AIs on tasks you could have in principle supervised. In practice, the way AI R&D is done today often requires doing tasks which are expensive to run and only happen a few times (e.g. big training runs), so literally doing outcome based RL over the whole process wouldn't work with the current setup. But, humans can in principle supervise the process, it's just that the process is expensive to run.

[-]boazbarak5mo50

I agree that I am not justifying in this essay my optimism about getting superhuman AI to robustly follow instructions. You can think as the above as more like the intuition that generally points out in that direction.

Justifying this optimism is a side discussion, but TBH rather than discussions I hope that we can make empirical progress toward justifying it.

[-]ryan_greenblatt5mo60

I believe that AI (including AGI and ASI) can do the same and be a positive force for humanity. I also believe that it is possible to solve the “technical alignment” problem and build AIs that follow the words and intent of our instructions and report faithfully on their actions and observations.
I will not defend these two claims here. However, even granting these optimistic premises, AI’s positive impact is not guaranteed.

It's interesting that you describe the claims "AI can be a positive force for humanity" and "technical alignment can be solved" as optimistic premises. I think people who think misalignment risks are catastrophically high (e.g. a >25% chance of literal AI takeover) would agree with these premises (they think misalignment risks could be avoided, it's just that we're at least somewhat likely to not succeed at this), so these claims don't typically distinguish between typical optimists and pessimists (at least with respect to worries around misalignment risks).

Perhaps when you say "it is possible to solve" you mean "we're very likely to solve it in practice given the realistic time and budget we will have for this problem". In this case, there certainly is disagreement!

I'm not sure whether you were trying to highlight typical disagreements or if you introduced things in this way for some othe reason.

You might imagine that by positing that technical alignment is solvable, I “assumed away” all the potential risks with AI.

I certainly don't think so! As you note, solvable does not imply that it will be solved!

[-]boazbarak5mo72

Thanks! Note that I did not optimize this essay just for the LessWrong audience, so different people might have different points on agreements and disagreement.

I think I am indeed more optimistic, but I would not say that we will solve it "by default." Like I say in the essay, I don't think that by simply letting the market work we will get sufficient level of alignment. We need to make an effort- this is similar to the example of Microsoft that I mentioned in this comment, but I hope that unlike the Microsoft example we don't need to wait for huge failures until we do that.

[-]Seth Herd5mo40

See my recent post Problems with instruction-following as an alignment target

[-]boazbarak5mo10

Thank you! This is quite relevant. Some of your concerns are about the technical feasibility of achieving instruction following, which is not a point I’m going into in this post. FWIW when I say “instruction following” I mean models that respect the instruction hierarchy and chain of command (for example as in our model spec). However I do not mean models that obey hypothetical future instructions that were not given to them (which is one option you mention in your post).

[-]Seth Herd5mo20

If they don't obey future instructions not yet given, the only sensible thing to do to carry out your current instructions thoroughly and with certainty is to make sure you can't issue new instructions. That would logically prevent them from completing their instructions and represent utter failure in their goal.

[-]boazbarak5mo10

I think AI assistants would have common sense even if they are obedient. I doubt an AI assistants would interpret “go fetch me coffee” as “kill me first so I can’t interrupt your task and then fetch me coffee” but YMMV

[-]Seth Herd5mo20

I don't think it's safe to assume that LLM-based AGI will have common sense (maybe this is different from the assistant you're addressing, in that it's a lot smarter and can think for itself more?). I'm talking about machines based on networks, but which can also reason in depth. They will understand common sense, but that doesn't ensure that it will guide their reasoning.

So it depends what you mean by "obedient". And how you trained them to be obedient. And whether that ensures that their interpretation doesn't change once they can reason more deeply than we can.

So I think those questions require serious thought, but you can't tackle it all at once, so starting by assuming that all works is also sensible. I'm just focusing on that first, because I don't think that's likely to work unless we put a lot more careful thought in before we try it.

If you look at my previous posts on the topic, linked from Problems with instruction-following, you'll see that I was initially more focused on downstream concerns like yours. After working in the field for longer and having more in-depth discussions and research on the techniques we'd use to align future LLM agents, I am increasingly concerned that it's not that easy, and we should focus on the difficulty of aligning AGI while we still have time. Difficulties from aligned AGI are also substantial, and I've addressed those as well in my string of work backlinked from Whether governments will control AGI is important and neglected

I am also drawn to the idea that government control of AGI is quite dangerous; I've addressed this tension in Fear of centralized power vs. fear of misaligned AGI: Vitalik Buterin on 80,000 Hours

But on the whole, I think the risks of widely distributed AGI are even greater. How many people can control a mind that can create technologies capable of taking over or destroying the world before someone uses them?

Michael Nielsen's excellent ASI existential risk: Reconsidering Alignment as a Goal is a similar analysis of why even obedient aligned AGI would be intensely dangerous if it's allowed to proliferate.

[-]Nina Panickssery5mo20

I liked this post and its framing of the alignment problem.

[-]Stephen Martin5mo10

On the "safety underinvestment" point I am going to say something which I think is obvious (and has probably been discussed before) but I have not personally seen anyone advocate for:

The DoD should be conducting its own safety/alignment research, and pushing for this should be one of the primary goals of safety advocates.

There is this constant push for labs to invest more in safety. At the same time we all acknowledge this is a zero sum game. Every dollar/joule/flop they put into safety/alignment is a dollar/joule/flop they can't put into capabilities research. I think this dynamic of insisting that labs carry the torch entirely on safety, instead of pushing for government funded safety research, is a big part of why you've seen labs/VCs becoming increasingly hostile to safety advocacy/regulation. Because the only pitch so far makes safety onerous for labs.

Pushing for safety/alignment research to be done by the DoD allows for investments to be made into safety in a way that doesn't require labs reroute scarce resources or decelerate their capabilities progress. It can also be purely additive, there's no reason labs can't continue their own safety research as well.

[-]boazbarak5mo10

I think it's useful for government to do safety/alignment but:

1. I don't think it's a zero sum game - often safety/alignment improvements go hand in hand with capabilities.
2. Like we have seen with software security, if safety is not "baked in", it is hard to add it after the fact, so it is important for safety and capabilities researchers to work together.

LESSWRONG
LW

LESSWRONG
LW

37

Machines of Faithful Obedience

37

37

Faithful, obedient, superintelligent.

Are we there yet?

Risky scenarios