Alignment Stream of Thought

Wiki Contributions


I suppose you're talking about this paper (https://arxiv.org/abs/2210.10760). It's important to note that in the setting of this paper, the reward model is only trained on samples from the original policy, whereas GAN discriminators are constantly trained with new data. Section 4.3 touches briefly on the iterated problems, which is closer in setting to GANs, where we correspondingly expect a reduction in overoptimization (i.e the beta term).

It is definitely true that you have to be careful whenever you're optimizing any proxy metric, and this is one big reason I feel kind of uncomfortable about proposals like RLHF/RRM. In fact, our setting probably underestimates the amount of overoptimization due to the synthetic setup. However, it does seem like GAN mode collapse is largely unrelated to this effect of overoptimization, and it seems like gwern's claim is mostly about this.

I expect that the key externalities will be borne by society. The main reason for this is I expect deceptive alignment to be a big deal. It will at some point be very easy to make AI appear safe, by making it pretend to be aligned, and very hard to make it actually aligned. Then, I expect something like the following to play out (this is already an optimistic rollout intended to isolate the externality aspect, not a representative one):

We start observing alignment failures in models. Maybe a bunch of AIs do things analogous to shoddy accounting practices. Everyone says "yes, AI safety is Very Important". Someone notices that when you punish the AI for exhibiting bad behaviour with RLHF or something the AI stops exhibiting bad behaviour (because it's pretending to be aligned). Some people are complaining that this doesn't actually make it aligned, but they're ignored or given a token mention. A bunch of regulations are passed to enforce that everyone uses RLHF to align their models. People notice that alignment failures decrease across the board. The models don't have to somehow magically all coordinate to not accidentally reveal deception, because even in cases where models fail in dangerous ways people chalk this up to the techniques not being perfect, but they're being iterated on, etc. Heck, humans commit fraud all the time and yet it doesn't cause people to suddenly stop trusting everyone they know when a high profile fraud case is exposed. And locally there's always the incentive to just make the accounting fraud go away by applying Well Known Technique rather than really dig deep and figuring out why it's happening. Also, a lot of people will have vested interest in not having the general public think that AI might be deceptive, and so will try to discredit the idea as being fringe. Over time, AI systems control more and more of the economy. At some point they will control enough of the economy to cause catastrophic damage, and a treacherous turn happens.

At every point through this story, the local incentive for most businesses is to do whatever it takes to make the AI stop committing accounting fraud or whatever, not to try and stave off a hypothetical long term catastrophe. A real life example that this is analogous to is antibiotic overuse.

This story does hinge on "sweeping under the rug" being easier than actually properly solving alignment, but if deceptive alignment is a thing and is even moderately hard to solve properly then this seems very likely the case.

I expect society (specifically, relevant decision-makers) to start listening once the demonstrated alignment problems actually hurt people

I predict that for most operationalizations of "actually hurt people", the result is that the right problems will not be paid attention to. And I don't expect lightning fast takeoff to be necessary. Again, in the case of climate change, which has very slow "takeoff", millions of people are directly impacted, and yet governments and major corporations move very slowly and mostly just say things about climate change mitigation being Very Important and doing token paper straw efforts. Deceptive alignment means that there is a very attractive easy option that makes the immediate crisis go away for a while.

But even setting aside the question of whether we should even expect to see warning signs, and whether deceptive alignment is a thing, I find it plausible that even the response to a warning sign that is as blatantly obvious as possible (an AI system tries to take over the world, fails, kills a bunch of people in the process) just results in front page headlines for a few days, some token statements, a bunch of political squabbling between people using the issue as a proxy fight for the broader "tech good or bad" narrative and a postmortem that results in patching the specific things that went wrong without trying to solve the underlying problem. (If even that; we're still doing gain of function research on coronaviruses!)

I can't speak for anyone else, but for me at least the answers to most of your questions correlate because of my underlying model.  However, it seems like correlation on most these questions (not just a few pairs) is to be expected, as many of them probe a similar underlying model from a few different directions. Just because there are lots of individual arguments about various axes does not immediately imply that any or even most configurations of beliefs on those arguments is coherent. In fact, in a very successfully truth-finding system we should also expect convergence on most axes of argument, with only a few cruxes remaining.

My answers to the questions:

1. I lean towards short timelines, though I have lots of uncertainty.

2. I mostly operate under the assumption of fast takeoffs. I think "slower" takeoffs (months to low years) are plausible but I think even in this world we're unlikely to respond adequately and that iterative approaches still fail.

3: I expect values to be fragile. However, my model of how things go wrong depends on a much weaker version of this claim.

4: I don't expect teaching a value system to an AGI to be hard or relevant to the reasons why things might go wrong. 

5: I expect corrigibility to be hard.

6: Yes (depending on the definition of "simple", they already happen), but I don't expect it to update people sufficiently.

7: >90%, if you condition on alignment not being solved (which also precludes alignment being solved because it's easy). 

8: If you factor out the feasibility of implementation/enforcement, then merely designing a good governance mechanism is relatively easy. However, I'm not sure why you would ever want to do this factorization.

9: I expect implementation/enforcement to be hard.

A small group of researchers raise alarm that this is going on, but society at large doesn't listen to them because everything seems to be going so well.

Arguably this is already the situation with alignment. We have already observed empirical examples of many early alignment problems like reward hacking. One could make an argument that looks something like "well yes but this is just in a toy environment, and it's a big leap to it taking over the world", but it seems unclear when society will start listening. In analogy to the AI goalpost moving problem ("chess was never actually hard!"), in my model it seems entirely plausible that every time we observe some alignment failure it updates a few people but most people remain un-updated. I predict that for a large set of things currently claimed will cause people to take alignment seriously, most of them will either be ignored by most people once they happen, or never happen before catastrophic failure.

We can also see analogous dynamics in i.e climate change, where even given decades of hard numbers and tangible physical phenomena large amounts of people (and importantly, major polluters) still reject its existence, many interventions are undertaken which only serve as lip service (greenwashing), and all of this would be worse if renewables were still economically uncompetitive.

I expect the alignment situation to be strictly worse because a) I expect the most egregious failures to only come shortly before AGI, so once evidence as robust as climate change (i.e literally catching AIs red handed trying and almost succeeding at taking over the world), I estimate we have anywhere between a few years and negative years left b) the space of ineffectual alignment interventions is far larger and harder to distinguish from real solutions to the underlying problem c) in particular, training away failures in ways that don't solve the underlying problems (i.e incentivizing deception) is an extremely attractive option and there does not exist any solution to this technical problem, and just observing the visible problems disappear is insufficient to distinguish whether the underlying problems are solved d) 80% of the tech for solving climate change basically already exists or is within reach, and society basically just has to decide that it cares, and the cost to society is legible. For alignment, we have no idea how to solve the technical problem, or even how that solution will vaguely look. This makes it a harder sell to society, e) the economic value of AGI vastly outweighs the value of fossil fuels, making the vested interest substantially larger, f) especially due to deceptive alignment, I expect actually-aligned systems to be strictly more expensive than unaligned systems; the cost will be more than just a fixed % more money, but also cost in terms of additional difficulty and uncertainty, time to market disadvantage, etc.

Seconded: mine also isn't.

Also, for what it's worth, I also don't think of myself as the kind of person to naturally gravitate towards the apocalypse/"saving the world" trope. From a purely narrative-aesthetic perspective, I much prefer the idea of building novel things, pioneering new frontiers, realizing the potential of humanity, etc, as opposed to trying to prevent disaster, reduce risk, etc. I am quite disappointed at reality for not conforming to my literary preferences.

The meta-values thing gets at the same thing that HRH is getting at. Also, I feel like fundamentally wireheading is a problem of embeddedness, and has a completely different causal story to the problem of reflective processes changing our values to be "zombified", though they feel vaguely similar. The way I would look at this is if you are a non-embedded algorithm running in an embedded world, you are potentially susceptible to wireheading, and only if you are an embedded algorithm then you could possible have a preference that implies wanting zombification, or preferences guided by meta-values that avoid this, etc.

I totally agree that the choice of "power seeking" is very unfortunate because of the same reasons you describe. I don't think optionality is quite it, though. I think "consequentialist" or "goal seeking" might be better (or we could just stick with "instrumental convergence"--it at least has neutral affect).

As for underappreciatedness, I think this is possibly true, though anecdotally at least for me I already strongly believed this and in fact a large part of my generator of why I think alignment is difficult is based on this.

I think I disagree about leveraging this for alignment but I'll read your proposal in more detail before commenting on that further.

I basically agree with the argument here. I think that approaches to alignment that try to avoid instrumental convergence are generally unlikely to succeed for exactly the reason that this removes the usefulness of AGI.[1] I also agree with jacob_cannell that the terminology choice of "power seeking" is unfortunate and misleading in this regard.

I think this is (at least for me) also one of the core generators of why alignment is so hard: AGI is dangerous for exactly the same reason why it is useful; the danger comes from not one specific kind of failure or one specific module in the model or whatever, but rather the fact that the things we want and don't want fall out of the exact same kind of cognition.

[1]: I do think there exists some work here that might be able to weasel out of this by making use of the surprising effectiveness of less-general intelligence plus the fact that capabilities research mostly pushes this kind of work currently, but this kind of thing has to hinge on a lot of specific assumptions, and I wouldn't bet on it.


  1. I don't really have a great answer to this. My guess is that it's related to the fact that the higher penalty runs do a bunch more steps to get to the same KL, and those extra steps do something weird. Also, it's possible that rather than the gold RM scores becoming more like the proxy RMs with more params/data; perhaps the changed frontier is solely due to some kind of additional exploitation in those extra steps, and evaporates when the RMs become sufficiently good.
  2. Mostly some evidence of other hyperparameters not leading to this behaviour, but also observing this behaviour replicated in other environments.

That sounds accurate. In the particular setting of RLHF (that this paper attempts to simulate), I think there are actually three levels of proxy:

  • The train-test gap (our RM (and resulting policies) are less valid out of distribution)
  • The data-RM gap (our RM doesn't capture the data perfectly
  • The intent-data gap (the fact that the data doesn't necessarily accurately reflect the human intent (i.e things that just look good to the human given the sensors they have, as opposed to what they actually want).

Regularization likely helps a lot but I think the main reason why regularization is insufficient as a full solution to Goodhart is that it breaks if the simplest generalization of the training set is bad (or if the data is broken in some way). In particular, there are specific kinds of bad generalizations that are consistently bad and potentially simple. For instance I would think of things like ELK human simulators and deceptive alignment as all fitting into this framework.

(I also want to flag that I think in particular we have a very different ontology from each other, so I expect you will probably disagree with/find the previous claim strange, but I think the crux actually lies somewhere else)

Load More