Mark Xu

I do alignment research at the Alignment Research Center. Learn more about me at markxu.com/about

Sequences

Intermittent Distllations
Training Regime

Wiki Contributions

Comments

Sorted by
Mark Xu20

Yes, I agree. If I had more time, this would have been a top-level post. If anyone reading wants to write such a post using my quick take as a base, I would be happy to take a look and offer comments. I might do it myself at some point as well.

Mark XuΩ6012725

Alignment researchers should think hard about switching to working on AI Control

I think Redwood Research’s recent work on AI control really “hits it out of the park”, and they have identified a tractable and neglected intervention that can make AI go a lot better. Obviously we should shift labor until the marginal unit of research in either area decreases P(doom) by the same amount. I think that implies lots of alignment researchers should shift to AI control type work, and would naively guess that the equilibrium is close to 50/50 across people who are reading this post. That means if you’re working on alignment and reading this, I think there’s probably a ~45% chance it would be better for your values if you instead were working on AI control!

For this post, my definitions are roughly:

  • AI alignment is the task of ensuring the AIs “do what you want them to do”
  • AI control is the task of ensuring that if the AIs are not aligned (e.g. don’t always “do what you want” and potentially want to mess with you), then you are still OK and can use them for economically productive tasks (an important one of which is doing more alignment/control research.)

Here are some thoughts, arguments, and analogies (epistemic status: there is no “hidden content”, if you don’t find the literal words I wrote persuasive you shouldn’t update. In particular, just update on the words and don't update about what my words imply about my beliefs.):

  • Everything is in degrees. We can “partially align” some AIs, and things will be better if we can use those AIs for productive tasks, like helping with alignment research. The thing that actually matters is “how aligned are the AIs” + “how aligned to they need to be to use them for stuff”, so we should also focus on the 2nd thing.
  • If you were a hedge fund, and your strategy for preventing people from stealing your data was and starting new hedge fund was “we will make the hedge fund a super fun place to work and interview people carefully to make sure they really care about working for us”, and you also did not do things like “maybe people shouldn’t be able to download all the trading strategies onto a USB stick which we let them take home”, “maybe people should be forced to keep their work computers in the office” and “maybe we should have some people checking if other employees are trying to steal data and make some plans about how to stop them from starting a rival hedge fund if they do”, then all the other hedge funds would think you were an idiot. You should obviously try both to check if people want to defect, and also make it harder for them to defect.
  • I think a lot of the neglect of AI control can be traced to the “pivotal act” based framings of how to make aligned AI go well. I think these views are mistaken, but I won’t provide a great argument here (quick analogy: if the sum of a bunch of numbers is large, there doesn’t need to be any individual number that is large; similarly, if the consequences of a sequence of actions results in a large change, no individual action needs to be “pivotal”). However, I don’t think things are shaping up for single pivotal acts to be very important, and being able to productively use early AI labor to help with alignment seems like it is going to be an important part of the future going well.
  • Working on AI control is really tractable compared to working on alignment. There is a lot of prior art in people thinking about how to e.g. prevent data from being stolen by insider threats, secure various information against adversaries, generally ensure that national security secrets don’t get leaked, etc. We can set up various plausibly looking scenarios and try to “game them out” under certain plausibly seeming assumptions about how much alignment we’ll be able to achieve. We can make assumptions about AI capabilities and alignment without really getting into the nitty gritty of philosophically confusing topics like “what is deception really?” and “how likely is deception?” and “when will this all go down?”
    • Note that this means control might be easier for the AIs to automate, and easier to do “during crunch time.”
    • It also means that AI control is a bit contiguous with work people already do to protect from adversaries. I think the AI thing qualitatively changes the picture though, so it’s still worth thinking about.
  • If you succeed at AI control, there is a more compelling story you can tell for why you should implement the specific scheme that you thought of. In comparison, a lot of alignment proposals depend on philosophically dubious and confusing assumptions about the nature of cognition/SGD/etc. That is to say, a successful proposal in AI control comes with an argument that might be convincing to established security professionals if they thought about it, but there is likely no such class of professionals capable of lending your proposal external legitimacy in AI alignment.
  • The intelligence explosion might happen with less-fully-AGI AIs, who will also be doing some alignment work on the side. It’s important for them to not escape and do other bad stuff until they’ve solve alignment. We can give ourselves more time to use smart AIs to help with alignment if we have better AI control.
  • There is a bunch of research that is useful for both alignment and control, e.g. model organisms of deception, interpretability techniques, oversight techniques, etc. More people should analyze such research from the perspective of “how can this be helpful even if it fails to produce an aligned AI?”
Mark XuΩ560

Yes I agree with what you have written, and do think it’s overall not that likely that everything pans out as hoped. We do also have other hopes for how this general picture can still cohere if the specific path doesn’t work out, eg we’re open to learning some stuff empirically and adding an “algorithmic cherry on top” to produce the estimate.

Mark XuΩ680

The literature review is very strange to me. Where is the section on certified robustness against epsilon-ball adversarial examples? The techniques used in that literature (e.g. interval propagation) are nearly identical to what you discuss here.

I was meaning to include such a section, but forgot :). Perhaps I will edit it in. I think such work is qualitatively similar to what we're trying to do, but that the key difference is that we're interested in "best guess" estimates, as opposed to formally verified-to-be-correct estimates (mostly because we don't think formally verified estimates are tractable to produce in general).

Relatedly, what's the source of hope for these kinds of methods outperforming adversarial training? My sense from the certified defenses literature is that the estimates they produce are very weak, because of the problems with failing to model all the information in activations. (Note I'm not sure how weak the estimates actually are, since they usually report fraction of inputs which could be certified robust, rather than an estimate of the probability that a sampled input will cause a misclassification, which would be more analogous to your setting.)

The main hope comes from the fact that we're using a "best guess" estimate, instead of trying to certify that the model won't produce catastrophic actions. For example, Method 1 can be thought of as running a single example with a Gaussian blob around it through the model, but also tracking the "1st order" contributions that come from the Gaussian blob. If we wanted to bound the potential contributions from the Gaussian blob, our estimates would get really broad really fast, as you tend to see with interval propagation.

Although, this also comes with the opposite issue of how to know if the estimates are at all reasonble, especially when you train against them.

If your catastrophe detector involves a weak model running many many inferences, then it seems like the total number of layers is vastly larger than the number of layers in M, which seems like it will exacerbate the problems above by a lot. Any ideas for dealing with this?

I think fundamentally we just need our estimates to "not get that much worse" as things get deeper/more complicated. The main hope for why we can achieve this is that the underlying model itself will not get worse as it gets deeper/the chain of thought gets longer. This implies that there is some sort of stabalization going on, so we will need to capture the effect of this stabalization. It does seem like in order to do this, we will have to model only high level properties of this distribution, instead of trying to model things on the level of activations.

In other words, one issue with interval propagation is that it makes an assumption that can only become less true as you propagate through the model. After a few layers, you're (perhaps only implicitly) putting high probability on activations that the model will never produce. But as long as your "activation model" is behaving reasonably, then hopefully it will only become more uncertain insofar as the underlying reasoning done by the model becomes more uncertain.

What's your proposal for the distribution P0 for Method 2 (independent linear features)?

You can either train an SAE on the input distribution, or just try to select the input distribution to maximize the probability of catastrophe produced by the estimation method (perhaps starting with an SAE of the input distribution, or a random one). Probably this wouldn't work that well in practice.

Why think this is a cost you can pay? Even if we ignore the existence of C and just focus on M, and we just require modeling the correlations between any pair of layers (which of course can be broken by higher-order correlations), that is still quadratic in the number of parameters of M and so has a cost similar to training M in the first place. In practice I would assume it is a much higher cost (not least because C is so much larger than M).

Our ultimate goal is vaguely to "only pay costs that SGD had to pay to produce M" Slightly more specifically, M has a bunch of correlations between its layers. Some of these correlations were actively selected to be those particular values by SGD, and other correlations were kind of random. We want to track the ones that were selected, and just assume the other ones are random. Hopefully, since SGD was not actively manipulating those correlations, the underlying model is in some sense invariant to their precise values, and so a model that treats such correlations as random will predict the same underlying behavior as a model that models the precise values of those correlations.

Mark Xu40

I don’t think Paul thinks verification is generally easy or that delegation is fundamentally viable. He, for example, doesn’t suck at hiring because he thinks it’s in fact a hard problem to verify if someone is good at their job.

I liked Rohin's comment elsewhere on this general thread.

I’m happy to answer more specific questions, although would generally feel more comfortable answering questions about my views then about Paul’s.

Mark Xu40

If you're commited to producing a powerful AI then the thing that matters is the probability there exists something you can't find that will kill you. I think our current understanding is sufficiently paltry that the chance of this working is pretty low (the value added by doing selection on non-deceptive behavior is probably very small, but I think there's a decent chance you just won't get that much deception). But you can also get evidence about the propensity for your training process to produce deceptive AIs and stop producing them until you develop better understanding, or alter your training process in other ways. For example, you can use your understanding of the simpler forms of deception your AIs engage in to invest resources in understanding more complicated forms of deception, e.g. by focusing interpretability efforts.

Mark Xu20

For any given system, you have some distribution over which properties will be necessary to verify in order to not die to that system. Some of those you will in fact be able to verify, thereby obtaining evidence about whether that system is dangerous. “Strategic deception” is a large set of features, some of which are possible to verify.

Mark XuΩ340

yes, you would need the catastrophe detector to be reasonably robust. Although I think it's fine if e.g. you have at least 1/million chance of catching any particular catastrophe.

I think there is a gap, but that the gap is probably not that bad (for "worst case" tail risk estimation). That is maybe because I think being able to do estimation through a single forward pass is likely already to be very hard, and to require being able to do "abstractions" over the concepts being manipulated by the forward pass. CoT seems like it will require vaguely similar struction of a qualitatively similar kind.

Mark Xu20

I think there are some easy-to-verify properties that would make us more likely to die if they were hard-to-verify. And therefore think "verification is easier than generation" is an important part of the overall landscape of AI risk.

Load More