ryan_greenblatt

I work at Redwood Research.

Wiki Contributions

Comments

Sorted by

This argument isn't control specific, it applies to any safety scheme with some operating tax or implementation difficulty.[1][2]

I agree that if there is insufficient security, paying any sizable safety tax (including slowing down) will be difficult. It would either require substantial coordination or much more agressive USG action than would otherwise be needed.

Thus, I agree that security is a huge factor in how AI goes, plausibly more important than technical safety work done in advance, and I agree that the technical AI safety community probably understimates the importance of security. (Security is probably notably less tractible than technical AI safety measures for a variety of reasons, but perhaps of greater importance than doing technical AI safety in advance.)

That said, I can certainly imagine worlds where AI control saves us despite the model being stolen by multiple adversaries. My sense is that the situation is much worse if a powerful AI is stolen, especially if it is stolen by multiple actors, but it isn't obviously fatal.


Therefore, AI control will only be useful in the coming years under one (or more) of these conditions

I would note that we don't just aim to prevent the outcome of the model escaping. Indeed, even if a (misaligned) model was literally made publicly available (on a torrent or whatever), we might still care about safely extracting work out of instances of the model we control. We no longer care about preventing escape, but we do care about getting good non-sabotaged work (on e.g. security or technical AI safety).

We also aren't just worried about direct sabotage, we want to prevent the AI from setting itself up in a position where it has much more power over the AI company. E.g., we really want to prevent it from being able to run itself without monitoring in a position where it has full permissions to do things within the AI company.

Even if the model has literally already escaped into the outside world, I think control is useful. (Though this is obviously a dismal situation to be in.)

As an example scenario:

  • China steals the model and is using it heavily.
  • We still want to be able to use our model without it causing huge problems, though escaping to the outside world isn't as big of a problem (it still might be better to reduce further proliferation).
  • We coordinate with China to slow down and to take sufficient precautions that the model doesn't escape from either of us. This allows for paying the control tax at least for a while. We don't necesarily need China to do a good job with control other than avoiding their model escaping. Just delaying the point at which it escapes is also pretty helpful.
  • During the time we have, we use huge amounts of AI labor to figure out a cheaper and more robust alignment strategy or are able to build the case for a stronger level of coordination.

Edit: One further note is that if the tax from control is sufficiently small and takeoff is sufficiently slow by default, then control can still let us get out the key work done on time (prior to adversaries being able to train much more powerful models such that we can't be competitive anymore).


  1. You also bring up a risk compensation argument: maybe control is bad because it encorages the development of powerful models via making some of the risks lower. This argument also applies to other safety strategies. ↩︎

  2. Zach also noted this, but I thought it was worth emphasizing. ↩︎

(Oops, slow reply)

If SLT were to say nontrivial things about what instruction fine-tuning and RLHF are doing to models, and those things were verified in experiments, would that shift your skepticism?

If SLT results in interesting predictions in some case or was generally able to notably improve our ability to predict downstream behavior (including just within a subset of cases), that would be a large update from me. Seems like a reasonable goal to shoot for.

Could you explain why you are skeptical that current baseline methods can be dramatically improved?

Note the "absent substantial advances in science"! My perspective was[1] that it is pretty likely that for white box methods (methods that go beyond just training and black box behavior), the most simple and baseline approaches are reasonably likely to be pretty close to as good as you can get for reducing risk if you are just doing basic empirical monkeying. At least for risks from close-to-worst-case-scheming models, I think other threat models might be more amenable to complex white-box methods as they are easier to directly study.

(Threat models other than close-to-worst-case-scheming have the property that if they exist, you should be able to directly measure them and potentially you can measure how well the method worked on the model itself in a very analogous test bed; see discussion here for details. More minimally, it is easier to construct many empirical test beds if you aren't worried that the model is directly trying to undermine your ability to study it.)

Why think this? The core reason is that is hard to build good empirical test beds for techniques using the internals of models in non-trivial ways. We don't have the internals of a scheming model in front of us. We can certainly study behavioral interventions (black-box control) and we can try to make test beds which are as analogous as possible. But, it seems hard to get that analogous for a variety of reasons. You could demonstrate robust properties of models such that you can sufficiently robustly predict a property of scheming models to take advantage of. But, now we're probably talking about fundamental advances in science.

I don't think I've really adequately defended this perspective here, but this is at least roughly why I think this.

Incidentally, I don't really understand what you mean by "captured by black-box control evaluations". Was there a follow-up?

Black-box control evaluations are evaluations where you basically assume your model is worst case misaligned, you make conservative assumptions about inductive biases (e.g. you get nothing other than what the model is forced to do), and you try to see if the model is capable enough to cause problems even given your countermeasures.

Basically, in any place where you can't establish a property with a capability evaluation, you make conservative/pessimal assumptions.

(I usually imagine some wiggle room, e.g., we need to study sample efficiency and exploration hacking so that we can argue about the effects of training and elicitation. We'd also want to answer some questions which are a bit tricky to answer with capability evaluations, but this is at least conceivable, e.g., how good is the AI at coming up with good strategies for subverting safeguards within a forward pass.)


  1. I've updated somewhat from this position, partially based on latent adversarial training and also just after thinking about it more. ↩︎

Furthermore, I don't think independent AI safety funding is that important anymore; models are smart enough now that most of the work to do in AI safety is directly working with them, most of that is happening at labs,

It might be the case that most of the quality weighted safety research involving working with large models is happening at labs, but I'm pretty skeptical that having this mostly happen at labs is the best approach and it seems like OpenPhil should be actively interested in building up a robust safety research ecosystem outside of labs.

(Better model access seems substantially overrated in its importance and large fractions of research can and should happen with just prompting or on smaller models. Additionally, at the moment, open weight models are pretty close to the best models.)

(This argument is also locally invalid at a more basic level. Just because this research seems to be mostly happening at large AI companies (which I'm also more skeptical of I think) doesn't imply that this is the way it should be and funding should try to push people to do better stuff rather than merely reacting to the current allocation.)

Here is another more narrow way to put this argument:

  • Let's say Nate is 35 (arbitrary guess).
  • Let's say that branches which deviated 35 years ago would pay for our branch (and other branches in our reference class). The case for this is that many people are over 50 (thus existing in both branches), and care about deviated versions of themselves and their children etc. Probably the discount relative to zero deviation is less than 10x.
  • Let's say that Nate thinks that if he didn't ever exist, P(takeover) would go up by 1 / 10 billion (roughly 2^-32). If it was wildly lower than this, that would be somewhat surprising and might suggest different actions.
  • Nate existing is sensitive to a bit of quantum randomness 35 years ago, so other people as good as Nate existing could be created with a bit of quantum randomness. So, 1 bit of randomness can reduce risk by at least 1 / 10 billion.
  • Thus, 75 bits of randomness presumably reduces risk by > 1 / 10 billion which is >> 2^-75.

(This argument is a bit messy because presumably some logical facts imply that Nate will be very helpful and some imply that he won't be very helpful and I was taking an expectation over this while we really care about the effect on all the quantum branches. I'm not sure exactly how to make the argument exactly right, but at least I think it is roughly right.)

What about these case where we only go back 10 years? We can apply the same argument, but instead just use some number of bits (e.g. 10) to make Nate work a bit more, say 1 week of additional work via changing whether Nate ends up getting sick (by adjusting the weather or which children are born, or whatever). This should also reduce doom by 1 week / (52 weeks/year) / (20 years/duration of work) * 1 / 10 billion = 1 / 10 trillion.

And surely there are more efficient schemes.

To be clear, only having ~ 1 / 10 billion branches survive is rough from a trade perspective.

Thanks, this seems like a reasonable summary of the proposal and a reasonable place to wrap.

I agree that kindness is more likely to buy human survival than something better described as trade/insurance schemes, though I think the insurance schemes are reasonably likely to matter.

(That is, reasonably likely to matter if the kindness funds aren't large enough to mostly saturate the returns of this scheme. As a wild guess, maybe 35% likely to matter on my views on doom and 20% on yours.)

At a more basic level, I think the situation is just actually much more confusing than human extinction in a bunch of ways.

(Separately, under my views misaligned AI takeover seems worse than human extinction due to (e.g.) biorisk. This is because primates or other closely related seem very likely to re-evolve into an intelligent civilization and I feel better about this civilization than AIs.)

And I think this is also true by the vast majority of common-sense ethical views. People care about the future of humanity. "Saving the world" is hugely more important than preventing the marginal atrocity. Outside of EA I have never actually met a welfarist who only cares about present humans. People of course think we are supposed to be good stewards of humanity's future, especially if you select on the people who are actually involved in global scale decisions.

Hmmm, I agree with this as stated, but it's not clear to me that this is scope sensitive. As in, suppose that the AI will eventually leave humans in control of earth and the solar system. Do people typically this is an extremely bad? I don't think so, though I'm not sure.

And, I think trading for humans to eventually control the solar system is pretty doable. (Most of the trade cost is in preventing an earlier slaughter and violence which was useful for takeover or avoiding delay.)

(I mostly care about long term future and scope sensitive resource use like habryka TBC.)

Sure, we can amend to:

"I believe that AI takeover would eliminate humanity's control over its future, has a high probability of killing billions, and should be strongly avoided."


We could also say something like "AI takeover seems similar to takeover by hostile aliens with potentially unrecognizable values. It would eliminate humanity's control over its future and has a high probability of killing billions."

I agree that "not dying in a base universe" is a more reasonable thing to care about than "proving people right that takeoff is slow" but I feel like both lines of argument that you bring up here are doing something where you take a perspective on the world that is very computationalist, unituitive and therefore takes you to extremely weird places, makes strong assumptions about what a post-singularity humanity will care about, and then uses that to try to defeat an argument in a weird and twisted way that maybe is technically correct, but I think unless you are really careful with every step, really does not actually communicate what is going on.

I agree that common sense morality and common sense views are quite confused about the relevant situation. Indexical selfish perspectives are also pretty confused and are perhaps even more incoherant.

However, I think that under the most straightforward generalization of common sense views or selfishness where you just care about the base universe and there is just one base universe, this scheme can work to save lives in the base universe[1].

I legitimately think that common sense moral views should care less about AI takeover due to these arguments. As in, there is a reasonable chance that a bunch of people aren't killed due to these arguments (and other different arguments) in the most straightforward sense.

I also think "the AI might leave you alone, but we don't really know and there seems at least a high chance that huge numbers of people, including you, die" is not a bad summary of the situation.

In some sense they are an argument against any specific human-scale bad thing to happen, because if we do win, we could spend a substantial fraction of our resources with future AI systems to prevent that.

Yes. I think any human-scale bad thing (except stuff needed for the AI to most easily take over and solidify control) can be paid for and this has some chance of working. (Tiny amounts of kindness works in a similar way.)


Humanity will not go extinct, because we are in a simulation.

FWIW, I think it is non-obvious how common sense views interpret these considerations. I think it is probably common to just care about base reality? (Which is basically equivalent to having a measure etc.) I do think that common sense moral views don't consider it good to run these simulations for this purpose while bailing out aliens who would have bailed us out is totally normal/reasonable under common sense moral views.


It is obviously extremely fucking bad for AI to disempower humanity. I think "literally everyone you know dies" is a much more accurate capture of that, and also a much more valid conclusion from conservative premises

Why not just say what's more straightforwardly true:

"I believe that AI takeover has a high probability of killing billions and should be strongly avoided, and would be a serious and irreversible decision by our society that's likely to be a mistake even if it doesn't lead to billions of deaths."

I don't think "literally everyone you know dies if AI takes over" is accurate because I don't expect that in the base reality version of this universe for multiple reasons. Like it might happen, but I don't know if it is more than 50% likely.


  1. It's not crazy to call the resulting scheme "multiverse/simulation shenanigans" TBC (as it involves prediction/simulation and uncertainty over the base universe), but I think this is just because I expect that multiverse/simulation shenanigans will alter the way AIs in base reality act in the common sense straightforward way. ↩︎

I probably won't respond further than this. Some responses to your comment:


I agree with your statements about the nature of UDT/FDT. I often talk about "things you would have commited to" because it is simpler to reason about and easier for people to understand (and I care about third parties understanding this), but I agree this is not the true abstraction.


It seems like you're imagining that we have to bamboozle some civilizations which seem clearly more competent than humanity in your lights. I don't think this is true.

Imagine we take all the civilizations which are roughly equally-competent-seeming-to-you and these civilizations make such an insurance deal[1]. My understanding is that your view is something like P(takeover) = 85%. So, let's say all of these civilizations are in a similar spot from your current epistemic perspective. While I expect that you think takeover is highly correlated between these worlds[2], my guess is that you should think it would be very unlikely that >99.9% of all of these civilizations get taken over. As in, even in the worst 10% of worlds where takeover happens in our world and the logical facts on alignment are quite bad, >0.1% of the corresponding civilizations are still in control of their universe. Do you disagree here? >0.1% of universes should be easily enough to bail out all the rest of the worlds[3].

And, if you really, really cared about not getting killed in base reality (including on reflection etc) you'd want to take a deal which is at least this good. There might be better approaches which reduce the correlation between worlds and thus make the fraction of available resources higher, but you'd like something at least this good.

(To be clear, I don't think this means we'd be fine, there are many ways this can go wrong! And I think it would be crazy for humanity to . I just think this sort of thing has a good chance of succeeding.)

(Also, my view is something like P(takeover) = 35% in our universe and in the worst 10% of worlds 30% of the universes in a similar epistemic state avoided takeover. But I didn't think about this very carefully.)

And further, we don't need to figure out the details of the deal now for the deal to work. We just need to make good choices about this in the counterfactuals where we were able to avoid takeover.

Another way to put this is that you seem to be assuming that there is no way our civilization would end up being the competent civilization doing the payout (and thus to survive some bamboozling must occur). But your view is that it is totally plausible (e.g. 15%) from your current epistemic state that we avoid takeover and thus a deal should be possible! While we might bring in a bunch of doomed branches, ex-ante we have a good chance of paying out.


I get the sense that you're approaching this from the perspective of "does this exact proposal have issues" rather than "in the future, if our enlightened selves really wanted to avoid dying in base reality, would there be an approach which greatly (acausally) reduces the chance of this". (And yes I agree this is a kind of crazy and incoherant thing to care about as you can just create more happy simulated lives with those galaxies.)

There just needs to exist one such insurance/trade scheme which can be found and it seems like there should be a trade with huge gains to the extent that people really care a lot about not dying. Not dying is very cheap.


  1. Yes, it is unnatural and arbitrary to coordinate on Nate's personal intuitive sense of competence. But for the sake of argument ↩︎

  2. I edited in the start of this sentence to improve clarity. ↩︎

  3. Assuming there isn't a huge correlation between measure of universe and takeover probability. ↩︎

Load More