I feel frustrated that your initial comment (which is now the top reply) implies I either hadn't read the 1700 word grant justification that is at the core of my argument, or was intentionally misrepresenting it to make my point.

I think this comment is extremely important for bystanders to understand the context of the grant and it isn't mentioned in your original short form post.

So, regardless of whether you understand the situation, it's important that other people understand the intention of the grant (and this intention isn't obvious from your original comment). Thus, this comment from Buck is valuable.

I also think that the main interpretation from bystanders of your original shortform would be something like:

OpenPhil made a grant to OpenAI
OpenAI is bad (and this was ex-ante obvious)
Therefore this grant is bad and the people who made this grant are bad.

Fair enough if this wasn't your intention, but I think it will be how bystanders interact with this.

Towards Guaranteed Safe AI: A Framework for Ensuring Robust and Reliable AI Systems

ryan_greenblatt4h62

A core advantage of Bayesian methods is the ability to handle out-of-distribution situations more gracefully

I dispute that Bayesian methods will be much better at this in practice.

[

Aside:

In general, most (?) AI safety problems can be cast as an instance of a case where a model behaves as intended on a training distribution

This seems like about 1/2 of the problem from my perspective. (So I almost agree.) Though, you can shove all AI safety problems into this bucket by doing a maneuver like "train your model on the easy cases humans can label, then deploy into the full distribution". But at some point, this is no longer very meaningful. (E.g. you train on solving 5th grade math problems and deploy to the Riemann hypothesis.)

]

Traditional ML has no straightforward way of dealing with such cases, since it only maintains a single hypothesis at any given time.

Is this true? Aren't NN implicitly ensembles of vast number of models? Also, does ensembling 5 NNs help? If this doesn't help why does sampling 5 models from the Bayesian posterior help? Or is that we needed to approximate sampling 1,000,000 models from the posterior? If we're conservative over a million models, how will we ever do anything?

However, Bayesian methods may make it less likely that a model will misgeneralise, or should at least give you a way of detecting when this is the case.

Do they? I'm skeptical on both of these. It maybe helps a little and rules out some unlikely scenarios, but I'm overall skeptical.

Overall, my view on the Bayesian approach is something like:

What prior were we using for Bayesian methods? If this is just the NN prior, then I'm not really sold we do much better than just training a NN (or an ensemble of NNs). If our prior is importantly different in a way which we think will help, why can't we regularize to train a NN in a normal way which will vaguely reasonably approximate this prior?
My main concern is that we can get a smart predictive model which understands OOD cases totally fine, but we still get catastrophic generalization for whatever reason. I don't see why bayesian methods help.
- In the ELK case, our issues is that too much of the prior is human imitation or other problematic generalization. (We can ensemble/sample from the posterior and reject things where our hypotheses don't match, but this will only help so much and I don't really see a strong case for bayes helping more than ensembling.)
- In the case of a treacherous turn, it seems like the core concern was that all of our models are schemers and will work together. If this isn't the case, (e.g. if ensembling gets 25% non schemers), then we have other great options. I again don't see how bayes ensures you have some non-schemers while ensembling doesn't. (Like it could in principle, but why? Training your models on way more dog fanfiction could also make them less likely to be schemers, we need some reason to think this isn't just noise.)

I also don't agree with the characterisation that "almost all the interesting work is in the step where we need to know whether a hypothesis implies harm" (if I understand you correctly). Of course, creating a formal definition or model of "harm" is difficult, and creating a world model is difficult, but once this has been done, it may not be very hard to detect if a given action would result in harm.

My claim here is that all the interesting work is in ensuring that we know whether a hypothesis "thinks" that harm will result. It would be fine to put this work in constructing an intepretable hypothesis such that we can know whether it causes harm or constructing a formal model of harm and ensuring we have access to all important latent variables for this formal model, but this work still must be done.

Another way to put this, is that all the interesting action was happening at the point where you solved the ELK problem. I agree that if:

You have access to all interesting latent variables for your predictive hypothesis. (And your predictive hypothesis (or hypotheses) is competive with your AI agent at predicting these latent variables.)
You can formally define harm in terms of these latent variables

You're fine. But, step (1) is just the ELK problem and I don't even really think you need to solve step (2) for most plans. (You can just have humans compute step (2) manually for most types of latent variables, though this does have some issues.)

Specifically, the world model does not necessarily have to be built manually

I thought the plan was to build it with either AI labor or human labor so that it will be sufficiently intepretable. Not to e.g. build it with SGD. If the plan is to build it with SGD and not to ensure that it is interpretable, then why does it provide any safety guarantee? How can we use the world model to define a harm predicate?

it does not have to be as good at prediction as our AI. The world model only needs to be good at predicting the variables that are important for the safety specification(s), within the range of outputs that the AI system may produce

Won't predicting safety specific variables contain all of the difficulty of predicting the world? (Because these variables can be mediated by arbitrary intermediate variables.) This sounds to me to be very similar to "we need to build an intepretable next-token predictor, but the next token predictor only needs to be as good as the model at predicting the lower case version of the text on just scientific papers". This is just as hard as building a full distribution next token predictor.

Interpretability: Integrated Gradients is a decent attribution method

ryan_greenblatt18h42

[Not very confident, but just saying my current view.]

I'm pretty skeptical about integrated gradients.

As far as why, I don't think we should care about the derivative at the baseline (zero or the mean).

As far as the axioms, I think I get off the train on "Completeness" which doesn't seem like a property we need/want.

I think you just need to eat that there isn't any sensible way to do something reasonable that gets Completeness.

The same applies with attribution in general (e.g. in decision making).

Interpretability: Integrated Gradients is a decent attribution method

ryan_greenblatt19h20

Integrated gradients is a computationally efficient attribution method (compared to activation patching / ablations) grounded in a series of axioms.

Maybe I'm confused, but isn't integrated gradients strictly slower than an ablation to a baseline?

"If we go extinct due to misaligned AI, at least nature will continue, right? ... right?"

ryan_greenblatt1d40

I've thought a bit about actions to reduce the probability that AI takeover involves violent conflict.

I don't think there are any amazing looking options. If goverments were generally more competent that would help.

Having some sort of apparatus for negotiating with rogue AIs could also help, but I expect this is politically infeasible and not that leveraged to advocate for on the margin.

DeepMind's "Frontier Safety Framework" is weak and unambitious

ryan_greenblatt3d40

I agree with 1 and think that race dynamics makes the situation considerably worse when we only have access to prosaic approaches. (Though I don't think this is the biggest issue with these approaches.)

I think I expect a period substantially longer than several months by default due to slower takeoff than this. (More like 2 years than 2 months.)

Insofar as the hope was for governments to step in at some point, I think the best and easiest point for them to step in is actually during the point where AIs are already becoming very powerful:

Prior to this point, we don't get substantial value from pausing, especially if we're pausing/dismantling all of semi-conductor R&D globally.
Prior to this point AI won't be concerning enough for governments to take agressive action.
At this point, additional time is extremely useful due to access to powerful AIs.
The main counterargument is that at this point more powerful AI will also look very attractive. So, it will seem too expensive to stop.

So, I don't really see very compelling alternatives to push on at the margin as far as "metastrategy" (though I'm not sure I know exactly what you're pointing at here). Pushing for bigger asks seems fine, but probably less leveraged.

I actually don't think control is a great meme for the interests of labs which purely optimize for power as it is a relatively legible ask which is potentially considerably more expensive than just "our model looks aligned because we red teamed it" which is more like the default IMO.

The same way "secure these model weights from China" isn't a great meme for these interests IMO.

"If we go extinct due to misaligned AI, at least nature will continue, right? ... right?"

ryan_greenblatt3d23-17

I think literal extinction is unlikely even conditional on misaligned AI takeover due to:

The potential for the AI to be at least a tiny bit "kind" (same as humans probably wouldn't kill all aliens). ^[1]
Decision theory/trade reasons

This is discussed in more detail here and here.

Insofar as humans and/or aliens care about nature, similar arguments apply there too, though this is mostly beside the point: if humans survive and have (even a tiny bit of) resources they can preserve some natural easily.

I find it annoying how confident this article is without really bothering to engage with the relevant arguments here.

(Same goes for many other posts asserting that AIs will disassemble humans for their atoms.)

Edit: note that I think AI takeover is probably quite bad and has a high chance of being violent.

This includes the potential for the AI to generally have preferences that are morally valueable from a typical human perspective. ↩︎

Stephen Fowler's Shortform

ryan_greenblatt3d20

The Internet seems to agree with you. I wonder why I remember "edit time addition".

Stephen Fowler's Shortform

ryan_greenblatt3d2-1

ETA = edit time addition

I should probably not use this term, I think I picked up this habit from some other people on LW.

Stephen Fowler's Shortform

ryan_greenblatt3d20

I interpreted the comment as being more general than this. (As in, if someone does something that works out very badly, they should be forced to resign.)

Upon rereading the comment, it reads as less generic than my original interpretation. I'm not sure if I just misread the comment or if it was edited. (Would be nice to see the original version if actually edited.)

(Edit: Also, you shouldn't interpret my comment as an endorsement or agreement with the the rest of the content of Ben's comment.)

LESSWRONG
LW

Posts

Wiki Contributions

Comments