What I had in mind is a situation where we have access to the latent variables during training, and only use the model to prove safety properties in situations that are within the range of the training distribution in some sense (eg, situations where we have some learning-theoretic guarantees). As for treacherous turns, I am implicitly assuming that we don't have to worry about a treacherous turn from the world model, but that we may have to worry about it from the AI policy that we're verifying.

However, note that even this is meaningfully different from just using RLHF, especially in settings with some adversarial component. In particular, a situation that is OOD for the policy need not be OOD for the world model. For example, learning a model of the rules of chess is much easier than learning a policy that is good at playing chess. It would also be much easier to prove a learning-theoretic guarantee for the former than the latter.

So, suppose we're training a chess-playing AI, and we want to be sure that it cannot be defeated in moves or less. The RLHF strategy would, in this scenario, essentially amount to letting a bunch of humans play against the AI a bunch of times, try to find cases where the red-teamers find a way to beat the AI in $n$ moves, and then train against those cases. This would only give us very weak quantitative guarantees, because there might be strategies that the red teamers didn't think of.

Alternatively, we could also train a network to model the rules of chess (in this particular example, we could of course also specify this model manually, but let's ignore that for the sake of the argument). It seems fairly likely that we could train this model to be highly accurate. Moreover, using normal statistical methods, we could estimate a bound on the fraction of the state-action space on which this model makes an incorrect prediction, and derive other learning-theoretic guarantees (depending on how the training data is collected, etc). We could then formally verify that the chess AI cannot be beaten in $n$ moves, relative to this world model. This would produce a much stronger quantitative guarantee, and the assumptions behind this guarantee would be much easier to audit. The guarantee would of course still not be an absolute proof, because there will likely be some errors in the world model, and the chess AI might be targeting these errors, etc, but the guarantee is substantially better than what you get in the RLHF case. Also note that we, as we run the chess AI, could track the predictions of the world model on-line. If the world model ever makes a prediction that contradicts what in fact happens, we could shut down the chess AI, or transition to a safe mode. This gives us even stronger guarantees.

This is of course a toy example, because it is a case where we can design a perfect world model manually (excluding the opponent, of course). We can also design a chess-playing AI manually, etc. However, I think it illustrates that there is a meaningful difference between RLHF and formal verification relative to even a black-box world model. The complexity of the space of all policies grows much faster than the complexity of the environment, and this fact can be exploited.

Towards Guaranteed Safe AI: A Framework for Ensuring Robust and Reliable AI Systems

Joar Skalse2mo10

In practice, I think you are unlikely to end up with a schemer unless you train your model to solve some agentic task (or tain it to model a system that may itself be a schemer, such as a human). However, in order to guarantee that, I agree we need some additional property (such as interpretability, or some learning-theoretic guarantee).

Towards Guaranteed Safe AI: A Framework for Ensuring Robust and Reliable AI Systems

Joar Skalse2mo10

I'm not so convinved of this. Yes, for some complex safety properties, the world model will probably have to be very smart. However, this does not mean that you have to model everything -- depending on your safety specification and use case, you may be able to factor out a huge amount of complexity. We know from existing cases that this is true on a small scale -- why should it not also be true on a medium or large scale?

For example, with a detailed model of the human body, you may be able to prove whether or not a given chemical could be harmful to ingest. This cannot be done with current tools, because we don't have a detailed computational model of the human body (and even if we did, we would not be able to use it for scaleable inference). However, this seems like the kind of thing that could plausibly be created in the not-so-long term using AI tools. And if we had such a model, we could prove many interesting safety properties for e.g. pharmacutical development AIs (even if these AIs know many things that are not covered by this world model).

Suppose you had a world model which was as smart as GPT-3 (but magically interpretable). Do you think this would be useful for something?

I think that would be extremely useful, because it would tell us many things about how to implement cognitive algorithms. But I don't think it would be very useful for proving safety properties (which I assume was your actual question). GPT-3's capabilities are wide but shallow, but in most cases, what we would need are capabilities that are narrow but deep.

Towards Guaranteed Safe AI: A Framework for Ensuring Robust and Reliable AI Systems

Joar Skalse2mo10

I think the distinction between these two cases often can be somewhat vague.

Why do you think that the adversarial case is very different?

Towards Guaranteed Safe AI: A Framework for Ensuring Robust and Reliable AI Systems

Joar Skalse2mo10

I think you're perhaps reading me as being more bullish on Bayesian methods than I in fact am -- I am not necessarily saying that Bayesian methods in fact can solve OOD generalisation in practice, nor am I saying that other methods could not also do this. In fact, I was until recently very skeptical of Bayesian methods, before talking about it with Yoshua Bengio. Rather, my original reply was meant to explain why the Bayesian aspect of Bengio's research agenda is a core part of its motivation, in response to your remark that "from my understanding, the bayesian aspect of [Bengio's] agenda doesn't add much value".

I agree that if a Bayesian learner uses the NN prior, then its behaviour should -- in the limit -- be very similar to training a large ensemble of NNs. However, there could still be advantages to an explicitly Bayesian method. For example, off the top of my head:

It may be that you need an extremely large ensemble to approximate the posterior well, and that the Bayesian learner can approximate it much better with much less resources.
It may be that you more easily can prove learning-theoretic guarantees for the Bayesian learner.
It may be that a Bayesian learner makes it easier to condition on events that have a very small probability in your posterior (such as, for example, the event that a particular complex plan is executed).
It may be that the Bayesian learner has a more interpretable prior, or that you can reprogram it more easily.

And so on, these are just some examples. Of course, if you get these benefits in practice is a matter of speculation until we have a concrete algorithm to analyse. All I'm saying is that there are valid and well-motivated reasons to explore this particular direction.

Towards Guaranteed Safe AI: A Framework for Ensuring Robust and Reliable AI Systems

Joar Skalse2mo1-2

But at some point, this is no longer very meaningful. (E.g. you train on solving 5th grade math problems and deploy to the Riemann hypothesis.)

It sounds to me like we agree here, I don't want to put too much weight on "most".

Is this true?

It is true in the sense that you don't have any theoretical guarantees, and in the sense that it also often fails to work in practice.

Aren't NN implicitly ensembles of vast number of models?

They probably are, to some extent. However, in practice, you often find that neural networks make very confident (and wrong) predictions for out-of-distribution inputs, in a way that seems to be caused by them projecting some spurious correlation. For example, you train a network to distinguish different types of tanks, but it learns to distinguish night from day. You train an agent to collect coins, but it learns to go to the right. You train a network to detect criminality, but it learns to detect smiles. Adversarial examples could also be cast as an instance of this phenomenon. In all of these cases, we have a situation where there are multiple patterns in the data that fit a given training objective, but where a neural network ends up giving an unreasonably large weight to some of these patterns at the expense of other plausible patterns. I thus think its fair to say that -- empirically -- neural networks do not robustly quantify uncertainty in a reliable way when out-of-distribution. It may be that this problem mostly goes away with a sufficiently large amount of sufficiently varied data, but it seems hard to get high confidence in that.

Also, does ensembling 5 NNs help?

In practice, this does not seem to help very much.

If we're conservative over a million models, how will we ever do anything?

I mean, there can easily be cases where we assign a very low probability to any given "complete" model of a situation, but where we are still able assign a high probability to many different partial hypotheses. For example, if you randomly sample a building somewhere on earth, then your credence that the building has a particular floor plan might be less than 1 in 1,000,000 for each given floor plan. However, you could still assign a credence of near-1 that the building has a door, and a roof, etc. To give a less contrived example, there are many ways for the stock market to evolve over time. It would be very foolish to assume that it will evolve according to, e.g., your maximum-likelihood model. However, you can still assign a high credence to the hypothesis that it will grow on average. In many (if not all?) cases, such partial hypotheses are sufficient.

If our prior is importantly different in a way which we think will help, why can't we regularize to train a NN in a normal way which will vaguely reasonably approximate this prior?

In the examples I gave above, the issue is less about the prior and more about the need to keep track of all plausible alternatives (which neural networks do not seem to do, robustly). Using ensembles might help, but in practice this does not seem to work that well.

I again don't see how bayes ensures you have some non-schemers while ensembling doesn't.

I also don't see a good reason to think that a Bayesian posterior over agents should give a large weight to non-schemers. However, that isn't the use-case here (the world model is not meant to be agentic).

Another way to put this, is that all the interesting action was happening at the point where you solved the ELK problem.

So, this depends on how you attempt to create the world model. If you try to do this by training a black-box model to do raw sensory prediction, and then attempt to either extract latent variables from that model, or turn it into an interpretable model, or something like that, then yes, you would probably have to solve ELK, or solve interpretability, or something like that. I would not be very optimistic about that working. However, this is in no way the only option. As a very simple example, you could simply train a black-box model to directly predict the values of all latent variables that you need for your definition of harm. This would not produce an interpretable model, and so you may not trust it very much (unless you have some learning-theoretic guarantee, perhaps), but it would not be difficult to determine if such a model "thinks" that harm would occur in a given scenario. As another example, you could build a world model "manually" (with humans and LLMs). Such a model may be interpretable by default. And so on.

I thought the plan was to build it with either AI labor or human labor so that it will be sufficiently intepretable. Not to e.g. build it with SGD. If the plan is to build it with SGD and not to ensure that it is interpretable, then why does it provide any safety guarantee? How can we use the world model to define a harm predicate?

The general strategy may include either of these two approaches. I'm just saying that that the plan does not definitionally rely on the assumption that the wold model is built manually.

Won't predicting safety specific variables contain all of the difficulty of predicting the world?

That very much depends on what the safety specifications are, and how you want to use your AI system. For example, think about the situations where similar things are already done today. You can prove that a cryptographic protocol is unbreakable, given some assumptions, without needing to have a complete predictive model of the humans that use that protocol. You can prove that a computer program will terminate using a bounded amount of time and memory, without needing a complete predictive model of all inputs to that computer program. You can prove that a bridge can withstand an earthquake of such-and-such magnitude, without having to model everything about the earth's climate. And so on. If you want to prove something like "the AI will never copy itself to an external computer", or "the AI will never communicate outside this trusted channel", or "the AI will never tamper with this sensor", or something like that, then your world model might not need to be all that detailed. For more ambitious safety specifications, you might of course need a more detailed world model. However, I don't think there is any good reason to believe that the world model categorically would need to be a "complete" world model in order to prove interesting safety properties.

Towards Guaranteed Safe AI: A Framework for Ensuring Robust and Reliable AI Systems

Joar Skalse2mo30

From my understanding, the bayesian aspect of this agenda doesn't add much value.

A core advantage of Bayesian methods is the ability to handle out-of-distribution situations more gracefully, and this is arguably as important as (if not more important than) interpretability. In general, most (?) AI safety problems can be cast as an instance of a case where a model behaves as intended on a training distribution, but generalises in an unintended way when placed in a novel situation. Traditional ML has no straightforward way of dealing with such cases, since it only maintains a single hypothesis at any given time. However, Bayesian methods may make it less likely that a model will misgeneralise, or should at least give you a way of detecting when this is the case.

I also don't agree with the characterisation that "almost all the interesting work is in the step where we need to know whether a hypothesis implies harm" (if I understand you correctly). Of course, creating a formal definition or model of "harm" is difficult, and creating a world model is difficult, but once this has been done, it may not be very hard to detect if a given action would result in harm. But I may not understand your point in the intended way.

manually build an (interpretable) infra-bayesian world model which is sufficiently predictive of the world (as smart as our AI)

I don't think this is an accurate description of davidad's plan. Specifically, the world model does not necessarily have to be built manually, and it does not have to be as good at prediction as our AI. The world model only needs to be good at predicting the variables that are important for the safety specification(s), within the range of outputs that the AI system may produce.

Proof checking on this world model also seems likely to be unworkable

I agree that this is likely to be hard, but not necessarily to the point of being unworkable. Similar things are already done for other kinds of software deployed in complex contexts, and ASL-2/3 AI may make this substantially easier.

Towards Guaranteed Safe AI: A Framework for Ensuring Robust and Reliable AI Systems

Joar Skalse2moΩ110

You can imagine different types of world models, going from very simple ones to very detailed ones. In a sense, you could perhaps think of the assumption that the input distribution is i.i.d. as a "world model". However, what is imagined is generally something that is much more detailed than this. More useful safety specifications would require world models that (to some extent) describe the physics of the environment of the AI (perhaps including human behaviour, though it would probably be better if this can be avoided). More detail about what the world model would need to do, and how such a world model may be created, is discussed in Section 3.2. My personal opinion is that the creation of such a world model probably would be challenging, but not more challenging than the problems encountered in other alignment research paths (such as mechanistic interpretability, etc). Also note that you can obtain guarantees without assuming that the world model is entirely accurate. For example, consider the guarantees that are derived in cryptography, or the guarantees derived from formal verification of airplane controllers, etc. You could also monitor the environment of the AI at runtime to look for signs that the world model is inaccurate in a certain situation, and if such signs are detected, transition the AI to a safe mode where it can be disabled.

My Criticism of Singular Learning Theory

Joar Skalse8moΩ110

If a universality statement like the above holds for neural networks, it would tell us that most of the details of the parameter-function map are irrelevant.

I suppose this depends on what you mean by "most". DNNs and CNNs have noticeable and meaningful differences in their (macroscopic) generalisation behaviour, and these differences are due to their parameter-function map. This is also true of LSTMs vs transformers, and so on. I think it's fairly likely that these kinds of differences could have a large impact on the probability that a given type model will learn to exhibit goal-directed behaviour in a given training setup, for example.

The ambitious statement here might be that all the relevant information you might care about (in terms of understanding universality) are already contained in the loss landscape.

Do you mean the loss landscape in the limit of infinite data, or the loss landscape for a "small" amount of data? In the former case, the loss landscape determines the parameter-function map over the data distribution. In the latter case, my guess would be that the statement probably is false (though I'm not sure).

My Criticism of Singular Learning Theory

Joar Skalse8mo10

You're right, I put the parameters the wrong way around. I have fixed it now, thanks!