Because the way they went about it was to give the language model access to a separate physics simulator (MuJoCo, owned by DeepMind) rather than something like the language model learning the rules of physics through a physics textbook or landing on some encoding of English tokens that happens to represent physics.

I interpreted having to go to a different engine to get better inputs for the language model as counting for multiple interacting services.

[-][anonymous]3y30

Would it satisfy your objections if the physics engine were a separate system, but that system was using only neural networks or other ML hypotheses capable of learning from feedback? So no hardcoded physics code, just a ML system that regresses between some input representation of the world and future predictions. And then we develop the representation with autoencoders so no human bias there either.

[-]ryan_b3y41

I don't think so, no - the way I understand it, any kind of separation into separate systems falls into the CAIS sphere of thought. This is because we are measuring things from the capability side, rather than the generation-of-capability side: for example, I think it is still CAIS even if we take the exact same ML architecture and train copies of it into different specializations which then use each other as services.

There are a couple of things worth distinguishing, though:

That a sufficiently integrated CAIS is indistinguishable from a single general agent to us is what tells us CAIS isn't safe either.
The arguments that a single general agent represents a different or more severe danger profile still motivate tracking the two paths differently.

I will also say I think this type of work could easily be part of the creation of a single general agent. If we consider Gato as the anchor for a general agent: many different tasks were captured in a single transformer, but as far as I understand it, Gato kept the input-to-task associations, which is to say the language inputs for the language tasks and the physics inputs for the physics tasks. But if the language model fed to Gato-2 uses this Mind's Eye technique, it would be possible to do physics tasks from a language input and maybe also explain the physics tasks as a language output.

So before it could respond to sentences with other sentences, and it could respond to equations with other equations, but now it can process ancient geometry books and Newton's Principia which use words to describe equations for certain, and maybe even compose outputs of a similar kind.

[-]Paul Tiplady3y30

That a sufficiently integrated CAIS is indistinguishable from a single general agent to us is what tells us CAIS isn't safe either.

Fleshing this point out, I think one can probably make conditional statistical arguments about safety here, to define what I think you are getting at with "sufficiently integrated".

If your model is N parameters and integrates a bunch of Services, and we've established that a SOTA physics model requires N*100 parameters (the OP paper suggests that OOM difference), then it is likely safe to say that the model has not "re-learned" physics in some way that would render it illegible. (I've been thinking of this configuration as "thin model, fat modules", maybe "thin model, fat services" fits better with the CAIS terminology). However another model at N*1000 would be able to embed a physics simulator, and therefore would be able to effectively re-implement it in an illegible/deceptive way.

Thinking about ways in which this safety margin could break; is it possible to have a thin mapping layer on top of your Physics simulator that somehow subverts or obfuscates it without having to fully re-implement it? For example, running two simple problems in the simulator, then merging/inverting them in the model layer? Intuitively I suspect it's hard to achieve much with this attack, but it's probably not possible to rule out in general.

[-][anonymous]3y30

Note that the "thin layer" is what you need to do regardless. If your machine observes, from either robotics or some video it found online, a situation. Say the situation is an object being dropped. Physics sim predicts a fairly vigorous bounce, actual object splats.

You need some way to correct the physics sims predictions to correspond with actual results, and the simplest way is a neural network that learns the mapping between (real world input, physics engine prediction) ->(real world observation).

You have to then use the predicted real world observation or probability distribution of predicted observations for your machine reasoning about what it should do.

[-]Paul Tiplady3y30

I realized the reference "thin layer" is ambiguous in my post, just wanted to confirm if you were referring to the general case ""thin model, fat services", or the specific safety question at the bottom "is it possible to have a thin mapping layer on top of your Physics simulator that somehow subverts or obfuscates it"? My child reply assumed the former, but on consideration/re-reading I suspect the latter might be more likely?

[-]ryan_b3y20

Thinking about ways in which this safety margin could break; is it possible to have a thin mapping layer on top of your Physics simulator that somehow subverts or obfuscates it

I suppose that a mapping task might fall under the heading of a mesa-optimizer, where what it is doing is optimizing for fidelity between between the outputs of the language layer and the inputs of the physics layer. This would be in addition to the mesa-optimization going on just in the physics simulator. Working title:

CAIS: The Case For Maximum Mesa Optimizers

[-]Paul Tiplady3y30

With "thin model / fat service" I'm attempting to contrast with the typical end-to-end model architecture, where there is no separate "physics simulator", and instead the model just learns its own physics model, embedded with all the other relations that it has learned. So under that dichotomy, I think there is no "thin layer in front of the physics simulation" in an end-to-end model, as any part of the physics simulator can connect to or be connected from any other part of the model.

In such an end-to-end model, it's really hard to figure out where the "physics simulator" (such as it is) even lives, and what kind of ontology it uses. I'm explicitly thinking of to the ELK paper's Ontology Identification problem here. How do we know the AI is being honest when it says "I predict that object X will be in position Y at time T"? If you can interpret its physics model, you can see what simulations it is running and know if it's being honest.

It's possible, if we completely solve interpretability, to extract the end-to-end model's physics simulator, I suppose. But if we have completely solved interpretability, I think we're most of the way towards solving alignment. On the other hand, if we use Services (e.g. MuJoCo for Physics) then we already have interpretability on that service, by construction, without having to interpret whatever structures the model has learned to solve physics problems.

You need some way to correct the physics sims predictions to correspond with actual results

Sure, in some sense an end-to-end model needs to normalize inputs, extract features, and so on, before invoking its physics knowledge. But the point I'm getting at is that with Services we know for sure what the model's representation has been normalized to (since we can just look at the data passed to MuJoCo), whereas this is (currently) opaque in an end-to-end model.

[-][anonymous]3y12

Note that you would probably never use such a "model learns it's own physics" as the solution in any production AI system. For either the reason you gave, or more pedantically, so long as rival architectures using a human written physics system at the core perform more consistently in testing, you should never pick the less reliable solution.

[-]Paul Tiplady3y30

I suppose the follow-up question is: how effectively can a model learn to re-implement a physics simulator, if given access to it during training -- instead of being explicitly trained to generate XML config files to run the simulator during inference?

If it's substantially more efficient to use this paper's approach and train your model to use a general purpose (and transparent) physics simulator, I think this bodes well for interpretability in general. In the ELK formulation, this would enable Ontology Identification.

On this point, the paper says:

Mind’s Eye is also efficient, since it delegates domain-specific knowledge to external expert models... The size of the LM can thus be significantly shrunk since it removes the burden of memorizing all the domain-specific knowledge. Experiments find that 100× smaller LMs augmented with Mind’s Eye can achieve similar reasoning capabilities as vanilla large models, and its promptingbased nature avoids the instability issues of training mixture-of-expert models (Zoph et al., 2022). The compatibility with small LMs not only enables faster LM inference, but also saves time during model saving, storing, and sharing.

On the other hand, the general trend of "end-to-end trained is better than hand-crafted architectures" has been going strong in recent years; it's mentioned in the CAIS post, and Demis Hassabis noted that he thinks it's likely to continue in his interview by Lex Fridman (indeed they chatted quite a bit about using AI models to solve Physics problems). And indeed, DeepMind has a recent paper gesturing towards an end-to-end learned Physics model from video, which looks far less capable than the one shown in the OP, but two papers down the line, who knows.

[-]jacob_cannell3y20

This seems like a direct step down the CAIS path of development

Cooperative AI Systems/Services? Quick google search isn't finding it, working on only a vague memory.

[-]niplav3y40

Comprehensive AI Services.

Summary, original paper.

[-]ryan_b3y20

Yeah, I shoulda linked that. Fixing shortly, thanks to niplav in the meantime!

[-]Dave Orr3y10

Basically their idea is that instead of having one agent that optimizes the hell out of its value function and bad things happen, have a collection of smaller components that each are working on a subproblem with limited resources. If you can do that and also aggregate them such that as a unit they are superhuman, you get a lot of the benefits without (at least some of) the big risks.

Here's a brief explainer with some objections.

[-]Zach Stein-Perlman3y10

Interesting. Figure 2 looks pretty impressive. What do the phrases like "log(350) = 2.5B" at the bottom of the figure mean?

[-]ryan_b3y2-1

That doesn't appear to be explained specifically, but what I think they are giving is the larger model size equivalence. That is to say, the 350M parameter language model with Mind's Eye is about as good as a 2.5B parameter language model, and so on.

[-]Hjalmar_Wijk3y21

I might be missing something but are they not just giving the number of parameters (in millions of parameters) on a log10 scale? Scaling laws are usually by log-parameters, and I suppose they felt that it was cleaner to subtract the constant log(10^6) from all the results (e.g. taking log(1300) instead of log(1.3B)).

The B they put at the end is a bit weird though.

Moderation Log

LESSWRONG
is fundraising!
LW

LESSWRONG
is fundraising!
LW

50

They gave LLMs access to physics simulators

50

50