It matters when the first sharp left turn happens

[-]jacob_cannell3y72

By contrast, the world where sharp left turns happen near human-level capabilities seems pretty dangerous. In that world, we don’t get to study things like deceptive alignment empirically. We don’t get to leverage powerful AI researcher tools.

We actually can study and iterate on alignment (and deception) in simulation sandboxes while leveraging any powerful introspection/analysis tools. Containing human-level AGI in (correctly constructed) simulations is likely easy.

But then later down you say:

For instance, we can put weak models in environments that encourage alignment failure so as to study left turns empirically (shifting towards the "weak" worlds).

Which seems like a partial contradiction, unless you believe we can't contain human-level agents?

[-]Adam Jermyn3y32

I am skeptical that we can contain human-level agents, particularly if they have speed/memory advantages over humans.

[-]jacob_cannell3y22

Is this something you've thought deeply about and or care to expand on? Curious about your source of skepticism, considering:

we can completely design and constrain the knowledge base of sim agents, limiting them to the equivalent of 1000BC human knowledge, or whatever we want
we can completely and automatically monitor their inner monologues, thoughts, etc

How do you concretely propose a sim agent will break containment? (which firstly requires sim-awareness)

How would you break containment now, assuming you are in a sim?

Also, Significant speed/memory advantages go against the definition of 'human-level agents' and are intrinsically unlikely anyway as 2x speed/memory agents are simply preceded by 1x speed/memory agents, and especially due to the constraints of GPUs which permit acceleration by parallelization of learning across agents, more so than serial speedup of individual agents.

[-]Adam Jermyn3y10

My modal expectation is that these are going to look something like large language models, meaning that while we'll have controlled the training corpus it won't be something hand-curated. So there will almost certainly be lots of information in there that models could use to e.g. figure out how to manipulate people, program computers, and so on. I expect then that we won't be limiting their knowledge as much as you think we will, because I expect them to be trained with giant corpuses of (mostly) post-1000BC text.

We do not currently have the interpretability tools needed to monitor their inner thoughts. I would feel much better about this problem if we had such tools!

My best guess is that containment fails because a model convinces one of the people in the lab to let it access the internet. That could look like getting someone to think it's genuinely a good idea (because it'll be helpful for some project), or it could look like bribery (with offer of future payment) or blackmail.

I'm guessing that we'll want to run some agents fast because that's how you get lots of serial intellectual work done. So even if all we get are human-level agents, there'll be a big incentive to run them fast.

[-]jacob_cannell3y02

Indeed a model trained on (or with full access to) our internet could be very hard to contain. That is in fact a key part of my argument. If it is hard to contain, it is hard to test. If it's hard to test, we are unlikely to succeed.

So if we agree there, then perhaps you agree that giving untested AGI access to our immense knowledge base is profoundly unwise.

I expect then that we won't be limiting their knowledge as much as you think we will, because I expect them to be trained with giant corpuses of (mostly) post-1000BC text.

That's then just equivalent to saying "I expect then that we won't even bother with testing our alignment designs". Do you actually believe that testing is unnecessary? Or do you just believe the leading teams won't care? And if you agree that testing is necessary, then shouldn't this be key to any successful alignment plan?

My best guess is that containment fails because a model convinces one of the people in the lab to let it access the internet.

Which obviously is nearly impossible if it doesn't know it is in a sim, and doesn't know what a lab or the internet are, and lacks even the precursor concepts. Since you seem to be focused on the latest fad (large language models), consider the plight of poor simulated Elon Musk - who at least is aware of the sim argument.

[-]jacquesthibs3y10

Thanks for pushing back on some stuff here.

Since you seem to be focused on the latest fad (large language models)

Is this your way of saying you don't think LLMs will scale to AGI (or AGI-level capabilities)?

So if we agree there, then perhaps you agree that giving untested AGI access to our immense knowledge base is profoundly unwise.

This seems like it will happen with economic incentives (I mean, WebGPT already exists) and models will already be trained on the entire internet, so it's not clear to me how we'd prevent this?

[-]jacob_cannell3y42

Is this your way of saying you don't think LLMs will scale to AGI (or AGI-level capabilities)?

It's more that AGI will contain a LLM module as the brain contains linguistic cortex modules. Indeed several new neuroscience studies have shown LLMs are functional equivalent of human brain linguistic cortex - trained the same way on the same data resulting in very similar learned representations.

It's also obvious from the LLM scaling that the larger LLMs are already comparable to linguistic cortex but lag the brain in many core capabilities. You don't get those capabilities from just making a larger vanilla LLM / linguistic cortex.

This seems like it will happen with economic incentives (I mean, WebGPT already exists) and models will already be trained on the entire internet

Training just a language module on the internet is not dangerous by itself, but of course yes the precedent is concerning. There are now several projects working on multi-modal foundation agents that control virtual terminals with full web access. If that continues and reaches AGI before the safer sim/game training path, then we may be in trouble.

so it's not clear to me how we'd prevent this?

Well if we agree that we obviously can contain human-level agents in sims if we want to, then it's some combination of the standard spreading awareness, advocacy, and advancing sim techniques.

Consider for example if there was a complete alignment solution available today - and it necessarily required changing how we trained models. Clearly, the fact that there is current inertia in the wrong direction can't be some knockdown argument against doing the right thing. If you learn your train is on a collision course you change course or jump.

[-]Vladimir_Nesov3yΩ240

The usual framing is to align agent policies in a way that also ensures that they don't have an out-of-distribution phase change. With deceptive alignment, sharp left turn, and goodharting, capability to consider complicated plans or mere optimization-in-actuality prompts situations that are too out-of-distribution. The behavior on-distribution stops being a good indication of what's going to happen there, regardless of difficulties of modern ML with robustness. In this framing, the training distribution also needs to confer alignment, so it anchors to more usual situations that can currently be captured in practice, with more capable agents or over-optimizing agents discovering more unusual situations than that. This anchoring of the character of the training distribution might be a problem.

It seems useful to deconfuse agents that don't have that sort of phase change in their behavior, without being distracted by the separate problem of their alignment. When not focusing on aligned agents, it becomes more natural to consider arbitrary agents in arbitrary situations, including situations that can't be obtained from the world as training data, so won't be useful for ensuring alignment. And to treat such situations as training data for ML purposes.

The simulator framing is an example of this point, except it still pays too much attention to currently observable reality (even as simulators themselves are bad at that task). A simulator is a map of counterfactuals and not an agent, and it's not about describing particular agents, instead it describes all sorts of agents and their interaction in all sorts of situations simultaneously. Can a simulator be trained to be particularly good at describing agents that make coherent decisions across a wide variety of situations, including situations that can't be collected as training data from the world, or won't occur in actuality, and need to instead be generated, with little hope or intent of remaining predictive of the real world? Or maybe such coherence is more naturally found in something smaller than agents, a collective of intents/purposes/concepts/norms that each shapes a scope of situations where it's relevant, competing/bargaining with others of its kind to find an equilibrium that doesn't have the more jarring sorts of behavioral phase changes, especially capability-related ones.

[-]Catnee3y10

Probably it is also depends on how much information about "various models trying to naively backstab their own creators" there are in the training dataset

LESSWRONG
LW

LESSWRONG
LW

45

It matters when the first sharp left turn happens

45

Ω 22

45

Ω 22

Introduction

Timing

Scales

Capabilities are (sort of) multi-dimensional

Outlook