Safety Implications of LeCun's path to machine intelligence

Great post, thanks for sharing. Here's my core concern about LeCun's worldview, then two other thoughts:

The intrinsic cost module (IC) is where the basic behavioral nature of the agent is defined. It is where basic behaviors can be indirectly specified. For a robot, these terms would include obvious proprioceptive measurements corresponding to “pain”, “hunger”, and “instinctive fears”, measuring such things as external force overloads, dangerous electrical, chemical, or thermal environments, excessive power consumption, low levels of energy reserves in the power source, etc.
They may also include basic drives to help the agent learn basic skills or accomplish its missions. For example, a legged robot may comprise an intrinsic cost to drive it to stand up and walk. This may also include social drives such as seeking the company of humans, finding interactions with humans and praises from them rewarding, and finding their pain unpleasant (akin to empathy in social animals). Other intrinsic behavioral drives, such as curiosity, or taking actions that have an observable impact, may be included to maximize the diversity of situations with which the world model is trained (Gottlieb et al., 2013)
The IC can be seen as playing a role similar to that of the amygdala in the mammalian brain and similar structures in other vertebrates. To prevent a kind of behavioral collapse or an uncontrolled drift towards bad behaviors, the IC must be immutable and not subject to learning (nor to external modifications).

This is the paper's treatment of the outer alignment problem. It says models should have basic drives and behaviors that are specified directly by humans and not trained. The paper doesn't mention the challenges of reward specification or the potential for learning human preferences. It doesn't discuss our normative systems or even the kinds of abstractions that humans care about. I don't understand why he doesn't see the challenges with specifying human values.

Most of the paper instead focuses on the challenges of building accurate, multimodal predictive world models. This seems entirely necessary to continue advancing AI, but the primary focus on predictive capabilities and minimizing of the challenges in learning human values worries me.

If anybody has good sources about LeCun's views on AI safety and value learning, I'd be interested.

success of model-free RL in complex video game environments like StarCraft and Dota 2

Do we expect model-free RL to succeed in domains where you can't obtain incredible amounts of data thanks to e.g. self-play? Having a purely predictive world model seems better able to utilize self-supervised predictive objective functions, and to generalize to many possible goals that use a single world model. (Not to mention the potential alignment benefits of a more modular system.) Is model-free RL simply a fluke that learns heuristics by playing games against itself, or are there reasons to believe it will succeed on more important tasks?

Since the whole architecture is trained end-to-end with gradient descent

I don't think this is what he meant, though I might've missed something. The world model could be trained with the self-supervised objective functions of language and vision models, as well as perhaps large labeled datasets and games via self-play. On the other hand, the actor must learn to adapt to many different tasks very quickly, but could potentially use few-shot learning or fine-tuning to that end. The more natural architecture would seem to be modules that treat each other as black boxes and can be swapped out relatively easily.

[-]Evan R. Murphy3y*11-3

If anybody has good sources about LeCun's views on AI safety and value learning, I'd be interested.

There's a conversation LeCun had with Stuart Russell and a few others in a Facebook comment thread back in 2019, arguing about instrumental convergence.

The full conversation is a bit long and difficult to skim. I haven't finished reading it myself, but in it LeCun links to an article he co-authored for Scientific American which argues x-risk from AI misalignment isn't something people should worry about. (He's more concerned about misuse risks.) Here's a quote from it:

We dramatically overestimate the threat of an accidental AI takeover, because we tend to conflate intelligence with the drive to achieve dominance. [...] But intelligence per se does not generate the drive for domination, any more than horns do."

[-]Ivan Vendrov3y64

My read of LeCun in that conversation is that he doesn't think in terms of outer alignment / value alignment at all, but rather in terms of implementing a series of "safeguards" that allow humans to recover if the AI behaves poorly (See Steven Byrnes' summary).

I think this paper helps clarify why he believes this - he had something like this architecture in mind, and so outer alignment seemed basically impossible. Independently, he believes it's unnecessary because the obvious safeguards will prove sufficient.

[-]Ivan Vendrov3y21

Ah you're right, the paper never directly says the architecture is trained end-to-end - updated the post, thanks for the catch.

He might still mean something closer to end-to-end learning, because

The world model is differentiable w.r.t the cost (Figure 2), suggesting it isn't trained purely using self-supervised learning.
The configurator needs to learn to modulate the world model, the cost, and the actor; it seems unlikely that this can be done well if these are all swappable black boxes. So there is likely some phase of co-adaptation between configurator, actor, cost, and world model.

[-]Steven Byrnes3y72

I claim there’s some overlap with brain-like AGI safety; No coincidence, since he’s explicitly inspired by how the brain works. :)

Interpretability becomes much easier

I would state that in a more pessimistic way, by saying “Interpretability seems extraordinarily hard if not impossible in this approach, but it would be even worse in other approaches”. See discussion here & here for example.

Most of the "intelligence" in the system (the world model) is aimed at increasing predictive accuracy, and the agent is motivated by relatively simple hard-coded drives; whether its intelligent behaviors are safe or dangerous will not be predictable in advance.

I think that’s unduly pessimistic. I think it’s a hard problem, but that it’s at least premature to say that it’s impossible. For example, we get to pick the “relatively simple hard-coded drives”, we get to pick the training data / environment, we get to invent other tricks, etc.

Whether an AI deployment leads to catastrophic outcomes will mostly be a function not of the agent's properties, but of the safety affordances implemented by the people deploying it

My opinion is that by far the most important determinant, and most important intervention point, is whether the agent is trying to bring about catastrophic outcomes (which, again, I see as a hard but not knowably-doomed thing to intervene on). See here.

[-][deactivated]2y*50

Very lucidly written. Thanks.

Broadly, it seems that in a world where LeCun's architecture becomes dominant, useful AI safety work looks more analogous to the kind of work that goes on now to make self-driving cars safe. It's not difficult to understand the individual components of a self-driving car or to debug them in isolation, but emergent interactions between the components and a diverse range of environments require massive and ongoing investments in testing and redundancy.

I think this is the crux of the matter. This is why LeCun tweeted:

One cannot just "solve the AI alignment problem." Let alone do it in 4 years. One doesn't just "solve" the safety problem for turbojets, cars, rockets, or human societies, either. Engineering-for-reliability is always a process of continuous & iterative refinement.

LeCun, like Sam Altman, believes in an empirical, iterative approach to AI safety. This is in sharp contrast to the highly theoretical, figure-it-all-out-far-in-advance approach of MIRI.

I don’t get why some folks are so dismissive of the empirical, iterative approach. Is it because they believe in a fast takeoff?

[-]quetzal_rainbow2y32

I think that if you mentioned cars and planes you should read this

[-]Algon3y50

Interpretability becomes much easier

How so? Its still a big old pile of vectors generated by SGD, just with a cost function that we can use to calculate stuff. But we may not understand what this cost function means, especially in terms of the models native ontology. Sure, maybe it will have natural abstractions interpretable by circuits style reasoning. But that's equally true of any current NN.

Looking at this, it is clear that this is a utility maximiser. And those are dangerous by default. Another worrying part is that there should be some "safety guardrails" in the cost function. But what kind of terms could make it safe? Nothing purely internal, at least not without crippling the AI's utility. For a utility function that's pointing to something in the real world, there's two issues.

Humans are very complex, and it seems tricky to point them out in a world model.
The AI's world model is potentially a shifting inscrutable mess. How do we reliably point to anything in it?

[-]james.lucassen3y43

In general, I'm a bit unsure about how much of an interpretability advantage we get from slicing the model up into chunks. If the pieces are trained separately, then we can reason about each part individually based on its training procedure. In the optimistic scenario, this means that the computation happening in the part of the system labeled "world model" is actually something humans would call world modelling. This is definitely helpful for interpretability. But the alternative possibility is that we get one or more mesa-optimizers, which seems less interpretable.

[-]Steven Byrnes3y40

I for one am moderately optimistic that the world-model can actually remain “just” a world-model (and not a secret deceptive world-optimizer), and that the value function can actually remain “just” a value function (and not a secret deceptive world-optimizer), and so on, for reasons in my post Thoughts on safety in predictive learning—particularly the idea that the world-model data structure / algorithm can be relatively narrowly tailored to being a world-model, and the value function data structure / algorithm can be relatively narrowly tailored to being a value function, etc.

[-]Evan R. Murphy3y2-1

Since LeCun's architecture is together a kind of optimizer (I agree with Algon that it's probably a utility maximizer) then the emergence of additional mesa-optimizers seems less likely.

We expect optimization to emerge because it's a powerful algorithm for SGD to stumble on that outcompetes the alternatives. But if the system is already an optimizer, then where is that selection pressure coming from to make another one?

[-]the gears to ascension3y10

it's coming from the fact that every module wants to be an optimizer of something in order to do its job

[-]Evan R. Murphy3y*10

Interesting, I wonder how the dynamics of a multiple mesa-optimizer system would play out (if it's possible).

[-]Ivan Vendrov3y31

I think it's easier to interpret than model-free RL (provided the line between model and actor is maintained through training, which is an assumption LeCun makes but doesn't defend) because it's doing explicit model-based planning, so there's a clear causal explanation for why the agent took a particular action - because it predicted that it would lead to a specific low-cost world state. It still might be hard to decode the world state representation, but much easier than decoding what the agent is trying to do from the activations of a policy network.

Not obvious to me that it will be a utility maximizer, but definitely dangerous by default. In a world where this architecture is dominant, we probably have to give up on getting intent alignment and fall back to safety guarantees like "well it behaved well in all of our adversarial simulations, and we have a powerful supervising process that will turn it off if it the plans look fishy". Not my ideal world, but an important world to consider.

[-]Algon3y43

It decides its actions via minimising a cost function. How's that not isomorphic to a utility maximiser?

[-]Ivan Vendrov3y80

The configurator dynamically modulates the cost function, so the agent is not guaranteed to have the same cost function over time, hence can be dutch booked / violate VNM axioms.

[-]Algon3y40

Good point. But at any given time, its doing EV calculations to decide its actions. Even if it modulates itself by picking amongst a variety of utility functions, its actions are still influenced by explicit EV calcs. If I understand TurnTrout's work correctly, that alone is enough to make the agent power seeking. Which is dangerous by default.

[-]Evan R. Murphy3y40

Some additional discussion of LeCun's paper on this earlier LessWrong post: https://www.lesswrong.com/posts/Y7XkGQXwHWkHHZvbm/yann-lecun-a-path-towards-autonomous-machine-intelligence-1

LESSWRONG
is fundraising!
LW

LESSWRONG
is fundraising!
LW

102

Safety Implications of LeCun's path to machine intelligence

102

Ω 29

102

Ω 29

Architecture Overview

The Overall Agent

The World Model

The Actor

The Cost

The Configurator

Implications for AI Safety

Conclusion and Unresolved Questions