Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

Beyond past confusions

Over the last year, I wrote and thought many confused and confusing ideas on the relationship between goal-directedness of behavior. In the linked post for example, I defended a deconfusion of goal-directdness solely in terms of behavior; in doing so, I might pass for a behaviorist (someone thinking that mental constructs are not needed and so don't exist), or look like I imply that we should never use internal knowledge of our models to determine goal-directedness. Without even mentioning the factual errors.

So here is my attempt at a short and clear explanation of the link I see between goal-directedness and behavior. If you're confused by this take, or believe me to be confused, I would really appreciate a comment. My goal isn't to prove that I'm obviously right, just to get less confused and hopefully help lift the fog of confusion for everyone.

Thanks to Jack Koch for a recent discussion that reminded me of this issue, and to Richard Ngo for giving me food for thought on this subject with his comments.

Behavior in all its glory

What's my take? I think that when we talk about goal-directedness, what we really care about is a range of possible behaviors, some of which we worry about in the context of alignment and safety. We might for example think that goal-directed systems have convergent subgoals, which tells us how they could lack corrigibility and cause catastrophic outcomes. that such a goal-directed system could follow.

My entire point is that for deconfusing goal-directedness, we want a better understanding of this range of behaviors. At the moment, when thinking about a given behavior, I don't know whether that's the sort of thing a goal-directed system would do. And it seems problematic both for understanding the risks of goal-directed systems, and for detecting them.

Note that even a purely structural definition of goal-directedness would constrain the structure such that the system behave in a certain way. So even if we want a structural definition, clarifying the range of behaviors sounds like progress.

What I'm not saying

  • We shouldn't ascribe any cognition to the system, just find rules of association for its behavior (aka Behaviorism)
    • That's not even coherent with my favored approach to goal-directedness, the intentional stance. Dennett clearly ascribes beliefs and desires to beings and systems; his point is that the ascription is done based on the behavior and the circumstances.
  • Nothing but the behavior is useful to check goal-directedness.
    • Even in my original confused post, I point out that structural knowledge about the system is probably necessary to check goal-directedness, as its probably the only tractable way of finding out what the system will do.
    • I hadn't thought about it last year, but I see more and more the value of thinking about the justified beliefs that the system might have, due to its training data, learning algorithm and inductive biases. (This is an idea of Paul with ties to universality)

How I could be wrong

The main crux I see about this take on behavior is whether it's even possible or tractable to deconfuse and formalize the range of behaviors of goal-directed systems. No matter how useful a formalization would be, if we can't get it, we should turn to other approaches.

That being said, I haven't seen any convincing argument that it's impossible, and the more I dig, the more stuff I find, so I am quite convinced that some progress is possible.

14

Ω 9

New Comment
4 comments, sorted by Click to highlight new comments since: Today at 10:16 AM

What's my take? I think that when we talk about goal-directedness, what we really care about is a range of possible behaviors, some of which we worry about in the context of alignment and safety.

  • (What I'm not saying) We shouldn't ascribe any cognition to the system, just find rules of association for its behavior (aka Behaviorism)
  • That's not even coherent with my favored approach to goal-directedness, the intentional stance. Dennett clearly ascribes beliefs and desires to beings and systems; his point is that the ascription is done based on the behavior and the circumstances.

I agree pretty strongly with all of this, fwiw. I think Dennett/the intentional stance really gets at the core of what it means for a system to "be an agent"; essentially, a system is one to the extent it makes sense to model it as such, i.e. as having beliefs and preferences, and acting on those beliefs to achieve those preferences, etc. The very reason why we usually consider our selves and other humans to be "agents" is exactly because that's the model over sensory data that the mind finds most reasonable to use, most of the time. In doing so, we actually are ascribing cognition to these systems, and in practice, of course we'll need to understand how such behavior will actually be implemented in our AIs. (And thinking about how "goal-directed behavior" is implemented in humans/biological neural nets seems like a good place to mine for useful insights and analogies for this purpose.)

I'm glad, you're one of the handful of people I wrote this post for. ;)

(And thinking about how "goal-directed behavior" is implemented in humans/biological neural nets seems like a good place to mine for useful insights and analogies for this purpose.)

Definitely. I have tended to neglect this angle, but I'm trying to correct that mistake.

The main crux I see about this take on behavior is whether it's even possible or tractable to deconfuse and formalize the range of behaviors of goal-directed systems.

It seems like a goal-directed system could decide to [temporarily] enact any behavior (that it was physically capable of), for example for the purpose of deception. How does this square with your current thoughts?

Here are some responses I can imagine:

  • Any behavior is possible, but some behaviors are far more likely than others, and probabilistic characterizations of the behavior of goal-directed systems would still be very useful
  • Computational resource limits put a bound on how much a system can pretend to be another system
  • If we can structurally detect non-deception, then we can get much stronger constraints on the system's behavior
  • Even though it feels intuitively like a sufficiently intelligent system could decide to enact any behavior, it is in fact a specific system that will necessarily have a more limited range of behaviors, and we can somehow figure out exactly which behaviors it can and can't enact

Interesting points!

My first reaction is that for the system to accomplish its goal, it must eventually behave goal-directedly. It's easy to imagine an AI accomplishing goal X by pursuing goal Y (for example building a robot to do X), but it's hard to imagine the AI accomplishing goal X by not accomplishing any goal.

As for your proposed answers:

  • I guess this is a probabilistic take on my argument that it will eventually need to be goal-directed to do things
  • I would say more "constraints on how well the actual goal must be accomplished"
  • That's true, but I'm not sure yet that we can detect deception.
  • Basically my intuition.