Wiki Contributions

Comments

Answer by jbkjrDec 09, 2021140

This is kind of vague, but I have this sense that almost everybody doing RL and related research takes the notion of "agent" for granted, as if it's some metaphysical primitive*, as opposed to being a (very) leaky abstraction that exists in the world models of humans. But I don't think the average alignment researcher has much better intuitions about agency, either, to be honest, even though some spend time thinking about things like embedded agency. It's hard to think meaningfully about the illusoriness of the Cartesian boundary when you still live 99% of your life and think 99% of your thoughts as if you were a Cartesian agent, fully "in control" of your choices, thoughts, and actions.

(*Not that "agent" couldn't, in fact, be a metaphysical primitive, just that such "agents" are hardly "agents" in the way most people consider humans to "be agents" [and, equally importantly, other things, like thermostats and quarks, to "not be agents"].)

jbkjr2yΩ130

To illustrate my reservations: soon after I read the sentence about GNW meaning you can only be conscious of one thing at a time, as I was considering that proposition, I felt my chin was a little itchy and so I scratched it. So now I can remember thinking about the proposition while simultaneously scratching my chin. Trying to recall exactly what I was thinking at the time now also brings up a feeling of a specific body posture.

To me, "thinking about the proposition while simultaneously scratching my chin" sounds like a separate "thing" (complex representation formed in the GNW) than either "think about proposition" or "scratch my chin"... and you experienced this thing after the other ones, right? Like, from the way you described it, it sounds to me like there was actually 1) the proposition 2) the itch 3) the output of a 'summarizer' that effectively says "just now, I was considering this proposition and scratching my chin". [I guess, in this sense, I would say you are ordinarily doing some "weird self-deceptive dance" that prevents you from noticing this, because most people seem to ordinarily experience "themselves" as the locus of/basis for experience, instead of there being a stream of moments of consciousness, some of which apparently refer to an 'I'.]

Also, I have this sense that you're chunking your experience into "things" based on what your metacognitive summarizer-of-mental-activity is outputting back to the workspace, but there are at least 10 representations streaming through the workspace each second, and many of these will be far more primitive than any of the "things" we've already mentioned here (or than would ordinarily be noticed by the summarizer without specific training for it, e.g. in meditation). Like, in your example, there were visual sensations from the reading, mental analyses about its content, the original raw sensation of the itch, the labeling of it as "itchy," the intention to scratch the itch, (definitely lots more...), and, eventually, the thought "I remember thinking about this proposition and scratching my chin 'at the same time'."

jbkjr3yΩ340

I think it's really cool you're posting updates as you go and writing about uncertainties! I also like the fiction continuation as a good first task for experimenting with these things.

My life is a relentless sequence of exercises in importance sampling and counterfactual analysis

This made me laugh out loud :P

jbkjr3yΩ560

If you then deconfuse agency as "its behavior is reliably predictable by the intentional strategy", I then have the same question: "why is its behavior reliably predictable by the intentional strategy?" Sure, its behavior in the set of circumstances we've observed is predictable by the intentional strategy, but none of those circumstances involved human extinction; why expect that the behavior will continue to be reliably predictable in settings where the prediction is "causes human extinction"?

Overall, I generally agree with the intentional stance as an explanation of the human concept of agency, but I do not think it can be used as a foundation for AI risk arguments. For that, you need something else, such as mechanistic implementation details, empirical trend extrapolations, analyses of the inductive biases of AI systems, etc.

The requirement for its behavior being "reliably predictable" by the intentional strategy doesn't necessarily limit us to postdiction in already-observed situations; we could require our intentional stance model of the system's behavior to generalize OOD. Obviously, to build such a model that generalizes well, you'll want it to mirror the actual causal dynamics producing the agent's behavior as closely as possible, so you need to make further assumptions about the agent's cognitive architecture, inductive biases, etc. that you hope will hold true in that specific context (e.g. human minds or prosaic AIs). However, these are additional assumptions needed to answer question of why an intentional stance model will generalize OOD, not replacing the intentional stance as the foundation of our concept of agency, because, as you say, it explains the human concept of agency, and we're worried that AI systems will fail catastrophically in ways that look agentic and goal-directed... to us.

You are correct that having only the intentional stance is insufficient to make the case for AI risk from "goal-directed" prosaic systems, but having it as the foundation of what we mean by "agent" clarifies what more is needed to make the sufficient case—what about the mechanics of prosaic systems will allow us to build intentional stance models of their behavior that generalize well OOD?

jbkjr3yΩ780

What's my take? I think that when we talk about goal-directedness, what we really care about is a range of possible behaviors, some of which we worry about in the context of alignment and safety.

  • (What I'm not saying) We shouldn't ascribe any cognition to the system, just find rules of association for its behavior (aka Behaviorism)
  • That's not even coherent with my favored approach to goal-directedness, the intentional stance. Dennett clearly ascribes beliefs and desires to beings and systems; his point is that the ascription is done based on the behavior and the circumstances.

I agree pretty strongly with all of this, fwiw. I think Dennett/the intentional stance really gets at the core of what it means for a system to "be an agent"; essentially, a system is one to the extent it makes sense to model it as such, i.e. as having beliefs and preferences, and acting on those beliefs to achieve those preferences, etc. The very reason why we usually consider our selves and other humans to be "agents" is exactly because that's the model over sensory data that the mind finds most reasonable to use, most of the time. In doing so, we actually are ascribing cognition to these systems, and in practice, of course we'll need to understand how such behavior will actually be implemented in our AIs. (And thinking about how "goal-directed behavior" is implemented in humans/biological neural nets seems like a good place to mine for useful insights and analogies for this purpose.)

Answer by jbkjrAug 06, 202160

I think they do some sort of distillation type thing where they train massive models to label data or act as “overseers” for the much smaller models that actually are deployed in cars (as inference time has to be much better to make decisions in real time)… so I wouldn’t actually expect them to be that big in the actual cars. More details about this can be found in Karpathy’s recent CLVR talk, iirc, but not about parameter count/model size?

jbkjr3yΩ110

To try to understand a bit better: does your pessimism about this come from the hardness of the technical challenge of querying a zillion-particle entity for its objective function? Or does it come from the hardness of the definitional challenge of exhaustively labeling every one of those zillion particles to make sure the demarcation is fully specified? Or is there a reason you think constructing any such demarcation is impossible even in principle? Or something else?

Probably something like the last one, although I think "even in principle" is doing some probably doing something suspicious in that statement. Like, sure, "in principle," you can pretty much construct any demarcation you could possibly imagine, including the Cartesian one, but what I'm trying to say is something like, "all demarcations, by their very nature, exist only in the map, not the territory." Carving reality is an operation that could only make sense within the context of a map, as reality simply is. Your concept of "agent" is defined in terms of other representations that similarly exist only within your world-model; other humans have a similar concept of "agent" because they have a similar representation built from correspondingly similar parts. If an AI is to understand the human notion of "agency," it will need to also understand plenty of other "things" which are also only abstractions or latent variables within our world models, as well as what those variables "point to" (at least, what variables in the AI's own world model they 'point to,' as by now I hope you're seeing the problem with trying to talk about "things they point to" in external/'objective' reality!).

jbkjr3yΩ110

(Because you'd always be unable to answer the legitimate question: "the mesa-objective of what?")

All I'm saying is that, to the extent you can meaningfully ask the question, "what is this bit of the universe optimizing for?", you should be able to clearly demarcate which bit you're asking about.

I totally agree with this; I guess I'm just (very) wary about being able to "clearly demarcate" whichever bit we're asking about and therefore fairly pessimistic we can "meaningfully" ask the question to begin with? Like, if you start asking yourself questions like "what am 'I' optimizing for?," and then try to figure out exactly what the demarcation is between "you" and "everything else" in order to answer that question, you're gonna have a real tough time finding anything close to a satisfactory answer.

jbkjr3yΩ220

Btw, if you're aware of any counterpoints to this — in particular anything like a clearly worked-out counterexample showing that one can't carve up a world, or recover a consistent utility function through this sort of process — please let me know. I'm directly working on a generalization of this problem at the moment, and anything like that could significantly accelerate my execution.

I'm not saying you can't reason under the assumption of a Cartesian boundary, I'm saying the results you obtain when doing so are of questionable relevance to reality, because "agents" and "environments" can only exist in a map, not the territory. The idea of trying to e.g. separate "your atoms" or whatever from those of "your environment," so that you can drop them into those of "another environment," is only a useful fiction, as in reality they're entangled with everything else. I'm not aware of formal proof of this point that I'm trying to make; it's just a pretty strongly held intuition. Isn't this also kind of one of the key motivations for thinking about embedded agency?

jbkjr3yΩ110

I definitely see it as a shift in that direction, although I'm not ready to really bite the bullets -- I'm still feeling out what I personally see as the implications. Like, I want a realist-but-anti-realist view ;p

You might find Joscha Bach's view interesting...

Load More