[ Question ]

Does Agent-like Behavior Imply Agent-like Architecture?

by Scott Garrabrant 3mo23rd Aug 20191 min read7 comments

45

Ω 16


Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

This is not a well-specified question. I don't know what "agent-like behavior" or "agent-like architecture" should mean. Perhaps the question should be "Can you define the fuzzy terms such that 'Agent-like behavior implies agent-like architecture' is true, useful, and in the spirit of the original question." I mostly think the answer is no, but it seems like it would be really useful to know if true, and the process of trying to make this true might help us triangulate what we should mean by agent-like behavior and agent-like architecture.

Now I'll say some more to try to communicate the spirit of the original question. First a giant look-up table is not (directly) a counterexample. This is because it might be that the only way to produce an agent-like GLUT is to use agent-like architecture to search for it. Similarly a program that outputs all possible GLUTs is also not a counterexample because you might have to use your agent-like architecture to point at the specific counterexample. A longer version of the conjecture is "If you see a program implements agent-like behavior, there must some agent-like architecture in the program itself, in the causal history of the program, or in the process that brought your attention to the program." The pseudo-theorem I want is similar to the claim that correlation really does imply causation or the good regulator theorem.

One way of defining agent-like behavior as that which can only be produced by an agent-like architecture. This makes the theorem trivial, and the challenge is making the theorem non-vacuous. In this light, the question is something like "Is there some nonempty class of architectures that can reasonably be described as a subclass of 'agent-like' such that the class can be equivalently specified either functionally or syntactically?" This looks like it might conflict with the spirit of Rice's theorem, but I think making it probabilistic and referring to the entire causal history of the algorithm might give it a chance of working.

One possible way of defining agent-like architecture is something like "Has a world model and a goal, and searches over possible outputs to find one such that the model believes that output leads to the goal." Many words in this will have to be defined further. World model might be something that has high logical mutual information with the environment. It might be hard to define search generally enough to include everything that counts as search. There also might be completely different ways to define agent-like architecture. Do whatever makes the theorem true.

45

Ω 16

New Answer
Ask Related Question
New Comment
Write here. Select text for formatting options.
We support LaTeX: Cmd-4 for inline, Cmd-M for block-level (Ctrl on Windows).
You can switch between rich text and markdown in your user settings.

2 Answers

Conjecture: Every short proof of agentic behavior points out agentic architecture.

Let's say "agent-like behavior" is "taking actions that are more-likely-than-chance to create an a-priori-specifiable consequence" (this definition includes bacteria).

Then I'd say this requires "agent-like processes", involving (at least) all 4 of: (1) having access to some information about the world (at least the local environment), including in particular (2) how one's actions affect the world. This information can come either baked into the design (bacteria, giant lookup table), and/or from previous experience (RL), and/or via reasoning from input data. It also needs (3) an ability to use this information to choose actions that are likelier-than-chance to achieve the consequence in question (again, the outcome of this search process could be baked into the design like bacteria, or it could be calculated on-the-fly like human foresight), and of course (4) a tendency to actually execute those actions in question.

I feel like this is almost trivial, like I'm just restating the same thing in two different ways... I mean, if there's no mutual information between the agent and the world, its actions can only be effective only insofar as the exact same action would be effective when executed in a random location of a random universe. (Does contracting your own muscle count as "accomplishing something without any world knowledge"?)

Anyway, where I'm really skeptical here is in the term "architecture". "Architecture" in everyday usage usually implies software properties that are obvious parts of how a program is built, and probably put in on purpose. (Is there a more specific definition of "architecture" you had in mind?) I'm pretty doubtful that the ingredients 1-4 have to be part of the "architecture" in that sense. For example, I've been thinking a lot about self-supervised learning algorithms, which have ingredient (1) by design and have (3) sorta incidentally. The other two ingredients (2) and (4) are definitely not part of the "architecture" (in the sense above). But I've argued that they can both occur as unintended side-effects of its operation: See here, and also here for more details about (2). And thus I argue at that first link that this system can have agent-like behavior.

(And what's the "architecture" of a bacteria anyway? Not a rhetorical question.)

Sorry if this is all incorrect and/or not in the spirit of your question.