The Language of Bird

[-]Steven Byrnes4y40

I generally like the idea of (for example) somehow finding the concept "I am being helpful" in the world-model and flagging it as "goal!", and then running an algorithm that chooses actions that increase the probability that that concept is true.

In fact, that kind of thing seems to me like the only way to get an AGI to be trying to do certain things that it can't learn by experiencing reward—I have an example in this comment here.

Then there are a few things I'm concerned about.

First, making sure you find the right concept.

Second, "different aspects of the value-function duking it out". I don't see how you can set up a goal without it possibly manifesting as multiple subagents working at cross-purposes, and if one can sabotage the others then you wind up with a quite different goal than what you started with. Like "name a three-digit prime number" seems like a single thing in the world-model that we can flag as a "goal", but actually I think it would break into lots of compositional pieces like "I'm going to name one number", "it's prime", "it's three digits". You can say "No problem, we'll just multiply the probabilities of those three components" or whatever, but the problem is that thoughts can abstain to make predictions about certain things (think of logical induction for example, or "what color is the concept of multiplication?"), and then you wind up allowing thoughts that are purely advancing one of the subgoals and maybe not making any claims about how they'll impact the other subgoals, and it turns out that they're bad from the perspective of the other subgoals. Something like that anyway…? I'm hoping there's some notion of "conservatism" that helps here ("no thoughts are allowed unless they actively advance all goal components") but it's pretty vague in my head, I don't know how to make sure that actually works.

Third, making sure that whatever concept we flag as a goal doesn't have problematic instrumental subgoals (incorrigibility, etc.)

Fourth, when we want the system to solve hard problems like doing original research or inventing new inventions, I think we need to allow the system to discover new concepts as it runs, and add them to the library. And I think we need to allow it to update "what it's trying to do" in ways that reference those new concepts (or, for that matter, that reference old concepts). (See discussion in section 7.2 here.) So then we face "ontological crisis" type problems, where the concept flagged as a goal winds up morphing somehow, and/or goal drift.

[-]johnswentworth4y50

On the "duking it out" issue specifically: one solution is to just give every component a veto. As long as different components mostly care about different things and/or can "trade" with each other, it should be possible to find pareto improvements acceptable to all of them.

[-]Steven Byrnes4y40

I think a big part of the safety challenge for this type of approach is the thing I called "the 1st-person problem" here (Section 1.1.3). It seems easy enough to get a computer to learn a concept like "Alice is following human norms" by passive observation, but what we really want is the concept of "I am following human norms", which is related but different. You could have the computer learn it actively by allowing it to do stuff and then labeling it, but then you're really labeling "I am following human norms (as far as the humans can tell)", which is different from what we want in an obviously problematic way. One way around that would be to solve transparency and thus correctly label deceptive actions, but I don't have any idea how to do that reliably. Another possible approach might be actually figuring out how to fiddle with the low-level world-model variables to turn the "Alice is following human norms" concept into the "I am following human norms" concept. I'm not sure if that works either, but anyway, I'm planning to think more about this when I get a chance, starting with how it works in humans, and of course I'm open to ideas.

[-]johnswentworth4y40

In terms of data structures, I usually think about this sort of thing in terms of a map with a self pointer.

Suppose our environment is a python list . We wish to represent it using another python list, the "model" $M$ . Two key points:

$M$ is a data structure, it can contain pointers, but it's not allowed to contain pointers directly to $X$ or things in $X$ : things in the model can only point directly to other things in the model. For instance, a pointer might literally be represented as an index to a position in $M$ - i.e. $p o i n t e r (1)$ would point to $M_{1}$ .
The model $M$ is contained in $X$ itself - for simplicity, we'll say $X_{0} = M$ .

So: how can we make the model match the environment (i.e. $M == X$ )?

The trick is quite similar to a typical quine. We can model the environment excluding $X_{0}$ easily enough: $M = [? ? ?, X_{1}, . . ., X_{n - 1}]$ . But then the ??? part has to match $M$ , and we can't point to the whole map - there is no index which contains the map. So, we have to drop another copy in there: $M = [[? ? ?, X_{1}, . . ., X_{n - 1}], X_{1}, . . ., X_{n - 1}]$ . Now we have almost the same problem: we need to replace the ??? with something. But this time, we can point it at something in the map: we can point it at $M_{0}$ . So, the final representation looks like:

$M = [[p o i n t e r (0), X_{1}, . . ., X_{n - 1}], X_{1}, . . ., X_{n - 1}]$

Seems like the keys to training something like this would be:

Make sure the model can support the appropriate kind of pointer.
Train in an environment where the system can observe its own internal map.

Actually getting the self-model pointed at the goal we want would be a whole extra step. Not sure how that would work, other than using transparency tools to explicitly locate the self-model and plug in a goal.

[-]Steven Byrnes4y40

Why not just have a "my model of" thing in the model, so you can have both "this door" and "my model of" + "this door" = "my model of this door"? (Of course I'm assuming compositionality here, but whatever, I always assume compositionality. This is the same kind of thing as "Carol's" + "door" = "Carol's door".) What am I missing? I didn't use any quines. Seems too simple… :-P

The thing I'm interested in re "1st-person problem" is slightly different than that, I think, because your reply still assumes a passive model, I think, whereas I think we're going to need an AGI that "does things"—even if it's just "thinking thoughts"—for reasons discussed in section 7.2 here. So there would a bunch of 1st-person actions / decisions thrown into the mix.

[-]johnswentworth4y40

The main issue with "my model of" + "this door" = "my model of this door", taken literally, is that there's no semantics. It's the semantics which I expect to need something quine-like.

Adding actions is indeed a big step, and I still don't know the best way to do that. Main strategies I've thought about are:

something predictive-processing-esque
keep the model itself passive, but include an agent with actions in the model itself, and then require correctness of the model-abstraction. (In other words, put an agent in the map, then require map-territory correspondence.)
Something thermodynamic-esque but not predictive processing. This one seems most promising long-term but also I'm still most confused about how to set it up.

[-]Steven Byrnes4y40

The main issue with "my model of" + "this door" = "my model of this door", taken literally, is that there's no semantics. It's the semantics which I expect to need something quine-like.

I think you're saying that I'm proposing how to label everything but not describing what those things are or do. (Correct?) I guess I'd say we learn general rules to follow with the "my model of" piece-of-thought, and exceptions to those rules, and exceptions to the exceptions, etc. Like "the relation between my-model-of-X and my-model-of-Y is the same as the relation between X and Y" could be an imperfect rule with various exceptions. See my "Python code runs the same on Windows and Mac" example here.

[-]Steven Byrnes4y40

You say "formal", which I guess is fine, but I think most people associate "formal" with "everything has a strict all-or-nothing mathematical definition", whereas I think the data structure would turn out to have everything being fuzzy, like things can range continuously from "100% totally a bookshelf" to "0% absolutely not a bookshelf", or "a bookshelf in the context of a certain movie where it's viewed from a particular angle and being used in a particular way, but not in other contexts", etc. etc. (So the smart contract would have to be something like "if we provide Bird with the following documents and CCTV footage and files, presented in the following order, then Bird will assign >99% truthiness to the statement 'Party A has tried in good faith to put the strawberry on the plate as further described in the following paragraphs…'") We can still call that "formal" insofar as there's a mathematical function that anyone can evaluate on the same data and get the same answer, just as a particular trained ConvNet image classifier can be called a "formally specified" function, i.e. specified by its list of weights and so on. I'm not sure if that's what you meant.

[-]johnswentworth4y40

Yup, that is basically what I meant.

LESSWRONG
LW

LESSWRONG
LW

45

The Language of Bird

45

45