I generally like the idea of (for example) somehow finding the concept "I am being helpful" in the world-model and flagging it as "goal!", and then running an algorithm that chooses actions that increase the probability that that concept is true.
In fact, that kind of thing seems to me like the only way to get an AGI to be trying to do certain things that it can't learn by experiencing reward—I have an example in this comment here.
Then there are a few things I'm concerned about.
First, making sure you find the right concept.
Second, "different aspects of the value-function duking it out". I don't see how you can set up a goal without it possibly manifesting as multiple subagents working at cross-purposes, and if one can sabotage the others then you wind up with a quite different goal than what you started with. Like "name a three-digit prime number" seems like a single thing in the world-model that we can flag as a "goal", but actually I think it would break into lots of compositional pieces like "I'm going to name one number", "it's prime", "it's three digits". You can say "No problem, we'll just multiply the probabilities of those three components" or whatever, but the problem is that thoughts can abstain to make predictions about certain things (think of logical induction for example, or "what color is the concept of multiplication?"), and then you wind up allowing thoughts that are purely advancing one of the subgoals and maybe not making any claims about how they'll impact the other subgoals, and it turns out that they're bad from the perspective of the other subgoals. Something like that anyway…? I'm hoping there's some notion of "conservatism" that helps here ("no thoughts are allowed unless they actively advance all goal components") but it's pretty vague in my head, I don't know how to make sure that actually works.
Third, making sure that whatever concept we flag as a goal doesn't have problematic instrumental subgoals (incorrigibility, etc.)
Fourth, when we want the system to solve hard problems like doing original research or inventing new inventions, I think we need to allow the system to discover new concepts as it runs, and add them to the library. And I think we need to allow it to update "what it's trying to do" in ways that reference those new concepts (or, for that matter, that reference old concepts). (See discussion in section 7.2 here.) So then we face "ontological crisis" type problems, where the concept flagged as a goal winds up morphing somehow, and/or goal drift.
On the "duking it out" issue specifically: one solution is to just give every component a veto. As long as different components mostly care about different things and/or can "trade" with each other, it should be possible to find pareto improvements acceptable to all of them.
I think a big part of the safety challenge for this type of approach is the thing I called "the 1st-person problem" here (Section 1.1.3). It seems easy enough to get a computer to learn a concept like "Alice is following human norms" by passive observation, but what we really want is the concept of "I am following human norms", which is related but different. You could have the computer learn it actively by allowing it to do stuff and then labeling it, but then you're really labeling "I am following human norms (as far as the humans can tell)", which is different from what we want in an obviously problematic way. One way around that would be to solve transparency and thus correctly label deceptive actions, but I don't have any idea how to do that reliably. Another possible approach might be actually figuring out how to fiddle with the low-level world-model variables to turn the "Alice is following human norms" concept into the "I am following human norms" concept. I'm not sure if that works either, but anyway, I'm planning to think more about this when I get a chance, starting with how it works in humans, and of course I'm open to ideas.
In terms of data structures, I usually think about this sort of thing in terms of a map with a self pointer.
Suppose our environment is a python list . We wish to represent it using another python list, the "model" . Two key points:
So: how can we make the model match the environment (i.e. )?
The trick is quite similar to a typical quine. We can model the environment excluding easily enough: . But then the ??? part has to match , and we can't point to the whole map - there is no index which contains the map. So, we have to drop another copy in there: . Now we have almost the same problem: we need to replace the ??? with something. But this time, we can point it at something in the map: we can point it at . So, the final representation looks like:
Seems like the keys to training something like this would be:
Actually getting the self-model pointed at the goal we want would be a whole extra step. Not sure how that would work, other than using transparency tools to explicitly locate the self-model and plug in a goal.
Why not just have a "my model of" thing in the model, so you can have both "this door" and "my model of" + "this door" = "my model of this door"? (Of course I'm assuming compositionality here, but whatever, I always assume compositionality. This is the same kind of thing as "Carol's" + "door" = "Carol's door".) What am I missing? I didn't use any quines. Seems too simple… :-P
The thing I'm interested in re "1st-person problem" is slightly different than that, I think, because your reply still assumes a passive model, I think, whereas I think we're going to need an AGI that "does things"—even if it's just "thinking thoughts"—for reasons discussed in section 7.2 here. So there would a bunch of 1st-person actions / decisions thrown into the mix.
The main issue with "my model of" + "this door" = "my model of this door", taken literally, is that there's no semantics. It's the semantics which I expect to need something quine-like.
Adding actions is indeed a big step, and I still don't know the best way to do that. Main strategies I've thought about are:
The main issue with "my model of" + "this door" = "my model of this door", taken literally, is that there's no semantics. It's the semantics which I expect to need something quine-like.
I think you're saying that I'm proposing how to label everything but not describing what those things are or do. (Correct?) I guess I'd say we learn general rules to follow with the "my model of" piece-of-thought, and exceptions to those rules, and exceptions to the exceptions, etc. Like "the relation between my-model-of-X and my-model-of-Y is the same as the relation between X and Y" could be an imperfect rule with various exceptions. See my "Python code runs the same on Windows and Mac" example here.
You say "formal", which I guess is fine, but I think most people associate "formal" with "everything has a strict all-or-nothing mathematical definition", whereas I think the data structure would turn out to have everything being fuzzy, like things can range continuously from "100% totally a bookshelf" to "0% absolutely not a bookshelf", or "a bookshelf in the context of a certain movie where it's viewed from a particular angle and being used in a particular way, but not in other contexts", etc. etc. (So the smart contract would have to be something like "if we provide Bird with the following documents and CCTV footage and files, presented in the following order, then Bird will assign >99% truthiness to the statement 'Party A has tried in good faith to put the strawberry on the plate as further described in the following paragraphs…'") We can still call that "formal" insofar as there's a mathematical function that anyone can evaluate on the same data and get the same answer, just as a particular trained ConvNet image classifier can be called a "formally specified" function, i.e. specified by its list of weights and so on. I'm not sure if that's what you meant.
This is a fictional snippet from the AI Vignettes Day. I do not think this is a likely future, but it’s a useful future to think about - it gives a different lens for thinking about the alignment problem and potential solutions. It’s the sort of future I’d expect if my own research went far better than I actually expect, saw rapid adoption, and most other ML/AI research stalled in the meantime.
[Transcript from PyCon 2028 Lightning Talk. Lightly edited for readability.]
Ok, so, today we’re going to talk about Bird, and especially about the future of Bird and machine learning.
First of all, what is Bird? I assume everyone here has heard of it and played around with it, but it’s hard to describe exactly what it is. You’ve maybe heard the phrase “human concept library”, but obviously this isn’t just a graph connecting words together.
Some background. In ye olden days (like, ten years ago) a lot of people thought that the brain’s internal data structures were inherently illegible, an evolved hodgepodge with no rhyme or reason to it. And that turned out to be basically false. At the low level, yeah, it’s a mess, but the higher-level data structures used by our brains for concept-representation are actually pretty sensible mathematical structures with some nice universal properties.
Those data structures are the main foundation of Bird.
When you write something like “from Bird.World import Bookshelf”, the data structure you’re importing - the Bookshelf - is basically an accurate translation of the data structure your own brain uses to represent a bookshelf. And of course it’s hooked up to a whole world-model, representing all the pieces of a bookshelf, things you put on a bookshelf, where you’d find a bookshelf, etc, as well as the grounding of all those things in a lower-level world model, and their ultimate connection to sensors/actuators. But when writing code using Bird, we usually don’t have to explicitly think about all that. That’s the beauty of the data structures: we can write code which intuitively matches the way we think about bookshelves, and end up with robust functionality.
Functionally, Bird is about translation. It’s a high-level language very close to the structure of human thought. But unlike natural language, it’s fully formally specified, and those formal specifications (along with Bird’s standard training data set) accurately capture our own intuitive concepts. They accurately translate human concepts into formal specifications. So, for instance, we can express the idea of “put a strawberry on a plate” in Bird, hand that off to an ML algorithm as a training objective, and it will actually figure out how to put a strawberry on a plate, rather than Goodharting the objective. The objective actually correctly represents the thing we’re intuitively saying.
That’s the vision, anyway. The problem is that there’s still some assumed social context which Bird doesn’t necessarily capture - like, if I write “put a strawberry on a plate”, the implicit context includes things like “don’t kill anyone in the process”. Bird won’t include that context unless we explicitly add it. It accurately captures “put a strawberry on a plate”, but nothing else.
Today, of course, Bird’s main use-case is for smart contracts. For that use-case, it’s fine to not include things like “don’t kill anyone”, because we have social norms and legal structures to enforce all that already. So for contracts between humans, it’s great. Thus the big resurgence in smart contracts over the past few years: we can finally formally specify contracts which actually do what we want.
But for ML systems, that doesn’t really cut it. ML systems aren’t human, they won't strictly follow social norms and laws unless we program - or train - them to do so.
The obvious solution to this is to express things like “follow the law” or “obey social norms” or ideally “do what I mean, not what I say” in Bird. But this is all tightly tied in with things like agency, self-reference, and goal-directedness. We don’t yet fully understand the right data structures for those things - self-reference makes the math more complicated. That’s the main missing piece in Bird today. But it is an active research area, so hopefully within the next few years we’ll be able to formally specify “what we want”.