Within the context of AI safety, why do we want a theory of agency? I see two main reasons:
The ideas involved overlap to a large extent, but I’ve noticed some major differences in what kinds of questions researchers ask, depending on which of these two goals they’re usually thinking about.
One type of difference seems particularly central: results which identify one tractable design within some class, vs characterizing all designs within a class.
This difference suggests different trade-offs:
As an example, consider logical induction. Logical induction was a major step forward in designing agent systems with strong properties - e.g. eventually having sane beliefs over logic statements despite finite resources. On the other hand, for the most part it doesn’t help us describe existing agenty systems much - bacteria or cats or (more debatably) humans probably don’t have embedded logical inductors.
Diving more into different questions/subgoals:
I’ve been pointing out differences, but of course there’s a huge amount of overlap between the theoretical problems of these two use-cases. Most of the problems of embedded world-models are central to both use-cases, as is the lack of a Cartesian boundary and all the problems which stem from that.
My general impression is that most MIRI folks (at least within the embedded agents group) are more focused on the AI design angle. Personally, my interest in embedded agents originally came from wanting to describe biological organisms, neural networks, markets and other real-world systems as agents, so I’m mostly focused on describing existing agents. I suspect that a lot of the disagreements I have with e.g. Abram stem from these differing goals.
In terms of how the two use-cases “should” be prioritized, I certainly see both as necessary for best-case AI outcomes. Description of existing agents seems more directly relevant to human alignment: in order to reliably point to human values, we need a theory of how things made of atoms can coherently “want” things in a world they can’t fully model. AI design problems seem more related to “scaling up” a human-aligned AI, i.e. having it self-modify and copy itself and whatnot without wireheading or value drift.
I’d be interested to hear from some agency researchers who focus more on the design use-case if all this sounds like an accurate representation of how you’re thinking, or if I’m totally missing the target here. Also, if you think that design-type use-cases really are more central to AI safety than description-type use-cases, or that the distinction isn't useful at all, I’d be interested to hear why.
Yeah. As someone more on the description side, I would say the problems are even more different than you make out, because the description problem isn't just "find a model that fits humans." It's figuring out ways to model the entire world in terms of the human perceived environment, and using the right level of description in the right context.