Formal Metaethics and Metasemantics for AI Alignment
New MetaEthical.AI Summary and Q&A at UC Berkeley
This time I tried to focus less on the technical details and more on providing the intuition behind the principles guiding the project. I'm grateful for questions and comments from Stuart Armstrong and the AI Safety Reading Group. I've posted the slides on Twitter.
Abstract: We construct a fully technical ethical goal function for AI by directly tackling the philosophical problems of metaethics and mental content. To simplify our reduction of these philosophical challenges into "merely" engineering ones, we suppose that unlimited computation and a complete low-level causal model of the world and the adult human brains in it are available.
Given such a model, the AI attributes beliefs and values to a brain in two stages. First, it identifies the syntax of a brain's mental content by selecting a decision algorithm which is i) isomorphic to the brain's causal processes and ii) best compresses its behavior while iii) maximizing charity. The semantics of that content then consists first in sense data that primitively refer to their own occurrence and then in logical and causal structural combinations of such content.
The resulting decision algorithm can capture how we decide what to do, but it can also identify the ethical factors that we seek to determine when we decide what to value or even how to decide. Unfolding the implications of those factors, we arrive at what we should do. All together, this allows us to imbue the AI with the necessary concepts to determine and do what we should program it to do.
What do you see as advantages and disadvantages of this design compared to something like Paul Christiano's 2012 formalization of indirect normativity? (One thing I personally like about Paul's design is that it's more agnostic about meta-ethics, and I worry about your stronger meta-ethical assumptions, which I'm not very convinced about. See metaethical policing for my general views on this.)
How worried are you about this kind of observation? People's actual moral views seem at best very under-determined by their "fundamental norms", with their environment and specifically what status games they're embedded in playing a big role. If many people are currently embedded in games that cause them to want to freeze their morally relevant views against further change and reflection, how will your algorithm handle that?
I agree that people's actual moral views don't track all that well with correct reasoning from their fundamental norms. Normative reasoning is just one causal influence on our views but there's plenty of biases such as from status games that also play a causal role. That's no problem for my theory. It just carefully avoids the distortions and focuses on the paths with correct reasoning to determine the normative truths. In general, our conscious desires and first-order views don’t matter that much on my view unless they are endorsed by the standards we implicitly appeal to when reflecting.
If anything, these status games and other biases are much more of a problem for Paul’s indirect normativity since Paul pursues extrapolation by simulating the entire person, which includes their normative reasoning but also all their biases. Are the emulations getting wiser or are they stagnating in their moral blindspots, being driven subtly insane by the strange, unprecedented circumstances or simply gradually becoming different people whose values no longer reflect the originals’?
I’m sure there are various clever mechanisms that can mitigate some of this (while likely introducing other distortions), but from my perspective, these just seem like epicycles trying to correct for garbage input. If what we want is better normative reasoning, it’s much cleaner and more elegant to precisely understand that process and extrapolate from that, not the entire, contradictory kludgy mess of a human brain.
Given the astronomical stakes, I'm just not satisfied with trusting in any humans’ morality. Even the most virtuous people in history are inevitably seen in hindsight to have glaring moral flaws. Hoping you pick the right person who will prove to be an exception is not much of a solution. I think aligning superintelligence requires superhuman performance on ethics.
Now I’m sympathetic to the concern that a metaethical implementation may be brittle but I’d prefer to address this metaphilosophically. For instance, we should be able to metasemantically check whether our concept of ‘ought’ matches with the metaethical theory programmed into the AI. Adding in metaethics, we may be able to extend that to cases where we ought to revise our concept to match the metaethical theory. In an ideal world, we would even be able to program in a self-correcting metaethical / metaphilosophical theory such that so long as it starts off with an adequate theory, it will eventually revise itself into the correct theory. Of course, we’d still want to supplement this with additional checks such as making it show its work and evaluating that along with its solutions to unrelated philosophical problems.
Very interesting! More interesting to me than the last time I looked through your proposal, both because of some small changes I think you've made but primarily because I'm a lot more amenable to this "genre" than I was.
I'd like to encourage a shift in perspective from having to read preferences from the brain, to being able to infer human preferences from all sorts of human-related data. This is related to another shift from trying to use preferences to predict human behavior in perfect detail, to being content to merely predict "human-scale" facts about humans using an agential model.
These two shifts are related by the conceptual change from thinking about the human preferences as "in the human," thus being inextricably linked to understanding humans on a microscopic level, to thinking about human preferences as "in our model of the human" - as being components that need to be understood as elements of an intentional-stance story we're telling about the world.
This of course isn't to say that brains have no mutual information with values. But rather than having two separate steps in your plan like "first, figure out human values" and "later, fit those human values into the AI's model of the world," I wonder if you've explored how it could work for the AI to try to figure out human values while simultaneously locating them within a way (or ways) of modeling the world.
My aim is to specify our preferences and values in a way that is as philosophically correct as possible in defining the AI's utility function. It's compatible with this that in practice, the (eventual scaled down version of the) AI would use various heuristics and approximations to make its best guess based on "human-related data" rather than direct brain data. But I do think it's important for the AI to have an accurate concept of what these are supposed to be an approximation to.
But it sounds like you have a deeper worry that intentional states are not really out there in the world, perhaps because you think all that exists are microphysical states. I don't share that concern because I'm a functionalist or reductionist instead of an eliminativist. Physical states get to count as intentional states when they play the right role, in which case the intentional states are real. I bring in Chalmers on when a physical system implements a computation and combine it with Dennett's intentional stance to help specify what those roles are.
I don't think we disagree too much, but what does "play the right functional role" mean, since my desires are not merely about what brain-state I want to have, but are about the real world? If I have a simple thermostat where a simple bimetallic spring opens or closes a switch, I can't talk about the real-world approximate goals of the thermostat until I know whether the switch goes to the heater or to the air conditioner. And if I had two such thermostats, I would need the connections to the external world to figure out if they were consistent or inconsistent.
In short, the important functional role that my desires play does not just take place intra-cranially, they function in interaction with my environment. If you were a new superintelligence, and the first thing you found was a wireheaded human, you might conclude that humans value having pleasurable brain states. If the first thing you found were humans in their ancestral environment, you might conclude that they value nutritious foods or producing healthy babies. The brains are basically the same, but the outside world they're hooked up to is different.
So from the premises of functionalism, we get a sort of holism.
I think the simplest intentional systems just refer to their own sensory states. It's true that we are able to refer to external things but that's not by somehow having different external causes of our cognitive states from that of those simple systems. External reference is earned by reasoning in such a way that attributing content like 'the cause of this and that sensory state ...' is a better explanation of our brain's dynamics and behavior than just 'this sensory state', e.g. reasoning in accordance with the axioms of Pearl's causal models. This applies to the content of both our beliefs and desires.
In philosophical terms, you seem to be thinking in terms of a causal theory of reference whereas I'm taking a neo-descriptivist approach. Both theories acknowledge that one aspect of meaning is what terms refer to and that obviously depends on the world. But if you consider cases like 'creature with a heart' and 'creature with a kidney' which may very well refer to the same things but clearly still differ in meaning, you can start to see there's more to meaning than reference.
Neo-descriptivists would say there's an intension, which is roughly a function from possible worlds to the term's reference in that world. It explains how reference is determined and unlike reference does not depend on the external world. This makes it well-suited to explaining cognition and behavior in terms of processes internal to the brain, which might otherwise look like spooky action at a distance if you tried explaining in terms of external reference. In context of my project, I define intension here. See also Chalmers on two-dimensional semantics.
No, I'm definitely being more descriptivist than causal-ist here. The point I want to get at is on a different axis.
Suppose you were Laplace's demon, and had perfect knowledge of a human's brain (it's not strictly necessary to pretend determinism, but it sure makes the argument simpler). You would have no need to track the human's "wants" or "beliefs," you would just predict based on the laws of physics. Not only could you do a better job than some human psychologist on human-scale tasks (like predicting in advance which button the human will press), you would be making information-dense predictions about the microphysical state of the human's brain the would just be totally beyond a model of humans coarse-grained to the level of psychology rather than physics.
So when you say "External reference is earned by reasoning in such a way that attributing content like 'the cause of this and that sensory state ...' is a better explanation", I totally agree, but I want to emphasize: better explanation for whom? If we somehow built Laplace's demon, what I'd want to tell it is something like "model me according to my own standards for intentionality."