June Ku — LessWrong

25 Min Talk on MetaEthical.AI with Questions from Stuart Armstrong

I agree that people's actual moral views don't track all that well with correct reasoning from their fundamental norms. Normative reasoning is just one causal influence on our views but there's plenty of biases such as from status games that also play a causal role. That's no problem for my theory. It just carefully avoids the distortions and focuses on the paths with correct reasoning to determine the normative truths. In general, our conscious desires and first-order views don’t matter that much on my view unless they are endorsed by the standards we implicitly appeal to when reflecting.

If anything, these status games and other biases are much more of a problem for Paul’s indirect normativity since Paul pursues extrapolation by simulating the entire person, which includes their normative reasoning but also all their biases. Are the emulations getting wiser or are they stagnating in their moral blindspots, being driven subtly insane by the strange, unprecedented circumstances or simply gradually becoming different people whose values no longer reflect the originals’?

I’m sure there are various clever mechanisms that can mitigate some of this (while likely introducing other distortions), but from my perspective, these just seem like epicycles trying to correct for garbage input. If what we want is better normative reasoning, it’s much cleaner and more elegant to precisely understand that process and extrapolate from that, not the entire, contradictory kludgy mess of a human brain.

Given the astronomical stakes, I'm just not satisfied with trusting in any humans’ morality. Even the most virtuous people in history are inevitably seen in hindsight to have glaring moral flaws. Hoping you pick the right person who will prove to be an exception is not much of a solution. I think aligning superintelligence requires superhuman performance on ethics.

Now I’m sympathetic to the concern that a metaethical implementation may be brittle but I’d prefer to address this metaphilosophically. For instance, we should be able to metasemantically check whether our concept of ‘ought’ matches with the metaethical theory programmed into the AI. Adding in metaethics, we may be able to extend that to cases where we ought to revise our concept to match the metaethical theory. In an ideal world, we would even be able to program in a self-correcting metaethical / metaphilosophical theory such that so long as it starts off with an adequate theory, it will eventually revise itself into the correct theory. Of course, we’d still want to supplement this with additional checks such as making it show its work and evaluating that along with its solutions to unrelated philosophical problems.

Ngo's view on alignment difficulty

June Ku4y130

I think philosophy is basically either conceptual analysis to turn an unclear question into a well-defined empirical / mathematical one or normative reasoning about what we ought to do, feel or believe. I’ve developed and programmed a formal theory of metasemantics and metaethics that can explain how to ideally do those. I apply them to construct an ethical goal function for AI. It would take some more work to figure out the details but I think together they also provide the necessary resources to solve metaphilosophy.

25 Min Talk on MetaEthical.AI with Questions from Stuart Armstrong

June Ku4y20

I think the simplest intentional systems just refer to their own sensory states. It's true that we are able to refer to external things but that's not by somehow having different external causes of our cognitive states from that of those simple systems. External reference is earned by reasoning in such a way that attributing content like 'the cause of this and that sensory state ...' is a better explanation of our brain's dynamics and behavior than just 'this sensory state', e.g. reasoning in accordance with the axioms of Pearl's causal models. This applies to the content of both our beliefs and desires.

In philosophical terms, you seem to be thinking in terms of a causal theory of reference whereas I'm taking a neo-descriptivist approach. Both theories acknowledge that one aspect of meaning is what terms refer to and that obviously depends on the world. But if you consider cases like 'creature with a heart' and 'creature with a kidney' which may very well refer to the same things but clearly still differ in meaning, you can start to see there's more to meaning than reference.

Neo-descriptivists would say there's an intension, which is roughly a function from possible worlds to the term's reference in that world. It explains how reference is determined and unlike reference does not depend on the external world. This makes it well-suited to explaining cognition and behavior in terms of processes internal to the brain, which might otherwise look like spooky action at a distance if you tried explaining in terms of external reference. In context of my project, I define intension here. See also Chalmers on two-dimensional semantics.

25 Min Talk on MetaEthical.AI with Questions from Stuart Armstrong

June Ku4y20

My aim is to specify our preferences and values in a way that is as philosophically correct as possible in defining the AI's utility function. It's compatible with this that in practice, the (eventual scaled down version of the) AI would use various heuristics and approximations to make its best guess based on "human-related data" rather than direct brain data. But I do think it's important for the AI to have an accurate concept of what these are supposed to be an approximation to.

But it sounds like you have a deeper worry that intentional states are not really out there in the world, perhaps because you think all that exists are microphysical states. I don't share that concern because I'm a functionalist or reductionist instead of an eliminativist. Physical states get to count as intentional states when they play the right role, in which case the intentional states are real. I bring in Chalmers on when a physical system implements a computation and combine it with Dennett's intentional stance to help specify what those roles are.

New MetaEthical.AI Summary and Q&A at UC Berkeley

June Ku6y10

Officially, my research is metaethical. I tell the AI how to identify someone’s higher-order utility functions but remain neutral on what those actually are in humans. Unofficially, I suspect they amount to some specification of reflective equilibrium and prescribe changing one’s values to be more in line with that equilibrium.

On distortion, I’m not sure what else to say but repeat myself. Distortions are just changes in value not governed by satisfying higher-order decision criteria. The examples I gave are not part of the specification, they’re just things I expect to be included.

Distortion is also not meant to specify all irrationality or nonoptimality. It’s just a corrective to a necessary part of the parliamentary procedure. We must simulate the brain’s continuation in some specific circumstance or other and that brings its own influences. So, I wouldn’t call a higher-order criterion a distortion even if it gets rejected. It’s more like a prima facie reason that gets overruled. In any case, we can evaluate such criteria as rational or not but we’d be doing so by invoking some (other unless reflective) higher-order criteria.

For the most part, I don’t believe in norms universal to all agents. Given our shared evolutionary history, I expect significant overlap among humans but that there’d also be some subtle differences from development and the environment. It may also be worth mentioning that even with the same norm, we can preserve uniqueness if for instance, it takes one’s current state into consideration.

New MetaEthical.AI Summary and Q&A at UC Berkeley

June Ku6y10

Here, the optimal decisions would be the higher-order outputs which maximize higher-order utility. They are decisions about what to value or how to decide rather than about what to do.

To capture rational values, we are trying to focus on the changes to values that flow out of satisfying one’s higher-order decision criteria. By unrelated distortions of value, I pretty much mean changes in value from any other causes, e.g. from noise, biases, or mere associations.

In the code and outline I call the lack of distortion Agential Identity (similar to personal identity). I had previously tried to just extract the criteria out of the brain and directly operate on them. But now, I think the brain is sufficiently messy that we can only simulate many continuations and aggregate them. That opens up a lot of potential to stray far from the original state. This Agential Identity helps ensure we’re uncovering your dispositions rather than that of a stranger or a funhouse mirror distortion.

New MetaEthical.AI Summary and Q&A at UC Berkeley

June Ku6y10

Nice catch. Yes, I think I’ll have to change the ordinal utility functions to range over lotteries rather than simply outcomes.

In this initial version, I am just assuming the ontology of the world is given, perhaps from just an oracle or the world model the AI has inferred.

Formal Metaethics and Metasemantics for AI Alignment

June Ku6y10

I now have a much more readable explanation of my code. I'd be interested to hear your thoughts on it.

Formal Metaethics and Metasemantics for AI Alignment

June Ku6y30

Yeah, more or less. In the abstract, I "suppose that unlimited computation and a complete low-level causal model of the world and the adult human brains in it are available." I've tended to imagine this as an oracle that just has a causal model of the actual world and the brains in it. But whole brain emulations would likely also suffice.

In the code, the causal models of the world and brains in it would be passed as parameters to the metaethical_ai_u function in main. The world w and each element of the set bs would be an instance of the causal_markov_model class.

Each brain gets associated with an instance of the decision_algorithm class by calling the class function implemented_by. A decision algorithm models the brain in higher level concepts like credences and preferences as opposed to bare causal states. And yeah, in determining both the decision algorithm implemented by a brain and its rational values, we look at their responses to all possible inputs.

For implementation, we aim for isomorphic, coherent, instrumentally rational and parsimonious explanations. For rational values, we aggregate the values of possible continuations weighting more heavily those that better satisfied the agent's own higher-order decision criteria without introducing too much unrelated distortion of values.

Formal Metaethics and Metasemantics for AI Alignment

June Ku6y30

If you or anyone else could point to a specific function in my code that we don't know how to compute, I'd be very interested to hear that. The only place that I know of that is uncomputable is in calculating Kolmogorov complexity, but that could be replaced by some finite approximation. The rest should be computable, though its complexity may be super-duper exponentially exponential.

In the early stages, I would often find, as you expect, components that I thought would be fairly straightforward to define technically but would realize upon digging in that it was not so clear and required more philosophical progress. Over time, these lessened to more like just technical details than philosophical gaps, until I didn't find even technical gaps.

Then I started writing automated tests and uncovered more bugs, though for the most part these were pretty minor, where I think a sympathetic programmer could probably work out what was meant to be done. I think around 42% of the procedures defined now have an automated test. Admittedly, these are generally the easier functions and simpler test cases. It turns out that writing code intended for an infinitely powerful computer doesn't exactly lend itself to being tested on current machines. (Having a proper testing framework, however, with the ability to stub and mock objects might help considerably.)

There's likely still many bugs in the untested parts but I would expect them to be fairly minor. Still, I'm only one person so I'd love to have more eyes on it. I also like the schema idea and have often thought of my work as a scaffold. Even if you disagree with one component, you might be able to just slot in a different philosophical theory. Perhaps you could even replace every component but still retain something of the flavor of my theory! I just hope it's more like replacing Newtonian mechanics than phlogiston.

LESSWRONG
LW

LESSWRONG
LW

Posts

Wikitag Contributions

Comments